# robust_graph_dictionary_learning__0c822d61.pdf Published as a conference paper at ICLR 2023 ROBUST GRAPH DICTIONARY LEARNING Weijie Liu,1,2 Jiahao Xie,2 Chao Zhang,2,8,9 Makoto Yamada,3,4,5 Nenggan Zheng,1,2,6,8 Hui Qian2,7,8 1 Qiushi Academy for Advanced Studies, Zhejiang University 2 College of Computer Science and Technology, Zhejiang University 3 Okinawa Institute of Science and Technology 4 Kyoto University 5 RIKEN AIP 6 Zhejiang Lab 7 State Key Lab of CAD&CG, Zhejiang University 8 Alibaba-Zhejiang University Joint Research Institute of Frontier Technologies 9 Advanced Technology Institute, Zhejiang University {westonhunter,xiejh,zczju,zng,qianhui}@zju.edu.cn, makoto.yamada@oist.jp Traditional Dictionary Learning (DL) aims to approximate data vectors as sparse linear combinations of basis elements (atoms) and is widely used in machine learning, computer vision, and signal processing. To extend DL to graphs, Vincent-Cuaz et al. 2021 proposed a method, called GDL, which describes the topology of each graph with a pairwise relation matrix (PRM) and compares PRMs via the Gromov Wasserstein Discrepancy (GWD). However, the lack of robustness often excludes GDL from a variety of real-world applications since GWD is sensitive to the structural noise in graphs. This paper proposes an improved graph dictionary learning algorithm based on a robust Gromov Wasserstein discrepancy (RGWD) which has theoretically sound properties and an efficient numerical scheme. Based on such a discrepancy, our dictionary learning algorithm can learn atoms from noisy graph data. Experimental results demonstrate that our algorithm achieves good performance on both simulated and real-world datasets. 1 INTRODUCTION Dictionary learning (DL) seeks to learn a set of basis elements (atoms) from data and approximates data samples by sparse linear combinations of these basis elements (Mallat, 1999; Mairal et al., 2009; Toˇsi c and Frossard, 2011), which has numerous machine learning applications including dimensionality reduction (Feng et al., 2013; Wei et al., 2018), classification (Raina et al., 2007; Mairal et al., 2008), and clustering (Ramirez et al., 2010; Sprechmann and Sapiro, 2010), to name a few. Although DL has received significant attention, it mostly focuses on vectorized data of the same dimension and is not amenable to graph data (Xu, 2020; Vincent-Cuaz et al., 2021; 2022). Many exciting machine learning tasks use graphs to capture complex structures (Backstrom and Leskovec, 2011; Sadreazami et al., 2017; Naderializadeh et al., 2020; Jin et al., 2017; Agrawal et al., 2018). DL for graphs is more challenging due to the lack of effective means to compare graphs. Specifically, evaluating the similarity between one observed graph and its approximation is difficult, since they are often with different numbers of nodes and the node correspondence across graphs is often unknown (Xu, 2020; Vincent-Cuaz et al., 2021). The seminal work of Vincent-Cuaz et al. (2021) proposed a DL method for graphs based on the Gromov Wasserstein Discrepancy (GWD) that is a variant of the Gromov Wasserstein distance. Gromov Wasserstein distance compares probability distributions supported on different metric spaces using pairwise distances (M emoli, 2011). By expressing each graph as a probability measure and capturing the graph topology with a pairwise relation matrix (PRM), comparing graphs can be naturally formulated as computing the GWD, since both the node correspondence and the discrepancy of the compared graphs are calculated (Peyr e et al., 2016; Xu et al., 2019b). However, Corresponding author. Published as a conference paper at ICLR 2023 observed graphs often contain structural noise including spurious or missing edges, which leads to the differences between the obtained PRMs and the true ones (Donnat et al., 2018; Xu et al., 2019b). Since GWD lacks robustness (S ejourn e et al., 2021; Vincent-Cuaz et al., 2022; Tran et al., 2022), the inaccuracies of PRMs may severely affect GWD and the effectiveness of DL in real-world applications. Contributions. To handle the inaccuracies of PRMs, this paper first proposes a novel robust Gromov Wasserstein discrepancy (RGWD) which adopts a minimax formulation. We prove that the inner maximization problem has a closed-form solution and derive an efficient numerical scheme to approximate RGWD. Under suitable assumptions, such a numerical scheme is guaranteed to find a δ-stationary solution within O( 1 δ2 ) iterations. We further prove that RGWD is lower bounded and the lower bound is achieved if and only if two graphs are isomorphic. Therefore, RGWD can be employed to compare graphs. RGWD also satisfies the triangle inequality which is of its own interest and allows numerous potential applications. A robust graph dictionary learning (RGDL) algorithm is thereby developed to learn atoms from noisy graph data, which assesses the quality of approximated graphs via RGWD. Numerical experiments on both synthetic and real-world datasets demonstrate that RGDL achieves good performance. The rest of the paper is organized as follows. In Sec. 2, a comprehensive review of the background is given. Sec. 3 presents RGWD and the numerical approximation scheme for RGWD. RGDL is delineated in Sec. 4. Empirical results are demonstrated in Sec. 5. We finally discuss related work in Sec. 6. 2 PRELIMINARY 2.1 OPTIMAL TRANSPORT We first present the notation used throughout this paper and then review the definition of the Gromov Wasserstein distance that originates from the optimal transport theory (Villani, 2008; Peyre and Cuturi, 2018). Notation. We use bold lowercase symbols (e.g. x), bold uppercase letters (e.g. A), uppercase calligraphic fonts (e.g. X), and Greek letters (e.g. α) to denote vectors, matrices, spaces (sets), and measures, respectively. 1d Rd is a d-dimensional all-ones vector. d is the probability simplex with d bins, namely the set of probability vectors d = a Rd +| Pd i=1 ai = 1 . A[i, :] and A[:, j] are the ith row and the jth column of matrix A respectively. Given a matrix A, A F and A denote its Frobenius norm and element-wise ℓ -norm (i.e., A = maxi,j |Aij|), respectively. The cardinality of set A is denoted by |A|. The bracketed notation Jn K is the shorthand for integer sets {1, 2, . . . , n}. A discrete measure α is denoted by α = Pm i=1 aiδxi, where δx is the Dirac measure at position x, i.e., a unit of mass infinitely concentrated at x. Gromov Wasserstein distance. Optimal Transport addresses the problem of transporting one probability measure towards another probability measure with the minimum cost (Villani, 2008; Peyre and Cuturi, 2018). The induced cost defines a distance between the two probability measures. Gromov Wasserstein (GW) distance extends classic optimal transport to compare probability measures supported on different spaces (M emoli, 2011). Let (X, d X ) and (Y, d Y) be two metric spaces. Given two probability measures α = Pm i=1 piδxi and β = Pn i =1 qi δyi where x1, x2, . . ., xm X and y1, y2, . . ., yn Y, the r-GW distance between α and β is defined as GWr(α, β) := min T Π(p,q) i ,j =1 Dr ii jj Tii Tjj 1 where the feasible domain of the transport plan T = [Tii ] is given by the set Π(p, q) = T Rm n + T1n = p, T 1m = q , and Dii jj calculates the difference between pairwise distances, i.e., Dii jj = |d X (xi, xj) d Y(yi , yj )|. Published as a conference paper at ICLR 2023 2.2 GRAPH REPRESENTATION AND COMPARISON In this subsection, we formalize the idea of comparing graphs with GWD, which addresses the challenges that graphs are often with different numbers of nodes and the node correspondence is unknown (Xu et al., 2019b; Xu, 2020; Vincent-Cuaz et al., 2021). Pairwise relation and graph representation. Given a graph G with n nodes, assigning each node an index i Jn K, G can be expressed as a tuple (C, p), where C Rn n is a matrix encoding the pairwise relations (e.g. adjacency, shortest-path, Laplacian, or heat kernel) and p n is a probability vector modeling the relative importance of nodes within the graph (Peyr e et al., 2016; Xu et al., 2019b; Titouan et al., 2019; Vincent-Cuaz et al., 2022). Gromov Wasserstein Discrepancy. GWD can be derived from the 2-GW distance by replacing the metrics with pairwise relations (Xu et al., 2019b; Vincent-Cuaz et al., 2022). More specifically, given an observed source graph Gs and a target graph Gt that can be expressed as (Cs, ps) and (Ct, pt) respectively, GWD is defined as GWD (Cs, ps), (Ct, pt) = min T Π(ps,pt) i ,j =1 (Cs ij Ct i j )2Tii Tjj 1 where ns and nt are the numbers of nodes of Gs and Gt respectively. GWD computes both a soft assignment matrix between the nodes of the two graphs and a notion of discrepancy between them. For conciseness, we abbreviate GWD (Cs, ps), (Ct, pt) to GWD(Cs, Ct) in the sequel. 2.3 DICTIONARY LEARNING Traditional DL approximates data vectors as sparse linear combinations of basis elements (atoms) (Mallat, 1999; Mairal et al., 2009; Toˇsi c and Frossard, 2011; Jiang et al., 2015), and is usually formulated as m=1 wk m D[:, m] 2 2 + λΩ(wk), (1) where X Rd K is the data matrix whose columns represent samples, the matrix D Rd M contains M atoms to learn and is constrained to the following set C = {D Rd M| m JMK, D[:, m] 2 1}, W RM K is the new representation of data whose kth-column wk = [wk m]m JMK stores the embedding of the kth sample, and λΩ(wk) promotes the sparsity of wk. Such a formulation only applies to vectorized data. Recently, Xu 2020 proposes to approximate graphs via the highly non-linear GW barycenter. Specifically, given a dataset of K graphs which has PRMs {Ck}k JKK such that Ck Rnk nk, the basis elements { Cm}m JMK are learned by solving min { Cm}m JMK,{wk}k JKK k=1 GWD2 Ck, B wk, { Cm}m JMK , where wk M is referred to as the embedding of the kth graph Gk, and the GW barycenter B wk, { Cm}m JMK gives the approximation of Gk and is defined as B wk, { Cm}m JMK = argmin B m=1 wk m GWD2(B, Cm). Therefore, a complex bi-level optimization problem is involved, which is computationally inefficient (Vincent-Cuaz et al., 2021). Published as a conference paper at ICLR 2023 DL for graphs. To overcome the above computational issues, Vincent-Cuaz et al. 2021 proposed GDL which approximates each graph as a weighted sum of PRMs and is formulated as min { Cm}m JMK,{wk}k JKK k=1 GWD2 Ck, m=1 wk m Cm + λΩ(wk), where each atom Cm is a na na matrix. In contrast to the ℓ2 loss in Eq. (1), GWD is used to assess the quality of the linear representation PM m=1 wk m Cm for k JKK. However, the observed graphs often contain noisy edges or miss some edges in real-world applications (Clauset et al., 2008; Xu et al., 2019b; Shi et al., 2019; Piccioli et al., 2022), which leads to the inaccuracies of the PRMs Ck, that is, the deviation between Ck and the true PRM Ck . Since GWD lacks robustness (S ejourn e et al., 2021; Vincent-Cuaz et al., 2022; Tran et al., 2022), the quality of the learned dictionary may be severely affected. 3 ROBUST GROMOV WASSERSTEIN DISCREPANCY To deal with the inaccuracies of PRMs, this section defines a robust variant of GWD, referred to as RGWD. The properties of RGWD are rigorously analyzed. We then derive a theoretically guaranteed numerical scheme for calculating RGWD approximately. Due to the limit of space, all proofs can be found in the appendix. Definition 1 Given an observed source graph Gs and a target graph Gt that can be expressed as (Cs, ps) and (Ct, pt) respectively, RGWD is defined by the solution to the following problem RGWD (Cs, ps), (Ct, pt), ϵ = min T Π(ps,pt) max E Uϵ f(T, E; Cs, Ct) 1 where the objective f( ) is given by f(T, E; Cs, Ct) = i ,j =1 (Cs ij Ct i j Ei j )2Tii Tjj , and the perturbation E is in the bounded set Uϵ = {E|E = E and E ϵ}. RGWD requires the sought transport plan to have low transportation costs for all perturbation E in set Uϵ. For succinctness, we omit Cs and Ct in f(T, E; Cs, Ct) in the following. 3.1 PROPERTIES OF RGWD The properties of RGWD are presented as follows. Firstly, although RGWD involves a non-convex non-concave minimax optimization problem, the inner maximization problem has a closed-form solution, which allows an efficient numerical scheme for RGWD. Secondly, RGWD has a lower bound that is achieved if and only if the expressions of compared graphs are identical up to a permutation, which implies RGWD can be employed to evaluate the similarity between one observed graph and its approximation in DL. Thirdly, RGWD satisfies the triangle inequality, which allows numerous potential applications including clustering (Elkan, 2003; Haj Kacem et al., 2019), metric learning (Pitis et al., 2019), and Bayesian learning (Moore, 2000; Xiao et al., 2019). Finally, arbitrarily changing the node orders does not affect the value of RGWD. More formally, we state the properties in the following theorem. Theorem 1 Given an observed source graph Gs and a target graph Gt that can be expressed as (Cs, ps) and (Ct, pt) respectively, RGWD satisfies 1. for all T Π(ps, pt), E(T) = [Ei j (T)] where ( ϵ, if Pns i,j=1 Tii Tjj (Cs ij Ct i j ) 0, ϵ, otherwise. solves the inner maximization problem max E Uϵ f(T, E). Published as a conference paper at ICLR 2023 2. RGWD is lower bounded, that is, RGWD (Cs, ps), (Ct, pt), ϵ ϵ, where the equality holds if and only if there exists a bijective π : Jns K Jnt K such that ps i = pt π (i) for all i Jns K and Cs ij = Ct π (i)π (j) for all i, j Jns K. 3. The triangle inequality holds for RGWD, i.e., RGWD (C1, p1), (C3, p3), ϵ RGWD (C1, p1), (C2, p2), ϵ + RGWD (C2, p2), (C3, p3), ϵ . 4. RGWD is invariant to the permutation of node orders, i.e., for all permutation matrices Qs and Qt, RGWD (Cs, ps), (Ct, pt), ϵ = RGWD (Qs Cs Qs, Qs ps), (Qt Ct Qt, Qt pt), ϵ . As is implied by Theorem 1, RGWD does not define a distance between metric-measure spaces. Firstly, the identity axiom is not satisfied. Secondly, the symmetry generally does not hold either, which we exemplify below. Example 1 (Asymmetry of RGWD) Consider the case ps = pt = [ 0.5 0.5 ], Cs = [ 0 1 1 0 ], Ct = [ 0 4 4 0 ]. Then RGWD (Cs, ps), (Ct, pt), 1 = 11.5 with the solution given by T = [ 0.25 0.25 0.25 0.25 ] and E = [ 1 1 1 1 ]. In contrast, RGWD (Ct, pt), (Cs, ps), 1 = 10.5 with T = [ 0.25 0.25 0.25 0.25 ] and E = [ 1 1 1 1 ]. One has RGWD (Cs, ps), (Ct, pt), 1 = RGWD (Ct, pt), (Cs, ps), 1 . Example 1 showcases that RGWD is asymmetric even if ns = nt. 3.2 NUMERICAL SCHEME OF RGWD We derive a gradient based numerical scheme to solve RGWD by exploiting the property that the inner maximization problem has a closed-form solution, which is summarized in Algorithm 1. In each iteration, Eτ that solves the inner problem for current Tτ is calculated. Then, the transport plan is updated using the projected gradient descent. Algorithm 1 Projected Gradient Descent for RGWD 1: Input: Initialization T0, step-size η, number of iterations N. 2: Output: Estimated optimal transport plan ˆT and its corresponding perturbation ˆE. 3: for τ = 0, 1, . . . , N 1 do 4: Find Eτ that maximizes f(Tτ, E). 5: Update the transport plan via Tτ+1 = ProjΠ(ps,pt) Tτ η Tf(Tτ, Eτ) , where the partial gradient takes the form Tf(Tτ, Eτ) =2 Cs Cs Tτ1nt1nt 21ns1ns Tτ Ct + Eτ Ct + Eτ 4Cs Tτ Ct + Eτ , with denoting the element-wise multiplication. 6: end for 7: Pick τ uniformly at random from {1, 2, . . . , T}. 8: Set ˆT Tτ. 9: Find ˆE that maximizes f(ˆT, E). To present the convergence guarantee of Algorithm 1, we introduce the notion of the Moreau envelope. The stationarity of any function h(x) can be quantified by the norm of the gradient of its Moreau envelope hλ(x) = minx h(x ) + 1 2λ x x 2. The following theorem gives the convergence rate of Algorithm 1 and the proof is deferred to the appendix. Published as a conference paper at ICLR 2023 Theorem 2 Define φ( ) = max E Uϵ f( , E). The output ˆT of Algorithm 1 with step-size η = γ N+1 satisfies E φ1/2l(ˆT) 2 2φ1/2l(T0) min T Π(ps,pt) φ(T) + l L2γ2 2 max{10n3U 2 1 + 6n3U1ϵ + 4n U1U2 + 4n3ϵ2, 6n2U1U2 + 2U 2 2 + 4n2U2ϵ} and L = 2 max{(4U1+2ϵ)U 2 2 n3, 2(2U1+ϵ)2U2n3} with n = max{ns, nt}, U1 = max{ Cs , Ct } and U2 = max{ pt 2, max T Π(ps,pt) T F }. When U1 and ϵ are of the order O( 1 n3 ), both l and L are of the order O(1) and Theorem 2 states that an δ-stationary solution can be obtained within O( 1 δ2 ) iterations. Note that we can multiply Cs, Ct, and ϵ by the same number without affecting the resulted transport plan. 4 ROBUST GRAPH DICTIONARY LEARNING The problem of learning a robust dictionary for graph data is now formulated as follows. Given a dataset of K graphs expressed by {(Ck, pk)}k JKK, estimating the optimal dictionary is formalized by min { Cm}m JMK,{wk}k JKK k=1 RGWD2 M X m=1 wk m Cm, p , (Ck, pk), ϵ λ wk 2, (2) where { Cm}m JMK and {wk}k JKK are the dictionary and graph embeddings respectively, and p is obtained by sorting and averaging {pk}k JKK following Xu et al. (2019a). To resolve (2), we propose a nested iterative optimization algorithm that is summarized in Algorithm 2. The main idea is that the dictionary and embeddings are updated alternatingly. We discuss some crucial details below. Algorithm 2 Robust Graph Dictionary Learning (RGDL) 1: Input: The dataset {Ck, pk}k JKK, the initial dictionary { Cm}m JMK, the number of iterations T, mini-batch size b. 2: Output: The learned dictionary { Cm}m JMK. 3: for t = 0, 1, . . . , T 1 do 4: Sample a mini-batch of graphs whose indices are denoted by B such that |B| = b. 5: for k B do 6: Initialize wk = 1 M 1M and Tk = ppk . 7: repeat 8: Calculate (Tk, Ek) via Algorithm 1 with fixed wk. 9: Compute wk solving (4) for the fixed Tk and Ek with conditional gradient. 10: until Convergence 11: end for 12: Update the atom Cm for m JMK with stochastic gradient ˆ Cm which has the form k B wk m M X m =1 wk m Cm p p Tk(Ck + Ek)Tk . (3) 13: end for Solving wk. We now formulate the problem of obtaining the embedding of the kth graph Gk when the dictionary is fixed and the PRM is inaccurate. Given dictionary { Cm}m JMK where each Cm Rna na, the embedding of Gk expressed by (Ck, pk) is calculated by solving min wk M RGWD2 M X m=1 wk m Cm, p , (Ck, pk), ϵ λ wk 2, (4) Published as a conference paper at ICLR 2023 where λ 0 induces a negative quadratic regularization promoting sparsity on the simplex (Li et al., 2020; Vincent-Cuaz et al., 2021). When wk is fixed, updating Tk and Ek can be solved by Algorithm 1 whose convergence is guaranteed by Theorem 2. For fixed Tk and Ek, the problem of updating wk is a non-convex problem that can be tackled by a conditional gradient algorithm. Note that for non-convex problems, the conditional gradient algorithm is proved to converge to a local stationary point (Lacoste-Julien, 2016). Such a procedure is described from Line 5 to Line 11 in Algorithm 2, which we observe converges within tens of iterations empirically. Stochastic updates. To enhance computational efficiency, each atom is updated with stochastic estimates of the gradient. At each stochastic update, b embedding learning problems are solved independently for the current dictionary using the procedure stated above, where b is the size of the sampled mini-batch. Each atom is then updated using the stochastic gradient given in Eq. (3). Note that the symmetry of each atom is preserved as long as the initialized atom is symmetric, since the stochastic gradients are symmetric. 5 EXPERIMENTS This section provides empirical evidence that RGDL performs well in the unsupervised graph clustering task on both synthetic and real-world datasets1. The heat kernel matrix is employed for the PRM since it captures both global and local topology and achieves good performance in many tasks (Donnat et al., 2018; Tsitsulin et al., 2018; Chowdhury and Needham, 2021). 5.1 SIMULATED DATASETS We first test RGDL in the graph clustering task on datasets simulated according to the well-studied Stochastic Block Model (SBM) (Holland et al., 1983; Wang and Wong, 1987). RGDL is compared against the following state-of-the-art graph clustering methods: (i) GDL (Vincent-Cuaz et al., 2021) learns graph dictionaries via GWD; (ii) Gromov Wasserstein Factorization (GWF) (Xu, 2020) that approximates graphs via GW barycenters; (iii) Spectral Clustering (SC) of Shi and Malik (2000); Stella and Shi (2003) applied to the matrix with each entry storing the GWD between two graphs. Table 1: Average (stdev) ARI scores for the first scenario of synthetic datasets. balanced unbalanced σ 0.05 0.10 0.15 0.05 0.10 0.15 GDL 0.119(0.017) 0.031(0.012) 0.016(0.006) 0.049(0.019) 0.018(0.004) 0.018(0.001) GWF 0.071(0.007) 0.034(0.003) 0.008(0.001) 0.052(0.020) 0.014(0.001) 0.015(0.001) SC 0.057(0.002) 0.033(0.002) 0.010(0.001) 0.054(0.024) 0.015(0.004) 0.010(0.001) RGDL(ϵ=0.01) 0.316(0.005) 0.161(0.005) 0.052(0.002) 0.246(0.013) 0.039(0.009) 0.024 (0.001) RGDL(ϵ=0.1) 0.853(0.003) 0.756(0.018) 0.439(0.015) 0.765(0.022) 0.694(0.046) 0.499(0.016) RGDL(ϵ=0.2) 0.975(0.025) 0.879(0.023) 0.736(0.020) 0.866(0.023) 0.815(0.028) 0.770(0.016) RGDL(ϵ=0.3) 0.975(0.025) 0.879(0.023) 0.869(0.013) 0.943(0.001) 0.916(0.027) 0.848(0.061) RGDL(ϵ=10) 0.975(0.025) 0.950(0.000) 0.950(0.000) 0.943(0.001) 0.943(0.001) 0.943(0.001) RGDL(ϵ=30) 0.781(0.046) 0.779(0.070) 0.728(0.085) 0.723(0.067) 0.698(0.057) 0.666(0.040) Dataset generation. We consider two scenarios of inaccuracies. In the first scenario (S1), Gaussian noise is added into the heat kernel matrix of each graph. More specifically, denoting the heat kernel matrix of the kth graph as Ck for k JKK, the PRM available to DL methods is Ck = Ck + Z + Z where each entry Zij of Z is sampled from the Gaussian distribution N(0, σ). In the second scenario (S2), we randomly add ρ|E| edges into the graph and then randomly remove ρ|E| edges while keeping the graph connected, where E is the edge set of the graph. The heat kernel matrix is then constructed for the modified graph. Such two scenarios allow us to study the performance of RGDL against different scales of inaccuracies. In both S1 and S2, we generate two datasets, both of which involve three generative structures (also used to label graphs): dense (only one community), two communities, and three communities. We fix p = 0.1 as the probability of inter-community connectivity and 1 p as the probability of intra-community connectivity. The first dataset includes 20 graphs for each generative structure and thus is referred to as the balanced dataset. The second 1Code available at https://github.com/cxxszz/rgdl. Published as a conference paper at ICLR 2023 Table 2: Average (stdev) ARI scores for the second scenario of synthetic datasets. balanced unbalanced ρ 0.00 0.05 0.10 0.00 0.05 0.1 GDL 0.260(0.020) 0.187(0.013) 0.024(0.004) 0.152(0.008) 0.018(0.001) 0.005(0.001) GWF 0.182(0.006) 0.086(0.004) 0.027(0.005) 0.020(0.005) 0.016(0.002) 0.010(0.002) SC 0.204(0.002) 0.129(0.017) 0.016(0.005) 0.129(0.008) 0.013(0.001) 0.011(0.005) RGDL(ϵ=0.01) 0.451(0.014) 0.449(0.016) 0.449(0.016) 0.401(0.002) 0.401(0.002) 0.399(0.000) RGDL(ϵ=0.1) 1.000(0.000) 1.000(0.000) 0.975(0.025) 1.000(0.000) 1.000(0.000) 1.000(0.000) RGDL(ϵ=0.2) 1.000(0.000) 1.000(0.000) 0.975(0.025) 1.000(0.000) 1.000(0.000) 1.000(0.000) RGDL(ϵ=0.3) 1.000(0.000) 1.000(0.000) 1.000(0.000) 1.000(0.000) 1.000(0.000) 1.000(0.000) RGDL(ϵ=10) 1.000(0.000) 1.000(0.000) 1.000(0.000) 1.000(0.000) 1.000(0.000) 1.000(0.000) RGDL(ϵ=30) 0.896(0.070) 0.888(0.080) 0.857(0.044) 0.864(0.057) 0.827(0.043) 0.816(0.074) dataset consists of 12, 18, and 30 graphs for the three generative structures respectively, and is hence named as the unbalanced dataset. The number of graph nodes is uniformly sampled from [30, 50]. The magnitude of the observed PRM Ck satisfies Ck 15. Evaluating the performance. The learned embeddings of the graphs are used as input for a kmeans algorithm to cluster graphs. We use the well-known Adjusted Rand Index (ARI) (Hubert and Arabie, 1985; Steinley, 2004), to evaluate the quality of clustering by comparing it with the graph labels. RGDL with varied ϵ is compared against GDL, GWF, and SC. RGDL, GDL and GWF use three atoms which are R6 6 matrices. We run each method for 5 times and report the averaged ARI scores and the standard deviations. Experimental results reported in Table 1 and Table 2 demonstrate RGDL outperforms baselines significantly. Influence of ϵ. RGDL with moderate ϵ values outperforms baseline methods by a large margin and is more robust to the noise. Even when ϵ is relatively small (ϵ=0.01), RGDL achieves better performance than baselines. Increasing ϵ within a suitable range can boost ARI and RGDL is not sensitive to the choice of ϵ. If ϵ becomes too large, the performance of RGDL slowly decreases. In practice, when a small quantity of data labels are available, ϵ can be chosen according to the performance on this small subset of data. 0 50 100 150 200 250 300 350 time(sec) 0 100 200 300 400 500 time(sec) 0 20 40 60 80 100 120 time(sec) RGDL( = 10 1U) RGDL( = 10 2U) RGDL( = 10 3U) Figure 1: ARI scores vs. time on MUTAG (left), BZR(middle), and Peking 1(right) datasets. 5.2 REAL-WORLD DATASETS We further use RGDL to cluster real-world graphs. We consider widely utilized benchmark datasets including MUTAG (Debnath et al., 1991), BZR (Sutherland et al., 2003), and Peking 1 (Pan et al., 2016). The labels of the graphs are employed as the ground truth to evaluate the estimated clustering results. For each dataset, the size of the atoms is set as the median of the numbers of graph nodes following Vincent-Cuaz et al. (2021). The number of atoms M is set as M = β(# classes) where β is chosen from {2, 3, 4, 5}. RGDL is run with different values of ϵ. Specifically, ϵ is chosen from {U, 10 1U, 10 2U, 10 3U} where U = maxk JKK Ck . Published as a conference paper at ICLR 2023 Results. The experimental results on real-world graphs are reported in Figure 1. RGDL with ϵ = 10 1U or ϵ = 10 2U outperforms baselines on all datasets, which implies that the observed graphs contain structural noise and 10 2U, 10 1U is often a suitable range for ϵ. The time required for RGDL to converge is comparable to that of state-of-the-art of methods. 6 RELATED WORK Unbalanced OT. Enhancing the robustness of the optimal transport plan has received wide attention recently (Balaji et al., 2020; Mukherjee et al., 2021; Le et al., 2021; Nietert et al., 2022; Chapel et al., 2020; S ejourn e et al., 2021; Vincent-Cuaz et al., 2022). Originally, robust variants of classical OT were proposed to compare distributions supported on the same metric space (Balaji et al., 2020; Mukherjee et al., 2021; Le et al., 2021; Nietert et al., 2022), which model the noise as outlier supports and reduce the influence of outlier supports by allowing mass destruction and creation. Following the same spirit, variants of the GW distance that also relax the marginal constraints were proposed (Chapel et al., 2020; S ejourn e et al., 2021; Vincent-Cuaz et al., 2022). However, these methods do not take the inaccuracies of the pairwise distances/similarities into account. The proposed RGWD aims to handle such cases. Graph representation learning and graph comparison. Comparing graphs often requires learning meaningful graph representations. Some methods manually design representations that are invariant under graph isomorphism (Bagrow and Bollt, 2019; Tam and Dunson, 2022). Such representations are often sophisticated and require domain knowledge. Graph neural network-based methods learn the representations of graphs in an end-to-end manner (Scarselli et al., 2008; Zhang et al., 2018; Lee et al., 2018; Errica et al., 2019), which however requires a large amount of labeled data. Another family of methods that uses graph representations implicitly is referred to as graph kernels (Shervashidze et al., 2009; Vishwanathan et al., 2010). GWD and its variants based methods can estimate the node correspondence and provide an interpretable discrepancy between compared graphs (Xu et al., 2019b; Titouan et al., 2019; Barbe et al., 2020; Chapel et al., 2020). In this paper, we propose a novel graph dictionary learning method based on a robust variant of GWD to learn representations of graphs which are useful in downstream tasks. Non-linear combination of atoms. Classic DL methods are linear in the sense that they attempt to approximate each vectorized datum by a linear combination of a few basis elements. Recently, non-linear operations were also considered. In order to exploit the non-linear nature of data, Autoencoder-based methods encode them to low-dimensional vectors using neural networks, and decode data with another neural network (Hinton and Salakhutdinov, 2006; Hu and Tan, 2018). Another family of methods replace the linear combinations by geodesic interpolations (Boissard et al., 2011; Bigot et al., 2013; Seguy and Cuturi, 2015; Schmitz et al., 2018). More closely related to our work, Xu 2020 proposed to approximate graphs via the GW barycenter of graph atoms, which however involves a complicated and computational demanding optimization problem. Projection robust OT. To improve the convergence of empirical Wasserstein distances (R uschendorf, 1985) to population ones, a group of methods project the distributions to informative low-dimensional subspaces (Paty and Cuturi, 2019; Lin et al., 2020; 2021), which involves solving min-max or max-min problems. This paper considers distributions supported on different metric spaces and does not project distributions. 7 CONCLUSION In this paper, we propose a novel graph dictionary learning algorithm that is robust to the structural noise of graphs. We first propose a robust variant of GWD, referred to as RGWD, which involves a minimax optimization problem. Exploiting the fact that the inner maximization problem has a closed-form solution, an efficient numerical scheme is derived. Based on RGWD, a robust dictionary learning algorithm for graphs called RGDL is derived to learn atoms from noisy graph data. Numerical results on both simulated and real-world datasets demonstrate that RGDL achieves good performance in the presence of structural noise. Published as a conference paper at ICLR 2023 ACKNOWLEDGMENTS This work is supported by the National Key R&D Program of China (2020YFB1313501, 2020AAA0107400), National Natural Science Foundation of China (T2293723, 61972347, 61672376, 61751209, 61472347, 62206248), Zhejiang Provincial Natural Science Foundation (LR19F020005, LZ18F020002), the Key Research and Development Project of Zhejiang Province (2022C01022, 2022C01119, 2021C03003), the Fundamental Research Funds for the Central Universities (226-2022-00051), Alibaba-Zhejiang University Joint Research Institute of Frontier Technologies, China Scholarship Council (202206320311), and MEXT KAKENHI (20H04243). Monica Agrawal, Marinka Zitnik, and Jure Leskovec. Large-scale analysis of disease pathways in the human interactome. In PACIFIC SYMPOSIUM ON BIOCOMPUTING 2018: Proceedings of the Pacific Symposium, pages 111 122. World Scientific, 2018. Lars Backstrom and Jure Leskovec. Supervised random walks: predicting and recommending links in social networks. In Proceedings of the fourth ACM international conference on Web search and data mining, pages 635 644, 2011. James P Bagrow and Erik M Bollt. An information-theoretic, all-scales approach to comparing networks. Applied Network Science, 4(1):1 15, 2019. Yogesh Balaji, Rama Chellappa, and Soheil Feizi. Robust optimal transport with applications in generative modeling and domain adaptation. Advances in Neural Information Processing Systems, 33:12934 12944, 2020. Am elie Barbe, Marc Sebban, Paulo Gonc alves, Pierre Borgnat, and R emi Gribonval. Graph diffusion wasserstein distances. In ECML PKDD, 2020. J er emie Bigot, Ra ul Gouet, Thierry Klein, and Alfredo L opez. Geodesic pca in the wasserstein space. ar Xiv preprint ar Xiv:1307.7721, 2013. Emmanuel Boissard, Thibaut Le Gouic, and Jean-Michel Loubes. Distribution s template estimate with wasserstein metrics. ar Xiv preprint ar Xiv:1111.5927, 2011. S ebastien Bubeck et al. Convex optimization: Algorithms and complexity. Foundations and Trends in Machine Learning, 8(3-4):231 357, 2015. Laetitia Chapel, Gilles Gasso, et al. Partial optimal tranport with applications on positive-unlabeled learning. In Neur IPS, 2020. Samir Chowdhury and Tom Needham. Generalized spectral clustering via gromov-wasserstein learning. In International Conference on Artificial Intelligence and Statistics, pages 712 720. PMLR, 2021. Aaron Clauset, Cristopher Moore, and M E J Newman. Hierarchical structure and the prediction of missing links in networks. Nature, 453(7191):98 101, 2008. Asim Kumar Debnath, Rosa L Lopez de Compadre, Gargi Debnath, Alan J Shusterman, and Corwin Hansch. Structure-activity relationship of mutagenic aromatic and heteroaromatic nitro compounds. correlation with molecular orbital energies and hydrophobicity. Journal of medicinal chemistry, 34(2):786 797, 1991. Claire Donnat, Marinka Zitnik, David Hallac, and Jure Leskovec. Learning structural node embeddings via diffusion wavelets. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, pages 1320 1329, 2018. Simon S Du, Kangcheng Hou, Russ R Salakhutdinov, Barnabas Poczos, Ruosong Wang, and Keyulu Xu. Graph neural tangent kernel: Fusing graph neural networks with graph kernels. Advances in neural information processing systems, 32, 2019. Published as a conference paper at ICLR 2023 Charles Elkan. Using the triangle inequality to accelerate k-means. In Proceedings of the 20th international conference on Machine Learning (ICML-03), pages 147 153, 2003. Federico Errica, Marco Podda, Davide Bacciu, and Alessio Micheli. A fair comparison of graph neural networks for graph classification. In International Conference on Learning Representations, 2019. Zhizhao Feng, Meng Yang, Lei Zhang, Yan Liu, and David Zhang. Joint discriminative dimensionality reduction and dictionary learning for face recognition. Pattern Recognition, 46(8):2134 2143, 2013. Mohamed Aymen Ben Haj Kacem, Chiheb-Eddine Ben N Cir, and Nadia Essoussi. Overview of scalable partitional methods for big data clustering. In Clustering Methods for Big Data Analytics, pages 1 23. Springer, 2019. Geoffrey E Hinton and Ruslan R Salakhutdinov. Reducing the dimensionality of data with neural networks. science, 313(5786):504 507, 2006. Paul W Holland, Kathryn Blackmond Laskey, and Samuel Leinhardt. Stochastic blockmodels: First steps. Social networks, 5(2):109 137, 1983. Junlin Hu and Yap-Peng Tan. Nonlinear dictionary learning with application to image classification. Pattern Recognition, 75:282 291, 2018. Lawrence Hubert and Phipps Arabie. Comparing partitions. Journal of classification, 2(1):193 218, 1985. Wenhao Jiang, Feiping Nie, and Heng Huang. Robust dictionary learning with capped l1-norm. In Twenty-Fourth International Joint Conference on Artificial Intelligence, 2015. Wengong Jin, Connor Coley, Regina Barzilay, and Tommi Jaakkola. Predicting organic reaction outcomes with weisfeiler-lehman network. Advances in neural information processing systems, 30, 2017. Simon Lacoste-Julien. Convergence rate of frank-wolfe for non-convex objectives. ar Xiv preprint ar Xiv:1607.00345, 2016. Khang Le, Huy Nguyen, Quang M Nguyen, Tung Pham, Hung Bui, and Nhat Ho. On robust optimal transport: Computational complexity and barycenter computation. Advances in Neural Information Processing Systems, 34:21947 21959, 2021. John Boaz Lee, Ryan Rossi, and Xiangnan Kong. Graph classification using structural attention. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1666 1674, 2018. Ping Li, Syama Sundar Rangapuram, and Martin Slawski. Methods for sparse and low-rank recovery under simplex constraints. Statistica Sinica, 30(2):557 577, 2020. Tianyi Lin, Chenyou Fan, Nhat Ho, Marco Cuturi, and Michael Jordan. Projection robust wasserstein distance and riemannian optimization. Advances in neural information processing systems, 33:9383 9397, 2020. Tianyi Lin, Zeyu Zheng, Elynn Chen, Marco Cuturi, and Michael I Jordan. On projection robust optimal transport: Sample complexity and model misspecification. In International Conference on Artificial Intelligence and Statistics, pages 262 270. PMLR, 2021. Julien Mairal, Jean Ponce, Guillermo Sapiro, Andrew Zisserman, and Francis Bach. Supervised dictionary learning. Advances in neural information processing systems, 21, 2008. Julien Mairal, Francis Bach, Jean Ponce, and Guillermo Sapiro. Online dictionary learning for sparse coding. In Proceedings of the 26th annual international conference on machine learning, pages 689 696, 2009. St ephane Mallat. A wavelet tour of signal processing. Elsevier, 1999. Published as a conference paper at ICLR 2023 Facundo M emoli. Gromov wasserstein distances and the metric approach to object matching. Foundations of computational mathematics, 11(4):417 487, 2011. Andrew W Moore. The anchors hierarchy: using the triangle inequality to survive high dimensional data. In Proceedings of the Sixteenth conference on Uncertainty in artificial intelligence, pages 397 405, 2000. Debarghya Mukherjee, Aritra Guha, Justin M Solomon, Yuekai Sun, and Mikhail Yurochkin. Outlier-robust optimal transport. In International Conference on Machine Learning, pages 7850 7860. PMLR, 2021. Navid Naderializadeh, Mark Eisen, and Alejandro Ribeiro. Wireless power control via counterfactual optimization of graph neural networks. In 2020 IEEE 21st International Workshop on Signal Processing Advances in Wireless Communications (SPAWC), pages 1 5. IEEE, 2020. Sloan Nietert, Ziv Goldfeld, and Rachel Cummings. Outlier-robust optimal transport: Duality, structure, and statistical analysis. In International Conference on Artificial Intelligence and Statistics, pages 11691 11719. PMLR, 2022. Shirui Pan, Jia Wu, Xingquan Zhu, Guodong Long, and Chengqi Zhang. Task sensitive feature exploration and learning for multitask graph classification. IEEE transactions on cybernetics, 47 (3):744 758, 2016. Franc ois-Pierre Paty and Marco Cuturi. Subspace robust wasserstein distances. In International conference on machine learning, pages 5072 5081. PMLR, 2019. Gabriel Peyre and Marco Cuturi. Computational optimal transport. ar Xiv: Machine Learning, 2018. Gabriel Peyr e, Marco Cuturi, and Justin Solomon. Gromov-wasserstein averaging of kernel and distance matrices. In ICML, 2016. Giovanni Piccioli, Guilhem Semerjian, Gabriele Sicuro, and Lenka Zdeborov a. Aligning random graphs with a sub-tree similarity message-passing algorithm. Journal of Statistical Mechanics: Theory and Experiment, 2022(6):063401, 2022. Silviu Pitis, Harris Chan, Kiarash Jamali, and Jimmy Ba. An inductive bias for distances: Neural nets that respect the triangle inequality. In International Conference on Learning Representations, 2019. Rajat Raina, Alexis Battle, Honglak Lee, Benjamin Packer, and Andrew Y Ng. Self-taught learning: transfer learning from unlabeled data. In Proceedings of the 24th international conference on Machine learning, pages 759 766, 2007. Ignacio Ramirez, Pablo Sprechmann, and Guillermo Sapiro. Classification and clustering via dictionary learning with structured incoherence and shared features. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 3501 3508. IEEE, 2010. Ludger R uschendorf. The wasserstein distance and approximation theorems. Probability Theory and Related Fields, 70(1):117 129, 1985. Hamidreza Sadreazami, Arash Mohammadi, Amir Asif, and Konstantinos N Plataniotis. Distributed-graph-based statistical approach for intrusion detection in cyber-physical systems. IEEE Transactions on Signal and Information Processing over Networks, 4(1):137 147, 2017. Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. The graph neural network model. IEEE transactions on neural networks, 20(1):61 80, 2008. Morgan A Schmitz, Matthieu Heitz, Nicolas Bonneel, Fred Ngole, David Coeurjolly, Marco Cuturi, Gabriel Peyr e, and Jean-Luc Starck. Wasserstein dictionary learning: Optimal transport-based unsupervised nonlinear dictionary learning. SIAM Journal on Imaging Sciences, 11(1):643 678, 2018. Vivien Seguy and Marco Cuturi. Principal geodesic analysis for probability measures under the optimal transport metric. Advances in Neural Information Processing Systems, 28, 2015. Published as a conference paper at ICLR 2023 Thibault S ejourn e, Franc ois-Xavier Vialard, and Gabriel Peyr e. The unbalanced gromov wasserstein distance: Conic formulation and relaxation. Advances in Neural Information Processing Systems, 34:8766 8779, 2021. Nino Shervashidze, SVN Vishwanathan, Tobias Petri, Kurt Mehlhorn, and Karsten Borgwardt. Efficient graphlet kernels for large graph comparison. In Artificial intelligence and statistics, pages 488 495. PMLR, 2009. Chuan Shi, Binbin Hu, Wayne Xin Zhao, and Philip S Yu. Heterogeneous information network embedding for recommendation. IEEE Transactions on Knowledge and Data Engineering, 31 (2):357 370, 2019. Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. IEEE Transactions on pattern analysis and machine intelligence, 22(8):888 905, 2000. Pablo Sprechmann and Guillermo Sapiro. Dictionary learning and sparse coding for unsupervised clustering. In 2010 IEEE international conference on acoustics, speech and signal processing, pages 2042 2045. IEEE, 2010. Douglas Steinley. Properties of the hubert-arable adjusted rand index. Psychological methods, 9(3): 386, 2004. X Yu Stella and Jianbo Shi. Multiclass spectral clustering. In Computer Vision, IEEE International Conference on, volume 2, pages 313 313. IEEE Computer Society, 2003. Jeffrey J Sutherland, Lee A O brien, and Donald F Weaver. Spline-fitting with a genetic algorithm: A method for developing classification structureactivity relationships. Journal of chemical information and computer sciences, 43(6):1906 1915, 2003. Edric Tam and David Dunson. Multiscale graph comparison via the embedded laplacian distance. ar Xiv preprint ar Xiv:2201.12064, 2022. Vayer Titouan, Nicolas Courty, Romain Tavenard, Chapel Laetitia, and R emi Flamary. Optimal transport for structured data with application on graphs. In ICML, 2019. Ivana Toˇsi c and Pascal Frossard. Dictionary learning. IEEE Signal Processing Magazine, 28(2): 27 38, 2011. Quang Huy Tran, Hicham Janati, Nicolas Courty, R emi Flamary, Ievgen Redko, Pinar Demetci, and Ritambhara Singh. Unbalanced co-optimal transport. ar Xiv preprint ar Xiv:2205.14923, 2022. Anton Tsitsulin, Davide Mottin, Panagiotis Karras, Alexander Bronstein, and Emmanuel M uller. Netlsd: hearing the shape of a graph. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 2347 2356, 2018. C edric Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008. C edric Vincent-Cuaz, Titouan Vayer, R emi Flamary, Marco Corneli, and Nicolas Courty. Online graph dictionary learning. In International Conference on Machine Learning, pages 10564 10574. PMLR, 2021. C edric Vincent-Cuaz, R emi Flamary, Marco Corneli, Titouan Vayer, and Nicolas Courty. Semirelaxed gromov-wasserstein divergence and applications on graphs. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id= RSha Mexjc-x. S Vishwanathan, N Schraudolph, R Kondor, and KM Borgwardt. Graph kernels. the journal ofmachine learning research. 2010. Yuchung J Wang and George Y Wong. Stochastic blockmodels for directed graphs. Journal of the American Statistical Association, 82(397):8 19, 1987. Published as a conference paper at ICLR 2023 Xian Wei, Hao Shen, Yuanxiang Li, Xuan Tang, Fengxiang Wang, Martin Kleinsteuber, and Yi Lu Murphey. Reconstructible nonlinear dimensionality reduction via joint dictionary learning. IEEE transactions on neural networks and learning systems, 30(1):175 189, 2018. Teng Xiao, Jiaxin Ren, Zaiqiao Meng, Huan Sun, and Shangsong Liang. Dynamic bayesian metric learning for personalized product search. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pages 1693 1702, 2019. Hongteng Xu, Dixin Luo, and Lawrence Carin. Scalable gromov-wasserstein learning for graph partitioning and matching. In Neur IPS, 2019a. Hongteng Xu, Dixin Luo, Hongyuan Zha, and Lawrence Carin Duke. Gromov-wasserstein learning for graph matching and node embedding. In ICML, 2019b. Hongtengl Xu. Gromov-wasserstein factorization models for graph clustering. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 6478 6485, 2020. Yangyang Xu. Iteration complexity of inexact augmented lagrangian methods for constrained convex programming. Mathematical Programming, 185(1):199 244, 2021. Pinar Yanardag and S Vishwanathan. Deep graph kernels in: Proceedings of the 21th acm sigkdd international conference on knowledge discovery and data mining, 1365 1374. ACM, New York, 2015. Muhan Zhang, Zhicheng Cui, Marion Neumann, and Yixin Chen. An end-to-end deep learning architecture for graph classification. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018. Tong Zhang, Yun Wang, Zhen Cui, Chuanwei Zhou, Baoliang Cui, Haikuan Huang, and Jian Yang. Deep wasserstein graph discriminant learning for graph classification. In AAAI, pages 10914 10922, 2021. Published as a conference paper at ICLR 2023 The appendix is organized as follows. We first provide omitted proofs in the main paper in Sec. A. Then, algorithmic details are presented in Sec. B. Finally, Sec. C gives additional experimental results. A OMITTED PROOFS Theorem 1 Given an observed source graph Gs and a target graph Gt that can be expressed as (Cs, ps) and (Ct, pt) respectively, RGWD satisfies 1. for all T Π(ps, pt), E(T) = [Ei j (T)] where ( ϵ, if Pns i,j=1 Tii Tjj (Cs ij Ct i j ) 0, ϵ, otherwise. solves the inner maximization problem max E Uϵ f(T, E). 2. RGWD is lower bounded, that is, RGWD (Cs, ps), (Ct, pt), ϵ ϵ, where the equality holds if and only if there exists a bijective π : Jns K Jnt K such that ps i = pt π (i) for all i Jns K and Cs ij = Ct π (i)π (j) for all i, j Jns K. 3. The triangle inequality holds for RGWD, i.e., RGWD (C1, p1), (C3, p3), ϵ RGWD (C1, p1), (C2, p2), ϵ + RGWD (C2, p2), (C3, p3), ϵ . 4. RGWD is invariant to the permutation of node orders, i.e., for all permutation matrices Qs and Qt, RGWD (Cs, ps), (Ct, pt), ϵ = RGWD (Qs Cs Qs, Qs ps), (Qt Ct Qt, Qt pt), ϵ . Proof: (i) The objective can be rewritten as follows, i ,j =1 (Cs ij Ct i j Ei j )2Tii Tjj i ,j =1 (Cs ij Ct i j )2Tii Tjj + i,j=1 Tii Tjj E2 i j 2 i,j=1 Tii Tjj (Cs ij Ct i j )Ei j i ,j =1 (Cs ij Ct i j )2Tii Tjj + pt i pt j E2 i j 2 i,j=1 Tii Tjj (Cs ij Ct i j )Ei j , (5) which by the property of quadratic functions yields the closed-form solution E(T) = [Ei j (T)], where ( ϵ, if Pns i,j=1 Tii Tjj (Cs ij Ct i j ) 0, ϵ, otherwise. It is easy to verify that the such a choice guarantees the symmetry of E(T). (ii) We now prove the lower boundedness. By Eq. (5), we have min T Π(ps,pt) max E Uϵ i ,j =1 (Cs ij Ct i j Ei j )2Tii Tjj min T Π(ps,pt) i ,j =1 Tii Tjj ϵ2 = ϵ2. Published as a conference paper at ICLR 2023 Note that when there exists a bijective π : Jns K Jnt K such that ps i = pt π (i) for all i Jns K and Cs ij = Ct π (i)π (j) for all i, j Jns K, choosing the transport plan ˆT = [ ˆTii ] where ˆTii = ps i, if i = π (i), 0, otherwise, we have for all i , j Jnt K, i,j=1 ˆTii ˆTjj (Cs ij Ct i j ) = ˆTπ 1(i )i ˆTπ 1(j )j Cs π 1(i )π 1(j ) Ct i j = 0, which implies that Ei j (T) = ϵ for all i , j Jnt K. We then have i ,j =1 (Cs ij Ct i j Ei j )2 ˆTii ˆTjj = i,j=1 (Cs ij Ct π (i)π (j) ϵ)2 ˆTiπ (i) ˆTjπ (j) = ϵ2. Therefore, in such a case, RGWD (Cs, ps), (Ct, pt), ϵ = ϵ. On the other hand, when such a bijective does not exist, RGWD (Cs, ps), (Ct, pt), ϵ v u u tϵ2 + min T Π(ps,pt) i ,j =1 (Cs ij Ct i j )2Tii Tjj > ϵ, where the strict inequality is due to the fact that Pns i ,j =1(Cs ij Ct i j )2Tii Tjj > 0. (iii) Thirdly, we prove the triangle inequality. Given tuples (C1, p1), (C2, p2), and (C3, p3) which have the node numbers n1, n2, and n3 respectively, let (T 12, E 12), (T 23, E 23), and (T 13, E 13) be the solutions of RGWD (C1, p1), (C2, p2), ϵ , RGWD (C2, p2), (C3, p3), ϵ , and RGWD (C1, p1), (C3, p3), ϵ . Define T13 = [T 13 i1i3] where T 13 i1i3 = T 12 i1i2T 23 i2i3 p2 i2 . Then we have RGWD (C1, p1), (C3, p3), ϵ C1 i1j1 C3 i3j3 E 13 i3j3 2T 13 i1i3T 13 j1j3 C1 i1j1 C3 i3j3 E 13 i3j3 2 n2 X T 12 i1i2T 23 i2i3 p2 i2 T 12 j1j2T 23 j2j3 p2 j2 C1 i1j1 C2 i2j2 + C2 i2j2 C3 i3j3 E 13 i3j3 2 T 12 i1i2T 23 i2i3 p2 i2 T 12 j1j2T 23 j2j3 p2 j2 C1 i1j1 C2 i2j2 2 T 12 i1i2T 23 i2i3 p2 i2 T 12 j1j2T 23 j2j3 p2 j2 C2 i2j2 C3 i3j3 E 13 i3j3 2 T 12 i1i2T 23 i2i3 p2 i2 T 12 j1j2T 23 j2j3 p2 j2 C1 i1j1 C2 i2j2 2T 12 i1i2T 12 j1j2 + C2 i2j2 C3 i3j3 E 13 i3j3 2T 23 i2i3T 23 j2j3 RGWD (C1, p1), (C2, p2), ϵ + RGWD (C2, p2), (C3, p3), ϵ . Published as a conference paper at ICLR 2023 (iv) Finally, we prove the invariance to the node order permutation. Denote the solution to the objective of RGWD by T = [T ii ] and E = [E i j ], which implies ( ϵ, if Pns i,j=1 T ii T jj (Cs ij Ct i j ) 0, ϵ, otherwise, i ,j =1 (Cs ij Ct i j E i j )2T ii T jj max E Uϵ i ,j =1 (Cs ij Ct i j Ei j )2Tii Tjj , for all T Π(ps, pt). The two permutation operations can be equivalently denoted as two bijectives πs and πt. Denote Cs = [ Cs ij] where Cs ij = Cs πs 1(i)πs 1(j), Ct = [ Ct i j ] where Ct i j = Ct πt 1(i )πt 1(j ), E = [ E i j ] where E i j = E πt 1(i )πt 1(j ), T = [ T s ii ] where T s ii = T πs 1(i)πt 1(i ). We first prove E solves the inner maximization problem for T . For all i , j Jnt K, when P ij T ii T jj ( Cs ij Ct i j ) 0, we have P ij T iπt 1(i )T jπt 1(j )(Cs ij Ct πt 1(i )πt 1(j )) 0, which is consistent with E i j = ϵ. The case when P ij T ii T jj ( Cs ij Ct i j ) > 0 is similar. Since i ,j =1 (Cs ij Ct i j E i j )2T ii T jj = i ,j =1 ( Cs ij Ct i j E i j )2 T ii T jj i ,j =1 (Cs ij Ct i j Ei j )2Tii Tjj = max E Uϵ i ,j =1 ( Cs ij Ct i j Ei j )2 Tii Tjj , where Tii = Tπs 1(i)πt 1(i ), T and E solve the optimization problem of RGWD (Qs Cs Qs, Qs ps), (Qt Ct Qt, Qt pt), ϵ . To prove Theorem 2, we require the following lemma. Lemma 3 f( ) is l-smooth and L-Lipschitz, where l = 2 max{10n3U 2 1 + 6n3U1ϵ + 4n U1U2 + 4n3ϵ2, 6n2U1U2 + 2U 2 2 + 4n2U2ϵ} and L = 2 max{(4U1 + 2ϵ)U 2 2 n3, 2(2U1 + ϵ)2U2n3} with n = max{ns, nt}, U1 = max{ Cs , Ct } and U2 = max{ pt 2, max T Π(ps,pt) T F }. Proof: (i) We first prove that f( ) is L-Lipschitz. For all T, T Π(ps, pt) and E, E Uϵ, f(T, E) f(T , E ) iji j (Cs ij Ct i j Ei j )2Tii Tjj X iji j (Cs ij Ct i j E i j )2Tii Tjj iji j (Cs ij Ct i j E i j )2Tii Tjj X iji j (Cs ij Ct i j E i j )2Tii T jj iji j (Cs ij Ct i j E i j )2Tii T jj X iji j (Cs ij Ct i j E i j )2T ii T jj . Published as a conference paper at ICLR 2023 For the first term, X iji j (Cs ij Ct i j Ei j )2Tii Tjj X iji j (Cs ij Ct i j E i j )2Tii Tjj iji j (Cs ij Ct i j Ei j + Cs ij Ct i j E i j )(E i j Ei j )Tii Tjj Cs ij Ct i j Ei j + Cs ij Ct i j E i j E i j Ei j Tii Tjj (4U1 + 2ϵ)U 2 2 n2 X E i j Ei j . For the second term, X iji j (Cs ij Ct i j E i j )2Tii Tjj X iji j (Cs ij Ct i j E i j )2Tii T jj iji j |Cs ij Ct i j E i j |2 Tii Tjj T jj (2U1 + ϵ)2U2 X (2U1 + ϵ)2U2n2 Tjj T jj . For the third term, X iji j (Cs ij Ct i j E i j )2Tii T jj X iji j (Cs ij Ct i j E i j )2T ii T jj (2U1 + ϵ)2U2n2 Tjj T jj . Combining the three relations above, we have f(T, E) f(T , E ) max n (4U1 + 2ϵ)U 2 2 n2, (2U1 + ϵ)2U2n2o X i j |E i j Ei j | + X jj |Tjj T jj | E E 2 F + T T 2 F . (ii) Now we prove that f( ) is l-smooth, which requires finding a constant l satisfying vec Tf(T, E) vec Ef(T, E) vec Tf(T , E ) vec Ef(T , E ) 2 l hvec(T) vec(E) i hvec(T ) vec(E ) where vec(X) means the vectorization of matrix X and [ a b ] denotes the concatenation of vectors a and b. Since the left hand side satisfies vec Tf(T, E) vec Ef(T, E) vec Tf(T , E ) vec Ef(T , E ) 2 Tf(T, E) Tf(T , E ) 2 F + Ef(T, E) Ef(T , E ) 2 F Tf(T, E) Tf(T , E ) F + Ef(T, E) Ef(T , E ) F , and the right hand side satisfies l hvec(T) vec(E) i hvec(T ) vec(E ) E E 2 F + T T 2 F 2 E E F + l Published as a conference paper at ICLR 2023 it suffices to find a constant l satisfying Tf(T, E) Tf(T , E ) F + Ef(T, E) Ef(T , E ) F l 2 E E F + l We bound Tf(T, E) Tf(T , E ) F as follows, Tf(T, E) Tf(T , E ) F 2n Cs Cs F T T F + 4 Cs F T(Ct + E) T (Ct + E ) F +2n T(Ct + E) (Ct + E) T (Ct + E ) (Ct + E ) F . For the first term, 2n Cs Cs F T T F 2n Cs 2 F 2n3U 2 1 T T F . For the second term, 4 Cs F T(Ct + E) T (Ct + E ) F 4 Cs F T T F Ct F + E F + 4 Cs F T F E E F (4n2U 2 1 + 4n2U1ϵ) T T F + 4n U1U2 E E F . For the third term, 2n T(Ct + E) (Ct + E) T (Ct + E ) (Ct + E ) F 2n T(Ct + E) (Ct + E) T (Ct + E) (Ct + E) F +2n T (Ct + E) (Ct + E) T (Ct + E ) (Ct + E) F +2n T (Ct + E ) (Ct + E) T (Ct + E ) (Ct + E ) F 4n Ct 2 F + E 2 F T T F + 2n T F Ct F + E F E E F +2n T F Ct F + E F E E F 4n3U 2 1 + 4n3ϵ2 T T F + (4n2U1U2 + 4n2U2ϵ) E E F Since Ef(T, E) = 2(E+C )ptpt 2T Cs T, Ef(T, E) Ef(T , E ) F can be bounded as follows, Ef(T, E) Ef(T , E ) F 2(E E ) F pt 2 F + 4 T F Cs F T T F 2U 2 2 (E E ) F + 4n U1U2 T T F . Combining the above four relations, we have Tf(T, E) Tf(T , E ) F + Ef(T, E) Ef(T , E ) F max{10n3U 2 1 + 6n3U1ϵ + 4n U1U2 + 4n3ϵ2, 6n2U1U2 + 2U 2 2 + 4n2U2ϵ} E E F + T T F , which yields the desired result. Theorem 2 Define φ( ) = max E Uϵ f( , E). The output ˆT of Algorithm 1 with step-size η = γ N+1 satisfies E φ1/2l(ˆT) 2 2φ1/2l(T0) min T Π(ps,pt) φ(T) + l L2γ2 2 max{10n3U 2 1 + 6n3U1ϵ + 4n U1U2 + 4n3ϵ2, 6n2U1U2 + 2U 2 2 + 4n2U2ϵ} and L = 2 max{(4U1+2ϵ)U 2 2 n3, 2(2U1+ϵ)2U2n3} with n = max{ns, nt}, U1 = max{ Cs , Ct } and U2 = max{ pt 2, max T Π(ps,pt) T F }. Published as a conference paper at ICLR 2023 Proof: By the smoothness of f( ), for any T Π(ps, pt) and Tτ from Algorithm 1, we have φ( T) f( T, Eτ) f(Tτ, Eτ) + Tf(Tτ, Eτ), T Tτ l = φ(Tτ) + Tf(Tτ, Eτ), T Tτ l 2 T Tτ 2 F . (6) Let ˆTτ = argmin T Π(ps,pt) φ(T) + l T Tτ 2 F . We have φ1/2l(Tτ+1) φ(ˆTτ) + l Tτ+1 ˆTτ 2 F φ(ˆTτ) + l Tτ η Tf(Tτ, Eτ) ˆTτ 2 F φ(ˆTτ) + l Tτ ˆTτ 2 F + 2lη Tf(Tτ, Eτ), ˆTτ Tτ + η2l Tf(Tτ, Eτ) 2 F φ1/2l(Tτ) + 2lη Tf(Tτ, Eτ), ˆTτ Tτ + η2l Tf(Tτ, Eτ) 2 F φ1/2l(Tτ) + 2ηl φ(ˆTτ) φ(Tτ) + l 2 ˆTτ Tτ 2 F + η2l L2, where the second line uses Lemma 3.1 of Bubeck et al. (2015) and the last line follows from (6). Taking a telescopic sum over τ, we obtain φ1/2l(TN) φ1/2l(T0) + 2ηl φ(ˆTτ) φ(Tτ) + l 2 ˆTτ Tτ 2 F + η2l L2. Rearranging this, we obtain φ(ˆTτ)+φ(Tτ l 2 ˆTτ Tτ 2 F φ1/2l(T0) min T Π(ps,pt) φ(T) 2ηl N + ηL2 Since φ(T) + l T Tτ 2 F is l-strongly convex, we have φ(ˆTτ) + φ(Tτ) l 2 ˆTτ Tτ 2 F 2 Tτ ˆTτ 2 F + φ(Tτ) + l Tτ Tτ 2 F min T φ(T) + l T Tτ 2 F l Tτ ˆTτ 2 F = 1 4l φ1/2l(Tτ) 2 F . Plugging this in (7) and combining Lemma 3 proves the result. B ALGORITHMIC DETAILS The Projected Gradient Descent (PGD) consists of the following three steps in each iteration τ. Find Eτ that maximizes f(Tτ, E). By Theorem 1, we need to calculate an auxiliary matrix G = T τ Cs Tτ Ct T τ Tτ, where denotes the element-wise multiplication. And we have Ei j (Tτ) = ϵ, if Gi j 0, ϵ, otherwise. Such a step involves computational cost O(n3) where n = max{ns, nt}. Gradient descent. Calculate Hτ = Tτ η Tf(Tτ, Eτ) where Tf(Tτ, Eτ) = 2 Cs Cs Tτ1nt1nt + 21ns1ns Tτ Ct + Eτ Ct + Eτ 4Cs Tτ Ct + Eτ , which also involves computational cost O(n3). Published as a conference paper at ICLR 2023 Figure 2: PCA-based visualization of embeddings produced by GDL (left), GWF (middle), and RGDL (right) respectively for the graphs in MUTAG dataset. The color of each point indicate the type of the corresponding graph. RGDL achieves the best clustering results. Projection into the feasible domain. This requires solving the following problem min T 0 1 2 T Hτ 2 F , s.t. T1n = p, T 1m = q. This optimization problem has a strongly convex objective and linear constraints, and hence can be solved efficiently via Augmented Lagrangian Method with computational complexity n2| log ρ| ρ1/2 (Xu, 2021) where ρ measures the optimality, that is, the violation of the two linear constraints. When ρ = O 1 n2 , this step also has cubic costs if we ignore the log term. Therefore, the overall complexity of PGD obtaining a δ-stationary solution is O n3 δ2 + n2| log ρ| C ADDITIONAL EXPERIMENTS C.1 ADDITIONAL EXPERIMENTAL RESULTS OF GRAPH CLUSTERING Visualization of graph embeddings. Since GDL, GWF, and RGDL can output graph embeddings, we further illustrate the embeddings generated by them respectively based on PCA. As is shown in Figure 2, the embeddings of the two types of the graphs are less likely to be mixed together, which explains why RGDL achieves higher ARI values. Sensitivity analysis of λ. As is discussed in Li et al. (2020); Vincent-Cuaz et al. (2021), the negative quadratic term can promote the sparsity of graph embeddings. We further conduct sensitivity analysis of λ by varying the value in {0, 10 5, 10 4, 10 3, 10 2, 10 1}. As is shown in Table 3, λ 10 4, 10 2 often yields good performance. The experiments in the main paper are run with λ = 10 3. Table 3: ARI scores of RGDL with varied λ s. Datasets 0 10 5 10 4 10 3 10 2 10 1 MUTAG 0.4389 0.4561 0.4636 0.458 0.4661 0.4412 BZR 0.2515 0.2515 0.2707 0.2707 0.2707 0 Peking 1 0.1195 0.1195 0.1195 0.1195 0.1195 0.1079 C.2 GRAPH CLASSIFICATION The learned embeddings of graphs can also be used in the graph classification task. RGDL is thus compared against GDL (Vincent-Cuaz et al., 2021), GWF (Xu, 2020), and other state-of-the-art graph classification methods including WGDL (Zhang et al., 2021) and GNTK (Du et al., 2019) on the benchmark datasets MUTAG (Debnath et al., 1991), IMDB-B, and IMDB-M (Yanardag and Vishwanathan, 2015). RGDL, GDL, and GWF use 3-NN as the classifier due to its simplicity. We perform a 10-fold nested cross validation (using 9 folds for training, 1 for testing, and reporting the average accuracy of this experiment repeated 10 times) by keeping same folds across methods. Published as a conference paper at ICLR 2023 Table 4: Graph classification results. Datasets RGDL GDL GWF WGDL GNTK IMDB-B 77.7(1.5) 59.0(2.6) 53.0(6.2) 79.7(3.6) 76.9(3.6) IMDB-M 52.9(2.5) 44.2(2.3) 42.0(3.7) 53.5(5.0) 52.8(4.6) MUTAG 98.2(3.0) 89.5(5.3) 86.0(3.0) 94.7(2.6) 90.0(8.5) The results are reported in Table 4 and RGDL outperforms or matches state-of-the-art methods. RGDL outperforms GDL and GWF significantly, which indicates the necessity of taking into account the structural noise of observed graphs. Although WGDL and GNTK have similar performance, they are more computation and memory demanding due to the usage of graph neural networks.