# learning_graphons_via_structured_gromovwasserstein_barycenters__20b22bbe.pdf Learning Graphons via Structured Gromov-Wasserstein Barycenters Hongteng Xu1, 2, Dixin Luo3 , Lawrence Carin4, Hongyuan Zha5 1Gaoling School of Artificial Intelligence, Renmin University of China 2 Beijing Key Laboratory of Big Data Management and Analysis Methods 3School of Computer Science and Technology, Beijing Institue of Technology 4Department of Electrical and Computer Engineering, Duke University 5School of Data Science, Shenzhen Research Institute of Big Data, The Chinese University of Hong Kong, Shenzhen dixin.luo@bit.edu.cn We propose a novel and principled method to learn a nonparametric graph model called graphon, which is defined in an infinite-dimensional space and represents arbitrary-size graphs. Based on the weak regularity lemma from the theory of graphons, we leverage a step function to approximate a graphon. We show that the cut distance of graphons can be relaxed to the Gromov-Wasserstein distance of their step functions. Accordingly, given a set of graphs generated by an underlying graphon, we learn the corresponding step function as the Gromov-Wasserstein barycenter of the given graphs. Furthermore, we develop several enhancements and extensions of the basic algorithm, e.g., the smoothed Gromov Wasserstein barycenter for guaranteeing the continuity of the learned graphons and the mixed Gromov-Wasserstein barycenters for learning multiple structured graphons. The proposed approach overcomes drawbacks of prior state-ofthe-art methods, and outperforms them on both synthetic and real-world data. The code is available at https://github.com/ Hongteng Xu/SGWB-Graphon. Introduction Given a set of graphs, e.g., social networks and biological networks, we are often interested in modeling their generative mechanisms and building statistical graph models (Kolaczyk 2009; Goldenberg et al. 2010). Many efforts have been made to achieve this aim, leading to such methods as the stochastic block model (Nowicki and Snijders 2001), the graphlet (Soufiani and Airoldi 2012), and the latent space model (Hoff, Raftery, and Handcock 2002). However, when dealing with large-scale complex networks, the parametric models above are often oversimplified, and thus, suffer from underfitting. To enhance the model capacity, a nonparametric graph model called graphon (or graph limit) was proposed (Janson and Diaconis 2008; Lov asz 2012). Mathematically, a graphon is a two-dimensional symmetric Lebesgue measurable function, denoted as W : Ω2 7 [0, 1], where Ω is a measure space, e.g., Ω= [0, 1]. Given a graphon, we can generate arbitrarily sized graphs by the following sampling Correspondence author. Currently on leave from College of Computing, Georgia Institute of Technology. Copyright c 2021, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. vn Uniform(Ω), n = 1, .., N, ann Bernoulli(W(vn, vn )), n, n = 1, .., N. (1) The first step samples N nodes independently from a uniform distribution defined on Ω. The second step generates an adjacency matrix A = [ann ] {0, 1}N N, whose elements yield the Bernoulli distributions determined by the graphon. Accordingly, we derive a graph G(V, E) with V = {1, ..., N} and E = {(n, n ) | ann = 1}. This graphon model is useful theoretically to characterize complex graphs (Chung and Radcliffe 2011; Lov asz 2012), which has been widely used in many applications, e.g., network centrality (Ballester, Calv o-Armengol, and Zenou 2006; Avella-Medina et al. 2018), control (Jackson and Zenou 2015; Gao and Caines 2019), and optimization (Nagurney 2013; Parise and Ozdaglar 2018). A fundamental problem connected to these applications concerns how to robustly learn graphons from observed graphs. Many learning methods have been developed to solve this problem. Most of them are based on the weak regularity lemma of graphon (Frieze and Kannan 1999). This lemma indicates that an arbitrary graphon can be approximated well by a two-dimensional step function. To learn step functions as target graphons, existing methods either leverage stochastic block models, e.g., the sorting-and-smoothing (SAS) method (Chan and Airoldi 2014), the stochastic block approximation (SBA) (Airoldi, Costa, and Chan 2013), and its variant largest gap (LG) (Channarond et al. 2012), or they apply low-rank approximation directly to observed graphs, e.g., the matrix completion (MC) method (Keshavan, Montanari, and Oh 2010) and the universal singular value thresholding (USVT) algorithm (Chatterjee and others 2015). These methods require the observed graphs to be wellaligned1 and generated by a single graphon. However, realworld graphs, e.g., the social networks collected from different platforms and different time slots, often have complicated clustering structure, and the correspondence between their nodes is unknown in general. This violation limits the feasibility of the above learning methods in practice. Specifically, these methods have to solve a multi-graph match- 1 Well-aligned graphs have comparable size and the correspondence between their nodes is provided. The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21) (a) W(x, y) = xy and 0 x, y 1 (b) W(x, y) = |x y| and 0 x, y 1 Figure 1: Illustrations of learning results obtained by various methods for different graphons. In both (a) and (b), we visualize the graphon and its estimations with size 1, 000 1, 000, and each estimation is derived based on 10 graphs with less than 300 nodes. The node degrees of the graphs provide strong evidence to align graphs when learning the graphon in (a) but are useless for the graphon in (b). Our GWB method outperforms state-of-the-art methods. In the challenging case (b), our estimation can be aligned to the ground truth by a measure-preserving mapping, which is close to the ground truth under the cut distance. ing problem before learning graphons. Because of the NPhardness of the matching problem, this preprocessing often introduces severe noise to the subsequent learning problem and leads to undesirable learning results. To overcome the aforementioned challenges, we propose a new method to learn one or multiple graphons from unaligned graphs. Our method leverages step functions to estimate graphons. It minimizes the Gromov-Wasserstein distance (GWD) (M emoli 2011) between the step function of each observed graph and that of the target graphon, whose solution is a Gromov-Wasserstein barycenter (GWB) of the graphs (Xu, Luo, and Carin 2019). We demonstrate that this learning strategy minimizes an upper bound of the cut distance (Lov asz 2012) between the graphon and its step function, which leads to a computationally-efficient algorithm. To the best of our knowledge, our work makes the first attempt to learn graphons from unaligned graphs. Different from existing methods, which first match graphs heuristically and then estimate graphons, our method leverages the permutation-invariance of the GW distance and integrates graph matching implicitly in the estimation phase. As a result, our method mitigates bias caused by undesired matching processes. Given a graphon W(x, y), if its marginal W(y) = R x ΩW(x, y)dx (or W(x) = R y ΩW(x, y)dy) is very different from a constant function, the graphs generated by it can be aligned readily by sorting and matching their nodes according to their degrees. Otherwise, it will be hard to align its graphs because the node degrees of the graphs nodes are almost the same. As illustrated in Figure 1, no matter whether it is easy to align the graphs or not, our method can successfully learn the graphons and consistently outperforms existing methods. Besides the basic GWB method, we design 1) a smoothed GWB method to enhance the continuity of learned graphons, and 2) a mixture model of GWBs to learn multiple graphons from the graphs with unknown clusters. These structured GWB models achieve encouraging learning results in some complicated scenarios. Proposed Method A graphon W : Ω2 7 [0, 1] is defined on a probability space (Ω, µ), where µ is a probability measure on the space Ω. Each W formulates a space of graphons, denoted as W. Let {Gm}M m=1 be a set of graphs generated by an unknown graphon W, whose sampling process is shown in (1). We want to estimate W based on {Gm}M m=1, making the estimation close to the ground truth under a specific metric. Approximate Graphons by Step Functions A graphon can always be approximated by a step function in the cut norm (Frieze and Kannan 1999). For each W W, its cut norm is defined as W := sup X,Y Ω Z X Y W(x, y)dxdy , (2) where the supremum is taken over all measurable subsets X and Y of Ω. Based on the cut norm, we can define a commonly-used metric called cut distance (Lov asz 2012): δ (W1, W2) := infφ SΩ W1 W φ 2 , W1, W2 W, where SΩrepresents the set of measure-preserving mappings from Ωto Ω. Accordingly, we have W φ 2 (x, y) = W2(φ(x), φ(y)). The cut distance plays a central role in graphon theory. We say that two graphons W1, W2 are equivalent if δ (W1, W2) = 0, denoted as W1 = W2. The work in (Borgs et al. 2008) demonstrates that the quotient space c W := W\ = is homeomorphic to the set of graphons and (c W, δ ) is a compact metric space. Similarly, we can define δ1(W1, W2) := infφ SΩ W1 W φ 2 1, where W 1 := R X Y |W(x, y)|dxdy. According to their definitions, we have δ (W1, W2) δ1(W1, W2), W1, W2 W. (3) Let P = (P1, .., PK) be a partition of Ωinto K measurable sets. We define a step function WP : Ω2 7 [0, 1] as WP(x, y) = XK k,k =1 wkk 1Pk Pk (x, y), (4) where each wkk [0, 1] and the indicator function 1Pk Pk (x, y) is 1 if (x, y) Pk Pk , otherwise it is 0. The weak regularity lemma (Lov asz 2012) shown below guarantees that every graphon can be approximated well in the cut norm by step functions. Theorem 1 (Weak Regularity Lemma (Lov asz 2012)). For every graphon W W and K 1, there always exists a step function WP with |P| = K steps such that W WP 2 log K W L2. (5) Note that a corollary of this lemma is δ (W, WP) 2 log K W L2 because δ (W, WP) W WP . Oracle Estimator Based on the weak regularity lemma, we would like to learn a step function WP from observed graphs {Gm}M m=1 such that the cut distance between the step function and the ground truth, i.e., δ (W, WP), is minimized. Note that a graph G can also be represented as a step function. Definition 2. For a graph with a node set V = {1, ..., N} and an adjacency matrix A, we can represent it as a step function with N equitable partitions of Ω, i.e., P = {Pn}N n=1,2 denoted as GP, where GP(x, y) = 1 N 2 PN n,n =1 ann 1Pn Pn (x, y). Ideally, if we know the positions of a graph s nodes, i.e., the vn s in (1), we can derive an isomorphism of the graph according to the order of the positions and obtain an oracle step function, denoted as ˆG and ˆGP, respectively. Applying this sorting operation to {Gm}M m=1, we obtain a set of wellaligned graphs { ˆGm}M m=1 and a set of oracle step functions { ˆGm,Pm}M m=1, where the number of partitions |Pm| is equal to the number of nodes in ˆGm. Accordingly, we achieve an oracle estimator of W as follows: m=1 ˆGm,Pm. (6) This oracle estimator provides a consistent estimation of W: Theorem 3. For every W W, let { ˆGm,Pm}M m=1 be a set of oracle step functions defined by Definition 2. We have δ (W, WO) C minm |Pm|, (7) where C is a constant. Proof. For an arbitrary graphon W and its oracle estimator WO, we have δ (W, WO) = δ (W, 1 m=1 ˆGm,Pm) m=1 ˆGm,Pm L2 m=1 W ˆGm,Pm L2 m=1 C |Pm| C minm |Pm|. 2The equitable partitions have the same size, i.e., n = n , |Pn| = |Pn |. The first inequality is based on the fact that for arbitrary two graphons, their cut distance is smaller than their L2 distance (Janson 2013). The second inequality is the triangle inequality. The third inequality is a corollary of the step function approximation lemma in (Chan and Airoldi 2014), whose derivation corresponds to the second proof shown in the supplementary file of the reference. In particular, the constant C corresponds to the supremum of the absolute difference between the graphon W and its step function ˆG, which is independent of the number of partitions. Learning Graphons via GW Barycenters The oracle estimator above is unavailable in practice because real-world graphs are generally unaligned neither the positions of their nodes nor the correspondence between them is provided. Given such unaligned graphs, traditional learning methods first match observed graphs and then estimate the oracle step functions. As illustrated in Figure 1(b), this strategy often leads to failures because the matching step is NP-hard and creates wrongly-aligned graphs. To mitigate the dependency on well-aligned graphs (and their oracle step functions), we propose a new learning strategy. Specifically, considering the oracle estimator, we have δ (W, WP) δ (W, WO) + δ (WO, WP) = δ (W, WO) + δ ( 1 m=1 ˆGm,Pm, WP) δ (W, WO) + 1 m=1 δ ( ˆGm,Pm, WP) = δ (W, WO) + 1 m=1 δ (Gm,Pm, WP) δ (W, WO) + 1 m=1 δ1(Gm,Pm, WP). The first inequality in (9) is the triangle inequality, and the second inequality is derived according to the definition of cut distance. The latter is manifested because δ (GP, ˆGP) = 0 (i.e., GP = ˆGP). Finally, the third inequality is based on (3). Theorem 3 and (9) indicate that we can minimize an upper bound of δ (W, WP) by solving the following optimization problem: m=1 δ1(Gm,Pm, WP). (10) This strategy does not need to estimate the oracle step function, because it directly considers the δ1 distance between observed graphs and the proposed step function. To solve this problem, we derive a computationally-efficient alternative of δ1 based on its equivalent definition shown below. Theorem 4 (Remark 6.13 in (Janson 2013)). Let W1 and W2 be two graphons defined on the probability spaces (Ω1, µ1) and (Ω2, µ2), respectively. The δ1(W1, W2) can be equivalently defined as infπ Π(µ1,µ2) R (Ω1 Ω2)2 |W1(x, y) W2(x , y )|dπ(x, x )dπ(y, y ), where Π(µ1, µ2) = {π | π 0, R y Ω2 dπ(x, y) = µ1, R x Ω1 dπ(x, y) = µ2} contains all measures on Ω1 Ω2 having marginals µ1 and µ2. The characterization shown in Theorem 4 coincides with the 1-order Gromov-Wasserstein distance between (Ω1, µ1) and (Ω2, µ2) (M emoli 2011). Let W1,P and W2,Q be two step functions defined on (Ω1, µ1) and (Ω2, µ2), which have equitable partitions P = {Pi}I i=1 and Q = {Qj}J j=1. We rewrite W1(x, y) W2(x , y ) as PI i,j=1 w1,ij1Pi Pj(x, y) PJ i ,j =1 w2,i j 1Qi Qj (x , y ) and denote riji j as |w1,ij w2,i j |. Let the probability measures µ1 and µ2 be constant in each partition, i.e., µ1(x) = P i µ1,i1Pi(x) and µ2(x) = P i µ2,j1Qj(x). We can rewrite the δ1 distance between the two step functions i.e., δ1(W1,P, W2,Q) as inf π Π(µ1,µ2) Pi Pj Qi Qj riji j dπx,x dπy,y = inf π Π(µ1,µ2) i,i ,j,j riji j Z Pi Qi dπx,x Z Pj Qj dπy,y = min T Π(µ1,µ2) i,i ,j,j riji j Tii Tjj = dgw,1(W1, W2), where W1 = [w1,ij] [0, 1]I I and W2 = [w2,i j ] [0, 1]J J rewrite the step functions in matrix form. Vectors µ1 = [µ1,i] and µ2 = [µ2,j] represents the probability measures µ1 and µ2; T = [Tii ] RI J is a doubly-stochastic matrix in the set Π(µ1, µ2) = {T 0|T µ2 = µ1, T µ1 = µ2}, whose element Tii = R Pi Qi dπ(x, x ). The optimal T , denoted as T , is called the optimal transport or optimal coupling between µ1 and µ2 (Villani 2008; Peyr e, Cuturi, and others 2019). The derivation above shows that instead of solving a complicated optimization problem in a function space, we can convert it to the 1-order Gromov-Wasserstein distance between matrices (Peyr e, Cuturi, and Solomon 2016; Chowdhury and M emoli 2019; Xu et al. 2019). Moreover, when we replace the riji j with r2 iji j , we obtain the squared 2-order Gromov-Wasserstein distance: d2 gw,2(W1, W2) = min T Π(µ1,µ2) X i,i ,j,j r2 iji j Tii Tjj = min T Π(µ1,µ2) D 2W1T W 2 , T . (11) Here, , calculates the inner product of two matrices. D = (W1 W1)µ11 J + 1Iµ 2 (W2 W2), where 1I is an I-dimensional all-one vector and represents the Hadamard product of matrix. Because the 2-order GW distance and the 1-order GW distance are equivalent (pseudo) metrics (Theorem 5.1 in (M emoli 2011)), the d2 gw,2(W1, W2) also provides a good alternative for the cut distance of the step functions. Plugging (11) into (10), the learning problem becomes estimating a GW barycenter of the observed graphs (Peyr e, Cuturi, and Solomon 2016): min W [0,1]K K 1 m=1 d2 gw,2(Am, W ), (12) where Am is the adjacency matrix of the graph Gm and W = [wkk ] [0, 1]K K is the matrix representation of step function WP. Note that the number of partitions K and the probability measures associated with W and {Am}N m=1 are predefined. Implementation Details Setting the number of partitions Given a set of graphs {Gm}M m , we denote Nmax as the number of the nodes of the largest graph. Following the work in (Chan and Airoldi 2014; Airoldi, Costa, and Chan 2013; Channarond et al. 2012), we can set the number of partitions to be K = Nmax log Nmax . This setting has been proven helpful to achieve a trade-off between accuracy and computational efficiency. Estimating probability measures For the observed graphs, we estimate the probability measures by normalized node degrees (Xu, Luo, and Carin 2019). We assume that the probability measure of W is sorted, i.e., µW = [µW,1, ..., µW,K] and µW,1 ... µW,K, which is estimated by sorting and merging {µm}M m=1. Here, µm = 1 Am1Nm 1 Am1Nm for m = 1, ..., M, and m=1 interp1d K(sort(µm)), (13) where sort( ) sorts the elements of the input vector in descending order, and interp1d K( ) samples K values from the input vector via linear interpolation. This strategy provides useful information when calculating the optimal transport between each Am and the W (Xu, Luo, and Carin 2019). Learning optimal transports The computation of the d2 gw,2(Am, W ) is a non-convex, non-smooth optimization problem. To solve this problem efficiently, we apply the proximal gradient algorithm developed in (Xu et al. 2019). This algorithm reformulates the original problem as a series of subproblems and solves them iteratively. In each iteration, the subproblem is min T Π(µm,µW ) Dm 2Am T (s)W , T + βKL(T T (s)), (14) where T (s) is the previous estimation of T , Dm = (Am Am)µm1K +1Nmµ W (W W ). We fix one transport matrix as its previous estimation in the GW term and add a proximal term as the regularizer. Here, the proximal term penalizes the KL-divergence between the transport matrix and its previous estimation, which smooths the update of the transport matrix. This problem can be solved by the Sinkhorn scaling algorithm (Sinkhorn and Knopp 1967), whose convergence rate is linear (Altschuler, Weed, and Rigollet 2017; Xie et al. 2020). Learning GW barycenters Given the optimal transports {Tm}M m=1, the GW barycenter has a closed-form solution (Peyr e, Cuturi, and Solomon 2016): W = 1 µW µT W PM m=1 T m Am Tm. The scheme of our algorithm is shown in Algorithm 1. Structured Gromov-Wasserstein Barycenters We extend the above algorithm and propose two kinds of structured Gromov-Wasserstein barycenters to apply our learning method to more complicated scenarios. Smoothed GW Barycenters As shown in Figure 1(b), the estimated graphons achieved by our method can be discontinuous because of the permutation invariance of Algorithm 1 Learning Graphons via GWB 1: Input: Adjacency matrices {Am}M m=1. The weight of proximal term β, the number of iterations L, the number of inner Sinkhorn iterations S. 2: Initialize K = Nmax log Nmax , and W Uniform([0, 1]). 3: Initialize {µm}M m=1 and µW via (13) 4: For l = 1, ..., L: 5: For m = 1, ..., M: // Solve (14) 6: Initialize T (0) = µmµ W and a = µm. 7: For s = 0, ..., S 1: 8: C = exp( 1 β (Dm 2Am T (s)W )) T (s) 9: b = µW C a, a = µm Cb , T (s+1) = diag(a)Cdiag(b). 10: Tm = T (S). 11: Update W via W = 1 µW µT W PM m=1 T m Am Tm. 12: The graphon WP (x, y) = P k,k wkk 1Pk Pk (x, y). Gromov-Wasserstein distance. To suppress the discontinuity of the results, we impose a smoothness regularization on the GW barycenters and obtain the following problem: min W [0,1]K K 1 m Dm 2Am Tm W , Tm + α LW L 2 F , (15) where Tm is current estimation of the m-th optimal transport and LW L is the matrix representation of the Laplacian filtering of W . The first-order optimality condition of this problem also has a closed-form solution. In particular, setting the gradient of the objection to zero, we obtain 2αL LW L L + diag(µW )W diag(µW ) m=1 T m Am Tm. (16) Applying singular value decomposition (SVD) to L L, i.e., L L = UΛU , we rewrite the left side of (16) as HW HH, where H = U( 2αΛ + idiag(µW ))U is a symmetric complex matrix and HH = U( 2αΛ idiag(µW ))U is its Hermitian transpose. Therefore, we obtain a smoothed W by replacing line 11 of Algorithm 1 with W = 1 M H 1(PM m=1 T m Am Tm)H H. Mixed GW Barycenters When the observed graphs are generated by C different graphons, we can build a graphon mixture model and learn it as mixed GW barycenters: min {Wc}C c=1,P Π( 1 c,m pcmd2 gw,2(Am, Wc). (17) where we set P = [pcm] as a doubly stochastic matrix, whose marginals are 1 C 1C and 1 M 1M. The value pcm indicates the joint probability that the m-th graph is generated by the c-th graphon. In other words, the objective function of (17) leads to a hierarchical optimal transport problem (Luo, Xu, and Carin 2020), in which the ground distance is defined by the GW distance and P is the optimal transport matrix. This problem is solved by alternating optimization. In particular, replacing 1 M with pcm, we still apply Algorithm 1 to learn {Wc}C c=1. Given {Wc}C c=1, we calculate the ground distance matrix Dgw = [d2 gw,2(Am, Wc)] and update P by solving an entropic optimal transport problem. M 1M) Dgw, P + β P , log P . (18) Similar to (14), this problem can also be solved by the Sinkhorn scaling algorithm in (Sinkhorn and Knopp 1967). Related Work Graphon estimation As a classic method, the stochastic block approximation (SBA) learns stochastic block models as graphons (Airoldi, Costa, and Chan 2013), in which the block size can be optimized heuristically by the largest gap algorithm (Channarond et al. 2012). The smoothingand-sorting (SAS) method improves this strategy by adding total-variation denoising as a post-processing step (Chan and Airoldi 2014). The work in (Pensky and others 2019) extends this strategy and proposes a dynamic stochastic block model to describe time-varying graphons. The matrix completion (MC) method (Keshavan, Montanari, and Oh 2010), the universal singular value thresholding (USVT) algorithm (Chatterjee and others 2015), and the spectral method (Xu 2018) learn low-rank matrices as graphons. The work in (Ruiz, Chamon, and Ribeiro 2020) represents graphons by their Fourier transformations. The methods above require that the observed graphs be well-aligned and generated by a single graphon. GW distance The GW distance (M emoli 2011) has been widely used to measure the difference between structured data like graphs (Vayer et al. 2018). For graphs, the optimal transport associated with their GW distance indicates the correspondence between their nodes (Xu et al. 2019). To calculate this distance, the work in (Peyr e, Cuturi, and Solomon 2016) adds an entropy regularizer to the objective function and applies the Sinkhorn scaling algorithm (Cuturi 2013). The work in (Xu et al. 2019) improves this method by replacing the entropy regularizer with a Bregman proximal term. An ADMM-based method is proposed in (Xu 2020) to calculate the GW distance between directed graphs. Recently, the recursive GW distance (Xu, Luo, and Carin 2019) and the sliced GW distance (Titouan et al. 2019b) have been proposed to reduce the computational complexity of the GW distance. A GW barycenter model is proposed in (Peyr e, Cuturi, and Solomon 2016), which shows the potential of graph clustering (Xu 2020). Our work is pioneering in the development of structured GW barycenters to learn graphons. Experiments Synthetic Data To demonstrate the efficacy of our GWB method and its smoothed variant (SGWB), we compare them with existing methods on learning synthetic graphons. We set the hyperparameters of our methods as follows: the weight of the proximal term β = 0.005, the number of iterations L = 5, and the number of Sinkhorn iterations S = 10; for the SGWB method, the weight of the smoothness regularizer α = 0.0002. The baselines include the stochastic block approximation (SBA) (Airoldi, Costa, and Chan 2013), the largest gap (LG) based block approximation (Channarond Type W(x, y) # nodes SBA LG MC USVT SAS GWB SGWB xy 200 65.6 6.5 29.8 5.7 11.3 0.8 31.7 2.5 125.0 1.3 40.6 5.7 39.3 5.5 100 300 157.4 23.3 133.2 22.8 138.3 21.3 131.2 24.2 161.9 19.5 52.5 13.1 51.9 12.6 e (x0.7+y0.7) 200 58.7 7.8 22.9 3.1 71.7 0.5 12.2 1.5 77.7 0.8 21.6 2.1 20.9 1.8 100 300 165.2 22.8 157.6 24.2 166.2 21.5 158.2 24.5 153.4 25.1 48.6 11.9 48.0 10.8 x2+y2+ x+ y 200 63.4 7.6 24.1 2.5 73.2 0.7 33.8 1.1 99.3 1.2 18.9 3.5 18.4 2.8 100 300 258.5 36.0 254.2 36.6 259.5 35.0 254.2 36.8 240.6 39.2 81.0 18.8 80.5 17.8 Easy 1 2(x + y) 200 66.2 8.3 24.0 2.5 71.9 0.6 40.2 0.8 108.3 1.0 21.2 4.6 20.2 3.9 100 300 247.6 40.3 241.3 41.0 247.0 39.1 241.3 41.0 231.3 40.2 83.8 22.5 84.6 22.0 to 1 1+exp( 10(x2+y2)) 200 55.0 9.5 23.1 3.2 64.6 0.5 37.3 0.6 73.3 0.7 14.8 2.3 16.1 1.5 100 300 394.0 45.7 397.0 46.5 400.7 45.6 399.3 47.0 345.4 52.6 62.8 12.6 62.5 12.3 Align 1 1+exp( (max{x,y}2 200 48.3 6.1 24.5 2.3 71.1 0.4 24.4 0.5 54.4 0.5 15.3 1.0 17.1 1.4 100 300 382.9 54.7 387.9 53.6 392.5 52.3 391.9 54.8 336.7 58.6 39.8 8.6 41.7 8.3 (MSE) e max{x,y}3/4 200 56.3 7.1 26.8 0.9 79.3 0.5 50.6 0.3 68.6 0.6 21.4 1.7 21.2 1.0 100 300 234.7 32.9 234.0 33.3 241.4 31.2 242.9 33.3 212.0 36.5 49.3 9.5 48.7 9.1 e min{x,y}+ x+ y 2 200 55.7 7.7 26.4 5.6 76.4 0.4 28.3 0.5 76.4 0.8 23.2 1.3 23.3 1.4 100 300 232.1 30.8 231.4 31.8 238.7 29.9 232.6 31.9 208.3 34.8 48.2 11.7 47.9 11.0 log(1 + max{x, y}) 200 66.0 8.4 37.1 6.6 66.9 0.8 120.9 0.5 137.0 1.2 23.7 2.5 23.2 1.8 100 300 370.8 38.9 370.7 40.4 374.5 39.5 375.5 37.5 337.7 42.1 104.0 18.8 107.4 18.9 Hard |x y| 200 0.202 0.001 0.200 0.002 0.206 0.001 0.215 0.002 0.217 0.002 0.057 0.005 0.050 0.003 100 300 0.254 0.007 0.254 0.008 0.254 0.009 0.261 0.009 0.248 0.008 0.085 0.012 0.080 0.010 to 1 |x y| 200 0.200 0.001 0.198 0.002 0.202 0.002 0.209 0.003 0.217 0.001 0.063 0.003 0.057 0.001 100 300 0.383 0.041 0.384 0.040 0.383 0.040 0.393 0.041 0.350 0.044 0.075 0.009 0.077 0.004 Align 0.8I2 1[0, 1 2 ]2 200 0.252 0.018 0.258 0.018 0.258 0.016 0.252 0.016 0.367 0.004 0.252 0.002 0.218 0.001 100 300 0.355 0.005 0.359 0.002 0.361 0.005 0.392 0.038 0.409 0.004 0.328 0.034 0.329 0.032 (dgw,2) 0.8flip(I2) 1[0, 1 2 ]2 200 0.241 0.010 0.254 0.005 0.250 0.003 0.242 0.003 0.364 0.001 0.252 0.002 0.190 0.058 100 300 0.453 0.004 0.487 0.002 0.450 0.008 0.477 0.001 0.468 0.001 0.427 0.027 0.420 0.027 Table 1: Comparisons on estimation errors (MSE for Easy to align , dgw,2 for Hard to align ) et al. 2012), the matrix completion (MC) (Keshavan, Montanari, and Oh 2010), the universal singular value thresholding (USVT) (Chatterjee and others 2015), and the sorting-andsmoothing (SAS) (Chan and Airoldi 2014). We prepare 13 kinds of synthetic graphons, whose definitions are shown in Table 1. The resolution of these graphons is 1000 1000. Among these graphons, nine are considered in (Chan and Airoldi 2014). The graphs generated by them are easy to align the node degrees of these graphs provide strong evidence to sort and match nodes. For these graphons, we apply the mean-square-error (MSE) to evaluate different methods. Additionally, to highlight the advantage of our method, we design four challenging graphons, whose graphs are hard to align.3 In particular, for the graphs generated by these four graphons, the node degrees of different nodes can be equal to each other. Therefore, there is no simple way of aligning the graphs. For these graphons, we apply the GW distance dgw,2 as the evaluation measurement. For each graphon, we simulate graphs with two different settings: In one setting, each of the graphs has 200 nodes, while in the other setting, the number of each graph s nodes is sampled uniformly from the range [100, 300]. Originally, the baselines above are designed for the former setting. When dealing with graphs with different sizes, they pad zeros to the corresponding adjacency matrices and enforce the graphs to have the same size. For each setting, we test our method and 30.8I2 1[0, 1 2 ]2 is a graphon with two diagonal blocks, and 0.8flip(I2) 1[0, 1 2 ]2 is a bipartite graphon, where represents the Kronecker product and I2 is a 2 2 identity matrix. the baselines in 10 trials, and in each trial we simulate 10 graphs and estimate the graphon by different methods. Table 1 shows that our GWB method outperforms the baselines in most situations. Especially for the graphs that are hard to align, the gap between our methods and the baselines become bigger. When the observed graphs have different sizes, the estimation errors of the baselines increase because the padding step does harm to the alignment of the graphs. By contrast, our GWB method is relatively robust to the variance of graph size, and it achieves much lower estimation errors. With the help of the smoothness regularizer, our SGWB method improves the stability of the original GWB method, which achieves comparable estimator errors but with smaller standard deviations. For the nine easy graphons, we generate different numbers of graphs with different sizes and compare the learning methods on their averaged MSEs under different settings in Table 2. Our method outperforms the baselines, which verifies its robustness. Besides estimation errors, we also compare various methods on their computational complexity. The runtime in Table 2 shows that our GWB and SGWB are faster than SBA and LG in practice. Suppose that we have M graphs, and each of them has N nodes and E edges. Learning a step function with K partitions as a graphon, we list the complexity of the methods in Table 3. In particular, line 8 of Algorithm 1 involves sparse matrix multiplication, whose complexity is O(EK + NK2). Because the graphs often have sparse edges, i.e., E = O(N log N) and K = O( N log N ), the complexity of our GWB method is comparable to others when the numbers of iterations (i.e., L and S) are small. Measurement # graphs M # nodes N SBA LG MC USVT SAS GWB SGWB 2 200 62.1 9.0 26.5 3.9 73.6 2.3 45.6 1.6 94.1 1.5 24.6 4.3 24.3 3.3 10 200 59.5 7.7 26.5 3.6 70.7 0.6 39.9 0.9 91.1 0.9 22.3 2.8 21.9 2.4 20 200 32.2 2.0 23.7 4.1 50.7 0.4 38.7 0.5 91.0 0.7 21.1 2.0 20.7 1.7 2 500 59.4 0.4 49.1 2.0 69.0 1.8 35.4 2.4 23.9 4.3 22.2 7.0 21.1 4.5 10 500 47.1 5.6 20.2 1.4 60.6 0.3 27.6 0.4 34.8 0.8 19.7 2.5 19.6 1.7 20 500 34.1 3.3 18.8 1.7 43.2 0.2 27.0 0.3 35.4 0.7 17.3 1.9 17.7 1.4 Runtime 10 200 0.69 0.03 0.61 0.04 0.02 0.01 0.02 0.00 0.03 0.01 0.32 0.04 0.34 0.03 10 500 3.74 0.09 3.69 0.08 0.08 0.02 0.09 0.01 0.08 0.01 0.67 0.06 0.70 0.06 Table 2: Comparisons on averaged MSE and runtime (second) under different configurations Method Complexity Method Complexity LG O(MN 2) SBA O(MKN log N) MC O(N 3) SAS O(MN log N + K2 log K2) USVT O(N 3) GWB O(LSM(EK + NK2)) Table 3: Comparisons on computational complexity Real-World Data For real-world graph datasets, our mixed GWB method (Mix GWBs) provides a new way to cluster graphs. In particular, by learning C graphons from M graphs, we achieve C centroids of different clusters and the learned optimal transport P = [pcm] indicates the probability that the m-th graph belongs to the c-th cluster. To demonstrate the effectiveness of our method, we test it on two datasets and compare it with three baselines. The datasets are the IMDB-BINARY and the IMDB-MULTI (Yanardag and Vishwanathan 2015), which can be downloaded from (Morris et al. 2020). The IMDB-BINARY contains 1000 graphs belonging to two clusters, while the IMDB-MULTI contains 1500 graphs belonging to three clusters. These two datasets are challenging for graph clustering, as the nodes and edges of their graphs do not have any side information. We have to cluster graphs merely based on their binary adjacency matrices. For these two datasets, the clustering methods based on the GW distance achieve state-of-the-art performance. The representative methods include (i) the fused Gromov Wasserstein kernel method (FGWK) (Titouan et al. 2019a); (ii) the K-means using the Gromov-Wasserstein distance as the metric (GW-KM); and (iii) the Gromov-Wasserstein factorization (GWF) method (Xu 2020). We test our Mix GWBs method and compare it with these three methods on clustering accuracy. In particular, for each dataset, we apply 10-fold cross-validation to evaluate each clustering method. The averaged clustering accuracy and the standard deviation are shown in Table 4. The performance of our method is at least comparable to that of the competitors. Figure 2 visualizes the graphons learned by our method and illustrates the difference between different clusters. We can find that the graphons correspond to the block models with different block sizes. Additionally, the IMDB-MULTI is much more challenging than the IMDB-BINARY, because it contains three rather than two clusters and the block structure of each cluster is not so clear as the clusters of the IMDB-BINARY. Dataset FGWK GW-KM GWF Mix GWBs IMDB-BINARY 56.7 1.5 53.5 2.3 60.6 1.7 61.4 1.8 IMDB-MULTI 42.3 2.4 41.0 3.1 40.8 2.0 42.9 1.9 Table 4: Comparisons on clustering accuracy (%) (a) IMDB-BINARY (b) IMDB-MULTI Figure 2: Illustrations of estimated graphons. Conclusions In this paper, we propose a novel method to learn graphons from unaligned graphs. Our method minimizes an upper bound of the cut distance between the target graphons and their approximations, which leads to a GW barycenter problem. To extend our method to practical scenarios, we developed two structured variants of the basic GWB algorithm. In the future, we plan to improve the robustness of our method to the weight of the smoothness regularizer and further reduce the complexity of our method by applying the recursive GW distance (Xu, Luo, and Carin 2019) or the sliced GW distance (Titouan et al. 2019b). Additionally, because the graphon is naturally a generative graph model, we will consider using the model to achieve graph generation tasks. Acknowledgments The Duke University component of this work was supported in part by DARPA, DOE, NIH, ONR and NSF, and a portion of the work performed by the first two authors was performed when they were affiliated with Duke. Hongteng Xu was supported in part by Beijing Outstanding Young Scientist Program (NO. BJJWZYJH012019100020098) and NSFC (No. 61832017), and Dixin Luo was supported in part by the Beijing Institute of Technology Research Fund Program for Young Scholars (XSQD-202107001) and the project 2020YFF0305200. We thank Thomas Needham and Samir Chowdhury for their constructive suggestions. References Airoldi, E. M.; Costa, T. B.; and Chan, S. H. 2013. Stochastic blockmodel approximation of a graphon: Theory and consistent estimation. In Advances in Neural Information Processing Systems, 692 700. Altschuler, J.; Weed, J.; and Rigollet, P. 2017. Nearlinear time approximation algorithms for optimal transport via Sinkhorn iteration. In Advances in Neural Information Processing Systems, 1964 1974. Avella-Medina, M.; Parise, F.; Schaub, M.; and Segarra, S. 2018. Centrality measures for graphons: Accounting for uncertainty in networks. IEEE Transactions on Network Science and Engineering. Ballester, C.; Calv o-Armengol, A.; and Zenou, Y. 2006. Who s who in networks. Wanted: The key player. Econometrica 74:1403 1417. Borgs, C.; Chayes, J. T.; Lov asz, L.; S os, V. T.; and Vesztergombi, K. 2008. Convergent sequences of dense graphs I: Subgraph frequencies, metric properties and testing. Advances in Mathematics 219(6):1801 1851. Chan, S., and Airoldi, E. 2014. A consistent histogram estimator for exchangeable graph models. In International Conference on Machine Learning, 208 216. Channarond, A.; Daudin, J.-J.; Robin, S.; et al. 2012. Classification and estimation in the stochastic blockmodel based on the empirical degrees. Electronic Journal of Statistics 6:2574 2601. Chatterjee, S., et al. 2015. Matrix estimation by universal singular value thresholding. The Annals of Statistics 43(1):177 214. Chowdhury, S., and M emoli, F. 2019. The gromov Wasserstein distance between networks and stable network invariants. Information and Inference: A Journal of the IMA 8(4):757 787. Chung, F., and Radcliffe, M. 2011. On the spectra of general random graphs. the electronic journal of combinatorics P215 P215. Cuturi, M. 2013. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in neural information processing systems, 2292 2300. Frieze, A., and Kannan, R. 1999. Quick approximation to matrices and applications. Combinatorica 19(2):175 220. Gao, S., and Caines, P. E. 2019. Graphon control of largescale networks of linear systems. IEEE Transactions on Automatic Control. Goldenberg, A.; Zheng, A. X.; Fienberg, S. E.; and Airoldi, E. M. 2010. A survey of statistical network models. Foundations and Trends R in Machine Learning 2(2):129 233. Hoff, P. D.; Raftery, A. E.; and Handcock, M. S. 2002. Latent space approaches to social network analysis. Journal of the american Statistical association 97(460):1090 1098. Jackson, M. O., and Zenou, Y. 2015. Games on networks. In Handbook of game theory with economic applications, volume 4. Elsevier. 95 163. Janson, S., and Diaconis, P. 2008. Graph limits and exchangeable random graphs. Rendiconti di Matematica e delle sue Applicazioni. Serie VII 33 61. Janson, S. 2013. Graphons, cut norm and distance, couplings and rearrangements. New York journal of mathematics. Keshavan, R. H.; Montanari, A.; and Oh, S. 2010. Matrix completion from a few entries. IEEE transactions on information theory 56(6):2980 2998. Kolaczyk, E. D. 2009. Statistical Analysis of Network Data: Methods and Models. Springer. Lov asz, L. 2012. Large networks and graph limits, volume 60. American Mathematical Soc. Luo, D.; Xu, H.; and Carin, L. 2020. Hierarchical optimal transport for robust multi-view learning. ar Xiv preprint ar Xiv:2006.03160. M emoli, F. 2011. Gromov-Wasserstein distances and the metric approach to object matching. Foundations of computational mathematics 11(4):417 487. Morris, C.; Kriege, N. M.; Bause, F.; Kersting, K.; Mutzel, P.; and Neumann, M. 2020. TUDataset: A collection of benchmark datasets for learning with graphs. In ICML 2020 Workshop on Graph Representation Learning and Beyond (GRL+2020). Nagurney, A. 2013. Network economics: A variational inequality approach, volume 10. Springer Science & Business Media. Nowicki, K., and Snijders, T. A. B. 2001. Estimation and prediction for stochastic blockstructures. Journal of the American statistical association 96(455):1077 1087. Parise, F., and Ozdaglar, A. 2018. Graphon games: A statistical framework for network games and interventions. ar Xiv preprint ar Xiv:1802.00080. Pensky, M., et al. 2019. Dynamic network models and graphon estimation. Annals of Statistics 47(4):2378 2403. Peyr e, G.; Cuturi, M.; et al. 2019. Computational optimal transport. Foundations and Trends R in Machine Learning 11(5-6):355 607. Peyr e, G.; Cuturi, M.; and Solomon, J. 2016. Gromov Wasserstein averaging of kernel and distance matrices. In International Conference on Machine Learning, 2664 2672. Ruiz, L.; Chamon, L. F.; and Ribeiro, A. 2020. The graphon Fourier transform. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 5660 5664. IEEE. Sinkhorn, R., and Knopp, P. 1967. Concerning nonnegative matrices and doubly stochastic matrices. Pacific Journal of Mathematics 21(2):343 348. Soufiani, H. A., and Airoldi, E. 2012. Graphlet decomposition of a weighted network. In Artificial Intelligence and Statistics, 54 63. Titouan, V.; Courty, N.; Tavenard, R.; and Flamary, R. 2019a. Optimal transport for structured data with application on graphs. In International Conference on Machine Learning, 6275 6284. Titouan, V.; Flamary, R.; Courty, N.; Tavenard, R.; and Chapel, L. 2019b. Sliced Gromov-Wasserstein. In Advances in Neural Information Processing Systems, 14726 14736. Vayer, T.; Chapel, L.; Flamary, R.; Tavenard, R.; and Courty, N. 2018. Fused Gromov-Wasserstein distance for structured objects: theoretical foundations and mathematical properties. ar Xiv preprint ar Xiv:1811.02834. Villani, C. 2008. Optimal transport: old and new, volume 338. Springer Science & Business Media. Xie, Y.; Wang, X.; Wang, R.; and Zha, H. 2020. A fast proximal point method for computing exact Wasserstein distance. In Uncertainty in Artificial Intelligence, 433 453. PMLR. Xu, H.; Luo, D.; Zha, H.; and Carin, L. 2019. Gromov Wasserstein learning for graph matching and node embedding. In International Conference on Machine Learning, 6932 6941. Xu, H.; Luo, D.; and Carin, L. 2019. Scalable Gromov Wasserstein learning for graph partitioning and matching. In Advances in neural information processing systems, 3052 3062. Xu, J. 2018. Rates of convergence of spectral methods for graphon estimation. In International Conference on Machine Learning, 5433 5442. Xu, H. 2020. Gromov-Wasserstein factorization models for graph clustering. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, 6478 6485. Yanardag, P., and Vishwanathan, S. 2015. Deep graph kernels. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1365 1374.