# convergence_of_invariant_graph_networks__8f97f2ef.pdf Convergence of Invariant Graph Networks Chen Cai 1 Yusu Wang 1 Although theoretical properties such as expressive power and over-smoothing of graph neural networks (GNN) have been extensively studied recently, its convergence property is a relatively new direction. In this paper, we investigate the convergence of one powerful GNN, Invariant Graph Network (IGN) over graphs sampled from graphons. We first prove the stability of linear layers for general k-IGN (of order k) based on a novel interpretation of linear equivariant layers. Building upon this result, we prove the convergence of k IGN under the model of Ruiz et al. (2020), where we access the edge weight but the convergence error is measured for graphon inputs. Under the more natural (and more challenging) setting of Keriven et al. (2020) where one can only access 0-1 adjacency matrix sampled according to edge probability, we first show a negative result that the convergence of any IGN is not possible. We then obtain the convergence of a subset of IGNs, denoted as IGN-small, after the edge probability estimation. We show that IGN-small still contains function class rich enough that can approximate spectral GNNs arbitrarily well. Lastly, we perform experiments on various graphon models to verify our statements. 1. Introduction Graph neural networks (GNNs) have recently become a key framework for the learning and analysis of graph type of data, leading to progress on link prediction, knowledge graph embedding, and property prediction to name a few (Wu et al., 2020; Zhou et al., 2020). Although theoretical properties such as expressive power (Maron et al., 2019b; Keriven & Peyr e, 2019; Maron et al., 2019a; Garg et al., 2020; Azizian & Lelarge, 2020; Geerts, 2020; Bevilacqua 1University of California San Diego, San Diego, USA. Correspondence to: Chen Cai . Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s). et al., 2021) and over-smoothing (Li et al., 2018; Oono & Suzuki, 2019; Cai & Wang, 2020; Zhou et al., 2021) of GNNs have received much attention, their convergence property is less understood. In this paper, we systematically investigate the convergence of one of the most powerful families of GNNs, the Invariant Graph Network (IGN) (Maron et al., 2018). Different from message passing neural network (MPNN) (Gilmer et al., 2017), it treats graphs and associated node/edge features as monolithic tensors and processes them in a permutation equivariant manner. 2-IGN can approximate the message passing neural network (MPNN) arbitrarily well on the compact domain. When allowing the use of high-order tensor as the intermediate representation, k-IGN is shown at least as powerful as k-WL test. As the tensor order k goes to O(n4), it achieves the universality and can distinguish all graphs of size n (Maron et al., 2019b; Keriven & Peyr e, 2019; Azizian & Lelarge, 2020). The high level question we are interested in is the convergence and stability of GNNs. In particular, given a sequence of graphs sampled from some generative models, does a GNN performed on them also converge to a limiting object? This problem has been considered recently, however, so far, the studies (Ruiz et al., 2020; Keriven et al., 2020) focus on the convergence of spectral GNNs, which encompasses several models (Bruna et al., 2013; Defferrard et al., 2016) including GCNs with order-1 filters (Kipf & Welling, 2016). However, it is known that the expressive power of GCN is limited. Given that 2(k)-IGN is strictly more powerful than GCN (Xu et al., 2018) in terms of separating graphs1 and its ability to achieve universality, it is of great interest to study the convergence of such powerful GNN. In fact, it is posted as an open question in Keriven et al. (2021) to study convergence for models more powerful than spectral GNNs and higher order GNNs. This is the question we aim to study in this paper. Contributions. We present the first convergence study of the powerful k-IGNs (strictly more powerful than the Spectral GNN which previous work studied). We first analyze the building block of IGNs: linear equivariant layers, and develop a stability result for such layers. The case of 2-IGN is proved via case analysis while the general case of k-IGN 1In terms of separating graphs, k-IGN > 2-IGN = GIN > GCN for k > 2. Convergence of Invariant Graph Networks uses a novel interpretation of the linear equivariant layers which we believe is of independent interest. There have been two existing models of convergence of spectral GNNs for graphs sampled from graphons developed in Ruiz et al. (2020) and Keriven et al. (2020), respectively. Using the model of Ruiz et al. (2020) (denoted by the edge weight continuous model) where we access the edge weight but the convergence error is measured between graphon inputs (see Section 5 for details), we obtain analogous convergence results for k-IGNs. The results cover both deterministic and random sampling for k-IGN while Ruiz et al. (2020) only covers deterministic sampling for the much weaker Spectral GNNs. Under more natural (and more challenging) setting of Keriven et al. (2020) where one can only access 0-1 adjacency matrix sampled according to edge probability (called the edge probability discrete model), we first show a negative result that in general the convergence of all IGNs is not possible. Building upon our earlier stability result, we obtain the convergence of a subset of IGN, denoted as IGNsmall, after a step of edge probability estimation. We show that IGN-small still contains rich function class that can approximate Spectral GNN arbitrarily well. Lastly, we perform experiments on various graphon models to verify our statements. 2. Related Work One type of convergence in deep learning concerns the limiting behavior of neural networks when the width goes to infinity (Jacot et al., 2018; Du et al., 2018; Arora et al., 2019; Lee et al., 2019; Du et al., 2019). In that regime, the gradient flow on a normally initialized, fully connected neural network with a linear output layer in the infinite-width limit turns out to be equivalent to kernel regression with respect to the Neural Tangent Kernel (Jacot et al., 2018). Another type of convergence concerns the limiting behavior of neural networks when the depth goes to infinity. In the continuous limit, models such as residual networks, recurrent neural network decoders, and normalizing flows can be seen as an Euler discretization of an ordinary differential equation (Weinan, 2017; Chen et al., 2018; Lu et al., 2018; Ruthotto & Haber, 2020). The type of convergence we consider in this paper concerns when the input objects converge to a limit, does the output of some neural network over such sequence of objects also converge to a limit? In the context of GNNs, such convergence and related notion of stability and transferability have been studied in both graphon (Ruiz et al., 2020; Keriven et al., 2020; Gama et al., 2020; Ruiz et al., 2021) and manifold setting Kostrikov et al. (2018); Levie et al. (2021). In the manifold setting, the analysis is closely re- lated to the literature on convergence of Laplacian operator (Xu, 2004; Wardetzky, 2008; Belkin et al., 2008; 2009; Dey et al., 2010). 3. Preliminaries 3.1. Notations To talk about convergence/stability, we will consider graphs of different sizes sampled from a generative model. Similar to the earlier work in this direction, the specific general model we consider is a graphon model. Graphons. A graphon is a bounded, symmetric and measurable function W : [0, 1]2 [0, 1]. We denote the space of graphon as W. It can be intuitively thought of as an undirected weighted graph with an uncountable number of nodes: roughly speaking, given ui, uj [0, 1], we can consider there is an edge (i, j) with weight W(ui, uj). Given a graphon W, we can sample unweighted graphs of any size from W, either in a deterministic or stochastic manner. We defer the definition of the sampling process until we introduce the edge weight continuous model in Section 5 and edge probability discrete model in Section 6. Tensor. Let [n] denote {1, ..., n}. A tensor X of order k, called a k-tensor, is a map from [n] k to Rd. If we specifiy a name namei for each axis, we then say X is indexed by (name1, ..., namek). With slight abuse of notation, we also write that X Rnk d. We refer to d as the feature dimensions or the channel dimensions. If d = 1, then we have a k-tensor Rnk 1 = Rnk. Although the name for each axis acts as an identifier and can be given arbitrarily, we will use set to name each axis in this paper. For example, given a 3-tensor X, we use {1} to name the first axis, {2} for the second axis, and so on. The benefits of doing so will be clear in Section 4.2. Partition. A partition of [k], denoted as γ, is defined to be a set of disjoint sets γ := {γ1, ..., γs} with s k such that the following condition satisfies, 1) for all i [s], γi [k], 2) γi γj = , i, j [s], and 3) s i=1γi = [k]. We denote the space of all partitions of [k] as Γk. Its cardinality is called the k-th bell number bell(k) = |Γk|. Other conventions. By default, we use 2-norm (Frobenius norm) to refer ℓ2 norm for all vectors/matrices and L2 norm for functions on [0, 1] and [0, 1]2. 2 or denotes the 2 norm for discrete objects while W L2 := R R W(u, v)dudv denotes the norm for continuous objects. Similarly, we use and L to denotes the infinity norm. When necessary, we use L2([0,1]) to specify the support explicitly. We use spec to denote spectral norm. Φc and Φd refers to the continuous IGN and discrete IGN respectively. We sometimes call a function f : [0, 1] Rd a graphon signal. Given A Rnk d1, B Rnk d2, [A, B] Convergence of Invariant Graph Networks Table 1: Linear equivariant maps for Rn n Rn n and R[0,1]2 R[0,1]2. 1 is a all-one vector of size n 1 and Iu=v is the indicator function. Operations Discrete Continuous Partitions 1-2: The identity and transpose operations T (A) = A T (A) = AT T (W ) = W T (W ) = W T {{1, 3}, {2, 4}} {{1, 4}, {2, 3}} 3: The diag operation T (A) = Diag(Diag (A)) T (W )(u, v) = W (u, v)Iu=v {{1, 2, 3, 4}} 4-6: Average of rows replicated on rows/ columns/ diagonal T (W )(u, ) = R W (u, v)dv T (W )( , u) = R W (u, v)dv T (W )(u, v) = Iu=v R W (u, v )dv {{1, 4}, {2}, {3}} {{1, 3}, {2}, {4}} {{1, 3, 4}, {2}} 7-9: Average of columns replicated on rows/ columns/ diagonal n Diag(AT 1). T (W )( , v) = R W (u, v)du T (W )(v, ) = R W (u, v)du T (W )(u, v) = Iu=v R W (u , v)du {{1}, {2, 4}, {3}} {{1}, {2, 3}, {4}} {{1}, {2, 3, 4}} 10-11: Average of all elements replicated on all matrix/ diagonal T (A) = 1 n2 (1T A1) 11T T (A) = 1 n2 (1T A1) Diag(1). T (W )( , ) = R W (u, v)dudv T (W )(u, v) = Iu=v R W (u , v )du dv {{1}, {2}, {3}, {4}} {{1}, {2}, {3, 4}} 12-13: Average of diagonal elements replicated on all matrix/diagonal T (A) = 1 n (1T Diag (A)) 11T n (1T Diag (A)) Diag(1) T (W )( , ) = R Iu=v W (u, v)dudv T (W )(u, v) = Iu=v R W (u , u )du {{1, 2}, {3}, {4}} {{1, 2}, {3, 4}} 14-15: Replicate diagonal elements on rows/columns T (A) = Diag (A)1T T (A) = 1Diag (A)T T (W )(u, v) = W (u, u) T (W )(u, v) = W (v, v) {{1, 2, 4}, {3}} {{1, 2, 3}, {4}} is defined to be the concatenation of A and B along feature dimensions, i.e., [A, B] Rnk (d1+d2). See Table 4 in Appendix for the full symbol list. 3.2. Invariant Graph Network Definition 1. An Invariant Graph Network (IGN) is a function Φ : Rn2 d0 Rd of the following form: F = h L(T ) σ σ L(1), (1) where each L(t) is a linear equivariant (LE) layer (Maron et al., 2018) from Rnkt 1 dt 1 to Rnkt dt (i.e., mapping a kt 1 tensor with dt 1 channels to a kt tensor with dt channels), σ is nonlinear activation function, h is a linear invariant layer from k T -tensor Rnk T d T to vector in Rd. dt is the channel number, and kt is tensor order in t-th layer. Let Diag( ) be the operator of constructing a diagonal matrix from vector and Diag ( ) be the operation of extracting a diagonal from a matrix. Under the IGN framework, we view a graph with n nodes as a 2-tensor: In particular, given its adjacency matrix An of size n n with node features Xn Rn dnode and edge features En n Rn2 dedge, the input of IGN is the concatenation of [An, Diag(Xn), En n] Rn2 (1+dnode+dedge) along different channels. We drop the subscript when there is no confusion. We use 2-IGN to denote the IGN whose largest tensor order within any intermediate layer is 2, while k-IGN is one whose largest tensor order across all layers is k. We use IGN to refer to the general IGN for any order k. Without loss of generality, we consider input and output tensor to have a single channel. The extension to multiple channels case is presented in Appendix G.2. Consider all lin- ear equivariant maps from Rnℓto Rnm, denoted as LEℓ+m. Maron et al. (2018) characterizes the basis of the space of LEℓ,m. It turns out that the cardinality of the basis equals to the bell number bell(ℓ+ m), thus depending only on the order of input/output tensor and independent from graph size n. As an example, we list a specific basis of the space of LE maps for 2-IGN (thus with tensor order at most 2) in Tables 1, 2 and 3 when input/output channel numbers are both 1. Extending the LE layers to multiple input/output channels is straightforward, and can be achieved by parametrizing the LE layers according to indices of input/output channel. See Remark 8 in Appendix. Note that one difference of the operators in Tables 1, 2 and 3 from those given in the original paper is that here we normalize all operators appropriately w.r.t. the graph size n. (This normalization is also in the official implementation of the IGN paper.) This is necessary when we consider the continuous limiting case. To talk about convergence, one has to define the continuous analog of IGN for graphons. In Tables 1, 2 and 3 we extend all LE operators defined for graphs to graphons, resulting in the continuous analog of 2-IGN, denoted as 2-c IGN or Φc in the remaining text. Similar operation can be done in general for k-IGN as well, where the basis elements for k-IGNs will be described in Section 4.2. Definition 2 (2-c IGN). By extending all LE layers for 2IGN to the graphon case as shown in Tables 1, 2 and 3, we can definite the corresponding 2-c IGN via Eq. (1). 4. Stability of Linear Layers in IGN In this section, we first show a stability result for a single linear layer of IGN. That is, given two graphon W1, W2, we show that if W1 W2 pn is small, then the distance Convergence of Invariant Graph Networks between the objects after applying a single LE layer remain close. Here pn is a partition-norm that will be introduced in a moment. Similar statements also hold for the discrete case when the input is a graph. We first describe how to prove stability for 2-(c)IGN as a warm-up. We then prove it for k-(c)IGN, which is significantly more interesting and requires a new interpretation of the elements in a specific basis of the space of LE operators in Maron et al. (2018). A the general LE layer T : Rnℓ Rnmcan be written as T = P γ cγTγ, where Tγ B := {Tγ|γ Γℓ+m} is the basis element of the space of LEℓ,m and cγ are denoted as filter coefficients. Hence proving the stability of T can be reduced to showing the stability for each element in B, which we focus from now on. 4.1. Stability of Linear Layers of 2-IGN A natural way to show stability is by showing that the spectral norm of each LE operator in a basis is bounded. However, even for 2-IGN, as we see some LE operator requires replicating diagonal elements to all rows (e.g., operator 14-15 in Table 1), and has unbounded spectral norm. To address this challenge, we need a more refined analysis. In particular, below we will introduce a new norm that treats the diagonal differently from non-diagonal elements for the 2-tensor case. We term it partition-norm as later when handling high order k-IGN, we will see that this norm arises naturally w.r.t. the partition of index set of tensors. Definition 3 (Partition-norm). The partition-norm of 2-tensor A Rn2 is defined as A pn := ( Diag (A) 2 n , A 2 n ). The continuous analog of the partitionnorm for graphon W W is defined as W pn = q R W 2(u, u)du, q RR W 2(u, v)dudv . We refer to the first term as the normalized diagonal norm and the second term as the normalized matrix norm. Furthermore, we define operations like addition/comparasion on the partition-norm simply as component-wise operations. For example, A pn B pn if each of the two terms of A is at most the corresponding term of B. As each term in partition-norm is a norm on different parts of the input, the partition-norm is also a norm. By summing over the finite feature dimension both for finite and infinite cases, the definition of the partition-norm can be extended to multi-channel tensors Rn2 d and its continuous version R[0,1]2 d. See Appendix B.1 for details. The following result shows that each basis operation for 2IGN, shown in Tables 1, 2 and 3, is stable w.r.t. the partitionnorm. Hence a LE layer consisting of a finite combination of these operations will remain stable. The proof is via a case-by-case analysis and can be found in Appendix B.2. Proposition 1. For all LE operators Ti : Rn2 Rn2 of discrete 2-IGN listed in Table 1, Ti(A) pn A pn for any A Rn2. Similar statements hold for Ti : Rn Rn2 and Ti : Rn2 Rn in Tables 2 and 3 in Appendix A. In the case of continuous 2-c IGN, the stability also holds. Remark 1. Note that this also implies that given W1, W2 W, we have that Ti(W1) Ti(W2) pn W1 W2 pn. Similarily, given A1, A2 Rn2 1 = Rn2, we have Ti(A1) Ti(A2) pn A1 A2 pn. 4.2. Stability of Linear Layers of k-IGN We now consider the more general case of k-IGN. In principle, the proof of 2-IGN can still be extended to k-IGN, but going through all bell(k) number of elements of LE basis of k-IGN one by one can be quite cumbersome. In the next two subsections, we provide a new interpretation of elements of the basis of space of LEℓ,m in a unified framework so that we can avoid a case-by-case analysis. Such an interpretation, detailed in Section 4.3, is potentially of independent interest. First, we need some notations. Definition 4 (Equivalence pattern). Given a k-tensor X, denote the space of its indices {(i1, ..., ik) | i1 [n], ..., ik [n]} by Ik. Given X, γ = {γ1, ..., γd} Γk and an element a = (a1, ..., ak) Ik, we say a γ if i, j γl for some l [d] always implies ai = aj. Alternatively, we also say a satisfies the equivalence pattern of γ if a γ. As an example, suppose γ = {{1, 2}, {3}}. Then (x, x, y) γ while (x, y, z) / γ. Equivalence patterns can induce slices /sub-tensors of a tensor. Definition 5 (Slice/sub-tensor of X Rnk 1 for γ Γk). Let X Rnk 1 be a k-tensor indexed by ({1}, ..., {k}). Consider a partition γ = {γ1, ..., γk } Γk of cardinality k k. The slice (sub-tensor) of X induced by γ is a k -tensor Xγ, indexed by (γ1, ..., γk ), and defined to be Xγ(j1, ..., jk ) := X(ιγ(j1, ..., jk )) where j [n] and ιγ(j1, ..., jk ) γ. ιγ : [n]k [n]k is defined to be ιγ(j1, ..., jk ) := (i1, ..., ik) such that {a, b} γc implies ia = ib := jc. Here a, b [k], c [k ]. As an example, we show five slices of a 3-tensor in Figure 1. Consider the LE operators from Rnℓto Rnm. Each such map Tγ can be represented by a matrix of size nℓ nm which can further considered as a (ℓ+ m)-tensor Bγ. Maron et al. (2018) showed that a specific basis for such operators can be characterized as follows: Each basis element will correspond to one of the bell(ℓ+ m) partitions in Γℓ+m. In particular, given a partition γ Γℓ+m, we have a corresponding basis LE operator Tγ and its tensor representation Bγ defined as follows: for any a Iℓ+m, Bγ(a) = ( 1 a γ 0 otherwise (2) Convergence of Invariant Graph Networks The collection B = {Tγ | γ Γℓ+m} form a basis for all LEℓ,m maps. In Section 4.3, we will provide an intereptation of each element of B, making it easy to reason its effect on an input tensor using a unified framework. Before the main theorem, we also need to extend the partition-norm in Definition 3 from 2-tensor to high-order tensor. Intuitively, for X Rnk, X pn has bell(k) components, where each component corresponds to the normalized norm of Xγ, the slice of X induced by γ Γk. See Figure 1 for examples of slices of a 3-tensor. The partition-norm of input and output of a LEℓ,m will be of dimension bell(ℓ) and bell(m) respectively. See Appendix B.1 for details. Figure 1: Five possible slices of a 3-tensor, corresponding to bell(3) = 5 paritions of [3]. From left to right: a) {{1, 2}, {3}} b) {{1}, {2, 3}} c) {{1, 3}, {2}} d) {{1}, {2}, {3}} e) {{1, 2, 3}}. The following theorem characterizes the effect of each operator in B in terms of partition-norm of input and output, generalizing Proposition 1 from matrix to high order tensor. Theorem 1 (Stability of LE layers for k-IGN). Let Tγ : R[0,1]ℓ R[0,1]m be a basis element of the space of LEℓ,m maps where γ Γℓ+m. If X pn ϵ1bell(ℓ), then the partition-norm of Y := Tγ(X) satisfies Y pn ϵ1bell(m) for all γ Γℓ+m. The proof relies on a new interpretation of elements of B in k-IGN. We give only an intuitive sketch using an example in the next subsection. See Appendix B.3 for the proof. 4.3. Interetation of Basis Elements For better understanding, we color the input axis {1, ..., ℓ} as red and output axis {ℓ+ 1, ..., ℓ+ m} as blue. Each Tγ corresponds to one partition γ of [ℓ+ m]. For any partition γ Γl+k, we can write this set as disjoint union γ = S1 S2 S3 where S1 is a set of set(s) of input axis, and S3 is a set of set(s) of output axis. S2 is a set of set(s) where each set contains both input and output axis. With slight abuse of notation, we omit the subscript γ for S1, S3, S3 when its choice is fixed or clear, and denote {ℓ+ 1, ..., ℓ+m} as ℓ+[m]. As an example, one basis element of the space of LE3,3 maps is γ = {{1, 2}, {3, 6}, {4}, {5}} S1 = {{1, 2}} | {z } Only has input axis S2 = {{3, 6}} | {z } has both input and output axis S3 = {{4}, {5}} | {z } only has output axis where 1, 2, 3 specifies the axis of input tensor and 4, 5, 6 specifies the axis of the output tensor. Recall that there {{1,2},{3,6},{4},{5}} Figure 2: An illustration of the one basis element of the space of LE3,3. The partition is {{1, 2}, {3, 6}, {4}, {5}}. It selects area spanned by axis {1, 2} and {3} (grey shaded), average over the (red) axis {1, 2}, and then align the resulting 1D tensor with axis {6} in the output tensor, and finally replicate the slices along axis {4} and {5} to fill in the whole cube on the right. is a one-to-one correspondence between the partitions over [ℓ+m] and the base elements in B as in Eqn (4.2). The basis element Tγ corresponding to γ = S1 S2 S3 operates on an input tensor X Rnℓand produce an output tensor Y Rnm as follows: Given input X, (step 1) obtain its slice Xγ on Π1 (selection axis), (step 2) average Xγ over Π2 (reduction axis), resulting in Xγ,reduction. (step 3) Align Xγ,reduction on Π3 (alignment axis) with Yγ and (step 4) replicate Yγ along Π4 (replication axis), resulting Yγ,replication, a slice of Y . Entries of Y outside Yγ,replication will be set to be 0. In general, Πi can be read off from S1-S3. See Appendix B.3 for details. As a running example, Figure 2 illustrates the basis element corresponding to γ = S1 S2 S3 where S1 = {{1, 2}} S2 = {{3, 6}} S3 = {{4}, {5}}. In the first step, given 3-tensor X, indexed by {{1}, {2}, {3}} we select slices of interest Xγ on Π1 = {{1, 2}, {3}}, colored in grey in the left cube of Figure 2. In the second step, we average Xγ over axis Π2 = {{1, 2}} to reduce 2-tensor Xγ, indexed by {{1, 2}, {3}} to a 1tensor Xγ,reduction, indexed by {{3}}. In the third step, the Xγ,reduction is aligned with Π3 = {{6}}, resulting in the grey cuboid Yγ indexed by {{6}}, shown in the right cube in Figure 2. Here the only difference between Xγ,reduction and Yγ is the index name of two tensors. In the fourth step, we replicate the grey cuboid Yγ over axis Π4 = {{4}, {5}} to fill in the cube, resulting in Yγ,replication, indexed by {{3}, {4}, {5}}. Note in general Yγ,replication is a slice of Y and does have to be the same as Y . These steps are defined formally in the Appendix. For each of the four steps, we can control the partition-norm of output for each step (shown in Lemma 3 in Appendix), and Convergence of Invariant Graph Networks therefore control the partition-norm of the final output for every basis element. See Appendix B.3 for full proofs. 5. Convergence of IGN in the Edge Weight Continuous Model Ruiz et al. (2020) consider the convergence of Φc(W) Φc(Wn) L2 in the graphon space, where W is the orignal graphon and Wn is a piecewise constant graphon induced from graphs of size n sampled from W (to be defined soon). We call this model as the edge weight continuous model. The main result of Ruiz et al. (2021) is the convergence of continuous spectral GNN in the deterministic sampling case where graphs are sampled from W deterministically. Leveraging our earlier stability result of linear layers of continuous IGNs in Theorem 1, we can prove an analogous convergence result of c IGNs in the edge weight continuous model for both the deterministic and random sampling cases. Setup of the edge weight continuous model. Given a graphon W W and a signal X R[0,1] d, the input of c IGN will be [W, Diag(X)] R[0,1]2 (1+d). In the random sampling setting, we sample a graph of size n from W by setting the following edge weight matrix and discrete signal: [f An]ij := W(ui, uj) and [f xn]i := X(ui) (4) where ui is the i-th smallest point from n i.i.d points sampled from uniform distribution on [0, 1]. We further lift the discrete graph (f An, f xn) to a piecewise-constant graphon g Wn with signal f Xn. Specifically, partition [0, 1] to be I1 . . . In with Ii = (ui, ui+1]. We then define g Wn(u, v) := [f An]ij I(u Ii)I(v Ij) and f Xn(u) := [f xn]i I(u Ii) (5) where I is the indicator function. Replacing the random sampling with fixed grid, i.e., let ui = i 1 n , we can get the deterministic edge weight continuous model, where Wn and Xn can be defined similarly as the lifting of a discrete sampled graph to a piecewise constant graphon. Note that g Wn is a piecewise constant graphon where each block is not of the same size, while all blocks Wn are of size 1 n. We usee to emphasize that g Wn/ f Xn are random variables, in contrast to the deterministic Wn/Xn. We also need a few assumptions on the input and IGN. AS1. The graphon W is A1-Lipschitz, i.e. |W(u2, v2) W(u1, v1)| A1(|u2 u1| + |v2 v1|). AS2. The filter coefficients cγ are upper bounded by A2. AS3. The graphon signal X is A3-Lipschitz. AS4. The activation functions in IGNs are normalized Lipschitz, i.e. |ρ(x) ρ(y)| |x y|, and ρ(0) = 0. Such four assumptions are quite natural and also adopted in Ruiz et al. (2020). With AS 1-4, we have the following key proposition. The proof leverages the stability of linear layers for k-IGN from Theorem 1; see Appendix C for details. Proposition 2 (Stability of Φc). If c IGN Φc : R[0,1]2 din Rdout satisfy AS2, AS4 and W1 W2 pn ϵ12, then Φc(W1) Φc(W2) pn = Φc(W1) Φc(W2) L2 C(A2)ϵ . The same statement still holds if we change the underlying norm of Partition-norm from L2 to L . Remark 2. Statements in Proposition 2 holds for discrete IGN Φd as well. From AS3 we can also bound the difference between the original signal X and the induced signal (Xn and f Xn). Lemma 1. Let X R[0,1] d be an A3-Lipschitz graphon signal satisfying AS3, and let f Xn and Xn be the induced graphon signal as in Eqs. (4) and (5). Then we have i) X Xn pn converges to 0 and ii) X f Xn pn converges to 0 in probability. We have the similar statements for W as well. Lemma 2. If W satisfies AS1, W Wn pn converges to 0. W g Wn pn converges to 0 in probability. The following main theorem (for k-c IGN of any order k) of this section can be shown by combining Proposition 2 with Lemmas 1 and 2; see Appendix C for details. Theorem 2 (Convergence of c IGN in the edge weight continuous model). Under the fixed sampling condition, IGN converges to c IGN, i.e., Φc ([W, Diag(X)]) Φc([Wn, Diag(Xn)]) L2 converges to 0. An analogous statement hold for the random sampling setting, where Φc([W, Diag(X)]) Φc([g Wn, Diag( f Xn)]) L2 converges to 0 in probability. 6. Convergence of IGN in the Edge Probability Discrete Model In this section, we will consider the convergence setup of Keriven et al. (2020), which we call the edge probability discrete model. The major difference from the edge weight continuous model of Ruiz et al. (2020) is that (1) we only access 0-1 adjacency matrix instead of full edge weights and (2) the convergence error is measured in the graph space (instead of graphon space). This model is more natural. However, we will first show a negative result that in general IGN does not converge in the edge probability discrete model in Section 6.2. This motivates us to consider a relaxed setting where we estimate the edge probability from data. With this extra assumption, we can prove the convergence of IGN-small, a subset of IGN, in the edge probability discrete model in Section 6.3. Convergence of Invariant Graph Networks Although this is not entirely satisfactory, we show that nevertheless, the family of functions that can be represented by IGN-small is still rich enough to for example approximate any spectral GNN arbitrarily well. 6.1. Setup: Edge Probability Continuous Model We first state the setup and results of Keriven et al. (2020). We keep the notation close to the original paper for consistency. A random graph model (P, W, f) is represented as a probability distribution P uniform over latent space U = [0, 1], a symmetric kernel W : U U [0, 1] and a bounded function (graph signal) f : U Rdz. A random graph Gn with n nodes is then generated from (P, W, f) according to latent variables U := {u1, ..., un} as follows: j < i n : graph node ui iid P, zi = f (ui) , graph edge aij Ber (αn W(ui, uj)) (6) where Ber is the Bernoulli distribution and αn controls the sparsity of sampled graph. Note that in our case, we assume that the sparsification factor αn = 1 (which is the classical graphon model). We define a degree function by d W,P ( ) := R W( , u)d P(u). We assume the following W( , u) L cmax, d W,P (u) cmin, W( , u) is (c Lip. , n U) -piecewise Lipschitz. (7) We introduce two normalized sampling operator SU and Sn that sample a continuous function to a discrete one over n points. For a function W : U k Rdout, SUW (i1, ..., ik) := ( 1 n)k(W (u(i1)), ..., W (u(ik)) where u(i) is the i-th smallest number over n uniform random samples over [0, 1] and i1, ..., ik [n]. Similarly, Sn W (i1, ..., ik) := ( 1 n)k W ( i1 n ), ..., W ( ik Note that the normalizing constant will depend on the dimension of the support of W . We have SUW 2 W L and Sn W 2 W L . To measure the convergence error, we consider root mean square error at the node level: for a signal x Rn2 dout and latent variables U, we define RMSEU(f, x) := SUf x n 2 = (n 2 Pn i=1 Pn j=1 f (ui, uj) x(i, j) 2)1/2. Again, there is a dependency on the input dimension the normalization term n 2 will need to be adjusted when the input order is different from 2. 6.2. Negative Result Theorem 3. Given any graphon W with cmax < 1 and an IGN architecture, there exists a set of parameters θ such that convergence of IGNθ to c IGNθ is not possible, i.e., RMSEU(Φc ([W, Diag(X)]) , Φd([An, Diag( f Xn)])) does not converge to 0 as n , where An is 0-1 matrix generated according to Eq. (6), i.e., An[i][j] = ai,j. The proof of Theorem 3 hinges on the fact that the input to IGN in discrete case is 0-1 matrix while the input to c IGN in the continuous case has edge weight upper bounded by cmax < 1. The margin between 1 and cmax makes it easy to construct counterexamples. See Appendix D.1 for details. Theorem 3 states that we cannot expect every IGN will converge to its continuous version c IGN. As the proof of this theorem crucially uses the fact that we can only access 0-1 adjacency matrix, a natural question is what if we can estimate the edge probability from the data? Interestingly, we can obtain the convergence of for a subset of IGNs (which is still rich enough), called IGN-small, in this case. 6.3. Convergence of IGN-small Let c Wn n be the estimated n n edge probability matrix from An. g Wn is the induced graphon defined in Eq. (5). To analysize the convergence error for general IGN after edge probability estimation, we first decompose the convergence error of the interest using triangle inequality. Assuming the output is 1-tensor, then RMSEU(Φc(W), Φd(c Wn n)) = SUΦc(W) 1 nΦd(c Wn n) SUΦc(W) SUΦc(g Wn) | {z } First term: discrization error + SUΦc(g Wn) Φd SU(g Wn) | {z } Second term: sampling error + Φd SU(g Wn) 1 nΦd(c Wn n) | {z } Third term: estimation error The three terms measure the different sources of error. Firstterm is concerned with the discretization error, which can be controlled via a property of SU and Proposition 2. The Second term concerns the sampling error from the randomness of U. This term will vanish if we consider only Sn instead of SU under the extra condition stated below. The third term concerns the edge probability estimation error, which can also be controlled by leveraging existing literature on the statistical guarantee of the edge probability estimation algorithm from Zhang et al. (2015). 2 Controlling the second term is more involved. This is also the place where we have to add an extra assumption to constrain the IGN space in order to achieve convergence after edge smoothing. Definition 6 (IGN-small). Let Wn,E be a graphon with 2For better readability, here we only use the W as input instead of [W, Diag(X)]. Adding Diag(X) into the input is easy and is included in the full proof in Appendix D.2. Convergence of Invariant Graph Networks Stochastic block model EW + fixed EW + random EP EP + edge smoothing Smooth Graphon (1 piece) EW + fixed EW + random EP EP + edge smoothing Piecewise smooth graphon (3 pieces) EW + fixed EW + random EP EP + edge smoothing Figure 3: The convergence error for three generative models: (left) stochastic block model, (middle) smooth graphon, (right) piece-wise smooth graphon. EW and EP stands for edge weight continuous model (Eq. (4)) and edge probability discrete model (Eq. (6)). Three dashed line in each figure indicates the decay rate of n 0.5, n 1 and n 2. chessboard pattern 3, i.e., it is a piecewise constant graphon where each block is of the same size. Similarly, define ] Xn,E as the 1D analog. IGN-small denotes a subset of IGN that satisfies SnΦc([ Wn,E, Diag( ] Xn,E)]) = Φd Sn([ Wn,E, Diag( ] Xn,E)]). Theorem 4 (convergence of IGN-small in the edge probability discrete model). Assume AS 1-4, and let c Wn n be the estimated edge probability that satisfies 1 n Wn n c Wn n 2 converges to 0 in probability. Let Φc, Φd be continuous and discrete IGN-small. Then RMSEU Φc ([W, Diag(X)]) , Φd [c Wn n, Diag(f xn)] converges to 0 in probability. We leave the detailed proofs in Appendix D.2 with some discussion on the challenges for achieving full convergence results in the Remark 4. We note that Theorem 4 has a practical implication: It suggests that in practice, for a given unweighted graph (potentially sampled from some graphon), it may be beneficial to first perform edge probability estimation before feeding into the general IGN framework, to improve the architecture s stability and convergence. Finally, although the convergence of IGN-small is not entirely satisfactory, it contains some interesting class of functions that can approximate any spectral GNN arbitrarily well. See Appendix E for proof details. Theorem 5. IGN-small can approximates spectral GNN (both discrete and continuous ones) arbitrarily well on the compact domain in the L sense. 7. Experiments We experiment 2-IGN on three graphon models of increasing complexity: Erdoes Renyi graph with p = 0.1, stochastic block model of 2 blocks of equal size and probability matrix [[0.1, 0.25], [0.25, 0.4]], a Lipschitz graphon model with W(u, v) = u+v+1 4 , and a piecewise Lipschitz graphon 3See full definition in Definition 10 in Appendix. with W(u, v) = u% 1 3 +1 4 where % is modulo operation. Similar to (Keriven et al., 2020), we consider untrained IGN with random weights to assess how convergence depends on the choice of architecture rather than learning. We use a 5-layer IGN with hidden dimension 16. We take graphs of different sizes as input and plot the error in terms of the norm of the output difference. The results are plotted in Figure 3. See Appendix F for full details and results. As suggested by the Theorem 2, for both deterministic and random sampling, the error decreases as we increase the size of the sampled graph. Interestingly, if we take the 0-1 adjacency matrix as the input, the error does not decrease, which aligns with the negative result in Theorem 3. We further implement the edge smoothing algorithm (Eldridge et al., 2016) and find that after the edge probability estimation, the error again decreases, as implied by Theorem 4. We remark that although Theorem 4 works only for IGN-small, our experiments for the general 2-IGN with randomized initialized weights still show encouraging convergence results. Understanding the convergence of general IGN after edge smoothing is an important direction that we will leave for further investigation. 8. Conclusion In this paper, we investigate the convergence property of a powerful GNN, Invariant Graph Network. We first prove a general stability result of linear layers in IGNs. We then prove a convergence result under the model of Ruiz et al. (2020) for both 2-IGN and high order k-IGN. Under the model of Keriven et al. (2020) we first show a negative result that in general the convergence of every IGN is not possible. Nevertheless, we pinpoint the major roadblock and prove that if we preprocess input graphs by edge smoothing (Zhang et al., 2015), the convergence of a subfamily of IGNs, called IGN-small, can be obtained. As an attempt to quantify the size of IGN-small, we also show that IGNsmall contains a rich class of functions that can approximate any spectral GNN. Convergence of Invariant Graph Networks In the future, we would like to (1) further explore the expressive power of IGN-small and (2) investigate the convergence for the general IGNs under the edge probability discrete model model, or design variants with convergence property but are equally powerful. Acknowledgement The authors would like to thank anonymous reviewers for helpful comments. Chen Cai would like to thank Jinwoo Kim for helping out illustrations, Haggai Maron for helpful discussion, and Hy Truong Son for providing the pytorch implementation of IGN. This work is in part supported by National Science Foundation under grants CCF-2112665 and IIS-2050360. Arora, S., Du, S., Hu, W., Li, Z., and Wang, R. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. In International Conference on Machine Learning, pp. 322 332. PMLR, 2019. Azizian, W. and Lelarge, M. Expressive power of invariant and equivariant graph neural networks. ar Xiv preprint ar Xiv:2006.15646, 2020. Belkin, M., Sun, J., and Wang, Y. Discrete laplace operator on meshed surfaces. In Proceedings of the twenty-fourth annual symposium on Computational geometry, pp. 278 287, 2008. Belkin, M., Sun, J., and Wang, Y. Constructing laplace operator from point clouds in rd. In Proceedings of the twentieth annual ACM-SIAM symposium on Discrete algorithms, pp. 1031 1040. SIAM, 2009. Bevilacqua, B., Frasca, F., Lim, D., Srinivasan, B., Cai, C., Balamurugan, G., Bronstein, M. M., and Maron, H. Equivariant subgraph aggregation networks. ar Xiv preprint ar Xiv:2110.02910, 2021. Bruna, J., Zaremba, W., Szlam, A., and Le Cun, Y. Spectral networks and locally connected networks on graphs. ar Xiv preprint ar Xiv:1312.6203, 2013. Cai, C. and Wang, Y. A note on over-smoothing for graph neural networks. ar Xiv preprint ar Xiv:2006.13318, 2020. Chen, R. T., Rubanova, Y., Bettencourt, J., and Duvenaud, D. Neural ordinary differential equations. ar Xiv preprint ar Xiv:1806.07366, 2018. Cybenko, G. Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2(4):303 314, 1989. Defferrard, M., Bresson, X., and Vandergheynst, P. Convolutional neural networks on graphs with fast localized spectral filtering. Advances in neural information processing systems, 29:3844 3852, 2016. Dey, T. K., Ranjan, P., and Wang, Y. Convergence, stability, and discrete approximation of laplace spectra. In Proceedings of the Twenty-First Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 650 663. SIAM, 2010. Du, S., Lee, J., Li, H., Wang, L., and Zhai, X. Gradient descent finds global minima of deep neural networks. In International Conference on Machine Learning, pp. 1675 1685. PMLR, 2019. Du, S. S., Zhai, X., Poczos, B., and Singh, A. Gradient descent provably optimizes over-parameterized neural networks. ar Xiv preprint ar Xiv:1810.02054, 2018. Eldridge, J., Belkin, M., and Wang, Y. Graphons, mergeons, and so on! In Advances in Neural Information Processing Systems, pp. 2307 2315, 2016. Finzi, M., Welling, M., and Wilson, A. G. A practical method for constructing equivariant multilayer perceptrons for arbitrary matrix groups. ar Xiv preprint ar Xiv:2104.09459, 2021. Gama, F., Bruna, J., and Ribeiro, A. Stability properties of graph neural networks. IEEE Transactions on Signal Processing, 68:5680 5695, 2020. Garg, V., Jegelka, S., and Jaakkola, T. Generalization and representational limits of graph neural networks. In International Conference on Machine Learning, pp. 3419 3430. PMLR, 2020. Geerts, F. The expressive power of kth-order invariant graph networks. ar Xiv preprint ar Xiv:2007.12035, 2020. Gilmer, J., Schoenholz, S. S., Riley, P. F., Vinyals, O., and Dahl, G. E. Neural message passing for quantum chemistry. In International conference on machine learning, pp. 1263 1272. PMLR, 2017. Holst, L. On the lengths of the pieces of a stick broken at random. Journal of Applied Probability, 17(3):623 634, 1980. Hornik, K., Stinchcombe, M., and White, H. Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359 366, 1989. Jacot, A., Gabriel, F., and Hongler, C. Neural tangent kernel: Convergence and generalization in neural networks. ar Xiv preprint ar Xiv:1806.07572, 2018. Convergence of Invariant Graph Networks Keriven, N. and Peyr e, G. Universal invariant and equivariant graph neural networks. Advances in Neural Information Processing Systems, 32:7092 7101, 2019. Keriven, N., Bietti, A., and Vaiter, S. Convergence and stability of graph convolutional networks on large random graphs. ar Xiv preprint ar Xiv:2006.01868, 2020. Keriven, N., Bietti, A., and Vaiter, S. On the universality of graph neural networks on large random graphs. ar Xiv preprint ar Xiv:2105.13099, 2021. Kipf, T. N. and Welling, M. Semi-supervised classification with graph convolutional networks. ar Xiv preprint ar Xiv:1609.02907, 2016. Kostrikov, I., Jiang, Z., Panozzo, D., Zorin, D., and Bruna, J. Surface networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2540 2548, 2018. Lee, J., Xiao, L., Schoenholz, S., Bahri, Y., Novak, R., Sohl Dickstein, J., and Pennington, J. Wide neural networks of any depth evolve as linear models under gradient descent. Advances in neural information processing systems, 32: 8572 8583, 2019. Levie, R., Huang, W., Bucci, L., Bronstein, M., and Kutyniok, G. Transferability of spectral graph convolutional neural networks. Journal of Machine Learning Research, 22(272):1 59, 2021. Li, Q., Han, Z., and Wu, X.-M. Deeper insights into graph convolutional networks for semi-supervised learning. In Thirty-Second AAAI conference on artificial intelligence, 2018. Lu, Y., Zhong, A., Li, Q., and Dong, B. Beyond finite layer neural networks: Bridging deep architectures and numerical differential equations. In International Conference on Machine Learning, pp. 3276 3285. PMLR, 2018. Maron, H., Ben-Hamu, H., Shamir, N., and Lipman, Y. Invariant and equivariant graph networks. ar Xiv preprint ar Xiv:1812.09902, 2018. Maron, H., Ben-Hamu, H., Serviansky, H., and Lipman, Y. Provably powerful graph networks. ar Xiv preprint ar Xiv:1905.11136, 2019a. Maron, H., Fetaya, E., Segol, N., and Lipman, Y. On the universality of invariant networks. In International conference on machine learning, pp. 4363 4371. PMLR, 2019b. Oono, K. and Suzuki, T. Graph neural networks exponentially lose expressive power for node classification. ar Xiv preprint ar Xiv:1905.10947, 2019. Pyke, R. Spacings. Journal of the Royal Statistical Society: Series B (Methodological), 27(3):395 436, 1965. R enyi, A. On the theory of order statistics. Acta Mathematica Academiae Scientiarum Hungarica, 4(3-4):191 231, 1953. Ruiz, L., Chamon, L., and Ribeiro, A. Graphon neural networks and the transferability of graph neural networks. Advances in Neural Information Processing Systems, 33, 2020. Ruiz, L., Gama, F., and Ribeiro, A. Graph neural networks: Architectures, stability, and transferability. Proceedings of the IEEE, 109(5):660 682, 2021. Ruthotto, L. and Haber, E. Deep neural networks motivated by partial differential equations. Journal of Mathematical Imaging and Vision, 62(3):352 364, 2020. Wardetzky, M. Convergence of the cotangent formula: An overview. Discrete differential geometry, pp. 275 286, 2008. Weinan, E. A proposal on machine learning via dynamical systems. Communications in Mathematics and Statistics, 5(1):1 11, 2017. Wu, Z., Pan, S., Chen, F., Long, G., Zhang, C., and Philip, S. Y. A comprehensive survey on graph neural networks. IEEE transactions on neural networks and learning systems, 32(1):4 24, 2020. Xu, G. Discrete laplace beltrami operators and their convergence. Computer aided geometric design, 21(8):767 784, 2004. Xu, K., Hu, W., Leskovec, J., and Jegelka, S. How powerful are graph neural networks? ar Xiv preprint ar Xiv:1810.00826, 2018. Zhang, Y., Levina, E., and Zhu, J. Estimating network edge probabilities by neighborhood smoothing. ar Xiv preprint ar Xiv:1509.08588, 2015. Zhou, J., Cui, G., Hu, S., Zhang, Z., Yang, C., Liu, Z., Wang, L., Li, C., and Sun, M. Graph neural networks: A review of methods and applications. AI Open, 1:57 81, 2020. Zhou, K., Huang, X., Zha, D., Chen, R., Li, L., Choi, S.- H., and Hu, X. Dirichlet energy constrained learning for deep graph neural networks. In Thirty-Fifth Conference on Neural Information Processing Systems, 2021. Convergence of Invariant Graph Networks Table 2: Linear equivariant maps for Rn Rn n and R[0,1] R[0,1]2. Operations Discrete Continuous Partitions 1-3: Replicate to diagonal/rows/columns T (A) = Diag(A) T (A)i,j = Ai T (A)i,j = Aj T (W )(u, v) = Iu=v W (u) T (W )(u, v) = W (u) T (W )(u, v) = W (v) {{1,2,3}} {{1,3},{2}} {{1,2},{3}} 4-5: Replicate mean to diagonal/all matrix T (A)i,i = 1 n A1 T (A)i,j = 1 n A1 T (W )(u, v) = Iu=v R W (u)du T (W )(u, v) = R W (u)du {{1},{2,3}} {{1},{2},{3}} Table 3: Linear equivariant maps for Rn n Rn and R[0,1]2 R[0,1]. Operations Discrete Continuous Partitions 1-3: Replicate diagonal/row mean/ columns mean T (A) = Diag (A) T (A) = 1 n A1 T (A) = 1 T (W )(u) = W (u, u) T (W )(u) = R W (u, v)dv T (W )(u) = R W (u, v)du {{1,2,3}} {{1,2},{3}} {{1,3},{2}} 4-5: Replicate mean of all elements/ mean of diagonal T (A)i = 1 n2 1T A1 n 1T Diag(Diag (A))1 T (W )(u) = R W (u, v)dudv T (W )(u) = R Iu,v W (u, v)dudv {{1},{2},{3}} {{1,2},{3}} We list the all LE maps for Rn Rn n and Rn n Rn in Table 2 and Table 3 respectively. We also summarize the notations used throughout the paper in Table 4. B. Missing Proofs from Section 4 B.1. Extension of Partition-norm There are three ways of extending Partition-norm 1) extend the definition of partition-norm to multiple channels 2) changing the underlying norm from L2 norm to L norm, and 3) extend Partition-norm defined for 2-tensor to k-tensor. First recall the definition partition-norm. Definition 3 (Partition-norm). The partition-norm of 2-tensor A Rn2 is defined as A pn := ( Diag (A) 2 n , A 2 n ). The continuous analog of the partition-norm for graphon W W is defined as W pn = q R W 2(u, u)du, q RR W 2(u, v)dudv . We refer to the first term as the normalized diagonal norm and the second term as the normalized matrix norm. Furthermore, we define operations like addition/comparasion on the partition-norm simply as component-wise operations. For example, A pn B pn if each of the two terms of A is at most the corresponding term of B. To extend partition-norm to signal A Rn2 d of multiple channels, we denote A = [A ,1 Rn2 1, ..., A ,d Rn2 1] where [ , ] is the concatenation along channels. A pn := Pd i=1 A ,i pn. As pn is defined by both continous and discrete case, we can extending pn both for multi-channel signal both for graphs and graphons. Another way of generalizing Partition-norm is to change the L2 to L norm. We denote the resulting norm as pn . For W W, W pn := (maxu [0,1] W(u, u), maxu [0,1],v [0,1] W(u, v)). The discrete case and high order tensor case can be defined similarily as the L2 case. The last way of extending Partition-norm to k-tensor X Rnk 1 is to define the norm for each slice of X, i.e., X pn := (( 1 n)|γ1| Xγ1 2, ..., 1 n)|γbell(k)| Xγbell(k) 2) where γ Γk. Note how we order (γ1, ..., γbell(k)) can be arbitrary as long as the order is used consistent. Convergence of Invariant Graph Networks Table 4: Summary of important notations. Symbol Meaning 1n all-one vector of size n 1 2/ L2 2-norm for matrix/ graphon / L infnity-norm for matrix/graphon [ , ] Given A Rnk d1, B Rnk d2, [A, B] is the concatenation of A and B along feature dimension. [A, B] Rnk (d1+d2). W : [0, 1]2 [0, 1] graphon X R[0,1] d 1D signal W space of graphons pn partition-norm. When the underlying norm is L norm, we also use pn . I indicator function I interval SGNN spectral graph neuarl networks, defined in Equation (12) LEℓ,m linear equivariant maps from ℓ-tensor to m-tensor Notations related to sampling Wn Induced piecewise constant graphon from fixed grid g Wn Induced piecewise constant graphon from random grid Induced piecewise constant graphon from random grid, but resize the all individual blocks to be of equal size (also called chessboard graphon in the paper). Wn,E(Ii Ij) := W(u(i), u(j)) Wn n n n matrix sampled from W; Wn n(i, j) = W(ui, uj) c Wn n Rn n the estimated edge probablity from graphs samplied according to edge probability discrete model from Zhang et al. (2015) f xn Rn d sampled singal [f xn]i := X(ui) Xn induced 1D piecewise graphon signal from fixed grid f Xn induced 1D piecewise graphon signal from random grid SU normalized sampling operator for random grid. SUf(i, j) = 1 n(f(u(i)), f(u(j)) Sn normalized sampling operator for fixed grid. Snf(i, j) = 1 RMSEU(x, f) n 1 Pn i=1 xi f (ui) 2 1/2 for 1D signal; n 2 P j xi,j f (ui, uj) 2 1/2 for 2D case αn a parameter that controls the sparsity of sample graphs. Set to be 1 in the paer. Notations related to IGN bell(k) Bell number: number of partitions of [k]. bell(2) = 2, bell(3) = 5, bell(4) = 15, bell(5) = 52... Γk space of all partitions of [k] Ik the space of indices. Ik := {(i1, ..., ik)|i1 [n], ..., ik [n]}. Elements of Ik is denoted as a γ [k] partition of [k]. For example {{1, 2}, {3}} is a partion of [3]. The totoal number of partitions of [k] is bell(k). a γ a satisfies the equivalence pattern of γ. For example, (x, x, y) {{1, 2}, {3}} where x, y, z [n]. γ < β given two partitions γ, β Γk, γ < β if γ is finer than β. For example, {1, 2, 3} < {{1, 2}, {3}}. Bγ l + m tensor; tensor representation of LEl,m maps. we differentiate Tγ (operators) from Bγ (tensor representation of operators) B a basis of the space of linear equivariant operations from ℓ-tensor to m-tensor. B = {Tγ|γ Γl+k} Tc/Td linear equivariant layers for graphon (continous) and graphs (discrete) Φc/Φd IGN for graphon (continous) and graphs (discrete) L(i) i-th linear equivariant layer of IGN L normalized graph Laplacian Ti basis element of the space of linear equivariant maps; sometimes also written as Tγ. Convergence of Invariant Graph Networks B.2. Proof of stability of linear layer for 2-IGN Proposition 1. For all LE operators Ti : Rn2 Rn2 of discrete 2-IGN listed in Table 1, Ti(A) pn A pn for any A Rn2. Similar statements hold for Ti : Rn Rn2 and Ti : Rn2 Rn in Tables 2 and 3 in Appendix A. In the case of continuous 2-c IGN, the stability also holds. Proof. The statements hold in both discrete and continuous cases. Without loss of generality, we only prove the continuous case by going over all linear equivariant maps R[0,1]2 R[0,1]2 in Table 1. 1-3: It is easy to see that the partition-norm does not increase for all three cases. 4-6: It is enough to prove case 4 only. For diagonal norm Diag(T(W)) 2 L2 = R ( R W(u, v)dv)2du RR W 2(u, v)dudv = ϵ. For matrix norm: T(W) 2 L2 = Diag(T(W)) L2 ϵ. Therefore the statement holds for this linear equivariant operation. 7-9: same as case 4-6. 10-11: It is enough to prove the first case: average of all elements replicated on the whole matrix. The diagonal norm is the same as the matrix norm. Both norms are decreasing so we are done. 12-13: It is enough to prove only case 12. Since diagonal norm is equal to matrix norm, and diagonal norm is decreasing by Jensen s inequality we are done. 14-15: Since matrix norm is the same as diagonal norm, which stays the same so we are done. As shown in all cases for any W W with W pn < (ϵ, ϵ), Ti(W) pn < (ϵ, ϵ). Therefore we finish the proof for R[0,1]2 R[0,1]2. We next go over all linear equivariant maps R[0,1] R[0,1]2 in Table 2 and prove it case by case. 1-3: It is enough to prove the second case. It is easy to see diagonal norm is preserved and T(W) 2 = W 2 ϵ. Therefore T(W) pn (ϵ, ϵ). 4-5: It is enough to prove the second case. Norm on diagonal is no larger than W by Jensen s inequality. The matrix norm is the same as the diagonal norm therefore also no large than ϵ. Therefore T(W) pn (ϵ, ϵ). Last, we prove the cases for R[0,1]2 R[0,1]. For cases 1-3, it is enough to prove case 2. Since the norm of the output is no large than the matrix norm of input by Jensen s inequality, we are done. Similar reasoning applies to cases 4-5 as well. B.3. Proof of Theorem 1 We need a few definitions and lemmas first. Definition 7 (axis of a tensor). Given a k-tensor X Rnk 1 indexed by (name1, ..., namek). The axis of X, denoted as ax(X), is defined to be ax(X) := (name1, ..., namek). As an example, the aixs of the first grey sub-tensor in Figure 5a is {{1, 2}, {3}}. Definition 8 (partial order of partitions). Given two paritions of [k], denoted as γ = {γ1, ..., γd1} and β = {β1, ..., βd2}, we say γ is finer than β, denoted as γ < β, if and only if 1) γ = β and 2) for any βj β, there exists γi γ such that βj γi. For example, {{1, 2, 3}} is finer than {{1, 2}, {3}} but {{1, 2}, {3}} is not comparable with {{1, 3}, {2}}. Note that space of partitions forms a hasse diagram under the partial order defined above (each set of elements has a least upper bound and a greatest lower bound, so that it forms a lattice). See Figure 4 for an example. Convergence of Invariant Graph Networks {{1, 2, 3}} {{1, 2}, {3}} {{1, 3}, {2}} {{2, 3}, {1}} {{1}, {2}, {3}} Figure 4: Space of partitions forms a hasse diagram under the partial order defined in Definition 8. Definition 9 (average a k-tensor X over Π). Let X Rnk 1 be a k-tensor indexed by {{1}, ..., {k}}. Without loss of generality, let Π = {{1}, ..., {d}}. Denote the resulting (k d)-tensor X , indexed by {{d + 1}, ..., {k}}. By averaging X over Π, we mean t Id X(t, ). The definition can be extended to R[0,1]k by replacing average with integral. Lemma 3 (properties of partition-norm). We list some properties of partition-norm. Although all lemmas are stated in the discrete case, the continous version also holds. The statements also holds for pn as well. 1. Let X Rnk 1 be a k-tensor and denote one of its slices X Rnk 1 with k k. If X pn ϵ1bell(k), then X pn ϵ1bell(k ). 2. Let k < k. Let X Rnk 1 be a k-tensor and X Rnk 1 be the resulting k -tensor after averaging over k k axis of X. If X pn ϵ1bell(k), then X pn ϵ1bell(k ). 3. Let k > k. Let X Rnk 1 be a k-tensor and X be the resulting k -tensor after replicating X over k k axis of X . If X pn ϵ1bell(k), then X pn ϵ1bell(k ). 4. Let k < k and X Rnk 1 be a k-tensor and γ Γk be one parition of [k]. Let X has only one non-zero slice Xγ of order k , i.e., if a Ik, X(a) = 0 implies a γ. If Xγ pn ϵ1bell(k ), then X pn ϵ1bell(k). Proof. We prove statements one by one. Note that although the proof is done for L2 norm, we do not make use of any specific property of L2 norm and the same proof can be applied to L as well. Therefore all statements in the lemma apply to pn as well. 1. By the definition of partition-norm and slice in Definition 5, we know that any slice of X is also a slice of X, therefore any component of X pn will be upper bounded by ϵ, which concludes the proof. 2. Without loss of generality, we can assume that k = k 1 and the axis of X that is averaged over is axis {1}. To bound X pn, we need to bound the normalized norm of any slice of X . Let X γ be arbitrary slice of X . Since X is obtained by averaging over axis 1 of X, we know that X γ is the obtained by averaging over axis of 1 of Xγ, a slice of X, where γ := γ {{1}}. Since X pn ϵ1bell(k), we know that ( 1 n)|γ| Xγ ϵ. By Jensen s inequality, we have ( 1 n)|γ | X γ ( 1 n)|γ| Xγ , and therefore ( 1 n)|γ | X γ ϵ. Since ( 1 n)|γ | X γ ϵ holds for arbitrary slice of X , we conclude that X pn ϵ1bell(k ). The proof above only handles the case of k = k 1. The general case where k k > 1 can be handled by evoking the proof above multiple times for different reduction axis. 3. We assume X is indexed by ({1}, ..., {k}) and X is indexed by ({1}, ..., {k + 1}). Just as the last case, without loss of generality we assume that X is obtained by replicating X over 1 new axis, denoted as {k + 1}. In other words, ax(X ) = ax(X) {{k + 1}}. Convergence of Invariant Graph Networks To control X pn, we need to bound ( 1 n)|γ| X γ where γ Γk+1. Since X is obtained from X by replicating it over {k +1}, ( 1 n)|γ| X γ = ( 1 n)|β| Xβ where β = γ|[k]. As X pn ϵ1bell(k), it implies that ( 1 n)|γ| X γ ϵ holds for any γ Γk . Therefore we conclude that X pn ϵ1bell(k ). 4. To bound X pn, we need to bound the normalized norm of any slice of X. Let Xβ be arbitrarily slice of X where β Γk. Since γ and β are partitions of [k], there exist partitions that are finer than both β and γ, where the notion of finer between two partitions is defined in Definition 8. Among all partitions that satisfy such conditions, denote the most coarse one as α Γk. This can be done because the Γk is finite. Note that |α| < |β| and |α| < |γ|. Since Xα is a slice of Xγ and Xγ pn ϵ1bell(k ), ( 1 n)|α| Xα ϵ according to Lemma 3.1. As Xα is the slice of Xβ (implies Xα Xβ ) and α is the most coarse partition that is finer than β and γ (implies Xα Xβ ), we have Xβ = Xα . This implies ( 1 n)|β| Xβ ( 1 n)|α| Xα ϵ. As ( 1 n)k Xβ ϵ holds for arbitrary slice β of X, we conclue that X pn ϵ1bell(k). Now we are ready to prove the main theorem. Theorem 1 (Stability of LE layers for k-IGN). Let Tγ : R[0,1]ℓ R[0,1]m be a basis element of the space of LEℓ,m maps where γ Γℓ+m. If X pn ϵ1bell(ℓ), then the partition-norm of Y := Tγ(X) satisfies Y pn ϵ1bell(m) for all γ Γℓ+m. Proof. Without loss of generality, we first consider discrete cases of mapping from X Rnℓto Y Rnm. In general, each element Tγ of linear permutation equivariant basis can be identified with the following operation on input/output tensors. Given input X, (step 1) obtain its subtensor Xγ X on a certain Π1 (selection axis), (step 2) average Xγ over Π2 (reduction axis), resulting in Xγ,reduction. (step 3) Align Xγ,reduction on Π3 (alignment axis) with Yγ and (step 4) replicate Yγ along Π4 (replication axis), resulting Yγ,replication, a slice of Y . Entries of Y outside Yγ,replication will be set to be 0. In general, Πi can be read off from S1-S3. Π1-Π4 corresponds to different axis of input/output tensor and can be read off from different parts of Sγ = S1 S2 S3. Note such operation can be naturally extended to the continuous case, as done in Tables 1, 2 and 3 for 2-IGN. We next give detailed explnations of each step. Figure 5: Five slices of a 3-tensor, corresponding to bell(3) = 5 paritions of [3]. From left to right: a) {{1, 2}, {3}} b) {{1}, {2, 3}} c) {{1, 3}, {2}} d) {{1}, {2}, {3}} e) {{1, 2, 3}}. First step (X Xγ): select Xγ from X via Π1. Π1 corresponds to S|[ℓ] := {s [l] | s S and s [l] = }. It specifies the what parts (such as diagonal part or off-diagonal part for 2-tensor) of the input ℓ-tensor is under consideration. We denote the resulting subtensor as Xγ. See Definition 5 for formal definition. As an example in Equation (3), Π1 corresponds to {{1, 2}, {3}}, meaing we select a 2-tensor with axises {1, 2} and {3}. Note that the cardinality |S|[ℓ]| = |(S1 S2)|[ℓ]| l encodes the order of Xγ. Convergence of Invariant Graph Networks Second step (Xγ Xγ,reduction): average of Xγ over Π2. Π2 corresponds to S1 S|[ℓ], which tells us along what axis to average over Xγ. It will reduce the tensor Xγ of order |S1| + |S2|, indexed by S|[ℓ], to a tensor of order |S|[ℓ]| |S1| = |S2|, indexed by S2|[l]. In the example of Figure 6, this corresponds to averaging over axis {{1, 2}} , reducing 2-tensor (indexed by axis {1, 2} and {3}) to 1-tensor (indexed by axis {3}). The normalization factor in the discrete case is n|S1|. We denote the tensor after reduction as Xγ,reduction. As the second step performs tensor order reduction, we end up with a tensor Xγ,reduction of order |S2|. Last two steps specify how to fill in the output tensor Y with Xγ,reduction. To fill in Y , we will first align Xγ,reduction with Yγ, a subtensor of Y , on Π3 and then replicate Yγ on Π4, resulting in Yγ,replication, a sub-tensor of Y . Third step (Xγ,reduction Yγ): align Xγ,reduction with Yγ. To fill in Yγ, we need to specify how the resuting |S2|-tensor Xγ,reduction is aligned with the |S2|-subtensor of Yγ. After all, there are many ways of selecting a |S2|-tensor from Y , which is indexed by {{l}, {l + 1}, ..., {ℓ+ m}}. This is specified by the third step. Let Yγ be the |S2|-tensor. We next define the precise relationship between Xγ,reduction and Yγ. Xγ,reduction is indexed by S2|[l] while Yγ is indexed by S2|l+[m] and defined to be Yγ( ) = Xγ,reduction( ). In the example of Figure 6, Xγ,reduction is a 1D tensor indexed by {3} and Yγ (the grey cuboid on the right cube of Figure 6) is indexed by {6}. Fourth step (Yγ Yγ,replication): replicating Yγ over Π4. Π4 denotes the S3 and specifies to what axis we will replicate the |S2|-tensor Yγ over. Recall that Yγ is indexed by S2|l+[m]. Let the resulting tensor be Yγ,replication, which is a subtensor of Y Rnl. Yγ,replication is defined to be indexed by (S2 S3)|l+[m]. Without loss of generality, let the first |S2| component are indexed by S2|l+[m] and the rest components are indexed by S3|l+[m]. The mathematical definition of the fourth step is then Yγ,replication( , t) := Yγ( ) for all t [n]|S3|. Note that the order of Yγ,replication can be smaller than order of Y . The example in Equation (3) has S3 = {{4}, {5}}, which means that we will replicate the 1-tensor along axis {4} and {5}. Note that in general, we do not have to fill in the whole m-tensor (think about copy row average to diagonal in Table 1). {{1,2},{3,6},{4},{5}} Figure 6: An illustration of the one linear equivariant basis from Rn3 Rn3. The partition is {{1, 2}, {3, 6}, {4}, {5}}. It selects area spanned by axis {1, 2} and {3} (grey shaded), average over the (red) axis {1, 2}, and then align the resulting 1D slice with axis {6} in the output tensor, and finally replicate the slices along axis {4} and {5} to fill in the whole cube on the right. After the interpretation of general linear equivariant maps in k-IGN, We now show that if X pn ϵ1bell(ℓ), then Tγ(X) ϵ1bell(m) holds for all γ. This can be done easily with the use of Lemma 3. For any partition of [ℓ+ m] γ, according to the first step we are mainly concerned about the Xγ pn instead of X pn. Since Xγ is a slice of X, then if X pn ϵ1bell(ord(X)), by Lemma 3.1, then Xγ pn ϵ1bell(|S1|+|S2|). According to the second step and Lemma 3.2, we can also conclude that Xγ,reduction pn ϵ1bell(|S2|). For the third step of align Xγ,reduction with Yγ, it is quite obvious that Yγ pn = Xγ,reduction pn ϵ1bell(|S2|). For the fourth step of replicating Yγ over Π4 to get Yγ,replication, by Lemma 3.3, we have Yγ,replication pn ϵ1bell(|S2|+|S3|). Lastly, we envoke Lemma 3.4 to get Y pn ϵ1bell(m), which concludes our proof. Convergence of Invariant Graph Networks C. Missing Proofs from Section 5 (Edge Weight Continuous Model) First we need a lemma on the distrbution of gaps between n uniform sampled points on [0, 1]. Lemma 4. Let u(i) be n points unformly sampled on [0, 1], sorted from small to large with u(0) = 0 and u(n+1) = 1. Let Di = u(i) u(i 1). All Dis have same distribution, which is Beta(1, n 1). In particular, E(Di) = 1 n, E(D2 i ) = 2 n(n+1), E(D2 i ) = 6 n(n+1)(n+2). Proof. By a symmetry argument, it is easy to see that all intervals follow the same distribution. For the first interval, the probability all the n 1 points are above x is (1 x)n 1 so the density of the length of the first (and so each) interval is (n 1)(1 x)1 x. This is a Beta distribution with parameters α = 1 and β = n 1. The expectation of higher moments follows easily. Note that although the intervals are identically distributed, they are not independently distributed, since their sum is 1. Lemma 1. Let X R[0,1] d be an A3-Lipschitz graphon signal satisfying AS3, and let f Xn and Xn be the induced graphon signal as in Eqs. (4) and (5). Then we have i) X Xn pn converges to 0 and ii) X f Xn pn converges to 0 in probability. Proof. We first bound the X Xn L2[0,1] and X f Xn L2[0,1]. For the first case, partitioning the unit interval as Ii = [(i 1)/n, i/n] for 1 i n (the same partition used to obtain xn, and thus Xn, from X), we can use the Lipschitz property of X to derive X Xn 2 L2(Ii) A2 3 0 u2du = A2 3 3n3 We can then write X Xn 2 L2([0,1]) = P i X Xn 2 L2(Ii) A3 3n2 , which implies that X Xn L2([0,1]) q For the second case, since X f Xn 2 L2([0,1]) = P i X f Xn 2 L2(Ii) , we will bound the X f Xn 2 L2(Ii) A2 3 0 u2du = A3D3 i /3 therefore X f Xn 2 L2(Ii) A3/3 X where Di stands for the length of Ii, which is a random variable due to the random sampling. According to Lemma 4, all Di are identically distributed and follows the Beta distribution B(1, n 1). The expectation E(D3 i ) = 6 n(n+1)(n+2). Since by Jensen s inequality E( E(Y ) holds for any positive random variable Y , 3 P i D3 i ) q 3 P i D3 i ) = q 3 1 n(n+2) = Θ( 1 n). Using Markov inequality, we can then upper bound the P( X f Xn L2(I) ϵ) P( i D3 i ϵ) E( q 3 P i D3 i ) Since the P( X f Xn L2(I) ϵ) goes to 0 as n increases, we conclude that X f Xn pn converges to 0 in probability. Lemma 2. If W satisfies AS1, W Wn pn converges to 0. W g Wn pn converges to 0 in probability. Proof. For the first case, partitioning the unit interval as Ii = [(i 1)/n, i/n] for 1 i n, we can use the graphon s Lipschitz property to derive W Wn L1(Ii Ij) A1 0 |u|dudv + A1 0 |v|dvdu = A1 Convergence of Invariant Graph Networks We can then write W Wn L1([0,1]2) = P i,j W Wn L1(Ii Ij) n2 A1 n which, since W Wn : [0, 1]2 [ 1, 1], implies W Wn L2([0,1]2) p W Wn L1([0,1]2) q n . The second last inequality holds because all entries of W Wn lies in [ 1, 1]. Similarly, Diag(W Wn) L2[0,1] p Diag(W Wn) L1[0,1] q 2n A1 R 1/n 0 udu = q n . Therefore we conclude the first part of the proof. For the second case, diagonal norm is similar to the proof of Lemma 1 so we only focus on the W Wn L2([0,1]2). Since W g Wn : [0, 1]2 [ 1, 1] implies W g Wn L2([0,1]2) q W g Wn L1([0,1]2) = s X i,j W g Wn L1(Ii Ij) where W g Wn L1(Ii Ij) A1 Iu |u|dudv + A1 Iv |v|dvdu = A1 2 (Di D2 j + Dj D2 i ) W g Wn L2([0,1]2) q W g Wn L1([0,1]2) = 2 (Dj D2 i + Di D2 j) = s i D2 i (10) where we use the P i Di = 1 for the last equality. Since by Jensen s inequality E( E(Y ) for any positive random variable Y , E( p P i D2 i ) = Θ( 1 n) since E(D2 i ) = Θ( 1 n2 ) by Lemma 4. By Markov inequality, we then bound P( W g Wn L2([0,1]2) > ϵ) P( q W g Wn L1([0,1]2) > ϵ) E( p P i D2 i ) ϵ Θ( 1 nϵ) Therefore, we conclude that both W Wn pn and W g Wn pn converges to 0. Proposition 2 (Stability of Φc). If c IGN Φc : R[0,1]2 din Rdout satisfy AS2, AS4 and W1 W2 pn ϵ12, then Φc(W1) Φc(W2) pn = Φc(W1) Φc(W2) L2 C(A2)ϵ . The same statement still holds if we change the underlying norm of Partition-norm from L2 to L . Proof. Without loss of generality, it suffices to prove for 2-IGN as k-IGN follows the same proof with the constant being slightly different. Since we have proved stability of every linear layers of IGN in Theorem 1, the general linear layer T is just a linear combinations of individual linear basis, i.e. T = P γ cγTγ where ci A2 for all i according to AS2. Without loss of generality, We can assume T(X) is of order 2 and have T(W1) T(W2) pn = X i cγTγ(W1 W2) pn i cγTγ(W1 W2) pn ( X |cγ|ϵ, X |cγ|ϵ) = (15A2ϵ, 15A2ϵ) To extend the result to nonlinear layer, note that AS4 ensures the 2-norm shrinks after passing through nonlinear layers. Therefore σ T(X) σ T(Y ) pn T(X) T(Y ) pn = T(X Y ) pn 15A2 X Y pn. Repeating such process across layers, we finish the proof of the L2 case. The extension to L is similar to the case of L2 norm. The main modification is to change the definition of the partition-norm from L2 norm on different slices (corresponding to different partitions of [ℓ] where ℓis the order of input) to L norm. The extension to the case where input and output tensor is of order ℓand m is also straightforward according to Theorem 1. Convergence of Invariant Graph Networks Theorem 2 (Convergence of c IGN in the edge weight continuous model). Under the fixed sampling condition, IGN converges to c IGN, i.e., Φc ([W, Diag(X)]) Φc([Wn, Diag(Xn)]) L2 converges to 0. An analogous statement hold for the random sampling setting, where Φc([W, Diag(X)]) Φc([g Wn, Diag( f Xn)]) L2 converges to 0 in probability. Proof. By Proposition 2, it suffices to prove that [W, Diag(X)]) [Wn, Diag(Xn)] pn and [W, Diag(X)]) [g Wn, Diag( f Xn)] pn goes to 0. [W, Diag(X)]) [Wn, Diag(Xn)] pn is upper bounded by (Θ( 1 n1.5 ), Θ( 1 n1.5 )) according to Lemmas 1 and 2, which decrease to 0 as n increases. Therefore we finish the proof of convergence for the deterministic case. For the random sampling case, by Lemmas 1 and 2, we know that both W g Wn L2([0,1]2) and X f Xn L2(I) goes to 0 as n increases in probability at the rate of Θ( 1 n1.5 ). Therefore we can also conclude that the convergence of IGN in probability according to Proposition 2. D. Missing Proof from Section 6 (Edge Probability Continuous Model) D.1. Missing Proof for Section 6.2 Theorem 3. Given any graphon W with cmax < 1 and an IGN architecture, there exists a set of parameters θ such that convergence of IGNθ to c IGNθ is not possible, i.e., RMSEU(Φc ([W, Diag(X)]) , Φd([An, Diag( f Xn)])) does not converge to 0 as n , where An is 0-1 matrix generated according to Eq. (6), i.e., An[i][j] = ai,j. Proof. Without loss of generality, let the network be IGN = L(2) σ L(1), and let the input to IGN be A in the discrete case and W in the continuous case. For simplicity, we assume that graphon W is constant p on [0, 1]2. As A consists of only 0 and 1 and all entries of W is below cmax, it is easy to construct a linear layer L(1) from R to R via the right choice of bias such that L(1) map any number no large than cmax to negative and maps 1 to positive. Therefore L(1)(W) = 0 and L(1)(A) is a postive number c R+ on entries (i, j) where A(i, j) = 1. Let σ be Re LU and L(2) be average of all entries. We can see that c IGN(W) = 0 for all n while IGN(A) converges to σ(c)p as n increases. As the construction above only relies on the fact that there is a separation between cmax and 1 (but not on size n), it can be extended to any IGN for any n, which means the gap between c IGN(W) and IGN(A) will not decrease as n increases. In the general case of W not being constant, the only difference is that IGN(A) will converge to be σ(c)p where p is a different constant that depends on W. Therefore we conclude the proof. Remark 3. The reason that the same argument does not work for spectral GNN is that spectral GNN always maintains Ax in the intermediate layer. In contrast, IGN keeps both A and Diag(x) in separate channels, which makes it easy to isolate them to construct counterexamples. D.2. Missing Proofs from Section 6.3 Notation. For any P, Q Rn n, define d2, , the normalized 2, matrix norm, by d2, (P, Q) = n 1/2 P Q 2, := maxi n 1/2 Pi, Qi, 2 where Pi, , Pi, are i-th row of P and Q. Note that d2, (P, Q) 1 Let SU be the sampling operator for W, i.e., SU(W) = 1 n[W(Ui, Uj)]n n. Note that as U is randomly sampled, SU is a random operator. Denote Sn as sampling on a fixed equally spaced grid of size n n, i.e. Sn W = 1 n)]n n. Sn is a fixed operator when n is fixed. Let c Wn n be the estimated edge probability from graphs A sampled from W. Let g Wn be the piece-wise constant graphon induced from sample U as Eq. (5). Similarly, denote Wn n be the n n matrix realized on sample U, i.e., Wn n[i, j] = W(ui, uj). It is easy to see that SU(W) = 1 n Wn n. Let Wn,E be the graphon induced by Wn n with n n blocks of the same size. In particular, Wn,E(Ii Ij) := W(u(i), u(j)) where Ii = [ i 1 n]. E in the subscript is the shorthand for the blocks of equal size . Similarly we can also define the 1D analog of g Wn and Wn,E, f Xn and ] Xn,E. Proof strategy. We first state five lemmas that will be used in the proof of Theorem 4. Lemma 5 concerns the property of normalized sampling operator SU and Sn. Lemmas 6 and 7 concerns the norm convergence of g Wn W L and Convergence of Invariant Graph Networks Wn,E W L . Lemma 8 characterize the effects of linear equivariant layers T and IGN Φ on L norm of the input and output. Lemma 9 bounds the L norm of the difference of stochastic sampling operator SU and the deterministic sampling operator Sn. Theorem 4 is built on the results from five lemmeas and the existing result on the theoretical guarantee of edge probability estimation from Zhang et al. (2015). The convergence of some lemmas is almost surely convergence. Convergence almost surely implies convergence in probability, and in this paper, all theorems concern convergence in probability. Note that proofs of Lemmas 5 to 7 and 9 for the W and X are almost the same. Therefore without loss of generality, we mainly prove the case of W. Definition 10 (Chessboard pattern). Let ui = i 1 n for all i [n]. A graphon W is defined to have chessboard pattern if and only if there exists a n such that W is a piecewise constant on [ui, ui+1] [uj, uj+1] for all i, j [n]. Similarily, f : [0, 1] R has 1D chessboard pattern if there exists n such that f is a piecewise constant on [ui, ui+1] for all i [n]. See Figure 7 for examples and counterexamples. (a) (b) (c) (d) (e) Figure 7: (a) and (c) has chessboard pattern. (e) has 1D chessboard pattern. (d) does not has the chessboard pattern. (b) is of form Diag( g fn,E) and also does not have chessboard pattern, but in the case of IGN approximating Spectral GNN, (b) is reprerepresented in the form of c) via a linear equivariant layers of 2-IGN. Lemma 5 (Property of Sn and SU). We list the properties of sampling operator SU and Sn 1. SU σ = σ SU. Similar result holds for Sn as well. 2. SUf1d f L where f1d : [0, 1] R. Similar result holds for f2d : [0, 1]2 R and Sn as well. Lemma 6. Let W be [0, 1]2 R and X be [0, 1] R. If W is Lipschitz, g Wn W L converges to 0 in probability. If X is Lipschitz, f Xn X L converges to 0 in probability. Proof. Without loss of generality, we only prove the case for W. By the Lipschitz condition of W, if suffices to bound the Zn := maxn i=1Di where Di is the length of i-th interval |u(i) u(i 1)|. Characterizing the distribution of the length of largest interval is a well studied problem (R enyi, 1953; Pyke, 1965; Holst, 1980). It can be shown that Zn follows P (Zn x) = Pn+1 j=0 ( 1)j(1 jx)n + with the expectation E(Zk) = 1 n+1 Pn+1 i=1 1 i = Θ( log n n ). By Markov inequality, we conclude that g Wn W L converges to 0 in probability. Lemma 7. Let W be [0, 1]2 R and X be [0, 1] R. If W is Lipschitz, Wn,E W L converges to 0 almost surely. If X is Lipschitz, ] Xn,E X L converges to 0 almost surely. Proof. As Wn,E is a piecewise constant graphon and W is Lipschitz according to AS1, we only need to examine maxi,j (W Wn,E)( i It is easy to see that (W Wn,E)( i n) W(u(i), u(j)) where u(i) stands for the i-th smallest random variable from uniform i.i.d. samples from [0, 1]. By the Lipschitz condition of W, if suffices to bound i n u(j) . Glivenko-Cantelli theorem tells us that the L of empirical distribution Fn and cumulative distribution function F converges to 0 almost surely, i.e., supu [0,1]|F(u) Fn(u)| 0 almost surely. Since maxi u(i) i n = supu {u(1),...,u(n)}|F(u) Convergence of Invariant Graph Networks Fn(u)| supu [0,1]|F(u) Fn(u)| when F(u) = u (cdf of uniform distribution), we conclude that Wn,E W L converges to 0 almost surely. We also need a lemma on the property of the linear equivariant layers T. Lemma 8 (Property of Tc and σ). Let σ be nonlinear layer. Let Tc be a linear combination of elements of basis of the space of linear equivariant layers of c IGN, with coefficients upper bounded. We have the following property about Tc and σ 1. If W is Lipschitz, Tc(W) is piecewise Lipschitz on diagonal and off-diagonal. Same statement holds for Φc(W). 2. Sn σ( Wn,E) = σ Sn( Wn,E). Proof. We prove two statements one by one. 1. We examine the linear equivariant operators from R[0,1]2 to R[0,1]2 in Table 1. There are some operations such as average of rows replicated on diagonal will destroy the Lipschitz conditon of Tc(W) but Tc(W) will still be piecewise Lipschitz on diagonal and off-diagonal. Since σ will preserve the Lipschitzness, Φc(W) is piecewise Lipschitz on diagonal and off-diagonal. 2. This is easy to see as σ acts on input pointwise. Lemma 9. Let W be [0, 1]2 R 1. If W is Lipschitz, SUW Sn W converges to 0 almost surely. Similarily, if X is Lipschitz, SUDiag(X) Sn Diag(X) converges to 0 almost surely. 2. If W is piecewise Lipschitz on S1 and S2 where S1 is the diagonal and S2 is off-diagonal, then SUW Sn W converges to 0 almost surely. Proof. Since the case of X is essentially the same with that of W, we only prove the case of W. 1. As n SUW Sn W SUW Sn W , it suffices to prove that n SUW Sn W = maxi,j|W(u(i), u(j)) W( i n)| converges to 0 almost surely. Similar to Lemma 7, using Lipschitz condition of W and Glivenko-Cantelli theorem concludes the proof. 2. This statement is stronger than the one above. The proof of the last item can be adapted here. As W is A1 Lipschitz on off-diagonal region and A2 Lipschitz on diagonal, n SUW Sn W = maxi,j W(u(i), u(j)) W( i = max maxi =j W(u(i), u(j)) W( i n) , maxi=j W(u(i), u(j)) W( i Using Lipschitz condition on diagonal and off-diagonal part of W and Glivenko-Cantelli theorem concludes the proof. With all lemmas stated, we are ready to prove the main theorem. Theorem 4 (convergence of IGN-small in the edge probability discrete model). Assume AS 1-4, and let c Wn n be the estimated edge probability that satisfies 1 n Wn n c Wn n 2 converges to 0 in probability. Let Φc, Φd be continuous and discrete IGN-small. Then RMSEU Φc ([W, Diag(X)]) , Φd [c Wn n, Diag(f xn)] converges to 0 in probability. Convergence of Invariant Graph Networks Proof. Using the triangle inequality RMSEU(Φc ([W, Diag(X)]) , Φd [c Wn n, Diag(f xn)] ) = SUΦc ([W, Diag(X)]) 1 nΦd [c Wn n, Diag(f xn)] = SUΦc ([W, Diag(X)]) SUΦc [g Wn, Diag( f Xn)] + SUΦc [g Wn, Diag( f Xn)] Φd SU([g Wn, Diag(f xn)]) + Φd SU([g Wn, Diag( f Xn)]) 1 nΦd([c Wn n, Diag( f Xn)]) SUΦc ([W, Diag(X)]) SUΦc [g Wn, Diag( f Xn)] | {z } First term: discrization error + SUΦc [g Wn, Diag( f Xn)] Φd SU([g Wn, Diag( f Xn)]) | {z } Second term: sampling error + Φd SU([g Wn, Diag( f Xn)]) 1 nΦd [c Wn n, Diag(f xn)] | {z } Third term: estimation error The three terms measure the different sources of error. The first term is concerned with the discretization error. The second term concerns the sampling error from the randomness of U. This term will vanish if we consider only Sn instead of SU for IGN-small. The third term concerns the edge probability estimation error. For the first term, it is similar to the sketch in Section 6.3. SUΦc([W, Diag(X)]) SUΦc([g Wn, Diag( f Xn)]) = SU(Φc([W, Diag(X)]) Φc([g Wn, Diag( f Xn)])) , if suffices to upper bound Φc([W, Diag(X)]) Φc([g Wn, Diag( f Xn)]) L according to property of SU in Lemma 5. Since Φc([W, Diag(X)]) Φc([g Wn, Diag( f Xn)]) L C( W g Wn L + Diag(X) Diag( f Xn) L ) by Proposition 2, and W g Wn L converges to 0 in probability according to Lemma 6, we conclude that the first term will converges to 0 in probability. For the third term Φd SU([g Wn, Diag( f Xn)]) 1 nΦd([c Wn n, Diag(f xn)]) = 1 n(Φd([Wn n, Diag(f xn)]) Φd([c Wn n, Diag(f xn)])) = Φd([Wn n, Diag(f xn)]) Φd([c Wn n, Diag(f xn)]) pn, it suffices to control the [Wn n, Diag(f xn)] [c Wn n, Diag(f xn)] pn = 1 n Wn n c Wn n 2 Wn n c Wn n 2, , which will also goes to 0 in probability as n increases according to the statistical guarantee of edge probability estimation of neighborhood smoothing algorithm (Zhang et al., 2015), stated in Theorem 8 in Appendix G. Therefore by Proposition 2, the third term also goes to 0 in probability. Therefore the rest work is to control the second term SUΦc [g Wn, Diag( f Xn)] Φd SU [g Wn, Diag( f Xn)] . Again, we use the triangle inequality Second term = SUΦc [g Wn, Diag( f Xn)] Φd SU [g Wn, Diag( f Xn)] SUΦc [g Wn, Diag( f Xn)] SnΦc [ Wn,E, Diag( ] Xn,E)] + SnΦc [ Wn,E, Diag( ] Xn,E)] Φd SU [g Wn, Diag( f Xn)] = SUΦc [g Wn, Diag( f Xn)] SnΦc [ Wn,E, Diag( ] Xn,E)] + SnΦc [ Wn,E, Diag( ] Xn,E)] Φd Sn([ Wn,E, Diag( ] Xn,E)]) = SUΦc [g Wn, Diag( f Xn)] SnΦc [ Wn,E, Diag( ] Xn,E)] SUΦc [g Wn, Diag( f Xn)] SUΦc [ Wn,E, Diag( ] Xn,E)] + SUΦc [ Wn,E, Diag( ] Xn,E)] SnΦc [ Wn,E, Diag( ] Xn,E)] = SU Φc([g Wn, Diag( f Xn)] Φc [ Wn,E, Diag( ] Xn,E)] | {z } term a + (SU Sn)Φc [ Wn,E, Diag( ] Xn,E)] | {z } term b The second equality holds because SU([g Wn, Diag( f Xn)]) = Sn([ Wn,E, ] Xn,E]) by definition of Wn,E and IGN-small (See Remark 4 for more discussion). The third equality holds by the definition of IGN-small. We will bound the term a) SU(Φc([g Wn, Diag( f Xn)]) Φc([ Wn,E, ] Xn,E])) and b) (SU Sn)Φc([ Wn,E, Diag( ] Xn,E)]) next. Convergence of Invariant Graph Networks For term a) SU(Φc([g Wn, Diag( f Xn)]) Φc([ Wn,E, ] Xn,E])) , if suffices to prove that Φc([g Wn, Diag( f Xn)]) Φc([ Wn,E, ] Xn,E])) L converges to 0 in probability. According to Proposition 2, it suffices to bound the [g Wn, f Xn] [ Wn,E, ] Xn,E] L . Because [g Wn, f Xn] [ Wn,E, ] Xn,E] L = g Wn Wn,E L + Diag( f Xn) Diag( ] Xn,E) L ) g Wn W L + Wn,E W L + Diag( f Xn) Diag(X) L + Diag( ] Xn,E) Diag(X) L , we only need to upper bound g Wn W L , Wn,E W L , Diag( f Xn) Diag(X) L ) and Diag( ] Xn,E) Diag(X) L ), which are proved by Lemma 6 and Lemma 7 respectively. For term b) (SU Sn)Φc [ Wn,E, Diag( ] Xn,E)] (SU Sn)Φc [ Wn,E, Diag( ] Xn,E)] = (SUΦc [ Wn,E, Diag( ] Xn,E)] SnΦc [ Wn,E, Diag( ] Xn,E)] (SUΦc [ Wn,E, Diag( ] Xn,E)] SUΦc ([W, Diag(X)]) + SUΦc ([W, Diag(X)]) SnΦc ([W, Diag(X)]) + SnΦc ([W, Diag(X)]) SnΦc [ Wn,E, Diag( ] Xn,E)] = (SU(Φc [ Wn,E, Diag( ] Xn,E)] Φc([W, Diag(X)])) + SUΦc ([W, Diag(X)]) SnΦc ([W, Diag(X)]) + Sn(Φc [ Wn,E, Diag( ] Xn,E)] Φc([W, Diag(X)])) For the first and last term, by the property of SU, Sn and Φc, it suffices to bound W Wn,E L and Diag(X) Diag( ] Xn,E) L . Without loss of generality, We only prove the case for W. As W Wn,E L converges to 0 almost surely by Lemma 7, we conclude that the first and last term converges to 0 almost surely (therefore in probability). For the second term SUΦc ([W, Diag(X)]) SnΦc ([W, Diag(X)]) , Φc ([W, Diag(X)]) is piecewise Lipschitz on diagonal and off-diagonal according to Lemma 8, and it converges to 0 almost surely according to the second part of Lemma 9. As all terms converge to 0 in the probability or almost surely, we conclude that SUΦc ([W, Diag(X)]) Φd([c Wn n, Diag( f Xn)]) converges to 0 in probability. Remark 4. Note that we can not prove Sn Φc( Wn,E) = Φd Sn( Wn,E) in general. The difficulty is that starting with Wn,E of chessboard pattern, after the first layer, pattern like Figure 7(e) may appear in σ T1(g Wn). If T2 is just a average/integral to map Rn2 1 to R, then Sn T2 σ T1(g Wn) = T2 σ T1(g Wn) will not be equal to T2 σ T1(Sng Wn). The reason is that both σ T1(g Wn) and σ T1(Sng Wn) will no longer be of chessboard pattern (Figure 7(e) may occur). The diagonal in the σ T1(g Wn) has no effect after taking integral in T2 as it is of measure 0. On the other hand, the diagonal in the matrix σ T1(Sng Wn) will affect the average. Therefore in general, SnΦc( Wn,E) = Φd Sn( Wn,E) does not hold. E. IGN-small Can Approximate Spectral GNN Definition of Spectral GNN. The spectral GNN (SGNN) here stands for GNN with multiple layers of the following form j = 1, . . . dℓ+1, z(ℓ+1) j = ρ i=1 h(ℓ) ij (L)z(ℓ) i + b(ℓ) j 1n where L = D(A) 1 2 stands for normalized adjacency,4 zℓ j, bℓ j R denotes the embedding and bias at layer ℓ. h : R R, h(λ) = P k 0 βkλk, h(L) = P k βk Lk, i.e., we apply h to the eigenvalues of L when it is diagonalizable. Extending h to multiple input output channels which are indexed in i and j, we have h(ℓ) ij (λ) = P k β(ℓ) ijkλk. By defining all components of spectral GNN for graphon, the continuous version of spectral GNN can also be defined. See Keriven et al. (2020) for details. 4We follow the same notation as Keriven et al. (2020), which is different from the conventional notation. Convergence of Invariant Graph Networks We first prove IGN can approximate spectral GNN arbitrarily well, both for discrete SGNN and continuous SGNN as well. Next, we show that such IGN belongs to IGN-small. We need the following simple assumption to ensure the input lies in a compact domain. AS5. There exists an upper bound on x L for the discrete case and X L in the continuous case. AS6. min(D(A)mean) cmin where D(A)mean is defined to be 1 n Diag(A1). The same lower bound holds for graphon case. Lemma 10. Given diagonal matrix D, matrix M and vector x, 2-IGN can approximate matrix-vector multiplication 1 n Mx and DMD arbitrarily well in L sense on a compact domain. Proof. Given diagonal matrix D and matrix M, to implement DMD with linear equivariant layers of 2-IGN, we first use operation 14-15 in Table 1 to copy diagonal elements in D to rows and columns of two matrix Drow and Dcol. Then calculating DMD becomes entry-wise multiplication of three matrix Drow, M, Dcol. Assuming all entries of D and M lies in a compact domain, we can use MLP (which is part of IGN according to Remark 8) to approximate multiplication arbitrarily well (Cybenko, 1989; Hornik et al., 1989). Note that in general matrix multiplication can not be achieved by 2-IGN but in our case, we are exploiting that D(A)mean is a diagonal matrix. See Remark 5 for illustration. To implement 1 n Mx with linear equivariant layers of 2-IGN, first map x into a diagonal matrix Diag(x) and concatenate it with M as the input [Diag(x), M] Rn n 2 to 2-IGN. Apply copy diagonal to all columns to the first channel and use MLP to uniformly approximates up to arbitrary precision ϵ the multiplication of first channel with the second channel. Then use operation copy row mean to map Rn n Rn to get the 1 n Mx within ϵ precision. See Figure 8. Remark 5. Linear layers in 2-IGN can not implement matrix-matrix multiplication in general. When we introduce the matrix multiplication component, the expressive power of GNN in terms of WL test provably increases from 2-WL to 3-WL (Maron et al., 2019a)). Theorem 6. Given n, ϵ, and SGNNθ1(n), there exists a 2-IGN IGNθ2(n) such that it approximates SGNNθ1(n) on a compact set (input feature xn) arbitrarily well in L sense on a compact domain. use MLP to mix channels to approximate pointwise multiplication linear equivaraint layers: copy column average to columns Channel dimension approximated Mx/n Node dimension Figure 8: An illusttration of how we approximate the major building blocks of SGNN: 1 Proof. Since IGN and SGNN has the same nonlinearity. To show that IGN can approximate SGNN, it suffices to show that IGN can approximate linear layer of SGNN, which further boils down to prove that IGN can approximate Lx. Here we assume the input of 2-IGN is A Rn n and x Rn d. We need to first show how L = D(A) 1 2 AD(A) 1 2 can be implemented by linear layers of IGN. This is achieved by noting that L = 1 2 mean AD(A) 1 2 mean where D(A)mean is normalized degree matrix 1 n Diag(A1). Representing L as 1 2 mean AD(A) 1 2 mean ensures that all entries in A and D(A)mean lies in a compact domain, which is crucial when we extending the approximation proof to the graphon case. Now we show how Lx = 1 2 mean AD(A) 1 2 meanx is implemented. First, it is easy to see that 2-IGN can calculate exactly D(A)mean using equivariant layers. Second, as approximating a) f(a, b) = ab and b) f(a) = 1 a can achieved by MLP on Convergence of Invariant Graph Networks compact domain, approximating D(A) 1 2 mean AD(A) 1 2 mean can also achieved by 2-IGN layers according to Lemma 10. Third, we need to show 1 2 mean AD(A) 1 2 meanx can also be implemented. This is proved in Lemma 10. There are two main functions we need to approximate with MLP: a) f(x) = 1/ a and b) f(a, b) = ab. For a) the input is then D(A)mean whose all entries lie in [0, 1]. By classical universal approximation theorem (Cybenko, 1989; Hornik et al., 1989), we know MLP can approximate a) arbitrarily well. For b) the input is (D(A) 1/2 mean , A) for normalized adjacency matrix calculation, and (L, x) for graph signal convolution. To ensure the uniform approximation, we need to ensure all of them lie in a compact domain. This is indeed the case as all entries in D(A)mean, A, x are all upper bounded. We list the support of all components explicitly below 1. every entry in A is either 0 or 1 therefore lies in a compact domain. 2. similarly, all entries D(A)mean lies in [cmin, 1] by AS6, and therefore D(A) 1 2 mean also lies in a compact domain. As L(A) is the multiplication of D(A) 1/2 mean , A, D(A) 1/2 mean , every entry of L(A) also lies in compact domain. 3. input signal x has bounded l -norm by assumption AS5. 4. all coefficient for operators is upper bounded and independent from n by AS2. Since we showed the L(A)x can be approximated arbitrarily well by IGN, repeating such processes and leveraging the fact that L has bounded spectral norm, we can then approximate Lk(A)x up to ϵ precision. The errors ϵ depend on the approximation error of the MLP to the relevant function, the previous errors, and uniform bounds as well as uniform continuity of the approximated functions. Theorem 7. Given ϵ, and a spectral GNN c SGNNθ1, there exists a continous 2-IGN c IGNθ2such that it approximates c SGNNθ1 on a compact set (input feature X) arbitrarily well. Proof. we show that all items listed in proof of Theorem 6 still holds in the continuous case we consider the W instead in the continuous case, where all entries still lies in a compact domain [0, 1]. similarly all entries of the continuous analog of D(A)mean, D(A) 1 2 mean, and T(W) also lies in a compact domain according to AS6. the statements about input signal X and the coefficient for linear equivariant operators also holds in the continuous setting. Therefore we conclude the proof. Now we are ready to prove that those IGN that can approximate SGNN well is a subset of IGN-small. Lemma 11. Let Wn,E be graphon of chessboard pattern. Let ] Xn,E be a graphon signal with 1D chessboard pattern. Sn Wn,E ] Xn,E = (Sn Wn,E)(Sn ] Xn,E) Proof. Since Sn Wn,E ] Xn,E(i) = Sn R j [0,1] Wn,E(i, j) ] Xn,E(j)dj = ..., 1 n R j [0,1] Wn,E( i n, j) ] Xn,E(j), ... , it suffices to analysize 1 n R j [0,1] Wn,E( i n, j) ] Xn,E(j). Convergence of Invariant Graph Networks Since Wn,E, ] Xn,E are of chessboard pattern, we can replace integral with summation. Sn Wn,E ] Xn,E(i) = 1 n j [0,1] Wn,E( i n, j) ] Xn,E(j) j [n] Wn,E( i n) ] Xn,E( j 1 n Wn,E( i n)(Sn ] Xn,E)(j) = X (Sn Wn,E)(i, j)(Sn ] Xn,E)(j) = (Sn Wn,E)(Sn ] Xn,E) (i) Which concludes the proof. Note that our proof does make use of the property of multiplication between two numbers. Remark 6. The whole proof only relies on that Wn,E and ] Xn,E have checkboard patterns. Therefore replacing the multiplication with other operations (such as a MLP) will still hold. Theorem 5. IGN-small can approximates spectral GNN (both discrete and continuous ones) arbitrarily well on the compact domain in the L sense. Proof. To prove this, we only need to show that SnΦc,approx([ Wn,E, g fn,E]) = Φd,approx Sn([ Wn,E, g fn,E]). Here Φc,approx and Φd,approx denotes those specific IGN in Theorems 6 and 7 constructed to approximate SGNN. To build up some intuition, let ΦSGNN denotes the spectral GNN that Φapprox approximates. it is easy to see that SnΦc,SGNN([ Wn,E, g fn,E]) = Φd,SGNNSn([ Wn,E, g fn,E]) due to Lemma 11 and Lemma 8.2. To show the same holds for Φapprox, note that the only difference between Wn,E g fn,E implemented by SGNN and approximated by Φapprox is that Φapprox use MLP to simulate multiplication between numbers. According to Remark 6, the approximated version of Wn,E g fn,E still commutes with Sn. Since nonlinear layer σ in Φapprox also commutes with Sn according to Lemma 8.2, we can combine the result above and conclude that Φapprox commutes with Sn. Therefore Φapprox belongs to IGN-small, which finishs the proof. F. More experiments We next show full results to verify Theorems 2 and 4. The main procedure is described in Section 7. As the ground truth is defined in the continuous regime, we use outputs of IGN on large graphs as the approximation of the unknown true limit. We experiment with two methods: a) we take the output of IGN from the deterministic edge weight continuous model as ground truth and b) we take graphs sampled from the stochastic edge weight continuous model as input to IGN and average the outputs over 10 random seeds. The case a) is shown In the main text. Here we include results for both a) and b). G. Third-party results G.1. Edge Probability Estimation from Zhang et al. (2015) We next restate the setting and theorem regarding the theoretical guarantee of the edge probability estimation algorithm. Definition 11. For any δ, A1 > 0, let Fδ;L de note a family of piecewise Lipschitz graphon functions f : [0, 1]2 [0, 1] such that (i) there exists an integer K 1 and a sequence 0 = x0 < < x K = 1 satisfying min0 s K 1 (xs+1 xs) δ, and (ii) both |f (u1, v) f (u2, v)| A1 |u1 u2| and |f (u, v1) f (u, v2)| A1 | v1 v2 | hold for all u, u1, u2 [xs, xs+1] , v, v1, v2 [xt, xt+1] and 0 s, t K 1 Assume that αn = 1. It is easy to see that the setup considered in Zhang et al. (2015) is slightly more general than the setup in Keriven et al. (2020). The statistical guarantee of the edge smoothing algorithm is stated below. Convergence of Invariant Graph Networks 0 10 20 30 40 Random graph (p=0.1) 0 10 20 30 40 Stochastic block model 0 10 20 30 40 Smooth Graphon (1 piece) 0 10 20 30 40 Piecewise smooth graphon (3 pieces) Figure 9: Four graphons of increasing complexity. Random graph (p=0.1) EW + fixed EW + random EP EP + edge smoothing Stochastic block model EW + fixed EW + random EP EP + edge smoothing Smooth Graphon (1 piece) EW + fixed EW + random EP EP + edge smoothing Piecewise smooth graphon (3 pieces) EW + fixed EW + random EP EP + edge smoothing Figure 10: ground truth: random sample. Random graph (p=0.1) EW + fixed EW + random EP EP + edge smoothing Stochastic block model EW + fixed EW + random EP EP + edge smoothing Smooth Graphon (1 piece) EW + fixed EW + random EP EP + edge smoothing Piecewise smooth graphon (3 pieces) EW + fixed EW + random EP EP + edge smoothing Figure 11: ground truth: grid sample. Figure 12: The convergence error for four generative models under two ways of approximating ground truth. Three dashed line in each figure indicates the decay rate of n 0.5, n 1 and n 2. EW stands for edge weight continuous model and EP stands for edge probability discrete model. As implied by Theorem 2, EW + fixed and EW + random both converges when n increases. On the other hand, EP does not converge, which is consistent with Theorem 3. After edge probability estimation, EP + edge smoothing again converges, which is consitent with Theorem 4. Convergence of Invariant Graph Networks Theorem 8 (Zhang et al. (2015)). Assume that A1 is a global constant and δ = δ(n) depends on n, satisfying limn δ/(n 1 log n)1/2 . Then the estimator P with neighborhood Ni defined in Zhang et al. (2015) and h = C(n 1 log n)1/2 for any global constant C (0, 1], satisfies maxf Fδ;A1 pr{d2, ( P, P)2 C1( log n n )1/2} n C2 where C1 and C2 are positive global constants. d2, (P, Q) := n 1/2 P Q 2, = maxi n 1/2 Pi Qi 2. G.2. IGN Details Remark 7 (independence from n). Although for large n, the result in Maron et al. (2018) is correct. But as noted by Finzi et al. (2021), this does not hold when n is small, which is not an issue as we consider cases when n goes to infinity in this paper. Remark 8 (multi-channel IGN contains MLP). For simplicity, in the main text, we focus on the case when the input and output tensor channel number is 1. The general case of multiple input and output channels is presented in Equation 9 of Maron et al. (2018). The main takeaway is that permutation equivariance does not constrain the mixing over feature channels, i.e., the space of linear equivariant maps from Rnℓ d1 Rnm d2 if of dimension d1d2bell(l + m). Therefore IGN contains MLP.