# invariant_and_equivariant_graph_networks__3e17c218.pdf Published as a conference paper at ICLR 2019 INVARIANT AND EQUIVARIANT GRAPH NETWORKS Haggai Maron, Heli Ben-Hamu, Nadav Shamir & Yaron Lipman Department of Computer Science and Applied Mathematics Weizmann Institute of Science Rehovot, Israel Invariant and equivariant networks have been successfully used for learning images, sets, point clouds, and graphs. A basic challenge in developing such networks is finding the maximal collection of invariant and equivariant linear layers. Although this question is answered for the first three examples (for popular transformations, at-least), a full characterization of invariant and equivariant linear layers for graphs is not known. In this paper we provide a characterization of all permutation invariant and equivariant linear layers for (hyper-)graph data, and show that their dimension, in case of edge-value graph data, is 2 and 15, respectively. More generally, for graph data defined on k-tuples of nodes, the dimension is the k-th and 2k-th Bell numbers. Orthogonal bases for the layers are computed, including generalization to multigraph data. The constant number of basis elements and their characteristics allow successfully applying the networks to different size graphs. From the theoretical point of view, our results generalize and unify recent advancement in equivariant deep learning. In particular, we show that our model is capable of approximating any message passing neural network. Applying these new linear layers in a simple deep neural network framework is shown to achieve comparable results to state-of-the-art and to have better expressivity than previous invariant and equivariant bases. 1 INTRODUCTION We consider the problem of graph learning, namely finding a functional relation between input graphs (more generally, hyper-graphs) Gℓand corresponding targets T ℓ, e.g., labels. As graphs are common data representations, this task received quite a bit of recent attention in the machine learning community Bruna et al. (2013); Henaff et al. (2015); Monti et al. (2017); Ying et al. (2018). More specifically, a (hyper-)graph data point G = (V, A) consists of a set of n nodes V, and values A attached to its hyper-edges1. These values are encoded in a tensor A. The order of the tensor A, or equivalently, the number of indices used to represent its elements, indicates the type of data it represents, as follows: First order tensor represents node-values where Ai is the value of the i-th node; Second order tensor represents edge-values, where Aij is the value attached to the (i, j) edge; in general, k-th order tensor encodes hyper-edge-values, where Ai1,...,ik represents the value of the hyper-edge represented by (i1, . . . , ik). For example, it is customary to represent a graph using a binary adjacency matrix A, where Aij equals one if vertex i is connected to vertex j and zero otherwise. We denote the set of order-k tensors by Rnk. The task at hand is constructing a functional relation f(Aℓ) T ℓ, where f is a neural network. If T ℓ= tℓis a single output response then it is natural to ask that f is order invariant, namely it should produce the same output regardless of the node numbering used to encode A. For example, if we represent a graph using an adjacency matrix A = A Rn n, then for an arbitrary permutation matrix P and an arbitrary adjacency matrix A, the function f is order invariant if it satisfies f(P T AP ) = f(A). If the targets T ℓspecify output response in a form of a tensor, T ℓ= Tℓ, then it is natural to ask that f is order equivariant, that is, f commutes with the renumbering of nodes operator acting on tensors. Using the above adjacency matrix example, for every adjacency matrix A and 1A hyper-edge is an ordered subset of the nodes, V Published as a conference paper at ICLR 2019 Figure 1: The full basis for equivariant linear layers for edge-value data A Rn n, for n = 5. The purely linear 15 basis elements, Bµ, are represented by matrices n2 n2, and the 2 bias basis elements (right), Cλ, by matrices n n, see equation 9. every permutation matrix P , the function f is equivariant if it satisfies f(P T AP ) = P T f(A)P . To define invariance and equivariance for functions acting on general tensors A Rnk we use the reordering operator: P A is defined to be the tensor that results from renumbering the nodes V according to the permutation defined by P . Invariance now reads as f(P A) = f(A); while equivariance means f(P A) = P f(A). Note that the latter equivariance definition also holds for functions between different order tensors, f : Rnk Rnl. Following the standard paradigm of neural-networks where a network f is defined by alternating compositions of linear layers and non-linear activations, we set as a goal to characterize all linear invariant and equivariant layers. The case of node-value input A = a Rn was treated in the pioneering works of Zaheer et al. (2017); Qi et al. (2017). These works characterize all linear permutation invariant and equivariant operators acting on node-value (i.e., first order) tensors, Rn. In particular it it shown that the linear space of invariant linear operators L : Rn R is of dimension one, containing essentially only the sum operator, L(a) = α1T a. The space of equivariant linear operators L : Rn Rn is of dimension two, L(a) = αI + β(11T I) a. The general equivariant tensor case was partially treated in Kondor et al. (2018) where the authors make the observation that the set of standard tensor operators: product, element-wise product, summation, and contraction are all equivariant, and due to linearity the same applies to their linear combinations. However, these do not exhaust nor provide a full and complete basis for all possible tensor equivariant linear layers. In this paper we provide a full characterization of permutation invariant and equivariant linear layers for general tensor input and output data. We show that the space of invariant linear layers L : Rnk R is of dimension b(k), where b(k) is the k-th Bell number. The k-th Bell number is the number of possible partitions of a set of size k; see inset for the case k = 3. Furthermore, the space of equivariant linear layers L : Rnk Rnl is of dimension b(k + l). Remarkably, this dimension is independent of the size n of the node set V. This allows applying the same network on graphs of different sizes. For both types of layers we provide a general formula for an orthogonal basis that can be readily used to build linear invariant or equivariant layers with maximal expressive power. Going back to the example of a graph represented by an adjacency matrix A Rn n we have k = 2 and the linear invariant layers L : Rn n R have dimension b(2) = 2, while linear equivariant layers L : Rn n Rn n have dimension b(4) = 15. Figure 1 shows visualization of the basis to the linear equivariant layers acting on edge-value data such as adjacency matrices. In Hartford et al. (2018) the authors provide an impressive generalization of the case of node-value data to several node sets, V1, V2, . . . , Vm of sizes n1, n2, . . . , nm. Their goal is to learn interactions across sets. That is, an input data point is a tensor A Rn1 n2 nm that assigns a value to each element in the cartesian product V1 V2 Vm. Renumbering the nodes in each node set using permutation matrices P1, . . . , Pm (resp.) results in a new tensor we denote by P1:m A. Order invariance means f(P1:m A) = f(A) and order equivariance is f(P1:m A) = P1:m f(A). Hartford et al. (2018) introduce bases for linear invariant and equivariant layers. Although the layers in Hartford et al. (2018) satisfy the order invariance and equivariance, they do not exhaust all possible such layers in case some node sets coincide. For example, if V1 = V2 they have 4 independent learnable parameters where our model has the maximal number of 15 parameters. Our analysis allows generalizing the multi-node set case to arbitrary tensor data over V1 V2 Vm. Namely, for data points in the form of a tensor A Rnk1 1 nk2 2 nkm m . The tensor A attaches a value to every element of the Cartesian product Vk1 1 Vk2 2 , that is, k1-tuple from V1, k2-tuple from V2 and so forth. We show that the linear space of invariant linear layers L : Rnk1 1 nk2 2 nkm m R is of dimension Qm i=1 b(ki), while the equivariant linear layers L : Published as a conference paper at ICLR 2019 Rnk1 1 nk2 2 nkm m Rnl1 1 nl2 2 nlm m has dimension Qm i=1 b(ki + li). We also provide orthogonal bases for these spaces. Note that, for clarity, the discussion above disregards biases and features; we detail these in the paper. In appendix C we show that our model is capable of approximating any message-passing neural network as defined in Gilmer et al. (2017) which encapsulate several popular graph learning models. One immediate corollary is that the universal approximation power of our model is not lower than message passing neural nets. In the experimental part of the paper we concentrated on possibly the most popular instantiation of graph learning, namely that of a single node set and edge-value data, e.g., with adjacency matrices. We created simple networks by composing our invariant or equivariant linear layers in standard ways and tested the networks in learning invariant and equivariant graph functions: (i) We compared identical networks with our basis and the basis of Hartford et al. (2018) and showed we can learn graph functions like trace, diagonal, and maximal singular vector. The basis in Hartford et al. (2018), tailored to the multi-set setting, cannot learn these functions demonstrating it is not maximal in the graph-learning (i.e., multi-set with repetitions) scenario. We also demonstrate our representation allows extrapolation: learning on one size graphs and testing on another size; (ii) We also tested our networks on a collection of graph learning datasets, achieving results that are comparable to the state-of-the-art in 3 social network datasets. 2 PREVIOUS WORK Our work builds on two main sub-fields of deep learning: group invariant or equivariant networks, and deep learning on graphs. Here we briefly review the relevant works. Invariance and equivariance in deep learning. In many learning tasks the functions that we want to learn are invariant or equivariant to certain symmetries of the input object description. Maybe the first example is the celebrated translation invariance of Convolutional Neural Networks (CNNs) (Le Cun et al., 1989; Krizhevsky et al., 2012); in this case, the image label is invariant to a translation of the input image. In recent years this idea was generalized to other types of symmetries such as rotational symmetries (Cohen & Welling, 2016a;b; Weiler et al., 2018; Cohen et al., 2018). Cohen & Welling (2016a) introduced Group Equivariant Neural Networks that use a generalization of the convolution operator to groups of rotations and reflections; Weiler et al. (2018); Cohen et al. (2018) also considered rotational symmetries but in the case of 3D shapes and spherical functions. Ravanbakhsh et al. (2017) showed that any equivariant layer is equivalent to a certain parameter sharing scheme. If we adopt this point of view, our work reveals the structure of the parameter sharing in the case of graphs and hyper-graphs. In another work, Kondor & Trivedi (2018) show that a neural network layer is equivariant to the action of some compact group iff it implements a generalized form of the convolution operator. Yarotsky (2018) suggested certain group invariant/equivariant models and proved their universality. To the best of our knowledge these models were not implemented. Learning of graphs. Learning of graphs is of huge interest in machine learning and we restrict our attention to recent advancements in deep learning on graphs. Gori et al. (2005); Scarselli et al. (2009) introduced Graph Neural Networks (GNN): GNNs hold a state (a real valued vector) for each node in the graph, and propagate these states according to the graph structure and learned parametric functions. This idea was further developed in Li et al. (2015) that use gated recurrent units. Following the success of CNNs, numerous works suggested ways to define convolution operator on graphs. One promising approach is to define convolution by imitating its spectral properties using the Laplacian operator to define generalized Fourier basis on graphs (Bruna et al., 2013). Multiple follow-up works (Henaff et al., 2015; Defferrard et al., 2016; Kipf & Welling, 2016; Levie et al., 2017) suggest more efficient and spatially localized filters. The main drawback of spectral approaches is that the generalized Fourier basis is graph-dependent and applying the same network to different graphs can be challenging. Another popular way to generalize the convolution operator to graphs is learning stationary functions that operate on neighbors of each node and update its current state (Atwood & Towsley, 2016; Duvenaud et al., 2015; Hamilton et al., 2017; Niepert et al., 2016; Veliˇckovi c et al., 2017; Monti et al., 2017; Simonovsky & Komodakis, 2017). This idea generalizes the locality and weight sharing properties of the standard convolution operators on regular grids. As shown in the important work of Gilmer et al. (2017), most of the the above mentioned methods (including the spectral methods) can be seen as instances of the general class of Message Passing Neural Networks. Published as a conference paper at ICLR 2019 3 LINEAR INVARIANT AND EQUIVARIANT LAYERS In this section we characterize the collection of linear invariant and equivariant layers. We start with the case of a single node set V of size n and edge-value data, that is order 2 tensors A = A Rn n. As a typical example imagine, as above, an adjacency matrix of a graph. We set a bit of notation. Given a matrix X Ra b we denote vec(X) Rab 1 its column stack, and by brackets the inverse action of reshaping to a square matrix, namely [vec(X)] = X. Let p denote an arbitrary permutation and P its corresponding permutation matrix. Let L R1 n2 denote the matrix representing a general linear operator L : Rn n R in the standard basis, then L is order invariant iff Lvec(P T AP ) = Lvec(A). Using the property of the Kronecker product that vec(XAY ) = Y T Xvec(A), we get the equivalent equality LP T P T vec(A) = Lvec(A). Since the latter equality should hold for every A we get (after transposing both sides of the equation) that order invariant L is equivalent to the equation P P vec(L) = vec(L) (1) for every permutation matrix P . Note that we used LT = vec(L). For equivariant layers we consider a general linear operator L : Rn n Rn n and its corresponding matrix L Rn2 n2. Equivariance of L is now equivalent to [Lvec(P T AP )] = P T [Lvec(A)]P . Using the above property of the Kronecker product again we get LP T P T vec(A) = P T P T Lvec(A). Noting that P T P T is an n2 n2 permutation matrix and its inverse is P P we get to the equivalent equality P P LP T P T vec(A) = Lvec(A). As before, since this holds for every A and using the properties of the Kronecker product we get that L is order equivariant iff for all permutation matrices P P P P P vec(L) = vec(L). (2) From equations 1 and 2 we see that finding invariant and equivariant linear layers for the order-2 tensor data over one node set requires finding fixed points of the permutation matrix group represented by Kronecker powers P P P of permutation matrices P . As we show next, this is also the general case for order-k tensor data A Rnk over one node set, V. That is, invariant L : P kvec(L) = vec(L) (3) equivariant L : P 2kvec(L) = vec(L) (4) for every permutation matrix P , where P k = k z }| { P P . In equation 3, L R1 nk is the matrix of an invariant operator; and in equation 4, L Rnk nk is the matrix of an equivariant operator. We call equations 3,4 the fixed-point equations. To see this, let us add a bit of notation first. Let p denote the permutation corresponding to the permutation matrix P . We let P A denote the tensor that results from expressing the tensor A after renumbering the nodes in V according to permutation P . Explicitly, the (p(i1), p(i2), . . . , p(ik))-th entry of P A equals the (i1, i2, . . . , ik)-th entry of A. The matrix that corresponds to the operator P in the standard tensor basis e(i1) e(ik) is the Kronecker power P T k = (P T ) k. Note that vec(A) is exactly the coordinate vector of the tensor A in this standard basis and therefore we have vec(P A) = P T kvec(A). We now show: Proposition 1. A linear layer is invariant (equivariant) if and only if its coefficient matrix satisfies the fixed-point equations, namely equation 3 (equation 4). Proof. Similarly to the argument from the order-2 case, let L R1 nk denote the matrix corresponding to a general linear operator L : Rnk R. Order invariance means Lvec(P A) = Lvec(A). (5) Using the matrix P T k we have equivalently LP T kvec(A) = Lvec(A) which is in turn equivalent to P kvec(L) = vec(L) for all permutation matrices P . For order equivariance, let L Rnk nk denote the matrix of a general linear operator L : Rnk Rnk. Now equivariance of L is equivalent to [Lvec(P A)] = P [Lvec(A)]. (6) Similarly to above this is equivalent to LP T kvec(A) = P T k Lvec(A) which in turn leads to P k LP T k = L, and using the Kronecker product properties we get P 2kvec(L) = vec(L). Published as a conference paper at ICLR 2019 3.1 SOLVING THE FIXED-POINT EQUATIONS We have reduced the problem of finding all invariant and equivariant linear operators L to finding all solutions L of equations 3 and 4. Although the fixed point equations consist of an exponential number of equations with only a polynomial number of unknowns they actually possess a solution space of constant dimension (i.e., independent of n). To find the solution of P ℓvec(X) = vec(X), where X Rnℓ, note that P ℓvec(X) = vec(Q X), where Q = P T . As above, the tensor Q X is the tensor resulted from renumbering the nodes in V using permutation Q. Equivalently, the fixed-point equations we need to solve can be formulated as Q X = X, Q permutation matrices (7) The permutation group is acting on tensors X Rnℓwith the action X 7 Q X. We are looking for fixed points under this action. To that end, let us define an equivalence relation in the index space of tensors Rnℓ, namely in [n]ℓ, where with a slight abuse of notation (we use light brackets) we set [n] = {1, 2, . . . , n}. For multi-indices a, b [n]ℓwe set a b iff a, b have the same equality pattern, that is ai = aj bi = bj for all i, j [ℓ]. The equality pattern equivalence relation partitions the index set [n]ℓinto equivalence classes, the collection of which is denoted [n]ℓ/ . Each equivalence class can be represented by a unique partition of the set [ℓ] where each set in the partition indicates maximal set of identical values. Let us exemplify. For ℓ= 2 we have two equivalence classes γ1 = {{1} , {2}} and γ2 = {{1, 2}}; γ1 represents all multi-indices (i, j) where i = j, while γ2 represents all multi-indices (i, j) where i = j. For ℓ= 4, there are 15 equivalence classes γ1 = {{1} , {2} , {3} , {4}}, γ2 = {{1} , {2} , {3, 4}}, γ3 = {{1, 2} , {3} , {4}}, ...; γ3 represents multi-indices (i1, i2, i3, i4) so that i1 = i2, i2 = i3, i3 = i4, i2 = i4. For each equivalence class γ [n]ℓ/ we define an order-ℓtensor Bγ Rnℓby setting Bγ a = 1 a γ 0 otherwise (8) Since we have a tensor Bγ for every equivalence class γ, and the equivalence classes are in oneto-one correspondence with partitions of the set [ℓ] we have b(ℓ) tensors Bγ. (Remember that b(ℓ) denotes the ℓ-th Bell number.) We next prove: Proposition 2. The tensors Bγ in equation 8 form an orthogonal basis (in the standard innerproduct) to the solution set of equations 7. The dimension of the solution set is therefore b(ℓ). Proof. Let us first show that: X is a solution to equation 7 iff X is constant on equivalence classes of the equality pattern relation, . Since permutation q : [n] [n] is a bijection the equality patterns of a = (i1, i2, . . . , iℓ) [n]ℓand q(a) = (q(i1), q(i2), . . . , q(iℓ)) [n]ℓare identical, i.e., a q(a). Taking the a [n]ℓentry of both sides of equation 7 gives Xq(a) = Xa. Now, if X is constant on equivalence classes then in particular it will have the same value at a and q(a) for all a [n]ℓand permutations q. Therefore X is a solution to equation 7. For the only if part, consider a tensor X for which there exist multi-indices a b (with identical equality patterns) and Xa = Xb then X is not a solution to equation 7. Indeed, since a b one can find a permutation q so that b = q(a) and using the equation above, Xb = Xq(a) = Xa which leads to a contradiction. To finish the proof note that any tensor X, constant on equivalence classes, can be written as a linear combination of Bγ, which are merely indicators of the equivalence class. Furthermore, the collection Bγ have pairwise disjoint supports and therefore are an orthogonal basis. Combining propositions 1 and 2 we get the characterization of invariant and equivariant linear layers acting on general k-order tensor data over a single node set V: Theorem 1. The space of invariant (equivariant) linear layers Rnk R (Rnk Rnk) is of dimension b(k) (b(2k)) with basis elements Bγ defined in equation 8, where γ are equivalence classes in [n]k/ ([n]2k/ ). Biases Theorem 1 deals with purely linear layers, that is without bias, i.e., without constant part. Nevertheless extending the previous analysis to constant layers is straight-forward. First, Published as a conference paper at ICLR 2019 any constant layer Rnk R is also invariant so all constant invariant layers are represented by constants c R. For equivariant layers L : Rnk Rnk we note that equivariance means C = L(P A) = P L(A) = P C. Representing this equation in matrix form we get P T kvec(C) = vec(C). This shows that constant equivariant layers on one node set acting on general k-order tensors are also characterized by the fixed-point equations, and in fact have the same form and dimensionality as invariant layers on k-order tensors, see equation 3. Specifically, their basis is Bλ, λ [n]k/ . For example, for k = 2, the biases are shown on the right in figure 1. Features. It is pretty common that input tensors have vector values (i.e., features) attached to each hyper-edge (k-tuple of nodes) in V, that is A Rnk d. Now linear invariant Rnk d R1 d or equivariant Rnk d Rnk d layers can be formulated using a slight generalization of the previous analysis. The operator P A is defined to act only on the nodal indices, i.e., i1, . . . , ik (the first k indices). Explicitly, the (p(i1), p(i2), . . . , p(ik), ik+1)-th entry of P A equals the (i1, i2, . . . , ik, ik+1)-th entry of A. Invariance is now formulated exactly as before, equation 5, namely Lvec(P A) = Lvec(A). The matrix that corresponds to P acting on Rnk d in the standard basis is P T k Id and therefore L(P T k Id)vec(A) = Lvec(A). Since this is true for all A we have (P k Id Id ) vec(L) = vec(L), using the properties of the Kronecker product. Equivariance is written as in equation 6, [Lvec(P A)] = P [Lvec(A)]. In matrix form, the equivariance equation becomes L(P T k Id)vec(A) = (P T k Id )Lvec(A), since this is true for all A and using the properties of the Kronecker product again we get to P k Id P k Id vec(L) = vec(L). The basis (with biases) to the solution space of these fixed-point equations is defined as follows. We use a, b [n]k, i, j [d], i , j [d ], λ [n]k/ , µ [n]2k/ . invariant: Bλ,j,j a,i,i = 1 a λ, i=j, i =j 0 otherwise ; Cj 0 otherwise (9a) equivariant: Bµ,j,j a,i,b,i = 1 (a,b) µ, i=j, i =j 0 otherwise ; Cλ,j b,i = 1 b λ, i =j 0 otherwise (9b) Note that these basis elements are similar to the ones in equation 8 with the difference that we have different basis tensor for each pair of input j and output j feature channels. An invariant (equation 10a)/ equivariant (equation 10b) linear layer L including the biases can be written as follows for input A Rnk d: a,i Ta,i,i Aa,i + Yi ; T = X λ,j,j wλ,j,j Bλ,j,j ; Y = X j bj Cj (10a) L(A)b,i = X a,i Ta,i,b,i Aa,i + Yb,i ; T = X µ,j,j wµ,j,j Bµ,j,j ; Y = X λ,j bλ,j Cλ,j (10b) where the learnable parameters are w Rb(k) d d and b Rd for a single linear invariant layer Rnk d Rd ; and it is w Rb(2k) d d and b Rb(k) d for a single linear equivariant layer Rnk d Rnk d . The natural generalization of theorem 1 to include bias and features is therefore: Theorem 2. The space of invariant (equivariant) linear layers Rnk,d Rd (Rnk d Rnk d ) is of dimension dd b(k) + d (for equivariant: dd b(2k) + d b(k)) with basis elements defined in equation 9; equation 10a (10b) show the general form of such layers. Since, by similar arguments to proposition 2, the purely linear parts B and biases C in equation 9 are independent solutions to the relevant fixed-point equations, theorem 2 will be proved if their number equals the dimension of the solution space of these fixed-point equations, namely dd b(k) for purely linear part and d for bias in the invariant case, and dd b(2k) for purely linear and d b(k) for bias in the equivariant case. This can be shown by repeating the arguments of the proof of proposition 2 slightly adapted to this case, or by a combinatorial identity we show in Appendix B . For example, figure 1 depicts the 15 basis elements for linear equivariant layers Rn n Rn n taking as input edge-value (order-2) tensor data A Rn n and outputting the same dimension tensor. The basis for the purely linear part are shown as n2 n2 matrices while the bias part as n n matrices (far right); the size of the node set is |V| = n = 5. Published as a conference paper at ICLR 2019 Mixed order equivariant layers. Another useful generalization of order equivariant linear layers is to linear layers between different order tensor layers, that is, L : Rnk Rnl, where l = k. For example, one can think of a layer mapping an adjacency matrix to per-node features. For simplicity we will discuss the purely linear scalar-valued case, however generalization to include bias and/or general feature vectors can be done as discussed above. Consider the matrix L Rnl nk representing the linear layer L, using the renumbering operator, P , order equivariance is equivalent to [Lvec(P A)] = P [Lvec(A)]. Note that while this equation looks identical to equation 6 it is nevertheless different in the sense that the P operator in the l.h.s. of this equation acts on k-order tensors while the one on the r.h.s. acts on l-order tensor. Still, we can transform this equation to a matrix equation as before by remembering that P T k is the matrix representation of the renumbering operator P acting on k-tensors in the standard basis. Therefore, repeating the arguments in proof of proposition 1, equivariance is equivalent to P (k+l)vec(L) = vec(L), for all permutation matrices P . This equation is solved as in section 3.1. The corresponding bases to such equivariant layers are computed as in equation 9b, with the only difference that now a [n]k, b [n]l, and µ [n]k+l/ . 4 EXPERIMENTS Implementation details. We implemented our method in Tensorflow (Abadi et al., 2016). The equivariant linear basis was implemented efficiently using basic row/column/diagonal summation operators, see appendix A for details. The networks we used are composition of 1 4 equivariant linear layers with Re LU activation between them for the equivariant function setting. For invariant function setting we further added a max over the invariant basis and 1 3 fully-connected layers with Re LU activations. Table 1: Comparison to baseline methods on synthetic experiments. Symmetric projection Diagonal extraction Max singular vector Trace # Layers 1 2 3 1 2 3 1 2 3 4 1 2 3 Trivial predictor 4.17 4.17 4.17 0.21 0.21 0.21 0.025 0.025 0.025 0.025 333.33 333.33 333.33 Hartford et al. 2.09 2.09 2.09 0.81 0.81 0.81 0.043 0.044 0.043 0.043 316.22 311.55 307.97 Ours 1E-05 7E-06 2E-05 8E-06 7E-06 1E-04 0.015 0.0084 0.0054 0.0016 0.005 0.001 0.003 Synthetic datasets. We tested our method on several synthetic equivariant and invariant graph functions that highlight the differences in expressivity between our linear basis and the basis of Hartford et al. (2018). Given an input matrix data A Rn n we considered: (i) projection onto the symmetric matrices 1 2(A+AT ); (ii) diagonal extraction diag(diag(A)) (keeps only the diagonal and plugs zeros elsewhere); (iii) computing the maximal right singular vector arg max v 2=1 Av 2; and (iv) computing the trace tr(A). Tasks (i)-(iii) are equivariant while task (iv) is invariant. We created accordingly 4 datasets with 10K train and 1K test examples of 40 40 matrices; for tasks (i), (ii), (iv) we used i.i.d. random matrices with uniform distribution in [0, 10]; we used mean-squared error (MSE) as loss; for task (iii) we random matrices with uniform distribution of singular values in [0, 0.5] and spectral gap 0.5; due to sign ambiguity in this task we used cosine loss of the form l(x, y) = 1 x/ x , y/ y 2. We trained networks with 1, 2, and 3 hidden layers with 8 feature channels each and a single fullyconnected layer. Both our models as well as Hartford et al. (2018) use the same architecture but with different bases for the linear layers. Table 1 logs the best mean-square error of each method over a set of hyper-parameters. We add the MSE for the trivial mean predictor. Table 2: Generalization. sym 0.0053 3.8E-05 0.0013 svd 0.0108 0.0084 0.0096 diag 0.0150 1.5E-05 0.0055 This experiment emphasizes simple cases in which the additional parameters in our model, with respect to Hartford et al. (2018), are needed. We note that Hartford et al. (2018) target a different scenario where the permutations acting on the rows and columns of the input matrix are not necessarily the same. The assumption taken in this paper, namely, that the same permutation acts on both rows and columns, gives rise to additional parameters that are associated with the diagonal and with the transpose of the matrix (for a complete list of layers for the k = 2 case see appendix A). In case of an input matrix that represents graphs, these parameters can be understood as parameters that control self-edges or node features, and incoming/outgoing edges in a different way. Table 2 shows the result of applying the learned equivariant networks from the above experiment to graphs (matrices) of Published as a conference paper at ICLR 2019 unseen sizes of n = 30 and n = 50. Note, that although the network was trained on a fixed size, the network provides plausible generalization to different size graphs. We note that the generalization of the invariant task of computing the trace did not generalize well to unseen sizes and probably requires training on different sizes as was done in the datasets below. Table 3: Graph Classification Results. dataset MUTAG PTC PROTEINS NCI1 NCI109 COLLAB IMDB-B IMDB-M size 188 344 1113 4110 4127 5000 1000 1500 classes 2 2 2 2 2 3 2 3 avg node # 17.9 25.5 39.1 29.8 29.6 74.4 19.7 13 DGCNN 85.83 1.7 58.59 2.5 75.54 0.9 74.44 0.5 NA 73.76 0.5 70.03 0.9 47.83 0.9 PSCN (k=10) 88.95 4.4 62.29 5.7 75 2.5 76.34 1.7 NA 72.6 2.2 71 2.3 45.23 2.8 DCNN NA NA 61.29 1.6 56.61 1.0 NA 52.11 0.7 49.06 1.4 33.49 1.4 ECC 76.11 NA NA 76.82 75.03 NA NA NA DGK 87.44 2.7 60.08 2.6 75.68 0.5 80.31 0.5 80.32 0.3 73.09 0.3 66.96 0.6 44.55 0.5 Diff Pool NA NA 78.1 NA NA 75.5 NA NA CCN 91.64 7.2 70.62 7.0 NA 76.27 4.1 75.54 3.4 NA NA NA GK 81.39 1.7 55.65 0.5 71.39 0.3 62.49 0.3 62.35 0.3 NA NA NA RW 79.17 2.1 55.91 0.3 59.57 0.1 > 3 days NA NA NA NA PK 76 2.7 59.5 2.4 73.68 0.7 82.54 0.5 NA NA NA NA WL 84.11 1.9 57.97 2.5 74.68 0.5 84.46 0.5 85.12 0.3 NA NA NA FGSD 92.12 62.80 73.42 79.80 78.84 80.02 73.62 52.41 AWE-DD NA NA NA NA NA 73.93 1.9 74.45 5.8 51.54 3.6 AWE-FB 87.87 9.7 NA NA NA NA 70.99 1.4 73.13 3.2 51.58 4.6 ours 84.61 10 59.47 7.3 75.19 4.3 73.71 2.6 72.48 2.5 77.92 1.7 71.27 4.5 48.55 3.9 Graph classification. We tested our method on standard benchmarks of graph classification. We use 8 different real world datasets from the benchmark of Yanardag & Vishwanathan (2015): five of these datasets originate from bioinformatics while the other three come from social networks. In all datasets the adjacency matrix of each graph is used as input and a categorial label is assigned as output. In the bioinformatics datasets node labels are also provided as inputs. These node labels can be used in our framework by placing their 1-hot representations on the diagonal of the input. Table 3 specifies the results for our method compared to state-of-the-art deep and non-deep graph learning methods. We follow the evaluation protocol including the 10-fold splits of Zhang et al. (2018). For each dataset we selected learning and decay rates on one random fold. In all experiments we used a fixed simple architecture of 3 layers with (16, 32, 256) features accordingly. The last equivariant layer is followed by an invariant max layer according to the invariant basis. We then add two fully-connected hidden layers with (512, 256) features. We compared our results to seven deep learning methods: DGCNN (Zhang et al., 2018), PSCN (Niepert et al., 2016), DCNN (Atwood & Towsley, 2016), ECC (Simonovsky & Komodakis, 2017), DGK (Yanardag & Vishwanathan, 2015), Diff Pool (Ying et al., 2018) and CCN (Kondor et al., 2018). We also compare our results to four popular graph kernel methods: Graphlet Kernel (GK) (Shervashidze et al., 2009),Random Walk Kernel (RW) (Vishwanathan et al., 2010), Propagation Kernel (PK) (Neumann et al., 2016), and Weisfeiler-lehman kernels (WL) (Shervashidze et al., 2011) and two recent feature-based methods: Family of Graph Spectral Distance (FGSD) (Verma & Zhang, 2017) and Anonymous Walk Embeddings (AWE) (Ivanov & Burnaev, 2018). Our method achieved results comparable to the state-of-the-art on the three social networks datasets, and slightly worse results than state-of-the-art on the biological datasets. 5 GENERALIZATIONS TO MULTI-NODE SETS Lastly, we provide a generalization of our framework to data that is given on tuples of nodes from a collection of node sets V1, V2, . . . , Vm of sizes n1, n2, . . . , nm (resp.), namely A Rnk1 1 nk2 2 nkm m . We characterize invariant linear layers L : Rnk1 1 nkm m R and equivariant linear layer L : Rnk1 1 nkm m Rnl1 1 nlm m , where for simplicity we do not discuss features that can be readily added as discussed in section 3. Note that the case of ki = li = 1 for all i = 1, . . . , m is treated in Hartford et al. (2018). The reordering operator now is built out of permutation matrices Pi Rni ni (pi denotes the permutation), i = 1, . . . , m, denoted P1:m , and defined as follows: the (p1(a1), p2(a2), . . . , pm(am))-th entry of the tensor P1:m A, where ai [ni]ki is defined to be the (a1, a2, . . . , am)-th entry of the tensor A. Rewriting the invariant and equivariant equations, i.e., equation 5, 6, in matrix format, similarly to before, we get the fixed-point equa- Published as a conference paper at ICLR 2019 tions: Mvec(L) = vec(L) for invariant, and M Mvec(L) = vec(L) for equivariant, where M = P k1 1 P km m . The solution of these equations would be linear combinations of basis tensor similar to equation 9 of the form invariant: Bλ1,...,λm a1,...,am = 1 ai λi, i 0 otherwise ; equivariant: Bµ1,...,µm a1,...,am,b1,...,bm= 1 (ai,bi) µi, i 0 otherwise (11) where λi [ni]ki, µi [ni]ki+li, a [ni]ki, bi [ni]li. The number of these tensors is Qm i=1 b(i) for invariant layers and Qm i=1 b(ki +li) for equivariant layers. Since these are all linear independent (pairwise disjoint support of non-zero entries) we need to show that their number equal the dimension of the solution of the relevant fixed-point equations above. This can be done again by similar arguments to the proof of proposition 2 or as shown in appendix B. To summarize: Theorem 3. The linear space of invariant linear layers L : Rnk1 1 nk2 2 nkm m R is of dimension Qm i=1 b(ki). The equivariant linear layers L : Rnk1 1 nk2 2 nkm m Rnl1 1 nl2 2 nlm m has dimension Qm i=1 b(ki + li). Orthogonal bases for these layers are listed in equation 11. ACKNOWLEDGMENTS This research was supported in part by the European Research Council (ERC Consolidator Grant, Lift Match 771136) and the Israel Science Foundation (Grant No. 1830/17). Mart ın Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: a system for largescale machine learning. In OSDI, volume 16, pp. 265 283, 2016. James Atwood and Don Towsley. Diffusion-convolutional neural networks. In Advances in Neural Information Processing Systems, pp. 1993 2001, 2016. Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann Le Cun. Spectral Networks and Locally Connected Networks on Graphs. pp. 1 14, 2013. URL http://arxiv.org/abs/1312. 6203. Taco Cohen and Max Welling. Group equivariant convolutional networks. In International conference on machine learning, pp. 2990 2999, 2016a. Taco S. Cohen and Max Welling. Steerable CNNs. (1990):1 14, 2016b. URL http://arxiv. org/abs/1612.08498. Taco S Cohen, Mario Geiger, Jonas K ohler, and Max Welling. Spherical cnns. ar Xiv preprint ar Xiv:1801.10130, 2018. Micha el Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems, pp. 3844 3852, 2016. David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel, Al an Aspuru-Guzik, and Ryan P Adams. Convolutional networks on graphs for learning molecular fingerprints. In Advances in neural information processing systems, pp. 2224 2232, 2015. William Fulton and Joe Harris. Representation theory: a first course, volume 129. Springer Science & Business Media, 2013. Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neural message passing for quantum chemistry. In International Conference on Machine Learning, pp. 1263 1272, 2017. Marco Gori, Gabriele Monfardini, and Franco Scarselli. A new model for earning in raph domains. Proceedings of the International Joint Conference on Neural Networks, 2(January):729 734, 2005. doi: 10.1109/IJCNN.2005.1555942. Published as a conference paper at ICLR 2019 Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, pp. 1024 1034, 2017. Jason S. Hartford, Devon R. Graham, Kevin Leyton-Brown, and Siamak Ravanbakhsh. Deep models of interactions across sets. In ICML, 2018. Mikael Henaff, Joan Bruna, and Yann Le Cun. Deep Convolutional Networks on Graph-Structured Data. (June), 2015. ISSN 1506.05163. URL http://arxiv.org/abs/1506.05163. Kurt Hornik. Approximation capabilities of multilayer feedforward networks. Neural networks, 4 (2):251 257, 1991. Sergey Ivanov and Evgeny Burnaev. Anonymous walk embeddings. ar Xiv preprint ar Xiv:1805.11921, 2018. Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. ar Xiv preprint ar Xiv:1609.02907, 2016. Risi Kondor and Shubhendu Trivedi. On the generalization of equivariance and convolution in neural networks to the action of compact groups. ar Xiv preprint ar Xiv:1802.03690, 2018. Risi Kondor, Hy Truong Son, Horace Pan, Brandon Anderson, and Shubhendu Trivedi. Covariant compositional networks for learning graphs. ar Xiv preprint ar Xiv:1801.02144, 2018. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097 1105, 2012. Yann Le Cun, Bernhard Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne Hubbard, and Lawrence D Jackel. Backpropagation applied to handwritten zip code recognition. Neural computation, 1(4):541 551, 1989. Ron Levie, Federico Monti, Xavier Bresson, and Michael M. Bronstein. Cayley Nets: Graph Convolutional Neural Networks with Complex Rational Spectral Filters. pp. 1 12, 2017. ISSN 1063-6919. doi: 10.1109/CVPR.2017.576. URL http://arxiv.org/abs/1705.07664. Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. Gated Graph Sequence Neural Networks. (1):1 20, 2015. ISSN 10797114. doi: 10.1103/Phys Rev Lett.116.082003. URL http://arxiv.org/abs/1511.05493. Federico Monti, Davide Boscaini, Jonathan Masci, Emanuele Rodola, Jan Svoboda, and Michael M Bronstein. Geometric deep learning on graphs and manifolds using mixture model cnns. In Proc. CVPR, volume 1, pp. 3, 2017. Marion Neumann, Roman Garnett, Christian Bauckhage, and Kristian Kersting. Propagation kernels: efficient graph kernels from propagated information. Machine Learning, 102(2):209 245, 2016. Mathias Niepert, Mohamed Ahmed, and Konstantin Kutzkov. Learning Convolutional Neural Networks for Graphs. 2016. ISSN 1938-7228. Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, 1(2):4, 2017. Siamak Ravanbakhsh, Jeff Schneider, and Barnabas Poczos. Equivariance through parametersharing. ar Xiv preprint ar Xiv:1702.08389, 2017. Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. The graph neural network model. Neural Networks, IEEE Transactions on, 20(1):61 80, 2009. ISSN 1045-9227. doi: 10.1109/TNN.2008.2005605. Nino Shervashidze, SVN Vishwanathan, Tobias Petri, Kurt Mehlhorn, and Karsten Borgwardt. Efficient graphlet kernels for large graph comparison. In Artificial Intelligence and Statistics, pp. 488 495, 2009. Published as a conference paper at ICLR 2019 Nino Shervashidze, Pascal Schweitzer, Erik Jan van Leeuwen, Kurt Mehlhorn, and Karsten M Borgwardt. Weisfeiler-lehman graph kernels. Journal of Machine Learning Research, 12(Sep):2539 2561, 2011. Martin Simonovsky and Nikos Komodakis. Dynamic edge-conditioned filters in convolutional neural networks on graphs. In Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, 2017. ISBN 9781538604571. doi: 10.1109/CVPR.2017.11. Petar Veliˇckovi c, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Li o, and Yoshua Bengio. Graph Attention Networks. pp. 1 12, 2017. URL http://arxiv.org/abs/1710. 10903. Saurabh Verma and Zhi-Li Zhang. Hunt for the unique, stable, sparse and fast feature learning on graphs. In Advances in Neural Information Processing Systems, pp. 88 98, 2017. S Vichy N Vishwanathan, Nicol N Schraudolph, Risi Kondor, and Karsten M Borgwardt. Graph kernels. Journal of Machine Learning Research, 11(Apr):1201 1242, 2010. Maurice Weiler, Mario Geiger, Max Welling, Wouter Boomsma, and Taco Cohen. 3D Steerable CNNs: Learning Rotationally Equivariant Features in Volumetric Data. 2018. URL http: //arxiv.org/abs/1807.02547. Pinar Yanardag and S.V.N. Vishwanathan. Deep Graph Kernels. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD 15, 2015. ISBN 9781450336642. doi: 10.1145/2783258.2783417. Dmitry Yarotsky. Universal approximations of invariant maps by neural networks. ar Xiv preprint ar Xiv:1804.10306, 2018. Rex Ying, Jiaxuan You, Christopher Morris, Xiang Ren, William L. , and Jure Leskovec. Hierarchical Graph Representation Learning with Differentiable Pooling. 2018. doi: 10.1145/nnnnnnn. nnnnnnn. URL http://arxiv.org/abs/1806.08804. Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Ruslan R Salakhutdinov, and Alexander J Smola. Deep sets. In Advances in Neural Information Processing Systems, pp. 3391 3401, 2017. Muhan Zhang, Zhicheng Cui, Marion Neumann, and Yixin Chen. An end-to-end deep learning architecture for graph classification. In Proceedings of AAAI Conference on Artificial Inteligence, 2018. APPENDIX A EFFICIENT IMPLEMENTATION OF LAYERS For fast execution of order-2 layers we implemented the following 15 operations which can be easily shown to span the basis discussed in the paper. We denote by 1 Rn the vector of all ones. 1. The identity and transpose operations: L(A) = A, L(A) = AT . 2. The diag operation: L(A) = diag(diag(A)). 3. Sum of rows replicated on rows/ columns/ diagonal: L(A) = A11T , L(A) = 1(A1)T , L(A) = diag(A1). 4. Sum of columns replicated on rows/ columns/ diagonal: L(A) = AT 11T , L(A) = 1(AT 1)T , L(A) = diag(AT 1). 5. Sum of all elements replicated on all matrix/ diagonal: L(A) = (1T A1) 11T , L(A) = (1T A1) diag(1). 6. Sum of diagonal elements replicated on all matrix/diagonal: L(A) = (1T diag(A)) 11T , L(A) = (1T diag(A)) diag(1). 7. Replicate diagonal elements on rows/columns: L(A) = diag(A)1T , L(A) = 1diag(A)T . Published as a conference paper at ICLR 2019 We normalize each operation to have unit max operator norm. We note that in case the input matrix is symmetric, our basis reduces to 11 elements in the first layer. If we further assume the matrix has zero diagonal we get a 6 element basis in the first layer. In both cases our model is more expressive than the 4 element basis of Hartford et al. (2018) and as the output of the first layer (or other inner states) need not be symmetric nor have zero diagonal the deeper layers can potentially make good use of the full 15 element basis. APPENDIX B INVARIANT AND EQUIVARIANT SUBSPACE DIMENSIONS We prove a useful combinatorial fact as a corollary of proposition 2. This fact will be used later to easily compute the dimensions of more general spaces of invariant and equivariant linear layers. We use the fact that if V is a representation of a finite group G then g G g End(V ) (12) is a projection onto V G = {v V | gv = v, g G}, the subspace of fixed points in V under the action of G, and consequently that tr(φ) = dim(V G) (see Fulton & Harris (2013) for simple proofs). Proposition 3. The following formula holds: P Πn tr(P )k = b(k), where Πn is the matrix permutation group of dimensions n n. Proof. In our case, the vector space is the space of order-k tensors and the group acting on it is the matrix group G = P k | P Πm . dim(V G) = tr(φ) = 1 |G| g G tr(g) = 1 P Πn tr(P k) = 1 P Πn tr(P )k, where we used the multiplicative law of the trace with respect to Kronecker product. Now we use proposition 2 noting that in this case V G is the solution space of the fixed-point equations. Therefore, dim(V G) = b(k) and the proof is finished. Recall that for a permutation matrix P , tr(P ) = | {i [n] s.t. P fixes ei } |. Using this, we can interpret the equation in proposition 3 as the k-th moment of a random variable counting the number of fixed points of a permutation, with uniform distribution over the permutation group. Proposition 3 proves that the k-th moment of this random variable is the k-th Bell number. We can now use proposition 3 to calculate the dimensions of two linear layer spaces: (i) Equivariant layers acting on order-k tensors with features (as in 3); and (ii) multi-node sets (as in section 5). Theorem 2. The space of invariant (equivariant) linear layers Rnk,d Rd (Rnk d Rnk d ) is of dimension dd b(k) + d (for equivariant: dd b(2k) + d b(k)) with basis elements defined in equation 9; equations 10a (10b) show the general form of such layers. Proof. We prove the dimension formulas for the invariant case. The equivariant case is proved similarly. The solution space for the fixed point equations is the set V G for the matrix group G = P k Id Id | P Πn . Using the projection formula 12 we get that the dimension of the solution subspace, which is the space of invariant linear layers, can be computed as follows: dim(V G) = 1 P Πn tr(P )k tr(Id)tr(Id ) = P Πn tr(P )k ! tr(Id) tr(Id ) = d d b(k). Published as a conference paper at ICLR 2019 Theorem 3. The linear space of invariant linear layers L : Rnk1 1 nk2 2 nkm m R is of dimension Qm i=1 b(ki). The equivariant linear layers L : Rnk1 1 nk2 2 nkm m Rnl1 1 nl2 2 nlm m has dimension Qm i=1 b(ki + li). Orthogonal bases for these layers are listed in equation 11. Proof. In this case we get the fixed-point equations: Mvec(L) = vec(L) for invariant, and M Mvec(L) = vec(L) for equivariant, where M = P k1 1 P km m . Similarly to the previous theorem, plugging M into equation 12, using the trace multiplication rule and proposition 3 we get the above formulas. APPENDIX C IMPLEMENTING MESSAGE PASSING WITH OUR MODEL In this appendix we show that our model can approximate message passing layers as defined in Gilmer et al. (2017) to an arbitrary precision, and consequently that our model is able to approximate any network consisting of several such layers. The key idea is to mimic multiplication of features by the adjacency matrix, which allows summing over local neighborhoods. This can be implemented using our basis. Theorem 4. Our model can represent message passing layers to an arbitrary precision on compact sets. Proof. Consider input vertex data H = (hu) Rn d (n is the number of vertices in the graph, and d is the input feature depth), adjacency matrix A = (auv) Rn n of the graph, and additional edge features E = (euv) Rn n l. Recall that a message passing layer of Gilmer et al. (2017) is of the form: v N(u) Mt(ht u, ht v, euv) (13a) ht+1 u = Ut(ht u, mt+1 u ) (13b) where u, v are nodes in the graph, ht u is the feature vector associated with u in layer t, and euv are additional edge features. We denote the number of output features of Mt by d . In our setting we represent this data using a tensor Y Rn n (1+l+d) where the first channel is the adjacency matrix A, the next l channels are edge features, and the last d channels are diagonal matrices that hold X. Let us construct a message passing layer using our model: 1. Our first step is constructing an n n (1 + l + 2d) tensor. In the first channels we put the adjacency matrix A and the edge features E. In the next d channels we replicate the features on the rows, and in the last d channels we replicate features on the columns. The output tensor Z1 has the form Z1 u,v = [auv, euv, ht u, ht v]. 2. Next, we copy the feature channels [auv, euv, ht u] to the output tensor Z2. We then apply a multilayer perceptron (MLP) on the last l + 2d feature dimensions of Z1 that approximates Mt (Hornik, 1991). The output tensor of this stage is Z2 u,v = [auv, euv, ht u, Mt(ht u, ht v, euv) + ϵ1]. 3. Next, we would like to perform point-wise multiplication Z2 u,v,1 Z2 u,v,(l+d+2):end. This step would zero out the outputs of Mt for non-adjacent nodes u, v. As this point-wise multiplication is not a part of our framework we can use an MLP on the feature dimension to approximate it and get Z3 u,v = [auv, euv, ht u, auv Mt(ht u, ht v) + ϵ2]. 4. As before we copy the feature channels [auv, euv, ht u]. We now apply a sum over the rows (v dimension) on the Mt output channels. We put the output of this sum on the diagonal of Z4 in separate channels. We get Z4 u,v = [auv, euv, ht u, δuv P w N(u) Mt(ht u, ht w) + ϵ3], where δuv is the Kronecker delta. We get a tensor Z4 Rn n (1+l+d+d ). Published as a conference paper at ICLR 2019 5. The last step is to apply an MLP to the last d + d feature channels of the diagonal of Z4. After this last step we have Z5 u,v = [auv, euv, δuv Ut(ht u, mt+1 u ) + ϵ4]. The errors ϵi depend on the approximation error of the MLP to the relevant function, the previous errors ϵi 1 (for i > 1), and uniform bounds as-well as uniform continuity of the approximated functions. Corollary 1. Our model can represent any message passing network to an arbitrary precision on compact sets. In other words, in terms of universality our model is at-least as powerful as any message passing neural network (MPNN) that falls into the framework of Gilmer et al. (2017).