# on_universal_equivariant_set_networks__6e056d1d.pdf Published as a conference paper at ICLR 2020 ON UNIVERSAL EQUIVARIANT SET NETWORKS Nimrod Segol & Yaron Lipman Department of Computer Science and Applied Mathematics Weizmann Institute of Science Rehovot, Israel {nimrod.segol,yaron.lipman}@weizmann.ac.il Using deep neural networks that are either invariant or equivariant to permutations in order to learn functions on unordered sets has become prevalent. The most popular, basic models are Deep Sets (Zaheer et al., 2017) and Point Net (Qi et al., 2017). While known to be universal for approximating invariant functions, Deep Sets and Point Net are not known to be universal when approximating equivariant set functions. On the other hand, several recent equivariant set architectures have been proven equivariant universal (Sannai et al., 2019; Keriven & Peyr e, 2019), however these models either use layers that are not permutation equivariant (in the standard sense) and/or use higher order tensor variables which are less practical. There is, therefore, a gap in understanding the universality of popular equivariant set models versus theoretical ones. In this paper we close this gap by proving that: (i) Point Net is not equivariant universal; and (ii) adding a single linear transmission layer makes Point Net universal. We call this architecture Point Net ST and argue it is the simplest permutation equivariant universal model known to date. Another consequence is that Deep Sets is universal, and also Point Net Seg, a popular point cloud segmentation network (used e.g., in (Qi et al., 2017)) is universal. The key theoretical tool used to prove the above results is an explicit characterization of all permutation equivariant polynomial layers. Lastly, we provide numerical experiments validating the theoretical results and comparing different permutation equivariant models. 1 INTRODUCTION Many interesting tasks in machine learning can be described by functions F that take as input a set, X = (x1, . . . , xn), and output some per-element features or values, F (X) = (F (X)1, . . . , F (X)n). Permutation equivariance is the property required of F so it is welldefined. Namely, it assures that reshuffling the elements in X and applying F results in the same output, reshuffled in the same manner. For example, if X = (x2, x1, x3, . . . , xn) then F ( X) = (F (X)2, F (X)1, F (X)3, . . . , F (X)n). Building neural networks that are permutation equivariant by construction proved extremely useful in practice. Arguably the most popular models are Deep Sets Zaheer et al. (2017) and Point Net Qi et al. (2017). These models enjoy small number of parameters, low memory footprint and computational efficiency along with high empirical expressiveness. Although both Deep Sets and Point Net are known to be invariant universal (i.e., can approximate arbitrary invariant continuous functions) they are not known to be equivariant universal (i.e., can approximate arbitrary equivariant continuous functions). On the other hand, several researchers have suggested theoretical permutation equivariant models and proved they are equivariant universal. Sannai et al. (2019) builds a universal equivariant network by taking n copies of (n 1)-invariant networks and combines them with a layer that is not permutation invariant in the standard (above mentioned) sense. Keriven & Peyr e (2019) solves a more general problem of building networks that are equivariant universal over arbitrary high order input tensors Rnd (including graphs); their construction, however, uses higher order tensors as hidden variables Published as a conference paper at ICLR 2020 which is of less practical value. Yarotsky (2018) proves that neural networks constructed using a finite set of invariant and equivariant polynomial layers are also equivariant universal, however his network is not explicit (i.e., the polynomials are not characterized for the equivariant case) and also of less practical interest due to the high degree polynomial layers. In this paper we close the gap between the practical and theoretical permutation equivariant constructions and prove: Theorem 1. (i) Point Net is not equivariant universal. (ii) Adding a single linear transmission layer (i.e., X 7 11T X) to Point Net makes it equivariant universal. (iii) Using Re LU activation the minimal width required for universal permutation equivariant network satisfies ω kout + kin + n+kin kin . This theorem suggests that, arguably, Point Net with an addition of a single linear layer is the simplest universal equivariant network, able to learn arbitrary continuous equivariant functions of sets. An immediate corollary of this theorem is Corollary 1. Deep Sets and Point Net Seg are universal. Point Net Seg is a network used often for point cloud segmentation (e.g., in Qi et al. (2017)). One of the benefit of our result is that it provides a simple characterization of universal equivariant architectures that can be used in the network design process to guarantee universality. The theoretical tool used for the proof of Theorem 1 is an explicit characterization of the permutation equivariant polynomials over sets of vectors in Rk using power-sum multi-symmetric polynomials. We prove: Theorem 2. Let P : Rn k Rn l be a permutation equivariant polynomial map. Then, |α| n bαq T α, (1) where bα = (xα 1 , . . . , xα n)T , qα = (qα,1, . . . , qα,l)T , where qα,j = qα,j(s1, . . . , st), t = n+k k , are polynomials; sj(X) = Pn i=1 xαj i are the power-sum multi-symmetric polynomials. On the other hand every polynomial map P satisfying Equation 1 is equivariant. This theorem, which extends Proposition 2.27 in Golubitsky & Stewart (2002) to sets of vectors using multivariate polynomials, lends itself to expressing arbitrary equivariant polynomials as a composition of entry-wise continuous functions and a single linear transmission, which in turn facilitates the proof of Theorem 1. We conclude the paper by numerical experiments validating the theoretical results and testing several permutation equivariant networks for the tasks of set classification and regression. 2 PRELIMINARIES Equivariant maps. Vectors x Rk are by default column vectors; 0, 1 are the all zero and all one vectors/tensors; ei is the i-th standard basis vector; I is the identity matrix; all dimensions are inferred from context or mentioned explicitly. We represent a set of n vectors in Rk as a matrix X Rn k and denote X = (x1, x2, . . . , xn)T , where xi Rk, i [n], are the columns of X. We denote by Sn the permutation group of [n]; its action on X is defined by σ X = (xσ 1(1), xσ 1(2), . . . , xσ 1(n))T , σ Sn. That is, σ is reshuffling the rows of X. The natural class of maps assigning a value or feature vector to every element in an input set is permutation equivariant maps: Definition 1. A map F : Rn k Rn l satisfying F (σ X) = σ F (X) for all σ Sn and X Rn d is called permutation equivariant. Power-sum multi-symmetric polynomials. Given a vector z = (z1, . . . , zn) Rn the powersum symmetric polynomials sj(z) = Pn i=1 zj i , with j [n], uniquely characterize z up to permuting Published as a conference paper at ICLR 2020 its entries. In other words, for z, y Rn we have y = σ z for some σ Sn if and only if sj(y) = sj(z) for all j [n]. An equivalent property is that every Sn invariant polynomial p can be expressed as a polynomial in the power-sum symmetric polynomials, i.e., p(z) = q(s1(z), . . . , sn(z)), see Rydh (2007) Corollary 8.4, Briand (2004) Theorem 3. This fact was previously used in Zaheer et al. (2017) to prove that Deep Sets is universal for invariant functions. We extend this result to equivariant functions and the multi-feature (sets of vectors) case. For a vector x Rk and a multi-index vector α = (α1, . . . , αk) Nk we define xα = xα1 1 xαk k , and |α| = P i [k] αi. A generalization of the power-sum symmetric polynomials to matrices exists and is called power-sum multi-symmetric polynomials, defined with a bit of notation abuse: sα(X) = Pn i=1 xα i , where α Nk is a multi-index satisfying |α| n. Note that the number of power-sum multi-symmetric polynomials acting on X Rn k is t = n+k k . For notation simplicity let α1, . . . , αt be a list of all α Nk with |α| n. Then we index the collection of power-sum multi-symmetric polynomials as s1, . . . , st. Similarly to the vector case the numbers sj(X), j [t] characterize X up to permutation of its rows. That is Y = σ X for some σ Sn iff sj(Y ) = sj(Y ) for all j [t]. Furthermore, every Sn invariant polynomial p : Rn k R can be expressed as a polynomial in the power-sum multi-symmetric polynomials (see (Rydh, 2007) corollary 8.4), i.e., p(X) = q(s1(X), . . . , st(X)), (2) These polynomials were recently used to encode multi-sets in Maron et al. (2019). 3 EQUIVARIANT MULTI-SYMMETRIC POLYNOMIAL LAYERS In this section we develop the main theoretical tool of this paper, namely, a characterization of all permutation equivariant polynomial layers. As far as we know, these layers were not fully characterized before. Theorem 2 provides an explicit representation of arbitrary permutation equivariant polynomial maps P : Rn k Rn l using the basis of power-sum multi-symmetric polynomials, si(X). The particular use of power-sum polynomials si(X) has the advantage it can be encoded efficiently using a neural network: as we will show si(X) can be approximated using a Point Net with a single linear transmission layer. This allows approximating an arbitrary equivariant polynomial map using Point Net with a single linear transmission layer. A version of this theorem for vectors instead of matrices (i.e., the case of k = 1) appears as Proposition 2.27 in Golubitsky & Stewart (2002); we extend their proof to matrices, which is the relevant scenario for ML applications as it allows working with sets of vectors. For k = 1 Theorem 2 reduces to the following form: p(x)i = P a n pa(s1(x), . . . , sn(x))xa i with sj(x) = P i xj i. For matrices the monomial xk i is replaced by xα i for a multi-index α and the power-sum symmetric polynomials are replaced by the power-sum multi-symmetric polynomials. First, note that it is enough to prove Theorem 1 for l = 1 and apply it to every column of P . Hence, we deal with a vector of polynomials p : Rn k Rn and need to prove it can be expressed as p = P |α| n bαqα, for Sn invariant polynomial qα. Given a polynomial p(X) and the cyclic permutation σ 1 = (123 n) the following operation, taking a polynomial to a vector of polynomials, is useful in characterizing equivariant polynomial maps: p(X) p(σ X) p(σ2 X) ... p(σn 1 X) Theorem 2 will be proved using the following two lemmas: Lemma 1. Let p : Rn k Rn be an equivariant polynomial map. Then, there exists a polynomial p : Rn k R, invariant to Sn 1 (permuting the last n 1 rows of X) so that p = p . Published as a conference paper at ICLR 2020 Proof. Equivariance of p means that for all σ Sn it holds that σ p(X) = p(σ X) σ p(X) = p(σ X). (4) Choosing an arbitrary permutation σ stab(1) < Sn, namely a permutation satisfying σ(1) = 1, and observing the first row in Equation 4 we get p1(X) = p1(σ X) = p1(x1, xσ 1(2), . . . , xσ 1(n)). Since this is true for all σ stab(1) p1 is Sn 1 invariant. Next, applying σ = (1i) to Equation 4 and observing the first row again we get pi(X) = p1(xi, . . . , x1, . . .). Using the invariance of p1 to Sn 1 we get p = p1 . Lemma 2. Let p : Rn k R be a polynomial invariant to Sn 1 (permuting the last n 1 rows of X) then p(X) = X |α| n xα 1 qα(X), (5) where qα are Sn invariant. Proof. Expanding p with respect to x1 we get |α| m xα 1 pα(x2, . . . , xn), (6) for some m N. We first claim pα are Sn 1 invariant. Indeed, note that if p(X) = p(x1, x2, . . . , xn) is Sn 1 invariant, i.e., invariant to permutations of x2, . . . , xn, then also its derivatives |β| xβ 1 p(X) are Sn 1 permutation invariant, for all β Nk. Taking the derivative |β|/ xβ 1 on both sides of Equation 6 we get that pβ is Sn 1 equivariant. For brevity denote p = pα. Since p is Sn 1 invariant it can be expressed as a polynomial in the powersum multi-symmetric polynomials, i.e., p(x2, . . . , xn) = r(s1(x2, . . . , xn), . . . , st(x2, . . . , xn)). Note that si(x2, . . . , xn) = si(X) xαi 1 and therefore p(x2, . . . , xn) = r(s1(X) xα1 1 , . . . , st(X) xαt 1 ). Since r is a polynomial, expanding its monomials in si(X) and xα 1 shows p can be expressed as p = P |α| m xα 1 pα, where m N, and pα are Sn invariant (as multiplication of invariant Sn polynomials si(X)). Plugging this in Equation 6 we get Equation 5, possibly with the sum over some n > n. It remains to show n can be taken to be at-most n. This is proved in Corollary 5 in Briand (2004) Proof. (Theorem 2) Given an equivariant p as above, use Lemma 1 to write p = p where p(X) is invariant to permuting the last n 1 rows of X. Use Lemma 2 to write p(X) = P |α| n xα 1 qα(X), where qα are Sn invariant. We get, |α| n bαqα. The converse direction is immediate after noting that bα are equivariant and qα are invariant. 4 UNIVERSALITY OF SET EQUIVARIANT NEURAL NETWORKS We consider equivariant deep neural networks f : Rn kin Rn kout, F (X) = Lm ν ν L1(X), (7) where Li : Rn ki Rn ki+1 are affine equivariant transformations, and ν is an entry-wise nonlinearity (e.g., Re LU). We define the width of the network to be ω = maxi ki; note that this definition is different from the one used for standard MLP where the width would be nω, see e.g., Hanin & Sellke (2017). Zaheer et al. (2017) proved that affine equivariant Li are of the form Li(X) = XA + 1 n11T XB + 1c T , (8) Published as a conference paper at ICLR 2020 where A, B Rki ki+1, and c Rki+1 are the layer s trainable parameters; we call the linear transformation X 7 1 n11T XB a linear transmission layer. We now define the equivariant models considered in this paper: The Deep Sets (Zaheer et al., 2017) architecture is Equation 7 with the choice of layers as in Equation 8. Taking B = 0 in all layers is the Point Net architecture (Qi et al., 2017). Point Net ST is an equivariant model of the form Equation 7 with layers as in Equation 8 where only a single layer Li has a non-zero B. The Point Net Seg (Qi et al., 2017) architecture is Point Net composed with an invariant max layer, namely max(F (X))j = maxi [n] F (X)i,j and then concatenating it with the input X, i.e., [X, 1 max(F (X))], and feeding is as input to another Point Net G, that is G([X, 1 max(F (X))]). We will prove Point Net ST is permutation equivariant universal and therefore arguably the simplest permutation equivariant universal model known to date. Universality of equivariant deep networks is defined next. Definition 2. Permutation equivariant universality1 of a model F : Rn kin Rn kout means that for every permutation equivariant continuous function H : Rn kin Rn kout defined over the cube K = [0, 1]n kin Rn kin, and ϵ > 0 there exists a choice of m (i.e., network depth), ki (i.e., network width) and the trainable parameters of F so that H(X) F (X) < ϵ for all X K. Proof. (Theorem 1) Fact (i), namely that Point Net is not equivariant universal is a consequence of the following simple lemma: Lemma 3. Let h = (h1, . . . , hn)T : Rn Rn be the equivariant linear function defined by h(x) = 11T x. There is no f : R R so that |hi(x) f(xi)| < 1 2 for all i [n] and x [0, 1]n. Proof. Assume such f exists. Let e1 = (1, 0, . . . , 0)T Rn. Then, 1 = |h2(e1) h2(0)| |h2(e1) f(0)| + |f(0) h2(0)| < 1 reaching a contradiction. To prove (ii) we first reduce the problem from the class of all continuous equivariant functions to the class of equivariant polynomials. This is justified by the following lemma. Lemma 4. Equivariant polynomials P : Rn kin Rn kout are dense in the space of continuous equivariant functions F : Rn kin Rn kout over the cube K. Proof. Take an arbitrary ϵ > 0. Consider the function fij : Rn kin R, which denotes the (i, j)-th output entry of F . By the Stone-Weierstrass Theorem there exists a polynomial pij : Rn kin R such that fij(X) pij(X) ϵ for all X K. Consider the polynomial map P : Rn kin Rn kout defined by (P )ij = pij. P is in general not equivariant. To finish the proof we will symmetrize P : F (X) 1 σ Sn σ P (σ 1 X) σ Sn σ F (σ 1 X) 1 σ Sn σ P (σ 1 X) σ Sn σ F (σ 1 X) P (σ 1 X) σ Sn ϵ = ϵ, where in the first equality we used the fact that F is equivariant. This concludes the proof since P σ Sn σ P (σ 1 X) is an equivariant polynomial map. Now, according to Theorem 2 an arbitrary equivariant polynomial P : Rn kin Rn kout can be written as P = P |α| n bα(X)qα(X)T , where bα(X) = xα 1 Rn and qα = 1Or just equivariant universal in short. Published as a conference paper at ICLR 2020 (qα,1, . . . , qα,kout) Rkout are invariant polynomials. Remember that every Sn invariant polynomial can be expressed as a polynomial in the t = n+kin kin power-sum multi-symmetric polynomials sj(X) = 1 n Pn i=1 xαj i , j [t] (we use the 1/n normalized version for a bit more simplicity later on). We can therefore write P as composition of three maps: P = Q L B, (9) where B : Rn kin Rn t is defined by B(X) = (b(x1), . . . , b(xn))T , b(x) = (xα1, . . . , xαt); L is defined as in Equation 8 with B = [0, I] and A = [e1, . . . , ekin, 0], where I Rt t the identity matrix and ei Rt represents the standard basis (as-usual). We assume αj = ej Rkin, for j [kin]. Note that the output of L is of the form L(B(X)) = (X, 1s1(X), 1s2(X), . . . , 1st(X)). Finally, Q : Rn (kin+t) Rn kout is defined by Q(X, 1s1, . . . , 1st) = (q(x1, s1, . . . , st), . . . , q(xn, s1, . . . , st))T , and q(x, s1, . . . , st) = P |α| n xαqα(s1, . . . , st)T . Figure 1: The construction of the universal network (Point Net ST). The decomposition in Equation 9 of P suggests that replacing Q, B with Multi-Layer Perceptrons (MLPs) would lead to a universal permutation equivariant network consisting of Point Net with a single linear transmission layer, namely Point Net ST. The F approximating P will be defined as F = Ψ L Φ, (10) where Φ : Rn kin Rn t and Ψ : Rn (t+kin) Rn kout are both of Point Net architecture, namely there exist MLPs φ : Rkin Rt and ψ : Rt+kin Rkout so that Φ(X) = (φ(x1), . . . , φ(xn))T and Ψ(X) = (ψ(x1), . . . , ψ(xn))T . See Figure 1 for an illustration of F . To build the MLPs φ ,ψ we will first construct ψ to approximate q, that is, we use the universality of MLPS (see (Hornik, 1991; Sonoda & Murata, 2017; Hanin & Sellke, 2017)) to construct ψ so that ψ(x, s1, . . . , st) q(x, s1, . . . , st) < ϵ 2 for all (x, s1, . . . , st) [0, 1]kin+t. Furthermore, as ψ over [0, 1]kin+t is uniformly continuous, let δ be such that if z, z [0, 1]kin+t, z z < δ then ψ(z) ψ(z ) < ϵ 2. Now, we use universality again to construct φ approximating b, that is we take φ so that φ(x) b(x) < δ for all x [0, 1]kin. F (X) P (X) Ψ(L(Φ(X))) Ψ(L(B(X))) + Ψ(L(B(X))) Q(L(B(X))) = err1 + err2 First, L(Φ(X)) L(B(X)) < δ for all X K and therefore err1 < ϵ 2. Second, note that if X K then B(X) [0, 1]n t and L(B(X)) [0, 1]n (kin+t). Therefore by construction of ψ we have err2 < ϵ To prove (iii) we use the result in Hanin & Sellke (2017) (see Theorem 1) bounding the width of an MLP approximating a function f : [0, 1]din Rdout by din + dout. Therefore, the width of the MLP φ is bounded by kin + t, where the width of the MLP ψ is bounded by t + kin + kout, proving the bound. We can now prove Cororllary 1. Proof. (Corollary 1) The fact that the Deep Sets model is equivariant universal is immediate. Indeed, The Point Net ST model can be obtained from the Deep Sets model by setting B = 0 in all but one layer, with B as in Equation 8. Published as a conference paper at ICLR 2020 For the Point Net Seg model note that by Theorem 1 in Qi et al. (2017) every invariant function f : Rn kin Rt can be approximated by a network of the form Ψ(max(F (X))), where (F (X) is a Point Net model and Ψ is an MLP. In particular, for every ε > 0 there exists such F , Φ for which Ψ(max(F (X))) (s1(X), . . . , st(X)) < ε for every X [0, 1]n kin where s1(X), . . . , st(X) are the power-sum multi-symmetric polynomials. It follows that we can use Point Net Seg to approximate 1(s1(X), . . . , st(X)). The rest of the proof closely resembles the proof of Theorem 1. Graph neural networks with constructed adjacency. One approach sometimes applied to learning from sets of vectors is to define an adjacency matrix (e.g., by thresholding distances of node feature vectors) and apply a graph neural network to the resulting graph (e.g., Wang et al. (2019), Li et al. (2019)). Using the common message passing paradigm (Gilmer et al., 2017) in this case boils to layers of the form: L(X)i = ψ(xi, P j Ni φ(xi, xj)) = ψ(xi, P j [n] N(xi, xj)φ(xi, xj)), where φ, ψ are MLPs, Ni is the index set of neighbors of node i, and N(xi, xj) is the indicator function for the edge (i, j). If N can be approximated by a continuous function, which is the case at least in the L2 sense for a finite set of vectors, then since L is also equivariant it follows from Theorem 1 that such a network can be approximated (again, at least in L2 norm) arbitrarily well by any universal equivariant network such as Point Net ST or Deep Sets. We tested the ability of a Deep Sets model with varying depth and width to approximate a single graph convolution layer. We found that a Deep Sets model with a small number of layers can approximate a graph convolution layer reasonably well. For details see Appendix A. 5 EXPERIMENTS We conducted experiments in order to validate our theoretical observations. We compared the results of several equivariant models, as well as baseline (full) MLP, on three equivariant learning tasks: a classification task (knapsack) and two regression tasks (squared norm and Fiedler eigen vector). For all tasks we compare results of 7 different models: Deep Sets, Point Net, Point Net Seg, Point Net ST, Point Net QT and Graph Net. Point Net QT is Point Net with a single quadratic equivariant transmission layer as defined in Appendix B. Graph Net is similar to the graph convolution network in Kipf & Welling (2016) and is defined explicitly in Appendix B. We generated the adjacency matrices for Graph Net by taking 10 nearest neighbors of each set element. In all experiments we used a network of the form Equation 7 with m = 6 depth and varying width, fixed across all layers. 2 Equivariant classification. For classification, we chose to learn the multidimensional knapsack problem, which is known to be NP-hard. We are given a set of 4-vectors, represented by X Rn 4, and our goal is to learn the equivariant classification function f : Rn 4 {0, 1}n defined by the following optimization problem: f(X) = arg max z i=1 xijzi wj, j = 2, 3, 4 zi {0, 1} , i [n] Intuitively, given a set of vectors X Rn 4, (X)ij = xij, where each row represents an element in a set, our goal is to find a subset maximizing the value while satisfying budget constraints. The first column of X defines the value of each element, and the three other columns the costs. To evaluate the success of a trained model we record the percentage of sets for which the predicted subset is such that all the budget constrains are satisfied and the value is within 10% of the optimal value. In Appendix C we detail how we generated this dataset. Equivariant regression. The first equivariant function we considered for regression is the function f(X) = 1 Pn i=1 Pk j=1(Xi,j 1 2)2. Hanin & Sellke (2017) showed this function cannot be approxi- 2The code can be found at https://github.com/Nimrod Segol/On-Universal-Equivariant-Set-Networks Published as a conference paper at ICLR 2020 Knapsack test Fiedler test P 20 40 60 80 100 width % values within 10% of optimal Point Net Deep Sets Point Net Seg Point Net ST MLP Point Net QT Graph Net 0 100 200 300 400 500 600 700 800 width Point Net Deep Sets Point Net Seg Point Net ST MLP Point Net QT Graph Net 0 20 40 60 80 100 120 140 width Point Net Deep Sets Point Net Seg Point Net ST MLP Point Net QT Graph Net Knapsack train Fiedler train P 20 40 60 80 100 width % values within 10% of optimal Point Net Deep Sets Point Net Seg Point Net ST MLP Point Net QT Graph Net 0 100 200 300 400 500 600 700 800 width Point Net Deep Sets Point Net Seg Point Net ST MLP Point Net QT Graph Net 0 20 40 60 80 100 120 140 width Point Net Deep Sets Point Net Seg Point Net ST MLP Point Net QT Graph Net Figure 2: Classification and regression tasks with permutation equivariant models. All the universal permutation equivariant models perform similarly, while the equivariant non-universal Point Net demonstrates reduced performace consistently; MLP baseline (with the same number of parameters as the equivariant models) performs poorly. mated by MLPs of small width. We drew 10k training examples and 1k test examples i.i.d. from a N( 1 2, 1) distribution (per entry of X). The second equivariant function we considered is defined on point clouds X Rn 3. For each point cloud we computed a graph by connecting every point to its 10 nearest neighbors. We then computed the absolute value of the first non trivial eigenvector of the graph Laplacian. We used the Model Net dataset (Wu et al., 2015) which contains 9k training meshes and 2k test meshes. The point clouds are generated by randomly sampling 512 points from each mesh. Result summary. Figure 2 summarizes train and test accuracy of the 6 models after training (training details in Appendix C) as a function of the network width ω. We have tested 15 ω values equidistant in [5, nkin As can be seen in the graphs, in all three datasets the equivariant universal models (Point Net ST, Point Net QT , Deep Sets, Point Net Seg) achieved comparable accuracy. Point Net, which is not equivariant universal, consistently achieved inferior performance compared to the universal models, as expected by the theory developed in this paper. The non-equivariant MLP, although universal, used the same width (i.e., same number of parameters) as the equivariant models and was able to over-fit only on one train set (the quadratic function); its performance on the test sets was inferior by a large margin to the equivariant models. We also note that in general the Graph Net model achieved comparable results to the equivariant universal models but was still outperformed by the Deep Sets model. An interesting point is that although the width used in the experiments in much smaller than the bound kout + kin + n+kin kin established by Theorem 1, the universal models are still able to learn well the functions we tested on. This raises the question of the tightness of this bound, which we leave to future work. 6 CONCLUSIONS In this paper we analyze several set equivariant neural networks and compare their approximation power. We show that while vanilla Point Net (Qi et al., 2017) is not equivariant universal, adding a single linear transmission layer makes it equivariant universal. Our proof strategy is based on a characterization of polynomial equivariant functions. As a corollary we show that the Deep Sets model Published as a conference paper at ICLR 2020 (Zaheer et al., 2017) and Point Net Seg (Qi et al., 2017) are equivariant universal. Experimentally, we tested the different models on several classification and regression tasks finding that adding a single linear transmitting layer to Point Net makes a significant positive impact on performance. 7 ACKNOWLEDGEMENTS This research was supported in part by the European Research Council (ERC Consolidator Grant, Lift Match 771136) and the Israel Science Foundation (Grant No. 1830/17). Emmanuel Briand. When is the algebra of multisymmetric polynomials generated by the elementary multisymmetric polynomials. Beitrge zur Algebra und Geometrie, 45, 01 2004. Justin Gilmer, Samuel S. Schoenholz, Patrick F. Riley, Oriol Vinyals, and George E. Dahl. Neural message passing for quantum chemistry. Co RR, 2017. Martin Golubitsky and Ian Stewart. The symmetry perspective. 2002. Boris Hanin and Mark Sellke. Approximating continuous functions by relu nets of minimal width. Co RR, abs/1710.11278, 2017. URL http://arxiv.org/abs/1710.11278. Kurt Hornik. Approximation capabilities of multilayer feedforward networks. Neural networks, 4(2): 251 257, 1991. Nicolas Keriven and Gabriel Peyr e. Universal invariant and equivariant graph neural networks. ar Xiv preprint ar Xiv:1905.04943, 2019. Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. ar Xiv preprint ar Xiv:1609.02907, 2016. Guohao Li, Matthias Mller, Ali Thabet, and Bernard Ghanem. Deepgcns: Can gcns go as deep as cnns? In The IEEE International Conference on Computer Vision (ICCV), 2019. Haggai Maron, Heli Ben-Hamu, Hadar Serviansky, and Yaron Lipman. Provably powerful graph networks. Co RR, abs/1905.11136, 2019. URL http://arxiv.org/abs/1905.11136. Silvano Martello and Paolo Toth. Knapsack Problems: Algorithms and Computer Implementations. John Wiley & Sons, Inc., New York, NY, USA, 1990. ISBN 0-471-92420-2. Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary De Vito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in Py Torch. In NIPS Autodiff Workshop, 2017. Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 652 660, 2017. David Rydh. A minimal set of generators for the ring of multisymmetric functions. In Annales de l institut Fourier, volume 57, pp. 1741 1769, 2007. Akiyoshi Sannai, Yuuki Takai, and Matthieu Cordonnier. Universal approximations of permutation invariant/equivariant functions by deep neural networks. ar Xiv preprint ar Xiv:1903.01939, 2019. Sho Sonoda and Noboru Murata. Neural network with unbounded activation functions is universal approximator. Applied and Computational Harmonic Analysis, 43(2):233 268, 2017. Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E. Sarma, Michael M. Bronstein, and Justin M. Solomon. Dynamic graph cnn for learning on point clouds. ACM Transactions on Graphics (TOG), 2019. Published as a conference paper at ICLR 2020 Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and Jianxiong Xiao. 3d shapenets: A deep representation for volumetric shapes. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015. Dmitry Yarotsky. Universal approximations of invariant maps by neural networks. ar Xiv preprint ar Xiv:1804.10306, 2018. Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Ruslan R Salakhutdinov, and Alexander J Smola. Deep sets. In Advances in neural information processing systems, pp. 3391 3401, 2017. A APPROXIMATING GRAPH CONVOLUTION LAYER WITH DEEPSETS To test the ability of an equivariant universal model to approximate a graph convolution layer, we conducted an experiment where we applied a single graph convolution layer (see Appendix B for a full description of the graph convolution layers used in this paper) with 3 in features and 10 out features. We constructed a knn graph by taking 10 neighbors. We sampled 1000 examples in R100 3 i.i.d from a N( 1 2, 1) distribution (per entry of X). The results are summarized in Figure 3. We regressed to the output of a graph convolution layer using the smooth L1 loss. 0 25 50 75 100 125 150 175 200 epochs Depth=2 Depth=3 Depth=4 Depth=5 Depth=6 Depth=7 Depth=8 Depth=9 0 25 50 75 100 125 150 175 200 epochs Width=13 Width=39 Width=66 Width=93 Width=119 Width=146 Width=173 Width=200 Figure 3: Using Deep Sets to regress to a single graph convolution layer. In the left image we varied the depth and took a constant width of 200. In the right we varied the width and took a fixed depth of 6 layers. The y axes are in log scale. Note that even a 2 layer Deep Sets can approximate a graph convolution layer up to an error of 0.01. B DESCRIPTION OF LAYERS B.1 QUADRATIC LAYER One potential application of Theorem 2 is augmenting an equivariant neural network (Equation 7) with equivariant polynomial layers P : Rn k Rn l of some maximal degree d. This can be done in the following way: look for all solutions to α, β1, β2, . . . Nk so that |α| + P i |βi| d. Any solution to this equation will give a basis element of the form p(X) = xα 1 Q j Pn i=1 xβj i . In the paper we tested Point Net QT, an architecture that adds to Point Net a single quadratic equivariant layer. We opted to use only the quadratic transmission operators: For a matrix X Rn k we define L(X) Rn k as follows: L(X) = XW1 + 11T XW2 + (11T X) (11T X)W3 + (X X)W4 + (11T X) XW5, where is a point-wise multiplication and Wi Rn k, i [5] are the learnable parameters. Published as a conference paper at ICLR 2020 B.2 GRAPH CONVOLUTION LAYER We implement a graph convolution layers as follows L(X) = BXW2 + XW1 + 1c T with W1, W2, c learnable parameters. The matrix B is defined as in Kipf & Welling (2016) from the knn graph of the set. B = D 1 2 where D is the degree matrix of the graph and A is the adjacency matrix of the graph with added self-connections. C IMPLEMENTATION DETAILS Knapsack data generation. We constructed a dataset of 10k training examples and 1k test examples consisting of 50 4 matrices. We took w1 = 100, w2 = 80, w3 = 50. To generate X R50 4, we draw an integer uniformly at random between 1 and 100 and randomly choose 50 integers between 1 as the first column of X. We also randomly chose an integer between 1 and 25 and then randomly chose 150 integers in that range as the three last columns of X. The labels for each input X were computed by a standard dynamic programming approach, see Martello & Toth (1990). Optimization. We implemented the experiments in Pytorch Paszke et al. (2017) with the Adam Kingma & Ba (2014) optimizer for learning. For the classification we used the cross entropy loss and trained for 150 epochs with learning rate 0.001, learning rate decay of 0.5 every 100 epochs and batch size 32. For the quadratic function regression we trained for 150 epochs with leaning rate of 0.001, learning rate decay 0.1 every 50 epochs and batch size 64; for the regression to the leading eigen vector we trained for 50 epochs with leaning rate of 0.001 and batch size 32. To regress to the output of a single graph convolution layer we trained for 200 epochs with leaning rate of 0.001 and batch size 32.