# on_the_universality_of_invariant_networks__9105b791.pdf On the Universality of Invariant Networks Haggai Maron 1 Ethan Fetaya 2 3 Nimrod Segol 1 Yaron Lipman 1 Constraining linear layers in neural networks to respect symmetry transformations from a group G is a common design principle for invariant networks that has found many applications in machine learning. In this paper, we consider a fundamental question that has received little attention to date: Can these networks approximate any (continuous) invariant function? We tackle the rather general case where G Sn (an arbitrary subgroup of the symmetric group) that acts on Rn by permuting coordinates. This setting includes several recent popular invariant networks. We present two main results: First, G-invariant networks are universal if high-order tensors are allowed. Second, there are groups G for which higher-order tensors are unavoidable for obtaining universality. G-invariant networks consisting of only first-order tensors are of special interest due to their practical value. We conclude the paper by proving a necessary condition for the universality of G-invariant networks that incorporate only first-order tensors. 1. Introduction The basic paradigm of deep neural networks is repeatedly composing layers of linear functions with non-linear, entrywise activation functions to create effective predictive models for learning tasks of interest. When trying to learn a function (task) f that is known to be invariant to some group of symmetries G (i.e., G-invariant function) it is common to use linear layers that respect this symmetry, namely, invariant and/or equivariant linear layers. Networks with invariant/equivariant linear layers 1Department of Computer Science and Applied Mathematics, Weizmann Institute of Science, Rehovot, Israel 2Department of Computer Science, University of Toronto, Toronto, Canada 3Vector Institute. Correspondence to: Haggai Maron . Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s). with respect to some group G will be referred here as Ginvariant networks. A fundamental question in learning theory is that of approximation or universality (Cybenko, 1989; Hornik, 1991). In the invariant case: Can a G-invariant network approximate an arbitrary continuous G-invariant function? The goal of this paper is to address this question for all finite permutation groups G Sn, where Sn is the symmetric group acting on [n] = {1, 2, . . . , n}. Note that this is a fairly general setting that contains many useful examples (detailed below). The archetypal example of G-invariant networks is Convolutional Neural Networks (CNNs) (Le Cun et al., 1989; Krizhevsky et al., 2012) that restrict their linear layers to convolutions in order to learn image tasks that are translation invariant or equivariant 1. In recent years researchers are considering other types of data and/or symmetries and consequently new G-invariant networks have emerged. Tasks involving point clouds or sets are in general invariant to the order of the input and therefore permutation invariance/equivariance was developed (Qi et al., 2017; Zaheer et al., 2017). Learning tasks involving interaction between different sets, where the input data is tabular, require dealing with different permutations acting independently on each set (Hartford et al., 2018). Tasks involving graphs and hyper-graphs lead to symmetries defined by tensor products of permutations (Kondor et al., 2018; Maron et al., 2019). A general treatment of invariance/equivariance to finite subgroups of the symmetric group is discussed in (Ravanbakhsh et al., 2017); infinite symmetries are discussed in general in (Kondor & Trivedi, 2018) as well as in (Cohen & Welling, 2016a;b; Cohen et al., 2018; Weiler et al., 2018). Among these examples, universality is known for pointclouds networks and sets networks (Qi et al., 2017; Zaheer et al., 2017), as well as networks invariant to finite translation groups (e.g., cyclic convolutional neural networks) (Yarotsky, 2018). However, universality is not known for tabular and multi-set networks (Hartford et al., 2018), graph 1It is common to use convolutional layers without cyclic padding which implies that these networks are not precisely translation invariant. On the Universality of Invariant Networks and hyper-graph networks (Kondor et al., 2018; Maron et al., 2019); and networks invariant to finite translations with rotations and/or reflections. We cover all these cases in this paper. Maybe the most related work to ours is (Yarotsky, 2018) that considered actions of compact groups and suggested provably universal architectures that are based on polynomial layers. In contrast, we study the standard and widely used linear layer model. The paper is organized as follows: First, we prove that an arbitrary continuous function f : Rn R invariant to an arbitrary permutation group G Sn can be approximated using a G-invariant network. The proof is constructive and makes use of linear equivariant layers between tensors X Rnk of order k d, where d depends on the permutation group G. Second, we prove a lower bound on the order d of tensors used in a G-invariant network so to achieve universality. Specifically, we show that for G = An (the alternating group) any G-invariant network that uses tensors of order at-most d = (n 2)/2 cannot approximate arbitrary Ginvariant functions. We conclude the paper by considering the question: For which groups G Sn, G-invariant networks using only first order tensors are universal? We prove a necessary condition, and describe families of groups for which universality cannot be attained using only first order tensors. 2. Preliminaries and main results The symmetries we consider in this paper are arbitrary subgroups of the symmetric group, i.e., G Sn. The action of G on x Rn used in this paper is defined as g x = (xg 1(1), . . . , xg 1(n)), g G. (1) The action of G on tensors X Rnk a (the last index, denoted j represents feature depth) is defined similarly by (g X)i1...ik,j = Xg 1(i1)...g 1(ik),j, g G. (2) The inset illustrates this action on tensors of order k = 1, 2, 3: the permutation g is a transposition of two numbers and is applied to each dimension of the tensor. Definition 1. A G-invariant function is a function f : Rn R that satisfies f(g x) = f(x) for all x Rn Definition 2. A linear equivariant layer is an affine map L : Rnk a Rnl b satisfying L(g X) = g L(X), for Figure 1. Illustration of invariant network architecture. The function is composed of multiple linear G-equivariant layers (gray), possibly of high order, and ends with a linear G-invariant function (light blue) followed by a Multi Layer Perceptron (yellow). all g G, and X Rnk a. An invariant linear layer is an affine map h : Rnk a Rb satisfying h(g X) = h(X), for all g G, and X Rnk a. A common way to construct G-invariant networks is: Definition 3. A G-invariant network is a function F : Rn a R defined as F = m h Ld σ σ L1, where Li are linear G-equivariant layers, σ is an activation function 2, h is a G-invariant layer, and m is a Multi-Layer Perceptron (MLP). Figure 1 illustrates the G-invariant network model. By construction, G-invariant networks are G-invariant functions (note that entrywise activation is equivariant as-well). This framework has been used, with appropriate group G, in previous works to build predictive G-invariant models for learning. Our goal is to show the approximation power of G-invariant networks. Namely, that G-invariant networks can approximate arbitrary continuous G-invariant functions f. Without loss of generality, we consider only functions of the form f : Rn R. Indeed, in case of multiple features, Rn a, we rearrange the input as Rn , n = na, and take the appropriate G Sn . We prove: Theorem 1. Let f : Rn R be a continuous G-invariant function for some G Sn, and K Rn a compact set. There exists a G-invariant network that approximates f to an arbitrary precision. The proof of Theorem 1 is constructive and builds an fapproximating G-invariant network with hidden tensors X Rnd of order d, where d = d(G) is a natural number depending on the group G. Unfortunately, we show that in the worst case d can be as high as n(n 1) 2 . Note that d = 2 could already be computationally challenging. It is therefore of interest to ask whether there exist more efficient 2We assume any activation function for which the universal approximation theorem for MLP holds, e.g., Re LU and sigmoid. On the Universality of Invariant Networks G-invariant networks that use lower order tensors without sacrificing approximation power. Surprisingly, the answer is that in general we can not go lower than order n for general permutation groups G. Specifically, we prove the following for G = An, the alternating group: Theorem 2. If an An-invariant network has the universal approximation property then it consists of tensors of order at least n 2 Although in general we cannot expect universal approximation of G-invariant networks with inner tensor order smaller than n 2 2 , it is still possible that for specific groups of interest we can prove approximation power with more efficient (i.e., lower order inner tensors) G-invariant networks. Of specific interest are G-invariant networks that use only first order tensors. In section 5 we prove the following necessary condition for universality of first-order G-invariant networks: Theorem 3. Let G Sn. If first order G-invariant networks are universal, then [n]2/H < [n]2/G for any strict super-group G < H Sn. |[n]2/G| is the number of equivalence classes of [n]2 defined by the relation: (i1, i2) (j1, j2) if jℓ= g(iℓ), ℓ= 1, 2 for some g G. Intuitively, this condition asks that supergroups of G have strictly better separation of the double index space [n]2. 3. G-invariant networks universality The key to showing theorem 1, namely that G-invariant networks are universal, is showing they can approximate a set of functions that are: (i) G-invariant; and (ii) can approximate arbitrary G-invariant functions to a desired precision. The G-invariant polynomials are an example of such a set: Definition 4. The G-invariant polynomials are all the polynomials in x1, . . . , xn over R that are also G-invariant functions. They are denoted R[x1, . . . , xn]G, where R[x1, . . . , xn] is the set of all polynomials over R. To see that G-invariant polynomials can approximate any arbitrary (continuous) function f : K Rn R, where K is a compact set, one can use the Stone-Weiestrass (SW) theorem, as done in (Yarotsky, 2018): First use SW to approximate f over a symmetrized domain K = g Gg K by some (not necessarily G-invariant) polynomial p R[x1, . . . , xn]. Second, consider q(x) = 1 |G| g G p(g x). q is a G-invariant polynomial and hence q R[x1, . . . , xn]G, furthermore for x K: |q(x) f(x)| 1 |G| p(g x) f(g x) max x K |p(x) f(x)| . Our goal in this section is to prove the following proposition that, together with the comment above, prove theorem 1: Proposition 1. For any ϵ > 0, K Rn compact set, and Ginvariant polynomial p R[x1, . . . , xn]G there exists a Ginvariant network F that approximates p to an ϵ-accuracy, namely maxx K |F(x) p(x)| < ϵ. The proposition will be proved in several steps: (i) We represent p as p(x) = Pd k=0 pk(x), where pk is a G-invariant homogeneous polynomial of degree k, i.e., pk Rk[x1, . . . , xn]G. (ii) We characterize all homogeneous G-invariant polynomials of a fixed degree k. In particular we find a basis to all such polynomials, bk1, bk2, . . . , bknk Rk[x1, . . . , xn]G. Using the bases of homogeneous G-invariant polynomials of degrees up-to d we write j=1 αkjbkj(x). (3) (iii) We approximate each basis element bkj using a Ginvariant network. (iv) We construct a G-invariant network F approximating p to an ϵ-accuracy using Equation 3 and (iii). 3.1. Proof of proposition 1 Part (i): It is a known fact that a G-invariant polynomial can be written as a sum of homogeneous G-invariant polynomials (Kraft & Procesi, 2000): Lemma 1. Let p : Rn R be a G-invariant polynomial of degree d. Then p can be written as p(x) = Pd k=0 pk(x) where pk are homogeneous G-invariant polynomials of degree k. Part (ii): We need to find bases for the linear spaces of homogeneous G-invariant polynomials of degree k = 0, 1, . . . , d, i.e., Rk[x1, . . . , xn]G. Any homogeneous polynomial of degree k can be written as i1,...,ik=1 Wi1...ik xi1 xik, (4) where W Rnk is its coefficient tensor; since xi1 xik = xiσ(1) xiσ(k) for all σ Sk, a unique choice of W can be On the Universality of Invariant Networks obtained by taking a symmetric W. That is, W that satisfies Wi1 ik = Wiσ(1) iσ(k), for all σ Sk. In short, we ask W Symk(Rn) Rnk. For example, the case k = 2 amounts to representing a quadratic form using a symmetric matrix, that is W satisfies in this case W = WT . The next proposition shows that if p is G-invariant, its coefficient tensor is a fixed point of the action of G on symmetric tensors W Rnk : Proposition 2. Let p Rk[x1, . . . , xn]G. Then its coefficient tensor W Rnk satisfies the fixed point equation: g W = W, g G. (5) Proof. From the fact that p is G-invariant we get the following set of equations p(x) = p(g x), for all g G. p(x) = p(g x) i1,...,ik=1 Wi1...ik xg 1(i1) xg 1(ik) i1,...,ik=1 Wg(i1)...g(ik) xi1 xik. By equating monomials coefficients of p(x) and p(g x) and the symmetry of W we get Wi1...ik = Wg(i1)...g(ik). This implies that W satisfies g W = W for all g G. Equation 5 is a linear homogeneous system of equations and therefore the set of solutions W forms a linear space. To define a basis for this linear space we first define the following equivalence relation: (i1, . . . , ik) (j1, . . . , jk) if there exists g G and σ Sk so that jℓ= g(iσ(ℓ)), ℓ= 1, . . . , k. Intuitively, g takes care of the G-invariance while σ factors out the fact that the monomials xi1 xik = xσ(i1) xσ(ik). For example, let n = 5, k = 3, g = (23)(45), σ = (23) (we use cycle notation), then we have: (2, 2, 4) (3, 5, 3). The equivalence classes are denoted τ and called the k-classes. We show: Proposition 3. The set of polynomials (i1,...,ik) τ xi1 xik, (6) where τ is a k-class, form a basis to Rk[x1, . . . , xn]G. Proof. Denote Wτ the symmetric coefficient tensor of pτ, Note that Wτ i1...ik = ( 1 (i1, . . . , ik) τ 0 otherwise . (7) pτ(g x) = X (i1,...,ik) τ xg 1(i1) xg 1(ik) = pτ(x), pτ Rk[x1, . . . , xn]G. The set of polynomials pτ, with τ a k-classes, is a linearly independent set since each pτ contains a different collection of monomials. By Proposition 2, the symmetric coefficient tensor W of every q Rk[x1, . . . , xn]G satisfies the fixed-point equation, equation 5. This in particular means that W is constant on its k-classes. Hence W can be written as linear combination of Wτ, see also equation 7. As we later show, the fixed point equation, equation 5, is also used to characterize and compute a basis for the space of linear permutation-equivariant and invariant layers (Maron et al., 2019). These equations are equivalently formulated using weight sharing scheme in (Ravanbakhsh et al., 2017). A slight difference in this case, that deals with polynomials, is the additional constraints that formulate the symmetry of W which are needed since every polynomial of degree > 1 has several representing tensors W. Part (iii): Our next step is approximating each pτ with a G-invariant network. The next proposition introduces the building blocks of this construction: Proposition 4. Let τ be a k-class and let Lτ ℓ: Rn Rnk, ℓ= 1, . . . , k, be a linear operator defined as follows: For x Rn Lτ ℓ(x)i1...ik = ( xiℓ (i1, . . . , ik) τ 0 otherwise . Then Lτ ℓis a linear G-equivariant function, that is Lτ ℓ(g x) = g Lτ ℓ(x), x Rn, g G. Proof. We have : g Lτ ℓ(x)i1...ik = ( xg 1(iℓ) (g 1(i1), . . . , g 1(ik)) τ 0 otherwise On the other hand, Lτ ℓ(g x)i1...ik = ( xg 1(iℓ) (i1, . . . , ik) τ 0 otherwise and both expressions are equal since (i1, . . . , ik) τ if and only if (g 1(i1), . . . , g 1(ik)) τ by definition of τ. Next, we construct the approximating G-invariant network: Proposition 5. For any ϵ > 0, K Rn compact set, and τ k-class there exists a G-invariant network F τ that approximates pτ from equation 6 to an ϵ-accuracy. On the Universality of Invariant Networks Proof. Let c > 0 be sufficiently large so that K [ c, c]n Rn. Denote mk : Rk R an MLP that approximates the multiplication function, f(y1, . . . , yk) = Qk i=1 yi, in [ c, c]k to n kϵ-accuracy, i.e., max c yi c f(y) mk(y) < n kϵ. Consider the following G-invariant network: First, given an input x Rn map it to Rnk k (i.e., k is the number of channnels) by Lτ(x)i1...ik,ℓ= Lτ ℓ(x)i1...ik. (8) Lτ : Rn Rnk k is a linear equivariant layer (see equation 2). Second, apply mk to the feature dimension in Rnk k. That is, given y Rnk k define M k(y)i1,...,ik = mk(yi1...,ik,1, . . . , yi1...,ik,k). Note that M k : Rnk k Rnk can be interpreted as a composition of equivariant linear layers 3. Lastly, denote s : Rnk R the summation layer: for z Rnk, s(z) = Pn i1...ik=1 zi1...ik. Note that M k, s are equivariant, invariant (respectively) for all G Sn. This construction can be visualized using the following diagram: Rn Lτ Rnk k M k Rnk s R This G-invariant network F τ = s M k Lτ approximates pτ to an ϵ-accuracy over the compact set K Rn. Indeed, let x K, then |F τ(x) pτ(x)| M k(Lτ(x))i1...ik Wτ i1...ikxi1 xik ( mk(xi1, . . . , xik) xi1 xik (i1,...,ik) τ 0 otherwise where in the last inequality we used the n kϵ-accuracy of mk to the product operator in [ c, c]k Rk. Part (iv): In the final stage, we would like to approximate an arbitrary p R[x1, . . . , xn]G with a G-invariant network to ϵ-accuracy over a compact set K Rn. 3In fact, any application of an MLP to the feature dimension is G-equivariant for any G Sn since it can be realized by scaling of the identity operator, possibly with a constant and nonlinear point-wise activations (see e.g.(Qi et al., 2017; Zaheer et al., 2017)). Proof. (proposition 1) Let us denote by bk1, . . . , bknk the polynomials pτ, with τ the k-classes. Let F kj denote the G-invariant network approximating bkj, k = 0, 1, . . . d, j [nk], to an ϵ-accuracy over the set K, the existence of which is guaranteed by proposition 5. We now utilize the decomposition of p shown in equation 3 and get p(x) j=1 αkj F kj(x) j=1 |αkj| bkj(x) F kj(x) where α 1 = P k,j |αkj| depends only upon p, where ϵ is arbitrary. To finish the proof we need to show that F = Pd k=0 Pnk j=1 αkj F kj can indeed be realized as a single, unified G-invariant network. This is a simple yet technical construction and we defer the proof of this fact to the supplementary material: Lemma 2. There exists a G-invariant network in the sense of definition 3 that realizes the sum of G-invariant networks F = Pd k=0 Pnk j=1 αkj F kj. 3.2. Bounded order construction We have constructed a G-invariant network F that approximates an arbitrary G-invariant polynomial p R[x1, . . . , xn]G of degree d. The network F uses ddimensional tensors, where d matches the degree of p. In this subsection we construct a G-invariant network F that approximates p with maximal tensor order that depends only on the group G Sn. Therefore, the tensor order is independent of the degree of the polynomial p. We use the following theorem by Noether (Kraft & Procesi, 2000): Theorem 4. (Noether) Let G be a finite group acting linearly on Rn. There exist finitely many G-invariant polynomials f1, ..., fm R[x1, . . . , xn]G such that any invariant polynomial p R[x1, . . . , xn]G can be expressed as p(x) = h(f1(x), ..., fm(x)), where h R[x1, . . . , xm] is a polynomial and deg(fi) |G|, i = 1, . . . , m. The idea of using a set of generating invariant polynomials in the context of universality was introduced in (Yarotsky, 2018). For the case of interest in this paper, namely G Sn, there exists a generating set of G-invariant polynomials of degree bounded by n(n 1) 2 , for n 3, see (G obel, 1995). We can On the Universality of Invariant Networks now repeat the construction above, building a G-invariant network Fi approximating fi to a ϵ1-accuracy, i = 1, . . . , m. The maximal order of these networks is bounded by d n(n 1) 2 . These networks can be combined, as above, to a single G-invariant network F : Rn Rm with the final output approximating f(x) = (f1(x), . . . , fm(x)) to a ϵ1accuracy. Now we compose the output of F with an MLP H : Rm R approximating the polynomial h over the compact set f(K)+Bϵ Rm to an ϵ-accuracy, where Bϵ is a closed ball centered at the origin of radius ϵ and the sum is the Minkowski sum. Since H is continuous and f(K) + Bϵ is compact, there exists δ > 0 so that |H(y) H(y )| ϵ if y y 2 δ. We use ϵ1 = min {δ, ϵ} for the construction of F above. We have: |H(F(x)) h(f(x))| |H(F(x)) H(f(x))| + |H(f(x)) h(f(x))| 2ϵ, for all x K. We have constructed H F that is a Ginvariant network with maximal tensor order bounded by n(n 1) 2 approximating p to an arbitrary precision. 3.3. Examples Universality of (hyper-) graph networks. Graph, or hyper-graph data can be described using tensors X Rnk a, where n is the number of vertices of the graph and xi1,i2,...,ik,: Ra is a feature vector attached to a (generalized-)edge defined by the ordered set of vertices (i1, i2, . . . , ik). For example, an adjacency matrix of an n-vertex graph is described by X Rn2. The graph symmetries are reordering the vertices by a permutation, namely g X, where g Sn. Typically, any function we would like to learn on graphs would be invariant to this action. Recently, (Maron et al., 2019) characterized the spaces of equivariant and invariant linear layers with this symmetry, provided a formula for their basis and employed the corresponding G-invariant networks for learning graph-related tasks. A corollary of Theorem 1 is that this construction yields a universal approximator of continuous functions defined on graphs. This is in contrast to the popular message passing neural network model (Gilmer et al., 2017) that was recently shown to be non-universal (Xu et al., 2019). Universality of rotation invariant convolutional networks. For learning tasks involving m m images one might require invariance to periodic translations and 90 degree rotations. Note that periodic translations and 90 degree rotations can be seen as permutations in Sn, n = m2, acting on the pixels of the image. Constructing a suitable G-invariant network would lead, according to Theorem 1, to a universal approximator. Figure 2. Illustration of the (n 2)-transitivity of An, the main property we use in this section. Any subset of distinct n 2 elements can be mapped to any other subset of distinct n 2 elements (gray). If needed, a transposition can be applied to the remaining 2 elements (blue) to assure an even permutation. 4. A lower bound on equivariant layer order In the previous section we showed how an arbitrary G-invariant polynomial can be approximated with a Ginvariant network with tensor order d = d(G) n(n 1) 2 . This upper-bound would be prohibitive in practice. In this section we prove a lower bound: We show that there exists a group for which the tensor order cannot be less than n 2 2 if we wish to maintain the universal approximation property. We consider the alternating group, G = An Sn. Remember that g An if g has an even number of transpositions. Definition 5. A group G Sn is k-transitive if for every two sequences (i1, i2, . . . , ik), (j1, j2, . . . , jk) of distinct elements in [n] there exists g G so that jℓ= g(iℓ), for ℓ= 1, . . . , k. The alternating group is (n 2)-transitive (see figure 2 and (Dixon & Mortimer, 1996)). Our goal is to prove: Theorem 2. If an An-invariant network has the universal approximation property, then it consists of tensors of order at least n 2 For the proof we first need a characterization of the linear equivariant layers L : Rnk a Rnl b, where l = 0 represents the invariant case. By definition L(g X) = g L(X) for all X Rnk a. In particular this means that g 1 L(g X) = L(X) Recall that L is an affine map (see definition 2) and therefore can be represented as a sum of a purely linear part and a constant part. Representing the linear part of L as a tensor L Rnk+l a b these equations become the fixed-point equation for linear equivariant layers (see supplementary material for derivation): g L = L, g G. (9) The constant part of L can be encoded using a tensor B Rnl b that satisfies equation 9 as-well. Note the that this fixed point equation is similar to the fixed point equation On the Universality of Invariant Networks of homogeneous G-invariant polynomials, equation 5. We denote by LG the collection of L : Rnk a Rnl b linear G-equivariant (l > 0) or G-invariant (l = 0) layers. Proposition 6. If k + l n 2, then LAn = LSn. Proof. In view of the fixed point equation for equivariant/invariant layers (9) we need to show the solution set to this equation is identical for G = An and G = Sn, as long as k + l n 2. The solution set of the fixed point equation consists of tensors L that are constant on each equivalence class defined by the equivalence relation: (i1, . . . , ik+l) (j1, . . . , jk+l) if jℓ= g(iℓ) for ℓ= 1, . . . , k + l. Both An and Sn are (n 2)-transitive4. Therefore, the equivalence relations defined above for An and Sn reduce to the same equivalence relation (i1, . . . , ik+l) (j1, . . . , jk+l) if iα = iβ if and only if jα = jβ, for all α, β [k + l] (see (Maron et al., 2019) where these classes are called equality patterns). Since this equivalence relation is the same for An, Sn, we get that the solution set of the fixed point equation (9) is the same for both groups. Since the constant part tensor B is of smaller order than k + l n 2, the same argumentation applies to the constant part, as-well. Proposition 6 implies that any An-invariant network with tensor order (n 2)/2 will be in fact Sn-invariant. Therefore, one approach to show that such networks have limited approximation power is to come up with an An-invariant continuous function that is not Sn-invariant, as follows: Proof. (Theorem 2) Consider the Vandermonde polynomial V (x) = Q 1 i 0 and K Rn a compact set containing both x, g x for g = (12). Assume by way of contradiction that there exists an An-invariant network F, which is Sn-invariant due to the above, such that |V (x) F(x)| ϵ as well as: |V (g x) F(g x)| = |( 1)V (x) F(x)| = |V (x) + F(x)| These last equations imply that |V (x)| ϵ and since ϵ is arbitrary we get V (x) = 0, a contradiction. 5. Universality of first order networks We have seen that G-invariant networks with tensor order n(n 1) 2 are universal. On the other hand for general per- 4Sn is in-fact n-transitive and is therefore also k-transitive for all k n. mutation groups G the tensor order is at least (n 2)/2 if universality is required. A particularly important question for applications, where higher order tensors are computationally prohibitive, is which permutation groups G give rise to first order G-invariant networks that are universal. Definition 6. A first order G-invariant network is a Ginvariant network where the maximal tensor order is 1. In this section we discuss this (mostly) open question. First, we note that there are a few cases for which first order Ginvariant networks are known to be universal: for instance, when G = {e} (i.e., the trivial group), G-invariant networks are composed of fully connected layers, a case which is covered by the original universal approximation theorems (Cybenko, 1989; Hornik, 1991). First order universality is also known when G is (possibly high dimensional) grid (e.g., G = Zn1 Znk) (Yarotsky, 2018), a case that includes periodic convolutional neural networks. Universality of first order networks is also known when G = Sn (Zaheer et al., 2017; Qi et al., 2017; Yarotsky, 2018) in the context of invariant networks that operate on sets or point clouds. Our goal in this section is to derive a necessary condition on G for the universality of first order G-invariant networks. To this end, we first find a function, playing the role of the Vandermonde polynomial in the previous section, that is G-invariant but not H-invariant, where G < H Sn. Lemma 3. Let G < H Sn. Then there exists a continuous function f : Rn R which is G-invariant but not H-invariant. Proof. Pick a point x0 Rn with distinct coordinates. Since the stabilizer (Sn)x0 is trivial (i.e., no permutation fixes x0 excluding the identity), the size of the orbits of x0 equals the size of the acting group. Namely, |G x0| = |G| and |H x0| = |H|. Furthermore, since |G| < |H| and G x0 H x0, we get that the H orbit strictly includes the G orbit. That is, G x H x. Since H x0 is a finite set of points, there exists a continuous function ˆf such that ˆf|G x0 = 1, and ˆf|H x0\G x0 = 0. Define f(x) = 1 |G| P g G ˆf(g x). Now, f is G-invariant by construction but f(x0) = 1 and f(h x0) = 0 for h x0 H x0 \G x0. Therefore, f is not H-invariant. In case of first order G-invariant networks the equivariant/invariant layers have the form L : Rn a Rn b and satisfy the fixed point equations (9). The solution set of the purely linear equivariant layers consists of tensors L Rn2 a b that are constant on equivalence classes of indices defined by the equivalence relation (i1, i2) (j1, j2) if there exists g G so that jℓ= g(iℓ), ℓ= 1, 2. We denote the number of equivalence classes by |[n]2/G|. The solution set of constant equivariant operators are tensors B Rn b that are constant on equivalence classes defined On the Universality of Invariant Networks by the equivalence relation i j if there exists g G so that j = g(i). We denote the number of these classes by |[n]/G|. We prove: Theorem 3. Let G Sn. If first order G-invariant networks are universal, then [n]2/H < [n]2/G for any strict super-group G < H Sn. Proof. Assume by contradiction that there exists a strict super-group G < H Sn so that [n]2/G = [n]2/H . This in particular means that |[n]/G| = |[n]/H|. Therefore LG = LH. That is, the spaces of equivariant and invariant linear layers coincide for G and H. This implies, as before, that every first order G-invariant network is also H-invariant. We proceed similarly to the proof of theorem 2: By lemma 3, there exists a continuous function f : Rn R that is Ginvariant but not H-invariant. Let x0 be a point with distinct coordinates where f(x0) = 1 (it exists by construction, see proof of theorem 2). Furthermore, by construction f(h x0) = 0 if h x0 H x0 \ G x0. Let ϵ > 0 and K Rn a compact set containing both x0, h x0. Assume by way of contradiction that there exists a first order G-invariant network F (which is also H-invariant in view of the above) such that |f(x0) F(x0)| ϵ as well as: |f(h x0) F(h x0)| = |f(h x0) F(x0)| ϵ. These last equations imply that 1 = |f(x0) f(h x0)| |f(x0) F(x0)| + |F(x0) f(h x0)| 2ϵ and since ϵ is arbitrary we get a contradiction. Using theorem 3 we can show that there exist a few infinite families of permutation groups (excluding the alternating group An) for which first order invariant networks are not universal. For example, any strict subgroup G < Sn that is 2-transitive is such a group since in this case [n]2/G = [n]2/Sn and consequently G-invariant/equivariant layers are also Sn-invariant/equivariant. Examples of 2-transitive permutation groups include projective linear groups over finite fields PSLd(Fq) (for q = pn where p, n N, p is prime) that act on the finite projective space, and can be seen as a subgroup of Sn for n = (qd 1)/(q 1) (the number of elements in this space ). Similarly affine subgroups over finite fields AΓLd(Fq) that act on F d q can be shown to be 2-transitive as a subgroup of Sn for n = qd. See (Dixon & Mortimer, 1996) for a full classification of 2-transitive subgroups of Sn. Relation to (Ravanbakhsh et al., 2017). Groups for which the condition in theorem 3 holds are called 2-closed and were first introduced by (Wielandt, 1969) (see (Babai, 1995) for further study). Theorem 3 reveals an interesting connection between our work and the work of (Ravanbakhsh et al., 2017) that studies parameter sharing schemes. One of the basic notions defined in their paper is the notion of uniquely G-equivariant functions, which describes functions that are G-equivariant but not equivariant to any supergroup of G. For example, a consequence of proposition 6 is that An Sn (with the representation used in this paper) has no uniquely equivariant linear functions between tensors of total order n 2. It was shown in (Ravanbakhsh et al., 2017) that 2-closed groups are exactly the groups for which one can find a uniquely equivariant function. In this section we proved that the existence of a uniquely G-equivariant linear function is a necessary condition for first order universality. As stated in (Ravanbakhsh et al., 2017) some examples for 2-closed groups are fixed-point free groups (e.g., the cyclic group Cn) and Sn itself. 6. Conclusion In this paper we have considered the universal approximation property of a popular invariant neural network model. We have shown that these networks are universal with a construction that uses tensors of order n(n 1) 2 , which makes this architecture impractical. On the other hand, there exists a permutation group for which we have proved a lower bound of n 2 2 on the tensor order required to achieve universality. We then addressed the more practical question of which groups G allow first order G-invariant networks to be universal. We have proved that 2-closedness of G is a necessary condition, and gave examples of infinite permutation group families that do not satisfy this condition. Our work is a first step in advancing the understanding of approximation power of a large class of invariant neural networks that becomes increasingly popular in applications. Several questions remain open: First, a classification of 2-closed groups will give us a complete answer to which networks are first-order universal. As far as we know this is an open question in group theory. Still, mapping the 2closed landscape for specific groups G that are interesting for machine learning applications is a worthy challenge. Second, In case one wishes to construct a G-invariant network for a group G that is not 2-closed, developing fast and efficient implementations of higher order layers seems like a potentially useful direction. Lastly, another interesting venue for future work might be to come up with new, possibly non-linear, models for invariant networks. Acknowledgments This research was supported in part by the European Research Council (ERC Consolidator Grant, Lift Match 771136) and the Israel Science Foundation (Grant No. 1830/17). On the Universality of Invariant Networks Babai, L. Automorphism groups, isomorphism, reconstruction. chapter 27 of the handbook of combinatorics, 1447 1540. rl graham, m. gr otschel, l. lov asz eds, 1995. Cohen, T. and Welling, M. Group equivariant convolutional networks. In International conference on machine learning, pp. 2990 2999, 2016a. Cohen, T. S. and Welling, M. Steerable CNNs. (1990):1 14, 2016b. URL http://arxiv.org/abs/1612. 08498. Cohen, T. S., Geiger, M., K ohler, J., and Welling, M. Spherical cnns. ar Xiv preprint ar Xiv:1801.10130, 2018. Cybenko, G. Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2(4):303 314, 1989. Dixon, J. D. and Mortimer, B. Permutation groups, volume 163. Springer Science & Business Media, 1996. Gilmer, J., Schoenholz, S. S., Riley, P. F., Vinyals, O., and Dahl, G. E. Neural message passing for quantum chemistry. In International Conference on Machine Learning, pp. 1263 1272, 2017. G obel, M. Computing bases for rings of permutationinvariant polynomials. J. Symb. Comput., 19(4):285 291, 1995. Hartford, J., Graham, D. R., Leyton-Brown, K., and Ravanbakhsh, S. Deep models of interactions across sets. ar Xiv preprint ar Xiv:1803.02879, 2018. Hornik, K. Approximation capabilities of multilayer feedforward networks. Neural networks, 4(2):251 257, 1991. Kondor, R. and Trivedi, S. On the generalization of equivariance and convolution in neural networks to the action of compact groups. ar Xiv preprint ar Xiv:1802.03690, 2018. Kondor, R., Son, H. T., Pan, H., Anderson, B., and Trivedi, S. Covariant compositional networks for learning graphs. ar Xiv preprint ar Xiv:1801.02144, 2018. Kraft, H. and Procesi, C. Classical invariant theory, a primer. Lecture Notes, Version, 2000. Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097 1105, 2012. Le Cun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., and Jackel, L. D. Backpropagation applied to handwritten zip code recognition. Neural computation, 1(4):541 551, 1989. Maron, H., Ben-Hamu, H., Shamir, N., and Lipman, Y. Invariant and equivariant graph networks. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum? id=Syx72j C9tm. Qi, C. R., Su, H., Mo, K., and Guibas, L. J. Pointnet: Deep learning on point sets for 3d classification and segmentation. Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, 1(2):4, 2017. Ravanbakhsh, S., Schneider, J., and Poczos, B. Equivariance through parameter-sharing. ar Xiv preprint ar Xiv:1702.08389, 2017. Weiler, M., Geiger, M., Welling, M., Boomsma, W., and Cohen, T. 3D Steerable CNNs: Learning Rotationally Equivariant Features in Volumetric Data. 2018. URL http://arxiv.org/abs/1807.02547. Wielandt, H. Permutation groups through invariant relations and invariant functions. 1969. Xu, K., Hu, W., Leskovec, J., and Jegelka, S. How powerful are graph neural networks? In International Conference on Learning Representations, 2019. URL https:// openreview.net/forum?id=ry Gs6i A5Km. Yarotsky, D. Universal approximations of invariant maps by neural networks. ar Xiv preprint ar Xiv:1804.10306, 2018. Zaheer, M., Kottur, S., Ravanbakhsh, S., Poczos, B., Salakhutdinov, R. R., and Smola, A. J. Deep sets. In Advances in Neural Information Processing Systems, pp. 3391 3401, 2017.