# on_the_universality_of_invariant_networks__9105b791.pdf

On the Universality of Invariant Networks

Haggai Maron 1 Ethan Fetaya 2 3 Nimrod Segol 1 Yaron Lipman 1

Constraining linear layers in neural networks to respect symmetry transformations from a group G is a common design principle for invariant networks that has found many applications in machine learning. In this paper, we consider a fundamental question that has received little attention to date: Can these networks approximate any (continuous) invariant function? We tackle the rather general case where G Sn (an arbitrary subgroup of the symmetric group) that acts on Rn by permuting coordinates. This setting includes several recent popular invariant networks. We present two main results: First, G-invariant networks are universal if high-order tensors are allowed. Second, there are groups G for which higher-order tensors are unavoidable for obtaining universality. G-invariant networks consisting of only ﬁrst-order tensors are of special interest due to their practical value. We conclude the paper by proving a necessary condition for the universality of G-invariant networks that incorporate only ﬁrst-order tensors.

1. Introduction

The basic paradigm of deep neural networks is repeatedly composing layers of linear functions with non-linear, entrywise activation functions to create effective predictive models for learning tasks of interest.

When trying to learn a function (task) f that is known to be invariant to some group of symmetries G (i.e., G-invariant function) it is common to use linear layers that respect this symmetry, namely, invariant and/or equivariant linear layers. Networks with invariant/equivariant linear layers

1Department of Computer Science and Applied Mathematics, Weizmann Institute of Science, Rehovot, Israel 2Department of Computer Science, University of Toronto, Toronto, Canada 3Vector Institute. Correspondence to: Haggai Maron <haggai.maron@weizmann.ac.il>.

Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s).

with respect to some group G will be referred here as Ginvariant networks.

A fundamental question in learning theory is that of approximation or universality (Cybenko, 1989; Hornik, 1991). In the invariant case: Can a G-invariant network approximate an arbitrary continuous G-invariant function?

The goal of this paper is to address this question for all ﬁnite permutation groups G Sn, where Sn is the symmetric group acting on [n] = {1, 2, . . . , n}. Note that this is a fairly general setting that contains many useful examples (detailed below).

The archetypal example of G-invariant networks is Convolutional Neural Networks (CNNs) (Le Cun et al., 1989; Krizhevsky et al., 2012) that restrict their linear layers to convolutions in order to learn image tasks that are translation invariant or equivariant 1.

In recent years researchers are considering other types of data and/or symmetries and consequently new G-invariant networks have emerged. Tasks involving point clouds or sets are in general invariant to the order of the input and therefore permutation invariance/equivariance was developed (Qi et al., 2017; Zaheer et al., 2017). Learning tasks involving interaction between different sets, where the input data is tabular, require dealing with different permutations acting independently on each set (Hartford et al., 2018). Tasks involving graphs and hyper-graphs lead to symmetries deﬁned by tensor products of permutations (Kondor et al., 2018; Maron et al., 2019). A general treatment of invariance/equivariance to ﬁnite subgroups of the symmetric group is discussed in (Ravanbakhsh et al., 2017); inﬁnite symmetries are discussed in general in (Kondor & Trivedi, 2018) as well as in (Cohen & Welling, 2016a;b; Cohen et al., 2018; Weiler et al., 2018).

Among these examples, universality is known for pointclouds networks and sets networks (Qi et al., 2017; Zaheer et al., 2017), as well as networks invariant to ﬁnite translation groups (e.g., cyclic convolutional neural networks) (Yarotsky, 2018). However, universality is not known for tabular and multi-set networks (Hartford et al., 2018), graph

1It is common to use convolutional layers without cyclic padding which implies that these networks are not precisely translation invariant.

On the Universality of Invariant Networks

and hyper-graph networks (Kondor et al., 2018; Maron et al., 2019); and networks invariant to ﬁnite translations with rotations and/or reﬂections. We cover all these cases in this paper.

Maybe the most related work to ours is (Yarotsky, 2018) that considered actions of compact groups and suggested provably universal architectures that are based on polynomial layers. In contrast, we study the standard and widely used linear layer model.

The paper is organized as follows: First, we prove that an arbitrary continuous function f : Rn R invariant to an arbitrary permutation group G Sn can be approximated using a G-invariant network. The proof is constructive and makes use of linear equivariant layers between tensors X Rnk of order k d, where d depends on the permutation group G.

Second, we prove a lower bound on the order d of tensors used in a G-invariant network so to achieve universality. Speciﬁcally, we show that for G = An (the alternating group) any G-invariant network that uses tensors of order at-most d = (n 2)/2 cannot approximate arbitrary Ginvariant functions.

We conclude the paper by considering the question: For which groups G Sn, G-invariant networks using only ﬁrst order tensors are universal? We prove a necessary condition, and describe families of groups for which universality cannot be attained using only ﬁrst order tensors.

2. Preliminaries and main results

The symmetries we consider in this paper are arbitrary subgroups of the symmetric group, i.e., G Sn. The action of G on x Rn used in this paper is deﬁned as

g x = (xg 1(1), . . . , xg 1(n)), g G. (1)

The action of G on tensors X Rnk a (the last index, denoted j represents feature depth) is deﬁned similarly by

(g X)i1...ik,j = Xg 1(i1)...g 1(ik),j, g G. (2)

The inset illustrates this action on tensors of order k = 1, 2, 3: the permutation g is a transposition of two numbers and is applied to each dimension of the tensor.

Deﬁnition 1. A G-invariant function is a function f : Rn R that satisﬁes f(g x) = f(x) for all x Rn

Deﬁnition 2. A linear equivariant layer is an afﬁne map L : Rnk a Rnl b satisfying L(g X) = g L(X), for

Figure 1. Illustration of invariant network architecture. The function is composed of multiple linear G-equivariant layers (gray), possibly of high order, and ends with a linear G-invariant function (light blue) followed by a Multi Layer Perceptron (yellow).

all g G, and X Rnk a. An invariant linear layer is an afﬁne map h : Rnk a Rb satisfying h(g X) = h(X), for all g G, and X Rnk a.

A common way to construct G-invariant networks is:

Deﬁnition 3. A G-invariant network is a function F : Rn a R deﬁned as

F = m h Ld σ σ L1,

where Li are linear G-equivariant layers, σ is an activation function 2, h is a G-invariant layer, and m is a Multi-Layer Perceptron (MLP).

Figure 1 illustrates the G-invariant network model. By construction, G-invariant networks are G-invariant functions (note that entrywise activation is equivariant as-well). This framework has been used, with appropriate group G, in previous works to build predictive G-invariant models for learning.

Our goal is to show the approximation power of G-invariant networks. Namely, that G-invariant networks can approximate arbitrary continuous G-invariant functions f. Without loss of generality, we consider only functions of the form f : Rn R. Indeed, in case of multiple features, Rn a, we rearrange the input as Rn , n = na, and take the appropriate G Sn . We prove:

Theorem 1. Let f : Rn R be a continuous G-invariant function for some G Sn, and K Rn a compact set. There exists a G-invariant network that approximates f to an arbitrary precision.

The proof of Theorem 1 is constructive and builds an fapproximating G-invariant network with hidden tensors X Rnd of order d, where d = d(G) is a natural number depending on the group G. Unfortunately, we show that in the worst case d can be as high as n(n 1)

2 . Note that d = 2 could already be computationally challenging. It is therefore of interest to ask whether there exist more efﬁcient

2We assume any activation function for which the universal approximation theorem for MLP holds, e.g., Re LU and sigmoid.

On the Universality of Invariant Networks

G-invariant networks that use lower order tensors without sacriﬁcing approximation power. Surprisingly, the answer is that in general we can not go lower than order n for general permutation groups G. Speciﬁcally, we prove the following for G = An, the alternating group:

Theorem 2. If an An-invariant network has the universal approximation property then it consists of tensors of order at least n 2

Although in general we cannot expect universal approximation of G-invariant networks with inner tensor order smaller than n 2

2 , it is still possible that for speciﬁc groups of interest we can prove approximation power with more efﬁcient (i.e., lower order inner tensors) G-invariant networks. Of speciﬁc interest are G-invariant networks that use only ﬁrst order tensors. In section 5 we prove the following necessary condition for universality of ﬁrst-order G-invariant networks:

Theorem 3. Let G Sn. If ﬁrst order G-invariant networks are universal, then [n]2/H < [n]2/G for any strict super-group G < H Sn.

|[n]2/G| is the number of equivalence classes of [n]2 deﬁned by the relation: (i1, i2) (j1, j2) if jℓ= g(iℓ), ℓ= 1, 2 for some g G. Intuitively, this condition asks that supergroups of G have strictly better separation of the double index space [n]2.

3. G-invariant networks universality

The key to showing theorem 1, namely that G-invariant networks are universal, is showing they can approximate a set of functions that are: (i) G-invariant; and (ii) can approximate arbitrary G-invariant functions to a desired precision. The G-invariant polynomials are an example of such a set:

Deﬁnition 4. The G-invariant polynomials are all the polynomials in x1, . . . , xn over R that are also G-invariant functions. They are denoted R[x1, . . . , xn]G, where R[x1, . . . , xn] is the set of all polynomials over R.

To see that G-invariant polynomials can approximate any arbitrary (continuous) function f : K Rn R, where K is a compact set, one can use the Stone-Weiestrass (SW) theorem, as done in (Yarotsky, 2018): First use SW to approximate f over a symmetrized domain K = g Gg K by some (not necessarily G-invariant) polynomial p R[x1, . . . , xn]. Second, consider

q(x) = 1 |G|

g G p(g x).

q is a G-invariant polynomial and hence

q R[x1, . . . , xn]G,

furthermore for x K:

|q(x) f(x)| 1 |G|

p(g x) f(g x) max x K |p(x) f(x)| .

Our goal in this section is to prove the following proposition that, together with the comment above, prove theorem 1:

Proposition 1. For any ϵ > 0, K Rn compact set, and Ginvariant polynomial p R[x1, . . . , xn]G there exists a Ginvariant network F that approximates p to an ϵ-accuracy, namely maxx K |F(x) p(x)| < ϵ.

The proposition will be proved in several steps:

(i) We represent p as p(x) = Pd k=0 pk(x), where pk is a G-invariant homogeneous polynomial of degree k, i.e., pk Rk[x1, . . . , xn]G.

(ii) We characterize all homogeneous G-invariant polynomials of a ﬁxed degree k. In particular we ﬁnd a basis to all such polynomials, bk1, bk2, . . . , bknk Rk[x1, . . . , xn]G. Using the bases of homogeneous G-invariant polynomials of degrees up-to d we write

j=1 αkjbkj(x). (3)

(iii) We approximate each basis element bkj using a Ginvariant network.

(iv) We construct a G-invariant network F approximating p to an ϵ-accuracy using Equation 3 and (iii).

3.1. Proof of proposition 1

Part (i): It is a known fact that a G-invariant polynomial can be written as a sum of homogeneous G-invariant polynomials (Kraft & Procesi, 2000):

Lemma 1. Let p : Rn R be a G-invariant polynomial of degree d. Then p can be written as p(x) = Pd k=0 pk(x) where pk are homogeneous G-invariant polynomials of degree k.

Part (ii): We need to ﬁnd bases for the linear spaces of homogeneous G-invariant polynomials of degree k = 0, 1, . . . , d, i.e., Rk[x1, . . . , xn]G. Any homogeneous polynomial of degree k can be written as

i1,...,ik=1 Wi1...ik xi1 xik, (4)

where W Rnk is its coefﬁcient tensor; since xi1 xik = xiσ(1) xiσ(k) for all σ Sk, a unique choice of W can be

On the Universality of Invariant Networks

obtained by taking a symmetric W. That is, W that satisﬁes Wi1 ik = Wiσ(1) iσ(k), for all σ Sk. In short, we ask

W Symk(Rn) Rnk. For example, the case k = 2 amounts to representing a quadratic form using a symmetric matrix, that is W satisﬁes in this case W = WT . The next proposition shows that if p is G-invariant, its coefﬁcient tensor is a ﬁxed point of the action of G on symmetric tensors W Rnk :

Proposition 2. Let p Rk[x1, . . . , xn]G. Then its coefﬁcient tensor W Rnk satisﬁes the ﬁxed point equation:

g W = W, g G. (5)

Proof. From the fact that p is G-invariant we get the following set of equations p(x) = p(g x), for all g G.

p(x) = p(g x)

i1,...,ik=1 Wi1...ik xg 1(i1) xg 1(ik)

i1,...,ik=1 Wg(i1)...g(ik) xi1 xik.

By equating monomials coefﬁcients of p(x) and p(g x) and the symmetry of W we get

Wi1...ik = Wg(i1)...g(ik).

This implies that W satisﬁes g W = W for all g G.

Equation 5 is a linear homogeneous system of equations and therefore the set of solutions W forms a linear space. To deﬁne a basis for this linear space we ﬁrst deﬁne the following equivalence relation: (i1, . . . , ik) (j1, . . . , jk) if there exists g G and σ Sk so that jℓ= g(iσ(ℓ)), ℓ= 1, . . . , k. Intuitively, g takes care of the G-invariance while σ factors out the fact that the monomials xi1 xik = xσ(i1) xσ(ik). For example, let n = 5, k = 3, g = (23)(45), σ = (23) (we use cycle notation), then we have: (2, 2, 4) (3, 5, 3). The equivalence classes are denoted τ and called the k-classes. We show:

Proposition 3. The set of polynomials

(i1,...,ik) τ xi1 xik, (6)

where τ is a k-class, form a basis to Rk[x1, . . . , xn]G.

Proof. Denote Wτ the symmetric coefﬁcient tensor of pτ, Note that

Wτ i1...ik =

( 1 (i1, . . . , ik) τ 0 otherwise . (7)

pτ(g x) = X

(i1,...,ik) τ xg 1(i1) xg 1(ik) = pτ(x),

pτ Rk[x1, . . . , xn]G. The set of polynomials pτ, with τ a k-classes, is a linearly independent set since each pτ

contains a different collection of monomials. By Proposition 2, the symmetric coefﬁcient tensor W of every q Rk[x1, . . . , xn]G satisﬁes the ﬁxed-point equation, equation 5. This in particular means that W is constant on its k-classes. Hence W can be written as linear combination of Wτ, see also equation 7.

As we later show, the ﬁxed point equation, equation 5, is also used to characterize and compute a basis for the space of linear permutation-equivariant and invariant layers (Maron et al., 2019). These equations are equivalently formulated using weight sharing scheme in (Ravanbakhsh et al., 2017). A slight difference in this case, that deals with polynomials, is the additional constraints that formulate the symmetry of W which are needed since every polynomial of degree > 1 has several representing tensors W.

Part (iii): Our next step is approximating each pτ with a G-invariant network. The next proposition introduces the building blocks of this construction:

Proposition 4. Let τ be a k-class and let Lτ ℓ: Rn Rnk, ℓ= 1, . . . , k, be a linear operator deﬁned as follows: For x Rn

Lτ ℓ(x)i1...ik =

( xiℓ (i1, . . . , ik) τ 0 otherwise .

Then Lτ ℓis a linear G-equivariant function, that is

Lτ ℓ(g x) = g Lτ ℓ(x), x Rn, g G.

Proof. We have :

g Lτ ℓ(x)i1...ik =

( xg 1(iℓ) (g 1(i1), . . . , g 1(ik)) τ 0 otherwise

On the other hand,

Lτ ℓ(g x)i1...ik =

( xg 1(iℓ) (i1, . . . , ik) τ 0 otherwise

and both expressions are equal since (i1, . . . , ik) τ if and only if (g 1(i1), . . . , g 1(ik)) τ by deﬁnition of τ.

Next, we construct the approximating G-invariant network:

Proposition 5. For any ϵ > 0, K Rn compact set, and τ k-class there exists a G-invariant network F τ that approximates pτ from equation 6 to an ϵ-accuracy.

On the Universality of Invariant Networks

Proof. Let c > 0 be sufﬁciently large so that K [ c, c]n Rn. Denote mk : Rk R an MLP that approximates the multiplication function, f(y1, . . . , yk) = Qk i=1 yi, in [ c, c]k to n kϵ-accuracy, i.e., max c yi c f(y) mk(y) < n kϵ.

Consider the following G-invariant network: First, given an input x Rn map it to Rnk k (i.e., k is the number of channnels) by

Lτ(x)i1...ik,ℓ= Lτ ℓ(x)i1...ik. (8)

Lτ : Rn Rnk k is a linear equivariant layer (see equation 2). Second, apply mk to the feature dimension in Rnk k. That is, given y Rnk k deﬁne

M k(y)i1,...,ik = mk(yi1...,ik,1, . . . , yi1...,ik,k).

Note that M k : Rnk k Rnk can be interpreted as a composition of equivariant linear layers 3.

Lastly, denote s : Rnk R the summation layer: for z Rnk, s(z) = Pn i1...ik=1 zi1...ik. Note that M k, s are equivariant, invariant (respectively) for all G Sn. This construction can be visualized using the following diagram:

Rn Lτ Rnk k M k Rnk s R

This G-invariant network F τ = s M k Lτ approximates pτ to an ϵ-accuracy over the compact set K Rn. Indeed, let x K, then

|F τ(x) pτ(x)|

M k(Lτ(x))i1...ik Wτ i1...ikxi1 xik

( mk(xi1, . . . , xik) xi1 xik (i1,...,ik) τ 0 otherwise

where in the last inequality we used the n kϵ-accuracy of mk to the product operator in [ c, c]k Rk.

Part (iv): In the ﬁnal stage, we would like to approximate an arbitrary p R[x1, . . . , xn]G with a G-invariant network to ϵ-accuracy over a compact set K Rn.

3In fact, any application of an MLP to the feature dimension is G-equivariant for any G Sn since it can be realized by scaling of the identity operator, possibly with a constant and nonlinear point-wise activations (see e.g.(Qi et al., 2017; Zaheer et al., 2017)).

Proof. (proposition 1) Let us denote by bk1, . . . , bknk the polynomials pτ, with τ the k-classes. Let F kj denote the G-invariant network approximating bkj, k = 0, 1, . . . d, j [nk], to an ϵ-accuracy over the set K, the existence of which is guaranteed by proposition 5. We now utilize the decomposition of p shown in equation 3 and get p(x)

j=1 αkj F kj(x)

j=1 |αkj| bkj(x) F kj(x)

where α 1 = P k,j |αkj| depends only upon p, where ϵ is arbitrary. To ﬁnish the proof we need to show that F = Pd k=0 Pnk j=1 αkj F kj can indeed be realized as a single, uniﬁed G-invariant network. This is a simple yet technical construction and we defer the proof of this fact to the supplementary material:

Lemma 2. There exists a G-invariant network in the sense of deﬁnition 3 that realizes the sum of G-invariant networks F = Pd k=0 Pnk j=1 αkj F kj.

3.2. Bounded order construction

We have constructed a G-invariant network F that approximates an arbitrary G-invariant polynomial p R[x1, . . . , xn]G of degree d. The network F uses ddimensional tensors, where d matches the degree of p. In this subsection we construct a G-invariant network F that approximates p with maximal tensor order that depends only on the group G Sn. Therefore, the tensor order is independent of the degree of the polynomial p. We use the following theorem by Noether (Kraft & Procesi, 2000):

Theorem 4. (Noether) Let G be a ﬁnite group acting linearly on Rn. There exist ﬁnitely many G-invariant polynomials f1, ..., fm R[x1, . . . , xn]G such that any invariant polynomial p R[x1, . . . , xn]G can be expressed as

p(x) = h(f1(x), ..., fm(x)),

where h R[x1, . . . , xm] is a polynomial and deg(fi) |G|, i = 1, . . . , m.

The idea of using a set of generating invariant polynomials in the context of universality was introduced in (Yarotsky, 2018).

For the case of interest in this paper, namely G Sn, there exists a generating set of G-invariant polynomials of degree bounded by n(n 1)

2 , for n 3, see (G obel, 1995). We can

On the Universality of Invariant Networks

now repeat the construction above, building a G-invariant network Fi approximating fi to a ϵ1-accuracy, i = 1, . . . , m. The maximal order of these networks is bounded by d n(n 1)

2 . These networks can be combined, as above, to a single G-invariant network F : Rn Rm with the ﬁnal output approximating f(x) = (f1(x), . . . , fm(x)) to a ϵ1accuracy. Now we compose the output of F with an MLP H : Rm R approximating the polynomial h over the compact set f(K)+Bϵ Rm to an ϵ-accuracy, where Bϵ is a closed ball centered at the origin of radius ϵ and the sum is the Minkowski sum. Since H is continuous and f(K) + Bϵ is compact, there exists δ > 0 so that |H(y) H(y )| ϵ if y y 2 δ. We use ϵ1 = min {δ, ϵ} for the construction of F above. We have:

|H(F(x)) h(f(x))| |H(F(x)) H(f(x))|

+ |H(f(x)) h(f(x))| 2ϵ,

for all x K. We have constructed H F that is a Ginvariant network with maximal tensor order bounded by n(n 1)

2 approximating p to an arbitrary precision.

3.3. Examples

Universality of (hyper-) graph networks. Graph, or hyper-graph data can be described using tensors X Rnk a, where n is the number of vertices of the graph and xi1,i2,...,ik,: Ra is a feature vector attached to a (generalized-)edge deﬁned by the ordered set of vertices (i1, i2, . . . , ik). For example, an adjacency matrix of an n-vertex graph is described by X Rn2. The graph symmetries are reordering the vertices by a permutation, namely g X, where g Sn. Typically, any function we would like to learn on graphs would be invariant to this action. Recently, (Maron et al., 2019) characterized the spaces of equivariant and invariant linear layers with this symmetry, provided a formula for their basis and employed the corresponding G-invariant networks for learning graph-related tasks. A corollary of Theorem 1 is that this construction yields a universal approximator of continuous functions deﬁned on graphs. This is in contrast to the popular message passing neural network model (Gilmer et al., 2017) that was recently shown to be non-universal (Xu et al., 2019).

Universality of rotation invariant convolutional networks. For learning tasks involving m m images one might require invariance to periodic translations and 90 degree rotations. Note that periodic translations and 90 degree rotations can be seen as permutations in Sn, n = m2, acting on the pixels of the image. Constructing a suitable G-invariant network would lead, according to Theorem 1, to a universal approximator.

Figure 2. Illustration of the (n 2)-transitivity of An, the main property we use in this section. Any subset of distinct n 2 elements can be mapped to any other subset of distinct n 2 elements (gray). If needed, a transposition can be applied to the remaining 2 elements (blue) to assure an even permutation.

4. A lower bound on equivariant layer order

In the previous section we showed how an arbitrary G-invariant polynomial can be approximated with a Ginvariant network with tensor order d = d(G) n(n 1)

2 . This upper-bound would be prohibitive in practice. In this section we prove a lower bound: We show that there exists a group for which the tensor order cannot be less than n 2

2 if we wish to maintain the universal approximation property.

We consider the alternating group, G = An Sn. Remember that g An if g has an even number of transpositions.

Deﬁnition 5. A group G Sn is k-transitive if for every two sequences (i1, i2, . . . , ik), (j1, j2, . . . , jk) of distinct elements in [n] there exists g G so that jℓ= g(iℓ), for ℓ= 1, . . . , k.

The alternating group is (n 2)-transitive (see ﬁgure 2 and (Dixon & Mortimer, 1996)). Our goal is to prove:

Theorem 2. If an An-invariant network has the universal approximation property, then it consists of tensors of order at least n 2

For the proof we ﬁrst need a characterization of the linear equivariant layers L : Rnk a Rnl b, where l = 0 represents the invariant case. By deﬁnition L(g X) = g L(X) for all X Rnk a. In particular this means that

g 1 L(g X) = L(X)

Recall that L is an afﬁne map (see deﬁnition 2) and therefore can be represented as a sum of a purely linear part and a constant part. Representing the linear part of L as a tensor L Rnk+l a b these equations become the ﬁxed-point equation for linear equivariant layers (see supplementary material for derivation):

g L = L, g G. (9)

The constant part of L can be encoded using a tensor B Rnl b that satisﬁes equation 9 as-well. Note the that this ﬁxed point equation is similar to the ﬁxed point equation

On the Universality of Invariant Networks

of homogeneous G-invariant polynomials, equation 5. We denote by LG the collection of L : Rnk a Rnl b linear G-equivariant (l > 0) or G-invariant (l = 0) layers.

Proposition 6. If k + l n 2, then LAn = LSn.

Proof. In view of the ﬁxed point equation for equivariant/invariant layers (9) we need to show the solution set to this equation is identical for G = An and G = Sn, as long as k + l n 2. The solution set of the ﬁxed point equation consists of tensors L that are constant on each equivalence class deﬁned by the equivalence relation: (i1, . . . , ik+l) (j1, . . . , jk+l) if jℓ= g(iℓ) for ℓ= 1, . . . , k + l.

Both An and Sn are (n 2)-transitive4. Therefore, the equivalence relations deﬁned above for An and Sn reduce to the same equivalence relation (i1, . . . , ik+l) (j1, . . . , jk+l) if iα = iβ if and only if jα = jβ, for all α, β [k + l] (see (Maron et al., 2019) where these classes are called equality patterns). Since this equivalence relation is the same for An, Sn, we get that the solution set of the ﬁxed point equation (9) is the same for both groups. Since the constant part tensor B is of smaller order than k + l n 2, the same argumentation applies to the constant part, as-well.

Proposition 6 implies that any An-invariant network with tensor order (n 2)/2 will be in fact Sn-invariant. Therefore, one approach to show that such networks have limited approximation power is to come up with an An-invariant continuous function that is not Sn-invariant, as follows:

Proof. (Theorem 2) Consider the Vandermonde polynomial V (x) = Q

1 i<j n(xi xj). It is not hard to check that V is An-invariant but not Sn-invariant (consider, e.g., g = (12) Sn). Pick x Rn with distinct coordinates. Then it holds that V (x) = 0. Let ϵ > 0 and K Rn a compact set containing both x, g x for g = (12). Assume by way of contradiction that there exists an An-invariant network F, which is Sn-invariant due to the above, such that |V (x) F(x)| ϵ as well as:

|V (g x) F(g x)| = |( 1)V (x) F(x)|

= |V (x) + F(x)|

These last equations imply that |V (x)| ϵ and since ϵ is arbitrary we get V (x) = 0, a contradiction.

5. Universality of ﬁrst order networks

We have seen that G-invariant networks with tensor order n(n 1)

2 are universal. On the other hand for general per-

4Sn is in-fact n-transitive and is therefore also k-transitive for all k n.

mutation groups G the tensor order is at least (n 2)/2 if universality is required. A particularly important question for applications, where higher order tensors are computationally prohibitive, is which permutation groups G give rise to ﬁrst order G-invariant networks that are universal.

Deﬁnition 6. A ﬁrst order G-invariant network is a Ginvariant network where the maximal tensor order is 1.

In this section we discuss this (mostly) open question. First, we note that there are a few cases for which ﬁrst order Ginvariant networks are known to be universal: for instance, when G = {e} (i.e., the trivial group), G-invariant networks are composed of fully connected layers, a case which is covered by the original universal approximation theorems (Cybenko, 1989; Hornik, 1991). First order universality is also known when G is (possibly high dimensional) grid (e.g., G = Zn1 Znk) (Yarotsky, 2018), a case that includes periodic convolutional neural networks. Universality of ﬁrst order networks is also known when G = Sn (Zaheer et al., 2017; Qi et al., 2017; Yarotsky, 2018) in the context of invariant networks that operate on sets or point clouds.

Our goal in this section is to derive a necessary condition on G for the universality of ﬁrst order G-invariant networks. To this end, we ﬁrst ﬁnd a function, playing the role of the Vandermonde polynomial in the previous section, that is G-invariant but not H-invariant, where G < H Sn.

Lemma 3. Let G < H Sn. Then there exists a continuous function f : Rn R which is G-invariant but not H-invariant.

Proof. Pick a point x0 Rn with distinct coordinates. Since the stabilizer (Sn)x0 is trivial (i.e., no permutation ﬁxes x0 excluding the identity), the size of the orbits of x0 equals the size of the acting group. Namely, |G x0| = |G| and |H x0| = |H|. Furthermore, since |G| < |H| and G x0 H x0, we get that the H orbit strictly includes the G orbit. That is, G x H x. Since H x0 is a ﬁnite set of points, there exists a continuous function ˆf such that ˆf|G x0 = 1, and ˆf|H x0\G x0 = 0. Deﬁne f(x) = 1 |G| P

g G ˆf(g x). Now, f is G-invariant by construction but f(x0) = 1 and f(h x0) = 0 for h x0 H x0 \G x0. Therefore, f is not H-invariant.

In case of ﬁrst order G-invariant networks the equivariant/invariant layers have the form L : Rn a Rn b

and satisfy the ﬁxed point equations (9). The solution set of the purely linear equivariant layers consists of tensors L Rn2 a b that are constant on equivalence classes of indices deﬁned by the equivalence relation (i1, i2) (j1, j2) if there exists g G so that jℓ= g(iℓ), ℓ= 1, 2. We denote the number of equivalence classes by |[n]2/G|. The solution set of constant equivariant operators are tensors B Rn b that are constant on equivalence classes deﬁned

On the Universality of Invariant Networks

by the equivalence relation i j if there exists g G so that j = g(i). We denote the number of these classes by |[n]/G|. We prove:

Theorem 3. Let G Sn. If ﬁrst order G-invariant networks are universal, then [n]2/H < [n]2/G for any strict super-group G < H Sn.

Proof. Assume by contradiction that there exists a strict super-group G < H Sn so that [n]2/G = [n]2/H . This in particular means that |[n]/G| = |[n]/H|. Therefore LG = LH. That is, the spaces of equivariant and invariant linear layers coincide for G and H. This implies, as before, that every ﬁrst order G-invariant network is also H-invariant.

We proceed similarly to the proof of theorem 2: By lemma 3, there exists a continuous function f : Rn R that is Ginvariant but not H-invariant. Let x0 be a point with distinct coordinates where f(x0) = 1 (it exists by construction, see proof of theorem 2). Furthermore, by construction f(h x0) = 0 if h x0 H x0 \ G x0.

Let ϵ > 0 and K Rn a compact set containing both x0, h x0. Assume by way of contradiction that there exists a ﬁrst order G-invariant network F (which is also H-invariant in view of the above) such that |f(x0) F(x0)| ϵ as well as:

|f(h x0) F(h x0)| = |f(h x0) F(x0)| ϵ.

These last equations imply that 1 = |f(x0) f(h x0)| |f(x0) F(x0)| + |F(x0) f(h x0)| 2ϵ and since ϵ is arbitrary we get a contradiction.

Using theorem 3 we can show that there exist a few inﬁnite families of permutation groups (excluding the alternating group An) for which ﬁrst order invariant networks are not universal. For example, any strict subgroup G < Sn that is 2-transitive is such a group since in this case [n]2/G = [n]2/Sn and consequently G-invariant/equivariant layers are also Sn-invariant/equivariant. Examples of 2-transitive permutation groups include projective linear groups over ﬁnite ﬁelds PSLd(Fq) (for q = pn where p, n N, p is prime) that act on the ﬁnite projective space, and can be seen as a subgroup of Sn for n = (qd 1)/(q 1) (the number of elements in this space ). Similarly afﬁne subgroups over ﬁnite ﬁelds AΓLd(Fq) that act on F d q can be shown to be 2-transitive as a subgroup of Sn for n = qd. See (Dixon & Mortimer, 1996) for a full classiﬁcation of 2-transitive subgroups of Sn.

Relation to (Ravanbakhsh et al., 2017). Groups for which the condition in theorem 3 holds are called 2-closed and were ﬁrst introduced by (Wielandt, 1969) (see (Babai, 1995) for further study). Theorem 3 reveals an interesting

connection between our work and the work of (Ravanbakhsh et al., 2017) that studies parameter sharing schemes. One of the basic notions deﬁned in their paper is the notion of uniquely G-equivariant functions, which describes functions that are G-equivariant but not equivariant to any supergroup of G. For example, a consequence of proposition 6 is that An Sn (with the representation used in this paper) has no uniquely equivariant linear functions between tensors of total order n 2. It was shown in (Ravanbakhsh et al., 2017) that 2-closed groups are exactly the groups for which one can ﬁnd a uniquely equivariant function. In this section we proved that the existence of a uniquely G-equivariant linear function is a necessary condition for ﬁrst order universality. As stated in (Ravanbakhsh et al., 2017) some examples for 2-closed groups are ﬁxed-point free groups (e.g., the cyclic group Cn) and Sn itself.

6. Conclusion

In this paper we have considered the universal approximation property of a popular invariant neural network model. We have shown that these networks are universal with a construction that uses tensors of order n(n 1)

2 , which makes this architecture impractical. On the other hand, there exists a permutation group for which we have proved a lower bound of n 2

2 on the tensor order required to achieve universality. We then addressed the more practical question of which groups G allow ﬁrst order G-invariant networks to be universal. We have proved that 2-closedness of G is a necessary condition, and gave examples of inﬁnite permutation group families that do not satisfy this condition.

Our work is a ﬁrst step in advancing the understanding of approximation power of a large class of invariant neural networks that becomes increasingly popular in applications. Several questions remain open: First, a classiﬁcation of 2-closed groups will give us a complete answer to which networks are ﬁrst-order universal. As far as we know this is an open question in group theory. Still, mapping the 2closed landscape for speciﬁc groups G that are interesting for machine learning applications is a worthy challenge. Second, In case one wishes to construct a G-invariant network for a group G that is not 2-closed, developing fast and efﬁcient implementations of higher order layers seems like a potentially useful direction. Lastly, another interesting venue for future work might be to come up with new, possibly non-linear, models for invariant networks.

Acknowledgments

This research was supported in part by the European Research Council (ERC Consolidator Grant, Lift Match 771136) and the Israel Science Foundation (Grant No. 1830/17).

On the Universality of Invariant Networks

Babai, L. Automorphism groups, isomorphism, reconstruction. chapter 27 of the handbook of combinatorics, 1447 1540. rl graham, m. gr otschel, l. lov asz eds, 1995.

Cohen, T. and Welling, M. Group equivariant convolutional networks. In International conference on machine learning, pp. 2990 2999, 2016a.

Cohen, T. S. and Welling, M. Steerable CNNs. (1990):1 14, 2016b. URL http://arxiv.org/abs/1612. 08498.

Cohen, T. S., Geiger, M., K ohler, J., and Welling, M. Spherical cnns. ar Xiv preprint ar Xiv:1801.10130, 2018.

Cybenko, G. Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2(4):303 314, 1989.

Dixon, J. D. and Mortimer, B. Permutation groups, volume 163. Springer Science & Business Media, 1996.

Gilmer, J., Schoenholz, S. S., Riley, P. F., Vinyals, O., and Dahl, G. E. Neural message passing for quantum chemistry. In International Conference on Machine Learning, pp. 1263 1272, 2017.

G obel, M. Computing bases for rings of permutationinvariant polynomials. J. Symb. Comput., 19(4):285 291, 1995.

Hartford, J., Graham, D. R., Leyton-Brown, K., and Ravanbakhsh, S. Deep models of interactions across sets. ar Xiv preprint ar Xiv:1803.02879, 2018.

Hornik, K. Approximation capabilities of multilayer feedforward networks. Neural networks, 4(2):251 257, 1991.

Kondor, R. and Trivedi, S. On the generalization of equivariance and convolution in neural networks to the action of compact groups. ar Xiv preprint ar Xiv:1802.03690, 2018.

Kondor, R., Son, H. T., Pan, H., Anderson, B., and Trivedi, S. Covariant compositional networks for learning graphs. ar Xiv preprint ar Xiv:1801.02144, 2018.

Kraft, H. and Procesi, C. Classical invariant theory, a primer. Lecture Notes, Version, 2000.

Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classiﬁcation with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097 1105, 2012.

Le Cun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., and Jackel, L. D. Backpropagation applied to handwritten zip code recognition. Neural computation, 1(4):541 551, 1989.

Maron, H., Ben-Hamu, H., Shamir, N., and Lipman, Y. Invariant and equivariant graph networks. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum? id=Syx72j C9tm.

Qi, C. R., Su, H., Mo, K., and Guibas, L. J. Pointnet: Deep learning on point sets for 3d classiﬁcation and segmentation. Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, 1(2):4, 2017.

Ravanbakhsh, S., Schneider, J., and Poczos, B. Equivariance through parameter-sharing. ar Xiv preprint ar Xiv:1702.08389, 2017.

Weiler, M., Geiger, M., Welling, M., Boomsma, W., and Cohen, T. 3D Steerable CNNs: Learning Rotationally Equivariant Features in Volumetric Data. 2018. URL http://arxiv.org/abs/1807.02547.

Wielandt, H. Permutation groups through invariant relations and invariant functions. 1969.

Xu, K., Hu, W., Leskovec, J., and Jegelka, S. How powerful are graph neural networks? In International Conference on Learning Representations, 2019. URL https:// openreview.net/forum?id=ry Gs6i A5Km.

Yarotsky, D. Universal approximations of invariant maps by neural networks. ar Xiv preprint ar Xiv:1804.10306, 2018.

Zaheer, M., Kottur, S., Ravanbakhsh, S., Poczos, B., Salakhutdinov, R. R., and Smola, A. J. Deep sets. In Advances in Neural Information Processing Systems, pp. 3391 3401, 2017.