# universal_equivariant_multilayer_perceptrons__12d6fd3a.pdf Universal Equivariant Multilayer Perceptrons Siamak Ravanbakhsh 1 2 Group invariant and equivariant Multilayer Perceptrons (MLP), also known as Equivariant Networks and Group Group Convolutional Neural Networks (G-CNN) have achieved remarkable success in learning on a variety of data structures, such as sequences, images, sets, and graphs. This paper proves the universality of a broad class of equivariant MLPs with a single hidden layer. In particular, it is shown that having a hidden layer on which the group acts regularly is sufficient for universal equivariance (invariance). For example, some types of steerable-CNNs become universal. Another corollary is the unconditional universality of equivariant MLPs for all Abelian groups. A third corollary is the universality of equivariant MLPs with a high-order hidden layer, where we give both group-agnostic bounds and groupspecific bounds on the order of the hidden layer that guarantees universal equivariance. 1. Introduction Invariance and equivariance properties constrain the output of a function under various transformations of its input. This constraint serves as a strong learning bias that has proven useful in sample efficient learning for a wide range of structured data. In this work, we are interested in universality results for Multilayer Perceptrons (MLPs) that are constrained to be equivariant or invariant. This type of result guarantees that the model can approximate any continuous equivariant (invariant) function with an arbitrary precision, in the same way an unconstrained MLP can approximate an arbitrary continuous function (Hornik et al., 1989; Cybenko, 1989; Funahashi, 1989). Study of invariance in neural networks goes back to the book of Perceptrons (Minsky & Papert, 2017), where the 1School of Computer Science, Mc Gill University, Montreal Canada. 2Mila - Quebec AI Institute.. Correspondence to: Siamak Ravanbakhsh . Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s). necessity of parameter-sharing for invariance was used to prove the limitation of a single layer Perceptron. The followup work showed how parameter symmetries can be used to achieve invariance to finite and infinite groups (Shawe Taylor, 1989; Wood & Shawe-Taylor, 1996; Shawe-Taylor, 1993; Wood, 1996). These fundamental early works went unnoticed during the resurgence of neural network research and renewed attention to symmetry (Hinton et al., 2011; Mallat, 2012; Bruna & Mallat, 2013; Gens & Domingos, 2014; Jaderberg et al., 2015; Dieleman et al., 2016; Cohen & Welling, 2016a). When equivariance constraints are imposed on feed-forward layers in an MLP, the linear maps in each layer is constrained to use tied parameters (Wood & Shawe-Taylor, 1996; Ravanbakhsh et al., 2017b). This model that we call an equivariant MLP appears in deep learning with sets (Zaheer et al., 2017; Qi et al., 2017), exchangeable tensors (Hartford et al., 2018), graphs (Maron et al., 2018), relational data (Graham & Ravanbakhsh, 2019), and sets of symmetric elements (Maron et al., 2020). Universality results for some of these models exists (Zaheer et al., 2017; Segol & Lipman, 2019; Keriven & Peyr e, 2019; Maron et al., 2020). Broader results for high order invariant MLPs appears in (Maron et al., 2019). Universality results for non-standard architectures appears in (Yarotsky, 2018; Sannai et al., 2019). In addition to proving universality of networks using polynomial layer, Yarotsky (2018) also prove universality for standard MLPs equivariant to Abelian groups. A similar results follows as a corollary to our main theorem. A parallel line of work in equivariant deep learning studies linear action of a group beyond permutations. The resulting equivariant linear layers can be written using convolution operations (Cohen & Welling, 2016b; Kondor & Trivedi, 2018). When limited to permutation groups, group convolution is simply another expression of parameter-sharing (Ravanbakhsh et al., 2017b); see also Section 2.3. However, in working with linear representations, one may move beyond finite groups (Cohen et al., 2019a; Kondor & Trivedi, 2018); see also (Wood & Shawe-Taylor, 1996). Some applications include equivariance to isometries of the Euclidean space (Weiler & Cesa, 2019; Worrall et al., 2017), and sphere (Cohen et al., 2018). Extension of this view to manifolds is proposed in (Cohen et al., 2019b). Finally, a third line of work in equivariant deep learning that involves a Universal Equivariant Multilayer Perceptrons specialized architecture and learning procedure is that of Capsule networks (Sabour et al., 2017; Hinton et al., 2018); see (Lenssen et al., 2018) for a group theoretic generalization. 2. Preliminaries Let G = {g} be a finite group. We define the action of this group on two finite sets N and M of input and output units in a feedforward layer. Using these actions which define permutation groups we then define equivariance and invariance. In detail, G-action on the set N is a structure preserving map (homomorphism) a : G SN, into the symmetric group SN, the group of all permutations of N. The image of this map is a permutation group GN SN. Instead of writing [a(g)](n) for g G and n N, we use the short notation g n = g 1n to denote this action. Let M be another G-set, where the corresponding permutation action GM SM is defined by b : G SM. G-action on N naturally extends to x RN by g xn .= xg n g GN. More conveniently, we also write this action as Agx, where Ag is the permutation matrix form of a(g, ) : N N. 2.1. Invariant and Equivariant Linear Maps Let the real matrix W R|N| |M| denote a linear map W : R|N| R|M|. We say this map is G-equivariant iff Bg Wx = W Agx x RN, g G. (1) where similar to Ag, the permutation matrix Bg is defined based on the action b( , g) : M M. In this definition, we assume that the group action on the input is faithful that is a is injective, or GN = G. If the action on the output index set M is not faithful, then the kernel of this action is a non-trivial normal subgroup of G, ker(b) G. In this case GM = G/ ker(b) is a quotient group, and it is more accurate to say that W is invariant to ker(b) and equivariant to G/ ker(b). Using this convention G-equivariance and G-invariance correspond to extreme cases of ker(b) = G and ker(b) = {e}. Moreover, composition of such invariantequivariant functions preserves this property, motivating design of deep networks by stacking equivariant layers. 2.2. Orbits and Homogeneous Spaces GN partitions N into orbits N1, . . . , NO, where GN is transitive on each orbit, meaning that for each pair n1, n2 No, there is at least one g GN such that g n1 = n2. If GN has a single orbit, it is transitive, and N is called a homogeneous space for G. If moreover the choice of g GN with g n1 = n2 is unique, then GN is called regular. Given a subgroup H G and g G, the right coset of H in G, defined as Hg .= {hg, h H} is a subset of G. For a fixed H G, the set of these right-cosets, Figure 1. The equivariant MLP of (16). The symbol indicates G-action on the units, Wc and W c for all channels of the hidden layer c = 1, . . . , C are constrained by the parameter-sharing of (3). If G-action on the hidden layer is regular, the number of channels can grow to approximate any continuous G-equivariant function with an arbitrary accuracy. Bias terms are not shown. H\G = {Hg, g G}, form a partition of G. G naturally acts on the right coset space, where g (Hg) .= H(gg ) sends one coset to another. The significance of this action is that any transitive G-action is isomorphic to G-action on some right coset space. To see why, note that in this action any h H stabilizes the coset He, because h He = He.1 Therefore in any action the stabilizer identifies the coset space. 2.3. Parameter-Sharing and Group-CNNs View Consider the equivariance condition of (1). Since the equality holds for all x RN, and using the fact that the inverse of a permutation matrix is its transpose, the equivariance constraint reduces to Bg WA g = W g G. (2) The equation above ties the parameters within the orbits of G-action on rows and columns of W: W(m, n) = W(g m, g n) g G, n, m N M (3) where W(g m, g n) is an element of the matrix W. This type of group action on Cartesian product space is sometimes called the diagonal action. In this case, the action is on the Cartesian product of rows and columns of W. We saw that any homogenous G-space is isomorphic to a coset space. Using N = H\G and M = K\G, the 1More generally, when G acts on the coset Ha H\G, all g a 1Ha stabilize Ha. Since g = a 1ha for some h H, we have (a 1ha) Ha = H(aa 1ha) = Ha. This means that any transitive G-action on a set N may be identified with the stabilizer subgroup Gn .= {g G s.t. g n = n}, for a choice of n N. This gives a bijection between N and the right coset space Gn\G. Universal Equivariant Multilayer Perceptrons parameter-sharing constraint of (2) becomes W(Kg, Hg ) = W(g 1 Kg, g 1 Hg ) (4) = W(K, Hg g 1) g, g G, (5) Since we can always multiply both indices to have the coset K as the first argument, we can replace the matrix W with the vector w, such that W(Kg, Hg ) = w(Hg g 1) g, g G. This rewriting also enables us to express the matrix vector multiplication of the linear map W in the form of cross-correlation of input and a kernel w [Wx](n) = [Wx](Kg) (6) Hg H\G W(Kg, Hg )x(Hg ) (7) Hg H\G w(Hg g 1)x(Hg ) (8) This relates the parameter-sharing view of equivariant maps (4) to the convolution view (8). Therefore, the universality results in the following extends to group convolution layers (Cohen & Welling, 2016a; Cohen et al., 2019a), for finite groups. Equivariant Affine Maps We may extend our definition, and consider affine G-maps Wx + b, by allowing an invariant bias parameter b R|M| satisfying Bgb = b. (9) This implies a parameter sharing constraint b(m) = b(g m). For homogeneous M, this constraint enforces a scalar bias. Beyond homogeneous spaces, the number of free parameters in b grows with the number of orbits. 2.4. Invariant and Equivariant MLPs One may stack multiple layers of equivariant affine maps with multiple channels, followed by a non-linearity, so as to build an equivariant MLP. One layer of this equivariant MLP a.k.a. equivariant network is given by: c =1 W(ℓ) c,c x(ℓ 1) c + b(ℓ) c where 1 c C(ℓ 1) and 1 c C(ℓ) index the input and output channels respectively, x(ℓ) is the output of layer 1 ℓ L, with x(0) = x denoting the original input. Here, we assume that G faithfully acts on all x(ℓ) c RH(ℓ) c, ℓ, with H(0) = N and H(L) = M. The parameter matrices Wℓ c(ℓ),c(ℓ) RH(ℓ 1) H(ℓ), and the bias vector b(ℓ) c RH(ℓ) are constrained by the parameter-sharing conditions (2) and (9) respectively. In an invariant MLP the faithfulness condition for G-action on the hidden and output layers are lifted. In practice, it is common to construct invariant networks by first constructing an equivariant network followed by pooling over H(L). 3. Universality Results This section presents two new results on universality of both invariant and equivariant networks with a single hidden layer (L = 2). Formally, we can claim that a G-equivariant MLP ˆψ : R|N| R|M| is a universal G-equivariant approximator, if for any G-equivariant continuous function ψ : R|N| R|M|, any compact set K R|N|, and ϵ > 0, there exists a choice of parameters, and number of channels such that ||ψ(x) ˆψ(x)|| < ϵ x K. Theorem 3.1. A G-invariant network c=1 w c1 σ Wcx + bc . (10) with a single hidden layer, on which G acts regularly is a universal G-invariant approximator. Here, 1 = [1, . . . , 1] | {z } |G| and , bc, w c R. Proof. The first step follows the symmetrisization argument (Yarotsky, 2018). Since MLP is a universal approximator, for any compact set K R|N|, we can find ψMLP such that for any ϵ > 0, |ψ(x) ψMLP (x)| ϵ for x K. Let Ksym = {S g G Agx|x K} denote the symmetrisized K, which is again a compact subset of RN for finite G. Let ψMLP + approximate ψ on the symmetrisized compact set Ksym. It is then easy to show that for G-invariant ψ, the symmetrisized MLP ψsym(x) = 1 |G| P g G ψMLP +(Agx) also approximates ψ |ψ(x) ψsym(x)| = |ψ(x) 1 g G ψMLP +(x)| (11) g G |ψ(Agx) ψMLP (Agx)| ϵ. (12) Next step, is to show that ψsym is equal to ˆψ of (10), for some parameters Wc R|H| |N| constrained so that Hg Wc = Wc Ag g G, where Ag and Hg are the permutation representation of G action on the input and the hidden layer respectively. ψsym(x) = 1 |G| c=1 w cσ w c (Agx) (13) Universal Equivariant Multilayer Perceptrons g G σ (w c Ag)x (14) w c Ag1 ... w c Ag|H| where in the last step we put the summation terms into rows of the matrix Wc, and performed the summation using multiplication by 1 . wc is the rescaled w c. Since the summation in (13) is over g G, each row of Wc and therefore each hidden unit is attached to exactly one group member, which translates to having a principal homogeneous space, a.k.a. a regular G-set. Note that we have the freedom to choose the rows to have any order, corresponding to a different order in summation, which means that the choice of a particular principal homogeneous space is irrelevant. Now we show that the parameter matrix Wc R|H| |N| above satisfy the parameter-sharing constraint Wc Ag = Hg Wc g G: Hg Wc A 1 g = w c Ag1g ... w c Ag|H|g w c Ag1 ... w c Ag|H| where the first equality follows from the fact that row indexed by gr is moved to the row g gr = grg 1: Hg Agr = Ag gr = Agrg 1. Therefore, the current row gr was previously g 1 gr = gr g. The second equality follows from A 1 g is acting from the right, and no further inversion is needed Agrg A 1 g = Agrgg 1 = Agr. This shows that a G-invariant network with a single hidden layer on which G acts regularly is equivalent to a symmetricized MLP, and therefore for some number of channels, it is a universal approximator of G-invariant functions. This result should not be surprising since the size of a regular hidden layer grows with the group, and as it is evident from the proof, an equivariant MLP with a regular hidden layer implicitly averages the output over all transformations of the input. Next, we apply a similar idea to prove the universality of the equivariant MLPs with a regular hidden layer. Theorem 3.2. A G-equivariant MLP c=1 W c σ Wcx + bc . (16) with a single regular hidden layer is a universal Gequivariant approximator. Proof. In this setting, symmetricization, using the so-called Reynolds operator (Sturmfels, 2008), for the universal MLP is given by ψsym(x) = 1 |G| c=1 w cσ w c Agx + bc (17) where wc R|N| and w c RM are the weight vectors in the first and second layer associated with hidden unit c. Our objective is to show that this symmetrisized MLP is equivalent to the equivariant network of (16), in which W c R|M| |H|, and Wc R|H| |N| use parameter-sharing to satisfy Hg Wc = Wc Ag and Bg W c = W c Hg g G. (18) Here, Ag, Bg and Hg are the permutation representations of G action on the input, the output, and the hidden layer respectively. First, rewrite the symmetrisized MLP as g G Bg 1w cσ w c Agx + bc c=1 W cσ Wcx where W c = | | Bg 1 1 w c . . . Bg 1 |G| w c | | wc Ag1 ... wc Ag|G| and the 1 |G| factor is absorbed in one of the weights. It remains to show that the two matrices above satisfy the equivariance condition Hg Wc = Wc Ag and Bg W c = W c Hg. The proof for Wc is identical to the invariant network case. For W c, we use a similar approach. Bg W c H 1 g = | | Bg Bg 1 1 gw c . . . Bg Bg 1 |G| gw c | | | | Bg 1 1 w c . . . Bg 1 |G| w c | | In the first step, since H 1 g = Hg 1 is acting on the right, it moves the column indexed by g 1 l to g 1 l g 1. This means Universal Equivariant Multilayer Perceptrons that the column currently at g 1 l is g 1 l g. The second step uses the following: Bg Bg 1 l g = Bg (g 1 l g) = Bg 1 l gg 1 = Bg 1 l . This, proves the equality of the symmetrisize MLP (17) to the equivariant MLP of (16). However, a similar argument to the proof of invariant case, shows the universality of ψsym. Putting these together, completes the proof of Theorem 3.2. 3.1. Universality for Abelian Groups In the case where G is an Abelian group, any faithful transitive action is regular, meaning that the hidden layer in a Gequivariant neural network is necessarily regular. Combined with Theorem 3.2, this leads to an unconditional universality result for Abelian groups. A similar result for Abelian groups appears in (Yarotsky, 2018). Corollary 1. For Abelian group G, a G-equivariant (invariant) neural network with a single hidden layer is a universal approximator of continuous Gequivariant (invariant) functions on compact subsets of R|N|. A corollary to this is the universality of a Convolutional Neural Network (CNN) with a single hidden layer. Corollary 2 (Universality of CNNs). For an arbitrary input-output dimensions, a CNN with a single hidden layer, full kernels, and cyclic padding is a universal approximator of continuous circular translation equivariant (invariant) functions. Use of the term circular, both in padding and translation is because of the need to work with finite translations, which are produce as the result of the action of a product of cyclic groups.2 3.2. Universality for Regular Steerable CNNs In building models equivariant to subgroups of Euclidean isometries (translation, rotation and reflection), a simple solution is to consider the action of (circular) translation group on rotated and/or reflected copies of each featuremap (Dieleman et al., 2016). This approach is formalized in (Cohen & Welling, 2016b) for general semi-direct product G = N H and linear representations. In the setting where the action of N and H on the fibers and the base-space are 2Input can be zero-padded, before circular padding, so that Corollary 2 guarantees universal approximation of translation equivariant functions, where translations are bounded by the size the original input. both regular, G-action is also regular, and as a corollary to Theorem 3.2 steerable CNN becomes universal. For example, this is the case in the practical setup where the feature-maps are rotated and reflected. 3.3. Universality for High-Order Hidden Layers G-action on the hidden units H naturally extends to its simultaneous action on the Cartesian product HD = H . . . H: g (h1, . . . , h D) .= (g h1, . . . , g h D). We call this an order D product space. Product spaces are used in building high-order layers in G-equivariant networks in several recent works (Kondor et al., 2018; Maron et al., 2018; Keriven & Peyr e, 2019; Albooyeh et al., 2019). Maron et al. (2019) show that for 2|H| (|H| 1), (19) such MLPs with multiple hidden layers of order D become universal G-invariant approximators. In this section, we show that better bounds for D that guarantees universal invariance and equivariance follows from the universality results of Theorems 3.1 and 3.2. The next section provides an in-depth analysis of product spaces that not only gives an alternative proof of the theorems below, but also could lead to yet better bounds.3 Theorem 3.3. Let G act faithfully on H = [H\G]. Then HD has a regular orbit for any D log2(|H|) and therefore, by Theorem 3.2, an order D hidden layer guarantees universal equivariance. Proof. If G acts faithfully on H, the intersection of the stabilisers of all the points in H is trivial i.e., Core G(H) = {e}. If instead of taking the intersection of the stabilisers of all h H, we can just take the intersection of the stabilisers of D (carefully chosen) points, we will know there is a regular orbit in HD. That is because the stabiliser of a point in Hd is the intersection of the stabilisers of its elements in H, that is Stab G(h1, ..., h D) = TD d=1 Stab G(hd). So the question is for what value of D can we find D points such that the intersection of their stabilisers is trivial. We work recursively to find a bound on D. Start with just one point h1in H1, and assume its stabiliser is of size s1. Now assume we have a point (h1, ..., hd) in Hd 3The beautiful proof for the following theorem was proposed by an anonymous reviewer. The original proof uses the ideas discussed in the next section and appears later in the paper. Universal Equivariant Multilayer Perceptrons such that its stabiliser is of size sd. If sd = 1, we are done. Otherwise, since the action is faithful, there has to exist a point hd+1 such that the intersection of all the stabilisers of h1, ..., hd+1 is a strictly smaller subgroup of the stabiliser of (h1, ..., hd). The size of a proper subgroup is at most half the size of the original group and therefore sd+1 < sd/2. Therefore, for each additional point the size of stabilizer at least half of the previous stabilizer. It follows that for any D log2(|H|), [H\G]D = HD has an orbit with a trivial stabilizer. Since the largest stabilizer for any action on H is S|H| 1, we can use a lower-bound for D, in Theorem 3.3 that is independent of the stabilizer sub-group H. The following bound follows from the Sterling s approximation N! < N N+ 1 2 e N+1 to the size of the largest possible stabilizer |S|H| 1| = (|H 1|)!. Corollary 3. The high-order G-set of hidden units HD, with N = |H| has a regular orbit for 2) log2(N 1) (N 2) log2(e) and following Theorem 3.2 the corresponding equivariant MLP is universal approximator of continuous G-equivariant functions. 4. Decomposition of Product G-Sets A prerequisite to analysis of product G-sets is their classification, which also leads to classification of all G-maps based on their input/output G-sets. 4.1. Classification of G-Sets and G-Maps Recall that any transitive G-set N is isomorphic to a rightcoset space H\G. However, the right cosets H\G and (g 1Hg)\G g G are themselves isomorphic. 4 This also means what we care about is conjgacy classes of subgroups [H] = {g 1Hg | g G}, which classifies rightcoset spaces up to conjugacy [H\G] = {g 1Hg\G | g G}. We used the bracket to identify the conjugacy class. In this notation, for H, H G, we say [H] < [H ], iff g 1Hg < H , for some g G. A G-set is transitive on each of its orbits, and we can identify each orbit with its stabilizer subgroup. Therefore a list of 4The stabilizer subgroups of two points in a homogeneous space are conjugate, and therefore G-sets resulting from conjugate choice of right-cosets are isomorphic. To see why stabilizers are conjugate, assume n = a 1 n, and h Gn, then aha 1 n = nha = na = n. Therefore, a 1ha Gn. Since conjugation is a bijection, this means Gn = a 1Gna. these subgroups along with their multiplicities completely defines a G-set up to an isomorphism (Rotman, 2012): [Hi] G pi[Hi\G], (20) where p1, . . . , p I Z 0 denotes the multiplicity of a rightcoset space, and N has PI i=1 pi orbits. To ensure a faithful G-action on N, a necessary and sufficient condition is for the point-stabilizers Gn n N to have a trivial intersection. The point-stabilizers within each orbit are conjugate to each other and their intersection which is the largest normal subgroup of G contained in Hi, is called the core of G-action on [Hi\G]: Core G(Hi) .= \ g G g 1Hig. (21) Next, we extend the classification of G-sets to G-equivariant maps, a.k.a. G-maps W : RN RM, by jointly classifying the input and the output index sets N and M. We may consider a similar expression to (20) for the output index set M = S [Kj] G qj[Kj\G]. The linear G-map W : RN RM is then equivariant to G/K and invariant to K G iff \ pi>0 Core G(Hi) = {e} and \ qi>0 Core G(Ki) = K (22) where the second condition translates to K invariance of Gaction on M. Note that the first condition is simply ensuring the faithfulness of G-action on N. This result means that the multiplicities (p1, . . . , p I) and (q1, . . . , q J) completely identify a (linear) G-map W : RN RM that equivariant to G/K and invariant to K G, up to an isomorphism. 4.2. Diagonal Action on Cartesian Product of G-sets Previously we classified all G-sets as the disjoint union of homogeneous spaces SI i=1 pi[Gi\G], where G acts transitively on each orbit. However, as we saw earlier G also naturally acts on the Cartesian product of homogeneous G-sets: N1 . . . ND = (G1\G) . . . (GD\G) where the action is defined by g (G1h1, . . . , GDh D) .= (G1(h1g), . . . , GD(h Dg)). A special case is when we consider the repeated self-product of the same homogeneous space H = [H\G], which as we saw gives an order D product space. HD = [H\G]D = [H\G] . . . [H\G] | {z } D times We call this an order D product space. The following discussion shows how the product space decomposes into orbits, where the existence of a regular orbit leads to universality. Universal Equivariant Multilayer Perceptrons 4.3. Burnside Ring and Decomposition of G-sets Since any G-set can be written as a disjoint union of homogeneous spaces (20), we expect a decomposition of the product G-space in the form [Gi\G] [Gj\G] = [ [Gℓ] G δℓ i,j[Gℓ\G] (23) Indeed, this decomposition exists, and the multiplicities δℓ i,j Z>0, are called the structure coefficient of the Burnside Ring. The (commutative semi)ring structure is due to the fact that the set of non-isomorphic G-sets Ω(G) = {S [Gi] G pi[Gi\G] | pi Z 0}, is equipped with: 1) a commutative product operation that is the Cartesian product of G-spaces, and; 2) a summation operation that is the disjoint union of G-spaces (Dieck, 2006). A key to analysis of product G-spaces is finding the structure coefficients in (23). Example 1 (PRODUCT OF SETS). The symmetric group SN acts faithfully on N, where the stabilizer is Sn = SN {n} that is the stabilizer of n N is the set of all permutations of the remaining items N {n}. This means N = [SN {n}\SN]. The diagonal SN action on the product space ND, decomposes into P i pi = Bell(D) orbits, where the Bell number is the number of different partitions of a set of D labelled objects (Maron et al., 2018). One may further refine these orbits by their type in the form of (23): [SN n\SN]D = d=1 S(D, d)[SN {n1,...,nd}\SN] where the structure coefficient S(D, d) is the Stirling number of the second kind, and it counts the number of ways D could be partitioned into d non-empty sets. For example, when D = 2, one may think of the index set N N as indexing some |N| |N| matrix. This matrix decomposes into one (S(2, 1) = 1) diagonal [SN {n}\SN] and one S(2, 2) = 1 set of off-diagonals [SN {n1,n2}\SN]. This decomposition is presented in (Albooyeh et al., 2019), where it is shown that these orbits correspond to hyper-diagonals for higher order tensors. For general groups, inferring the structural coefficients is more challenging, as we see shortly. From (24) in the example above it follows that an order D = |N| product of sets contains a regular orbit. The following is a corollary that combines this with the universality results of Theorems 3.1 and 3.2. Corollary 4. [Universality of Equivariant Hyper Graph Networks] A SN equivariant network with a hidden layer of order D |N|, is a universal approximator of SN-equivariant (invariant) functions, where the input and output layer may be of any order. Note how using group specific analysis gives a better bound of D N compared to group agnostic bound D N log(N) of Corollary 3. A universality result for the invariant case only, using a quadratic order appears in (Maron et al., 2019), where the MLP is called a hyper-graph network. Keriven & Peyr e (2019) prove universality for the equivariant case, without giving a bound on the order of the hidden layer, and assuming an output M = H1 of degree D = 1. In comparison, Corollary 4 uses a linear bound and applies to a much more general setting of arbitrary orders for the input and output product sets. In fact, the universality result is true for arbitrary input-output SN-sets. Linear G-Map as a Product Space For finite groups, the linear G-map W : RN RM is indexed by M N, and therefore it is a product space. In fact the parameter-sharing of (3) ties all the parameters W(m, n) that are in the same orbit. Therefore, the decomposition (23) also identifies parameter-sharing pattern of W.5 Example 2 (EQUIVARIANT MAPS BETWEEN SET PRODUCTS). Equation (24) gives a closed form for the decomposition of ND into orbits. Assuming a similar decomposition for MD , the equivariant map W : RND RMD is decomposed in to Bell(D+D ) linear maps corresponding to the orbits of MD ND. 4.3.1. BURNSIDE S TABLE OF MARKS Burnside s table of marks simplifies working with the multiplication operation of the Burnside ring, and enables the analysis of G-action on product spaces (Burnside, 1911; Pfeiffer, 1997). The mark of H G on a finite G-set N, is defined as the number of points in N fixed by all h H: m N(H) .= |{n N | h n = n h H}|. (25) The interesting quality of the number of fixed points is that the total number of fixed points adds up when we add two spaces N1 N2. Also, when considering product spaces N1 N2, any combination of points fixed in both spaces 5When N and M are homogeneous spaces, another characterization the orbits of the product space [Gn\G] [Gm\G] is by showing their one-to-one correspondence with double-cosets Gn\G/Gm = {Gng Gm | g G}. Universal Equivariant Multilayer Perceptrons Table 1. Table of marks MG. {e} ... Gi .. . Gj . .. G {e}\G |G| ... ... ... Gi\G |G : Gi| ... |G : NG(Gi)| ... ... ... ... Gj\G |G : Gj| ... m Gj\G(Gi) ... |G : NG(Gj)| ... ... ... ... ... G\G 1 ... 1 .. . 1 .. . 1 will be fixed by H. This means m N1 N2(Gi) = m N1(Gi) + m N2(Gi) (26) m N1 N2(Gi) = m N1(Gi) m N2(Gi). (27) Now define the vector of marks m N : Ω(G) Zn as m N .= [m N(G1), . . . , m N(GI)] where I is the the number of conjugacy classes of subgroups of G, and we have assume a fixed order on [Gi] G. Due to Eqs. (26) and (27), given G-sets N1, . . . , ND, we can perform elementwise addition and multiplication on the vector of integers m N1, ..., m ND, to obtain the mark of union and product G-sets respectively. Moreover, the special quality of marks, makes this vector an injective homeomorphism: we can work backward from the resulting vector of marks and decompose the union/product space into homogeneous spaces. To facilitate calculation of this vector, for any G-set N, one may use the table of marks. The table of marks for a group G, is the square matrix of marks of all subgroups on all right-coset spaces6 that is the element i, j of this matrix is: MG(i, j) .= m Gi\G(Gj) or MG .= m{e}\G ... m G\G The matrix MG, has valuable information about the subgroup structure of G. For example, Gj s action on Gi\G will have a fixed point, iff [Gj] [Gi]. Therefore, the sparsity pattern in the table of marks, reflects the subgroup lattice structure of G, up to conjugacy.7 A useful property of MG is that we can use it to find the marks m N on any G-set N = P i pi[Gi\G] in Ω(G) using the expression m N = [p1, . . . , p I] MG. Moreover, the 6m Gi\G(Gj) = m Gi\G(g Gjg 1), and m Gi\G(Gj) = mg Gig 1\G(Gj) g G. Therefore, the table of marks characterization is up to conjugacy. 7The sub-group lattice of G is a partially ordered set in which the order Gi < Gj is a subgroup relation, and the greatest and least elements are G and {e} respectively. Any G-set is isomorphic to a right-coset space produced by a member of this lattice. However, we only care about this lattice up to a conjugacy relation. This is because as we saw, the right cosets H\G and (g 1Hg)\G g G are isomorphic. Figure 2. A high-order hidden layer decomposes into orbits, which are characterized by the table of marks. By increasing the order one could guarantee the existence of a regular orbit in the decomposition. By Theorem 3.2 this leads to universal equivariance. structural constants of (23) can be recovered from the table of Marks l MG(i, l)MG(j, l)(M 1 G )(l, ℓ). (29) 5. Universality of G-Maps on Product Spaces Using the tools discussed in the previous section, in this section we prove some properties of product spaces that are consequential in design of equivariant maps. Previously we saw that product spaces decompose into orbits, identified by δℓ ij > 0 in (23). The following theorem states that such product spaces always have orbits that are at least as large as the largest of the input orbits, and at least one of these product orbits is strictly larger than both inputs. For simplicity, this theorem is stated in terms of the stabilizers, rather than the orbits, where by the orbit-stabilizer theorem, larger stabilizers correspond to smaller orbits. Also, while the following theorem is stated for the product of homogeneous G-sets, it trivially extends to product of G-sets with multiple orbits. Theorem 5.1. Let [Gi\G] and [Gj\G] be transitive G-sets, with {e} < Gi, Gj < G. Their product G-set decomposes into orbits [Gi\G] [Gj\G] = S ℓδℓ ij[Gℓ\G], such that: (i) [Gℓ] [Gi], [Gj] for all the resulting orbits. (ii) if Gj Core G(Gi) and Gi Core G(Gj), then [Gℓ]<[Gi], [Gj] for at least one of the resulting orbit. Proof. The proof is by analysis of the table of Marks MG. The vector of mark for the product space is the element-wise Universal Equivariant Multilayer Perceptrons Table 2. Table of marks for the alternating group A5. {e} C2 C3 K4 C5 S3 D10 A4 A5 {e}\A5 60 C2\A5 30 2 C3\A5 20 2 K4\A5 15 3 3 C5\A5 12 2 S3\A5 10 2 1 1 D10\A5 6 2 1 1 A4\A5 5 1 2 1 1 A5\A5 1 1 1 1 1 1 1 1 1 product of vector of marks of the input: m[Gi\G] [Gi\G] = m[Gi\G] m[Gj\G]. The same vector, can be written as a linear combination of rows of MG, with non-negative integer coefficients: m Gi\G m Gj\G = P ℓδℓ ijm[Gℓ\G]. For convenience we assume a topological ordering of the conjugacy class of subgroups {e} = G1, . . . , Gi, . . . , GI = G consistent with their partial order that is [Gi] > [Gj] j > i. This means that MG is lower-triangular, with nonzero diagonals; see Table 1. Three important properties of this table are (Pfeiffer, 1997): (1) the sparsity pattern in MG reflects the subgroup relation: m[Gi\G](ℓ) > 0 iff Gℓ Gi. (2) the first column is the index of Gi in G: m[Gi\G](1) = |G : Gi| i. (3) the diagonal element is the index of the normalizer: m[Gi\G](i) = |G : NG(Gi)| i, where the normalizer of H in G is defined as the largest intermediate subgroup of G in which H is normal: NG(H) = {g G | g Hg 1 = H}. (i) From (1) it follows that the non-zeros of the product (m[Gi\G] m[Gj\G])(ℓ) > 0 correspond to Gℓ [Gi] and Gℓ [Gj]. Since the only rows of MG with such non-zero elements are m[Gℓ\G] for Gℓ [Gi] Gj, all the resulting orbits have such stabilizers. This finishes the proof of the first claim. (ii) If [Gi] [Gj] and [Gj] [Gi], then [Gℓ] which is a subgroup of both groups is strictly smaller than both, which means one of the resulting orbits must be larger than both input orbits. Next, w.l.o.g., assume [Gi] [Gj]. Consider proof by contradiction: suppose the product does not have a strictly larger orbit. It follows that m[Gj\G] m[Gi\G] = δi i,im[Gi\G] for some δi ii > 0. Consider the first and ith element of the elementwise product above: |G : Gj| |G : Gi| = δi ii|G : Gi| m[Gj\G](i) |G : NG(Gi)| = δi ii|G : NG(Gi)| Substituting δi ii = |G : Gj| from the first equation into the second equation and simplifying we get m[Gj\G](i) = |G : Gj|. This means the action of Gi on [Gj\G] fixes all points, and therefore Gi Core G(Gj) as defined in (21). This contradicts the assumption of (ii). A sufficient condition for (ii) in Theorem 5.1 is for the G-action on input G-sets to be faithful. Note that in this case the the core is trivial; see Section 4.1. An implication of this theorem is that repeated self-product [H\G]D is bound to produce a regular orbit. This leads to Theorem 3.3, that we saw earlier. Here, we give a shorter proof using Theorem 5.1; see Fig. 2. Alternative Proof of Theorem 3.3. Since G acts faithfully on N, Core G(H) = {e}. From Theorem 5.1 it follows that each time we calculate a product by N, a strictly smaller stabilizer is produced so that H = H(t=0) > H(1) > . . . > H(D) = {e}, where H(d) is the smallest stabilizer at time-step d. From Lagrange theorem, the size of a proper subgroup is at most half the size of its overgroup in this sequence of stabilizers. It follows that for any D log2 |H|, [H\G]D has an orbit with Ht=D = {e} as its stabilizer. Example 3 (UNIVERSAL APPROXIMATION FOR A5). The alternating group A5 is the group of even permutations of 5 objects. One way to create a universal approximator for this group to have a regular layer (see Theorem 3.2). A more convenient alternative is to consider the canonical action of this group on a set N of size 5, and use an order D layer to ensure universality. Using Corollary 3 we get D 5 = (3 1 2 log2(4)) 4 log2(e) . The natural action of A5 on N = [5] is isomorphic to [A4\A5] i.e., A4 is a stabilizer. Using this stabilizer in Theorem 3.3, we get the same bound D 5 = log2(|A4|) . However, using the table of marks we can show that D = 3 already produces a regular orbit in this case. The table of marks for the alternating group A5 is shown in Table 2. Our objective is to find the decomposition of [A4\A5]3. We do this in steps, first showing [A4\A5]2 = [A4\A5] [C3\A5] (30) To see this, note that the element-wise product of the vector of marks m[A4\A5] (which is next to last row in Table 2) with itself is equal to m[A4\A5] + m[C3\A5]. Since the vector of marks is an injective homomorphism, this implies (30). Applying the same idea one more time, gives [A4\A5]3 = ([A4\A5] [C3\A5]) [A4\A5] = 2[A4\A5] [C3\A5] [{e}\A5]. This shows that [A4\A5]3 contains a regular orbit [{e}\A5]. Therefore, using an order D = 3 hidden layer N3 on which A5 acts using even permutations, also produces a universal equivariant (invariant) approximator. Universal Equivariant Multilayer Perceptrons Acknowledgements We thank anonymous reviewers for their constructive feedback. In particular the first proof for Theorem 3.3, as well as clarifications on the proof of the main theorems was proposed by reviewers. This research is in part funded by the Canada CIFAR AI Chair Program. Albooyeh, M., Bertolini, D., and Ravanbakhsh, S. Incidence networks for geometric deep learning. ar Xiv preprint ar Xiv:1905.11460, 2019. Bruna, J. and Mallat, S. Invariant scattering convolution networks. IEEE transactions on pattern analysis and machine intelligence, 35(8):1872 1886, 2013. Burnside, W. Theory of groups of finite order. University, 1911. Cohen, T. S. and Welling, M. Group equivariant convolutional networks. ar Xiv preprint ar Xiv:1602.07576, 2016a. Cohen, T. S. and Welling, M. Steerable cnns. ar Xiv preprint ar Xiv:1612.08498, 2016b. Cohen, T. S., Geiger, M., K ohler, J., and Welling, M. Spherical cnns. ar Xiv preprint ar Xiv:1801.10130, 2018. Cohen, T. S., Geiger, M., and Weiler, M. A general theory of equivariant cnns on homogeneous spaces. In Advances in Neural Information Processing Systems, pp. 9142 9153, 2019a. Cohen, T. S., Weiler, M., Kicanaoglu, B., and Welling, M. Gauge equivariant convolutional networks and the icosahedral cnn. ar Xiv preprint ar Xiv:1902.04615, 2019b. Cybenko, G. Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2(4):303 314, 1989. Dieck, T. T. Transformation groups and representation theory, volume 766. Springer, 2006. Dieleman, S., De Fauw, J., and Kavukcuoglu, K. Exploiting cyclic symmetry in convolutional neural networks. ar Xiv preprint ar Xiv:1602.02660, 2016. Funahashi, K.-I. On the approximate realization of continuous mappings by neural networks. Neural networks, 2(3): 183 192, 1989. Gens, R. and Domingos, P. M. Deep symmetry networks. In Advances in neural information processing systems, pp. 2537 2545, 2014. Graham, D. and Ravanbakhsh, S. Deep models for relational databases. ar Xiv preprint ar Xiv:1903.09033, 2019. Hartford, J., Graham, D. R., Leyton-Brown, K., and Ravanbakhsh, S. Deep models of interactions across sets. In Proceedings of the 35th International Conference on Machine Learning, pp. 1909 1918, 2018. Hinton, G. E., Krizhevsky, A., and Wang, S. D. Transforming auto-encoders. In International conference on artificial neural networks, pp. 44 51. Springer, 2011. Hinton, G. E., Sabour, S., and Frosst, N. Matrix capsules with em routing. 2018. Hornik, K., Stinchcombe, M., White, H., et al. Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359 366, 1989. Jaderberg, M., Simonyan, K., Zisserman, A., et al. Spatial transformer networks. In Advances in neural information processing systems, pp. 2017 2025, 2015. Keriven, N. and Peyr e, G. Universal invariant and equivariant graph neural networks. In Advances in Neural Information Processing Systems, pp. 7090 7099, 2019. Kondor, R. and Trivedi, S. On the generalization of equivariance and convolution in neural networks to the action of compact groups. ar Xiv preprint ar Xiv:1802.03690, 2018. Kondor, R., Son, H. T., Pan, H., Anderson, B., and Trivedi, S. Covariant compositional networks for learning graphs. ar Xiv preprint ar Xiv:1801.02144, 2018. Lenssen, J. E., Fey, M., and Libuschewski, P. Group equivariant capsule networks. ar Xiv preprint ar Xiv:1806.05086, 2018. Mallat, S. Group invariant scattering. Communications on Pure and Applied Mathematics, 65(10):1331 1398, 2012. Maron, H., Ben-Hamu, H., Shamir, N., and Lipman, Y. Invariant and equivariant graph networks. ar Xiv preprint ar Xiv:1812.09902, 2018. Maron, H., Fetaya, E., Segol, N., and Lipman, Y. On the universality of invariant networks. ar Xiv preprint ar Xiv:1901.09342, 2019. Maron, H., Litany, O., Chechik, G., and Fetaya, E. On learning sets of symmetric elements. ar Xiv preprint ar Xiv:2002.08599, 2020. Minsky, M. and Papert, S. A. Perceptrons: An introduction to computational geometry. MIT press, 2017. Pfeiffer, G. The subgroups of m24, or how to compute the table of marks of a finite group. Experimental Mathematics, 6(3):247 270, 1997. Universal Equivariant Multilayer Perceptrons Qi, C. R., Su, H., Mo, K., and Guibas, L. J. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 652 660, 2017. Ravanbakhsh, S., Schneider, J., and Poczos, B. Deep learning with sets and point clouds. In International Conference on Learning Representations (ICLR) workshop track, 2017a. Ravanbakhsh, S., Schneider, J., and Poczos, B. Equivariance through parameter-sharing. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of JMLR: WCP, August 2017b. Rotman, J. J. An introduction to the theory of groups, volume 148. Springer Science & Business Media, 2012. Sabour, S., Frosst, N., and Hinton, G. E. Dynamic routing between capsules. In Advances in Neural Information Processing Systems, pp. 3856 3866, 2017. Sannai, A., Takai, Y., and Cordonnier, M. Universal approximations of permutation invariant/equivariant functions by deep neural networks. ar Xiv preprint ar Xiv:1903.01939, 2019. Segol, N. and Lipman, Y. On universal equivariant set networks. ar Xiv preprint ar Xiv:1910.02421, 2019. Shawe-Taylor, J. Building symmetries into feedforward networks. In Artificial Neural Networks, 1989., First IEE International Conference on (Conf. Publ. No. 313), pp. 158 162. IET, 1989. Shawe-Taylor, J. Symmetries and discriminability in feedforward network architectures. IEEE Transactions on Neural Networks, 4(5):816 826, 1993. Sturmfels, B. Algorithms in invariant theory. Springer Science & Business Media, 2008. Weiler, M. and Cesa, G. General e (2)-equivariant steerable cnns. In Advances in Neural Information Processing Systems, pp. 14334 14345, 2019. Wood, J. Invariant pattern recognition: a review. Pattern recognition, 29(1):1 17, 1996. Wood, J. and Shawe-Taylor, J. Representation theory and invariant neural networks. Discrete applied mathematics, 69(1-2):33 60, 1996. Worrall, D. E., Garbin, S. J., Turmukhambetov, D., and Brostow, G. J. Harmonic networks: Deep translation and rotation equivariance. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), volume 2, 2017. Yarotsky, D. Universal approximations of invariant maps by neural networks. ar Xiv preprint ar Xiv:1804.10306, 2018. Zaheer, M., Kottur, S., Ravanbakhsh, S., Poczos, B., Salakhutdinov, R. R., and Smola, A. J. Deep sets. In Advances in Neural Information Processing Systems, 2017.