# exponential_separations_in_symmetric_neural_networks__bb0f37f2.pdf

Exponential Separations in Symmetric Neural Networks

Aaron Zweig Courant Institute of Mathematical Sciences New York University az831@nyu.edu

Joan Bruna Center for Data Science New York University bruna@cims.nyu.edu

In this work we demonstrate a novel separation between symmetric neural network architectures. Speciﬁcally, we consider the Relational Network [21] architecture as a natural generalization of the Deep Sets [32] architecture, and study their representational gap. Under the restriction to analytic activation functions, we construct a symmetric function acting on sets of size N with elements in dimension D, which can be efﬁciently approximated by the former architecture, but provably requires width exponential in N and D for the latter.

1 Introduction

The modern success of deep learning can in part be attributed to architectures that enforce appropriate invariance. Invariance to permutation of the input, i.e. treating the input as an unordered set, is a desirable property when learning symmetric functions in such ﬁelds as particle physics and population statistics. The simplest architectures that enforce permutation invariance treat each set element individually without allowing for interaction, as captured by the popular Deep Set model [18, 32].

Several architectures explicitly enable interaction between set elements, the simplest being the Relational Networks [21] that encode pairwise interaction. This may be understood as an instance of self-attention, the mechanism underlying Transformers [27], which have emerged as powerful generic neural network architectures to process a wide variety of data, from image patches to text to physical data. Speciﬁcally, Set Transformers [12] are special instantiations of Transformers, made permutation equivariant by omitting positional encoding of inputs, and using self-attention for pooling.

Both the Deep Sets and Relational Networks architectures are universal approximators for the class of symmetric functions. But empirical evidence suggests an inherent advantage of symmetric networks using self-attention in synthetic settings [16], on point cloud data [12] and in quantum chemistry [17]. In this work, we formalize this question in terms of approximation power, and explicitly construct symmetric functions which provably require exponentially-many neurons in the Deep Sets model, yet are efﬁciently approximated with self-attention.

This exponential separation bears notable differences from typical separation results. In particular, while the expressive power of a vanilla neural network is characterized by depth and width, expressiveness of symmetric networks is controlled particularly by symmetric width. In contrast to depth separations of vanilla neural networks [7], in this work we observe width separations, where the weaker architectures (even with arbitrary depth) require exponential symmetric width to match the expressive power of stronger architectures.

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

Summary of Contributions In this work:

We demonstrate a width separation between the Deep Sets and Relational Network architectures, where the former requires symmetric width L poly(N, D) to approximate a family of analytic symmetric functions, while the latter can approximate with polynomial efﬁciency. This also answers an open question of high-dimensional Deep Sets representation posed in Wagstaff et al. [30]

We introduce an extension of the Hall inner product to high dimensions that preserves low-degree orthogonality of multisymmetric powersum polynomials, which may be of independent interest.

2 Setup and Main Result

2.1 Symmetric Architectures

(a) Deep Sets with symmetric width L

(b) Relational Network with symmetric width L

Figure 1: Architectural diagram for Sym L (left) and Sym2 L (right)

To introduce the symmetric architectures, we must ﬁrst characterize how to treat sets as inputs. We will consider sets of size N, where each element of the set is a vector of dimension D. In particular, we will represent a set as a matrix X CD N. Thus, each column vector xn CD is an element of the set. Note that we consider complex-valued inputs because the natural inner product over symmetric polynomials integrates over the complex unit circle, see Macdonald [14] or Theorem 4.3.

A function f : CD N C is symmetric if f(X) = f(XΠ) for any permutation matrix Π RN N, i.e. if f is invariant to permuting the columns of X. In other words, a symmetric function treats the input X as an unordered set of column vectors. Given the symmetric width parameter L, we consider two primary symmetric architectures: Deﬁnition 2.1. Let Sym L denote the class of singleton symmetric networks with symmetric width L, i.e. functions f of the form:

f(X) = ρ(φ1(X), . . . , φL(X)) (1)

n=1 ψl(xn) (2)

where {ψl : CD C}L l=1 and ρ : CL C are arbitrary neural networks with analytic activations.

The class Sym L is exactly the architecture of Deep Sets [32] restricted to analytic activations. However, we introduce this notation to differentiate this class from the more expressive architectures that allow for pairwise interaction among set elements.

From the theory of symmetric polynomials, if L L := N+D N 1, then f Sym L is a universal approximator for any analytic symmetric function [19]. Therefore we will primarily be interested in the expressive power of Sym L for L < L .

Deﬁnition 2.2. Let Sym2 L denote the class of pairwise symmetric networks with symmetric width L, i.e. functions f of the form:

f(X) = ρ(φ1(X), . . . , φL(X)) (3)

n,n =1 ψl(xn, xn ) (4)

where {ψl : CD D C}L l=1 and ρ : CL C are arbitrary neural networks with analytic activations.

Similarly, the class Sym2 L is exactly the architecture of Relational Pooling [21] with analytic activations. We note this architecture is also equivalent to the 2-ary instantiation of Janossy Pooling [16].

2.2 Main Result

Our main result demonstrates an exponential separation, where Sym L requires exponentially large symmetric width L to match the expressive power of the class Sym2 L for L = 1. We choose norms to make this separation as prominent as possible: there is a hard function that can be approximated in Sym2 L in the inﬁnity norm, but cannot be approximated in Sym L even in an appropriately chosen L2 norm with respect to some non-trivial data distribution.

We require one activation assumption to realize the Sym2 L approximation: Assumption 2.3. The activation σ : C C is analytic, and for a ﬁxed D, N there exist twolayer neural networks f1, f2 using σ, both with O D2 + D log D

ϵ width and O(D log D) bounded weights, such that:

sup |ξ| 3 |f1(ξ) ξ2| ϵ, sup |ξ| 3

f2(ξ) 1 (ξ/4)min(D,

Essentially this assumption guarantees that networks built with the analytic activation σ are able to efﬁciently approximate the map ξ ξ2, and, a truncated form of the ﬁnite Blaschke product[8] with one zero at ξ = 4. We show in Lemma G.3 that the exp activation satisﬁes this assumption. Theorem 2.4 (Exponential width-separation). Fix N and D > 1, and a non-trivial data distribution µ on D N copies of the unit complex circle (S1)D N.

Then there exists an analytic symmetric function g : CD N C such that g L2(µ) = 1 and:

For L N 2 exp(O(min(D,

min f Sym L f g 2 L2(µ) 1

There exists f Sym2 L with L = 1, parameterized with an activation σ that satisﬁes Assumption 2.3, with width poly(N, D, 1/ϵ), depth O(log D), and max weight O(D log D) such that over (S1)D N:

Remark 1. The lower bound is completely independent of the width and depth of the parameterized networks {ψl} and ρ. The only parameter that the theorem restricts is the symmetric width L. This is in sharp contrast to the separations of vanilla networks [7], where there is a natural trade-off between width and depth. Remark 2. In the upper bound, we consider the network f Sym2 L to have width and depth in the usual sense of vanilla neural networks, where the parameterized maps {ψl} and ρ obey the width, depth, and weight bounds given.

3 Related Work

3.1 Depth Separation

Numerous works have studied the difference in expressive power between different neural network architectures. Many of these works center on the representational gap between two-layer and threelayer networks [4, 7]. In particular, recent works have focused on generalizing the family of functions that realize these separations, to various radial functions [20] and non-radial functions [28].

A separate line of work considers separations between networks when the depth varies polynomially [24]. Notably, Vardi, Yehudai, and Shamir [26] demonstrates that depth has a greater impact on expressivity than width, in the case of vanilla neural networks.

3.2 Symmetric Architectures

We primarily consider the symmetric neural network parameterization as introduced in Deep Sets[32], with Point Net[18] a similar symmetric parameterization using a different pooling function. Simple linear equivariant layers were also introduced in Zaheer et al. [32].

In the context of relationships between objects in an image, the ﬁrst symmetric architecture enabling explicit pairwise interaction was introduced in Santoro et al. [21]. More complicated symmetric architectures, allowing for higher-order interaction and more substantial equivariant layers, were built on top of attention primitives [12, 13]. And the notion of explicit high-order interactions between set elements before symmetrizing is formalized in the architecture of Janossy pooling [16].

Symmetric architectures are generalized by graph neural networks [10, 22], under the restriction to the complete graph.

3.3 Symmetric Network Expressivity

The dependence of representational power on the symmetric width parameter L was ﬁrst demonstrated in the D = 1 case. Under the strong condition L < N, it was proven there are symmetric functions which cannot be exactly represented by a Deep Sets network [29], and this was later strengthened to functions which cannot be approximated in the inﬁnity norm to arbitrary precision [30].

The work introducing Janossy pooling [16] also includes a theoretical result showing singleton networks cannot exactly represent some particular pairwise symmetric network. Crucially however, this result is restricted to a simpliﬁed, non-universal symmetric architecture excluding the ρ transformation, and therefore does not characterize the real-world architectures given above.

The question of expressiveness in symmetric networks may also be generalized to graph neural networks, with a focus on distinguishing non-isomorphic graphs as compared to the Weissfeler Lehman test[31] and calculating invariants such as substructure counting[3]. In particular, one may understand expressiveness in symmetric networks incorporating pairwise interaction as the ability to learn functions of the complete graph decorated with edge features.

3.4 Symmetric Polynomial Theory

Our proofs rely on the technical machinery of symmetric polynomial theory, thoroughly characterized in Macdonald [14]. In particular, we utilize the integral representation of the ﬁnite-variable Hall Inner product as introduced in Section A. Because this integral is deﬁned over the complex unit circle, we consequently consider complex-valued neural networks [1].

The connection of symmetric networks to the powersum polynomials was ﬁrst observed in Zaheer et al. [32], and likewise the multisymmetric powersum polynomials have been applied in higher dimensional

symmetric problems [15, 23]. The algebraic properties of the multisymmetric powersum polynomials are well-studied, for example as a basis of higher dimensional symmetric polynomials [19] and through their algebraic dependencies [6]. However, to the best of our knowledge this is the ﬁrst work to apply the Hall inner product to symmetric neural networks, and to extend this inner product to yield low-degree orthogonality over the multisymmetric polynomials.

4 Warmup: One-dimensional set elements

To begin, we consider the simpler case where D = 1, i.e. where we learn a symmetric function acting on a set of scalars. It was already observed in Zaheer et al. [32] that the universality of Deep Sets could be demonstrated by approximating the network with symmetric polynomials. We ﬁrst demonstrate that through this approximation, we can relate the symmetric width L to expressive power.

4.1 Symmetric Polynomials

In order to approximate symmetric networks by symmetric polynomials, we choose a suitable basis. The powersum polynomials serve as the natural choice, as their structure matches that of a singleton symmetric network, and they obey very nice orthogonality properties that we detail below. Deﬁnition 4.1. For k N and x CN, the normalized powersum polynomial is deﬁned as

with p0(x) = 1.

A classical result in symmetric polynomial theory is the existence of an L2 inner product that grants orthogonality for products of powersums. To make this notion explicit and keep track of products, we index products with partitions. Deﬁnition 4.2. An integer partition λ is non-increasing, ﬁnite sequence of positive integers λ1 λ2 λk. The weight of the partition is given by |λ| = Pk i=1 λi. The length of a partition l(λ) is the number of terms in the sequence.

Then we characterize a product of powersums by:

i pλi(x) (8)

This notation intentionally also allows for the empty partition, such that if λ = then pλ = 1. All together, we can now state the following remarkable fact: Theorem 4.3 ([14, Chapter VI (9.10)] ). There exists a L2(dν) inner product (for some probability measure ν) such that, for partitions λ, µ with |λ| N: pλ, pµ V = zλ1λ=µ (9) where zλ is some combinatorial constant.

We index this inner product with V because it is written as an expectation with respect to a density proportional to the squared Vandermonde polynomial (see Section A for the precise deﬁnition). This inner product may also be considered the ﬁnite-variable specialization of the Hall inner product, deﬁned on symmetric polynomials over inﬁnitely many variables [14, Chapter I (4.5)].

It s easy to check that the degree of pλ is equal to |λ|. So this theorem states that the powersum terms pλ are "almost" an orthogonal basis, except for correlation between two high-degree terms.

Let us remark that we assume analytic activations for the sake of this theorem, as the orthogonality property does not hold for symmetric polynomials with negative exponents. However, in exchange for that assumption we can apply this very powerful inner product, that ultimately results in the irrelevance of network depth.

4.2 Projection Lemma

Before we can proceed to prove a representational lower bound, we need one tool to better understand f Sym L. Utilizing the orthogonality properties of the inner product , V allows us to project any f Sym L to a simpliﬁed form, while keeping a straightforward dependence on L.

For example, consider some uniformly convergent power series (with no constant term) φ(x) = P i=1 cikpk(x). We claim p2p1, φ3 V = 0. Indeed, expanding φ3, one exclusively gets terms of the form pk1pk2pk3, and because the partition {k1, k2, k3} is of a different length than {2, 1}, they are clearly distinct partitions so by orthogonality p2p1, pk1pk2pk3 V = 0.

Motivated by this observation, we can project f to only contain products of two terms. Let us introduce P1 to be the orthogonal projection onto span({pt : 1 t N/2}), and P2 to be the orthogonal projection onto span({ptpt : 1 t, t N/2}). Lemma 4.4. Given any f Sym L, we may choose coefﬁcients vij over i j L, and symmetric polynomials φi over i L, such that:

i j vij(P1φi)(P2φj) (10)

4.3 Rank Lemma

Given the reduced form of f above, we may now go about lower bounding its approximation error to a given function g.

By the properties of orthogonal projection, we have f g 2 V P2(f g) 2 V . And by Parseval s theorem, the function approximation error P2f P2g 2 V equals

P2f, ptpt ptpt V

V P2g, ptpt ptpt V

Rearranging the orthogonal coefﬁcients in the form of matrices, we have the following fact: Lemma 4.5. Given any f Sym L, and g such that P2g = g, we have the bound

P2f P2g 2 V 1

2 F G 2 F (11)

where F, G CN/2 N/2 are matrices with entries Ftt = P2f, ptpt V , Gtt = P2g, ptpt V . Furthermore, F has maximum rank L.

The signiﬁcance of this lemma is the rank constraint: it implies that choosing symmetric width L corresponds to a maximum rank L on the matrix F. From here, we can use standard arguments about low-rank approximation in the Frobenius norm to yield a lower bound.

4.4 Separation in one-dimensional case

Our main goal in this section is to construct a hard symmetric function g that cannot be efﬁciently approximated by Sym L for L N/4. It is not particularly expensive for the symmetric width L to scale linearly with the set size N: however, we will use the same proof structure to prove Theorem 2.4, which will require L to scale exponentially. Theorem 4.6. For D = 1:

max g V =1 min f Sym L f g 2 V 1 2L

In particular, for L = N

4 we recover a constant lower bound of 1

Proof (sketch). Choose g such that P2g = g. Then because P2 is an orthogonal projection and applying Lemma 4.5:

min f Sym L f g 2 V min f Sym L P2f P2g 2 V (13)

2 min rank(F ) L F G 2 F (14)

We note that ptpt 2 V = z{t,t} = 2, so the choice of g = 1

N PN/2 t=1 ptpt can be seen to obey

g V = 1, and implies that G is the scaled identity matrix 2

N I CN/2 N/2. Then by standard properties of the SVD:

min f Sym L f g 2 V 1

2 min rank(F ) L F 2

N I 2 F (15)

= 1 N/2 min rank(F ) L F I 2 F (16)

= 1 N/2(N/2 L) (17)

5 Proof Sketch of Main Result

5.1 Challenges for High-dimensional Set Elements

We d like to strengthen this separation in several ways:

Generalize to the D > 1 case,

Realize a separation where the symmetric width L must scale exponentially in N and D, showing that Sym L is infeasible,

Show the hard function g can nevertheless be efﬁciently approximated in Sym2 L for L polynomial in N and D

First, in order to approximate via polynomials in the high-dimenionsal case, we will require the high-dimensional analogue of powersum polynomials:

Deﬁnition 5.1. For a multi-index α ND, the normalized multisymmetric powersum polynomial is deﬁned as:

pα(X) = 1 p

d xαd dn . (19)

So the plan is to ﬁnd a high-dimensional analogue of Lemma 4.4 and Lemma 4.5, now using multisymmetric powersum polynomials, mimic the proof of the D = 1 case, and then additionally show the hard function g is efﬁciently computable in the pairwise symmetric architecture. Note that because the algebraic basis of multisymmetric powersum polynomials is of size L = N+D N 1, we can expect an exponential separation when we apply a similar rank argument.1

1We subtract one in order to discount the constant polynomial.

5.2 Sketch of Main result (lower bound)

Because we are in high dimensions, we cannot simply apply the restricted Hall inner product introduced in Theorem 4.3. To the best of our knowledge, there is no standard generalization of the Hall inner product to multi-symmetric polynomials that preserves the orthogonality property. For the main technical ingredient in the high-dimensional case we introduce a novel generalization, which builds on two inner products.

First, we introduce a new input distribution ν over set inputs X CD N, and induce an L2 inner product:

f, g A = EX ν h f(X)g(X) i . (20)

We use this inner product to measure the approximation error of Sym L. That is, we seek a lower bound to minf Sym L f g A, for a suitable choice of hard function g.

We can now apply an analogue of Lemma 4.4 to project f to a simpliﬁed form. But we cannot immediately apply an analogue of Lemma 4.5, as it relied on Parseval s theorem and the low-degree multisymmetric powersum polynomials are not orthogonal in this inner product. Put another way, if we represent , A as a matrix in the basis of low-degree multisymmetric powersums, it will be positive-deﬁnite but include some off-diagonal terms.

The idea is to now introduce a new inner product with a different input distribution ν0

f, g A0 = EX ν0 h f(X)g(X) i , (21)

and deﬁne the bilinear form

f, g = f, g A 2 f, g A0 . (22)

Typically positive-deﬁniteness is lost when subtracting two inner products, but we prove that , is an inner product when restricted to a particular subspace of symmetric polynomials (see Theorem D.3). Furthermore, the careful choice of ν and ν0 cancels the off-diagonal correlation of different multisymmetric powersums, so they are orthogonal under this new inner product , .

By the norm domination A , we are able to pass from the former L2 norm to the latter norm that obeys orthogonality, and apply an analogue of the Rank Lemma 4.5. Thus we derive a lower bound using any hard function g whose corresponding matrix G (built from orthogonal coefﬁcients) is diagonal and high-rank. And because the total number of polynomials is L , the rank argument now yields an exponential separation.

Based on this proof, we have much freedom in our choice of g. By choosing its coefﬁcients in the basis of multisymmetric powersum polynomials, it s easy to enforce the conditions that G is diagonal and high-rank for variety of possible functions. However, ensuring that g is not pathological (i.e. that it is bounded and Lipschitz), and can be efﬁciently approximated in Sym2 L, requires a more careful choice.

5.3 Sketch of Main Result (upper bound)

It remains to approximate the hard function g with a network from Sym2 L. First we must make a choice of g in particular.

Based on the lower bound proof, the desiderata for g is that it is supported exclusively on terms of the form pαpα over many values of α, as this induces a diagonal and high-rank matrix G in an analogue of Lemma 4.5. Furthermore, by simple algebra one can conﬁrm that pα(X)pα(X) = 1 |α| P

n,n QD d=1(xdnxdn )αd, so g supported on these polynomials can clearly be written in the form

of a network in Sym2 L. This structure of g guarantees difﬁcult approximation, and is akin to the radial structure of the hard functions introduced in works on depth separation [7].

We must however be careful in our choice of g: for the matrix G to be high-rank, g must be supported on exponentially many powersum polynomials. But this could make g exponentially large, and therefore challenging to approximate efﬁciently with a network from Sym2 L.

We handle this difﬁculty by deﬁning g in a different way. We introduce a ﬁnite Blaschke product µ(ξ) = ξ 1/4

ξ/4 1, a function that analytically maps the unit complex circle to itself. Then the choice

d=1 µ(xdnxdn ) (23)

ensures that g , g A, and Lip(g) are all polynomial in N, D, 1

ϵ for ϵ approximation error (see Lemma E.3). Furthermore, again from simple algebra it is clear that g is only supported on terms of the form pαpα. So it remains to show that the induced diagonal matrix G is effectively high rank, which follows from expanding the Blaschke products.

Satisﬁed that this choice of g will meet the desiderata for the lower bound, and has no pathological behavior, it remains to construct f Sym2 L for L = 1 that approximates g. That is, choose

ψ1 and ρ so that g(X) ρ PN n,n =1 ψ1(xn, xn ) . Clearly we may take ρ to be the identity,

and ψ1(xn, xn ) to approximate QD d=1 µ(xdnxdn ), which is straightforwardly calculated in depth O(log D) by performing successive multiplications in a binary-tree like structure (see Theorem F.1).

Ultimately, we use a slight variant of this function for the formal proof. Because the orthogonality of our newly introduced inner product , only holds for low-degree polynomials, we must truncate high-degree terms of g; we conﬁrm in Appendix F that this truncation nevertheless preserves the properties we care about.

6 Discussion

In this work, we ve demonstrated how symmetric width captures more of the expressive power of symmetric networks than depth when restricted to analytic activations, by evincing an exponential separation between two of the most common architectures that enforce permutation invariance.

The most unusual property of this result is the complete independence of depth, owing to the unique orthogonality properties of the restricted Hall inner product when paired with the assumption of analyticity. This stands in contrast to the case of vanilla neural networks, for which separations beyond small depth would resolve open questions in circuit complexity suspected to be quite hard [25]. Furthermore, the greater dependence on width than depth is a unique property to symmetric networks, whereas the opposite is true for vanilla networks [26].

A natural extension would be to consider the simple equivariant layers introduced in Zaheer et al. [32], which we suspect will not substantially improve approximation power of Sym L. Furthermore, allowing for multiple such equivariant layers, this network becomes exactly akin to a Graph Convolutional Network [10] on a complete graph, whereas Sym2 L corresponds to a message passing network [9] as it is capable of interpreting edge features.

6.1 Limitations

The major limitation of this result is the restriction to analytic functions. Although analytic symmetric functions nevertheless appear crucially in the study of exactly solvable quantum systems [2, 11], this assumption may be be overly strict for general problems of learning symmetric functions. We nevertheless conjecture that these bounds will still hold even allowing for non-analytic activations, and consider this an exciting question for future work. Additionally, whether the hard function g can be efﬁciently learned with gradient descent remains unclear, and future work could touch on the learnability.

Acknowledgements: This work has been partially supported by the Alfred P. Sloan Foundation, NSF RI-1816753, NSF CAREER CIF-1845360, and NSF CCF-1814524.

[1] Joshua Bassey, Lijun Qian, and Xianfang Li. A survey of complex-valued neural networks . In: ar Xiv preprint ar Xiv:2101.12249 (2021). [2] Mathieu Beau and Adolfo del Campo. Parent Hamiltonians of Jastrow wavefunctions . In: Sci Post Physics Core 4.4 (2021), p. 030. [3] Zhengdao Chen, Lei Chen, Soledad Villar, and Joan Bruna. Can graph neural networks count substructures? In: Advances in neural information processing systems 33 (2020), pp. 10383 10395. [4] Amit Daniely. Depth separation for neural networks . In: Conference on Learning Theory. PMLR. 2017, pp. 690 696. [5] Persi Diaconis and Mehrdad Shahshahani. On the eigenvalues of random matrices . In: Journal of Applied Probability 31.A (1994), pp. 49 62. [6] Mátyás Domokos. Vector invariants of a class of pseudo-reﬂection groups and multisymmetric syzygies . In: ar Xiv preprint ar Xiv:0706.2154 (2007). [7] Ronen Eldan and Ohad Shamir. The power of depth for feedforward neural networks . In: Conference on learning theory. PMLR. 2016, pp. 907 940. [8] John Garnett. Bounded analytic functions. Vol. 236. Springer Science & Business Media, 2007. [9] Justin Gilmer et al. Neural message passing for quantum chemistry . In: International conference on machine learning. PMLR. 2017, pp. 1263 1272. [10] Thomas N Kipf and Max Welling. Semi-supervised classiﬁcation with graph convolutional networks . In: ar Xiv preprint ar Xiv:1609.02907 (2016). [11] Edwin Langmann. A method to derive explicit formulas for an elliptic generalization of the Jack polynomials . In: ar Xiv preprint math-ph/0511015 (2005). [12] Juho Lee et al. Set transformer: A framework for attention-based permutation-invariant neural networks . In: International Conference on Machine Learning. PMLR. 2019, pp. 3744 3753. [13] Chih-Yao Ma et al. Attend and interact: Higher-order object interactions for video understanding . In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018, pp. 6790 6800. [14] Ian Grant Macdonald. Symmetric functions and Hall polynomials. Oxford university press, 1998. [15] Haggai Maron, Heli Ben-Hamu, Hadar Serviansky, and Yaron Lipman. Provably powerful graph networks . In: Advances in neural information processing systems 32 (2019). [16] Ryan L Murphy, Balasubramaniam Srinivasan, Vinayak Rao, and Bruno Ribeiro. Janossy pooling: Learning deep permutation-invariant functions for variable-size inputs . In: ar Xiv preprint ar Xiv:1811.01900 (2018). [17] David Pfau, James S Spencer, Alexander GDG Matthews, and W Matthew C Foulkes. Ab initio solution of the many-electron Schrödinger equation with deep neural networks . In: Physical Review Research 2.3 (2020), p. 033429. [18] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classiﬁcation and segmentation . In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2017, pp. 652 660. [19] David Rydh. A minimal set of generators for the ring of multisymmetric functions . In: Annales de l institut Fourier. Vol. 57. 6. 2007, pp. 1741 1769. [20] Itay Safran and Ohad Shamir. Depth-width tradeoffs in approximating natural functions with neural networks . In: International conference on machine learning. PMLR. 2017, pp. 2979 2987. [21] Adam Santoro et al. A simple neural network module for relational reasoning . In: Advances in neural information processing systems 30 (2017). [22] Franco Scarselli et al. The graph neural network model . In: IEEE transactions on neural networks 20.1 (2008), pp. 61 80.

[23] Nimrod Segol and Yaron Lipman. On universal equivariant set networks . In: ar Xiv preprint ar Xiv:1910.02421 (2019). [24] Matus Telgarsky. Beneﬁts of depth in neural networks . In: Conference on learning theory. PMLR. 2016, pp. 1517 1539. [25] Gal Vardi, Daniel Reichman, Toniann Pitassi, and Ohad Shamir. Size and depth separation in approximating benign functions with neural networks . In: Conference on Learning Theory. PMLR. 2021, pp. 4195 4223. [26] Gal Vardi, Gilad Yehudai, and Ohad Shamir. Width is Less Important than Depth in Re LU Neural Networks . In: ar Xiv preprint ar Xiv:2202.03841 (2022). [27] Ashish Vaswani et al. Attention is all you need . In: Advances in neural information processing systems 30 (2017). [28] Luca Venturi, Samy Jelassi, Tristan Ozuch, and Joan Bruna. Depth separation beyond radial functions . In: ar Xiv preprint ar Xiv:2102.01621 (2021). [29] Edward Wagstaff et al. On the limitations of representing functions on sets . In: International Conference on Machine Learning. PMLR. 2019, pp. 6487 6494. [30] Edward Wagstaff et al. Universal approximation of functions on sets . In: Journal of Machine Learning Research 23.151 (2022), pp. 1 56. [31] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? In: ar Xiv preprint ar Xiv:1810.00826 (2018). [32] Manzil Zaheer et al. Deep sets . In: Advances in neural information processing systems 30 (2017).

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reﬂect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] See Section 6.1

(c) Did you discuss any potential negative societal impacts of your work? [N/A] (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes]

2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [Yes] (b) Did you include complete proofs of all theoretical results? [Yes]

3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [N/A] (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [N/A] (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [N/A] (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [N/A]

4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [N/A] (b) Did you mention the license of the assets? [N/A]

(c) Did you include any new assets either in the supplemental material or as a URL? [N/A]

(d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identiﬁable information or offensive content? [N/A]

5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]