# set2graph_learning_graphs_from_sets__ed1fd08f.pdf

Set2Graph: Learning Graphs From Sets

Hadar Serviansky1 Nimrod Segol1 Jonathan Shlomi1 Kyle Cranmer2

Eilam Gross1 Haggai Maron3 Yaron Lipman1

1Weizmann Institute of Science 2New York University 3NVIDIA Research

Many problems in machine learning can be cast as learning functions from sets to graphs, or more generally to hypergraphs; in short, Set2Graph functions. Examples include clustering, learning vertex and edge features on graphs, and learning features on triplets in a collection. A natural approach for building Set2Graph models is to characterize all linear equivariant set-to-hypergraph layers and stack them with non-linear activations. This poses two challenges: (i) the expressive power of these networks is not well understood; and (ii) these models would suffer from high, often intractable computational and memory complexity, as their dimension grows exponentially. This paper advocates a family of neural network models for learning Set2Graph functions that is both practical and of maximal expressive power (universal), that is, can approximate arbitrary continuous Set2Graph functions over compact sets. Testing these models on different machine learning tasks, mainly an application to particle physics, we ﬁnd them favorable to existing baselines.

1 Introduction

1 Set 2-edges F

Set graph F

Set 3-edges F

Figure 1: Set-to-graph functions are represented as collections of set-to-k-edge functions.

We consider the problem of learning functions taking sets of vectors in Rdin to graphs, or more generally hypergraphs; we name this problem Set2Graph, or setto-graph. Set-to-graph functions appear in machinelearning applications such as clustering, predicting features on edges and nodes in graphs, and learning k-edge information in sets.

Mathematically, we represent each set-to-graph function as a collection of set-to-k-edge functions, where each set-to-k-edge function learns features on k-edges. That is, given an input set X = {x1, . . . , xn} Rdin we consider functions Fk attaching feature vectors to k-edges: each k-tuple (xi1, . . . , xik) is assigned with an output vector Fk(X)i1,i2,...,ik,: Rdout. Now, functions mapping sets to hypergraphs with hyper-edges of size up-to k are modeled by (F1, F2, . . . , Fk). For example, functions mapping sets to standard graphs are represented by (F1, F2), see Figure 1.

Set-to-graph functions are well-deﬁned if they satisfy a property called equivariance (deﬁned later), and therefore the set-to-graph problem is an instance of the bigger class of equivariant learning [3, 27, 16]. A natural approach for learning equivariant set-to-graph model is using out-of-the-box full equivariant model as in [20].

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

A central question is: Are equivariant models universal for set-to-graph functions? That is, can equivariant models approximate any continuous equivariant function? In equivariant learning literature set-to-set models [44, 25] are proven equivariant universal [11, 30, 28]. In contrast, the situation for graph-to-graph equivariant models is more intricate: some models, such as message passing (a.k.a. graph convolutional networks), are known to be non-universal [41, 23, 19, 2], while high-order equivariant models are known to be universal [21] but require using high order tensors and therefore not practical. Universality of equivariant set-to-graph models is not known, as far as we are aware. In particular, are high order tensors required for universality (as the graph-to-graph case), or low order tensors (as in the set-to-set case) are sufﬁcient?

In this paper we: (i) show that low order tensors are sufﬁcient for set-to-graph universality, and (ii) build an equivariant model for the set-to-graph problem that is both practical (i.e., small number of parameters and no-need to build high-order tensors in memory) and provably universal. We achieve that with a composition of three networks: Fk = ψ β φ, where φ is a set-to-set model, β is a non-learnable broadcasting set-to-graph layer, and ψ is a simple graph-to-graph network using only a single Multi-Layer Perceptron (MLP) acting on each k-edge independently.

Our main motivation for this work comes from an important set-to-2-edges learning problem in particle physics: partitioning (clustering) of simulated particles generated in the Large Hadron Collider (LHC). We demonstrate our model produces state of the art results on this task compared to relevant baselines. We also experimented with another set-to-2-edges problem of Delaunay triangulation, and a set-to-3-edges problem of 3D convex hull, in which we also achieve superior performances to the baselines.

2 Previous work

Equivariant learning. In many learning setups the task is invariant or equivariant to certain transformations of the input. The Canonical example is image recognition tasks [18, 17] and set classiﬁcation tasks [44, 25]. Earlier methods such as [34] used non-equivariant methods to learn set functions. Restricting models to be invariant or equivariant to these transformation was shown to be an excellent approach for reducing the number of parameters of models while improving generalization [44, 25, 13, 9, 33, 41, 15, 20, 19, 3, 5, 7, 40, 4, 8, 37, 39, 37]. There has been a keen interest in the analysis of equivariant models [27, 15], especially the analysis of their approximation power [44, 25, 21, 11, 30, 22]. As far we know, the set-to-graph case was not treated before.

Similarity learning. Our work is related to the ﬁeld of similarity learning, in which the goal is to learn a similarity function on pairs of inputs. In most cases, a siamese architecture is used in order to extract features for each input and then a similarity score is calculated based on this pair of features [1, 31, 43]. The difference from our setup is that similarity learning is aimed at extracting pairwise relations between two inputs, independently from the other members of the set, while we learn these pairwise relations globally from the entire input set. In the experimental section we show that the independence assumption taken in similarity learning might cause a signiﬁcant degradation in performance compared to our global approach.

Other related methods. [10] suggest a method for meta-clustering that can be seen as an instance of the set2graph setup. Their method is based on LSTMs and therefore depends on the order of the set elements. In contrast, our method is blind (equivariant) to the chosen order of the input sets. The Neural relational Inference model [12] is another related work that targets learning relations and dynamics of a set of objects in an unsupervised manner. [35] had previously tackled planar Delaunay triangulation and convex hull prediction problems using a non-equivariant network.

3 Learning hypergraphs from sets

We would like to learn functions of sets of n vectors in Rdin to hypergraphs with n nodes (think of the nodes as corresponding to the set elements), and arbitrary k-edge feature vectors in Rdout, where a k-edge is deﬁned as a k-tuple of set elements. A function mapping sets of vectors to k-edges is called set-to-k-edge function and denoted Fk : Rn din Rnk dout. Consequently, a set-to-hypergraph function would be modeled as a sequence (F1, F2, . . . , FK), for target hypergraphs with hyperedges of maximal size K. For example, F2 learns pairwise relations in a set; and (F1, F2) is a function from sets to graphs (outputs both node features and pairwise relations); see Figure 1.

Our goal is to design permutation equivariant neural network models for Fk that are as-efﬁcient-aspossible in terms of number of parameters and memory usage, but on the same time with maximal expressive power, i.e., universal.

Representing sets and k-edges. A matrix X = (x1, x2, . . . , xn)T Rn din represents a set of n vectors xi Rdin and therefore should be considered up to re-ordering of its rows. We denote by Sn = {σ} the symmetric group, that is the group of bijections (permutations) σ : [n] [n], where [n] = {1, . . . , n}. We denote by σ X the matrix resulting in reordering the rows of X by the permutation σ, i.e., (σ X)i,j = Xσ 1(i),j. In this notation, X and σ X represent the same set, for all permutations σ.

k-edges are represented as a tensor Y Rnk dout, where Yi,: Rdout denotes the feature vector attached to the k-edge deﬁned by the k-tuple (xi1, xi2, . . . , xik), where i = (i1, i2, . . . , ik) [n]k is a multi-index with non-repeating indices. Similarly to the set case, k-edges are considered up-to renumbering of the nodes by some permutation σ Sn. That is, if we deﬁne the action σ Y by (σ Y)i,j = Yσ 1(i),j, where σ 1(i) = (σ 1(i1), σ 1(i2), . . . , σ 1(ik)), then Y and σ Y represent the same k-edge data, for all σ Sn.

Equivariance. A sufﬁcient condition for Fk to represent a well-deﬁned map between sets X Rn din and k-edge data Y Rnk dout is equivariance to permutations, namely

Fk(σ X) = σ Fk(X), (1)

for all sets X Rn din and permutations σ Sn. Equivariance guarantees, in particular, that the two equivalent sets X and σ X are mapped to equivalent k-edge data tensors Fk(X) and σ Fk(X).

Set-to-k-edge models. In this paper we explore the following equivariant neural network model family for approximating Fk: Fk(X; θ) = ψ β φ(X), (2)

where φ, β, and ψ will be deﬁned soon. For Fk to be equivariant (as in equation 1) it is sufﬁcient that its constituents, namely φ, β, ψ, are equivariant. That is, φ, β, ψ all satisfy equation 1.

Set-to-graphs models. Given the model of set-to-k-edge functions, a model for a set-to-graph function can now be constructed from a pair of set-to-k-edge networks (F1, F2). Similarly, set-tohypergraph function would require (F1, . . . , FK), where K is the maximal hyperedge size. Figure 1 shows an illustration of set-to-k-edge and set-to-graph functions

φ component. φ : Rn din Rn d1 is a set-to-set equivariant model, that is φ is mapping sets of vectors in Rdin to sets of vectors in Rd1. To achieve the universality goal we will need φ to be universal as set-to-set model; that is, φ can approximate arbitrary continuous set-to-set functions. Several options exists [11, 28] although probably the simplest option is either Deep Sets [44] or one of its variations; all were proven to be universal recently in [30].

In practice, as will be clear later from the proof of the universality of the model, when building set-to-graph or set-to-hypergraph model, the φ (set-to-set) part of the k-edge networks can be shared between different set-to-k-edge models, Fk, without compromising universality.

β component. β : Rn d1 Rnk d2 is a non-learnable linear broadcasting layer mapping sets to k-edges. In theory, as shown in [20] the space of equivariant linear mappings Rn d1 Rnk d2 is of dimension d1d2bell(k + 1) which can be very high since bell numbers have exponential growth. Interestingly, in the set-to-k-edge case one can achieve universality with only k linear operators. We deﬁne the broadcasting operator to be

β(X)i,: = [xi1, xi2, . . . , xik] , (3)

where i = (i1, . . . , ik) and brackets denote concatenation in the feature dimension, that is, for A Rnk da, B Rnk db their concatenation is [A, B] Rnk (da+db). Therefore, the feature output dimension of β is d2 = kd1.

As an example, consider the graph case, where k = 2. In this case β(X)i1,i2,: = [xi1, xi2]. This function is illustrated in Figure 2 broadcasting data in Rn d1 to tensor Rn n d2.

To see that the broadcasting layer is equivariant, it is enough to consider a single feature β(X)i = xi1. Permuting the rows of X by a permutation σ we get β(σ X)i,j = xσ 1(i1),j = β(X)σ 1(i),j = (σ β(X))i,j.

Figure 2: The model architecture for the Set-tograph and set-to-2-edge functions.

ψ component. ψ : Rnk d2 Rnk dout is a mapping of k-tensors to k-tensors. Here the theory of equivariant operators indicates that the space of linear equivariant maps is of dimension d2doutbell(2k) that suggests a huge number of model parameters even for a single linear layer. Surprisingly, universality can be achieved with much less, in fact a single linear operator (i.e., scaled identity) in each layer. In the multifeature multi-layer case this boils to applying a Multi-Layer Perceptron m : Rd2 Rdout independently to each feature vector in the input tensor X Rnk d2. That is, we use ψ(X)i,: = m(Xi,:). (4)

Figure 2 illustrates set-to-2-edges and set-to-graph models incorporating the three components φ, β, ψ discussed above. We note that, Indeed, φ, β, ψ are equivariant.

4 Universality of set-to-graph models

In this section we prove that the model Fk introduced above, is universal, in the sense it can approximate arbitrary continuous equivariant set-to-k-edge functions Gk : Rn din Rnk dout over compact domains K Rn din.

Theorem 1. The model Fk is set-to-k-edge universal.

A corollary of Theorem 1 establishes set-to-hypergraph universal models:

Theorem 2. The model (F1, . . . , Fk) is set-to-hypergraph universal.

Our main tool for proving Theorem 1 is a characterization of the equivariant set-to-k-edge polynomials Pk. This characterization can be seen as a generalization of the characterization of set-to-set equivariant polynomial recently appeared in [30].

We consider an arbitrary set-to-k-edge continuous mapping Gk(X) over a compact set K Rn din. Since Gk is equivariant we can assume K is symmetric, i.e., σ K = K for all σ Sn. The proof consists of three parts: (i) Characterization of the equivariant set-to-k-edge polynomials Pk. (ii) Showing that every equivariant continuous set-to-k-edge function Gk can be approximated by some Pk. (iii) Every Pk can be approximated by our model Fk.

Before providing the full proof which contains some technical derivations let us provide a simpler universality proof (under some mild extra conditions) for the set-to-2-edge model, F2, based on the Singular Value Decomposition (SVD).

4.1 A simple proof for universality of second-order tensors

It is enough to consider the dout = 1 case; the general case is implied by applying the argument for each output feature dimension independently. Let G2 be an arbitrary continuous equivariant set-to-2-edge function G2 : K Rn din Rn n. We want to approximate G2 with our model F2. First, note that without losing generality we can assume G2(X) has a simple spectrum (i.e., eigenvalues are all different) for all X K. Indeed, if this is not the case we can always choose λ > 0 sufﬁciently large and consider G2+λdiag(1, 2, . . . , n). This diagonal addition does not change

the 2-edge values assigned by G2, and it guarantees a simple specturm using standard hermitian matrix eigenvalue perturbation theory (see e.g., [32], Section IV:4).

Now let G2(X) = U(X)Σ(X)V (X)T be the SVD of G2(X), where U = [u1, . . . , un] and V = [v1, . . . , vn]. Since G2(X) has a simple spectrum, U, V , Σ are all continuous in X; Σ is unique, and U, V are unique up to a sign ﬂip of the singular vectors (i.e., columns of U, V ) [24]. Let us ﬁrst assume that the singular vectors can be chosen uniquely also up to a sign, later we show how we achieve this with some additional mild assumption.

Now, uniqueness of the SVD together with the equivariance of G2 imply that U, V are continuous set-to-set equivariant and Σ is a continuous set invariant function:

(σ U(X))Σ(X)(σ V (X))T

= σ G(X) = G(σ X) (5)

= U(σ X)Σ(σ X)V (σ X)T . Lastly, since φ is set-to-set universal there is a choice of its parameters so that it approximates arbitrarily well the equivariant set-to-set function Y = [U, V , 11T Σ]. The ψ component can be chosen by noting that G2(X)i1,i2 = Pn j=1 σj Ui1,j Vi2,j = p(β(Y )i1,i2,:), where σj are the singular values, and p : R6n R is a cubic polynomial. To conclude pick m to approximate p sufﬁciently well so that ψ β φ approximates G2 to the desired accuracy.

To achieve uniqueness of the singular vectors up-to a sign we can add, e.g., the following assumption: 1T ui(X) = 0 = 1T vi(X) for all singular vectors and X K. Using this assumption we can always pick ui(X), vi(X) in the SVD so that 1T ui(X) > 0, 1T vi(X) > 0, for all i [n]. Lastly, note that equation 5 suggests that also outer-product can be used as a broadcasting layer. We now move to the general proof.

4.2 Equivariant set-to-k-edge polynomials

We start with a characterization of the set-to-k-edge equivariant polynomials Pk : Rn din Rnk dout. We need some more notation. Given a vector x Rd, and a multi-index α [n]d, we set xα = Qd i=1 xαi i ; |α| = Pd i=1 αi; and deﬁne accordingly Xα = (xα 1 , . . . , xα n)T . Given two tensors A Rnk1, B Rnk2 we use the notation A B Rnk1+k2 to denote the tensor-product, deﬁned by (A B)i1,i2 = Ai1Bi2, where i1, i2 are suitable multi-indices. Lastly, we denote by α = (α1, . . . , αk) a vector of multi-indices αi [n]d, and Xα = Xα1 Xαk.

Theorem 3. An equivariant set-to-k-edge polynomial Pk : Rn din Rnk dout can be written as

α Xα qα(X) (6)

where α = (α1, . . . , αk), αi [n]din, and qα : Rn din dout are Sn invariant polynomials.

As an example, consider the graph case, where k = 2. Equivariant set-to-2-edge polynomials take the form: Pk(X) = X

α1,α2 Xα1 Xα2 qα1,α2(X), (7)

and coordinate-wise Pk ijl(X) = X

α1,α2 xα1 i xα2 j qα1,α2,l(X). (8)

The general proof idea and proof itself is given in the supplementary. Figure 3 provides an illustration of these polynomials.

Approximating Gk with a polynomial Pk. We denote for an arbitrary tensor A Ra b c its inﬁnity norm by A = maxi |Ai|.

Lemma 1. Let Gk : K Rn din Rnk dout be a continuous equivariant function over a symmetric domain K Rn dout. For an arbitrary ϵ > 0, there exists an equivariant polynomial Pk : Rn din Rnk dout so that max X K

Gk(X) Pk(X) < ϵ.

Xα = Xα1 Xα2

nk Equivariant Equivariant Equivariant

Xα1 Xα2 qα(X)

P k(X) = + + + + ...

Figure 3: Illustration of the structure of an equivariant set-to-k-edge polynomial Pk, for k = 2.

This is a standard lemma, similar to [42, 21, 30]; we provide a proof in the supplementary.

Approximating Pk with a network Fk. The ﬁnal component of the proof of Theorem 1 is showing that an equivariant polynomial Pk can be approximated over K using a network of the form in equation 2. The key is to use the characterization of Theorem 3 and write Pk in a similar form to our model in equation 2: Pk i,:(X) = p(β(H(X))i,:), (9)

where H : K Rn d1 deﬁned by H(X)i,: = [xi, q(X)], where q(X) = [qα1(X), . . . , qαm(X)], and α1, α2, . . . , αm are all the multi-indices participating in the sum in equation 6. Note that

β(H(X))i,: = [xi1, q(X), xi2, q(X), . . . , xik, q(X)] .

Therefore, p : Rd2 Rdout is chosen as the polynomial

p : [x1, y, x2, y, . . . , xk, y] 7 X

α xα1 1 xαk k yα,

where y = [yα1, . . . , yαm] Rmdout, and yαi Rdout.

In view of equation 9 all we have left is to choose φ and ψ (i.e., m) to approximate H, p (resp.) to a desired accuracy. We detail the rest of the proof in the supplementary.

Universality of the set-to-hypergraph model. Theorem 2 follows from Theorem 1 by considering a set-to-hypergraph continuous function G as a collection Gk of set-to-k-edge functions and approximating each one using our model Fk. Note that universality still holds if F1, . . . , FK all share the φ part of the network (assuming sufﬁcient width d1).

Note that a set-to-k-edge model (in equation 2) is not universal when approximating set-to-hypergraph functions:

Proposition 1. The set-to-2-edge model, F2, cannot approximate general set-to-graph functions.

The proof is in the supplementary; it shows that even the constant function that outputs 1 for 1-edges (nodes), and 0 for 2-edges cannot be approximated by a set-to-2-edge model F2.

5 Applications

5.1 Model variants and baselines

We tested our model on three learning tasks from two categories: set-to-2-edge and set-to-3-edge.

Variants of our model. We consider two variations of our model:

S2G: This is Our basic model. We used the F2 and F3 (resp.) models for these learning tasks. for F2, φ is implemented using Deep Sets [44] with 5 layers and output dimension d1 {5, 80}; ψ is implemented with an MLP, m, with {2, 3} layers with input dimension d2 deﬁned by d1 and β. β is implemented according to equation 3: for k = 2 it uses d2 = 2 d1 output features. For F3, S2G is described in section 5.4.

S2G+: For the k = 2 case we have also tested a more general (but not more expressive) broadcasting β deﬁned using the full equivariant basis Rn Rn2 from [20] that contains bell(3) = 5 basis operations. This broadcasting layer gives d2 = 5 d1.

Baselines. We compare our results to the following baselines:

MLP: A standard multilayer perceptron applied to the ﬂattened set features.

SIAM: A popular similarity learning model (see e.g., [43]) based on Siamese networks. This model has the same structure as in equation 2 where φ is a Siamese MLP (a non-universal set-to-set function) that is applied independently to each element in the set. We use the same loss we use with our model (according to the task at hand).

SIAM-3: The same architecture as SIAM but with a triplet loss [38] on the learned representations based on l2 distance, see e.g., [29]. Edge predictions are obtained by thresholding distances of pairs of learned representations.

GNN: A Graph Neural Network [23] applied to the k-NN (k {0, 5, 10}) induced graph. Edge prediction is done via outer-product [14].

AVR: A non-learnable geometric-based baseline called Adaptive Vertex Reconstruction [36] typically used for the particle physics problem we tackle. More information can be found in the supplementary material.

More architecture, implementation, hyper-parameter details and number of parameters can be found in the supplementary material.

5.2 Partitioning for particle physics

The ﬁrst learning setup we tackle is learning set-to-2-edge functions. Here, each training example is a pair (X, Y ) where X is a set X = (x1, x2, . . . , xn)T Rn din and Y {0, 1}n n is an adjacency matrix (the diagonal of Y is ignored). Our main experiment tackles an important particle partitioning problem.

Problem statement. In particle physics experiments, such as the Large Hadron Collider (LHC), beams of incoming particles are collided at high energies. The results of the collision are outgoing particles, whose properties (such as the trajectory) are measured by detectors surrounding the collision point. A critical low-level task for analyzing this data is to associate the particle trajectories to their progenitor, which can be formalized as partitioning sets of particle trajectories into subsets according to their unobserved point of origin in space. This task is referred to as vertex reconstruction in particle physics and is illustrated in Figure 4. We cast this problem as a set-to-2-edge problem by treating the measured particle trajectories as elements in the input set and nodes in the output graph, where the parameters that characterize them serve as the node features. An edge between two nodes indicates that the two particles come from a common progenitor or vertex.

Data. We consider three different types (or ﬂavors) of particle sets (called jets) corresponding to three different fundamental data generating processes labeled bottom-jets, charm-jets, and light-jets (B/C/L). The important distinction between the ﬂavors is the typical number of partitions in each set. Since it is impossible to label real data collected in the detectors at the LHC, algorithms for particle physics are typically designed with high-ﬁdelity simulators, which can provide labeled training data. These algorithms are then applied to and calibrated with real data collected by the LHC experiments. The generated sets are small, ranging from 2 to 14 elements each, with around 0.9M sets divided to

train/val/test using the ratios 0.6/0.2/0.2. Each set element has 10 features (din). More information can be found in the supplementary material.

Observed Particles

Figure 4: Illustration of a particle physics experiment. The task is to partition the set of observed particles based on their point of origin (in blue).

Evaluation metrics, loss and post processing. We consider multiple quantities to quantify the performance of the partitioning: the F1 score, the Rand Index (RI), and the Adjusted Rand Index (ARI = (RI E[RI])/(1 E[RI])). All models are trained to minimize the F1 score. We make sure the adjacency matrix of the output graph encodes a valid partitioning of nodes to clusters by considering any connected components as a clique.

Results. We compare the results of all learning based methods and a typical baseline algorithm used in particle physics (AVR). We also add the results of a trivial baseline that predicts that all nodes have the same progenitor. All models have roughly the same number of parameters. We performed each experiment 11 times with different random initializations, and evaluated the model F1 score, RI and ARI on the test set. The results are shown in Table 2-Jets. For bottom and charm jets, which have secondary vertices, both of our models signiﬁcantly outperform the baselines by 5%-10% in all performance metrics. In light-jets, without secondary decays, our models yield similar scores. We also performed an extensive ablation study, see Table A1 in the supplementary material. Note that S2G+ has the same expressive power as S2G, and produces equivalent results in practice. Table 1 compares the training times of the different methods.

5.3 Learning Delaunay triangulations

Model Epochs training-time (minutes) S2G 193 62 S2G+ 139 47 GNN 91 21 SIAM 77 24 SIAM3 22 322 MLP 132 22

Table 1: Training times of different models. Middle column: number of epochs with early stopping. Right column: total training time in minutes.

In a second set-to-2-edge task we test our model s ability to learn Delaunay triangulations, namely given a set of planar points we want to predict the Delaunay edges between pairs of points, see e.g., [6] Chapter 9. We generated 50k planar point sets as training data and 5k planar point sets as test data; the point sets, X Rn 2, were uniformly sampled in the unit square, and a ground truth matrix in {0, 1}n n was computed per point set using a Delaunay triangulation algorithm. The number of points in a set, n, is either 50 or varies and is randomly chosen from {20, . . . , 80}. Training was stopped after 100 epochs. As in the previous experiment, all models have roughly the number of parameters. See more implementation details in the supplementary material. In Table 2-Delaunay we report accuracy of prediction as well as precision recall and F1 score. Evidently, both of our models (S2G and S2G+) outperform the baselines. We also tried the MLP baseline that yielded very low F1 scores (i.e., 0.1). See also Figure 1 in the supplementary material for visualizations of several qualitative examples.

5.4 Set to 3-edges: learning the convex-hull of a point cloud

In the last experiment, we demonstrate learning of set-to-3-edge function. The learning task we target is ﬁnding supporting triangles in the convex hull of a set of points in R3. In this scenario, the input is a point set X Rn 3, and the function we wanted to learn is F3 : Rn 3 Rn3 where the output is a probability for each triplet of nodes (triangle) {xi1, xi2, xi3} to belong to the triangular mesh that describes the convex hull of X. [35] had previously tackled a 2-dimensional version of this problem, but since their network predicts the order of nodes in the 2D convex hull, it is not easily adapted to the 3D settings.

Note that storing 3-rd order tensors in memory is not feasible, hence we concentrate on a local version of the problem: Given a point set X R3, identify the triangles within the K-Nearest-Neighbors of each point that belong to the convex hull of the entire point set X. We used K = 10. Therefore, for broadcasting (β) from point data to 3-edge data, instead of holding a 3-rd order tensor in memory we broadcast only the subset of K-NN neighborhoods. This allows working with high-order information with relatively low memory footprint. Furthermore, since we want to consider 3-edges (triangles)

Model F1 RI ARI S2G 0.646 0.003 0.736 0.004 0.491 0.006 S2G+ 0.655 0.004 0.747 0.006 0.508 0.007 GNN 0.586 0.003 0.661 0.004 0.381 0.005 B SIAM 0.606 0.002 0.675 0.005 0.411 0.004 SIAM-3 0.597 0.002 0.673 0.005 0.396 0.005 MLP 0.533 0.000 0.643 0.000 0.315 0.000 AVR 0.565 0.612 0.318 trivial 0.438 0.303 0.026 S2G 0.747 0.001 0.727 0.003 0.457 0.004 S2G+ 0.751 0.002 0.733 0.003 0.467 0.005 GNN 0.720 0.002 0.689 0.003 0.390 0.005 C SIAM 0.729 0.001 0.695 0.002 0.406 0.004 SIAM-3 0.719 0.001 0.710 0.003 0.421 0.005 MLP 0.686 0.000 0.658 0.000 0.319 0.000 trivial 0.610 0.472 0.078 AVR 0.695 0.650 0.326 S2G 0.972 0.001 0.970 0.001 0.931 0.003 S2G+ 0.971 0.002 0.969 0.002 0.929 0.003 GNN 0.972 0.001 0.970 0.001 0.929 0.003 L SIAM 0.973 0.001 0.970 0.001 0.925 0.003 SIAM-3 0.895 0.006 0.876 0.008 0.729 0.015 MLP 0.960 0.000 0.957 0.000 0.894 0.000 trivial 0.910 0.867 0.675 AVR 0.970 0.965 0.922

Acc Prec Rec F1 n = 50 S2G 0.984 0.927 0.926 0.926 S2G+ 0.983 0.927 0.925 0.926 GNN0 0.826 0.384 0.966 0.549 GNN5 0.809 0.363 0.985 0.530 GNN10 0.759 0.311 0.978 0.471 SIAM 0.939 0.766 0.653 0.704 SIAM-3 0.911 0.608 0.538 0.570

n {20, . . . , 80} S2G 0.947 0.736 0.934 0.799 S2G+ 0.947 0.735 0.934 0.798 GNN0 0.810 0.387 0.946 0.536 GNN5 0.777 0.352 0.975 0.506 GNN10 0.746 0.322 0.970 0.474 SIAM 0.919 0.667 0.764 0.687 SIAM-3 0.895 0.578 0.622 0.587

# points F1 AUC-ROC Spherical S2G 30 0.780 0.988 GNN5 30 0.693 0.974 SIAM 30 0.425 0.885 S2G 50 0.686 0.975 GNN5 50 0.688 0.973 SIAM 50 0.424 0.890 S2G 20-100 0.535 0.953 GNN5 20-100 0.667 0.970 SIAM 20-100 0.354 0.885 Gaussian S2G 30 0.707 0.996 GNN5 30 0.5826 0.9865 SIAM 30 0.275 0.946 S2G 50 0.661 0.997 GNN5 50 0.4834 0.9917 SIAM 50 0.254 0.974 S2G 20-100 0.552 0.994 GNN5 20-100 0.41 0.9866 SIAM 20-100 0.187 0.969

GT predicted n = 30

Jets Delaunay Convex Hull (a) Convex Hull (b) Table 2: Jets - Performance of partitioning for three types of jets. Delaunay - Results on the Delaunay triangulation task. Convex Hull (a) and (b) - Convex hull learning quantitative and qualitative results.

with no order we used invariant universal set model (Deep Sets again) as m. For k = 3, S2G is implemented as follows: φ is implemented using Deep Sets with 3 layers and output dimension d1 = 512; β triplets of points to sets. ψ is implemented with a Deep Sets with 3 layers of 64 features, followed by an MLP, m, with 3 layers. More details are in the supplementary.

We tested our S2G model on two types of data: Gaussian and spherical. For both types we draw point sets in R3 i.i.d. from standard normal distribution, N(0, 1), where for the spherical data we normalize each point to unit length. We generated 20k point set samples as a training set, 2k for validation and another 2k for test set. Point sets are in Rn 3, where n = 30, n = 50, and n [20, 100]. We compare our method, S2G, to the SIAM, GNN and MLP baselines. The F1 scores and AUC-ROC of the predicted convex hull triangles are shown in Table 2-Convex Hull, where our model outperform the baselines in most cases (as in the previous experiment we exclude MLP from the table since it yields very low results). See Figure (b) for several examples of triangles predicted using our trained model compared to the ground truth.

5.5 Discussion

In the experiments above we compared our model S2G (and its variant S2G+) with several, broadly used, state-of-the-art architectures. As noted, our models compare favorably with these baselines in all tasks. We attribute this to the increased expressive power (universality) of these models. Our architectures are able to learn relations between set instances while reasoning about the whole set, where Siamese networks rely only on the relation between pairs or triplets of points. The GNN models use a prescribed connectivity that hinder efﬁcient set learning.

6 Conclusion

In this paper, we presented a novel neural network family that can model equivariant set-to-k-edge functions and consequently set-to-graph, or more generally set-to-hypergraph functions. The family uses a relatively small number of parameters, compared to other equivariant models, and is shown to be a universal approximator of such continuous equivariant functions. We show the efﬁcacy of these networks on several tasks, including a real-life particle physics problem. There are many directions for future work. One is adapting the model to learn valid clustering of sets (i.e., learn graphs that are built of disjoint cliques). Another direction is incorporating our network architecture in set models, with the goal of improving performance of general set tasks by constructing an intermediate latent graph representation, for example for constructing scene graphs as in [26].

Broader Impact

Our contribution describes a class of neural network models for functions from sets to graphs and includes theoretical results that the model family is universal. The potential uses of these models is very broad and includes the physical sciences, computer graphics, and social networks. The paper includes experiments that show the positive impact of these models in the context of particle physics, and similar tasks appear repeatedly in the physical sciences. The models could also be used for social networks and areas with more complex ethical and societal consequences. Because the models treat the input as a set and are permutation equivariant, they have the potential to mitigate potential bias in data due to sorting and other pre-processing that could impact methods that treat the input as a sequence. Otherwise, the considerations of bias in the data and impact of failure are no different for our model than the generic considerations of the use of supervised learning and neural networks. Finally we note that the models we describe are already being used in real-world particle physics research.

Acknowledgments and Disclosure of Funding

HS, NS and YL were supported in part by the European Research Council (ERC Consolidator Grant, "Lift Match" 771136), the Israel Science Foundation (Grant No. 1830/17) and by a research grant from the Carolito Stiftung (WAIC). JS and EG were supported by the NSF-BSF Grant 2017600 and the ISF Grant 2871/19. KC was supported by the National Science Foundation under the awards ACI-1450310, OAC-1836650, and OAC-1841471 and by the Moore-Sloan data science environment at NYU.

[1] J. Bromley, I. Guyon, Y. Le Cun, E. Säckinger, and R. Shah. Signature veriﬁcation using a" siamese" time delay neural network. In Advances in neural information processing systems, pages 737 744, 1994.

[2] Z. Chen, S. Villar, L. Chen, and J. Bruna. On the equivalence between graph isomorphism testing and function approximation with gnns. ar Xiv preprint ar Xiv:1905.12560, 2019.

[3] T. Cohen and M. Welling. Group equivariant convolutional networks. In International conference on machine learning, pages 2990 2999, 2016.

[4] T. S. Cohen, M. Geiger, J. Köhler, and M. Welling. Spherical cnns. ar Xiv preprint ar Xiv:1801.10130, 2018.

[5] T. S. Cohen and M. Welling. Steerable CNNs. (1990):1 14, 2016.

[6] M. De Berg, M. Van Kreveld, M. Overmars, and O. Schwarzkopf. Computational geometry. In Computational geometry, pages 1 17. Springer, 1997.

[7] S. Dieleman, J. De Fauw, and K. Kavukcuoglu. Exploiting cyclic symmetry in convolutional neural networks. ar Xiv preprint ar Xiv:1602.02660, 2016.

[8] C. Esteves, C. Allen-Blanchette, A. Makadia, and K. Daniilidis. 3d object classiﬁcation and retrieval with spherical cnns. ar Xiv preprint ar Xiv:1711.06721, 2017.

[9] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl. Neural message passing for quantum chemistry. In International Conference on Machine Learning, pages 1263 1272, 2017.

[10] Y. Jiang and N. Verma. Meta-learning to cluster. ar Xiv preprint ar Xiv:1910.14134, 2019.

[11] N. Keriven and G. Peyré. Universal invariant and equivariant graph neural networks. Co RR, abs/1905.04943, 2019.

[12] T. Kipf, E. Fetaya, K.-C. Wang, M. Welling, and R. Zemel. Neural relational inference for interacting systems. ar Xiv preprint ar Xiv:1802.04687, 2018.

[13] T. N. Kipf and M. Welling. Semi-supervised classiﬁcation with graph convolutional networks. ar Xiv preprint ar Xiv:1609.02907, 2016.

[14] T. N. Kipf and M. Welling. Variational graph auto-encoders. ar Xiv preprint ar Xiv:1611.07308, 2016.

[15] R. Kondor, H. T. Son, H. Pan, B. Anderson, and S. Trivedi. Covariant compositional networks for learning graphs. ar Xiv preprint ar Xiv:1801.02144, 2018.

[16] R. Kondor and S. Trivedi. On the generalization of equivariance and convolution in neural networks to the action of compact groups. ar Xiv preprint ar Xiv:1802.03690, 2018.

[17] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classiﬁcation with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097 1105, 2012.

[18] Y. Le Cun, L. Bottou, Y. Bengio, P. Haffner, et al. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998.

[19] H. Maron, H. Ben-Hamu, H. Serviansky, and Y. Lipman. Provably powerful graph networks. ar Xiv preprint ar Xiv:1905.11136, 2019.

[20] H. Maron, H. Ben-Hamu, N. Shamir, and Y. Lipman. Invariant and equivariant graph networks. In International Conference on Learning Representations, 2019.

[21] H. Maron, E. Fetaya, N. Segol, and Y. Lipman. On the universality of invariant networks. ar Xiv preprint ar Xiv:1901.09342, 2019.

[22] H. Maron, O. Litany, G. Chechik, and E. Fetaya. On learning sets of symmetric elements. ar Xiv preprint ar Xiv:2002.08599, 2020.

[23] C. Morris, M. Ritzert, M. Fey, W. L. Hamilton, J. E. Lenssen, G. Rattan, and M. Grohe. Weisfeiler and leman go neural: Higher-order graph neural networks. ar Xiv preprint ar Xiv:1810.02244, 2018.

[24] K. A. O Neil. Critical points of the singular value decomposition. SIAM journal on matrix analysis and applications, 27(2):459 473, 2005.

[25] C. R. Qi, H. Su, K. Mo, and L. J. Guibas. Pointnet: Deep learning on point sets for 3d classiﬁcation and segmentation. Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, 1(2):4, 2017.

[26] M. Raboh, R. Herzig, J. Berant, G. Chechik, and A. Globerson. Differentiable scene graphs. In The IEEE Winter Conference on Applications of Computer Vision, pages 1488 1497, 2020.

[27] S. Ravanbakhsh, J. Schneider, and B. Poczos. Equivariance through parameter-sharing. ar Xiv preprint ar Xiv:1702.08389, 2017.

[28] A. Sannai, Y. Takai, and M. Cordonnier. Universal approximations of permutation invariant/equivariant functions by deep neural networks. ar Xiv preprint ar Xiv:1903.01939, 2019.

[29] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A uniﬁed embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 815 823, 2015.

[30] N. Segol and Y. Lipman. On universal equivariant set networks. In International Conference on Learning Representations, 2020.

[31] E. Simo-Serra, E. Trulls, L. Ferraz, I. Kokkinos, P. Fua, and F. Moreno-Noguer. Discriminative learning of deep convolutional feature point descriptors. In Proceedings of the IEEE International Conference on Computer Vision, pages 118 126, 2015.

[32] G. W. Stewart. Matrix perturbation theory. 1990.

[33] P. Veliˇckovi c, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio. Graph Attention Networks. pages 1 12, 2017.

[34] O. Vinyals, S. Bengio, and M. Kudlur. Order matters: Sequence to sequence for sets. ar Xiv preprint ar Xiv:1511.06391, 2015.

[35] O. Vinyals, M. Fortunato, and N. Jaitly. Pointer networks, 2015.

[36] W. Waltenberger. RAVE: A detector-independent toolkit to reconstruct vertices. IEEE Trans. Nucl. Sci., 58:434 444, 2011.

[37] M. Weiler, M. Geiger, M. Welling, W. Boomsma, and T. Cohen. 3D Steerable CNNs: Learning Rotationally Equivariant Features in Volumetric Data. 2018.

[38] K. Q. Weinberger, J. Blitzer, and L. K. Saul. Distance metric learning for large margin nearest neighbor classiﬁcation. In Advances in neural information processing systems, pages 1473 1480, 2006.

[39] D. Worrall and G. Brostow. Cubenet: Equivariance to 3d rotation and translation. In Proceedings of the European Conference on Computer Vision (ECCV), pages 567 584, 2018.

[40] D. E. Worrall, S. J. Garbin, D. Turmukhambetov, and G. J. Brostow. Harmonic networks: Deep translation and rotation equivariance. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5028 5037, 2017.

[41] K. Xu, W. Hu, J. Leskovec, and S. Jegelka. How powerful are graph neural networks? In International Conference on Learning Representations, 2019.

[42] D. Yarotsky. Universal approximations of invariant maps by neural networks. ar Xiv preprint ar Xiv:1804.10306, 2018.

[43] S. Zagoruyko and N. Komodakis. Learning to compare image patches via convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4353 4361, 2015.

[44] M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos, R. R. Salakhutdinov, and A. J. Smola. Deep sets. In Advances in neural information processing systems, pages 3391 3401, 2017.