# learning_aggregation_functions__717b135d.pdf

Learning Aggregation Functions

Giovanni Pellegrini1 , Alessandro Tibo2 , Paolo Frasconi3 , Andrea Passerini1,2 and Manfred Jaeger2

1DISI, University of Trento 2Computer Science Department, Aalborg University 3DINFO, Universit a di Firenze {giovanni.pellegrini, andrea.passerini}@unitn.it, {alessandro, jaeger}@cs.aau.dk, paolo.frasconi@pm.me

Learning on sets is increasingly gaining attention in the machine learning community, due to its widespread applicability. Typically, representations over sets are computed by using ﬁxed aggregation functions such as sum or maximum. However, recent results showed that universal function representation by sum- (or max-) decomposition requires either highly discontinuous (and thus poorly learnable) mappings, or a latent dimension equal to the maximum number of elements in the set. To mitigate this problem, we introduce a learnable aggregation function (LAF) for sets of arbitrary cardinality. LAF can approximate several extensively used aggregators (such as average, sum, maximum) as well as more complex functions (e.g., variance and skewness). We report experiments on semi-synthetic and real data showing that LAF outperforms state-of-the-art sum- (max-) decomposition architectures such as Deep Sets and librarybased architectures like Principal Neighborhood Aggregation, and can be effectively combined with attention-based architectures.

1 Introduction

The need to aggregate representations is ubiquitous in deep learning. Some recent examples include max-over-time pooling used in convolutional networks for sequence classiﬁcation [Kim, 2014], average pooling of neighbors in graph convolutional networks [Kipf and Welling, 2017], maxpooling in Deep Sets [Zaheer et al., 2017], in (generalized) multi-instance learning [Tibo et al., 2020] and in Graph SAGE [Hamilton et al., 2017]. In all the above cases (with the exception of LSTM-pooling in Graph SAGE) the aggregation function is predeﬁned, i.e., not tunable, which may be in general a disadvantage [Ilse et al., 2018]. Sum-based aggregation has been advocated based on theoretical ﬁndings showing the permutation invariant functions can be sumdecomposed [Zaheer et al., 2017; Xu et al., 2019]. However, recent results [Wagstaff et al., 2019] showed that this universal function representation guarantee requires either highly

Equal Contribution. Contact Authors.

discontinuous (and thus poorly learnable) mappings, or a latent dimension equal to the maximum number of elements in the set. This suggests that learning set functions that are accurate on sets of large cardinality is difﬁcult. Inspired by previous work on learning uninorms [Melnikov and H ullermeier, 2016], we propose a new parametric family of aggregation functions that we call LAF, for learnable aggregation functions. A single LAF unit can approximate standard aggregators like sum, max or mean as well as model intermediate behaviours (possibly different in different areas of the space). In addition, LAF layers with multiple aggregation units can approximate higher order moments of distributions like variance, skewness or kurtosis. In contrast, other authors [Corso et al., 2020] suggest to employ a predeﬁned library of elementary aggregators to be combined. Since LAF can represent sums, it can be seen as a smooth version of the class of functions that are shown in [Zaheer et al., 2017] to enjoy universality results in representing set functions. The hope is that being smoother, LAF is more easily learnable. Our empirical ﬁndings show that this can be actually the case, especially when asking the model to generalize over large sets. In particular, we ﬁnd that: LAF layers can learn a wide range of aggregators (including higher-order moments) on sets of scalars without background knowledge on the nature of the aggregation task. LAF layers on the top of traditional layers can learn the same wide range of aggregators on sets of high dimensional vectors (MNIST images). LAF outperforms state-of-the-art set learning methods such as Deep Sets and PNA on real-world problems involving point clouds and text concept set retrieval. LAF performs comparably to PNA on random graph generation tasks, outperforming several graph neural networks architectures including GAT [Veliˇckovi c et al., 2018] and GIN [Xu et al., 2019].

2 The LAF Framework We use x = {x1, . . . , x N} to denote ﬁnite multisets of real numbers xi R. Note that directly taking x to be a multiset, not a vector, means that there is no need to deﬁne properties like exchangeability or permutation equivariance for operations on x. An aggregation function agg is any function that

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)

NAME DEFINITION a b c d e f g h α β γ δ LIMITS

CONSTANT κ R 0 1 - - 0 1 - - κ 0 1 0 MAX maxi xi 1/r r - - 0 1 - - 1 0 1 0 r MIN mini xi 0 1 1/r r 0 1 - - 1 -1 1 0 r SUM P

i xi 1 1 - - 0 1 - - 1 0 1 0 NONZERO COUNT |{i : xi = 0}| 1 0 - - 0 1 - - 1 0 1 0 MEAN 1/N P

i xi 1 1 - - 1 0 - - 1 0 1 0 k TH MOMENT 1/N P

i xk i 1 k - - 1 0 - - 1 0 1 0 l TH POWER OF k TH MOMENT (1/N P

i xk i )l l k - - l 0 - - 1 0 1 0 MIN/MAX mini xi/ maxi xi 0 1 1/r r 1/s s - - 1 1 1 0 r, s MAX/MIN maxi xi/ mini xi 1/r r - - 0 1 1/s s 1 0 1 1 r, s

Table 1: Different functions achievable by varying the parameters in the formulation in Equation 2.

returns for any multiset x of arbitrary cardinality N N a value agg(x) R. Standard aggregation functions like mean and max can be understood as (normalized) Lp-norms. We therefore build our parametric LAF aggregator around functions of a form that generalizes Lp-norms:

(a, b 0). (1)

La,b is invariant under the addition of zeros: La,b(x) = La,b(x 0) where 0 is a multiset of zeros of arbitrary cardinality. In order to also enable aggregations that can represent conjunctive behaviour such as min, we make symmetric use of aggregators of the multisets 1 x := {1 xi|xi x}. For La,b(1 x) to be a well-behaved, dual version of La,b(x), the values in x need to lie in the range [0, 1]. We therefore restrict the following deﬁnition of our learnable aggregation function to sets x whose elements are in [0, 1]:

LAF(x) := αLa,b(x) + βLc,d(1 x)

γLe,f(x) + δLg,h(1 x) (2)

deﬁned by tunable parameters a, . . . , h 0, and α, . . . , δ R. In cases where sets need to be aggregated whose elements are not already bounded by 0, 1, we apply a sigmoid function to the set elements prior to aggregation. Table 1 shows how a number of important aggregation functions are special cases of LAF (for values in [0, 1]). We make repeated use of the fact that L0,1 returns the constant 1. For max and min LAF only provides an asymptotic approximation in the limit of speciﬁc function parameters (as indicated in the last column of Table 1). In most cases, the parameterization of LAF for the functions in Table 1 will not be unique. Being able to encode the powers of moments implies that e.g. the variance of x can be expressed as the difference 1/N P i x2 i (1/N P i xi)2 of two LAF aggregators. Since LAF includes sum-aggregation, we can adapt the results of [Zaheer et al., 2017] and [Wagstaff et al., 2019] on the theoretical universality of sum-aggregation as follows. Proposition 1. Let X R be countable, and f a function deﬁned on ﬁnite multisets with elements from X. Then there exist functions φ : X [0, 1], ρ : R R, and a parameterization of LAF, such that f(x) = ρ(LAF(φx); α, β, γ, δ, a, b, c, d), where φx is the multiset {φ(x)|x x}.

A proof in [Wagstaff et al., 2019] for a very similar proposition used a mapping from X into the reals. Our requirement that LAF inputs must be in [0, 1] requires a modiﬁcation of the proof (contained in the supplementary material1), which for the deﬁnition of φ relies on a randomized construction. Proposition 1 shows that we retain the theoretical universality guarantees of [Zaheer et al., 2017], while enabling a wider range of solutions based on continuous encoding and decoding functions. LAF enables a continuum of intermediate and hybrid aggregators. In Figure 1 we plot 4 different randomly generated LAF functions over [0, 1] [0, 1], i.e., evaluated over sets of size 2. Parameters α, . . . , γ were randomly sampled in [0, 1], b, d, f, h in {0, . . . , 5}, and a, c, e, g obtained as 1/i with i a random integer from {0, . . . , 5}. The ﬁgure illustrates the rich repertoire of aggregation functions with different qualitative behaviours already for non-extreme parameter values. Learning the functions depicted in Table 1 can in principle be done by a single LAF unit. However, learning complex aggregation functions might require a larger number of independent units, in that the ﬁnal aggregation is the result of the combination of simpler aggregations. Moreover, a LAF layer should be able to approximate the behaviour of simpler functions also when multiple units are used. Therefore, we analyzed the application of multiple LAF units to some of the known functions in Table 1. The details and the visual representation of this analysis is shown in the supplementary material. Although using only one function is sometimes sufﬁcient to greatly approximate the target function, the error variance among different runs is quite high, indicating that the optimization sometimes fails to converge to a good set of parameters. Multiple units provide more stability while performing better than a single unit aggregation in some cases. We therefore construct the LAF architecture for the experimental section by using multiple aggregators, computing the ﬁnal aggregation as a linear combination of the units aggregations. Several LAF units can be combined as shown in Figure 1, to capture different aspects of the input set, which can be in general a multiset of vectors x = {x1, . . . , x N}, where now xi Rd. Note that multiple aggregators are also used in related frameworks such as Deep Sets [Zaheer et al., 2017] or Graph Neural Networks [Veliˇckovi c et al., 2018;

1See https://github.com/alessandro-t/laf for supplementary material and code.

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)

LAF1 LAF2 LAFr

x1 x2 x N From previous layer

To next layer

x = {x1, x2, . . . , x N}

LAF(x) Rd r

Figure 1: Left: End-to-end LAF architecture. Right: LAF functions with randomly generated parameters.

Corso et al., 2020]. A module with r LAF units takes as input d-dimensional vectors and produces a vector of size r d as output. Each LAF unit performs an elementwise aggregation of the vectors in the set such that Lk,j = LAF({xi,j, . . . , x N,j}; αk, βk, γk, δk, ak, bk, ck, dk) for k = 1, . . . , r and j = 1, . . . , d. The output vector can be then fed into the next layer.

3 Related Work

Several studies address the problem of aggregating data over sets. Sum-decomposition strategies have been used in [Zaheer et al., 2017] for points cloud classiﬁcation and set expansion tasks and in [Santoro et al., 2017] for question answering and dynamic physical systems computation. Max, sum and average are standard aggregation functions for node neighborhoods in graph neural networks [Hamilton et al., 2017; Kipf and Welling, 2017; Xu et al., 2019; Veliˇckovi c et al., 2018]. [Zaheer et al., 2017] ﬁrst proved universal representation results for these standard aggregators when combined with learned mappings over inputs and results of the aggregation. However, [Wagstaff et al., 2019] showed that these universality results are of little practical use, as they either require highly discontinuous mappings that would be extremely difﬁcult to learn, or a latent dimension that is at least the size of the maximum number of input elements. Uninorms [Yager and Rybalov, 1996] are a class of aggregation functions in fuzzy logic that can behave in a conjunctive, disjunctive or averaging manner depending on a parameter called neutral element. [Melnikov and H ullermeier, 2016] proposed to learn fuzzy aggregators by adjusting these learnable parameters, showing promising results on combining reviewers scores on papers into an overall decision of acceptance or reject. Despite the advantage of incorporating different behaviours in one single function, uninorms present discontinuities in the regions between aggregators, making them not amenable to be utilized in fully differentiable frameworks. Furthermore the range of possible behaviours is restricted to those commonly used in the context of fuzzy-logic. The need for considering multiple candidate aggregators is advocated in a very recent work that was developed in parallel with our framework [Corso et al., 2020]. The resulting architecture, termed Principal Neighborhood Aggregation (PNA) combines multiple standard aggregators, including most of the ones we consider in the LAF framework, adjusting their outputs with degree scalers. However, the underlying philosophy is rather different. PNA aims at learning to select the appropriate aggregator(s) from a pool of candidates, while LAF explores a continuous space of aggregators that includes

standard ones as extreme cases. Our experimental evaluation shows that PNA has troubles in learning aggregators that generalize over set sizes, despite having them in the pool of candidates, likely because of the quasi-combinatorial structure of its search space. On the other hand, LAF can successfully learn even the higher moment aggregators and consistently outperforms PNA. Closely connected, but somewhat complementary to aggregation operators are attention mechanisms [Bahdanau et al., 2015; Vaswani et al., 2017]. They have been explored to manipulate set data in [Lee et al., 2019] and in the context of multi-instance learning [Ilse et al., 2018]. Attention operates at the level of set elements, and aims at a transformation (weighting) of their representations such as to optimize a subsequent weighted sum-aggregation. Although the objectives of attention-based frameworks and LAF are different in principle, our method can be used inside attention frameworks as the aggregation layer of the learned representation. We discuss the combination of LAF and attention in Section 5 showing the advantage of using LAF.

4 Experiments

In this section, we present and discuss experimental results showing the potential of the LAF framework on both synthetic and real-world tasks. Synthetic experiments are aimed at showing the ability of LAF to learn a wide range of aggregators and its ability to generalize over set sizes (i.e., having test-set sets whose cardinality exceeds the cardinality of the training-set sets), something that alternative architectures based on predeﬁned aggregators fail to achieve. We use Deep Sets, PNA, and LSTM as representatives of these architectures. The LSTM architecture corresponds to a version of Deep Sets where the aggregation function is replaced by a LSTM layer. Experiments on diverse tasks including point cloud classiﬁcation, text concept set retrieval and graph properties prediction are aimed at showing the potential of the framework on real-world applications.

4.1 Scalars Aggregation

This section shows the learning capacity of the LAF framework to learn simple and complex aggregation functions where constituents of the sets are simple numerical values. In this setting we consider sets made of scalar integer values. The training set is constructed as follows: for each set, we initially sample its cardinality K from a uniform distribution taking values in {2, . . . , M}, and then we uniformly sample K integers in {0, . . . , 9}. For the training set we use M = 10. We construct several test sets for different values

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)

10 20 30 40 50 0

Mean Absolute Error

LAF Deep Sets PNA LSTM

10 20 30 40 50 0

10 20 30 40 50 0.0

10 20 30 40 50 0.00

10 20 30 40 50 0.0

10 20 30 40 50 0.00

10 20 30 40 50 0.0

Mean Absolute Error

Inverse Count

10 20 30 40 50

10 20 30 40 50 0.00

10 20 30 40 50 0.0

10 20 30 40 50

0.6 Skewness

10 20 30 40 50 0.0

Figure 2: Test performances for the synthetic experiment with integer scalars on increasing test set size. The x axis represents the maximum test set cardinality, the y axis depicts the MAE. The dot, star, diamond and triangle symbols denote LAF, Deep Sets, PNA, and LSTM respectively. Skewness: 1/N P

i((xi ˆµ)/ˆσ)3, Kurtosis: 1/N P

i((xi ˆµ)/ˆσ)4, where ˆµ and ˆσ are the sample mean and standard deviation.

of M (M = 5, 10, 15, 20, 25, 30, 35, 40, 45, 50). This implies that models need to generalize to larger set sizes. Contrarily to the training set, each test set is constructed in order to diversify the target labels it contains, so as to avoid degenerate behaviours for large set sizes (e.g., maximum constantly equal to 9). Each synthetic dataset is composed of 100,000 sets for training, 20,000 set for validating and 100,000 for testing. The number of aggregation units is set as follows. The model contains nine LAF (Equation 2) units, whose parameters {ak, . . . , hk}, k = 1, . . . , 9 are initialized according to a uniform sampling in [0, 1] as those parameters must be positive, whereas the coefﬁcients {α, . . . , δ} are initialized with a Gaussian distribution with zero mean and standard deviation of 0.01 to cover also negative values. The positivity constraint for parameters {a, b, . . . , h} is enforced by projection during the optimization process. The remaining parameters can take on negative values. Deep Sets also uses nine units: three max units, three sum units, and three mean units and PNA uses seven units: mean, max, sum, standard deviation, variance, skewness and kurtosis. Preliminary experiments showed that expanding the set of aggregators for PNA with higher order moments only leads to worse performance. Each set of integers is fed into an embedding layer (followed by a sigmoid) before performing the aggregation function. Deep Sets and PNA do need an embedding layer (otherwise they would have no parameters to be tuned). Although LAF does not need an embedding layer, we used it in all models to make the comparison more uniform. The architecture details are reported in the supplementary material. We use the Mean Absolute Error (MAE) as a loss function to calculate the prediction error. Figure 2 shows the trend of the MAE for the three methods for increasing test set sizes, for different types of target aggregators. As expected, Deep Sets manages to learn the identity function and thus correctly models aggregators like sum, max and mean. Even if LAF needs to adjust its parameters in order to properly aggregate the data, its performance are competitive with those of Deep Sets. When moving to more

complex aggregators like inverse count, median or moments of different orders, Deep Sets fails to learn the latent representation. One the other hand, the performance of LAF is very stable for growing set sizes. While having in principle at its disposal most of the target aggregators (including higher order moment) PNA badly overﬁts over the cardinality of sets in the training set in all cases (remember that the training set contains sets of cardinality at most 10). The reason why LAF substantially outperforms PNA on large set sizes could be explained in terms of a greater ﬂexibility to adapt to the learnt representation. Indeed, LAF parameters can adjust the laf function to be compliant with the latent representation even if the input mapping fails to learn the identity. On the other hand, having a bunch of ﬁxed, hard-coded aggregators, PNA needs to be able to both learn the identity mapping and select the correct aggregator among the candidates. Finally, LSTM results are generally poor when compared to the other methods, particularly in the case of the count and the sum.

4.2 MNIST Digits We performed an additional set of experiments aiming to demonstrate the ability of LAF to learn from more complex representations of the data by plugging it into end-to-end differentiable architectures. In these experiments, we thus replaced numbers by visual representations obtained from MNIST digits. Unlike the model proposed in the previous section, here we use three dense layers for learning picture representations before performing the aggregation function. Results obtained in this way are very similar to those obtained with numerical inputs and due to space limitations we report them along with other architectural and experimental details in the supplementary material.

4.3 Point Clouds In order to evaluate LAF on real-world dataset, we consider point cloud classiﬁcation, a prototype task for set-wise prediction. Therefore, we run experimental comparisons on

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)

METHOD P100 P1000

DEEPSETS 82.0 2.0% 87.0 1.0% PNA 82.9 0.7% 86.4 0.6% LSTM 78.7 1.1% 82.2 1.7% LAF 84.0 0.6% 87.0 0.5%

Table 2: Results on the Point Clouds classiﬁcation task. Accuracies with standard deviations (over 5 runs) for the Model Net40 dataset.

the Model Net40 [Wu et al., 2015] dataset, which consists of 9,843 training and 2,468 test point clouds of objects distributed over 40 classes. The dataset is preprocessed following the same procedure described by [Zaheer et al., 2017]. We create point clouds of 100 and 1,000 three-dimensional points by adopting the point-cloud library s sampling routine developed by [Rusu and Cousins, 2011] and normalizing each set of points to have zero mean (along each axis) and unit (global) variance. We refer with P100 and P1000 to the two datasets. For all the settings, we consider the same architecture and hyper-parameters of the Deep Sets permutation invariant model described by [Zaheer et al., 2017]. For LAF, we replace the original aggregation function (max) used in Deep Sets with 10 LAF units, while for PNA we use the concatenation of max, min, mean, and standard deviation, as proposed by the authors. For PNA we do not consider any scaler, as the cardinalities of the sets are ﬁxed. Results in Table 2 show that LAF produces an advantage in the lower resolution dataset (i.e., on P100), while it obtains comparable (and slightly more stable) performances in the higher resolution one (i.e., on P1000). These results suggest that having predeﬁned aggregators is not necessarily an optimal choice in real world cases, and that the ﬂexibility of LAF in modeling diverse aggregation functions can boost performance and stability.

4.4 Set Expansion Following the experimental setup of Deep Sets, we also considered the Set Expansion task. In this task the aim is to augment a set of objects of the same class with other similar objects, as explained in [Zaheer et al., 2017]. The model learns to predict a score for an object given a query set and decide

whether to add the object to the existing set. Speciﬁcally, [Zaheer et al., 2017] consider the speciﬁc application of set expansion to text concept retrieval. The idea is to retrieve words that belong to a particular concept, giving as input set a set of words having the same concept. We employ the same model and hyper-parameters of the original publication, where we replace the sum-decomposition aggregation with LAF units for our methods and the min, max, mean, and standard deviation aggregators for PNA. We trained our model on sets constructed from a vocabulary of different size, namely LDA-1K, LDA-3K and LDA5K. Table 3 shows the results of LAF, Deep Sets and PNA on different evaluation metrics. We report the retrieval metrics recall@K, median rank and mean reciprocal rank. We also report the results on other methods the authors compared to in the original paper. More details on the other methods in the table can be found in the original publication. Brieﬂy, Random samples a word uniformly from the vocabulary; Bayes Set [Ghahramani and Heller, 2006]; w2v-Near computes the nearest neighbors in the word2vec [Mikolov et al., 2013] space; NN-max uses a similar architecture as our Deep Sets but uses max pooling to compute the set feature, as opposed to sum pooling; NN-max-con uses max pooling on set elements but concatenates this pooled representation with that of query for a ﬁnal set feature; NN-sum-con is similar to NN-max-con but uses sum pooling followed by concatenation with query representation. For the sake of fairness, we have rerun Deep Sets using the current implementation from the authors (indicated as Deep Set in Table 3), exhibiting better results than the ones reported in the original paper. Nonetheless, LAF outperforms all other methods in most cases, especially on LDA-3K and LDA-5K.

4.5 Multi-task Graph Properties

[Corso et al., 2020] deﬁnes a benchmark consisting of 6 classical graph theory tasks on artiﬁcially generated graphs from a wide range of popular graph types like Erdos-Renyi, Barabasi-Albert or star-shaped graphs. Three of the tasks are deﬁned for nodes, while the other three for whole graphs. The node tasks are the single-source shortest-path lengths (N1), the eccentricity (N2) and the Laplacian features (N3). The

LDA-1k (VOCAB = 17k) LDA-3k (VOCAB = 38k) LDA-5k (VOCAB = 61k) RECALL(%) MRR MED. RECALL(%) MRR MED. RECALL(%) MRR MED. @10 @100 @1K @10 @100 @1K @10 @100 @1K

RANDOM 0.06 0.6 5.9 0.001 8520 0.02 0.2 2.6 0.000 28635 0.01 0.2 1.6 0.000 30600 BAYES SET 1.69 11.9 37.2 0.007 2848 2.01 14.5 36.5 0.008 3234 1.75 12.5 34.5 0.007 3590 W2V NEAR 6.00 28.1 54.7 0.021 641 4.80 21.2 43.2 0.016 2054 4.03 16.7 35.2 0.013 6900 NN-MAX 4.78 22.5 53.1 0.023 779 5.30 24.9 54.8 0.025 672 4.72 21.4 47.0 0.022 1320 NN-SUM-CON 4.58 19.8 48.5 0.021 1110 5.81 27.2 60.0 0.027 453 4.87 23.5 53.9 0.022 731 NN-MAX-CON 3.36 16.9 46.6 0.018 1250 5.61 25.7 57.5 0.026 570 4.72 22.0 51.8 0.022 877 DEEPSETS 5.53 24.2 54.3 0.025 696 6.04 28.5 60.7 0.027 426 5.54 26.1 55.5 0.026 616

DEEPSETS 5.89 26.0 55.3 0.026 619 7.56 28.5 64.0 0.035 349 6.49 27.9 56.9 0.030 536 PNA 5.56 24.7 53.2 0.027 753 7.04 27.2 58.7 0.028 502 5.47 23.8 52.4 0.025 807 LSTM 4.29 21.5 52.6 0.022 690 5.56 25.7 58.8 0.026 830 4.87 23.8 55.0 0.022 672 LAF 6.51 26.6 54.5 0.030 650 8.14 32.3 62.8 0.037 339 6.71 28.3 56.9 0.031 523

Table 3: Results on Text Concept Set Retrieval on LDA-1k, LDA-3k, and LDA-5k. Bold values denote the best performance for each metric.

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)

METHOD N1 N2 N3 G1 G2 G3

BASELINE -1.87 -1.50 -1.60 -0.62 -1.30 -1.41 GIN -2.00 -1.90 -1.60 -1.61 -2.17 -2.66 GCN -2.16 -1.89 -1.60 -1.69 -2.14 -2.79 GAT -2.34 -2.09 -1.60 -2.44 -2.40 -2.70 MPNN (MAX) -2.33 -2.26 -2.37 -1.82 -2.69 -3.52 MPNN (SUM) -2.36 -2.16 -2.59 -2.54 -2.67 -2.87 PNA -2.54 -2.42 -2.94 -2.61 -2.82 -3.29 PNA -2.89 -2.89 -3.77 -2.61 -3.04 -3.57

LAF -2.13 -2.20 -1.67 -2.35 -2.77 -3.63

Table 4: Results on the Multi-task graph properties prediction benchmark, expressed in log 10 of mean squared error.

graph tasks are graph connectivity (G1), diameter (G2), and the spectral radius (G3). We compare LAF against PNA by simply replacing the original PNA aggregators and scalers with 100 LAF units (see Equation 2). Table 4 shows that albeit these datasets were designed to highlight the features of the PNA architecture, that outperforms a wide range of alternative graph neural network approaches LAF produces competitive results, outperforming state-of-the-art GNN approaches like GIN [Xu et al., 2019], GCN [Kipf and Welling, 2017] and GAT [Veliˇckovi c et al., 2018] and even improving over PNA on spectral radius prediction. PNA is the variant without scalers of PNA still proposed by [Corso et al., 2020].

5 Set Transformer With LAF Aggregation

In this section we investigate the combination of LAF aggregation with attention mechanisms on sets as proposed in the Set Transformer framework [Lee et al., 2019]. Brieﬂy, Set Transformer consists of an encoder and a decoder. The encoder maps a set of input vectors into a set of feature vectors by leveraging attention blocks. The decoder employs a pooling multihead attention (PMA) layer, which aggregates the set of feature vectors produced by the encoder. In the following experiment we replace PMA by a LAF layer. Inspired by one of the tasks described in [Lee et al., 2019], we propose here to approximate the average of the unique numbers in a set of MNIST images. Solving the task requires to learn a cascade of two processing steps, one that detects unique elements in a set (which can be done by the transformer encoder, as shown in the experiment by [Lee et al., 2019]) and one that aggregates the results by averaging (which LAF is supposed to do well). The set cardinalities are uniformly sampled from {2, 3, 4, 5} and each set label is calculated as the average of the unique digits contained in the set. We trained two Set Transformer models: one with PMA (ST-PMA) and the other with LAF (ST-LAF). The full implementation details are reported in the supplementary material. Quantitative and qualitative results of the evaluation are shown in Figure 3, where we report the MAE for both methods2. ST-LAF exhibits a nice improvement over ST-PMA for this particular task. Note

2We run several experiments by changing the number of seeds k of PMA. All of them exhibited the same behaviour. For this experiment we used k = 1.

Figure 3: Distribution of the predicted values for ST-PMA and STLAF by set cardinalities. On the x-axis the true labels of the sets, on the y-axis the predicted ones. Different colors represent the sets cardinalities |x|.

that for ST-PMA only 25% of the sets (red points in the scatter plot), corresponding to sets with maximum cardinality, approximates well the average, while for all other cardinalities (the remaining 75% of the sets) ST-PMA predicts a constant value equal to the average label in the training set. ST-LAF instead clearly captures the distribution of the labels, generalizing better with respect to the set sizes. A similar behaviour was observed when learning to predict the sum rather than the average of the unique digits in a set (see supplementary material for the results).

6 Conclusions The theoretical underpinnings for sum aggregation as a universal framework for deﬁning set functions do not necessarily provide a template for practical solutions. Therefore we introduced a framework for learning aggregation functions that makes use of a parametric aggregator to effectively explore a rich space of possible aggregations. LAF deﬁnes a new class of aggregation functions, which include as special cases widely used aggregators, and also has the ability to learn complex functions such as higher-order moments. We empirically showed the generalization ability of our method on synthetic settings as well as real-world datasets, providing comparisons with state-of-the-art sum-decomposition approaches and recently introduced techniques. The ﬂexibility of our model is a crucial aspect for potential practical use in many deep learning architectures, due to its ability to be easily plugged into larger architectures, as shown in our experiments with the Set Transformer. The portability of LAF opens a new range of possible applications for aggregation functions in machine learning methods, and future research in this direction can enhance the expressivity of many architectures and models that deal with unstructured data.

Acknowledgements The work of PF was partly supported by the PRIN 2017 project Rex Learn, funded by the Italian Ministry of Education, University and Research, GA No 2017TWNMH2. The work of AP was partly supported by the TAILOR project, funded by EU Horizon 2020 research and innovation programme, GA No 952215. The scholarship of GP is funded by TIM and EIT Digital.

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)

[Bahdanau et al., 2015] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. [Corso et al., 2020] Gabriele Corso, Luca Cavalleri, Dominique Beaini, Pietro Li o, and Petar Velickovic. Principal neighbourhood aggregation for graph nets. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020, December 6-12, 2020, virtual, 2020. [Ghahramani and Heller, 2006] Zoubin Ghahramani and Katherine A Heller. Bayesian sets. In Advances in neural information processing systems, pages 435 442, 2006. [Hamilton et al., 2017] William L. Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pages 1024 1034, 2017. [Ilse et al., 2018] Maximilian Ilse, Jakub M. Tomczak, and Max Welling. Attention-based deep multiple instance learning. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsm assan, Stockholm, Sweden, July 10-15, 2018, pages 2132 2141, 2018. [Kim, 2014] Yoon Kim. Convolutional neural networks for sentence classiﬁcation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, pages 1746 1751, 2014. [Kipf and Welling, 2017] Thomas N. Kipf and Max Welling. Semi-supervised classiﬁcation with graph convolutional networks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, 2017. [Lee et al., 2019] Juho Lee, Yoonho Lee, Jungtaek Kim, Adam R. Kosiorek, Seungjin Choi, and Yee Whye Teh. Set transformer: A framework for attention-based permutation-invariant neural networks. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, pages 3744 3753, 2019. [Melnikov and H ullermeier, 2016] Vitalik Melnikov and Eyke H ullermeier. Learning to aggregate using uninorms. In Machine Learning and Knowledge Discovery in Databases - European Conference, ECML PKDD 2016, Riva del Garda, Italy, September 19-23, 2016, Proceedings, Part II, pages 756 771, 2016. [Mikolov et al., 2013] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositional-

ity. In Advances in neural information processing systems, pages 3111 3119, 2013. [Rusu and Cousins, 2011] Radu Bogdan Rusu and Steve Cousins. 3d is here: Point cloud library (pcl). In 2011 IEEE international conference on robotics and automation, pages 1 4. IEEE, 2011. [Santoro et al., 2017] Adam Santoro, David Raposo, David G Barrett, Mateusz Malinowski, Razvan Pascanu, Peter Battaglia, and Timothy Lillicrap. A simple neural network module for relational reasoning. In Advances in neural information processing systems, pages 4967 4976, 2017. [Tibo et al., 2020] Alessandro Tibo, Manfred Jaeger, and Paolo Frasconi. Learning and interpreting multi-multiinstance learning networks. Journal of Machine Learning Research, 21(193):1 60, 2020. [Vaswani et al., 2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pages 5998 6008, 2017. [Veliˇckovi c et al., 2018] Petar Veliˇckovi c, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Li o, and Yoshua Bengio. Graph attention networks. In ICLR 18, 2018. [Wagstaff et al., 2019] Edward Wagstaff, Fabian B Fuchs, Martin Engelcke, Ingmar Posner, and Michael Osborne. On the limitations of representing functions on sets. ar Xiv preprint ar Xiv:1901.09006, 2019. [Wu et al., 2015] Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and Jianxiong Xiao. 3d shapenets: A deep representation for volumetric shapes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1912 1920, 2015. [Xu et al., 2019] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? In International Conference on Learning Representations, 2019. [Yager and Rybalov, 1996] Ronald R Yager and Alexander Rybalov. Uninorm aggregation operators. Fuzzy sets and systems, 80(1):111 120, 1996. [Zaheer et al., 2017] Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Russ R Salakhutdinov, and Alexander J Smola. Deep sets. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 3391 3401. Curran Associates, Inc., 2017.

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)