# regularizing_towards_permutation_invariance_in_recurrent_models__c89ed228.pdf Regularizing Towards Permutation Invariance in Recurrent Models Edo Cohen-Karlik Tel Aviv University, Israel edocohen@mail.tau.ac.il Avichai Ben David Tel Aviv University, Israel avichaib@mail.tau.ac.il Amir Globerson Tel Aviv University, Israel and Google Research gamir@post.tau.ac.il In many machine learning problems the output should not depend on the order of the input. Such permutation invariant functions have been studied extensively recently. Here we argue that temporal architectures such as RNNs are highly relevant for such problems, despite the inherent dependence of RNNs on order. We show that RNNs can be regularized towards permutation invariance, and that this can result in compact models, as compared to non-recurrent architectures. We implement this idea via a novel form of stochastic regularization. Existing solutions mostly suggest restricting the learning problem to hypothesis classes which are permutation invariant by design [Zaheer et al., 2017, Lee et al., 2019, Murphy et al., 2018]. Our approach of enforcing permutation invariance via regularization gives rise to models which are semi permutation invariant (e.g. invariant to some permutations and not to others). We show that our method outperforms other permutation invariant approaches on synthetic and real world datasets. 1 Introduction In recent years deep learning has shown remarkable performance in a vast range of applications from natural language processing to autonomous vehicles. One of the most successful models of the current deep-learning Renaissance are convolutional neural nets (CNN) [Krizhevsky et al., 2012], which utilize domain specific properties such as invariance of images to specific spatial transformations. Such inductive bias is common in other domains where it is necessary to learn from limited amounts of data. In this work we consider the setting where learned functions are such that the order of the inputs does not affect the output value. These problems are commonly referred to as permutation invariant, and typical solutions aim to restrict the learned models to functions that are permutation invariant by design. One such work is Deep Sets [Zaheer et al., 2017], which provided a characterization of permutation invariant functions. Specifically, given a set of objects tx1, . . . , xnu and a permutation invariant function fpx1, . . . , xnq they showed that f can be expressed as follows: there exist two networks ϕp q and ρp q such that f is a result of applying ϕ to all inputs, sum-aggregating and then applying ρ. Namely: fpx1, . . . , xnq ρ 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada. Other works suggest replacing summation with different permutation invariant aggregation methods such as element-wise maximum [Qi et al., 2017] and attention mechanisms [Lee et al., 2019, Vinyals et al., 2016]. Although these approaches result in permutation invariant functions and can be shown to express any permutation invariant function, it is not clear how many parameters are required for such an implementation. Indeed, there remains an important open question: given the many ways in which a given permutation-invariant function can be implemented, what are the relative advantages of each approach? Here we highlight the potential of recurrent architectures for modeling permutation invariance. By recurrent architectures we mean any architecture that has a state that evolves as the sequence is processed. For example, recurrent neural networks (RNNs), LSTMS [Hochreiter and Schmidhuber, 1997] and GRUs [Chung et al., 2014]. We focus on standard RNNs in what follows, but our approach applies to any recurrent model. It initially seems counter-intuitive that RNNs should be useful for modeling permutation invariant functions. However, as we show in Section 3 there are permutation invariant functions that RNNs can model with far fewer parameters than Deep Sets. The reason why RNNs are effective models for permutation invariant functions is that their state can be used as an aggregator to perform order invariant summaries. For example, max-aggregation for positive numbers can be implemented via the simple state update st 1 maxrst, xts and s0 0, and can be realized with only four Re LU gates. Similarly, the state can collect statistics of the sequence in a permutation invariant manner (e.g., order-statistics [Alon et al., 1999] etc.). One option to achieve permutation invariant RNNs is to take all permutations of the input sequence, feed each of them to an RNN, and average. This method is exponential in the sequence length, and Murphy et al. [2018] suggest approximating it by sub-sampling the permutation set. Here we take an alternative approach that is conceptually simple, more general and more empirically effective. We propose to regularize RNNs towards invariance. Namely learn a regular RNN model f, but with a regularization term Rpfq that penalizes models f that violate invariance. The naive implementation of this idea would be to require that all permutations of the training data result in the same output. However, we go beyond this, by requiring same-output for subsets of training sequences. We call this subset invariance regularization (SIRE). It is a very natural criterion since most sequence classification tasks do not have fixed length inputs, and a subsequence of a training point is likely to be a valid example as well. In contrast to previous methods which all result in architectures that are invariant by design, our method enforces invariance in a soft manner, thus enabling usage in semi permutation invariant settings where previous methods are not applicable. This makes it applicable to settings where there is some temporal structure in the inputs. The rest of the paper is structured as follows: in Section 2 we define notations and formally describe the problem setting. Section 3 shows that in some cases RNNs are favorable with respect to other permutation invariant architectures. In Section 4 we describe our regularization method. In Section 5 we discuss related work, and Section 6 provides an empirical evaluation. 2 Formulation Consider a general recurrent neural network, with a state update function f : S ˆ X Ñ S parameterized by Θ (we omit Θ dependence when clear from context). The initial state s0 is also a learned parameter. The state update rule is therefore given by: st 1 fpst, xt 1; Θq (2.1) In what follows we use the notation fps, x1, . . . , xnq to denote the state that is generated by starting at state s and processing the sequence x1, . . . , xn. Thus, for an input sequence px1, x2, x3q, the state s3 will be given by: s3 fps2, x3q fpfps1, x2q, x3q fpfpfps0, x1q, x2q, x3q : fps0, x1, x2, x3q (2.2) The state is mapped to an output yt via the output mapping yt gpstq. We next define several notions of permutation invariance. Informally, a model f is permutation invariant if it provides the same output regardless of the ordering of the input. In the definitions below we assume input is sampled from some distribution D and require invariance only for inputs in the support of D, or their subsets. If D is the true underlying distribution of the data, then clearly this is sufficient since we will never test on examples outside D. When training we will consider the empirical distribution as an approximation to D. We begin by defining invariance for the sequences sampled from D. Definition 1. An RNN is called permutation invariant with respect to D on length n, if for any px1, . . . , xnq in the support of D and any permutation π P Sn we have:1 fps0, x1, . . . , xnq fps0, xπ1, . . . , xπnq (2.3) We next note that Definition 1 does not imply any constraint on sequences of length n1 ă n. However, a natural requirement from a permutation invariant RNN is to satisfy the same properties for shorter sequences as well. This is captured by the following definition. Definition 2. An RNN is called subset-permutation invariant with respect to D on length n, if for any px1, . . . , xnq in the support of D, any sequence px1 1, . . . , x1 mq whose elements are a subset of px1, . . . , xnq, and any permutations ˆπ, π P Sm it holds that: fps0, x1 ˆπ1, . . . , x1 ˆπmq fps0, x1 π1, . . . , x1 πmq (2.4) Note that Definition 2 is more restrictive than Definition 1. In particular, an RNN which satisfies Definition 2 also trivially satisfies Definition 1 but the other way around is not true. Definition 2 involves the response of RNNs to sub-sequences of the data. It is thus closely related to the states that the RNN can reach when presented with sequences of different length. The next definition captures this notion. Definition 3. Denote the states reachable by D and parameters Θ by SD,Θ. Formally:2 SD,Θ def ! fps0, x1 1, . . . , x1 i; Θq | Dpx1, . . . , xnq : px1 1, . . . , x1 iq P 2tx1,...,xnu , Ppx1, . . . , xnq ą 0 ) We shall use this definition when proposing an invariance regularization in Section 4. 3 Compact RNNs for Permutation Invariance In this section we show the existence of functions that are permutation invariant and are modeled by a very small RNN, whereas modeling them with a Deep Set architecture requires significantly more parameters. In what follows we make this argument formal. Theorem 4. For any natural number K ą 4, there exists a permutation invariant function that can be implemented by an RNN with 3 hidden neurons but its Deep Sets implementation requires Ωp Kq neurons to implement. The above theorem says there are cases where an RNN requires far fewer parameters to implement than a Deep Set architecture, and this will of course imply (by standard sample complexity lower bounds) that there are distributions for which the RNN will require far fewer samples to learn than Deep Sets. We next prove the result by using the parity function to demonstrate the gap in model size. In Section 6 we provide an empirical demonstration of the result. Proof. In order to prove Theorem 4 one needs to show that for any K there exists a function f such that: (a) f can be implementated with a constant number of neurons using RNNs, and (b) any Deep Sets architecture will require at least K neurons to implement f. Let n 2K. Given X t0, 1u, define the parity function operating over sets of size n: parityptx1, . . . , xnuq mod 2 (3.1) Next we claim that the parity function can be implemented by an RNN with three hidden neurons and 12 parameters in total. Consider an RNN operating on a sequence x1, . . . , xn P t0, 1u with the 1Where Sn denotes the symmetry group containing all permutations of a set with n elements. 2We use the notation 2tx1,...,xnu to denote all sequences whose elements are subsets of (x1, . . . , xnq. following update rule st 1 W T 1 σp W2xt W3st Bq. By setting:W1 r1, 1, 1s, W2 W3 r2, 2, 2s, and B r0, 1, 3s we have, 2xt 2st 2xt 2st 1 2xt 2st 3 The above implements the function Re LUpaq Re LUpa 1q Re LUpa 3q, where a 2xt 2st. This in turn is equivalent to the XOR function (namely pxt stqmod 2). Since the RNN state update implements addition modulo 2, it easily follows that the full RNN will calculate parity. Specifically, by setting s0 0 and applying fpfps0, x1q, . . . , xnq we obtain the parity of px1, . . . , xnq, as required. Note that the implementation requires 4 weight matrices, W1, W2, W3, B P R3 which amounts to 12 parameters. The second part of the proof requires showing that any Deep Sets architecture needs at least K neurons to implement parity over sets of n elements. Recall that a Deep Sets architecture is composed of two functions, ϕ and ρ (see Eq. 1.1). We assume ϕ and ρ are feed-forward nets with Re LU activations and l hidden layers of fixed width d. Our result can be extended to variable width networks. First we argue that WLOG the function ϕ can be assumed to be the identity ϕIpxq x. To see this, recall that there are only two possible x values. We will now take any Deep Set implementation ϕ, ρ and show that it has an equivalent implementation ϕI, ρ. Denote the two values that ϕ takes by ϕp0q v0 and ϕp1q v1. Thus, after the sum aggregation of the Deep Set architecture we have: ÿ i ϕpxiq n0v0 n1v1 pn n1qv0 n1v1 nv0 n1pv1 v0q (3.3) where n0, n1 are the number of zeros and one in the sequence x1, . . . , xn (so that n0 n1 n). Now note that: ř i xi n1. Then we can define ρpzq ρ pnv0 zpv1 v0qq, and we have that the implementations are equivalent, namely: From now on, we therefore assume ϕpxq x. Given the above, we can assume that ϕpxq x. Therefore ρ : R Ñ R is a continuous function which takes řn i 1 xi as input and outputs 1 for odd values and 0 for even values. Recall that a Re LU network implements a piecewise linear function, and a network of depth L and r units per layer can model a function with at most r L linear segments [Montufar et al., 2014]. The ρ function above must have at least n segments since it switches between 0 and 1 values n times. This network has Lr2 parameters. Minimizing Lr2 under the condition that r L n the minimum is L log2 n, r 2. Thus the minimum number of units in a network that implements ρ is 4 log2 n 4K, proving the result. 4 Permutation Invariant Regularization In the previous section we showed that RNNs can implement certain permutation invariant functions in a compact manner. On the other hand, if we learn an RNN from data, we have no guarantee that it will be permutation invariant. The question is then: how can we learn RNNs that correspond to permutation invariant functions. In this section we present an approach to this problem, which relies on a regularization term that encourages permutation invariant RNNs. Intuitively, such a term should be designed such that it is minimized only by permutation invariant RNNs. 4.1 Regularizing Towards Subset Permutation Invariance Our goal is to define a function Rpfq that will be zero when f is subset permutation invariant and non-zero otherwise. Following Definition 2 it is natural to define the expected squared error between the RNN state for all sub-sequence pairs that are required to have the same output. Clearly, having the same state will result in the same output. Thus we define the regularizer: RSUBpfq E fps0, x1 ˆπ1, . . . , x1 ˆπmq fps0, x1 π1, . . . , x1 πmq 2ı (4.1) where the expectation is taken with respect to D, ˆπ, π P Sm and the subsequence sampling. 4.2 Pair Permutation Invariance Regularization Calculating the RSUB regularizer exactly will take exponential time, and thus we must resort to approximations. The simplest approach would be to randomly select a subset and two permutations and then replace the expectation in SUB with its empirical average. However, as we show next, there is a simpler approach to regularization. We next suggest an alternative regularizer, RSIRE that also vanishes for subset-permutationinvariant models, but avoids the permutation sampling in RSUB. Our key insight is that because of the recurrent nature of the RNN, one only needs to verify invariance by considering invariance to adding two elements to an existing sub-sequence. We next state the key result that facilitates the new regularizer. Theorem 5. An RNN is subset-permutation invariant with respect to D if @x1, x2 P X and @s P SD,Θ it holds that: fps, x1, x2q fps, x2, x1q (4.2) In order to prove Theorem 5, we make use of the following lemma. Lemma 6. Assume Equation 4.2 holds under the conditions of Theorem 5. Then @x1, . . . , xt P X the following holds for all i P t1, . . . , t 1u: fps0, x1, . . . , xi 1, xi, xi 1, xi 2, . . . , xtq fps0, x1, . . . , xi 1, xi 1, xi, xi 2, . . . , xtq Corollary 7. Assume the condition of Lemma 6 holds. Then for a sequence x1, . . . , xt any two elements xi and xj can be swapped without changing the value of the resulting state. The proofs for Lemma 6 and Corollary 7 are provided in the Appendix. Proof of Theorem 5. Given an arbitrary permutation π P St it suffices to show: fps0, x1, . . . , xtq fps0, xπ1, . . . , xπtq (4.3) Denote by i the first index for which i πi. Then Dj ą i such that πj i. From Corollary 7 we can sawp xπi and xπj. This process is repeated until Equation 4.3 is satisfied. 4.2.1 The SIRE Regularizer The result above implies that testing for subset-permutation-invariance is equivalent to testing the effect of adding two inputs to an existing state. This immediately suggests a regularizer that will vanish if and only if the RNN f is subset-permutation-invariant. We refer to this as the Subset Invariant-Regularizer (SIRE), and define it as follows: RSIREpfq E x1,x2 D s SD,Θ fps, x1, x2q fps, x2, x1q 2ı (4.4) The key advantage of SIRE over SUB is that SIRE requires sampling sub-sequences but not permutations. Empirically, we show this translates to much faster learning when using RSIRE (see Appendix). In practice, we of course do not sum over all states s P SD,Θ as that will require all permutations over the training data. Instead we randomly sample subsets of training sequences and estimate SIRE via an average over those. In summary, we propose learning an RNN by minimizing the regular training loss (e.g., squared error for regression or cross-entropy for classification) plus the regularization term RSIREpfq in Equation 4.4, where it is estimated via sampling. As with any regularization scheme, RSIREpfq may be multiplied by a regularization coefficient λ. 5 Related Work In recent years, the question of invariances and network architecture has attracted considerable attention, and in particular for various forms of permutation invariances. Several works have focused on characterizing architectures that are by design permutation invariant [Zaheer et al., 2017, Vinyals et al., 2016, Qi et al., 2017, Hartford et al., 2018, Lee et al., 2019, Zellers et al., 2018]. While the above works address invariance for sets, there has also been work on invariance of computations on graphs [Maron et al., 2019, Herzig et al., 2018]. In these, the focus is on problems that take a graph as input, and the goal is for the output to be invariant to all equivalent representations of the graph. The most relevant line of work relating to ours is Murphy et al. [2018] which suggests viewing a permutation invariant function as an average of the output of all possible orderings applied to a permutation variant function. As this approach is intractable, the authors suggest a few efficient approximations. Our work is conceptually different from the above works. These approaches are invariant by design , either explicitly by implementing a permutation invariant pooling operator [Zaheer et al., 2017] or by approximating such a pooling layer [Murphy et al., 2018]. We take a different approach where we do not attempt to obtain a network which is strictly permutation invariant. Instead, we control the variance-invariance spectrum via a regularization term. 6 Experiments In order to evaluate the empirical effectiveness of our regularization scheme we compare it to other methods for learning permutation invariant models. Finally, we also demonstrate how our regularization scheme is effective in semi permutation invariant settings. Baselines: We compare our method (namely SIRE regularization) to two permutation invariance learning methods: Deep Sets [Zaheer et al., 2017] and the π SGD algorithm from the Janossy Pooling paper [Murphy et al., 2018]. We used the code provided by [Murphy et al., 2018] for experiments over digits and the code by Lee et al. [2019] for the point cloud experiment. Cross-validation was used for learning all architectures. 6.1 Learning Parity In Theorem 4 we showed that RNNs can implement the parity function more efficiently than Deep Sets. Here we provide an empirical demonstration of this fact. As training data we take Boolean sequences x of length at most ten, where the label is their parity. We train both RNNs and Deep Sets on those, and test on sequence length up to 100. Figure 1 show the results, and it can be seen that RNNs indeed learn the correct parity function, whereas Deep Sets do not. For both networks we used the minimal width required to perfectly fit the training data. For full details see the Appendix. 6.2 Arithmetic Tasks on Sequences of Integers To evaluate our regularization approach, we consider three tasks used in Murphy et al. [2018].3 In all tasks the input is a sequence px1, . . . , xnq where xi P t0, . . . , 99u. The tasks are: (1) sum: The label is the sum of all elements in the sequence. (2) range: The label is the difference between the maximum and minimum elements in the sequence. (3) variance: The label is the empirical variance of the sequence: 1 n řn j 1pxj 1 n řn i 1 xiq2. In Murphy et al. [2018] all tasks were evaluated with n 5, here we perform all experiments with n 5, 10, . . . , 30. 3The original task includes 2 more tasks which we omit since all models achieved near perfect performance. Figure 1: Test accuracy as a function of sequence length for learning parity, using Deep Sets and RNNs. Figure 2: Test prediction accuracy (zero-one error) of sum (left) and range (center). For the variance experiment we report mean square error (as in Murphy et al. [2018]). Figure 2 shows the average accuracy of each model as a function of the sequence length.4 The range task turns out to be less challenging than the sum task. This is probably because the ground truth for a sequence of any size is bounded by 99 (since max xi ď 99 and min xi ě 0). In contrast, in the sum task, the output is bounded by 99 n which makes the task harder for longer sequences. Results on the sum task clearly show that RSIRE generalizes better to longer sequences than the baselines. For the variance task we report RMSE values,5 the graph shows that SIRE outperforms Deep Sets. We omit π SGD as a baseline in the variance task as it failed to converge to low training error with all configurations explored, resulting in poor performance. 6.3 Point Clouds We evaluate our method on a 40-way point cloud classification task using Model Net40 [Chang et al., 2015]. A point-cloud [Chang et al., 2015, Wu et al., 2015] consists of a set of n vectors in R3 and has many applications in the growing trend of robotics where LIDAR sensors are common. As pointclouds are represented using a list of vectors, which do not induce a natural order, they are ideal candidates for evaluation of permutation invariant methods. Experiment details are as in [Zaheer et al., 2017]. 4For each sequence length we perform cross validation to select the best configuration and report the average of 20 runs for sum and the average of 3 runs for range and variance. 5Lower is better. Method 100 pts 1000 pts 5000 pts Deep Sets 0.825 0.872 0.90 SIRE 0.835 0.878 0.899 Table 1: Point cloud classification results. In Table 1 we report results for n 100, 1000, 5000. For n 100, 1000 SIRE outperforms Deep Sets and achieves comparable results for n 5000.6 6.4 Arithmetic Semi Permutation Task One advantage of our method is the possibility to tune the level of invariance an RNN should capture. This may be useful in real-world datasets where the data is permutation invariant to some extent. For example, human activity recognition signals often correspond to repetitive action such as walking, running, etc. Another example is classification of ECG readings which are also characterized by periodic signals. seq. len=10 seq. len=15 seq. len=20 λ 0.0 0.9346 (0.006) 0.9461 (0.001) 0.9678 (0.005) λ 0.01 0.9584 (0.008) 0.9658 (0.008) 0.9780 (0.004) Table 2: Learning Semi Permutation Invariant models on the half-range problem. Test accuracy for two regularization coefficients and different sequence length. Note that setting λ 0 amounts to vanilla RNN without regularization. Here we demonstrate that our regularization method can capture such soft permutation invariance by defining the toy task of half-range. The data is a sequence of integers generated in a similar fashion to the range task above, but not in a completely invariant manner. The target in this task is to predict the difference of the maximum integer from the first half of the sequence and the minimum integer from the second half of the sequence. Formally, given px1, . . . , xkq, half-range is defined as HRpx1, . . . , xkq max ! x1, . . . , xt k 2 u ) min ! xt k 2 u 1, . . . , xk ) Clearly half-range is not permutation invariant. Despite not being completely permutation invariant, the output of half-range is not sensitive to many of the possible permutations. Thus, it makes sense to learn it by regularizing towards invariance using the SIRE regularizer. Result for regularized and un-regularized models are shown in Table 2. It can be seen that the regularized version consistently outperforms the standard RNN, showing that the notion of semi-invariance is empirically effective in this case. 6.5 Locally Perturbed MNIST Since the introduction of the MNIST dataset [Le Cun et al., 1998], it was used as a starting point for many variations. For example, [Larochelle et al., 2007] created Rotated-MNIST, a more challenging version of MNIST where digits are rotated by a random angle. Another example is MNIST-C [Mu and Gilmer, 2019], where a corrupted test set was created to evaluate out-of-distribution robustness of computer vision models. Yet another variant of MNIST is Perturbed MNIST [Goodfellow et al., 2013, Srivastava et al., 2013], where random permutations are applied to digits. Here we present Locally Perturbed MNIST, a variant of MNIST where pixels are randomly permuted with nearby pixels (see Figure 3). Our goal is to test the performance of our method on data that exhibits some degree of permutation invariance. We believe that such structure is also present in problems such as activity recognition and document analysis. 6We also evaluated Set Transformer [Lee et al., 2019], using the official implementation (https:// github.com/juho-lee/set_transformer). We were not able to reproduce reported results for the Set Transformer model, and thus do not report results for it. In addition we evaluated π SGD, but it exhibited optimization difficulties for lengths greater than n 100, resulting in poor results compared to other baselines. Figure 3: Locally Perturbed MNIST digits. See Appendix for complete details on the perturbation applied. Since the spatial structure of the image is partially preserved, a relevant baseline is a CNN.7 The CNN achieves accuracy of 0.963 on Locally Perturbed MNIST. Learning with a GRU achieves accuracy of 0.833. Adding RSIRE to the same architecture boosts performance to 0.977. These findings suggest that RSIRE is effective for learning data with partial invariance properties. 7 Discussion We have introduced a novel approach for modeling permutation invariant learning problems by using recurrent architectures. While RNNs are generally order dependent, we suggest a regularization term and prove that when this term is zero, the network is permutation invariant. We further discuss the permutation invariant parity function, for which fixed aggregation based methods such as Deep Sets need a large number of parameters whereas simple recurrent models can implement parity with Op1q parameters. Empirically, we show that recurrent models can easily solve tasks where Deep Sets models do not do as well. We further demonstrate that our method scales better to larger set sizes compared to the recurrent method of Murphy et al. [2018]. In addition to the above, we consider a setting where the data is partially permutation invariant. This property cannot be captured by architectures that are fully permutation invariant by design, and therefore this non-invariant case is typically solved using RNNs. We show that adding our regularization term helps in such semi permutation invariant problems. A common approach to solve time series classification is to perform some sort of feature engineering with a sliding window and then treating the resulting features as a set of features without significance to order. Our method may prove useful in such scenarios. An interesting theoretical question which we do not discuss is that of optimization issues for different architectures. For example, we noticed empirically that our regularization method leads to better optimization errors than RNN without regularization. We hypothesize that this is because the regularization operates on shorter sequences, and can thus alleviate optimization issues related to vanishing gradients. We note that our regularizer does not require labeled data, and can thus employ unlabeled data in a semi-supervised setting. Taken together, our results demonstrate the potential of recurrent architectures for permutation invariant and partially invariant modeling. They also serve to further highlight the importance of sample complexity considerations when learning invariant functions. Namely, different implementations of an invariant function may require a different number of parameters and thus result in different sample complexities. In future work we plan to provide a more comprehensive theoretical account of these phenomena. 7We use a simple CNN with 2 convolution layers followed by 2 fully connected layers. This architecture achieves accuracy of 0.9963 on regular MNIST. 8 Broader Impact In this work, we analyze an approach for learning recurrent models in cases where there is an underlying permutation invariance. The method can improve sequence labeling systems. We do not see any ethical aspects with the contribution. Societal aspects are positive in terms of improving accuracy of models in healthcare for example. 9 Acknowledgements This project has received funding from the European Research Council (ERC) under the European Unions Horizon 2020 research and innovation programme (grant ERC HOLI 819080). N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. Journal of Computer and system sciences, 58(1):137 147, 1999. A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, et al. Shapenet: An information-rich 3d model repository. ar Xiv preprint ar Xiv:1512.03012, 2015. J. Chung, C. Gulcehre, K. Cho, and Y. Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. ar Xiv preprint ar Xiv:1412.3555, 2014. I. J. Goodfellow, M. Mirza, D. Xiao, A. Courville, and Y. Bengio. An empirical investigation of catastrophic forgetting in gradient-based neural networks. ar Xiv preprint ar Xiv:1312.6211, 2013. J. Hartford, D. Graham, K. Leyton-Brown, and S. Ravanbakhsh. Deep models of interactions across sets. In International Conference on Machine Learning, pages 1914 1923, 2018. R. Herzig, M. Raboh, G. Chechik, J. Berant, and A. Globerson. Mapping images to scene graphs with permutation-invariant structured prediction. In Advances in Neural Information Processing Systems, pages 7211 7221, 2018. S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735 1780, 1997. . Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097 1105, 2012. H. Larochelle, D. Erhan, A. Courville, J. Bergstra, and Y. Bengio. An empirical evaluation of deep architectures on problems with many factors of variation. In Proceedings of the 24th international conference on Machine learning, pages 473 480, 2007. Y. Le Cun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998. J. Lee, Y. Lee, J. Kim, A. Kosiorek, S. Choi, and Y. W. Teh. Set transformer: A framework for attention-based permutation-invariant neural networks. In International Conference on Machine Learning, pages 3744 3753, 2019. H. Maron, H. B. Hamu, N. Shamir, and Y. Lipman. Invariant and equivariant graph networks. In 7th International Conference on Learning Representations, ICLR, 2019. G. F. Montufar, R. Pascanu, K. Cho, and Y. Bengio. On the number of linear regions of deep neural networks. In Advances in neural information processing systems, pages 2924 2932, 2014. N. Mu and J. Gilmer. Mnist-c: A robustness benchmark for computer vision. ar Xiv preprint ar Xiv:1906.02337, 2019. R. L. Murphy, B. Srinivasan, V. Rao, and B. Ribeiro. Janossy pooling: Learning deep permutationinvariant functions for variable-size inputs. ar Xiv preprint ar Xiv:1811.01900, 2018. C. R. Qi, H. Su, K. Mo, and L. J. Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 652 660, 2017. R. K. Srivastava, J. Masci, S. Kazerounian, F. Gomez, and J. Schmidhuber. Compete to compute. In Advances in neural information processing systems, pages 2310 2318, 2013. O. Vinyals, S. Bengio, and M. Kudlur. Order matters: Sequence to sequence for sets. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, 2016. Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao. 3d shapenets: A deep representation for volumetric shapes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1912 1920, 2015. M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos, R. R. Salakhutdinov, and A. J. Smola. Deep Sets. In Advances in neural information processing systems, pages 3391 3401, 2017. R. Zellers, M. Yatskar, S. Thomson, and Y. Choi. Neural motifs: Scene graph parsing with global context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5831 5840, 2018.