# deep_learning_with_sets_and_point_clouds__3ab40250.pdf Under review as a conference paper at ICLR 2017 DEEP LEARNING WITH SETS AND POINT CLOUDS Siamak Ravanbakhsh, Jeff Schneider & Barnab as P oczos School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213, USA {mravanba,jeff.schneider,bapoczos}@cs.cmu.edu We introduce a simple permutation equivariant layer for deep learning with set structure. This type of layer, obtained by parameter-sharing, has a simple implementation and linear-time complexity in the size of each set. We use deep permutation-invariant networks to perform point-could classification and MNISTdigit summation, where in both cases the output is invariant to permutations of the input. In a semi-supervised setting, where the goal is make predictions for each instance within a set, we demonstrate the usefulness of this type of layer in set-outlier detection as well as semi-supervised learning with clustering sideinformation. 1 INTRODUCTION Recent progress in deep learning (Le Cun et al., 2015) has witnessed its application to structured settings, including graphs (Bruna et al., 2013; Duvenaud et al., 2015) groups (Gens & Domingos, 2014; Christopher, 2014; Cohen & Welling, 2016), sequences and hierarchies (Irsoy & Cardie, 2014; Socher et al., 2013). Here, we introduce a simple permutation-equivariant layer for deep learning with set structure, where the primary dataset is a collection of sets, possibly of different sizes. Note that each instance may have a structure of its own, such as graph, image, or another set. In typical machine-learning applications, iid assumption implies that the entire data-set itself has a set structure. Therefore, our special treatment of the set structure is only necessary due to multiplicity of distinct yet homogeneous (data)sets. Here, we show that a simple parameter-sharing scheme enables a general treatment of sets within supervised and semi-supervised settings. In the following, after introducing the set layer in Section 2, we explore several novel applications. Section 3 studies supervised learning with sets that requires invariance to permutation of inputs. Section 3.1 considers the task of summing multiple MNIST digits, and Section 3.2 studies an important application of sets in representing low-dimensional point-clouds. Here, we show that deep networks can successfully classify objects using their point-cloud representation. Section 4 presents numerical study in semi-supervised setting where the output of the multi-layer network is equivariant to the permutation of inputs. We use permutation-equivariant layer to perform outlier detection on Celeb A face dataset in Section 4.1 and improve galaxy red-shift estimates using its clustering information in Section 5. 2 PERMUTATION-EQUIVARIANT LAYER Let xn X denote use x = [x1, . . . , x N] to denote a vector of xn instances. Here, xn, could be a feature-vector, an image or any other structured object. Our goal is to design neural network layers that are indifferent to permutations of instances in x. Achieving this goal amounts to treating x as a set rather than a vector. The function f : X N YN is equivariant to the permutation of its inputs iff f(πx) = πf(x) π SN where the symmetric group SN is the set of all permutation of indices 1, . . . , N. Similarly, the function f : ℜN ℜis invariant to permutation of its inputs i.e., a.k.a. a symmetric function (David Under review as a conference paper at ICLR 2017 Figure 1: Classification accuracy of different schemes in predicting the sum of a (left) N=3 and (right) N=6 MNIST digits without access to individual image labels. The training set was fixed to 10,000 sets. et al., 1966) iff f(πx) = f(x) π SN. Here, the action of π on a vector x ℜN can be represented by a permutation matrix. With some abuse of notation, we use π {0, 1}N N to also denote this matrix. Given two permutation equivariance function f : X N YN and g : YN ZN, their composition is also permutation-equivariance; this is because g(f(πx)) = g(πf(x)) = πg(f(x)). Consider the standard neural network layer fΘ(x) .= σ(Θx) Θ ℜN N (1) where Θ is the weight vector and σ : ℜ ℜis a nonlinearity such as sigmoid function. σ : ℜN ℜN is the point-wise application of σ to its input vector. The following theorem states the necessary and sufficient conditions for permutation-equivariance in this type of function. Theorem 2.1. The function fΘ : ℜN ℜN as defined in Eq. (1) is permutation equivariant iff all the off-diagonal elements of Θ are tied together and all the diagonal elements are equal as well, Θ = λI + γ (11T) λ, γ ℜ 1 = [1, . . . , 1]T ℜN (2) where I is the identity matrix.1 This function is simply a non-linearity applied to a weighted combination of I) its input Ix and; II) the sum of input values (11T)x. Since summation does not depend on the permutation, the layer is permutation-equivariant. Therefore we can manipulate the operations and parameters in this layer, for example to get another variation f(x) .= σ(λIx γ(max n x)1) (3) where the max operation over elements of the set is (similar to summation) commutative and using γ instead of +γ amounts to a reparametrization. In practice using this variation performs better in some applications. This may be due to the fact that for λ = γ, the input to the non-linearity is max-normalized. For multiple input-output channels, we may speed up the operation of the layer using matrix multiplication. Suppose we have K input channels corresponding to K features for each instance in the set with a set of size N, and K output channels. Here, x ℜN K and fΘ : ℜN K ℜN K . The permutation-equivariant layer parameters are Λ, Γ ℜK,K (replacing λ and γ in Eq. (2)). The output of this layer becomes y = σ xΛ + 1 1TxΓ Similarly, the multiple input-output channel variation of the Eq. (3) is y = σ xΛ 1xmaxΓ (4) where xmax = (maxn x) ℜ1 K is a row-vector of maximum value of x ℜN K over the set dimension. In practice, we may further reduce the number of parameters in favor of better generalization by factoring Γ and Λ and keeping a single λ ℜK,K y = σ β + x 1(max n x) Γ (5) 1See the Appendix A for the proof. Under review as a conference paper at ICLR 2017 Figure 2: Examples for 8 out of 40 object classes (column) in the Model Net40. Each point-cloud is produces by sampling 1000 particles from the mesh representation of the original Meodel Net40 instances. Two pointclouds in the same column are from the same class. The projection of particles into xy, zy and xz planes are added for better visualization. where β ℜK is a bias parameter. This final variation of permutation-equivariant layer is simply a fully connected layer where input features are max-normalized within each set. With multiple input-output channels, the complexity of this layer for each set is O(N K K ). Subtracting the mean or max over the set also reduces the internal covariate shift (Ioffe & Szegedy, 2015) and we observe that for deep networks (even using tanh activation), batch-normalization is not required. When applying dropout (Srivastava et al., 2014) to regularize permutation-equivariant layers with multiple output channels, it is often beneficial to simultaneously dropout the channels for all instances within a set. In particular, when set-members share similar features, independent dropout effectively does not regularize the model as the the network learns to replace the missing features from other set-members. In the remainder of this paper, we demonstrate how this simple treatment of sets can solve novel and non-trivial problems that occasionally have no alternative working solutions within deep learning. 3 SUPERVISED LEARNING The permutation-equivariant layers that we introduced so far are useful for semi-supervised learning (or transductive) setting, where we intend to predict a value per each instance in every set. In supervised (or inductive) setting the task is to make a prediction for each set (rather than instances within them), and we require permutation invariance of f : X N Y. A pooling operation over the set-dimension can turn any permutation equivariant function f : X N YN, permutation invariant f(x) .= L n f(x). Here is any commutative operation such as summation or maximization. In a related work Chen et al. (2014) construct deep permutation invariant features by pairwise coupling of features at the previous layer, where fi,j([xi, xj]) .= [|xi xj|, xi + xj] is invariant to transposition of i and j.2 As we see shortly in Section 3.1, in the supervised setting, even simple application of setpooling, without max-normalization of Eq. (5) performs very well in practice. However, in semisupervised setting since there is no pooling operation, the permutation invariant layer requires maxnormalization in order to obtain the required information about the context of each instance. 3.1 PREDICTING THE SUM OF MNIST DIGITS MNIST dataset (Le Cun et al., 1998) contains 70,000 instances of 28 28 grey-scale stamps of digits in {0, . . . , 9}. We randomly sample a subset of N images from this dataset to build 10,000 sets of training and 10,000 sets of validation images, where the set-label is the sum of digits in that set (i.e., individual labels per image is unavailable). 2Due to change in the number of features/channels in each layer this approach cannot produce permutationequivariant layers. Also, this method requires a graph to guide the multi-resolution partitioning of the nodes, which is then used to define pairing of features in each layer. Under review as a conference paper at ICLR 2017 Table 1: Classification accuracy and the (size of) representation used by different methods on the Model Net40 dataset. model instance size representation accuracy set-layer + transformation (ours) 5000 3 point-cloud 90 .3% set-layer (ours) 1000 3 point-cloud 87 1% set-pooling only (ours) 1000 3 point-cloud 83 1% set-layer (ours) 100 3 point-cloud 82 2% KNN graph-convolution (ours) 1000 (3 + 8) directed 8-regular graph 58 2% 3DShape Nets (Wu et al., 2015) 303 voxels (using convolutional deep belief net) 77% Deep Pano (Shi et al., 2015) 64 160 panoramic image (2D CNN + angle-pooling) 77.64% Vox Net (Maturana & Scherer, 2015) 323 voxels (voxels from point-cloud + 3D CNN) 83.10% MVCNN (Su et al., 2015) 164 164 12 multi-vew images (2D CNN + view-pooling) 90.1% VRN Ensemble (Brock et al., 2016) 323 voxels (3D CNN, variational autoencoder) 95.54% 3D GAN (Wu et al., 2016) 643 voxels (3D CNN, generative adversarial training) 83.3% In our first experiment, each set contains N = 3 images, and the set label is a number between 0 y 3 9 = 27. We then considered four different models for predicting the sum: I) Concatenating instances along the horizontal axis. II) Stacking images in each set as different input channels. III) Using set-pooling without max-normalization. IV) Using the permutation equivariant layer of Eq. (5) with set-pooling. All models are defined to have similar number of layers and parameters; see Appendix B.1 for details. The output of all models is a (9N + 1)-way softmax, predicting the sum of N digits. Figure 1(left) show the prediction accuracy over the validation-set for different models for N = 3. We see that using the set-layer performs the best. However, interestingly, using set-pooling alone produces similarly good results. We also observe that concatenating the digits eventually performs well, despite its lack of invariance. This is because due to the sufficiently large size of the dataset most permutations of length three appear in the training set. However, as we increase the size of each set to N = 6, permutation invariance becomes crucial; see Figure 1(right). We see that using the default dropout rate of 20%, the model simply memorizes the input instances (indicated by discrepancy of training/validation error) and by increasing this dropout rate, the model simply predicts values close to the mean value. However, the permutation-invariant layer learns to predict the sum of six digits with > 80% accuracy, without having access to individual image labels. Performance using set-pooling alone is similar. 3.2 POINT CLOUD CLASSIFICATION A low-dimensional point-cloud is a set of low-dimensional vectors. This type of data is frequently encountered in various applications from robotics and vision to cosmology. In these applications, point-cloud data is often converted to voxel or mesh representation at a preprocessing step (e.g., Maturana & Scherer, 2015; Ravanbakhsh et al., 2016; Lin et al., 2004). Since the output of many range sensors such as Li DAR which are extensively used in applications such as autonomous vehicles is in the form of point-cloud, direct application of deep learning methods to point-cloud is highly desirable. Moreover, when working with point-clouds rather than voxelized 3D objects, it is easy to apply transformations such as rotation and translation as differentiable layers at low cost. Here, we show that treating the point-cloud data as a set, we can use the set-equivariant layer of Eq. (5) to classify point-cloud representation of a subset of Shape Net objects (Chang et al., 2015), called Model Net40 (Wu et al., 2015). This subset consists of 3D representation of 9,843 training and 2,468 test instances belonging to 40 classes of objects; see Fig. 2. We produce point-clouds with 100, 1000 and 5000 particles each (x, y, z-coordinates) from the mesh representation of objects using the point-cloud-library s sampling routine (Rusu & Cousins, 2011). Each set is normalized by the initial layer of the deep network to have zero mean (along individual axes) and unit (global) Under review as a conference paper at ICLR 2017 variance. Additionally we experiment with the K-nearest neighbor graph of each point-cloud and report the results using graph-convolution; see Appendix B.3 for model details. Table 1 compares our method against the competition.3 Note that we achieve our best accuracy using 5000 3 dimensional representation of each object, which is much smaller than most other methods. All other techniques use either voxelization or multiple view of the 3D object for classification. Interestingly, variations of view/angle-pooling (e.g., Su et al., 2015; Shi et al., 2015) can be interpreted as set-pooling where the class-label is invariant to permutation of different views. The results also shows that using fully-connected layers with set-pooling alone (without max-normalization over the set) works relatively well. We see that reducing the number of particles to only 100, still produces comparatively good results. Using graph-convolution is computationally more challenging and produces inferior results in this setting. The results using 5000 particles is also invariant to small changes in scale and rotation around the z-axis; see Appendix B.3 for details. Figure 3: Each box is the particle-cloud maximizing the activation of a unit at the firs (top) and second (bottom) permutation-equivariant layers of our model. Two images of the same column are two different views of the same point-cloud. Features. To visualize the features learned by the set layers, we used Adamax (Kingma & Ba, 2014) to locate 1000 particle coordinates maximizing the activation of each unit.4 Activating the tanh units beyond the second layer proved to be difficult. Figure 3 shows the particle-cloud-features learned at the first and second layers of our deep network. We observed that the first layer learns simple localized (often cubic) point-clouds at different (x, y, z) locations, while the second layer learns more complex surfaces with different scales and orientations. 4 SEMI-SUPERVISED LEARNING In semi-supervised or transductive learning, some/all instances within each training set are labelled. Our goal is to make predictions for individual instances within a test set. Therefore, the permutation equivariant layer leverages the interaction between the set-members to label individual member. Note that in this case, we do not perform any pooling operation over the set dimension of the data. 4.1 SET ANOMALY DETECTION The objective here is for the deep model to find the anomalous face in each set, simply by observing examples and without any access to the attribute values. Celeb A dataset (Liu et al., 2015) contains 3The error-bar on our results is due to variations depending on the choice of particles during test time and it is estimated over three trials. 4We started from uniformly distributed set of particles and used a learning rate of .01 for Adamax, with first and second order moment of .1 and .9 respectively. We optimized the input in 105 iterations. The results of Fig. 3 are limited to instances where tanh units were successfully activated. Since the input at the first layer of our deep network is normalized to have a zero mean and unit standard deviation, we do not need to constrain the input while maximizing unit s activation. Under review as a conference paper at ICLR 2017 Figure 4: Each row shows a set, constructed from Celeb A dataset, such that all set members except for an outlier, share at least two attributes (on the right). The outlier is identified with a red frame. The model is trained by observing examples of sets and their anomalous members, without access to the attributes. The probability assigned to each member by the outlier detection network is visualized using a red bar at the bottom of each image. The probabilities in each row sum to one. See Appendix B.2 for more examples. 202,599 face images, each annotated with 40 boolean attributes. We use 64 64 stamps and using these attributes we build 18,000 sets, each containing N = 16 images (on the training set) as follows: after randomly selecting two attributes, we draw 15 images where those attributes are present and a single image where both attributes are absent. Using a similar procedure we build sets on the test images. No individual person s face appears in both train and test sets. Our deep neural network consists of 9 2D-convolution and max-pooling layers followed by 3 permutation-equivariant layers and finally a softmax layer that assigns a probability value to each set member (Note that one could identify arbitrary number of outliers using a sigmoid activation at the output.) Our trained model successfully finds the anomalous face in 75% of test sets. Visually inspecting these instances suggests that the task is non-trivial even for humans; see Fig. 4. For details of the model, training and more identification examples see Appendix B.2. As a baseline, we repeat the same experiment by using a set-pooling layer after convolution layers, and replacing the permutation-equivariant layers with fully connected layers, with the same number of hidden units/output-channels, where the final layer is a 16-way softmax. The resulting network shares the convolution filters for all instances within all sets, however the input to the softmax is not equivariant to the permutation of input images. Permutation equivariance seems to be crucial here as the baseline model achieves a training and test accuracy of 6.3%; the same as random selection. 5 IMPROVED RED-SHIFT ESTIMATION USING CLUSTERING INFORMATION An important regression problem in cosmology is to estimate the red-shift of galaxies, corresponding to their age as well as their distance from us (Binney & Merrifield, 1998). Two common types of observation for distant galaxies include a) photometric and b) spectroscopic observations, where the latter can produce more accurate red-shift estimates. One way to estimate the red-shift from photometric observations is using a regression model (Connolly et al., 1995). We use a multi-layer Perceptron for this purpose and use the more accurate spectroscopic red-shift estimates as the ground-truth. As another baseline, we have a photometric redshift estimate that is provided by the catalogue and uses various observations (including clustering information) to estimate individual galaxy-red-shift. Our objective is to use clustering information of the galaxies to improve our red-shift prediction using the multi-layer Preceptron. Note that the prediction for each galaxy does not change by permuting the members of the galaxy cluster. Therefore, we can treat each galaxy cluster as a set and use permutation-equivariant layer to estimate the individual galaxy red-shifts. Under review as a conference paper at ICLR 2017 For each galaxy, we have 17 photometric features 5 from the red Ma PPer galaxy cluster catalog (Rozo & Rykoff, 2014), which contains photometric readings for 26,111 red galaxy clusters. In this task in contrast to the previous ones, sets have different cardinalities; each galaxy-cluster in this catalog has between 20 300 galaxies i.e., x ℜN(c) 17, where N(c) is the cluster-size. See Fig. 5(a) for distribution of cluster sizes. The catalog also provides accurate spectroscopic red-shift estimates for a subset of these galaxies as well as photometric estimates that uses clustering information. Fig. 5(b) reports the distribution of available spectroscopic red-shift estimates per cluster. We randomly split the data into 90% training and 10% test clusters, and use the following simple architecture for semi-supervised learning. We use four permutation-equivariant layers with 128, 128, 128 and 1 output channels respectively, where the output of the last layer is used as red-shift estimate. The squared loss of the prediction for available spectroscopic red-shifts is minimized.6 Fig. 5(c) shows the agreement of our estimates with spectroscopic readings on the galaxies in the test-set with spectroscopic readings. The figure also compares the photometric estimates provided by the catalogue (see Rozo & Rykoff, 2014), to the ground-truth. As it is customary in cosmology literature, we report the average scatter |zspec z| 1+zspec , where zspec is the accurate spectroscopic measurement and z is a photometric estimate. The average scatter using our model is .023 compared to the scatter of .025 in the original photometric estimates for the red Ma PPer catalog. Both of these values are averaged over all the galaxies with spectroscopic measurements in the test-set. We repeat this experiment, replacing the permutation-equivariant layers with fully connected layers (with the same number of parameters) and only use the individual galaxies with available spectroscopic estimate for training. The resulting average scatter for multi-layer Perceptron is .026, demonstrating that using clustering information indeed improves photometric red-shift estimates. Figure 5: application of permutation-equivariant layer to semi-supervised red-shift prediction using clustering information: a) distribution of cluster (set) size; b) distribution of reliable red-shift estimates per cluster; c) prediction of red-shift on test-set (versus ground-truth) using clustering information as well as Red Ma PPer photometric estimates (also using clustering information). We introduced a simple parameter-sharing scheme to effectively achieve permutation-equivariance in deep networks and demonstrated its effectiveness in several novel supervised and semi-supervised tasks. Our treatment of set structure also generalizes various settings in multi-instance learning (Ray et al., 2011; Zhou et al., 2009). In addition to our experimental settings, the permutation-invariant layer can be used for distribution regression and classification which have become popular recently (Szabo et al., 2016). In our experiments with point-cloud data we observed the model to be robust to the variations in the number of particles in each cloud, suggesting the usefulness of our method in the general setting of distribution regression where the number of samples should not qualitatively affect our representation of a distribution. We leave further investigation of this direction to future work. 5We have a single measurement for each u,g,r, i and z band as well as measurement error bars, location of the galaxy in the sky, as well as the probability of each galaxy being the cluster center. We do not include the information regarding the richness estimates of the clusters from the catalog, for any of the methods, so that baseline multi-layer Preceptron is blind to the clusters. 6We use mini-batches of size 128, Adam (Kingma & Ba, 2014), with learning rate of .001, β1 = .9 and β2 = .999. All layers except for the last layer use Tanh units and simultaneous dropout with 50% dropout rate. Under review as a conference paper at ICLR 2017 ACKNOWLEDGEMENT We would like to thank Francois Lanusse for the pointing us to the red Ma PPer dataset and the anonymous reviewers as well as Andrew Wagner for valuable feedback. Martın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. ar Xiv preprint ar Xiv:1603.04467, 2016. James Binney and Michael Merrifield. Galactic astronomy. Princeton University Press, 1998. Andrew Brock, Theodore Lim, JM Ritchie, and Nick Weston. Generative and discriminative voxel modeling with convolutional neural networks. ar Xiv preprint ar Xiv:1608.04236, 2016. Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann Le Cun. Spectral networks and locally connected networks on graphs. ar Xiv preprint ar Xiv:1312.6203, 2013. Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, et al. Shapenet: An information-rich 3d model repository. ar Xiv preprint ar Xiv:1512.03012, 2015. Xu Chen, Xiuyuan Cheng, and St ephane Mallat. Unsupervised deep haar scattering on graphs. In Advances in Neural Information Processing Systems, pp. 1709 1717, 2014. Olah Christopher. Groups and group convolutions. http://colah.github.io/posts/ 2014-12-Groups-Convolution/, 2014. Djork-Arn e Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep network learning by exponential linear units (elus). ar Xiv preprint ar Xiv:1511.07289, 2015. Taco S Cohen and Max Welling. Group equivariant convolutional networks. ar Xiv preprint ar Xiv:1602.07576, 2016. AJ Connolly, I Csabai, AS Szalay, DC Koo, RG Kron, and JA Munn. Slicing through multicolor space: Galaxy redshifts from broadband photometry. ar Xiv preprint astro-ph/9508100, 1995. F.N. David, M.S. Kendall, and D.E. Barton. Symmetric Functions and Allied Tables. University Press, 1966. David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel, Al an Aspuru Guzik, and Ryan P Adams. Convolutional networks on graphs for learning molecular fingerprints. In Advances in Neural Information Processing Systems, pp. 2224 2232, 2015. Robert Gens and Pedro M Domingos. Deep symmetry networks. In Advances in neural information processing systems, pp. 2537 2545, 2014. Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. ar Xiv preprint ar Xiv:1502.03167, 2015. Ozan Irsoy and Claire Cardie. Deep recursive neural networks for compositionality in language. In Advances in Neural Information Processing Systems, pp. 2096 2104, 2014. Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. Yann Le Cun, L eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998. Yann Le Cun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436 444, 2015. Hong-Wei Lin, Chiew-Lan Tai, and Guo-Jin Wang. A mesh reconstruction algorithm driven by an intrinsic property of a point cloud. Computer-Aided Design, 36(1):1 9, 2004. Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), 2015. Under review as a conference paper at ICLR 2017 Daniel Maturana and Sebastian Scherer. Voxnet: A 3d convolutional neural network for real-time object recognition. In Intelligent Robots and Systems (IROS), 2015 IEEE/RSJ International Conference on, pp. 922 928. IEEE, 2015. Siamak Ravanbakhsh, Junier Oliva, Sebastien Fromenteau, Layne C Price, Shirley Ho, Jeff Schneider, and Barnab as P oczos. Estimating cosmological parameters from the dark matter distribution. In Proceedings of The 33rd International Conference on Machine Learning, 2016. Soumya Ray, Stephen Scott, and Hendrik Blockeel. Multi-instance learning. In Encyclopedia of Machine Learning, pp. 701 710. Springer, 2011. Eduardo Rozo and Eli S Rykoff. redmapper ii: X-ray and sz performance benchmarks for the sdss catalog. The Astrophysical Journal, 783(2):80, 2014. Radu Bogdan Rusu and Steve Cousins. 3D is here: Point Cloud Library (PCL). In IEEE International Conference on Robotics and Automation (ICRA), Shanghai, China, May 9-13 2011. Baoguang Shi, Song Bai, Zhichao Zhou, and Xiang Bai. Deeppano: Deep panoramic representation for 3-d shape recognition. IEEE Signal Processing Letters, 22(12):2339 2343, 2015. Richard Socher, Alex Perelygin, Jean Y Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the conference on empirical methods in natural language processing (EMNLP), volume 1631, pp. 1642. Citeseer, 2013. Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1): 1929 1958, 2014. Hang Su, Subhransu Maji, Evangelos Kalogerakis, and Erik Learned-Miller. Multi-view convolutional neural networks for 3d shape recognition. In Proceedings of the IEEE International Conference on Computer Vision, pp. 945 953, 2015. Z. Szabo, B. Sriperumbudur, B. Poczos, and A. Gretton. Learning theory for distribution regression. Journal of Machine Learning Research, 2016. Jiajun Wu, Chengkai Zhang, Tianfan Xue, William T Freeman, and Joshua B Tenenbaum. Learning a probabilistic latent space of object shapes via 3d generative-adversarial modeling. ar Xiv preprint ar Xiv:1610.07584, 2016. Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and Jianxiong Xiao. 3d shapenets: A deep representation for volumetric shapes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1912 1920, 2015. Zhi-Hua Zhou, Yu-Yin Sun, and Yu-Feng Li. Multi-instance learning by treating instances as non-iid samples. In Proceedings of the 26th annual international conference on machine learning, pp. 1249 1256. ACM, 2009. Under review as a conference paper at ICLR 2017 Proof. of the Theorem 2.1 From definition of permutation equivariance fΘ(πx) = πfΘ(x) and definition of f in Eq. (1), the condition becomes σ(Θπx) = πσ(Θx), which (assuming sigmoid is a bijection) is equivalent to Θπ = πΘ. Therefore we need to show that the necessary and sufficient conditions for the matrix Θ ℜN N to commute with all permutation matrices π Sn is given by Eq. (2). We prove this in both directions: To see why Θ = λI + γ (11T) commutes with any permutation matrix, first note that commutativity is linear that is Θ1π = πΘ1 Θ2π = πΘ2 (aΘ1 + bΘ2)π = π(aΘ1 + bΘ2). Since both Identity matrix I, and constant matrix 11T, commute with any permutation matrix, so does their linear combination Θ = λI + γ (11T). We need to show that in a matrix Θ that commutes with all permutation matrices All diagonal elements are identical: Let πk,l for 1 k, l N, k = l, be a transposition (i.e., a permutation that only swaps two elements). The inverse permutation matrix of πk,l is the permutation matrix of πl,k = πT k,l. We see that commutativity of Θ with the transposition πk,l implies that Θk,k = Θl,l: πk,lΘ = Θπk,l πk,lΘπl,k = Θ (πk,lΘπl,k)l,l = Θl,l Θk,k = Θl,l Therefore, π and Θ commute for any permutation π, they also commute for any transposition πk,l and therefore Θi,i = λ i. All off-diagonal elements are identical: We show that since Θ commutes with any product of transpositions, any choice two off-diagonal elements should be identical. Let (i, j) and (i , j ) be the index of two off-diagonal elements (i.e., i = j and i = j ). Moreover for now assume i = i and j = j . Application of the transposition πi,i Θ, swaps the rows i, i in Θ. Similarly, Θπj,j switches the jth column with j th column. From commutativity property of Θ and π Sn we have πj ,jπi,i Θ = Θπj ,jπi,i πj ,jπi,i Θ(πj ,jπi,i ) 1 = Θ πj ,jπi,i Θπi ,iπj,j = Θ (πj ,jπi,i Θπi ,iπj,j )i,j = Θi,j Θi ,j = Θi,j where in the last step we used our assumptions that i = i , j = j , i = j and i = j . In the cases where either i = i or j = j , we can use the above to show that Θi,j = Θi ,j and Θi ,j = Θi ,j , for some i = i, i and j = j, j , and conclude Θi,j = Θi ,j . B DETAILS OF MODELS In the following, all our implementations use Tensorflow (Abadi et al., 2016). B.1 MNIST SUMMATION All nonlinearities are exponential linear units (ELU Clevert et al., 2015). All models have 4 convolution layers followed by max-pooling. The convolution layers have respectively 16-32-64-128 output channels and 5 5 receptive fields. Each pooling, fully-connected and set-layer is followed by a 20% dropout. For models III and IV we use simultaneous dropout. In models I and II, the convolution layers are followed by two fullyconnected layers with 128 hidden units. In model III, after the first fully connected layer we perform Under review as a conference paper at ICLR 2017 Figure 6: Each row of the images shows a set, constructed from Celeb A dataset images, such that all set members except for an outlier, share at least two attributes. The outlier is identified with a red frame. The model is trained by observing examples of sets and their anomalous members and without access to the attributes. The probability assigned to each member by the outlier detection network is visualized using a red bar at the bottom of each image. The probabilities in each row sum to one. set-pooling followed by another dense layer with 128 hidden units. In the model IV, the convolution layers are followed by a permutation-equivariant layer with 128 output channels, followed by setpooling and a fully connected layer with 128 hidden units. For optimization, we used a learning rate of .0003 with Adam using the default β1 = .9 and β2 = .999. B.2 FACE OUTLIER DETECTION MODEL Our model has 9 convolution layers with 3 3 receptive fields. The model has convolution layers with 32, 32, 64 feature-maps followed by max-pooling followed by 2D convolution layers with 64, 64, 128 feature-maps followed by another max-pooling layer. The final set of convolution layers have 128, 128, 256 feature-maps, followed by a max-pooling layer with pool-size of 5 that reduces the output dimension to batch size.N 256, where the set-size N = 16. This is then forwarded to three permutation-equivariant layers with 256, 128 and 1 output channels. The output of final layer is fed to the Softmax, to identify the outlier. We use exponential linear units (Clevert et al., 2015), drop out with 20% dropout rate at convolutional layers and 50% dropout rate at the first two set layers. When applied to set layers, the selected feature (channel) is simultaneously dropped in all the set members of that particular set. We use Adam (Kingma & Ba, 2014) for optimization and use batch-normalization only in the convolutional layers. We use mini-batches of 8 sets, for a total of 128 images per batch. Under review as a conference paper at ICLR 2017 B.3 MODELS FOR POINT-CLOUDS CLASSIFICATION Set convolution. We use a network comprising of 3 permutation-equivariant layers with 256 channels followed by max-pooling over the set structure. The resulting vector representation of the set is then fed to a fully connected layer with 256 units followed by a 40-way softmax unit. We use Tanh activation at all layers and dropout on the layers after set-max-pooling (i.e., two dropout operations) with 50% dropout rate. Applying dropout to permutation-equivariant layers for point-cloud data deteriorated the performance. We observed that using different types of permutation-equivariant layers (see Section 2) and as few as 64 channels for set layers changes the result by less than 5% in classification accuracy. For the setting with 5000 particles, we increase the number of units to 512 in all layers and randomly rotate the input around the z-axis. We also randomly scale the point-cloud by s U(.8, 1./.8). For this setting only, we use Adamax (Kingma & Ba, 2014) instead of Adam and reduce learning rate from .001 to .0005. Graph convolution. For each point-cloud instance with 1000 particles, we build a sparse K-nearest neighbor graph and use the three point coordinates as input features. We normalized all graphs at the preprocessing step. For direct comparison with set layer, we use the exact architecture of 3 graphconvolution layer followed by set-pooling (global graph pooling) and dense layer with 256 units. We use exponential linear activation function instead of Tanh as it performs better for graphs. Due to over-fitting, we use a heavy dropout of 50% after graph-convolution and dense layers. Similar to dropout for sets, all the randomly selected features are simultaneously dropped across the graph nodes. the We use a mini-batch size of 64 and Adam for optimization where the learning rate is .001 (the same as that of permutation-equivariant counter-part). Despite our efficient sparse implementation using Tensorflow, graph-convolution is significantly slower than the set layer. This prevented a thorough search for hyper-parameters and it is quite possible that better hyper-parameter tuning would improve the results that we report here.