# pruned_graph_scattering_transforms__21cb559b.pdf Published as a conference paper at ICLR 2020 PRUNED GRAPH SCATTERING TRANSFORMS Vassilis N. Ioannidis Dpt. of Electrical and Computer Engineering Univ. of Minnesota Minneapolis, MN, USA ioann006@umn.edu Siheng Chen Mitsubishi Electric Research Laboratories Cambridge, MA, USA schen@merl.com Georgios B. Giannakis Dpt. of Electrical and Computer Engineering Univ. of Minnesota Minneapolis, MN, USA georgios@umn.edu Graph convolutional networks (GCNs) have achieved remarkable performance in a variety of network science learning tasks. However, theoretical analysis of such approaches is still at its infancy. Graph scattering transforms (GSTs) are non-trainable deep GCN models that are amenable to generalization and stability analyses. The present work addresses some limitations of GSTs by introducing a novel so-termed pruned (p)GST approach. The resultant pruning algorithm is guided by a graph-spectrum-inspired criterion, and retains informative scattering features on-the-fly while bypassing the exponential complexity associated with GSTs. It is further established that p GSTs are stable to perturbations of the input graph signals with bounded energy. Experiments showcase that i) p GST performs comparably to the baseline GST that uses all scattering features, while achieving significant computational savings; ii) p GST achieves comparable performance to state-of-the-art GCNs; and iii) Graph data from various domains lead to different scattering patterns, suggesting domain-adaptive p GST network architectures. 1 INTRODUCTION The abundance of graph-structured data calls for advanced learning techniques, and complements nicely standard machine learning tools that cannot be directly applied to irregular data domains. Permeating the benefits of deep learning to the graph domain, graph convolutional networks (GCNs) provide a versatile and powerful framework to learn from complex graph data (Bronstein et al., 2017). GCNs and variants thereof have attained remarkable success in social network analysis, 3D point cloud processing, recommender systems and action recognition. However, researchers have recently reported inconsistent perspectives on the appropriate designs for GCN architectures. For example, experiments in social network analysis have argued that deeper GCNs marginally increase the learning performance (Wu et al., 2019), whereas a method for 3D point cloud segmentation achieves state-ofthe-art performance with a 56-layer GCN network (Li et al., 2019). These controversial empirical findings motivate theoretical analysis to understand the fundamental performance factors and the architecture design choices for GCNs. Aiming to bestow GCNs with theoretical guarantees, one promising research direction is to study graph scattering transforms (GSTs). GSTs are non-trainable GCNs comprising a cascade of graph filter banks followed by nonlinear activation functions. The graph filter banks are mathematically designed and are adopted to scatter an input graph signal into multiple channels. GSTs extract scattering features that can be utilized towards graph learning tasks (Gao et al., 2019), with competitive performance especially when the number of training examples is small. Under certain conditions on the graph filter banks, GSTs are endowed with energy conservation properties (Zou & Lerman, This work was mainly done while V. N. Ioanndis was working at Mitsubishi Electric Research Laboratories. Published as a conference paper at ICLR 2020 (a) Academic collaboration (b) Protein-protein network (c) 3D point cloud Figure 1: Illustration of the same p GST applied to different graph datasets. Notice that for the social network (a) most of GST branches are pruned, suggesting that most information is captured by local interactions. 2019), as well as stability meaning robustness to graph topology deformations (Gama et al., 2019a). However, GSTs are associated with exponential complexity in space and time that increases with the number of layers. This discourages deployment of GSTs when a deep architecture is needed. Furthermore, stability should not come at odds with sensitivity. A filter s output should be sensitive to and detect perturbations of large magnitude. Lastly, graph data in different domains (social networks, 3D point clouds) have distinct properties, which encourages GSTs with domain-adaptive architectures. The present paper develops a data-adaptive pruning framework for the GST to systematically retain important features. Specifically, the contribution of this work is threefold. C1. We put forth a pruning approach to select informative GST features that we naturally term pruned graph scattering transform (p GST). The pruning decisions are guided by a criterion promoting alignment (matching) of the input graph spectrum with that of the graph filters. The optimal pruning decisions are provided on-the-fly, and alleviate the exponential complexity of GSTs. C2. We prove that the p GST is stable to perturbations of the input graph signals. Under certain conditions on the energy of the perturbations, the resulting pruning patterns before and after the perturbations are identical and the overall p GST is stable. C3. We showcase with extensive experiments that: i) the proposed p GSTs perform similarly and in certain cases better than the baseline GSTs that use all scattering features, while achieving significant computational savings; ii) The extracted features from p GSTs can be utilized towards graph classification and 3D point cloud recognition. Even without any training on the feature extraction step, the performance is comparable to state-of-the-art deep supervised learning approaches, particularly when training data are scarce; and iii) By analyzing the pruning patterns of the p GST, we deduce that graph signals in different domains call for different network architectures; see Fig. 1. 2 RELATED WORK GCNs rely on a layered processing architecture comprising trainable graph convolution operations to linearly combine features per graph neighborhood, followed by pointwise nonlinear functions applied to the linearly transformed features (Bronstein et al., 2017). Complex GCNs and their variants have shown remarkable success in graph semi-supervised learning (Kipf & Welling, 2017; Veliˇckovi c et al., 2018) and graph classification (Ying et al., 2018). To simplify GCNs, (Wu et al., 2019) has shown that by employing a single-layer linear GCN the performance in certain social network learning tasks degrades only slightly. On the other hand, (Li et al., 2019) has developed a 56-layer GCN that achieves state-of-the-art performance in 3D point cloud segmentation. Hence, designing GCN architectures guided by properties of the graph data is a highly motivated research question. Towards theoretically explaining the success of GCNs, recent works study the stability properties of GSTs with respect to metric deformations of the domain (Gama et al., 2019b;a; Zou & Lerman, 2019). GSTs generalize scattering transforms (Bruna & Mallat, 2013; Mallat, 2012) to non-Euclidean Published as a conference paper at ICLR 2020 domains. GSTs are a cascade of graph filter banks and nonlinear operations that is organized in a tree-structured architecture. The number of extracted scattering features of a GST grows exponentially with the number of layers. Theoretical guarantees for GSTs are obtained after fixing the graph filter banks to implement a set of graph wavelets. The work in (Zou & Lerman, 2019) establishes energy conservation properties for GSTs given that certain energy-preserving graph wavelets are employed, and also prove that GSTs are stable to graph structure perturbations; see also (Gama et al., 2019b) that focuses on diffusion wavelets. On the other hand, (Gama et al., 2019a) proves stability to relative metric deformations for a wide class of graph wavelet families. These contemporary works shed light into the stability and generalization capabilities of GCNs. However, stable transforms are not necessarily informative, and albeit highly desirable, a principled approach to selecting informative GST features remains still an uncharted venue. 3 BACKGROUND Consider a graph G := {V, E} with node set V := {vi}N i=1, and edge set E := {ei}E i=1. Its connectivity is described by the graph shift matrix S RN N, whose (n, n )th entry Snn is nonzero if (n, n ) E or if n = n . A typical choice for S is the adjacency or the Laplacian matrix. Further, each node can be also associated with a few attributes. Collect attributes across all nodes in the matrix X := [x1, . . . , x F ] RN F , where each column xf RN is a graph signal. Graph Fourier transform. A Fourier transform corresponds to the expansion of a signal over bases that are invariant to filtering; here, this graph frequency basis is the eigenbasis of the graph shift matrix S. Henceforth, S is assumed normal with S = VΛV , where V RN N forms the graph Fourier basis, and Λ RN N is the diagonal matrix of corresponding eigenvalues λ0, . . . , λN 1. These eigenvalues represent graph frequencies. The graph Fourier transform (GFT) of x RN is bx = V x RN, while the inverse transform is x = Vbx. The vector bx represents the signal s expansion in the eigenvector basis and describes the graph spectrum of x. The inverse GFT reconstructs the graph signal from its graph spectrum by combining graph frequency components weighted by the coefficients of the signal s graph Fourier transform. GFT is a tool that has been popular for analyzing graph signals in the graph spectral domain. Graph convolution neural networks. GCNs permeate the benefits of CNNs from processing Euclidean data to modeling graph structured data. GCNs model graph data through a succession of layers, each of which consists of a graph convolution operation (graph filter), a pointwise nonlinear function σ( ), and oftentimes also a pooling operation. Given a graph signal x RN, the graph convolution operation diffuses each node s information to its neighbors according to the graph shift matrix S, as Sx. The nth entry [Sx]n = P n Nn Snn xn is a weighted average of the one-hop neighboring features. Successive application of S will increase the reception field, spreading the information across the network. Hence, a Kth order graph convolution operation (graph filtering) is k=0 wk Skx = Vbh(Λ)bx (1) where the graph filter h( ) is parameterized by the learnable weights {wk}K k=0, and the graph filter in the graph spectral domain is bh(Λ) = PK k=0 wkΛk. In the graph vertex domain, the learnable weights reflect the influences from various orders of neighbors; and in the graph spectral domain, those weights adaptively adjust the focus and emphasize certain graph frequency bands. GCNs employ various graph filter banks per layer, and learn the parameters that minimize a predefined learning objective, such as classification, or regression. Graph scattering transforms. GSTs are the nontrainable counterparts of GCNs, where the parameters of the graph convolutions are selected based on mathematical designs. GSTs process the input at each layer by a sequential application of graph filter banks {hj(S)}J j=1, an elementwise nonlinear function σ( ), and a pooling operator U. At the first layer, the input graph signal x RN constitutes the first scattering feature vector z(0) := x. Next, z(0) is processed by the graph filter banks and σ( ) to generate {z(j)}J j=1 with z(j) := σ(hj(S)z(0)). At the second layer, the same operation is repeated per j. The resulting computation structure is a tree with J branches at each non-leaf node; see also Fig. 2. The ℓth layer of the tree includes Jℓnodes. Each tree node at layer ℓin the scattering transform is indexed by the path p(ℓ) of the sequence of ℓgraph convolutions applied to the input Published as a conference paper at ICLR 2020 GFTs of x, h1(S), h2(S), h3(S) bx h1(λ) h2(λ) h3(λ) z(1,2) z(1,3) z(2,1) z(2,2) z(2,3) z(3,1) z(3,2) z(3,3) h1(S) h2(S) h3(S) h1(S) h1(S) h1(S) Figure 2: Scattering pattern associated with a p GST with J = 3 and L = 3. The dashed lines represent the pruned branches. The example of a graph signal and the GFTs of x and the filter banks are included as well. Note that the third filter j = 3 at ℓ= 1 generates no output, i.e. z(3) = 0, and hence is pruned. graph signal x, i.e. p(ℓ) := (j(1), j(2), . . . , j(ℓ)).1 The scattering feature vector at the tree node indexed by (p(ℓ), j) at layer ℓ+ 1 is z(p(ℓ),j) = σ(hj(S)z(p(ℓ))) (2) where the variable p(ℓ) holds the list of indices of the parent nodes ordered by ancestry, and all path p(ℓ) in the tree with length ℓare included in the path set P(ℓ) with |P(ℓ)| = 2ℓ. The nonlinear transformation function σ( ) disperses the graph frequency representation through the spectrum, and endows the GST with increased discriminating power (Gama et al., 2019a). By exploiting the sparsity of the graph, the computational complexity of (2) is O(KE), where E = |E| is the number of edges in G.2 Each scattering feature vector z(p(ℓ)) is summarized by an aggregation operator U( ) to obtain a scalar scattering coefficient as φ(p(ℓ)) := U(z(p(ℓ))), where U( ) is typically an average or sum operator that effects dimensionality reduction of the extracted features. The scattering coefficient at each tree node reflects the activation level at a certain graph frequency band. These scattering coefficients are collected across all tree nodes to form a scattering feature map Φ(x) := {φ(p(ℓ))}p(ℓ) P(ℓ) L where |Φ(x)| = PL ℓ=0 Jℓ. The GST operation resembles a forward pass of a trained GCN. This is why several works study GST stability under perturbations of S in order to understand the working mechanism of GCNs (Zou & Lerman, 2019; Gama et al., 2019a;b). 4 PRUNED GRAPH SCATTERING TRANSFORMS While the representation power of GST increases with the number of layers, the computational and space complexity of the transform also increase exponentially with the number of layers due to its scattering nature. Hence, even if informative features are available at deeper GST layers, the associated exponential complexity of extracting such features is prohibitive with the existing GST architectures. On the other hand, various input data (social networks, 3D point clouds) may have distinct properties, leading to different GST feature maps. In some cases, only a few tree nodes in deep layers are informative; and in other cases, tree nodes in shallow layers carry significant information; see Fig. 1. This requires a customized GST to adaptively choose significant tree nodes. Alleviating GST limitations, we introduce a pruned graph scattering transform (p GST) to systematically retain informative tree nodes without additional complexity. Our novel p GST alleviates the exponential complexity and adapts GST to different input data. Furthermore, p GST offer a practical mechanism to understand the architecture of GCNs. Based on the pruning patterns, the proposed p GST suggests when a deeper GCN is desirable, and when a shallow one will suffice. Pruning the wavelet packets has been traditionally employed for compression in image processing 1A tree node is fully specified by its corresponding path. 2Any analytical function h(S) can be written as a polynomial of S with maximum degree N 1 (Horn & Johnson, 2012). Published as a conference paper at ICLR 2020 applications (Xiong et al., 2002), where the pruning is guided by a rate-distortion optimallity criterion. In this work, we consider a graph spectrum inspired criterion. Intuitively, each tree node in the GSTs reflects a unique subband in the graph spectrum. When the subband of a tree node does not have a sufficient overlap with the the graph spectrum of a graph signal, this tree node cannot capture the property of this graph signal, and should be pruned. For example, consider a smooth graph signal x, where connected nodes have similar signal values, that has a sparse (low-rank) representation in the graph spectral domain that is, bx := V x RN and [bx]n = 0 for n b. The graph spectrum of the jth graph filtered output is then V hj(S)x = diag (bhj(λ))bx = [bhj(λ1)bx1,bhj(λ2)bx2, . . . ,bhj(λN)bx N] where λn is the nth eigenvalue of S and each frequency bxn is weighted by the corresponding transformed eigenvalue bhj(λn). Hence, if the support of the graph spectrum {bhj(λn)}n is not included in the support of [bx]n then the jth graph filter output will not capture any information; that is, hj(S)x = 0N; see Fig. 2. Thus, identifying such graph filters and pruning the corresponding tree nodes will result to a parsimonius and thus computationally efficient GST. Pruning criterion. Motivated by the aforementioned observation, we introduce a pruning criterion to select the scattering branches per tree node by maximizing the alignment between the graph spectrum of the graph filters and the scattering feature. At the tree node p, the optimization problem is max {f(p,j)}J j=1 bhj(λn)2 τ [bz(p)]2 n s. t. f(p,j) {0, 1}, j = 1, . . . , J where bz(p) := Vz(p) is the graph spectrum of the scattering feature vector z(p); τ is a user-specific threshold; and, f(p,j) stands for the pruning assignment variable indicating whether node (p, j) is active (f(p,j) = 1) or it should be pruned (f(p,j) = 0). The objective in (4) promotes retaining tree nodes that maximize the alignment of the graph spectrum of bz(p) with that of bhj(λ). The threshold τ introduces a minimum spectral value to locate those tree nodes whose corresponding graph spectral response is small, i.e. bhj(λn)2 τ. Note that criterion (4) is evaluated per tree node p, thus allowing for a flexible and scalable design. The optimization problem in (4) is nonconvex since f(p,j) is a discrete variable. Furthermore, recovering bz(p) requires an eigendecomposition of the Laplacian matrix that incurs O(N 3) complexity. Nevertheless, by exploiting the structure in (4), we develop an efficient pruning algorithm that achieves the maximum of (4), as summarized in the following theorem. Theorem 1. The optimal pruning assignment variables n f (p,j) o j of (4) is given as follows 1 if z(p,j) 2 z(p) 2 > τ, 0 if z(p,j) 2 z(p) 2 < τ. , j = 1, . . . , J (5) The optimal variables f (p,j) are given by comparing the energy of the input z(p) to that of the output z(p,j) per graph filter j that can be evaluated at a low complexity of O(N). Our pruning criterion leads to a principled and scalable algorithm to selecting the GST tree nodes to be pruned. The pruning objective is evaluated at each tree node p, and pruning decisions are made on-the-fly. Hence, when f (p) = 1, tree node p is active and the graph filter bank will be applied to z(p), expanding the tree to the next layer; otherwise, the GST will not be expanded further at tree node p, which can result to exponential savings in computations. An example of such a pruned tree is depicted in Fig. 2. Evidently, the hyperparameter τ controls the input-to-output energy ratio. A large τ corresponds to an aggressively pruned scattering tree, while a small τ amounts to a minimally pruned scattering tree. The p GST is then defined as Ψ(x) := φ(p) where T is the set of active tree nodes T := {p P| f (p) = 1}. Published as a conference paper at ICLR 2020 Our pruning approach provides a concise version of GSTs and effects savings in computations as well as memory. Although the worst-case complexity of p GST is still exponential, a desirable complexity can be effected by properly selecting τ. As a byproduct, the scattering patterns of p GSTs reveal the appropriate depths and widths of the GSTs for different input data; see also Fig. 1. The pruning approach so far is an unsupervised one, since no input data labels are assumed available. 5 STABILITY AND SENSITIVITY OF PGST In this section, we prove the stability of p GST to perturbations of the input graph signal. To establish the ensuing results, we consider graph wavelets that form a frame with frame bounds A and B (Hammond et al., 2011). Specifically, for any graph signal x RN, it holds that, A2 x 2 PJ j=1 hj(S)x 2 B2 x 2. In the graph vertex domain, the scalar frame bounds A and B characterize the numerical stability of recovering a graph signal x from {hj(S)x}j. In the graph spectral domain, they reflect the ability of the graph filter bank to amplify x along each graph frequency. Tight frame bounds, satisfying A2 = B2, are of particular interest because such wavelets lead to enhanced numerical stability and faster computations (Shuman et al., 2015). The frame property of the graph wavelet plays an instrumental role in proving GST stability to perturbations of the underlying graph structure (Gama et al., 2019a;b; Zou & Lerman, 2019). Consider a perturbed graph signal x given by x := x + δ RN (6) where δ RN is the perturbation vector. Such an additive model (6) may represent noise in the observed feature or adversarial perturbations. We are interested in studying how and under which conditions our p GST is affected by such perturbations. A stable transformation should have a similar output under small input perturbations. Before establishing that our p GST is stable, we first show that GST is stable to small perturbations of the input graph signal3. Lemma 1. Consider the GST Φ( ) with L layers and J graph filters; and suppose that the graph filter bank forms a frame with bound B, while x and x are related via (6). It then holds that Φ(x) Φ( x) 2 p PL ℓ=0(B2J)ℓ PL ℓ=0 Jℓ δ 2 . (7) The squared difference of the GSTs is normalized by the number of scattering features in Φ( ), that is |Φ(x)| = PL ℓ=1 Jℓ. The bound in (7) relates to the frame bound of the wavelet filter bank. Notice that for tight frames with B = 1, then the normalized stability bound (7) is tight. Let T be the structure of the pruned tree for Ψ( x). The following lemma asserts that the p GST offers the same pruned tree for the original and the perturbed inputs. Lemma 2. Let zp denote the perturbed scattering feature at the tree node p and δp := zp zp. If for all p P and j = 1, . . . , J, it holds that hj(S)zp 2 τ zp 2 > hj(S)δp 2 + τ zp 2 zp 2 . (8) Then, we obtain 1. The pruning transform will output the same tree for Ψ(x) and Ψ( x); that is, T = T ; and, 2. With g(x) := hj(S)x 2 τ x 2, a necessary condition for (8) is |g(zp)| > g(δp) . (9) According to (9), Lemma 2 can be interpreted as a signal-to-noise ratio (SNR) condition because under g(δp) > 0, it is possible to write (9) as |g(zp)|/g(δp) > 1. Lemma 2 provides a per-layer and branch condition for p GST to output the same scattering tree for the original or the perturbed signal. By combining Lemmas 1 and 2, we arrive at the following stability result for the p GST network. 3Prior art deals with GST stability to structure perturbations (Gama et al., 2019a;b; Zou & Lerman, 2019). Published as a conference paper at ICLR 2020 300 600 1,000 1,400 Number of samples Classification accuracy DS MCS THS PDS PMCS PTHS (a) Authorship 1 30 60 100 Classification accuracy (b) Facebook Authorship Facebook 10 3 (c) Runtime Authorship Facebook Pct of pruned features in top-2|T | (d) Feature overlap Figure 3: Classification accuracy against number of samples in the authorship attribution (a) and SNR in d B for source localization (b). Runtime comparison in seconds of the scattering transforms (c). Theorem 2. Consider the p GST transform Ψ( ) with L layers and J graph filters; and suppose that the graph filter bank forms a frame with bound B, while x and x are related via (6). The p GST is stable to bounded perturbations δ, in the sense that Ψ(x) Ψ( x) 2 p PL ℓ=0 FℓB2ℓ PL ℓ=0 Fℓ δ 2 where Fℓ:= |P(ℓ) T | is the number of active scattering features at layer ℓ, and |Ψ(x)| = PL ℓ=0 Fℓ the number of retained scattering features. 6 EXPERIMENTS This section evaluates the performance of our p GST in various graph classification tasks. Graph classification amounts to predicting a label yi given xi and Si for the ith graph. Our p GST extracts Ψ(xi), which is utilized as a feature vector for predicting yi. During training, the structure of the p GST T is determined, which is kept fixed during validation and testing. The parameter τ is selected via cross-validation. Our goal is to provide tangible answers to the following research questions. RQ1 How does the proposed p GST compare to GST? RQ2 How does p GST compare to state-of-the-art GCN approaches for graph classification? RQ3 What are the appropriate scattering patterns for various graph data? Appendix A includes additional experiments on ablation studies over the effect of the parameters J, L, τ. p GST vs. GST. To address RQ1, we reproduce the experiments of two tasks in (Gama et al., 2019a): authorship attribution and source localization. For the scattering transforms, we consider three implementations of graph filter banks: the diffusion wavelets (DS) in (Gama et al., 2019b), the monic cubic wavelets (MCS) in (Hammond et al., 2011) and the tight Hann wavelets (THS) in (Shuman et al., 2015).4 The scattering transforms use J = 5 filters, L = 5 layers, and τ = 0.01. The extracted features from GSTs are subsequently utilized by a linear support vector machine (SVM) classifier. Authorship attribution amounts to determining if a certain text was written by a specific author. Each text is represented by a graph with N = 244, where words (nodes) are connected based on their relative positions in the text, and x is a bag-of-words representation of the text; see also (Gama et al., 2019b). Fig. 3 (a) reports the classification accuracy for the authorship attribution task as the number of training samples (texts) increases. GSTs utilize P5 ℓ=1 5ℓ= 781 scattering coefficients, while p GSTs rely only on |T | = 61 for PDS, |T | = 30 for PMCS, and |T | = 80 for PTHS. Evidently, the proposed p GST achieves comparable performance as the baseline GST, whereas p GST uses only a subset of features (12.8%, 3.8% and 10.2% respectively). The SVM classifier provides a coefficient that weighs each scattering scalar. The magnitude of each coefficient shows the importance of the corresponding scattering feature in the classification. Fig. 3 (d) depicts the percentage of features 4PDS, PMCS, PTHS denote the pruned versions of these transforms. Published as a conference paper at ICLR 2020 Network depth L Classification accuracy MCS PMCS POINTNET POINTNET++ 3DSHAPENETS VOXNET (a) Tr./test : 9843/2468 Network depth L Classification accuracy (b) Tr./test : 615/11703 Figure 4: 3D point cloud classification. Method Data Set ENZYMES D&D COLLAB PROTEINS SHORTEST-PATH 42.32 78.86 59.10 76.43 WL-OA 60.13 79.04 80.74 75.26 PATCHYSAN 76.27 72.60 75.00 GRAPHSAGE 54.25 75.42 68.25 70.48 ECC 53.50 74.10 67.79 72.65 SET2SET 60.15 78.12 71.75 74.29 SORTPOOL 57.12 79.37 73.76 75.54 DIFFPOOL-DET 58.33 75.47 82.13 75.62 DIFFPOOL-NOLP 62.67 79.98 75.63 77.42 DIFFPOOL 64.23 81.15 75.50 78.10 GSC 53.88 76.57 76.88 74.03 GST 59.84 79.28 77.32 76.23 PGST (Ours) 60.25 81.27 78.40 78.57 Table 1: Graph classification accuracy. after prunning retained in the top-2|T | most important GST features given by the SVM classifier. It is observed, that although p GST does not take into account the labels, the retained features are indeed informative for classification. Source localization amounts to recovering the source of a rumor given a diffused signal over a Facebook subnetwork with N = 234; see the detailed settings in (Gama et al., 2019b). Fig. 3 (b) shows the classification accuracy of the scattering transforms for increasing SNR in d B. In accordance to Lemma 1 and Theorem 2, both p GST and GST are stable for a wide range of SNR. Furthermore, the performance of p GST matches the corresponding GST, while the p GST uses only a subset of features. Fig. 3 (c) depicts the runtime of the different scattering approaches, where the computational advantage of the pruned methods is evident. Graph classification. Towards answering RQ2, the proposed p GST is compared with the following state-of-the-art approaches.5 The kernel methods shortest-path (Borgwardt & Kriegel, 2005), and Weisfeiler-Lehman optimal assignment (WL-OA) (Kriege et al., 2016); the deep learning approaches Patchy San (Niepert et al., 2016), Graph Sage (Hamilton et al., 2017), edge-conditioned filters in CCNs (ECC) (Simonovsky & Komodakis, 2017), Set2Set (Vinyals et al., 2015), Sort Pool (Zhang et al., 2018), and Diff Pool (Ying et al., 2018); and the geometric scattering classifier (GSC) (Gao et al., 2019). Results are presented with protein data sets D&D, Enzymes and Proteins, and the scientific collaboration data set Collab. Detailed description of the datasets is included in the Appendix. We perform 10-fold cross validation and report the classification accuracy averaged over the 10 folds. The gradient boosting classifier is employed for p GST and GST with parameters chosen based on the performance on the validation set. The graph scattering transforms use the MC wavelet with L = 5, J = 5 and τ = 0.01. Table 1 lists the classification accuracy of the proposed and competing approaches. Even without any training on the feature extraction step, the performance of p GST is comparable to the state-of-the-art deep supervised learning approaches across all datasets. GST and p GST outperform also GSC, since the latter uses a linear SVM to classify the scattering features. Point cloud classification. We further test p GST in classifying 3D point clouds. Given a point cloud, a graph can be created by connecting points (nodes) to their nearest neighbors based on their Euclidian distance. Each node is also associated with 6 scalars denoting its x-y-z coordinates and RGB colors. For this experiment, GSTs are compared against Point Net++ (Qi et al., 2017a;b), 3d Shape Nets (Wu et al., 2015) and Vox Net (Maturana & Scherer, 2015), that are state-of-the-art deep learning approaches. Fig. 4 reports the classification accuracy for the Model Net40 dataset (Wu et al., 2015) for increasing L. In Fig. 4 (a) 9,843 clouds are used for training and 2,468 for testing using the gradient boosting classifier; whereas, in Fig. 4 (b) only 615 clouds are used for training and the rest for testing using a fully connected neural network classifier with 3 layers. The scattering transforms use an MC wavelet with J = 5 for Fig. 4 (a) and J = 9 for Fig. 4 (b). Fig. 4 showcases that scattering transforms are competitive to state-of-the-art approaches, while p GST outperforms GST. This may be attributed to overfitting effects, since a large number of GST features is not informative. Furthermore, the exponential complexity associated with GSTs prevents their application for L = 6. Fig. 4 (b) shows that when the training data are scarce, GST and p GST outperform the Point Net++, which requires a large number of training data to optimize over the network parameters. 5For the competing approaches we report the 10-fold cross-validation numbers reported by the original authors; see also (Ying et al., 2018). Published as a conference paper at ICLR 2020 Scattering patterns. Towards answering RQ3, we depict the scattering structures of p GSTs, with a MC wavelet, J = 3 and L = 5, for the Collab, Proteins, and Model Net40 datasets in Fig. 1. Evidently, graph data from various domains require an adaptive scattering architecture. Specifically, most tree nodes for the academic collaboration dataset are pruned, and hence most informative features are in the shallow layers. This is consistent with the study in (Wu et al., 2019), which experimentally shows that deeper GCNs do not contribute as much for social network data. These findings are further supported by the small-world phenomenon in social networks, which suggests the diameter of social networks is small (Watts & Strogatz, 1998). On the other hand, the tree nodes for a 3D point cloud are minimally pruned, which is in line with the work in (Li et al., 2019) that showcases the advantage of deep GCNs in 3D point clouds classification. For the protein datasets, additional experiments are performed in Appendix A.4 that corroborate the p GST insights regarding the required number of GCN layers. 7 CONCLUSIONS This paper developed a novel approach to pruning the graph scattering transform. The proposed p GST relies on a graph-spectrum-based data-adaptive criterion to prune non-informative features on-the-fly, and effectively reduce the computational complexity of GSTs. Furthermore, when the input signal is perturbed, the stability of p GST is established. Experiments demonstrate that i) the performance gains of p GSTs relative to GSTs; ii) p GST is competitive in a variety of graph classification tasks; and (iii) graph data from different domains exhibit unique pruned scattering patterns, which calls for adaptive network architectures. Acknowlegdments. This work was supported by Mitsubishi Electric Research Laboratories, the Doctoral Dissertation Fellowship of the University of Minnesota, and the NSF grants 171141, and 1500713. Karsten M Borgwardt and Hans-Peter Kriegel. Shortest-path kernels on graphs. In IEEE Intl. Conf. on Data Mining (ICDM), 2005. Michael M Bronstein, Joan Bruna, Yann Le Cun, Arthur Szlam, and Pierre Vandergheynst. Geometric deep learning: going beyond euclidean data. IEEE Sig. Process. Mag., 34(4):18 42, 2017. Joan Bruna and Stéphane Mallat. Invariant scattering convolution networks. IEEE Trans. Pattern Anal. Mach. Intel., 35(8):1872 1886, 2013. Fernando Gama, Joan Bruna, and Alejandro Ribeiro. Stability of graph scattering transforms. In Proc. Advances Neural Inf. Process. Syst. (Neur IPS), 2019a. Fernando Gama, Alejandro Ribeiro, and Joan Bruna. Diffusion scattering transforms on graphs. In Proc. Intl. Conf. on Learn. Representations (ICLR), 2019b. Feng Gao, Guy Wolf, and Matthew Hirn. Geometric scattering for graph data analysis. In Proc. Intl. Conf. Mach. Learn. (ICML), pp. 2122 2131, 2019. William L Hamilton, Rex Ying, and Jure Leskovec. Representation learning on graphs: Methods and applications. IEEE Data Engineering Bulletin, 2017. David K Hammond, Pierre Vandergheynst, and Rémi Gribonval. Wavelets on graphs via spectral graph theory. Applied and Computational Harmonic Analysis, 30(2):129 150, 2011. Roger A Horn and Charles R Johnson. Matrix Analysis. Cambridge university press, 2012. Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In Proc. Intl. Conf. on Learn. Representations (ICLR), Toulon, France, Apr. 2017. Nils M Kriege, Pierre-Louis Giscard, and Richard Wilson. On valid optimal assignment kernels and applications to graph classification. In Proc. Advances Neural Inf. Process. Syst. (Neur IPS), 2016. Guohao Li, Matthias Müller, Ali Thabet, and Bernard Ghanem. Can gcns go as deep as cnns? In Proc. Int. Conf. Comput. Vis., 2019. Published as a conference paper at ICLR 2020 Stéphane Mallat. Group invariant scattering. Communications on Pure and Applied Mathematics, 65 (10):1331 1398, 2012. Daniel Maturana and Sebastian Scherer. Voxnet: A 3d convolutional neural network for real-time object recognition. In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2015. Mathias Niepert, Mohamed Ahmed, and Konstantin Kutzkov. Learning convolutional neural networks for graphs. In Proc. Intl. Conf. Mach. Learn. (ICML), 2016. Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. In IEEE Conf. on Comp. Vis. and Pat. Rec., 2017a. Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In Proc. Advances Neural Inf. Process. Syst. (Neur IPS), 2017b. David I Shuman, Christoph Wiesmeyr, Nicki Holighaus, and Pierre Vandergheynst. Spectrum-adapted tight graph wavelet and vertex-frequency frames. IEEE Trans. Sig. Process., 63(16):4223 4235, 2015. Martin Simonovsky and Nikos Komodakis. Dynamic edge-conditioned filters in convolutional neural networks on graphs. In IEEE Conf. on Comp. Vis. and Pat. Rec., 2017. Petar Veliˇckovi c, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. In Proc. Intl. Conf. on Learn. Representations (ICLR), 2018. Oriol Vinyals, Samy Bengio, and Manjunath Kudlur. Order matters: Sequence to sequence for sets. Proc. Intl. Conf. on Learn. Representations (ICLR), 2015. Duncan J Watts and Steven H Strogatz. Collective dynamics of small-world networks. Nature, 393 (6684):440, 1998. Felix Wu, Tianyi Zhang, Amauri Holanda de Souza Jr, Christopher Fifty, Tao Yu, and Kilian Q Weinberger. Simplifying graph convolutional networks. In Proc. Intl. Conf. Mach. Learn. (ICML), 2019. Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and Jianxiong Xiao. 3d shapenets: A deep representation for volumetric shapes. In IEEE Conf. on Comp. Vis. and Pat. Rec., 2015. Zixiang Xiong, Kannan Ramchandran, and Michael T Orchard. Space-frequency quantization for wavelet image coding. 2002. Zhitao Ying, Jiaxuan You, Christopher Morris, Xiang Ren, Will Hamilton, and Jure Leskovec. Hierarchical graph representation learning with differentiable pooling. In Proc. Advances Neural Inf. Process. Syst. (Neur IPS), 2018. Muhan Zhang, Zhicheng Cui, Marion Neumann, and Yixin Chen. An end-to-end deep learning architecture for graph classification. In AAAI Conference on Artificial Intelligence, 2018. Dongmian Zou and Gilad Lerman. Graph convolutional neural networks via scattering. Applied and Computational Harmonic Analysis, 2019. Published as a conference paper at ICLR 2020 Dataset Graphs Features F Max number of nodes per graph Collab 5000 1 492 D&D 1178 89 5748 Enzymes 600 3 126 Proteins 1113 3 620 Table 2: Dataset characteristics 10 5 10 4 10 3 10 2 10 1 100 Classification accuracy PDS PMCS PTHS (a) Classification accuracy 10 5 10 4 10 3 10 2 10 1 100 (b) Number of active features |T | 10 5 10 4 10 3 10 2 10 1 100 (c) Runtime in seconds Figure 5: Performance of p GSTs for varying τ. A ADDITIONAL EXPERIMENTS Dataset characteristics. The characteristics of the Datasets used in Table 1 are shown in Table 2. Notice that the nodes in the Collab dataset did not have any features, and hence x was selected as the vector that holds the node degrees. Classification accuracy PDS PMCS PTHS Classification accuracy Figure 6: Classification accuracy over different parameters. Published as a conference paper at ICLR 2020 A.1 ABLATION STUDY Fig. 5 reports how the p GST is affected by varying the threshold τ in the task of source localization, with J = 6 and L = 5. Fig. 5 (a) shows the classification accuracy that generally decreases at τ increases since the number of active features |T | decreases; cf. Fig. 5 (b). Fig. 5 (c) reports the runtime in seconds of the approaches. Fig. 6 showcases the classification performance of the p GST with τ = 0.01 for varying L with J = 3 on the left and for varying J with L = 3 on the right. It is observed, that the classification performance generally increases with L and J. B PROOF OF THEOREM 1 First, the objective in (4) is rewritten as bhj(λn)2 τ [bz(p)]2 n j=1 bz (p) diag (bhj(λ)) 2 τI bz(p)fj (10) that follows by definition. By introducing the scalars αj := bz (p)(diag (bhj(λ)) 2 τI)bz(p) for j = 1, . . . , J, (4) can be rewritten as j=1 αjfj (11) s. t. fj {0, 1}, j = 1, . . . , J. The optimization problem in (11) is nonconvex since fj is a discrete variable. However, maximizing the sum in (11) amounts to setting fj = 1 for the positive αj over j. Such an approach leads to the optimal pruning assignment variables as follows ( 1 if αj > 0, 0 if αj < 0. , j = 1, . . . , J (12) The rest of the proof focuses on rewriting αj as follows αj =bz (p)(diag (bhj(λ)) 2 τI)bz(p) (13) = diag (bhj(λ))bz(p) 2 τ bz(p) 2 (14) Furthermore, since V is orthogonal matrix it holds that bz(p) 2 = V z(p) 2 = z(p) 2 and it follows diag (bhj(λ))bz(p) 2 = hj(S)z(p) 2 (15) = σ(hj(S)z(p)) 2 = z(p,j) 2 (16) where the second line follows since σ( ) is applied elementwise and does not change the norm. C PROOF OF LEMMA 1 By definition (3) it holds Φ(x) Φ( x) 2 = p(ℓ) P(ℓ) |φ(p(ℓ)) φ(p(ℓ))|2 (17) Hence, it is well motivated to bound each term of the sum in (17) as follows |φ(p(ℓ)) φ(p(ℓ))| =|U(z(p(ℓ)) U( z(p(ℓ)))| (18) U z(p(ℓ)) z(p(ℓ)) (19) Published as a conference paper at ICLR 2020 where (19) follows since the norm is a sub-multiplicative operator. Next we will show the following recursive bound z(p(ℓ)) z(p(ℓ)) = σ(hj(ℓ)(S)z(p(ℓ 1))) σ(hj(ℓ)(S) z(p(ℓ 1))) (20) σ() hj(ℓ)(S)z(p(ℓ 1)) hj(ℓ)(S) z(p(ℓ 1)) (21) hj(ℓ)(S)z(p(ℓ 1)) hj(ℓ)(S) z(p(ℓ 1)) (22) hj(ℓ)(S) z(p(ℓ 1)) z(p(ℓ 1)) (23) where (21), (23) follow since the norm is a sub-multiplicative operator and (22) follows since the nonlinearity is nonexpansive, i.e. σ() < 1. Hence, by applying (23) ℓ 1 times the following condition holds z(p(ℓ)) z(p(ℓ)) hj(ℓ)(S) hj(ℓ 1)(S) hj(1)(S) x x (24) and by further applying the frame bound and (6) it follows that z(p(ℓ)) z(p(ℓ)) Bℓ δ (25) Combining (19), (25) and the average operator property U = 1 it holds that |φ(p(ℓ)) φ(p(ℓ))| Bℓ δ (26) By applying the bound (26) for all entries in the right hand side of (17) it follows that Φ(x) Φ( x) 2 p(ℓ) P(ℓ) B2l δ 2 (27) By factoring out δ and observing that the sum in the right side of (27) does not depend on the path index p it follows that Φ(x) Φ( x) 2 ℓ=0 |P(ℓ)|B2ℓ ! Finally, since the cardinality of the paths at ℓis |P(ℓ)| = Jℓand PL ℓ=0(B2J)ℓ = (B2J)L / B2J 1 it holds B2J 1 δ (29) D PROOF OF LEMMA 2 We will prove the case for ℓ= 0, where zp(0) = x, since the same proof holds for any ℓ. First, we adapt (8) to the following hj(S)x 2 τ x 2 > hj(S)δ 2 + τ x 2 x 2 . (30) The proof will examine two cases and will follow by contradiction. For the first case, consider that branch j is pruned in Ψ(x) and not pruned in Ψ( x), i.e. (j) T and (j) / T . By applying (5) for z(j) = σ(hj(S)x) there exists C 0 such that x 2 τ C (31) hj(S)x 2 τ x 2 C x 2 (32) Furthermore, from (5) it holds for z(j) = σ(hj(S) x) that x 2 > τ (33) Published as a conference paper at ICLR 2020 By applying (6) to (33), and using the triangular inequality it follows that hj(S)x 2 + hj(S)δ 2 τ x 2 (34) Next, by applying (32) it holds that τ x 2 C x 2 + hj(S)δ 2 τ x 2 (35) τ( x 2 x 2) + hj(S)δ 2 C x 2. (36) Next, by utilizing (30) and the absolute value property |a| a to upper-bound the left side of (36) it follows that hj(S)x 2 τ x 2 >C x 2. (37) Finally, by applying (32) the following is obtained 0 > 2C x 2 (38) which implies that C < 0. However, this contradicts (31) since C 0. Following a symmetric argument we can complete the proof for the other case.