# diffusion_scattering_transforms_on_graphs__251fbb86.pdf Published as a conference paper at ICLR 2019 DIFFUSION SCATTERING TRANSFORMS ON GRAPHS Fernando Gama Dept. of Electrical and Systems Engineering University of Pennsylvania Alejandro Ribeiro Dept. of Electrical and Systems Engineering University of Pennsylvania Joan Bruna Courant Institute of Mathematical Sciences New York University Stability is a key aspect of data analysis. In many applications, the natural notion of stability is geometric, as illustrated for example in computer vision. Scattering transforms construct deep convolutional representations which are certified stable to input deformations. This stability to deformations can be interpreted as stability with respect to changes in the metric structure of the domain. In this work, we show that scattering transforms can be generalized to non Euclidean domains using diffusion wavelets, while preserving a notion of stability with respect to metric changes in the domain, measured with diffusion maps. The resulting representation is stable to metric perturbations of the domain while being able to capture high-frequency information, akin to the Euclidean Scattering. 1 INTRODUCTION Convolutional Neural Networks (CNN) are layered information processing architectures. Each of the layers in a CNN is itself the composition of a convolution operation with a pointwise nonlinearity where the filters used at different layers are the outcome of a data-driven optimization process (Le Cun et al., 2010; 2015). Scattering transforms have an analogous layered architecture but differ from CNNs in that the convolutional filters used at different layers are not trained but selected from a multi-resolution filter bank (Mallat, 2012; Bruna & Mallat, 2013). The fact that they are not trained endows scattering transforms with intrinsic value in situations where training is impossible and inherent limitations in the converse case. That said, an equally important value of scattering transforms is that by isolating the convolutional layered architecture from training effects it permits analysis of the fundamental properties of CNN information processing architectures. This analysis is undertaken in Mallat (2012); Bruna & Mallat (2013) where the fundamental conclusion is about the stability of scattering transforms with respect to deformations in the underlying domain that are close to translations. In this paper we consider graphs and signals supported on graphs such as brain connectivity networks and functional activity levels (Huang et al., 2016), social networks and opinions (Jackson, 2008), or user similarity networks and ratings in recommendation systems (Huang et al., 2018). Our specific goals are: (i) To define a family of graph-scattering transforms. (ii) To define a notion of deformation for graph signals. (iii) To study the stability of graph scattering transforms with respect to this notion of deformation. To accomplish goal (i) we consider the family of graph diffusion wavelets which provide an appropriate construction of a multi-resolution filter bank (Coifman & Maggioni, 2006). Our diffusion scattering transforms are defined as the layered composition of diffusion wavelet filter banks and pointwise nonlinearities. To accomplish goal (ii) we adopt the graph diffusion distance as a measure of deformation of the underlying domain (Coifman & Lafon, 2006; Nadler et al., 2006). Diffusion distances measure the similarity of two graphs through the time it takes for a signal to be diffused on the graph. The major accomplishment of this paper is to show that the diffusion graph scattering transforms are stable with respect to deformations as measured with respect to diffusion distances. Specifically, consider a signal x supported on graph G whose diffusion scattering transform is denoted by the operator ΨG. Consider now a deformation of the Published as a conference paper at ICLR 2019 signal s domain so that the signal s support is now described by the graph G whose diffusion scattering operator is ΨG . We show that the operator norm distance ΨG ΨG is bounded by a constant multiplied by the diffusion distance between the graphs G and G . The constant in this bound depends on the spectral gap of G but, very importantly, does not depend on the number of nodes in the graph. It is important to point out that finding stable representations is not difficult. E.g., taking signal averages is a representation that is stable to domain deformations indeed, invariant. The challenge is finding a representation that is stable and rich in its description of the signal. In our numerical analyses we show that linear filters can provide representations that are either stable or rich but that cannot be stable and rich at the same time. The situation is analogous to (Euclidean) scattering transforms and is also associated with high frequency components. We can obtain a stable representation by eliminating high frequency components but the representation loses important signal features. Alternatively, we can retain high frequency components to have a rich representation but that representation is unstable to deformations. Diffusion scattering transforms are observed to be not only stable as predicted by our theoretical analysis but also sufficiently rich to achieve good performance in graph signal classification examples. 2 RELATED WORK Since graph and graph signals are of increasing interest but do not have the regular structure that would make use of CNNs appealing, it is pertinent to ask the question of what should be an appropriate generalization of CNNs to graphs and the graph signals whose topology they describe (Bronstein et al., 2017). If one accepts the value of convolutions as prima facie, a natural solution is to replace convolutions with graph shift invariant filters which are known to be valid generalizations of (convolutional) time invariant filters (Bruna et al., 2014). This idea is not only natural but has been demonstrated to work well in practical implementations of Graph Neural Networks (GNNs) (Defferrard et al., 2016; Gama et al., 2019; Gilmer et al., 2017; Henaff et al., 2015; Kipf & Welling, 2017). Same as Euclidean scattering transforms, our graph scattering transforms differ from GNNs in that they do not have to be trained. The advantages and limitations of the absence of training notwithstanding, our work also sheds light on the question of why graph convolutions are appropriate generalizations of regular domain convolutions for signal classification problems. Our work suggests that the value of GNNs stems from their stability relative to deformations of the underlying domain that are close to permutations which is the property that a pair of graphs must satisfy to have small diffusion distance. The stability results obtained in this paper build on the notion of scattering transforms. These scattering representations were introduced by Mallat (2012) and further developed in Bruna & Mallat (2013) with computer vision applications. Since, these representations have been extended to handle transformations on more complex groups, such as roto-translations (Sifre & Mallat, 2013; Oyallon & Mallat, 2015), and to domains such as audio processing (And en & Mallat, 2014) and quantum chemistry (Eickenberg et al., 2017). Similarly as in this work, extensions of scattering to general graphs have been considered in Chen et al. (2014) and Zou & Lerman (2018). Chen et al. (2014) focuses on Haar wavelets that hierarchically coarsen the graph, and relies on building multiresolution pairings. The recent Zou & Lerman (2018) is closest to our work. There, the authors define graph scattering using spectrally constructed wavelets from (Hammond et al., 2011), and establish some properties of the resulting representation, such as energy conservation and stability to spectral perturbations. In contrast, our stability results are established with respect to diffusion metric perturbations, which are generally weaker, in the sense that they define a weaker topology (see Section 3). We use diffusion wavelets (Coifman & Maggioni, 2006) to obtain multi-resolution graph filter banks that are localized in frequency as well as in the graph domain, while spanning the whole spectrum. Diffusion wavelets serve as the constructive basis for the obtained stability results. Our work is also closely related to recent analysis of stability of Graph Neural Networks in the context of surface representations in (Kostrikov et al., 2017). In our work, however, we do not rely on extrinsic deformations and exploit the specific multiresolution structure of wavelets. Published as a conference paper at ICLR 2019 3 PROBLEM SET-UP This section introduces our framework and states the desired stability properties of signal representations defined on general non-Euclidean domains. 3.1 EUCLIDEAN STABILITY TO DEFORMATIONS WITH SCATTERING Motivated by computer vision applications, our analysis starts with the notion of deformation stability. If x(u) L2(Ω) is an image defined over an Euclidean domain Ω Rd, we are interested in signal representations Φ : L2(Ω) RK that are stable with respect to small deformations. If xτ(u) := x(u τ(u)) denotes a change of variables with a differentiable field τ : Ω Ωsuch that τ < 1, then we ask x, τ , Φ(x) Φ(xτ) x τ , with (1) τ := τ denoting a uniform bound on the operator norm of τ. In this setting, a notorious challenge to achieving (1) while keeping enough discriminative power in Φ(x) is to transform the high-frequency content of x in such a way that it becomes stable. Scattering transforms (Mallat, 2012; Bruna & Mallat, 2013) provide such representations by cascading wavelet decompositions with pointwise modulus activation functions. We briefly summarize here their basic definition. Given a mother wavelet ψ L1(Ω) with at least a vanishing moment R ψ(u)du = 0 and with good spatial localization, we consider rotated and dilated versions ψj,c(u) = 2 jdψ(2 j Rcu) using scale parameter j and angle θ {2πc/C}c=0,...,C 1. A wavelet decomposition operator is defined as a filter bank spanning all scales up to a cutoff 2J and all angles: ΨJ : x 7 (x ψj,c)j J,c C. This filter bank is combined with a pointwise modulus activation function ρ(z) = |z|, as well as a low-pass average pooling operator U computing the average over the domain. The resulting representation using m layers becomes Φ(x) = {S0(x), S1(x), . . . , Sm 1(x)} , with (2) Sk(x) = UρΨJρ . . . ΨJx = {U(||x ψα1| ψα2| ψαk|); }α1,...,αk (k = 0, . . . , m 1). The resulting signal representation has the structure of a CNN, in which feature maps are not recombined with each other, and trainable filters are replaced by multiscale, oriented wavelets. It is shown in Mallat (2012) that for appropriate signal classes and wavelet families, the resulting scattering transform satisfies a deformation stablity condition of the form (1), which has been subsequently generalised to broader multiresolution families (Wiatowski & B olcskei, 2018). In essence, the mechanism that provides stability is to capture high-frequency information with the appropriate spatio-temporal tradeoffs, using spatially localized wavelets. 3.2 DEFORMATIONS AND METRIC STABILITY Whereas deformations provide the natural framework to describe geometric stability in Euclidean domains, their generalization to non-Euclidean, non-smooth domains is not straightforward. Let x L2(X). If X is embedded into a low-dimension Euclidean space Ω Rd, such as a 2surface within a three-dimensional space, then one can still define meaningful deformations on X via extrinsic deformations of Ω(Kostrikov et al., 2017). However, in this work we are interested in intrinsic notions of geometric stability, that do not necessarily rely on a pre-existent low-dimensional embedding of the domain. The change of variables ϕ(u) = u τ(u) defining the deformation can be seen as a perturbation of the Euclidean metric in L2(Rd). Indeed, xτ, yτ L2(Rd,µ) = Z Rd xτ(u)yτ(u)dµ(u) = Z Rd x(u)y(u)|I τ(u)|dµ(u) = x, y L2(Rd, µ) , with d µ(u) = |I τ(u)|dµ(u), and |I τ(u)| 1 if τ is small, where I is the identity. Therefore, a possible way to extend the notion of deformation stability to general domains L2(X) is to think of X as a metric space and reason in terms of stability of Φ : L2(X) RK to metric changes in X. This requires a representation that can be defined on generic metric spaces, as well as a criteria to compare how close two metric spaces are. Published as a conference paper at ICLR 2019 3.3 DIFFUSION WAVELETS AND METRICS ON GRAPHS Graphs are flexible data structures that enable general metric structures and modeling non-Euclidean domains. The main ingredients of the scattering transform can be generalized using tools from computational harmonic analysis on graphs. We note that, unlike the case of Euclidean domains, where deformations are equivalent whether they are analyzed from the function domain or its image, in the case of graphs, we focus on deformations on the underlying graph domain, while keeping the same function mapping (i.e. we model deformations as a change of the underlying graph support and analyze how this affects the interaction between the function mapping and the graph). In particular, diffusion wavelets (Coifman & Maggioni, 2006) provide a simple framework to define a multi-resolution analysis from powers of a diffusion operator defined on a graph. A weighted, undirected graph G = (V, E, W) with |V | = n nodes, edge set E and adjacency matrix W Rn n defines a diffusion process A in its nodes, given in its symmetric form by the normalized adjacency A := D 1/2WD 1/2 , with D = diag(d1, . . . , dn) , (3) where di = P (i,j) E Wi,j denotes the degree of node i. Denote by d = W1 the degree vector containing di in the ith element. By construction, A is well-localized in space (it is nonzero only where there is an edge connecting nodes), it is self-adjoint and satisfies A 1, where A is the operator norm. Let λ0 λ1 . . . λn 1 denote its eigenvalues in decreasing order. Defining d1/2 = ( d1, . . . , dn), one can easily verify that the normalized squared root degree vector v = d1/2/ d1/2 2 = d/ d 1 is the eigenvector with associated eigenvalue λ0 = 1. Also, note that λn 1 = 1 if and only if G has a connected component that is non-trivial and bipartite (Chung, 1997). In the following, it will be convenient to assume that the spectrum of A (which is real and discrete since A is self-adjoint and in finite-dimensions) is non-negative. Since we shall be taking powers of A, this will avoid folding negative eigenvalues into positive ones. For that purpose, we adopt the so-called lazy diffusion, given by T = 1 2(I + A). In Section 4 we use this diffusion operator to define both a multiscale wavelet filter bank and a low-pass average pooling, leading to the diffusion scattering representation. This diffusion operator can also be used to construct a metric on G. The so-called diffusion distances (Coifman & Lafon, 2006; Nadler et al., 2006) measure distances between two nodes x, x V in terms of their associated diffusion at time s: d G,s(x, x ) = T s Gδx T s Gδx , where δx is a vector with all zeros except a 1 in position x. In this work, we build on this diffusion metric to define a distance between two graphs G, G . Assuming first that G and G have the same size, the simplest formulation is to compare the diffusion metric generated by G and G up to a node permutation: Definition 3.1. Let G = (V, E, W), G = (V , E , W ) have the same size |V | = |V | = n. The normalized diffusion distance between graphs G, G at time s > 0 is ds(G, G ) := inf Π Πn (T s G) (T s G) ΠT(T s G ) (T s G )Π = inf Π Πn T 2s G ΠTT 2s G Π , (4) where Πn is the space of n n permutation matrices. The diffusion distance is defined at a specific time s. As s increases, this distance becomes weaker1, since it compares points at later stages of diffusion. The role of time is thus to select the smoothness of the graph deformation , similarly as τ measures the smoothness of the deformation in the Euclidean case. For convenience, we denote d(G, G ) = d1/2(G, G ) and use the distance at s = 1/2 as our main deformation measure. The quantity d defines a distance between graphs (seen as metric spaces) yielding a stronger topology than other alternatives such as the Gromov-Hausdorff distance, defined as ds GH(G, G ) = inf Π sup x,x V |ds G(x, x ) ds G (π(x), π(x ))| with ds G(x, x ) = T t G(δx δx ) L2(G) . We choose d(G, G ) in this work for convenience and mathematical tractability, but leave for future work the study of stability relative to ds GH. Finally, 1In the sense that it defines a weaker topology, i.e., limm ds(G, Gm) 0 limm ds (G, Gm) = 0 for s > s, but not vice-versa. Published as a conference paper at ICLR 2019 we consider for simplicity only the case where the sizes of G and G are equal, but definition (3.1) can be naturally extended to compare variable-sized graphs by replacing permutations by softcorrespondences (see Bronstein et al., 2010). 3.4 PROBLEM STATEMENT Our goal is to build a stable and rich representation ΦG(x). The stability property is stated in terms of the diffusion metric above: For a chosen diffusion time s, x Rn , G = (V, E, W), G = (V , E , W ) with |V | = |V | = n , we want ΦG(x) ΦG (x) x ds(G, G ) . (5) This representation can be used to model both signals and domains, or just domains G, by considering a prespecified x = f(G), such as the degree, or by marginalizing from an exchangeable distribution, ΦG = Ex QΦG(x). The motivation of (5) is two-fold: On the one hand, we are interested in applications where the signal of interest may be measured in dynamic environments that modify the domain, e.g. in measuring brain signals across different individuals. On the other hand, in other applications, such as building generative models for graphs, we may be interested in representing the domain G itself. A representation from the adjacency matrix of G needs to build invariance to node permutations, while capturing enough discriminative information to separate different graphs. In particular, and similarly as with Gromov-Hausdorff distances, the definition of d(G, G ) involves a matching problem between two kernel matrices, which defines an NP-hard combinatorial problem. This further motivates the need for efficient representations of graphs ΦG that can efficiently tell apart two graphs, and such that ℓ(θ) = ΦG ΦG(θ) can be used as a differentiable loss for training generative models. 4 GRAPH DIFFUSION SCATTERING Let T be a lazy diffusion operator associated with a graph G of size n such as those described in Section 3.3. Following Coifman & Maggioni (2006), we construct a family of multiscale filters by exploiting the powers of the diffusion operator T 2j. We define ψ0 := I T , ψj := T 2j 1(I T 2j 1) = T 2j 1 T 2j , (j > 0) . (6) This corresponds to a graph wavelet filter bank with optimal spatial localization. Graph diffusion wavelets are localized both in space and frequency, and favor a spatial localization, since they can be obtained with only two filter coefficients, namely h0 = 1 for diffusion T 2j 1 and h1 = 1 for diffusion T 2j. The finest scale ψ0 corresponds to one half of the normalized Laplacian operator ψ0 = (1/2) = 1/2(I D 1/2WD 1/2), here seen as a temporal difference in a diffusion process, seeing each diffusion step (each multiplication by ) as a time step. The coarser scales ψj capture temporal differences at increasingly spaced diffusion times. For j = 0, . . . , Jn 1, we consider the linear operator Ψ : L2(G) (L2(G))Jn x 7 (ψjx)j=0,...,Jn 1 , (7) which is the analog of the wavelet filter bank in the Euclidean domain. Whereas several other options exist to define graph wavelet decompositions (Rustamov & Guibas, 2013; Gavish et al., 2010), and GNN designs that favor frequency localization, such as Cayley filters (Levie et al., 2019), we consider here wavelets that can be expressed with few diffusion terms, favoring spatial over frequential localization, for stability reasons that will become apparent next. We choose dyadic scales for convenience, but the construction is analogous if one replaces scales 2j by γj for any γ > 1 in (6). If the graph G exhibits a spectral gap, i.e., βG = supi=1,...n 1 |λi| < 1, the following proposition proves that the linear operator Ψ defines a stable frame. Proposition 4.1. For each n, let Ψ define the diffusion wavelet decomposition (7) and assume βG < 1. Then there exists a constant 0 < C(β) depending only on β such that for any x Rn Published as a conference paper at ICLR 2019 satisfying x, v = 0, j=0 ψjx 2 x 2 . (8) This proposition thus provides the Littlewood-Paley bounds of Ψ, which control the ability of the filter bank to capture and amplify the signal x along each frequency (i.e. the ability of the filter to increase or decrease the energy of the representation, relative to the energy of the x). We note that diffusion wavelets are neither unitary nor analytic and therefore do not preserve energy. However, the frame bounds in Proposition 4.1 provide lower bounds on the energy lost, such that the smaller 1 β is, the less unitary our diffusion wavelets are. It also informs us about how the spectral gap β determines the appropriate diffusion scale J: The maximum of p(u) = (ur u2r)2 is at u = 2 1/r, thus the cutoff r should align with β as r = 1 log2 β , since larger values of r capture energy in a spectral range where the graph has no information. Therefore, the maximum scale can be adjusted as J = 1 + log2 r = 1 + l log2 1 log2 β m . Recall that the Euclidean Scattering transform is constructed by cascading three building blocks: a wavelet decomposition operator, a pointwise modulus activation function, and an averaging operator. Following the Euclidean scattering, given a graph G and x L2(G), we define an analogous Diffusion Scattering transform ΦG(x) by cascading three building blocks: the Wavelet decomposition operator Ψ, a pointwise activation function ρ, and an average operator U which extracts the average over the domain. The average over a domain can be interpreted as the diffusion at infinite time, thus Ux = limt T tx = v T, x . More specifically, we consider a first layer transformation given by φ1(G, x) = UρΨx = {Uρψjx}0 j Jn 1 , , (9) followed by second order coefficients φ2(G, x) = UρΨρΨx = {Uρψj2ρψj1x}0 j1,j2 Jn 1 , , (10) and so on. The representation obtained from m layers of such transformation is thus ΦG(x) = {Ux, φ1(G, x), . . . , φm 1(G, x)} = {U(ρΨ)kx ; k = 0, . . . , m 1} . (11) 5 STABILITY OF GRAPH DIFFUSION SCATTERING 5.1 STABILITY AND EQUIVARIANCE OF DIFFUSION WAVELETS Given two graphs G, G of size n and a signal x Rn, our objective is to bound ΦG(x) ΦG (x) in terms of d(G, G ). Let π the permutation minimising the distortion between G and G in (4). Since all operations in Φ are either equivariant or invariant with respect to permutations, we can assume w.l.o.g. that π = 1, so that the diffusion distance can be directly computed by comparing nodes with the given order. A key property of G that drives the stability of the diffusion scattering is given by its spectral gap 1 βG = 1 supi=1,...n 1 |λi| 0. In the following, we use ℓ2 operator norms, unless stated otherwise. Lemma 5.1. Assume β := max(βG, βG ) < 1. Then inf Π Πn ΨG ΠΨG ΠT 2d(G, G ) (1 β2)3 . (12) Remark: If diffusion distance is measured at time different from s = 1/2, the stability bound would be modified due to scales j such that 2j < s. The following lemma studies the stability of the low-pass operator U with respect to graph perturbations. Lemma 5.2. Let G, G be two graphs with same size, denote by v and v their respective squaredroot degree vectors, and by β, β their spectral gap. Then inf Π Πn v Πv 2 2 d(G, G ) 1 min(β, β ) . (13) Published as a conference paper at ICLR 2019 Spectral Gap asymptotic behavior Lemmas 5.1 and 5.2 leverage the spectral gap of the lazy diffusion operator associated with G. In some cases, such as regular graphs, the spectral gap vanishes asymptotically as n , thus degrading the upper bound asymptotically. Improving the bound by leveraging other properties of the graph (such as regular degree distribution) is an important open task. 5.2 STABILITY AND INVARIANCE OF DIFFUSION SCATTERING The scattering transform coefficients ΦG(x) obtained after m layers are given by equation 11, for low-pass operator U such that Ux = v, x so that U = v T. From Lemma 5.1 we have that, ΨG ΨG εΨ = 2d(G, G ) p β2(1 + β2)/(1 β2)3. We also know, from Proposition 4.1 that Ψ conforms a frame, i.e. C(β) x 2 Ψx 2 x 2 for known constant C(β) given in Prop. 4.1. Additionally, from Lemma 5.2 we get that UG UG εU= 2d(G, G )/(1 min(β, β )). The objective now is to prove stability of the scattering coefficients ΦG(x), that is, to prove that ΦG(x) ΦG (x) d(G, G ) x . (14) This is captured in the following Theorem: Theorem 5.3. Let G, G be two graphs and let d(G, G ) be their distance measured as in equation 4. Let TG and TG be the respective diffusion operators. Denote by UG, ρG and ΨG and by UG , ρG and ΨG the low pass operator, pointwise nonlinearity and the wavelet filter bank used on the scattering transform defined on each graph, respectively, cf. equation 11. Assume ρG = ρG and that ρG is non-expansive. Let β = min(βG, βG ), β+ = max(βG, βG ) and assume β+ < 1. Then, we have that, for each k = 0, . . . , m 1, the following holds UG(ρGΨG)k UG (ρG ΨG )k 2 1 β d(G, G ) 1/2 + k β2 +(1 + β2 +) (1 β2 +)3 d(G, G ) . (15) Defining ΦG(x) 2 = Pm 1 k=0 UG(ρGΨG)k 2 analogously to Bruna & Mallat (2013), it is straightforward to compute the stability bound on the scattering coefficients as follows. Corollary 5.4. In the context of Theorem 5.3, let x Rn and let ΦG(x) be the scattering coefficients computed by means of equation 11 on graph G after m layers, and let ΦG (x) be the corresponding coefficients on graph G . Then, ΦG(x) ΦG (x) 2 " 2 1 β d(G, G ) 1/2 + k β2 +(1 + β2 +) (1 β2 +)3 d(G, G ) ΦG(x) ΦG (x) m1/2d1/2(G, G ) x if d(G, G ) 1. Corollary 5.4 satisfies equation 5. It also shows that the closer the graphs are in terms of the diffusion metric, the closer their scattering representations will be. The constant is given by topological properties, the spectral gaps of G and G , as well as design parameters, the number of layers m. We observe that the stability bound grows the smaller the spectral gap is and also as more layers are considered. The spectral gap is tightly linked with diffusion processes on graphs, and thus it does emerge from the choice of a diffusion metric. Graphs with values of beta closer to 1, exhibit weaker diffusion paths, and thus a small perturbation on the edges of these graphs would lead to a larger diffusion distance. The contrary holds as well. In other words, the tolerance of the graph to edge perturbations (i.e., d(G, G) being small) depends on the spectral gap of the graph. We also note that, as stated at the end of Section 5.1, the spectral gap appears in our upper bounds, but it is not necessarily sharp. In particular, the spectral gap is a poor indication of stability in regular graphs, and we believe our bound can be improved by leveraging structural properties of regular domains. Finally, we note that the size of the graphs impacts the stability result inasmuch as it impacts the distance measure d(G, G ). This is expected, since graphs of different size can be compared, as mentioned in Section 3.3. Different from Zou & Lerman (2018), our focus is on obtaining graph wavelet banks that are localized in the graph domain to improve computational efficiency as discussed in Defferrard et al. (2016). We also notice that the scattering transform in Zou & Lerman Published as a conference paper at ICLR 2019 (2018) is stable with respect to a graph measure that depends on the spectrum of the graph through both eigenvectors and eigenvalues. More specifically, it is required that the spectrum gets concentrated as the graphs grow. However, in general, it is not straightforward to relate the topological structure of the graph with its spectral properties. As mentioned in Section 3.3, the stability is computed with a metric d(G, G ) which is stronger than what could be hoped for. Our metric is permutation-invariant, in analogy with the rigid translation invariance in the Euclidean case, and stable to small perturbations around permutations. The extension of (16) to weaker metrics, using e.g. multiscale deformations, is left for future work. 5.3 FROM DIFFUSION SCATTERING TO DIFFUSION GNNS By denoting Tj = T 2j, observe that one can approximate the diffusion wavelets from (6) as a cascade of low-pass diffusions followed by a high-pass filter at resolution 2j: ψj = Tj 1(I Tj 1) T P j 0 and p0(x) = 1 x. Denote by QJ(x) = PJ 1 j=0 pj(x)2. It follows from the definition that Ψvi 2 = QJ(λi) for i = 1, . . . , n 1 and therefore C1 = min x (0,β) QJ(x) , C2 = max x (0,β) QJ(x) . (19) We check that C1 minx (0,β) p0(x)2 = (1 β)2 and C2 = QJ(0) = 1. Indeed, denote by Q(x) = P j=0 pj(x)2. One easily verifies that Q(x) is continuous in [0, 1) since it is bounded by a geometric series. Also, observe that Q(x) = (1 x)2+P j>0(x2j 1 x2j)2 satisfies the recurrence Q(x2) = Q(x) + 2x(1 x)2 Q(x) since x [0, 1). By continuity it thus follows that sup x [0,1) QJ(x) sup x [0,1) Q(x) = lim x 0 Q(x) = Q(0) = 1 . B PROOF OF LEMMA 5.1 By definition ΨG ΨG 2 = [ΨG ΨG ] [ΨG ΨG ] and [ΨG ΨG ] [ΨG ΨG ] = j=0 (ψj(G) ψj(G )) (ψj(G) ψj(G )) . (20) Since TG and TG are self-adjoint, so are ψj(G) and ψj(G ). From the triangular inequality, we thus obtain ΨG ΨG 2 X j ψj(G) ψj(G ) 2 . (21) Denote by v the eigenvector associated with λ0 = 1, [v]i = (di/ Pn j=1 dj)1/2. Since β < 1, we can write T = vv T + T with T < 1 and v Null(T). It follows by induction that T r = vv T + T r, and hence ψj(G) = T 2j 1 G T 2j G = T 2j 1 and equivalently for G , resulting in ψj(G) ψj(G ) 2 2 T 2j 1 ΨG ΨG 2 4 X Let us show that for matrices A and B such that β = max( A , B ) < 1 and r N, one has Ar Br rβr 1 A B . (23) Indeed, by noting g(t) = (t B + (1 t)A)r, we have Ar Br = g(1) g(0) = 0 g (t)dt Z 1 0 g (t) dt sup t (0,1) g (t) . Published as a conference paper at ICLR 2019 By noting At = t B + (1 t)A we verify that l=0 Al t(B A)Ar l 1 t , which results in g (t) rβr 1 B A , proving (23). By plugging (23) into (22) we thus obtain ΨG ΨG 2 4 T G T G 2 X j 22jβ2j+1 (24) 4 T G T G 2 X 4 T G T G 2 β2(1 + β2) which yields ΨG ΨG 2 T G T G q (1 β2)3 . Finally, we observe that T G T G = TG TG , which proves (12) as claimed. C PROOF OF LEMMA 5.2 Without loss of generality, assume that the node assignment that minimizes TG ΠTG Π T is the identity. We need to bound the leading eigenvectors of two symmetric matrices TG and TG with a spectral gap. As before, let TG = vv T + T G and TG = v (v )T + T G . Let α = v, v 0 since both are non-negative vectors. Denote by E = TG TG and E = T G T G . Then E = vv T v (v )T + E. Hence Ev = v v α T G (v αv ), so (I T G )(v v α) = Ev Since E d(G, G ) and T G < β , we have (1 α)(1 β ) (I T G )(v v α) d(G, G ) , 1 α d(G, G ) Finally, since v v = 2 2α, we have v v 2 2d(G, G ) Since we are free to swap the role of v and v , the result follows. D PROOF OF THEOREM 5.3 First, note that ρG = ρG = ρ since it is a pointwise nonlinearity (an absolute value), and is independent of the graph topology. Now, let s start with k = 0. In this case, we get UGx UG x which is immediately bounded by Lemma 5.2 satisfying equation 15. For k = 1 we have UGρΨGx UG ρΨG x = UGρΨGx UG ρΨGx + UG ρΨGx UG ρΨG x (26) (UG UG )ρΨGx + UG ρ((ΨG ΨG )x) (27) where the triangular inequality of the norm was used, together with the fact that ρu ρu ρ(u u ) for any real vector u since ρ is the pointwise absolute value. Using the submultiplicativity of the operator norm, we get UGρΨGx UG ρΨG x UG UG ρ ΨG x + UG ρ ΨG ΨG x . (28) Published as a conference paper at ICLR 2019 From Lemmas 5.1 and 5.2 we have that ΨG ΨG εΨ and UG UG εU, and from Proposition 4.1 that ΨG 1. Note also that UG = UG = 1 and that ρ = 1. This yields UGρΨGx UG ρΨG x εU x + εΨ x . (29) satisfying equation 15 for k = 1. For k = 2, we observe that UGρΨGρΨGx UG ρΨG ρΨG x = UGρΨGρΨGx UG ρΨGρΨGx + UG ρΨGρΨGx UG ρΨG ρΨG x (30) (UG UG )ρΨGρΨGx + UG (ρΨGρΨGx ρΨG ρΨG x) (31) The first term is bounded in a straightforward fashion by (UG UG )ρΨGρΨGx εU x in analogy to the development for k = 1. Since UG = 1, for the second term, we focus on ρΨGρΨGx ρΨG ρΨG x = ρΨGρΨGx ρΨGρΨG x + ρΨGρΨG x ρΨG ρΨG x (32) ρΨGρΨGx ρΨGρΨG x + ρΨGρΨG x ρΨG ρΨG x (33) We note that, in the first term in equation 33, the first layer induces an error, but after that, the processing is through the same filter banks. So we are basically interested in bounding the propagation of the error induced in the first layer. Applying twice the fact that ρ(u) ρ(u ) ρ(u u ) we get ρΨGρΨGx ρΨGρΨG x ρ(ΨG(ρΨGx ρΨG x)) ρ(ΨGρ((ΨG ρΨG )x)) . (34) And following with submultiplicativity of the operator norm, ρΨGρΨGx ρΨGρΨG x εΨ x . (35) For the second term in equation 33, we see that the first layer applied is the same in both, namely ρΨG so there is no error induced. Therefore, we are interested in the error obtained after the first layer, which is precisely the same error obtained for k = 1. Therefore, ρΨGρΨG x ρΨG ρΨG x = ρΨGx ρΨG x εΨ x . (36) Plugging equation 35 and equation 36 back in equation 31 we get UGρΨGρΨGx UG ρΨG ρΨG x εU x + εΨ x + εΨ x (37) satisfying equation 15 for k = 2. For general k we see that we will have a first term that is the error induced by the mismatch on the low pass filter that amounts to εU, a second term that accounts for the propagation through (k 1) equal layers of an initial error, yielding εΨ, and a final third term that is the error induced by the previous layer, (k 1)εΨ. More formally, assume that equation 15 holds for k 1, implying that (ρΨG)k 1x (ρΨG )k 1x (k 1)εΨ x (38) Then, for k, we can write UG(ρΨG)kx UG (ρΨG )kx (UG UG )(ρΨG)kx + UG ((ρΨG)kx (ρΨG )kx) (39) Again, the first term we bound it in a straightforward manner using submultiplicativity of the operator norm (UG UG )(ρΨG)kx εU x . (40) For the second term, since UG = 1 we focus on (ρΨG)kx (ρΨG )kx = (ρΨG)kx (ρΨG)k 1ρΨG x + (ρΨG)k 1ρΨG x (ρΨG )kx (41) (ρΨG)k 1ρΨGx (ρΨG)k 1ρΨG x + (ρΨG)k 1ρΨG x (ρΨG )k 1ρΨG x (42) The first term in equation 42 computes the propagation in the initial error caused by the first layer. Then, repeatedly applying ρ(u) ρ(u ) ρ(u u ) in analogy with k = 2 and using submultiplicativity, we get (ρΨG)k 1ρΨGx (ρΨG)k 1ρΨG x εΨ x . (43) Published as a conference paper at ICLR 2019 The second term in equation 42 is the bounded by equation 38, since the first layer is exactly the same in this second term. Then, combining equation 43 with equation 38, yields (ρΨG)kx (ρΨG )kx εΨ + (k 1)εΨ x = kεΨ x . (44) Overall, we get UG(ρΨG)kx UG (ρΨG )kx εU x + εΨ x (45) which satisfies equation 15 for k. Finally, since this holds for k = 2, the proof is completed by induction. E PROOF OF COROLLARY 5.4 From Theorem 5.3, we have UG(ρGΨG)k UG (ρG ΨG )k 2 1 β ds(G, G ) 1/2 + k β2 +(1 + β2 +) (1 β2 +)3 d(G, G ) (46) and, by definition (Bruna & Mallat, 2013, Sec. 3.1), k=0 UG(ρGΨG)kx 2 (47) ΦG(x) ΦG (x) 2 = k=0 UG(ρGΨG)kx UG (ρG ΨG )kx 2 (48) Then, applying the inequality of Theorem 5.3, we get " 2 1 β d(G, G ) 1/2 + k β2 +(1 + β2 +) (1 β2 +)3 d(G, G ) Now, considering each term, such that 2 1 β d(G, G ) + k=0 k2 β2 +(1 + β2 +) (1 β2 +)3 d2(G, G ) (50) β2 +(1 + β2 +) (1 β )(1 β2 +)3 d3/2(G, G ) (51) we observe that the second and third term vanish for d(G, G ) 1, leaving only the first term, yielding ΦG ΦG 2 m 2 1 β d(G, G ) . (52) Finally, this leads to the corollary result ΦG(x) ΦG(x) m1/2 d1/2(G, G ) x if d(G, G ) 1 (53) completing the proof. F PROOF OF COROLLARY 5.5 Let us denote by ej = x(j) G x(j) G . From (17), from the triangle inequality we verify that ej+1 ajej + bj , with aj = θ(j) 1 + θ(j) 2 , bj = β2j 1d(G, G ) x(j) G and e0 = 0 . Published as a conference paper at ICLR 2019 It follows that j=1 (1 + aj) bj = β2j 1d(G, G ) x(j) G β2j 1d(G, G ) x j=1 (1 + aj) j=1 (1 + aj) d(G, G ) x X j=1 (1 + aj) d(G, G ) x 1 1 β , (55) as desired.