# diffusion_scattering_transforms_on_graphs__251fbb86.pdf

Published as a conference paper at ICLR 2019

DIFFUSION SCATTERING TRANSFORMS ON GRAPHS

Fernando Gama Dept. of Electrical and Systems Engineering University of Pennsylvania

Alejandro Ribeiro Dept. of Electrical and Systems Engineering University of Pennsylvania

Joan Bruna Courant Institute of Mathematical Sciences New York University

Stability is a key aspect of data analysis. In many applications, the natural notion of stability is geometric, as illustrated for example in computer vision. Scattering transforms construct deep convolutional representations which are certiﬁed stable to input deformations. This stability to deformations can be interpreted as stability with respect to changes in the metric structure of the domain. In this work, we show that scattering transforms can be generalized to non Euclidean domains using diffusion wavelets, while preserving a notion of stability with respect to metric changes in the domain, measured with diffusion maps. The resulting representation is stable to metric perturbations of the domain while being able to capture high-frequency information, akin to the Euclidean Scattering.

1 INTRODUCTION

Convolutional Neural Networks (CNN) are layered information processing architectures. Each of the layers in a CNN is itself the composition of a convolution operation with a pointwise nonlinearity where the ﬁlters used at different layers are the outcome of a data-driven optimization process (Le Cun et al., 2010; 2015). Scattering transforms have an analogous layered architecture but differ from CNNs in that the convolutional ﬁlters used at different layers are not trained but selected from a multi-resolution ﬁlter bank (Mallat, 2012; Bruna & Mallat, 2013). The fact that they are not trained endows scattering transforms with intrinsic value in situations where training is impossible and inherent limitations in the converse case. That said, an equally important value of scattering transforms is that by isolating the convolutional layered architecture from training effects it permits analysis of the fundamental properties of CNN information processing architectures. This analysis is undertaken in Mallat (2012); Bruna & Mallat (2013) where the fundamental conclusion is about the stability of scattering transforms with respect to deformations in the underlying domain that are close to translations.

In this paper we consider graphs and signals supported on graphs such as brain connectivity networks and functional activity levels (Huang et al., 2016), social networks and opinions (Jackson, 2008), or user similarity networks and ratings in recommendation systems (Huang et al., 2018). Our speciﬁc goals are: (i) To deﬁne a family of graph-scattering transforms. (ii) To deﬁne a notion of deformation for graph signals. (iii) To study the stability of graph scattering transforms with respect to this notion of deformation. To accomplish goal (i) we consider the family of graph diffusion wavelets which provide an appropriate construction of a multi-resolution ﬁlter bank (Coifman & Maggioni, 2006). Our diffusion scattering transforms are deﬁned as the layered composition of diffusion wavelet ﬁlter banks and pointwise nonlinearities. To accomplish goal (ii) we adopt the graph diffusion distance as a measure of deformation of the underlying domain (Coifman & Lafon, 2006; Nadler et al., 2006). Diffusion distances measure the similarity of two graphs through the time it takes for a signal to be diffused on the graph. The major accomplishment of this paper is to show that the diffusion graph scattering transforms are stable with respect to deformations as measured with respect to diffusion distances. Speciﬁcally, consider a signal x supported on graph G whose diffusion scattering transform is denoted by the operator ΨG. Consider now a deformation of the

Published as a conference paper at ICLR 2019

signal s domain so that the signal s support is now described by the graph G whose diffusion scattering operator is ΨG . We show that the operator norm distance ΨG ΨG is bounded by a constant multiplied by the diffusion distance between the graphs G and G . The constant in this bound depends on the spectral gap of G but, very importantly, does not depend on the number of nodes in the graph.

It is important to point out that ﬁnding stable representations is not difﬁcult. E.g., taking signal averages is a representation that is stable to domain deformations indeed, invariant. The challenge is ﬁnding a representation that is stable and rich in its description of the signal. In our numerical analyses we show that linear ﬁlters can provide representations that are either stable or rich but that cannot be stable and rich at the same time. The situation is analogous to (Euclidean) scattering transforms and is also associated with high frequency components. We can obtain a stable representation by eliminating high frequency components but the representation loses important signal features. Alternatively, we can retain high frequency components to have a rich representation but that representation is unstable to deformations. Diffusion scattering transforms are observed to be not only stable as predicted by our theoretical analysis but also sufﬁciently rich to achieve good performance in graph signal classiﬁcation examples.

2 RELATED WORK

Since graph and graph signals are of increasing interest but do not have the regular structure that would make use of CNNs appealing, it is pertinent to ask the question of what should be an appropriate generalization of CNNs to graphs and the graph signals whose topology they describe (Bronstein et al., 2017). If one accepts the value of convolutions as prima facie, a natural solution is to replace convolutions with graph shift invariant ﬁlters which are known to be valid generalizations of (convolutional) time invariant ﬁlters (Bruna et al., 2014). This idea is not only natural but has been demonstrated to work well in practical implementations of Graph Neural Networks (GNNs) (Defferrard et al., 2016; Gama et al., 2019; Gilmer et al., 2017; Henaff et al., 2015; Kipf & Welling, 2017). Same as Euclidean scattering transforms, our graph scattering transforms differ from GNNs in that they do not have to be trained. The advantages and limitations of the absence of training notwithstanding, our work also sheds light on the question of why graph convolutions are appropriate generalizations of regular domain convolutions for signal classiﬁcation problems. Our work suggests that the value of GNNs stems from their stability relative to deformations of the underlying domain that are close to permutations which is the property that a pair of graphs must satisfy to have small diffusion distance.

The stability results obtained in this paper build on the notion of scattering transforms. These scattering representations were introduced by Mallat (2012) and further developed in Bruna & Mallat (2013) with computer vision applications. Since, these representations have been extended to handle transformations on more complex groups, such as roto-translations (Sifre & Mallat, 2013; Oyallon & Mallat, 2015), and to domains such as audio processing (And en & Mallat, 2014) and quantum chemistry (Eickenberg et al., 2017).

Similarly as in this work, extensions of scattering to general graphs have been considered in Chen et al. (2014) and Zou & Lerman (2018). Chen et al. (2014) focuses on Haar wavelets that hierarchically coarsen the graph, and relies on building multiresolution pairings. The recent Zou & Lerman (2018) is closest to our work. There, the authors deﬁne graph scattering using spectrally constructed wavelets from (Hammond et al., 2011), and establish some properties of the resulting representation, such as energy conservation and stability to spectral perturbations. In contrast, our stability results are established with respect to diffusion metric perturbations, which are generally weaker, in the sense that they deﬁne a weaker topology (see Section 3). We use diffusion wavelets (Coifman & Maggioni, 2006) to obtain multi-resolution graph ﬁlter banks that are localized in frequency as well as in the graph domain, while spanning the whole spectrum. Diffusion wavelets serve as the constructive basis for the obtained stability results. Our work is also closely related to recent analysis of stability of Graph Neural Networks in the context of surface representations in (Kostrikov et al., 2017). In our work, however, we do not rely on extrinsic deformations and exploit the speciﬁc multiresolution structure of wavelets.

Published as a conference paper at ICLR 2019

3 PROBLEM SET-UP

This section introduces our framework and states the desired stability properties of signal representations deﬁned on general non-Euclidean domains.

3.1 EUCLIDEAN STABILITY TO DEFORMATIONS WITH SCATTERING

Motivated by computer vision applications, our analysis starts with the notion of deformation stability. If x(u) L2(Ω) is an image deﬁned over an Euclidean domain Ω Rd, we are interested in signal representations Φ : L2(Ω) RK that are stable with respect to small deformations. If xτ(u) := x(u τ(u)) denotes a change of variables with a differentiable ﬁeld τ : Ω Ωsuch that τ < 1, then we ask

x, τ , Φ(x) Φ(xτ) x τ , with (1)

τ := τ denoting a uniform bound on the operator norm of τ. In this setting, a notorious challenge to achieving (1) while keeping enough discriminative power in Φ(x) is to transform the high-frequency content of x in such a way that it becomes stable.

Scattering transforms (Mallat, 2012; Bruna & Mallat, 2013) provide such representations by cascading wavelet decompositions with pointwise modulus activation functions. We brieﬂy summarize here their basic deﬁnition. Given a mother wavelet ψ L1(Ω) with at least a vanishing moment R ψ(u)du = 0 and with good spatial localization, we consider rotated and dilated versions ψj,c(u) = 2 jdψ(2 j Rcu) using scale parameter j and angle θ {2πc/C}c=0,...,C 1. A wavelet decomposition operator is deﬁned as a ﬁlter bank spanning all scales up to a cutoff 2J and all angles: ΨJ : x 7 (x ψj,c)j J,c C. This ﬁlter bank is combined with a pointwise modulus activation function ρ(z) = |z|, as well as a low-pass average pooling operator U computing the average over the domain. The resulting representation using m layers becomes

Φ(x) = {S0(x), S1(x), . . . , Sm 1(x)} , with (2) Sk(x) = UρΨJρ . . . ΨJx = {U(||x ψα1| ψα2| ψαk|); }α1,...,αk (k = 0, . . . , m 1).

The resulting signal representation has the structure of a CNN, in which feature maps are not recombined with each other, and trainable ﬁlters are replaced by multiscale, oriented wavelets. It is shown in Mallat (2012) that for appropriate signal classes and wavelet families, the resulting scattering transform satisﬁes a deformation stablity condition of the form (1), which has been subsequently generalised to broader multiresolution families (Wiatowski & B olcskei, 2018). In essence, the mechanism that provides stability is to capture high-frequency information with the appropriate spatio-temporal tradeoffs, using spatially localized wavelets.

3.2 DEFORMATIONS AND METRIC STABILITY

Whereas deformations provide the natural framework to describe geometric stability in Euclidean domains, their generalization to non-Euclidean, non-smooth domains is not straightforward. Let x L2(X). If X is embedded into a low-dimension Euclidean space Ω Rd, such as a 2surface within a three-dimensional space, then one can still deﬁne meaningful deformations on X via extrinsic deformations of Ω(Kostrikov et al., 2017).

However, in this work we are interested in intrinsic notions of geometric stability, that do not necessarily rely on a pre-existent low-dimensional embedding of the domain. The change of variables ϕ(u) = u τ(u) deﬁning the deformation can be seen as a perturbation of the Euclidean metric in L2(Rd). Indeed,

xτ, yτ L2(Rd,µ) = Z

Rd xτ(u)yτ(u)dµ(u) = Z

Rd x(u)y(u)|I τ(u)|dµ(u) = x, y L2(Rd, µ) ,

with d µ(u) = |I τ(u)|dµ(u), and |I τ(u)| 1 if τ is small, where I is the identity. Therefore, a possible way to extend the notion of deformation stability to general domains L2(X) is to think of X as a metric space and reason in terms of stability of Φ : L2(X) RK to metric changes in X. This requires a representation that can be deﬁned on generic metric spaces, as well as a criteria to compare how close two metric spaces are.

Published as a conference paper at ICLR 2019

3.3 DIFFUSION WAVELETS AND METRICS ON GRAPHS

Graphs are ﬂexible data structures that enable general metric structures and modeling non-Euclidean domains. The main ingredients of the scattering transform can be generalized using tools from computational harmonic analysis on graphs. We note that, unlike the case of Euclidean domains, where deformations are equivalent whether they are analyzed from the function domain or its image, in the case of graphs, we focus on deformations on the underlying graph domain, while keeping the same function mapping (i.e. we model deformations as a change of the underlying graph support and analyze how this affects the interaction between the function mapping and the graph).

In particular, diffusion wavelets (Coifman & Maggioni, 2006) provide a simple framework to deﬁne a multi-resolution analysis from powers of a diffusion operator deﬁned on a graph. A weighted, undirected graph G = (V, E, W) with |V | = n nodes, edge set E and adjacency matrix W Rn n deﬁnes a diffusion process A in its nodes, given in its symmetric form by the normalized adjacency

A := D 1/2WD 1/2 , with D = diag(d1, . . . , dn) , (3)

where di = P

(i,j) E Wi,j denotes the degree of node i. Denote by d = W1 the degree vector containing di in the ith element. By construction, A is well-localized in space (it is nonzero only where there is an edge connecting nodes), it is self-adjoint and satisﬁes A 1, where A is the operator norm. Let λ0 λ1 . . . λn 1 denote its eigenvalues in decreasing order. Deﬁning d1/2 = ( d1, . . . , dn), one can easily verify that the normalized squared root degree vector v = d1/2/ d1/2 2 = d/ d 1 is the eigenvector with associated eigenvalue λ0 = 1. Also, note that λn 1 = 1 if and only if G has a connected component that is non-trivial and bipartite (Chung, 1997).

In the following, it will be convenient to assume that the spectrum of A (which is real and discrete since A is self-adjoint and in ﬁnite-dimensions) is non-negative. Since we shall be taking powers of A, this will avoid folding negative eigenvalues into positive ones. For that purpose, we adopt the so-called lazy diffusion, given by T = 1

2(I + A). In Section 4 we use this diffusion operator to deﬁne both a multiscale wavelet ﬁlter bank and a low-pass average pooling, leading to the diffusion scattering representation.

This diffusion operator can also be used to construct a metric on G. The so-called diffusion distances (Coifman & Lafon, 2006; Nadler et al., 2006) measure distances between two nodes x, x V in terms of their associated diffusion at time s: d G,s(x, x ) = T s Gδx T s Gδx , where δx is a vector with all zeros except a 1 in position x.

In this work, we build on this diffusion metric to deﬁne a distance between two graphs G, G . Assuming ﬁrst that G and G have the same size, the simplest formulation is to compare the diffusion metric generated by G and G up to a node permutation: Deﬁnition 3.1. Let G = (V, E, W), G = (V , E , W ) have the same size |V | = |V | = n. The normalized diffusion distance between graphs G, G at time s > 0 is

ds(G, G ) := inf Π Πn (T s G) (T s G) ΠT(T s G ) (T s G )Π = inf Π Πn T 2s G ΠTT 2s G Π , (4)

where Πn is the space of n n permutation matrices.

The diffusion distance is deﬁned at a speciﬁc time s. As s increases, this distance becomes weaker1, since it compares points at later stages of diffusion. The role of time is thus to select the smoothness of the graph deformation , similarly as τ measures the smoothness of the deformation in the Euclidean case. For convenience, we denote d(G, G ) = d1/2(G, G ) and use the distance at s = 1/2 as our main deformation measure. The quantity d deﬁnes a distance between graphs (seen as metric spaces) yielding a stronger topology than other alternatives such as the Gromov-Hausdorff distance, deﬁned as

ds GH(G, G ) = inf Π sup x,x V |ds G(x, x ) ds G (π(x), π(x ))|

with ds G(x, x ) = T t G(δx δx ) L2(G) . We choose d(G, G ) in this work for convenience and mathematical tractability, but leave for future work the study of stability relative to ds GH. Finally,

1In the sense that it deﬁnes a weaker topology, i.e., limm ds(G, Gm) 0 limm ds (G, Gm) = 0 for s > s, but not vice-versa.

Published as a conference paper at ICLR 2019

we consider for simplicity only the case where the sizes of G and G are equal, but deﬁnition (3.1) can be naturally extended to compare variable-sized graphs by replacing permutations by softcorrespondences (see Bronstein et al., 2010).

3.4 PROBLEM STATEMENT

Our goal is to build a stable and rich representation ΦG(x). The stability property is stated in terms of the diffusion metric above: For a chosen diffusion time s, x Rn , G = (V, E, W), G = (V , E , W ) with |V | = |V | = n , we want

ΦG(x) ΦG (x) x ds(G, G ) . (5)

This representation can be used to model both signals and domains, or just domains G, by considering a prespeciﬁed x = f(G), such as the degree, or by marginalizing from an exchangeable distribution, ΦG = Ex QΦG(x).

The motivation of (5) is two-fold: On the one hand, we are interested in applications where the signal of interest may be measured in dynamic environments that modify the domain, e.g. in measuring brain signals across different individuals. On the other hand, in other applications, such as building generative models for graphs, we may be interested in representing the domain G itself. A representation from the adjacency matrix of G needs to build invariance to node permutations, while capturing enough discriminative information to separate different graphs. In particular, and similarly as with Gromov-Hausdorff distances, the deﬁnition of d(G, G ) involves a matching problem between two kernel matrices, which deﬁnes an NP-hard combinatorial problem. This further motivates the need for efﬁcient representations of graphs ΦG that can efﬁciently tell apart two graphs, and such that ℓ(θ) = ΦG ΦG(θ) can be used as a differentiable loss for training generative models.

4 GRAPH DIFFUSION SCATTERING

Let T be a lazy diffusion operator associated with a graph G of size n such as those described in Section 3.3. Following Coifman & Maggioni (2006), we construct a family of multiscale ﬁlters by exploiting the powers of the diffusion operator T 2j. We deﬁne

ψ0 := I T , ψj := T 2j 1(I T 2j 1) = T 2j 1 T 2j , (j > 0) . (6)

This corresponds to a graph wavelet ﬁlter bank with optimal spatial localization. Graph diffusion wavelets are localized both in space and frequency, and favor a spatial localization, since they can be obtained with only two ﬁlter coefﬁcients, namely h0 = 1 for diffusion T 2j 1 and h1 = 1 for diffusion T 2j. The ﬁnest scale ψ0 corresponds to one half of the normalized Laplacian operator ψ0 = (1/2) = 1/2(I D 1/2WD 1/2), here seen as a temporal difference in a diffusion process, seeing each diffusion step (each multiplication by ) as a time step. The coarser scales ψj capture temporal differences at increasingly spaced diffusion times. For j = 0, . . . , Jn 1, we consider the linear operator

Ψ : L2(G) (L2(G))Jn

x 7 (ψjx)j=0,...,Jn 1 , (7)

which is the analog of the wavelet ﬁlter bank in the Euclidean domain. Whereas several other options exist to deﬁne graph wavelet decompositions (Rustamov & Guibas, 2013; Gavish et al., 2010), and GNN designs that favor frequency localization, such as Cayley ﬁlters (Levie et al., 2019), we consider here wavelets that can be expressed with few diffusion terms, favoring spatial over frequential localization, for stability reasons that will become apparent next. We choose dyadic scales for convenience, but the construction is analogous if one replaces scales 2j by γj for any γ > 1 in (6).

If the graph G exhibits a spectral gap, i.e., βG = supi=1,...n 1 |λi| < 1, the following proposition proves that the linear operator Ψ deﬁnes a stable frame. Proposition 4.1. For each n, let Ψ deﬁne the diffusion wavelet decomposition (7) and assume βG < 1. Then there exists a constant 0 < C(β) depending only on β such that for any x Rn

Published as a conference paper at ICLR 2019

satisfying x, v = 0,

j=0 ψjx 2 x 2 . (8)

This proposition thus provides the Littlewood-Paley bounds of Ψ, which control the ability of the ﬁlter bank to capture and amplify the signal x along each frequency (i.e. the ability of the ﬁlter to increase or decrease the energy of the representation, relative to the energy of the x). We note that diffusion wavelets are neither unitary nor analytic and therefore do not preserve energy. However, the frame bounds in Proposition 4.1 provide lower bounds on the energy lost, such that the smaller 1 β is, the less unitary our diffusion wavelets are. It also informs us about how the spectral gap β determines the appropriate diffusion scale J: The maximum of p(u) = (ur u2r)2 is at u = 2 1/r, thus the cutoff r should align with β as r = 1 log2 β , since larger values of r capture energy in a spectral range where the graph has no information. Therefore, the maximum scale can be adjusted as J = 1 + log2 r = 1 + l log2 1 log2 β m .

Recall that the Euclidean Scattering transform is constructed by cascading three building blocks: a wavelet decomposition operator, a pointwise modulus activation function, and an averaging operator. Following the Euclidean scattering, given a graph G and x L2(G), we deﬁne an analogous Diffusion Scattering transform ΦG(x) by cascading three building blocks: the Wavelet decomposition operator Ψ, a pointwise activation function ρ, and an average operator U which extracts the average over the domain. The average over a domain can be interpreted as the diffusion at inﬁnite time, thus Ux = limt T tx = v T, x . More speciﬁcally, we consider a ﬁrst layer transformation given by φ1(G, x) = UρΨx = {Uρψjx}0 j Jn 1 , , (9) followed by second order coefﬁcients

φ2(G, x) = UρΨρΨx = {Uρψj2ρψj1x}0 j1,j2 Jn 1 , , (10)

and so on. The representation obtained from m layers of such transformation is thus

ΦG(x) = {Ux, φ1(G, x), . . . , φm 1(G, x)} = {U(ρΨ)kx ; k = 0, . . . , m 1} . (11)

5 STABILITY OF GRAPH DIFFUSION SCATTERING

5.1 STABILITY AND EQUIVARIANCE OF DIFFUSION WAVELETS

Given two graphs G, G of size n and a signal x Rn, our objective is to bound ΦG(x) ΦG (x) in terms of d(G, G ). Let π the permutation minimising the distortion between G and G in (4). Since all operations in Φ are either equivariant or invariant with respect to permutations, we can assume w.l.o.g. that π = 1, so that the diffusion distance can be directly computed by comparing nodes with the given order. A key property of G that drives the stability of the diffusion scattering is given by its spectral gap 1 βG = 1 supi=1,...n 1 |λi| 0. In the following, we use ℓ2 operator norms, unless stated otherwise. Lemma 5.1. Assume β := max(βG, βG ) < 1. Then

inf Π Πn ΨG ΠΨG ΠT 2d(G, G )

(1 β2)3 . (12)

Remark: If diffusion distance is measured at time different from s = 1/2, the stability bound would be modiﬁed due to scales j such that 2j < s.

The following lemma studies the stability of the low-pass operator U with respect to graph perturbations. Lemma 5.2. Let G, G be two graphs with same size, denote by v and v their respective squaredroot degree vectors, and by β, β their spectral gap. Then

inf Π Πn v Πv 2 2 d(G, G ) 1 min(β, β ) . (13)

Published as a conference paper at ICLR 2019

Spectral Gap asymptotic behavior Lemmas 5.1 and 5.2 leverage the spectral gap of the lazy diffusion operator associated with G. In some cases, such as regular graphs, the spectral gap vanishes asymptotically as n , thus degrading the upper bound asymptotically. Improving the bound by leveraging other properties of the graph (such as regular degree distribution) is an important open task.

5.2 STABILITY AND INVARIANCE OF DIFFUSION SCATTERING

The scattering transform coefﬁcients ΦG(x) obtained after m layers are given by equation 11, for low-pass operator U such that Ux = v, x so that U = v T.

From Lemma 5.1 we have that, ΨG ΨG εΨ = 2d(G, G ) p

β2(1 + β2)/(1 β2)3. We also know, from Proposition 4.1 that Ψ conforms a frame, i.e. C(β) x 2 Ψx 2 x 2 for known constant C(β) given in Prop. 4.1. Additionally, from Lemma 5.2 we get that UG UG εU= 2d(G, G )/(1 min(β, β )).

The objective now is to prove stability of the scattering coefﬁcients ΦG(x), that is, to prove that

ΦG(x) ΦG (x) d(G, G ) x . (14)

This is captured in the following Theorem: Theorem 5.3. Let G, G be two graphs and let d(G, G ) be their distance measured as in equation 4. Let TG and TG be the respective diffusion operators. Denote by UG, ρG and ΨG and by UG , ρG and ΨG the low pass operator, pointwise nonlinearity and the wavelet ﬁlter bank used on the scattering transform deﬁned on each graph, respectively, cf. equation 11. Assume ρG = ρG and that ρG is non-expansive. Let β = min(βG, βG ), β+ = max(βG, βG ) and assume β+ < 1. Then, we have that, for each k = 0, . . . , m 1, the following holds

UG(ρGΨG)k UG (ρG ΨG )k 2 1 β d(G, G ) 1/2 + k

β2 +(1 + β2 +) (1 β2 +)3 d(G, G ) . (15)

Deﬁning ΦG(x) 2 = Pm 1 k=0 UG(ρGΨG)k 2 analogously to Bruna & Mallat (2013), it is straightforward to compute the stability bound on the scattering coefﬁcients as follows. Corollary 5.4. In the context of Theorem 5.3, let x Rn and let ΦG(x) be the scattering coefﬁcients computed by means of equation 11 on graph G after m layers, and let ΦG (x) be the corresponding coefﬁcients on graph G . Then,

ΦG(x) ΦG (x) 2

" 2 1 β d(G, G ) 1/2 + k

β2 +(1 + β2 +) (1 β2 +)3 d(G, G )

ΦG(x) ΦG (x) m1/2d1/2(G, G ) x if d(G, G ) 1.

Corollary 5.4 satisﬁes equation 5. It also shows that the closer the graphs are in terms of the diffusion metric, the closer their scattering representations will be. The constant is given by topological properties, the spectral gaps of G and G , as well as design parameters, the number of layers m. We observe that the stability bound grows the smaller the spectral gap is and also as more layers are considered. The spectral gap is tightly linked with diffusion processes on graphs, and thus it does emerge from the choice of a diffusion metric. Graphs with values of beta closer to 1, exhibit weaker diffusion paths, and thus a small perturbation on the edges of these graphs would lead to a larger diffusion distance. The contrary holds as well. In other words, the tolerance of the graph to edge perturbations (i.e., d(G, G) being small) depends on the spectral gap of the graph. We also note that, as stated at the end of Section 5.1, the spectral gap appears in our upper bounds, but it is not necessarily sharp. In particular, the spectral gap is a poor indication of stability in regular graphs, and we believe our bound can be improved by leveraging structural properties of regular domains.

Finally, we note that the size of the graphs impacts the stability result inasmuch as it impacts the distance measure d(G, G ). This is expected, since graphs of different size can be compared, as mentioned in Section 3.3. Different from Zou & Lerman (2018), our focus is on obtaining graph wavelet banks that are localized in the graph domain to improve computational efﬁciency as discussed in Defferrard et al. (2016). We also notice that the scattering transform in Zou & Lerman

Published as a conference paper at ICLR 2019

(2018) is stable with respect to a graph measure that depends on the spectrum of the graph through both eigenvectors and eigenvalues. More speciﬁcally, it is required that the spectrum gets concentrated as the graphs grow. However, in general, it is not straightforward to relate the topological structure of the graph with its spectral properties.

As mentioned in Section 3.3, the stability is computed with a metric d(G, G ) which is stronger than what could be hoped for. Our metric is permutation-invariant, in analogy with the rigid translation invariance in the Euclidean case, and stable to small perturbations around permutations. The extension of (16) to weaker metrics, using e.g. multiscale deformations, is left for future work.

5.3 FROM DIFFUSION SCATTERING TO DIFFUSION GNNS

By denoting Tj = T 2j, observe that one can approximate the diffusion wavelets from (6) as a cascade of low-pass diffusions followed by a high-pass ﬁlter at resolution 2j:

ψj = Tj 1(I Tj 1) T P

j <j 1 2j (I Tj 1) =

This pyramidal structure of multi-resolution analysis wavelets in which each layer now corresponds to a different scale, shows that the diffusion scattering is a particular instance of GNNs where each layer j is generated by the pair of operators {I, Tj 1}. If x(j) Rn dj denotes the feature representation at layer j using dj feature maps per node, the corresponding update is given by

x(j+1) = ρ x(j)θ(j) 1 + Tj 1x(j)θ(j) 2 , (17)

where θ(j) 1 , θ(j) 2 are dj dj+1 weight matrices. In this case, a simple modiﬁcation of the previous theorem shows that the resulting GNN representation ΦG(x, Θ), Θ = (θ(j) 1 , θ(j) 2 )j J is also stable with respect to d(G, G ), albeit this time the constants are parameter-dependent:

Corollary 5.5. The J layer GNN with parameters Θ = (θ(j) 1 , θ(j) 2 )j J satisﬁes

ΦG(x, Θ) ΦG (x, Θ) d(G, G ) x

j J (1 + θ(j) 1 + θ(j) 2 )

This bound is thus learning-agnostic and is proved by elementary application of the diffusion distance deﬁnition. An interesting question left for future work is to relate such stability to gradient descent optimization biases, similarly as in (Gunasekar et al., 2018; Wei et al., 2018), which could provide stability certiﬁcates for learnt GNN representations.

6 NUMERICAL EXPERIMENTS

In this section, we ﬁrst show empirically the dependence of the stability result with respect to the spectral gap, and then we illustrate the discriminative power of the diffusion scattering transform in two different classiﬁcation tasks; namely, the problems of authorship attribution and source localization.

Consider a small-world graph G with N = 200 nodes, edge probability p SW and rewiring probability q SW = 0.1. Let x N(0, I) be a random graph signal deﬁned on top of G and ΦG(x) the corresponding scattering transform. Let G be another realization of the small-world graph, and let ΦG (x) be the scattering representation of the same graph signal x but on the different support G . We can then proceed to compute ΦG(x) ΦG (x) . By changing the value of p SW we can change value of the spectral gap β and study the dependence of the difference in representations as a function of the spectral gap. Results shown in Fig. 1a are obtained by varying p SW from 0.1 to 0.9. For each value of p SW we generate one graph G and 50 different graphs G ; and for each graph G we compute ΦG(x) ΦG (x) for 1, 000 different graph signal realizations x. The average across all signal realizations is considered the estimate of the representation difference, and then the mean as well as the variance across all graphs are computed and shown in the ﬁgure (error bars).

Published as a conference paper at ICLR 2019

(a) Small world (b) Authorship attribution (c) Facebook graph

Figure 1. (a) Difference in representation between the signal deﬁned on the original graph G and on the deformed graph G as a function of the spectral gap β. (b)-(c) Classiﬁcation error percentage as a function of perturbation for the authorship attribution and the Facebook graph, respectively.

Fig. 1a shows the average difference ΦG(x) ΦG(x ) as a function of the spectral gap (changing p SW from 0.1 to 0.9 led to values of spectral gap between 0.5 and close to 1). First and foremost we observe that, indeed, as β reaches one, the stability result gets worse and the representation difference increases. We also observe that, for deeper scattering representations, the difference also gets worse, although it is not a linear behaviour as predicted in equation 16, which suggest that the bound is not tight.

For classifying we train a SVM linear model fed by features obtained from different representations. We thus compare with two non-trainable linear representations of the data: a data-based method (using the graph signals to feed the classiﬁer) and a graph-based method (obtaining the GFT coefﬁcients as features for the data). Additionally, we consider the graph scattering with varying depth to analyze the richness of the representation. Our aim is mainly to illustrate that the scattering representation is rich enough, relative to linear representations, and is stable to graph deformations.

First, we consider the problem of authorship attribution where the main task is to determine if a given text was written by a certain author. We construct author proﬁles by means of word adjacency networks (WAN). This WAN acts as the underlying graph support for the graph signal representing the word count (bag-of-words) of the target text of unknown authorship. Intuitively, the choice of words of the target text should reﬂect the pairwise connections in the WAN, see Segarra et al. (2015) for detailed construction of WANs. In particular, we consider all works by Jane Austen. To illustrate the stability result, we construct a WAN with 188 nodes (functional words) using a varying number of texts to form the training set, obtaining an array of graphs that are similar but not exactly the same. For the test set, we include 154 excerpts by Jane Austen and 154 excerpts written by other contemporary authors, totaling 308 data points. Fig. 1b shows classiﬁcation error as a function of the number of training samples used. We observe that graph scattering transforms monotonically improve while considering more training data, whereas other methods vary more erratically, showing their lack of stability (their representations vary more wildly when the underlying graph support changes). This shows that scattering diffusion transforms strike a good balance between stability and discriminative power.

For the second task, let G be a 234-node graph modeling real-world Facebook interactions (Mc Auley & Leskovec, 2012). In the source localization problem, we observe a diffusion process after some unknown time t, that originated at some unknown node i, i.e. we observe x = W tδi, where δi is the signal with all zeros except a 1 on node i. The objective is to determine which community the source node i belongs to. These signals can be used to model rumors that percolate through the social network by interaction between users and the objective is to determine which user group generated said rumor (or initiated a discussion on some topic). We generate a training sample of size 2, 000, for nodes i chosen at random and diffusion times t chosen as random as well. The GFT is computed by projecting on the eigenbasis of the operator T. We note that, to avoid numerical instabilities, the diffusion is carried out using the normalized operator (W/λmax(W)) and t tmax = 20. The representation coefﬁcients (graph signals, GFT or scattering coefﬁcients) obtained from this set are used to train different linear SVMs to perform classiﬁcation. For the test set, we draw 200 new signals. We compute the classiﬁcation errors on the test set as a measure of usefulness of the obtained representations. Results are presented in Fig. 1c, where perturbations are illustrated by dropping edges with probability p (adding or removing friends in Facebook). Again, it is observed that the

Published as a conference paper at ICLR 2019

scattering representation exhibits lower variations when the underlying graph changes, compared to the linear approaches.

Finally, to remark the discriminative power of the scattering representation, we observe that as the graph scattering grows deeper, the obtained features help in more accurate classiﬁcation. We remark that in regimes with sufﬁcient labeled examples, trainable GNN architectures will generally outperform scattering-based representations.

7 CONCLUSIONS

In this work we addressed the problem of stability of graph representations. We designed a scattering transform of graph signals using diffusion wavelets and we proved that this transform is stable under deformations of the underlying graph support. More speciﬁcally, we showed that the scattering transform of a graph signal supported on two different graphs is proportional to the diffusion distance between those graphs. As a byproduct of our analysis, we obtain stability bounds for Graph Neural Networks generated by diffusion operators. Additionally, we showed that the resulting descriptions are also rich enough to be able to adequately classify plays by author in the context of authorship attribution, and identify the community origin of a signal in a source localization problem.

That said, there are a number of directions to build upon from these results. First, our stability bounds depend on the spectral gap of the graph diffusion. Although lazy diffusion prevents this spectral gap to vanish, as the size of the graph increases we generally do not have a tight bound, as illustrated by regular graphs. An important direction of future research is thus to develop stability bounds which are robust to vanishing spectral gaps. Next, and related to this ﬁrst point, we are working on extending the analysis to broader families of wavelet decompositions on graphs and their corresponding graph neural network versions, including stability with respect to the Gromov Hausdorff metric, which can be achieved by using graph wavelet ﬁlter banks that achieve bounds analogous to those in Lemmas 5.1 and 5.2.

ACKNOWLEDGMENTS

Work supported by NSF CCF 1717120, US ARO W911NF1710438, ARL DCIST CRA W911NF17-2-0181, ISTC-WAS and Intel AI Dev Cloud.

J. And en and S. Mallat. Deep scattering spectrum. IEEE Trans. Signal Process., 62(16):4114 4128, Aug. 2014.

A. M. Bronstein, M. M. Bronstein, R. Kimmel, M. Mahmoudi, and G. Sapiro. A Gromov-Hausdorff framework with diffusion geometry for topologically-robust non-rigid shape matching. Int. J. of Comput. Vision, 89(2-3):266 286, 2010.

M. M. Bronstein, J. Bruna, Y. Le Cun, A. Szlam, and P. Vandergheynst. Geometric deep learning: Going beyond euclidean data. IEEE Signal Process. Mag., 34(4):18 42, July 2017.

J. Bruna and S. Mallat. Invariant scattering convolution networks. IEEE Trans. Pattern Anal. Mach. Intell., 35(8):1872 1886, Aug. 2013.

J. Bruna, W. Zaremba, A. Szlam, and Y. Le Cun. Spectral networks and deep locally connected networks on graphs. ar Xiv:1312.6203v3 [cs.LG], 21 May 2014. URL http://arxiv.org/ abs/1213.6203.

X. Chen, X. Cheng, and S. Mallat. Unsupervised deep Haar scattering on graphs. In Annu. Conf. Neural Inform. Process. Syst. 2014, Montr eal, QC, 8-13 Dec. 2014. NIPS Foundation.

F. R. K. Chung. Spectral Graph Theory. Number 92 in Conference Board of the Mathematical Sciences: Regional Conference Series in Mathematics. American Mathematical Society, Providence, RI, 1997.

R. R. Coifman and S. Lafon. Diffusion maps. Appl. Comput. Harmonic Anal., 21(1):5 30, July 2006.

Published as a conference paper at ICLR 2019

R. R. Coifman and M. Maggioni. Diffusion wavelets. Appl. Comput. Harmonic Anal., 21(1):53 94, July 2006.

M. Defferrard, X. Bresson, and P. Vandergheynst. Convolutional neural networks on graphs with fast localized spectral ﬁltering. In Annu. Conf. Neural Inform. Process. Syst. 2016, Barcelona, Spain, 5-10 Dec. 2016. NIPS Foundation.

M. Eickenberg, G. Exarchakis, M. Hirn, and S. Mallat. Solid harmonic wavelet scattering: Predicting quantum molecular energy from invariant descriptors of 3d electronic densities. In Annu. Conf. Neural Inform. Process. Syst. 2017, Long Beach, CA, 4-9 Dec. 2017. NIPS Foundation.

F. Gama, A. G. Marques, G. Leus, and A. Ribeiro. Convolutional neural network architectures for signals supported on graphs. IEEE Trans. Signal Process., 67(4):1034 1049, Feb. 2019.

M. Gavish, B. Nadler, and R. R. Coifman. Multiscale wavelets on trees, graphs and high dimensional data: Theory and applications to semi supervised learning. In 27th Int. Conf. Mach. Learning, Haifa, Israel, 21-24 June 2010.

J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl. Neural message passing for quantum chemistry. In 34th Int. Conf. Mach. Learning, Sydney, Australia, 6-11 Aug. 2017.

S. Gunasekar, J. Lee, D. Soudry, and N. Srebro. Characterizing implicit bias in terms of optimization geometry. In 35th Int. Conf. Mach. Learning, number 80, pp. 1832 1841, Stockholm, Sweden, 10-15 July 2018.

D. K. Hammond, P. Vandergheynst, and R. Gribonval. Wavelets on graphs via spectral graph theory. Appl. Comput. Harmonic Anal., 30(2):129 150, March 2011.

M. Henaff, J. Bruna, and Y. Le Cun. Deep convolutional networks on graph-structured data. ar Xiv:1506.051631v1 [cs.LG], 16 June 2015. URL http://arxiv.org/abs/1506. 05163.

W. Huang, L. Goldsberry, N. F. Wymbs, S. T. Grafton, D. S. Bassett, and A. Ribeiro. Graph frequency analysis of brain signals. IEEE J. Select. Topics Signal Process., 10(7):1189 1203, Oct. 2016.

W. Huang, A. G. Marques, and A. Ribeiro. Matrix completion via graph signal processing. In 43rd IEEE Int. Conf. Acoust., Speech and Signal Process., Calgary, AB, 15-20 Apr. 2018. IEEE.

M. O. Jackson. Social and Economic Networks. Princeton University Press, Princeton, NJ, 2008.

T. N. Kipf and M. Welling. Semi-supervised classiﬁcation with graph convolutional networks. In 5th Int. Conf. Learning Representations, Toulon, France, 24-26 Apr. 2017. Assoc. Comput. Linguistics.

I. Kostrikov, J. Bruna, D. Panozzo, and D. Zorin. Surface networks. ar Xiv:1705.10819v1 [stat.ML], 30 May 2017. URL http://arxiv.org/abs/1705.10819.

Y. Le Cun, K. Kavukcuoglu, and C. Farabet. Convolutional networks and applications in vision. In 2010 IEEE Int. Symp. Circuits and Syst., Paris, France, 30 May-2 June 2010. IEEE.

Y. Le Cun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521(7553):85 117, 2015.

R. Levie, F. Monti, X. Bresson, and M. M. Bronstein. Cayleynets: Graph convolutional neural networks with complex rational spectral ﬁlters. IEEE Trans. Signal Process., 67(1):97 109, Jan. 2019.

S. Mallat. Group invariant scattering. Communications on Pure and Applied Mathematics, 65(10): 1331 1398, 2012.

J. Mc Auley and J. Leskovec. Learning to discover social circles in Ego networks. In 26th Annu. Conf. Neural Inform. Process. Syst., Stateline, TX, 3-8 Dec. 2012. NIPS Foundation.

Published as a conference paper at ICLR 2019

B. Nadler, S. Lafon, I. Kevrekidis, and R. R. Coifman. Diffusion maps, spectral clustering and eigenfunctions of fokker-planck operators. In Annu. Conf. Neural Inform. Process. Syst. 2006, Vancouver, BC, 4-9 Dec. 2006. NIPS Foundation.

E. Oyallon and S. Mallat. Deep roto-translation scattering for object classiﬁcation. In IEEE Comput. Soc. Conf. Comput. Vision and Pattern Recognition 2015, Boston, MA, 7-12 June 2015. IEEE Comput. Soc.

R. Rustamov and L. J. Guibas. Wavelets on graphs via deep learning. In Annu. Conf. Neural Inform. Process. Syst. 2013, Lake Tahoe, NV, 5-10 Dec. 2013. NIPS Foundation.

S. Segarra, M. Eisen, and A. Ribeiro. Authorship attribution through function word adjacency networks. IEEE Trans. Signal Process., 63(20):5464 5478, Oct. 2015.

L. Sifre and S. Mallat. Rotation, scaling and deformation invariant scattering for texture discrimination. In IEEE Comput. Soc. Conf. Comput. Vision and Pattern Recognition 2013, Portland, OR, 23-28 June 2013. IEEE Comput. Soc.

C. Wei, J. D. Lee, Q. Liu, and T. Ma. On the margin theory of feedforward neural networks. ar Xiv:1810.05369v1 [stat.ML], 12 Oct. 2018. URL http://arxiv.org/abs/ 1810.05369.

T. Wiatowski and H. B olcskei. A mathematical theory of deep convolutional neural networks for feature extraction. IEEE Trans. Inf. Theory, 64(3):1845 1866, 2018.

D. Zou and G. Lerman. Graph convolutional neural networks via scattering. ar Xiv:1804.00099v1 [cs.IT], 31 March 2018. URL http://arxiv.org/abs/1804.00099.

Published as a conference paper at ICLR 2019

A PROOF OF PROPOSITION 4.1

Since all operators ψj are polynomials of the diffusion T, they all diagonalise in the same basis. Let T = V ΛV T, where V TV = I contains the eigenvectors of T and Λ = diag(λ0, . . . , λn 1) its eigenvalues. The frame bounds C1, C2 are obtained by evaluating Ψx 2 for x = vi, i = 1, . . . , n 1, since v0 corresponds to the square-root degree vector and x is by assumption orthogonal to v0.

We verify that the spectrum of ψj is given by (pj(λ0), . . . , pj(λn 1)), where pj(x) = x2j 1 x2j

for j > 0 and p0(x) = 1 x. Denote by QJ(x) = PJ 1 j=0 pj(x)2. It follows from the deﬁnition that Ψvi 2 = QJ(λi) for i = 1, . . . , n 1 and therefore

C1 = min x (0,β) QJ(x) , C2 = max x (0,β) QJ(x) . (19)

We check that C1 minx (0,β) p0(x)2 = (1 β)2 and C2 = QJ(0) = 1. Indeed, denote by Q(x) = P j=0 pj(x)2. One easily veriﬁes that Q(x) is continuous in [0, 1) since it is bounded by a

geometric series. Also, observe that Q(x) = (1 x)2+P

j>0(x2j 1 x2j)2 satisﬁes the recurrence

Q(x2) = Q(x) + 2x(1 x)2 Q(x)

since x [0, 1). By continuity it thus follows that

sup x [0,1) QJ(x) sup x [0,1) Q(x) = lim x 0 Q(x) = Q(0) = 1 .

B PROOF OF LEMMA 5.1

By deﬁnition ΨG ΨG 2 = [ΨG ΨG ] [ΨG ΨG ] and

[ΨG ΨG ] [ΨG ΨG ] =

j=0 (ψj(G) ψj(G )) (ψj(G) ψj(G )) . (20)

Since TG and TG are self-adjoint, so are ψj(G) and ψj(G ). From the triangular inequality, we thus obtain ΨG ΨG 2 X

j ψj(G) ψj(G ) 2 . (21)

Denote by v the eigenvector associated with λ0 = 1, [v]i = (di/ Pn j=1 dj)1/2. Since β < 1, we can write T = vv T + T with T < 1 and v Null(T). It follows by induction that T r = vv T + T r, and hence

ψj(G) = T 2j 1 G T 2j G = T 2j 1

and equivalently for G , resulting in

ψj(G) ψj(G ) 2 2 T 2j 1

ΨG ΨG 2 4 X

Let us show that for matrices A and B such that β = max( A , B ) < 1 and r N, one has

Ar Br rβr 1 A B . (23)

Indeed, by noting g(t) = (t B + (1 t)A)r, we have

Ar Br = g(1) g(0) =

0 g (t)dt Z 1

0 g (t) dt sup t (0,1) g (t) .

Published as a conference paper at ICLR 2019

By noting At = t B + (1 t)A we verify that

l=0 Al t(B A)Ar l 1 t ,

which results in g (t) rβr 1 B A , proving (23).

By plugging (23) into (22) we thus obtain

ΨG ΨG 2 4 T G T G 2 X

j 22jβ2j+1 (24)

4 T G T G 2 X

4 T G T G 2 β2(1 + β2)

which yields ΨG ΨG 2 T G T G q

(1 β2)3 . Finally, we observe that T G T G = TG TG , which proves (12) as claimed.

C PROOF OF LEMMA 5.2

Without loss of generality, assume that the node assignment that minimizes TG ΠTG Π T is the identity. We need to bound the leading eigenvectors of two symmetric matrices TG and TG with a spectral gap. As before, let TG = vv T + T G and TG = v (v )T + T G . Let α = v, v 0 since both are non-negative vectors. Denote by E = TG TG and E = T G T G . Then E = vv T v (v )T + E. Hence Ev = v v α T G (v αv ), so

(I T G )(v v α) = Ev

Since E d(G, G ) and T G < β , we have

(1 α)(1 β ) (I T G )(v v α) d(G, G ) ,

1 α d(G, G )

Finally, since v v = 2 2α, we have

v v 2 2d(G, G )

Since we are free to swap the role of v and v , the result follows.

D PROOF OF THEOREM 5.3

First, note that ρG = ρG = ρ since it is a pointwise nonlinearity (an absolute value), and is independent of the graph topology. Now, let s start with k = 0. In this case, we get UGx UG x which is immediately bounded by Lemma 5.2 satisfying equation 15.

For k = 1 we have

UGρΨGx UG ρΨG x = UGρΨGx UG ρΨGx + UG ρΨGx UG ρΨG x (26) (UG UG )ρΨGx + UG ρ((ΨG ΨG )x) (27)

where the triangular inequality of the norm was used, together with the fact that ρu ρu ρ(u u ) for any real vector u since ρ is the pointwise absolute value. Using the submultiplicativity of the operator norm, we get

UGρΨGx UG ρΨG x UG UG ρ ΨG x + UG ρ ΨG ΨG x . (28)

Published as a conference paper at ICLR 2019

From Lemmas 5.1 and 5.2 we have that ΨG ΨG εΨ and UG UG εU, and from Proposition 4.1 that ΨG 1. Note also that UG = UG = 1 and that ρ = 1. This yields

UGρΨGx UG ρΨG x εU x + εΨ x . (29)

satisfying equation 15 for k = 1.

For k = 2, we observe that

UGρΨGρΨGx UG ρΨG ρΨG x = UGρΨGρΨGx UG ρΨGρΨGx + UG ρΨGρΨGx UG ρΨG ρΨG x (30) (UG UG )ρΨGρΨGx + UG (ρΨGρΨGx ρΨG ρΨG x) (31)

The ﬁrst term is bounded in a straightforward fashion by (UG UG )ρΨGρΨGx εU x in analogy to the development for k = 1. Since UG = 1, for the second term, we focus on

ρΨGρΨGx ρΨG ρΨG x = ρΨGρΨGx ρΨGρΨG x + ρΨGρΨG x ρΨG ρΨG x (32)

ρΨGρΨGx ρΨGρΨG x + ρΨGρΨG x ρΨG ρΨG x (33)

We note that, in the ﬁrst term in equation 33, the ﬁrst layer induces an error, but after that, the processing is through the same ﬁlter banks. So we are basically interested in bounding the propagation of the error induced in the ﬁrst layer. Applying twice the fact that ρ(u) ρ(u ) ρ(u u ) we get

ρΨGρΨGx ρΨGρΨG x ρ(ΨG(ρΨGx ρΨG x)) ρ(ΨGρ((ΨG ρΨG )x)) . (34)

And following with submultiplicativity of the operator norm,

ρΨGρΨGx ρΨGρΨG x εΨ x . (35)

For the second term in equation 33, we see that the ﬁrst layer applied is the same in both, namely ρΨG so there is no error induced. Therefore, we are interested in the error obtained after the ﬁrst layer, which is precisely the same error obtained for k = 1. Therefore,

ρΨGρΨG x ρΨG ρΨG x = ρΨGx ρΨG x εΨ x . (36)

Plugging equation 35 and equation 36 back in equation 31 we get

UGρΨGρΨGx UG ρΨG ρΨG x εU x + εΨ x + εΨ x (37)

satisfying equation 15 for k = 2.

For general k we see that we will have a ﬁrst term that is the error induced by the mismatch on the low pass ﬁlter that amounts to εU, a second term that accounts for the propagation through (k 1) equal layers of an initial error, yielding εΨ, and a ﬁnal third term that is the error induced by the previous layer, (k 1)εΨ. More formally, assume that equation 15 holds for k 1, implying that

(ρΨG)k 1x (ρΨG )k 1x (k 1)εΨ x (38)

Then, for k, we can write

UG(ρΨG)kx UG (ρΨG )kx (UG UG )(ρΨG)kx + UG ((ρΨG)kx (ρΨG )kx) (39)

Again, the ﬁrst term we bound it in a straightforward manner using submultiplicativity of the operator norm (UG UG )(ρΨG)kx εU x . (40) For the second term, since UG = 1 we focus on

(ρΨG)kx (ρΨG )kx = (ρΨG)kx (ρΨG)k 1ρΨG x + (ρΨG)k 1ρΨG x (ρΨG )kx (41)

(ρΨG)k 1ρΨGx (ρΨG)k 1ρΨG x + (ρΨG)k 1ρΨG x (ρΨG )k 1ρΨG x (42)

The ﬁrst term in equation 42 computes the propagation in the initial error caused by the ﬁrst layer. Then, repeatedly applying ρ(u) ρ(u ) ρ(u u ) in analogy with k = 2 and using submultiplicativity, we get

(ρΨG)k 1ρΨGx (ρΨG)k 1ρΨG x εΨ x . (43)

Published as a conference paper at ICLR 2019

The second term in equation 42 is the bounded by equation 38, since the ﬁrst layer is exactly the same in this second term. Then, combining equation 43 with equation 38, yields

(ρΨG)kx (ρΨG )kx εΨ + (k 1)εΨ x = kεΨ x . (44)

Overall, we get UG(ρΨG)kx UG (ρΨG )kx εU x + εΨ x (45)

which satisﬁes equation 15 for k. Finally, since this holds for k = 2, the proof is completed by induction.

E PROOF OF COROLLARY 5.4

From Theorem 5.3, we have

UG(ρGΨG)k UG (ρG ΨG )k 2 1 β ds(G, G ) 1/2 + k

β2 +(1 + β2 +) (1 β2 +)3 d(G, G ) (46)

and, by deﬁnition (Bruna & Mallat, 2013, Sec. 3.1),

k=0 UG(ρGΨG)kx 2 (47)

ΦG(x) ΦG (x) 2 =

k=0 UG(ρGΨG)kx UG (ρG ΨG )kx 2 (48)

Then, applying the inequality of Theorem 5.3, we get

" 2 1 β d(G, G ) 1/2 + k

β2 +(1 + β2 +) (1 β2 +)3 d(G, G )

Now, considering each term, such that

2 1 β d(G, G ) +

k=0 k2 β2 +(1 + β2 +) (1 β2 +)3 d2(G, G ) (50)

β2 +(1 + β2 +) (1 β )(1 β2 +)3 d3/2(G, G ) (51)

we observe that the second and third term vanish for d(G, G ) 1, leaving only the ﬁrst term, yielding

ΦG ΦG 2 m 2 1 β d(G, G ) . (52)

Finally, this leads to the corollary result

ΦG(x) ΦG(x) m1/2 d1/2(G, G ) x if d(G, G ) 1 (53)

completing the proof.

F PROOF OF COROLLARY 5.5

Let us denote by ej = x(j) G x(j) G . From (17), from the triangle inequality we verify that

ej+1 ajej + bj , with

aj = θ(j) 1 + θ(j) 2 , bj = β2j 1d(G, G ) x(j) G and e0 = 0 .

Published as a conference paper at ICLR 2019

It follows that

j=1 (1 + aj)

bj = β2j 1d(G, G ) x(j) G β2j 1d(G, G ) x

j=1 (1 + aj)

j=1 (1 + aj)

d(G, G ) x X

j=1 (1 + aj)

d(G, G ) x 1 1 β , (55)

as desired.