# graph_scattering_beyond_wavelet_shackles__4aded964.pdf

Graph Scattering beyond Wavelet Shackles

Christian Koke Technical University of Munich & Ludwig Maximilian University Munich christian.koke@tum.de

Gitta Kutyniok Ludwig Maximilian University Munich & University of Tromsø kutyniok@math.lmu.de

This work develops a ﬂexible and mathematically sound framework for the design and analysis of graph scattering networks with variable branching ratios and generic functional calculus ﬁlters. Spectrally-agnostic stability guarantees for nodeand graph-level perturbations are derived; the vertex-set non-preserving case is treated by utilizing recently developed mathematical-physics based tools. Energy propagation through the network layers is investigated and related to truncation stability. New methods of graph-level feature aggregation are introduced and stability of the resulting composite scattering architectures is established. Finally, scattering transforms are extended to edgeand higher order tensorial input. Theoretical results are complemented by numerical investigations: Suitably chosen scattering networks conforming to the developed theory perform better than traditional graphwavelet based scattering approaches in social network graph classiﬁcation tasks and signiﬁcantly outperform other graph-based learning approaches to regression of quantum-chemical energies on QM7.

1 Introduction

Euclidean wavelet scattering networks [22, 4] are deep convolutional architectures where outputfeatures are generated in each layer. Employed ﬁlters are designed rather than learned and derive from a ﬁxed (tight) wavelet frame, resulting in a tree structured network with constant branching ratio. Such networks provide state of the art methods in settings with limited data availability and serve as a mathematically tractable model of standard convolutional neural networks (CNNs). Rigorous investigations establishing remarkable invarianceand stability properties of wavelet scattering networks were initially carried out in [22]. The extensive mathematical analysis [38] generalized the term scattering network to include tree structured networks with varying branching rations and frames of convolutional ﬁlters, thus signiﬁcantly narrowing the conceptual gap to general CNNs.

With increasing interest in data on graph-structured domains, well performing networks generalizing Euclidean CNNs to this geometric setting emerged [18, 5, 9]. If efﬁciently implemented, such graph convolutional networks (GCNs) replace Euclidean convolutional ﬁlters by functional calculus ﬁlters; i.e. scalar functions applied to a suitably chosen graph-shift-oprator capturing the geometry of the underlying graph [18, 14, 9]. Almost immediately, proposals aimed at extending the success story of Euclidean scattering networks to the graph convolutional setting began appearing: In [48], the authors utilize dyadic graph wavelets (see e.g. [14]) based on the non-normalized graph Laplacian resulting in a norm preserving graph wavelet scattering transform. In [10], diffusion wavelets (see e.g. [8]) are used to construct a graph scattering transform enjoying spectrum-dependent stability guarantees to graph level perturbations. For scattering transforms with N layers and K distinct functional calculus ﬁlters, the work [11] derives node-level stability bounds of Op KN{2q and conducts corresponding numerical experiments choosing diffusion wavelets, monic cubic wavelets [14] and tight Hann wavelets [35] as ﬁlters. In [12] the authors, following [8], construct so called geometric wavelets and establish the expressivity of a scattering transform based on such a frame through extensive numerical

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

experiments. A theoretical analysis of this and a closely related wavelet based scattering transform is the main focus of [28]. Additionally, graph-wavelet based scattering transforms have been extended to the spatio-temporal domain [27], utilized to overcome the problem of oversmoothing in GCNs [25] and pruned to deal with their exponential (in network depth) increase in needed resources [15].

Common among all these contributions is the focus on graph wavelets, which are generically understood to derive in a scale-sampling procedure from a common wavelet generating kernel function g : R Ñ R satisfying various properties [14]. Established stability or expressivity properties especially to structural perturbations are then generally linked to the speciﬁc choice of the wavelet kernel g and utilized graph shift operator [10, 28]. This severely limits the diversity of available ﬁlter banks in the design of scattering networks and draws into question their validity as models for more general GCNs whose ﬁlters generically do not derive from a wavelet kernel.

A primary focus of this work is to provide alleviation in this situation: After reviewing the graph signal processing setting in Section 2, we introduce a general framework for the construction of (generalized) graph scattering transforms beyond the wavelet setting in Section 3. Section 4 establishes spectrumagnostic stability guarantees on the node signal level and for the ﬁrst time also for graph-level perturbations. To handle the vertex-set non-preserving case, a new distance measure for operators capturing the geometry of varying graphs is utilized. After providing conditions for energy decay (with the layers) and relating it to truncation stability, we consider graph level feature aggregation and higher order inputs in Sections 5 and 6 respectively. In Section 7 we then provide numerical results indicating that general functional calculus ﬁlter based scattering is at least as expressive as standard wavelet based scattering in graph classiﬁcation tasks and outperforms leading graph neural network approaches to regression of quantum chemical energies on QM7.

2 Graph Signal Processing

Taking a signal processing approach, we consider signals on graphs as opposed to graph embeddings:

Node-Signals: Given a graph p G, Eq, we are primarily interested in node-signals, which are functions from the node-set G to the complex numbers, modelled as elements of C|G|. We equip this space with an inner product according to xf, gy ř|G| i 1 figiµi (with all vertex weights µi ě 1) and denote the resulting inner product space by ℓ2p Gq. We forego considering arbitrary inner products on C|G| solely in the interest of increased readability.

Functional Calculus Filters: Our fundamental objects in investigating node-signals will be functional calculus ﬁlters based on a normal operator : ℓ2p Gq Ñ ℓ2p Gq. Prominent examples include the adjacency matrix W, the degree matrix D, normalized p1 D 1

2 q or un-normalized (L : D W) graph Laplacians Writing normalized eigenvalue-eigenvector pairs of as pλi, φiq|G| i 1, the ﬁlter obtained from applying g : C Ñ C is given by gp qf ř|G| i 1 gpλiqxφi, fyℓ2p V qφi. The operator we utilize in our numerical investigations of Section 6, is given by L : L{λmaxp Lq. We divide by the largest eigenvalue to ensure that the spectrum σp L q is contained in the interval r0, 1s, which aids in the choice of functions from which ﬁlters are derived.

Generalized Frames: We are most interested in ﬁlters that arise from a collection of functions adequately covering the spectrum of the operator to which they are applied. To this end we call a collection tgip qui PI of functions a generalized frame if it satisﬁes the generalized frame condition A ď ř

i PI |gipcq|2 ď B for any c in C for constants A; B ą 0. As proved in Appendix B, this condition is sufﬁcient to guarantee that the associated operators form a frame: Theorem 2.1. Let : ℓ2p Gq Ñ ℓ2p Gq be normal. If the family tgip qui PI of bounded functions satisﬁes A ď ř

i PI |gipcq|2 ď B for all c in the spectrum σp q, we have (@f P ℓ2p Gq)

A}f}2 ℓ2p Gq ď ÿ

i PI }gip qf}2 ℓ2p Gq ď B}f}2 ℓ2p Gq.

Notably, the functions tgiui PI need not be continuous: In fact, in our numerical implementations, we will among other mappings utilize the function δ0p q, deﬁned by δ0p0q 1 and δ0pcq 0 for c 0 as well as a modiﬁed cosine, deﬁned by cosp0q 0 and cospcq cospcq for c 0.

3 The Generalized Graph Scattering Transform

A generalized graph scattering transform is a non-linear map Φ based on a tree structured multilayer graph convolutional network with constant branching factor in each layer. For an input signal f P ℓ2p Gq, outputs are generated in each layer of such a scattering network, and then concatenated to form a feature vector in a feature space F. The network is built up from three ingredients:

Connecting Operators: To allow intermediate signal representations in the hidden network layers to be further processed with functional calculus ﬁlters based on varying operators, which might not all be normal for the same choice of node-weights, we allow these intermediate representations to live in varying graph signal spaces. In fact, we do not even assume that these signal spaces are based on a common vertex set. This is done to allow for modelling of recently proposed networks where inputand processing graphs are decoupled (see e.g. [1, 36]), as well as architectures incorporating graph pooling [20]. Instead, we associate one signal space ℓ2p Gnq to each layer n. Connecting operators are then (not necessarily linear) operators Pn : ℓ2p Gn 1q Ñ ℓ2p Gnq connecting the signal spaces of subsequent layers. We assume them to be Lipschitz continuous (}Ppfq Ppgq}ℓ2p Gn 1q ď R }f g}ℓ2p Gnqq and triviality preserving (Pp0q 0). For our original node-signal space we also write ℓ2p Gq ℓ2p G0q .

Non-Linearities: To each layer, we also associate a (possibly) non-linear function ρn : C Ñ C acting poinwise on signals in ℓ2p Gnq. Similar to connecting operators, we assume ρn preserves zero and is Lipschitz-continuous with Lipschitz constant denoted by L n . This deﬁnition allows for the absolute value non-linearity, but also Re Lu or trivially the identity function.

Operator Frames: Beyond these ingredients, the central building block of our scattering architecture is comprised of a family of functional calculus ﬁlters in each layer. That is, we assume that in each layer, the node signal space ℓ2p Gnq carries a normal operator n and an associated collection of functions comprised of an output generating function χnp q as well as a ﬁlter bank tgγnp quγn PΓn indexed by an index set Γn. As the network layer n varies (and in contrast to wavelet-scattering networks) we allow the index set Γn as well as the collection tχnp qu Ťtgγnp quγn PΓn of functions to vary. We only demand that in each layer the functions in the ﬁlter bank together with the output generating function constitute a generalized frame with frame constants An, Bn ě 0.

We refer to the collection of functions ΩN : pρn, tχnp qu Ťtgγnp quγn PΓnq N n 1 as a module sequence and call DN : p Pn, nq N n 1 our operator collection. The generalized scattering transform is then constructed iteratively:

Figure 1: Schematic Scattering Architecture

To our initial signal f P ℓ2p Gq we ﬁrst apply the connecting operator P1, yielding a signal representation in ℓ2p G1q. Subsequently, we apply the pointwise non-linearity ρ1. Then we apply our graph ﬁlters tχ1p 1qu Ťtgγ1p 1quγ1PΓ1 to ρ1p P1pfqq yielding the output V1pfq : χ1p 1qρ1p P1pfqq as well as the intermediate hidden representations t U1rγ1spfq : gγ1p 1qρ1p P1pfqquγ1PΓ1 obtained in the ﬁrst layer. Here we have introduced the one-step scattering propagator Unrγns : ℓ2p Gn 1q Ñ ℓ2p Gnq mapping f ÞÑ gγnp nqρnp Pnpfqq as well as the output generating operator Vn : ℓ2p Gn 1q Ñ ℓ2p Gnq mapping f to χnp nqρnp Pnpfqq. Upon deﬁning the set ΓN 1 : ΓN 1 ˆ ... ˆ Γ1 of paths of length p N 1q terminating in layer N 1 (with Γ0 taken to be the one-element set) and iterating the above procedure, we see that the outputs generated in the N th-layer are indexed by paths ΓN 1 terminating in the previous layer.

Outputs generated in the N th layer are thus given by t VN UrγN 1s ... Urγ1spfqupγN 1,...,γ1q PΓN 1. Concatenating the features obtained in the various layers of a network with depth N, our full feature vectors thus live in the feature space

FN N n 1 ℓ2p Gnq |Γn 1| . (1)

The associated canonical norm is denoted } }FN . For convenience, a brief review of direct sums of spaces, their associated norms and a discussion of corresponding direct sums of maps is provided in Appendix A. We denote the hence constructed generalized scattering transform of length N, based on a module sequence ΩN and operator collection DN by ΦN. In our numerical experiments in Section 7, we consider two particular instantiations of the above general architecture. In both cases the utilized shift-operator is L : L{λmaxp Lq, node weights satisfy µi 1, the branching ratio in each layer is chosen as 4 and the depth is set to N 4 as well. The connecting operators are set to the identity and non-linearities are set to the modulus (| |). The two architectures differ in the utilized ﬁlters, which are repeated in each layer and depicted in Fig. 2. Postponing a discussion of other parameterchoices, we note here that the ﬁlters tsinpπ{2 q, cospπ{2 qu provide a high and a low pass ﬁlter on the spectrum σp L q Ď r0, 1s, while tsinpπ q, cospπ qu provides a spectral reﬁnement of the former two ﬁlters. The inner two elements of the ﬁlter bank in Architecture II thus separate an input signal into highand low-lying spectral

Figure 2: Filters of tested Architectures components. The outer two act similarly at a higher spectral scale. Additionally Architecture I utilizing cos and δ0 as introduced Section 2 prevents the lowest lying spectral information from propagating. Instead it is extracted via δ0p q in each layer. Note that Id arises from applying the constant-1 function to L . Normalizations are chosen to generate frames with upper bounds B ž 1.

4 Stability Guarantees

In order to produce meaningful signal representations, a small change in input signal should produce a small change in the output of our generalized scattering transforms. This property is captured in the result below, which is proved in Appendix C. Theorem 4.1. With the notation of Section 3, we have for all f, h P ℓ2p Gq:

}ΦNpfq ΦNphq}FN ď

n 1 maxtr Bn 1s, r Bnp L n R n q2 1s, 0u

2 }f h}ℓ2p Gq

In the case where upper frame bounds Bn and Lipschitz constants L n and R n are all smaller than or equal one, this statement reduces to the much nicer inequality:

}ΦNpfq ΦNphq}FN ď }f h}ℓ2p Gq. (2)

Below, we always assume R n , L n ď 1 as this easily achievable through rescaling. We will keep Bn variable to demonstrate how ﬁlter size inﬂuences stability results. As for our experimentally tested architectures (cf. Fig. 2), we note for Architecture I that Bn 1{2 for all n, so that (2) applies. For Architecture II we have Bn 3, which yields a stability constant of ?

1 2 3 2 32 2 33 9. Similar to other constants derived in this section, this bound is however not necessarily tight.

Operators capturing graph geometries might only be known approximately in real world tasks; e.g. if edge weights are only known to a certain level of precision. Hence it is important that our scattering representation be insensitive to small perturbations in the underlying normal operators in each layer, which is captured by our next result, proved in Appendix D. Smallness here is measured in Frobenius norm } }F , which for convenience is brieﬂy reviewed in Appendix A).

Theorem 4.2. Let ΦN and rΦN be two scattering transforms based on the same module sequence ΩN and operator sequences DN, r DN with the same connecting operators (Pn r Pn) in each layer. Assume R n , L n ď 1 and Bn ď B for some B and n ď N. Assume that the respective normal operators satisfy } n r n}F ď δ for some δ ą 0. Further assume that the functions

tgγnuγn PΓn and χn in each layer are Lipschitz continuous with associated Lipschitz constants satisfying L2 χn ř

γn PΓn L2 gγn ď D2 for all n ď N and some D ą 0. Then we have

}rΦNpfq ΦNpfq}FN ď b

pmaxt B, 1{2uq N 1 D δ }f}ℓ2p Gq

for all f P ℓ2p Gq. If B ď 1{2, the stability constant improves to a

2p1 BNq{p1 Bq D ď 2 D.

The condition B ď 1

2 is e.g. satisﬁed by our Architecture I, but strictly speaking we may not apply Theorem 4.2, since not all utilized ﬁlters are Lipschitz continuous. Remark D.3 in Appendix D however shows, that the above stability result remains applicable for this architecture as long as we demand that and r are (potentially rescaled) graph Laplacians. For Architecture II we note that D π ?

10{2 and thus the stability constant is given by a

We are also interested in perturbations that change the vertex set of the graphs in our architecture. This is important for example in the context of social networks, when passing from nodes representing individuals to nodes representing (close knit) groups of individuals. To investigate this setting, we utilize tools originally developed within the mathematical physics community [29]:

Deﬁnition 4.3. Let H and r H be two ﬁnite dimensional Hilbert spaces. Let and r be normal operators on these spaces. Let J : H Ñ r H and r J : r H Ñ H be linear maps called identiﬁcation operators. We call the two spaces δ-quasi-unitarily-equivalent (with δ ě 0) if for any f P H and u P r H we have

}Jf} r H ď 2}f}H, }p J r J qf} r H ď δ}f}H,

}f r JJf}H ď δ b

}f}2 H xf, | | fy H, }u J r Ju} r H ď δ b

}u}2 r H xu, |r | uy r H.

If, for some w P C the resolvent R : p ωq 1 satisﬁes }p r RJ JRqf} r H ď δ}f}H for all f P H, we say that and r are ω-δ-close with identiﬁcation operator J.

Absolute value | | and adjoint r J of operators are brieﬂy reviewed in Appendix A. While the above deﬁnition might seem fairly abstract at ﬁrst, it is in fact a natural setting to investigate structural perturbations as Figure 3 exempliﬁes. In our current setting, the Hilbert spaces in Deﬁnition 4.3 are node-signal spaces H ℓ2p Gq, r H ℓ2p r Gq of different graphs. The notion of ω-δ-closeness is then useful, as it allows to compare ﬁlters deﬁned on different graphs but obtained from applying the same function to the respective graph-operators:

Lemma 4.4. In the setting of Deﬁnition 4.3 let and r be ω-δclose and satisfy } }op, }r }op ď K for some K ą 0. If g : C Ñ C is holomorphic on the disk BK 1p0q of radius p K 1q, there is a constant Cg ě 0 so that

}gpr q J Jgp q}op ď Cg δ

with Cg depending on g, ω and K.

An explicit characterization of Cg together with a proof of this result is presented in Appendix F. Lemma 4.4 is our main tool in establishing our next result, proved in Appendix G, which captures stability under vertex-set non-preserving perturbations:

Figure 3: Prototypical Example of δ-unitary-equivalent Node Signal Spaces with p 1q-12δ-close Laplacians. Details in Appendix E.

Theorem 4.5. Let ΦN, rΦN be scattering transforms based on a common module sequence ΩN and differing operator sequences DN, r DN. Assume R n , L n ď 1 and Bn ď B for some B and n ě 0. Assume that there are identiﬁcation operators Jn : ℓ2p Gnq Ñ ℓ2p r Gnq, r Jn : ℓ2p r Gnq Ñ ℓ2p Gnq (0 ď n ď N) so that the respective signal spaces are δ-unitarily equivalent, the respective normal operators n, r n are ω-δ-close as well as bounded (in norm) by K ą 0 and the connecting operators satisfy } r Pn Jn 1f Jn Pnf}ℓ2p r Gnq 0. For the common module sequence ΩN assume that the non-linearities satisfy }ρnp Jnfq Jnρnpfq}ℓ2p r Gnq 0 and that the constants Cχn and

t Cgγnuγn PΓN associated through Lemma 4.4 to the functions of the generalized frames in each layer satisfy C2 χn ř

γn PΓN C2 gγn ď D2 for some D ą 0. Denote the operator that the family t Jnun of

identiﬁcation operators induce on FN through concatenation by JN : FN Ñ Ă FN. Then, with KN a

p2N 1q2D2 BN 1 if B ą 1{2 and KN a

2D2 p1 BNq{p1 Bq if B ď 1{2:

}rΦNp J0fq JNΦNpfq} Ă FN ď KN δ }f}ℓ2p G, @f P ℓ2p Gq.

The stability result persists with slightly altered stability constants, if identiﬁcation operators only almost commute with non-linearities and/or connecting operators, as Appendix G further elucidates. Theorem 4.5 is not applicable to Architecture I, where ﬁlters are not all holomorphic, but is directly applicable to Architecture II. Stability constants can be calculated in terms of D and B as before.

Beyond these results, stability under truncation of the scattering transform is equally desirable: Given the energy WN : ř

pγN,...,γ1q PΓN }UrγNs ... Urγ1spfq}2 ℓ2p GNq stored in the network at layer N, it is not hard to see that after extending ΦNpfq by zero to match dimensions with ΦN 1pfq we have }ΦNpfq ΦN 1pfq}2 FN 1 ď R N 1L N 1 2 BN 1 WN (see Appendix H for more details). A bound for WN is then given as follows: Theorem 4.6. Let Φ8 be a generalized graph scattering transform based on a an operator sequence D8 p Pn, nq8 n 1 and a module sequence Ω8 with each ρnp q ě 0. Assume in each layer n ě 1 that there is an eigenvector ψn of n with solely positive entries; denote the smallest entry by mn : mini PGn ψnris and the eigenvalue corresponding to ψn by λn. Quantify the spectral-gap opened up at this eigenvalue through neglecting the output-generating function by ηn : ř

γn PΓn |gγnpλnq|2

and assume Bnmn ě ηn. We then have (with C N : śN i 1 max 1, Bip L i R i q2( )

WNpfq ď C N

ˆ 1 ˆ mn ηn

}f}2 ℓ2p Gq. (3)

The product in (3) decays if C N Ñ C converges and řN n 1pmn ηn{Bnq Ñ 8 diverges as N Ñ 8. The positivity-assumptions on the eigenvectors ψn can e.g. always be ensured if they are chosen to lie in the lowest lying eigenspace of a graph Laplacian or normalized graph Laplacian (irrespective of the connectedness of the underlying graphs). As an example, we note that if we extend our Architecture I to inﬁnite depth (recall from Section 3 that we are using the same ﬁlters, operators, etc. in each layer) we have upon choosing λn 0 and ψn to be the constant normalized vector that ηn 0, CN 1 and mn 1{ a

|G|, for a graph with |G| vertices. On a graph with 16 vertices, we then e.g. have WN ď p3{4q N}f}2 ℓ2p Gq and thus }ΦNpfq ΦN 1pfq}FN 1 ď p3{4q N }f}2 ℓ2p Gq{2. As detailed in Appendix H, Theorem 4.6 also implies that under the given assumptions the scattering transform has trivial kernel for N Ñ 8, mapping only 0 to 0.

5 Graph-Level Feature Aggregation

To solve tasks such as graph classiﬁcation or regression over multiple graphs, we need to represent graphs of varying sizes in a common feature space. Given a scattering transform ΦN, we thus need to ﬁnd a stability preserving map from the feature space FN to some Euclidean space that is independent of any vertex set cardinalities. Since FN is a large direct sum of smaller spaces (cf. (1)), we simply construct such maps on each summand independently and then concatenate them.

General non-linear feature aggregation: Our main tool in passing to graph-level features is a non-linear map N G p : ℓ2p Gq Ñ Rp given as

N G p pfq 1 ?pp}f}ℓ1p Gq{?µG, }f}ℓ2p Gq, }f}ℓ3p Gq, ..., }f}ℓpp Gqq J, (4)

with µG : ř

i PG µi and }f}ℓqp Gq : př

i PG |fi|qµiq1{q. Our inspiration to use this map stems from the standard case where all µi 1: For p ě |G|, the vector |f| pp|f1|, ..., |f G|q J can then be recovered from N G p pfq up to permutation of indices [23]. Hence, employing N G p (with p ě |G|) to aggregate node-information into graph-level information, we lose the minimal necessary information

about node permutation (clearly N G p pfq N G p pΠfq for any permutation matrix Π) and beyond that only information about the complex phase (respectively the sign in the real case) in each entry of f.

Figure 4: Graph Level Scattering

Given a scattering transform ΦN mapping from ℓ2p Gq to

the feature space FN N n 1 ℓ2p Gnq |Γn 1|, we obtain a corresponding map ΨN mapping from ℓ2p Gq to RN N n 1 p Rpnq|Γn 1| by concatenating the feature map ΦN with the operator that the family of non-linear maps t N pn Gnu N n 1 induces on FN by concatenation. Similarly we obtain the map rΨN : ℓ2p r Gq Ñ RN by concatenat-

ing the map rΦN : ℓ2p r Gq Ñ N n 1 ℓ2p r Gnq |Γn 1| with

the operator induced by the family t N pn r Gnu N n 1. The feature

space RN is completely determined by path-sets ΓN and used maximal p-norm indices pn. It no longer depends on cardinalities of vertex sets of any graphs, allowing to compare (signals on) varying graphs with each other. Most of the results of the previous sections then readily transfer to the graph-level-feature setting (c.f. Appendix I.1).

Low-pass feature aggregation: The spectrum-free aggregation scheme of the previous paragraph is especially adapted to settings where there are no high-level spectral properties remaining constant under graph perturbations. However, many commonly utilized operators, such as normalized and un-normalized graph Laplacians, have a somewhat stable spectral theory: Eigenvalues are always real, non-negative, the lowest-lying eigenvalue equals zero and simple (if the graph is connected). In this section we shall thus assume that each mentioned normal operator n (r n) has these spectral properties. We denote the lowest lying normalized eigenvector (which is generically determined up to a complex phase) by ψ n and denote by M |x , y| Gn : ℓ2p Gnq Ñ C the map given by M |x , y| Gn pfq |xψ n, fyℓ2p Gnq|. The absolute value around the inner product is introduced to absorb the phaseambiguity in the choice of ψ n. Given a scattering transform ΦN mapping from ℓ2p Gq to the feature space FN, we obtain a corresponding map Ψ|x , y| N mapping from ℓ2p Gq to CN N n 1C|Γn 1| by concatenating the feature map ΦN with the operator that the family of maps t M |x , y| Gn u N n 1 induces on FN by concatenation. As detailed in Appendix I.2, this map inherits stability properties in complete analogy to the discussion of Section 4.

6 Higher Order Scattering

Node signals capture information about nodes in isolation. However, one might be interested in binary, ternary or even higher order relations between nodes such as distances or angles in graphs representing molecules. In this section we focus on binary relations i.e. edge level input as this is the instantiation we also test in our regression experiment in Section 7. Appendix J.2 provides more details and extends these considerations beyond the binary setting. We equip the space of edge inputs with an inner product according to xf, gy ř|G| i,j 1 fijgijµij and denote the resulting inner-product space by ℓ2p Eq with E GˆG the set of edges. Setting e.g. node-weights µi and edge weights µik to one, the adjacency matrix W as well as normalized or un-normalized graph Laplacians constitute selfadjoint operators on ℓ2p Eq, where they act by matrix multiplication. Replacing the Gn of Section 3 by En, we can then follow the recipe laid out there in constructing 2nd-order scattering transforms; all that we need are a module sequence ΩN and an operator sequence D2 N : p P 2 n, 2 nq N n 1, where now P 2 n : ℓ2p En 1q Ñ ℓ2p Enq and 2 n : ℓ2p Enq Ñ ℓ2p Enq. We denote the resulting feature map by Φ2 N and write F 2 N for the corresponding feature space. The map N G p introduced in (4) can also be adapted to aggregate higher-order features into graph level features: With }f}q : př

ij PG |fij|qµijq1{q and

µE : ř|G| ij 1 µij, we deﬁne N E p pfq p}f}ℓ1p Eq{?µE, }f}ℓ2p Eq, }f}ℓ3p Eq, ..., }f}ℓpp Eqq J{?p.

Given a feature map Φ2 N with feature space F 2 N N n 1 ℓ2p Enq |Γn 1|, we obtain a corresponding

map Ψ2 N mapping from ℓ2p Eq to RN N n 1 p Rpnq|Γn 1| by concatenating ΦE N with the map that the family of non-linear maps t N En pn u N n 1 induces on F N by concatenation. The stability results of the preceding sections then readily translate to Φ2 N and Ψ2 N (c.f. Appendix J).

7 Experimental Results

We showcase that even upon selecting the fairly simple Architectures I and II introduced in Section 3 (c.f. also Fig. 2), our generalized graph scattering networks are able to outperform both waveletbased scattering transforms and leading graph-networks under different circumstances. To aid visual clarity when comparing results, we colour-code the best-performing method in green, the second-best performing in yellow and the third-best performing method in orange respectively.

Social Network Graph Classiﬁcation: To facilitate contact between our generalized graph scattering networks, and the wider literature, we combine a network conforming to our general theory namely Architecture I in Fig. 2 (as discussed in Section 3 with depth N 4, identity as connecting operators and | |-non-linearities) with the low pass aggregation scheme of Section 5 and a Euclidean support vector machine with RBF-kernel (GGSN+EK). The choice N 4 was made to keep computation-time palatable, while aggregation scheme and non-linearities were chosen to facilitate comparison with standard wavelet-scattering approaches. For this hybrid architecture (GGSN+EK), classiﬁcation accuracies under the standard choice of 10-fold cross validation on ﬁve common social network graph datasets are compared with performances of popular graph kernel approaches, leading deep learning methods as well as geometric wavelet scattering (GS-SVM) [12]. More details are provided in Appendix K. As evident from Table 1, our network consistently achieves higher accuracies than the geometric wavelet scattering transform of [12], with the performance gap becoming signiﬁcant on the more complex REDDIT datasets, reaching a relative mean performance increase of more than 10% on REDDIT-12K. This indicates the liberating power of transcending the graph wavelet setting. While on comparatively smaller and somewhat simpler datasets there is a performance gap between our static architecture and fully trainable networks, this gap closes on more complex datasets: While P-Poinc e.g. outperforms our method on IMDB datasets, the roles are reversed on REDDIT datasets. On REDDIT-B our approach trails only GIN; with difference in accuracies insigniﬁcant. On REDDIT-5K our method comes in third, with the gap to the second best method (GIN) being statistically insigniﬁcant. On REDDIT-12K we generate state of the art results.

Table 1: Classiﬁcation Accuracies on Social Network Datasets

Method Classiﬁcation Accuracies r%s COLLAB IMDBB IMDB-M REDDIT-B REDDIT-5K REDDIT-12K WL [33] 77.82 1.45 71.60 5.16 N/A 78.52 2.01 50.77 2.02 34.57 1.32 Graphlet [34] 73.42 2.43 65.40 5.95 N/A 77.26 2.34 39.75 1.36 25.98 1.29 DGK [42] 73.00 0.20 66.90 0.50 44.50 0.50 78.00 0.30 41.20 0.10 32.20 0.10 DGCNN [46] 73.76 0.49 70.03 0.86 47.83 0.85 N/A 48.70 4.54 N/A PSCN [26] 72.60 2.15 71.00 2.29 45.23 2.84 86.30 1.58 49.10 0.70 41.32 0.42 P-Poinc [19] N/A 81.86 4.26 57.31 4.27 79.78 3.21 51.71 3.01 42.16 3.41 S2S-N2N-PP [16] 81.75 0.80 73.80 0.70 51.19 0.50 86.50 0.80 52.28 0.50 42.47 0.10 GSN-e [3] 85.5 1.2 77.8 3.3 54.3 3.3 N/A N/A N/A WKPI-k C[47] N/A 75.1 1.1 49.5 0.4 N/A 59.5 0.6 48.4 0.5 GIN [41] 80.20 1.90 75.10 5.10 52.30 2.80 92.40 2.50 57.50 1.50 N/A GS-SVM [12] 79.94 1.61 71.20 3.25 48.73 2.32 89.65 1.94 53.33 1.37 45.23 1.25 GGSN+EK [OURS] 80.34 1.68 73.20 3.76 49.47 2.27 91.60 1.97 56.89 2.24 49.03 1.58

Regression of Quantum Chemical Energies: In order to showcase the prowess of both our higher order scattering scheme and our spectrum-agnostic aggregation method of Section 5, we combine these building blocks into a hybrid architecture which we then apply in combination with kernel methods (2GGST + EK) to the task of atomization energy regression on QM7. This is a comparatively small dataset of 7165 molecular graphs, taken from the 970 million strong molecular database GDB13 [2]. Each graph in QM7 represents an organic molecule, with nodes corresponding to individual atoms. Beyond the node-level information of atomic charge, there is also edge level information characterising interaction strengths between individual nodes/atoms available. This is encoded into so called Coulomb matrices (see e.g. [31] or Appendix K) of molecular graphs, which for us serve a dual purpose: On the one hand we consider a Coulomb matrix as an edge-level input signal on a given graph.

On the other hand, we also treat it as an adjacency matrix from which we build up a graph Laplacian L. Our normal operator is then chosen as L L{λmaxp Lq again. Connecting operators are set to the identity, while non-linearities are ﬁxed to ρně1p q | |. Filters are chosen as psinpπ{2 L q, cospπ{2 L q, sinpπ L q, cospπ L qq acting through matrix multiplication. Output generating functions are set to the identity and depth is N 4, so that we essentially recover Architecture II of

Figure 5: Atomization Energy as a Function of primary Principal Components of Scattering Features

Fig. 2; now applied to edge-level input. Graph level features are aggregated via the map N E 5 of Section 6. We chose p 5 (and not p " 5) for N E p to avoid overﬁtting. Generated feature vectors are combined with node level scattering features obtained from applying Architecture II of Fig. 2 to the input signal of atomic charge into composite feature vectors; plotted in Figure 5. As is visually evident, even when reduced to the low-dimensional subspace of their ﬁrst three principal components, the generated scattering features are able to aptly resolve the atomization energy of the molecules. This aptitude is also reﬂected in Table 2, comparing our approach with leading graph-based learning methods trained with ten-fold cross validation on node and (depending on the model) edge level information. Our method is the best performing. We signiﬁcantly outperform the next best model (DTNN), producing less than half of its mean absolute error (MAE). Errors of other methods are at least one sometimes two orders of magnitude greater. In part, this performance discrepancy might be explained by the hightened suitability of our scattering transform for environments with somewhat limited training-data availability. Here we speculate that the additional performance gap might be explained by the fact that our graph shift operator carries the same information as the Coulomb matrix (a proven molecular graph descriptor in itself [31]). Additionally, our ﬁlters being inﬁnite series in powers of the underlying normal operator allows for rapid dispersion of information across underlying molecular graphs, as opposed to e.g. the ﬁlters in Graph Conv

Table 2: Comparison of Methods

Method MAE [kcal/mol] Attentive FP [40] 66.2 2.8 DMPNN [44] 105.8 13.2 DTNN [39] 8.2 3.9 Graph Conv [18] 118.9 20.2 GROVER (base)[30] 72.5 5.9 MPNN [13] 113.0 17.2 N-GRAM[21] 125.6 1.5 PAGTN (global) [6] 47.8 3.0 Phys Chem [45] 59.6 2.3 Sch Net [32] 74.2 6.0 Weave [17] 59.6 2.3 GGST+EK [OURS] 11.3 0.6 2GGST+EK [OURS] 3.4 0.3 or Sch Net, which do not incorporate such higher powers. To quantify the effect of including second order scattering coefﬁcients, we also include the result of performing kernel-regression solely on ﬁrst order features generated through Architecture II of Fig. 2 (GGST + EK). While results are still better than those of all but one leading approach, incorporating higher order scattering improves performance signiﬁcantly.

8 Discussion

Leaving behind the traditional reliance on graph wavelets, we developed a theoretically well founded framework for the design and analysis of (generalized) graph scattering networks; allowing for varying branching rations, non-linearities and ﬁlter banks. We provided spectrum independent stability guarantees, covering changes in input signals and for the ﬁrst time also arbitrary normal perturbations in the underlying graph-shift-operators. After introducing a new framework to quantify vertex-set non-preserving changes in graph domains, we obtained spectrum-independent stability guarantees for this setting too. We provided conditions for energy decay and discussed implications for truncation stability. Then we introduced a new method of graph-level feature aggregation and extended scattering networks to higher order input data. Our numerical experiments showed that a simple scattering transform conforming to our framework is able to outperform the traditional graph-wavelet based approach to graph scattering in social network graph classiﬁcation tasks. On complex datasets our method is also competitive with current fully trainable methods, ouperforming all competitors on REDDIT-12K. Additionally, higher order graph scattering transforms signiﬁcantly outperform current leading graph-based learning methods in predicting atomization energies on QM7. A reasonable critique of scattering networks as tractable models for general graph convolutional

networks is their inability to emulate non-tree-structured network topologies. While transcending the wavelet setting has arguably diminished the conceptual gap between the two architectures, this structural difference persists. Additionally we note that despite a provided promising example, it is not yet clear whether the newly introduced graph-perturbation framework can aptly provide stability guarantees to all reasonable coarse-graining procedures. Exploring this question is the subject of ongoing work.

Broader Impact

We caution against an over-interpretation of established mathematical guarantees: Such guarantees do not negate biases that may be inherent to utilized datasets.

Disclosure of Funding

Christian Koke acknowledges support from the German Research Foundation through the MIMO II-project (DFG SPP 1798, KU 1446/21-2). Gitta Kutyniok acknowledges support from the ONE Munich Strategy Forum (LMU Munich, TU Munich, and the Bavarian Ministery for Science and Art), the Konrad Zuse School of Excellence in Reliable AI (DAAD), the Munich Center for Machine Learning (BMBF) as well as the German Research Foundation under Grants DFG-SPP-2298, KU 1446/31-1 and KU 1446/32-1 and under Grant DFG-SFB/TR 109, Project C09 and the Federal Ministry of Education and Research under Grant Ma Gri Do.

[1] Uri Alon and Eran Yahav. On the bottleneck of graph neural networks and its practical implications. In International Conference on Learning Representations, 2021.

[2] L. C. Blum and J.-L. Reymond. 970 million druglike small molecules for virtual screening in the chemical universe database GDB-13. J. Am. Chem. Soc., 131:8732, 2009.

[3] Giorgos Bouritsas, Fabrizio Frasca, Stefanos P Zafeiriou, and Michael Bronstein. Improving graph neural network expressivity via subgraph isomorphism counting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.

[4] Joan Bruna and Stéphane Mallat. Invariant scattering convolution networks. IEEE Trans. Pattern Anal. Mach. Intell., 35(8):1872 1886, 2013.

[5] Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann Lecun. Spectral networks and locally connected networks on graphs. In International Conference on Learning Representations (ICLR2014), CBLS, April 2014, 2014.

[6] Benson Chen, Regina Barzilay, and Tommi Jaakkola. Path-augmented graph transformer network, 2019.

[7] Paul Civin, Nelson Dunford, and Jacob T. Schwartz. Linear operators. part i: General theory. American Mathematical Monthly, 67:199, 1960.

[8] Ronald R. Coifman and Mauro Maggioni. Diffusion wavelets. Applied and Computational Harmonic Analysis, 21(1):53 94, 2006. Special Issue: Diffusion Maps and Wavelets.

[9] Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on graphs with fast localized spectral ﬁltering. Advances in neural information processing systems, 29, 2016.

[10] Fernando Gama, Alejandro Ribeiro, and Joan Bruna. Diffusion scattering transforms on graphs. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. Open Review.net, 2019.

[11] Fernando Gama, Alejandro Ribeiro, and Joan Bruna. Stability of graph scattering transforms. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d Alché-Buc, Emily B. Fox, and Roman Garnett, editors, Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Neur IPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 8036 8046, 2019.

[12] Feng Gao, Guy Wolf, and Matthew J. Hirn. Geometric scattering for graph data analysis. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pages 2122 2131. PMLR, 2019.

[13] Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neural message passing for quantum chemistry. In International conference on machine learning, pages 1263 1272. PMLR, 2017.

[14] David K Hammond, Pierre Vandergheynst, and Rémi Gribonval. Wavelets on graphs via spectral graph theory. Applied and Computational Harmonic Analysis, 30(2):129 150, 2011.

[15] Vassilis N. Ioannidis, Siheng Chen, and Georgios B. Giannakis. Pruned graph scattering transforms. In International Conference on Learning Representations, 2020.

[16] Yu Jin and Joseph F. JáJá. Learning graph-level representations with gated recurrent neural networks. Co RR, abs/1805.07683, 2018.

[17] Steven Kearnes, Kevin Mc Closkey, Marc Berndl, Vijay Pande, and Patrick Riley. Molecular graph convolutions: moving beyond ﬁngerprints. Journal of computer-aided molecular design, 30(8):595 608, 2016.

[18] Thomas N. Kipf and Max Welling. Semi-supervised classiﬁcation with graph convolutional networks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. Open Review.net, 2017.

[19] Panagiotis Kyriakis, Iordanis Fostiropoulos, and Paul Bogdan. Learning hyperbolic representations of topological features. In International Conference on Learning Representations, 2021.

[20] Junhyun Lee, Inyeop Lee, and Jaewoo Kang. Self-attention graph pooling. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 3734 3743. PMLR, 09 15 Jun 2019.

[21] Shengchao Liu, Mehmet F Demirel, and Yingyu Liang. N-gram graph: Simple unsupervised representation for graphs, with applications to molecules. Advances in neural information processing systems, 32, 2019.

[22] Stéphane Mallat. Group invariant scattering. Communications on Pure and Applied Mathematics, 65(10):1331 1398, 2012.

[23] Hana Melánová, Bernd Sturmfels, and Rosa Winter. Recovery from power sums. Experimental Mathematics, 0(0):1 10, 2022.

[24] Barry Simon Michael Reed. Methods of modern mathematical physics, Volume 1. Academic Press, 1981.

[25] Yimeng Min, Frederik Wenkel, and Guy Wolf. Scattering gcn: Overcoming oversmoothness in graph convolutional networks. Advances in Neural Information Processing Systems, 33:14498 14508, 2020.

[26] Mathias Niepert, Mohamed Ahmed, and Konstantin Kutzkov. Learning convolutional neural networks for graphs. In Maria Florina Balcan and Kilian Q. Weinberger, editors, Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 2014 2023, New York, New York, USA, 20 22 Jun 2016. PMLR.

[27] Chao Pan, Siheng Chen, and Antonio Ortega. Spatio-temporal graph scattering transform, 2020.

[28] Michael Perlmutter, Feng Gao, Guy Wolf, and Matthew Hirn. Understanding graph neural networks with asymmetric geometric scattering transforms, 2019.

[29] Olaf. Post. Spectral Analysis on Graph-like Spaces / by Olaf Post. Lecture Notes in Mathematics, 2039. Springer Berlin Heidelberg, Berlin, Heidelberg, 1st ed. 2012. edition, 2012.

[30] Yu Rong, Yatao Bian, Tingyang Xu, Weiyang Xie, Ying Wei, Wenbing Huang, and Junzhou Huang. Self-supervised graph transformer on large-scale molecular data. Advances in Neural Information Processing Systems, 33:12559 12571, 2020.

[31] M. Rupp, A. Tkatchenko, K.-R. Müller, and O. A. von Lilienfeld. Fast and accurate modeling of molecular atomization energies with machine learning. Physical Review Letters, 108:058301, 2012.

[32] Kristof Schütt, Pieter-Jan Kindermans, Huziel Enoc Sauceda Felix, Stefan Chmiela, Alexandre Tkatchenko, and Klaus-Robert Müller. Schnet: A continuous-ﬁlter convolutional neural network for modeling quantum interactions. Advances in neural information processing systems, 30, 2017.

[33] Nino Shervashidze, Pascal Schweitzer, Erik Jan van Leeuwen, Kurt Mehlhorn, and Karsten M. Borgwardt. Weisfeiler-lehman graph kernels. Journal of Machine Learning Research, 12(77):2539 2561, 2011.

[34] Nino Shervashidze, SVN Vishwanathan, Tobias Petri, Kurt Mehlhorn, and Karsten Borgwardt. Efﬁcient graphlet kernels for large graph comparison. In David van Dyk and Max Welling, editors, Proceedings of the Twelth International Conference on Artiﬁcial Intelligence and Statistics, volume 5 of Proceedings of Machine Learning Research, pages 488 495, Hilton Clearwater Beach Resort, Clearwater Beach, Florida USA, 16 18 Apr 2009. PMLR.

[35] David I Shuman, Christoph Wiesmeyr, Nicki Holighaus, and Pierre Vandergheynst. Spectrumadapted tight graph wavelet and vertex-frequency frames. IEEE Transactions on Signal Processing, 63(16):4223 4235, 2015.

[36] Jake Topping, Francesco Di Giovanni, Benjamin Paul Chamberlain, Xiaowen Dong, and Michael M. Bronstein. Understanding over-squashing and bottlenecks on graphs via curvature, 2021.

[37] Wihler T.P. On the hölder continuity of matrix functions for normal matrices. Journal of inequalities in pure and applied mathematics, 10(4), Dec 2009.

[38] Thomas Wiatowski and Helmut Bölcskei. A mathematical theory of deep convolutional neural networks for feature extraction. IEEE Transactions on Information Theory, 64:1845 1866, 2018.

[39] Zhenqin Wu, Bharath Ramsundar, Evan N. Feinberg, Joseph Gomes, Caleb Geniesse, Aneesh S. Pappu, Karl Leswing, and Vijay S. Pande. Moleculenet: a benchmark for molecular machine learning electronic supplementary information (esi) available. see doi: 10.1039/c7sc02664a. Chemical Science, 9:513 530, 2018.

[40] Zhaoping Xiong, Dingyan Wang, Xiaohong Liu, Feisheng Zhong, Xiaozhe Wan, Xutong Li, Zhaojun Li, Xiaomin Luo, Kaixian Chen, Hualiang Jiang, and Mingyue Zheng. Pushing the boundaries of molecular representation for drug discovery with the graph attention mechanism. Journal of Medicinal Chemistry, 63(16):8749 8760, 2020. PMID: 31408336.

[41] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. Open Review.net, 2019.

[42] Pinar Yanardag and S.V.N. Vishwanathan. Deep graph kernels. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 15, page 1365 1374, New York, NY, USA, 2015. Association for Computing Machinery.

[43] Pinar Yanardag and S.V.N. Vishwanathan. Deep graph kernels. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 15, page 1365 1374, New York, NY, USA, 2015. Association for Computing Machinery.

[44] Kevin Yang, Kyle Swanson, Wengong Jin, Connor Coley, Philipp Eiden, Hua Gao, Angel Guzman-Perez, Timothy Hopper, Brian Kelley, Miriam Mathea, Andrew Palmer, Volker Settels, Tommi Jaakkola, Klavs Jensen, and Regina Barzilay. Analyzing learned molecular representations for property prediction. Journal of Chemical Information and Modeling, 59(8):3370 3388, 2019. PMID: 31361484.

[45] Shuwen Yang, Ziyao Li, Guojie Song, and Lingsheng Cai. Deep molecular representation learning via fusing physical and chemical information. Advances in Neural Information Processing Systems, 34, 2021.

[46] Muhan Zhang, Zhicheng Cui, Marion Neumann, and Yixin Chen. An end-to-end deep learning architecture for graph classiﬁcation. In Sheila A. Mc Ilraith and Kilian Q. Weinberger, editors, Proceedings of the Thirty-Second AAAI Conference on Artiﬁcial Intelligence, (AAAI-18), the 30th innovative Applications of Artiﬁcial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artiﬁcial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, pages 4438 4445. AAAI Press, 2018.

[47] Qi Zhao and Yusu Wang. Learning metrics for persistence-based summaries and applications for graph classiﬁcation. Advances in Neural Information Processing Systems, 32, 2019.

[48] Dongmian Zou and Gilad Lerman. Graph convolutional neural networks via scattering. Applied and Computational Harmonic Analysis, 49(3):1046 1074, nov 2020.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reﬂect the paper s contributions and scope? [Yes] A discussion of how and in which order the main claims made in Abstract and Introduction were substantiated within the paper is a main focus of Section 8 (b) Did you describe the limitations of your work? [Yes] This is a second focus of Section 8. (c) Did you discuss any potential negative societal impacts of your work? [Yes] This is part of Section 8. (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes]

2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [Yes] Every Theorem and Lemma is stated in a mathematically precise way, with all underlying assumptions included. Additionally, Appendix A brieﬂy reviews some terminology that might not be immediately present in every readers mind, but is utilized in order to be able to state theoretical results in a precise and concise manner. (b) Did you include complete proofs of all theoretical results? [Yes] This is the Focus of Appendices B, C, D, F, G, H and I. Results in Appendix J are merely stated, as statements and corresponding proofs are in complete analogy (in fact almost verbatim the same) to previously discussed statements and proofs.

3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] Yes; please see the supplementary material. (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] This is the main focus of Appendix K.

(c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes] Errors for results of conducted experiments are included in Table 1 and Table 2 respectively. (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] This is described at the beginning of Appendix K 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [Yes] All utilized datasets were matched to the papers that introduced them (cf. Section K). Additionally, we partially built on code corresponding to [12] which we mentioned in Appendix K. (b) Did you mention the license of the assets? [Yes] We mentioned in Appendix K that the code corresponding to [12] is freely available under an Apache License. (c) Did you include any new assets either in the supplemental material or as a URL? [Yes]

Please see supplementary material. (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identiﬁable information or offensive content? [Yes] For the utilized social network datasets, we mention in Appendix K that neither personally identifyable data nor content that might be considered offensive is utilised. 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]