# graphcoupled_oscillator_networks__620135ed.pdf Graph-Coupled Oscillator Networks T. Konstantin Rusch 1 2 Benjamin P. Chamberlain 3 James Rowbottom 3 Siddhartha Mishra 1 2 Michael M. Bronstein 4 3 We propose Graph-Coupled Oscillator Networks (Graph CON), a novel framework for deep learning on graphs. It is based on discretizations of a second-order system of ordinary differential equations (ODEs), which model a network of nonlinear controlled and damped oscillators, coupled via the adjacency structure of the underlying graph. The flexibility of our framework permits any basic GNN layer (e.g. convolutional or attentional) as the coupling function, from which a multi-layer deep neural network is built up via the dynamics of the proposed ODEs. We relate the oversmoothing problem, commonly encountered in GNNs, to the stability of steady states of the underlying ODE and show that zero-Dirichlet energy steady states are not stable for our proposed ODEs. This demonstrates that the proposed framework mitigates the oversmoothing problem. Moreover, we prove that Graph CON mitigates the exploding and vanishing gradients problem to facilitate training of deep multi-layer GNNs. Finally, we show that our approach offers competitive performance with respect to the state-of-the-art on a variety of graphbased learning tasks. 1. Introduction Graph Neural Networks (GNNs) (Sperduti, 1994; Goller & Kuchler, 1996; Sperduti & Starita, 1997; Frasconi et al., 1998; Gori et al., 2005; Scarselli et al., 2008; Bruna et al., 2014; Defferrard et al., 2016; Kipf & Welling, 2017; Monti et al., 2017; Gilmer et al., 2017) are a widely-used class of models for learning on relations and interaction data. These models have recently been successfully applied in a 1Seminar for Applied Mathematics (SAM), D-MATH, ETH Z urich, Switzerland 2ETH AI Center, ETH Z urich 3Twitter Inc., London, UK 4Department of Computer Science, University of Oxford, UK. Correspondence to: T. Konstantin Rusch . Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s). variety of tasks such as computer vision and graphics (Monti et al., 2017), recommender systems (Ying et al., 2018), transportation (Derrow-Pinion et al., 2021), computational chemistry (Gilmer et al., 2017), drug discovery (Gaudelet et al., 2021), physics (Shlomi et al., 2020), and analysis of social networks (see Zhou et al. (2019); Bronstein et al. (2021) for additional applications). Several recent works proposed Graph ML models based on differential equations coming from physics (Avelar et al., 2019; Poli et al., 2019b; Zhuang et al., 2020; Xhonneux et al., 2020b), including diffusion (Chamberlain et al., 2021b) and wave (Eliasof et al., 2021) equations and geometric equations such as Beltrami (Chamberlain et al., 2021a) and Ricci (Topping et al., 2021) flows. Such approaches allow not only to recover popular GNN models as discretization schemes for the underling differential equations, but also, in some cases, can address problems encountered in traditional GNNs such as oversmoothing (Nt & Maehara, 2019; Oono & Suzuki, 2020) and bottlenecks (Alon & Yahav, 2021). In this paper, we propose a novel physically-inspired approach to learning on graphs. Our framework, termed Graph CON (Graph-Coupled Oscillator Network) builds upon suitable time-discretizations of a specific class of ordinary differential equations (ODEs) that model the dynamics of a network of non-linear controlled and damped oscillators, which are coupled via the adjacency structure of the underlying graph. Graph-coupled oscillators are often encountered in mechanical, electronic, and biological systems, and have been studied extensively (Strogatz, 2015), with a prominent example being functional circuits in the brain such as cortical columns (Stiefel & Ermentrout, 2016). In these circuits, each neuron oscillates with periodic firing and spiking of the action potential. The network of neurons is coupled in the form of a graph, with neurons representing nodes and edges corresponding to synapses linking neurons. Main Contributions. In the subsequent sections, we will demonstrate the following features of Graph CON: Graph CON is flexible enough to accommodate any standard GNN layer (such as GAT or GCN) as its coupling function. As timesteps of our discretized Graph-Coupled Oscillator Networks ODE can be interpreted as layers of a deep neural network (Chen et al., 2018; Haber & Ruthotto, 2018; Chamberlain et al., 2021b), one can view Graph CON as a wrapper around any underlying basic GNN layer allowing to build deep GNNs. Moreover, we will show that standard GNNs can be recovered as steady states of the underlying class of ODEs, whereas Graph CON utilizes their dynamic behavior to sample a richer set of states, which leads to better expressive power. We mathematically formulate the frequently encountered oversmoothing problem for GNNs (Nt & Maehara, 2019; Oono & Suzuki, 2020) in terms of the stability of zero-Dirichlet energy steady states of the underlying equations. By a careful analysis of the dynamics of the proposed ODEs, we demonstrate that any zero-Dirichlet energy steady states are not (exponentially) stable. Consequently, we show that the oversmoothing problem for Graph CON is mitigated by construction. We rigorously prove that Graph CON mitigates the socalled exploding and vanishing gradients problem for the resulting GNN. Hence, Graph CON can greatly improve the trainability of deep multi-layer GNNs. We provide an extensive empirical evaluation of Graph CON on a wide variety of graph learning tasks such as transductive and inductive node classification and graph regression and classification, demonstrating that Graph CON achieves competitive performance. 2. Graph CON Let G = (V, E V V) be an undirected graph with |V| = v nodes and |E| = e edges consisting of unordered pairs of nodes {i, j} and denoted i j. We will label nodes by the index i V = {1, 2, . . . , v}. For any i V, we denote its 1-neighborhood as Ni = {j V : i j}. Furthermore, let X Rv m be given by X = {Xi} for i V, denoting the m-dimensional feature vector at each node i. Central to our framework is a graph dynamical system represented by the following nonlinear system of ODEs: X = σ(Fθ(X, t)) γX αX . (1) Here, X(t) denotes the time-dependent v m-matrix of node features, σ is the activation function, Fθ is a general learnable (possibly time-dependent) 1-neighborhood coupling function of the form (Fθ(X, t))i = Fθ (Xi(t), Xj(t), t) i j, (2) parametrized with a set of learnable parameters θ. By introducing the auxiliary velocity variable Y(t) = X (t) Rv m, we can rewrite the second-order ODEs (1) as a first-order system: Y = σ(Fθ(X, t)) γX αY, The key idea of our framework is, given the input node features X(0) as an initial condition, to use the solution X(T) at some time T as the output (more generally, one can also apply (linear) transformations (embeddings) to X(0) and X(T)). As will be shown in the following section, the space of solutions of our system is a rich class of functions that can solve many learning tasks on a graph. The system (3) must be solved by an iterative numerical solver using a suitable time-discretization. It is highly desirable for a time-discretization to preserve the structure of the underlying ODEs (3) (Hairer et al., 1987). In this paper, we use the following IMEX (implicit-explicit) time-stepping scheme, which extends the symplectic Euler method (Hairer et al., 1987) to systems with an additional damping term, Yn = Yn 1 + t[σ(Fθ(Xn 1, tn 1)) γXn 1 αYn 1], Xn = Xn 1 + t Yn, for n = 1, . . . , N, where t > 0 is a fixed time-step and Yn, Xn denote the hidden node features at time tn = n t. The iterative scheme (4) can be interpreted as an N-layer graph neural network (with potential additional linear input and readout layers, omitted here for simplicity), which we refer to as Graph CON (see section 3 for the motivation of this nomenclature). The coupling function Fθ plays the role of a message passing mechanism (Gilmer et al. (2017), also referred to, in various contexts, as diffusion or neighborhood aggregation ) in traditional GNNs. Choice of the coupling function Fθ. Our framework allows for any learnable 1-neighborhood coupling to be used as Fθ, including instances of message passing mechanisms commonly used in the Graph ML literature such as Graph SAGE (Hamilton et al., 2017), Graph Attention (Velickovic et al., 2018), Graph Convolution (Defferrard et al., 2016; Kipf & Welling, 2017), Spline CNN (Fey et al., 2018), or Mo Net (Monti et al., 2017)). In this paper, we focus on two particularly popular choices: Attentional message passing of Velickovic et al. (2018): Fθ(Xn, tn) = An(Xn)Xn Wn, with learnable weight matrices Wn Rm m and attention matrices An Rn n following the adjacency structure of the graph G, i.e., (An(Xn))ij = 0 if j / Ni and (An(Xn))ij = exp(Leaky Re LU(a [Wn Xn i ||Wn Xn j ])) P k Ni exp(Leaky Re LU(a [Wn Xn i ||Wn Xn k])), Graph-Coupled Oscillator Networks otherwise (here Xn i denotes the i-th row of Xn and a R2m). We refer to (4) based on this attentional 1neighborhood coupling as Graph CON-GAT. Graph convolution operator of Kipf & Welling (2017): Fθ(Xn, tn) = ˆD 1 2 Xn Wn, (5) with ˆA = A + I denoting the adjacency matrix of G with inserted self-loops, diagonal degree matrix D = diag(Pn l=1 ˆAkl), and Wn i Rm m being learnable weight matrices. We refer to (4) based on this convolutional 1-neighborhood coupling as Graph CON-GCN. Steady States of Graph CON and relation to GNNs. It is straightforward to see that the steady states X , Y of the Graph CON dynamical system (4) with an autonomous coupling function Fθ = Fθ(X) (as in Graph CON-GAT or Graph CON-GCN) are given by Y 0 and γ σ(Fθ(X )). (6) Using a simple fixed point iteration to find the steady states (6) yields a multi-layer GNN of the form; γ σ(Fθ(Xn 1)), for n = 1, 2, . . . , N. (7) We observe that (up to a rescaling by the factor t/γ) equation (7) corresponds to the update formula for any standard N-layer message-passing GNN (Gilmer et al., 2017), including such popular variants as GAT (Velickovic et al., 2018) or GCN (Kipf & Welling, 2017). Thus, this interpretation of Graph CON (4) clearly brings out its relationship with standard GNNs. Unlike in standard multi-layer GNNs of the generic form (7) that can be thought of as steady states of the underlying ODEs (3), Graph CON evolves the underlying node features dynamically in time. Interpreting the multiple GNN layers as iterations at times tn = n t in (4), we observe that the node features in Graph CON follow the trajectories of the corresponding dynamical system and can explore a richer sampling of the underlying latent feature space, leading to possibly greater expressive power than standard GNNs (7), which might remain in the vicinity of steady states. Moreover, this interpretation also reveals that, in principle, any GNN of the form (7) can be used within the Graph CON framework, offering a very flexible and broad class of architectures. Hence, one can think of Graph CON as an additional wrapper on top of any basic GNN layer allowing for a principled and stable design of deep multi-layered GNNs. In the following Section 3, we show that such an approach has several key advantages over standard GNNs. 3. Properties of Graph CON To gain some insight into the functioning of Graph CON (4), we start by setting the hyperparameter γ = 1 and assuming that the 1-neighborhood coupling Fθ is given by either the GAT or GCN type coupling functions. In this case, the underlying ODEs (3) takes the following node-wise form, j Ni Aij Xj Xi αYi, (8) for all nodes i V, with Aij = A (Xi(t), Xj(t)) R stemming from the attention or convolution operators. Furthermore, the matrices are right stochastic i.e., the entries satisfy, 0 Aij 1, j Ni, i V, X j Ni Aij = 1, i V. (9) Uncoupled case. The simplest case of (8), corresponds to setting σ 0 and α = 0. In this case, all nodes are uncoupled from each other and the solutions of the resulting ODEs are of the form, Xi(t) = Xi(0) cos(t) + Yi(0) sin(t). (10) Thus, the dynamics of the ODEs (3) in this special case correspond to a system of uncoupled oscillators, with each node oscillating at unit frequency. Coupled linear case. Next, we introduce coupling between the nodes that are adjacent on the underlying graph G and assume identity activation function σ(x) = x. In this case, (8) is a coupled linear system and an exact closed form solution, such as (10) may not be possible. However, we can describe the dynamics of (8) in the form of the following proposition (proved in SM C.1), Proposition 3.1. Let the node features X, Y evolve according to the ODEs (8) with activation function σ = id and time-independent matrix A (e.g. Aij = A(Xi(0), Xj(0)) using the initial features). Further assume that A is symmetric and α = 0. Then X i V Yi(t) 2 + X j Ni Aij Xi(t) Xj(t) 2 i V Yi(0) 2 + X j Ni Aij Xi(0) Xj(0) 2, (11) holds for all t > 0. Thus, in this case, we have shown that the dynamics of the Graph-Coupled Oscillator Networks underlying ODEs (8) preserves the energy, i V Yi(t) 2 + X j Ni Aij Xi(t) Xj(t) 2, (12) and the trajectories of (8) are constrained to lie on a manifold of the node feature space, defined by the level sets of the energy. In particular, energy (12) is not produced or destroyed but simply redistributed among the nodes of the underlying graph G. Thus, the dynamics of (3) in this setting amounts to the motion of a linear system of coupled oscillators. General nonlinear case. In the general case, we have (i) a nonlinear activation function σ; (ii) time-dependent non-linear coefficients Aij = A(Xi(t), Xj(t)); and (iii) possible unsymmetrical entries Aij = Aji. All these factors destroy the energy conservation property (11) and can possibly lead to unbounded growth of the energy. Hence, we need to add some damping to the system. To this end, the damping term in (8) is activated by setting α > 0. Moreover, γ = 1 corresponds to controlling frequencies of the nodes. Thus, the overall dynamics of the underlying ODEs (3) amounts to the motion of a nonlinear system of coupled, controlled and damped oscillators with the coupling structure being that of the underlying graph. This explains our choice of the name, Graph-Coupled Oscillatory Neural Network or Graph CON for short. We illustrate the dynamics of Graph CON in Fig. 1, where the model is applied to the graph of a molecule from the ZINC database (Irwin et al., 2012), with features X denoting the position of the nodes and they are propagated in time through the action of Graph CON (4). The oscillatory behavior of the node features, as well as their dependence on the adjacency structure of the underlying graph can be clearly observed in this figure. Oversmoothing and Graph CON. One of the common plights of GNN models such as GAT (Velickovic et al., 2018), GCN (Kipf & Welling, 2017) and their variants is oversmoothing (Nt & Maehara, 2019; Oono & Suzuki, 2020), a phenomenon where all node features in a deep GNN converge to the same constant value as the number of hidden layers is increased. Consequently, one often must resort to shallow GNNs at the expense of expressive power (Nt & Maehara, 2019; Oono & Suzuki, 2020). Many attempts have been made in recent years to mitigate the oversmoothing problem for GNNs, including regularization procedures such as Drop Edge (Rong et al., 2020), using intermediate representations (Xu et al., 2018b), or adding residual connections (Chen et al., 2020). We will show that Graph CON allows to mitigate this problem by construction, and set off by formulating this problem Figure 1. Illustration of Graph CON dynamics on a ZINC molecular graph. The initial positions of Graph CON (X0 in (4)) are represented by the 2-dimensional positions of the nodes, while the initial velocities (Y0 in (4)) are set to the initial positions. The positions are propagated forward in time ( layers ) using Graph CON-GCN with random weights. The molecular graph is plotted at initial time t = 0 as well as at t = 20. in precise mathematical terms and to this end, we recall the Dirichlet energy, defined on the node features X of an undirected graph G as, j Ni Xi Xj 2. (13) Next, we define oversmoothing as follows: Definition 3.2. Let Xn denote the hidden features of the nth layer of an N-layer GNN, with n = 0, . . . , N. We define oversmoothing as the exponential convergence to zero of the layer-wise Dirichlet energy as a function of n, i.e., E(Xn) C1e C2n, (14) with some constants C1, C2 > 0. In other words, oversmoothing happens when the graph gradients vanish quickly (see for instance the illustration in Fig. 2) in the number of hidden layers of the GNN. As a result, the feature vectors across all nodes rapidly (exponentially) converge to the same constant value. This behavior is commonly observed in GNNs and is identified as one of the reasons for the difficulty in designing deep GNNs. Graph CON behaves rather differently and allows to mitigate the oversmoothing problem in the sense of definition 3.2. To see this, we focus on the underlying ODEs (3). It is trivial to extend the definition of oversmoothing from the discrete case to the continuous one by requiring that oversmoothing happens for the ODEs (3) if the Dirichlet energy behaves as, E(X(t)) C1e C2t, t > 0, (15) for some C1,2 > 0. We have the following simple proposition (proved in SM C.2) that characterizes the oversmoothing problem for the Graph-Coupled Oscillator Networks underlying ODEs in the standard terminology of dynamical systems (Wiggins, 2003), Proposition 3.3. The oversmoothing problem occurs for the ODEs (3) if and only if the hidden states (X , Y ) = (c, 0) are exponentially stable steady states (fixed points) of the ODE (3), for some c Rm and 0 being the m-dimensional vector with zeroes for all its entries. In other words, all the trajectories of the ODE (3), that start within the corresponding basin of attraction, have to converge exponentially fast in time (satisfy (15)) to the corresponding steady state (c, 0) for the oversmoothing problem to occur for this system. Note that the basins of attraction will be different for different values of c. Given this characterization, the key questions are a) whether (c, 0) are fixed points for the ODE (3), and b) whether these fixed points are exponentially stable. We answer these questions for the ODEs (8) in the following Proposition 3.4. Assume that the activation function σ in the ODEs (8) is Re LU. Then, for any c Rm such that each entry of the vector cℓ 0, for all 1 ℓ m, the hidden state (c, 0) is a steady state for the ODEs (8). However under the additional assumption of α 1 2, this fixed point is not exponentially stable. The fact that (c, 0) is a steady state of (8), for any positive c is straightforward to see from the structure of (8) and the definition of the Re LU activation function. We can already observe from the energy identity (11) for the simplified symmetric linear system that the energy (12) for the small perturbations around the steady state (c, 0) is conserved in time. Hence, these small perturbations do not decay at all, let alone, exponentially fast in time. Thus, these steady states are not exponentially stable. An extension of this analysis to the nonlinear timedependent, possibly non-symmetric system (8) is more subtle and the proof relies on the identity (28) (expressed in Proposition C.1 in SM C.3) that describes how a suitably defined energy of the general system (8) evolves around small perturbations of the steady state (c, 0). A careful analysis of this identity reveals that these small perturbations can grow polynomially in time (at least for short time periods) and do not decay exponentially. Consequently, the fixed point (c, 0) is not stable. This shows that the oversmoothing problem, in the sense of definition 3.2, is mitigated for the ODEs (3) and structure preserving time-discretizations of it such as (4), from which, in simple words it follows that Graph CON mitigates oversmoothing by construction. This analysis also illustrates the rich dynamics of (3) as we show that even if the trajectories reach a steady state of the form (c, 0), very small perturbations will grow and the trajectory will veer away from this steady state, possibly towards other constant steady states which are also not stable. Thus, the trajectories can sample large parts of the latent space, contributing to the expressive power of the model. We remark here that the use of Re LU activation function in proposition C.1 is purely for definiteness. Any other widely used activation function can be used in σ, with corresponding zero Dirichlet energy steady states being specified by the roots of the algebraic equation σ(c) = c and an analogous result can be derived. For instance, the zero-Dirichlet energy steady state corresponding to the Tanh activation function is given by (0, 0). On the exploding and vanishing gradients problem. The mitigation of oversmoothing by Graph CON has a great bearing on increasing the expressivity of the resulting deep GNN. In addition, it turns out that using graph-coupled oscillators can also facilitate training of the underlying GNNs. To see this, we will consider a concrete example of the coupling function in (4) to be GCN (5). Other coupling functions such as GAT can be considered analogously. For simplicity of exposition and without any loss of generality, we consider scalar node features by setting m = 1. We also set α, γ = 1. With these assumptions, a N-layer deep Graph CON-GCN reduces to the following explicit (nodewise) form, Yn i = (1 t)Yn 1 i + tσ Cn 1 i t Xn 1 i , Cn 1 i = wn i di Xn 1 i + X wn j Xn 1 j p Xn i = Xn 1 i + t Yn i , 1 n N, 1 i v. (16) Here, di = deg(i), denoting the degree of a node i V and wn Rv, denoting the learnable weight vector. Moreover, we are in a setting where the learning task is for the GNN to approximate the ground truth vector X Rv. Consequently, we set up the following loss-function, i V |XN i Xi|2, (17) with w = [w1, w2, , w N] denoting the concatenated learnable weights in (16). During training, one computes an approximate minimizer of the loss-function (17) with a (stochastic) gradient descent (SGD) procedure. At every step of gradient descent, we need to compute the gradient w J. For definiteness, we fix node k V and layer 1 ℓ N and consider the learnable weight wℓ k. Thus, in a SGD step, one needs to compute gradient, J wℓ k . By chain rule, one readily proves the following identity (see for instance (Pascanu et al., 2013)), J wℓ k = J ZN ZN wℓ k . (18) Graph-Coupled Oscillator Networks Zn = [Xn 1, Yn 1 , Xn 2, Yn 2 , , Xn i , Yn i , , Xn v, Yn v ] , is the concatenated node-feature vector at the layer 1 n N. Furthermore, by using the product rule, we see that, Zn 1 . (19) In other words, the gradient J wℓ k measures the contribution made by the node k in the ℓ-th hidden layer to the learning process. If we assume that the partial gradient behaves as Zn Zn 1 λ, for all n, then, the long-product structure of (19) implies that ZN Zℓ λN ℓ. If on average, λ > 1, then we observe that the total gradient (18) can grow exponentially in the number of layers, leading to the exploding gradients problem. Similarly, if on average, λ < 1, then the total gradient (18) can decay exponentially in the number of layers, leading to the vanishing gradients problem. Either of these situations can lead to failure of training as the gradient step either blows up or does not change at all. Hence, for very deep GNN architectures, it is essential to investigate if the exploding and vanishing gradients problem can be mitigated. We start by showing the following upper bound (proved in SM C.4) on the gradients, Proposition 3.5. Let Xn, Yn be the node features, generated by Graphcon-GCN (16). We assume that t << 1 is chosen to be sufficiently small. Then, the gradient of the loss function J (17) with respect to any learnable weight parameter wℓ k, for some 1 k v and 1 ℓ N is bounded as J wℓ k β ˆD t(1 + ΓN t) max 1 i v(|X0 i | + |Y0 i |) + β ˆD t(1 + ΓN t) max 1 i v |Xi| + β β = max x |σ(x)|, β = max x |σ (x)|, ˆD = max i,j V 1 p didj , Γ := 6 + 4β ˆD max 1 n N wn 1. The upper bound (20) clearly shows that the total gradient is globally bounded, independent of the number of layers N, if t N 1, thus mitigating the exploding gradients problem. Even if the small parameter t is chosen independently of the number of layers N, the total gradient in (20) only grows, at most quadratically in the number of layers, thus preventing exponential blowup of gradients and mitigating the exploding gradients problem. However, this upper bound (20) does not necessarily rule out the vanishing gradients problem. To this end, we derive the following formula (in SM C.4) for the gradients, Proposition 3.6. For 1 n N, let Xn be the node features generated by Graph CON-GCN (16), Then for sufficiently small t << 1, the gradient J wℓ k , for any ℓ, k satisfies the following expression, J wℓ k =2 t2 σ (Cℓ 1 j )Xℓ 1 j XN j Xj with the order notation being defined in SM Eqn. (54). One readily observes from the formula (22), that to leading order in the small parameter t, the gradient J wℓ k is independent of the number of layers N of the underlying GNN. Thus, although the gradient can be small (due to small t), it will not vanish by increasing the number of layers, mitigating the vanishing gradient problem. 4. Related Work Differential equations have historically played a role in designing and interpreting various algorithms in machine learning, including non-linear dimensionality reduction methods (Belkin & Niyogi, 2003; Coifman & Lafon, 2006) and ranking (Page et al., 1999; Chakrabarti, 2007) (all of which are related to closed-form solutions of diffusion PDEs). In the context of Deep Learning, differential equations have been used to derive various types of neural networks including Neural ODEs and their variants, that have been used to design and interpret residual (Chen et al., 2018) and convolutional (Haber & Ruthotto, 2018) neural networks. These approaches have recently gained traction in Graph ML, e.g. with ODE-based models for learning on graphs (Avelar et al., 2019; Poli et al., 2019b; Zhuang et al., 2020; Xhonneux et al., 2020b). Chamberlain et al. (2021b) used parabolic diffusion-type PDEs to design GNNs using graph gradient and divergence operators as the spatial differential operator, a transformer type-attention as a learnable diffusivity function ( 1neighborhood coupling in our terminology), and a variety of time stepping schemes to discretize the temporal dimension in this framework. Chamberlain et al. (2021a) applied a non-euclidean diffusion equation ( Beltrami flow ) to a joint positional-feature space, yielding a scheme with adaptive spatial derivatives ( graph rewiring ), and Topping et al. (2021) studied a discrete geometric PDE similar to Ricci flow to improve information propagation in GNNs. Graph-Coupled Oscillator Networks We can see the contrast between the diffusion-based methods of Chamberlain et al. (2021b;a) and Graph CON in the simple case of identity activation σ(x) = x. Then, under the further assumption that the second-order time derivative X is removed from (1) and α = γ = 1, we recover the graph diffusion-PDEs of (Chamberlain et al., 2021b). Hence, the presence of the temporal second-order derivative distinguishes this approach from diffusion-based PDEs. Eliasof et al. (2021) proposed a GNN framework arising from a mixture of parabolic (diffusion) and hyperbolic (wave) PDEs on graphs with convolutional coupling operators, which describe dissipative wave propagation. We point out that a particular instance of their model (damped wave equation, also called as the Telegrapher s equation) can be obtained as a special case of our model (1) with the identity activation function. This is not surprising as the zero grid-size limit of oscillators on a regular grid yields a wave equation. However, given that we use a nonlinear activation function and the specific placement of the activation layer in (3), a local PDE interpretation of the general form of our underlying ODEs (1) does not appear to be feasible. Finally, the explicit use of networks of coupled, controlled oscillators to design machine learning models was proposed in context of recurrent neural networks (RNNs) by Rusch & Mishra (2021a;b). 5. Experimental results We present a detailed experimental evaluation of the proposed framework on a variety of graph learning tasks. We test two settings of Graph CON: Graph CON-GCN (using graph convolution as the 1-neighborhood coupling in (4)) and Graph CON-GAT (using the attentional coupling). Since in most experiments, these two configurations already outperform the state-of-the-art (SOTA), we only apply Graph CON with more involved coupling functions in a few particular tasks. All code to reproduce our results can be found at https://github.com/tk-rusch/Graph CON. 5.1. Evolution of Dirichlet Energy. We start by illustrating the dynamics of the Dirichlet energy (13) of Graph CON for an undirected graph representing a 2-dimensional 10 10 regular grid with 4-neighbor connectivity. The node features X are randomly sampled from U([0, 1]) and then propagated through 100-layer GNNs (with random weights): GAT, GCN, and their Graph CONstacked versions (Graph CON-GAT and Graph CON-GCN) for two different values of the damping parameter α = 0, 0.5 in (4) and with fixed γ = 1. In Fig. 2, we plot the (logarithm of) Dirichlet energy of each layer s output with respect to (logarithm) of the layer number. It can clearly be seen that GAT and GCN suffer from the oversmoothing problem as the Dirichlet energy converges exponentially fast to zero, indicating that the node features become constant, while Graph CON is devoid of this behavior. This holds true even for non-zero value of the damping parameter α, where the Dirichlet energy stabilizes after an initial decay. 1 10 100 Layer n GAT GCN Graph CON-GAT (α = 0) Graph CON-GAT (α = 0.5) Graph CON-GCN (α = 0) Graph CON-GCN (α = 0.5) Figure 2. Dirichlet energy E(Xn) of layer-wise node features Xn propagated through a GAT and GCN as well as their Graph CONstacked versions (Graph Con-GAT and Graph CON-GCN) for two different values of α = 0, 0.5 in (4) and fixed γ = 1. 5.2. Transductive node classification We evaluate Graph CON on both homophilic and heterophilic datasets, where high homophily implies that the features in a node are similar to those of its neighbors. The homophily level reported in Table 1 and Table 2 is the measure proposed by Pei et al. (2020). Homophilic datasets. We consider three widely used node classification tasks, based on the citation networks Cora (Mc Callum et al., 2000), Citeseer (Sen et al., 2008) and Pubmed (Namata et al., 2012). We follow the evaluation protocols and training, validation, and test splits of Shchur et al. (2018); Chamberlain et al. (2021b), using only on the largest connected component in each network. Table 1 compares Graph CON with standard GNN baselines: GCN (Kipf & Welling, 2017), GAT (Velickovic et al., 2018), Mo Net (Monti et al., 2017), Graph SAGE (GS) (Hamilton et al., 2017), CGNN (Xhonneux et al., 2020a), GDE (Poli et al., 2019a), and GRAND (Chamberlain et al., 2021b). We observe that Graph CON-GCN and Graph CON-GAT outperform pure GCN and GAT consistently. We also provide results for Graph CON based on the propagation layer used in GRAND i.e., transformer (Vaswani et al., 2017) based graph attention, referred to as Graph CON-Tran, which also outperforms the basic underlying model. Overall, Graph CON models show the best performance on all these datasets. Heterophilic datasets. We also evaluate Graph CON on the Graph-Coupled Oscillator Networks Table 1. Transductive node classification test accuracy (MAP in %) on homophilic datasets. Mean and standard deviation are obtained using 20 random initializations on 5 random splits each. The three best performing methods are highlighted in red (First), blue (Second), and violet (Third). Cora Citeseer Pubmed Homophily level 0.81 0.74 0.80 GAT-ppr 81.6 0.3 68.5 0.2 76.7 0.3 Mo Net 81.3 1.3 71.2 2.0 78.6 2.3 Graph Sage-mean 79.2 7.7 71.6 1.9 77.4 2.2 Graph Sage-maxpool 76.6 1.9 67.5 2.3 76.1 2.3 CGNN 81.4 1.6 66.9 1.8 66.6 4.4 GDE 78.7 2.2 71.8 1.1 73.9 3.7 GCN 81.5 1.3 71.9 1.9 77.8 2.9 Graph CON-GCN 81.9 1.7 72.9 2.1 78.8 2.6 GAT 81.8 1.3 71.4 1.9 78.7 2.3 Graph CON-GAT 83.2 1.4 73.2 1.8 79.5 1.8 GRAND 83.6 1.0 73.4 0.5 78.8 1.7 Graph CON-Tran 84.2 1.3 74.2 1.7 79.4 1.3 heterophilic graphs; Cornell, Texas and Wisconsin from the Web KB dataset1. Here, the assumption on neighbor feature similarity does not hold. Many GNN models were shown to struggle in this settings as can be seen by the poor performance of baseline GCN and GAT in Table 2. On the other hand, we see from Table 2 that not only do Graph CONGCN and Graph CON-GAT dramatically outperform the underlying GCN and GAT models (e.g. for the most heterophilic Texas graph, Graph CON-GCN and Graph CONGAT have mean accuracies of 85.4% and 82.2%, compared to accuracies of 55.1% and 52.2% for GCN and GAT), the Graph CON models also provide the best performance, outperforming recent baselines that are specifically designed for heterophilic graphs. 5.3. Inductive node classification In this experiment, we consider the Protein-Protein Interaction (PPI) dataset of Zitnik & Leskovec (2017), using the protocol of Hamilton et al. (2017). Table 3 shows the test performance (micro-average F1) of Graph CON and several standard GNN baselines. We can see that Graph CON significantly improves the performance of the underling models (GAT from 97.4% to 99.4% and GCN from 98.5% to 99.6%, which is the top result on this benchmark). 5.4. Molecular graph property regression We reproduce the benchmark proposed in Dwivedi et al. (2020), regressing the constrained solubulity of 12K molecular graphs from the ZINC dataset (Irwin et al., 2012). We 1http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo11/www/wwkb/ Table 2. Transductive node classification test accuracy (MAP in %) on heterophilic datasets. All results represent the average performance of the respective model over 10 fixed train/val/test splits, which are taken from Pei et al. (2020). Texas Wisconsin Cornell Homophily level 0.11 0.21 0.30 GPRGNN 78.4 4.4 82.9 4.2 80.3 8.1 H2GCN 84.9 7.2 87.7 5.0 82.7 5.3 GCNII 77.6 3.8 80.4 3.4 77.9 3.8 Geom-GCN 66.8 2.7 64.5 3.7 60.5 3.7 Pair Norm 60.3 4.3 48.4 6.1 58.9 3.2 Graph SAGE 82.4 6.1 81.2 5.6 76.0 5.0 MLP 80.8 4.8 85.3 3.3 81.9 6.4 GAT 52.2 6.6 49.4 4.1 61.9 5.1 Graph CON-GAT 82.2 4.7 85.7 3.6 83.2 7.0 GCN 55.1 5.2 51.8 3.1 60.5 5.3 Graph CON-GCN 85.4 4.2 87.8 3.3 84.3 4.8 Table 3. Test micro-averaged F1 score on Protein-Protein Interactions (PPI) data set. Model Micro-averaged F1 VR-GCN (Chen et al., 2017) 97.8 Graph SAGE (Hamilton et al., 2017) 61.2 PDE-GCN (Eliasof et al., 2021) 99.2 GCNII (Chen et al., 2020) 99.5 Cluster-GCN (Chiang et al., 2019) 99.4 Genie Path (Liu et al., 2019) 98.5 JKNet (Xu et al., 2018b) 97.6 GAT (Velickovic et al., 2018) 97.3 Graph CON-GAT 99.4 GCN (Kipf & Welling, 2017) 98.5 Graph CON-GCN 99.6 follow verbatim the settings of Dwivedi et al. (2020); Beani et al. (2021): make no use of edge features and constrain the network sizes to 100K parameters. Table 4 summarizes the performance of Graph CON and standard GNN baselines. Both Graph CON-GAT and Graph CON-GCN outperform GAT and GCN respectively, by a factor of 2. Moreover, the performance of Graph CON-GCN is on par with the recent state-of-the-art method DGN (Beani et al., 2021) with significantly lower standard deviation. Given these results, it is instructive to ask why Graph CON models outperform their underlying base GNN models such as GCN. A part of the answer can be seen from SM Table 6, where the MAE for GCN and Graph CON-GCN for this task is shown for increasing number of layers. We observe from this table that while the MAE with GCN increases with the number of layers, the MAE for Graph CON-GCN decreases monotonically with increasing layers, allowing for the use of very deep Graph CON models with increased expressive power. Graph-Coupled Oscillator Networks Table 4. Test mean absolute error (MAE, averaged over 4 runs on different initializations) on ZINC (without edge features, small 12k version) restricted to small network sizes of 100k parameters. Baseline results are taken from Beani et al. (2021). Model Test MAE GIN (Xu et al., 2018a) 0.41 0.008 Gated GCN (Bresson & Laurent, 2017) 0.42 0.006 Graph SAGE (Hamilton et al., 2017) 0.41 0.005 Mo Net (Monti et al., 2017) 0.41 0.007 PNA (Corso et al., 2020) 0.32 0.032 DGN (Beani et al., 2021) 0.22 0.010 GCN (Kipf & Welling, 2017) 0.47 0.002 Graph CON-GCN 0.22 0.004 GAT (Velickovic et al., 2018) 0.46 0.002 Graph CON-GAT 0.23 0.004 5.5. MNIST Superpixel graph classification This experiment, first suggested by Monti et al. (2017), is based on the MNIST dataset (Le Cun et al., 1998), where the grey-scale images are transformed into irregular graphs, as follows: the vertices in the graphs represent superpixels (large blobs of similar color), while the edges represent their spatial adjacency. Each graph has a fixed number of 75 superpixels (vertices). We use the standard splitting of using 55K-5K-10K for training, validation, and testing. Table 5 shows that Graph CON-GCN dramatically improves the performance of a pure GCN (test accuracy of 88.89% vs 98.70%). We stress that both models share the parameters over all layers, i.e. Graph CON-GCN does not have more parameters despite being a deeper model. Thus, the better performance of Graph CON-GCN over GCN can be attributed to the use of more layers (iterations) and not to a higher number of parameters (see SM Table 7 for accuracy vs. number of layers for this testcase). Finally, Table 5 also shows that Graph CON-GAT outperforms all other methods, including the recently proposed PNCNN (Finzi et al., 2021), reaching a nearly-perfect test accuracy of 98.91%. 6. Conclusions In conclusion, we proposed a novel framework for designing deep Graph Neural Networks called Graph CON, based on suitable time discretizations of ODEs (1) that model the dynamics of a network of controlled and damped oscillators. The coupling between the nodes is conditioned on the structure of the underlying graph. One can readily interpret Graph CON as a framework to propagate information through multiple layers of a deep GNN, where each hidden layer has the same structure as standard GNNs such as GAT, GCN etc. Unlike in canonical constructions of deep GNNs, which stack hidden layers in Table 5. Test accuracy in % on MNIST Superpixel 75. Model Test accuracy Cheb Net (Defferrard et al., 2016) 75.62 Mo Net (Monti et al., 2017) 91.11 PNCNN (Finzi et al., 2021) 98.76 Spline CNN (Fey et al., 2018) 95.22 GIN (Xu et al., 2018a) 97.23 Graph CON-GIN 98.53 Gated GCN (Bresson & Laurent, 2017) 97.95 Graph CON-Gated GCN 98.27 GCN (Kipf & Welling, 2017) 88.89 Graph CON-GCN 98.68 GAT (Velickovic et al., 2018) 96.19 Graph CON-GAT 98.91 a straightforward iterative fashion (7), Graph CON stacks them in a more involved manner using the dynamics of the ODE (3). Hence, in principle, any GNN hidden layer can serve as the coupling function Fθ in Graph CON (4), offering it as an attractive framework for constructing very deep GNNs. The well-known oversmoothing problem for GNNs was described mathematically in terms of the stability of zero Dirichlet energy steady states of the underlying ODE (3). We showed that such zero Dirichlet energy steady states of (3), which lead to constant node features, are not (exponentially) stable. Even if a trajectory reaches a feature vector that is constant across all nodes, very small perturbations will nudge it away and the resulting node features will deviate from each other. Thus, by construction, we demonstrated that the oversmoothing problem, in the sense of definition 3.2, is mitigated for Graph CON. In addition to increasing expressivity by mitigating the oversmoothing problem, Graph CON was rigorously shown to mitigate the exploding and vanishing gradients problem. Consequently, using coupled oscillators also facilitates efficient training of the resulting GNNs. Finally, we extensively test Graph CON on a variety of nodeand graph-classification and regression tasks, including heterophilic datasets known to be challenging for standard GNN models. From these experiments, we observed that (i) Graph CON models significantly outperform the underlying base GNN such as GCN or GAT and (ii) Graph CON models are either on par with or outperform state-of-the-art models on these tasks. This shows that ours is a novel, flexible, easy to use framework for constructing deep GNNs with theoretical guarantees and solid empirical performance. Graph-Coupled Oscillator Networks Alon, U. and Yahav, E. On the bottleneck of graph neural networks and its practical implications. In ICML, 2021. Avelar, P. H. C., Tavares, A. R., , Gori, M., and Lamb, L. C. Discrete and continuous deep residual learning over graphs. ar Xiv preprint, 2019. Beani, D., Passaro, S., L etourneau, V., Hamilton, W., Corso, G., and Li o, P. Directional graph networks. In ICML. PMLR, 2021. Belkin, M. and Niyogi, P. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation, 15(6):1373 1396, 2003. Bresson, X. and Laurent, T. Residual gated graph convnets. ar Xiv:1711.07553, 2017. Bronstein, M. M., Bruna, J., Cohen, T., and Veliˇckovi c, P. Geometric deep learning: Grids, groups, graphs, geodesics, and gauges. ar Xiv:2104.13478, 2021. Bruna, J., Zaremba, W., Szlam, A., and Le Cun, Y. Spectral networks and locally connected networks on graphs. In 2nd International Conference on Learning Representations, ICLR 2014, 2014. Chakrabarti, S. Dynamic personalized pagerank in entityrelation graphs. In WWW, 2007. Chamberlain, B., Rowbottom, J., Eynard, D., Di Giovanni, F., Dong, X., and Bronstein, M. Beltrami flow and neural diffusion on graphs. In Neur IPS, 2021a. Chamberlain, B., Rowbottom, J., Gorinova, M. I., Bronstein, M. M., Webb, S., and Rossi, E. GRAND: graph neural diffusion. In Proceedings of the 38th International Conference on Machine Learning, ICML, volume 139 of Proceedings of Machine Learning Research, pp. 1407 1418. PMLR, 2021b. Chen, J., Zhu, J., and Song, L. Stochastic training of graph convolutional networks with variance reduction. ar Xiv:1710.10568, 2017. Chen, M., Wei, Z., Huang, Z., Ding, B., and Li, Y. Simple and deep graph convolutional networks. In ICML. PMLR, 2020. Chen, R. T., Rubanova, Y., Bettencourt, J., and Duvenaud, D. K. Neural ordinary differential equations. In Neur IPS, 2018. Chiang, W.-L., Liu, X., Si, S., Li, Y., Bengio, S., and Hsieh, C.-J. Cluster-gcn: An efficient algorithm for training deep and large graph convolutional networks. In KDD, 2019. Coifman, R. R. and Lafon, S. Diffusion maps. Applied and computational harmonic analysis, 21(1):5 30, 2006. Corso, G., Cavalleri, L., Beaini, D., Li o, P., and Veliˇckovi c, P. Principal neighbourhood aggregation for graph nets. ar Xiv:2004.05718, 2020. Defferrard, M., Bresson, X., and Vandergheynst, P. Convolutional neural networks on graphs with fast localized spectral filtering. Advances in neural information processing systems, 29:3844 3852, 2016. Derrow-Pinion, A., She, J., Wong, D., Lange, O., Hester, T., Perez, L., Nunkesser, M., Lee, S., Guo, X., Battaglia, P. W., Gupta, V., Li, A., Xu, Z., Sanchez-Gonzalez, A., Li, Y., and Veliˇckovi c, P. Traffic Prediction with Graph Neural Networks in Google Maps. 2021. Dwivedi, V. P., Joshi, C. K., Laurent, T., Bengio, Y., and Bresson, X. Benchmarking graph neural networks. ar Xiv:2003.00982, 2020. Eliasof, M., Haber, E., and Treister, E. Pde-gcn: Novel architectures for graph neural networks motivated by partial differential equations. In Neur IPS, 2021. Fey, M., Lenssen, J. E., Weichert, F., and M uller, H. Splinecnn: Fast geometric deep learning with continuous b-spline kernels. In CVPR, 2018. Finzi, M. A., Bondesan, R., and Welling, M. Probabilistic numeric convolutional neural networks. In 9th International Conference on Learning Representations, ICLR, 2021. Frasconi, P., Gori, M., and Sperduti, A. A general framework for adaptive processing of data structures. IEEE Trans. Neural Networks, 9(5):768 786, 1998. Gaudelet, T., Day, B., Jamasb, A. R., Soman, J., Regep, C., Liu, G., Hayter, J. B., Vickers, R., Roberts, C., Tang, J., et al. Utilizing graph machine learning within drug discovery and development. Briefings in Bioinformatics, 22(6), 2021. Gilmer, J., Schoenholz, S. S., Riley, P. F., Vinyals, O., and Dahl, G. E. Neural message passing for quantum chemistry. In ICML, 2017. Goller, C. and Kuchler, A. Learning task-dependent distributed representations by backpropagation through structure. In ICNN, 1996. Gori, M., Monfardini, G., and Scarselli, F. A new model for learning in graph domains. In IJCNN, 2005. Haber, E. and Ruthotto, L. Stable architectures for deep neural networks. Inverse Problems, 34, 2018. Graph-Coupled Oscillator Networks Hairer, E., Norsett, S. P., and Wanner, G. Solving ordinary differential equations I. Springer, 1987. Hamilton, W. L., Ying, R., and Leskovec, J. Inductive representation learning on large graphs. In Neur IPS, 2017. Irwin, J. J., Sterling, T., Mysinger, M. M., Bolstad, E. S., and Coleman, R. G. Zinc: a free tool to discover chemistry for biology. Journal of chemical information and modeling, 52(7):1757 1768, 2012. Kipf, T. N. and Welling, M. Semi-supervised classification with graph convolutional networks. In ICLR, 2017. Le Cun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradientbased learning applied to document recognition. Proc. IEEE, 86(11):2278 2324, 1998. Liu, Z., Chen, C., Li, L., Zhou, J., Li, X., Song, L., and Qi, Y. Geniepath: Graph neural networks with adaptive receptive paths. In AAAI, 2019. Mc Callum, A. K., Nigam, K., Rennie, J., and Seymore, K. Automating the construction of internet portals with machine learning. Information Retrieval, 3(2):127 163, 2000. Monti, F., Boscaini, D., Masci, J., Rodola, E., Svoboda, J., and Bronstein, M. M. Geometric deep learning on graphs and manifolds using mixture model cnns. In CVPR, 2017. Namata, G., London, B., Getoor, L., Huang, B., and EDU, U. Query-driven active surveying for collective classification. In 10th International Workshop on Mining and Learning with Graphs, volume 8, pp. 1, 2012. Nt, H. and Maehara, T. Revisiting graph neural networks: all we have is low pass filters. ar Xiv:1812.08434v4, 2019. Oono, K. and Suzuki, T. Graph neural networks exponentially lose expressive power for node classification. In ICLR, 2020. Page, L., Brin, S., Motwani, R., and Winograd, T. The pagerank citation ranking: Bringing order to the web. Technical report, 1999. Pascanu, R., Mikolov, T., and Bengio, Y. On the difficulty of training recurrent neural networks. In Proceedings of the 30th International Conference on Machine Learning, volume 28 of ICML 13, pp. III 1310 III 1318. JMLR.org, 2013. Pei, H., Wei, B., Chang, K. C.-C., Lei, Y., and Yang, B. Geom-gcn: Geometric graph convolutional networks. ar Xiv:2002.05287, 2020. Poli, M., Massaroli, S., Park, J., Yamashita, A., Asama, H., and Park, J. Graph neural ordinary differential equations. ar Xiv:1911.07532, 2019a. Poli, M., Massaroli, S., Park, J., Yamashita, A., Asama, H., and Park, J. Graph neural ordinary differential equations. pp. 6571 6583, 2019b. Rong, Y., Huang, W., Xu, T., and Huang, J. Towards deep graph convolutional networks on node classification. In ICLR, 2020. Rusch, T. K. and Mishra, S. Coupled oscillatory recurrent neural network (cornn): An accurate and (gradient) stable architecture for learning long time dependencies. In ICLR, 2021a. Rusch, T. K. and Mishra, S. Unicornn: A recurrent model for learning very long time dependencies. In ICML, 2021b. Scarselli, F., Gori, M., Tsoi, A. C., Hagenbuchner, M., and Monfardini, G. The graph neural network model. IEEE Trans. Neural Networks, 20(1):61 80, 2008. Sen, P., Namata, G., Bilgic, M., Getoor, L., Galligher, B., and Eliassi-Rad, T. Collective classification in network data. AI Magazine, 29(3):93 93, 2008. Shchur, O., Mumme, M., Bojchevski, A., and G unnemann, S. Pitfalls of graph neural network evaluation. ar Xiv:1811.05868, 2018. Shlomi, J., Battaglia, P., and Vlimant, J.-R. Graph neural networks in particle physics. Machine Learning: Science and Technology, 2(2):021001, 2020. Sperduti, A. Encoding labeled graphs by labeling RAAM. In NIPS, 1994. Sperduti, A. and Starita, A. Supervised neural networks for the classification of structures. IEEE Trans. Neural Networks, 8(3):714 735, 1997. Stiefel, K. M. and Ermentrout, G. B. Neurons as oscillators. Journal of Neurophysiology, 116:2950 2960, 2016. Strogatz, S. Nonlinear Dynamics and Chaos. Westview, Boulder CO, 2015. Topping, J., Di Giovanni, F., Chamberlain, B. P., Dong, X., and Bronstein, M. M. Understanding over-squashing and bottlenecks on graphs via curvature. ar Xiv:2111.14522, 2021. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. In Neur IPS, 2017. Velickovic, P., Cucurull, G., Casanova, A., Romero, A., Li o, P., and Bengio, Y. Graph attention networks. In 6th International Conference on Learning Representations, ICLR, 2018. Graph-Coupled Oscillator Networks Wiggins, S. Introduction to nonlinear dynamical systems and chaos. Springer, 2003. Xhonneux, L.-P., Qu, M., and Tang, J. Continuous graph neural networks. In ICML. PMLR, 2020a. Xhonneux, L.-p. A. C., Qu, M., and Tang, J. Continuous graph neural networks. In ICML, 2020b. Xu, K., Hu, W., Leskovec, J., and Jegelka, S. How powerful are graph neural networks? ar Xiv:1810.00826, 2018a. Xu, K., Li, C., Tian, Y., Sonobe, T., Kawarabayashi, K.-i., and Jegelka, S. Representation learning on graphs with jumping knowledge networks. In ICML. PMLR, 2018b. Ying, R., He, R., Chen, K., Eksombatchai, P., Hamilton, W. L., and Leskovec, J. Graph convolutional neural networks for web-scale recommender systems. In KDD, 2018. Zhou, J., Cui, G., Zhang, Z., Yang, C., Liu, Z., Wang, L., , Li, C., and Sun, M. Graph neural networks: a review of methods and applications. ar Xiv:1812.08434v4, 2019. Zhuang, J., Dvornek, N., Li, X., and Duncan, J. S. Ordinary differential equations on graph networks. Technical Report, 2020. Zitnik, M. and Leskovec, J. Predicting multicellular function through multi-layer tissue networks. Bioinformatics, 33 (14):i190 i198, 2017. Supplementary Material for Graph-Coupled Oscillator Networks Supplementary Material for: Graph-Coupled Oscillator Networks A. Further experimental results A.1. Performance of Graph CON with respect to number of layers As we have argued in the main text, Graph CON is designed to be a deep GNN architecture with many layers. Depth could enhance the expressive power of Graph CON and we investigate this issue in three of the datasets, presented in Section 5 of the main text. In the first two experiments, we will focus on the Graph CON-GCN model and compare and contrast its performance, with respect to increasing depth, with the baseline GCN model. We start with the molecular graph property regression example for the ZINC dataset of Irwin et al. (2012). In Table 6, we present the mean absolute error (MAE) of the model on the test set with respect to increasing number of layers (up to 20 layers) of the respective GNNs. As observed from this table, the MAE with standard GCN increases with depth. On the other hand, the MAE with Graph CON decreases as more layers are added. Table 6. Test mean absolute errors of Graph CON-GCN as well as its baseline model GCN on the ZINC task for different number of layers N = 5, 10, 15, 20. Model Layers Graph CON-GCN 0.241 0.233 0.228 0.214 GCN 0.442 0.463 0.478 0.489 Next, we consider the MNIST Superpixel graph classification task and present the test accuracy with increasing depth (number of layers) for both GCN and Graph CON-GCN. As in the previous example, we observe that increasing depth leads to worsening of the test accuracy for GCN. On the other hand, the test accuracy for Graph CON-GCN increases as more layers (up to 32 layers) are added to the model. Table 7. Test accuracies in % of Graph CON-GCN as well as its baseline model GCN on the MNIST Superpixel 75 task for different number of layers N = 4, 8, 16, 32. Model Layers Graph CON-GCN 97.78 98.51 98.55 98.68 GCN 88.09 87.26 86.78 85.67 Additionally, we compare the performance of Graph CON-GCN to GCN with Edge Drop (Rong et al., 2020) (GCN+Edge Drop), which has been specifically designed to mitigate the oversmoothing phenomenon for deeper GNN models. We consider the Cora node-based classification task in the semi-supervised setting, where we compare Graph CON-GCN to GCN+Drop Edge for increasing number of layers N = 2, 4, 8, 16, 32, 64. We observe in Table 8 that Graph CON improves (or retains) performance for a large increase in the number of layers, in contrast to plain GCN+Drop Edge on this task. Thus, all three experiments demonstrate that Graph CON leverages more depth to improve performance. A.2. Sensitivity of performance of Graph CON to hyperparameters α and γ We recall that Graph CON, (4) of the main text, has two additional hyperparameters, namely the damping parameter α 0 and the frequency control parameter γ > 0. In Table 9, we present the values of α, γ that led to the best performance of the resulting Graph CON models. It is natural to ask how sensitive the performance of Graph CON is to the variation of these hyperparameters. To this end, we choose the MNIST Superpixel graph classification task and perform a sensitivity study of the Graph CON-GCN model with respect to these hyperparameters. First, we fix a value of γ = 0.76 (corresponding to the Supplementary Material for Graph-Coupled Oscillator Networks Table 8. Test accuracies in % of Graph CON-GCN as well as of GCN+Drop Edge on cora (semi-supervised setting) for different number of layers N = 2, 4, 8, 16, 32, 64. The GCN+Drop Edge results are taken from https://github.com/Drop Edge/Drop Edge Model Layers 2 4 8 16 32 64 Graph CON-GCN 82.20 82.78 83.53 84.85 82.95 82.12 GCN+Drop Edge 82.80 82.00 75.80 75.70 62.50 49.50 best results in Table 9) and vary α in the range of α [0, 2]. The results are plotted in Fig. 3 and show that the accuracy is extremely robust to a very large parameter range in α. Only for large values α > 1.6, we see that the accuracy deteriorates when the damping is too high. Next for this model and task, we fix α = 1 (which provides the best performance as reported in Table 9) and vary γ [0, 2]. Again, for a large range of values corresponding to γ [0.2, 2], the accuracy is very robust. However, for very small values of γ, the accuracy falls significantly. This is to be expected as the model loses its interpretation as system of oscillators for γ 0. Thus, these sensitivity results demonstrate that Graph CON performs very robustly with respect to variations of the parameters α, γ, within a reasonable range. 0.0 0.5 1.0 1.5 2.0 Varying value of α or γ Test accuracy in % Fixed α = 1.0 Fixed γ = 0.76 Figure 3. Sensitivity (measured as test accuracy) plot for α and γ hyperparameters of Graph CON-GCN (with 32 layers) trained on MNIST superpixel 75 experiment. First, α = 1.0 is fixed and γ is varied in [0, 2]. Second, γ = 0.76 is fixed and α is varied in [0, 2]. The fixed α, γ are taken from the best performing Graph CON-GCN on the MNIST superpixel 75 task (Table 9) B. Training details All experiments were run on NVIDIA Ge Force GTX 1080 Ti, RTX 2080 Ti as well as RTX 2080 Ti GPUs. The tuning of the hyperparameters was done using a standard random search algorithm. We fix the time-step t in (4) to 1 in all experiments. The damping parameter α as well as the frequency control parameter γ are set to 1 for all Cora, Citeseer and Pubmed experiments, while we set them to 0 for all experiments based on the Texas, Cornell and Wisconsin network graphs. For all other experiments we include α and γ to the hyperparameter search-space. The tuned values can be found in Table 9. Supplementary Material for Graph-Coupled Oscillator Networks Table 9. Hyperparameters α and γ of Graph CON (4) for each best performing Graph CON model (based on a validation set). Model Experiment α γ Graph CON-GCN PPI 0.242 1.0 Graph CON-GAT 0.785 1.0 Graph CON-GCN ZINC 0.215 1.115 Graph CON-GAT 1.475 1.324 Graph CON-GCN MNIST (superpixel) 1.0 0.76 Graph CON-GAT 0.76 0.105 C. Mathematical details for Section 3 of main text In this section, we provide details for the mathematical results in section 3 of the main text. We start with, C.1. Proof of Proposition 3.1 Proof. We multiply Y i to the second equation of (8) and obtain, j Ni Aij Y i (Xj Xi) , j Ni Aij = 1 Summing over i V and using the symmetry condition Aij = Aji in the above expression yields, j Ni Aij (Yj Yi) (Xj Xi) , i V Yi 2 + X j Ni Aij Xj Xi 2 Integrating the last line in the above expression over time [0, t] yields the desired identity (11) C.2. Proof of Proposition 3.3 Proof. By the definition of the Dirichlet energy (13), (15) implies that, lim t Xi(t) c, i V, (23) for some c Rm. In other words, all the hidden node features converge to the same feature vector c as time increases. Moreover, by (15), this convergence is exponentially fast. Plugging in (23) in to the first equation of the ODE (3), we obtain that, lim t Yi(t) 0, i G, (24) with 0 being the m vector with zeroes for all its entries. Thus, oversmoothing in the sense of definition 3.2, amounts to (c, 0) being an exponentially stable fixed point (steady state) for the dynamics of (8) On the other hand, if (c, 0) is an exponentially stable steady state of (8), then the trajectories converge to this state exponentially fast satisfying (15). Consequently, by the definition of the Dirichlet energy (13), we readily observe that the oversmoothing problem, in the sense of definition 3.2, occurs in this case. Supplementary Material for Graph-Coupled Oscillator Networks C.3. Proof of Proposition 3.4 The main aim of the section is to show that steady states of (8), of the form (c, 0) are not exponentially stable. To this end, we fix c and start by considering small perturbations around the fixed point (c, 0). We define, ˆXi = Xi c, ˆYi = Yi, and evolve these perturbations by the linearized ODE, ˆX i = ˆYi, ˆ Yi = σ (c) X ˆAi,j ˆXj ˆXi α ˆYi, (25) As σ(x) = max(x, 0) and c 0, we have that σ (c) = ID and linearized system (26) reduces to, ˆX i = ˆYi, ˆAij ˆXj ˆXi α ˆYi, (26) with ˆAij = Aij(c, c), j Ni, i G, 0 ˆAij 1, X ˆAij = 1. (27) We have the following proposition on the dynamics of linearized system (26) with respect to perturbations of the fixed point (c, 0), Proposition C.1. Perturbations ˆX(t), ˆY(t) of the fixed point (c, 0), which evolve according to (26) satisfy the following identity, i V ˆ Yi(t) 2 + X ˆAij + ˆAji ˆXj(t) ˆXi(t) 2 = T1(t) + T2(t) + T3(t), ˆ Yi(0) 2 e 2αt + 1 ˆAij + ˆAji ˆXj(0) ˆXi(0) 2 e 2αt ˆAij + ˆAji t Z 0 ˆXj(s) ˆXi(s) 2e2α(s t)ds ˆAij ˆAji t Z ˆYi(s) + ˆYj(s) ˆXj(s) ˆXi(s) e2α(s t)ds Proof. Multiplying the second equation in (26) with ˆY i and using the fact that P j Ni ˆAij = 1, we obtain, 2 + α ˆYi 2 = X ˆAij ˆY i ˆXj ˆXi , ˆXj ˆXi ˆXj ˆXi , Supplementary Material for Graph-Coupled Oscillator Networks where we have used the first equation of (26) in the last line of (29). Consequently, we have for all i V, 2 + α ˆYi 2 + d 2 ˆXj ˆXi 2 ˆXj ˆXi (30) Summing (30) over all nodes i V yields, i V ˆYi 2 + d ˆAij + ˆAji 2 ˆXj ˆXi 2 ˆYi + ˆYj ˆXj ˆXi (31) Multiplying e2αt to both sides of (31) and using the chain rule, we readily obtain, ˆAij + ˆAji 2 ˆXj ˆXi 2 ˆAij + ˆAji 2 ˆXj ˆXi 2 ˆYi + ˆYj ˆXj ˆXi Integrating (32) over the time interval [0, t] yields, ˆAij + ˆAji ˆXj(t) ˆXi(t) 2 ˆAij + ˆAji ˆXj(0) ˆXi(0) 2 ˆAij + ˆAji 0 ˆXj(s) ˆXi(s) 2e2αsds ˆYi(s) + ˆYj(s) ˆXj(s) ˆXi(s) e2αsds We readily obtain the desired identity (28) from (33). Next, we observe that the right-hand side of the nonlinear ODEs (8) is globally Lipschitz. Therefore, solutions exist for all time t > 0, are unique and depend continuously on the data. We assume that the initial perturbations around the steady state (c, 0) are small i.e., they satisfy ˆXi(0) ˆXj(0) ϵ, j Ni, i V, ˆYi(0) ϵ, i V, for some 0 < ϵ << 1. Supplementary Material for Graph-Coupled Oscillator Networks Hence, there exists a small time τ > 0 such that the time-evolution of these perturbations can be approximated to arbitrary accuracy by solutions of the linearized system (26). Next, we see from the identity (28) that the evolution of the perturbations ˆX, ˆY from the fixed point (c, 0) for the linearized system (26) is balanced by three terms T1,2,3. The term T1 is clearly a dissipative term and says that the initial perturbations are damped exponentially fast in time. On the other hand, the term T2, which has a positive sign, is a production term and says that the initial perturbations will grow with time t. Given the continuous dependence of the dynamics evolved by the ODE (26), there exists a time, still called τ by choosing it even smaller than the τ encountered before, such that ˆXi(t) ˆXj(t) O(ϵ), j Ni, i V, t [0, τ], ˆYi(t) O(ϵ), i V, t [0, τ]. (34) Plugging the above expression into the term T2 in (28) and using the right-stochasticity of the matrix ˆA, we obtain that, T2(t) O(ϵ2) 1 e 2αt , t τ (35) Thus, the leading term in T2 grows algebraically with respect to the initial perturbations. Next we turn our attention to the term T3 in (28). This term is proportional to the asymmetry in the graph-coupling matrix ˆA = A(c, c). If this matrix were symmetric, then T3 vanishes. On the other hand, for many 1-neighborhood couplings considered in this article, the matrix ˆA is not symmetric. In fact, one can explicitly compute that for the GAT and Transformers attention and GCN-couplings, we have, ˆAij = 1 deg(i), j Ni, i V. (36) Here, deg refers to the degree of the node, with possibly inserted self-loops. As the ordering of nodes of the graph G is arbitrary, we can order them in such a manner that ˆAij > ˆAji. Even with this ordering, as long as the matrix ˆA is not symmetric, the term T3 is of indefinite sign. If it is positive, then we have additional growth with respect to time in (28). On the other hand, if T3 is negative, it will have a dissipative effect. The rate of this dissipation can be readily calculated for a short time t τ under the assumption (34) to be, |T3(t)| D D O(ϵ2). (37) Here, we define, D = max i V deg(i), D = min i V deg(i) (38) Thus by combining (35) with (37), we obtain, T2 + T3 1 D D 1 e 2αt O(ϵ2) (39) In particular for α 1/2, we see from (39), that the overall balance (28) leads to an algebraic growth, rather than exponential decay, of the initial perturbations of the fixed point (c, 0). Thus, we have shown that this steady state is not exponentially stable and small perturbations will take the trajectories of the ODE (26) away from this fixed point, completing the proof of Proposition 3.4. Remark C.2. We see from the above proof, the condition α 1 2 is only a sufficient condition for the proof of Proposition 3.4, we can readily replace it by, Supplementary Material for Graph-Coupled Oscillator Networks C.4. Proofs of Propositions 3.5 and 3.6 As a first step in proving the gradient bounds in Proposition 3.5, we will prove the following upper bound on the hidden node features of the following general form of Graph CON (4), written node-wise as, Cn 1 i = (Fθ(Xn 1))i, Yn i = Yn 1 i + tσ(Cn 1 i ) γ t Xn 1 i α t Yn 1 i , Xn i = Xn 1 i + t Yn i . We derive following upper bound on the resulting hidden node features, Proposition C.3. For all n, let tn = n t and the time step t satisfy, Let Xn i denote the hidden state vector at any node i V which evolves according to Graph CON (40), then the hidden states satisfy the following bound, Xn i 2 X0 i 2 + 1 + mβ2tn 2γ(α γ t) where β is the global bound on the underlying activation function σ (21). Proof. We multiply γ(Xn 1 i ) to the third equation of (40) and (Yn i ) to the second equation of (40) and repeatedly use the following elementary identities, a (a b) = a 2 b (a b) = a 2 2 = γ Xn 1 i 2 2 + Yn 1 i 2 2 + t(Yn i ) σ(Cn 1 i ) Yn i Yn 1 i 2 As we have assumed that the time step t is chosen such that we obtain from the above inequality that, 2 γ Xn 1 i 2 2 + Yn 1 i 2 2 + t(Yn i ) σ(Cn 1 i ) Supplementary Material for Graph-Coupled Oscillator Networks Next we use the elementary identity with ϵ = α γ t in the above inequality to obtain, 2 γ Xn 1 i 2 2 + Yn 1 i 2 + t 2(α γ t) σ(Cn 1 i ) 2 (42) Now from the bound (21) on the activation function, we obtain from (42) that, 2 γ Xn 1 i 2 2 + Yn 1 i 2 Iterating (43) over n yields, γ Xn i 2 + Yn i 2 γ X0 i 2 + Y0 i 2 2(α γ t), (44) which readily yields the desired inequality (41). C.4.1. PROOF OF PROPOSITION 3.5 Proof. For any ℓ n N, a tedious yet straightforward computation yields the following representation formula, Zn 1 = I2v 2v + t En,n 1 + t2Fn,n 1. (45) Here En,n 1 R2v 2v is a matrix whose entries are given below. For any 1 i v, we have, En,n 1 2i 1,2i = 1, En,n 1 2i 1,j = 0, j = 2i, En,n 1 2i,2i = 1, En,n 1 2i,2i 1 = 1 + σ (Cn 1 i )wn i di , En,n 1 2i,2j = 0, 1 j v, and j = i, En,n 1 2i,2j 1 = σ (Cn 1 j )wn j p didj , j Ni, En,n 1 2i,2j 1 = 0, j / Ni and j = i. Similarly, Fn,n 1 R2v 2v is a matrix whose entries are given below. For any 1 i v, we have, Fn,n 1 2i,j = 0, j, Fn,n 1 2i 1,2i 1 = 1 + σ (Cn 1 i )wn i di , Fn,n 1 2i 1,2j 1 = σ (Cn 1 j )wn j p didj , j Ni, Fn,n 1 2i 1,2j 1 = 0, j / Ni and j = i. Supplementary Material for Graph-Coupled Oscillator Networks Using (21), it is straightforward to compute that, En,n 1 2 + β ˆD wn 1, Fn,n 1 1 + β ˆD wn 1, (46) Then using t 1 and definition (21), we have from (45) that, Zn Therefore, from the identity (19), we obtain, ZN Now choosing t << 1 small enough such that the following inequality holds, 1 + Γ 2 t N ℓ 1 + (N ℓ)Γ t, (47) leads to the following bound, ZN 1 + (N ℓ)Γ t 1 + NΓ t (48) A straight-forward differentiation of the loss function (17) yields, v XN 1 X1, 0, XN 2 X2, 0, , XN v Xv, 0 . (49) Hence, J ZN max 1 i v |XN i | + max 1 i v |Xi| (50) Applying the pointwise upper bound (41) to (50), we obtain, J ZN max 1 i v(|X0 i | + |Y0 i |) + max 1 i v |Xi| + β Finally, a direct calculation provides the following characterization of the vector Zℓ 2j = t σ (Cℓ j)Xℓ 1 j p dkdj , j Nk, 2j 1 = t2 σ (Cℓ j)Xℓ 1 j p dkdj , j Nk, j 0, otherwise. Therefore using the pointwise bound (41), one can readily calculate that, Zℓ tβ ˆD max 1 i v(|X0 i | + |Y0 i |) + β tβ ˆD max 1 i v(|X0 i | + |Y0 i |) + max 1 i v |Xi| + β Multiplying (48), (51) and (53) and using the product rule (19) yields the desired upper bound (20), Supplementary Material for Graph-Coupled Oscillator Networks C.4.2. PROOF OF PROPOSITION 3.6 To investigate how small the gradients in (18) can be, we will need the following order notation: β = O(α), for α, β R+ if there exists constants C, C such that Cα β Cα. M = O(α), for M Rd1 d2, α R+ if there exists constant C such that M Cα. (54) Equipped with this notation, we proceed to prove Proposition 3.6 below, Proof. The key ingredient in the proof is the following representation formula, Zℓ= I2v 2v + t n=ℓ+1 En,n 1 + O( t2), (55) the proof of which follows directly from the identity (45) and the boundedness of the matrices E, F in (45). Then, (22) follows from a multiplication of (49), (55) and (52) and a straightforward rearrangement of the terms, One readily observes from the formula (22), that to leading order in the small parameter t, the gradient J wℓ k is independent of the number of layers N of the underlying GNN. Thus, although the gradient can be small (due to small t), it will not vanish by increasing the number of layers, mitigating the vanishing gradient problem. Even if small parameter t depends on the number of layers, as long as this dependence is polynomial i.e., t N s, for some s, the gradient cannot decay exponentially in N, alleviating the vanishing gradients problem in this case too.