# elastic_graph_neural_networks__8adf57c7.pdf Elastic Graph Neural Networks Xiaorui Liu 1 * Wei Jin 1 * Yao Ma 1 Yaxin Li 1 Hua Liu 2 Yiqi Wang 1 Ming Yan 3 Jiliang Tang 1 While many existing graph neural networks (GNNs) have been proven to perform ℓ2-based graph smoothing that enforces smoothness globally, in this work we aim to further enhance the local smoothness adaptivity of GNNs via ℓ1-based graph smoothing. As a result, we introduce a family of GNNs (Elastic GNNs) based on ℓ1 and ℓ2-based graph smoothing. In particular, we propose a novel and general message passing scheme into GNNs. This message passing algorithm is not only friendly to back-propagation training but also achieves the desired smoothing properties with a theoretical convergence guarantee. Experiments on semi-supervised learning tasks demonstrate that the proposed Elastic GNNs obtain better adaptivity on benchmark datasets and are significantly robust to graph adversarial attacks. The implementation of Elastic GNNs is available at https: //github.com/lxiaorui/Elastic GNN. 1. Introduction Graph neural networks (GNNs) generalize traditional deep neural networks (DNNs) from regular grids, such as image, video, and text, to irregular data such as social networks, transportation networks, and biological networks, which are typically denoted as graphs (Defferrard et al., 2016; Kipf & Welling, 2016). One popular such generalization is the neural message passing framework (Gilmer et al., 2017): x(k+1) u = UPDATE(k) x(k) u , m(k) N(u) (1) where x(k) u Rd denotes the feature vector of node u in k-th iteration of message passing and m(k) N(u) is the message aggregated from u s neighborhood N(u). The specific *Equal contribution 1Department of Computer Science and Engineering, Michigan State University, USA 2School of Mathematics, Shandong University, China 3Department of Computational Mathematics, Science and Engineering, Michigan State University, USA. Correspondence to: Xiaorui Liu . Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s). architecture design has been motivated from spectral domain (Kipf & Welling, 2016; Defferrard et al., 2016) and spatial domain (Hamilton et al., 2017; Veliˇckovi c et al., 2017; Scarselli et al., 2008; Gilmer et al., 2017). Recent study (Ma et al., 2020) has proven that the message passing schemes in numerous popular GNNs, such as GCN, GAT, PPNP, and APPNP, intrinsically perform the ℓ2-based graph smoothing to the graph signal, and they can be considered as solving the graph signal denoising problem: arg min F L(F) := F Xin 2 F + λ tr(F LF), (2) where Xin Rn d is the input signal and L Rn n is the graph Laplacian matrix encoding the graph structure. The first term guides F to be close to input signal Xin, while the second term enforces global smoothness to the filtered signal F. The resulted message passing schemes can be derived by different optimization solvers, and they typically entail the aggregation of node features from neighboring nodes, which intuitively coincides with the cluster or consistency assumption that neighboring nodes should be similar (Zhu & Ghahramani; Zhou et al., 2004). While existing GNNs are prominently driven by ℓ2-based graph smoothing, ℓ2based methods enforce smoothness globally and the level of smoothness is usually shared across the whole graph. However, the level of smoothness over different regions of the graph can be different. For instance, node features or labels can change significantly between clusters but smoothly within the cluster (Zhu, 2005). Therefore, it is desired to enhance the local smoothness adaptivity of GNNs. Motivated by the idea of trend filtering (Kim et al., 2009; Tibshirani et al., 2014; Wang et al., 2016), we aim to achieve the goal via ℓ1-based graph smoothing. Intuitively, compared with ℓ2-based methods, ℓ1-based methods penalize large values less and thus preserve discontinuity or nonsmooth signal better. Theoretically, ℓ1-based methods tend to promote signal sparsity to trade for discontinuity (Rudin et al., 1992; Tibshirani et al., 2005; Sharpnack et al., 2012). Owning to these advantages, trend filtering (Tibshirani et al., 2014) and graph trend filter (Wang et al., 2016; Varma et al., 2019) demonstrate that ℓ1-based graph smoothing can adapt to inhomogenous level of smoothness of signals and yield estimators with k-th order piecewise polynomial functions, such as piecewise constant, linear and quadratic functions, depending on the order of the graph difference operator. Elastic Graph Neural Networks While ℓ1-based methods exhibit various appealing properties and have been extensively studied in different domains such as signal processing (Elad, 2010), statistics and machine learning (Hastie et al., 2015), it has rarely been investigated in the design of GNNs. In this work, we attempt to bridge this gap and enhance the local smoothnesss adaptivity of GNNs via ℓ1-based graph smoothing. Incorporating ℓ1-based graph smoothing in the design of GNNs faces tremendous challenges. First, since the message passing schemes in GNNs can be derived from the optimization iteration of the graph signal denoising problem, a fast, efficient and scalable optimization solver is desired. Unfortunately, to solve the associated optimization problem involving ℓ1 norm is challenging since the objective function is composed by smooth and non-smooth components and the decision variable is further coupled by the discrete graph difference operator. Second, to integrate the derived messaging passing scheme into GNNs, it has to be composed by simple operations that are friendly to the back-propagation training of the whole GNNs. Third, it requires an appropriate normalization step to deal with diverse node degrees, which is often overlooked by existing graph total variation and graph trend filtering methods. Our attempt to address these challenges leads to a family of novel GNNs, i.e., Elastic GNNs. Our key contributions can be summarized as follows: We introduce ℓ1-based graph smoothing in the design of GNNs to further enhance the local smoothness adaptivity, for the first time; We derive a novel and general message passing scheme, i.e., Elastic Message Passing (EMP), and develop a family of GNN architectures, i.e., Elastic GNNs, by integrating the proposed message passing scheme into deep neural nets; Extensive experiments demonstrate that Elastic GNNs obtain better adaptivity on various real-world datasets, and they are significantly robust to graph adversarial attacks. The study on different variants of Elastic GNNs suggest that ℓ1 and ℓ2-based graph smoothing are complementary and the proposed GNNs are more versatile. 2. Preliminary We use bold upper-case letters such as X to denote matrices and bold lower-case letters such as x to define vectors. Given a matrix X Rn d, we use Xi to denote its i-th row and Xij to denote its element in i-th row and j-th column. We define the Frobenius norm, ℓ1 norm, and ℓ21 norm of matrix X as X F = q P ij X2 ij, X 1 = P ij |Xij|, and X 21 = P i Xi 2 = P i q P j X2 ij, respectively. We define X 2 = σmax(X) where σmax(X) is the largest singular value of X. Given two matrices X, Y Rn d, we define the inner product as X, Y = tr(X Y). Let G = {V, E} be a graph with the node set V = {v1, . . . , vn} and the undirected edge set E = {e1, . . . , em}. We use N(vi) to denote the neighboring nodes of node vi, including vi itself. Suppose that each node is associated with a d-dimensional feature vector, and the features for all nodes are denoted as Xfea Rn d. The graph structure G can be represented as an adjacent matrix A Rn n, where Aij = 1 when there exists an edge between nodes vi and vj. The graph Laplacian matrix is defined as L = D A, where D is the diagonal degree matrix. Let { 1, 0, 1}m n be the oriented incident matrix, which contains one row for each edge. If eℓ= (i, j), then has ℓ-th row as: ℓ= (0, . . . , 1 |{z} i , . . . , 1 |{z} j , . . . , 0) where the edge orientation can be arbitrary. Note that the incident matrix and unnormalized Laplacian matrix have the equivalence L = . Next, we briefly introduce some necessary background about the graph signal denoising perspective of GNNs and the graph trend filtering methods. 2.1. GNNs as Graph Signal Denoising It is evident from recent work (Ma et al., 2020) that many popular GNNs can be uniformly understood as graph signal denoising with Laplacian smoothing regularization. Here we briefly describe several representative examples. GCN. The message passing scheme in Graph Convolutional Networks (GCN) (Kipf & Welling, 2016), Xout = AXin, is equivalent to one gradient descent step to minimize tr(F (I A)F) with the initial F = Xin and stepsize 1/2. Here A = ˆD 1 2 with ˆA = A + I being the adjacent matrix with self-loop, whose degree matrix is ˆD. PPNP & APPNP. The message passing scheme in PPNP and APPNP (Klicpera et al., 2018) follow the aggregation rules Xout = α I (1 α) A 1Xin, and X(k+1) = (1 α) AX(k) + αXin. They are shown to be the exact solution and one gradient descent step with stepsize α/2 for the following problem min F F Xin 2 F + (1/α 1) tr(F (I A)F). (3) For more comprehensive illustration, please refer to (Ma et al., 2020). We point out that all these message passing schemes adopt ℓ2-based graph smoothing as the signal differences between neighboring nodes are penalized by the Elastic Graph Neural Networks square of ℓ2 norm, e.g., P (vi,vj) E Fi di+1 Fj dj+1 2 2 with di being the node degree of node vi. The resulted message passing schemes are usually linear smoothers which smooth the input signal by their linear transformation. 2.2. Graph Trend Filtering In the univariate case, the k-th order graph trend filtering (GTF) estimator (Wang et al., 2016) is given by arg min f Rn = 1 2 f x 2 2 + λ (k+1)f 1 (4) where x Rn is the 1-dimensional input signal of n nodes and (k+1) is a k-th order graph difference operator. When k = 0, it penalizes the absolute difference across neighboring nodes in graph G: (vi,vj) E |fi fj| where (1) is equivalent to the incident matrix . Generally, k-th order graph difference operators can be defined recursively: ( (k) = L k+1 2 Rn n for odd k (k) = L k 2 Rm n for even k. It is demonstrated that GTF can adapt to inhomogeneity in the level of smoothness of signal and tends to provide piecewise polynomials over graphs (Wang et al., 2016). For instance, when k = 0, the sparsity induced by the ℓ1-based penalty (1)f 1 implies that many of the differences fi fj are zeros across edges (vi, vj) E in G. The piecewise property originates from the discontinuity of signal allowed by less aggressive ℓ1 penalty, with adaptively chosen knot nodes or knot edges. Note that the smoothers induced by GTF are not linear smoothers and cannot be simply represented by linear transformation of the input signal. 3. Elastic Graph Neural Networks In this section, we first propose a new graph signal denoising estimator. Then we develop an efficient optimization algorithm for solving the denoising problem and introduce a novel, general and efficient message passing scheme, i.e., Elastic Message Passing (EMP), for graph signal smoothing. Finally, the integration of the proposed message passing scheme and deep neural networks leads to Elastic GNNs. 3.1. Elastic Graph Signal Estimator To combine the advantages of ℓ1 and ℓ2-based graph smoothing, we propose the following elastic graph signal estimator: arg min F Rn d λ1 F 1 | {z } g1( F) 2 tr(F LF) + 1 2 F Xin 2 F | {z } f(F) where Xin Rn d is the d-dimensional input signal of n nodes. The first term can be written in an edge-centric way: (1)F 1 = P (vi,vj) E Fi Fj 1, which penalizes the absolute difference across connected nodes in graph G. Similarly, the second term penalizes the difference quadratically via tr(F LF) = P (vi,vj) E Fi Fj 2 2. The last term is the fidelity term which preserves the similarity with the input signal. The regularization coefficients λ1 and λ2 control the balance between ℓ1 and ℓ2-based graph smoothing. Remark 1. It is potential to consider higher-order graph differences in both the ℓ1-based and ℓ2-based smoothers. But, in this work, we focus on the 0-th order graph difference operator , since we assume the piecewise constant prior for graph representation learning. Normalization. In existing GNNs, it is beneficial to normalize the Laplacian matrix for better numerical stability, and the normalization trick is also crucial for achieving superior performance. Therefore, for the ℓ2-based graph smoothing, we follow the common normalization trick in GNNs: L = I A, where A = ˆD 1 2 , ˆA = A + I and ˆDii = di = P j ˆAij. It leads to a degree normalized penalty tr(F LF) = X Fi di + 1 Fj p In the literature of graph total variation and graph trend filtering, the normalization step is often overlooked and the graph difference operator is directly used as in GTF (Wang et al., 2016; Varma et al., 2019). To achieve better numerical stability and handle diverse node degrees in real-world graphs, we propose to normalize each column of the incident matrix by the square root of node degrees for the ℓ1-based graph smoothing as follows1: It leads to a degree normalized total variation penalty 2 Fi di + 1 Fj p Note that this normalized incident matrix maintains the relation with the normalized Laplacian matrix as in the unnormalized case 2 ( ˆD ˆA) ˆD 1 1It naturally supports read-value edge weights if the edge weights are set in the incident matrix . 2With the normalization, the piecewise constant prior is up to the degree scaling, i.e., sparsity in F. Elastic Graph Neural Networks With the normalization, the estimator defined in (5) becomes: arg min F Rn d λ1 F 1 | {z } g1( F) 2 tr(F LF) + 1 2 F Xin 2 F | {z } f(F) Capture correlation among dimensions. The node features in real-world graphs are usually multi-dimensional. Although the estimator defined in (7) is able to handle multidimensional data since the signal from different dimensions are separable under ℓ1 and ℓ2 norm, such estimator treats each feature dimension independently and does not exploit the potential relation between feature dimensions. However, the sparsity patterns of node difference across edges could be shared among feature dimensions. To better exploit this potential correlation, we propose to couple the multi-dimensional features by ℓ21 norm, which penalizes the summation of ℓ2 norm of the node difference Fi di + 1 Fj p This penalty promotes the row sparsity of F and enforces similar sparsity patterns among feature dimensions. In other words, if two nodes are similar, all their feature dimensions should be similar. Therefore, we define the ℓ21-based estimator as arg min F Rn d λ1 F 21 | {z } g21( F) 2 tr(F LF) + 1 2 F Xin 2 F | {z } f(F) (8) where g21( ) = λ1 21. In the following subsections, we will use g( ) to represent both g1( ) and g21( ). We use ℓ1 to represent both ℓ1 and ℓ21 if not specified. 3.2. Elastic Message Passing For the ℓ2-based graph smoother, message passing schemes can be derived from the gradient descent iterations of the graph signal denoising problem, as in the case of GCN and APPNP (Ma et al., 2020). However, computing the estimators defined by (7) and (8) is much more challenging because of the nonsmoothness, and the two components, i.e., f(F) and g( F), are non-separable as they are coupled by the graph difference operator . In the literature, researchers have developed optimization algorithms for the graph trend filtering problem (4) such as Alternating Direction Method of Multipliers (ADMM) and Newton type algorithms (Wang et al., 2016; Varma et al., 2019). However, these algorithms require to solve the minimization of a nontrivial sub-problem in each single iteration, which incurs high computation complexity. Moreover, it is unclear how to make these iterations compatible with the back-propagation training of deep learning models. This motivates us to design an algorithm which is not only efficient but also friendly to back-propagation training. To this end, we propose to solve an equivalent saddle point problem using a primal-dual algorithm with efficient computations. Saddle point reformulation. For a general convex function g( ), its conjugate function is defined as g (Z) := sup X Z, X g(X). By using g( F) = sup Z F, Z g (Z), the problem (7) and (8) can be equivalently written as the following saddle point problem: min F max Z f(F) + F, Z g (Z). (9) where Z Rm d. Motivated by Proximal Alternating Predictor-Corrector (PAPC) (Loris & Verhoeven, 2011; Chen et al., 2013), we propose an efficient algorithm with per iteration low computation complexity and convergence guarantee: Fk+1 = Fk γ f(Fk) γ Zk, (10) Zk+1 = proxβg (Zk + β Fk+1), (11) Fk+1 = Fk γ f(Fk) γ Zk+1, (12) where proxβg (X) = arg min Y 1 2 Y X 2 F + βg (Y). The stepsizes, γ and β, will be specified later. The first step (10) obtains a prediction of Fk+1, i.e., Fk+1, by a gradient descent step on primal variable Fk. The second step (11) is a proximal dual ascent step on the dual variable Zk based on the predicted Fk+1. Finally, another gradient descent step on the primal variable based on (Fk, Zk+1) gives next iteration Fk+1 (12). Algorithm (10) (12) can be interpreted as a predict-correct algorithm for the saddle point problem (9). Next we demonstrate how to compute the proximal operator in Eq. (11). Proximal operators. Using the Moreau s decomposition principle (Bauschke & Combettes, 2011) X = proxβg (X) + λproxβ 1g(X/β), we can rewrite the step (11) using the proximal operator of g( ), that is, proxβg (X) = X βprox 1 We discuss the two options for the function g( ) corresponding to the objectives (7) and (8). Elastic Graph Neural Networks Yk+1 = γXin + (1 γ) AFk Fk+1 = Yk γ Zk Zk+1 = Zk + β Fk+1 Zk+1 = min(| Zk+1|, λ1) sign( Zk+1) (Option I: ℓ1 norm) Zk+1 i = min( Zk+1 i 2, λ1) Zk+1 i Zk+1 i 2 , i [m] (Option II: ℓ21 norm) Fk+1 = Yk γ Zk+1 Figure 1. Elastic Message Passing (EMP). F0 = Xin and Z0 = 0m d. Option I (ℓ1 norm): g1(X) = λ1 X 1 By definition, the proximal operator of 1 β g1(X) = arg min Y 1 2 Y X 2 F + 1 which is equivalent to the soft-thresholding operator (component-wise): β λ1(X))ij =sign(Xij) max(|Xij| 1 =Xij sign(Xij) min(|Xij|, 1 Therefore, using (13), we have (proxβg 1 (X))ij = sign(Xij) min(|Xij|, λ1). (14) which is a component-wise projection onto the ℓ ball of radius λ1. Option II (ℓ21 norm): g21(X) = λ1 X 2,1 By definition, the proximal operator of 1 β g21(X) is β g21(X) = arg min Y 1 2 Y X 2 F + 1 with the i-th row being prox 1 i = Xi Xi 2 max( Xi 2 1 Similarly, using (13), we have the i-th row of proxβg 21(X) being (proxβg 21(X))i = Xi βprox 1 β g21(Xi/β) = Xi β Xi/β Xi/β 2 max( Xi/β 2 λ1/β, 0) = Xi Xi Xi 2 max( Xi 2 λ1, 0) = Xi Xi 2 ( Xi 2 max( Xi 2 λ1, 0)) = Xi Xi 2 min( Xi 2, λ1), (15) which is a row-wise projection on the ℓ2 ball of radius λ1. Note that the proximal operator in the ℓ1 norm case treats each feature dimension independently, while in the ℓ21 norm case, it couples the multi-dimensional features, which is consistent with the motivation to exploit the correlation among feature dimensions. The Algorithm (10) (12) and the proximal operators (14) and (15) enable us to derive the final message passing scheme. Note that the computation Fk γ f(Fk) in steps (10) and (12) can be shared to save computation. Therefore, we decompose the step (10) into two steps: Yk = Fk γ f(Fk) = (1 γ)I γλ2 L Fk + γXin, (16) Fk+1 = Yk γ Zk. (17) In this work, we choose γ = 1 1+λ2 and β = 1 2γ . Therefore, with L = I A, Eq. (16) can be simplified as Yk+1 = γXin + (1 γ) AFk. (18) Let Zk+1 := Zk + β Fk+1, then steps (11) and (12) become Zk+1 = proxβg ( Zk+1), (19) Fk+1 = Fk γ f(Fk) γ Zk+1 = Yk γ Zk+1. (20) Substituting the proximal operators in (19) with (14) and (15), we obtain the complete elastic message passing scheme (EMP) as summarized in Figure 1. Interpretation of EMP. EMP can be interpreted as the standard message passing (MP) (Y in Fig. 1) with extra operations (the following steps). The extra operations compute Z to adjust the standard MP such that sparsity in F is promoted and some large node differences can be preserved. EMP is general and covers some existing propagation rules as special cases as demonstrated in Remark 2. Elastic Graph Neural Networks Remark 2 (Special cases). If there is only ℓ2-based regularization, i.e., λ1 = 0, then according to the projection operator, we have Zk = 0m n. Therefore, with γ = 1 1+λ2 , the proposed message passing scheme reduces to Fk+1 = 1 1 + λ2 Xin + λ2 1 + λ2 AFk. α 1, it recovers the message passing in APPNP: Fk+1 = αXin + (1 α) AFk. If λ2 = , it recovers the simple aggregation operation in many GNNs: Fk+1 = AFk. Computation Complexity. EMP is efficient and composed by simple operations. The major computation cost comes from four sparse matrix multiplications, include AFk, Zk, Fk+1 and Zk+1. The computation complexity is in the order O(md) where m is the number of edges in graph G and d is the feature dimension of input signal Xin. Other operations are simple matrix additions and projection. The convergence of EMP and the parameter settings are justified by Theorem 1, with a proof deferred to Appendix B. Theorem 1 (Convergence). Under the stepsize setting γ < 2 1+λ2 L 2 and β 4 3γ 2 , the elastic message passing scheme (EMP) in Figure 1 converges to the optimal solution of the elastic graph signal estimator defined in (7) (Option I) or (8) (Option II). It is sufficient to choose any γ < 2 1+2λ2 and β 2 3γ since L 2 = 2 = 2 2. 3.3. Elastic GNNs Incorporating the elastic message passing scheme from the elastic graph signal estimator (7) and (8) into deep neural networks, we introduce a family of GNNs, namely Elastic GNNs. In this work, we follow the decoupled way as proposed in APPNP (Klicpera et al., 2018), where we first make predictions from node features and aggregate the prediction through the proposed EMP: Ypre = EMP hθ(Xfea), K, λ1, λ2 . (21) Xfea Rn d denotes the node features, hθ( ) is any machine learning model, such as multilayer perceptrons (MLPs), θ is the learnable parameters in the model, and K is the number of message passing steps. The training objective is the cross entropy loss defined by the final prediction Ypre and labels for training data. Elastic GNNs also have the following nice properties: In addition to the backbone neural network model, Elastic GNNs only require to set up three hyperparameters including two coefficients λ1, λ2 and the propagation step K, but they do not introduce any learnable parameters. Therefore, it reduces the risk of overfitting. The hyperparameters λ1 and λ2 provide better smoothness adaptivity to Elastic GNNs depending on the smoothness properties of the graph data. The message passing scheme only entails simple and efficient operations, which makes it friendly to the efficient and end-to-end back-propagation training of the whole GNN model. 4. Experiment In this section, we conduct experiments to validate the effectiveness of the proposed Elastic GNNs. We first introduce the experimental settings. Then we assess the performance of Elastic GNNs and investigate the benefits of introducing ℓ1-based graph smoothing into GNNs with semi-supervised learning tasks under normal and adversarial settings. In the ablation study, we validate the local adaptive smoothness, sparsity pattern, and convergence of EMP. 4.1. Experimental Settings Datasets. We conduct experiments on 8 real-world datasets including three citation graphs, i.e., Cora, Citeseer, Pubmed (Sen et al., 2008), two co-authorship graphs, i.e., Coauthor CS and Coauthor Physics (Shchur et al., 2018), two co-purchase graphs, i.e., Amazon Computers and Amazon Photo (Shchur et al., 2018), and one blog graph, i.e., Polblogs (Adamic & Glance, 2005). In Polblogs graph, node features are not available so we set the feature matrix to be a n n identity matrix. Baselines. We compare the proposed Elastic GNNs with representative GNNs including GCN (Kipf & Welling, 2016), GAT (Veliˇckovi c et al., 2017), Cheb Net (Defferrard et al., 2016), Graph SAGE (Hamilton et al., 2017), APPNP (Klicpera et al., 2018) and SGC (Wu et al., 2019). For all models, we use 2 layer neural networks with 64 hidden units. Parameter settings. For each experiment, we report the average performance and the standard variance of 10 runs. For all methods, hyperparameters are tuned from the following search space: 1) learning rate: {0.05, 0.01, 0.005}; 2) weight decay: {5e-4, 5e-5, 5e-6}; 3) dropout rate: {0.5, 0.8}. For APPNP, the propagation step K is tuned from {5, 10} and the parameter α is tuned from {0, 0.1, 0.2, 0.3, 0.5, 0.8, 1.0}. For Elastic GNNs, the propagation step K is tuned from {5, 10} and parameters λ1 and λ2 are tuned from {0, 3, 6, 9}. As suggested by Theorem 1, we set γ = 1 1+λ2 and β = 1 2γ in the proposed elastic message passing scheme. Adam optimizer (Kingma & Ba, 2014) is used in all experiments. Elastic Graph Neural Networks 4.2. Performance on Benchmark Datasets On commonly used datasets including Cora, Cite Seer, Pub Med, Coauthor CS, Coauthor Physics, Amazon Computers and Amazon Photo, we compare the performance of the proposed Elastic GNN (ℓ21 + ℓ2) with representative GNN baselines on the semi-supervised learning task. The detail statistics of these datasets and data splits are summarized in Table 5 in Appendix A. The classification accuracy are showed in Table 1. From these results, we can make the following observations: Elastic GNN outperforms GCN, GAT, Cheb Net, Graph SAGE and SGC by significant margins on all datasets. For instance, Elastic GNN improves over GCN by 3.1%, 2.0% and 1.8% on Cora, Cite Seer and Pub Med datasets. The improvement comes from the global and local smoothness adaptivity of Elastic GNN. Elastic GNN (ℓ21 + ℓ2) consistently achieves higher performance than APPNP on all datasets. Essentially, Elastic GNN covers APPNP as a special case when there is only ℓ2 regularization, i.e., λ1 = 0. Beyond the ℓ2-based graph smoothing, the ℓ21-based graph smoothing further enhances the local smoothness adaptivity. This comparison verifies the benefits of introducing ℓ21-based graph smoothing in GNNs. 4.3. Robustness Under Adversarial Attack Locally adaptive smoothness makes Elastic GNNs more robust to adversarial attack on graph structure. This is because the attack tends to connect nodes with different labels, which fuzzes the cluster structure in the graph. But EMP can tolerate large node differences along these wrong edges, and maintain the smoothness along correct edges. To validate this, we evaluate the performance of Elastic GNNs under untargeted adversarial graph attack, which tries to degrade GNN models overall performance by deliberately modifying the graph structure. We use the Meta Attack (Z ugner & G unnemann, 2019) implemented in Deep Robust (Li et al., 2020)3, a Py Torch library for adversarial attacks and defenses, to generate the adversarially attacked graphs based on four datasets including Cora, Cite Seer, Polblogs and Pub Med. We randomly split 10%/10%/80% of nodes for training, validation and test. The detailed data statistics are summarized in Table 6 in Appendix A. Note that following the works (Z ugner et al., 2018; Z ugner & G unnemann, 2019; Entezari et al., 2020; Jin et al., 2020), we only consider the largest connected component (LCC) in the adversarial graphs. Therefore, the results in Table 2 are not directly comparable with the results in Table 1. We 3https://github.com/DSE-MSU/Deep Robust focus on investigating the robustness introduced by ℓ1-based graph smoothing but not on adversarial defense so we don t compare with defense strategies. Existing defense strategies can be applied on Elastic GNNs to further improve the robustness against attacks. Variants of Elastic GNNs. To make a deeper investigation of Elastic GNNs, we consider the following variants: (1) ℓ2 (λ1 = 0); (2) ℓ1 (λ2 = 0, Option I); (3) ℓ21 (λ2 = 0, Option II); (4) ℓ1+ℓ2 (Option I); (5) ℓ21+ℓ2 (Option II). To save computation, we fix the learning rate as 0.01, weight decay as 0.0005, dropout rate as 0.5 and K = 10 since this setting works well for the chosen datasets and models. Only λ1 and λ2 are tuned. The classification accuracy under different perturbation rates ranging from 0% to 20% is summarized in Table 2. From the results, we can make the following observations: All variants of Elastic GNNs outperforms GCN and GAT by significant margins under all perturbation rates. For instance, when the pertubation rate is 15%, Elastic GNN (ℓ21 + ℓ2) improves over GCN by 12.1%, 7.4%, 13.7% and 7.7% on the four datasets being considered. This is because Elastic GNN can adapt to the change of smoothness while GCN and GAT can not adapt well when the perturbation rate increases. ℓ21 outperforms ℓ1 in most cases, and ℓ21 + ℓ2 outperforms ℓ1 + ℓ2 in almost all cases. It demonstrates the benefits of exploiting the correlation between feature channels by coupling multi-dimensional features via ℓ21 norm. ℓ21 outperforms ℓ2 in most cases, which suggests the benefits of local smoothness adaptivity. When ℓ21 and ℓ2 is combined, the Elastic GNN (ℓ21 + ℓ2) achieves significantly better performance than solely ℓ2, ℓ21 or ℓ1 variant in almost all cases. It suggests that ℓ1 and ℓ2-based graph smoothing are complementary to each other, and combining them provides significant better robustness against adversarial graph attacks. 4.4. Ablation Study We provide ablation study to further investigate the adaptive smoothness, sparsity pattern, and convergence of EMP in Elastic GNN, based on three datasets including Cora, Cite Seer and Pub Med. In this section, we fix λ1 = 3, λ2 = 3 for Elastic GNN, and α = 0.1 for APPNP. We fix learning rate as 0.01, weight decay as 0.0005 and dropout rate as 0.5 since this setting works well for both methods. Adaptive smoothness. It is expected that ℓ1-based smoothing enhances local smoothness adaptivity by increasing the smoothness along correct edges (connecting nodes with same labels) while lowering smoothness along wrong edges (connecting nodes with different labels). To validate this, Elastic Graph Neural Networks Table 1. Classification accuracy (%) on benchmark datasets with 10 times random data splits. Model Cora Cite Seer Pub Med CS Physics Computers Photo Cheb Net 76.3 1.5 67.4 1.5 75.0 2.0 91.8 0.4 OOM 81.0 2.0 90.4 1.0 GCN 79.6 1.1 68.9 1.2 77.6 2.3 91.6 0.6 93.3 0.8 79.8 1.6 90.3 1.2 GAT 80.1 1.2 68.9 1.8 77.6 2.2 91.1 0.5 93.3 0.7 79.3 2.4 89.6 1.6 SGC 80.2 1.5 68.9 1.3 75.5 2.9 90.1 1.3 93.1 0.6 73.0 2.0 83.5 2.9 APPNP 82.2 1.3 70.4 1.2 78.9 2.2 92.5 0.3 93.7 0.7 80.1 2.1 90.8 1.3 Graph SAGE 79.0 1.1 67.5 2.0 77.6 2.0 91.7 0.5 92.5 0.8 80.7 1.7 90.9 1.0 Elastic GNN 82.7 1.0 70.9 1.4 79.4 1.8 92.5 0.3 94.2 0.5 80.7 1.8 91.3 1.3 Table 2. Classification accuracy (%) under different perturbation rates of adversarial graph attack. Dataset Ptb Rate Basic GNN Elastic GNN GCN GAT ℓ2 ℓ1 ℓ21 ℓ1 + ℓ2 ℓ21 + ℓ2 0% 83.5 0.4 84.0 0.7 85.8 0.4 85.1 0.5 85.3 0.4 85.8 0.4 85.8 0.4 5% 76.6 0.8 80.4 0.7 81.0 1.0 82.3 1.1 81.6 1.1 81.9 1.4 82.2 0.9 10% 70.4 1.3 75.6 0.6 76.3 1.5 76.2 1.4 77.9 0.9 78.2 1.6 78.8 1.7 15% 65.1 0.7 69.8 1.3 72.2 0.9 73.3 1.3 75.7 1.2 76.9 0.9 77.2 1.6 20% 60.0 2.7 59.9 0.6 67.7 0.7 63.7 0.9 70.3 1.1 67.2 5.3 70.5 1.3 0% 72.0 0.6 73.3 0.8 73.6 0.9 73.2 0.6 73.2 0.5 73.6 0.6 73.8 0.6 5% 70.9 0.6 72.9 0.8 72.8 0.5 72.8 0.5 72.8 0.5 73.3 0.6 72.9 0.5 10% 67.6 0.9 70.6 0.5 70.2 0.6 70.8 0.6 70.7 1.2 72.4 0.9 72.6 0.4 15% 64.5 1.1 69.0 1.1 70.2 0.6 68.1 1.4 68.2 1.1 71.3 1.5 71.9 0.7 20% 62.0 3.5 61.0 1.5 64.9 1.0 64.7 0.8 64.7 0.8 64.7 0.8 64.7 0.8 0% 95.7 0.4 95.4 0.2 95.4 0.2 95.8 0.3 95.8 0.3 95.8 0.3 95.8 0.3 5% 73.1 0.8 83.7 1.5 82.8 0.3 78.7 0.6 78.7 0.7 82.8 0.4 83.0 0.3 10% 70.7 1.1 76.3 0.9 73.7 0.3 75.2 0.4 75.3 0.7 81.5 0.2 81.6 0.3 15% 65.0 1.9 68.8 1.1 68.9 0.9 72.1 0.9 71.5 1.1 77.8 0.9 78.7 0.5 20% 51.3 1.2 51.5 1.6 65.5 0.7 68.1 0.6 68.7 0.7 77.4 0.2 77.5 0.2 0% 87.2 0.1 83.7 0.4 88.1 0.1 86.7 0.1 87.3 0.1 88.1 0.1 88.1 0.1 5% 83.1 0.1 78.0 0.4 87.1 0.2 86.2 0.1 87.0 0.1 87.1 0.2 87.1 0.2 10% 81.2 0.1 74.9 0.4 86.6 0.1 86.0 0.2 86.9 0.2 86.3 0.1 87.0 0.1 15% 78.7 0.1 71.1 0.5 85.7 0.2 85.4 0.2 86.4 0.2 85.5 0.1 86.4 0.2 20% 77.4 0.2 68.2 1.0 85.8 0.1 85.4 0.1 86.4 0.1 85.4 0.1 86.4 0.1 we compute the average adjacent node differences (based on node features in the last layer) along wrong and correct edges separately, and use the ratio between these two averages to measure the smoothness adaptivity. The results are summarized in Table 3. It is clearly observed that for all datasets, the ratio for Elastic GNN is significantly higher than ℓ2 based method such as APPNP, which validates its better local smoothness adaptivity. Sparsity pattern. To validate the piecewise constant property enforced by EMP, we also investigate the sparsity pattern in the adjacent node differences, i.e., F, based on node features in the last layer. Node difference along edge ei is defined as sparse if ( F)i 2 < 0.1. The sparsity ratios for ℓ2-based method such as APPNP and ℓ1-based method such as Elastic GNN are summarized in Table 4. It can be observed that in Elastic GNN, a significant portion of F are sparse for all datasets. While in APPNP, this portion is much smaller. This sparsity pattern validates the piecewise constant prior as designed. Table 3. Ratio between average node differences along wrong and correct edges. Model Cora Cite Seer Pub Med ℓ2 (APPNP) 1.57 1.35 1.43 ℓ21+ℓ2 (Elastic GNN) 2.03 1.94 1.79 Table 4. Sparsity ratio (i.e., ( F)i 2 < 0.1) in node differences F. Model Cora Cite Seer Pub Med ℓ2 (APPNP) 2% 16% 11% ℓ21+ℓ2 (Elastic GNN) 37% 74% 42% Convergence of EMP. We provide two additional experiments to demonstrate the impact of propagation step K on classification performance and the convergence of message passing scheme. Figure 2 shows that the increase of classifi- Elastic Graph Neural Networks cation accuracy when the propagation step K increases. It verifies the effectiveness of EMP in improving graph representation learning. It also shows that a small number of propagation step can achieve very good performance, and therefore the computation cost for EMP can be small. Figure 3 shows the decreasing of the objective value defined in Eq. (8) during the forward message passing process, and it verifies the convergence of the proposed EMP as suggested by Theorem 1. 2 4 6 8 10 12 Step K Test Accuracy (%) Cora (Elastic GNN) Cora (APPNP) Cite Seer (Elastic GNN) Cite Seer (APPNP) Pub Med (Elastic GNN) Pub Med (APPNP) Figure 2. Classification accuracy under different propagation steps. 0 2 4 6 8 10 12 14 Step K Figure 3. Convergence of the objective value for the problem in Eq. (8) during message passing. 5. Related Work The design of GNN architectures can be majorly motivated in spectral domain (Kipf & Welling, 2016; Defferrard et al., 2016) and spatial domain (Hamilton et al., 2017; Veliˇckovi c et al., 2017; Scarselli et al., 2008; Gilmer et al., 2017). The message passing scheme (Gilmer et al., 2017; Ma & Tang, 2020) for feature aggregation is one central component of GNNs. Recent works have proven that the message passing in GNNs can be regarded as low-pass graph filters (Nt & Maehara, 2019; Zhao & Akoglu, 2019). Generally, it is recently proved that message passing in many GNNs can be unified in the graph signal denosing framework (Ma et al., 2020; Pan et al., 2020; Zhu et al., 2021; Chen et al., 2020). We point out that they intrinsically perform ℓ2-based graph smoothing and typically can be represented as linear smoothers. ℓ1-based graph signal denoising has been explored in graph trend filtering (Wang et al., 2016; Varma et al., 2019) which tends to provide estimators with k-th order piecewise polynomials over graphs. Graph total variation has also been utilized in semi-supervised learning (Nie et al., 2011; Jung et al., 2016; Jung et al., 2019; Aviles-Rivero et al., 2019), spectral clustering (B uhler & Hein, 2009; Bresson et al., 2013b) and graph cut problems (Szlam & Bresson, 2010; Bresson et al., 2013a). However, it is unclear whether these algorithms can be used to design GNNs. To the best of our knowledge, we make first such investigation in this work. 6. Conclusion In this work, we propose to enhance the smoothness adaptivity of GNNs via ℓ1 and ℓ2-based graph smoothing. Through the proposed elastic graph signal estimator, we derive a novel, efficient and general message passing scheme, i.e., elastic message passing (EMP). Integrating the proposed message passing scheme and deep neural networks leads to a family of GNNs, i.e., Elastic GNNs. Extensitve experiments on benchmark datasets and adversarially attacked graphs demonstrate the benefits of introducing ℓ1-based graph smoothing in the design of GNNs. The empirical study suggests that ℓ1 and ℓ2-based graph smoothing is complementary to each other, and the proposed Elastic GNNs has better smoothnesss adaptivity owning to the integration of ℓ1 and ℓ2-based graph smoothing. We hope the proposed elastic message passing scheme can inspire more powerful GNN architecture design in the future. Acknowledgements This research is supported by the National Science Foundation (NSF) under grant numbers CNS-1815636, IIS1928278, IIS-1714741, IIS-1845081, IIS-1907704, IIS1955285, and Army Research Office (ARO) under grant number W911NF-21-1-0198. Ming Yan is supported by NSF grant DMS-2012439 and Facebook Faculty Research Award (Systems for ML). Adamic, L. A. and Glance, N. The political blogosphere and the 2004 us election: divided they blog. In Proceedings of the 3rd international workshop on Link discovery, pp. 36 43, 2005. Aviles-Rivero, A. I., Papadakis, N., Li, R., Alsaleh, S. M., Elastic Graph Neural Networks Tan, R. T., and Schonlieb, C.-B. When labelled data hurts: Deep semi-supervised classification with the graph 1-laplacian. ar Xiv preprint ar Xiv:1906.08635, 2019. Bauschke, H. H. and Combettes, P. L. Convex Analysis and Monotone Operator Theory in Hilbert Spaces. Springer Publishing Company, Incorporated, 1st edition, 2011. ISBN 1441994661. Bresson, X., Laurent, T., Uminsky, D., and von Brecht, J. H. An adaptive total variation algorithm for computing the balanced cut of a graph, 2013a. Bresson, X., Laurent, T., Uminsky, D., and Von Brecht, J. H. Multiclass total variation clustering. ar Xiv preprint ar Xiv:1306.1185, 2013b. B uhler, T. and Hein, M. Spectral clustering based on the graph p-laplacian. In Proceedings of the 26th Annual International Conference on Machine Learning, pp. 81 88, 2009. Chen, P., Huang, J., and Zhang, X. A primal dual fixed point algorithm for convex separable minimization with applications to image restoration. Inverse Problems, 29 (2):025011, 2013. Chen, S., Eldar, Y. C., and Zhao, L. Graph unrolling networks: Interpretable neural networks for graph signal denoising, 2020. Chung, F. R. and Graham, F. C. Spectral graph theory. Number 92. American Mathematical Soc., 1997. Defferrard, M., Bresson, X., and Vandergheynst, P. Convolutional neural networks on graphs with fast localized spectral filtering. In Proceedings of the 30th International Conference on Neural Information Processing Systems, pp. 3844 3852, 2016. Elad, M. Sparse and redundant representations: from theory to applications in signal and image processing. Springer Science & Business Media, 2010. Entezari, N., Al-Sayouri, S. A., Darvishzadeh, A., and Papalexakis, E. E. All you need is low (rank) defending against adversarial attacks on graphs. In WSDM, 2020. Gilmer, J., Schoenholz, S. S., Riley, P. F., Vinyals, O., and Dahl, G. E. Neural message passing for quantum chemistry. In International Conference on Machine Learning, pp. 1263 1272. PMLR, 2017. Hamilton, W. L., Ying, R., and Leskovec, J. Inductive representation learning on large graphs. ar Xiv preprint ar Xiv:1706.02216, 2017. Hastie, T., Tibshirani, R., and Wainwright, M. Statistical Learning with Sparsity: The Lasso and Generalizations. Chapman Hall/CRC, 2015. ISBN 1498712169. Jin, W., Ma, Y., Liu, X., Tang, X., Wang, S., and Tang, J. Graph structure learning for robust graph neural networks. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 66 74, 2020. Jung, A., Hero III, A. O., Mara, A., and Jahromi, S. Semisupervised learning via sparse label propagation. ar Xiv preprint ar Xiv:1612.01414, 2016. Jung, A., Hero, III, A. O., Mara, A. C., Jahromi, S., Heimowitz, A., and Eldar, Y. C. Semi-supervised learning in network-structured data via total variation minimization. IEEE Transactions on Signal Processing, 67(24): 6256 6269, 2019. doi: 10.1109/TSP.2019.2953593. Kim, S.-J., Koh, K., Boyd, S., and Gorinevsky, D. ℓ1 trend filtering. SIAM review, 51(2):339 360, 2009. Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. Kipf, T. N. and Welling, M. Semi-supervised classification with graph convolutional networks. ar Xiv preprint ar Xiv:1609.02907, 2016. Klicpera, J., Bojchevski, A., and G unnemann, S. Predict then propagate: Graph neural networks meet personalized pagerank. ar Xiv preprint ar Xiv:1810.05997, 2018. Li, Y., Jin, W., Xu, H., and Tang, J. Deeprobust: A pytorch library for adversarial attacks and defenses, 2020. Li, Z. and Yan, M. A primal-dual algorithm with optimal stepsizes and its application in decentralized consensus optimization. ar Xiv preprint ar Xiv:1711.06785, 2017. Loris, I. and Verhoeven, C. On a generalization of the iterative soft-thresholding algorithm for the case of nonseparable penalty. Inverse Problems, 27(12):125007, 2011. Ma, Y. and Tang, J. Deep Learning on Graphs. Cambridge University Press, 2020. Ma, Y., Liu, X., Zhao, T., Liu, Y., Tang, J., and Shah, N. A unified view on graph neural networks as graph signal denoising. ar Xiv preprint ar Xiv:2010.01777, 2020. Nie, F., Wang, H., Huang, H., and Ding, C. Unsupervised and semi-supervised learning via ℓ1-norm graph. In 2011 International Conference on Computer Vision, pp. 2268 2273. IEEE, 2011. Elastic Graph Neural Networks Nt, H. and Maehara, T. Revisiting graph neural networks: All we have is low-pass filters. ar Xiv preprint ar Xiv:1905.09550, 2019. Pan, X., Shiji, S., and Gao, H. A unified framework for convolution-based graph neural networks. https://openreview.net/forum?id=z UMD Fb9Bt, 2020. Rudin, L. I., Osher, S., and Fatemi, E. Nonlinear total variation based noise removal algorithms. Physica D: nonlinear phenomena, 60(1-4):259 268, 1992. Scarselli, F., Gori, M., Tsoi, A. C., Hagenbuchner, M., and Monfardini, G. The graph neural network model. IEEE transactions on neural networks, 20(1):61 80, 2008. Sen, P., Namata, G., Bilgic, M., Getoor, L., Galligher, B., and Eliassi-Rad, T. Collective classification in network data. AI magazine, 29(3):93 93, 2008. Sharpnack, J., Singh, A., and Rinaldo, A. Sparsistency of the edge lasso over graphs. In Artificial Intelligence and Statistics, pp. 1028 1036. PMLR, 2012. Shchur, O., Mumme, M., Bojchevski, A., and G unnemann, S. Pitfalls of graph neural network evaluation. ar Xiv preprint ar Xiv:1811.05868, 2018. Szlam, A. and Bresson, X. Total variation, cheeger cuts. In ICML, 2010. Tibshirani, R., Saunders, M., Rosset, S., Zhu, J., and Knight, K. Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(1):91 108, 2005. Tibshirani, R. J. et al. Adaptive piecewise polynomial estimation via trend filtering. Annals of statistics, 42(1): 285 323, 2014. Varma, R., Lee, H., Kovaˇcevi c, J., and Chi, Y. Vector-valued graph trend filtering with non-convex penalties. IEEE Transactions on Signal and Information Processing over Networks, 6:48 62, 2019. Veliˇckovi c, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., and Bengio, Y. Graph attention networks. ar Xiv preprint ar Xiv:1710.10903, 2017. Wang, Y.-X., Sharpnack, J., Smola, A. J., and Tibshirani, R. J. Trend filtering on graphs. Journal of Machine Learning Research, 17:1 41, 2016. Wu, F., Souza, A., Zhang, T., Fifty, C., Yu, T., and Weinberger, K. Simplifying graph convolutional networks. In International conference on machine learning, pp. 6861 6871. PMLR, 2019. Zhao, L. and Akoglu, L. Pairnorm: Tackling oversmoothing in gnns. In International Conference on Learning Representations, 2019. Zhou, D., Bousquet, O., Lal, T. N., Weston, J., and Sch olkopf, B. Learning with local and global consistency. 2004. Zhu, M., Wang, X., Shi, C., Ji, H., and Cui, P. Interpreting and unifying graph neural networks with an optimization framework, 2021. Zhu, X. and Ghahramani, Z. Learning from labeled and unlabeled data with label propagation. Zhu, X. J. Semi-supervised learning literature survey. 2005. Z ugner, D. and G unnemann, S. Adversarial attacks on graph neural networks via meta learning. ar Xiv preprint ar Xiv:1902.08412, 2019. Z ugner, D., Akbarnejad, A., and G unnemann, S. Adversarial attacks on neural networks for graph data. In KDD. ACM, 2018. Elastic Graph Neural Networks Appendix for Elastic Graph Neural Networks A. Data Statistics The data statistics for the benchmark datasets used in Section 4.2 are summarized in Table 5. The data statistics for the adversarially attacked graph used in Section 4.3 are summarized in Table 6. Table 5. Statistics of benchmark datasets. Dataset Classes Nodes Edges Features Training Nodes Validation Nodes Test Nodes Cora 7 2708 5278 1433 20 per class 500 1000 Cite Seer 6 3327 4552 3703 20 per class 500 1000 Pub Med 3 19717 44324 500 20 per class 500 1000 Coauthor CS 15 18333 81894 6805 20 per class 30 per class Rest nodes Coauthor Physics 5 34493 247962 8415 20 per class 30 per class Rest nodes Amazon Computers 10 13381 245778 767 20 per class 30 per class Rest nodes Amazon Photo 8 7487 119043 745 20 per class 30 per class Rest nodes Table 6. Dataset Statistics for adversarially attacked graph. NLCC ELCC Classes Features Cora 2,485 5,069 7 1,433 Cite Seer 2,110 3,668 6 3,703 Polblogs 1,222 16,714 2 / Pub Med 19,717 44,338 3 500 B. Convergence Guarantee We provide Theorem 1 to show the convergence guarantee of the proposed elastic messsage passing scheme and the practical guidance for parameter settings in EMP. Theorem 1 (Convergence of EMP). Under the stepsize setting γ < 2 1+λ2 L 2 and β 4 3γ 2 , the elastic message passing scheme (EMP) in Figure 1 converges to the optimal solution of the elastic graph signal estimator defined in (7) (Option I) or (8) (Option II). It is sufficient to choose any γ < 2 1+2λ2 and β 2 3γ since L 2 = 2 = 2 2. Proof. We first consider the general problem min F f(F) + g(BF) (22) where f and g are convex functions and B is a bounded linear operator. It is proved in (Loris & Verhoeven, 2011; Chen et al., 2013) that the iterations in (10) (12) guarantee the convergence of Fk to the optimal solution of the minimization problem (22) if the parameters satisfy γ < 2 L and β 1 γλmax(BB ), where L is the Lipschitz constant of f(F). These conditions are further relaxed to γ < 2 L and β 4 3γλmax(BB ) in (Li & Yan, 2017). For the specific problems defined in (7) and (8), the two function components f and g are both convex, and the linear operator is bounded. The Lipschitz constant of f(F) can be computed by the largest eigenvalue of the Hessian matrix of f(F): L = λmax( 2f(F)) = λmax(I + λ2 L) = 1 + λ2 L 2. Therefore, the elastic message passing scheme derived from iterations (10) (12) is guaranteed to converge to the optimal solution of problem (7) (Option I) or problem (8) (Option II) if the stepsizes satisfy γ < 2 1+λ2 L 2 and β 4 3γ 2 . Elastic Graph Neural Networks Let = UΣV be the singular value decomposition of and we derive 2 = UΣV VΣU 2 = UΣ2U 2 = VΣ2V 2 = VΣU UΣV 2 = 2. The equivalence L = in (6) further gives L 2 = 2 = 2. Since L 2 2 (Chung & Graham, 1997), we have 2 1+2λ2 2 1+λ2 L 2 and 2 3γ 4 3γ 2 . Therefore, γ < 2 1+2λ2 β 2 3γ are sufficient for the convergence of EMP.