# implicit_vs_unfolded_graph_neural_networks__526850a9.pdf

Journal of Machine Learning Research 26 (2025) 1-46 Submitted 5/22; Revised 1/25; Published 5/25

Implicit vs Unfolded Graph Neural Networks

Yongyi Yang YONGYI@UMICH.EDU University of Michigan Ann Arbor, United States

Tang Liu CNLIUTANG@GMAIL.COM Alibaba Hangzhou, China

Yangkun Wang ESPYLAPIZA@GMAIL.COM University of California, San Diego La Jolla, United States

Zengfeng Huang HUANGZF@FUDAN.EDU.CN Fudan University Shanghai, China

David Wipf DAVIDWIPF@GMAIL.COM Amazon Web Services Shanghai, China

Editor: Pradeep Ravikumar

It has been observed that message-passing graph neural networks (GNN) sometimes struggle to maintain a healthy balance between the efficient/scalable modeling of long-range dependencies across nodes while avoiding unintended consequences such oversmoothed node representations, sensitivity to spurious edges, or inadequate model interpretability. To address these and other issues, two separate strategies have recently been proposed, namely implicit and unfolded GNNs (that we abbreviate to IGNN and UGNN respectively). The former treats node representations as the fixed points of a deep equilibrium model that can efficiently facilitate arbitrary implicit propagation across the graph with a fixed memory footprint. In contrast, the latter involves treating graph propagation as unfolded descent iterations as applied to some graph-regularized energy function. While motivated differently, in this paper we carefully quantify explicit situations where the solutions they produce are equivalent and others where their properties sharply diverge. This includes the analysis of convergence, representational capacity, and interpretability. In support of this analysis, we also provide empirical head-to-head comparisons across multiple synthetic and public realworld node classification benchmarks. These results indicate that while IGNN is substantially more memory-efficient, UGNN models support unique, integrated graph attention mechanisms and propagation rules that can achieve strong node classification accuracy across disparate regimes such as adversarially-perturbed graphs, graphs with heterophily, and graphs involving long-range dependencies.

Keywords: Graph Neural Networks, Graph-Regularized Energy Functions, Algorithm Unfolding, Bilevel Optimization, Implicit Deep Learning.

2025 Yongyi Yang, Tang Liu, Yangkun Wang, Zengfeng Huang, David Wipf.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v26/22-0459.html.

YANG, LIU, WANG, HUANG, WIPF

1. Introduction

Given graph data with node features, graph neural networks (GNNs) represent an effective way of exploiting relationships among these features to predict labeled quantities of interest, e.g., node classification (Wu et al., 2020; Zhou et al., 2018). In particular, each layer of a message-passing GNN (our focus) is constructed by bundling a graph propagation step with an aggregation function such that information can be shared between neighboring nodes to an extent determined by network depth. (Kipf and Welling, 2017; Hamilton et al., 2017b; Kearnes et al., 2016; Velickovic et al., 2018). For sparsely-labeled graphs, or graphs with entity relations reflecting long-range dependencies, it is often desirable to propagate signals arbitrary distances across nodes, but without expensive memory requirements during training, oversmoothing effects (Oono and Suzuki, 2020; Li et al., 2018), or disproportionate sensitivity to bad/spurious edges (Zhu et al., 2020b,a; Z ugner et al., 2019; Z ugner and G unnemann, 2019). Moreover, to the extent possible we would like such propagation to follow interpretable patterns that can be adjusted in predictable ways to problem-specific contexts (e.g., graphs possessing heterophily). To address these issues, at least in part, two distinct strategies have recently been proposed that will form the basis of our study.1 First, the framework of implicit deep learning (Bai et al., 2019; El Ghaoui et al., 2020) has been applied to producing supervised node embeddings that satisfy an equilibrium criteria instantiated through a graph-dependent fixed-point equation. The resulting so-called implicit GNN (IGNN) pipeline (Gu et al., 2020), as well as related antecedents (Dai et al., 2018; Gallicchio and Micheli, 2020) and descendants (Baker et al., 2023; Fu et al., 2023; Liu et al., 2021a), mimic the behavior of a GNN model with arbitrary depth for handling long-range dependencies, but is nonetheless trainable via a fixed memory budget and robust against oversmoothing effects. See Figure 1 for a high-level conceptual depiction of IGNNs. Secondly, unfolded GNN (UGNN) architectures are formed via graph propagation layers patterned after the unfolded descent iterations of some graph-regularized energy function (Chen and Eldar, 2021; Di Giovanni et al., 2023; Liu et al., 2021b; Ma et al., 2020; Pan et al., 2021; Xue et al., 2023; Yang et al., 2021; Zhang et al., 2020; Zhu et al., 2021); Figure 1 contains an illustration. In this way, the node embeddings at each UGNN layer can be viewed as increasingly refined approximations to an interpretable energy minimizer, which can be nested within a bilevel optimization framework (Wang et al., 2016) for supervised training akin to IGNN. More broadly, similar unfolding strategies have previously been adopted for designing a variety of new deep architectures or interpreting existing ones (Geng et al., 2021; Gregor and Le Cun, 2010; He et al., 2017; Hershey et al., 2014; Sprechmann et al., 2015; Veliˇckovi c et al., 2020; Yang et al., 2022). As such, IGNN and UGNN are closely related in the sense that their node embeddings are intentionally aligned with a meta-criterion: either a fixed-point for IGNN or an energy minimizer for UGNN, where in both cases a useful inductive bias is introduced to address similar desiderata. And yet despite these commonalities, there has thus far been no systematic examination of the meaningful similarities and differences. We take key steps in this direction via the following workflow. First, after introducing the basic setup and notation in Section 2, we overview the IGNN framework in Section 3, including its attractive reliance on memory-efficient implicit differentiation and existing convergence results. Next, Section 4 introduces a general-purpose UGNN framework that encompasses a variety of existing unfolded models as special cases, including models with graph attention, broad graph

1. There are of course many more than just two approaches for tackling these types of issues, e.g., random-walk GNNs, cooperative GNNs, graph rewiring, graph Transformers (Finkelshtein et al., 2024; Karhadkar et al., 2023; T onshoff et al., 2023; Topping et al., 2022; Wu et al., 2023); however, a broader treatment of such methodology remains outside of our primary scope herein.

IMPLICIT VS UNFOLDED GRAPH NEURAL NETWORKS

Figure 1: Illustration of UGNNs and IGNNs. The lefthand side displays IGNN forward pass layers, each of which reduces the distance between the current node embedding Y (k) and a fixed point Y . contrast, the righthand side reveals that, given input features X, each UGNN forward pass layer induces node embeddings Y that form a descent step of an energy function ℓY .

propagation operators, and nonlinear activation functions. Sections 5 6 provide comparative analysis of the relative strengths and weaknesses of IGNN and UGNN models with respect to convergence guarantees and model expressiveness. We conclude with empirical comparisons in Section 7 and scalability considerations in Section 8, while proofs and supplementary details are deferred to the Appendix. Overall, our contributions can be summarized as follows:

Although the original conceptions are unique, we consider a sufficiently broad design space of IGNN and UGNN models such that practically-relevant, interpretable regions of exact overlap can be established, analyzed, and contrasted with key areas of nonconformity.

Within this framework, we compare the relative ability of IGNNs and UGNNs to converge to optimal (or locally-optimal) solutions per various model-specific design criteria. Among other things, this analysis reveals that IGNNs enjoy an advantage in that (unlike UGNNs) their implicit layers do not unavoidably induce symmetric propagation weights. In contrast, we show that UGNNs are more flexible by accommodating a broader class of robust nonlinear graph propagation operators while still guaranteeing at least local convergence. This capability is relevant to handling graphs with certain types of heterophily (meaning nodes sharing an edge may not share similar labels and/or features) and adversarial perturbations.

We investigate the consequences of symmetric (layer-tied) UGNN propagation weights (an issue that has been raised but remains unresolved in the literature). In particular, we prove that with linear activations, a UGNN can reproduce any IGNN representation, and define sufficient conditions for equivalency in broader regimes. We also show that UGNN layers with symmetric, layer-tied weights can exactly mimic arbitrary graph convolutional network (GCN) models (Kipf and Welling, 2017) characterized by asymmetric, layer-specific weights without introducing any additional parameters or significant complexity. Collectively, these results suggest that the weight symmetry enforced by UGNN models may not be overly restrictive in practice.

YANG, LIU, WANG, HUANG, WIPF

Empirically, we provide comprehensive, head-to-head comparisons between equivalently-sized IGNN and UGNN models while evaluating computational complexity and memory costs. These results also serve to complement our analytical findings while showcasing node classification regimes whereby IGNN or UGNN can achieve near-SOTA performance. We also demonstrate how graph heterophily can manifest in fundamentally different ways that UGNN models in particular are naturally equipped to handle.

2. Basic Setup

Consider a graph G = {V, E}, with n = |V| nodes and m = |E| edges. We define L Rn n as the Laplacian of G, meaning L = D A = B B, where D and A are degree and adjacency matrices respectively, while B Rm n is an incidence matrix. We also let e A = A + I (i.e., A with self loops) and denote e D as the corresponding degree matrix. Both IGNNs and UGNNs incorporate graph structure via optimized embeddings Y Rn d that are a function of some adjustable weights W, i.e., Y Y (W) (for simplicity of notation, we will frequently omit including this dependency on W), where by design Y (W)/ W is computable, either implicitly (IGNN) or explicitly (UGNN). We may then insert these embeddings within an application-specific meta-loss given by

i=1 D g [y i (W); θ] , ti , (1)

where g : Rd Rc is some differentiable node-wise function with parameters θ and c-dimensional output tasked with predicting ground-truth node-wise targets ti Rc. Additionally, y i (W) is the i-th row of Y (W), n < n is the number of labeled nodes (we assume w.l.o.g. that the first n nodes are labeled), and D is a discriminator function, e.g., cross-entropy for classification, squared error for regression, etc. Given that Y (W) is differentiable by construction, we can optimize ℓθ (θ, W) via gradient descent to obtain our final predictive model. At a conceptual level then, the only difference between IGNNs and UGNNs is in how the corresponding optimal embeddings Y (W) are motivated and computed. We introduce the specifics of each option in the following two sections.

3. Overview of Implicit GNNs

IGNN models are predicated on the fixed-point update

Y (k+1) = σ h PY (k)Wp + f (X; Wx) i , (2)

where Y (k) are node embeddings after the k-th iteration, σ is a nonlinear activation function, W = {Wp, Wx} is a set of two weight matrices, and P Rn n is an arbitrary graph propagation operator, e.g., P = A; other variants are also obtainable using, for example, graph rewiring (Gasteiger et al., 2019). Additionally, f : Rn d0 Rn d is a base predictor function that converts the initial d0-dimensional node features X Rn d0 into d-dimensional candidate embeddings, e.g., f (X; Wx) = XWx or f (X; Wx) = MLP [X; Wx]. Now assume that σ is a differentiable component-wise non-expansive (CONE) mapping as specified in (Gu et al., 2020). This condition stipulates that σ applies the same function individually

IMPLICIT VS UNFOLDED GRAPH NEURAL NETWORKS

to each element of its input, and that this function satisfies x y σ(x) σ(y) for all {x, y} R2 (with some abuse of notation, we will overload the definition of σ when the meaning is clear from context). Furthermore, we assume that the weights Wp are such that λpf(|P Wp|) < 1, where | | denotes the element-wise absolute value and λpf refers to the Peron-Frobenius eigenvalue.2

Under these conditions, it has been shown in (Gu et al., 2020) that limk Y (k) = Y , where Y

satisfies the fixed-point equation Y = σ [PY Wp + f (X; Wx)]. Therefore, as an IGNN forward pass we can iterate (2) K times, where K is sufficiently large, to compute node embeddings Y (K) Y for use within the meta-loss from (1). For the IGNN backward pass, implicit differentiation methods (Bai et al., 2019; El Ghaoui et al., 2020), carefully tailored for handling graph-based models (Gu et al., 2020), can but used to compute gradients of Y with respect to Wp and Wx. Critically, this implicit technique does not require storing each intermediate representation {Y (k)}K k=1 from the forward pass, and hence is quite memory efficient even if K is arbitrarily large. As K can be viewed as quantifying the extent of signal propagation across the graph, IGNNs can naturally capture long-range dependencies with a K-independent memory budget unlike more traditional techniques.

In follow-up work from (Liu et al., 2021a) a special case of (2) is considered whereby σ is an identity mapping and Wp is constrained to be a symmetric matrix W s p . These simplifications facilitate a more stream-lined implementation, referred to as efficient infinite-depth graph neural networks (EIGNN) in (Liu et al., 2021a), while maintaining accurate predictions on various node classification tasks. As we will demonstrate in the next section, the EIGNN fixed point minimizes an explicit energy function allowing this approach to be simultaneously interpreted as a special case of UGNN as well. Additionally, while EIGNN is designed to to economize IGNNs by removing the nonlinear activation, more recent work has targeted increasing IGNN expressivity (Baker et al., 2023) by relaxing the requirement that IGNN weights satisfy the Peron-Frobenius eigenvalue constraint from above in certain settings using monotone operator theory; hence the name MIGNN. And in a related vein, it has also been proposed to instantiate IGNNs (and the propagation P in particular) using a parameterized graph Laplacian operator that may be more suitable for handling potential oversmoothing effects (Fu et al., 2023).

4. A General-Purpose Unfolded GNN Framework

In this section we will first introduce a general-purpose energy function that, when reduced with simplifying assumptions, decomposes into various interpretable special cases that have been proposed in the literature, both as the basis for new UGNN models as well as motivation for canonical GNN architectures. Later, we will describe the generic proximal gradient descent iterations designed to optimize this energy. By construction, these iterations unfold in one-to-one correspondence with the UGNN model layers of each respective architecture under consideration.

4.1 Flexible Energy Function Design

Unlike IGNN models, the starting point of UGNNs is an energy function. In order to accommodate a broad variety of existing graph-regularized objectives and ultimately GNN architectures as special

2. The Peron-Frobenius (PF) eigenvalue is a real, non-negative eigenvalue that exists for all square, non-negative matrices and produces the largest modulus.

YANG, LIU, WANG, HUANG, WIPF

cases, we propose the rather general form

ℓY (Y ; W, f, ρ, e B, ϕ) Y f (X; Wx) 2 Wf +

i=1 ρ h e BY Wp Y e B i

i=1 ϕ (yi; Wϕ) , (3)

where W = {Wx, Wf, Wp, Wϕ} is a set of four weight matrices, ρ : R R is a differentiable function, and e B Rm n is a function of A that can be viewed as a generalized incidence matrix. Furthermore, ϕ : Rd R is an arbitrary function (possibly non-smooth) of the node embeddings that is bounded from below, and the weighted norm W is defined such that X 2 W = tr[X WX]. The above loss is composed of three terms: (i) A quadratic penalty on deviations of the embeddings Y from the node-wise base model f (X; Wx), (ii) a (possibly) non-convex graph-aware penalty that favors smoothness across edges (see illustrative special cases below), and (iii) an arbitrary function of each embedding that is independent of both the graph structure A and the original node features X. As a representative special case, if e B = B, then the second term from (3) satisfies

i=1 ρ h e BY Wp Y e B i

{i,j} E ρ yi yj 2 Wp

which penalizes deviations between the embeddings of two nodes sharing an edge. As has been noted in our prior work (Yang et al., 2021), if ρ is chosen to be concave and non-decreasing on [0, ), the resulting regularization favors robustness to bad edges, and could be useful for handling heterophily graphs or adversarial attacks. Intuitively, this robustness occurs because the penalized differences between the embeddings of two nodes sharing a spurious edge no longer accumulate quadratically, a growth-rate characterized by a well-known sensitivity to outliers (West, 1984).3

Simplifying further, if in addition to the assumption from (4), ϕ is set to zero, ρ is a (non-robust) identity mapping, Wf = I, and Wp = λI with λ > 0, then (3) collapses to the more familiar loss

ℓY (Y ; Wx, f) Y f (X; Wx) 2 F + λtr h Y LY i , (5)

as originally proposed in (Zhou et al., 2004) to smooth base predictions from f (X; Wx) using a quadratic, graph-aware regularization factor tr Y LY = P {i,j} E yi yj 2. This construction has also formed the basis of a wide variety of work linking different GNN architectures (Ma et al., 2020; Pan et al., 2021; Xue et al., 2023; Zhang et al., 2020; Zhu et al., 2021); more details to follow in Section 4.2. Similarly, if we allow for broader choices of e B = B with e L e B e B = π(A) for some function π : Sn Sn over the space of n-dimensional PSD matrices Sn, then (5) can be generalized by swapping L for this e L as proposed in (Ioannidis et al., 2018; Fu et al., 2023). Candidates for e L include normalized graph Laplacians (Von Luxburg, 2007) and various diffusion kernels designed according to application-specific requirements, e.g., graph signal denoising or graph rewiring (Gasteiger et al., 2019; Klicpera et al., 2019b). Moreover, if we reintroduce Wp = I, then it can be shown that the EIGNN fixed point (assuming suitable hyperparameter settings) minimizes the resulting generalized version of (5); see Appendix for specifics of the derivation as the original EIGNN algorithm (Liu et al., 2021a) is not motivated in this way.

3. In the present context, this would imply that spurious edges (e.g., that were errantly included in the graph) or which link nodes with neither label nor features in common, could dominate the resulting loss leading to poor performance on downstream tasks. Meanwhile, concave regularization grows slowly once the embeddings are already sufficiently separated, muting the impact of outliers.

IMPLICIT VS UNFOLDED GRAPH NEURAL NETWORKS

4.2 Descent Iterations That Form UGNN Layers

We now derive descent iterations for the general loss from the previous section (and later more interpretable special cases) that will be mapped to UGNN layers. For this purpose, let U(k) denote a single gradient descent step along (3), excluding the possibly non-differentiable ϕ-dependent term, and evaluated at some candidate point Y (k). We may compute such an abridged gradient step as4

U(k) = Y (k) α h e B Γ(k) e BY (k) Wp + W p + Y (k) Wf + W f f (X; W) i , (6)

where α is the step size and Γ(k) is a diagonal matrix with i-th diagonal element given by

γ(k) e = ρ (z)

z=diag h e BY (k)Wp(Y (k)) e B i

and e {1, . . . , m} is an index for the e-th edge in E. We then have the following:

Lemma 1 If ρ has Lipschitz continuous gradients, then the proximal gradient update

Y (k+1) = proxϕ U(k) arg min Y

" 1 2α U(k) Y 2 F + X

i ϕ (yi; Wϕ)

is guaranteed to satisfy ℓY (Y (k+1); W, f, ρ, e B, ϕ) ℓY (Y (k); W, f, ρ, e B, ϕ) for any α (0, 1/L], where L is the Lipschitz constant for gradients of (3) w.r.t. Y , excluding the non-smooth ϕ term.

The function proxϕ : Rd Rd is known as the proximal operator5 associated with ϕ (separably extended across all n nodes in the graph), and the proof follows from basic properties of gradient descent (Bertsekas, 1999) and proximal methods (Combettes and Pesquet, 2011); see Appendix. Generally speaking, the mapping Y (k) 7 U(k) from (6) can be viewed as a (possibly nonlinear) graph filter, while proxϕ from (8) serves as an activation function applied to the embedding of each node. Collectively then, Y (k+1) = proxϕ U(k) provides a flexible template for UGNN layers that naturally reduces to common GNN architectures per various design choices. And analogous to IGNN, we can execute K UGNN steps to approximate some Y , which could be a fixed-point, stationary point, or global optimum; more on this in Sections 4.3 and 5 below. For example, consider the selection P i ϕ (yi; Wϕ) = P i,j I [yij < 0], where yij is the (i, j)-th element of Y and I is an indicator function that assigns an infinite penalty to any yij < 0. The proximal operator then becomes proxϕ(U) = Re LU(U), i.e., a Re LU function that element-wise sets negative values to zero. If we add this ϕ-dependent term to (5) (as a special case of (3)), the resulting update becomes

Y (k+1) = proxϕ U(k) = Re LU Y (k) α h (λL + I) Y (k) f (X; W) i . (9)

And when combined with observations from (Ma et al., 2020; Pan et al., 2021; Zhang et al., 2020; Zhu et al., 2021),6 if we initialize with Y (0) = f (X; W) = XW and apply simple reparameterizations, the first iteration reduces to the canonical GCN layer Y (1) = Re LU h e D 1/2 e A e D 1/2XW i

4. Note that here w.l.o.g. we have absorbed an extra Wf + W f factor into f(X; W). 5. If (8) happens to have multiple solutions, then the proximal operator selects one of them so as to remain a proper deterministic function of its argument, which is relevant for forming later connections with IGNN. 6. While these works do not consider the treatment of nonlinear activations, they do nicely introduce how the embedded linear gradient step U (k) relates to GCN filters.

YANG, LIU, WANG, HUANG, WIPF

from (Kipf and Welling, 2017). Subsequent iterations are no longer equivalent, although the inherent differences (e.g., the skip connections from the input layer) are useful for avoiding oversmoothing effects that can at times hamper deep GCN models (Oono and Suzuki, 2020; Li et al., 2018; Rong et al., 2020). Additionally, the widely-used APPNP architecture (Klicpera et al., 2019a) is also a special case of (9) when the Re LU operator is removed (i.e., ϕ is zero), α = 1 1+λ, and L is changed to the symmetric normalized Laplacian (Von Luxburg, 2007). And as a final representative example, if we adopt a nonlinear choice for ρ, then L in (9) will be replaced by L(k) B Γ(k)B, where Γ(k) rescales graph edges. As has been discussed in (Yang et al., 2021), this instantiates a form of graph attention, whereby the attention weights Γ(k) produced by a concave, non-decreasing ρ add robustness to spurious edges. To see why this happens, note that if ρ is concave and non-decreasing, then each γ(k) e computed via (7) will necessarily be a non-increasing function of z. Furthermore, taking Wp = I

and e B = B for simplicity of exposition, then z = diag h e BY (k)Wp Y (k) e B i

e = yi yj 2 2 for some (i, j) E. Hence if the difference between yi and yj becomes large, as may be expected

across a spurious edge, the corresponding attention weight γ(k) e will necessarily become small as desired to avoid dominating the loss.

4.3 UGNN Fixed Points and Connections with IGNN

While UGNN layers are formed from the descent steps of a graph-regularized energy, for a more direct comparison with IGNN, it is illuminating to consider the situation where the update Y (k+1) = proxϕ U(k) is iterated with k becoming sufficiently large. In this case, we may ideally reach a fixed point Y that satisfies

Y = proxϕ Y α h e B Γ e BY Wp + W p + Y Wf + W f f (X; W) i

= proxϕ Y I αW s f αW s p + α h I e B Γ e B i Y W s p + αf (X; W) , (10)

where W s p Wp + W p and W s f Wf + W f are symmetric weight matrices, and Γ is defined analogously to (7). It now follows that if W s f = I W s p , α = 1 and Γ = I (i.e., ρ is an identity

mapping), and we define the propagation operator as P = I e B e B I e L, then (10) reduces to

Y = proxϕ PY W s p + f (X; W) . (11)

This expression is exactly equivalent to the IGNN fixed point from Section 3 when we set σ to the proximal operator of ϕ and we restrict Wp to be symmetric. For direct head-to-head comparisons then, it is instructive to consider what CONE activation functions σ can actually be expressed as the proximal operator of some penalty ϕ. In this regard we have the following:

Lemma 2 A continuous CONE function σ : R R can be expressed as the proximal operator of some function ϕ iff σ is also non-decreasing.

The proof follows from the analysis in (Gribonval and Nikolova, 2020). This result implies that, at least with respect to allowances for decreasing activation functions σ, IGNN is more flexible than UGNN. However, the relative flexibility of IGNN vs UGNN fixed points becomes much more nuanced once we take into account convergence considerations. For example, the proximal gradient iterations

IMPLICIT VS UNFOLDED GRAPH NEURAL NETWORKS

from Section 4.2 do not require that proxϕ is non-expansive or CONE to guarantee descent, although it thus far remains ambiguous how this relates to obtainable fixed points of UGNN. To this end, in Section 5 we will evaluate when such fixed points exist, what properties they have, convergence conditions for reaching them, and importantly for present purposes, how UGNN fixed points relate to their IGNN counterparts.

4.4 Potential for Broader Fixed-Point Equivalency Regimes

Before we tackle formal convergence considerations, a natural question arises regarding the extent to which IGNN and UGNN fixed-points are capable of alignment when we grant greater flexibility to the former. In particular, if for the moment we relax the restriction that IGNN iterations must strictly adhere to (2), we could potentially treat any UGNN fixed-point as the fixed-point of a richer class of IGNN models merely by definition. We would therefore obtain a trivial equivalency, and in this sense, IGNNs could be interpreted as uniformly more expressive than UGNNs.

The unresolved issue though, at least for the purposes of the IGNN/UGNN definitions we have adopted for this paper, is that to actually qualify as an IGNN there must exist a stable/scalable procedure for implicitly computing the gradients needed for back-propagation and model training. In other words, we require implicit differentiation conditioned on a given fixed point in order to achieve the signature K-independent memory budget of IGNNs as reviewed in Section 3. Without this, we are just swapping names and there is diminished reason to differentiate IGNNs and UGNNs in the first place. However, to our knowledge established implicit differentiation schemes applicable to producing GNN-like architectures equipped with graph propagation are all limited to handling fixedpoints induced by (2), with different variants tied to varying choices for σ, P, Wp and f. Generalizing further is non-trivial, necessitating architecture-specific accommodations; see Section D of (Gu et al., 2020) for details of the nuance involved and (Baker et al., 2023) for follow-up effort demonstrating the difficulties of extending further. In fact, even without complex graph structure linking all training instances, achieving stable implicit differentiation is known to be challenging. Hence we adopt (2) for characterizing the design space of IGNN models, and leave broader alternatives, should they arise, to future work.

5. Convergence Comparisons

In this section we first detail situations whereby convergence to a unique fixed point, that also may at times correspond with a unique UGNN global minimum, can be established. In these situations IGNN facilitates stronger guarantees in terms of broader choices for Wp and σ. Later, we examine alternative scenarios whereby UGNN has a distinct advantage with respect to convergence guarantees to potentially local solutions. We remark that prior work examining UGNN iterations has largely been limited to certain special cases of (3), and critically, only established a non-increasing cost function (see for example (Yang et al., 2021)), not actual convergence to a meaningful solution set; the latter requires confirmation of additional non-trivial criteria as we will tackle herein. As such, an ancillary benefit of this section is to ground UGNN models to more comprehensive, elucidating convergence properties.

YANG, LIU, WANG, HUANG, WIPF

5.1 Convergence to Unique Global Solutions

To facilitate the most direct, head-to-head comparison between IGNN and UGNN, we consider a restricted form of the UGNN energy from (3). In particular, let

ℓY (Y ; W, f, ϕ) Y f (X; Wx) 2 Wf + tr h Y e LY Wp i +

i=1 ϕ (yi) , (12)

which is obtainable from (3) when ρ is an identity mapping and ϕ is assumed to have no trainable parameters. We also define Σ I W s f + e L W s p , with minimum and maximum eigenvalues given by λmin(Σ) and λmax(Σ) respectively.

Theorem 3 If proxϕ is non-expansive and λmin(Σ) > 0, then (12) has a unique global minimum. Additionally, starting from any initialization Y (0), the iterations Y (k+1) = proxϕ U(k) applied to (12) with α (0, 2/λmax(Σ)) are guaranteed to converge to this global minimum as k .

Corollary 4 For any non-decreasing CONE function σ and symmetric Wp W s p satisfying W s p 2 < P 1 2 , there will exist a function ϕ such that σ = proxϕ and the fixed point of (11) is the unique global minimum of (12) when Wf is chosen such that W s f = I W s p and P = I e L.

Lemma 5 For any non-expansive σ (not necessarily element-wise) and Wp (not necessarily symmetric) satisfying Wp 2 < P 1 2 , iterating (2) will converge to a unique fixed point.7

Hence from Theorem 3 and Corollary 4, if we are willing to accept symmetric graph propagation weights W s p (and a non-decreasing σ), UGNN enjoys the same convergence guarantee as IGNN, but with the added benefit of an interpretable underlying energy function. In contrast, when Wp is not symmetric as in Lemma 5, we can only guarantee a unique IGNN fixed point, but we are no longer able to establish an association with a specific UGNN energy. In fact, it does not seem likely that any familiar/interpretable functional form can even possibly exist to underlie such a fixed point when Wp is not symmetric. This is largely because of the following simple result:

Lemma 6 There does not exist any second-order smooth function h : Rn d R such that h(Y ; W)/ Y = Y W for all (asymmetric) matrices W Rd d with d > 1 and n 1.

And by continuity arguments, a similar result will apply to many non-smooth functions. Consequently, with asymmetric weights there is no obvious way to associate fixed points with stationary points as we have done for UGNN. Interestingly, while a wide variety of work on unfolding models (including outside of the graph domain) has led to architectures with symmetric or even more restrictive PSD weight constraints (Amos and Kolter, 2017; Di Giovanni et al., 2023; Frecon et al., 2022; Xie et al., 2023; Yang et al., 2022), the apparent inevitability of these constraints remains underexplored.

5.2 Broader Convergence Regimes

As we move to alternative energy functions with nonlinear dependencies on the graph, e.g., ρ not equal to identity, meaningful (albeit possibly weaker) convergence properties for UGNN can still be

7. An analogous result has been shown in (Gu et al., 2020), but restricted to CONE mappings (not arbitrary non-expansive mappings) and with a dependency on the less familiar Peron-Frobenius eigenvalue; see Section 3.

IMPLICIT VS UNFOLDED GRAPH NEURAL NETWORKS

established. In particular, we consider convergence to local minima, or more generally, stationary points of the underlying objective. However, because (3) may be non-convex and non-smooth, we must precisely define an applicable notion of stationarity. While various possibilities exist, for present purposes we define Y as a stationary point of (3) if 0 F h ℓY (Y ; W, f, ρ, e B, ϕ) i , where F denotes the Fr echet subdifferential. The latter generalizes the standard subdifferential as defined for convex functions, to non-convex cases. More formally, the Fr echet subdifferential (Li et al., 2020a) of some function h(Y ) is defined as the set

F [h(Y )] = n S : h(Y ) h(Z) + tr h S (Y Z) i + o ( Y Z F) Y o , (13)

which is equivalent to the gradient when h is differentiable and the regular subdifferential when h is convex. Loosely speaking, this definition specifies a set of affine functions that approximately lower-bound h(Y ) within a restricted region that is sufficiently close to Y as specified by the o ( Y Z F) term; however, as we move away from Y the affine bound need not hold, which allows for a meaningful, non-empty subdifferential in broader non-convex regimes. Based on this definition, we have the following:

Theorem 7 Assume that Wp and Wf are PSD, ϕ is continuous with a non-empty Fr echet subdifferential for all Y 8 and ρ is a concave, non-decreasing function with Lipschitz-continuous gradients. Additionally, starting from any initialization Y (0), let Y (k) k=0 denote a sequence generated by Y (k+1) = proxϕ U(k) with step size parameter α (0, 1/L]. Then all accumulation points of Y (k) k=0 are stationary points of (3). Furthermore, limk ℓY (Y (k); W, f, ρ, e B, ϕ) = ℓY (Y ; W, f, ρ, e B, ϕ) for some stationary point Y .

The proof is based on Zangwill s global convergence theorem (Luenberger, 1984) and additional results from (Sriperumbudur and Lanckriet, 2009); see Appendix for further details.

Corollary 8 If in addition to the conditions from Theorem 7, ϕ is non-expansive and ρ is chosen such that the composite function ρ h ( )2i is convex, then limk ℓY (Y (k); W, f, ρ, e B, ϕ) =

ℓY (Y ; W, f, ρ, e B, ϕ), where Y is a global minimizer of (3).

These results both apply to situations where the k-dependent UGNN graph propagation operator P (k) I e B Γ(k) e B is nonlinear by virtue of the Y (k)-dependency of Γ(k), i.e., P (k)Y (k)W s p = PY (k)W s p for any fixed P. In contrast, it is unknown (and difficult to determine) if general nonlinear alternatives to PY Wp in (2) will converge.

5.3 Recap of Relative IGNN and UGNN Flexibility

In terms of model expressiveness, the advantage of the IGNN framework is two-fold: (i) it allows for asymmetric weights Wp while still providing a strong convergence guarantee, and (ii) it allows for decreasing activation functions. However, the latter is likely much less relevant in practice, as most deep models adopt some form of non-decreasing activation anyway, e.g., Re LU, etc. In contrast, UGNN models are more flexible than IGNN in their accommodation of: (i) nonlinear graph propagation through the graph attention mechanism that results from a nonlinear ρ as mentioned

8. For minor technical reasons, we also assume ϕ is such that lim Y ℓY (Y ; W, f, ρ, e B, ϕ) = .

YANG, LIU, WANG, HUANG, WIPF

in Section 4.2, and (ii) expansive proximal operators. While the latter may seem like a mere technicality, expansive proximal operators actually undergird a variety of popular sparsity shrinkage penalties, which have been proposed for integration with GNN models (Zheng et al., 2021). For example, the ℓ0 norm and related non-convex regularization factors (Chen et al., 2017b; Fan and Li, 2001) are expansive and can be useful for favoring parsimonious node embeddings. And overall, with only mild assumptions, UGNN at least guarantees cost function descent across a broad class of models per Lemma 1, with even convergence to stationary points possible in relevant situations from Theorem 7 that IGNN cannot emulate.

6. How Limiting Are Symmetric (Layer-Tied) Propagation Weights?

Previously we observed that the primary advantage IGNN has over UGNN, at least in terms of model expressiveness, is that IGNN places no restriction that the propagation weight matrix Wp need be symmetric, although both models ultimately involve an architecture with layer-tied weights unlike typical message-passing models such as GCN. To better understand the implications of this distinction, we will now explore the expressiveness of GNN models with symmetric, layer-tied weights. We remark that resolution of these issues is of growing relevance, as evidenced by a number of recent papers. For example, (Di Giovanni et al., 2023) explores what can be categorized as UGNN models and the emergence of symmetric weights; however, there is no direct analysis of the inevitability of symmetric weights, the expressiveness relative to GNNs with asymmetric weights, nor consideration of the limitations of layer-tied weights. Additionally, relative to our general UGNN framework emanating from (3), their approach excludes ρ and non-linear activations. Critically, without the latter, our Theorem 10 presented below is not possible. In a related vein, (Luo et al., 2022) addresses the expressiveness of weight-tied GNN layers relative to GCN models, although these results are a special case of ours, and symmetric weights are not considered.9 More broadly, symmetric weights naturally emerge when deriving unfolding architectures analogous to common deep feed-forward models such as MLPs, Res Nets, or Transformers (Amos and Kolter, 2017; Frecon et al., 2022; Xie et al., 2023; Yang et al., 2022). However, even these works do not quantify the expressiveness relative to unconstrained weights as we investigate here.

6.1 Fixed-Point Equivalency with Symmetric and Asymmetric Weights

In this section we examine situations whereby UGNN models are able to reproduce, up to some inconsequential transform, the fixed points of an IGNN, even when the latter includes an asymmetric propagation weight matrix Wp. However, because it is challenging to analyze general situations with arbitrary activation functions, we first consider the case where σ is an identity mapping. We then have the following:

Theorem 9 For any Wp Rd d, Wx Rd0 d, X Rn d0, and P that admit a unique IGNN fixed point Y = PY Wp + f(X; Wx), there exists a Y Rn d , Wx Rd d , right-invertible

9. We remark that while (Di Giovanni et al., 2023) and (Luo et al., 2022) represent valuable, complementary contributions that have been independently introduced by others, our work was both posted to arxiv and submitted to JMLR well in advance, and hence should not be viewed as derivative of them. [Note: This footnote is to provide context specifically for reviewing purposes and can be removed later.]

IMPLICIT VS UNFOLDED GRAPH NEURAL NETWORKS

transform T Rd d , and symmetric W s p Rd d , such that

Y T = PY TW s p + f(X; Wx) Wx, with Y Y F < ϵ, ϵ > 0. (14)

This result implies that a UGNN model with ϕ = 0 can produce node-wise embeddings Y = Y T capable of matching any IGNN fixed-point with arbitrary precision up to transforms T and Wx. And given that T can be absorbed into the meta-loss output layer g from (1) (which is often a linear layer anyway), and Wx can be absorbed into f, the distinction introduced by these transforms is negligible. Proceeding further, if we allow for nonlinear activation functions σ, we may then consider a more general, parameterized family of penalty functions ϕ(y; Wϕ) such that the resulting proximal operator increases our chances of finding a UGNN fixed point that aligns with IGNN. However, some care must be taken to control UGNN model capacity to reduce the possibility of trivial, degenerate alignments. To this end, we consider proximal operators in the set

Sσ = {proxϕ : y 7 Gσ(Cy)|with {G, C} chosen such that proxϕ is proximal operator.}, (15)

where the matrices G and C can be aggregated into Wϕ. It is then possible to derive sufficient conditions under which UGNN has optimal solutions equivalent to IGNN fixed points (it remains an open question if a necessary condition can be similarly derived). See Appendix for details.

6.2 UGNN Capacity to Match Canonical GCN Architectures

The previous section considered the alignment of fixed points, which are obtainable after executing potentially infinite graph propagation steps, from models with and without symmetric propagation weights. Somewhat differently, this section investigates analogous issues in the context of the possible embeddings obtainable after k steps of graph propagation. Specifically, we evaluate the expressiveness of a canonical GCN model with arbitrary, layer-independent weights versus a UGNN model structured so as to effectively maintain an equivalent capacity (up to a right-invertible linear transformation as discussed above).

Theorem 10 Let Y (k+1) GCN = σ PY (k) GCNW (k) p + βY (k) GCN denote the k-th layer of a GCN, where

W (k) p is a layer-dependent weight matrix, σ = proxϕ for some function ϕ, and β {0, 1} determines whether or not a residual connection is included. Then with α = 1, ρ an identity mapping, and f(X; Wx) set to zero, there will always exist a right-invertible T, initialization Y (0), and symmetric weights W s p and W s f such that the k-th iteration step computed via (8) satisfies Y (k+1)T = Y (k+1) GCN .

Given the close association between GCNs and certain UGNN models, Theorem 10 suggests that, at least in principle, symmetric layer-tied weights may not be prohibitively restrictive in the finite layer regime. And as can be observed from the proof, this equivalency result is possible using a matching parameter budget instantiated through an expanded hidden dimension but constrained via block-sparse weight matrices W s p and W s f . Of course in practice when we extrapolate outside of regimes where the results of this section strictly hold, it remains to be seen if symmetric weight constraints have a significant adverse impact. We explore such considerations next.

YANG, LIU, WANG, HUANG, WIPF

7. Empirical Accuracy Comparisons

In this section we evaluate the predictions of various comparable IGNN and UGNN models across experimental settings designed to complement our theoretical findings and showcase potential strengths and weaknesses that may arise in practice. To this end, we first conduct a series of quasisynthetic tests designed to explore the interplay between the type of graph propagation weights (symmetric as with UGNN vs asymmetric as with IGNN), the graph propagation path length (finite as with UGNN vs approximately infinite as with IGNN), and model expressiveness. We then turn to real-world benchmarks to examine relative capabilities handling long-range dependencies, graph heterophily, and adversarial attacks, all of which represent domains within which either IGNNs, UGNNs, or both have been motivated in prior work. Under these different application scenarios, we also evaluate against popular methods specifically designed for the particular task at hand. As for IGNN/UGNN model architectures, we assume the following:

IGNN baselines: We use (2) as the grounding for IGNN baselines with the following notable specializations. First, when σ is an identity, we refer to the resulting model as EIGNN following (Liu et al., 2021a), and when σ is a Re LU activation with relaxed weight constraints on Wp from (Baker et al., 2023), the model is denoted MIGNN (because of its origins from monotone operator theory). We also remark that beyond these candidates, there is as of yet no more expressive IGNN parameterization that we are aware of that comes equipped with an implementation suitable for stable training and implicit differentiation across large graphs.

UGNN baselines: For UGNN we limit our exploration to various special cases of the general framework introduced in Section 4 and instantiated via the composite update described by (6), (7), and (8). To recap the role of these steps, when conditioned on fixed graph attention (6) computes an affine filter that includes a graph propagation operator (as with IGNNs), (7) serves as a form of nonlinear graph attention operator (unique to currently-available UGNNs), and (8) produces the nonlinear activation function for each layer (analogous to σ for IGNNs). While UGNNs also accommodate expansive proximal operators beyond σ, we do not consider this flexibility herein.

Further specification of dataset-specific architectural choices and hyperparameters will be provided in the forthcoming subsections, while full experimental details are deferred to the Appendix. All models and testing were implemented using the Deep Graph Library (DGL) (Wang et al., 2019). We also note that portions of these experiments have appeared previously in our conference paper (Yang et al., 2021), which presents a useful UGNN-related modeling framework and supporting DGL-based code that we further augmented for our purposes herein.

7.1 Exploring the Practical Impact of Symmetric Weights

From our analysis in Section 6 it is reasonable to speculate that UGNN symmetric weights are not a significant limitation relative to IGNN when it comes to model expressiveness, potentially even in regimes that deviate from the stated technical conditions, or when UGNN has limited propagation steps as may occur in practical implementations. To explore this hypothesis empirically, we design the following quasi-synthetic experiment: First, we train separate IGNN and UGNN models on the ogbn-arxiv dataset (Hu et al., 2020), where the architectures are equivalent with the exception of Wp and the number of propagation steps (see Appendix for all network and training details). Additionally, because UGNN requires symmetric propagation weights, a matching parameter count can be achieved by simply expanding the UGNN hidden dimension. Once trained, we then generate

IMPLICIT VS UNFOLDED GRAPH NEURAL NETWORKS

predictions from each model and treat them as new, ground-truth labels. We next train three additional models separately on both sets of synthetic labels:

1. An IGNN with architecture equivalent to the original used for generating labels,

2. An analogous UGNN, but with limited propagation steps consistent with practical use cases,

3. A hybrid model that can equivalently be viewed as either an IGNN with finite propagation steps, or a UGNN where the restriction on symmetric weights has been relaxed.

For all three models above the number of parameters is a small fraction of the number of labels to mitigate overfitting issues.

Gen Rec IGNN UGNN Hybrid

IGNN 95.7 0.9 89.8 1.9 91.6 1.2 UGNN 91.0 1.5 94.5 0.8 92.3 0.7

Table 1: Accuracy recovering quasi-synthetic labels.

Prediction accuracy results are presented in Table 1, where the rows indicate the label generating models, and the columns denote the recovery models. We first consider the leftmost two columns labeled IGNN and UGNN. As would be expected, the highest recovery accuracy (95.7 and 94.5) occurs when the generating and recovery models are matched such that perfect recovery is theoretically possible by design. But even in the mismatched cases (91.0 and 89.8) the performance is still quite strong, suggesting that a UGNN with symmetric weights and limited propagation steps can reasonably approximate an IGNN, and vice versa. To further probe the source of the modest gap that nonetheless does exist, we next examine the rightmost hybrid column of Table 1. Here we observe that the hybrid model performs somewhat worse recovering IGNN data (91.6 vs 95.7), indicating that truncated propagation steps, potentially more so than symmetric weights, reduces performance when the generating model involves true long-range propagation as with IGNN. Not surprisingly, we also notice that when trained on UGNN data, the hybrid model outperforms IGNN (92.3 vs 91.0) since here there is no benefit to modeling long-range propagation beyond the limit of the truncated UGNN propagation steps. In general though, the fact that the performance of all models varies within a range of a few percentage points, even under these idealized conditions where the data-generating model has maximal advantage, suggests a significant overlap of model expressiveness as might be expected per the analysis in Section 6. Hence UGNN models with symmetric weights and finite propagation steps are likely adequate in many practical scenarios.

7.2 Results Using Long-Dependency Benchmarks

We now turn to experiments involving existing graph benchmarks purported to involve long-range dependencies, a scenario that has been previously used to motivate both UGNN and IGNN models. The modeling goal here is to capture these long-range effects by stably introducing a sufficient number of propagation layers without the risk of oversmoothing, which might otherwise degrade performance. We begin with the synthetic Chains node classification dataset (Gu et al., 2020; Liu

YANG, LIU, WANG, HUANG, WIPF

et al., 2021a) explicitly tailored to evaluate GNN models w.r.t. long-range dependencies. Results of IGNN and UGNN models, as well as baselines SGC (Chen et al., 2020), GCN (Kipf and Welling, 2017), GAT (Velickovic et al., 2018), and SSE (Dai et al., 2018) are shown in Figure 2. Baseline results are from (Gu et al., 2020) without error bars; however, in our own testing we observe the results to be quite stable across trials such that error bars are not necessary for conveying the core message. Note that SSE, like IGNN and to some extent UGNNs, is explicitly designed for capturing long-range dependencies. While the Chains dataset has been shown in the past to differentiate IGNN and SSE from traditional GNN architectures (Gu et al., 2020), we observe here that UGNN is equally capable of matching performance, and hence a more challenging benchmark is needed.

Figure 2: Accuracy comparisons on long-range Chains dataset.

For this purpose we next turn to the Amazon Co-Purchase dataset, a node classification benchmark often used for evaluating long-range dependencies, in part because of sparse labels relative to graph size (Baker et al., 2023; Dai et al., 2018; Gu et al., 2020).10 We adopt the test setup from (Dai et al., 2018; Gu et al., 2020; Baker et al., 2023), and compare performance using different label ratios. We report results from five baselines, namely SGC, GCN, SSE, IGNN, and MIGNN, where the latter is a more expressive version of IGNN. Regardless, UGNN is still able to outperform all of them by a non-trivial margin as shown in Figures 3 and 4. This likely indicates that while some long-range capability is needed to outperform the canonical GNNs, the unbounded extent of IGNNs provides diminishing marginal returns beyond the UGNN, which has greater flexibility in other relevant aspects. We also remark that, to the degree that MIGNN is more expressive than prior

10. An alternative is the Long-Range Graph Benchmark (Dwivedi et al., 2023); unfortunately though, this benchmark is exclusively composed of tiny graphs for solving graph-level prediction problems whereby scalable UGNN and IGNN node-level embeddings are less relevant. Yet another option is PPI (Hamilton et al., 2017a); however, multiple existing methods can already achieve nearly 100% accuracy (Gu et al., 2020; Liu et al., 2021a), so it is less impactful for differentiation.

IMPLICIT VS UNFOLDED GRAPH NEURAL NETWORKS

IGNNs, this does not translate into improved classification performance, possibly because the core architecture is still the same as discussed in Section 3.

Figure 3: Micro-F1 on Amazon Co-Purchase.

Figure 4: Macro-F1 on Amazon Co-Purchase.

7.3 Results on Heterophily Graphs

A homophily graph occurs when nearby nodes share similar labels and/or features, and messagepassing GNNs are generally well-equipped to handle them. In contrast, heterophily graphs exhibit an opposing phenomena, namely, nodes sharing an edge have a reduced likelihood of having the same label. These complementary notions are often quantified via the homophily ratio

H |{(u, v) E : tu = tv}|

from (Zhu et al., 2020b), where tu and tv are the target labels of nodes u and v respectively. H quantifies the tendency of nodes to be connected with other nodes from the same class. Graphs with H 1 exhibit strong homophily, while conversely, those with H 0 show strong heterophily, indicating that many edges are connecting nodes with different labels. With this in mind, it has been suggested that GNNs with more sensitivity to long-range dependencies might be well-suited for handling heterophily graphs (Liu et al., 2021a; Pei et al., 2019), since nodes sharing the same label may now be concentrated many hops away, i.e., a form of long-range association. But we conjecture that the utility of such long-range GNNs will depend on attributes of the heterophilic graph not directly measured by the homophily ratio. More concretely, the implicit assumption is that if we increase the receptive field of a given node, the percentage of nodes with the same label will be higher, improving model performance.

7.3.1 A CLOSER LOOK AT HETEROPHILY AND LONG-RANGE DEPENDENCIES

To kick the tires on the above assumption, we first compute an additional metric across a variety of heterophily graphs. We name this metric the average same-label distance (ASD), with formal definition given by

c Eu VEv Ludist(u, v), (17)

where Lu denotes the set of nodes with the same label as node u, c is the number of classes as introduced previously, and dist(u, v) counts the shortest path length between nodes u and v. To

YANG, LIU, WANG, HUANG, WIPF

reduce computational complexity, we compute the expectation in (17) by randomly sampling 1000 nodes as u and then averaging. The resulting log-scale ASD values are presented in Figure 5 across four classic heterophily datasets examined in (Pei et al., 2019), Actor, Cornell, Texas, and Wisconsin, and the more recent heterophily benchmarks Roman, Amazon, Minesweeper, Tolokers, and Questions from (Platonov et al., 2023).

Figure 5: Average same-label distance (ASD) on heterophily benchmarks.

Interestingly, there is a clear demarcation between the new heterophily datasets and the classical ones: The new datasets unanimously possess a larger log(ASD) (all log(ASD) > 1) while the classical ones are smaller (all log(ASD) < 1). We will now empirically explore the impact of this distinction on performance. To streamline this treatment, we reference the datasets from (Platonov et al., 2023) with log(ASD) > 1 as ASDsup, while the datasets from (Pei et al., 2019) with log(ASD) < 1 are referenced as ASDinf.

7.3.2 HOW DIFFERENT MANIFESTATIONS

OF HETEROPHILY IMPACT PERFORMANCE

Intuition suggests that incorporating long-range dependencies will be more effective on the ASDsup datasets, since a larger effective receptive field, as occurs with greater numbers of propagation steps over the graph, can potentially lead to aggregation over more nodes with the same label. In contrast, for ASDinf heterophily graphs, same-label nodes may actually be relatively close together, but mixed in with appreciable concentrations of mismatched-labeled nodes. Under these conditions long-range propagation may be less effective and instead, treating edges between mismatched nodes as outliers a preferable alternative. For UGNN models, such treatment can be operationalized by choosing ρ as a concave non-decreasing function as espoused in Section 4.1. And as alluded to in Section 4.2, this leads to an emergent form of graph attention that is capable of selectively pruning bad edges.

To test this hypothesis, we first train both IGNN and UGNN models on the four ASDinf datasets from Section 7.3.1, using the data splits, processed node features, and labels provided by (Pei et al., 2019). Notably, we include UGNN models both with and without a concave ρ function (by without here we just mean that ρ is an inconsequential identity mapping). Node classification accuracy results are presented in Table 2, where for context, we also include results from the additional GNN architectures GCN (Kipf and Welling, 2017), GAT (Velickovic et al., 2018), Graph SAGE (Hamilton et al., 2017b), SOTA heterophily methods GEOM-GCN (Pei et al., 2019) and H2GCN (Zhu et al., 2020b), as well as a baseline MLP as reported in (Zhu et al., 2020b). Two things stand out regarding these results. First, the IGNN, EIGNN, and UGNN (without ρ) models, are comparable or worse than the MLP, subject to some dataset-to-dataset variability, indicating that merely introducing long-range

IMPLICIT VS UNFOLDED GRAPH NEURAL NETWORKS

DATASET TEXAS WISCONSIN ACTOR CORNELL AVG.

HOM. RATIO (H) 0.11 0.21 0.22 0.3

GCN 59.46 5.25 59.80 6.99 30.26 0.79 57.03 4.67 51.64 GAT 58.38 4.45 55.29 8.71 26.28 1.73 58.92 3.32 49.72 GRAPHSAGE 82.43 6.14 81.18 5.56 34.23 0.99 75.95 5.01 68.45 GEOM-GCN 67.57 64.12 31.63 60.81 56.03 H2GCN 84.86 6.77 86.67 4.69 35.86 1.03 82.16 4.80 72.39 MLP 81.89 4.78 85.29 3.61 35.76 0.98 81.08 6.37 71.01

IGNN 83.1 5.6 79.2 6.0 35.5 2.9 76.0 6.2 68.45 EIGNN 80.54 5.67 83.72 3.46 34.74 1.19 81.08 5.70 70.02 UGNN(W/O ρ) 81.62 5.51 82.75 7.83 37.10 1.07 83.51 7.30 71.25 UGNN(W/ ρ) 84.59 3.83 86.67 4.19 37.43 1.50 86.76 5.05 73.86

Table 2: Node classification accuracy on ASDinf heterophily graphs. Boldface indicates the best performing model within each block. When equipped with a concave ρ, a simple/generic UGNN model outperforms IGNN models and is competitive with strong baselines explicitly designed for heterophily graphs.

dependencies alone may not necessarily offer a clear-cut advantage over ignoring the graph structure altogether. In contrast, UGNN (with ρ) is uniformly better than the other IGNN/UGNN models, as well as the MLP. This observation is consistent with the notion that heterophily of the ASDinf variety can be mitigated using robust/concave penalization that mutes the impact of edges connecting nodes with mismatched labels. This strategy also allows this UGNN model to even outperform strong GNN architectures explicitly designed for heterophily graphs, e.g., H2GCN. To provide complementary evidence for our hypothesis, we next train IGNN and UGNN models on the five ASDsup benchmarks. With these datasets heterophily manifests more as a longer average distance between nodes sharing a label, rather than spurious edges connected to any given node that can be adequately treated as outliers. As such, we might expect that the effectiveness of ρ will be marginal relative to long-range graph propagations that are tantamount to a large receptive field. Results are shown in Table 3, along with a set of recent heterophily GNN models tested in (Platonov et al., 2023) for reference. These models include H2GCN (Zhu et al., 2020c), GPR-GCN (Chien et al., 2020), FSGNN (Maurya et al., 2022), FAGNCN (Bo et al., 2021), GBK-GNN (Du et al., 2022) and Jacobi Conv (Wang and Zhang, 2022). We also include an MLP as well as GCN and GAT baselines for reference.11

From these results, we observe that all UGNN/IGNN-based models significantly outperform the MLP, suggesting that same labels spread across a wider receptive field can in fact be exploited within the ASDsup datasets. However, the UGNN with concave ρ no longer offers an advantage, in

11. For GCN and GAT we ran the experiments ourselves using the standard implementations from DGL. While (Platonov et al., 2023) also includes results associated with a number of popular baseline GNN architectures (e.g., GCN, GAT, Graph SAGE, GT, and variants thereof etc.), upon inspection of the associated public codebases, we find that the actual implementations involve additional tweaks/enhancements which deviate from the original models. These and other types of modifications may be effective on specific tasks/datasets, and are interesting to consider; however, they can be applied across a wide variety of GNN architectures and are well outside of our analysis-driven priorities here.

YANG, LIU, WANG, HUANG, WIPF

that IGNN and UGNN (regardless of ρ) perform similarly. Additionally, both IGNN and UGNN are competitive with the best performing heterophily-motivated GNNs, e.g., FSGNN and GBK-GNN.

DATASET ROMAN AMAZON MINESWEEPER TOLOKERS QUESTIONS AVG.

GCN 53.31 1.55 49.86 0.21 77.49 0.60 76.53 0.11 76.61 0.41 66.76 GAT 71.31 0.55 50.52 0.52 88.86 0.51 75.72 6.19 74.34 0.38 72.15 H2GCN 60.11 0.52 36.47 0.23 89.71 0.31 73.35 1.01 63.59 1.46 64.65 GPR-GNN 64.85 0.27 44.88 0.34 86.24 0.61 72.94 0.97 55.48 0.91 64.88 FSGNN 79.92 0.56 52.74 0.83 90.08 0.70 82.76 0.61 78.86 0.92 76.87 FAGCN 65.22 0.56 44.12 0.30 88.17 0.73 77.75 1.05 77.24 1.26 70.50 GBK-GNN 74.57 0.47 45.98 0.71 90.85 0.58 81.01 0.67 74.47 0.86 73.38 JACOBICONV 71.14 0.42 43.55 0.48 89.66 0.40 68.66 0.65 73.88 1.16 69.38 MLP 65.88 0.38 45.90 0.52 50.89 1.39 72.95 1.06 70.34 0.76 61.19

IGNN 84.04 0.85 51.79 0.46 90.73 1.45 82.06 0.72 75.44 0.45 76.81 EIGNN 75.72 0.29 45.27 0.78 88.17 0.66 79.90 0.76 69.13 0.65 71.64 UGNN(W/O ρ) 82.75 1.71 50.99 0.37 91.41 0.20 85.62 0.17 75.83 0.40 77.32 UGNN(W/ ρ) 83.54 0.47 53.38 0.35 87.58 0.08 84.83 0.25 75.21 0.21 76.91

Table 3: Node classification accuracy on ASDsup heterophily graphs. Boldface indicates the best performing model within each block.

7.4 Adversarial Attack Results

As demonstrated in the previous section, when heterophily implicitly manifests as graphs with spurious edges, UGNN models with a concave ρ provide a unique advantage over IGNN models. To further explore this advantage, we turn to an application scenario whereby spurious edges are explicit, namely, certain forms of adversarial attacks. To this end, we compare the models using graph data corrupted via the the Mettack algorithm (Z ugner and G unnemann, 2019). Mettack operates by perturbing graph edges with the aim of maximally reducing the node classification accuracy of a surrogate GNN model that is amenable to adversarial attack optimization. And the design is such that this reduction is generally transferable to other GNN models trained with the same perturbed graph. In terms of experimental design, we follow the exact same non-targeted attack scenario from (Zhang and Zitnik, 2020), setting the edge perturbation rate to 20%, adopting the Meta-Self training strategy, and a GCN as the surrogate model.

MODEL ATK-CORA ATK-CITESEER AVG.

SURROGATE (GCN) 57.38 1.42 60.42 1.48 58.90

GNNGUARD 70.46 1.03 65.20 1.84 67.83 GNN-JACCARD 64.51 1.35 63.38 1.31 63.95 GNN-SVD 66.45 0.76 65.34 1.00 65.90

EIGNN 65.16 0.78 68.36 1.01 66.76 IGNN 61.47 2.30 64.70 1.26 63.09 UGNN(W/O ρ) 62.89 1.59 63.83 1.95 63.36 UGNN(W/ ρ) 70.23 1.09 70.63 0.93 70.43

Table 4: Node classification accuracy under Adversarial attacks. Boldface indicates the best performing model within each block.

IMPLICIT VS UNFOLDED GRAPH NEURAL NETWORKS

As baselines for reference, we use three strong defense models, namely GNNGuard (Zhang and Zitnik, 2020), GNN-Jaccard (Wu et al., 2019) and GNN-SVD (Entezari et al., 2020). Results on Cora and Citeseer data are reported in Table 4. As in Table 2, we again observe that UGNN with ρ outperforms the IGNN models as well as UGNN without ρ. And somewhat surprisingly, UGNN with ρ also performs comparably or better than strong existing models that were meticulously designed to defend against adversarial attacks. These result further demonstrate the UGNN advantage in terms of robustness relative to presently-available IGNN models.

8. Scalability Considerations

In this section we provide complexity analysis and supporting system efficiency empirical comparisons for full-graph training with UGNN and IGNN models. We then conclude with a discussion of sampling-based and lazy training methods that can be applied to further speed up UGNN models.

8.1 Complexity Analysis

Time Complexity: The time complexity of UGNN is O(md K + Nnd2), where m is the number of edges, n is the number of nodes, K is the number of propagation steps, N is the total number of MLP layers and d is the largest hidden size. As a useful reference point, the time complexity of a standard K-layer GCN or Graph SAGE model is O(md K + Knd2). Moreover, if the UGNN MLP layers are all after propagation (i.e., no parameters before propagation), the time complexity can be reduced to O(Nnd2) by precomputing the propagation (i.e., the same as an MLP). For IGNN (as well as MIGNN), the time complexity depends on the number of iterations needed to drive (2) to an approximate fixed point, which relates to certain properties of the weight matrix Wp. However, the official IGNN implementation12 executes propagation steps until either acceptable error rate or an upper threshold is reached. If the number of iteration steps used to approximate the fixed point is K, then the IGNN time complexity is O(md K + Nnd2), which is essentially the same as UGNN (and other message passing-based GNNs), although generally the value of K will be much higher. That being said, the computational complexity of the IGNN backward pass alone is constant with respect to K, although this does not change the overall complexity order. For EIGNN, based on its linearity, the overall/composite graph propagation matrix can actually be pre-computed in O(n3) time. In this way, it does not need to be recomputed during training, which reduces the training time complexity by a factor of O(Nnd2). However, for large-sized graphs often encountered in practice, an O(n3) computation can be a prohibitive overhead, which would also need to be recomputed any time a propagation-related hyper-parameter is modified.

Memory Complexity: Absent sampling-based training methods to be discussed further in Section 8.3 below, the memory complexity of UGNN is O(md(K + N)); a comparable K-layer GCN or Graph SAGE model would require O(md K). Meanwhile the IGNN memory complexity is only O(md), which is constant with respect to K (a key advantage).

8.2 Empirical Comparisons

In terms of time and memory consumption in practice, we compare UGNN and IGNN on the Amazon Co-Purchase benchmark, which has been advocated in (Gu et al., 2020) as a suitable data source for testing IGNN. Results are shown in Figures 6 and 7 based on executing 100 steps of training and

12. https://github.com/Swiftie H/IGNN

YANG, LIU, WANG, HUANG, WIPF

evaluation on a single Tesla T4 GPU. Clearly, IGNN maintains a huge advantage in terms of memory consumption, while UGNN has a faster runtime provided the number of propagation steps is not too large. Of course these results are based on full-graph training, so it is worth considering additional scalability measures as described next.

Figure 6: Memory cost

Figure 7: Running time

8.3 Directions for Additional Scalability

We are presently unaware of existing work to further scale IGNN models to handle huge real-world graphs, for example, using sampling-based methods (Chen et al., 2018; Chiang et al., 2019; Hamilton et al., 2017a; Ying et al., 2018; Zou et al., 2019) or historical embedding caches (Chen et al., 2017a; Fey et al., 2021) as have been applied to typical message-passing GNNs. While in principle such methods may be applicable, the stability of implicit differentiation would need to be maintained, and present IGNN work thus far has largely focused on full graph training. In contrast, there exist two recent orthogonal directions explicitly designed for scaling UGNNs. First, lazy training methods, combined with sampling and a historical embedding cache, have been developed such that UGNN complexity is largely decoupled from model depth and far more efficient than full-graph training (Xue et al., 2023). As a second approach, UGNN models have been combined with offline sampling (Zeng et al., 2021; Gasteiger et al., 2022) such that training can proceed on the very largest publicly-available benchmarks (Jiang et al., 2023). This includes the 377GB dataset MAG240M from (Hu et al., 2021) and the 1.15TB dataset IGB-full from (Khatua et al., 2023), where UGNN models akin to (9), but integrated with offline sampling, achieve SOTA performance. As opposed to full-graph training, the memory requirement for this approach reduces to O(nsd(K + N)), where ns is the size of the offline sampled subgraphs. A comparable GCN or Graph SAGE model would utilize O(nsd K).

9. Conclusions

This work has closely examined the relationship between IGNNs and UGNNs, shedding light on their similarities and differences. In this regard, representative contributions and take-home messages are as follows:

Existing IGNN models broadly coalesce around the propagation update from (2), with notable differences tied to varying choices for σ, P, Wp, and f. As for UGNN, we have introduced a flexible foundation via the general energy from (3) and the attendant descent steps given by (6), (7), and (8). Collectively, these subsume a wide variety of existing UGNN models, while also motivating previously-unused variants with convergence guarantees and broader coverage of IGNN capabilities.

IMPLICIT VS UNFOLDED GRAPH NEURAL NETWORKS

IGNNs should in principle have an advantage when long-range propagation and full-graph training are dominant factors leading to high accuracy. However, the practical significance of such an advantage is presently limited for two reasons: (i) Many node-classification benchmarks in common use today do not actually require long-range propagation, and relatively shallow GNNs already perform well; (ii) Even in instances where long-range propagation may be helpful (e.g., certain heterophily graphs, the Amazon Co-Purchase benchmark, etc.), the dataset scales are such that UGNNs can easily be applied as well while maintaining greater flexibility for matching or exceeding the accuracy. Of course these observations are subject to change as new application scenarios emerge, and IGNNs remain an elegant paradigm to consider for future use cases.

We have demonstrated that UGNN symmetric graph propagation weights are essentially unavoidable, while IGNNs have no such restriction. And yet our complementary analytical and empirical results suggest that these symmetric UGNN weights are not a significant hindrance relative to IGNNs, at least in terms of model expressiveness and subsequent node classification accuracy. In contrast, UGNN nonlinear graph propagation operators, as instantiated via a concave ρ and the attention weights from (7), can be advantageous relative to existing IGNNs when unreliable edges are present, e.g., as evidenced by the comparisons in Tables 2 and 4 involving certain types of heterophily graphs and adversarial attacks respectively.

Regarding heterophily in particular, a by-product of our analysis is the differentiation of two heterophily regimes, characterized by distinct distributions of same-label nodes as quantified by the ASD metric in Figure 5. Within the ASDsup regime, both UGNN and IGNN models excel by exploiting long-range dependencies; however, in the ASDinf regime only UGNNs with a concave ρ are competitive.

IGNN has a major advantage in memory costs for full-graph training, since backpropagation does not require storing intermediate node representations across all propagation steps. However, UGNNs have an advantage in terms of computational complexity if the number of propagation steps can be truncated without compromising accuracy. Furthermore, UGNN memory and computation savings are possible with sampling-based methods or lazy training.

UGNN model energy functions in particular provide a natural entry point for inducing versatile GNN architectures that yield predictable results. Natural candidates for extending beyond (3) include the handling of heterogeneous graphs using an energy whose minimization mimics knowledge graph embedding computations (Ahn et al., 2022), or weaving negative samples within an energy function targeting link prediction tasks (Wang et al., 2024). Future work could also consider exploiting UGNN models for the sake of explainability, e.g., disentangling the importance of input features versus network effects in arriving at node-level predictions.

Overall, both models benefit from the inductive biases of their respective design criteria, which frequently overlap but retain important areas of distinction that we have elucidated herein.

YANG, LIU, WANG, HUANG, WIPF

Hongjoon Ahn, Yongyi Yang, Quan Gan, Taesup Moon, and David Wipf. Descent steps of a relationaware energy produce heterogeneous graph neural networks. Advances in Neural Information Processing Systems, 35, 2022.

Brandon Amos and J Zico Kolter. Optnet: Differentiable optimization as a layer in neural networks. In International Conference on Machine Learning, pages 136 145. PMLR, 2017.

Shaojie Bai, J Zico Kolter, and Vladlen Koltun. Deep equilibrium models. ar Xiv preprint ar Xiv:1909.01377, 2019.

Justin Baker, Qingsong Wang, Cory D. Hauck, and Bao Wang. Implicit graph neural networks: A monotone operator viewpoint. In International Conference on Machine Learning, 2023.

Dimitri Bertsekas. Nonlinear Programming. Athena Scientific, 2nd edition, 1999.

Deyu Bo, Xiao Wang, Chuan Shi, and Huawei Shen. Beyond low-frequency information in graph convolutional networks. In Proceedings of the AAAI Conference on Artificial Intelligence, 2021.

Jianfei Chen, Jun Zhu, and Le Song. Stochastic training of graph convolutional networks with variance reduction. ar Xiv preprint ar Xiv:1710.10568, 2017a.

Jie Chen, Tengfei Ma, and Cao Xiao. Fast GCN: fast learning with graph convolutional networks via importance sampling. ar Xiv preprint ar Xiv:1801.10247, 2018.

Ming Chen, Zhewei Wei, Zengfeng Huang, Bolin Ding, and Yaliang Li. Simple and deep graph convolutional networks. In Proceedings of the 37th International Conference on Machine Learning, ICML, volume 119, pages 1725 1735, 2020.

Siheng Chen and Yonina C Eldar. Graph signal denoising via unrolling networks. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5290 5294, 2021.

Yichen Chen, Dongdong Ge, Mengdi Wang, Zizhuo Wang, Yinyu Ye, and Hao Yin. Strong NPhardness for sparse optimization with concave penalty functions. In International Confernece on Machine Learning, 2017b.

Wei-Lin Chiang, Xuanqing Liu, Si Si, Yang Li, Samy Bengio, and Cho-Jui Hsieh. Cluster-GCN: An efficient algorithm for training deep and large graph convolutional networks. In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, pages 257 266, 2019.

Eli Chien, Jianhao Peng, Pan Li, and Olgica Milenkovic. Adaptive universal generalized pagerank graph neural network. ar Xiv preprint ar Xiv:2006.07988, 2020.

Patrick L Combettes and Jean-Christophe Pesquet. Proximal splitting methods in signal processing. In Fixed-point algorithms for inverse problems in science and engineering, pages 185 212. Springer, 2011.

IMPLICIT VS UNFOLDED GRAPH NEURAL NETWORKS

Hanjun Dai, Zornitsa Kozareva, Bo Dai, Alex Smola, and Le Song. Learning steady-states of iterative algorithms over graphs. In International Conference on Machine Learning, pages 1106 1114, 2018.

Francesco Di Giovanni, James Rowbottom, Benjamin Paul Chamberlain, Thomas Markovich, and Michael M Bronstein. Understanding convolution on graphs via energies. Transactions on Machine Learning Research, 2023.

Lun Du, Xiaozhou Shi, Qiang Fu, Xiaojun Ma, Hengyu Liu, Shi Han, and Dongmei Zhang. Gbkgnn: Gated bi-kernel graph neural networks for modeling both homophily and heterophily. In Proceedings of the ACM Web Conference 2022, pages 1550 1558, 2022.

Vijay Prakash Dwivedi, Chaitanya K Joshi, Anh Tuan Luu, Thomas Laurent, Yoshua Bengio, and Xavier Bresson. Benchmarking graph neural networks. Journal of Machine Learning Research, 24(43):1 48, 2023.

Laurent El Ghaoui, Fangda Gu, Bertrand Travacca, Armin Askari, and Alicia Tsai. Implicit deep learning. ar Xiv preprint ar Xiv:1908.06315, 2020.

Negin Entezari, Saba A Al-Sayouri, Amirali Darvishzadeh, and Evangelos E Papalexakis. All you need is low (rank) defending against adversarial attacks on graphs. In Proceedings of the 13th International Conference on Web Search and Data Mining, pages 169 177, 2020.

Jianqing Fan and Runze Li. Variable selection via nonconcave penalized likelihood and its oracle properties. JASTA, 96(456):1348 1360, 2001.

Matthias Fey, Jan E Lenssen, Frank Weichert, and Jure Leskovec. GNNAuto Scale: Scalable and expressive graph neural networks via historical embeddings. In International Conference on Machine Learning, pages 3294 3304. PMLR, 2021.

Ben Finkelshtein, Xingyue Huang, Michael Bronstein, and Ismail Ilkan Ceylan. Cooperative graph neural networks. International Conference on Machine Learning, 2024.

Jordan Frecon, Gilles Gasso, Massimiliano Pontil, and Saverio Salzo. Bregman neural networks. In International Conference on Machine Learning, pages 6779 6792. PMLR, 2022.

Guoji Fu, Mohammed Haroon Dupty, Yanfei Dong, and Lee Wee Sun. Implicit graph neural diffusion based on constrained dirichlet energy minimization. ar Xiv preprint ar Xiv:2308.03306, 2023.

Claudio Gallicchio and Alessio Micheli. Fast and deep graph neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, 2020.

Johannes Gasteiger, Stefan Weißenberger, and Stephan G unnemann. Diffusion improves graph learning. Advances in neural information processing systems, 32, 2019.

Johannes Gasteiger, Chendi Qian, and Stephan G unnemann. Influence-based mini-batching for graph neural networks. In Learning on Graphs Conference, pages 9 1. PMLR, 2022.

Zhengyang Geng, Meng-Hao Guo, Hongxu Chen, Xia Li, Ke Wei, and Zhouchen Lin. Is attention better than matrix decomposition? In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. Open Review.net, 2021.

YANG, LIU, WANG, HUANG, WIPF

Karol Gregor and Yann Le Cun. Learning fast approximations of sparse coding. In International Conference on Machine Learning, 2010.

R emi Gribonval and Mila Nikolova. A characterization of proximity operators. Journal of Mathematical Imaging and Vision, 62(6):773 789, 2020.

Fangda Gu, Heng Chang, Wenwu Zhu, Somayeh Sojoudi, and Laurent El Ghaoui. Implicit graph neural networks. In Advances in Neural Information Processing Systems, 2020.

Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. Advances in Neural Information Processing Systems, 30, 2017a.

William L Hamilton, Rex Ying, and Jure Leskovec. Inductive representation learning on large graphs. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pages 1025 1035, 2017b.

Hao He, Bo Xin, Satoshi Ikehata, and David Wipf. From bayesian sparsity to gated recurrent nets. In Advances in Neural Information Processing Systems, 2017.

John Hershey, Jonathan Le Roux, and Felix Weninger. Deep unfolding: Model-based inspiration of novel deep architectures. ar Xiv preprint ar Xiv:1409.2574, 2014.

Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs. Advances in neural information processing systems, 33:22118 22133, 2020.

Weihua Hu, Matthias Fey, Hongyu Ren, Maho Nakata, Yuxiao Dong, and Jure Leskovec. OGB-LSC: A large-scale challenge for machine learning on graphs. ar Xiv:2103.09430, 2021.

Vassilis N Ioannidis, Meng Ma, Athanasios N Nikolakopoulos, Georgios B Giannakis, and Daniel Romero. Kernel-based inference of functions on graphs. In D. Comminiello and J. Principe, editors, Adaptive Learning Methods for Nonlinear System Modeling. Elsevier, 2018.

Haitian Jiang, Renjie Liu, Xiao Yan, Zhenkun Cai, Minjie Wang, and David Wipf. Muse GNN: Interpretable and convergent graph neural network layers at scale. ar Xiv preprint ar Xiv:2310.12457, 2023.

Kedar Karhadkar, Pradeep Kr Banerjee, and Guido Mont ufar. Fo SR: First-order spectral rewiring for addressing oversquashing in gnns. International Conference on Learning Representations, 2023.

Steven M. Kearnes, Kevin Mc Closkey, Marc Berndl, Vijay S. Pande, and Patrick Riley. Molecular graph convolutions: moving beyond fingerprints. J. Comput. Aided Mol. Des., 30(8):595 608, 2016.

Arpandeep Khatua, Vikram Sharma Mailthody, Bhagyashree Taleka, Tengfei Ma, Xiang Song, and Wen-mei Hwu. Igb: Addressing the gaps in labeling, features, heterogeneity, and size of public graph datasets for deep learning research. ar Xiv preprint ar Xiv:2302.13522, 2023.

D. Kinderlehrer and G. Stampacchia. An Introduction to Variational Inequalities and Their Applications. Classics in Applied Mathematics. Society for Industrial and Applied Mathematics (SIAM, 3600 Market Street, Floor 6, Philadelphia, PA 19104), 1980. ISBN 9780898719451.

IMPLICIT VS UNFOLDED GRAPH NEURAL NETWORKS

Thomas Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations, 2017.

Johannes Klicpera, Aleksandar Bojchevski, and Stephan G unnemann. Predict then propagate: Graph neural networks meet personalized pagerank. In International Conference on Learning Representations, 2019a.

Johannes Klicpera, Stefan Weißenberger, and Stephan G unnemann. Diffusion improves graph learning. In Advances in Neural Information Processing Systems, 2019b.

Jiajin Li, Anthony Man-Cho So, and Wing-Kin Ma. Understanding notions of stationarity in nonsmooth optimization: A guided tour of various constructions of subdifferential for nonsmooth functions. IEEE Signal Processing Magazine, 37(5):18 31, 2020a.

Qimai Li, Zhichao Han, and Xiao-Ming Wu. Deeper insights into graph convolutional networks for semi-supervised learning. In Proceedings of the AAAI Conference on Artificial Intelligence, 2018.

Yaxin Li, Wei Jin, Han Xu, and Jiliang Tang. Deeprobust: A pytorch library for adversarial attacks and defenses. ar Xiv preprint ar Xiv:2005.06149, 2020b.

Juncheng Liu, Kenji Kawaguchi, Bryan Hooi, Yiwei Wang, and Xiaokui Xiao. Eignn: Efficient infinite-depth graph neural networks. In Proceedings of the 31st International Conference on Neural Information Processing Systems, 2021a.

Xiaorui Liu, Wei Jin, Yao Ma, Yaxin Li, Hua Liu, Yiqi Wang, Ming Yan, and Jiliang Tang. Elastic graph neural networks. In International Conference on Machine Learning, 2021b.

D.G. Luenberger. Linear and Nonlinear Programming. Addison Wesley, Reading, Massachusetts, second edition, 1984.

Yi Luo, Guiduo Duan, Guangchun Luo, and Aiguo Chen. Unifying label-inputted graph neural networks with deep equilibrium models. ar Xiv preprint ar Xiv:2211.10629v1, 2022.

Yao Ma, Xiaorui Liu, Tong Zhao, Yozen Liu, Jiliang Tang, and Neil Shah. A unified view on graph neural networks as graph signal denoising. ar Xiv preprint ar Xiv:2010.01777, 2020.

Sunil Kumar Maurya, Xin Liu, and Tsuyoshi Murata. Simplifying approach to node classification in graph neural networks. Journal of Computational Science, 62:101695, 2022.

Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2003.

Kenta Oono and Taiji Suzuki. Graph neural networks exponentially lose expressive power for node classification. In 8th International Conference on Learning Representations, ICLR, 2020.

Xuran Pan, Shiji Song, and Gao Huang. A unified framework for convolution-based graph neural networks, 2021. URL https://openreview.net/forum?id=z UMD--Fb9Bt.

Hongbin Pei, Bingzhe Wei, Kevin Chen-Chuan Chang, Yu Lei, and Bo Yang. Geom-gcn: Geometric graph convolutional networks. In International Conference on Learning Representations, 2019.

YANG, LIU, WANG, HUANG, WIPF

Oleg Platonov, Denis Kuznedelev, Michael Diskin, Artem Babenko, and Liudmila Prokhorenkova. A critical look at the evaluation of gnns under heterophily: Are we really making progress? In International Conference on Learning Representations, 2023.

Yu Rong, Wenbing Huang, Tingyang Xu, and Junzhou Huang. Dropedge: Towards deep graph convolutional networks on node classification. In 8th International Conference on Learning Representations, ICLR, 2020.

Pablo Sprechmann, Alex Bronstein, and Guillermo Sapiro. Learning efficient sparse and low rank models. IEEE Trans. Pattern Analysis and Machine Intelligence, 37(9), 2015.

Bharath Sriperumbudur and Gert Lanckriet. On the convergence of the concave-convex procedure. In Advances in Neural Information Processing Systems, 2009.

Jan T onshoff, Martin Ritzert, Hinrikus Wolf, and Martin Grohe. Walking out of the Weisfeiler Leman hierarchy: Graph learning beyond message passing. Transactions on Machine Learning Research, 2023.

Jake Topping, Francesco Di Giovanni, Benjamin Paul Chamberlain, Xiaowen Dong, and Michael Bronstein. Understanding over-squashing and bottlenecks on graphs via curvature. International Conference on Learning Representations, 2022.

Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Li o, and Yoshua Bengio. Graph attention networks. In 6th International Conference on Learning Representations, ICLR, 2018.

Petar Veliˇckovi c, Rex Ying, Matilde Padovano, Raia Hadsell, and Charles Blundell. Neural execution of graph algorithms. In International Conference on Learning Representations, 2020.

Ulrike Von Luxburg. A tutorial on spectral clustering. Statistics and computing, 17(4):395 416, 2007.

Minjie Wang, Da Zheng, Zihao Ye, Quan Gan, Mufei Li, Xiang Song, Jinjing Zhou, Chao Ma, Lingfan Yu, Yu Gai, Tianjun Xiao, Tong He, George Karypis, Jinyang Li, and Zheng Zhang. Deep graph library: A graph-centric, highly-performant package for graph neural networks. ar Xiv preprint ar Xiv:1909.01315, 2019.

Xiyuan Wang and Muhan Zhang. How powerful are spectral graph neural networks. In International Conference on Machine Learning, pages 23341 23362. PMLR, 2022.

Yuxin Wang, Xiannian Hu, Quan Gan, Xuanjing Huang, Xipeng Qiu, and David Wipf. Efficient link prediction via gnn layers induced by negative sampling. IEEE Transactions on Knowledge and Data Engineering, 2024.

Zhangyang Wang, Qing Ling, and Thomas Huang. Learning deep ℓ0 encoders. In AAAI Conference on Artificial Intelligence, 2016.

Mike West. Outlier models and prior distributions in Bayesian linear regression. J. Royal Statistical Society: Series B, 46(3), 1984.

IMPLICIT VS UNFOLDED GRAPH NEURAL NETWORKS

Huijun Wu, Chen Wang, Yuriy Tyshetskiy, Andrew Docherty, Kai Lu, and Liming Zhu. Adversarial examples for graph data: Deep insights into attack and defense. In Proceedings of the Twenty Eighth International Joint Conference on Artificial Intelligence, IJCAI, pages 4816 4823, 2019.

Qitian Wu, Wentao Zhao, Chenxiao Yang, Hengrui Zhang, Fan Nie, Haitian Jiang, Yatao Bian, and Junchi Yan. SGFormer: Simplifying and empowering transformers for large-graph representations. Advances in Neural Information Processing Systems, 37, 2023.

Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and S Yu Philip. A comprehensive survey on graph neural networks. IEEE transactions on neural networks and learning systems, 32(1):4 24, 2020.

Xingyu Xie, Qiuhao Wang, Zenan Ling, Xia Li, Guangcan Liu, and Zhouchen Lin. Optimization induced equilibrium networks: An explicit optimization perspective for understanding equilibrium models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(3):3604 3616, 2023.

Rui Xue, Haoyu Han, Mohamad Ali Torkamani, Jian Pei, and Xiaorui Liu. Lazy GNN: Large-scale graph neural networks via lazy propagation. In International Conference on Machine Learning, 2023.

Yongyi Yang, Tang Liu, Yangkun Wang, Jinjing Zhou, Quan Gan, Zhewei Wei, Zheng Zhang, Zengfeng Huang, and David Wipf. Graph neural networks inspired by classical iterative algorithms. In International Conference on Machine Learning, 2021.

Yongyi Yang, , Zengfeng Huang, and David Wipf. Transformers from an optimization perspective. In Advances in Neural Information Processing Systems, volume 35, 2022.

Rex Ying, Ruining He, Kaifeng Chen, Pong Eksombatchai, William L Hamilton, and Jure Leskovec. Graph convolutional neural networks for web-scale recommender systems. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, pages 974 983, 2018.

Hanqing Zeng, Muhan Zhang, Yinglong Xia, Ajitesh Srivastava, Andrey Malevich, Rajgopal Kannan, Viktor Prasanna, Long Jin, and Ren Chen. Decoupling the depth and scope of graph neural networks. Advances in Neural Information Processing Systems, 34, 2021.

Hongwei Zhang, Tijin Yan, Zenjun Xie, Yuanqing Xia, and Yuan Zhang. Revisiting graph convolutional network on semi-supervised node classification from an optimization perspective. Co RR, abs/2009.11469, 2020.

Xiang Zhang and Marinka Zitnik. Gnnguard: Defending graph neural networks against adversarial attacks. Advances in Neural Information Processing Systems, 33, 2020.

Xuebin Zheng, Bingxin Zhou, Junbin Gao, Yu Guang Wang, Pietro Lio, Ming Li, and Guido Mont ufar. How framelets enhance graph neural networks. In International Conference on Machine Learning, 2021.

YANG, LIU, WANG, HUANG, WIPF

Dengyong Zhou, Olivier Bousquet, Thomas Navin Lal, Jason Weston, and Bernhard Sch olkopf. Learning with local and global consistency. Advances in Neural Information Processing Systems, 2004.

Jie Zhou, Ganqu Cui, Zhengyan Zhang, Cheng Yang, Zhiyuan Liu, Lifeng Wang, Changcheng Li, and Maosong Sun. Graph neural networks: A review of methods and applications. ar Xiv preprint ar Xiv:1812.08434, 2018.

Jiong Zhu, Ryan A Rossi, Anup Rao, Tung Mai, Nedim Lipka, Nesreen K Ahmed, and Danai Koutra. Graph neural networks with heterophily. ar Xiv preprint ar Xiv:2009.13566, 2020a.

Jiong Zhu, Yujun Yan, Lingxiao Zhao, Mark Heimann, Leman Akoglu, and Danai Koutra. Beyond homophily in graph neural networks: Current limitations and effective designs. In Advances in Neural Information Processing Systems, Neur IPS, 2020b.

Jiong Zhu, Yujun Yan, Lingxiao Zhao, Mark Heimann, Leman Akoglu, and Danai Koutra. Beyond homophily in graph neural networks: Current limitations and effective designs. Advances in neural information processing systems, 33:7793 7804, 2020c.

Meiqi Zhu, Xiao Wang, Chuan Shi, Houye Ji, and Peng Cui. Interpreting and unifying graph neural networks with an optimization framework. ar Xiv preprint ar Xiv:2101.11859, 2021.

Difan Zou, Ziniu Hu, Yewen Wang, Song Jiang, Yizhou Sun, and Quanquan Gu. Layer-dependent importance sampling for training deep and large graph convolutional networks. Advances in Neural Information Processing Systems, 32, 2019.

Daniel Z ugner and Stephan G unnemann. Adversarial attacks on graph neural networks via meta learning. In 7th International Conference on Learning Representations, ICLR, 2019.

Daniel Z ugner, Amir Akbarnejad, and Stephan G unnemann. Adversarial attacks on neural networks for graph data. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI, 2019.

IMPLICIT VS UNFOLDED GRAPH NEURAL NETWORKS

Appendix A. Datasets, Experiment Details, and Model Specifications

In this section we introduce additional experiment details as well as give a comprehensive introduction of the UGNN architectures used in the paper.

A.1 Datasets

Practical Impact of Symmetric Weights We choose the popular ogbn-arxiv dataset (Hu et al., 2020) for the experiments in Section 7.1. Original training splits for generating subsequent synthetic labels follow the standard OGB protocol. Long-Range Dependency/Sparse Label Tests In Section 7.2, we adopt the Amazon Co-Purchase dataset and Chains dataset, both of which have previously been used in (Gu et al., 2020) for evaluating performance involving long-range dependencies. We use the Amazon Co-Purchase dataset provided by the IGNN repo (Gu et al., 2020), including the data-processing and evaluation code, in order to obtain a fair comparison. As for splitting, 10% of nodes are selected as the test set. Because there is no dev set, we directly report the test result of the last epoch. We also vary the fraction of training nodes from 5% to 9%. Additionally, because there are no node features, we learn a 128-dim feature vector for each node. All of these settings from above follow from (Gu et al., 2020). The Chains dataset is a synthetic dataset introduced by (Gu et al., 2020). We use exactly the same setting as in (Gu et al., 2020). Heterophily Experiments In Section 7.3, we use several heterophily datasets, including Texas, Wisconsin, Actor and Cornell datasets which are introduced by (Pei et al., 2019), and Roman-empire, Amazon-ratings, Minesweeper, Tolokers and Questions which are introduced by (Platonov et al., 2023). We used the data split, processed node features, and labels provided by (Pei et al., 2019) and (Platonov et al., 2023) respectively. Adversarial Attack Experiments As mentioned in Section 7.4, we tested on Cora and Citerseer using Mettack. We use the Deep Robust library (Li et al., 2020b) and apply the exact same non-targeted attack setting as in (Zhang and Zitnik, 2020). For all the baseline results in Table 4, we run the implementation in the Deep Robust library or the GNNGuard official code. Note the GCN-Jaccard results differ slightly from those reported in (Zhang and Zitnik, 2020), likely because of updates in the Deep Robust library and the fact that (Zhang and Zitnik, 2020) only report results from a single trial (as opposed to averaged results across multiple trails as we report). Summary Statistics Table 5 summarizes the attributes of each dataset listed above.

A.2 Experiment Details

In all experiments of IGNN and EIGNN, we optionally set W s p to be trainable parameters13 or fix it to I. For IGNN, following the original paper, we optionally set f(X) = MLP(PX) or MLP(X), while for EIGNN and UGNN we only set f(X) = MLP(X). In label recovering task in Section 7.1 and the evaluation of time and memory in Section 8, for the purpose of comparing finite vs infinite number of propagations and symmetric vs asymmetric weights, we ignore irrelevant options such as different choice of f(X), hidden size, and the attention mechanism in UGNN. We always use f(X) = XWx and g [y i (W); θ] = Wgy i (W) where Wg Rc d is a learned matrix that maps the propagated node features to the output space. We set the

13. For EIGNN, following the exact formulation in (Liu et al., 2021a), we actually use W s p = µW, where W is trainable and is forced to be symmetric, and µ is a hyperparamter.

YANG, LIU, WANG, HUANG, WIPF

Table 5: Dataset statistics. The FEATURES column describes the dimensionality of node features. The Amazon Co-P row denotes the Amazon Co-Purchase dataset. Note that the Amazon Co-Purchase dataset has no node features.

DATASET NODES EDGES FEATURES CLASSES

AMAZON CO-P 334,863 2,186,607 - 58

CORA 2,708 5,429 1,433 7 CITESEER 3,327 4,732 3,703 6

OGBN-ARXIV 169,343 1,166,243 128 40

TEXAS 183 309 1,703 5 WISCONSIN 251 499 1,703 5 ACTOR 7,600 33,544 931 5 CORNELL 183 295 1,703 5

ROMAN-EMPIRE 22,662 32,927 300 18 AMAZON-RATINGS 32,927 93,050 300 5 MINESWEEPER 1,000 39,402 7 2 TOLOKERS 11,758 519,000 10 2 QUESTIONS 48,921 153,540 301 2

hidden size to 32 in asymmetric case and 34 in symmetric case to ensure nearly the same number of parameters (we deliberately set the hidden size to this small to ensure the models do not overfit). For generating models, we first train them using the original labels of the dataset by 500 steps. For UGNN, we fix the number of propagation steps is set to 2, and adopt (11) of UGNN. When projecting Wp (or W in UGNN) to the space that admits unique fixed point, in IGNN s original paper it uses inf (Gu et al., 2020), which would break the symmetry, based on our result in Section 5.1, in our implementation when the model weight is symmetric we project 2 instead of inf. In EIGNN, we use rescaling as in (Liu et al., 2021a) instead of projection. In the direct comparison between IGNN and UGNN in Section 7.1 and the time & memory comparison in Section 8, we always use normalized adjacency P = ˆA for propagation matrix. However, in other experiments with UGNN, we allow extra flexibility of the propagation matrix. Finally, for the results in Sections 7.1, 7.3 and 7.4, we run each experiment for 10 trials to obtain the standard deviation.

A.3 Basic UGNN Architecture

The UGNN architecture is composed of the input module f (X; W), followed by the unfolded propagation layers defined by (8), concluding with g(y; θ). We allow an optional attention layer in UGNN by specifying the ρ function in (3). See Section A.5 below and (Yang et al., 2021) for details. The aggregate design of UGNN is depicted in in Figure 8. For simplicity, we generally adopt a single attention layer sandwiched between equal numbers of propagation layers; however, for heterophily datasets we apply an extra attention layer before

IMPLICIT VS UNFOLDED GRAPH NEURAL NETWORKS

propagation. Note that the attention only involves reweighting the edge weights of the graph (i.e., it does not alter the node embeddings at each layer).

Figure 8: Model Architecture. S is the total number of propagation steps. K and L are number of MLP layers before and after the propagation respectively. While not a requirement, in all of our experiments, either K or L is set to zero, meaning that the MLP exists on only one side of the propagation layers.

A.4 UGNN Variations

Pre-conditioning and Reparameterization If we unfold the iteration step of (5), define the reparameterized embeddings Z = D1/2Y and left multiply the update rule by D1/2, we have

Z(k+1) = (1 α)Z(k) + αλ D 1/2AY (k) + α D 1/2f(X; W)

= (1 α)Z(k) + αλ D 1/2A D 1/2Z(k) + α D 1Z(0). (18)

From here, if we choose α = λ = 1, for Z(1) we have that

Z(1) = D 1/2A D 1/2 + D 1 Z(0)

= D 1/2 A D 1/2Z(0), (19)

which gives the exact single-layer GCN formulation in Z-space with Z(0) = f(X; W).

Normalized Laplacian Unfolding From another perspective, if we replace L in (5) with a normalized graph Laplacian, and then take gradients steps as before, there is no need to do preconditioning and reparameterizing. For example, unfolding (5) with L changed to the symmetrically-normalized version L = I D 1/2 A D 1/2, we get

Y (k+1) = (1 α αλ)Y (k) + αλ D 1/2 A D 1/2Y (k) + αY (0), (20)

where we set D = I + D. This formula is essentially the same as (18). The main difference is that there is no D 1 in front of X, which indicates an emphasis on the initial features. We found this

YANG, LIU, WANG, HUANG, WIPF

version to be helpful on ogbn-arxiv and Amazon Co-Purchase data. Note however that all of our theoretical support from the main paper applies equally well to this normalized version, just with a redefinition of the gradient steps to include the normalized Laplacian.

A.5 Specific Attention Formula

We adapt the TWIRLS model from (Yang et al., 2021) as the basis for UGNNs with an optional attention layer that is determined by the ρ function in (3). While the attention mechanism can in principle adopt any concave, non-decreasing function ρ, in this work we restrict ρ to a single functional form that is sufficiently flexible to effectively accommodate all experimental scenarios. Specifically, we adopt

τ p 2z2 if z < τ

2 p T p ρ0 if z > T

2 pzp ρ0 otherwise,

where p, T, and τ are non-negative hyperparameters and ρ0 = 2 p

p τ p is a constant that ensures ρ is continuous. Additionally, the gradient of ρ produces the attention score function (akin to γ in the main paper) given by

s(z2) p(z2)

τ p 2 if z < τ 0 if z > T zp 2 otherwise.

And for convenience and visualization, we also adopt the reparameterizations τ = τ 1 2 p and T = T 1 2 p , and plot ρ(z2) and s(z2) in Figure 9 using p = 0.1, τ = 0.2, T = 2.

Figure 9: A visualization of attention functions.

Overall, this flexible choice has a natural interpretation in terms of its differing behavior between the intervals [0, τ], ( τ, T), and ( T, ). For example, in the [0, τ] interval a quadratic penalty is applied, which leads to constant attention independent of z. This is exactly like the simple case where there is no attention. In contrast, within the ( T, ) interval ρ is constant and the corresponding attention weight is set to zero (truncation), which is tantamount to edge removal. And finally, the middle interval provides a natural transition between these two extremes, with ρ becoming increasingly flat with larger z values.

IMPLICIT VS UNFOLDED GRAPH NEURAL NETWORKS

Additionally, many familiar special cases emerge for certain parameter selections. For example T = corresponds with no explicit truncation, while p = 2 can instantiate no attention. And T = τ means we simply truncate those edges with large distance, effectively keeping the remaining edge attention weights at 1 (note that there is normalization during propagation, so setting edges to a constant is equivalent to setting them to 1).

Appendix B. EIGNN as a Special Case of UGNN

As claimed in the main paper, the EIGNN model can be viewed as a special case of UGNN; we provide the details here. We begin with the general energy from (3), and then choose ϕ (y; Wϕ) = 0, ρ z2 = z2, and adopt the weight parameterization W s P = s2F F, where F is a trainable parameter matrix and s is a rescaling factor given by

s = F F +ϵF 1

This produces the simplified loss

ℓY (Y ; Wx, f) = Y f(X; Wx) 2 F + µtr h F Y LY F i , (24)

where P = I L is the normalized adjacency matrix. A gradient step on (24) initialized at Y (k) can then be computed as Y (k+1) = γs2F FY (k)W s p + f (X, Wx) , (25)

which matches the original EIGNN updates from (Liu et al., 2021a).

Appendix C. Notation Used in Proofs

In following sections, we give proofs of all propositions in the main paper. We first introduce some notations used in proofs here. For simplicity, in following sections we denote ℓY (Y ; W, f, ρ, e B, ϕ) in (3) by ℓY (Y ), denote ϕ (yi; Wϕ) by ϕ(yi) and P i ϕ(yi) by ϕ(Y ), and denote f(X; W) by f(X). And the smooth part of ℓY is denoted by ℓs Y (Y ) = ℓY (Y ) ϕ(Y ). And we possibly ignore the parameter of the function (e.g. ℓY ). We denote the vectorization of a matrix M by vec(M) and the Kronecker product of two matrices M1 and M2 denoted by M1 M2. For Fr echet subdifferential respect to Y , we denote it by Y , ignoring the subscript F used in main paper.

Appendix D. Proofs Related to UGNN Convergence

In this section, we investigate different conditions to guarantee convergence (to stationary point) of UGNN under different assumptions of ℓY . To start with, we first prove that as long as proximal gradient descent has a fixed point, the fixed point is a stationary point.

YANG, LIU, WANG, HUANG, WIPF

Lemma 11 For any Z such that

Z arg min Y 1 2α Y (Z α ℓs Y (Z)) 2 F + ϕ(Y ), (26)

it must hold that 0 ZℓY (Z).

Proof We will use the summation property and local minimality property of Fr echet subdifferential (see (Li et al., 2020a) for detailed descriptions). For any Y such that

Y arg min Y 1 2α Y (Z α ℓs Y (Z)) 2 F + ϕ(Y ), (27)

2α Y (Z α ℓs Y (Z)) 2 F + ϕ(Y ) Y =Y . (28)

It follows that

0 ℓs Y (Z) + Y ϕ(Y ). (29)

So, if Y = Z, we have

0 ℓs Y (Z) + Zϕ(Z) = Z [ℓs Y (Z) + ϕ(Z)] = ZℓY (Z) (30)

With Lemma 11, to prove that UGNN converges to a stationary point, we only need to show it has a fixed point. For the convergence of UGNN, we assume ϕ is continuous and ℓY is Lipschitz smooth (i.e. its gradient is Lipschitz continuous) with Lipschitz constant L. And we further distinguish two cases 1) No other conditions provided 2) ϕ is convex and ℓY is strongly convex. We will use different strategies to prove convergence and show that the conditions on the step-size that guarantee convergence are different in these two cases. Note that, as mentioned in the main paper, in the first case, we also need to assume ℓY (Y ) when Y to avoid a degenerate case that the process goes to infinite. The Framework of Proving Convergence To prove the convergence of the iteration of (8), we will use Zangwill s convergence theorem. For clarity, we state it next.

Proposition 12 (Zangwill s Convergence Theorem (Luenberger, 1984)) Let A : X 2X be a set-valued function, G X be a solution set (which can be any set we are interested in). For any sequence {xk} k=1 such that xk+1 A(xk), if the following conditions hold jointly:

1. (boundedness) {xk|1 k } is contained in some compact subset of X.

2. (descent function) there exists a continuous function ℓsuch that

(a) if x G, then y A(x), ℓ(y) < ℓ(x).

(b) if x G, then y A(x), ℓ(y) ℓ(x).

IMPLICIT VS UNFOLDED GRAPH NEURAL NETWORKS

3. (closedness) the mapping A is closed at all point of X \ G i.e. {(x, y)|x clos (X \ G) , y A(x)} is a closed set, where clos means set closure.

then the limit of any convergent subsequence of {xk} k=1 is a solution i.e. inside G.

Given Lemma 11, we define the solution set G as set of fixed points:

G = Z Z arg min Y 1 2α Y (Z α ℓs Y (Z)) 2 F + ϕ(Y ) . (31)

Then the proof is composed of three steps:

1. showing there exists some function that is a descent function;

2. showing the process is closed;

3. showing Y (k) k=1 are bounded.

Thereinto, the closedness is a straightforward consequence of Lemma 6 in (Sriperumbudur and Lanckriet, 2009). For boundedness, as long as we find some descent function ℓ, we consider the set S = Y | ℓ(Y ) ℓ Y (0) . Apparently Y (k) k=1 S, so we only need to show that S is bounded. It suffices to prove lim Y ℓ(Y ) = . Therefore, step 2 is already finished and step 3 is

nearly finished. The most tricky part is proving descent of some function.

D.1 UGNN Convergence in General Case

Here we don t cast convexity assumption on the objective function. In this case we choose ℓY as the candidate descent function. Note that by the assumption that lim Y ℓY (Y ) = made in main

paper, we have boundedness of S. To prove descent, we first prove loosely descent (i.e. Lemma 1) and then show when Y (k) G, it is also strict. Proof of Lemma 1 Here we first state a property of Lipschitz smooth function:

Proposition 13 ((Bertsekas, 1999)) for any ℓ(y) whose gradient is L-Lipschitz, it satisfies

y, z, ℓ(y) ℓ(z) + ℓ(z) (y z) + L

2 y z 2 2. (32)

From Proposition 13, if function ℓis L-Lipschitz smooth, then for any z, y, we have

ℓ(y) ℓ(z) + ℓ(z) (y z) + L

2 y z 2 2 (33)

= ℓ(z) + ℓ(z) y ℓ(z) z + L

2 y 2 2 + L

2 z 2 2 Ly z (34)

2L ℓ(z) 2 2

L2 ℓ(z) 2 2 + y 2 2 + z 2 2 + 2

L ℓ(z) z 2y z (35)

2L ℓ(z) 2 2 + L

YANG, LIU, WANG, HUANG, WIPF

Taking ℓ(y) = ℓs Y (y), y = vec(Y ) and z = vec Y (k) , and let

βY Y (k) = ℓs Y Y (k) α

ℓs Y Y (k) 2

we can define an upper-bound of ℓY :

ℓ(upp) Y (Y ; Y (k)) = 1

Y h Y (k) α ℓ Y (k) i 2

i ϕ(yi) + βY Y (k) (38)

i ϕ(yi) + βY Y (k) , (39)

and adopting (36) we have that:

ℓY (Y ) = ℓs Y (Y ) + X

i ϕ(yi) ℓ(upp) Y . (40)

(Notice that since we have assumed α 1 L, ℓY is also 1

α-Lipschitz). It s not difficult to check

ℓ(upp) Y Y (k); Y (k) = ℓY Y (k) . Also, since adding terms which is not related to Y does not affect the arg min result in (8), we have that

Y (k+1) arg min Y 1 2α Y U(k) 2 F + X

i ϕ(yi) (41)

= arg min Y 1 2α Y U(k) 2 F + X

i ϕ(yi) + βY Y (k) (42)

= arg min Y ℓ(upp) Y Y ; Y (k) . (43)

Then we have

ℓY Y (k+1) ℓ(upp) Y Y (k+1); Y (k) ℓ(upp) Y Y (k); Y (k) = ℓY Y (k) , (44)

which proves the the lemma.

Furthermore, when Y (k) G, i.e.Y (k) arg min ℓ(upp) Y Y ; Y (k) , we have that

ℓ(upp) Y Y (k+1); Y (k) < ℓ(upp) Y Y (k); Y (k) , (45)

and thus (44) holds strictly, which matches the condition of strictly descent given in Proposition 12.

D.2 UGNN Convergence Under Convexity

When ℓs Y is strongly convex and ϕ is convex, we have already know the existence and uniqueness of stationary point (global optimum) Y , so by Lemma 11 the fixed point of iteration (8) must be Y , i.e. G = {Y }. In this case, we choose the distance of current point to the fixed point ℓ(Y ) = Y Y 2 F as the descent function. As long as descent is proved, the boundedness of ℓ(Y ) is obvious.

IMPLICIT VS UNFOLDED GRAPH NEURAL NETWORKS

We prove the descent of ℓ(Y ) by using the fact that when ϕ and ℓs Y are convex, the proximal operator proxϕ is non-expansive and the gradient step is contraction with small enough α. For proximal operator, Proposition 1 in (Gribonval and Nikolova, 2020) has already shown that as long as ϕ is convex, proxϕ is non-expansive, i.e.

x, y, proxϕ(x) proxϕ(y) 2 x y 2. (46)

Thus we only need to prove gradient descent is contraction. To prove this, we first introduce another property of convex Lipschitz smooth function:

Proposition 14 (Theorem 2.1.5 of (Nesterov, 2003)) For any function ℓthat is convex and whose gradient is L-Lipschitz continuous, we have

x, y, [ ℓ(x) ℓ(y)] (x y) 1

L ℓ(x) ℓ(y) 2 2. (47)

Then we prove the contraction of gradient descent:

Lemma 15 for any function ℓthat is L-Lipschitz smooth satisfies that

x, y, [x α ℓ(x)] [y α ℓ(y)] 2 2 (1 α2L2 2αL) x y 2 2. (48)

Proof From Proposition 14, we have that:

[x α ℓ(x)] [y α ℓ(y)] 2 2 (49)

= x y 2 2 + α2 ℓ(x) ℓ(y) 2 2 2α(x y) [ ℓ(x) ℓ(y)] (50)

x y 2 2 + α2 2α

ℓ(x) ℓ(y) 2 2 (51)

(1 α2L2 2αL) x y 2 2, (52)

which finishes the proof.

Combining (46) and Lemma 15, we get following corollary:

Corollary 16 For ℓY (Y ) = ℓs Y (Y ) + ϕ(Y ), where ℓs Y is strongly convex and L-Lipschitz smooth, ϕ is convex, α < 2

L, we have Y (k+1) Y F < Y (k) Y F (53)

if Y (k) = Y .

Proof Since Y is the fixed point, we have Y proxϕ [Y α ℓs Y (Y )], therefore by (46)

Y (k+1) Y F = proxϕ h Y (k) α ℓs Y Y (k) i proxϕ [Y α ℓs Y (Y )] 2

h Y (k) α ℓs Y Y (k) i [Y α ℓs Y (Y )] 2

YANG, LIU, WANG, HUANG, WIPF

Since α < 2

L, from Lemma 15 we have that when Y (k) = Y ,

h Y (k) α ℓs Y Y (k) i [Y α ℓs Y (Y )] 2

F < Y (k) Y . (56)

Therefore, ℓ Y (k) = Y (k) Y is strictly descent when Y (k) = Y . Combining this and Proposition 12 we conclude that:

Theorem 17 When ℓs Y (Y ) is L-Lipschitz smooth and strongly convex, ϕ(Y ) is continuous and convex, iteration (8) converges to the unique global minimum of ℓY when α < 2

D.3 Remarks

Note that, the upper-bound of α is different in the two cases. Actually, if there is no penalty term ϕ, i.e. no proximal step, in both cases the upper bound is 2/L, which is a standard result on gradient descent for smooth functions that can be found in any optimization textbook. To get some intuition, considering the upper bound derived in (39), although α = 1/L is the optimal step size to minimize ℓ(upp) Y , with α relaxed to 2/L, ℓ(upp) Y still goes down:

ℓ(upp) Y h Y (k) α ℓs Y Y (k) ; Y (k)i ℓ(upp) Y Y (k); Y (k) = ℓs Y Y (k) . (57)

However, this is no longer true when the proximal step is added. Thus in the latter case, we either cast a more strict bound on α to still guarantee descent of ℓY , or add extra assumptions that the process is non-expansive to guarantee that the output of each iteration would not be pushed further away from the fixed point despite not ensuring the descent of ℓY .

D.4 Other Related Proofs

Note that Lemma 1 and Theorem 7 has been proved in section D.1 14. Here we provide other missing proofs from Section 5.1. Proof of Theorem 3 First, we will show that ℓs Y is strongly convex under conditions given. We have

= vec h Y W s f f(X; W) + LY W s p i (58)

= W s f I + W s p L vec(Y ) + vec [f(X; W)] (59)

= Σvec(Y ) + vec [f(X; W)] . (60)

Therefore the Hessian of ℓs Y is 2ℓs Y vec(Y )2 = Σ. (61)

We also know that when λmin(Σ) > 0, ℓs Y is strongly convex respect to Y , thus has a unique global minimum. And from Proposition 1 in (Gribonval and Nikolova, 2020), we know ϕ is convex through the non-expansiveness of σ.

14. The last part of Theorem 7 that lim k ℓY Y (k) = ℓY (Y ) follows directly from the continuity of ℓY .

IMPLICIT VS UNFOLDED GRAPH NEURAL NETWORKS

Furthermore, the λmax(Σ) is the Lipschitz constant of ℓs Y , and from theorem 17 we know when α < 2/λmax(Σ) the algorithm converges to the stationary point, which is the global optimum.

Proof of Corollary 4 Taking W s f = I W s p and P = I L, we have that

Σ = I P W s p . (62)

By the spectral properties of Kronecker product, we have

λmin(Σ) 1 W s p 2 P 2 > 0 (63)

and λmax(Σ) 1 + W s p 2 P 2 < 2. (64)

The discussion in Section 4.3 mentioned that to get (11), we need to set α = 1 < 2 λmax(Σ), therefore from Theorem 3 we conclude that (11) converges to the global minimum of (12).

Proof of Lemma 5 (Convergence of IGNN) We first convert (2) to the vector form:

vec Y (k+1) = σ h (Wp P) vec Y (k) + vec(f) i , (65)

where f = f(X; Wx), ignoring the parameters for simplicity. Let M = Wp P. From the spectral properties of knronecker product, we have

M 2 = Wp 2 P 2 < 1. (66)

By definition of matrix norm we have

x1, x2, M(x1 x2) 2 M 2 x1 x2 2, (67)

which means M (as a linear transformation) is a contraction mapping on Euclidean space. We also assumed σ is contraction mapping, and thus from Banach s fixed point theorem(Kinderlehrer and Stampacchia, 1980) we conclude that (65), as well as (2), has a unique fixed point.

Appendix E. Proofs Related to UGNN Expressiveness

In this section, we provide detailed proofs of propositions in Section 6, as well as an analysis under cases where activation functions are non-linear.

E.1 Proof of Lemma 2

This lemma is a straight-forward corollary of Theorem 1 in (Gribonval and Nikolova, 2020). For clarity, we excerpt the theorem here:

Proposition 18 (Theorem 1 in (Gribonval and Nikolova, 2020)) Let H be some Hilbert space (e.g. Rk) and Y H be non-empty. A function σ : Y H is the proximal operator of a function ϕ : H R {+ } if and only if there exists a convex l.s.c. (lower semi-continuous) function ψ : H R {+ } such that for each y Y, σ(y) ψ(y).

YANG, LIU, WANG, HUANG, WIPF

Since our σ is component-wise, we only need to consider Y = H = R. Furthermore, since σ(x) is continuous, its indefinite integration exists. Let ψ(x) = R σ(x)dx, which must be convex since its derivative is non-decreasing. Then according to Proposition 18, there exists a function ϕ whose proximal operator is σ. Conversely, if there exists ϕ such that σ = proxϕ, from Proposition 18 it is the subdifferential of some convex function ψ, which means it is non-decreasing.

E.2 Proof of Lemma 6

We only need a counterexample to prove this. Suppose n = 1 and d = 2, consider W = 0 1 0 0

Y = a b , then Y W = 0 a

If there exists any second order smooth function h such that h

a = 0 and h

b = a, we have that

2h a b = 0 = 2h

which contradicts the second order smoothness assumption of h.

E.3 Proof of Theorem 9

In this subsection we denote ˆX = f(X; Wx) for simplicity. First consider a case which allows T and W s p to have complex values. Hereinafter we denote the space of all complex-valued matrices with shape d1 d2 by Cd1 d2. We recall Jordan s decomposition theorem below. Let J(λ) be Jordan block matrix

λ 1 λ 1 λ ... 1 λ

We have the following well-known fact.

Proposition 19 (Jordan) For every W Rd d, there exists a complex-valued invertible matrix P Cd d and complex-valued block-diagonal matrix

J(λ1) J(λ2) ... J(λk)

such that W = PΩP 1, where λj C is the j-th eigenvalue of W and the size of J(λj) is the algebraic multiplicity of λj.

Corollary 20 If a square matrix W Rd d has d distinct eigenvalues, then there exists P, Λ Cd d

where P is invertible and Λ is diagonal, such that W = PΛP 1.

IMPLICIT VS UNFOLDED GRAPH NEURAL NETWORKS

Then, we shall show that actually on the complex domain almost every matrix can be diagonalized, which means, for each matrix W, either it itself can be diagonalized, or there s a diagonalizable matrix W that is arbitrarily closed to W.

Corollary 21 For every square matrix W Rd d and any ϵ > 0, there exists a diagonalizable square matrix W Rd d such that W W ϵ.

Proof Let W = RΩR 1 and E = diag ϵ1, ϵ2, , ϵn such that the diagonal of Ω+ E is distinct and ϵ2 j < ϵ2 d R 2 R 1 2 . Since Ωis upper-triangle, its eigenvalues are the elements in its diagonal, which are distinct. Thus according to Corollary 20, Ω+ E is diagonalizable, suppose Ω+ E = QΛQ 1. Let W = R(Ω+ E)R 1 = RQΛQ 1R 1, it is apparent W also diagonalizable. Now consider the difference between W and W. We have W W 2 = PEP 1 2 E 2 R 2 R 1 2 < ϵ2.

Note that the norm in Corollary 21 can be any matrix norm.

Lemma 22 The solution to the equation PY W + ˆX = Y is continuous respect to W as long as a unique solution exists.

Proof The solution of this equation is vec(Y ) = I W P 1 vec( ˆX), which is continuous.

Theorem 23 For any Wp Rd d that admits a unique fixed point for IGNN and ˆX Rn d, suppose Y Rn d is the only solution of PY W + ˆX = Y , then there exists a Y Cn d, Wx Cd d , right-invertible T Cd d and hermitian W s p Cd d , such that

PY T W s p + ˆX Wx = Y T, with Y Y < ϵ, ϵ > 0, Y Cn d. (68)

Proof By Corollary 21 and Lemma 22, there exists diagonalizable W Rd d such that PY W + ˆX = Y for some Y satisfies Y Y < ϵ. Suppose now W = RΛR 1 where Λ is diagonal and R is invertible. We have

PY RΛR 1 + ˆX = Y , (69)

and right-multiply by R at each side of this equation we get

P(Y R)Λ + ˆXR = Y R. (70)

By choosing T = Wx = R and W s p = Λ, we proved the theorem.

Next, we consider restricting this result to the real domain. This is straight-forward since every complex linear transformation can be converted to real linear transformation by consider real and imaginary parts separately. Hereinafter, we denote the real part of an complex-valued matrix M by M(r) and imaginary part by M(i). Proof of Theorem 9 Let Y , T, W s p , Wx be the matrices obtained from Theorem 23. Note the last three are complex-valued, T is invertible and W s p is hermitian.

YANG, LIU, WANG, HUANG, WIPF

Consider T = T(r) T(i) . Since T is invertible, let

= T 1, then we have I = ( T

(i). Thus T

(i) = I. This means T is right-invertible.

Then let W p s =

" W s p(r) W s p(i) W s p(i) W s p(r)

. Since W s p is hermitian, we have that W s p(r) = W s p(r)

and W s p(i) = W s p(i), which ensures W s p constructed here is symmetric. Finally, let Wx = Wx(r) Wx(i) . Now it s easy to verify that PY TW s p + ˆX Wx = Y T.

E.4 Proof of Theorem 10

First we will illustrate that by setting the parameters in UGNN to properly, the iteration step from (8) is equivalent to a GCN layer with symmetric shared weights, with or without residual connections. We set ρ to be identity (so that Γ = I), B satisfies that B B = I P (e.g. if P = ˆA = D 1/2 A D 1/2

is normalized adjacency, then B is normalized incidence matrix B = D 1/2B). For a GCN model without residual connections, let W s f = I W s p , then we have Y (k+1) =

σ(PY (k)W s p ). Consider Y (0) = W h Y (0) GCN 0 0 0 i where Y (0) GCN is the input of GCN, and

the ith 0 here denote a block matrix of all zero with the same size as Y (i) GCN. When we let

W (1) p W (1) p W (2) p W (2) p W (3) p

W (3) p ... ... W (k) p W (k) p

it is not difficult to verify that the kth block of Y (k) is Y (k) GCN. Note the 0s in the (71) are block matrices of all zeros with proper size. For GCN model with residual connection, let W s f = I W s p W s r , then Y (k+1) = σ(PY (k)W s p + Y (k)W s r ). We take the same W s p as before (71), and let

I ... ... I I

Since W s r and W s p are both symmetric, apparently W s f is also symmetric. Also notice that since

residual connection exists, the size of all the W (i) p s are the same, so the size of I in the (72) is the same as W (i) p .

IMPLICIT VS UNFOLDED GRAPH NEURAL NETWORKS

E.5 A Sufficient Condition of Equivalence of Symmetric and Asymmetric Weights

In this section, we give a sufficient condition of fixed-point equivalence of symmetric and asymmetric weights in general case as mentioned in Section 6.1. Here we consider general nonlinear functions σ and penalty term ϕ. Hereinafter we denote proxϕ by σ. Constraint on σ To simplify the discussion, we add a mild assumption on σ and consider a general type of σ: We assume σ(x) is a proximal operator of some penalty function which is differentiable with respect to x. This assumption is consistent with practice. Firstly, commonly used non-linearities like tanh, sigmoid and Re LU are all proximal operators since they are all continuous and component-wise non-decreasing (By Lemma 2). Also, although at some point Re LU and Leaky-Re LU are not differentiable, we can approximate them by

r log (exp(rx) + 1)

and Leaky-Re LU(x, p) 1

r log (exp(rx) + exp(prx))

with big enough r, which are differentiable and do not affect their practical attributes. σ Considered As discussed in the main paper, it would cause some degenerated cases if we allow a general class of proximal operators. Also it would too difficult to handle in this case. Therefore, we only consider proximal operators in a special while general family

S = { σ : x 7 Gσ(Cx)|G, C are chosen such that σ is proximal operator}

Now we consider under what value of G and C we can ensure σ is proximal operator. From Proposition 18, we know ϕ, σ = proxϕ iff σ is subgradient of some l.s.c convex function. Since we have assumed σ is continuous and differentiable, R σ(x)dx is twice differentiable. We know that a twice-differentiable function ψ is convex iff its Hessian Hψ is positive semi-definite. Therefore the Hessian of σ must be positive semi-definite, we denote it by H(x) = σ(x)

x , and so do Hessian of σ(x), which we denote by H(x) = σ(x)

x . From the chain rule we have

H(x) = σ(x)

x = G σ(Cx)

x = GH(Cx)C.

From the deduction above, we can conclude that:

Lemma 24 σ : Rd Rd, x 7 Gσ(Cx) is proximal operator iff x CRd, GH(x)C is positive semi-definite where H(x) = σ(x)

The Condition of General Fixed-Point Alignment

Theorem 25 For σ satisfying the assumptions above, Wp Rd d that admits unique fixed point for IGNN, and Wx Rd0 d, suppose Y is the only solution of Y = σ(PY W + XWx), a sufficient condition of

ϕ, right invertible T Rd d , symmetric W s p Rd d and Wx Rd d such that

σ(PY TW s p + XWx) = Y T

YANG, LIU, WANG, HUANG, WIPF

right-invertible T Rd d , C Rd d, W s p Rd d such that TW s p C = Wp

and x C R, T H(x)C is P.S.D,

where σ(x) = proxϕ(x) and H(x) = σ(x)

Proof We have assumed σ S, thus σ(x) = Gσ(Cx). We want to prove that

Y T = σ(PY TW s p + X Wx) (73)

= σ(PY TW s p C + X Wx C)G (74)

= σ(PY TW s p C + XWx)G (Assume Wx C = Wx). (75)

A sufficient condition for this equation to hold is T = G and Y = σ(AY TW s p C + XWx) = σ(AY Wp + XWx). Comparing the last two terms, apparently it holds when TW s p C = Wp. And to ensure σ S, from Lemma 24 we know it means T H(C x)C be positive semi-definite for all x.

We can verify this Theorem 25 by linear case. If σ is linear, then H(Cx) = I, thus simply taking C = T 1 (on complex field) we have TW s p T 1 can generate any real matrix W and T H(C x)C = T T = I is positive semi-definite.