# graph_attention_retrospective__072abd0f.pdf

Journal of Machine Learning Research 24 (2023) 1-52 Submitted 11/22; Revised 4/23; Published 5/23

Graph Attention Retrospective

Kimon Fountoulakis kimon.fountoulakis@uwaterloo.ca David R. Cheriton School of Computer Science University of Waterloo Waterloo, Ontario, Canada

Amit Levi amit.levi@uwaterloo.ca Huawei Noah s Ark Lab Montreal, Quebec, Canada

Shenghao Yang shenghao.yang@uwaterloo.ca David R. Cheriton School of Computer Science University of Waterloo Waterloo, Ontario, Canada

Aseem Baranwal aseem.baranwal@uwaterloo.ca David R. Cheriton School of Computer Science University of Waterloo Waterloo, Ontario, Canada

Aukosh Jagannath a.jagannath@uwaterloo.ca Department of Statistics and Actuarial Science, Department of Applied Mathematics University of Waterloo Waterloo, Ontario, Canada

Editor: Joan Bruna

Graph-based learning is a rapidly growing sub-ﬁeld of machine learning with applications in social networks, citation networks, and bioinformatics. One of the most popular models is graph attention networks. They were introduced to allow a node to aggregate information from features of neighbor nodes in a non-uniform way, in contrast to simple graph convolution which does not distinguish the neighbors of a node. In this paper, we theoretically study the behaviour of graph attention networks. We prove multiple results on the performance of the graph attention mechanism for the problem of node classiﬁcation for a contextual stochastic block model. Here, the node features are obtained from a mixture of Gaussians and the edges from a stochastic block model. We show that in an easy regime, where the distance between the means of the Gaussians is large enough, graph attention is able to distinguish inter-class from intra-class edges. Thus it maintains the weights of important edges and signiﬁcantly reduces the weights of unimportant edges. Consequently, we show that this implies perfect node classiﬁcation. In the hard regime, we show that every attention mechanism fails to distinguish intra-class from inter-class edges. In addition, we show that graph attention convolution cannot (almost) perfectly classify the nodes even if intra-class edges could be separated from inter-class edges. Beyond perfect node classiﬁcation, we provide a positive result on graph attention s robustness against structural noise in the graph. In particular, our robustness result implies that graph attention

. Equal contribution.

c 2023 Kimon Fountoulakis, Amit Levi, Shenghao Yang, Aseem Baranwal and Aukosh Jagannath.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v24/22-1255.html.

Fountoulakis, Levi, Yang, Baranwal and Jagannath

can be strictly better than both the simple graph convolution and the best linear classiﬁer of node features. We evaluate our theoretical results on synthetic and real-world data. Keywords: graph neural networks, attention mechanism, contextual stochastic block model, high-dimensional probability, node classiﬁcation

1. Introduction

Graph learning has received a lot of attention recently due to breakthrough learning models (Gori et al., 2005; Scarselli et al., 2009; Bruna et al., 2014; Duvenaud et al., 2015; Henaﬀ et al., 2015; Atwood and Towsley, 2016; Deﬀerrard et al., 2016; Hamilton et al., 2017; Kipf and Welling, 2017) that are able to exploit multi-modal data that consist of nodes and their edges as well as the features of the nodes. One of the most important problems in graph learning is the problem of classiﬁcation, where the goal is to classify the nodes or edges of a graph given the graph and the features of the nodes. Two of the most fundamental mechanisms for classiﬁcation, and graph learning in general, are graph convolution and graph attention. Graph convolution, usually deﬁned using its spatial version, corresponds to averaging the features of a node with the features of its neighbors (Kipf and Welling, 2017).1 Graph attention (Veliˇckovi c et al., 2018) mechanisms augment this convolution by appropriately weighting the edges of a graph before spatially convolving the data. Graph attention is able to do this by using information from the given features for each node. Despite its wide adoption by practitioners (Fey and Lenssen, 2019; Wang et al., 2019; Hu et al., 2020) and its large academic impact as well, the number of works that rigorously study its eﬀectiveness is quite limited. One of the motivations for using a graph attention mechanism as opposed to a simple convolution is the expectation that the attention mechanism is able to distinguish interclass edges from intra-class edges and consequently weights inter-class edges and intra-class edges diﬀerently before performing the convolution step. This ability essentially maintains the weights of important edges and signiﬁcantly reduces the weights of unimportant edges, and thus it allows graph convolution to aggregate features from a subset of neighbor nodes that would help node classiﬁcation tasks. In this work, we explore the regimes in which this heuristic picture holds in simple node classiﬁcation tasks, namely classifying the nodes in a contextual stochastic block model (CSBM) (Binkiewicz et al., 2017; Deshpande et al., 2018). The CSBM is a coupling of the stochastic block model (SBM) with a Gaussian mixture model, where the features of the nodes within a class are drawn from the same component of the mixture model. For a more precise deﬁnition, see Section 2. We focus on the case of two classes where the answer to the above question is suﬃciently precise to understand the performance of graph attention and build useful intuition about it. We brieﬂy and informally summarize our contributions as follows:

1. In the easy regime , i.e., when the distance between the means of the Gaussian mixture model is much larger than the standard deviation, we show that there exists a choice of attention architecture that distinguishes inter-class edges from intra-class

1. Although the model in Kipf and Welling (2017) is related to spectral convolutions, it is mainly a spatial convolution since messages are propagated along graph edges. More broadly, graph convolution can refer to diﬀerent variants arising from diﬀerent (approximations of) graph spectral ﬁlters. We provide details in Section 2.

Graph Attention Retrospective

edges with high probability (Theorem 3). In particular, we show that the attention coeﬃcients for one class of edges are much higher than the other class of edges (Corollary 4). Furthermore, we show that these attention coeﬃcients lead to a perfect node classiﬁcation result (Corollary 5). However, in the same regime, we also show that the graph is not needed to perfectly classify the data (Proposition 7).

2. In the hard regime , i.e., when the distance between the means is small compared to the standard deviation, we show that every attention architecture is unable to distinguish inter-class from intra-class edges with high probability (Theorem 9). Moreover, we show that using the original Graph Attention Network (GAT) architecture (Veliˇckovi c et al., 2018), with high probability, most of the attention coeﬃcients are going to have uniform weights, similar to those of simple graph convolution (Kipf and Welling, 2017) (Theorem 10). We also consider a setting where graph attention is provided with independent and perfect edge information, and we show that even in this case, if the distance between the means is suﬃciently small, then graph attention would fail to (almost) perfectly classify the nodes with high probability (Theorem 13).

3. Beyond perfect node classiﬁcation, we show that graph attention is able to assign a signiﬁcantly higher weight to self-loops, irrespective of the parameters of the Gaussian mixture model that generates node features or the stochastic block model that generates the graph. Consequently, we show that with a high probability, graph attention convolution is at least as good as the best linear classiﬁer for node features (Theorem 14). In a high structural noise regime when the graph is not helpful for node classiﬁcation, i.e., when intra-class edge probability p is close to inter-class edge probability q, this means that graph attention is strictly better than simple graph convolution because it is able to essentially ignore the graph structural noise (Corollary 15). In addition, we obtain lower bounds on graph attention s performance for almost perfect node classiﬁcation and partial node classiﬁcation (Corollary 16).

4. We provide an extensive set of experiments both on synthetic data and on three popular real-world datasets that validates our theoretical results.

1.1 Limitations of our theoretical setting

In this work, to study the beneﬁts and the limitations of the graph attention mechanism, we focus on node classiﬁcation tasks using the CSBM generative model with two classes and Gaussian node features. This theoretical setting has a few limitations. First, we note that many real-world networks are much more complex and may exhibit diﬀerent structures than the ones obtainable from a stochastic block model. Furthermore, real-world node attributes such as categorical features may not even follow a continuous probability distribution. Apparently, there are clear gaps between CSBM and the actual generative processes of real-world data. Nonetheless, we note that a good understanding of graph attention s behavior over CSBM would already provide us with useful insights. For example, it has been shown empirically that synthetic graph datasets based on CSBM constitute good benchmarks for evaluating various GNN architectures (Palowitch et al., 2022). To that end, in Section 6, we summarize some practical implications of our theoretical results which could be useful for practitioners working with GNNs.

Fountoulakis, Levi, Yang, Baranwal and Jagannath

Second, there are diﬀerent levels of granularity for machine learning problems on graphs. Besides node-level tasks such as node classiﬁcation, other important problems include edgelevel tasks such as link prediction and graph-level tasks such as graph classiﬁcation. While, empirically, variants of graph attention networks have been successfully applied to solve problems at all levels of granularity, our theoretical results mostly pertain to the eﬀects of graph attention for node classiﬁcation. This is a limitation on the scope of our study. On the other hand, we note that edge-level and graph-level tasks are typically solved by adding a pooling layer which combines node representations from previous graph (attention) convolution layers. Consequently, the quality of graph attention convolution s output for a node not only aﬀects node classiﬁcation but also has a major inﬂuence on link prediction and graph classiﬁcation. Therefore, our results may be extended to study the eﬀects of graph attention on link prediction and graph classiﬁcation under the CSBM generative model. For example, link prediction is closely related to edge classiﬁcation which we extensively study in Section 3. For graph classiﬁcation, one may consider the problem of classifying graphs generated from diﬀerent parameter settings of the CSBM. In this case, our results on graph attention s impact on node representations may be used to establish results on graph attention s impact on graph representations after certain pooling layers. In particular, if graph attention convolution generates good node representations that are indicative of node class memberships, then one can show that the graph representation obtained from appropriately aggregating node representations would be indicative of the clustering structure of the graph, and hence the graph representation would be useful for graph classiﬁcation tasks.

1.2 Relevant work

Recently the concept of attention for neural networks (Bahdanau et al., 2015; Vaswani et al., 2017) was transferred to graph neural networks (Li et al., 2016; Bresson and Laurent, 2018; Veliˇckovi c et al., 2018; Lee et al., 2019; Puny et al., 2020). A few papers have attempted to understand the attention mechanism in Veliˇckovi c et al. (2018). One work relevant to ours is Brody et al. (2022). In this paper, the authors show that a node may fail to assign large edge weight to its most important neighbors due to a global ranking of nodes generated by the attention mechanism in Veliˇckovi c et al. (2018). Another related work is Knyazev et al. (2019), which presents an empirical study of the ability of graph attention to generalize on larger, complex, and noisy graphs. In addition, in Hou et al. (2019) the authors propose a diﬀerent metric to generate the attention coeﬃcients and show empirically that it has an advantage over the original GAT architecture.

Other related work to ours, which does not focus on graph attention, comes from the ﬁeld of statistical learning on random data models. Random graphs and the stochastic block model have been traditionally used in clustering and community detection (Abbe, 2018; Athreya et al., 2018; Moore, 2017). Moreover, the works by Binkiewicz et al. (2017); Deshpande et al. (2018), which also rely on CSBM are focused on the fundamental limits of unsupervised learning. Of particular relevance is the work by Baranwal et al. (2021), which studies the performance of graph convolution on CSBM as a semi-supervised learning problem. Within the context of random graphs, Keriven et al. (2021) studies the approximation power of graph neural networks on random graphs. In Maskey et al. (2022) the authors

Graph Attention Retrospective

derive generalization error of graph neural networks for graph classiﬁcation and regression tasks. In our paper we are interested in understanding graph attention s capability for edge/node classiﬁcation with respect to diﬀerent parameter regimes of CSBM. Finally, there are a few related theoretical works on understanding the generalization and representation power of graph neural networks (Chen et al., 2019; Chien et al., 2021; Zhu et al., 2020; Xu et al., 2019; Garg et al., 2020; Loukas, 2020a,b). For a recent survey in this direction see Jegelka (2022). Our work takes a statistical perspective which allows us to characterize the precise performance of graph attention compared to graph convolution and no convolution for CSBM, with the goal of answering the particular questions that we imposed above.

2. Preliminaries

In this section, we describe the Contextual Stochastic Block Model (CSBM) (Deshpande et al., 2018) which serves as our data model, and the Graph Attention mechanism (Veliˇckovi c et al., 2018). We also deﬁne notations and terms that are frequently used throughout the paper. For a vector x Rn and n N, the norm x denotes the Euclidean norm of x, i.e.

x def = P i [n] x2 i . We write [n] def = {1, 2, . . . , n}. We use Ber(p) to denote the Bernoulli distribution, so x Ber(p) means the random variable x takes value 1 with probability p and 0 with probability 1 p. Let d, n N, and ϵ1, . . . , ϵn Ber(1/2). Deﬁne two classes as Ck = {j [n] | ϵj = k} for k {0, 1}. For each index i [n], we set the feature vector Xi Rd as Xi N((2ϵi 1)µ, σ2I), where µ Rd, σ R and I {0, 1}d d is the identity matrix.2 For a given pair p, q [0, 1] we consider the stochastic adjacency matrix A {0, 1}n n deﬁned as follows. For i, j [n] in the same class (i.e., intra-class edge), we set aij Ber(p), and if i, j are in diﬀerent classes (i.e., inter-class edge), we set aij Ber(q). We denote by (X, A) CSBM(n, p, q, µ, σ2) a sample obtained according to the above random process. An advantage of CSBM is that it allows us to control the noise by controlling the parameters of the distributions of the model. In particular, CSBM allows us to control the distance of the means and the variance of the Gaussians, which are important for controlling the separability of the Gaussians. For example, ﬁxing the variance, then the closer the means are the more diﬃcult the separation of the node Gaussian features becomes. Another notable advantage of CSBM is that it allows us to control the structural noise and homophily level of the graph. The level of noise in the graph is controlled by increasing or reducing the gap between the intra-class edge probability p and the inter-class edge probability q, and the level of homophily in the graph is controlled by the relative magnitude between p and q. For example, when p is much larger than q, then a node is more likely to be connected with a node from the same class, and hence we obtain a homophilous graph; when q is much larger than p, then a node is more likely to be connected with a node from a diﬀerent class, and hence we obtain a heterophilous graph. There are several recent works exploring the behavior of GNNs in heterophilous graphs (Lim et al., 2021; Yan et al., 2022; Bodnar et al., 2022; Luan et al., 2022). Interestingly, we note that our results for graph attention s

2. The means of the mixture of Gaussians are µ. Our results can be easily generalized to general means. The current setting makes our analysis simpler without loss of generality.

Fountoulakis, Levi, Yang, Baranwal and Jagannath

behavior over the CSBM data model do not depend on whether the graph is homophilous or heterophilous. Given node representations hi RF for i [n], a (spatial) graph convolution for node i has output

j [n] Aijcij Whj, cij = X

where W RF F is a learnable matrix. Throughout this paper, we will refer to this operation as simple graph convolution or standard graph convolution. Our deﬁnition of graph convolution is essentially the mean aggregation step in a general GNN layer (Hamilton et al., 2017). The normalization constant cij in our deﬁnition is closely related to the symmetric normalization cij = (P ℓ [n] Aiℓ) 1/2(P ℓ [n] Ajℓ) 1/2 used in the original Graph Convolutional Network (GCN) introduced by Kipf and Welling (2017). Our deﬁnition does not aﬀect the discussions or implications we have for GCN with symmetric normalization. More broadly, there are other forms of graph convolution in the literature (Bronstein et al., 2021; Deﬀerrard et al., 2016; Levie et al., 2018), which we do not compare within this work. A single-head graph attention applies some weight function on the edges based on their node features (or a mapping thereof). Given two representations hi, hj RF for two nodes i, j [n], the attention model/mechanism is deﬁned as the mapping

Ψ(hi, hj) def = α(Whi, Whj)

where α : RF RF R and W RF F is a learnable matrix. The attention coeﬃcient for a node i and its neighbor j is deﬁned as

γij def = exp(Ψ(hi, hj)) P ℓ Ni exp(Ψ(hi, hℓ)), (1)

where Ni is the set of neighbors of node i that also includes node i itself. Let f be some element-wise activation function (which is usually nonlinear), the graph attention convolution output for a node i [n] is given by

j [n] Aijγij Whj,

hi = f(h i).

A multi-head graph attention (Veliˇckovi c et al., 2018) uses K N weight matrices W1, . . ., WK RF F and averages their individual (single-head) outputs. We consider the most simpliﬁed case of a single graph attention layer (i.e., F = d and F = 1) where α is realized by an MLP using the Leaky Relu activation function. The Leaky Relu activation function is deﬁned as Leaky Relu(x) = x if x 0 and Leaky Relu(x) = βx for some constant β [0, 1) if x < 0. The CSBM model induces node features X which are correlated through the graph G = ([n], E), represented by an adjacency matrix A. A natural requirement of attention architectures is to maintain important edges in the graph and ignore unimportant edges. For example, important edges could be the set of intra-class edges and unimportant edges

Graph Attention Retrospective

could be the set of inter-class edges. In this case, if graph attention mains all intra-class edges and ignores all inter-class edges, then a node from a class will be connected only to nodes from its own class. More speciﬁcally, a node v will be connected to neighbor nodes whose associated node features come from the same distribution as node features of v. Given two sets A and B, we denote A B def = {(i, j) : i A, j B} and A2 def = A A.

3. Separation of edges and its implications for graph attention s ability for perfect node classiﬁcation

In this section, we study the ability of graph attention to separate important edges from unimportant edges. In addition, we study the implications of graph attention s behavior for node classiﬁcation. In particular, we are interested in whether or not the graph attention convolution is able to perfectly classify the nodes in the graph. Depending on the parameter regimes of the CSBM, we separate the results into two parts. In Section 3.1, we study graph attention s behavior when the distance between the means of the node features is large. We construct a speciﬁc attention architecture and demonstrate its ability to separate intraclass edges from inter-class edges. Consequently, we show that the corresponding graph attention convolution is able to perfectly classify the nodes. In Section 3.2, we study graph attention s behavior when the distance between the means of the node features is small. We show that with high probability no attention architecture is able to separate intra-class edges from inter-class edges. Consequently, we conjecture that no graph attention is able to perfectly classify the nodes. We provide evidence of this conjecture in Section 3.2.1, where we assume the existence of a strong attention mechanism that is able to perfectly classify the edges independently from node features. We show that using this idealistic attention mechanism still fails to (almost) perfectly classify the nodes, provided that the distance between the means of the node features is suﬃciently small. The two-parameter regimes of the CSBM that we consider in Section 3.1 and Section 3.2 are as follows. The ﬁrst ( easy regime ) is where µ = ω(σ log n), and the second ( hard regime ) is where µ = κσ for some 0 < κ O( log n). We start by deﬁning edge separability and node separability.

Deﬁnition 1 (Edge separability). Given an attention model Ψ, we say that the model separates the edges, if the outputs satisfy

sign(Ψ(Xi, Xj)) = sign(Ψ(Xk, Xℓ))

whenever (i, j) is an intra-class edge, i.e. (i, j) (C2 1 C2 0) E, and (k, ℓ) is an inter-class edge, i.e. (k, ℓ) E \ (C2 1 C2 0).

Deﬁnition 2 (Node separability). Given a classiﬁcation model which outputs a scalar representation h i for node i, we say that the model separates the nodes if h i > 0 whenever i C1 and h i < 0 whenever i C0.

Some of our results in this section rely on a mild assumption that lower bounds the sparsity of the graph generated by the CSBM model. This assumption says that the expected degree of a node in the graph is larger than log2 n. It is required to obtain a concentration of node degrees property used in the proofs. While this assumption may limit a direct

Fountoulakis, Levi, Yang, Baranwal and Jagannath

application of our results to very sparse graphs, for example, graphs whose node degrees are bounded by a constant, it is mainly an artifact of the proof techniques that we use. We expect that similar results still hold for sparser graphs, but to rigorously remove the assumption on graph density one may have to adopt a diﬀerent proof technique.

Assumption 1. p, q = Ω(log2 n/n).

3.1 Easy Regime

In this regime µ = ω(σ log n) we show that a two-layer MLP attention is able to separate the edges with high probability. Consequently, we show that the corresponding graph attention convolution is able to separate the nodes with high probability. At a high level, we transform the problem of classifying an edge (i, j) E into the problem of classifying a point [ w T Xi, w T Xj] in R2, where w = µ/ µ is a unit vector that maximizes the total pairwise distances among the four means given below. When we consider the set of points [ w T Xi, w T Xj] for all (i, j) E, we can think of each point as a two-dimensional Gaussian vector whose mean is one of the following: [ w T µ, w T µ], [ w T µ, w T µ], [ w T µ, w T µ], [ w T µ, w T µ]. The set of intra-class edges corresponds to the set of bivariate Gaussian vectors whose mean is either [ w T µ, w T µ] or [ w T µ, w T µ], while the set of inter-class edges corresponds to the set of bivariate Gaussian vectors whose mean is either [ w T µ, w T µ] or [ w T µ, w T µ]. Therefore, in order to correctly classify the edges, we need to correctly classify the data corresponding to means [ w T µ, w T µ] and [ w T µ, w T µ] as 1, and classify the data corresponding to the other means as 1. This problem is known in the literature as the XOR problem (Minsky and Papert, 1969). To achieve this we consider a two-layer MLP architecture Ψ which separates the ﬁrst and the third quadrants of the two-dimensional space from the second and the fourth quadrants. In particular, we consider the following speciﬁcation of Ψ(Xi, Xj),

Ψ(Xi, Xj) def = r T Leaky Relu S w T Xi w T Xj

w def = µ µ , S def =

1 1 1 1 1 1 1 1

, r def = R

where R > 0 is an arbitrary scaling parameter. The particular function Ψ has been chosen such that it is able to classify the means of the XOR problem correctly, that is,

sign(Ψ(E[Xi], E[Xj])) = 1, if (i, j) is an intra-class edge, 1, if (i, j) is an inter-class edge.

To see this, we can take look at the following simpliﬁed expression of Ψ, which is obtained by substituting the speciﬁcations of S and r from (4) to (3),

Ψ(Xi, Xj) =

2R(1 β) w T Xi, if w T Xj w T Xi , 2R(1 β) sign( w T Xi) w T Xj, if w T Xi < w T Xj < w T Xi , 2R(1 β) w T Xi, if w T Xj w T Xi . (5)

Graph Attention Retrospective

Then, by substituting w = µ/ µ into (5), one easily veriﬁes that

Ψ(E[Xi], E[Xj]) = 2R(1 β) µ , if (i, j) is an intra-class edge, 2R(1 β) µ , if (i, j) is an inter-class edge. (6)

On the other hand, our assumption that µ = ω(σ log n) guarantees that the distance between the means of the XOR problem is much larger than the standard deviation of the Gaussians, and thus with high probability there is no overlap between the distributions. More speciﬁcally, one can show that with a high probability,

Ψ(Xi, Xj) = 2R(1 β) µ (1 o(1)), if (i, j) is an intra-class edge, 2R(1 β) µ (1 o(1)), if (i, j) is an inter-class edge. (7)

This property guarantees that with high probability,

sign(Ψ(Xi, Xj)) = sign(Ψ(E[Xi], E[Xj])), for all (i, j) E,

which implies perfect separability of the edges. We state this result below in Theorem 3 and provide detailed arguments in the appendix.

Theorem 3. Suppose that µ = ω(σ log n). Then with probability at least 1 o(1) over the data (X, A) CSBM(n, p, q, µ, σ2), the two-layer MLP attention architecture Ψ given in (3) and (4) separates intra-class edges from inter-class edges.

Our analysis of edge separability by the attention architecture Ψ has two important implications. First, edge separability by Ψ implies a nice concentration result for the attention coeﬃcients γij deﬁned in (1). In particular, one can show that the attention coeﬃcients have the desirable property to maintain important edges and drop unimportant edges. Second, the nice distinguishing power of the attention coeﬃcients in turn implies the exact recoverability of the node memberships using the graph attention convolution. We state these two results below in Corollary 4 and Corollary 5, respectively. Denote

Ψ def = (1p q 1p<q) Ψ,

that is, Ψ = Ψ if p q and Ψ = Ψ if p < q, where Ψ is the two-layer MLP attention architecture given in (3) and (4). As it will become clear later, the attention architecture Ψ

allows attention coeﬃcients to assign more weights to either intra-class or inter-class edges, depending on if p q or p < q, and this will help us obtain a perfect separation of the nodes.

Corollary 4. Suppose that µ = ω(σ log n) and that Assumption 1 holds. Then with probability at least 1 o(1) over the data (X, A) CSBM(n, p, q, µ, σ2), the attention architecture Ψ yields attention coeﬃcients γij such that

1. If p q, then γij = 2 np(1 o(1)) if (i, j) is an intra-class edge and γij = o( 1 n(p+q)) otherwise;

2. If p < q, then γij = 2 nq(1 o(1)) if (i, j) is an inter-class edge and γij = o( 1 n(p+q)) otherwise.

Fountoulakis, Levi, Yang, Baranwal and Jagannath

Corollary 5. Suppose that µ = ω(σ log n) and that Assumption 1 holds. Then with probability at least 1 o(1) over the data (X, A) CSBM(n, p, q, µ, σ2), using the attention architecture Ψ with the graph attention convolution given in (2), where f is set to be the identify function, the model separates the nodes.

Corollary 4 shows the desired behavior of the attention mechanism, namely it is able to assign signiﬁcantly large weights to important edges while it drops unimportant edges. When p q, the attention mechanism maintains intra-class edges and essentially ignores all inter-class edges; when p < q, it maintains inter-class edges and essentially ignores all intra-class edges. We now explain the high-level idea of the proof and leave detailed arguments to Appendix A.2. Corollary 4 follows from the concentration of Ψ(Xi, Xj) around Ψ(E[Xi], E[Xj]) given by (7). Assume for a moment that p q so that 1p q 1p<q = 1. Then Ψ = Ψ, and (7) implies that the value of exp(Ψ (Xi, Xj)) when (i, j) is an intra-class edge is exponentially larger than the value of exp(Ψ (Xi, Xj)) when (i, j) is an inter-class edge. Therefore, by the deﬁnition of the attention coeﬃcients in (1), the denominator of γij is dominated by terms (i, k) where k is in the same class as i. Moreover, using concentration of node degrees guaranteed by Assumption 1, each node i is connected to 2 np(1 o(1)) many intra-class nodes. By appropriately setting the scaling parameter R in (4), the values of Ψ (Xi, Xk) for all intra-class edges (i, k) can be made within o(1) multiplicative factor from each other. Therefore we get γij = 2 np(1 o(1)) when (i, j) is an intra-class edge. A similar reasoning applies to inter-class edges and yields the vanishing value of γij when (i, j) is an inter-class edge. Finally, the argument for p < q follows analogously. The concentration result of attention coeﬃcients in Corollary 4 implies the node classiﬁcation result in Corollary 5, which holds for any p, q satisfying Assumption 1. That is, even when the graph structure is noisy, e.g., when p q, it is still possible to exactly recover the node class memberships. Essentially, by applying Corollary 4 and carrying out some straightforward algebraic simpliﬁcations, one can show that with high probability for all i [n],

j Ni γij w T Xj = (1p q 1p<q) µ (1 o(1)), if i C0, (1p q 1p<q) µ (1 o(1)), if i C1,

and hence the model separates the nodes with high probability. We provide details in the appendix. While Corollary 5 provides a positive result for graph attention, it can be shown that a simple linear classiﬁer which does not use the graph at all achieves perfect node separability with high probability. In particular, the Bayes optimal classiﬁer for the node features without the graph is able to separate the nodes with high probability. This means that in this regime, using the additional graph information is unnecessary, as it does not provide additional power compared to a simple linear classiﬁer for the node classiﬁcation task.

Lemma 6 (Section 6.4 in Anderson (2003)). Let (X, A) CSBM(n, p, q, µ, σ2). Then the optimal Bayes classiﬁer for X is realized by the linear classiﬁer

( 0 if µT Xi 0 1 if µT Xi > 0 . (8)

Graph Attention Retrospective

Proposition 7. Suppose µ = ω(σ log n). Then with probability at least 1 o(1) over the data (X, A) CSBM(n, p, q, µ, σ2), the linear classiﬁer given in (8) separates the nodes.

The proof of Proposition 7 is elementary. To see the claim one may show that the probability that the classiﬁer in (8) misclassiﬁes a node i [n] is o(1). To do this, let us ﬁx i [n] and write Xi = (2ϵi 1)µ + σgi where gi N(0, I). Assume for a moment ϵi = 0. Then the probability of misclassiﬁcation is

Pr µT Xi > 0 = Pr µT gi

where Φ( ) is the cumulative distribution function of N(0, 1) and the last equality follows from the fact that µT gi

µ N(0, 1). The assumption that µ = ω(σ log n) implies µ σ 2 log n for large enough n. Therefore, using standard tail bounds for normal distribution (Vershynin, 2018) we have that

2π µ exp µ 2

n 1 4π log n.

This means that the probability that there exists i C0 which is misclassiﬁed is at most 1 2 4π log n = o(1). A similar argument can be applied to the case where ϵi = 1, and an application of a union bound on the events that there is i [n] which is misclassiﬁed ﬁnishes the proof of Proposition 7.

3.2 Hard Regime

In this regime ( µ = κσ for κ O( log n)), we show that every attention architecture Ψ fails to separate the edges if κ < 2 log n, and we conjecture that no graph attention convolution is able to separate the nodes. The conjecture is based on our node separability result in Section 3.2.1 which says that, even if we assume that there is an attention architecture which is able to separate the edges independently from node features, the corresponding graph attention convolution still fails to (almost) perfectly classify the nodes with high probability. The goal of the attention mechanism is to decide whether an edge (i, j) is an inter-class edge or an intra-class edge based on the node feature vectors Xi and Xj. Let X ij denote the vector obtained from concatenating Xi and Xj, that is,

X ij def = Xi Xj

We would like to analyze every classiﬁer h which takes as input X ij and tries to separate inter-class edges and intra-class edges. An ideal classiﬁer would have the property

y = h (X ij) = 0, if (i, j) is an inter-class edge, 1, if (i, j) is an intra-class edge. (10)

To understand the limitations of all such classiﬁers in this regime, it suﬃces to consider the Bayes optimal classiﬁer for this data model, whose probability of misclassifying an arbitrary edge lower bounds that of every attention architecture which takes as input (Xi, Xj).

Fountoulakis, Levi, Yang, Baranwal and Jagannath

Consequently, by deriving a misclassiﬁcation rate for the Bayes classiﬁer, we obtain a lower bound on the misclassiﬁcation rate for every attention mechanism Ψ for classifying intraclass and inter-class edges. The following Lemma 8 describes the Bayes classiﬁer for this classiﬁcation task.

Lemma 8. Let (X, A) CSBM(n, p, q, µ, σ2) and let X ij be deﬁned as in (9). The Bayes optimal classiﬁer for X ij is realized by the following function,

( 0, if p cosh x T µ

σ2 q cosh x T ν

1, otherwise, (11)

where µ def = µ µ

and ν def = µ µ

Using Lemma 8, we can lower bound the rate of misclassiﬁcation of edges that every attention mechanism Ψ exhibits. Below we deﬁne Φc def = 1 Φ, where Φ is the cumulative distribution function of N(0, 1).

Theorem 9. Suppose µ = κσ for some κ > 0 and let Ψ be any attention mechanism. Then,

1. With probability at least 1 o(1), Ψ fails to correctly classify at least 2 Φc(κ)2 fraction of inter-class edges;

2. For any K > 0 if q > K log2 n

nΦc(κ)2 , then with probability at least 1 O(n 8KΦc(κ)2 log n), Ψ misclassify at least one inter-class edge.

Part 1 of Theorem 9 implies that if µ is linear in the standard deviation σ, that is if κ = O(1), then with overwhelming probability the attention mechanism fails to distinguish a constant fraction of inter-class edges from intra-class edges. Furthermore, part 2 of Theorem 9 characterizes a regime for the inter-class edge probability q where the attention mechanism fails to distinguish at least one inter-class edge. It provides a lower bound on q in terms of the scale at which the distance between the means grows compared to the standard deviation σ. This aligns with the intuition that as we increase the distance between the means, it gets easier for the attention mechanism to correctly distinguish inter-class and intra-class edges. However, if q is also increased with the right proportion, in other words, if the noise in the graph is increased, then the attention mechanism would still fail to correctly distinguish at least one inter-class edge. For instance, for κ = 2 log log n and K = log2 n, we get that if q > Ω( log6+o(1) n

n ), then with probability at least 1 o(1), Ψ misclassiﬁes at least an inter-class edge. The proof of Theorem 9 relies on analyzing the behavior of the Bayes optimal classiﬁer in (11). We compute an upper bound on the probability with which the optimal classiﬁer correctly classiﬁes an arbitrary inter-class edge. Then the proof of part 1 of Theorem 9 follows from a concentration argument for the fraction of inter-class edges that are misclassiﬁed by the optimal classiﬁer. For part 2, we use a similar concentration argument to choose a suitable threshold for q that forces the optimal classiﬁer to fail on at least one inter-class edge. We provide formal arguments in the appendix.

Graph Attention Retrospective

As a motivating example of how the attention mechanism would fail and what exactly the attention coeﬃcients would behave in this regime, we focus on one of the most popular attention architecture (Veliˇckovi c et al., 2018), where α is a single-layer neural network parameterized by (w, a, b) Rd R2 R with Leaky Relu activation function. Namely, the attention coeﬃcients are deﬁned by

γij def = exp Leaky Relu a T w T Xi w T Xj

P ℓ Ni exp Leaky Relu a T w T Xi w T Xℓ

We show that, as a consequence of the inability of the attention mechanism to distinguish intra-class and inter-class edges, with overwhelming probability most of the attention coefﬁcients γij given by (12) are going to be Θ(1/|Ni|). In particular, Theorem 10 says that for the vast majority of nodes in the graph, the attention coeﬃcients on most edges are uniform irrespective of whether the edge is inter-class or intra-class. As a result, this means that the attention mechanism is unable to assign higher weights to important edges and lower weights to unimportant edges.

Theorem 10. Assume that µ Kσ and σ K for some absolute constants K and K . Moreover, assume that the parameters (w, a, b) Rd R2 R are bounded. Then, with probability at least 1 o(1) over the data (X, A) CSBM(n, p, q, µ, σ2), there exists a subset A [n] with cardinality at least n(1 o(1)) such that for all i A the following hold:

1. There is a subset Ji,0 Ni C0 with cardinality at least 9 10|Ni C0|, such that γij = Θ(1/|Ni|) for all j Ji,0.

2. There is a subset Ji,1 Ni C1 with cardinality at least 9 10|Ni C1|, such that γij = Θ(1/|Ni|) for all j Ji,1.

Theorem 10 is proved by carefully computing the numerator and the denominator in (12). In this regime, µ is not much larger than σ, that is, the signal does not dominate noise, so the numerator in (12) is not indicative of the class memberships of nodes i, j but rather acts like Gaussian noise. On the other hand, denote the denominator in (12) by δi and observe that it is the same for all γiℓwhere ℓ Ni. Using concentration arguments about {w T Xℓ}ℓ [n] yields γij = Θ(1/δi) and δi = Θ(|Ni|) ﬁnishes up the proof. We provide details in the appendix. Compared to the easy regime, it is diﬃcult to obtain a separation result for the nodes without additional assumptions. In the easy regime, the distance between the means was much larger than the standard deviation, which made the signal (the expectation of the convolved data) dominate the noise (i.e., the variance of the convolved data). In the hard regime the noise dominates the signal . Thus, we conjecture the following.

Conjecture 11. There is an absolute constant M > 0 such that, whenever µ M

log n n(p+q)(1 max(p, q)) p+q |p q|, every graph attention model fails to perfectly classify the nodes with high probability.

Fountoulakis, Levi, Yang, Baranwal and Jagannath

The above conjecture means that in the hard regime, the performance of the graph attention model depends on q as opposed to the easy regime, where in Corollary 5 we show that it doesn t. This property is veriﬁed by our synthetic experiments in Section 5. The quantity σ q

log n n(p+q)(1 max(p, q)) in the threshold comes from our conjecture that the expected maximum noise of the graph attention convolved data over the nodes is at least cσ q

log n n(p+q)(1 max(p, q)) for some constant c > 0. The quantity p+q |p q| in the threshold comes from our conjecture that the distance between the means (i.e. signal ) of the graph attention convolved data is reduced to at most |p q|/(p+q) of the original distance. Proving Conjecture 11 would require delicate treatment of the correlations between the attention coeﬃcients γij and the node features Xi for i [n].

3.2.1 Are good attention coefficients helpful in the hard regime ?

In this subsection we are interested in understanding the implications of edge separability on node separability in the hard regime and when Ψ is restricted to a speciﬁc class of functions. In particular, we show that Conjecture 11 is true under an additional assumption that Ψ does not depend on the node features. In addition, we show that even if we were allowed to use an extremely good attention function Ψ which separates the edges with an arbitrarily large margin, with high probability the graph attention convolution (2) will still misclassify at least one node as long as µ /σ is suﬃciently small. We consider the class of functions Ψ which can be expressed in the following form:

Ψ(i, j) = sign(p q)t, if (i, j) is an intra-class edge, sign(p q)t, if (i, j) is an inter-class edge, (13)

for some t 0. The particular class of functions in (13) is motivated by the property of the ideal edge classiﬁer in (10) and the behavior of Ψ in (6) when it is applied to the means of the Gaussians. There are a few possible ways to obtain a function Ψ which satisﬁes (13). For example, in the presence of good edge features which reﬂect the class memberships of the edges, we can make Ψ take as input the edge features. Moreover, if | p q| > p

2 log n/n, one such Ψ may be easily realized from the eigenvectors of the graph adjacency matrix. By the exact spectral recovery result in Lemma 12, we know that there exists a classiﬁer ˆτ which separates the nodes. Therefore, we can set Ψ(i, j) = sign(p q)t if ˆτ(i) = ˆτ(j) and Ψ(i, j) = sign(p q)t otherwise.

Lemma 12 (Exact recovery in (Abbe, 2018)). Suppose that p, q = Ω(log2 n/n) and | p q| > p

2 log n/n. Then there exists a classiﬁer ˆτ taking as input the graph A and perfectly classiﬁes the nodes with probability at least 1 o(1).

Theorem 13. Suppose that p, q satisfy Assumption 1 and that p, q are bounded away from 1. For every ϵ > 0, there are absolute constants M, M = O( ϵ) such that, with probability at least 1 o(1) over the data (X, A) CSBM(n, p, q, µ, σ2), using the graph attention convolution in (2) and the attention architecture Ψ in (13), the model misclassiﬁes at least Ω(n1 ϵ) nodes for any w such that w = 1, if

1. t = O(1) and µ Mσ q

log n n(p+q)(1 max(p, q)) p+q

Graph Attention Retrospective

2. t = ω(1) and µ M σ q

log n n(p+q)(1 max(p, q)).

Theorem 13 warrants some discussions. We start with the role of t in the attention function (13). One may think of t as the multiplicative margin of separation for intra-class and intra-class edges. When t = O(1), the margin of separation is at most a constant. This includes the special case when Ψ(i, j) = 0 for all (i, j) E, i.e, the margin of separation is 0. In this case, the graph attention convolution in (2) reduces to the standard graph convolution with uniform averaging among the neighbors. Therefore, part 1 of Theorem 13 also applies to the standard graph convolution. On the other hand, when t = ω(1), the margin of separation is not only bounded away from 0 but also it grows with n. Next, we discuss the additional assumption that p, q are bounded away from 1. This assumption is used to obtain a concentration result required for the proof of Theorem 13. It is also intuitive in the following sense. If both p and q are arbitrarily close to 1, then after the convolution the convolved node feature vectors collapse to approximately a single point, and thus this becomes a trivial case where no classiﬁer is able to separate the nodes; on the other hand, if p is arbitrarily close to 1 and q is very small, then after the convolution the convolved node feature vectors collapse to approximately one of two points according to which class the node comes from, and in this case the nodes can be easily separated by a linear classiﬁer. We now focus on the threshold for µ under which the model is going to misclassify at least one node with high probability. In part 1 of Theorem 13, t = O(1), i.e., the attention mechanism Ψ is either unable to separate the edges or unable to separate the edges with a large enough margin. In this case, one can show that all attention coeﬃcients are Θ( 1 n(p+q)). Consequently, the quantity |p q| appears in denominator of the threshold for µ in part 1 of Theorem 13. Because of that, if p and q are arbitrarily close, then the model is not able to separate the nodes irrespective of how large µ is. For example, treating 1 max(p, q) as a constant since p and q are bounded away from 1 by assumption, we have that

log n n(p + q)(1 max(p, q)) p + q

|p q| = ω(σ p

This means that if p and q are close enough, every attention function Ψ in the form of (13) and t = O(1) cannot help classify all nodes correctly even if µ = ω(σ log n). On the contrary, recall that in the easy regime where µ = ω(σ log n), the attention mechanism given in (3) and (4) helps separate the nodes with high probability. This illustrates the limitation of every attention mechanism in the form of (13) which have an insigniﬁcant margin of separation. According to Theorem 10, the vast majority of attention coeﬃcients are uniform, and thus in Conjecture 11 we expect that graph attention in general shares similar limitations in the hard regime. In part 2 of Proposition 13, t = ω(1), i.e., the attention mechanism Ψ separates the edges with a large margin. In this case, one can show that the attention coeﬃcients on important edges (e.g. intra-class edges) are exponentially larger than those on unimportant edges (e.g. inter-class edges). Consequently, the factor (p + q)/|p q| no longer appears in the threshold for µ in part 2 of Theorem 13. However, at the same time, the threshold also implies that, even when we have a perfect attention mechanism that is able to separate

Fountoulakis, Levi, Yang, Baranwal and Jagannath

the edges with a large margin, as long as µ /σ is small enough, then the model is going to misclassify at least one node with high probability. Finally, notice that in Theorem 13 the parameter ϵ captures a natural tradeoﬀbetween the threshold for µ and the lower bound on the number of misclassiﬁed nodes. Namely, the smaller the ϵ is, the smaller the threshold for µ becomes, and hence the less signal there is in the node features, the more mistakes the model is going to make. This is precisely demonstrated by the scaling of M, M = O( ϵ) and misclassiﬁcation bound Ω(n1 ϵ) with respect to ϵ. We leave the proof of Theorem 13 to the appendix.

4. Robustness to structural noise and its implications beyond perfect node classiﬁcation

In this section, we provide a positive result on the capacity of graph attention convolution for node classiﬁcation beyond the perfect classiﬁcation regime. In particular, we show that independent of the parameters of CSBM, i.e., independent of p, q, µ and σ, the two-layer MLP attention architecture Ψ from Section 3.1 is able to achieve the classiﬁcation performance obtainable by the Bayes optimal classiﬁer for node features. This is proved by showing that there is a parameter setting for Ψ where the attention coeﬃcient on self-loops can be made exponentially large. Consequently, the corresponding graph attention convolution behaves like a linear function of node features. We provide two corollaries of this result. The ﬁrst corollary provides a ranking of graph attention convolution, simple graph convolution, and linear function in terms of their ability to classify nodes in CSBM. More speciﬁcally, by noticing that the simple graph convolution is also realized by a speciﬁc parameter setting of the attention architecture Ψ, our result implies that the performance of graph attention convolution for node classiﬁcation in CSBM is lower bounded by the best possible performance between a linear classiﬁer and simple graph convolution. In particular, graph attention is strictly more powerful than simple graph convolution when the graph is noisy (e.g. when p q), and it is strictly more powerful than a linear classiﬁer when the graph is less noisy (e.g. when p and q are signiﬁcantly diﬀerent). The second corollary provides perfect classiﬁcation, almost perfect classiﬁcation, and partial classiﬁcation results for graph attention convolution. It follows immediately from the reduction of graph attention convolution to a linear function under the speciﬁcation of Ψ that we will discuss. In what follows we start with high-level ideas, then we present formal statements of the results, and we discuss the implications in detail. Recall the two-layer MLP attention architecture Ψ from (3) and (4) is equivalently written in (5) as

Ψ(Xi, Xj) =

2R(1 β) w T Xi, if w T Xj w T Xi , 2R(1 β) sign( w T Xi) w T Xj, if w T Xi < w T Xj < w T Xi , 2R(1 β) w T Xi, if w T Xj w T Xi .

We make the following observations. Assuming that the scaling parameter R > 0,

If w T Xi > 0, then the function Ψ assigns more weight to an edge (i, j) such that w T Xj > 0 than an edge (i, j ) such that w T Xj < 0;

If w T Xi < 0, then the function Ψ assigns more weight to an edge (i, j) such that w T Xj < 0 than an edge (i, j ) such that w T Xj > 0;

Graph Attention Retrospective

If w T Xi = 0, then the function Ψ assigns uniform weight to every edge (i, j).

This means that the behavior of Ψ depends on which side of the hyperplane {x : w T x = 0} that Xi comes from. In other words, for ﬁxed Xj, whether the attention function Ψ will up-weight or down-weight an edge (i, j) depends entirely on the classiﬁcation of Xi based on the linear classiﬁer w. Moreover, note that the attention function value satisﬁes

2R(1 β) max{ | w T Xi|, | w T Xj|} Ψ(Xi, Xj) 2R(1 β) min{| w T Xi|, | w T Xj|}.

Therefore, out of all unit norm vectors w, our choice w = µ/ µ maximizes the range of values that Ψ can output. Recall from Lemma 6 that w also happens to characterize the Bayes optimal classiﬁer for the node features. Finally, the attention function Ψ achieves minimum/maximum at self-attention, i.e.

Ψ(Xi, Xi) = 2R(1 β)| w T Xi| = max j [n] Ψ(Xi, Xj),

Ψ(Xi, Xi) = 2R(1 β)| w T Xi| = min j [n] Ψ(Xi, Xj).

A consequence of these observations is that, by setting the scaling parameter R suﬃciently large, one can make exp(Ψ(Xi, Xj)) exponentially larger than exp(Ψ(Xi, Xj )) for any j, j

such that sign( w T Xj) = sign( w T Xi) and sign( w T Xj ) = sign( w T Xi). According to the deﬁnition of attention coeﬃcients in (1), this means that the attention coeﬃcients γij where sign( w T Xj) = sign( w T Xi) are going to be exponentially larger than the attention coeﬃcients γij where sign( w T Xj ) = sign( w T Xi). Therefore, one could expect that in this case, if the linear classiﬁer w correctly classiﬁes Xi for some i [n], e.g., for i C1 this means that w T Xi > 0, then graph attention convolution output h i = P j Ni γij w T Xj also satisﬁes h i > 0, due to suﬃciently larger attention coeﬃcients γij for which w T Xj > 0. We state the result below in Theorem 14 and leave detailed arguments in the appendix.

Theorem 14. With probability at least 1 o(1) over the data (X, A) CSBM(n, p, q, µ, σ2), using the two-layer MLP attention architecture Ψ given in (3) and (4) with R = Ω(n log2 n/σ), the graph attention convolution output satisﬁes

j Ni γij w T Xj > 0 if and only if w T Xi > 0, i [n],

j Ni γij w T Xj < 0 if and only if w T Xi < 0, i [n].

Theorem 14 means that there is a parameter setting for the attention architecture Ψ such that the performance of graph attention convolution matches with the performance of the Bayes optimal classiﬁer for node features. This shows the ability of graph attention to ignore the graph structure, which can be beneﬁcial when the graph is noisy. For example, if p = q, then it is easy to see that simple graph convolution completely mixes up the node features, making it not possible to achieve any meaningful node classiﬁcation result. On the other hand, as long as there is some signal from the original node features, i.e. µ > 0, then graph attention will be able to pick that up and classify the nodes at least as good as the best classiﬁer for the node features alone. It is also worth noting that by

Fountoulakis, Levi, Yang, Baranwal and Jagannath

setting R = 0, the attention function Ψ has a constant value, and hence graph attention convolution reduces to the standard graph convolution, which has been shown to be useful in the regime where the original node features are not very strong but the graph has a nice structure (Baranwal et al., 2021). For example, when there is a signiﬁcant gap between p and q, |p q|/(p + q) = Ω(1), then setting R = 0 could signiﬁcantly improve the node separability threshold over the best linear classiﬁer (Baranwal et al., 2021). This shows the robustness of graph attention against noise in one of the two sources of information, namely node features and edge connectivities. Unlike the Bayes optimal classiﬁer for node features which is sensitive to feature noise or the simple graph convolution which is sensitive to structural noise, one can expect graph attention to work as long as at least one of the two sources of information has a good signal. Therefore, we obtain a ranking of node classiﬁcation models among graph attention convolution, simple graph convolution, and a linear classiﬁer. We state this below in Corollary 15.

Corollary 15. The node classiﬁcation performance obtainable by graph attention convolution is lower bounded by the best possible node classiﬁcation performance between a simple graph convolution and a linear classiﬁer.

By a straightforward characterization of the performance of the linear classiﬁer w = µ/ µ , we immediately obtain the following classiﬁcation results in Corollary 16. Recall that we denoted Φ as the cumulative distribution function of the standard normal distribution. Write µ = κσ for some κ > 0.

Corollary 16. With probability at least 1 o(1) over the data (X, A) CSBM(n, p, q, µ, σ2), using the two-layer MLP attention architecture Ψ given in (3) and (4) with R = Ω(n log2 n/σ), one has that

(Perfect classiﬁcation) If κ 2 log n then all nodes are correctly classiﬁed;

(Almost perfect classiﬁcation) If κ = ω(1) then at least 1 o(1) fraction of all nodes are correctly classiﬁed;

(Partial classiﬁcation) If κ = O(1) then at least Φ(κ) o(1) fraction of all nodes are correctly classiﬁed.

Interestingly, we note that the perfect classiﬁcation result in Corollary 16 is nearly equivalent (up to a small order in the threshold κ) to the perfect classiﬁcation result in Corollary 5 from Section 3.1. They are obtained from diﬀerent behaviors of the attention coeﬃcients. This shows that there could be more than one type of good attention coeﬃcients that are able to deliver good node classiﬁcation performance.

5. Experiments

In this section, we demonstrate empirically our results on synthetic and real data. The parameters of the models that we experiment with are set by using an ansatz based on our theorems. The particular details are given in Section 5.1. For real datasets, we use the default splits which come from Py Torch Geometric (Fey and Lenssen, 2019) and OGB (Hu

Graph Attention Retrospective

et al., 2020). In all our experiments we use MLP-GAT, where the attention mechanism Ψ is set to be the two-layer network in (3) and (4) with R = 1. For synthetic experiments using CSBM with known p and q, we use the variant that takes p, q into account, Ψ = (1p q 1p<q)Ψ. In Figure 2b and Figure 3b we additionally consider the original GAT architecture of Veliˇckovi c et al. (2018) to demonstrate Theorem 10.

5.1 Ansatz for GAT, MLP-GAT and GCN

For the original GAT architecture we ﬁx w = µ/ µ and deﬁne the ﬁrst head as a1 = 1

2(1, 1) and b1 = 1

2w T µ; The second head is deﬁned as a2 = a1 and b2 = b1. We now discuss the choice of such an ansatz. The parameter w is picked based on the optimal Bayes classiﬁer without a graph, and the attention is set such that the ﬁrst head maintains intra-class edges in C1 and the second head maintains intra-class edges in C0. Note that for the original GAT (Veliˇckovi c et al., 2018), due to the fact that the attention mechanism consists of just one layer (i.e. a nonlinear activation applied on a linear transformation, see (12)), it is not possible for the original GAT to keep only γij which correspond to intraclass edges. More speciﬁcally, one may use the same techniques in the proof of Theorem 3 and Corollaries 4 and 5 to prove the node separability results for the original GAT. In this particular case, the result will depend on q in contrast to the result we get for MLP-GAT, where no dependence of q was needed. For MLP-GAT we use the ansatz Ψ = (1p q 1p<q)Ψ where Ψ is given in (3) and (4) with R = 1. This choice of two-layer network allows us to bypass the XOR problem (Minsky and Papert, 1969) and separate inter-class from intraclass edges as shown in Theorem 3. Note that no single-layer architecture will be able to separate the edges due to the XOR problem . For GCN we used the ansatz from Baranwal et al. (2021) which is also w = µ/ µ .

5.2 Synthetic data

We use the CSBM to generate the data. In a recent work (Palowitch et al., 2022), a simple variant of the CSBM was also chosen as the default generative model for benchmarking GNN performance for node classiﬁcation tasks. We present two sets of experiments. In the ﬁrst set, we ﬁx the distance between the means and vary q, and in the second set, we ﬁx q and vary the distance. We set n = 1000, d = n/ log2(n), p = 0.5 and σ = 0.1. Results are averaged over 10 trials.

5.2.1 Fixing the distance between the means and varying q

We consider the two regimes separately, where for the easy regime we ﬁx the mean µ to be a vector where each coordinate is equal to 10σ p

d. This guarantees that the distance between the means is 10σ p

log n2. In the hard regime we ﬁx the mean µ to a vector where each coordinate is equal to σ/

d, and this guarantees that the distance is σ. We ﬁx p = 0.5 and vary q from log2(n)/n to 1 log2(n)/n. In Figure 1 we illustrate Theorem 3 and Corollaries 4, 5 for the easy regime, and in Figure 2 we illustrate Theorem 9 and Theorem 10 for the hard regime. In particular, in Figure 1a we show Theorem 3, MLP-GAT is able to classify intra-class and inter-class edges perfectly. In Figure 1b we show that in the easy regime, as claimed in Corollary 4 for MLP-

Fountoulakis, Levi, Yang, Baranwal and Jagannath

GAT, when p q, the γ that correspond to intra-class edges concentrate around 2/np, while the γ for the inter-class edges concentrate to tiny values; when p < q, we see the opposite, that is the γ that correspond to intra-class edges concentrate to tiny values, while the γ for the inter-class edges concentrate around 2/nq. In Figure 1c we observe that the performance of MLP-GAT for node classiﬁcation is independent of q in the easy regime as claimed in Corollary 5. However, in this plot, we observe that not using the graph also achieves perfect node classiﬁcation, a result which is proved in Proposition 7. In the same plot, we also show the performance of simple graph convolution, where its performance depends on q (see Baranwal et al. (2021)). In Figure 2a we show Theorem 9. MLP-GAT misclassiﬁes a constant fraction of the intra and inter edges as proved in Theorem 9. In Figure 2b we show Theorem 10, where γ s in the hard regime concentrate around uniform (GCN) coeﬃcients for both MLP-GAT and GAT. In Figure 2c we illustrate that node classiﬁcation accuracy is a function of q for MLP-GAT. This is conjectured in Conjecture 11. On the other hand, note that the performance of MLP-GAT is lower bounded by the performance of not using the graph, as proved in Theorem 14.

0.2 0.4 0.6 0.8 q

Classification accuracy

MLP-GAT, intra edge classification MLP-GAT, inter edge classification

(a) Edge classiﬁcation

0.2 0.4 0.6 0.8 q

Average , intra edges, MLP-GAT

Average , inter edges, MLP-GAT

Average 1/|Ni|

(b) Attention coeﬃcients

0.2 0.4 0.6 0.8 q

Classification accuracy

MLP-GAT GCN No-graph p=q

(c) Node classiﬁcation

Figure 1: Demonstration of Theorem 3 and Corollaries 4, 5 for the easy regime. The shaded areas in the plots show standard deviation. Unlike GCN, the performance of MLP-GAT does not degrade as we vary q.

Graph Attention Retrospective

0.2 0.4 0.6 0.8 q

Classification accuracy

MLP-GAT, intra edge classification MLP-GAT, inter edge classification

(a) Edge classiﬁcation

0.2 0.4 0.6 0.8 q

Average , intra edges, MLP-GAT

Average , inter edges, MLP-GAT

Average , intra edges, GAT head 0

Average , inter edges, GAT head 0

Average , intra edges, GAT head 1

Average , inter edges, GAT head 1

Average 1/|Ni|

(b) Attention coeﬃcients

0.2 0.4 0.6 0.8 q

Classification accuracy

MLP-GAT GCN No-graph p=q

(c) Node classiﬁcation

Figure 2: Demonstration of Theorem 9 and Theorem 10 for the hard regime. The shaded areas in the plots show standard deviation. Unlike GCN, the performance of MLP-GAT is lower bounded by the performance of not using the graph.

5.2.2 Fixing q and varying the distance between the means

We consider the case where q = 0.1. In Figure 3 we show how the attention coeﬃcients of MLP-GAT and GAT, the node and edge classiﬁcation depend on the distance between the means. We also add a vertical line at σ to approximately separate the easy (left of σ) and hard (right of σ) regimes. Figure 3c illustrates Theorems 3 and 9 in the hard and easy regimes, respectively. In particular, we observe that in the hard regime, MLP-GAT fails to distinguish intra from inter edges, while in the easy regime, it is able to do that perfectly for a large enough distance between the means. In Figure 3a we observe that in the hard regime, γ concentrate around the uniform (GCN) coeﬃcients, while in the easy regime, MLP-GAT is able to maintain the γ for intraclass edges, while it sets the γ to tiny values for inter-class edges. In Figure 3b. we observe that in the hard regime, the γ of GAT concentrate around the uniform coeﬃcients (proved in Theorem 10), while in the easy regime although the γ concentrate, GAT is not able to distinguish intra from inter edges. This makes sense since the separation of edges can t be done by simple linear classiﬁers used by GAT, see the discussion below Theorem 10. Finally, in Figure 3d we show node classiﬁcation results for MLP-GAT. In the easy regime,

Fountoulakis, Levi, Yang, Baranwal and Jagannath

we observe perfect classiﬁcation as proved in Corollary 5. However, as the distance between the means decreases, we observe that MLP-GAT starts to misclassify nodes.

10 3 10 2 10 1 100 101

Distance between means

Average , intra edges, MLP-GAT

Average , inter edges, MLP-GAT

Average 1/|Ni|

(a) Attention coeﬃcients of MLP-GAT

10 3 10 2 10 1 100 101

Distance between means

Avg. , intra edges, GAT, head 0, class 0

Avg. , inter edges, GAT, head 0, class 0

Avg. , intra edges, GAT, head 0, class 1

Avg. , inter edges, GAT, head 0, class 1

Avg. , intra edges, GAT, head 1, class 0

Avg. , inter edges, GAT, head 1, class 0

Avg. , intra edges, GAT, head 1, class 1

Avg. , inter edges, GAT, head 1, class 1

Average 1/|Ni|

(b) Attention coeﬃcients of GAT

10 3 10 2 10 1 100 101

Distance between means

Classification accuracy

MLP-GAT, intra edge classification MLP-GAT, inter edge classification

(c) Edge classiﬁcation accuracy

10 3 10 2 10 1 100 101

Distance between means

Classification accuracy

MLP-GAT GCN No-graph

(d) Node classiﬁcation accuracy

Figure 3: Attention coeﬃcients of MLP-GAT and GAT, node and edge classiﬁcation as a function of the distance between the means. Shaded areas show standard deviation. When there is a suﬃcient distance between the means, attention coeﬃcients of MLP-GAT demonstrate a nice separation while GAT does not.

5.3 Real data

For the experiment on real data, we illustrate the attention coeﬃcients, node and edge classiﬁcation for MLP-GAT as a function of the distance between the means. We use popular real-world graph datasets Cora, Pub Med, and Cite Seer collected by Py Torch Geometric (Fey and Lenssen, 2019) and ogbn-arxiv from Open Graph Benchmark (Hu et al., 2020). The datasets come with multiple classes, however, for each of our experiments, we do a onev.s.-all classiﬁcation for a single class. This is a semi-supervised problem, only a fraction of the training nodes have labels. The rest of the nodes are used for measuring prediction accuracy. To control the distance between the means of the problem we use the true labels to determine the class of each node and then we compute the empirical mean for each class. We subtract the empirical means from their corresponding classes and we also add means

Graph Attention Retrospective

µ and µ to each class, respectively. This modiﬁcation can be thought of as translating the mean of the distribution of the data for each class.

10 4 10 3 10 2 10 1

Distance between means

Classification accuracy

MLP-GAT, intra edge classification MLP-GAT, inter edge classification

(a) Edge class., Cora

10 4 10 3 10 2 10 1

Distance between means

Average , intra edges, MLP-GAT

Average , inter edges, MLP-GAT

Average 1/|Ni|

(b) Attention coef., Cora

10 4 10 3 10 2 10 1

Distance between means

Classification accuracy

MLP-GAT GCN No-graph

(c) Node class., Cora

10 4 10 3 10 2

Distance between means

Classification accuracy

MLP-GAT, intra edge classification MLP-GAT, inter edge classification

(d) Edge class., Pub Med

10 4 10 3 10 2

Distance between means

Average , intra edges, MLP-GAT

Average , inter edges, MLP-GAT

Average 1/|Ni|

(e) Attention coef., Pub Med

10 4 10 3 10 2

Distance between means

Classification accuracy

MLP-GAT GCN No-graph

(f) Node class., Pub Med

10 4 10 3 10 2

Distance between means

Classification accuracy

MLP-GAT, intra edge classification MLP-GAT, inter edge classification

(g) Edge class., Cite Seer

10 4 10 3 10 2

Distance between means

Average , intra edges, MLP-GAT

Average , inter edges, MLP-GAT

Average 1/|Ni|

(h) Attention coef., Cite Seer

10 4 10 3 10 2

Distance between means

Classification accuracy

MLP-GAT GCN No-graph

(i) Node class., Cite Seer

10 3 10 2 10 1

Distance between means

Classification accuracy

MLP-GAT, intra edge classification MLP-GAT, inter edge classification

(j) Edge class., ogbn-arxiv

10 3 10 2 10 1

Distance between means

Average , intra edges, MLP-GAT

Average , inter edges, MLP-GAT

Average 1/|Ni|

(k) Attention coef., ogbn-arxiv

10 3 10 2 10 1

Distance between means

Classification accuracy

MLP-GAT GCN No-graph

(l) Node class., ogbn-arxiv

Figure 4: Attention coeﬃcients, node and edge classiﬁcation for MLP-GAT as a function of the distance between the means for real data.

The results of this experiment are shown in Figure 4. For Cora, Pub Med, and Cite Seer we show results for class 0 since these are small datasets, each dataset contains at most 7 classes, and the classes have similar sizes. In our experiments on other classes, we observed that the results are similar. For ogbn-arxiv we show results for the largest class (i.e. class 16) since this is a larger dataset which has 40 imbalanced classes. Picking a large class

Fountoulakis, Levi, Yang, Baranwal and Jagannath

makes the one-v.s.-all classiﬁcation task more balanced. In our experiments on other classes having similar sizes, we obtained similar results. We note that in the real data, we also observe similar behavior of MLP-GAT in the easy and hard regimes as for the synthetic data. In particular, for all datasets as the distance of means increases, MLP-GAT is able to accurately classify intra-class and inter-class edges, see Figures 4a, 4d and 4g. Moreover, as the distance between the means increases, the average intra-class γ becomes much larger than the average inter-class γ, see Figures 4b, 4e, 4h, and 4k, and the model is able to classify the nodes accurately, see Figures 4c, 4f, 4i, and 4l. On the contrary, in the same ﬁgures, we observe that as the distance between the means decreases then MLP-GAT is not able to separate intra-class from inter-class edges, the averaged γ are very close to uniform coeﬃcients and the model can t classify the nodes accurately. Note that Figure 4 does not show the standard deviation for the attention coeﬃcients γ. We show the standard deviation of γ in Figure 5. We observe that the standard deviation is higher than what we observed in the synthetic data. In particular, it can be more than half of the averaged γ. This is to be expected since for the real data the degrees of the nodes do not concentrate as well. In Figure 5 we show that the standard deviation of the uniform coeﬃcients 1/|Ni| is also high. For Cora, Pub Med, and Cite Seer, the standard deviation for intra-class γ is similar to that of 1/|Ni|, while the deviation for inter-class γ is large for a small distance between the means, but it gets much smaller as the distance increases. For ognb-arxiv, the standard deviation of 1/|Ni| is particularly high. This implies that the degree distribution of nodes of ogbn-arxiv has a heavy tail, which could potentially result from the graph structure being noisier than other datasets and also explain GCN s relatively much worse performance when the distance between the means is large.

10 4 10 3 10 2 10 1

Distance between means

Stand. dev. , intra edges, MLP-GAT

Stand. dev. , inter edges, MLP-GAT

Stand. dev. of 1/|Ni|

10 4 10 3 10 2

Distance between means

Stand. dev. , intra edges, MLP-GAT

Stand. dev. , inter edges, MLP-GAT

Stand. dev. of 1/|Ni|

(b) Pub Med

10 4 10 3 10 2

Distance between means

Stand. dev. , intra edges, MLP-GAT

Stand. dev. , inter edges, MLP-GAT

Stand. dev. of 1/|Ni|

(c) Cite Seer

10 3 10 2 10 1

Distance between means

Stand. dev. , intra edges, MLP-GAT

Stand. dev. , inter edges, MLP-GAT

Stand. dev. of 1/|Ni|

(d) ogbn-arxiv

Figure 5: Standard deviation for attention coeﬃcients of MLP-GAT.

Graph Attention Retrospective

6. Summary of implications for practitioners

While this work focuses on theoretical understanding of graph attention s capability for edge and node classiﬁcation using the CSBM generative model, our ﬁndings yield a series of potential suggestions for GNN practitioners. In this section we provide some interesting practical implications of our results.

6.1 Why graph attention? Beneﬁts of graph attention s robustness to structural noise

When the graph is very noisy, e.g. when a node has a similar number of neighbors from the same class and from diﬀerent classes, simple graph convolution will mix up node features and thus make nodes from diﬀerent classes indistinguishable. In this case, simple graph convolution can do more harm than good. However, in practice, it is diﬃcult to determine how noisy the graph is, or if the graph is even useful at all. This could pose a challenge in choosing an architecture. Our results in Theorem 14 and Corollary 15 imply that graph attention has the ability to dramatically reduce the impact of a noisy graph, in a way such that the output is at least as good as the output from the best linear classiﬁer (on the input) which does not rely on the graph. This shows that graph attention convolution is more robust against structural noise in the graph, and hence on noisy graphs, it is strictly better than simple graph convolution.

6.2 Which attention architecture? Beneﬁts of multi-layer attention architecture

In this work, we are able to obtain positive results for graph attention by using the two-layer MLP attention architecture in (3). This is diﬀerent from the original GAT which uses a single-layer attention architecture (Veliˇckovi c et al., 2018). In our analyses and empirical experiments, we found that the original single-layer attention does not have the important properties required for obtaining positive results (e.g. Theorem 3, Corollary 5, Theorem 14, Corollary 16) for graph attention convolution. Coincidentally, this aligns with the ﬁndings of Brody et al. (2022), where the authors study limitations of the original GAT architecture from a diﬀerent perspective than ours, and they propose a new architecture termed GATv2. The two-layer MLP attention architecture that we consider in (3) encompasses GATv2 as a special case. Therefore, the two-layer MLP attention architecture can be a good candidate to consider when practitioners search for a suitable graph attention architecture for their speciﬁc downstream tasks. On the other hand, our results in Section 3.2 imply that even the two-layer MLP attention architecture (and hence GATv2) has limitations when the node features are noisy. To ﬁx that, a potential solution is to incorporate additional information such as edge features, which we discuss next.

6.3 Will additional information help? Beneﬁts of incorporating good edge features

Even though we do not consider edge features in our analyses, our results in Theorem 13 imply that good attention functions that are able to classify the edges independently from the node features can be very helpful, as they help reduce the threshold under which graph

Fountoulakis, Levi, Yang, Baranwal and Jagannath

attention convolution would fail to separate the nodes. One potential way to obtain good attention functions that behave like the one given in (13) is by incorporating good edge features. Furthermore, given our result in Theorem 9, which says that graph attention based on noisy node features cannot perfectly classify the edges, the importance of incorporating informative edge features that are more indicative of edge class memberships (if they are accessible in practice) into the attention mechanism is more pronounced.

7. Conclusion and future work

In this work, we study the impact of graph attention on edges and its implications for node classiﬁcation. We show that graph attention improves robustness to noise in the graph structure. We also show that graph attention may not be very useful in a hard regime where the node features are noisy. Our work shows that single-layer graph attention convolution has limited power at distinguishing intra-class from inter-class edges. Given the empirical successes of graph attention and its many variants, a promising future work is to study the power of multi-layer graph attention convolutions for distinguishing intra-class and inter-class edges. Moreover, our negative results in Section 3.2 for edge/node classiﬁcation pertains to perfect classiﬁcation and almost perfect classiﬁcation. In practice, misclassiﬁcation of a small constant fraction of nodes/edges is often inevitable, but nonetheless useful. Therefore, an interesting future line of work is to characterize the threshold under which graph attention is going to misclassify a certain proportion of nodes. Finally, variants of graph attention networks have been successfully used in tasks other than node classiﬁcation, such as link prediction and graph classiﬁcation. These tasks are typically solved by architectures that add a ﬁnal aggregation layer which combines node representations generated from graph attention convolution. It is an interesting future direction to develop a good understanding of the beneﬁts and limitations of the graph attention mechanism for those tasks.

Acknowledgments

K. Fountoulakis would like to acknowledge the support of the Natural Sciences and Engineering Research Council of Canada (NSERC). Cette recherche a et e ﬁnanc ee par le Conseil de recherches en sciences naturelles et en g enie du Canada (CRSNG), [RGPIN-2019-04067, DGECR-2019-00147]. A. Jagannath acknowledges the support of the Natural Sciences and Engineering Research Council of Canada (NSERC). Cette recherche a et e ﬁnanc ee par le Conseil de recherches en sciences naturelles et en g enie du Canada (CRSNG), [RGPIN-2020-04597, DGECR-2020-00199].

Graph Attention Retrospective

Appendix A. Proofs

We deﬁne the following high-probability events which will be used in some proofs. Each of these events holds with probability at least 1 o(1), which follows from straightforward applications of Chernoﬀbound and union bound, e.g., see (Baranwal et al., 2021).

Deﬁnition 17. Deﬁne the following events over the randomness of A and {ϵi}i [n] and {Xi}i [n],

E1 is the event that |C0| = n

2 O( n log n) and |C1| = n

2 O( n log n).

E2 is the event that for each i [n], Dii = n(p+q)

2 1 10 log n

E3 is the event that for each i [n], |C0 Ni| = Dii (1 ϵi)p+ϵiq

p+q 1 10 log n

|C1 Ni| = Dii (1 ϵi)q+ϵip

p+q 1 10 log n

E4 is the event that for each i [n], w T Xi E w T Xi 10σ log n.

E is the intersection of the above 4 events.

Lemma 18 (Baranwal et al. (2021)). With probability at least 1 o(1) event E holds.

Some of our proofs also utilize the following simple observation on the mutual independence among {w T gi}i [n]when {gi}i [n] are i.i.d. Gaussian random vectors.

Observation 19. Fix w = 0 in Rd and let g1, . . . , gn be i.i.d. drawn from N(0, I). Then w T g1, w T g2, . . . , w T gn are independent.

Proof Note that since w T gi N(0, w 2), it suﬃces to prove that the covariance E[w T gi w T gj] = 0 for all i = j. By deﬁnition, for i = j,

E[w T gi w T gj] = E

ℓ [d] wkwℓgikgjℓ

ℓ [d] wkwℓE[gikgjℓ] = 0,

where the last equality follows from independence between gi and gj.

A.1 Proof of Theorem 3

We restate Theorem 3 for convenience.

Theorem. Suppose that µ = ω(σ log n). Then with probability at least 1 o(1) over the data (X, A) CSBM(n, p, q, µ, σ2), the two-layer MLP attention architecture Ψ given in (3) and (4) separates intra-class edges from inter-class edges.

Recall from (5) that

Ψ(Xi, Xj) =

2R(1 β) w T Xi, if w T Xj w T Xi , 2R(1 β) sign( w T Xi) w T Xj, if w T Xi < w T Xj < w T Xi , 2R(1 β) w T Xi, if w T Xj w T Xi .

Fountoulakis, Levi, Yang, Baranwal and Jagannath

We will condition on the event that w T Xi E w T Xi 10σ log n for all i [n], which holds with probability at least 1 O(1/n99) following a union bound and the Gaussian tail probability. Under this event, because µ = ω(σ log n), for all i, j C1 we have

sign( w T Xi) = sign( w T Xj) = 1,

min{ w T Xi, w T Xj} µ 10σ p

max{ w T Xi, w T Xj} µ + 10σ p

Ψ(Xi, Xj) 2R(1 β) min{ w T Xi, w T Xj} 2R(1 β) ( µ O(σ p

Ψ(Xi, Xj) 2R(1 β) max{ w T Xi, w T Xj} 2R(1 β) ( µ + O(σ p

which implies that Ψ(Xi, Xj) = µ O(σ log n) for i, j C1. Similarly, we get that

Ψ(Xi, Xj) = µ O(σ log n) = µ (1 o(1)), if (i, j) is an intra-class edge, µ O(σ log n) = µ (1 o(1)), if (i, j) is an inter-class edge. (14) Therefore, we probability at least 1 o(1), we have that

sign(Ψ(Xi, Xj)) = 1, if (i, j) is an intra-class edge, 1, if (i, j) is an inter-class edge,

which means perfect separability of edges, and the proof is complete.

A.2 Proof of Corollary 4

We restate Corollary 4 for convenience.

Corollary. Suppose that µ = ω(σ log n) and that Assumption 1 holds. Then with probability at least 1 o(1) over the data (X, A) CSBM(n, p, q, µ, σ2), the attention architecture Ψ yields attention coeﬃcients γij such that

1. If p q, then γij = 2 np(1 o(1)) if (i, j) is an intra-class edge and γij = o( 1 n(p+q)) otherwise;

2. If p < q, then γij = 2 nq(1 o(1)) if (i, j) is an inter-class edge and γij = o( 1 n(p+q)) otherwise.

The proof is straightforward by considering the cases p q and p < q separately. When p q, we have Ψ = Ψ. Using the speciﬁcation of Ψ in (3) and (4), the deﬁnition of attention coeﬃcients in (1), the high probability event in Lemma 18, the expression of Ψ(Xi, Xj) in (14), and picking R such that 1/R = ω(σ log n) and 1/R = o( µ ), we obtain the claimed results. The result when p < q is obtained in the same way.

Graph Attention Retrospective

A.3 Proof of Corollary 5

We restate Corollary 5 for convenience.

Corollary. Suppose that µ = ω(σ log n) and that Assumption 1 holds. Then with probability at least 1 o(1) over the data (X, A) CSBM(n, p, q, µ, σ2), using the attention architecture Ψ with the graph attention convolution given in (2), where f is set to be the identify function, the model separates the nodes.

We prove the case p q and the case p < q follows analogously. Consider the attention architecture Ψ = (1p q 1p<q) Ψ = Ψ, where Ψ is given in (3) and (4). Pick R such that 1/R = ω(σ log n) and 1/R = o( µ )). Assume that i C1, and denote the graph attention convolution output as

h i def = X

j Ni γij w T Xj.

We will condition on the event E , which holds with probability at least 1 o(1). By using Corollary 4 we have X

j Ni γij w T Xj = X

j C0 Ni γij w T Xj + X

j C1 Ni γij w T Xj

np(1 o(1)) µ + 10σ p

+ |C0 Ni| o 1 n(p + q)

= (1 o(1)) µ + 10σ p

log n nq(1 o(1))

ω(n(p + q)) µ 10σ p

= µ (1 o(1)).

Similarly, we have that X

j Ni γij w T Xj (1 o(1)) µ 10σ p

log n nq(1 o(1))

ω(n(p + q)) µ + 10σ p

= µ (1 o(1)).

This means that h i = µ (1 o(1)) for i C1. Applying the same reasoning we get that h i = µ (1 o(1)) for i C0. Therefore, with probability at least 1 o(1), the graph attention convolution separates the nodes.

A.4 Proof of Lemma 8

We restate Lemma 8 for convenience.

Lemma. Let (X, A) CSBM(n, p, q, µ, σ2) and let X ij be deﬁned as in (9). The Bayes optimal classiﬁer for X ij is realized by the following function,

( 0, if p cosh x T µ

σ2 q cosh x T ν

1, otherwise,

Fountoulakis, Levi, Yang, Baranwal and Jagannath

where µ def = µ µ

and ν def = µ µ

Proof Note that X ij is a mixture of 2d-dimensional Gaussian distributions,

N( µ , σ2I) i C0, j C0 N(µ , σ2I) i C1, j C1 N( ν , σ2I) i C0, j C1 N(ν , σ2I) i C1, j C0

The optimal classiﬁer is then given by

h (x) = arg max c {0,1} Pr[y = c | x].

Note that Pr[y = 0] = q p+q and Pr[y = 1] = p p+q. Thus, by Bayes rule, we obtain that

Pr[y = c | x] = Pr[y = c] fx|y(x | y = c) Pr[y = 0]fx|y=0(x | y = 0) + Pr[y = 1]fx|y=1(x | y = 1)

1 + Pr[y=1 c] fx|y(x|y=1 c)

Pr[y=c] fx|y(x|y=c) .

Suppose that x = X ij such that i j. Then h (x) = 0 if and only if Pr[y = 0 | x] 1

2. Hence, for c = 0 we require that

Pr[y = 1 c] fx|y(x | y = 1 c)

Pr[y = c] fx|y(x | y = c) = p

q fx|y(x | y = 1) fx|y(x | y = 0) = p

σ2 x T ν 1,

Similarly, we obtain the reverse condition for h (x) = 1.

A.5 Proof of Theorem 9

We restate Theorem 9 for convenience.

Theorem. Suppose µ = κσ for some κ > 0 and let Ψ be any attention mechanism. Then,

1. With probability at least 1 o(1), Ψ fails to correctly classify at least 2 Φc(κ)2 fraction of inter-class edges;

2. For any K > 1 if q > K log2 n

nΦc(κ)2 , then with probability at least 1 O(n K

4 Φc(κ)2 log n), Ψ misclassify at least one inter-class edge.

We will write i j if node i and node j are in the same class and i j otherwise. From Lemma 8, we observe that for successful classiﬁcation by the optimal classiﬁer, we need

p cosh x T µ

σ2 q cosh x T ν

σ2 for i j,

p cosh x T µ

σ2 > q cosh x T ν

σ2 for i j.

Graph Attention Retrospective

We will split the analysis into two cases. First, note that when p q we have for i j that

p cosh x T µ

σ2 q cosh x T ν

σ2 = cosh x T µ

σ2 cosh x T ν

σ2 = |x T µ | |x T ν |.

In the ﬁrst implication, we used that p q, while the second implication follows from the fact that cosh(a) cosh(b) = |a| |b| for all a, b R. Similarly, for p < q we have for i j that

p cosh x T µ

σ2 > q cosh x T ν

σ2 = cosh x T µ

σ2 > cosh x T ν

σ2 = |x T µ | > |x T ν |.

Therefore, for each of the above cases, we can upper bound the probability for either i j or i j that X ij is correctly classiﬁed, by the probability of the event |X T ij µ | |X T ij ν | or equivalently |X T ij µ | > |X T ij ν |. We focus on the former as the latter is equivalent and symmetric. Writing Xi = µ + σgi and Xj = µ + σgj, we have that for i C1 and j C0,

Pr[h (X ij) = 0] Pr h |X T ij µ | |X T ij ν | i

= Pr |XT i µ + XT j µ| |XT i µ XT j µ|

= Pr σ|g T i µ + g T j µ| | 2 µ 2 + σg T i µ σg T j µ|

Pr |g T i ˆµ + g T j ˆµ| |g T i ˆµ g T j ˆµ| 2 µ

= Pr |g T i ˆµ + g T j ˆµ| |g T i ˆµ g T j ˆµ| 2κ ,

where we denote ˆµ = µ/ µ . In the second to last step above, we used triangle inequality to pull 2 µ 2 outside the absolute value, while in the last equation we use µ = κσ. We now denote zi = g T i ˆµ for all i [n]. Then the above probability is Pr[|zi + zj| |zi zj| 2κ], where zi, zj N(0, 1) are independent random variables. Note that we have

Pr[h (X ij) = 0] Pr[|zi + zj| |zi zj| 2κ]

= Pr[|zi + zj| |zi zj| 2κ, |zi| κ]

+ Pr[|zi + zj| |zi zj| 2κ, |zi| > κ]

= Pr[|zi| κ] + Φ(κ) Pr[|zi| > κ]. (15)

To see how we obtain the last equation, observe that if |zi| κ then we have

|zi + zj| |zi zj| = |zi + zj| |zj zi|

|zi| + |zj| |zj zi| by triangle inequality

|zi| + |zj| |zj| |zi| by reverse triangle inequality

|zi| + |zj| (|zj| |zi|) = 2|zi|

hence, Pr[|zi + zj| |zi zj| 2κ, |zi| κ] = Pr[|zi| κ]. On the other hand, for |zi| > κ, we look at each case, conditioned on the events zi > κ and zi < κ for each of

Fountoulakis, Levi, Yang, Baranwal and Jagannath

the four cases based on the signs of zi + zj and zi zj. We denote by E the event that |zi + zj| |zi zj| 2κ, and analyze the cases in detail. First consider the case zi < κ:

Pr[E, zi + zj 0, zi zj 0 | zi < κ] = Pr[zj zi, zj zi | zi < κ] = 0,

Pr[E, zi + zj 0, zi zj < 0 | zi < κ] = Pr[zj > |zi|, zi κ | zi < κ] = Φ(zi),

Pr[E, zi + zj < 0, zi zj 0 | zi < κ] = Pr[zj < |zi|, zi κ | zi < κ] = 0,

Pr[E, zi + zj < 0, zi zj < 0 | zi < κ] = Pr[zi < zj < zi, zj > κ | zi < κ]

= Φ(κ) Φ(zi).

The sum of the four probabilities in the above is Pr[E | zi < κ] = Φ(κ). Similarly, we analyze the other case, zi > κ:

Pr[E, zi + zj 0, zi zj 0 | zi > κ] = Pr[ zi zj zi, zj κ | zi > κ]

= Φ(κ) Φc(zi),

Pr[E, zi + zj 0, zi zj < 0 | zi > κ] = Pr[zj > |zi|, zi κ | zi > κ] = 0,

Pr[E, zi + zj < 0, zi zj 0 | zi > κ] = Pr[zj < |zi|, zi κ | zi > κ] = Φc(zi),

Pr[E, zi + zj < 0, zi zj < 0 | zi > κ] = Pr[zj < zi, zj > zi | zi > κ] = 0.

The sum of the four probabilities above is Pr[E | zi > κ] = Φ(κ). Therefore, we obtain that Pr[|zi + zj| |zi zj| 2κ | |zi| > κ] = Φ(κ),

which justiﬁes (15). Next, note that Pr[|zi| κ] = Φ(κ) Φc(κ) and Pr[|zi| > κ] = 2Φc(κ), so we have from (15) that

Pr[h (X ij) = 0] Φ(κ) Φc(κ) + 2Φc(κ)Φ(κ)

= 1 2Φc(κ) + 2Φc(κ)Φ(κ) = 1 2Φc(κ)2.

Thus, X ij is misclassiﬁed with probability at least 2Φc(κ)2. We will now construct sets of pairs with mutually independent elements, such that the union of those sets covers all inter-class edges. This will enable us to use a concentration argument that computes the fraction of the inter-class edges that are misclassiﬁed. Since the graph operations are permutation invariant, let us assume for simplicity that C0 = {1, . . . , n

2 } and C1 = {n

2 + 1, . . . , n} for an even number of nodes n. Also, deﬁne the function

( i + l, i + l n

2 , i + l n

2 , i + l > n

We now construct the following sequence of sets for all l {0, . . . , n

Sl = {(Xm(i,l), Xi+ n

2 ) for all i C0 such that (m(i, l), i + n/2) E}.

Fix l {0, . . . , n

2 1} and observe that the pairs in the set Sl are mutually independent. Deﬁne a Bernoulli random variable, βi, to be the indicator that (Xm(i,l), Xi+ n

2 ) is misclassiﬁed. We have that E[βi] 2Φc(κ)2. Note that the fraction of pairs in the set Sl that are

Graph Attention Retrospective

misclassiﬁed is 1 |Sl| P i:(Xm(i,l),Xi+n/2) Sl βi, which is a sum of independent Bernoulli random variables. Hence, by the additive Chernoﬀbound, we obtain

i C0 Nm(i,l) βi 2|Sl|Φc(κ)2 |Sl|t

1 exp( 2|Sl|t2).

Since p, q = Ω(log2 n

n ), we have by the Chernoﬀbound and a union bound that with proba-

bility at least 1 1/poly(n), |Sl| = nq(1 o(1)) for all l. We now choose t = q

|Sl| = o(1) to obtain that on the event where |Sl| = nq(1 o(1)), we have the following for any large C > 1:

i C0 Nm(i,l) βi 2Φc(κ)2 o(1)

Following a union bound over all l {0, . . . , n

2 1}, we conclude that for any c > 0,

i C0 Nm(i,l) βi 2Φc(κ)2 o(1), l n 0, . . . , n

Thus, out of all the pairs X ij with j i, with probability at least 1 o(1), we have that at least a fraction 2Φc(κ)2 of the pairs are misclassiﬁed by the attention mechanism. This concludes part 1 of the theorem. For part 2, note that by the additive Chernoﬀbound we have for any t (0, 1),

i C0 Nm(i,l) βi 2|Sl|Φc(κ)2 |Sl|t

1 exp( 2|Sl|t2).

Since |Sl| = nq

2 (1 o(1)) with probability at least 1/poly(n), we choose t = 2 q

KΦc(κ)2 log2 n

nq to obtain

i C0 Nm(i,l) βi nqΦc(κ)2(1 o(1)) q

KnqΦc(κ)2 log2 n

1 O(n 8KΦc(κ)2 log n).

Now note that if q > K log2 n

nΦc(κ)2 then we have nqΦc(κ)2 > K log2 n, which implies that

KnqΦc(κ)2 log2 n > 0.

Hence, in this regime of q,

i C0 Nm(i,l) βi > 0

1 O(n 8KΦc(κ)2 log n),

and the proof is complete.

Fountoulakis, Levi, Yang, Baranwal and Jagannath

A.6 Proof of Theorem 10

We restate Theorem 10 for convenience

Theorem. Assume that µ Kσ and σ K for some absolute constants K and K . Moreover, assume that the parameters (w, a, b) Rd R2 R are bounded. Then, with probability at least 1 o(1) over the data (X, A) CSBM(n, p, q, µ, σ2), there exists a subset A [n] with cardinality at least n(1 o(1)) such that for all i A the following hold:

1. There is a subset Ji,0 Ni C0 with cardinality at least 9 10|Ni C0|, such that γij = Θ(1/|Ni|) for all j Ji,0.

2. There is a subset Ji,1 Ni C1 with cardinality at least 9 10|Ni C1|, such that γij = Θ(1/|Ni|) for all j Ji,1.

For i [n] let us write Xi = (2ϵi 1)µ + σgi where gi N(0, I), ϵi = 0 if i C0 and ϵi = 1 if i C1. Moreover, since the parameters (w, a, b) Rd R2 R are bounded, we can write w = R ˆw and a = R ˆa such that ˆw = 1 and ˆa = 1 and R, R are some constants. We deﬁne the following sets which will become useful later in our computation of γij s. Deﬁne

A def = i [n] |ˆa1 ˆw T gi| 10 p

log(n(p + q)), and |ˆa2 ˆw T gj| 10 p

log(n(p + q)), j Ni

For i [n] deﬁne

Ji,0 def = n j Ni C0 | |ˆa2 ˆw T gj|

Ji,1 def = n j Ni C1 | |ˆa2 ˆw T gj|

Bt i,0 def = j Ni C0 | 2t 1 ˆa2 ˆw T gj 2t , t = 1, 2, . . . , T,

Bt i,1 def = j Ni C1 | 2t 1 ˆa2 ˆw T gj 2t , t = 1, 2, . . . , T,

where T def = l log2 10 p

log(n(p + q)) m .

We start with a few claims about the sizes of these sets.

Claim 20. With probability at least 1 o(1), we have that |A| n(1 o(1)).

Proof Because |ˆa2| 1 we know that A is a superset of A where

A def = i [n] | ˆw T gi| 10 p

log(n(p + q)), and | ˆw T gj| 10 p

log(n(p + q)), j Ni

We give a lower bound for |A | and hence prove the result. First of all, note that if p + q Ω(1/ log2 n), then log(n(p + q)) = log n(1 o(1)) and we easily get that with probability at least 1 o(1), | ˆw T gi| 10 p

log(n(p + q)) for all i [n], and thus |A| = |A | = n. Therefore

Graph Attention Retrospective

let us assume without loss of generality that p + q O(1/ log2 n). Consider the following sum of indicator random variables

i [n] 1n | ˆ w T gi| 10

log(n(p+q)) o.

By the multiplicative Chernoﬀbound, for any δ > 0 we have

Pr [S nb(1 + δ)] exp δ2

where b def = Pr(| ˆw T gi| 10 p

log(n(p + q))). Moreover, by the standard upper bound on the Gaussian tail probability (Proposition 2.1.2, Vershynin (2018)) we know that b < e 50 log(n(p+q)). Let us set

δ def = 1 bn(p + q) log n.

Then by the upper bound on b and the assumption that p, q = Ω(log2 n/n) we know that

δ (n(p + q))49

log n Ω(log97 n) = ω(1).

It follows that δ2

2 + δnb Ω(δnb) = Ω 1 (p + q) log n

Therefore, with probability at least 1 o(1) we have that

S nb(1 + δ) n (n(p + q))50 + n n(p + q) log n = O n n(p + q) log n

Apply the concentration result of node degrees, this means that with probability at least 1 o(1), n i [n] | ˆw T gi| 10 p

log(n(p + q)) or j Ni such that | ˆw T gj| 10 p

log(n(p + q)) o

2 (p + q)(1 o(1)) = O n n(p + q) log n

2 (p + q)(1 o(1)) = O n log n

Therefore we have |A | n O(n/ log n) = n(1 o(1)).

Claim 21. With probability at least 1 o(1), we have that for all i [n],

10|Ni C0| and |Ji,1| 9

Fountoulakis, Levi, Yang, Baranwal and Jagannath

Proof We prove the result for Ji,0, the result for Ji,1 follows analogously. First, ﬁx i [n]. For each j |Ni C0| we have that

Pr[|ˆa2w T gj|

10] Pr[|w T gj|

Denote Jc i,0 def = (Ni C0) \ Ji,0. We have that

E[|Jc i,0|] = E

j Ni C0 1{|ˆa2w T gj|

e 50|Ni C0|,

Apply Chernoﬀ s inequality (Theorem 2.3.4 in Vershynin (2018)) we have

Pr |Jc i,0| 1

10|Ni C0| e E[|Jc i,0|] e E[|Jc i,0|] |Ni C0|/10

ee 50|Ni C0|

25|Ni C0| .

Apply the union bound we get

Pr |Ji,0| 9

10|C0 Ni|, i [n] 1 X

i [n] exp 4

i [n] exp 4

25 n min(p, q)(1 o(1))

= (1 o(1)) 1 n exp 2n min(p, q)(1 o(1))

The second inequality follows because |Ni C0| n

2 min(p, q)(1 o(1)) under the event E3 (cf. Deﬁnition 17) for all i [n]. The last equality is due to our assumption that p, q = Ω(log2 n

Claim 22. With probability at least 1 o(1), we have that for all i [n] and for all t [T],

|Bt i,0| E[|Bt i,0|] +

T|Ni C0| 4 5 and |Bt i,1| E[|Bt i,1|] +

T|Ni C1| 4 5 .

Proof We prove the result for Bt i,0, and the result for Bt i,1 follows analogously. First ﬁx i [n] and t [T]. By the additive Chernoﬀinequality, we have

Pr |Bt i,0| E[|Bt i,0|] + |Ni C0|

5 e 2T|Ni C0|3/5.

Graph Attention Retrospective

Taking a union bound over all i [n] and t [T] we get

n |Bt i,0| E[|Bt i,0|] +

T|Ni C0| 4 5 o

n T exp 2T n

2 min(p, q)(1 o(1)) 3/5 + o(1) = o(1),

where the last equality follows from Assumption 1 that p, q = Ω(log2 n

n ), and hence

n T exp 2T n

2 min(p, q)(1 o(1)) 3/5 = n T exp ω

2T log n = O n c

for some absolute constant c > 0. Moreover, we have used degree concentration, which introduced the additional additive o(1) term in the probability upper bound. Therefore we have Pr h |Bt i,0| E[|Bt i,0|] +

T|Ni C0| 4 5 , i [n] t [T] i 1 o(1).

We start by deﬁning an event E# which is the intersection of the following events over the randomness of A and {ϵi}i and Xi = (2ϵi 1)µ + σgi,

E 1 is the event that for each i [n], |C0 Ni| = n

2 ((1 ϵi)p + ϵiq)(1 o(1)) and |C1 Ni| = n

2 ((1 ϵi)q + ϵip)(1 o(1)).

E 2 is the event that |A| n o( n).

E 3 is the event that |Ji,0| 9

10|Ni C0| and |Ji,1| 9

10|Ni C1| for all i [n].

E 4 is the event that |Bt i,0| E[|Bt i,0|] +

T|Ni C0| 4 5 and |Bt i,1| E[|Bt i,1|] +

C1| 4 5 for all i [n] and for all t [T].

By Claims 20, 21, 22, we get that with probability at least 1 o(1), the event E# def = T4 i=1 E i holds. We will show that under event E#, for all i A, for all j Ji,c where c {0, 1}, we have γij = Θ(1/|Ni|). This will prove Theorem 10. Fix i A and some j Ji,0. Let us consider

γij = exp Leaky Relu(a1w T Xi + a2w T Xj + b) P k Ni exp (Leaky Relu(a1w T Xi + a2w T Xk + b))

= exp σRR Leaky Relu(κij + ˆa1 ˆw T gi + ˆa2 ˆw T gj + b ) P k Ni exp (σRR Leaky Relu(κik + ˆa1 ˆw T gi + ˆa2 ˆw T gk + b ))

= 1 P k Ni exp( ik ij)

Fountoulakis, Levi, Yang, Baranwal and Jagannath

where for l Ni, we denote

κil def = (2ϵi 1) ˆw T µ/σ + (2ϵl 1) ˆw T µ/σ,

il def = σRR Leaky Relu(κil + ˆa1w T gi + ˆa2w T gl + b ),

and b = σRR b . We will show that

k Ni exp( ik ij) = Θ(|Ni|)

and hence conclude that γij = Θ(1/|Ni|). First of all, note that since µ Kσ for some absolute constant K, we know that

Let us assume that ˆa1 ˆw T gi 0 and consider the following two cases regarding the magnitude of ˆa1 ˆw T gi. Case 1. If κij + ˆa1 ˆw T gi + ˆa2 ˆw T gj + b < 0, then

ik ij = σRR Leaky Relu(κik + ˆa1 ˆw T gi + ˆa2 ˆw T gk + b )

Leaky Relu(κij + ˆa1 ˆw T gi + ˆa2 ˆw T gj + b )

= σRR Leaky Relu(ˆa1 ˆw T gi + ˆa2 ˆw T gk O(1))

β(κij + ˆa1 ˆw T gi + ˆa2 ˆw T gj + b )

= σRR Leaky Relu(ˆa2 ˆw T gk O(1)) O(1)

= σRR Θ(ˆa2 ˆw T gk) O(1) ,

where β is the slope of Leaky Relu(x) for x < 0. Here, the second equality follows from |κik + b |

2K + |b | = O(1) and κij + ˆa1 ˆw T gi + ˆa2 ˆw T gj + b < 0. The third equality follows from

We have j Ji,0 and hence |ˆa2 ˆw T gj| = O(1);

We have κij + ˆa1 ˆw T gi + ˆa2 ˆw T gj +b < 0, so ˆa1 ˆw T gi < |κij|+|ˆa2 ˆw T gj|+|b | = O(1), moreover, because ˆa1 ˆw T gi 0, we get that |ˆa1 ˆw T gi| = O(1);

We have |κij + ˆa1 ˆw T gi + ˆa2 ˆw T gj + b | |ˆa1 ˆw T gi| + |ˆa2 ˆw T gj| + |κij + b | = O(1) + O(1) + O(1) = O(1).

Graph Attention Retrospective

Case 2. If κij + ˆa1 ˆw T gi + ˆa2 ˆw T gj + b 0, then

ik ij = σRR Leaky Relu(κik + ˆa1 ˆw T gi + ˆa2 ˆw T gk + b )

Leaky Relu(κij + ˆa1 ˆw T gi + ˆa2 ˆw T gj + b )

= σRR Leaky Relu(κik + ˆa1 ˆw T gi + ˆa2 ˆw T gk + b )

κij ˆa1 ˆw T gi ˆa2 ˆw T gj b

= σRR Leaky Relu(κik + ˆa1 ˆw T gi + ˆa2 ˆw T gk + b ) ˆa1 ˆw T gi O(1)

= σRR Θ(ˆa2 ˆw T gk) O(1) , if k Ji,0 Ji,1 σRR O(ˆa2 ˆw T gk) O(1) , otherwise.

To see the last (in)equality in the above, consider the following cases:

1. If k Ji,0 Ji,1, then there are two cases depending on the sign of κik + ˆa1 ˆw T gi + ˆa2 ˆw T gk + b .

If κik + ˆa1 ˆw T gi + ˆa2 ˆw T gk + b 0, then we have that

Leaky Relu(κik + ˆa1 ˆw T gi + ˆa2 ˆw T gk + b ) ˆa1 ˆw T gi O(1)

= κik + ˆa1 ˆw T gi + ˆa2 ˆw T gk + b ˆa1 ˆw T gi O(1)

= ˆa2 ˆw T gk + κik + b O(1)

= ˆa2 ˆw T gk O(1).

If κik+ˆa1 ˆw T gi+ˆa2 ˆw T gk+b < 0, then because ˆa1 ˆw T gi 0 and |κik+ˆa2 ˆw T gk+ b | |κik| + |ˆa2 ˆw T gk| + |b | = O(1), we know that ˆa1 ˆw T gi < |κik| + |ˆa2 ˆw T gk| + |b | = O(1) and |κik + ˆa1 ˆw T gi + ˆa2 ˆw T gk + b | = O(1). Therefore it follows that

Leaky Relu(κik + ˆa1 ˆw T gi + ˆa2 ˆw T gk + b ) ˆa1 ˆw T gi O(1)

= Leaky Relu( O(1)) O(1) O(1)

= ˆa2 ˆw T gk O(1)

where the last equality is due to the fact that k Ji,0 Ji,1 so |ˆa2 ˆw T gk| = O(1).

2. If k Ji,0 Ji,1, then there are two cases depending on the sign of κik + ˆa1 ˆw T gi + ˆa2 ˆw T gk + b .

If κik + ˆa1 ˆw T gi + ˆa2 ˆw T gk + b 0, then we have that

Leaky Relu(κik + ˆa1 ˆw T gi + ˆa2 ˆw T gk + b ) ˆa1 ˆw T gi O(1)

= κik + ˆa1 ˆw T gi + ˆa2 ˆw T gk + b ˆa1 ˆw T gi O(1)

= ˆa2 ˆw T gk + κik + b O(1)

= ˆa2 ˆw T gk O(1).

Fountoulakis, Levi, Yang, Baranwal and Jagannath

If κik + ˆa1 ˆw T gi + ˆa2 ˆw T gk + b < 0, then we have that,

Leaky Relu(κik + ˆa1 ˆw T gi + ˆa2 ˆw T gk + b ) ˆa1 ˆw T gi O(1)

= βκik + βˆa1 ˆw T gi + βˆa2 ˆw T gk + βb ˆa1 ˆw T gi O(1)

= βˆa2 ˆw T gk (1 β)ˆa1 ˆw T gi O(1)

βˆa2 ˆw T gk O(1),

where β is the slope of Leaky Relu( ).

Combining the two cases regarding the magnitude of ˆa1 ˆw T gi and our assumption that σ, R, R = O(1), so far we have showed that, for any i such that ˆa1 ˆw T gi 0, for all j Ji,0, we have

ik ij = Θ(ˆa2 ˆw T gk) O(1), if k Ji,0 Ji,1 O(ˆa2 ˆw T gk) O(1), otherwise. (16)

By following a similar argument, one can show that Equation 16 holds for any i such that ˆa1 ˆw T gi < 0. Let us now compute

k Ni exp( ik ij) = X

k Ni C0 exp( ik ij) + X

k Ni C1 exp( ik ij)

for some j Ji,0. Let us focus on P k Ni C0 exp( ik ij) ﬁrst. We will show that Ω(|Ni C0|) P k Ni C0 exp( ik ij) O(|Ni|). First of all, we have that

k Ni C0 exp( ik ij) X

k Ji,0 exp( ik ij) = X

k Ji,0 exp Θ(ˆa2 ˆw T gk) O(1)

k Ji,0 ec1 = |Ji,0|ec1 = Ω(|Ni C0|), (17)

where c1 is an absolute constant (possibly negative). On the other hand, consider the following partition of Ni C0:

P1 def = {k Ni C0 | ˆa2 ˆw T gk 1},

P2 def = {k Ni C0 | ˆa2 ˆw T gk 1}.

It is easy to see that

k P1 exp( ik ij) X

k P1 exp O(ˆa2 ˆw T gk) O(1) X

k P1 ec2 = |P1|ec2 = O(|Ni C0|),

Graph Attention Retrospective

where c2 is an absolute constant. Moreover, because i A we have that P2 S t [T] Bt i,0. It follows that X

k P2 exp( ik ij) = X

exp( ik ij)

exp O(ˆa2 ˆw T gk) O(1)

t [T] |Bt i,0|ec32t,

where c3 is an absolute constant. We can upper bound the above quantity as follows. Under the Event E , we have that

|Bt i,0| mt +

T|Ni C0| 4 5 , for all t [T],

mt def = E[|Bt i,0|] = X

k Ni C0 Pr(2t 1 ˆa2 ˆw T gk 2t) X

k Ni C0 Pr[ˆa2 ˆw T gk 2t 1]

k Ni C0 Pr[ ˆw T gk 2t 1] |Ni C0|e 22t 3.

It follows that X

t [T] |Bt i,0|ec32t X

|Ni C0|e 22t 3 +

T|Ni C0| 4 5 ec32t

t=1 e 22t 3ec32t + X

T|Ni C0| 4 5 ec32T

c4|Ni C0| + o(|Ni|)

where c4 is an absolute constant. The third inequality in the above follows from

The series P t=1 e 22t 3ec32t converges absolutely for any constant c3;

The sum P t [T]

T|Ni C0| 4 5 ec32T = T 3 2 |Ni C0| 4 5 ec32T = o(|Ni|) because

log T 3 2 ec32T = 3

2 log l log2 10 p

log(n(p + q)) m + c32

log(n(p+q)) m

2 log l log2 10 p

log(n(p + q)) m + 20c3 p

log(n(p + q))

c log(n(p + q)) ,

for any c > 0. In particular, by picking c > 5 we see that T 3 2 ec32T O((n(p+q)) 1 c ) o(|Ni| 1 5 ), and hence we get T 3 2 ec32T |Ni C0| 4 5 |Ni| 4 5 o(|Ni| 1 5 ) = o(|Ni|).

Fountoulakis, Levi, Yang, Baranwal and Jagannath

Combining Equations 19 and 20 we get X

k P2 exp( ik ij) O(|Ni|), (21)

and combining Equations 18 and 21 we get X

k Ni C0 exp( ik ij) = X

k P1 exp( ik ij) + X

k P1 exp( ik ij) O(|Ni|). (22)

Now, by Equations 17 and 22 we get

Ω(|Ni C0|) X

k Ni C0 exp( ik ij) O(|Ni|). (23)

It turns out that repeating the same argument for P k Ni C1 exp( ik ij) yields

Ω(|Ni C1|) X

k Ni C1 exp( ik ij) O(|Ni|). (24)

Finally, Equations 23 and 24 give us X

k Ni exp( ik ij) = Θ(|Ni|),

which readily implies

γij = 1 P k Ni exp( ik ij) = Θ(1/|Ni|)

as required. We have showed that for all i A and for all j Ji,0, γij = Θ(1/|Ni|). Repeating the same argument we get that the same result holds for all i A and for all j Ji,1, too. Hence, by Claims 20 and 21 about the cardinalities of A, Ji,0 and Ji,1 we have thus proved Theorem 10.

A.7 Proof of Proposition 13

We restate Proposition 13 for convenience.

Proposition 23. Suppose that p, q satisfy Assumption 1 and that p, q are bounded away from 1. For every ϵ > 0, there are absolute constants M, M = O( ϵ) such that, with probability at least 1 o(1) over the data (X, A) CSBM(n, p, q, µ, σ2), using the graph attention convolution in (2) and the attention architecture Ψ in (13), the model misclassiﬁes at least Ω(n1 ϵ) nodes for any w such that w = 1, if

1. t = O(1) and µ Mσ q

log n n(p+q)(1 max(p, q)) p+q

2. t = ω(1) and µ M σ q

log n n(p+q)(1 max(p, q)).

Graph Attention Retrospective

We start with part 1 of the proposition. Let us assume that p q. The result when p < q follows analogously. We will condition on the events E1, E2, E3 deﬁned in Deﬁnition 17. These events are concerned with the concentration of class sizes |C0| and |C1| and the concentration of the number of intra-class and inter-class edges, i.e. |C0 Ni| and C1 Ni| for all i. By Lemma 18, the probability that these events hold simultaneously is at least 1 o(1). Fix any w Rd such that w = 1. Without loss of generality, assume that w T µ > 0. Because t = O(1), by the deﬁnition of Ψ in (13) and the attention coeﬃcients in (1) we have that

( c1 n(p+q)(1 o(1)), if (i, j) is an intra-class edge, c2 n(p+q)(1 o(1)), if (i, j) is an inter-class edge, (25)

for some positive constants c1 1 and c2 1. Let us write Xi = (2ϵi 1)µ + σgi where gi N(0, I), ϵi = 0 if i C0 and ϵi = 1 if i C1. Let us consider the classiﬁcation of nodes in C0 based on the model output P j Ni γijw T Xj for node i. Note that, depending on whether i C0 or i C1, the expectation of the model output is symmetric around 0, and therefore we consider the decision boundary at 0. Let 0 < ϵ < 1 and ﬁx any partition of C0 into a set of disjoint subsets,

such that ℓ= |C0|1 ϵ and |C(h) 0 | = |C0|ϵ for every h = 1, 2, . . . , ℓ. In what follows we will consider the classiﬁcation of nodes in each C(h) 0 separately. We will show that, with probability at least 1 o(1), for each one of more than half of the subsets C(h) 0 where h = 1, 2, . . . , ℓ, the model is going to misclassify at least one node, and consequently, the model is going to misclassify at least ℓ/2 = |C0|1 ϵ/2 = Ω(n1 ϵ) nodes, giving the required result on misclassiﬁcation rate.

Fix any h {1, 2, . . . , ℓ}. Using (25) we get that, for large enough n, the event that the model correctly classiﬁes all nodes in C(h) 0 satisﬁes

max i C(h ) 0

j Ni γijw T Xj < 0

max i C(h ) 0

j Ni C1 γij X

j Ni C0 γij

w T µ + σ X

j Ni γijw T gj < 0

w T µ + σ max i C(h ) 0

j Ni γijw T gj < 0

Fountoulakis, Levi, Yang, Baranwal and Jagannath

for some absolute constant c3 > 0, and hence the probability that the model correctly classiﬁes all nodes in C(h ) 0 satisﬁes, for large enough n,

max i C(h ) 0

j Ni γijw T Xj < 0

max i C(h ) 0

j Ni w T gj < c3

max i C(h ) 0

j Ni w T gj < M

log n n(p + q)(1 max(p, q))

where the last inequality follows from our assumption on µ and we denote M def = c3M > 0. Now we will use Sudakov s minoration inequality (Vershynin, 2018) to obtain a lower bound on the expected maximum, and then apply Borell s inequality to upper bound the above probability. In order to apply Sudakov s result we will need to deﬁne a metric over the node index set [n]. Let zi def = P j Ni w T gj. For i, j C0, i = j, consider the canonical metric d (i, j) given by

d (i, j)2 def = E[(zi zj)2]

k Ni γ2 ik + X

k Nj γ2 jk 2 X

k Ni Nj γikγjk

1 n2(p + q)2

= c4|Jij| n2(p + q)2 ,

where Jij def = (Ni Nj)\(Ni Nj) is the symmetric diﬀerence of the neighbors of i and j, c4 > 0 is an absolute constant, and the inequality is due to (25). We lower bound |Jij| as follows. For i, j C0, i = j, and a node k [n], the probability that k is a neighbor of exactly one of i and j is 2p(1 p) if k C0 and 2q(1 q) if k C1. Therefore we have E[|Jij|] = n(p(1 p) + q(1 q)). It follows from the multiplicative Chernoﬀbound that for any 0 < δ < 1, Pr[|Jij| < E[|Jij|](1 δ)] exp( δ2 E[|Jij|]/3).

log n E[|Jij|] = 3

log n n(p(1 p) + q(1 q)) = o(1),

where the last equality follows from n(p(1 p) + q(1 q)) = Ω(log2 n), which is in turn due to the assumptions that p, q = Ω(log2 n/n) and p, q are bounded away from 1. Apply a union bound over all i, j C0, we get that with probability at least 1 o(1), the size of Jij satisﬁes |Jij| n(p(1 p) + q(1 q))(1 o(1)), i, j C0. (26)

Therefore it follows that, for large enough n,

c4|Jij| n2(p + q)2 =

c4n(p(1 p) + q(1 q))(1 o(1))

n2(p + q)2 = Ω

1 max(p, q)

Graph Attention Retrospective

We condition on the event that the inequality (26) holds for all i, j C0, which happens with probability at least 1 o(1). Apply Sudakov s minoration with metric d (i, j), we get that for large enough n, for all h = 1, 2, . . . , ℓ,

max i C(h) 0

j Ni γijw T gj

c5d (i, j) q

log |C(h) 0 | c6 ϵ

log n n(p + q)(1 max(p, q)) (27)

for some absolute constants c5, c6 > 0. The last inequality in the above follows from |C(h) 0 | = |C0|ϵ = Ω(nϵ). In addition, note that since by assumption Ψ is independent from the node features, using (25) we have that P j Ni γijw T gj is Gaussian with variance O( 1 n(p+q)). We use Borell s inequality (Adler and Taylor (2007) Chapter 2) to get that for any t > 0 and large enough n,

max i C(h ) 0

j Ni γijw T gj < E

max i C(h ) 0

j Ni γijw T gj

exp( c7t2n(p + q)).

for some absolute constant c7 > 0. By the lower bound of the expectation (27), we get that

max i C(h ) 0

j Ni γijw T gj < c6 ϵ

log n n(p + q)(1 max(p, q)) t

exp( c7t2n(p + q)).

Now, let M > 0 be any constant that also satisﬁes M < c6 ϵ/c3. Recall that we deﬁned M = c3M, and hence M < c6 ϵ. Set

t = (c6 ϵ M)

log n n(p + q)(1 max(p, q)) = Ω

log n n(p + q)(1 max(p, q))

then combine with the events we have conditioned so far we get

max i C(h ) 0

j Ni γijw T gj M

log n n(p + q)(1 max(p, q))

Recall that the above probability is the probability of correctly classifying all nodes in C(h ) 0 . Since our choice of h was arbitrary, this applies to every h {1, 2, . . . , h}. Let I(h) 0 denote the indicator variable of the event that there is at least one node in C(h) 0 that is misclassiﬁed, then

= ℓ E[I(1) 0 ] = ℓ (1 o(1)) = ℓ o(ℓ).

Apply the reverse Markov inequality, we get

h=1 I(h) 0 1

ℓ E h Pℓ h=1 I(h) 0 i

1 2ℓ= o(1).

Fountoulakis, Levi, Yang, Baranwal and Jagannath

Therefore, with probability at least 1 o(1), we have Pℓ h=1 I(h) 0 1

2ℓ= Ω(n1 ϵ). This implies the required result that the model misclassiﬁes at least Ω(n1 ϵ) nodes. The proof of part 2 is similar to the proof of part 1. Let us assume that p q since the result when p < q can be proved analogously. We condition on the events E1, E2, E3 deﬁned in Deﬁnition 17 which simultaneous hold with probability at least 1 o(1) by Lemma 18. Fix any w Rd such that w = 1. Because t = ω(1), by the deﬁnition of Ψ in (13) and the attention coeﬃcients in (1) we have that

( 2 np(1 o(1)), if (i, j) is an intra-class edge, o( 1 n(p+q)), if (i, j) is an inter-class edge. (29)

Write Xi = (2ϵi 1)µ + σgi where gi N(0, I), ϵi = 0 if i C0 and ϵi = 1 if i C1. We consider the classiﬁcation of nodes in C0 based on the model output P j Ni γijw T Xj for node i and the decision boundary at 0. As before, let 0 < ϵ < 1 and ﬁx any partition of C0 into a set of disjoint subsets C0 = Sℓ h=1 C(h) 0 such that ℓ= |C0|1 ϵ and |C(h) 0 | = |C0|ϵ for every h = 1, 2, . . . , ℓ. We proceed to show that, with high probability, for each one of more than half of the subsets C(h) 0 where h = 1, 2, . . . , ℓ, the model is going to misclassify at least one node, and consequently, the model is going to misclassify at least ℓ/2 = Ω(n1 ϵ) nodes as required. Fix any h {1, 2, . . . , ℓ}. Using (29) we get that, for large enough n, the event that the model correctly classiﬁes all nodes in C(h ) 0 satisﬁes (

max i C(h ) 0

j Ni γijw T Xj < 0

c1w T µ + σ max i C(h ) 0

j Ni γijw T gj < 0

for some absolute constant c1 > 0, and hence the probability that the model classiﬁes all nodes in Ch 0 correctly satisﬁes, for large enough n,

max i C(h ) 0

j Ni γijw T Xj < 0

max i C(h ) 0

j Ni w T gj < c1 |w T µ|

max i C(h ) 0

j Ni w T gj < M

log n n(p + q)(1 max(p, q))

where the last inequality follows from our assumption on µ and we denote M def = c1M > 0. The rest of the proof of part 2 proceeds as the proof of part 1.

A.8 Proof of Theorem 14

We restate Theorem 14 for convenience. Recall that we write µ = κσ for some κ > 0.

Theorem. With probability at least 1 o(1) over the data (X, A) CSBM(n, p, q, µ, σ2), using the two-layer MLP attention architecture Ψ given in (3) and (4) with R = Ω(n log2 nσ), the graph attention convolution output satisﬁes

j Ni γij w T Xj > 0 if and only if w T Xi > 0, i [n],

j Ni γij w T Xj < 0 if and only if w T Xi < 0, i [n].

Graph Attention Retrospective

We will condition on the following two events concerning the values of w T Xi for i [n]. Both events hold with probability at least 1 o(1). First, for ϵ > 0 consider the event w T Xi [ ϵσ max{1, κ}, ϵσ max{1, κ}] for all i [n] .

Since w = µ/ µ and µ = κσ, each w T Xi follows a normal distribution with mean ϵiκσ and variance σ2 (recall that ϵi {0, 1} generates the node memberships in CSBM), we have that

Pr ϵσ max{1, κ} w T Xi ϵσ max{1, κ} = Z κ+ϵ max{1,κ}

κ ϵ max{1,κ}

2 πϵ, if κ 1, q

2 πe κ2(1 ϵ)2/2ϵκ, if κ > 1.

Let us pick ϵ such that

ϵ = 1 n log n if κ 2 p

log n, and ϵ = 1

4 if κ > 2 p

This gives that

Pr w T Xi [ ϵσ max{1, κ}, ϵσ max{1, κ}] for all i [n] 1 n o(1/n) = 1 o(1).

Second, consider the event that | w T Xi E[ w T Xi]| 10σ log n. By Lemma 18 we know that it holds with probability at least 1 o(1). Under these two events, we have that

ϵσ max{1, κ} w T Xi κσ + 10σ p

log n if w T Xi > 0,

log n w T Xi ϵσ max{1, κ} if w T Xi < 0.

Recall from (5) that

Ψ(Xi, Xj) =

2R(1 β) w T Xi, if w T Xj w T Xi , 2R(1 β) sign( w T Xi) w T Xj, if w T Xi < w T Xj < w T Xi , 2R(1 β) w T Xi, if w T Xj w T Xi .

Consider an arbitrary node i where w T Xi > 0. We will show that hi = P j Ni γij w T Xj > 0. By the deﬁnition of attention coeﬃcients in (1), we know that X

j Ni γij w T Xj > 0 X

j Ni exp(Ψ(Xi, Xj)) w T Xj > 0.

Using the above expression for Ψ, we have that X

j Ni exp(Ψ(Xi, Xj)) w T Xj

= exp(Ψ(Xi, Xi)) w T Xi + X

exp(Ψ(Xi, Xj)) w T Xj

exp(Ψ(Xi, Xi)) w T Xi (n 1) exp(Ψ(Xi, Xi))(κσ + 10σ p

e2R(1 β)ϵσ max{1,κ}ϵσ max{1, κ} (n 1)e 2R(1 β)ϵσ max{1,κ}(κσ + 10σ p

Fountoulakis, Levi, Yang, Baranwal and Jagannath

where the second last inequality follows from the event that | w T Xj| κσ + 10σ log n for all j, and |Ni| n, and the last inequality follows from the fact that the function

f(x) = xe2R(1 β)x (n 1)(κσ + 10σ p

log n)e 2R(1 β)x

is increasing with respect to x for x 0. To see that h i = P j Ni exp(Ψ(Xi, Xj)) w T Xj > 0, consider the following cases separately.

If κ 2 log n, recall that we picked ϵ = 1/(n log n), then because R = Ω(n log2 n/σ), for large enough n we have that

4R(1 β)σ > n log n max{1, κ}

log n + log n log n

+ log(κ + 10 p

4R(1 β)σ > 1 ϵ max{1, κ}

log n + log 1 ϵ max{1, κ}

+ log(κ + 10 p

e4R(1 β)σϵ max{1,κ} > n(κ + 10 log n)

ϵ max{1, κ}

e2R(1 β)σϵ max{1,κ}σϵ max{1, κ} > ne 2R(1 β)σϵ max{1,κ}(κσ + 10σ p

By (30) this means that h i > 0.

If κ > 2 log n, then ϵ = 1/4. Because R = Ω(n log2 n/σ), for large enough n we have

4R(1 β)σ > log n + log(κ + 10 log n) log(κ/4)

4R(1 β)σ > log n + log(κ + 10 log n) log(ϵκ)

e4R(1 β)σϵκ > n(κ + 10 log n)

ϵκ e2R(1 β)σϵκσϵκ > n(κσ + 10σ p

By (30) this means that h i > 0.

This shows that h i > 0 whenever w T Xi > 0. Similarly, one easily gets that h i < 0 whenever w T Xi < 0. Therefore the proof is complete.

A.9 Proof of Corollary 16

We restate Corollary 16 for convenience. Recall that we write µ = κσ for some κ > 0.

Corollary. With probability at least 1 o(1) over the data (X, A) CSBM(n, p, q, µ, σ2), using the two-layer MLP attention architecture Ψ given in (3) and (4) with R = Ω(n log2 nσ), one has that

(Perfect classiﬁcation) If κ 2 log n then all nodes are correctly classiﬁed;

(Almost perfect classiﬁcation) If κ = ω(1) then at least 1 o(1) fraction of all nodes are correctly classiﬁed;

Graph Attention Retrospective

(Partial classiﬁcation) If κ = O(1) then at least Φ(κ) o(1) fraction of all nodes are correctly classiﬁed.

Fix i [n] and write w T Xi = (2ϵi 1) µ +σ/ µ µT gi where gi N(0, I), where we recall that ϵi {0, 1} deﬁnes the class membership of node i. We have that w T Xi follows a normal distribution with mean (2ϵi 1)κσ and standard deviation σ. Part 1 of the corollary is exactly Proposition 7 whose proof is given in the main text. We consider the other cases for κ. For both classes ϵi = 0 and ϵi = 1, the probability of correct classiﬁcation is Φ(κ) (using 0 as the decision boundary). Therefore, by applying additive Chernoﬀbound, we have that

Pr At most nΦ(κ) n log n nodes are correctly classiﬁed

i [n] 1{node i is correctly classiﬁed} nΦ(κ) p

The proof is complete by noticing that nΦ(κ) n log n = n(1 o(1)) when κ = ω(1) and nΦ(κ) n log n = n(Φ(κ) o(1)) when κ = O(1).

E. Abbe. Community detection and stochastic block models: Recent developments. Journal of Machine Learning Research, 18:1 86, 2018.

R. J. Adler and J. E. Taylor. Random ﬁelds and geometry. Springer, 2007.

T. W. Anderson. An introduction to multivariate statistical analysis. John Wiley & Sons, 2003.

A. Athreya, D. E. Fishkind, M. Tang, C. E. Priebe, Y. Park, J. T. Vogelstein, K. Levin, V. Lyzinski, Y. Qin, and D. L. Sussman. Statistical inference on random dot product graphs: A survey. Journal of Machine Learning Research, 18(226):1 92, 2018.

J. Atwood and D. Towsley. Diﬀusion-convolutional neural networks. In Advances in Neural Information Processing Systems (Neur IPS), 2016.

D. Bahdanau, K. H. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. In International Conference on Learning Representations (ICLR), 2015.

A. Baranwal, K. Fountoulakis, and A. Jagannath. Graph convolution for semi-supervised classiﬁcation: Improved linear separability and out-of-distribution generalization. In International Conference on Machine Learning (ICML), 2021.

N. Binkiewicz, J. T. Vogelstein, and K. Rohe. Covariate-assisted spectral clustering. Biometrika, 104:361 377, 2017.

Fountoulakis, Levi, Yang, Baranwal and Jagannath

C. Bodnar, F. Di Giovanni, B. Chamberlain, P. Lio, and M. M. Bronstein. Neural sheaf diffusion: A topological perspective on heterophily and oversmoothing in gnns. In Advances in Neural Information Processing Systems (Neur IPS), 2022.

X. Bresson and T. Laurent. Residual gated graph convnets. In ar Xiv:1711.07553, 2018.

S. Brody, U. Alon, and E. Yahav. How attentive are graph attention networks. In International Conference on Learning Representations (ICLR), 2022.

M. M. Bronstein, J. Bruna, T. Cohen, and P. Veliˇckovi c. Geometric deep learning: Grids, groups, graphs, geodesics, and gauges. ar Xiv preprint ar Xiv:2104.13478, 2021.

J. Bruna, W. Zaremba, A. Szlam, and Y. Le Cun. Spectral networks and locally connected networks on graphs. In International Conference on Learning Representations (ICLR), 2014.

Z. Chen, L. Li, and J. Bruna. Supervised community detection with line graph neural networks. In International Conference on Learning Representations (ICLR), 2019.

E. Chien, J. Peng, P. Li, and O. Milenkovic. Adaptive universal generalized pagerank graph neural network. In International Conference on Learning Representations (ICLR), 2021.

M. Deﬀerrard, X. Bresson, and P. Vandergheynst. Convolutional neural networks on graphs with fast localized spectral ﬁltering. In Advances in Neural Information Processing Systems (Neur IPS), 2016.

Y. Deshpande, A. Montanari S. Sen, and E. Mossel. Contextual stochastic block models. In Advances in Neural Information Processing Systems (Neur IPS), 2018.

D. Duvenaud, D. Maclaurin, J. Aguilera-Iparraguirre, R. G omez-Bombarelli, T. Hirzel, A. Aspuru-Guzik, and R. P. Adams. Convolutional networks on graphs for learning molecular ﬁngerprints. In Advances in Neural Information Processing Systems (Neur IPS), 2015.

M. Fey and J. E. Lenssen. Fast graph representation learning with Py Torch Geometric. In ICLR Workshop on Representation Learning on Graphs and Manifolds, 2019.

V. Garg, S. Jegelka, and T. Jaakkola. Generalization and representational limits of graph neural networks. In Advances in Neural Information Processing Systems (Neur IPS), 2020.

M. Gori, G. Monfardini, and F. Scarselli. A new model for learning in graph domains. In IEEE International Joint Conference on Neural Networks (IJCNN), 2005.

W. L. Hamilton, R. Ying, and J. Leskovec. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems (Neur IPS), 2017.

M. Henaﬀ, J. Bruna, and Y. Le Cun. Deep convolutional networks on graph-structured data. In ar Xiv:1506.05163, 2015.

Graph Attention Retrospective

Y. Hou, J. Zhang, J. Cheng, K. Ma, R. T. B. Ma, H. Chen, and M.-C. Yang. Measuring and improving the use of graph information in graph neural networks. In International Conference on Learning Representations (ICLR), 2019.

W. Hu, M. Fey, M. Zitnik, Y. Dong, H. Ren, B. Liu, M. Catasta, and J. Leskovec. Open graph benchmark: datasets for machine learning on graphs. In Advances in Neural Information Processing Systems (Neur IPS), 2020.

S. Jegelka. Theory of graph neural networks: Representation and learning. In ar Xiv:2204.07697, 2022.

N. Keriven, A. Bietti, and S. Vaiter. On the universality of graph neural networks on large random graphs. In Advances in Neural Information Processing Systems (Neur IPS), 2021.

T. N. Kipf and M. Welling. Semi-supervised classiﬁcation with graph convolutional networks. In International Conference on Learning Representations (ICLR), 2017.

B. Knyazev, G. W. Taylor, and M. Amer. Understanding attention and generalization in graph neural networks. In Advances in Neural Information Processing Systems (Neur IPS), 2019.

B. J. Lee, R. A. Rossi, S. Kim, K. N. Ahmed, and E. Koh. Attention models in graphs: A survey. ACM Transactions on Knowledge Discovery from Data (TKDD), 2019.

R. Levie, F. Monti, X. Bresson, and M. M. Bronstein. Cayleynets: Graph convolutional neural networks with complex rational spectral ﬁlters. IEEE Transactions on Signal Processing, 67(1):97 109, 2018.

Y. Li, R. Zemel, M. Brockschmidt, and D. Tarlow. Gated graph sequence neural networks. In International Conference on Learning Representations (ICLR), 2016.

D. Lim, F. Hohne, X. Li, S. L. Huang, V. Gupta, O. Bhalerao, and S. N. Lim. Large scale learning on non-homophilous graphs: New benchmarks and strong simple methods. In Advances in Neural Information Processing Systems (Neur IPS), 2021.

A. Loukas. How hard is to distinguish graphs with graph neural networks? In Advances in Neural Information Processing Systems (Neur IPS), 2020a.

A. Loukas. What graph neural networks cannot learn: Depth vs width. In International Conference on Learning Representations (ICLR), 2020b.

S. Luan, C. Hua, Q. Lu, J. Zhu, M. Zhao, S. Zhang, X.-W. Chang, and D. Precup. Revisiting heterophily for graph neural networks. In Advances in Neural Information Processing Systems (Neur IPS), 2022.

S. Maskey, R. Levie, Y. Lee, and G. Kutyniok. Generalization analysis of message passing neural networks on large random graphs. In Advances in Neural Information Processing Systems (Neur IPS), 2022.

M. Minsky and S. Papert. Perceptron: an introduction to computational geometry, 1969.

Fountoulakis, Levi, Yang, Baranwal and Jagannath

C. Moore. The computer science and physics of community detection: Landscapes, phase transitions, and hardness. Bulletin of The European Association for Theoretical Computer Science, 1(121), 2017.

J. Palowitch, A. Tsitsulin, B. Mayer, and B. Perozzi. Graphworld: Fake graphs bring real insights for gnns. In ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2022.

O. Puny, H. Ben-Hamu, and Y. Lipman. Global attention improves graph networks generalization. In ar Xiv:2006.07846, 2020.

F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini. The graph neural network model. IEEE Transactions on Neural Networks, 20(1), 2009.

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems (Neur IPS), 2017.

P. Veliˇckovi c, G. Cucurull, A. Casanova, A. Romero, P. Li o, and Y. Bengio. Graph attention networks. In International Conference on Learning Representations (ICLR), 2018.

R. Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.

M. Wang, D. Zheng, Z. Ye, Q. Gan, M. Li, X. Song, J. Zhou, C. Ma, L. Yu, Y. Gai, T. Xiao, T. He, G. Karypis, J. Li, and Z. Zhang. Deep graph library: A graph-centric, highly-performant package for graph neural networks. ar Xiv preprint ar Xiv:1909.01315, 2019.

K. Xu, W. Hu, J. Leskovec, and S. Jegelka. How powerful are graph neural networks? In International Conference on Learning Representations (ICLR), 2019.

Y. Yan, M. Hashemi, K. Swersky, Y. Yang, and D. Koutra. Two sides of the same coin: Heterophily and oversmoothing in graph convolutional neural networks. In IEEE International Conference on Data Mining (ICDM), 2022.

J. Zhu, Y. Yan, L. Zhao, M. Heimann, L. Akoglu, and D. Koutra. Beyond homophily in graph neural networks: Current limitations and eﬀective designs. In Advances in Neural Information Processing Systems (Neur IPS), 2020.