# graph_attention_retrospective__072abd0f.pdf Journal of Machine Learning Research 24 (2023) 1-52 Submitted 11/22; Revised 4/23; Published 5/23 Graph Attention Retrospective Kimon Fountoulakis kimon.fountoulakis@uwaterloo.ca David R. Cheriton School of Computer Science University of Waterloo Waterloo, Ontario, Canada Amit Levi amit.levi@uwaterloo.ca Huawei Noah s Ark Lab Montreal, Quebec, Canada Shenghao Yang shenghao.yang@uwaterloo.ca David R. Cheriton School of Computer Science University of Waterloo Waterloo, Ontario, Canada Aseem Baranwal aseem.baranwal@uwaterloo.ca David R. Cheriton School of Computer Science University of Waterloo Waterloo, Ontario, Canada Aukosh Jagannath a.jagannath@uwaterloo.ca Department of Statistics and Actuarial Science, Department of Applied Mathematics University of Waterloo Waterloo, Ontario, Canada Editor: Joan Bruna Graph-based learning is a rapidly growing sub-field of machine learning with applications in social networks, citation networks, and bioinformatics. One of the most popular models is graph attention networks. They were introduced to allow a node to aggregate information from features of neighbor nodes in a non-uniform way, in contrast to simple graph convolution which does not distinguish the neighbors of a node. In this paper, we theoretically study the behaviour of graph attention networks. We prove multiple results on the performance of the graph attention mechanism for the problem of node classification for a contextual stochastic block model. Here, the node features are obtained from a mixture of Gaussians and the edges from a stochastic block model. We show that in an easy regime, where the distance between the means of the Gaussians is large enough, graph attention is able to distinguish inter-class from intra-class edges. Thus it maintains the weights of important edges and significantly reduces the weights of unimportant edges. Consequently, we show that this implies perfect node classification. In the hard regime, we show that every attention mechanism fails to distinguish intra-class from inter-class edges. In addition, we show that graph attention convolution cannot (almost) perfectly classify the nodes even if intra-class edges could be separated from inter-class edges. Beyond perfect node classification, we provide a positive result on graph attention s robustness against structural noise in the graph. In particular, our robustness result implies that graph attention . Equal contribution. c 2023 Kimon Fountoulakis, Amit Levi, Shenghao Yang, Aseem Baranwal and Aukosh Jagannath. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v24/22-1255.html. Fountoulakis, Levi, Yang, Baranwal and Jagannath can be strictly better than both the simple graph convolution and the best linear classifier of node features. We evaluate our theoretical results on synthetic and real-world data. Keywords: graph neural networks, attention mechanism, contextual stochastic block model, high-dimensional probability, node classification 1. Introduction Graph learning has received a lot of attention recently due to breakthrough learning models (Gori et al., 2005; Scarselli et al., 2009; Bruna et al., 2014; Duvenaud et al., 2015; Henaff et al., 2015; Atwood and Towsley, 2016; Defferrard et al., 2016; Hamilton et al., 2017; Kipf and Welling, 2017) that are able to exploit multi-modal data that consist of nodes and their edges as well as the features of the nodes. One of the most important problems in graph learning is the problem of classification, where the goal is to classify the nodes or edges of a graph given the graph and the features of the nodes. Two of the most fundamental mechanisms for classification, and graph learning in general, are graph convolution and graph attention. Graph convolution, usually defined using its spatial version, corresponds to averaging the features of a node with the features of its neighbors (Kipf and Welling, 2017).1 Graph attention (Veliˇckovi c et al., 2018) mechanisms augment this convolution by appropriately weighting the edges of a graph before spatially convolving the data. Graph attention is able to do this by using information from the given features for each node. Despite its wide adoption by practitioners (Fey and Lenssen, 2019; Wang et al., 2019; Hu et al., 2020) and its large academic impact as well, the number of works that rigorously study its effectiveness is quite limited. One of the motivations for using a graph attention mechanism as opposed to a simple convolution is the expectation that the attention mechanism is able to distinguish interclass edges from intra-class edges and consequently weights inter-class edges and intra-class edges differently before performing the convolution step. This ability essentially maintains the weights of important edges and significantly reduces the weights of unimportant edges, and thus it allows graph convolution to aggregate features from a subset of neighbor nodes that would help node classification tasks. In this work, we explore the regimes in which this heuristic picture holds in simple node classification tasks, namely classifying the nodes in a contextual stochastic block model (CSBM) (Binkiewicz et al., 2017; Deshpande et al., 2018). The CSBM is a coupling of the stochastic block model (SBM) with a Gaussian mixture model, where the features of the nodes within a class are drawn from the same component of the mixture model. For a more precise definition, see Section 2. We focus on the case of two classes where the answer to the above question is sufficiently precise to understand the performance of graph attention and build useful intuition about it. We briefly and informally summarize our contributions as follows: 1. In the easy regime , i.e., when the distance between the means of the Gaussian mixture model is much larger than the standard deviation, we show that there exists a choice of attention architecture that distinguishes inter-class edges from intra-class 1. Although the model in Kipf and Welling (2017) is related to spectral convolutions, it is mainly a spatial convolution since messages are propagated along graph edges. More broadly, graph convolution can refer to different variants arising from different (approximations of) graph spectral filters. We provide details in Section 2. Graph Attention Retrospective edges with high probability (Theorem 3). In particular, we show that the attention coefficients for one class of edges are much higher than the other class of edges (Corollary 4). Furthermore, we show that these attention coefficients lead to a perfect node classification result (Corollary 5). However, in the same regime, we also show that the graph is not needed to perfectly classify the data (Proposition 7). 2. In the hard regime , i.e., when the distance between the means is small compared to the standard deviation, we show that every attention architecture is unable to distinguish inter-class from intra-class edges with high probability (Theorem 9). Moreover, we show that using the original Graph Attention Network (GAT) architecture (Veliˇckovi c et al., 2018), with high probability, most of the attention coefficients are going to have uniform weights, similar to those of simple graph convolution (Kipf and Welling, 2017) (Theorem 10). We also consider a setting where graph attention is provided with independent and perfect edge information, and we show that even in this case, if the distance between the means is sufficiently small, then graph attention would fail to (almost) perfectly classify the nodes with high probability (Theorem 13). 3. Beyond perfect node classification, we show that graph attention is able to assign a significantly higher weight to self-loops, irrespective of the parameters of the Gaussian mixture model that generates node features or the stochastic block model that generates the graph. Consequently, we show that with a high probability, graph attention convolution is at least as good as the best linear classifier for node features (Theorem 14). In a high structural noise regime when the graph is not helpful for node classification, i.e., when intra-class edge probability p is close to inter-class edge probability q, this means that graph attention is strictly better than simple graph convolution because it is able to essentially ignore the graph structural noise (Corollary 15). In addition, we obtain lower bounds on graph attention s performance for almost perfect node classification and partial node classification (Corollary 16). 4. We provide an extensive set of experiments both on synthetic data and on three popular real-world datasets that validates our theoretical results. 1.1 Limitations of our theoretical setting In this work, to study the benefits and the limitations of the graph attention mechanism, we focus on node classification tasks using the CSBM generative model with two classes and Gaussian node features. This theoretical setting has a few limitations. First, we note that many real-world networks are much more complex and may exhibit different structures than the ones obtainable from a stochastic block model. Furthermore, real-world node attributes such as categorical features may not even follow a continuous probability distribution. Apparently, there are clear gaps between CSBM and the actual generative processes of real-world data. Nonetheless, we note that a good understanding of graph attention s behavior over CSBM would already provide us with useful insights. For example, it has been shown empirically that synthetic graph datasets based on CSBM constitute good benchmarks for evaluating various GNN architectures (Palowitch et al., 2022). To that end, in Section 6, we summarize some practical implications of our theoretical results which could be useful for practitioners working with GNNs. Fountoulakis, Levi, Yang, Baranwal and Jagannath Second, there are different levels of granularity for machine learning problems on graphs. Besides node-level tasks such as node classification, other important problems include edgelevel tasks such as link prediction and graph-level tasks such as graph classification. While, empirically, variants of graph attention networks have been successfully applied to solve problems at all levels of granularity, our theoretical results mostly pertain to the effects of graph attention for node classification. This is a limitation on the scope of our study. On the other hand, we note that edge-level and graph-level tasks are typically solved by adding a pooling layer which combines node representations from previous graph (attention) convolution layers. Consequently, the quality of graph attention convolution s output for a node not only affects node classification but also has a major influence on link prediction and graph classification. Therefore, our results may be extended to study the effects of graph attention on link prediction and graph classification under the CSBM generative model. For example, link prediction is closely related to edge classification which we extensively study in Section 3. For graph classification, one may consider the problem of classifying graphs generated from different parameter settings of the CSBM. In this case, our results on graph attention s impact on node representations may be used to establish results on graph attention s impact on graph representations after certain pooling layers. In particular, if graph attention convolution generates good node representations that are indicative of node class memberships, then one can show that the graph representation obtained from appropriately aggregating node representations would be indicative of the clustering structure of the graph, and hence the graph representation would be useful for graph classification tasks. 1.2 Relevant work Recently the concept of attention for neural networks (Bahdanau et al., 2015; Vaswani et al., 2017) was transferred to graph neural networks (Li et al., 2016; Bresson and Laurent, 2018; Veliˇckovi c et al., 2018; Lee et al., 2019; Puny et al., 2020). A few papers have attempted to understand the attention mechanism in Veliˇckovi c et al. (2018). One work relevant to ours is Brody et al. (2022). In this paper, the authors show that a node may fail to assign large edge weight to its most important neighbors due to a global ranking of nodes generated by the attention mechanism in Veliˇckovi c et al. (2018). Another related work is Knyazev et al. (2019), which presents an empirical study of the ability of graph attention to generalize on larger, complex, and noisy graphs. In addition, in Hou et al. (2019) the authors propose a different metric to generate the attention coefficients and show empirically that it has an advantage over the original GAT architecture. Other related work to ours, which does not focus on graph attention, comes from the field of statistical learning on random data models. Random graphs and the stochastic block model have been traditionally used in clustering and community detection (Abbe, 2018; Athreya et al., 2018; Moore, 2017). Moreover, the works by Binkiewicz et al. (2017); Deshpande et al. (2018), which also rely on CSBM are focused on the fundamental limits of unsupervised learning. Of particular relevance is the work by Baranwal et al. (2021), which studies the performance of graph convolution on CSBM as a semi-supervised learning problem. Within the context of random graphs, Keriven et al. (2021) studies the approximation power of graph neural networks on random graphs. In Maskey et al. (2022) the authors Graph Attention Retrospective derive generalization error of graph neural networks for graph classification and regression tasks. In our paper we are interested in understanding graph attention s capability for edge/node classification with respect to different parameter regimes of CSBM. Finally, there are a few related theoretical works on understanding the generalization and representation power of graph neural networks (Chen et al., 2019; Chien et al., 2021; Zhu et al., 2020; Xu et al., 2019; Garg et al., 2020; Loukas, 2020a,b). For a recent survey in this direction see Jegelka (2022). Our work takes a statistical perspective which allows us to characterize the precise performance of graph attention compared to graph convolution and no convolution for CSBM, with the goal of answering the particular questions that we imposed above. 2. Preliminaries In this section, we describe the Contextual Stochastic Block Model (CSBM) (Deshpande et al., 2018) which serves as our data model, and the Graph Attention mechanism (Veliˇckovi c et al., 2018). We also define notations and terms that are frequently used throughout the paper. For a vector x Rn and n N, the norm x denotes the Euclidean norm of x, i.e. x def = P i [n] x2 i . We write [n] def = {1, 2, . . . , n}. We use Ber(p) to denote the Bernoulli distribution, so x Ber(p) means the random variable x takes value 1 with probability p and 0 with probability 1 p. Let d, n N, and ϵ1, . . . , ϵn Ber(1/2). Define two classes as Ck = {j [n] | ϵj = k} for k {0, 1}. For each index i [n], we set the feature vector Xi Rd as Xi N((2ϵi 1)µ, σ2I), where µ Rd, σ R and I {0, 1}d d is the identity matrix.2 For a given pair p, q [0, 1] we consider the stochastic adjacency matrix A {0, 1}n n defined as follows. For i, j [n] in the same class (i.e., intra-class edge), we set aij Ber(p), and if i, j are in different classes (i.e., inter-class edge), we set aij Ber(q). We denote by (X, A) CSBM(n, p, q, µ, σ2) a sample obtained according to the above random process. An advantage of CSBM is that it allows us to control the noise by controlling the parameters of the distributions of the model. In particular, CSBM allows us to control the distance of the means and the variance of the Gaussians, which are important for controlling the separability of the Gaussians. For example, fixing the variance, then the closer the means are the more difficult the separation of the node Gaussian features becomes. Another notable advantage of CSBM is that it allows us to control the structural noise and homophily level of the graph. The level of noise in the graph is controlled by increasing or reducing the gap between the intra-class edge probability p and the inter-class edge probability q, and the level of homophily in the graph is controlled by the relative magnitude between p and q. For example, when p is much larger than q, then a node is more likely to be connected with a node from the same class, and hence we obtain a homophilous graph; when q is much larger than p, then a node is more likely to be connected with a node from a different class, and hence we obtain a heterophilous graph. There are several recent works exploring the behavior of GNNs in heterophilous graphs (Lim et al., 2021; Yan et al., 2022; Bodnar et al., 2022; Luan et al., 2022). Interestingly, we note that our results for graph attention s 2. The means of the mixture of Gaussians are µ. Our results can be easily generalized to general means. The current setting makes our analysis simpler without loss of generality. Fountoulakis, Levi, Yang, Baranwal and Jagannath behavior over the CSBM data model do not depend on whether the graph is homophilous or heterophilous. Given node representations hi RF for i [n], a (spatial) graph convolution for node i has output j [n] Aijcij Whj, cij = X where W RF F is a learnable matrix. Throughout this paper, we will refer to this operation as simple graph convolution or standard graph convolution. Our definition of graph convolution is essentially the mean aggregation step in a general GNN layer (Hamilton et al., 2017). The normalization constant cij in our definition is closely related to the symmetric normalization cij = (P ℓ [n] Aiℓ) 1/2(P ℓ [n] Ajℓ) 1/2 used in the original Graph Convolutional Network (GCN) introduced by Kipf and Welling (2017). Our definition does not affect the discussions or implications we have for GCN with symmetric normalization. More broadly, there are other forms of graph convolution in the literature (Bronstein et al., 2021; Defferrard et al., 2016; Levie et al., 2018), which we do not compare within this work. A single-head graph attention applies some weight function on the edges based on their node features (or a mapping thereof). Given two representations hi, hj RF for two nodes i, j [n], the attention model/mechanism is defined as the mapping Ψ(hi, hj) def = α(Whi, Whj) where α : RF RF R and W RF F is a learnable matrix. The attention coefficient for a node i and its neighbor j is defined as γij def = exp(Ψ(hi, hj)) P ℓ Ni exp(Ψ(hi, hℓ)), (1) where Ni is the set of neighbors of node i that also includes node i itself. Let f be some element-wise activation function (which is usually nonlinear), the graph attention convolution output for a node i [n] is given by j [n] Aijγij Whj, hi = f(h i). A multi-head graph attention (Veliˇckovi c et al., 2018) uses K N weight matrices W1, . . ., WK RF F and averages their individual (single-head) outputs. We consider the most simplified case of a single graph attention layer (i.e., F = d and F = 1) where α is realized by an MLP using the Leaky Relu activation function. The Leaky Relu activation function is defined as Leaky Relu(x) = x if x 0 and Leaky Relu(x) = βx for some constant β [0, 1) if x < 0. The CSBM model induces node features X which are correlated through the graph G = ([n], E), represented by an adjacency matrix A. A natural requirement of attention architectures is to maintain important edges in the graph and ignore unimportant edges. For example, important edges could be the set of intra-class edges and unimportant edges Graph Attention Retrospective could be the set of inter-class edges. In this case, if graph attention mains all intra-class edges and ignores all inter-class edges, then a node from a class will be connected only to nodes from its own class. More specifically, a node v will be connected to neighbor nodes whose associated node features come from the same distribution as node features of v. Given two sets A and B, we denote A B def = {(i, j) : i A, j B} and A2 def = A A. 3. Separation of edges and its implications for graph attention s ability for perfect node classification In this section, we study the ability of graph attention to separate important edges from unimportant edges. In addition, we study the implications of graph attention s behavior for node classification. In particular, we are interested in whether or not the graph attention convolution is able to perfectly classify the nodes in the graph. Depending on the parameter regimes of the CSBM, we separate the results into two parts. In Section 3.1, we study graph attention s behavior when the distance between the means of the node features is large. We construct a specific attention architecture and demonstrate its ability to separate intraclass edges from inter-class edges. Consequently, we show that the corresponding graph attention convolution is able to perfectly classify the nodes. In Section 3.2, we study graph attention s behavior when the distance between the means of the node features is small. We show that with high probability no attention architecture is able to separate intra-class edges from inter-class edges. Consequently, we conjecture that no graph attention is able to perfectly classify the nodes. We provide evidence of this conjecture in Section 3.2.1, where we assume the existence of a strong attention mechanism that is able to perfectly classify the edges independently from node features. We show that using this idealistic attention mechanism still fails to (almost) perfectly classify the nodes, provided that the distance between the means of the node features is sufficiently small. The two-parameter regimes of the CSBM that we consider in Section 3.1 and Section 3.2 are as follows. The first ( easy regime ) is where µ = ω(σ log n), and the second ( hard regime ) is where µ = κσ for some 0 < κ O( log n). We start by defining edge separability and node separability. Definition 1 (Edge separability). Given an attention model Ψ, we say that the model separates the edges, if the outputs satisfy sign(Ψ(Xi, Xj)) = sign(Ψ(Xk, Xℓ)) whenever (i, j) is an intra-class edge, i.e. (i, j) (C2 1 C2 0) E, and (k, ℓ) is an inter-class edge, i.e. (k, ℓ) E \ (C2 1 C2 0). Definition 2 (Node separability). Given a classification model which outputs a scalar representation h i for node i, we say that the model separates the nodes if h i > 0 whenever i C1 and h i < 0 whenever i C0. Some of our results in this section rely on a mild assumption that lower bounds the sparsity of the graph generated by the CSBM model. This assumption says that the expected degree of a node in the graph is larger than log2 n. It is required to obtain a concentration of node degrees property used in the proofs. While this assumption may limit a direct Fountoulakis, Levi, Yang, Baranwal and Jagannath application of our results to very sparse graphs, for example, graphs whose node degrees are bounded by a constant, it is mainly an artifact of the proof techniques that we use. We expect that similar results still hold for sparser graphs, but to rigorously remove the assumption on graph density one may have to adopt a different proof technique. Assumption 1. p, q = Ω(log2 n/n). 3.1 Easy Regime In this regime µ = ω(σ log n) we show that a two-layer MLP attention is able to separate the edges with high probability. Consequently, we show that the corresponding graph attention convolution is able to separate the nodes with high probability. At a high level, we transform the problem of classifying an edge (i, j) E into the problem of classifying a point [ w T Xi, w T Xj] in R2, where w = µ/ µ is a unit vector that maximizes the total pairwise distances among the four means given below. When we consider the set of points [ w T Xi, w T Xj] for all (i, j) E, we can think of each point as a two-dimensional Gaussian vector whose mean is one of the following: [ w T µ, w T µ], [ w T µ, w T µ], [ w T µ, w T µ], [ w T µ, w T µ]. The set of intra-class edges corresponds to the set of bivariate Gaussian vectors whose mean is either [ w T µ, w T µ] or [ w T µ, w T µ], while the set of inter-class edges corresponds to the set of bivariate Gaussian vectors whose mean is either [ w T µ, w T µ] or [ w T µ, w T µ]. Therefore, in order to correctly classify the edges, we need to correctly classify the data corresponding to means [ w T µ, w T µ] and [ w T µ, w T µ] as 1, and classify the data corresponding to the other means as 1. This problem is known in the literature as the XOR problem (Minsky and Papert, 1969). To achieve this we consider a two-layer MLP architecture Ψ which separates the first and the third quadrants of the two-dimensional space from the second and the fourth quadrants. In particular, we consider the following specification of Ψ(Xi, Xj), Ψ(Xi, Xj) def = r T Leaky Relu S w T Xi w T Xj w def = µ µ , S def = 1 1 1 1 1 1 1 1 , r def = R where R > 0 is an arbitrary scaling parameter. The particular function Ψ has been chosen such that it is able to classify the means of the XOR problem correctly, that is, sign(Ψ(E[Xi], E[Xj])) = 1, if (i, j) is an intra-class edge, 1, if (i, j) is an inter-class edge. To see this, we can take look at the following simplified expression of Ψ, which is obtained by substituting the specifications of S and r from (4) to (3), Ψ(Xi, Xj) = 2R(1 β) w T Xi, if w T Xj w T Xi , 2R(1 β) sign( w T Xi) w T Xj, if w T Xi < w T Xj < w T Xi , 2R(1 β) w T Xi, if w T Xj w T Xi . (5) Graph Attention Retrospective Then, by substituting w = µ/ µ into (5), one easily verifies that Ψ(E[Xi], E[Xj]) = 2R(1 β) µ , if (i, j) is an intra-class edge, 2R(1 β) µ , if (i, j) is an inter-class edge. (6) On the other hand, our assumption that µ = ω(σ log n) guarantees that the distance between the means of the XOR problem is much larger than the standard deviation of the Gaussians, and thus with high probability there is no overlap between the distributions. More specifically, one can show that with a high probability, Ψ(Xi, Xj) = 2R(1 β) µ (1 o(1)), if (i, j) is an intra-class edge, 2R(1 β) µ (1 o(1)), if (i, j) is an inter-class edge. (7) This property guarantees that with high probability, sign(Ψ(Xi, Xj)) = sign(Ψ(E[Xi], E[Xj])), for all (i, j) E, which implies perfect separability of the edges. We state this result below in Theorem 3 and provide detailed arguments in the appendix. Theorem 3. Suppose that µ = ω(σ log n). Then with probability at least 1 o(1) over the data (X, A) CSBM(n, p, q, µ, σ2), the two-layer MLP attention architecture Ψ given in (3) and (4) separates intra-class edges from inter-class edges. Our analysis of edge separability by the attention architecture Ψ has two important implications. First, edge separability by Ψ implies a nice concentration result for the attention coefficients γij defined in (1). In particular, one can show that the attention coefficients have the desirable property to maintain important edges and drop unimportant edges. Second, the nice distinguishing power of the attention coefficients in turn implies the exact recoverability of the node memberships using the graph attention convolution. We state these two results below in Corollary 4 and Corollary 5, respectively. Denote Ψ def = (1p q 1p 0 . (8) Graph Attention Retrospective Proposition 7. Suppose µ = ω(σ log n). Then with probability at least 1 o(1) over the data (X, A) CSBM(n, p, q, µ, σ2), the linear classifier given in (8) separates the nodes. The proof of Proposition 7 is elementary. To see the claim one may show that the probability that the classifier in (8) misclassifies a node i [n] is o(1). To do this, let us fix i [n] and write Xi = (2ϵi 1)µ + σgi where gi N(0, I). Assume for a moment ϵi = 0. Then the probability of misclassification is Pr µT Xi > 0 = Pr µT gi where Φ( ) is the cumulative distribution function of N(0, 1) and the last equality follows from the fact that µT gi µ N(0, 1). The assumption that µ = ω(σ log n) implies µ σ 2 log n for large enough n. Therefore, using standard tail bounds for normal distribution (Vershynin, 2018) we have that 2π µ exp µ 2 n 1 4π log n. This means that the probability that there exists i C0 which is misclassified is at most 1 2 4π log n = o(1). A similar argument can be applied to the case where ϵi = 1, and an application of a union bound on the events that there is i [n] which is misclassified finishes the proof of Proposition 7. 3.2 Hard Regime In this regime ( µ = κσ for κ O( log n)), we show that every attention architecture Ψ fails to separate the edges if κ < 2 log n, and we conjecture that no graph attention convolution is able to separate the nodes. The conjecture is based on our node separability result in Section 3.2.1 which says that, even if we assume that there is an attention architecture which is able to separate the edges independently from node features, the corresponding graph attention convolution still fails to (almost) perfectly classify the nodes with high probability. The goal of the attention mechanism is to decide whether an edge (i, j) is an inter-class edge or an intra-class edge based on the node feature vectors Xi and Xj. Let X ij denote the vector obtained from concatenating Xi and Xj, that is, X ij def = Xi Xj We would like to analyze every classifier h which takes as input X ij and tries to separate inter-class edges and intra-class edges. An ideal classifier would have the property y = h (X ij) = 0, if (i, j) is an inter-class edge, 1, if (i, j) is an intra-class edge. (10) To understand the limitations of all such classifiers in this regime, it suffices to consider the Bayes optimal classifier for this data model, whose probability of misclassifying an arbitrary edge lower bounds that of every attention architecture which takes as input (Xi, Xj). Fountoulakis, Levi, Yang, Baranwal and Jagannath Consequently, by deriving a misclassification rate for the Bayes classifier, we obtain a lower bound on the misclassification rate for every attention mechanism Ψ for classifying intraclass and inter-class edges. The following Lemma 8 describes the Bayes classifier for this classification task. Lemma 8. Let (X, A) CSBM(n, p, q, µ, σ2) and let X ij be defined as in (9). The Bayes optimal classifier for X ij is realized by the following function, ( 0, if p cosh x T µ σ2 q cosh x T ν 1, otherwise, (11) where µ def = µ µ and ν def = µ µ Using Lemma 8, we can lower bound the rate of misclassification of edges that every attention mechanism Ψ exhibits. Below we define Φc def = 1 Φ, where Φ is the cumulative distribution function of N(0, 1). Theorem 9. Suppose µ = κσ for some κ > 0 and let Ψ be any attention mechanism. Then, 1. With probability at least 1 o(1), Ψ fails to correctly classify at least 2 Φc(κ)2 fraction of inter-class edges; 2. For any K > 0 if q > K log2 n nΦc(κ)2 , then with probability at least 1 O(n 8KΦc(κ)2 log n), Ψ misclassify at least one inter-class edge. Part 1 of Theorem 9 implies that if µ is linear in the standard deviation σ, that is if κ = O(1), then with overwhelming probability the attention mechanism fails to distinguish a constant fraction of inter-class edges from intra-class edges. Furthermore, part 2 of Theorem 9 characterizes a regime for the inter-class edge probability q where the attention mechanism fails to distinguish at least one inter-class edge. It provides a lower bound on q in terms of the scale at which the distance between the means grows compared to the standard deviation σ. This aligns with the intuition that as we increase the distance between the means, it gets easier for the attention mechanism to correctly distinguish inter-class and intra-class edges. However, if q is also increased with the right proportion, in other words, if the noise in the graph is increased, then the attention mechanism would still fail to correctly distinguish at least one inter-class edge. For instance, for κ = 2 log log n and K = log2 n, we get that if q > Ω( log6+o(1) n n ), then with probability at least 1 o(1), Ψ misclassifies at least an inter-class edge. The proof of Theorem 9 relies on analyzing the behavior of the Bayes optimal classifier in (11). We compute an upper bound on the probability with which the optimal classifier correctly classifies an arbitrary inter-class edge. Then the proof of part 1 of Theorem 9 follows from a concentration argument for the fraction of inter-class edges that are misclassified by the optimal classifier. For part 2, we use a similar concentration argument to choose a suitable threshold for q that forces the optimal classifier to fail on at least one inter-class edge. We provide formal arguments in the appendix. Graph Attention Retrospective As a motivating example of how the attention mechanism would fail and what exactly the attention coefficients would behave in this regime, we focus on one of the most popular attention architecture (Veliˇckovi c et al., 2018), where α is a single-layer neural network parameterized by (w, a, b) Rd R2 R with Leaky Relu activation function. Namely, the attention coefficients are defined by γij def = exp Leaky Relu a T w T Xi w T Xj P ℓ Ni exp Leaky Relu a T w T Xi w T Xℓ We show that, as a consequence of the inability of the attention mechanism to distinguish intra-class and inter-class edges, with overwhelming probability most of the attention coefficients γij given by (12) are going to be Θ(1/|Ni|). In particular, Theorem 10 says that for the vast majority of nodes in the graph, the attention coefficients on most edges are uniform irrespective of whether the edge is inter-class or intra-class. As a result, this means that the attention mechanism is unable to assign higher weights to important edges and lower weights to unimportant edges. Theorem 10. Assume that µ Kσ and σ K for some absolute constants K and K . Moreover, assume that the parameters (w, a, b) Rd R2 R are bounded. Then, with probability at least 1 o(1) over the data (X, A) CSBM(n, p, q, µ, σ2), there exists a subset A [n] with cardinality at least n(1 o(1)) such that for all i A the following hold: 1. There is a subset Ji,0 Ni C0 with cardinality at least 9 10|Ni C0|, such that γij = Θ(1/|Ni|) for all j Ji,0. 2. There is a subset Ji,1 Ni C1 with cardinality at least 9 10|Ni C1|, such that γij = Θ(1/|Ni|) for all j Ji,1. Theorem 10 is proved by carefully computing the numerator and the denominator in (12). In this regime, µ is not much larger than σ, that is, the signal does not dominate noise, so the numerator in (12) is not indicative of the class memberships of nodes i, j but rather acts like Gaussian noise. On the other hand, denote the denominator in (12) by δi and observe that it is the same for all γiℓwhere ℓ Ni. Using concentration arguments about {w T Xℓ}ℓ [n] yields γij = Θ(1/δi) and δi = Θ(|Ni|) finishes up the proof. We provide details in the appendix. Compared to the easy regime, it is difficult to obtain a separation result for the nodes without additional assumptions. In the easy regime, the distance between the means was much larger than the standard deviation, which made the signal (the expectation of the convolved data) dominate the noise (i.e., the variance of the convolved data). In the hard regime the noise dominates the signal . Thus, we conjecture the following. Conjecture 11. There is an absolute constant M > 0 such that, whenever µ M log n n(p+q)(1 max(p, q)) p+q |p q|, every graph attention model fails to perfectly classify the nodes with high probability. Fountoulakis, Levi, Yang, Baranwal and Jagannath The above conjecture means that in the hard regime, the performance of the graph attention model depends on q as opposed to the easy regime, where in Corollary 5 we show that it doesn t. This property is verified by our synthetic experiments in Section 5. The quantity σ q log n n(p+q)(1 max(p, q)) in the threshold comes from our conjecture that the expected maximum noise of the graph attention convolved data over the nodes is at least cσ q log n n(p+q)(1 max(p, q)) for some constant c > 0. The quantity p+q |p q| in the threshold comes from our conjecture that the distance between the means (i.e. signal ) of the graph attention convolved data is reduced to at most |p q|/(p+q) of the original distance. Proving Conjecture 11 would require delicate treatment of the correlations between the attention coefficients γij and the node features Xi for i [n]. 3.2.1 Are good attention coefficients helpful in the hard regime ? In this subsection we are interested in understanding the implications of edge separability on node separability in the hard regime and when Ψ is restricted to a specific class of functions. In particular, we show that Conjecture 11 is true under an additional assumption that Ψ does not depend on the node features. In addition, we show that even if we were allowed to use an extremely good attention function Ψ which separates the edges with an arbitrarily large margin, with high probability the graph attention convolution (2) will still misclassify at least one node as long as µ /σ is sufficiently small. We consider the class of functions Ψ which can be expressed in the following form: Ψ(i, j) = sign(p q)t, if (i, j) is an intra-class edge, sign(p q)t, if (i, j) is an inter-class edge, (13) for some t 0. The particular class of functions in (13) is motivated by the property of the ideal edge classifier in (10) and the behavior of Ψ in (6) when it is applied to the means of the Gaussians. There are a few possible ways to obtain a function Ψ which satisfies (13). For example, in the presence of good edge features which reflect the class memberships of the edges, we can make Ψ take as input the edge features. Moreover, if | p q| > p 2 log n/n, one such Ψ may be easily realized from the eigenvectors of the graph adjacency matrix. By the exact spectral recovery result in Lemma 12, we know that there exists a classifier ˆτ which separates the nodes. Therefore, we can set Ψ(i, j) = sign(p q)t if ˆτ(i) = ˆτ(j) and Ψ(i, j) = sign(p q)t otherwise. Lemma 12 (Exact recovery in (Abbe, 2018)). Suppose that p, q = Ω(log2 n/n) and | p q| > p 2 log n/n. Then there exists a classifier ˆτ taking as input the graph A and perfectly classifies the nodes with probability at least 1 o(1). Theorem 13. Suppose that p, q satisfy Assumption 1 and that p, q are bounded away from 1. For every ϵ > 0, there are absolute constants M, M = O( ϵ) such that, with probability at least 1 o(1) over the data (X, A) CSBM(n, p, q, µ, σ2), using the graph attention convolution in (2) and the attention architecture Ψ in (13), the model misclassifies at least Ω(n1 ϵ) nodes for any w such that w = 1, if 1. t = O(1) and µ Mσ q log n n(p+q)(1 max(p, q)) p+q Graph Attention Retrospective 2. t = ω(1) and µ M σ q log n n(p+q)(1 max(p, q)). Theorem 13 warrants some discussions. We start with the role of t in the attention function (13). One may think of t as the multiplicative margin of separation for intra-class and intra-class edges. When t = O(1), the margin of separation is at most a constant. This includes the special case when Ψ(i, j) = 0 for all (i, j) E, i.e, the margin of separation is 0. In this case, the graph attention convolution in (2) reduces to the standard graph convolution with uniform averaging among the neighbors. Therefore, part 1 of Theorem 13 also applies to the standard graph convolution. On the other hand, when t = ω(1), the margin of separation is not only bounded away from 0 but also it grows with n. Next, we discuss the additional assumption that p, q are bounded away from 1. This assumption is used to obtain a concentration result required for the proof of Theorem 13. It is also intuitive in the following sense. If both p and q are arbitrarily close to 1, then after the convolution the convolved node feature vectors collapse to approximately a single point, and thus this becomes a trivial case where no classifier is able to separate the nodes; on the other hand, if p is arbitrarily close to 1 and q is very small, then after the convolution the convolved node feature vectors collapse to approximately one of two points according to which class the node comes from, and in this case the nodes can be easily separated by a linear classifier. We now focus on the threshold for µ under which the model is going to misclassify at least one node with high probability. In part 1 of Theorem 13, t = O(1), i.e., the attention mechanism Ψ is either unable to separate the edges or unable to separate the edges with a large enough margin. In this case, one can show that all attention coefficients are Θ( 1 n(p+q)). Consequently, the quantity |p q| appears in denominator of the threshold for µ in part 1 of Theorem 13. Because of that, if p and q are arbitrarily close, then the model is not able to separate the nodes irrespective of how large µ is. For example, treating 1 max(p, q) as a constant since p and q are bounded away from 1 by assumption, we have that log n n(p + q)(1 max(p, q)) p + q |p q| = ω(σ p This means that if p and q are close enough, every attention function Ψ in the form of (13) and t = O(1) cannot help classify all nodes correctly even if µ = ω(σ log n). On the contrary, recall that in the easy regime where µ = ω(σ log n), the attention mechanism given in (3) and (4) helps separate the nodes with high probability. This illustrates the limitation of every attention mechanism in the form of (13) which have an insignificant margin of separation. According to Theorem 10, the vast majority of attention coefficients are uniform, and thus in Conjecture 11 we expect that graph attention in general shares similar limitations in the hard regime. In part 2 of Proposition 13, t = ω(1), i.e., the attention mechanism Ψ separates the edges with a large margin. In this case, one can show that the attention coefficients on important edges (e.g. intra-class edges) are exponentially larger than those on unimportant edges (e.g. inter-class edges). Consequently, the factor (p + q)/|p q| no longer appears in the threshold for µ in part 2 of Theorem 13. However, at the same time, the threshold also implies that, even when we have a perfect attention mechanism that is able to separate Fountoulakis, Levi, Yang, Baranwal and Jagannath the edges with a large margin, as long as µ /σ is small enough, then the model is going to misclassify at least one node with high probability. Finally, notice that in Theorem 13 the parameter ϵ captures a natural tradeoffbetween the threshold for µ and the lower bound on the number of misclassified nodes. Namely, the smaller the ϵ is, the smaller the threshold for µ becomes, and hence the less signal there is in the node features, the more mistakes the model is going to make. This is precisely demonstrated by the scaling of M, M = O( ϵ) and misclassification bound Ω(n1 ϵ) with respect to ϵ. We leave the proof of Theorem 13 to the appendix. 4. Robustness to structural noise and its implications beyond perfect node classification In this section, we provide a positive result on the capacity of graph attention convolution for node classification beyond the perfect classification regime. In particular, we show that independent of the parameters of CSBM, i.e., independent of p, q, µ and σ, the two-layer MLP attention architecture Ψ from Section 3.1 is able to achieve the classification performance obtainable by the Bayes optimal classifier for node features. This is proved by showing that there is a parameter setting for Ψ where the attention coefficient on self-loops can be made exponentially large. Consequently, the corresponding graph attention convolution behaves like a linear function of node features. We provide two corollaries of this result. The first corollary provides a ranking of graph attention convolution, simple graph convolution, and linear function in terms of their ability to classify nodes in CSBM. More specifically, by noticing that the simple graph convolution is also realized by a specific parameter setting of the attention architecture Ψ, our result implies that the performance of graph attention convolution for node classification in CSBM is lower bounded by the best possible performance between a linear classifier and simple graph convolution. In particular, graph attention is strictly more powerful than simple graph convolution when the graph is noisy (e.g. when p q), and it is strictly more powerful than a linear classifier when the graph is less noisy (e.g. when p and q are significantly different). The second corollary provides perfect classification, almost perfect classification, and partial classification results for graph attention convolution. It follows immediately from the reduction of graph attention convolution to a linear function under the specification of Ψ that we will discuss. In what follows we start with high-level ideas, then we present formal statements of the results, and we discuss the implications in detail. Recall the two-layer MLP attention architecture Ψ from (3) and (4) is equivalently written in (5) as Ψ(Xi, Xj) = 2R(1 β) w T Xi, if w T Xj w T Xi , 2R(1 β) sign( w T Xi) w T Xj, if w T Xi < w T Xj < w T Xi , 2R(1 β) w T Xi, if w T Xj w T Xi . We make the following observations. Assuming that the scaling parameter R > 0, If w T Xi > 0, then the function Ψ assigns more weight to an edge (i, j) such that w T Xj > 0 than an edge (i, j ) such that w T Xj < 0; If w T Xi < 0, then the function Ψ assigns more weight to an edge (i, j) such that w T Xj < 0 than an edge (i, j ) such that w T Xj > 0; Graph Attention Retrospective If w T Xi = 0, then the function Ψ assigns uniform weight to every edge (i, j). This means that the behavior of Ψ depends on which side of the hyperplane {x : w T x = 0} that Xi comes from. In other words, for fixed Xj, whether the attention function Ψ will up-weight or down-weight an edge (i, j) depends entirely on the classification of Xi based on the linear classifier w. Moreover, note that the attention function value satisfies 2R(1 β) max{ | w T Xi|, | w T Xj|} Ψ(Xi, Xj) 2R(1 β) min{| w T Xi|, | w T Xj|}. Therefore, out of all unit norm vectors w, our choice w = µ/ µ maximizes the range of values that Ψ can output. Recall from Lemma 6 that w also happens to characterize the Bayes optimal classifier for the node features. Finally, the attention function Ψ achieves minimum/maximum at self-attention, i.e. Ψ(Xi, Xi) = 2R(1 β)| w T Xi| = max j [n] Ψ(Xi, Xj), Ψ(Xi, Xi) = 2R(1 β)| w T Xi| = min j [n] Ψ(Xi, Xj). A consequence of these observations is that, by setting the scaling parameter R sufficiently large, one can make exp(Ψ(Xi, Xj)) exponentially larger than exp(Ψ(Xi, Xj )) for any j, j such that sign( w T Xj) = sign( w T Xi) and sign( w T Xj ) = sign( w T Xi). According to the definition of attention coefficients in (1), this means that the attention coefficients γij where sign( w T Xj) = sign( w T Xi) are going to be exponentially larger than the attention coefficients γij where sign( w T Xj ) = sign( w T Xi). Therefore, one could expect that in this case, if the linear classifier w correctly classifies Xi for some i [n], e.g., for i C1 this means that w T Xi > 0, then graph attention convolution output h i = P j Ni γij w T Xj also satisfies h i > 0, due to sufficiently larger attention coefficients γij for which w T Xj > 0. We state the result below in Theorem 14 and leave detailed arguments in the appendix. Theorem 14. With probability at least 1 o(1) over the data (X, A) CSBM(n, p, q, µ, σ2), using the two-layer MLP attention architecture Ψ given in (3) and (4) with R = Ω(n log2 n/σ), the graph attention convolution output satisfies j Ni γij w T Xj > 0 if and only if w T Xi > 0, i [n], j Ni γij w T Xj < 0 if and only if w T Xi < 0, i [n]. Theorem 14 means that there is a parameter setting for the attention architecture Ψ such that the performance of graph attention convolution matches with the performance of the Bayes optimal classifier for node features. This shows the ability of graph attention to ignore the graph structure, which can be beneficial when the graph is noisy. For example, if p = q, then it is easy to see that simple graph convolution completely mixes up the node features, making it not possible to achieve any meaningful node classification result. On the other hand, as long as there is some signal from the original node features, i.e. µ > 0, then graph attention will be able to pick that up and classify the nodes at least as good as the best classifier for the node features alone. It is also worth noting that by Fountoulakis, Levi, Yang, Baranwal and Jagannath setting R = 0, the attention function Ψ has a constant value, and hence graph attention convolution reduces to the standard graph convolution, which has been shown to be useful in the regime where the original node features are not very strong but the graph has a nice structure (Baranwal et al., 2021). For example, when there is a significant gap between p and q, |p q|/(p + q) = Ω(1), then setting R = 0 could significantly improve the node separability threshold over the best linear classifier (Baranwal et al., 2021). This shows the robustness of graph attention against noise in one of the two sources of information, namely node features and edge connectivities. Unlike the Bayes optimal classifier for node features which is sensitive to feature noise or the simple graph convolution which is sensitive to structural noise, one can expect graph attention to work as long as at least one of the two sources of information has a good signal. Therefore, we obtain a ranking of node classification models among graph attention convolution, simple graph convolution, and a linear classifier. We state this below in Corollary 15. Corollary 15. The node classification performance obtainable by graph attention convolution is lower bounded by the best possible node classification performance between a simple graph convolution and a linear classifier. By a straightforward characterization of the performance of the linear classifier w = µ/ µ , we immediately obtain the following classification results in Corollary 16. Recall that we denoted Φ as the cumulative distribution function of the standard normal distribution. Write µ = κσ for some κ > 0. Corollary 16. With probability at least 1 o(1) over the data (X, A) CSBM(n, p, q, µ, σ2), using the two-layer MLP attention architecture Ψ given in (3) and (4) with R = Ω(n log2 n/σ), one has that (Perfect classification) If κ 2 log n then all nodes are correctly classified; (Almost perfect classification) If κ = ω(1) then at least 1 o(1) fraction of all nodes are correctly classified; (Partial classification) If κ = O(1) then at least Φ(κ) o(1) fraction of all nodes are correctly classified. Interestingly, we note that the perfect classification result in Corollary 16 is nearly equivalent (up to a small order in the threshold κ) to the perfect classification result in Corollary 5 from Section 3.1. They are obtained from different behaviors of the attention coefficients. This shows that there could be more than one type of good attention coefficients that are able to deliver good node classification performance. 5. Experiments In this section, we demonstrate empirically our results on synthetic and real data. The parameters of the models that we experiment with are set by using an ansatz based on our theorems. The particular details are given in Section 5.1. For real datasets, we use the default splits which come from Py Torch Geometric (Fey and Lenssen, 2019) and OGB (Hu Graph Attention Retrospective et al., 2020). In all our experiments we use MLP-GAT, where the attention mechanism Ψ is set to be the two-layer network in (3) and (4) with R = 1. For synthetic experiments using CSBM with known p and q, we use the variant that takes p, q into account, Ψ = (1p q 1p 0 and let Ψ be any attention mechanism. Then, 1. With probability at least 1 o(1), Ψ fails to correctly classify at least 2 Φc(κ)2 fraction of inter-class edges; 2. For any K > 1 if q > K log2 n nΦc(κ)2 , then with probability at least 1 O(n K 4 Φc(κ)2 log n), Ψ misclassify at least one inter-class edge. We will write i j if node i and node j are in the same class and i j otherwise. From Lemma 8, we observe that for successful classification by the optimal classifier, we need p cosh x T µ σ2 q cosh x T ν σ2 for i j, p cosh x T µ σ2 > q cosh x T ν σ2 for i j. Graph Attention Retrospective We will split the analysis into two cases. First, note that when p q we have for i j that p cosh x T µ σ2 q cosh x T ν σ2 = cosh x T µ σ2 cosh x T ν σ2 = |x T µ | |x T ν |. In the first implication, we used that p q, while the second implication follows from the fact that cosh(a) cosh(b) = |a| |b| for all a, b R. Similarly, for p < q we have for i j that p cosh x T µ σ2 > q cosh x T ν σ2 = cosh x T µ σ2 > cosh x T ν σ2 = |x T µ | > |x T ν |. Therefore, for each of the above cases, we can upper bound the probability for either i j or i j that X ij is correctly classified, by the probability of the event |X T ij µ | |X T ij ν | or equivalently |X T ij µ | > |X T ij ν |. We focus on the former as the latter is equivalent and symmetric. Writing Xi = µ + σgi and Xj = µ + σgj, we have that for i C1 and j C0, Pr[h (X ij) = 0] Pr h |X T ij µ | |X T ij ν | i = Pr |XT i µ + XT j µ| |XT i µ XT j µ| = Pr σ|g T i µ + g T j µ| | 2 µ 2 + σg T i µ σg T j µ| Pr |g T i ˆµ + g T j ˆµ| |g T i ˆµ g T j ˆµ| 2 µ = Pr |g T i ˆµ + g T j ˆµ| |g T i ˆµ g T j ˆµ| 2κ , where we denote ˆµ = µ/ µ . In the second to last step above, we used triangle inequality to pull 2 µ 2 outside the absolute value, while in the last equation we use µ = κσ. We now denote zi = g T i ˆµ for all i [n]. Then the above probability is Pr[|zi + zj| |zi zj| 2κ], where zi, zj N(0, 1) are independent random variables. Note that we have Pr[h (X ij) = 0] Pr[|zi + zj| |zi zj| 2κ] = Pr[|zi + zj| |zi zj| 2κ, |zi| κ] + Pr[|zi + zj| |zi zj| 2κ, |zi| > κ] = Pr[|zi| κ] + Φ(κ) Pr[|zi| > κ]. (15) To see how we obtain the last equation, observe that if |zi| κ then we have |zi + zj| |zi zj| = |zi + zj| |zj zi| |zi| + |zj| |zj zi| by triangle inequality |zi| + |zj| |zj| |zi| by reverse triangle inequality |zi| + |zj| (|zj| |zi|) = 2|zi| hence, Pr[|zi + zj| |zi zj| 2κ, |zi| κ] = Pr[|zi| κ]. On the other hand, for |zi| > κ, we look at each case, conditioned on the events zi > κ and zi < κ for each of Fountoulakis, Levi, Yang, Baranwal and Jagannath the four cases based on the signs of zi + zj and zi zj. We denote by E the event that |zi + zj| |zi zj| 2κ, and analyze the cases in detail. First consider the case zi < κ: Pr[E, zi + zj 0, zi zj 0 | zi < κ] = Pr[zj zi, zj zi | zi < κ] = 0, Pr[E, zi + zj 0, zi zj < 0 | zi < κ] = Pr[zj > |zi|, zi κ | zi < κ] = Φ(zi), Pr[E, zi + zj < 0, zi zj 0 | zi < κ] = Pr[zj < |zi|, zi κ | zi < κ] = 0, Pr[E, zi + zj < 0, zi zj < 0 | zi < κ] = Pr[zi < zj < zi, zj > κ | zi < κ] = Φ(κ) Φ(zi). The sum of the four probabilities in the above is Pr[E | zi < κ] = Φ(κ). Similarly, we analyze the other case, zi > κ: Pr[E, zi + zj 0, zi zj 0 | zi > κ] = Pr[ zi zj zi, zj κ | zi > κ] = Φ(κ) Φc(zi), Pr[E, zi + zj 0, zi zj < 0 | zi > κ] = Pr[zj > |zi|, zi κ | zi > κ] = 0, Pr[E, zi + zj < 0, zi zj 0 | zi > κ] = Pr[zj < |zi|, zi κ | zi > κ] = Φc(zi), Pr[E, zi + zj < 0, zi zj < 0 | zi > κ] = Pr[zj < zi, zj > zi | zi > κ] = 0. The sum of the four probabilities above is Pr[E | zi > κ] = Φ(κ). Therefore, we obtain that Pr[|zi + zj| |zi zj| 2κ | |zi| > κ] = Φ(κ), which justifies (15). Next, note that Pr[|zi| κ] = Φ(κ) Φc(κ) and Pr[|zi| > κ] = 2Φc(κ), so we have from (15) that Pr[h (X ij) = 0] Φ(κ) Φc(κ) + 2Φc(κ)Φ(κ) = 1 2Φc(κ) + 2Φc(κ)Φ(κ) = 1 2Φc(κ)2. Thus, X ij is misclassified with probability at least 2Φc(κ)2. We will now construct sets of pairs with mutually independent elements, such that the union of those sets covers all inter-class edges. This will enable us to use a concentration argument that computes the fraction of the inter-class edges that are misclassified. Since the graph operations are permutation invariant, let us assume for simplicity that C0 = {1, . . . , n 2 } and C1 = {n 2 + 1, . . . , n} for an even number of nodes n. Also, define the function ( i + l, i + l n 2 , i + l n 2 , i + l > n We now construct the following sequence of sets for all l {0, . . . , n Sl = {(Xm(i,l), Xi+ n 2 ) for all i C0 such that (m(i, l), i + n/2) E}. Fix l {0, . . . , n 2 1} and observe that the pairs in the set Sl are mutually independent. Define a Bernoulli random variable, βi, to be the indicator that (Xm(i,l), Xi+ n 2 ) is misclassified. We have that E[βi] 2Φc(κ)2. Note that the fraction of pairs in the set Sl that are Graph Attention Retrospective misclassified is 1 |Sl| P i:(Xm(i,l),Xi+n/2) Sl βi, which is a sum of independent Bernoulli random variables. Hence, by the additive Chernoffbound, we obtain i C0 Nm(i,l) βi 2|Sl|Φc(κ)2 |Sl|t 1 exp( 2|Sl|t2). Since p, q = Ω(log2 n n ), we have by the Chernoffbound and a union bound that with proba- bility at least 1 1/poly(n), |Sl| = nq(1 o(1)) for all l. We now choose t = q |Sl| = o(1) to obtain that on the event where |Sl| = nq(1 o(1)), we have the following for any large C > 1: i C0 Nm(i,l) βi 2Φc(κ)2 o(1) Following a union bound over all l {0, . . . , n 2 1}, we conclude that for any c > 0, i C0 Nm(i,l) βi 2Φc(κ)2 o(1), l n 0, . . . , n Thus, out of all the pairs X ij with j i, with probability at least 1 o(1), we have that at least a fraction 2Φc(κ)2 of the pairs are misclassified by the attention mechanism. This concludes part 1 of the theorem. For part 2, note that by the additive Chernoffbound we have for any t (0, 1), i C0 Nm(i,l) βi 2|Sl|Φc(κ)2 |Sl|t 1 exp( 2|Sl|t2). Since |Sl| = nq 2 (1 o(1)) with probability at least 1/poly(n), we choose t = 2 q KΦc(κ)2 log2 n nq to obtain i C0 Nm(i,l) βi nqΦc(κ)2(1 o(1)) q KnqΦc(κ)2 log2 n 1 O(n 8KΦc(κ)2 log n). Now note that if q > K log2 n nΦc(κ)2 then we have nqΦc(κ)2 > K log2 n, which implies that KnqΦc(κ)2 log2 n > 0. Hence, in this regime of q, i C0 Nm(i,l) βi > 0 1 O(n 8KΦc(κ)2 log n), and the proof is complete. Fountoulakis, Levi, Yang, Baranwal and Jagannath A.6 Proof of Theorem 10 We restate Theorem 10 for convenience Theorem. Assume that µ Kσ and σ K for some absolute constants K and K . Moreover, assume that the parameters (w, a, b) Rd R2 R are bounded. Then, with probability at least 1 o(1) over the data (X, A) CSBM(n, p, q, µ, σ2), there exists a subset A [n] with cardinality at least n(1 o(1)) such that for all i A the following hold: 1. There is a subset Ji,0 Ni C0 with cardinality at least 9 10|Ni C0|, such that γij = Θ(1/|Ni|) for all j Ji,0. 2. There is a subset Ji,1 Ni C1 with cardinality at least 9 10|Ni C1|, such that γij = Θ(1/|Ni|) for all j Ji,1. For i [n] let us write Xi = (2ϵi 1)µ + σgi where gi N(0, I), ϵi = 0 if i C0 and ϵi = 1 if i C1. Moreover, since the parameters (w, a, b) Rd R2 R are bounded, we can write w = R ˆw and a = R ˆa such that ˆw = 1 and ˆa = 1 and R, R are some constants. We define the following sets which will become useful later in our computation of γij s. Define A def = i [n] |ˆa1 ˆw T gi| 10 p log(n(p + q)), and |ˆa2 ˆw T gj| 10 p log(n(p + q)), j Ni For i [n] define Ji,0 def = n j Ni C0 | |ˆa2 ˆw T gj| Ji,1 def = n j Ni C1 | |ˆa2 ˆw T gj| Bt i,0 def = j Ni C0 | 2t 1 ˆa2 ˆw T gj 2t , t = 1, 2, . . . , T, Bt i,1 def = j Ni C1 | 2t 1 ˆa2 ˆw T gj 2t , t = 1, 2, . . . , T, where T def = l log2 10 p log(n(p + q)) m . We start with a few claims about the sizes of these sets. Claim 20. With probability at least 1 o(1), we have that |A| n(1 o(1)). Proof Because |ˆa2| 1 we know that A is a superset of A where A def = i [n] | ˆw T gi| 10 p log(n(p + q)), and | ˆw T gj| 10 p log(n(p + q)), j Ni We give a lower bound for |A | and hence prove the result. First of all, note that if p + q Ω(1/ log2 n), then log(n(p + q)) = log n(1 o(1)) and we easily get that with probability at least 1 o(1), | ˆw T gi| 10 p log(n(p + q)) for all i [n], and thus |A| = |A | = n. Therefore Graph Attention Retrospective let us assume without loss of generality that p + q O(1/ log2 n). Consider the following sum of indicator random variables i [n] 1n | ˆ w T gi| 10 log(n(p+q)) o. By the multiplicative Chernoffbound, for any δ > 0 we have Pr [S nb(1 + δ)] exp δ2 where b def = Pr(| ˆw T gi| 10 p log(n(p + q))). Moreover, by the standard upper bound on the Gaussian tail probability (Proposition 2.1.2, Vershynin (2018)) we know that b < e 50 log(n(p+q)). Let us set δ def = 1 bn(p + q) log n. Then by the upper bound on b and the assumption that p, q = Ω(log2 n/n) we know that δ (n(p + q))49 log n Ω(log97 n) = ω(1). It follows that δ2 2 + δnb Ω(δnb) = Ω 1 (p + q) log n Therefore, with probability at least 1 o(1) we have that S nb(1 + δ) n (n(p + q))50 + n n(p + q) log n = O n n(p + q) log n Apply the concentration result of node degrees, this means that with probability at least 1 o(1), n i [n] | ˆw T gi| 10 p log(n(p + q)) or j Ni such that | ˆw T gj| 10 p log(n(p + q)) o 2 (p + q)(1 o(1)) = O n n(p + q) log n 2 (p + q)(1 o(1)) = O n log n Therefore we have |A | n O(n/ log n) = n(1 o(1)). Claim 21. With probability at least 1 o(1), we have that for all i [n], 10|Ni C0| and |Ji,1| 9 Fountoulakis, Levi, Yang, Baranwal and Jagannath Proof We prove the result for Ji,0, the result for Ji,1 follows analogously. First, fix i [n]. For each j |Ni C0| we have that Pr[|ˆa2w T gj| 10] Pr[|w T gj| Denote Jc i,0 def = (Ni C0) \ Ji,0. We have that E[|Jc i,0|] = E j Ni C0 1{|ˆa2w T gj| e 50|Ni C0|, Apply Chernoff s inequality (Theorem 2.3.4 in Vershynin (2018)) we have Pr |Jc i,0| 1 10|Ni C0| e E[|Jc i,0|] e E[|Jc i,0|] |Ni C0|/10 ee 50|Ni C0| 25|Ni C0| . Apply the union bound we get Pr |Ji,0| 9 10|C0 Ni|, i [n] 1 X i [n] exp 4 i [n] exp 4 25 n min(p, q)(1 o(1)) = (1 o(1)) 1 n exp 2n min(p, q)(1 o(1)) The second inequality follows because |Ni C0| n 2 min(p, q)(1 o(1)) under the event E3 (cf. Definition 17) for all i [n]. The last equality is due to our assumption that p, q = Ω(log2 n Claim 22. With probability at least 1 o(1), we have that for all i [n] and for all t [T], |Bt i,0| E[|Bt i,0|] + T|Ni C0| 4 5 and |Bt i,1| E[|Bt i,1|] + T|Ni C1| 4 5 . Proof We prove the result for Bt i,0, and the result for Bt i,1 follows analogously. First fix i [n] and t [T]. By the additive Chernoffinequality, we have Pr |Bt i,0| E[|Bt i,0|] + |Ni C0| 5 e 2T|Ni C0|3/5. Graph Attention Retrospective Taking a union bound over all i [n] and t [T] we get n |Bt i,0| E[|Bt i,0|] + T|Ni C0| 4 5 o n T exp 2T n 2 min(p, q)(1 o(1)) 3/5 + o(1) = o(1), where the last equality follows from Assumption 1 that p, q = Ω(log2 n n ), and hence n T exp 2T n 2 min(p, q)(1 o(1)) 3/5 = n T exp ω 2T log n = O n c for some absolute constant c > 0. Moreover, we have used degree concentration, which introduced the additional additive o(1) term in the probability upper bound. Therefore we have Pr h |Bt i,0| E[|Bt i,0|] + T|Ni C0| 4 5 , i [n] t [T] i 1 o(1). We start by defining an event E# which is the intersection of the following events over the randomness of A and {ϵi}i and Xi = (2ϵi 1)µ + σgi, E 1 is the event that for each i [n], |C0 Ni| = n 2 ((1 ϵi)p + ϵiq)(1 o(1)) and |C1 Ni| = n 2 ((1 ϵi)q + ϵip)(1 o(1)). E 2 is the event that |A| n o( n). E 3 is the event that |Ji,0| 9 10|Ni C0| and |Ji,1| 9 10|Ni C1| for all i [n]. E 4 is the event that |Bt i,0| E[|Bt i,0|] + T|Ni C0| 4 5 and |Bt i,1| E[|Bt i,1|] + C1| 4 5 for all i [n] and for all t [T]. By Claims 20, 21, 22, we get that with probability at least 1 o(1), the event E# def = T4 i=1 E i holds. We will show that under event E#, for all i A, for all j Ji,c where c {0, 1}, we have γij = Θ(1/|Ni|). This will prove Theorem 10. Fix i A and some j Ji,0. Let us consider γij = exp Leaky Relu(a1w T Xi + a2w T Xj + b) P k Ni exp (Leaky Relu(a1w T Xi + a2w T Xk + b)) = exp σRR Leaky Relu(κij + ˆa1 ˆw T gi + ˆa2 ˆw T gj + b ) P k Ni exp (σRR Leaky Relu(κik + ˆa1 ˆw T gi + ˆa2 ˆw T gk + b )) = 1 P k Ni exp( ik ij) Fountoulakis, Levi, Yang, Baranwal and Jagannath where for l Ni, we denote κil def = (2ϵi 1) ˆw T µ/σ + (2ϵl 1) ˆw T µ/σ, il def = σRR Leaky Relu(κil + ˆa1w T gi + ˆa2w T gl + b ), and b = σRR b . We will show that k Ni exp( ik ij) = Θ(|Ni|) and hence conclude that γij = Θ(1/|Ni|). First of all, note that since µ Kσ for some absolute constant K, we know that Let us assume that ˆa1 ˆw T gi 0 and consider the following two cases regarding the magnitude of ˆa1 ˆw T gi. Case 1. If κij + ˆa1 ˆw T gi + ˆa2 ˆw T gj + b < 0, then ik ij = σRR Leaky Relu(κik + ˆa1 ˆw T gi + ˆa2 ˆw T gk + b ) Leaky Relu(κij + ˆa1 ˆw T gi + ˆa2 ˆw T gj + b ) = σRR Leaky Relu(ˆa1 ˆw T gi + ˆa2 ˆw T gk O(1)) β(κij + ˆa1 ˆw T gi + ˆa2 ˆw T gj + b ) = σRR Leaky Relu(ˆa2 ˆw T gk O(1)) O(1) = σRR Θ(ˆa2 ˆw T gk) O(1) , where β is the slope of Leaky Relu(x) for x < 0. Here, the second equality follows from |κik + b | 2K + |b | = O(1) and κij + ˆa1 ˆw T gi + ˆa2 ˆw T gj + b < 0. The third equality follows from We have j Ji,0 and hence |ˆa2 ˆw T gj| = O(1); We have κij + ˆa1 ˆw T gi + ˆa2 ˆw T gj +b < 0, so ˆa1 ˆw T gi < |κij|+|ˆa2 ˆw T gj|+|b | = O(1), moreover, because ˆa1 ˆw T gi 0, we get that |ˆa1 ˆw T gi| = O(1); We have |κij + ˆa1 ˆw T gi + ˆa2 ˆw T gj + b | |ˆa1 ˆw T gi| + |ˆa2 ˆw T gj| + |κij + b | = O(1) + O(1) + O(1) = O(1). Graph Attention Retrospective Case 2. If κij + ˆa1 ˆw T gi + ˆa2 ˆw T gj + b 0, then ik ij = σRR Leaky Relu(κik + ˆa1 ˆw T gi + ˆa2 ˆw T gk + b ) Leaky Relu(κij + ˆa1 ˆw T gi + ˆa2 ˆw T gj + b ) = σRR Leaky Relu(κik + ˆa1 ˆw T gi + ˆa2 ˆw T gk + b ) κij ˆa1 ˆw T gi ˆa2 ˆw T gj b = σRR Leaky Relu(κik + ˆa1 ˆw T gi + ˆa2 ˆw T gk + b ) ˆa1 ˆw T gi O(1) = σRR Θ(ˆa2 ˆw T gk) O(1) , if k Ji,0 Ji,1 σRR O(ˆa2 ˆw T gk) O(1) , otherwise. To see the last (in)equality in the above, consider the following cases: 1. If k Ji,0 Ji,1, then there are two cases depending on the sign of κik + ˆa1 ˆw T gi + ˆa2 ˆw T gk + b . If κik + ˆa1 ˆw T gi + ˆa2 ˆw T gk + b 0, then we have that Leaky Relu(κik + ˆa1 ˆw T gi + ˆa2 ˆw T gk + b ) ˆa1 ˆw T gi O(1) = κik + ˆa1 ˆw T gi + ˆa2 ˆw T gk + b ˆa1 ˆw T gi O(1) = ˆa2 ˆw T gk + κik + b O(1) = ˆa2 ˆw T gk O(1). If κik+ˆa1 ˆw T gi+ˆa2 ˆw T gk+b < 0, then because ˆa1 ˆw T gi 0 and |κik+ˆa2 ˆw T gk+ b | |κik| + |ˆa2 ˆw T gk| + |b | = O(1), we know that ˆa1 ˆw T gi < |κik| + |ˆa2 ˆw T gk| + |b | = O(1) and |κik + ˆa1 ˆw T gi + ˆa2 ˆw T gk + b | = O(1). Therefore it follows that Leaky Relu(κik + ˆa1 ˆw T gi + ˆa2 ˆw T gk + b ) ˆa1 ˆw T gi O(1) = Leaky Relu( O(1)) O(1) O(1) = ˆa2 ˆw T gk O(1) where the last equality is due to the fact that k Ji,0 Ji,1 so |ˆa2 ˆw T gk| = O(1). 2. If k Ji,0 Ji,1, then there are two cases depending on the sign of κik + ˆa1 ˆw T gi + ˆa2 ˆw T gk + b . If κik + ˆa1 ˆw T gi + ˆa2 ˆw T gk + b 0, then we have that Leaky Relu(κik + ˆa1 ˆw T gi + ˆa2 ˆw T gk + b ) ˆa1 ˆw T gi O(1) = κik + ˆa1 ˆw T gi + ˆa2 ˆw T gk + b ˆa1 ˆw T gi O(1) = ˆa2 ˆw T gk + κik + b O(1) = ˆa2 ˆw T gk O(1). Fountoulakis, Levi, Yang, Baranwal and Jagannath If κik + ˆa1 ˆw T gi + ˆa2 ˆw T gk + b < 0, then we have that, Leaky Relu(κik + ˆa1 ˆw T gi + ˆa2 ˆw T gk + b ) ˆa1 ˆw T gi O(1) = βκik + βˆa1 ˆw T gi + βˆa2 ˆw T gk + βb ˆa1 ˆw T gi O(1) = βˆa2 ˆw T gk (1 β)ˆa1 ˆw T gi O(1) βˆa2 ˆw T gk O(1), where β is the slope of Leaky Relu( ). Combining the two cases regarding the magnitude of ˆa1 ˆw T gi and our assumption that σ, R, R = O(1), so far we have showed that, for any i such that ˆa1 ˆw T gi 0, for all j Ji,0, we have ik ij = Θ(ˆa2 ˆw T gk) O(1), if k Ji,0 Ji,1 O(ˆa2 ˆw T gk) O(1), otherwise. (16) By following a similar argument, one can show that Equation 16 holds for any i such that ˆa1 ˆw T gi < 0. Let us now compute k Ni exp( ik ij) = X k Ni C0 exp( ik ij) + X k Ni C1 exp( ik ij) for some j Ji,0. Let us focus on P k Ni C0 exp( ik ij) first. We will show that Ω(|Ni C0|) P k Ni C0 exp( ik ij) O(|Ni|). First of all, we have that k Ni C0 exp( ik ij) X k Ji,0 exp( ik ij) = X k Ji,0 exp Θ(ˆa2 ˆw T gk) O(1) k Ji,0 ec1 = |Ji,0|ec1 = Ω(|Ni C0|), (17) where c1 is an absolute constant (possibly negative). On the other hand, consider the following partition of Ni C0: P1 def = {k Ni C0 | ˆa2 ˆw T gk 1}, P2 def = {k Ni C0 | ˆa2 ˆw T gk 1}. It is easy to see that k P1 exp( ik ij) X k P1 exp O(ˆa2 ˆw T gk) O(1) X k P1 ec2 = |P1|ec2 = O(|Ni C0|), Graph Attention Retrospective where c2 is an absolute constant. Moreover, because i A we have that P2 S t [T] Bt i,0. It follows that X k P2 exp( ik ij) = X exp( ik ij) exp O(ˆa2 ˆw T gk) O(1) t [T] |Bt i,0|ec32t, where c3 is an absolute constant. We can upper bound the above quantity as follows. Under the Event E , we have that |Bt i,0| mt + T|Ni C0| 4 5 , for all t [T], mt def = E[|Bt i,0|] = X k Ni C0 Pr(2t 1 ˆa2 ˆw T gk 2t) X k Ni C0 Pr[ˆa2 ˆw T gk 2t 1] k Ni C0 Pr[ ˆw T gk 2t 1] |Ni C0|e 22t 3. It follows that X t [T] |Bt i,0|ec32t X |Ni C0|e 22t 3 + T|Ni C0| 4 5 ec32t t=1 e 22t 3ec32t + X T|Ni C0| 4 5 ec32T c4|Ni C0| + o(|Ni|) where c4 is an absolute constant. The third inequality in the above follows from The series P t=1 e 22t 3ec32t converges absolutely for any constant c3; The sum P t [T] T|Ni C0| 4 5 ec32T = T 3 2 |Ni C0| 4 5 ec32T = o(|Ni|) because log T 3 2 ec32T = 3 2 log l log2 10 p log(n(p + q)) m + c32 log(n(p+q)) m 2 log l log2 10 p log(n(p + q)) m + 20c3 p log(n(p + q)) c log(n(p + q)) , for any c > 0. In particular, by picking c > 5 we see that T 3 2 ec32T O((n(p+q)) 1 c ) o(|Ni| 1 5 ), and hence we get T 3 2 ec32T |Ni C0| 4 5 |Ni| 4 5 o(|Ni| 1 5 ) = o(|Ni|). Fountoulakis, Levi, Yang, Baranwal and Jagannath Combining Equations 19 and 20 we get X k P2 exp( ik ij) O(|Ni|), (21) and combining Equations 18 and 21 we get X k Ni C0 exp( ik ij) = X k P1 exp( ik ij) + X k P1 exp( ik ij) O(|Ni|). (22) Now, by Equations 17 and 22 we get Ω(|Ni C0|) X k Ni C0 exp( ik ij) O(|Ni|). (23) It turns out that repeating the same argument for P k Ni C1 exp( ik ij) yields Ω(|Ni C1|) X k Ni C1 exp( ik ij) O(|Ni|). (24) Finally, Equations 23 and 24 give us X k Ni exp( ik ij) = Θ(|Ni|), which readily implies γij = 1 P k Ni exp( ik ij) = Θ(1/|Ni|) as required. We have showed that for all i A and for all j Ji,0, γij = Θ(1/|Ni|). Repeating the same argument we get that the same result holds for all i A and for all j Ji,1, too. Hence, by Claims 20 and 21 about the cardinalities of A, Ji,0 and Ji,1 we have thus proved Theorem 10. A.7 Proof of Proposition 13 We restate Proposition 13 for convenience. Proposition 23. Suppose that p, q satisfy Assumption 1 and that p, q are bounded away from 1. For every ϵ > 0, there are absolute constants M, M = O( ϵ) such that, with probability at least 1 o(1) over the data (X, A) CSBM(n, p, q, µ, σ2), using the graph attention convolution in (2) and the attention architecture Ψ in (13), the model misclassifies at least Ω(n1 ϵ) nodes for any w such that w = 1, if 1. t = O(1) and µ Mσ q log n n(p+q)(1 max(p, q)) p+q 2. t = ω(1) and µ M σ q log n n(p+q)(1 max(p, q)). Graph Attention Retrospective We start with part 1 of the proposition. Let us assume that p q. The result when p < q follows analogously. We will condition on the events E1, E2, E3 defined in Definition 17. These events are concerned with the concentration of class sizes |C0| and |C1| and the concentration of the number of intra-class and inter-class edges, i.e. |C0 Ni| and C1 Ni| for all i. By Lemma 18, the probability that these events hold simultaneously is at least 1 o(1). Fix any w Rd such that w = 1. Without loss of generality, assume that w T µ > 0. Because t = O(1), by the definition of Ψ in (13) and the attention coefficients in (1) we have that ( c1 n(p+q)(1 o(1)), if (i, j) is an intra-class edge, c2 n(p+q)(1 o(1)), if (i, j) is an inter-class edge, (25) for some positive constants c1 1 and c2 1. Let us write Xi = (2ϵi 1)µ + σgi where gi N(0, I), ϵi = 0 if i C0 and ϵi = 1 if i C1. Let us consider the classification of nodes in C0 based on the model output P j Ni γijw T Xj for node i. Note that, depending on whether i C0 or i C1, the expectation of the model output is symmetric around 0, and therefore we consider the decision boundary at 0. Let 0 < ϵ < 1 and fix any partition of C0 into a set of disjoint subsets, such that ℓ= |C0|1 ϵ and |C(h) 0 | = |C0|ϵ for every h = 1, 2, . . . , ℓ. In what follows we will consider the classification of nodes in each C(h) 0 separately. We will show that, with probability at least 1 o(1), for each one of more than half of the subsets C(h) 0 where h = 1, 2, . . . , ℓ, the model is going to misclassify at least one node, and consequently, the model is going to misclassify at least ℓ/2 = |C0|1 ϵ/2 = Ω(n1 ϵ) nodes, giving the required result on misclassification rate. Fix any h {1, 2, . . . , ℓ}. Using (25) we get that, for large enough n, the event that the model correctly classifies all nodes in C(h) 0 satisfies max i C(h ) 0 j Ni γijw T Xj < 0 max i C(h ) 0 j Ni C1 γij X j Ni C0 γij w T µ + σ X j Ni γijw T gj < 0 w T µ + σ max i C(h ) 0 j Ni γijw T gj < 0 Fountoulakis, Levi, Yang, Baranwal and Jagannath for some absolute constant c3 > 0, and hence the probability that the model correctly classifies all nodes in C(h ) 0 satisfies, for large enough n, max i C(h ) 0 j Ni γijw T Xj < 0 max i C(h ) 0 j Ni w T gj < c3 max i C(h ) 0 j Ni w T gj < M log n n(p + q)(1 max(p, q)) where the last inequality follows from our assumption on µ and we denote M def = c3M > 0. Now we will use Sudakov s minoration inequality (Vershynin, 2018) to obtain a lower bound on the expected maximum, and then apply Borell s inequality to upper bound the above probability. In order to apply Sudakov s result we will need to define a metric over the node index set [n]. Let zi def = P j Ni w T gj. For i, j C0, i = j, consider the canonical metric d (i, j) given by d (i, j)2 def = E[(zi zj)2] k Ni γ2 ik + X k Nj γ2 jk 2 X k Ni Nj γikγjk 1 n2(p + q)2 = c4|Jij| n2(p + q)2 , where Jij def = (Ni Nj)\(Ni Nj) is the symmetric difference of the neighbors of i and j, c4 > 0 is an absolute constant, and the inequality is due to (25). We lower bound |Jij| as follows. For i, j C0, i = j, and a node k [n], the probability that k is a neighbor of exactly one of i and j is 2p(1 p) if k C0 and 2q(1 q) if k C1. Therefore we have E[|Jij|] = n(p(1 p) + q(1 q)). It follows from the multiplicative Chernoffbound that for any 0 < δ < 1, Pr[|Jij| < E[|Jij|](1 δ)] exp( δ2 E[|Jij|]/3). log n E[|Jij|] = 3 log n n(p(1 p) + q(1 q)) = o(1), where the last equality follows from n(p(1 p) + q(1 q)) = Ω(log2 n), which is in turn due to the assumptions that p, q = Ω(log2 n/n) and p, q are bounded away from 1. Apply a union bound over all i, j C0, we get that with probability at least 1 o(1), the size of Jij satisfies |Jij| n(p(1 p) + q(1 q))(1 o(1)), i, j C0. (26) Therefore it follows that, for large enough n, c4|Jij| n2(p + q)2 = c4n(p(1 p) + q(1 q))(1 o(1)) n2(p + q)2 = Ω 1 max(p, q) Graph Attention Retrospective We condition on the event that the inequality (26) holds for all i, j C0, which happens with probability at least 1 o(1). Apply Sudakov s minoration with metric d (i, j), we get that for large enough n, for all h = 1, 2, . . . , ℓ, max i C(h) 0 j Ni γijw T gj c5d (i, j) q log |C(h) 0 | c6 ϵ log n n(p + q)(1 max(p, q)) (27) for some absolute constants c5, c6 > 0. The last inequality in the above follows from |C(h) 0 | = |C0|ϵ = Ω(nϵ). In addition, note that since by assumption Ψ is independent from the node features, using (25) we have that P j Ni γijw T gj is Gaussian with variance O( 1 n(p+q)). We use Borell s inequality (Adler and Taylor (2007) Chapter 2) to get that for any t > 0 and large enough n, max i C(h ) 0 j Ni γijw T gj < E max i C(h ) 0 j Ni γijw T gj exp( c7t2n(p + q)). for some absolute constant c7 > 0. By the lower bound of the expectation (27), we get that max i C(h ) 0 j Ni γijw T gj < c6 ϵ log n n(p + q)(1 max(p, q)) t exp( c7t2n(p + q)). Now, let M > 0 be any constant that also satisfies M < c6 ϵ/c3. Recall that we defined M = c3M, and hence M < c6 ϵ. Set t = (c6 ϵ M) log n n(p + q)(1 max(p, q)) = Ω log n n(p + q)(1 max(p, q)) then combine with the events we have conditioned so far we get max i C(h ) 0 j Ni γijw T gj M log n n(p + q)(1 max(p, q)) Recall that the above probability is the probability of correctly classifying all nodes in C(h ) 0 . Since our choice of h was arbitrary, this applies to every h {1, 2, . . . , h}. Let I(h) 0 denote the indicator variable of the event that there is at least one node in C(h) 0 that is misclassified, then = ℓ E[I(1) 0 ] = ℓ (1 o(1)) = ℓ o(ℓ). Apply the reverse Markov inequality, we get h=1 I(h) 0 1 ℓ E h Pℓ h=1 I(h) 0 i 1 2ℓ= o(1). Fountoulakis, Levi, Yang, Baranwal and Jagannath Therefore, with probability at least 1 o(1), we have Pℓ h=1 I(h) 0 1 2ℓ= Ω(n1 ϵ). This implies the required result that the model misclassifies at least Ω(n1 ϵ) nodes. The proof of part 2 is similar to the proof of part 1. Let us assume that p q since the result when p < q can be proved analogously. We condition on the events E1, E2, E3 defined in Definition 17 which simultaneous hold with probability at least 1 o(1) by Lemma 18. Fix any w Rd such that w = 1. Because t = ω(1), by the definition of Ψ in (13) and the attention coefficients in (1) we have that ( 2 np(1 o(1)), if (i, j) is an intra-class edge, o( 1 n(p+q)), if (i, j) is an inter-class edge. (29) Write Xi = (2ϵi 1)µ + σgi where gi N(0, I), ϵi = 0 if i C0 and ϵi = 1 if i C1. We consider the classification of nodes in C0 based on the model output P j Ni γijw T Xj for node i and the decision boundary at 0. As before, let 0 < ϵ < 1 and fix any partition of C0 into a set of disjoint subsets C0 = Sℓ h=1 C(h) 0 such that ℓ= |C0|1 ϵ and |C(h) 0 | = |C0|ϵ for every h = 1, 2, . . . , ℓ. We proceed to show that, with high probability, for each one of more than half of the subsets C(h) 0 where h = 1, 2, . . . , ℓ, the model is going to misclassify at least one node, and consequently, the model is going to misclassify at least ℓ/2 = Ω(n1 ϵ) nodes as required. Fix any h {1, 2, . . . , ℓ}. Using (29) we get that, for large enough n, the event that the model correctly classifies all nodes in C(h ) 0 satisfies ( max i C(h ) 0 j Ni γijw T Xj < 0 c1w T µ + σ max i C(h ) 0 j Ni γijw T gj < 0 for some absolute constant c1 > 0, and hence the probability that the model classifies all nodes in Ch 0 correctly satisfies, for large enough n, max i C(h ) 0 j Ni γijw T Xj < 0 max i C(h ) 0 j Ni w T gj < c1 |w T µ| max i C(h ) 0 j Ni w T gj < M log n n(p + q)(1 max(p, q)) where the last inequality follows from our assumption on µ and we denote M def = c1M > 0. The rest of the proof of part 2 proceeds as the proof of part 1. A.8 Proof of Theorem 14 We restate Theorem 14 for convenience. Recall that we write µ = κσ for some κ > 0. Theorem. With probability at least 1 o(1) over the data (X, A) CSBM(n, p, q, µ, σ2), using the two-layer MLP attention architecture Ψ given in (3) and (4) with R = Ω(n log2 nσ), the graph attention convolution output satisfies j Ni γij w T Xj > 0 if and only if w T Xi > 0, i [n], j Ni γij w T Xj < 0 if and only if w T Xi < 0, i [n]. Graph Attention Retrospective We will condition on the following two events concerning the values of w T Xi for i [n]. Both events hold with probability at least 1 o(1). First, for ϵ > 0 consider the event w T Xi [ ϵσ max{1, κ}, ϵσ max{1, κ}] for all i [n] . Since w = µ/ µ and µ = κσ, each w T Xi follows a normal distribution with mean ϵiκσ and variance σ2 (recall that ϵi {0, 1} generates the node memberships in CSBM), we have that Pr ϵσ max{1, κ} w T Xi ϵσ max{1, κ} = Z κ+ϵ max{1,κ} κ ϵ max{1,κ} 2 πϵ, if κ 1, q 2 πe κ2(1 ϵ)2/2ϵκ, if κ > 1. Let us pick ϵ such that ϵ = 1 n log n if κ 2 p log n, and ϵ = 1 4 if κ > 2 p This gives that Pr w T Xi [ ϵσ max{1, κ}, ϵσ max{1, κ}] for all i [n] 1 n o(1/n) = 1 o(1). Second, consider the event that | w T Xi E[ w T Xi]| 10σ log n. By Lemma 18 we know that it holds with probability at least 1 o(1). Under these two events, we have that ϵσ max{1, κ} w T Xi κσ + 10σ p log n if w T Xi > 0, log n w T Xi ϵσ max{1, κ} if w T Xi < 0. Recall from (5) that Ψ(Xi, Xj) = 2R(1 β) w T Xi, if w T Xj w T Xi , 2R(1 β) sign( w T Xi) w T Xj, if w T Xi < w T Xj < w T Xi , 2R(1 β) w T Xi, if w T Xj w T Xi . Consider an arbitrary node i where w T Xi > 0. We will show that hi = P j Ni γij w T Xj > 0. By the definition of attention coefficients in (1), we know that X j Ni γij w T Xj > 0 X j Ni exp(Ψ(Xi, Xj)) w T Xj > 0. Using the above expression for Ψ, we have that X j Ni exp(Ψ(Xi, Xj)) w T Xj = exp(Ψ(Xi, Xi)) w T Xi + X exp(Ψ(Xi, Xj)) w T Xj exp(Ψ(Xi, Xi)) w T Xi (n 1) exp(Ψ(Xi, Xi))(κσ + 10σ p e2R(1 β)ϵσ max{1,κ}ϵσ max{1, κ} (n 1)e 2R(1 β)ϵσ max{1,κ}(κσ + 10σ p Fountoulakis, Levi, Yang, Baranwal and Jagannath where the second last inequality follows from the event that | w T Xj| κσ + 10σ log n for all j, and |Ni| n, and the last inequality follows from the fact that the function f(x) = xe2R(1 β)x (n 1)(κσ + 10σ p log n)e 2R(1 β)x is increasing with respect to x for x 0. To see that h i = P j Ni exp(Ψ(Xi, Xj)) w T Xj > 0, consider the following cases separately. If κ 2 log n, recall that we picked ϵ = 1/(n log n), then because R = Ω(n log2 n/σ), for large enough n we have that 4R(1 β)σ > n log n max{1, κ} log n + log n log n + log(κ + 10 p 4R(1 β)σ > 1 ϵ max{1, κ} log n + log 1 ϵ max{1, κ} + log(κ + 10 p e4R(1 β)σϵ max{1,κ} > n(κ + 10 log n) ϵ max{1, κ} e2R(1 β)σϵ max{1,κ}σϵ max{1, κ} > ne 2R(1 β)σϵ max{1,κ}(κσ + 10σ p By (30) this means that h i > 0. If κ > 2 log n, then ϵ = 1/4. Because R = Ω(n log2 n/σ), for large enough n we have 4R(1 β)σ > log n + log(κ + 10 log n) log(κ/4) 4R(1 β)σ > log n + log(κ + 10 log n) log(ϵκ) e4R(1 β)σϵκ > n(κ + 10 log n) ϵκ e2R(1 β)σϵκσϵκ > n(κσ + 10σ p By (30) this means that h i > 0. This shows that h i > 0 whenever w T Xi > 0. Similarly, one easily gets that h i < 0 whenever w T Xi < 0. Therefore the proof is complete. A.9 Proof of Corollary 16 We restate Corollary 16 for convenience. Recall that we write µ = κσ for some κ > 0. Corollary. With probability at least 1 o(1) over the data (X, A) CSBM(n, p, q, µ, σ2), using the two-layer MLP attention architecture Ψ given in (3) and (4) with R = Ω(n log2 nσ), one has that (Perfect classification) If κ 2 log n then all nodes are correctly classified; (Almost perfect classification) If κ = ω(1) then at least 1 o(1) fraction of all nodes are correctly classified; Graph Attention Retrospective (Partial classification) If κ = O(1) then at least Φ(κ) o(1) fraction of all nodes are correctly classified. Fix i [n] and write w T Xi = (2ϵi 1) µ +σ/ µ µT gi where gi N(0, I), where we recall that ϵi {0, 1} defines the class membership of node i. We have that w T Xi follows a normal distribution with mean (2ϵi 1)κσ and standard deviation σ. Part 1 of the corollary is exactly Proposition 7 whose proof is given in the main text. We consider the other cases for κ. For both classes ϵi = 0 and ϵi = 1, the probability of correct classification is Φ(κ) (using 0 as the decision boundary). Therefore, by applying additive Chernoffbound, we have that Pr At most nΦ(κ) n log n nodes are correctly classified i [n] 1{node i is correctly classified} nΦ(κ) p The proof is complete by noticing that nΦ(κ) n log n = n(1 o(1)) when κ = ω(1) and nΦ(κ) n log n = n(Φ(κ) o(1)) when κ = O(1). E. Abbe. Community detection and stochastic block models: Recent developments. Journal of Machine Learning Research, 18:1 86, 2018. R. J. Adler and J. E. Taylor. Random fields and geometry. Springer, 2007. T. W. Anderson. An introduction to multivariate statistical analysis. John Wiley & Sons, 2003. A. Athreya, D. E. Fishkind, M. Tang, C. E. Priebe, Y. Park, J. T. Vogelstein, K. Levin, V. Lyzinski, Y. Qin, and D. L. Sussman. Statistical inference on random dot product graphs: A survey. Journal of Machine Learning Research, 18(226):1 92, 2018. J. Atwood and D. Towsley. Diffusion-convolutional neural networks. In Advances in Neural Information Processing Systems (Neur IPS), 2016. D. Bahdanau, K. H. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. In International Conference on Learning Representations (ICLR), 2015. A. Baranwal, K. Fountoulakis, and A. Jagannath. Graph convolution for semi-supervised classification: Improved linear separability and out-of-distribution generalization. In International Conference on Machine Learning (ICML), 2021. N. Binkiewicz, J. T. Vogelstein, and K. Rohe. Covariate-assisted spectral clustering. Biometrika, 104:361 377, 2017. Fountoulakis, Levi, Yang, Baranwal and Jagannath C. Bodnar, F. Di Giovanni, B. Chamberlain, P. Lio, and M. M. Bronstein. Neural sheaf diffusion: A topological perspective on heterophily and oversmoothing in gnns. In Advances in Neural Information Processing Systems (Neur IPS), 2022. X. Bresson and T. Laurent. Residual gated graph convnets. In ar Xiv:1711.07553, 2018. S. Brody, U. Alon, and E. Yahav. How attentive are graph attention networks. In International Conference on Learning Representations (ICLR), 2022. M. M. Bronstein, J. Bruna, T. Cohen, and P. Veliˇckovi c. Geometric deep learning: Grids, groups, graphs, geodesics, and gauges. ar Xiv preprint ar Xiv:2104.13478, 2021. J. Bruna, W. Zaremba, A. Szlam, and Y. Le Cun. Spectral networks and locally connected networks on graphs. In International Conference on Learning Representations (ICLR), 2014. Z. Chen, L. Li, and J. Bruna. Supervised community detection with line graph neural networks. In International Conference on Learning Representations (ICLR), 2019. E. Chien, J. Peng, P. Li, and O. Milenkovic. Adaptive universal generalized pagerank graph neural network. In International Conference on Learning Representations (ICLR), 2021. M. Defferrard, X. Bresson, and P. Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems (Neur IPS), 2016. Y. Deshpande, A. Montanari S. Sen, and E. Mossel. Contextual stochastic block models. In Advances in Neural Information Processing Systems (Neur IPS), 2018. D. Duvenaud, D. Maclaurin, J. Aguilera-Iparraguirre, R. G omez-Bombarelli, T. Hirzel, A. Aspuru-Guzik, and R. P. Adams. Convolutional networks on graphs for learning molecular fingerprints. In Advances in Neural Information Processing Systems (Neur IPS), 2015. M. Fey and J. E. Lenssen. Fast graph representation learning with Py Torch Geometric. In ICLR Workshop on Representation Learning on Graphs and Manifolds, 2019. V. Garg, S. Jegelka, and T. Jaakkola. Generalization and representational limits of graph neural networks. In Advances in Neural Information Processing Systems (Neur IPS), 2020. M. Gori, G. Monfardini, and F. Scarselli. A new model for learning in graph domains. In IEEE International Joint Conference on Neural Networks (IJCNN), 2005. W. L. Hamilton, R. Ying, and J. Leskovec. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems (Neur IPS), 2017. M. Henaff, J. Bruna, and Y. Le Cun. Deep convolutional networks on graph-structured data. In ar Xiv:1506.05163, 2015. Graph Attention Retrospective Y. Hou, J. Zhang, J. Cheng, K. Ma, R. T. B. Ma, H. Chen, and M.-C. Yang. Measuring and improving the use of graph information in graph neural networks. In International Conference on Learning Representations (ICLR), 2019. W. Hu, M. Fey, M. Zitnik, Y. Dong, H. Ren, B. Liu, M. Catasta, and J. Leskovec. Open graph benchmark: datasets for machine learning on graphs. In Advances in Neural Information Processing Systems (Neur IPS), 2020. S. Jegelka. Theory of graph neural networks: Representation and learning. In ar Xiv:2204.07697, 2022. N. Keriven, A. Bietti, and S. Vaiter. On the universality of graph neural networks on large random graphs. In Advances in Neural Information Processing Systems (Neur IPS), 2021. T. N. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations (ICLR), 2017. B. Knyazev, G. W. Taylor, and M. Amer. Understanding attention and generalization in graph neural networks. In Advances in Neural Information Processing Systems (Neur IPS), 2019. B. J. Lee, R. A. Rossi, S. Kim, K. N. Ahmed, and E. Koh. Attention models in graphs: A survey. ACM Transactions on Knowledge Discovery from Data (TKDD), 2019. R. Levie, F. Monti, X. Bresson, and M. M. Bronstein. Cayleynets: Graph convolutional neural networks with complex rational spectral filters. IEEE Transactions on Signal Processing, 67(1):97 109, 2018. Y. Li, R. Zemel, M. Brockschmidt, and D. Tarlow. Gated graph sequence neural networks. In International Conference on Learning Representations (ICLR), 2016. D. Lim, F. Hohne, X. Li, S. L. Huang, V. Gupta, O. Bhalerao, and S. N. Lim. Large scale learning on non-homophilous graphs: New benchmarks and strong simple methods. In Advances in Neural Information Processing Systems (Neur IPS), 2021. A. Loukas. How hard is to distinguish graphs with graph neural networks? In Advances in Neural Information Processing Systems (Neur IPS), 2020a. A. Loukas. What graph neural networks cannot learn: Depth vs width. In International Conference on Learning Representations (ICLR), 2020b. S. Luan, C. Hua, Q. Lu, J. Zhu, M. Zhao, S. Zhang, X.-W. Chang, and D. Precup. Revisiting heterophily for graph neural networks. In Advances in Neural Information Processing Systems (Neur IPS), 2022. S. Maskey, R. Levie, Y. Lee, and G. Kutyniok. Generalization analysis of message passing neural networks on large random graphs. In Advances in Neural Information Processing Systems (Neur IPS), 2022. M. Minsky and S. Papert. Perceptron: an introduction to computational geometry, 1969. Fountoulakis, Levi, Yang, Baranwal and Jagannath C. Moore. The computer science and physics of community detection: Landscapes, phase transitions, and hardness. Bulletin of The European Association for Theoretical Computer Science, 1(121), 2017. J. Palowitch, A. Tsitsulin, B. Mayer, and B. Perozzi. Graphworld: Fake graphs bring real insights for gnns. In ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2022. O. Puny, H. Ben-Hamu, and Y. Lipman. Global attention improves graph networks generalization. In ar Xiv:2006.07846, 2020. F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini. The graph neural network model. IEEE Transactions on Neural Networks, 20(1), 2009. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems (Neur IPS), 2017. P. Veliˇckovi c, G. Cucurull, A. Casanova, A. Romero, P. Li o, and Y. Bengio. Graph attention networks. In International Conference on Learning Representations (ICLR), 2018. R. Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018. M. Wang, D. Zheng, Z. Ye, Q. Gan, M. Li, X. Song, J. Zhou, C. Ma, L. Yu, Y. Gai, T. Xiao, T. He, G. Karypis, J. Li, and Z. Zhang. Deep graph library: A graph-centric, highly-performant package for graph neural networks. ar Xiv preprint ar Xiv:1909.01315, 2019. K. Xu, W. Hu, J. Leskovec, and S. Jegelka. How powerful are graph neural networks? In International Conference on Learning Representations (ICLR), 2019. Y. Yan, M. Hashemi, K. Swersky, Y. Yang, and D. Koutra. Two sides of the same coin: Heterophily and oversmoothing in graph convolutional neural networks. In IEEE International Conference on Data Mining (ICDM), 2022. J. Zhu, Y. Yan, L. Zhao, M. Heimann, L. Akoglu, and D. Koutra. Beyond homophily in graph neural networks: Current limitations and effective designs. In Advances in Neural Information Processing Systems (Neur IPS), 2020.