# attentive_walkaggregating_graph_neural_networks__32f7f92d.pdf Published in Transactions on Machine Learning Research (08/2022) Attentive Walk-Aggregating Graph Neural Networks Mehmet F. Demirel demirel@cs.wisc.edu Department of Computer Sciences University of Wisconsin-Madison Shengchao Liu liusheng@mila.quebec Quebec AI Institute (Mila) Siddhant Garg sidgarg@amazon.com Amazon Alexa AI Zhenmei Shi zhmeishi@cs.wisc.edu Department of Computer Sciences University of Wisconsin-Madison Yingyu Liang yliang@cs.wisc.edu Department of Computer Sciences University of Wisconsin-Madison Reviewed on Open Review: https: // openreview. net/ forum? id= TWSTy Yd2Rl Graph neural networks (GNNs) have been shown to possess strong representation power, which can be exploited for downstream prediction tasks on graph-structured data, such as molecules and social networks. They typically learn representations by aggregating information from the K-hop neighborhood of individual vertices or from the enumerated walks in the graph. Prior studies have demonstrated the effectiveness of incorporating weighting schemes into GNNs; however, this has been primarily limited to K-hop neighborhood GNNs so far. In this paper, we aim to design an algorithm incorporating weighting schemes into walk-aggregating GNNs and analyze their effect. We propose a novel GNN model, called AWARE, that aggregates information about the walks in the graph using attention schemes. This leads to an end-to-end supervised learning method for graph-level prediction tasks in the standard setting where the input is the adjacency and vertex information of a graph, and the output is a predicted label for the graph. We then perform theoretical, empirical, and interpretability analyses of AWARE. Our theoretical analysis in a simplified setting identifies successful conditions for provable guarantees, demonstrating how the graph information is encoded in the representation, and how the weighting schemes in AWARE affect the representation and learning performance. Our experiments demonstrate the strong performance of AWARE in graph-level prediction tasks in the standard setting in the domains of molecular property prediction and social networks. Lastly, our interpretation study illustrates that AWARE can successfully capture the important substructures of the input graph. The code is available on Git Hub. 1 Introduction The increasing prominence of machine learning applications for graph-structured data has lead to the popularity of graph neural networks (GNNs) in several domains, such as social networks (Kipf & Welling, 2016), molecular property prediction (Duvenaud et al., 2015), and recommendation systems (Ying et al., 2018). Several empirical and theoretical studies (e.g., (Duvenaud et al., 2015; Kipf & Welling, 2016; Xu Published in Transactions on Machine Learning Research (08/2022) et al., 2019; Dehmamy et al., 2019)) have shown that GNNs can achieve strong representation power by constructing representations encoding rich information about the graph. A popular approach of learning GNNs involves aggregating information from the K-hop neighborhood of individual vertices in the graph (e.g., (Kipf & Welling, 2016; Gilmer et al., 2017; Xu et al., 2019)). An alternative approach for learning graph representations is via walk aggregation (e.g., (Vishwanathan et al., 2010; Shervashidze et al., 2011; Perozzi et al., 2014)) that enumerates and encodes information of the walks in the graph. Existing studies have shown that walk-aggregating GNNs can achieve strong empirical performance with concrete analysis of the encoded graph information (Liu et al., 2019a). The results show that the approach can encode important information about the walks in the graph. This can potentially allow emphasizing and aggregating important walks to improve the quality of the representation for downstream prediction tasks. Weighting important information has been a popular strategy in recent studies on representation learning. It is important to note that the strong representation power of GNNs may not always translate to learning the best representation amongst all possible ones for the downstream prediction tasks. While the strong representation power allows encoding all kinds of information, a subset of the encoded information that is not relevant for prediction may interfere or even overwhelm the information useful for prediction, leading to sub-optimal performance. A particularly attractive approach to address this challenge is by incorporating weighting schemes into GNNs, which is inspired by the strong empirical performance of attention mechanisms (Bahdanau et al., 2014; Luong et al., 2015; Xu et al., 2015; Vaswani et al., 2017; Shankar et al., 2018; Deng et al., 2018) for natural language processing (e.g., (Devlin et al., 2019)) and computer vision tasks (e.g., (Dosovitskiy et al., 2020)). In the domain of graph representation learning, recent studies (Gilmer et al., 2017; Veličković et al., 2017; Yun et al., 2019; Maziarka et al., 2020; Rong et al., 2020) have used the attention mechanism to improve the empirical performance of GNNs by learning to select important information and removing the irrelevant ones. These studies, however, have only explored using attention schemes for K-hop neighborhood GNNs, and there has been no corresponding work exploring this idea for walk-aggregating GNNs. In this paper, we propose to theoretically and empirically examine the effect of incorporating weighting schemes into walk-aggregating GNNs. To this end, we propose a simple, interpretable, and end-to-end supervised GNN model, called AWARE (Attentive Walk-Aggregating GRaph Neural NEtwork), for graph-level prediction in the standard setting where the input is the adjacency and vertex information of a graph, and the output is a predicted label for the graph. AWARE aggregates the walk information by weighting schemes at distinct levels (vertex-, walk-, and graph-level). At the vertex (or graph) level, the model weights different directions in the vertex (graph, respectively) embedding space to emphasize important feature in the embedding space. At the walk level, it weights the embeddings for different walks in the graph according to the embeddings of the vertices along the walk. By virtue of the incorporated weighting schemes at these different levels, AWARE can emphasize the information important for prediction while diminishing the irrelevant ones leading to representations that can improve learning performance. We perform an extensive three-fold analysis of AWARE as summarized below: Theoretical Analysis: We analyze AWARE in the simplified setting when the weights depend only on the latent vertex representations, identifying conditions when the weighting schemes improve learning. Prior weighted GNNs (e.g., (Veličković et al., 2017; Maziarka et al., 2020)) do not enjoy similar theoretical guarantees, making this the first provable guarantee on the learning performance of weighted GNNs to the best of our knowledge. Furthermore, current understanding of weighted GNNs typically focuses only on the positive effect of weighting on their representation power. In contrast, we also explore the limitation scenarios when the weighting does not translate to stronger learning power. Empirical Analysis: We empirically evaluate the performance of AWARE on graph-level prediction tasks from two domains: molecular property prediction (61 tasks from 11 popular benchmarks) and social networks (4 tasks). For both domains, AWARE overall outperforms both traditional graph representation methods as well as recent GNNs (including the ones that use attention mechanisms) in the standard setting. Interpretability Analysis: We perform an interpretation study to support our design for AWARE as well as the theoretical insights obtained about the weighting schemes. We provide a visual illustration that Published in Transactions on Machine Learning Research (08/2022) AWARE can extract the important sub-graphs for the prediction tasks. Furthermore, we show that the weighting scheme in AWARE can align well with the downstream predictors. 2 Related Work Graph neural networks (GNNs). GNNs have been the predominant method for capturing information of graph data (Li et al., 2015; Duvenaud et al., 2015; Kipf & Welling, 2016; Kearnes et al., 2016; Gilmer et al., 2017). A majority of GNN methods build graph representations by aggregating information from the K-hop neighborhood of individual vertices (Duvenaud et al., 2015; Li et al., 2015; Battaglia et al., 2016; Kearnes et al., 2016; Xu et al., 2019; Yang et al., 2019). This is achieved by maintaining a latent representation for every vertex, and iteratively updating it to capture information from neighboring vertices that are K-hops away. Another popular approach is enumerating the walks in the graph and using their information (Vishwanathan et al., 2010; Shervashidze et al., 2011; Perozzi et al., 2014). Liu et al. (2019a) use the motivation of aggregating information from the walks by proposing a GNN model that can achieve strong empirical performance along with concrete theoretical analysis. Theoretical studies have shown that GNNs have strong representation power (Xu et al., 2019; Dehmamy et al., 2019; Liu et al., 2019a), and have inspired new disciplines for improving their representations further (Morris et al., 2019; Azizian & marc lelarge, 2021). To this extent, while the standard setting of GNNs has only vertex features and the adjacency information as inputs (see Section 3), many recent GNNs (Kearnes et al., 2016; Gilmer et al., 2017; Coors et al., 2018; Yang et al., 2019; Klicpera et al., 2020; Wang et al., 2021) exploit extra information, such as edge features and 3D information, in order to gain stronger performance. In this work; however, we focus on analyzing the effect of applying attention schemes for representation learning, and thus want to perform this analysis in the standard setting. GNNs with attention. The empirical effectiveness of attention mechanisms has been demonstrated on language (Martins & Astudillo, 2016; Devlin et al., 2019; Raffel et al., 2020) and vision tasks (Ramachandran et al., 2019; Dosovitskiy et al., 2020; Zhao et al., 2020). This has also been extended to the K-hop GNN research line where the main motivation is to dynamically learn a weighting scheme at various granularities, e.g., vertex-, edgeand graph-level. Graph Attention Network (GAT) (Veličković et al., 2017) and Molecule Attention Transformer (MAT) (Maziarka et al., 2020) utilize the attention idea in their message passing functions. GTransformer (Rong et al., 2020) applies an attention mechanism at both vertexand edge-levels to better capture the structural information in molecules. ENN-S2S (Gilmer et al., 2017) adopts an attention module (Vinyals et al., 2015) as a readout function. However, all such studies are based on K-hop GNNs, and to the best of our knowledge, our work is the first to bring attention schemes into walk-aggregation GNNs. 3 Preliminaries Graph data. We assume an input graph G=(V, A) consisting of vertex attributes V and an adjacency matrix A. The vertices are indexed by [m]={1, . . . , m}. Suppose each vertex has C discrete-valued attributes,1 and the jth attribute takes values in a set of size kj. Let hj i {0, 1}kj be the one-hot encoding of the jth attribute for vertex i. The vertex i is represented as the concatenation of C attributes, i.e., hi=[h1 i ; . . . ; h C i ] {0, 1}K where K= PC j=1 kj. Then V is the set {hi}m i=1. We denote the adjacency matrix by A {0, 1}m m, where Ai,j=1 indicates that vertices i and j are connected. We denote the set containing the neighbors of vertex i by N(i)={j [m] : Ai,j=1}. Although many GNNs exploit extra input information like edge attributes and 3D information, our primary focus is on the effect of weighting schemes. Hence, we perform our analysis in the standard setting that only has the vertex attributes and the adjacency matrix of the graph as the input. Description of Vertex Attributes. For molecular graphs, vertices and edges correspond to atoms and bonds, respectively. Each vertex i [m] will then possess useful attribute information, such as the atom symbol and whether the atom is acceptor or donor. Such vertex attributes are folded into a vertex attribute 1Note that the vertex attributes are discrete-valued in general. If there are numeric attributes, they can simply be padded to the learned embedding for the other attributes. Published in Transactions on Machine Learning Research (08/2022) matrix R {0, 1}m C where C is the number of attributes on each vertex i [m]. Here is a concrete example: Ri, = [Ri,1, Ri,2, . . . , Ri,7, Ri,8], atom symbol Ri,1 {C, Cl, I, F, . . .}, atom degree Ri,2 {0, 1, 2, 3, 4, 5, 6}, is acceptor Ri,7 {0, 1}, is donor Ri,8 {0, 1}. The matrix R can then be translated into the vertex attribute vector set V using one-hot vectors for the attributes. For social network graphs, vertices and edges correspond to entities (actors, online posts) and the connections between them, respectively. For the social network graphs in our experiments, we follow existing work and utilize the vertex degree as the vertex attribute (i.e., C = 1). Vertex embedding. We define an r-dimensional embedding of vertex i by: fi = Whi, (1) where W=[W 1; . . . ; W C] Rr K and W j Rr kj is the embedding matrix for each attribute j [C]. We denote the embedding corresponding to V by F = [f1; . . . ; fm]. Walk aggregation. Unlike the typical approach of aggregating K-hop neighborhood information, walk aggregation enumerates the walks in the graph, and uses their information (e.g., (Vishwanathan et al., 2010; Perozzi et al., 2014)). Liu et al. (2019a) utilize the walk-aggregation strategy by proposing the N-gram graph GNN, which can achieve strong empirical performance, allow for fine-grained theoretical analysis, and potentially alleviate the over-squashing problem in K-hop GNNs. The N-gram graph views the graph as a Bag-of-Walks. It learns the vertex embeddings F in Equation (1) in a self-supervised manner. It then enumerates and embeds walks in the graph. The embedding of a particular walk p, denoted as fp, is the element-wise product of the embeddings of all vertices along this walk. The embedding for the n-gram walk set (walks of length n), denoted as f(n), is the sum of the embeddings of all walks of length n. Formally, given the vertex embedding fi for vertex i, i p fi, f(n) = X p:n-gram fp, (2) i p is the element-wise product over all the vertices in the walk p, and P p:n-gram is the sum over all the walks p in the n-gram walk set. It has been shown that the method is equivalent to a message-passing GNN described as follows: set F(1) = F = [f1; . . . ; fm] and f(1) = F(1)1 = P i [m] fi, and then for 2 n T: F(n) = (F(n 1)A) F(1), and f(n) = F(n)1, (3) where is the element-wise product and 1 denotes a vector of ones in Rm. The final graph embedding is given by the concatenation of all f(n) s, i.e., f[T ](G)=[f(1); . . . ; f(T )]. Compared to K-hop aggregation strategies, this formulation explicitly allows analyzing representations at different granularities of the graph: vertices, walks, and the entire graph. This provides motivation for capitalizing on the N-gram walk-aggregation strategy for incorporating and analyzing the effect of weighting schemes on walk-aggregation GNNs. The principled design facilitates theoretical analysis of conditions under which the weighting schemes can be beneficial. Thus, in this paper, we analyze the effect of incorporating attention weighting schemes on the N-gram walk-aggregation GNN. 4 AWARE: Attentive Walk-Aggregating Graph Neural Network We propose AWARE, an end-to-end fully supervised GNN for learning graph embeddings by aggregating information from walks with learned weighting schemes. Intuitively, not all walks in a graph are equally Published in Transactions on Machine Learning Research (08/2022) important for downstream prediction tasks. AWARE incorporates an attention mechanism to assign different contributions to individual walks as well as assigns feature weightings at the vertex and graph embedding levels. These weights are learned in a supervised fashion for prediction. This enables AWARE to mitigate the shortcomings of its unweighted counterpart (Liu et al., 2019a), which computes graph embeddings in an unsupervised manner only using the graph topology. Algorithm 1 AWARE (W, Wv, Ww, Wg) Require: Graph G=(V, A), max walk length T 1: Compute vertex embeddings F by Eqn (1) 2: F(1) = σ(Wv F) 3: for each n [2, T] do 4: Compute Sn using Eqn (7) 5: F(n)= F(n 1)(A Sn) F(1) 6: end for 7: Set f(n) := σ(Wg F(n))1 for 1 n T 8: Set f[T ](G) := [f(1); . . . ; f(T )] Ensure: The graph embedding f[T ](G) At a high level, AWARE first computes vertex embeddings F, and initializes a latent vertex representation F(1) by incorporating a feature weighting at the vertex level. It then iteratively updates the latent representation F(n) using attention at the walk level, and then performs a weighted summarization at the graph level to obtain embeddings f(n) for walk sets of length n. The f(n) s are concatenated to produce the graph embedding f[T ](G) for the downstream task. We now provide more details. Weighted vertex embedding. Intuitively, some directions in the vertex embedding space are likely to be more important for the downstream prediction task than others. In the extreme case, the prediction task may depend only on a subset of the vertex attributes (corresponding to some directions in the embedding space), while the rest may be inconsequential and hence should be ignored when constructing the graph embedding. AWARE weights different vertex features using Wv Rr r by computing the initial latent vertex representation F(1) as: F(1) = σ(Wv F), where F is computed using Equation (1) (4) where σ is an activation function, and r is the dimension of the weighted vertex embedding. Walk attention. AWARE computes embeddings corresponding to walks of length n in an iterative manner, and updates the latent vertex representations in each iteration using such walk embeddings. When aggregating the embedding of a walk, each vertex in the walk is bound to have a different contribution towards the downstream prediction task. For instance, in molecular property prediction, the existence of chemical bonds between certain types of atoms in the molecule may have more impact on the property to be predicted than others. To achieve this, in iteration n, AWARE updates the latent representations for vertex i from [F(n 1)]i to [F(n)]i by taking an element-wise product of [F(n 1)]i with a weighted sum of the latent representation vectors of its neighbors j N(i). Such a weighted update of the latent representations implicitly assigns a different importance to each neighbor j for vertex i. Assuming that the importance of vertex j for vertex i depends on their latent representations, we consider a score function corresponding to the update from vertex j to i as: Sji := S(fj, fi). (5) While our theoretical analysis is for weighting schemes defined in Equation (5), in practice one can have more flexibility, e.g., one can allow the weights to depend on the neighbors and the iterations. To allow different weights [S(n)]ji for different iterations n by using the latent representations for vertices from the previous iteration (n 1). In particular, we use the self-attention mechanism: [Z(n)]j i = [F(n 1)]j Ww[F(n 1)]i (6) where [F(n 1)]i is the latent vector of vertex i at iteration n 1, and Ww Rr r is a parameter matrix to be learned. We then define the attention weighting matrix used at iteration n as: [Sn]ji = e[Z(n)]j i P k N(i) e[Z(n)]k i (7) Published in Transactions on Machine Learning Research (08/2022) Using this attention matrix Sn, we perform the iterative update to the latent vertex representations via a weighted sum of the latent representation vectors of their neighbors: F(n) = F(n 1)(A Sn) F(1) (8) This update is simple and efficient, and automatically aggregates important information from the vertex neighbors for the downstream prediction task. In particular, it does not have the typical projection operation for aggregating information from neighbors. Instead, it computes the weighted sum and then the element-wise product to aggregate the information. Weighted summarization. Since the downstream task may selectively prefer certain directions in the final graph embedding space, AWARE learns a weighting Wg Rr r to compute a weighted sum of latent vertex representations for obtaining walk set embeddings of length n as follows: f(n) = σ(Wg F(n))1 (9) where 1 denotes a vector of ones in Rm. Walk set embeddings up to length T are then concatenated to produce the graph embedding f[T ](G) = [f(1), . . . , f(T )]. End-to-end supervised training. We summarize the different weighting schemes and steps of AWARE as a pseudo-code in Algorithm 1. The graph embeddings produced by AWARE can be fed into any properly-chosen predictor hθ parametrized by θ, so as to be trained end-to-end on labeled data. For a given loss function L, and a labeled data set S = {(Gi, yi)}M i=1 where Gi s are graphs and yi s are their labels, AWARE can learn the parameters (W, Wv, Ww, Wg) and the predictor θ by optimizing the loss i [M] L yi, hθ f[T ](Gi) (10) The N-Gram walk aggregation strategy termed as the N-Gram Graph (Liu et al., 2019a) operates in two steps: first to learn a graph embedding using the graph topology without any supervision, and then to use a predictor on the embedding for the downstream task. In contrast, AWARE is end-to-end fully supervised, and simultaneously learns the vertex/graph embeddings for the downstream task along with the weighting schemes to highlight the important information in the graph and suppress the irrelevant and/or harmful ones. Secondly, the weighting schemes of AWARE allow for the use of simple predictors over the graph embeddings (e.g., logistic regression or shallow fully-connected networks) for performing end-to-end supervised learning. In contrast, N-Gram Graph requires strong predictors such as XGBoost (with thousands of trees) to exploit the encoded information in the graph embedding. 5 Theoretical Analysis For the design of our walk-aggregation GNN with weighting schemes, we are interested in the following two fundamental questions: (1) what representation can it obtain and (2) under what conditions can the weighting scheme improve the prediction performance? In this section, we provide theoretical analysis of the walk weighting scheme.2 We consider the simplified case when the weights depend only on the latent embeddings of the vertices along the walk: Assumption 1. The weights are Sij defined in Equation (5). First, in Section 5.1 and 5.2, we answer the above two questions under the following simplifying assumption: Assumption 2. Wv = Wg = I, the number of attributes is C = 1, and the activation is linear σ(z) = z. 2For the other weighting schemes Wv and Wg, we know Wv weights the vertex embeddings fi, and Wg weights the final embeddings F(n), emphasizing important directions in the corresponding space. If Wv has singular vector decomposition Wv = UΣV , then it will relatively emphasize the singular vector directions with large singular values. Similar for Wg. See Section 7 for some visualization. Published in Transactions on Machine Learning Research (08/2022) In this simplified case, the only weighting is Sij computed by Ww, which allows our analysis to focus on its effect. We further assume that the number of attributes on the vertices is C = 1 to simplify the notations. We will show that the weighting scheme can highlight important information, and reduce irrelevant information for the prediction, and thus improve learning. To this end, we first analyze what information can be encoded in our graph representation, and how they are weighted (Theorem 1). We then examine when and why the weighting can help learning a predictor with better performance (Theorem 3). Next, in Section 5.3, we provide analysis for the general setting where Wv and Wg may not be the identity matrix, C 1, and σ is the leaky rectified linear unit (Re LU). The analysis leads to guarantees (Theorem 4 and Theorem 5) that are similar to those in the simplified setting. 5.1 The Effect of Weighting on Representation We will show that the representation/embedding f(n) is a linear mapping of a high dimension vector c(n) into the low dimension embedding space, where the vector c(n) records the statistics about the walks in the graph. First, we formally define the walk statistics c(n) (a variant of the count statistics defined in (Liu et al., 2019a)). Recall that we assume the number of attributes is C = 1. K is the number of possible attribute values, and the columns of the vertex embedding parameter matrix W Rr K are embeddings for different attribute values u. Let W(u) denote the column for value u, i.e., W(u) = Wh(u) where h(u) is the one-hot vector of u. Definition 1 (Walk Statistics). A walk type of length n is a sequence of n attribute values v = (v1, v2, , vn) where each vi is an attribute value. The walk statistics vector c(n)(G) RKn is the histogram of all walk types of length n in the graph G, i.e., each entry is indexed by a walk type v and the entry value is the number of walks with sequence of attribute values v in the graph. Furthermore, let c[T ](G) be the concatenation of c(1)(G), . . . , c(T )(G). When G is clear from the context, we write c(n) and c[T ] for short. Note that the walk statistics c(n) may not completely distinguish any two different graphs, i.e., there can exist two different graphs with the same walk statistics c(n) for any given n. Figure 1 shows such an example where the given two graphs are isomorphically different despite having the same walk statistics c(1), c(2), and c(3). On the other hand, such indistinguishable cases are highly unlikely in practice. We also acknowledge other well-known statistics for distinguishability that have been used for analyzing GNNs, in particular, the Weisfeiler-Lehman isomorphism test (e.g., Xu et al. (2019)). Nevertheless, it is crucial noting here that the goal of our theoretical analysis is very different. Namely, while the Weisfeiler-Lehman test has been used as an important tool to analyze the representation power of GNNs, the goal of our analysis is the prediction performance. As pointed out in the introduction, strong representation power may not always translate to good prediction performance. In fact, a very strong representation power emphasizing too much on graph distinguishability is harmful rather than beneficial for the prediction. For example, a good representation for prediction should emphasize the effective features related to class labels and remove irrelevant features and/or noise. If two graphs only differ in some features irrelevant to the class label, then it is preferable to get the same representation for them, rather than insisting on graph distinguishability. Weighting schemes can potentially down-weight or remove the irrelevant information and improve the prediction performance. Next, we introduce the following notation for the linear mapping projecting c(n) to the representation f(n). Definition 2 (ℓ-way Column Product). Let A be a d N matrix, and let ℓbe a natural integer. The ℓ-way column product of A is a d N ℓmatrix denoted as A[ℓ], whose column indexed by a sequence (i1, i2, , iℓ) is the element-wise product of the i1, i2, . . . , iℓ-th columns of A, i.e., (i1, i2, . . . , iℓ)-th column in A[ℓ] is Ai1 Ai2 Aiℓwhere Aj for j [N] is the j-th column in A, and is the element-wise product. In particular, W [n] is an r by Kn matrix, whose columns are indexed by walk types v = (v1, v2, , vn) and equal W(v1) W(v2) W(vn). Definition 3 (Walk Weights). The weight of a walk type v = (v1, . . . , vn) is i=1 S(W(vi), W(vi+1)) (11) where S( , ) is the weight function in Equation (5). Published in Transactions on Machine Learning Research (08/2022) B B B B B B B B B B B B Figure 1: Two different graphs with the same walk-statistics c(1), c(2), and c(3). Vertices with the same letters have entirely identical attribute values. All edges are undirected and identical. The following theorem then shows that f(n) can be viewed as a compressed version (linear mapping) of the walk statistics, weighted by the attention weights S. Theorem 1. Assume Assumption 1 and 2. The embedding f(n) is a linear mapping of the walk statistics c(n): f(n) = M(n)Λ(n)c(n) (12) where M(n) = W [n] is a matrix depending only on W, and Λ(n) is a Kn-dimensional diagonal matrix whose columns are indexed by walk types v and have diagonal entries λ(v). Therefore, f[T ] = MΛc[T ] (13) where M is a block-diagonal matrix with diagonal blocks M(1), M(2), . . . , M(T ), and Λ is block-diagonal with blocks Λ(1), Λ(2), . . . , Λ(T ). Proof. It is sufficient to prove the first statement with M(n) = W [n], as the second one directly follows. To this end, we will first prove the following lemma. Lemma 1. Let Pi,n be the set of walks starting from vertex i and of length n. Then the latent vector on vertex i is: [F(n)]i = X p Pi,n λ(vp) k p [F(1)]k where λ(vp) is the weight for the sequence of attribute values on p, and J k p[F(1)]k is the element-wise product of all the [F(1)]k s on p. Proof. We prove the lemma by induction. For n = 1, it is trivially true. Suppose the statement is true for n 1. Then recall that [F(n)]i is constructed by weighted-summing up all the latent vectors [F(n 1)]j from the neighbors j of i, and then element-wise product with [F(1)]i = fi. That is, j Ni Sji[F(n 1)]j [F(1)]i. (15) Published in Transactions on Machine Learning Research (08/2022) So letting Ni denote the set of neighbors of i, we have by induction j Ni Sji[F(n 1)]j [F(1)]i (16) p Pj,n 1 λ(vp) k p [F(1)]k [F(1)]i (17) p Pj,n 1 Sjiλ(vp) k p [F(1)]k By concatenating i to the walks p Pj,n 1 for all neighbors j Ni, we obtain the set of walks starting from i and of length n, i.e., Pi,n. Furthermore, for a path obtained by concatenating i and p Pj,n 1, the weight is exactly Sji λ(vp). Therefore, [F(n)]i = X p Pj,n 1 Sji λ(vp) k p [F(1)]k p Pi,n λ(vp) k p [F(1)]k By induction, we complete the proof. We now use Lemma 1 to prove the theorem statement. Recall that hk is the one-hot vector for the attribute on vertex k. Let ep {0, 1}Kn be the one-hot vector for the walk type of a walk p. f(n) = F(n)1 (21) i=1 [F(n)]i (22) p Pi,n λ(vp) k p [F(1)]k p:walks of length n λ(vp) k p [F(1)]k p:walks of length n λ(vp) p:walks of length n λ(vp)W [n]ep (26) p:walks of length n λ(vp)ep (27) = W [n]Λ(n)c(n). (28) The third line follows from Lemma 1. The forth line follows from that the union of Pi,n for all i is the set of all walks of length n. The sixth line follows from the definition of W [n] and ep. The last line follows from the definitions of Λ(n) and c(n). Remark. Theorem 1 shows that the embedding f(n) can encode a compressed version of the weighted walk statistics Λ(n)c(n). Note that similar to Λ(n), c(n) is in high dimension Kn. Its entries are indexed Published in Transactions on Machine Learning Research (08/2022) by all possible sequences of the attribute values v = (v1, . . . , vn), and the entry value is just the count of the corresponding sequence in the graph. Λ(n)c(n) is thus an entry-wise weighted version of the counts, i.e., weighting the walks with walk type v = (v1, . . . , vn) by the corresponding weight λ(v). In words, c(n) is first weighted by our weighting scheme where the count of each walk type v is weighted by the corresponding walk weight λ(v), and then compressed from the high dimension RKn to the low dimension Rr. Ideally, we would like to have relatively larger weights on walk types important for the prediction task and smaller for those not important. This provides the basis for the focus of our analysis: the effect of weighting for the learning performance. The N-gram graph method is a special case of our method, by setting the message weights S( , ) to be always 1 (and thus Λ(n) being an identity matrix). Then we have f(n) = W [n]c(n). Our method thus enjoys greater representation power, since it can be viewed as a generalization that allows to weight the features. What is more important, and is also the focus of our study, is that this weighting can potentially help learn a predictor with better prediction performance. This is analyzed in the next subsection. Remark. The weighted walk statistics Λ(n)c(n) is compressed from a high dimension to a low dimension by multiplying with W [n]. For the unweighted case, the analysis in (Liu et al., 2019a) shows that there exists a large family of W (e.g., the entries of W are independent Rademacher variables) such that W [n] has the Restricted Isometry Property (RIP) and thus c(n) can be recovered from f(n) by compressive sensing techniques (see the review in Appendix A.1), i.e., f(n) encodes c(n). A similar result holds for our weighted case. In particular, it is well known in the compressive sensing literature that when W [n] has RIP, and Λ(n)c(n) is sparse, then Λ(n)c(n) can be recovered from f(n), i.e., f(n) preserves the information of Λ(n)c(n). However, it is unclear if there exists W such that W [n] can have RIP. We show that a wide family of W satisfy this and thus Λ(n)c(n) can be recovered from f(n). Theorem 2. Assume Assumption 1 and 2. If r = Ω((ns3 n log K)/ϵ2) where sn is the sparsity of c(n), then there is a prior distribution over W such that with probability 1 exp( Ω(r1/3)), W [n] satisfies (sn, ϵ)-RIP. Therefore, if r = Ω(ns3 n log K) and Λ(n)c(n) is the sparsest vector satisfying f(n) = W [n]Λ(n)c(n), then with probability 1 exp( Ω(r1/3)), Λ(n)c(n) can be recovered from f(n). Proof. The first statement follows from Theorem 8 in Appendix A.2, and the second follows from Theorem 6 in Appendix A.1. The distribution of W satisfying the above can be that with (properly scaled) i.i.d. Rademacher entries or Gaussian entries. Since this is not the focus of our paper, below we simply assume that W [n] (and thus M) has RIP and focus on analyzing the effect of the weighting on the learning over the representations. 5.2 The Effect of Weighting on Learning Since we have shown that the embedding f(n) can be viewed as a linear mapping of the weighted walk statistics to low dimensional representations, we are now ready to analyze if the weighting can potentially improve the learning. We now illustrate the intuition for the benefit of appropriate weighting. First, consider the case where we learn over the weighted features Λc[T ] (instead of learning over f[T ](G) = MΛc[T ] which has an additional M). Suppose that the label is given by a linear function on c[T ] with parameter β , i.e., y = β , c[T ] . If Λ is invertible, the parameter Λ 1β on Λc[T ] has the same loss as β on c[T ]. So we only need to learn Λ 1β . The sample size needed to learn Λ 1β on Λc[T ] will depend on the factor Λ 1β 2 Λc[T ] 2, which is potentially smaller than β 2 c[T ] 2 for the unweighted case. This means fewer data samples are needed (equivalently, smaller loss for a fixed amount of samples). Now, consider the case of learning over f[T ](G) = MΛc[T ] that has an extra M. We note that c[T ] can be sparse compared to its high dimension (since likely only a very small fraction of all possible walk types will appear in a graph). Well-established results from compressive sensing show that when M has the Restricted Isometry Property (RIP), learning over MΛc[T ] is comparable to learning over Λc[T ]. Indeed, Theorem 2 Published in Transactions on Machine Learning Research (08/2022) shows when W is random and the embedding dimension r is large enough, there are families of distributions of W such that M has RIP for Λc[T ]. Thus, we assume M has RIP and focus on the analysis of how Ww affects the weighting and the learning. In practice, our method is more general and the parameters are learned over the data. Still, the analysis in the special case under the assumptions can provide useful insights for understanding our method, in particular, how the weighting can affect the learning of a predictor over the embeddings. However, the above intuition is only for learning over a fixed weighting Λ induced by a fixed Ww. Our key challenge is to incorporate the learning of Ww in the analysis, which we now address. Formally, we consider learning Ww from a hypothesis class W, and let Λ(Ww) and f[T ](G; Ww) denote the weights and representation given by Ww. For prediction, we consider binary classification with the logistic loss ℓ(g, y) = log(1+exp( gy)) where g is the prediction and y is the true label. Let ℓD(θ, Ww) be the risk of a linear classifier with a parameter θ on f[T ](G; Ww) over the data distribution D, and let ℓS(θ, Ww) denote the risk over the training dataset S. Suppose we have a dataset S = {(Gi, yi)}M i=1 of M i.i.d. sampled from D, and ˆθ and c Ww are the parameters learned via ℓ2-regularization with regularization coefficient Bθ: ˆθ, c Ww = arg min Ww W, θ 2 Bθ ℓS(θ, Ww) := 1 i=1 ℓ θ, f[T ](Gi; Ww) , yi . (29) To derive error bounds, suppose W is equipped with a norm and let N(W, ϵ) be the ϵ-covering number of W w.r.t. the norm (other complexity measures on W, such as VC-dimension, can also be used). Suppose f[T ](G; Ww) is Lf-Lipschitz w.r.t. the norm on W and the ℓ2 norm on the representation. Furthermore, let β denote the best linear classifier on c[T ], and let ℓ D denote its risk. Theorem 3. Assume Assumption 1 and 2. Assume c[T ] is s-sparse, M satisfies (2s, ϵ0)-RIP, Λ(Ww) is invertible and f[T ](G; Ww) is Lf-Lipschitz over W. For any δ, ϵ (0, 1), there are regularization coefficient values Bθ such that with probability 1 δ: ℓD(ˆθ, c Ww) ℓ D + 2ϵ + O r T + Cϵ(W) + min Ww W B(Ww) O Cϵ(W) := log N W, ϵ 8BθLf δ , B(Ww) := max G D Λ(Ww)c[T ](G) 2 Λ(Ww) 1β 2. (31) Proof. Since ˆθ = ˆθ(c Ww) where ˆθ(c Ww) is defined in Lemma 2, by Lemma 2.(1), we have ℓD(ˆθ, c Ww) ℓS(ˆθ, c Ww) + O r T + log N W, ϵ 8BθLf Furthermore, since ˆθ, c Ww are the optimal solution for the regularized regression, then for any Ww W, ℓS(ˆθ, c Ww) ℓS(ˆθ(Ww), Ww). (33) Then by Lemma 2.(2), we have ℓS(ˆθ(Ww), Ww) ℓ D + O δ + log N W, ϵ 8BθLf Combining the above inequalities proves the theorem. Published in Transactions on Machine Learning Research (08/2022) Lemma 2. Suppose f[T ](G; Ww) is Lf-Lipschitz w.r.t. the norm on W and the ℓ2 norm on the representation. Let ˆθ(Ww) = arg min θ 2 Bθ i=1 ℓ θ, f[T ](Gi; Ww) , yi (35) be the optimal solution for a fixed Ww. (1) For any ϵ, δ (0, 1), with probability at least 1 δ, for any Ww W, |ℓD(ˆθ(Ww), Ww) ℓS(ˆθ(Ww), Ww)| O r T + log N W, ϵ 8BθLf (2) Assume that M satisfies the (2s, ϵ0)-RIP, and c[T ] is s-sparse. Also assume that Λ 1(Ww) is invertible over W. Then for any ϵ, δ (0, 1), there exists an appropriate choice of regularization coefficient Bθ, such that with probability at least 1 δ, for any Ww W, ℓD(ˆθ(Ww), Ww) ℓ D + O δ + log N W, ϵ 8BθLf Proof. (1) We apply a net argument on W. Let X be an ϵ/8BθLf-net of W, so |X| N(W, ϵ/8BθLf). Then for the given M, any Ww X and any θ satisfies: |ℓD(θ, Ww) ℓS(θ, Ww)| O r T + log N W, ϵ 8BθLf Then for any W w W, there exists a Ww X such that Ww W w ϵ/8BθLf. Then letting θ denote ˆθ(W w), we have |ℓD(θ, W w) ℓS(θ, W w)| |ℓD(θ, W w) ℓD(θ, Ww)| (39) + |ℓD(θ, Ww) ℓS(θ, Ww)| (40) + |ℓS(θ, Ww) ℓS(θ, W w)|. (41) For any G with label y, we have |ℓ( θ, f[T ](G; Ww) , y) ℓ( θ, f[T ](G; W w) , y)| (42) | θ, f[T ](G; Ww) θ, f[T ](G; W w) | (43) = | θ, f[T ](G; Ww) f[T ](G; W w) | (44) = θ 2 f[T ](G; Ww) f[T ](G; W w) 2 (45) BθLf Ww W w (46) |ℓD(θ, W w) ℓS(θ, W w)| ϵ r T + log N W, ϵ 8BθLf This proves the first statement. (2) Let X be the set of Λc[T ] for G from the data distribution. Since c[T ] is s-sparse, Λc[T ] is also s-sparse. Then Λc[T ](G) Λc[T ](G ) is 2s-sparse for any G and G , so M satisfies ( X, ϵ)-RIP. Then we can apply the theorem for learning over compressive sensing data. In particular, for a fixed Ww, we apply Theorem 4.2 in (Arora et al., 2018). (The theorem is included as Theorem 7 in Section A.1 for completeness. Note that choosing an appropriate λ in that theorem is equivalent to choosing an appropriate Bθ by standard Lagrange multiplier theory.) The statement follows from that the logistic loss function is 1-Lipschitz and convex, and that the optimal solution over Λ(Ww)c[T ] is Λ 1(Ww)θ with the same loss as θ over c[T ]. Combining with a net argument similar as above proves the statement. Published in Transactions on Machine Learning Research (08/2022) Remark. Theorem 3 shows that the learned model has risk comparable to that of the best linear classifier on the walk statistics, given sufficient data. To see the benefit of weighting schemes, let us now compare to the unweighted case. In the unweighted case, Λ is the identity matrix, log N W, ϵ 8BθLf reduces to 0, and B(Ww) reduces to B0 := max G D c[T ](G) 2 β 2. (49) Therefore, our method needs extra samples to learn Ww, leading to the extra error terms related to log N W, ϵ 8BθLf . On the other hand, the benefit of weighting is replacing the factor B0 above with min Ww B(Ww). If there is W w with B(W w) B0, the error is significantly reduced. Therefore, there is a trade-off between the reduction of error for learning classifiers on an appropriate weighted representation and the additional samples needed for learning an appropriate weighting. The benefit of weighting can be significant in practice. min Ww B(Ww) can be much smaller than B0, especially when some features (i.e., walk types) in c[T ] are important while others are not, which is true for many real-world applications. For a concrete example, suppose c[T ](G) is s-sparse with each non-zero entry being some constant c. Suppose only a few of the features are useful for the prediction. In particular, β is ρ-sparse with each non-zero entry being some constant b, and ρ s. Suppose there is a weighting W w that leads to weight Υ on the entries corresponding to the ρ important features (i.e., the non-zero entries in β ), and weight υ for the other features where |υ| |Υ|. Then it can be shown that ρb2 = bc ρs, (50) min Ww B(Ww) p ρ(Υc)2 + (s ρ)(cυ)2p ρ(b/Υ)2 (51) min Ww B(Ww) Since ρ s and |υ| |Υ|, min Ww B(Ww) is much smaller than B0, so the weighting can significantly reduce the error. This demonstrates that with proper weighting highlighting important features and suppressing irrelevant features for prediction, the error can be much smaller than the error for without weighting. 5.3 Analysis for the General Setting Here we analyze the more general case where Wv and Wg may not be the identity matrix I and the number of attributes C 1. We assume that the activation function σ is the leaky rectified linear unit: σ(z) = max{αz, z} for some α [0, 1]. (53) This includes the following special cases: (1) the linear activation analyzed above corresponds to α = 1; (2) the commonly used rectified linear unit (Re LU) corresponds to α = 0. We also note that while our analysis is for the leaky rectified linear unit, it can easily be generalized to any piece-wise linear function. We will need to generalize the notations. Recall that C is the number of attributes, kj is the number of possible values for the j-th attribute. Let KC := QC j=1 kj denote the number of possible attribute value vector. Also, hj i {0, 1}kj is the one hot vector for the j-th attribute on vertex i. hi denotes the one hot vector for vertex i, which is the concatenation [h1 i , . . . , h C i ] {0, 1}K. The ℓ-th column of the embedding parameter matrix W j Rr kj is an embedding vector for the ℓ-th value of the j-th attribute, and the parameter matrix W Rr K is the concatenation W = [W 1, W 1, . . . , W C] with K = PC 1 j=0 kj. Finally, given an attribute vector u = [u1, u2, . . . , u C] where uj is the value for the j-th attribute, let Let W(u) denote the embedding for u, i.e., W(u) = Wh(u) where h(u) = [h(u1), h(u2), . . . , h(u C)] and h(uj) is the one-hot vector of uj. We define the walk statistics ci (n) for each vertex i and the walk statistics c(n) for the whole graph as follows. Published in Transactions on Machine Learning Research (08/2022) Definition 4 (Walk Statistics for the General Case). A walk type of length n is a sequence of n attribute vectors v = (v1, v2, , vn) where each vi is an attribute vector of C attributes. The walk statistics vector ci (n)(G) RKn C for vertex i [m] is the histogram of all walk types of length n beginning from i in the graph G, i.e., each entry is indexed by a walk type v and the entry value is the number of walks beginning from vertex i with sequence of attribute value vectors v in the graph. Furthermore, let ci [T ](G) be the concatenation of ci (1)(G), . . . , ci (T )(G), and let c(n)(G) = P i [m] ci (n)(G) and c[T ](G) = P i [m] ci [T ](G). When G is clear from the context, we write ci (n), ci [T ], c(n), c[T ] for short. So the definition is similar to that for the simplified case, except that now the walk statistics for each vertex is also defined, and a walk type considers all C attributes. When C = 1, the c(n) and c[T ] here reduces to those defined in the simplified setting. Similarly, the definition of the walk weight is the same as that in the simplified setting, except that it is defined over the generalized walk types. The Effect of Weighting on Representation. We will first consider the representation power, showing that F i (n) := [Wg F(n)]i is a linear mapping of ci (n). This is based on the observation that for the leaky Re LU unit, we have σ(z) = Γ(z)z where Γ(z) = αI[z < 0] + I[z 0]. This inspires the following notation. Definition 5. Given a vector u Rr , define a diagonal matrix Γ(u) Rr r with diagonal entries [Γ(u)]ii = αI[ui < 0] + I[ui 0]. (54) Let (Wv W){n} be a matrix with Kn C column corresponding to all possible length-n walk types, with the column indexed by a walk type v = (v1, . . . , vn) being g1 g2 gn with gi = Γ(Wv W(vi)) Wv W(vi). The following theorem then shows that F i (n) can be a compressed version of the walk statistic for vertex i, weighted by the weighting parameter matrix Wv, Wg and also by the attention scores S. Theorem 4. Assume Assumption 1. The embedding F i (n) := [Wg F(n)]i is a linear mapping of the walk statistics ci (n) for any i [m]: F i (n) = Wg(Wv W){n}Λ(n)ci (n). (55) where Λ(n) is a Kn C-dimensional diagonal matrix, whose columns are indexed by walk types v and have diagonal entries λ(v). Therefore, i=1 σ(MΛci [T ]) (56) where M is a block-diagonal matrix with diagonal blocks Wg(Wv W)(1), Wg(Wv W){2}, . . . , Wg(Wv W){T }, and Λ is block-diagonal with blocks Λ(1), Λ(2), . . . , Λ(T ). Proof. The proof is similar to that of Theorem 1. First, we note that Lemma 1 still applies to the general case, so can be used to prove the theorem statement. Recall that hk is the one-hot vector for the attributes on vertex k. Let ep {0, 1}Kn C be the one-hot vector for the walk type of a walk p. By the definition of F i (n) and by Lemma 1, F i (n) = [Wg F(n)]i (57) = Wg[F(n)]i (58) p Pi,n λ(vp) k p [F(1)]k Published in Transactions on Machine Learning Research (08/2022) Then by the definition of F(1), F i (n) = Wg X p Pi,n λ(vp) k p σ(Wv Whk) p Pi,n λ(vp) k p (Γ(Wv Whk) Wv Whk) p Pi,n λ(vp)(Wv W){n}ep (62) = Wg(Wv W){n} X p Pi,n λ(vp)ep (63) = Wg(Wv W){n}Λ(n)ci (n). (64) The second line follows from the property of σ and the definition of Γ. The third line follows from the definitions of (Wv W){n} and ep. The last line follows from the definition of Λ(n) and ci (n). The theorem shows that in the general case, before applying the last activation, the embedding F i (n) is a linear mapping of the walk statistics ci (n), with a more complicated mapping Wg(Wv W){n}Λ(n). On the other hand, the final graph embedding f[T ] is no longer a linear mapping of the walk statistics in general, but is a sum of the nonlinear transformation of the linear embedding F i (n) s. Only when σ is the identity function, f[T ] = Pm i=1 σ(MΛci [T ]) = Pm i=1 MΛci [T ] = MΛ Pm i=1 ci [T ] = MΛc[T ] becomes a linear mapping, and recovers the result in the simplified setting. Finally, similarly as before, with properly set Wg, Wv, W, the linear mapping M can satisfy RIP. We will assume this in the following analysis. The Effect of Weighting on Learning. Now we are ready to analyze the learning performance. Suppose we learn a classifier h on top of f[T ] together with the parameters W, Wv, Ww, Wg in the model. Formally, let Wall := (W, Wv, Ww, Wg). We consider learning Wall from a hypothesis class W, and learning the classifier h from a classifier hypothesis class H = {hθ} with a parameter θ. Let Λ(Wall) and f[T ](G; Wall) denote the weights and representation given by Wall. For prediction, we consider binary classification with the logistic loss ℓ(g, y) = log(1 + exp( gy)) where g is the prediction and y is the true label. Let ℓD(θ, Ww) be the risk of a classifier with a parameter θ on f[T ](G; Wall) over the data distribution D, and let ℓS(θ, Wall) denote the risk over the training dataset S. Recall that ℓ D denote the risk of the best linear classifier on c[T ]. Suppose we have a dataset S = {(Gi, yi)}M i=1 of M i.i.d. sampled from D, and ˆθ, c Wall are the parameters learned via ℓ2-regularization: ˆθ, c Wall = arg min Wall W,hθ H ℓS(θ, Wall) := 1 i=1 ℓ hθ f[T ](Gi; Wall) , yi . (65) A technical challenge is that f[T ] is no longer a linear mapping, so we cannot directly apply the guarantees about regression on RIP linear mappings. To address this challenge, we compare the power of our nonlinear learning to the linear learning. Formally, let f lin [T ](G; Wall) be the linear mapping induced by Wall: f lin [T ](G; Wall) := MΛc[T ](G) (66) and let ˆθlin and c W lin all be the parameters learned via ℓ2-regularization with regularization coefficient Bθ: ˆθlin, c W lin all = arg min Wall W, θ 2 Bθ ℓlin S (θ, Wall) := 1 i=1 ℓ θ, f lin [T ](Gi; Wall) , yi . (67) Published in Transactions on Machine Learning Research (08/2022) We introduce the following notation to measure the power of the classifier from H on f[T ] compared to that of a linear classifier on the linear mapping f lin [T ]: δS(W, H, Bθ) := min Wall W,hθ H ℓS(θ, Wall) min Wall W, θ 2 Bθ ℓlin S (θ, Wall). (68) Theorem 5. Assume Assumption 1. Assume c[T ] is s-sparse, M satisfies (2s, ϵ0)-RIP, Λ(Wall) is invertible and f[T ](G; Wall) is Lf-Lipschitz over W. For any δ, ϵ (0, 1), there are regularization coefficient values Bθ such that with probability 1 δ: ℓD(ˆθ, c Wall) ℓ D + 2ϵ + O r T + Cϵ(W) + min Ww W B(Wall) O + δS(W, H, Bθ) (69) Cϵ(W) := log N W, ϵ 8BθLf B(Wall) := max G D Λ(Wall)c[T ](G) 2 Λ(Wall) 1β 2. (71) Proof. The proof is similar to that in the simplified setting. First, following the same proof for Lemma 2.(1) with Ww replaced by Wall, we have ℓD(ˆθ, c Wall) ℓS(ˆθ, c Wall) + O r T + log N W, ϵ 8BθLf Furthermore, since ˆθ, c Wall are the optimal solution for the regression on f[T ] and ˆθlin, c W lin all are that for the regression on f lin [T ], we have ℓS(ˆθ, c Wall) = ℓlin S (ˆθlin, c W lin all ) + δS(W, H, Bθ), (73) and for any Wall W, ℓlin S (ˆθlin, c W lin all ) ℓlin S (ˆθlin(Wall), Wall). (74) Finally, following the same proof for Lemma 2.(2) with Ww replaced by Wall, we have ℓlin S (ˆθlin(Wall), Wall) ℓ D + O δ + log N W, ϵ 8BθLf Combining the above inequalities proves the theorem. The theorem shows a similar conclusion as that in the simplified setting. In particular, when σ is the identity function and H is the class of linear classifiers with norms bounded by Bθ, we have δ(W, H, Bθ) = 0, and thus the bound here reduces to the bound in the simplified setting. In the general case, when we choose a powerful enough classifier class H and the nonlinear embedding f[T ] preserves enough information, then there exists ˆθ, c Wall that achieves better predictions than the linear counterparts ˆθlin, c W lin all . This leads to a small (or even negative) δ(W, H, Bθ), and thus the theorem for the general case gives a similar or better bound than that in the simplified case. 6 Experiments 6.1 Experimental Setup Datasets. We perform experiments on graph-level prediction tasks from two domains: molecular property prediction (61 tasks from 11 benchmarks) and social networks (4 benchmarks).3 Specifically, we consider 37 Published in Transactions on Machine Learning Research (08/2022) Table 1: Details on the benchmark datasets used in our experiments Dataset # of Tasks Type Domain IMDB-BINARY (Yanardag & Vishwanathan, 2015) 1 Classification Social Network IMDB-MULTI (Yanardag & Vishwanathan, 2015) 1 Classification Social Network REDDIT-BINARY (Yanardag & Vishwanathan, 2015) 1 Classification Social Network COLLAB (Yanardag & Vishwanathan, 2015) 1 Classification Social Network Mutagenicity (Kazius et al., 2005) 1 Classification Chemistry Tox21 (Tox21 Data Challenge, 2014) 12 Classification Chemistry Clin Tox (Artemov et al., 2016; Gayvert et al., 2016) 2 Classification Chemistry HIV (AIDS Antiviral Screen Data, 2017) 1 Classification Chemistry MUV (Rohrer & Baumann, 2009) 17 Classification Chemistry Delaney (Delaney, 2004) 1 Regression Chemistry Malaria (Gamo et al., 2010) 1 Regression Chemistry CEP (Hachmann et al., 2011) 1 Regression Chemistry QM7 (Blum & Reymond, 2009) 1 Regression Chemistry QM8 (Ramakrishnan et al., 2015) 12 Regression Chemistry QM9 (Ruddigkeit et al., 2012) 12 Regression Chemistry classification (33 molecular + 4 social networks) and 28 regression (on molecular) tasks in total. Table 1 provides details about the datasets used in our experiments. Baseline methods. We consider WL kernels (Shervashidze et al., 2011), Morgan fingerprints (Morgan, 1965), and N-Gram Graph (Liu et al., 2019a) as baselines for graph representation learning. For the predictor on top of the representations, we use SVM for WL kernels, and Random Forest and XGBoost (Chen & Guestrin, 2016) for Morgan fingerprints and N-Gram Graph. We also consider several recent end-to-end trainable GNNs that are commonly used, including GCNN (Duvenaud et al., 2015), GAT (Veličković et al., 2017), GIN (Xu et al., 2019), Attentive FP (Xiong et al., 2019), and PNA (Corso et al., 2020). Note that we do not consider recent GNN models that use extra edge/3D information or self-supervised pre-training as baselines in order to avoid unfair comparison to AWARE since our analysis throughout this paper focuses on the standard setting (see Section 3). Attentive FP and PNA were run without using extra edge information as this is not their main contribution. Evaluation. We perform single-task learning for each task in each dataset. Each dataset is randomly split into training, validation, and test sets with a ratio of 8:1:1, respectively. We report the average performance across 5 runs (datasets are split independently for each run). We select optimal hyperparameters using grid search. We present the full hyperparameter details as well as an ablation study on their effects below. For the molecular property prediction tasks, we use evaluation metrics from the benchmark paper (Wu et al., 2018), except for the MUV dataset for which we use ROC-AUC following recent studies (Hu et al., 2019; Rong et al., 2020). For the social network tasks, we follow the evaluation metrics from (Xu et al., 2019). Hyperparameter Tuning. For AWARE, we carefully perform a hyperparameter sweeping on the different candidate values listed in Table 2. Table 2: Hyperparameter sweeping for AWARE Hyperparameters Candidate values Learning rate 1e-3, 1e-4 # of linear layers in the predictor: L 1, 2, 3 Maximum walk length: T 3, 6, 9, 12 Vertex embedding dimension: r 100, 300, 500 Random dimension: r 100, 300, 500 Optimizer Adam 3Our code can be accessed at https://github.com/mehmetfdemirel/aware Published in Transactions on Machine Learning Research (08/2022) Table 3: Overall performance on all 15 datasets (65 tasks). We report (# tasks with top-1 performance, # tasks with top-3 performance). Models with no top-3 performance on a dataset are left blank. Models that are too slow, not well tuned, or not run due to model/dataset incompatibility are marked with . For full results with error bounds, see Tables 4, 5, and 6. Dataset # Tasks Metric Morgan FP WL Kernel GCNN GAT GIN Attentive FP PNA N-Gram Graph AWARE IMDB-BINARY 1 ACC (0, 1) (0, 1) (1, 1) IMDB-MULTI 1 ACC (0, 1) (0, 1) (1, 1) REDDIT-BINARY 1 ACC (0, 1) (0, 1) (1, 1) COLLAB 1 ACC (0, 1) (0, 1) (1, 1) Mutagenicity 1 ACC (1, 1) (0, 1) (0, 1) Tox21 12 ROC (0, 4) (0, 2) (0, 5) (1, 3) (4, 11) (7, 11) Clin Tox 2 ROC (1, 1) (0, 1) (0, 1) (0, 1) (1, 2) HIV 1 ROC (1, 1) (0, 1) (0, 1) MUV 17 ROC (2, 7) (3, 4) (0, 8) (0, 1) (0, 3) (1, 2) (1, 6) (1, 4) (9, 16) Delaney 1 RMSE (0, 1) (0, 1) (1, 1) Malaria 1 RMSE (1, 1) (0, 1) (0, 1) CEP 1 RMSE (1, 1) (0, 1) (0, 1) QM7 1 MAE (0, 1) (0, 1) (1, 1) QM8 12 MAE (5, 6) (1, 7) (0, 1) (0, 11) (6, 11) QM9 12 MAE (3, 12) (4, 7) (1, 11) (4, 6) Total 65 (4, 13) (3, 6) (9, 27) (1, 2) (6, 22) (1, 13) (2, 18) (6, 41) (33, 53) For all the molecular baseline methods other than GAT, Attentive FP, and PNA, the hyperparameter search strategy outlined in (Liu et al., 2019a) has been adopted. For GAT, we use their reported optimal hyperparameters (Veličković et al., 2017; Yang et al., 2019). For Attentive FP and PNA, we performed a hyperparameter tuning that included their reported optimal hyperparameters. For social network experiments, we perform hyperparameter tuning on PNA and Attentive FP, and use the optimal hyperparameters reported for the other baseline methods. In addition, for some of the social network datasets, we remove graphs with vertices more than a certain threshold (REDDIT-BINARY: 200, COLLAB: 100), because they have many vertices with a lot of neighbors and do not fit into memory for methods using one-hot feature encoding. Training Details. We train AWARE on 9 classification and 6 regression datasets, each of which consisting of multiple tasks, resulting in a total of 37 classification and 28 regression tasks. Each dataset is split into 5 different sets of training, validation, and test sets (i.e., 5 different random seeds) with a respective ratio of 8:1:1. We train the model for 500 epochs and use early stopping on the validation set with a patience of 50 epochs. No learning rate scheduler is used. GPU Specifications. In general, an NVIDIA Ge Force GTX 1080 (8GB) GPU model was used in the training process to obtain the main experimental results. For some of the bigger datasets, we used an NVIDIA A100 (40 GB) GPU model. 6.2 Results Prediction Performance. For a quick overview, we present the relative performance of AWARE compared to the baseline methods in Table 3. We observe that AWARE achieves the best performance in 33 out of the 65 tasks, while being ranked in the top-3 performing methods for 53 tasks. In particular, AWARE (even with a simple fully-connected predictor) significantly outperforms N-Gram Graph (which uses a powerful RF or XGB predictor) in 44 tasks, and achieves comparable performance in all other tasks. This indicates that AWARE can successfully learn a weighting scheme to selectively focus on the graph information that is important for the downstream prediction task. We also present complete results for all tasks with error bounds in Tables 4, 5, and 6. These allow for more fine-grained inspection. For example, we observe in Tables 5 and 6 that both N-Gram Graph and AWARE give overall stronger performance compared to other baselines across the Tox21 tasks in Table 5 and QM8 tasks in Table 6. This suggests that the tasks from these two datasets rely heavily on walk information, which can be well-exploited by approaches using walk-level aggregation. AWARE, being able to highlight important walk types, can further improve the performance of N-Gram Graph as observed in Tables 5 and 6. The Effect of Hyperparameters. We also analyze the effect of different hyperparameters on the prediction performance. Figure 2 demonstrates the effect of the maximum walk length T and the latent dimension r , Published in Transactions on Machine Learning Research (08/2022) Table 4: In this table, we present the performance of 8 models on 4 classification tasks in the domain of social networks (Morgan FP is excluded as it works only on molecular graphs). Experiments are run on 5 different random seeds, and the average of the 5 reported for each task along with their standard deviation in the subscript. The top-3 models in each task are highlighted in gray and the best one is highlighted in blue . Higher is better. Task # of Classes Metric WL Kernel GCNN GAT GIN Attentive FP PNA N-Gram Graph AWARE IMDB-BINARY 2 ACC 0.680 0.022 0.698 0.026 0.568 0.047 0.696 0.037 0.716 0.022 0.710 0.011 0.522 0.036 0.740 0.020 IMDB-MULTI 3 ACC 0.403 0.027 0.459 0.033 0.366 0.025 0.473 0.031 0.481 0.021 0.489 0.031 0.341 0.019 0.499 0.026 REDDIT-BINARY 2 ACC 0.892 0.017 0.931 0.013 0.900 0.036 0.933 0.009 0.864 0.029 0.938 0.010 0.764 0.026 0.949 0.014 COLLAB 3 ACC 0.567 0.011 0.660 0.009 0.616 0.029 0.669 0.014 0.653 0.012 0.675 0.024 0.376 0.119 0.739 0.017 Table 5: In this table, we present the performance of 9 models on 33 classification tasks from the domain of molecular property prediction. Experiments are run on 5 different random seeds, and the average of the 5 run results is reported for each task along with their standard deviation in the subscript. The top-3 models in each task are highlighted in gray and the best one is highlighted in blue (breaking ties by checking more digits in the average result). We mark incompatible task/model pairs with a . Higher is better. Dataset/Task Metric Morgan FP WL Kernel GCNN GAT GIN Attentive FP PNA N-Gram Graph AWARE Mutagenicity ACC 0.684 0.083 0.758 0.011 0.601 0.017 0.747 0.019 0.657 0.029 0.753 0.013 0.506 0.011 0.757 0.040 Tox21 tasks NR-AR ROC 0.763 0.043 0.701 0.068 0.762 0.035 0.754 0.058 0.759 0.048 0.783 0.035 0.786 0.039 0.776 0.049 0.786 0.041 NR-AR-LBD ROC 0.858 0.048 0.861 0.053 0.844 0.046 0.800 0.056 0.830 0.046 0.839 0.065 0.838 0.045 0.873 0.039 0.865 0.054 NR-Ah R ROC 0.890 0.010 0.876 0.017 0.886 0.017 0.823 0.020 0.872 0.016 0.878 0.011 0.901 0.013 0.897 0.008 0.889 0.006 NR-Aromatase ROC 0.821 0.024 0.818 0.027 0.828 0.024 0.744 0.039 0.760 0.053 0.844 0.019 0.837 0.018 0.852 0.013 0.861 0.019 NR-ER ROC 0.726 0.036 0.704 0.031 0.737 0.018 0.706 0.042 0.683 0.021 0.747 0.014 0.738 0.030 0.754 0.020 0.765 0.028 NR-ER-LBD ROC 0.838 0.043 0.799 0.033 0.813 0.048 0.764 0.023 0.772 0.032 0.808 0.037 0.815 0.039 0.834 0.030 0.853 0.059 NR-PPAR-gamma ROC 0.840 0.063 0.845 0.060 0.816 0.036 0.758 0.035 0.780 0.062 0.848 0.053 0.841 0.067 0.857 0.053 0.862 0.040 SR-ARE ROC 0.820 0.016 0.801 0.029 0.809 0.014 0.735 0.020 0.794 0.020 0.809 0.028 0.821 0.019 0.851 0.014 0.828 0.011 SR-ATAD5 ROC 0.850 0.017 0.814 0.020 0.827 0.052 0.754 0.052 0.803 0.050 0.807 0.047 0.821 0.055 0.853 0.025 0.841 0.025 SR-HSE ROC 0.797 0.019 0.803 0.037 0.774 0.037 0.686 0.038 0.740 0.062 0.787 0.037 0.778 0.027 0.808 0.025 0.820 0.026 SR-MMP ROC 0.890 0.007 0.875 0.017 0.877 0.017 0.834 0.014 0.872 0.025 0.895 0.018 0.873 0.019 0.905 0.015 0.905 0.014 SR-p53 ROC 0.844 0.012 0.842 0.044 0.818 0.015 0.733 0.036 0.817 0.026 0.804 0.026 0.843 0.024 0.860 0.019 0.852 0.030 Clin Tox tasks CT_TOX ROC 0.813 0.036 0.830 0.057 0.860 0.027 0.828 0.075 0.859 0.063 0.873 0.053 0.895 0.043 0.849 0.024 0.905 0.038 FDA_APPROVED ROC 0.795 0.084 0.862 0.029 0.866 0.028 0.899 0.033 0.883 0.025 0.870 0.070 0.879 0.022 0.852 0.044 0.895 0.050 HIV ROC 0.856 0.012 0.811 0.015 0.813 0.014 0.783 0.015 0.829 0.014 0.796 0.016 0.822 0.013 0.843 0.017 0.825 0.014 MUV-466 ROC 0.765 0.142 0.708 0.130 0.736 0.061 0.749 0.109 0.705 0.134 0.574 0.161 0.713 0.085 0.724 0.100 0.830 0.078 MUV-548 ROC 0.953 0.036 0.917 0.061 0.960 0.022 0.764 0.117 0.793 0.113 0.865 0.056 0.966 0.016 0.925 0.061 0.976 0.016 MUV-600 ROC 0.536 0.098 0.536 0.106 0.570 0.091 0.437 0.095 0.575 0.153 0.508 0.128 0.680 0.111 0.675 0.108 0.687 0.062 MUV-644 ROC 0.893 0.068 0.944 0.028 0.885 0.024 0.762 0.161 0.749 0.094 0.776 0.133 0.913 0.069 0.799 0.085 0.909 0.029 MUV-652 ROC 0.725 0.131 0.653 0.139 0.694 0.177 0.493 0.124 0.645 0.071 0.593 0.111 0.659 0.124 0.688 0.117 0.819 0.084 MUV-689 ROC 0.676 0.277 0.735 0.217 0.671 0.257 0.553 0.247 0.775 0.088 0.452 0.220 0.666 0.172 0.669 0.203 0.833 0.077 MUV-692 ROC 0.693 0.199 0.447 0.193 0.581 0.235 0.626 0.170 0.629 0.118 0.581 0.174 0.618 0.209 0.606 0.147 0.639 0.194 MUV-712 ROC 0.927 0.058 0.889 0.072 0.936 0.038 0.760 0.162 0.773 0.195 0.946 0.040 0.881 0.119 0.812 0.103 0.931 0.059 MUV-713 ROC 0.554 0.206 0.787 0.093 0.731 0.109 0.586 0.109 0.567 0.183 0.526 0.094 0.648 0.093 0.715 0.089 0.781 0.151 MUV-733 ROC 0.709 0.101 0.707 0.108 0.751 0.129 0.637 0.053 0.558 0.198 0.664 0.136 0.632 0.168 0.696 0.084 0.819 0.127 MUV-737 ROC 0.791 0.092 0.773 0.071 0.796 0.082 0.675 0.087 0.723 0.093 0.794 0.063 0.810 0.111 0.879 0.049 0.917 0.058 MUV-810 ROC 0.794 0.111 0.875 0.052 0.714 0.124 0.588 0.166 0.682 0.188 0.604 0.084 0.782 0.133 0.680 0.094 0.820 0.103 MUV-832 ROC 0.986 0.014 0.964 0.034 0.926 0.042 0.923 0.036 0.918 0.129 0.714 0.121 0.960 0.037 0.969 0.030 0.973 0.027 MUV-846 ROC 0.877 0.128 0.884 0.066 0.911 0.067 0.863 0.151 0.764 0.112 0.857 0.094 0.940 0.024 0.781 0.100 0.964 0.027 MUV-852 ROC 0.890 0.096 0.867 0.109 0.882 0.099 0.743 0.133 0.735 0.194 0.863 0.047 0.850 0.086 0.834 0.141 0.917 0.090 MUV-858 ROC 0.701 0.080 0.677 0.186 0.705 0.106 0.650 0.205 0.746 0.134 0.553 0.147 0.760 0.110 0.630 0.148 0.657 0.186 MUV-859 ROC 0.530 0.082 0.533 0.094 0.613 0.173 0.499 0.076 0.607 0.126 0.681 0.095 0.604 0.037 0.724 0.145 0.653 0.186 and Figure 3 shows the impact of the number of layers L in the final predictor and the vertex embedding dimension r. In general, the performance is quite stable across different hyperparameter values. This indicates that our algorithm is friendly towards hyperparameter tuning. 6.3 Ablation Studies In this section, we perform three different ablation studies to further explore our method. First, we perform a study to examine the impact of each weighting component Wv, Ww and Wg in AWARE. We individually remove one component from the model and compare its performance to the full model. We also compare our full model to the version with linear σ, i.e., σ(z) = z. Table 7 shows that the weighting components mostly lead to better performance even though there are cases in which they may not. We see that all three weighting components contribute to improved performance for most tasks. Notably, there exist tasks for Published in Transactions on Machine Learning Research (08/2022) Table 6: In this table, we present the performance of 9 models on 28 regression tasks from the domain of molecular property prediction. Experiments are run on 5 different random seeds, and the average of the 5 run results are reported for each task along with their standard deviation in the subscript. The top-3 models in each task are highlighted in gray and the best one is highlighted in blue (breaking ties by checking more digits in the average result). Models that are too slow are left blank. Lower is better. Dataset/Task Metric Morgan FP WL Kernel GCNN GAT GIN Attentive FP PNA N-Gram Graph AWARE Delaney RMSE 1.081 0.073 1.160 0.050 0.762 0.151 0.954 0.151 0.840 0.070 0.615 0.026 0.922 0.122 0.744 0.068 0.585 0.042 Malaria RMSE 0.995 0.028 1.090 0.037 1.141 0.057 1.136 0.035 1.129 0.032 1.080 0.028 1.048 0.022 1.030 0.039 1.056 0.036 CEP RMSE 1.274 0.047 1.783 0.083 1.457 0.112 1.344 0.112 1.064 0.057 1.108 0.046 1.153 0.052 1.409 0.029 1.233 0.040 QM7 MAE 118.883 2.421 173.582 4.293 76.000 2.743 213.014 10.618 82.681 3.979 74.710 9.079 108.913 25.555 49.661 4.246 39.697 3.400 E1-CC2 MAE 0.009 0.000 0.033 0.001 0.007 0.001 0.012 0.002 0.008 0.001 0.012 0.001 0.008 0.001 0.007 0.000 0.007 0.000 E2-CC2 MAE 0.011 0.000 0.024 0.001 0.007 0.000 0.012 0.001 0.008 0.000 0.013 0.001 0.010 0.000 0.008 0.000 0.008 0.000 f1-CC2 MAE 0.016 0.001 0.071 0.001 0.016 0.002 0.020 0.003 0.014 0.001 0.020 0.002 0.015 0.001 0.015 0.000 0.013 0.000 f2-CC2 MAE 0.035 0.001 0.080 0.001 0.033 0.001 0.038 0.001 0.031 0.001 0.039 0.001 0.032 0.001 0.030 0.001 0.030 0.002 E1-PBE0 MAE 0.009 0.000 0.034 0.001 0.006 0.001 0.015 0.004 0.007 0.001 0.012 0.000 0.008 0.001 0.007 0.000 0.007 0.000 E2-PBE0 MAE 0.011 0.000 0.029 0.001 0.007 0.000 0.012 0.002 0.008 0.000 0.012 0.001 0.009 0.001 0.007 0.000 0.008 0.000 f1-PBE0 MAE 0.014 0.000 0.067 0.001 0.012 0.000 0.016 0.001 0.011 0.001 0.017 0.001 0.013 0.001 0.012 0.000 0.011 0.001 f2-PBE0 MAE 0.028 0.001 0.078 0.000 0.025 0.001 0.030 0.001 0.024 0.001 0.031 0.001 0.025 0.000 0.024 0.000 0.022 0.001 E1-CAM MAE 0.009 0.000 0.033 0.001 0.006 0.001 0.012 0.003 0.007 0.001 0.012 0.001 0.007 0.000 0.006 0.000 0.006 0.000 E2-CAM MAE 0.010 0.000 0.026 0.001 0.006 0.000 0.011 0.001 0.007 0.001 0.013 0.001 0.009 0.000 0.007 0.000 0.007 0.000 f1-CAM MAE 0.015 0.001 0.072 0.001 0.013 0.000 0.018 0.001 0.012 0.001 0.017 0.001 0.013 0.001 0.013 0.001 0.012 0.001 f2-CAM MAE 0.030 0.001 0.080 0.001 0.027 0.001 0.034 0.003 0.027 0.001 0.035 0.003 0.027 0.001 0.026 0.001 0.024 0.001 mu MAE 0.625 0.003 0.506 0.019 0.654 0.011 0.476 0.008 0.562 0.020 0.575 0.012 0.536 0.002 0.535 0.007 alpha MAE 3.348 0.018 0.533 0.083 1.033 0.144 0.688 0.081 1.076 0.157 3.322 0.661 0.595 0.004 0.774 0.035 homo MAE 0.007 0.000 0.004 0.000 0.008 0.001 0.004 0.000 0.009 0.000 0.007 0.001 0.005 0.000 0.006 0.000 lumo MAE 0.009 0.000 0.004 0.000 0.009 0.002 0.004 0.000 0.009 0.000 0.008 0.001 0.005 0.001 0.005 0.000 gap MAE 0.010 0.000 0.006 0.000 0.011 0.001 0.005 0.000 0.012 0.000 0.010 0.001 0.007 0.000 0.007 0.000 r2 MAE 97.768 0.405 30.788 2.295 100.926 8.128 36.583 1.937 82.265 8.864 97.403 18.507 56.776 0.283 83.000 8.780 zpve MAE 0.008 0.000 0.001 0.000 0.004 0.002 0.001 0.000 0.002 0.000 0.008 0.001 0.000 0.000 0.001 0.000 cv MAE 1.422 0.010 0.229 0.014 0.541 0.220 0.248 0.013 0.521 0.062 1.318 0.256 0.334 0.004 0.586 0.042 u0 MAE 14.657 0.153 0.906 0.337 1.698 1.589 2.283 0.567 2.715 1.299 22.330 3.091 0.427 0.032 0.090 0.017 u298 MAE 14.647 0.148 1.126 0.494 5.110 5.487 2.032 0.453 2.683 1.263 21.365 2.566 0.428 0.032 0.086 0.009 h298 MAE 14.650 0.146 0.785 0.292 2.066 1.159 2.308 0.580 2.930 1.093 20.880 5.738 0.429 0.032 0.098 0.007 g298 MAE 14.651 0.149 0.646 0.169 2.576 1.555 2.269 0.596 4.014 1.422 19.794 3.679 0.427 0.028 0.086 0.010 Table 7: Ablation study I: Change in performance on removing/modifying components of AWARE. +" / -" indicate relatively better/worse performance respectively. Dataset Task No Wv No Ww No Wg No Wv, Ww or Wg Linear σ IMDB-BINARY IMDB-BINARY 5.03% +1.12% 1.96% 7.54% 10.06% Tox21 NR-AR +1.32% 0.76% +0.67% +0.37% 1.12% Clin Tox CT_TOX 9.00% 2.09% +0.70% 3.07% 10.35% Clin Tox FDA_APPROVED 7.83% 2.39% +1.38% 4.16% 10.40% MUV MUV-466 20.08% 16.70% 7.44% +1.80% 18.73% Delaney Delaney 28.45% 0.18% 4.69% 57.80% 76.17% Malaria Malaria 0.83% +2.10% 1.15% 2.32% 5.86% QM7 QM7 11.59% +3.74% 18.45% 71.39% 85.21% Table 8: Ablation study II: Change in AWARE s performance when the vertex embedding matrix W is randomly initialized and non-trainable, with linear σ. Underline indicates better performance. Dataset Task Metric Trainable W Fixed Random W IMDB-BINARY IMDB-BINARY ACC 0.716 0.660 Tox21 NR-AR ROC-AUC 0.776 0.774 Clin Tox CT_TOX ROC-AUC 0.889 0.764 Clin Tox FDA_APPROVED ROC-AUC 0.869 0.774 Delaney Delaney RMSE 0.612 1.162 Malaria Malaria RMSE 1.062 1.126 QM7 QM7 MAE 41.280 96.675 which specific weights lead to a drop in performance. Aligning with Theorems 1 and 3 in Section 5, this Published in Transactions on Machine Learning Research (08/2022) 3 6 9 12 0.7 3 6 9 12 0.7 3 6 9 12 0.7 3 6 9 12 0.7 NR-Aromatase 3 6 9 12 0.7 3 6 9 12 0.7 3 6 9 12 0.7 NR-PPAR-gamma 3 6 9 12 0.7 3 6 9 12 0.7 r = 100 r = 300 r = 500 Figure 2: Effect of T and r on the prediction performance on the 12 tasks in the Tox21 dataset. For each pair of T and r hyperparameter values, the model was run on 5 different seeds of data and the average of the 5 runs is reported. Higher is better. FDA_APPROVED r = 100 r = 300 r = 500 Figure 3: Effect of the number of linear layers L in the fully connected neural network for graph-level prediction and the vertex embedding dimension r on the prediction performance on the 2 tasks in the Clin Tox dataset. For each pair of L and r hyperparameter values, the model was run on 5 different seeds of data and the average of the 5 runs is reported. Higher is better. Published in Transactions on Machine Learning Research (08/2022) Table 9: Ablation study III: Change in AWARE s performance when the final predictor is changed from a multiple layer NN to a linear predictor. Underline indicates better performance. Dataset Task Metric Multiple layers Linear predictor IMDB-BINARY IMDB-BINARY ACC 0.716 0.678 Tox21 NR-AR ROC-AUC 0.776 0.759 Clin Tox CT_TOX ROC-AUC 0.889 0.880 Clin Tox FDA_APPROVED ROC-AUC 0.869 0.870 Delaney Delaney RMSE 0.612 0.640 Malaria Malaria RMSE 1.062 1.070 QM7 QM7 MAE 41.280 415.155 indicates that weighting schemes are successful in learning important artifacts for the downstream task only under specific conditions. Furthermore, we can also observe the advantage of using a non-linear activation function σ over a linear one. Second, we analyze the change in performance when a non-trainable vertex embedding matrix W and a linear σ are used. Table 8 demonstrates using a trainable random vertex embedding matrix W and a non-linear σ gives overall better performance. It also shows that even with random W and a linear σ, our method can still get decent performance providing justification for the simplification assumptions in our theoretical analysis. Third, we examine the advantage of using a fully-connected neural network with multiple linear layers as a predictor over using a simple linear predictor. Table 9 suggests that using multiple layers in the final predictor leads to better performance in general. 7 Interpretation and Visualization AWARE uses an attention mechanism at the walk level (Ww) to aggregate crucial information from the neighbors of each vertex (Section 4). While we have demonstrated the empirical effectiveness of this in Section 6, we now focus on validating our analysis that AWARE can highlight important substructures of the input graph for the prediction task. For this analysis, we use the Mutagenicity dataset (Kazius et al., 2005), which comes with the ground-truth information that molecules that contain specific chemical groups (-NO2, -NH2) are much more likely to be assigned a mutagen label (Debnath et al., 1991). This dataset has been introduced for the purpose of increasing accuracy and reliability in mutagenicity predictions for molecular compounds. Mutagenicity of a molecular compound, among many other attributes, is known to impede its ability to become a usable drug. A mutagen is a physical or chemical factor that has the potential to alter the DNA of an organism, which in turn increases the possibility of mutations. The dataset contains 4337 molecular structures with 2401 labeled as mutagen . Molecular structures in this dataset contain around 30 atoms on average. To find substructures that AWARE uses for its prediction, we compute the importance score for each bond (edge) of the molecule by using the attention scores computed via Equation (7) (Specifically for an edge i j, we use [S(T )]ij + [S(T )]ji). Accordingly, we visualize two randomly chosen mutagenic molecules in Figure 4 and the important substructures as attributed by different interpretation techniques. Figures 4b and 4c depict the interpretation of the GIN model (Xu et al., 2019) using Grad and GNNExplainer techniques (Ying et al., 2019). The former computes gradients with respect to the adjacency matrix and vertex features, while the latter extracts substructures with the closest property prediction to the complete graph. In the first molecule, although both of these techniques are able to highlight the two NH2 groups as important for the final prediction, they fail to highlight the NO2 group. In the second molecule, while Grad fails to identify the NO2 atom group, GNNExplainer marks majority of the bonds (edges) in the molecule as important, which should not be the case. In contrast, AWARE can successfully highlight both the NH2 and NO2 groups as important in the first molecule as well as the NO2 group in the second one, as can be seen in Figure 4d. This provides further evidence that AWARE is able to identify substructures in the graph that are significant (or insignificant) for Published in Transactions on Machine Learning Research (08/2022) 16H 17H 18H 0C 1C 2C 3N 4C 5C 6C 7N 8O 9O 10C 11N 12C 13H 14H 15H 16H 17H 18H 19H 20H 21H 22H 23H (a) Molecule (c) GNNExplainer 0C 1C 2C 3C 4C 5C 6C 7C 8C 9C 10C 11C 12C 13C 14C 15C 16N 17O 18C 19C 20C 21O 22O 23C 24H 25H 26H 27H 28H 29H 30H 31H 32H 33H 34H (e) edge importance Figure 4: Visualization of two random mutagen molecules from Mutagenicity and their important substructures for accurate prediction captured by different interpretation techniques. Different node colors indicate different atom types. (a) depicts the original molecules with important mutagenic atom groups circled in red, such as NO2 and NH2. (b), (c), and (d) demonstrate important substructures detected by different methods. (e) is a heatmap for the edge importance scores computed by AWARE. 0.4 0.2 0.0 0.2 0.4 f[T](Gi) for yi = 0 f[T](Gi) for yi = 1 Figure 5: Interpretation of graph-level attention Wg for the NR-AR classification task. a given downstream prediction task (In the examples given in Figure 4, we set a threshold ( 1.0) on the importance scores computed by AWARE to highlight important substructures in the molecules). Interpretation for Wg. AWARE uses Wg to selectively weight the embeddings at the graph level for the prediction task (Section 4). Towards interpreting Wg, we want to analyze how well it aligns with the predictor for the downstream task. Specifically, we train AWARE for the binary classification NR-AR task (Tox21 dataset) using a linear predictor with parameter w (without a non-linear activation function). We randomly sample 200 data points, and compute their graph embeddings f[T ](G) from AWARE. We denote the top three left singular vectors of Wg by {u1, u2, u3}. For a particular ui, we define vi = [ui, ui, . . . (T times)] to bring ui to the same-dimensional space as f[T ](G). Finally, we plot {v1, v2, v3}, w, and the embeddings f[T ](G) for all 200 samples in Figure 5 using PCA with n = 2 components. We observe that Wg s largest singular vector direction aligns very well with the parameter w of the downstream predictor that the model has learned. This suggests that this weight can successfully emphasize the directions in the graph embedding space that are important for the prediction. Published in Transactions on Machine Learning Research (08/2022) 8 Conclusion In this work, we present and analyze a novel attentive walk-aggregating GNN, AWARE, providing the first provable guarantees on the learning performance of weighted GNNs by identifying the specific conditions under which weighting can improve the learning performance in the standard setting. Our experiments on 65 graph-level prediction tasks from the domains of molecular property prediction and social networks demonstrate that AWARE overall outperforms traditional and recent baselines in the standard setting where only adjacency and vertex attribute information are used. Our interpretability study lends support to our algorithm design and theoretical insights by providing concrete evidence that the attention mechanism works in favor of emphasizing the important walks in the graph while diminishing the others. Lastly, our ablation studies show the importance of the different components and design choices of our model. Supported with the strong representation power by AWARE, we believe that it can be further explored for a wide range of tasks, including but not limited to multi-task learning (e.g., Liu et al. (2019b; 2022b)), self-supervised pretraining (e.g., Liu et al. (2022a;c)), few-shot learning (e.g., Altae-Tran et al. (2017)), etc. We would also like to briefly touch on the ethical impacts and weaknesses of our method here. Though AWARE can be used for graph-structured data from distinct data domains, we will be highlighting the ethical implications of our method for the important domain of molecular property prediction. Having strong empirical performance for the molecular property prediction domain, AWARE can potentially be used for efficient drug development process. Physical experiments for this task can be expensive and slow, which can be alleviated by using AWARE for an initial virtual screening (selecting high-confident candidates from a large pool before physical screening). A strong empirically performing model like AWARE can help speed up the process, and provide tremendous cost savings for this important task. However, deploying an automatic ML prediction model for such a highly critical task must be done extremely carefully. As evidenced by Section 6, AWARE does not achieve the best performance for all molecular property prediction tasks. Thus, AWARE may fail to identify promising chemicals for drug development, and/or make erroneous selections. While the former may increase the time and cost of the process, the latter might lead to failures in developing the drug. Nevertheless, with sufficient physical experimentation performed by human experts, such unwanted events can be minimized while still enjoying the benefits of using AWARE. 9 Acknowledgement The work is partially supported by Air Force Grant FA9550-18-1-0166, the National Science Foundation (NSF) Grants 2008559-IIS, CCF-2046710, and 2023239-DMS. AIDS Antiviral Screen Data. Aids antiviral screen data. https://wiki.nci.nih.gov/display/NCIDTPdata/ AIDS+Antiviral+Screen+Data, 2017. Accessed: 2020-12-20. 17, 32 Han Altae-Tran, Bharath Ramsundar, Aneesh S Pappu, and Vijay Pande. Low data drug discovery with one-shot learning. ACS Central Science, 3(4):283 293, 2017. 24 Sanjeev Arora, Mikhail Khodak, Nikunj Saunshi, and Kiran Vodrahalli. A compressed sensing view of unsupervised text embeddings, bag-of-n-grams, and lstm. International Conference on Learning Representations, 2018. 12, 30 Artem V Artemov, Evgeny Putin, Quentin Vanhaelen, Alexander Aliper, Ivan V Ozerov, and Alex Zhavoronkov. Integrated deep learned transcriptomic and structure-based predictor of clinical trials outcomes. bio Rxiv, pp. 095653, 2016. 17 Waiss Azizian and marc lelarge. Expressive power of invariant and equivariant graph neural networks. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id= lx Hg XYN4bwl. 3 Published in Transactions on Machine Learning Research (08/2022) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate, 2014. URL http://arxiv.org/abs/1409.0473. cite arxiv:1409.0473Comment: Accepted at ICLR 2015 as oral presentation. 2 Peter Battaglia, Razvan Pascanu, Matthew Lai, Danilo Jimenez Rezende, et al. Interaction networks for learning about objects, relations and physics. In Advances in neural information processing systems, pp. 4502 4510, 2016. 3 Lorenz C Blum and Jean-Louis Reymond. 970 million druglike small molecules for virtual screening in the chemical universe database gdb-13. Journal of the American Chemical Society, 131(25):8732 8733, 2009. 17, 32 Robert Calderbank, Sina Jafarpour, and Robert Schapire. Compressed learning: Universal sparse dimensionality reduction and learning in the measurement domain. Techical Report, 2009. 30 Emmanuel J Candes. The restricted isometry property and its implications for compressed sensing. Comptes rendus mathematique, 346(9-10):589 592, 2008. 29 Emmanuel J Candes and Terence Tao. Decoding by linear programming. IEEE transactions on information theory, 51(12):4203 4215, 2005. 29 Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785 794. ACM, 2016. 17 Benjamin Coors, Alexandru Paul Condurache, and Andreas Geiger. Spherenet: Learning spherical representations for detection and classification in omnidirectional images. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 518 533, 2018. 3 Gabriele Corso, Luca Cavalleri, Dominique Beaini, Pietro Liò, and Petar Veličković. Principal neighbourhood aggregation for graph nets. Advances in Neural Information Processing Systems, 33, 2020. 17 Asim Kumar Debnath, Rosa L Lopez de Compadre, Gargi Debnath, Alan J Shusterman, and Corwin Hansch. Structure-activity relationship of mutagenic aromatic and heteroaromatic nitro compounds. correlation with molecular orbital energies and hydrophobicity. Journal of medicinal chemistry, 34(2):786 797, 1991. 22 Nima Dehmamy, Albert-Laszlo Barabasi, and Rose Yu. Understanding the representation power of graph neural networks in learning graph topology. Advances in Neural Information Processing Systems 32 (NIPS 2019), 2019. 2, 3 John S. Delaney. ESOL: Estimating Aqueous Solubility Directly from Molecular Structure. Journal of Chemical Information and Computer Sciences, 44(3):1000 1005, May 2004. ISSN 0095-2338. doi: 10.1021/ci034243x. 17, 32 Yuntian Deng, Yoon Kim, Justin Chiu, Demi Guo, and Alexander M. Rush. Latent alignment and variational attention. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS 18, pp. 9735 9747, Red Hook, NY, USA, 2018. Curran Associates Inc. 2 Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171 4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https://www.aclweb.org/anthology/N19-1423. 2, 3 Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. ar Xiv preprint ar Xiv:2010.11929, 2020. 2, 3 Published in Transactions on Machine Learning Research (08/2022) David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel, Alán Aspuru Guzik, and Ryan P Adams. Convolutional networks on graphs for learning molecular fingerprints. In Advances in neural information processing systems, pp. 2224 2232, 2015. 1, 3, 17 Simon Foucart and Holger Rauhut. A mathematical introduction to compressive sensing. Bull. Am. Math, 54:151 165, 2017. 29 Francisco-Javier Gamo, Laura M. Sanz, Jaume Vidal, Cristina de Cozar, Emilio Alvarez, Jose-Luis Lavandera, Dana E. Vanderwall, Darren V. S. Green, Vinod Kumar, Samiul Hasan, James R. Brown, Catherine E. Peishoff, Lon R. Cardon, and Jose F. Garcia-Bustos. Thousands of chemical starting points for antimalarial lead identification. Nature, 465(7296):305 310, May 2010. ISSN 1476-4687. doi: 10.1038/nature09107. 17, 32 Kaitlyn M Gayvert, Neel S Madhukar, and Olivier Elemento. A data-driven approach to predicting successes and failures of clinical trials. Cell chemical biology, 23(10):1294 1301, 2016. 17, 32 Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neural message passing for quantum chemistry. ar Xiv preprint ar Xiv:1704.01212, 2017. 2, 3 Johannes Hachmann, Roberto Olivares-Amaya, Sule Atahan-Evrenk, Carlos Amador-Bedolla, Roel S. Sánchez Carrera, Aryeh Gold-Parker, Leslie Vogt, Anna M. Brockway, and Alán Aspuru-Guzik. The Harvard Clean Energy Project: Large-Scale Computational Screening and Design of Organic Photovoltaics on the World Community Grid. The Journal of Physical Chemistry Letters, 2(17):2241 2251, September 2011. ISSN 1948-7185. doi: 10.1021/jz200866s. 17, 32 Weihua Hu, Bowen Liu, Joseph Gomes, Marinka Zitnik, Percy Liang, Vijay Pande, and Jure Leskovec. Strategies for pre-training graph neural networks. ar Xiv preprint ar Xiv:1905.12265, 2019. 17 Shiva Prasad Kasiviswanathan and Mark Rudelson. Restricted isometry property under high correlations. ar Xiv preprint ar Xiv:1904.05510, 2019. 30 Jeroen Kazius, Ross Mc Guire, and Roberta Bursi. Derivation and validation of toxicophores for mutagenicity prediction. Journal of medicinal chemistry, 48(1):312 320, 2005. 17, 22, 32 Steven Kearnes, Kevin Mc Closkey, Marc Berndl, Vijay Pande, and Patrick Riley. Molecular graph convolutions: moving beyond fingerprints. Journal of computer-aided molecular design, 30(8):595 608, 2016. 3 Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. ar Xiv preprint ar Xiv:1609.02907, 2016. 1, 2, 3 Johannes Klicpera, Janek Groß, and Stephan Günnemann. Directional message passing for molecular graphs. ar Xiv preprint ar Xiv:2003.03123, 2020. 3 Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. Gated graph sequence neural networks. ar Xiv preprint ar Xiv:1511.05493, 2015. 3 Shengchao Liu, Mehmet F Demirel, and Yingyu Liang. N-gram graph: Simple unsupervised representation for graphs, with applications to molecules. In Advances in Neural Information Processing Systems, pp. 8466 8478, 2019a. 2, 3, 4, 5, 6, 7, 10, 17, 18, 29 Shengchao Liu, Yingyu Liang, and Anthony Gitter. Loss-balanced task weighting to reduce negative transfer in multi-task learning. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pp. 9977 9978, 2019b. 24 Shengchao Liu, Hongyu Guo, and Jian Tang. Molecular geometry pretraining with se (3)-invariant denoising distance matching. ar Xiv preprint ar Xiv:2206.13602, 2022a. 24 Shengchao Liu, Meng Qu, Zuobai Zhang, Huiyu Cai, and Jian Tang. Structured multi-task learning for molecular property prediction. In International Conference on Artificial Intelligence and Statistics, pp. 8906 8920. PMLR, 2022b. 24 Published in Transactions on Machine Learning Research (08/2022) Shengchao Liu, Hanchen Wang, Weiyang Liu, Joan Lasenby, Hongyu Guo, and Jian Tang. Pre-training molecular graph representation with 3d geometry. In International Conference on Learning Representations, 2022c. URL https://openreview.net/forum?id=x QUe1p OKPam. 24 Minh-Thang Luong, Hieu Pham, and Christopher D Manning. Effective approaches to attention-based neural machine translation. ar Xiv preprint ar Xiv:1508.04025, 2015. 2 Andre Martins and Ramon Astudillo. From softmax to sparsemax: A sparse model of attention and multilabel classification. In Maria Florina Balcan and Kilian Q. Weinberger (eds.), Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pp. 1614 1623, New York, New York, USA, 20 22 Jun 2016. PMLR. URL http://proceedings.mlr. press/v48/martins16.html. 3 Lukasz Maziarka, Tomasz Danel, Slawomir Mucha, Krzysztof Rataj, Jacek Tabor, and Stanislaw Jastrzkebski. Molecule attention transformer. ar Xiv preprint ar Xiv:2002.08264, 2020. 2, 3 HL Morgan. The generation of a unique machine description for chemical structures-a technique developed at chemical abstracts service. Journal of Chemical Documentation, 5(2):107 113, 1965. 17 Christopher Morris, Martin Ritzert, Matthias Fey, William L Hamilton, Jan Eric Lenssen, Gaurav Rattan, and Martin Grohe. Weisfeiler and leman go neural: Higher-order graph neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp. 4602 4609, 2019. 3 Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 701 710, 2014. 2, 3, 4 Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1 67, 2020. URL http://jmlr.org/papers/v21/20-074.html. 3 Prajit Ramachandran, Niki Parmar, Ashish Vaswani, Irwan Bello, Anselm Levskaya, and Jonathon Shlens. Stand-alone self-attention in vision models. ar Xiv preprint ar Xiv:1906.05909, 2019. 3 Raghunathan Ramakrishnan, Mia Hartmann, Enrico Tapavicza, and O Anatole Von Lilienfeld. Electronic spectra from tddft and machine learning in chemical space. The Journal of chemical physics, 143(8):084111, 2015. 17, 32 Sebastian G Rohrer and Knut Baumann. Maximum unbiased validation (muv) data sets for virtual screening based on pubchem bioactivity data. Journal of chemical information and modeling, 49(2):169 184, 2009. 17, 32 Yu Rong, Yatao Bian, Tingyang Xu, Weiyang Xie, Ying Wei, Wenbing Huang, and Junzhou Huang. Grover: Self-supervised message passing transformer on large-scale molecular data. ar Xiv preprint ar Xiv:2007.02835, 2020. 2, 3, 17 Lars Ruddigkeit, Ruud Van Deursen, Lorenz C Blum, and Jean-Louis Reymond. Enumeration of 166 billion organic small molecules in the chemical universe database gdb-17. Journal of chemical information and modeling, 52(11):2864 2875, 2012. 17, 32 Mark Rudelson, Roman Vershynin, et al. Hanson-wright inequality and sub-gaussian concentration. Electronic Communications in Probability, 18, 2013. 32 Shiv Shankar, Siddhant Garg, and Sunita Sarawagi. Surprisingly easy hard-attention for sequence to sequence learning. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 640 645, Brussels, Belgium, October-November 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-1065. URL https://www.aclweb.org/anthology/D18-1065. 2 Nino Shervashidze, Pascal Schweitzer, Erik Jan van Leeuwen, Kurt Mehlhorn, and Karsten M Borgwardt. Weisfeiler-lehman graph kernels. Journal of Machine Learning Research, 12(Sep):2539 2561, 2011. 2, 3, 17 Published in Transactions on Machine Learning Research (08/2022) Tox21 Data Challenge. Tox21 data challenge 2014. https://tripod.nih.gov/tox21/challenge/, 2014. 17, 32 Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 6000 6010, 2017. 2 Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. ar Xiv preprint ar Xiv:1710.10903, 2017. 2, 3, 17, 18 Oriol Vinyals, Samy Bengio, and Manjunath Kudlur. Order matters: Sequence to sequence for sets. ar Xiv preprint ar Xiv:1511.06391, 2015. 3 S Vichy N Vishwanathan, Nicol N Schraudolph, Risi Kondor, and Karsten M Borgwardt. Graph kernels. The Journal of Machine Learning Research, 11:1201 1242, 2010. 2, 3, 4 Guangtao Wang, Rex Ying, Jing Huang, and Jure Leskovec. Multi-hop attention graph neural networks. In Zhi-Hua Zhou (ed.), Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI-21, pp. 3089 3096. International Joint Conferences on Artificial Intelligence Organization, 8 2021. doi: 10.24963/ijcai.2021/425. URL https://doi.org/10.24963/ijcai.2021/425. Main Track. 3 Zhenqin Wu, Bharath Ramsundar, Evan N Feinberg, Joseph Gomes, Caleb Geniesse, Aneesh S Pappu, Karl Leswing, and Vijay Pande. Moleculenet: a benchmark for molecular machine learning. Chemical science, 9 (2):513 530, 2018. 17 Zhaoping Xiong, Dingyan Wang, Xiaohong Liu, Feisheng Zhong, Xiaozhe Wan, Xutong Li, Zhaojun Li, Xiaomin Luo, Kaixian Chen, Hualiang Jiang, et al. Pushing the boundaries of molecular representation for drug discovery with the graph attention mechanism. Journal of medicinal chemistry, 63(16):8749 8760, 2019. 17 Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. In Francis Bach and David Blei (eds.), Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pp. 2048 2057, Lille, France, 07 09 Jul 2015. PMLR. URL http://proceedings.mlr.press/v37/xuc15.html. 2 Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id= ry Gs6i A5Km. 1, 2, 3, 7, 17, 22 Pinar Yanardag and SVN Vishwanathan. Deep graph kernels. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1365 1374. ACM, 2015. 17, 32 Kevin Yang, Kyle Swanson, Wengong Jin, Connor Coley, Philipp Eiden, Hua Gao, Angel Guzman-Perez, Timothy Hopper, Brian Kelley, Miriam Mathea, et al. Analyzing learned molecular representations for property prediction. Journal of chemical information and modeling, 59(8):3370 3388, 2019. 3, 18 Rex Ying, Ruining He, Kaifeng Chen, Pong Eksombatchai, William L Hamilton, and Jure Leskovec. Graph convolutional neural networks for web-scale recommender systems. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 974 983, 2018. 1 Zhitao Ying, Dylan Bourgeois, Jiaxuan You, Marinka Zitnik, and Jure Leskovec. Gnnexplainer: Generating explanations for graph neural networks. In Advances in neural information processing systems, pp. 9244 9255, 2019. 22 Seongjun Yun, Minbyul Jeong, Raehyun Kim, Jaewoo Kang, and Hyunwoo J Kim. Graph transformer networks. In Advances in Neural Information Processing Systems, pp. 11983 11993, 2019. 2 Hengshuang Zhao, Jiaya Jia, and Vladlen Koltun. Exploring self-attention for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10076 10085, 2020. 3 Published in Transactions on Machine Learning Research (08/2022) Attentive Walk-Aggregating Graph Neural Networks A Toolbox for Theoretical Analysis A.1 Toolbox from Compressive Sensing For completeness, here we include the review from (Liu et al., 2019a) about related concepts in the field of compressed sensing that are important for our analysis. Please refer to (Foucart & Rauhut, 2017) for more details. The primary goal of compressed sensing is to recover a high-dimensional k-sparse signal x RN from a few linear measurements. Here, being k-sparse means that x has at most k non-zero entries, i.e., |x|0 k. In the noiseless case, we have a design matrix A Rd N and the measurement vector is z = Ax. The optimization formulation is then minimizex x 0 subject to Ax = z (76) where x 0 is ℓ0 norm of x , i.e., the number of non-zero entries in x . The assumption that x is the sparsest vector satisfying Ax = z is equivalent to that x is the optimal solution for (76). Unfortunately, the ℓ0-minimization in (76) is NP-hard. The typical approach in compressed sensing is to consider its convex surrogate using ℓ1-minimization: minimizex x 1 subject to Ax = z (77) where x 1 = P i |x i| is the ℓ1 norm of x . The fundamental question is when the optimal solution of (76) is equivalent to that of (77), i.e., when exact recovery is guaranteed. A.1.1 The Restricted Isometry Property One common condition for recovery is the Restricted Isometry Property (RIP): Definition 6. A Rd N is (X, ϵ)-RIP for some subset X RN if for any x X, (1 ϵ) x 2 Ax 2 (1 + ϵ) x 2. (78) We will abuse notation and say (k, ϵ)-RIP if X is the set of all k-sparse x RN. Introduced by (Candes & Tao, 2005), RIP has been used to show to guarantee exact recovery. Theorem 6 (Restatement of Theorem 1.1 in (Candes, 2008)). Suppose A is (2k, ϵ)-RIP for an ϵ < 2 1. Let ˆx denote the solution to (77), and let xk denote the vector x with all but the k-largest entries set to zero. Then ˆx x 1 C0 xk x 1 (79) ˆx x 2 C0k 1/2 xk x 1. (80) In particular, if x is k-sparse, the recovery is exact. Furthermore, it has been shown that A is (k, ϵ)-RIP with overwhelming probability when d = Ω(k log N d Aij N(0, 1)( i, j) or d Aij U{ 1, 1}( i, j). There are also many others types of A with RIP; see (Foucart & Rauhut, 2017). Published in Transactions on Machine Learning Research (08/2022) A.1.2 Compressed Learning Given that Ax preserves the information of sparse x when A is RIP, it is then natural to study the performance of a linear classifier learned on Ax compared to that of the best linear classifier on x. Our analysis will use a theorem from (Arora et al., 2018) that generalizes that of (Calderbank et al., 2009). Let X RN denote X = {x : x RN, x 0 k, x 2 B}. (81) Let {(xi, yi)}M i=1 be a set of M samples i.i.d. from some distribution over X { 1, 1}. Let ℓdenote a λℓ-Lipschitz convex loss function. Let ℓD(θ) denote the risk of a linear classifier with weight θ RN, i.e., ℓD(θ) = E[ℓ( θ, x , y)], and let θ denote a minimizer of ℓD(θ). Let ℓA D(θ) denote the risk of a linear classifier with weight θ Rd over Ax, i.e., ℓA D(θA) = E[ℓ( θA, Ax , y)], and let ˆθA denote the weight learned with ℓ2-regularization over {(Axi, yi)}i: ˆθA = arg min θ 1 M i=1 ℓ( θ, Axi , yi) + λ θ 2 (82) where λ is the regularization coefficient. Theorem 7 (Restatement of Theorem 4.2 in (Arora et al., 2018)). Suppose A is ( X, ϵ)-RIP. Then with probability at least 1 δ, ℓA D(ˆθA) ℓD(θ ) + O for appropriate choice of λ. Here, X = {x x : x, x X} for any X RN. A.2 Tools for the Proof of Theorem 2 For the proof, we concern about whether the ℓ-way column product of W has RIP. Existing results in the literature do not directly apply in our case. But following the ideas in Theorem 4.3 in (Kasiviswanathan & Rudelson, 2019), we are able to prove the following theorem for our purpose. Theorem 8. Let X be an n d matrix, and let R be a d N random matrix with independent entries Rij such that E[Rij] = 0, E[R2 ij] = 1, and |Rij| τ almost surely. Let t 2 be a constant. Let ϵ (0, 1), and let k be an integer satisfying sr(X) Cτ 4tk3 k for some universal constant C > 0. Then with probability at least 1 exp( cϵ2sr(X)/(k2τ 4t)) for some universal constant c > 0, the matrix XR[t]/ X F is (k, ϵ)-RIP. Here, sr(X) = X 2 F / X 2 is the stable rank of X. In our case, we will apply the theorem with X being Id d/ d where Id d Rd d is the identity matrix. Proof of Theorem 8. The proof follows the idea in Theorem 4.3 in (Kasiviswanathan & Rudelson, 2019). However, their analysis is for a different type of matrices (ℓ-way Column Hadamard Product). We thus include a proof for our case for completeness. Let u Rdt be a vector with sparsity k, and its entries indexed by sequences (i1, i2, . . . , it) [d] t. Let ℓ [p], and define (i1,i2,...,it) [d] t u(i1,i2,...,it) j=1 Rℓij. (84) Note that the random variables yℓ(ℓ [p]) are independent. We will now estimate the ψ2-norm of yℓand then use the Hanson-Wright inequality (and its corollaries) with a net argument to establish the concentration for the norm of XR[ℓ]u = Xy where y = (y1, . . . , yp). Published in Transactions on Machine Learning Research (08/2022) Let supp(u) be the support of u. By the triangle inequality, (i1,i2,...,it) [d] t u(i1,i2,...,it) (i1,i2,...,it) supp(u) u(i1,i2,...,it) = O τ t u 1 (87) Next, we choose an (1/2C2)-net N in the set of all k-sparse vectors in Cdt 1 such that (6C2)k exp k log C0dt Note that for any k-sparse vector u Cdt 1, y = R[t]u = (y1, . . . , yp) is a random vector with independent coordinates such that for any ℓ [p], E[yℓ] = 0, E[y2 ℓ] = u 2 2, and yℓ ψ2 Cτ t k u 2. (90) Then by Corollary 1, for any fixed u Cdt 1 with |supp(u)| k (and y = R[t]u), Pr [| Xy 2 X F | > ϵ X F ] 2 exp maxℓ yℓ 4 ψ2 sr(X) τ 4tk2 sr(X) . (91) Together with the union bound over u N and using the assumption on sr(X), we have Pr h u N, | XR[t]u 2 X F | > ϵ X F i exp k log C0dt τ 4tk2 sr(X) . (92) Finally, we extend the above argument from the net to all k-sparse vectors. From Corollary 2, we have Pr h I [d] t, |I| = k, XR[t] I 2 > C1ϵ X F i exp c1ϵ2 τ 4tk2 sr(X) . (93) First assume that the events in equation 92 and equation 93 happen. Any k-sparse vector u can be written as u = a + b, where a N, and b satisfies |supp(b)| k and b 2 1/(2C1). Let Ib = supp(b) [d] t and let b be b restricted to Ib. Let R[t] Ib be the submatrix of R[t] with columns indexed by Ib. Then XR[t]u 2 = XR[t]a + XR[t]b 2 (94) XR[t]a 2 + XR[t]b 2 (95) = XR[t]a 2 + XR[t] Ib b 2 (96) XR[t]a 2 + XR[t] Ib 2 b 2 (97) (1 + ϵ) X F + 1 2C2 XR[t] Ib 2 (98) (1 + ϵ1) X F (99) where the bound on XR[t]a 2 is from equation 92 and the spectrum norm bound for XR[t] Ib 2 is from equation 93. Similarly, XR[t]u 2 (1 ϵ2) X F . (100) Adjusting the constants and removing the conditioning completes the proof. Published in Transactions on Machine Learning Research (08/2022) For proving the above Theorem 8, the Hanson-Wright Inequality and its corollaries are useful. We thus include them here for completeness. Theorem 9 (Hanson-Wright Inequality (Rudelson et al., 2013)). Let x = (x1, . . . , xn) Rn be a random vector with independent components xi which satisfy E[xi] = 0 and xi ψ2 K. Let M be an n n matrix. Then for every t 0, Pr x Mx E[x Mx] > t 2 exp c min t2 K4 M 2 F , t K2 M 2 Corollary 1 (Subgaussian Concentration (Rudelson et al., 2013)). Let M be a fixed n d matrix. Let x = (x1, . . . , xn) Rn be a random vector with independent components xi which satisfies E[xi] = 0, E[x2 i ] = 1 and xi ψ2 K. Then for every t 0, Pr[| Mx 2 M F | > t] 2 exp ct2 Corollary 2 (Spectrum Norm of the Product (Rudelson et al., 2013)). Let B be a fixed n p matrix, and let G = (Gij) be a p d matrix with independent entries that satisfy: E[Gij] = 0, E[G2 ij] = 1, and Gij ψ2 K. Then for any a, b > 0, Pr[ BG 2 > CK2(a B F + b d B 2)] 2 exp a2sr(B) b2d . (103) B Dataset Licenses. The Delaney (Delaney, 2004), CEP (Hachmann et al., 2011), QM7 (Blum & Reymond, 2009), QM9 (Ruddigkeit et al., 2012), MUV (Rohrer & Baumann, 2009), and Mutagenicity (Kazius et al., 2005) datasets are all licensed under the Copyright of the American Chemical Society (ACS) which allows free usage of the data and materials appearing in public domain articles without any permission. The QM8 (Ramakrishnan et al., 2015) dataset is under Creative Commons Attribution (CC BY) license of the American Institute of Physics (AIP) Publishing LLC requiring no permission from the authors and publisher for using publicly released data from the paper. The Clin Tox (Gayvert et al., 2016) dataset is under the Copyright of Elsevier Ltd. which permits usage of public domain works and open access content without author permissions. The Malaria (Gamo et al., 2010) dataset is licensed under Copyright of Macmillan Publishers Limited that allows usage for personal and noncommercial use. The Tox21 (Tox21 Data Challenge, 2014) dataset was released by NIH National Center for Advancing Translational Sciences for free public usage as a part of a crowdsourced data analysis challenge. The HIV (AIDS Antiviral Screen Data, 2017) dataset was released by NIH National Cancer Institute (NCI) for public usage without any confidentiality agreement which allows access to chemical structural data on compounds. The IMDB-BINARY, IMDB-MULTI, REDDITBINARY, COLLAB (Yanardag & Vishwanathan, 2015) datasets are licensed under ACM Copyright 2015 under Creative Commons License that allows free usage for non-commercial academic purposes.