# halting_in_random_walk_kernels__a924ea00.pdf Halting in Random Walk Kernels Mahito Sugiyama ISIR, Osaka University, Japan JST, PRESTO mahito@ar.sanken.osaka-u.ac.jp Karsten M. Borgwardt D-BSSE, ETH Z urich Basel, Switzerland karsten.borgwardt@bsse.ethz.ch Random walk kernels measure graph similarity by counting matching walks in two graphs. In their most popular form of geometric random walk kernels, longer walks of length k are downweighted by a factor of λk (λ < 1) to ensure convergence of the corresponding geometric series. We know from the field of link prediction that this downweighting often leads to a phenomenon referred to as halting: Longer walks are downweighted so much that the similarity score is completely dominated by the comparison of walks of length 1. This is a na ıve kernel between edges and vertices. We theoretically show that halting may occur in geometric random walk kernels. We also empirically quantify its impact in simulated datasets and popular graph classification benchmark datasets. Our findings promise to be instrumental in future graph kernel development and applications of random walk kernels. 1 Introduction Over the last decade, graph kernels have become a popular approach to graph comparison [4, 5, 7, 9, 12, 13, 14], which is at the heart of many machine learning applications in bioinformatics, imaging, and social-network analysis. The first and best-studied instance of this family of kernels are random walk kernels, which count matching walks in two graphs [5, 7] to quantify their similarity. In particular, the geometric random walk kernel [5] is often used in applications as a baseline comparison method on graph benchmark datasets when developing new graph kernels. These geometric random walk kernels assign a weight λk to walks of length k, where λ < 1 is set to be small enough to ensure convergence of the corresponding geometric series. Related similarity measures have also been employed in link prediction [6, 10] as a similarity score between vertices [8]. However, there is one caveat regarding these approaches. Walk-based similarity scores with exponentially decaying weights tend to suffer from a problem referred to as halting [1]. They may downweight walks of lengths 2 and more, so much so that the similarity score is ultimately completely dominated by walks of length 1. In other words, they are almost identical to a simple comparison of edges and vertices, which ignores any topological information in the graph beyond single edges. Such a simple similarity measure could be computed more efficiently outside the random walk framework. Therefore, halting may affect both the expressivity and efficiency of these similarity scores. Halting has been conjectured to occur in random walk kernels [1], but its existence in graph kernels has never been theoretically proven or empirically demonstrated. Our goal in this study is to answer the open question if and when halting occurs in random walk graph kernels. We theoretically show that halting may occur in graph kernels and that its extent depends on properties of the graphs being compared (Section 2). We empirically demonstrate in which simulated datasets and popular graph classification benchmark datasets halting is a concern (Section 3). We conclude by summarizing when halting occurs in practice and how it can be avoided (Section 4). We believe that our findings will be instrumental in future applications of random walk kernels and the development of novel graph kernels. 2 Theoretical Analysis of Halting We theoretically analyze the phenomenon of halting in random walk graph kernels. First, we review the definition of graph kernels in Section 2.1. We then present our key theoretical result regarding halting in Section 2.2 and clarify the connection to linear kernels on vertex and edge label histograms in Section 2.3. 2.1 Random Walk Kernels Let G = (V, E, φ) be a labeled graph, where V is the vertex set, E is the edge set, and φ is a mapping φ : V E Σ with the range Σ of vertex and edge labels. For an edge (u, v) E, we identify (u, v) and (v, u) if G is undirected. The degree of a vertex v V is denoted by d(v). The direct (tensor) product G = (V , E , φ ) of two graphs G = (V, E, φ) and G = (V , E , φ ) is defined as follows [1, 5, 14]: V = { (v, v ) V V | φ(v) = φ (v ) }, E = { ((u, u ), (v, v )) V V | (u, v) E, (u , v ) E , and φ(u, v) = φ (u , v ) }, and all labels are inherited, or φ ((v, v )) = φ(v) = φ (v ) and φ ((u, u ), (v, v )) = φ(u, v) = φ (u , v ). We denote by A the adjacency matrix of G and denote by δ and the minimum and maximum degrees of G , respectively. To measure the similarity between graphs G and G , random walk kernels count all pairs of matching walks on G and G [2, 5, 7, 11]. If we assume a uniform distribution for the starting and stopping probabilities over the vertices of G and G , the number of matching walks is obtained through the adjacency matrix A of the product graph G [14]. For each k N, the k-step random walk kernel between two graphs G and G is defined as: Kk (G, G ) = with a sequence of positive, real-valued weights λ0, λ1, λ2, . . . , λk assuming that A0 = I, the identity matrix. Its limit K (G, G ) is simply called the random walk kernel. Interestingly, K can be directly computed if weights are the geometric series, or λl = λl, resulting in the geometric random walk kernel: KGR(G, G ) = [ (I λA ) 1] In the above equation, let (I λA )x = 0 for some value of x. Then, λA x = x and (λA )lx = x for any l N. If (λA )l converges to 0 as l , (I λA ) is invertible since x becomes 0. Therefore, (I λA ) 1 = l=0 λl Al from the equation (I λA )(I + λA + λ2A2 + . . . ) = I [5]. It is well-known that the geometric series of matrices, often called the Neumann series, I + λA + (λA )2 + converges only if the maximum eigenvalue of A , denoted by µ ,max, is strictly smaller than 1/λ. Therefore, the geometric random walk kernel KGR is well-defined only if λ < 1/µ ,max. There is a relationship for the minimum and maximum degrees δ and of G [3]: δ d µ ,max , where d is the average of the vertex degrees of G , or d = (1/|V |) v V d(v). In practice, it is sufficient to set the parameter λ < 1/ . In the inductive learning setting, since we do not know a priori target graphs that a learner will receive in the future, λ should be small enough so λ < 1/µ ,max for any pair of unseen graphs. Otherwise, we need to re-compute the full kernel matrix and re-train the learner. In the transductive setting, we are given a collection G of graphs beforehand. We can explicitly compute the upper bound of λ, which is (max G,G G µ ,max) 1 with the maximum of the maximum eigenvalues over all pairs of graphs G, G G. 2.2 Halting The geometric random walk kernel KGR is one of the most popular graph kernels, as it can take walks of any length into account [5, 14]. However, the fact that it weights walks of length k by the kth power of λ, together with the condition that λ < (µ ,max) 1 < 1, immediately tells us that the contribution of longer walks is significantly lowered in KGR. If the contribution of walks of length 2 and more to the kernel value is even completely dominated by the contribution of walks of length 1, we would speak of halting. It is as if the random walks halt after one step. Here, we analyze under which conditions this halting phenomenon may occur in geometric random walk kernels. We obtain the following key theoretical statement by comparing KGR to the one-step random walk kernel K1 . Theorem 1 Let λ0 = 1 and λ1 = λ in the random walk kernel. For a pair of graphs G and G , K1 (G, G ) KGR(G, G ) K1 (G, G ) + ε, where ε = |V | (λ )2 and ε monotonically converges to 0 as λ 0. Proof. Let d(v) be the degree of a vertex v in G and N(v) be the set of neighboring vertices of v, that is, N(v) = {u V | (u, v) E }. Since A is the adjacency matrix of G , the following relationships hold: i,j=1 [A ]ij = v V d(v) |V | , i,j=1 [A2 ]ij = v N(v) d(v ) |V | 2 , i,j=1 [A3 ]ij = v N(v ) d(v ) |V | 3 , . . . , i,j=1 [An ]ij |V | n . From the assumption that λ < 1, we have KGR(G, G ) = i,j=1 [I + λA + λ2A2 + . . . ]ij = K1 (G, G ) + i,j=1 [λ2A2 + λ3A3 + . . . ]ij K1 (G, G ) + |V |λ2 2 (1 + λ + λ2 2 + . . . ) = K1 (G, G ) + ε. It is clear that ε monotonically goes to 0 when λ 0. Moreover, we can normalize ε by dividing KGR(G, G ) by K1 (G, G ). Corollary 1 Let λ0 = 1 and λ1 = λ in the random walk kernel. For a pair of graphs G and G , 1 KGR(G, G ) K1 (G, G ) 1 + ε , where ε = (λ )2 (1 λ )(1 + λd ) and d is the average of vertex degrees of G . Proof. Since we have K1 (G, G ) = |V | + λ v V d(v) = |V |(1 + λd ), it follows that ε/K1 (G, G ) = ε . Theorem 1 can be easily generalized to any k-step random walk kernel Kk . Corollary 2 Let ε(k) = |V |(λ )k/(1 λ ). For a pair of graphs G and G , we have Kk (G, G ) KGR(G, G ) Kk (G, G ) + ε(k + 1). Our results imply that, in the geometric random walk kernel KGR, the contribution of walks of length longer than 2 diminishes for very small choices of λ. This can easily happen in real-world graph data, as λ is upper-bounded by the inverse of the maximum degree of the product graph. 2.3 Relationships to Linear Kernels on Label Histograms Next, we clarify the relationship between KGR and basic linear kernels on vertex and edge label histograms. We show that halting KGR leads to the convergence of it to such linear kernels. Given a pair of graphs G and G , let us introduce two linear kernels on vertex and edge histograms. Assume that the range of labels Σ = {1, 2, . . . , s} without loss of generality. The vertex label histogram of a graph G = (V, E, φ) is a vector f = (f1, f2, . . . , fs), such that fi = |{v V | φ(v) = i}| for each i Σ. Let f and f be the vertex label histograms of graphs G and G , respectively. The vertex label histogram kernel KVH(G, G ) is then defined as the linear kernel between f and f : KVH(G, G ) = f, f = s i=1 fif i. Similarly, the edge label histogram is a vector g = (g1, g2, . . . , gs), such that gi = |{(u, v) E | φ(u, v) = i}| for each i Σ. The edge label histogram kernel KEH(G, G ) is defined as the linear kernel between g and g , for respective histograms: KEH(G, G ) = g, g = s i=1 gig i. Finally, we introduce the vertex-edge label histogram. Let h = (h111, h211, . . . , hsss) be a histogram vector, such that hijk = |{(u, v) E | φ(u, v) = i, φ(u) = j, φ(v) = k}| for each i, j, k Σ. The vertex-edge label histogram kernel KVEH(G, G ) is defined as the linear kernel between h and h for the respective histograms of G and G : KVEH(G, G ) = h, h = s i,j,k=1 hijkh ijk. Notice that KVEH(G, G ) = KEH(G, G ) if vertices are not labeled. From the definition of the direct product of graphs, we can confirm the following relationships between histogram kernels and the random walk kernel. Lemma 1 For a pair of graphs G, G and their direct product G , we have KVH(G, G ) = 1 λ0 K0 (G, G ) = |V |. KVEH(G, G ) = 1 λ1 K1 (G, G ) λ0 λ1 K0 (G, G ) = i,j=1 [A ]ij. Proof. The first equation KVH(G, G ) = |V | can be proven from the following: KVH(G, G ) = v V |{ v V | φ(v) = φ (v ) }| = |{ (v, v ) V V | φ(v) = φ (v ) }| λ0 K0 (G, G ). We can prove the second equation in a similar fashion: KVEH(G, G ) = 2 (u,v) E |{ (u , v ) E | φ(u, v) = φ (u , v ), φ(u) = φ (u ), φ(v) = φ (v ) }| { ( (u, v), (u , v ) ) E E φ(u, v) = φ (u , v ), φ(u) = φ (u ), φ(v) = φ (v ) i,j=1 [A ]ij = 1 λ1 K1 (G, G ) λ0 λ1 K0 (G, G ). Finally, let us define a new kernel KH(G, G ) := KVH(G, G ) + λKVEH(G, G ) (1) with a parameter λ. From Lemma 1, since KH(G, G ) = K1 (G, G ) holds if λ0 = 1 and λ1 = λ in the one-step random walk kernel K1 , we have the following relationship from Theorem 1. Corollary 3 For a pair of graphs G and G , we have KH(G, G ) KGR(G, G ) KH(G, G ) + ε, where ε is given in Theorem 1. To summarize, our results show that if the parameter λ of the geometric random walk kernel KGR is small enough, random walks halt, and KGR reduces to KH, which finally converges to KVH. This is based on vertex histograms only and completely ignores the topological structure of the graphs. 3 Experiments We empirically examine the halting phenomenon of the geometric random walk kernel on popular real-world graph benchmark datasets and semi-simulated graph data. 3.1 Experimental Setup Environment. We used Amazon Linux AMI release 2015.03 and ran all experiments on a single core of 2.5 GHz Intel Xeon CPU E5-2670 and 244 GB of memory. All kernels were implemented in C++ with Eigen library and compiled with gcc 4.8.2. Datasets. We collected five real-world graph classification benchmark datasets:1 ENZYMES, NCI1, NCI109, MUTAG, and D&D, which are popular in the graph-classification literature [13, 14]. ENZYMES and D&D are proteins, and NCI1, NCI109, and MUTAG are chemical compounds. Statistics of these datasets are summarized in Table 1, in which we also show the maximum of maximum degrees of product graphs max G,G G for each dataset G. We consistently used λmax = (max G,G G ) 1 as the upper bound of λ in geometric random walk kernels, in which the gap was less than one order as the lower bound of λ. The average degree of the product graph, the lower bound of λ, were 18.17, 7.93, 5.60, 6.21, and 13.31 for ENZYMES, NCI1, NCI109, MUTAG, and DD, respectively. Kernels. We employed the following graph kernels in our experiments: We used linear kernels on vertex label histograms KVH, edge label histograms KEH, vertex-edge label histograms KVEH, and the combination KH introduced in Equation (1). We also included a Gaussian RBF kernel between vertex-edge label histograms, denoted as KVEH,G. From the family of random walk kernels, we used the geometric random walk kernel KGR and the k-step random walk kernel Kk . Only the number k of steps were treated as a parameter in Kk and λk was fixed to 1 for all k. We used fix-point iterations [14, Section 4.3] for efficient computation of KGR. Moreover, we employed the Weisfeiler-Lehman subtree kernel [13], denoted as KWL, as the state-of-the-art graph kernel, which has a parameter h of the number of iterations. 3.2 Results on Real-World Datasets We first compared the geometric random walk kernel KGR to other kernels in graph classification. The classification accuracy of each graph kernel was examined by 10-fold cross validation with multiclass C-support vector classification (libsvm2 was used), in which the parameter C for CSVC and a parameter (if one exists) of each kernel were chosen by internal 10-fold cross validation (CV) on only the training dataset. We repeated the whole experiment 10 times and reported average 1The code and all datasets are available at: http://www.bsse.ethz.ch/mlcb/research/machine-learning/graph-kernels.html 2http://www.csie.ntu.edu.tw/ cjlin/libsvm/ Table 1: Statistics of graph datasets, |ΣV | and |ΣE| denote the number of vertex and edge labels. Dataset Size #classes avg.|V | avg.|E| max|V | max|E| |ΣV | |ΣE| max ENZYMES 600 6 32.63 62.14 126 149 3 1 65 NCI1 4110 2 29.87 32.3 111 119 37 3 16 NCI109 4127 2 29.68 32.13 111 119 38 3 17 MUTAG 188 2 17.93 19.79 28 33 7 11 10 D&D 1178 2 284.32 715.66 5748 14267 82 1 50 10 5 10 4 10 3 10 2 Parameter λ Number of steps k KVH KEH KVEH KVEH,G KGR KWL Kx k KH Label histogram Random walk Comparison of KGR with KH k-step Kx k Comparison of various graph kernels (i) (ii) (iii) (a) ENZYMES 10 5 10 4 10 3 10 2 Parameter λ Number of steps k KVH KEH KVEH KVEH,G KGR KWL Kx k KH Label histogram Random walk Comparison of KGR with KH k-step Kx k Comparison of various graph kernels (i) (ii) (iii) 10 5 10 4 10 3 10 2 Parameter λ Number of steps k KVH KEH KVEH KVEH,G KGR KWL Kx k KH Label histogram Random walk Comparison of KGR with KH k-step Kx k Comparison of various graph kernels (i) (ii) (iii) Figure 1: Classification accuracy on real-world datasets (Means SD). classification accuracies with their standard errors. The list of parameters optimized by the internal CV is as follows: C {2 7, 2 5, . . . , 25, 27} for C-SVC, the width σ {10 2, . . . , 102} in the RBF kernel KVEH,G, the number of steps k {1, . . . , 10} in Kk , the number of iterations h {1, . . . , 10} in KWL, and λ {10 5, . . . , 10 2, λmax} in KH and KGR, where λmax = (max G,G G ) 1. Results are summarized in the left column of Figure 1 for ENZYMES, NCI1, and NCI109. We present results on MUTAG and D&D in the Supplementary Notes, as different graph kernels do not give significantly different results (e.g., [13]). Overall, we could observe two trends. First, the Weisfeiler-Lehman subtree kernel KWL was the most accurate, which confirms results in [13], 4 3 2 1 0 1 2 0 4 3 2 1 0 1 2 0 4 3 2 1 0 1 2 0 ENZYMES NCI1 NCI109 (a) (b) (c) log10 ε log10 ε log10 ε Figure 2: Distribution of log10 ε , where ε is defined in Corollary 1, in real-world datasets. Number of added edges 0 1020 50 100 65 Number of added edges 0 1020 50 100 Sim-ENZYMES Sim-NCI1 Sim-NCI109 (a) (b) (c) Number of added edges 0 1020 50 100 Figure 3: Classification accuracy on semi-simulated datasets (Means SD). Second, the two random walk kernels KGR and Kk show greater accuracy than na ıve linear kernels on edge and vertex histograms, which indicates that halting is not occurring in these datasets. It is also noteworthy that employing a Gaussian RBF kernel on vertex-edge histograms leads to a clear improvement over linear kernels on all three datasets. On ENZYMES, the Gaussian kernel is even on par with the random walks in terms of accuracy. To investigate the effect of halting in more detail, we show the accuracy of KGR and KH in the center column of Figure 1 for various choices of λ, from 10 5 to its upper bound. We can clearly see that halting occurs for small λ, which greatly affects the performance of KGR. More specifically, if it is chosen to be very small (smaller than 10 3 in our datasets), the accuracies are close to the na ıve baseline KH that ignores the topological structure of graphs. However, accuracies are much closer to that reached by the Weisfeiler-Lehman kernel if λ is close to its theoretical maximum. Of course, the theoretical maximum of λ depends on unseen test data in reality. Therefore, we often have to set λ conservatively so that we can apply the trained model to any unseen graph data. Moreover, we also investigated the accuracy of the random walk kernel as a function of the number of steps k of the random walk kernel Kk . Results are shown in the right column of Figure 1. In all datasets, accuracy improves with each step, up to four to five steps. The optimal number of steps in Kk and the maximum λ give similar accuracy levels. We also confirmed Theorem 1 that conservative choices of λ (10 3 or less) give the same accuracy as a one-step random walk. In addition, Figure 2 shows histograms of log10 ε , where ε is given in Corollary 1 for λ = (max ) 1 for all pairs of graphs in the respective datasets. The value ε can be viewed as the deviation of KGR from KH in percentages. Although ε is small on average (about 0.1 percent in ENZYMES and NCI datasets), we confirmed the existence of relatively large ε in the plot (more than 1 percent), which might cause the difference between KGR and KH. 3.3 Results on Semi-Simulated Datasets To empirically study halting, we generated semi-simulated graphs from our three benchmark datasets (ENZYMES, NCI1, and NCI109) and compared the three kernels KGR, KH, and KVH. In each dataset, we artificially generated denser graphs by randomly adding edges, in which the number of new edges per graph was determined from a normal distribution with the mean m {10, 20, 50, 100} and the distribution of edge labels was unchanged. Note that the accuracy of the vertex histogram kernel KVH stays always the same, as we only added edges. Results are plotted in Figure 3. There are two key observations. First, by adding new false edges to the graphs, the accuracy levels drop for both the random walk kernel and the histogram kernel. However, even after adding 100 new false edges per graph, they are both still better than a na ıve classifier that assigns all graphs to the same class (accuracy of 16.6 percent on ENZYMES and approximately 50 percent on NCI1 and NCI109). Second, the geometric random walk kernel quickly approaches the accuracy level of KH when new edges are added. This is a strong indicator that halting occurs. As graphs become denser, the upper bound for λ gets smaller, and the accuracy of the geometric random walk kernel KGR rapidly drops and converges to KH. This result confirms Corollary 3, which says that both KGR and KH converge to KVH as λ goes to 0. 4 Discussion In this work, we show when and where the phenomenon of halting occurs in random walk kernels. Halting refers to the fact that similarity measures based on counting walks (of potentially infinite length) often downweight longer walks so much that the similarity score is completely dominated by walks of length 1, degenerating the random walk kernel to a simple kernel between edges and vertices. While it had been conjectured that this problem may arise in graph kernels [1], we provide the first theoretical proof and empirical demonstration of the occurrence and extent of halting in geometric random walk kernels. We show that the difference between geometric random walk kernels and simple edge kernels depends on the maximum degree of the graphs being compared. With increasing maximum degree, the difference converges to zero. We empirically demonstrate on simulated graphs that the comparison of graphs with high maximum degrees suffers from halting. On real graph data from popular graph classification benchmark datasets, the maximum degree is so low that halting can be avoided if the decaying weight λ is set close to its theoretical maximum. Still, if λ is set conservatively to a low value to ensure convergence, halting can clearly be observed, even on unseen test graphs with unknown maximum degrees. There is an interesting connection between halting and tottering [1, Section 2.1.5], a weakness of random walk kernels described more than a decade ago [11]. Tottering is the phenomenon that a walk of infinite length may go back and forth along the same edge, thereby creating an artificially inflated similarity score if two graphs share a common edge. Halting and tottering seem to be opposing effects. If halting occurs, the effect of tottering is reduced and vice versa. Halting downweights these tottering walks and counteracts the inflation of the similarity scores. An interesting point is that the strategies proposed to remove tottering from walk kernels did not lead to a clear improvement in classification accuracy [11], while we observed a strong negative effect of halting on the classification accuracy in our experiments (Section 3). This finding stresses the importance of studying halting. Our theoretical and empirical results have important implications for future applications of random walk kernels. First, if the geometric random walk kernel is used on a graph dataset with known maximum degree, λ should be close to the theoretical maximum. Second, simple baseline kernels based on vertex and edge label histograms should be employed to check empirically if the random walk kernel gives better accuracy results than these baselines. Third, particularly in datasets with high maximum degree, we advise using a fixed-length-k random walk kernel rather than a geometric random walk kernel. Optimizing the length k by cross validation on the training dataset led to competitive or superior results compared to the geometric random walk kernel in all of our experiments. Based on these results and the fact that by definition the fixed-length kernel does not suffer from halting, we recommend using the fixed-length random walk kernel as a comparison method in future studies on novel graph kernels. Acknowledgments. This work was supported by JSPS KAKENHI Grant Number 26880013 (MS), the Alfried Krupp von Bohlen und Halbach-Stiftung (KB), the SNSF Starting Grant Significant Pattern Mining (KB), and the Marie Curie Initial Training Network MLPM2012, Grant No. 316861 (KB). [1] Borgwardt, K. M. Graph Kernels. Ph D thesis, Ludwig-Maximilians-University Munich, 2007. [2] Borgwardt, K. M., Ong, C. S., Sch onauer, S., Vishwanathan, S. V. N., Smola, A. J., and Kriegel, H.-P. Protein function prediction via graph kernels. Bioinformatics, 21(suppl 1):i47 i56, 2005. [3] Brualdi, R. A. The Mutually Beneficial Relationship of Graphs and Matrices. AMS, 2011. [4] Costa, F. and Grave, K. D. Fast neighborhood subgraph pairwise distance kernel. In Proceedings of the 27th International Conference on Machine Learning (ICML), 255 262, 2010. [5] G artner, T., Flach, P., and Wrobel, S. On graph kernels: Hardness results and efficient alternatives. In Learning Theory and Kernel Machines (LNCS 2777), 129 143, 2003. [6] Girvan, M. and Newman, M. E. J. Community structure in social and biological networks. Proceedings of the National Academy of Sciences (PNAS), 99(12):7821 7826, 2002. [7] Kashima, H., Tsuda, K., and Inokuchi, A. Marginalized kernels between labeled graphs. In Proceedings of the 20th International Conference on Machine Learning (ICML), 321 328, 2003. [8] Katz, L. A new status index derived from sociometric analysis. Psychometrika, 18(1):39 43, 1953. [9] Kriege, N., Neumann, M., Kersting, K., and Mutzel, P. Explicit versus implicit graph feature maps: A computational phase transition for walk kernels. In Proceedings of IEEE International Conference on Data Mining (ICDM), 881 886, 2014. [10] Liben-Nowell, D. and Kleinberg, J. The link-prediction problem for social networks. Journal of the American Society for Information Science and Technology, 58(7):1019 1031, 2007. [11] Mah e, P., Ueda, N., Akutsu, T., Perret, J.-L., and Vert, J.-P. Extensions of marginalized graph kernels. In Proceedings of the 21st International Conference on Machine Learning (ICML), 2004. [12] Shervashidze, N. and Borgwardt, K. M. Fast subtree kernels on graphs. In Advances in Neural Information Processing Systems (NIPS) 22, 1660 1668, 2009. [13] Shervashidze, N., Schweitzer, P., van Leeuwen, E. J., Mehlhorn, K., and Borgwardt, K. M. Weisfeiler-Lehman graph kernels. Journal of Machine Learning Research, 12:2359 2561, 2011. [14] Vishwanathan, S. V. N., Schraudolph, N. N., Kondor, R., and Borgwardt, K. M. Graph kernels. Journal of Machine Learning Research, 11:1201 1242, 2010.