# the_multiscale_laplacian_graph_kernel__aafd5630.pdf

The Multiscale Laplacian Graph Kernel

Risi Kondor Department of Computer Science Department of Statistics University of Chicago Chicago, IL 60637 risi@cs.uchicago.edu

Horace Pan Department of Computer Science University of Chicago Chicago, IL 60637 hopan@uchicago.edu

Many real world graphs, such as the graphs of molecules, exhibit structure at multiple different scales, but most existing kernels between graphs are either purely local or purely global in character. In contrast, by building a hierarchy of nested subgraphs, the Multiscale Laplacian Graph kernels (MLG kernels) that we deﬁne in this paper can account for structure at a range of different scales. At the heart of the MLG construction is another new graph kernel, called the Feature Space Laplacian Graph kernel (FLG kernel), which has the property that it can lift a base kernel deﬁned on the vertices of two graphs to a kernel between the graphs. The MLG kernel applies such FLG kernels to subgraphs recursively. To make the MLG kernel computationally feasible, we also introduce a randomized projection procedure, similar to the Nystr om method, but for RKHS operators.

1 Introduction

There is a wide range of problems in applied machine learning from web data mining [1] to protein function prediction [2] where the input space is a space of graphs. A particularly important application domain is chemoinformatics, where the graphs capture the structure of molecules. In the pharmaceutical industry, for example, machine learning algorithms are regularly used to screen candidate drug compounds for safety and efﬁcacy against speciﬁc diseases [3].

Because kernel methods neatly separate the issue of data representation from the statistical learning component, it is natural to formulate graph learning problems in the kernel paradigm. Starting with [4], a number of different graph kernels have appeared in the literature (for an overview, see [5]). In general, a graph kernel k(G1, G2) must satisfy the following requirements:

1. The kernel should capture the right notion of similarity between G1 and G2. For example, if G1 and G2 are social networks, then k might capture to what extent their clustering structure, degree distribution, etc. match. If, on the other hand, G1 and G2 are molecules, then we are probably more interested in what functional groups are present, and how they are arranged relative to each other. 2. The kernel is usually computed from the adjacency matrices A1 and A2 of the two graphs, but it must be invariant to the ordering of the vertices. In other words, writing the kernel explicitly in terms of A1 and A2, we must have k(A1, A2) = k(A1,PA2P ) for any permutation matrix P. Permutation invariance has proved to be the central constraint around which much of the graph kernels literature is organized, effectively stipulating that graph kernels must be built out of graph invariants. Efﬁciently computable graph invariants offered by the mathematics literature tend to fall in one of two categories:

1. Local invariants, which can often be reduced to simply counting some local properties, such as the number of triangles, squares, etc. that appear in G as subgraphs.

30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.

2. Spectral invariants, which can be expressed as functions of the eigenvalues of the adjacency matrix or the graph Laplacian. Correspondingly, while different graph kernels are motivated in very different ways from random walks [4] through shortest paths [6, 7] to Fourier transforms on the symmetric group [8], most graph kernels in the literature ultimately reduce to computing a function of the two graphs that is either purely local or purely spectral. Any of the kernels based on the subgraph counting idea (e.g., [9]) are local. On the other hand, most of the random walk based kernels are reducible to a spectral form involving the eigenvalues of either the two graphs individually, or their Kronecker product [5] and therefore are really only sensitive to the large scale structure of graphs.

In practice, it would be desirable to have a kernel that can take structure into account at multiple different scales. A kernel between molecules, for example, should not only be sensitive to the overall large-scale shape of the graphs (whether they are more like a chain, a ring, a chain that branches, etc.), but also to what smaller structures (e.g., functional groups) are present in the graphs, and how they are related to the global structure (e.g., whether a particular functional group is towards the middle or one of the ends of a chain).

For the most part, such a multiscale graph kernel has been missing from the literature. Two notable exceptions are the Weisfeiler Lehman kernel [10] and Propagation Kernel [11]. The WL kernel uses a combination of message passing and hashing to build summaries of the local neighborhoods of vertices at different scales. While shown to be effective, the Weisfeiler Lehman kernel s hashing step is somewhat ad-hoc; perturbing the edges by a small amount leads to completely different hash features. Similarly, the propagation kernel monitors how the distribution of node/edge labels spreads through the graph and then uses locality sensitivity hashing to efﬁciently bin the label distributions into feature vectors.

Most recently, structure2vec[12] attempts to represent each graph with a latent variable model and then embeds them into a feature space, using the inner product as a kernel. This approach compares favorably to the standard kernel methods in both accuracy and computational efﬁciency.

In this paper we present a new graph kernel, the Multiscale Laplacian Graph Kernel (MLG kernel), which, we believe, is the ﬁrst kernel in the literature that can truly compare structure in graphs simultaneously at multiple different scales. We begin by introducing the Feature Space Laplacian Graph Kernel(FLG kernel) in Section 2. The FLG kernel operates at a single scale, while combining information from the nodes s vertex features with topological information through its Laplacian. An important property of the FLG kernel is that it can work with vertex labels provided as a base kernel on the vertices, which allows us to apply the FLG kernel recursively.

The MLG kernel, deﬁned in Section 3, uses the FLG kernel s recursive property to build a hierarchy of subgraph kernels that are sensitive to both the topological relationships between individual vertices, and between subgraphs of increasing sizes. Each kernel is deﬁned in terms of the preceding kernel in the hierarchy. Efﬁcient computability is a major concern in our paper, and recursively deﬁned kernels on combinatorial data structures, can be very expensive. Therefore, in Section 4 we describe a strategy based on a combination of linearizing each level of the kernel (relative to a given dataset) and a randomized low rank projection step, which reduces every stage of the kernel computation to simple operations involving small matrices, leading to a very fast algorithm. Finally, section 5 presents experimental comparisons of our kernel with competing methods.

2 Laplacian Graph Kernels

Let G be a weighted undirected graph with vertex set V = {v1, . . . , vn} and edge set E. Recall that the graph Laplacian of G is an n n matrix LG, with

wi,j if {vi, vj} E P

j : {vi,vj} E wi,j if i = j 0 otherwise,

where wi,j is the weight of edge {vi, vj}. The graph Laplacian is positive semi-deﬁnite, and in terms of the adjacency matrix A and the weighted degree matrix D it can be expressed as L = D A.

Spectral graph theory tells us that the low eigenvalue eigenvectors of LG are informative about the overall shape of G. One way of seeing this is to note that for any vector z Rn

{i,j} E wi,j(zi zj)2,

so the low eigenvalue eigenvectors are the smoothest functions on G, in the sense that they vary the least between adjacent vertices. An alternative interpretation emerges if we use G to construct a Gaussian graphical model (Markov Random Field or MRF) over n variables x1, . . . , xn with clique potentials φ(xi, xj) = e wi,j(xi xj)2/2 for each edge and ψ(xi) = e ηx2 i /2 for each vertex. The joint distribution of x = (x1, . . . , xn) is then

{vi,vj} E e wi,j(xi xj)2/2 Y

vi V e ηx2 i /2 = e x (L+ηI)x/2, (1)

showing that the covariance matrix of x is (LG + ηI) 1. Note that the ψ factors were added to ensure that the distribution is normalizable, and η is typically just a small constant regularizer : LG actually has a zero eigenvalue eigenvector (namely the constant vector n 1/2(1, 1, . . . , 1) ), so without adding ηI we would not be able to invert it. In the following we will call LG + ηI the regularized Laplacian, and denote it simply by L.

Both the above views suggest that if we want deﬁne a kernel between graphs that is sensitive to their overall shape, comparing the low eigenvalue eigenvectors of their Laplacians is a good place to start. Previous work by [13] also used the graph Laplacian for constructing a similarity function on graphs. Following the MRF route, given two graphs G1 and G2 of n vertices, we can deﬁne the kernel between them to be a kernel between the corresponding distributions p1 = N(0, L 1 1 ) and p2 = N(0, L 1 2 ). Speciﬁcally, we will use the Bhattacharyya kernel [14]

k(p1, p2) = Z p

p2(x) dx, (2)

because for Gaussian distributions it can be computed in closed form, giving

k(p1, p2) =

L 1 1 1/4 L 1 2 1/4 .

If some of the eigenvalues of L 1 1 or L 1 2 are zero or very close to zero, along certain directions in space the two distributions in (2) become very ﬂat, leading to vanishingly small kernel values (unless the ﬂat directions of the two Gaussians are perfectly aligned). To remedy this problem, similarly to [15], we soften (or regularize) the kernel by adding some small constant γ times the identity to L 1 1 and L 1 2 . This leads to what we call the Laplacian Graph Kernel. Deﬁnition 1. Let G1 and G2 be two graphs with n vertices with (regularized) Laplacians L1 and L2, respectively. We deﬁne the Laplacian graph kernel (LG kernel) with parameter γ between G1 and G2 as

k LG(G1, G2) =

2S 1 2 1 1/2

|S1|1/4 |S2|1/4 , (3)

where S1 = L 1 1 +γI and S2 = L 1 2 +γI.

By virtue of (2), the LG kernel is positive semi-deﬁnite, and because the value of the overlap integral is largely determined by the extent to which the subspaces spanned by the largest eigenvalue eigenvectors of L 1 1 and L 1 2 are aligned, it effectively captures similarity between the overall shapes of G1 and G2. However, the LG kernel does suffer from three major limitations: it assumes that both graphs have the same number of vertices, it is only sensitive to the overall structure of the two graphs, and it is not invariant to permuting the vertices. Our goal for the rest of this paper is to overcome each of these limitations, while retaining the LG kernel s attractive spectral interpretation.

2.1 The feature space Laplacian graph kernel (FLG kernel)

In the probabilistic view of the LG kernel, every graph generates random vectors x = (x1, . . . , xn)

according to (1), and the kernel between two graphs is determined by comparing the corresponding

distributions. The invariance problem arises because the ordering of the variables x1, . . . , xn is arbitrary: even if G1 and G2 are topologically the same, k LG(G1, G2) might be low if their vertices happen to be numbered differently.

One of the central ideas of this paper is to address this issue by transforming from the vertex space variables x1, . . . , xn to feature space variables y1, . . . , ym, where yi = P

j ti,j(xj), and each ti,j function may only depend on j through local and reordering invariant properties of vertex vj. If we then compute an analogous kernel to the LG kernel, but now between the distributions of the y s rather than the x s, the resulting kernel will be permutation invariant.

In the simplest case, the ti,j functions are linear, i.e., ti,j(xj) = φi(vj) xj, where (φ1, . . . , φm) is a collection of m local (and permutation invariant) vertex features. For example, φi(vj) may be the degree of vertex vj, or the value of hβ(vj, vj), where h is the diffusion kernel on G with length scale parameter β (c.f., [16]). In the chemoinformatics setting, the φi s might be some way of encoding what type of atom is located at vertex vj.

The linear transform of a multivariate normal random variable is multivariate normal. In our case, deﬁning Ui,j = φi(vj)i,j and y = Ux, we have E(y) = 0 and Cov(y, y) = U Cov(x, x)U = UL 1U , leading to the following kernel, which is the workhorse of the present paper. Deﬁnition 2. Let G1 and G2 be two graphs with regularized Laplacians L1 and L2, respectively, γ 0 a parameter, and (φ1, . . . , φm) a collection of m local vertex features. Deﬁne the corresponding feature mapping matrices

[U1]i,j = φi(vj) [U2]i,j = φi(v j),

where vj is the j th vertex of G1 and v j is the j th vertex of G2. The corresponding Feature space Laplacian graph kernel (FLG kernel) is deﬁned

k FLG(G1, G2) =

2S 1 2 1 1/2

|S1|1/4 |S2|1/4 , (4)

where S1 = U1L 1 1 U 1 +γI and S2 = U2L 1 2 U 2 +γI.

Since the φ1, . . . , φm vertex features, by deﬁnition, are local and invariant to vertex renumbering, the FLG kernel is permutation invariant. Moreover, the distributions now live in the space of features rather than the space deﬁned by the vertices, so we can apply the kernel to two graphs with different numbers of vertices. The major remaining shortcoming of the FLG kernel is that it cannot take into account structure at multiple different scales.

2.2 The kernelized FLG kernel

The key to boosting k FLG to a multiscale kernel is that it itself can be kernelized , i.e., it can be computed from just the inner products between the feature vectors of the vertices (which we call the base kernel) without having to know the actual φi(vj) features values.

Deﬁnition 3. Given a collection φ = (φ1, . . . , φm) of local vertex features, we deﬁne the corresponding base kernel κ between two vertices v and v as the dot product of their feature vectors: κ(v, v ) = φ(v) φ(v ).

Note that in this deﬁnition v and v may be two vertices of the same graph, or of two different graphs. We ﬁrst show that, similarly to other kernel methods [17], to compute k FLG(G1, G2) one only needs to consider the subspace of Rm spanned by the feature vectors of their vertices. Proposition 1. Let G1 and G2 be two graphs with vertex sets V1 = {v1 . . . vn1} and V2 = {v 1 . . . v n2}, and let {ξ1, . . . , ξp} be an orthonormal basis for the subspace W = span φ(v1), . . . , φ(vn1), φ(v 1), . . . , φ(v n2) . dim(W) = p. Then, (4) can be rewritten as

k FLG(G1, G2) =

2S 1 2 1 1/2

|S1|1/4 |S2|1/4 , (5)

where [S1]i,j = ξ i S1ξj and [S2]i,j = ξ i S2ξj. In other words, S1 and S2 are the projections of S1 and S2 to W.

Similarly to kernel PCA [18] or the Bhattacharyya kernel [15], the easiest way to construct the basis {ξ1, . . . , ξp} required by (5) is to compute the eigendecomposition of the joint Gram matrix of the vertices of the two graphs.

Proposition 2. Let G1 and G be as in Proposition 1, V = {v1, . . . , vn1+n2} be the union of their vertex sets (where it is assumed that the ﬁrst n1 vertices are {v1, . . . , vn1} and the second n2 vertices are v 1, . . . , v n2 ), and deﬁne the joint Gram matrix K R(n1+n2) (n1+n2) as Ki,j = κ(vi, vj) = φ(vi) φ(vj).

Let u1, . . . , up be a maximal orthonormal set of the non-zero eigenvalue eigenvectors of K with corresponding eigenvalues. Then the vectors

ℓ=1 [ui]ℓφ(vℓ) (6)

form an orthonormal basis for W. Moreover, deﬁning Q = [λ1/2 1 u1, . . . , λ1/2 p up] Rp p and setting Q1 = Q1:n1, : and Q2 = Qn1+1:n2, : (the ﬁrst n1 and remaining n2 rows of Q, respectively), the matrices S1 and S2 appearing in (5) can be computed as

S1 = Q 1 L 1 1 Q1 + γI, S2 = Q 2 L 1 2 Q2 + γI. (7)

Proofs of these two propositions are given in the Supplemental Material. As in other kernel methods, the signiﬁcance of Propositions 1 and 2 is not just that they show how k FLG(G1, G2) can be efﬁciently computed when φ is very high dimensional, but that they make it clear that the FLG kernel may be induced from any base kernel. For completeness, we close this section with the generalized deﬁnition of the FLG kernel. Deﬁnition 4. Let G1 and G2 be two graphs. Assume that each of their vertices comes from an abstract vertex space V and that κ: V V R is a symmetric positive semi-deﬁnite kernel on V. The generalized FLG kernel induced from κ is then deﬁned as

kκ FLG(G1, G2) =

2S 1 2 1 1/2

|S1|1/4 |S2|1/4 , (8)

where S1 and S2 are as deﬁned in Proposition 2.

3 The multiscale Laplacian graph kernel (MLG kernel)

By multiscale graph kernel we mean a kernel that is able to capture similarity between graphs not just based on the topological relationships between their individual vertices, but also the topological relationships between subgraphs. The key property of the FLG kernel that allows us to build such a kernel is that it can be applied recursively. In broad terms, the construction goes as follows:

1. Given a graph G, associate each vertex with a subgraph centered around it and compute the FLG kernel between every pair of these subgraphs. 2. Reinterpret the FLG kernel between these subgraphs as a new base kernel between the center vertices of the subgraphs. 3. Consider larger subgraphs centered at each vertex, compute the FLG kernel between them induced from the new base kernel constructed in the previous step, and recurse. To compute the actual multiscale graph kernel K between G and another graph G , we follow the same process for G and then set K(G, G ) equal to the FLG kernel induced from their top level base kernels. The following deﬁnitions formalize this construction. Deﬁnition 5. Let G be a graph with vertex set V , and κ a positive semi-deﬁnite kernel on V . Assume that for each v V we have a nested sequence of L neighborhoods

v N1(v) N2(v) . . . NL(v) V, (9)

and for each Nℓ(v), let Gℓ(v) be the corresponding induced subgraph of G. We deﬁne the Multiscale Laplacian Subgraph Kernels (MLS kernels), K1, . . . , KL : V V R as follows: 1. K1 is just the FLG kernel kκ FLG induced from the base kernel κ between the lowest level subgraphs:

K1(v, v ) = kκ FLG(G1(v), G1(v )).

2. For ℓ= 2, 3, . . . , L, Kℓis the FLG kernel induced from Kℓ 1 between Gℓ(v) and Gℓ(v ):

Kℓ(v, v ) = k Kℓ 1 FLG (Gℓ(v), Gℓ(v )).

Deﬁnition 5 deﬁnes the MLS kernel as a kernel between different subgraphs of the same graph G. However, if two graphs G1 and G2 share the same base kernel, the MLS kernel can also be used to compare any subgraph of G1 with any subgraph of G2. This is what allows us to deﬁne an L+1 th FLG kernel, which compares the two full graphs. Deﬁnition 6. Let G be a collection of graphs such that all their vertices are members of an abstract vertex space V endowed with a symmetric positive semi-deﬁnite kernel κ: V V R. Assume that the MLS kernels K1, . . . , KL are deﬁned as in Deﬁnition 5, both for pairs of subgraphs within the same graph and across pairs of different graphs. We deﬁne the Multiscale Laplacian Graph Kernel (MLG kernel) between any two graphs G1, G2 G as

K(G1, G2) = k KL FLG(G1, G2).

Deﬁnition 5 leaves open the question of how the neighborhoods N1(v), . . . , NL(v) are to be deﬁned. In the simplest case, we set Nℓ(v) to be the ball Br(v) (i.e., the set of vertices at a distance at most r from v), where r = r0dℓ 1 for some d > 1.

3.1 Computational complexity

Deﬁnitions 5 and 6 suggest a recursive approach to computing the MLG kernel: computing K(G1, G2) ﬁrst requires computing KL(v, v ) between all n1+n2 2 pairs of top level subgraphs across G1 and G2; each of these kernel evaluations requires computing KL 1(v, v ) between up to O(n2) level L 1 subgraphs, and so on. Following this recursion blindly would require up to O(n2L+2) kernel evaluations, which is clearly infeasible.

The recursive strategy is wasteful because it involves evaluating the same kernel entries over and over again in different parts of the recursion tree. An alternative solution that requires only O(Ln2) kernel evaluations would be to ﬁrst compute K1(v, v ) for all (v, v ) pairs, then compute K2(v, v ) for all (v, v ) pairs and so on.

4 Linearized Kernels and Low Rank Approximation

Computing the MLG kernel between two graphs, as described in the previous section, may involve O(Ln2) kernel evaluations. At the top levels of the hierarchy each Gℓ(v) might have Θ(n) vertices, so the cost of a single FLG kernel evaluation can be as high as O(n3). Somewhat pessimistically, this means that the overall cost of computing k FLG(G1, G2) is O(Ln5). Given a dataset of M graphs, computing their Gram matrix requires repeating this for all {G1, G2} pairs, giving O(LM 2n5), which is even more problematic. The solution that we propose in this section is to compute for each level ℓ= 1, 2, . . . , L + 1 a single joint basis for all subgraphs at the given level across all graphs G1, . . . , GM. For concreteness, we go back to the deﬁnition of the FLG kernel. Deﬁnition 7. Let G = {G1, . . . , GM} be a collection of graphs, V1, . . . , VM their vertex sets, and assume that V1, . . . , VM V for some general vertex space V. Further, assume that κ: V V R is a positive semi-deﬁnite kernel on V, Hκ is its Reproducing Kernel Hilbert Space, and φ: V Hκ is the corresponding feature map satisfying κ(v, v ) = φ(v), φ(v ) for any v, v V. The joint vertex feature space of {G1, . . . , GM} is then WG = span SM i=1 S

v Vi {φ(v)} .

WG is just the generalization of the W space deﬁned in Proposition 1 from two graphs to M. The following generalization of Propositions 1 and 2 is then immediate.

Proposition 3. Let N = PM i=1 | Vi |, V = (v1, . . . , v N) be the concatenation of the vertex sets V1, . . . , VM, and K the corresponding joint Gram matrix Ki,j = κ(vi, vj) = φ(vi), φ(vj) . Let u1, . . . , u P be a maximal orthonormal set of non-zero eigenvalue eigenvectors of K with corresponding eigenvalues λ1, . . . , λP , and P = dim(WG). Then the vectors

ℓ=1 [ui]ℓφ(vℓ) i = 1, . . . , P

form an orthonormal basis for WG. Moreover, deﬁning Q = [λ1/2 1 u1, . . . , λ1/2 p u P ] RP P , and setting Q1 to be the submatrix of Q composed of its ﬁrst |V1| rows; Q2 be the submatrix composed of the next |V2| rows, and so on, for any Gi, Gj G, the generalized FLG kernel induced from κ (Deﬁnition 4) can be expressed as

k FLG(Gi, Gj) =

2S 1 j 1 1/2

|Si|1/4 |Sj |1/4 , (10)

where Si = Q i L 1 i Qi + γI and Sj = Q j L 1 j Qj + γI.

The signiﬁcance of Proposition 3 is that S1, . . . , SM are now ﬁxed matrices that do not need to be recomputed for each kernel evaluation. Once we have constructed the joint basis {ξ1, . . . , ξP }, the Si matrix of each graph Gi can be computed independently, as a precomputation step, and individual kernel evaluations reduce to just plugging them into (10). At a conceptual level, Proposition 3 linearizes the kernel κ by projecting everything down to WG. In particular, it replaces the {φ(vi)} RKHS vectors with explicit ﬁnite dimensional feature vectors given by the corresponding rows of Q, just like we had in the unkernelized FLG kernel of Deﬁnition 2.

For our multiscale kernels this is particularly important, because linearizing not just kκ FLG, but also k K1 FLG, k K2 FLG, . . ., allows us to compute the MLG kernel level by level, without recursion. After linearizing the base kernel κ, we attach explicit, ﬁnite dimensional vectors to each vertex of each graph. Then we compute compute k K1 FLG between all pairs of lowest level subgraphs, and linearizing this kernel as well, each vertex effectively just gets an updated feature vector. Then we repeat the process for k K2 FLG . . . k KL FLG, and ﬁnally we compute the MLG kernel K(G1, G2).

4.1 Randomized low rank approximation

The difﬁculty in the above approach of course is that at each level (3) is a Gram matrix between all vertices of all graphs, so storing it is already very costly, let along computing its eigendecomposition. Morever, P = dim(WG) is also very large, so managing the S1, . . . , SM matrices (each of which is of size P P) becomes infeasible. The natural alternative is to replace WG by a smaller, approximate joint features space, deﬁned as follows.

Deﬁnition 8. Let G, κ, Hκ and φ be deﬁned as in Deﬁnition 7. Let V = ( v1, . . . , v N) be N N vertices sampled from the joint vertex set V = (v1, . . . , v N). Then the corresponding subsampled vertex feature space is WG = span{ φ( v) | v V }.

Let P = dim( WG). Similarly to before, we construct an orthonormal basis {ξ1, . . . , ξ P } for WG by forming the (now much smaller) Gram matrix Ki,j = κ( vi, vj), computing its eigenvalues and

eigenvectors, and setting ξi = 1 λi P N ℓ=1[ui]ℓφ( vℓ). The resulting approximate FLG kernel is

k FLG(Gi, Gj) =

2 S 1 i + 1

2 S 1 j 1 1/2

| Si|1/4 | Sj |1/4 , (11)

where Si = Q i L 1 i Qi + γI and Sj = Q j L 1 j Qj + γI are the projections of Si and Sj to WG. We introduce a further layer of approximation by restricting WG to be the space spanned by the ﬁrst ˆP < P basis vectors (ordered by descending eigenvalue), effectively doing kernel PCA on {φ( v)} v V , equivalently, a low rank approximation of K.

Assuming that vg j is the j th vertex of Gg, in contrast to Proposition 2, now the j th row of Qs consists of the coordinates of the projection of φ(vg j ) onto WG, i.e.,

[ Qg]j,i = 1 λi

ℓ=1 [ui]ℓ φ(vg j ), φ( vℓ) = 1 λi

ℓ=1 [ui]ℓκ(vg j , vℓ).

The above procedure is similar to the popular Nystr om approximation for kernel matrices [19, 20], except that in our case the ultimate goal is not to approximate the Gram matrix (3), but the

Table 1: Classiﬁcation Results (Mean Accuracy Standard Deviation)

Method MUTAG[22] PTC[23] ENZYMES[2] PROTEINS[2] NCI1[24] NCI109[24] WL 84.50( 2.16) 59.97( 1.60) 53.75( 1.37) 75.43( 1.95) 84.76( 0.32) 85.12( 0.29) WL-Edge 82.94( 2.33) 60.18( 2.19) 52.00( 0.72) 73.63( 2.12) 84.65( 0.25) 85.32( 0.34) SP 85.50( 2.50) 59.53( 1.71) 42.31( 1.37) 75.61( 0.45) 73.61( 0.36) 73.23( 0.26) Graphlet 82.44( 1.29) 55.88( 0.31) 30.95( 0.73) 71.63( 0.33) 62.40( 0.27) 62.35( 0.28) p RW 80.33( 1.35) 59.85( 0.95) 28.17( 0.76) 71.67( 0.78) TIMED OUT TIMED OUT MLG 84.21( 2.61) 63.62( 4.69) 57.92( 5.39) 76.14( 1.95) 80.83( 1.29) 81.30( 0.80)

S1, . . . , SM matrices used to form the FLG kernel. In practice, we found that the eigenvalues of K usually drop off very rapidly, suggesting that W can be safely approximated by a surprisingly small dimensional subspace ( ˆP 10), and correspondingly the sample size N can be kept quite small as well (on the order of 100). The combination of these two factors makes computing the entire stack of kernels feasible, reducing the complexity of computing the Gram matrix for a dataset of M graphs of θ(n) vertices each to O(ML N 2 ˆP 3 + ML N 3 + M 2 ˆP 3). It is also important to note that this linearization step requires the graphs(not the labels) in the test set to be known during training in order to project the features of the test graphs onto the low rank approximation of WG.

5 Experiments

We tested the efﬁcacy of the MLG kernel by performing classiﬁcation on benchmark bioinformatics datasets using a binary C-SVM solver [21], and compared our classiﬁcation results against those from other representative graph kernels from the literature: the Weisfeiler Lehman Kernel, the Weisfeiler Lehman Edge Kernel [9], the Shortest Path Kernel [6], the Graphlet Kernel [9], and the p-random Walk Kernel [5].

We randomly selected 20% of each dataset to be used as a test set. On the other 80% we did 10 fold cross validation to select the parameters for each kernel method to be used on the test set and repeated this setup 10 times. For the Weisfeiler Lehman kernels, the height parameter h is chosen from {1, 2, ..., 5}, the random walk size p for the p-random walk kernel was chosen from {1, 2, ..., 5}, for the Graphlets kernel the graphlet size n was chosen from {3, 4, 5}. For the parameters of the MLG kernel: we chose η from {0.01, 0.1, 1}, radius size n from {1, 2, 3}, number of levels l from {1, 2, 3}, and ﬁxed gamma to be 0.01. For the MLG kernel, we used the given discrete node labels to create a one-hot binary feature vector for each node and used the dot product between nodes binary feature vector labels as the base kernel. All experiments were done on a 16 core Intel E5-2670 @ 2.6GHz processor with 32 GB of memory.

We are fairly competitive in accuracy for all datasets except NCI1, and NCI109, where it performs better than all non-Weisfeiler Lehman kernels. The Supplementary Materials give a more detailed discussion of the experiments and datasets.

6 Conclusions

In this paper we have proposed two new graph kernels: (1) The FLG kernel, which is a very simple single level kernel that combines information attached to the vertices with the graph Laplacian; (2) The MLG kernel, which is a multilevel, recursively deﬁned kernel that captures topological relationships between not just individual vertices, but also subgraphs. Clearly, designing kernels that can optimally take into account the multiscale structure of actual chemical compounds is a challenging task that will require further work and domain knowledge. However, it is encouraging that even just straight out of the box , tuning only two or three parameters, the MLG kernel is competitive with other well known kernels in the literature. Beyond just graphs, the general idea of multiscale kernels is of interest for other types of data as well (such as images) that have multiresolution structure, and the way that the MLG kernel chains together local spectral analysis at multiple scales is potentially applicable to these domains as well, which will be the subject of further research.

Acknowledgements

This work was completed in part with computing resources provided by the University of Chicago Research Computing Center and with the support of DARPA-D16AP00112 and NSF-1320344.

[1] Akihiro Inokuchi, Takashi Washio, and Hiroshi Motoda. Complete mining of frequent patterns from graphs: Mining graph data. Machine Learning, 50(3):321 354, 2003. [2] K. M. Borgwardt, C. S. Ong, S. Sch onauer, S. V. N. Vishwanathan, A. J. Smola, and H.-P. Kriegel. Protein function prediction via graph kernels. In Proceedings of Intelligent Systems in Molecular Biology (ISMB), Detroit, USA, 2005. [3] H. Kubinyi. Drug research: myths, hype and reality. Nature Reviews: Drug Discovery, 2(8):665 668, August 2003. [4] T. G artner. Exponential and geometric kernels for graphs. In NIPS*02 workshop on unreal data, volume Principles of modeling nonvectorial data, 2002. [5] S. V. N. Vishwanathan, Karsten Borgwardt, Risi Kondor, and Nicol Schraudolph. On graph kernels. Journal of Machine Learning Research (JMLR), 11, 2010. [6] Karsten M. Borgwardt and Hans Peter Kriegel. Shortest-path kernels on graphs. In Proceedings of the 5th IEEE International Conference on Data Mining(ICDM) 2005), 27-30 November 2005, Houston, Texas, USA, pages 74 81, 2005. [7] Aasa Feragen, Niklas Kasenburg, Jens Petersen, Marleen de Bruijne, and Karsten M. Borgwardt. Scalable kernels for graphs with continuous attributes. In Advances in Neural Information Processing Systemss, 2013. [8] Risi Kondor and Karsten Borgwardt. The skew spectrum of graphs. In Proceedings of the International Conference on Machine Learning (ICML), pages 496 503. ACM, 2008. [9] Nino Shervashidze, S. V. N. Vishwanathan, Tobias Petri, Kurt Mehlhorn, and Karsten M. Borgwardt. Efﬁcient graphlet kernels for large graph comparison. In Proceedings of the Twelfth International Conference on Artiﬁcial Intelligence and Statistics, AISTATS, pages 488 495, 2009. [10] Nino Shervashidze, Pascal Schweitzer, Erik Jan van Leeuwen, Kurt Mehlhorn, and Karsten M. Borgwardt. Weisfeiler-lehman graph kernels. Journal of Machine Learning Research(JMLR), 12:2539 2561, November 2011. [11] Marion Neumann, Roman Garnett, Christian Baukhage, and Kristian Kersting. Propagation kernels: efﬁcient graph kernels from propagated information. In Machine Learning, 2016. [12] Hanjun Dai, Bo Dai, and Le Song. Discriminative embeddings of latent variable models for structured data. In Proceedings of the International Conference on Machine Learning (ICML), 2016. [13] Fredrik D. Johansson and Devdatt Dubhashi. Learning with similarity functions on graphs using matchings of geometric embeddings. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 467 476, 2015. [14] Tony Jebara and Risi Kondor. Bhattacharyya and expected likelihood kernels. In Proceedings of the Annual Conference on Computational Learning Theory and Kernels Workshop (COLT/KW), 2003. [15] Risi Kondor and Tony Jebara. A kernel between sets of vectors. In Proceedings of the International Conference on Machine Learning (ICML), 2003. [16] Marc Alexa, Michael Kazhdan, and Leonidas Guibas. A Concise and Provably Informative Multi-Scale Signature Based on Heat Diffusion. In Processing of Eurographics Symposium on Geometry Processing, volume 28, 2009. [17] Bernhard Sch olkopf and Alexander J. Smola. Learning with Kernels. MIT Press, 2002. [18] S. Mika, B. Sch olkopf, A. J. Smola, K.-R. M uller, Matthias Scholz, and G. R atsch. Kernel PCA and de-noising in feature spaces. In Advances in Neural Information Processing Systems 11, pages 536 542, 1999. [19] Christopher K. I. Williams and Mattias Seeger. Using the Nystr om method to speed up kernel machines. In Advances in Neural Information Processing Systems (NIPS), 2001. [20] Petros Drineas and Michael W. Mahoney. On the Nystr om method for approximating a Gram matrix for improved kernel-based learning. Journal of Machine Learning Research, 6:2153 2175, 2005. [21] Chih-Chung Chang and Chih-Jen Lin. Libsvm: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 3, 2011. [22] A.K. Debnat, R. L. Lopez de Compadre, G. Debnath, A. j. Shusterman, and C. Hansch. Structure-activity relationship of mutagenic aromatic and heteroaromatic nitro compounds. correlation with molecular orbital energies and hydrophobicity. J Med Chem, 34:786 97, 1991. [23] H.Toivonen, A. Srinivasan, R. D. King, S. Kramer, and C. Helma. Statistical evaluation of the predictive toxicology challenge. Bioinformatics, pages 1183 1193, 2003. [24] N. Wale, I. A. Watson, and G. Karypis. Comparison of descriptor spaces for chemical compound retrieval and classiﬁcation. Knowledge and Information Systems, pages 347 375, 2008.