# taming_graph_kernels_with_random_features__2e6a4e29.pdf

Taming graph kernels with random features

Krzysztof Choromanski 1 2

We introduce in this paper the mechanism of graph random features (GRFs). GRFs can be used to construct unbiased randomized estimators of several important kernels defined on graphs nodes, in particular the regularized Laplacian kernel. As regular RFs for non-graph kernels, they provide means to scale up kernel methods defined on graphs to larger networks. Importantly, they give substantial computational gains also for smaller graphs, while applied in downstream applications. Consequently, GRFs address the notoriously difficult problem of cubic (in the number of the nodes of the graph) time complexity of graph kernels algorithms. We provide a detailed theoretical analysis of GRFs and an extensive empirical evaluation: from speed tests, through Frobenius relative error analysis to kmeans graphclustering with graph kernels. We show that the computation of GRFs admits an embarrassingly simple distributed algorithm that can be applied if the graph under consideration needs to be split across several machines. We also introduce a (still unbiased) quasi Monte Carlo variant of GRFs, q-GRFs, relying on the so-called reinforced random walks that might be used to optimize the variance of GRFs. As a byproduct, we obtain a novel approach to solve certain classes of linear equations with positive and symmetric matrices.

1. Introduction

Consider a positive definite kernel (similarity function) K : Rd Rd R. Kernel methods (Campbell, 2002; Kontorovich et al., 2008; Smola & Sch olkopf, 2002; Canu & Smola, 2006; Ghojogh et al., 2021; Cumby & Roth, 2003) provide powerful mechanisms for modeling non-linear relationships between data points. However, as relying on the so-called kernel matrices K = [K(xi, xj)]i,j for datasets

*Equal contribution 1Google Deep Mind 2Columbia University. Correspondence to: Krzysztof Choromanski <kchoro@google.com>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

X = {x1, ..., x N} under consideration, they have time complexity at last quadratic in the size of a dataset.

To address this limitation, a mechanism of random features (RFs) (Liu et al., 2022; Rahimi & Recht, 2007; 2008) to linearize kernel functions was proposed. Random features rely on randomized functions: ϕ : Rd Rm that map datapoints from the original to the new space, where the linear (dot-product) kernels corresponds to the original kernel. i.e:

K(x, y) = E[ϕ(x) ϕ(y)] (1)

The theory of random features for kernels was born with the seminal paper (Johnson, 1984a), proposing a simple randomized linear transformation ϕ (with Gaussian random matrices) to approximate dot-product kernels via dimensionality reduction (m d). That mapping ϕ is often referred to as the Johnson-Lindenstrauss Transform (JLT).

Interestingly, it turned out that for several nonlinear kernels, JLT can be modified by adding a nonlinear functions acting element-wise on the linearly transformed dimensions of the input vectors (and potentially by replacing Gaussian sampling mechanism and altering the constant multiplicative renormalization term), to obtain Eq. 1. For the radial basis function kernels (RBFs) (K egl et al., 1998) and softmax kernels, this nonlinear functions are trigonometric transformations (Rahimi & Recht, 2007). For the Gaussian (an instantiation of RBFs) and softmax kernels one can also apply the so-called positive random features leveraging exponential nonlinearties (Choromanski et al., 2020). For the angular kernel, the sign function is used (Goemans & Williamson, 2004; Choromanski et al., 2017). Thus linearization schemes for different kernels are obtained by applying different element-wise transformations, but the overall structure of ϕ is the same across different kernels.

RFs provide unbiased low-rank decomposition of the kernel matrix K, effectively enabling the (approximate) computations with K to be conducted in sub-quadratic time, by-passing explicit materialization of K. Even more importantly, they help to create nonlinear variants of the simpler linear methods (e.g kernel-SVM (Boser et al., 1992) as opposed to regular SVM (Cortes & Vapnik, 1995) or kernel regression (Ch avez et al., 2020) as opposed to linear regression (Goldin, 2010)) via the ϕ-transformation.

So far we have considered points embedded in the regular

Taming graph kernels with random features

Euclidean Rd-space, but what if the space itself is non Euclidean ? Upon discretization, such a space can be approximated by undirected weighted graphs (often equipped with shortest-path-distance metric) and in that picture the points correspond to graph nodes. It is thus natural to consider in that context graph kernels K : V V R defined on the set of nodes V of the graph. And indeed, several such kernels were proposed: diffusion, regularized Laplacian, p-step random walk, inverse cosine and more (Smola & Kondor, 2003; Kondor & Lafferty, 2002; Chung & Yau, 1999). The term graph kernels is also often used for kernels taking as input graphs, thus computing similarities between two graphs rather than two nodes in a given graph (Vishwanathan et al., 2010). Those are not a subject of this paper.

Unfortunately, for most graph kernels, the computation of the kernel matrix K is at least cubic in the number of nodes of the graph. This is a major obstacle on the quest to make those methods more practical and mainstream kernel techniques. A tempting idea is to try to apply random features also for graph kernels, yet so far no analogue of the mechanism presented above for regular kernels was proposed (see: Sec. 2 for the additional discussion on some efforts to apply RF-based techniques in the theory of graph kernels in the broad sense). This is in striking contrast to kernels operating in Rd, where not only are RFs widely adopted, but (as already discussed) the core underlying mechanism is the same for a large plethora of different kernels, irrespectively of their taxonomy, e.g. shift-invariant (Gaussian, Laplace, Cauchy) or not (softmax, dot-product). One of the main challenges is that no longer is the value of the graph kernel solely the property of its input-nodes, but also the medium (the graph itself), where those nodes reside. In fact, it does not even make sense to talk about the properties of graphs nodes by abstracting from the whole graph (in this paper we do not assume any internal graph-node structure).

This work is one of the first steps to develop rigorous theory of random features for graph kernels. We introduce here the mechanism of graph random features (GRFs). GRFs can be used to construct unbiased randomized estimators of several important kernels defined on graphs nodes, in particular the regularized Laplacian kernel. As regular RFs for non-graph kernels, they provide means to scale up kernel methods defined on graphs to larger networks. Importantly, they give substantial computational gains also for smaller graphs, while applied in downstream applications. Consequently, GRFs address the notoriously difficult problem of cubic (in the number of the nodes of the graph) time complexity of graph kernels algorithms. We provide a detailed theoretical analysis of GRFs and an extensive empirical evaluation: from speed tests, through Frobenius relative error analysis to k-means graph-clustering with graph kernels. We show that the computation of GRFs admits an embarrassingly simple distributed algorithm that can be applied for particularly

massive graphs. We also introduce a (still unbiased) quasi Monte Carlo variant of GRFs, q-GRFs, relying on the so-called reinforced random walks (Figueiredo & Garetto, 2017) that might be used to optimize the variance of GRFs. As a byproduct, we also obtain a novel approach for solving linear equations with positive and symmetric matrices.

The q-GRF variants of our algorithm (that, as we show, still provide unbiased estimation) aim to translate the geometric techniques of structured random projections, that were recently successfully applied in the theory of RFs (see: Sec. 2 for detailed discussion) to non-Euclidean domains defined by graphs. To the best of our knowledge, it was never done before. We propose this idea in the paper, yet leave to future work comprehensive theoretical and empirical analysis.

2. Related work

Probabilistic techniques are used in different contexts in the theory of graph kernels. Sampling algorithms can be applied to make the computations of kernels taking graphs as inputs (rather than graphs nodes) feasible (Yanardag & Vishwanathan, 2015; Shervashidze et al., 2009; Leskovec & Faloutsos, 2006; Orbanz, 2017; Bressan, 2020; Wang et al., 2011). Examples include in particular graphlet kernels, where counters of various small-size sub-graphs are replaced by their counterparts obtained via sub-graph sub-sampling (often via random walks) (Ribeiro et al., 2022; Wu et al., 2019; Chen et al., 2016). They provide more tractable, yet biased estimation of the original objective.

Other approaches define new classes of graph kernels starting from the RF-inspired representations. Examples include structure-aware random Fourier kernels from (Fang et al., 2021) that apply the so-called L-hop sub-graphs processed by graph neural networks (GNNs) to obtain their latent embeddings and define a kernel by taking their dot products. Even though here we proceed in a reverse order - starting from well-known graph kernels defined by deterministic closed-form expressions and producing their unbiased RFbased representations, there is an interesting link between that line of research and our work. GRFs can be cast as representations of similar core structure, but much more specific, GNN-free and consequently, providing simplicity, low computational time and strong theoretical guarantees.

Quasi Monte Carlo (q-MC) methods, where independent samples are replaced by the correlated ensembles for more accurate estimation are powerful MC techniques. In the theory of RFs for regular kernels, new q-MC methods applying block-orthogonal ensembles of the random Gaussian vectors (the so-called orthogonal random features or ORFs) were shown to (sometimes only asymptotically) reduce the approximation variance of various kernels (Yu et al., 2016; Lin et al., 2020; Choromanski et al., 2018b; 2017; 2020; Likhosherstov et al., 2022; Choromanski et al., 2018a).

Taming graph kernels with random features

3. Graph random features (GRFs)

3.1. Hidden Gram-structure of squared inverse matrices

We start our analysis, that will eventually lead us to the definition of GRFs, by forgetting (only temporarily) about graph theory and focusing on positive symmetric matrices.

3.1.1. PRELIMINARIES

Consider a positive symmetric matrix U RN N of spectral radius ρ(U) def = maxλ Λ(U) |λ| satisfying: ρ(U) < 1, where Λ(U) stands for the set of eigenvalues of U (Λ(U) R since U is symmetric). Note that under this condition, the following series converges:

IN + U + U2 + ... (2)

and is equal to (IN U) 1. Furthermore, we have:

Lemma 3.1. If U RN N is a a positive symmetric matrix with ρ(U) < 1 then the following holds:

IN + 2U + 3U2 + ... = (IN U) 2 (3)

Proof. We have: IN + 2U + 3U2 + ... = (IN + U + U2 + ...) + (U + 2U2 + 3U3 + ...). Thus we have:

IN + 2U + 3U2 + ... = (IN + U + U2 + ...)+

U(IN + 2U + 3U2 + ...) (4)

Therefore, from what we have said above, we obtain:

(IN U)(IN + 2U + 3U2 + ...) = (IN U) 1 (5)

and, consequently: IN +2U+3U2+... = (IN U) 2.

We will derive an algorithm for unbiasedly estimating (IN U) 2 with the Gram-matrix M = [ϕ(i) ϕ(j)]i,j=1,...,N for a randomized mapping: ϕ : N RN. We will refer to ϕ(k) for a given k {1, ..., N} as a signature vector for k (see also Fig. 1 for the illustration of signature vectors).

3.1.2. COMPUTING SIGNATURE VECTORS

We are ready to bring back the graphs. We interpret U = [ui,j]i,j=1,...,N as a weighted adjacency matrix of the weighted graph GU with vertex set V = V(GU) of N vertices, where the weight between node i and j is given as ui,j. We then calculate signature vectors via random walks on GU. The algorithm for computing ϕ(i) for a specific node i is given in Algorithm 1 box. Calculations of ϕ(i) for different i are done independently (and therefore can be easily parallelized). The algorithm works as follows.

The signature vector is zeroed out in initialization. In the initialization, we also zero-out data structure history H which maintains some particular function of the list of visited

Algorithm 1 Computing a signature vector for a given i. Input :graph GU, termination probability pterm, sampling strategy sample, # of random walks m, node i. Result: signature vector ϕ(i). Main: 1. Initialize: ϕ(i)[j] = 0 for j = 1, 2, ..., N. 2. Initialize: H . 3. for t = 1, ..., m do (a) initialize: load = 1 (b) initialize: current vertex i (c) update: ϕ(i)[i] ϕ(i)[i] + 1 (d) while (not terminated) do 1. assign: v = current vertex 2. w, p = sample(v, GU, H) 3. update: current vertex w 4. update: load load uv,w / p(1 pterm) 5. update: ϕ(i)[w] ϕ(i)[w] + load 6. update: H H.add((v, w)) 4. renormalize: ϕ(i) ϕ(i) / m

edges (more details later). A number m of random walks, all initiated at point i is conducted. In principle, they can also be parallelized, but here, for the clarity of the analysis, we assume that they are executed in a given order.

Figure 1. The pictorial representation of signature vectors. The contibutions to the dot-product between two signature vectors come from the vertices that were visited by both vectors.

Each random walk carries a load that is being updated every time the termination state has not been reached. Before each

Taming graph kernels with random features

transition, the termination is reached independently with a given probability pterm. The update-rule of the load is given in point 4 in the Algorithm 1 box. Strategy sample takes as input: (1) the current vertex v, (2) entire graph, (3) the function of the history of visits and outputs: (1) a neighbor w of v according to a particular randomized sampling strategy as well as: (2) its corresponding probability p. Every time a new vertex is reached, the value of its corresponding signature vector dimension is increased by the most updated value of the load, as given in point 5 in the Algorithm 1 box. When all the walks terminate, the signature vector is renormalized by the multiplicative term 1

m and returned.

Sampler & the history of visits: The simplest sample strategy is to choose a neighbor w of a current vertex v uniformly at random from the set of all neighbors Nv of v in GU. In this setting: p = 1 deg(v), where deg(v) stands for the degree of v defined as the size |Nv| of its neighborhood. In that setting, H is not needed and thus we have: H = . We show in Sec. 5 that this strategy works very well in practice. Another option is to take: p = p(v, w) = uv,w P

z Nv uv,z , i.e. make probabilities proportional to weights. In that setting, history is also not needed. In Sec. 4, we analyze in more detail the q-GRF setting, where nontrivial H is useful. This history can be used to diversify the set of random walks, by de-prioritizing edges that were already selected in previous random walks. Thus it might have the same positive effect on the estimation accuracy as ORFs from (Yu et al., 2016) that diversify the set of random projection vectors. However detailed analysis of such strategies is beyond the scope of this paper. Most importantly, as we show next, regardless of the particular instantiations of sample, signature vectors computed in Algorithm 1 indeed provide an unbiased estimation of the matrix (IN U) 2 (proof in Sec. 4):

Theorem 3.2. Denote by B RN N a matrix with rows given by ϕ(i) for i = 1, ..., N, computed according to Algorithm 1 and by B its independent copy (obtained in practice by constructing another indeepentent set of random walks). Assume that method sample is not degenerate, i.e. it assigns a positive probability p(v, w) for every neighbor w of v. Then the following is true:

(IN U) 2 = E[B(B ) ]. (6)

Time complexity & signature vector maps: To compute ϕ(i) for every i, we need to run in every node m random walks, each of the expected length l = 1 pterm , thus the expected time complexity of computing B, B is O( Nmtsam

pterm + Trep), where: tsam stands for time complexity of the mechanism sample and Trep for the space complexity of storing the representation of GU. For instance, if U is sparse, graph adjacency list representation is a right choice. In principle, even more compact representations are possible if graph-edges do not need to be stored explicitly. For the most straightforward uniform sampling strategy, we

have: tsam = O(1) and for many more sophisticated ones tsam is proportional to the average degree of a vertex. We assume that in all considered representations there is an indexing mechanism for the neighbors of any given vertex and a vertex of a particular index can be retrieved in O(1) time. Note that for pterm mtsam

N , B, B should be kept in the implicit manner (this is a pretty conservative lower bound; in practice different walks will share many vertices), since most of the entries of each of its rows are equal to zero in this case. Thus for every i, we can store a dictionary with keys corresponding to visited vertices w and values to the values of ϕi[w]. We call such a dictionary a signature vector map (or svmap). In practice, one can choose large pterm without accuracy drop (see: Sec. 5).

3.1.3. TRIMMING RFS: ANCHOR POINTS AND JLT

Algorithm 1 provides substantial computational gains in calculating (IN U) 2, replacing O(N 3) time complexity with expected O( Nm

pterm + Trep). Furthermore, signature vectors ϕ(i) in practice can be often stored efficiently (see: our discussion on svmaps above). Interestingly, there are also easy ways to directly control their effective dimensionality via one of two simple to implement tricks, provided below.

Anchor points: This technique samples randomly K < N points from the set of all N vertices of GU. We call this set: anchor points. Signature vectors are updated only in anchor points (but we still upload load for each transition) and thus effectively each of them has dimensionality K. In principle, different sampling strategies are possible and analyzing all of them in outside the scope of this work. The simplest approach is to sample uniformly at random a Kelement subset of V with no repetitions. In this setting, if we replace term 1

m in point 4. of the Algorithm 1 box with N Km term, the estimator remains unbiased (see: Sec. 4).

JLTed signature vectors: The more algebraic approach to trimming the dimensionality of signature vectors is to apply Johnson-Lindenstrauss Transform (see: Sec. 1). We simply replace ϕ(i) with: ϕJLT(i) = 1

K Gϕ(i) after all the computations. Here G RK N is the Gaussian matrix with entries taken independently at random from N(0, 1) and K is the number of RFs used in JLT. Matrix G is sampled independently from all the walks and is the same for all the nodes. The unbiasedness of the dot-product kernel estimation with JLTs (Johnson, 1984b) combined with Theorem 3.2, provides the unbiasedness of the overall mechanism.

3.2. From squared inverse matrices to graph kernels

3.2.1. PRELIMINARIES

We are ready to show how the results obtained in Sec. 3.1 can be applied for graph kernels. For a given undirected weighted loopless graph G with a weighted symmetric

Taming graph kernels with random features

adjacency matrix W RN N, we consider the family of d-regularized Laplacian kernels defined as follows for deg W(i) def = P

j Ni W(i, j):

[Kd lap(v, w)]i,j=1,...,N = (IN + σ2 LG) d, (7)

for a symetrically normalized Laplacian LG RN N:

( 1 if v = w W(v,w)

deg W(v)deg W(w) if (v, w) is an edge

For d = 1, we get the popular regularized Laplacian kernel. As we show in Sec. A.2, as long as the 1-regularized Laplacian kernel is well defined (i.e. matrix IN + σ2 LG is invertible and the inverse has positive entries), the d-regularized Laplacian kernel is a valid positive definite kernel.

3.2.2. SIGNATURE VECTORS ARE RFS FOR THE 2-REGULARIZED LAPLACIAN KERNEL

We will start with d = 2 since it turns out that the mechanism of signature vectors can be straightforwardly applied there to obtain corresponding GRFs. Note that the following is true for U RN N defined as: U = [ui,j]i,j=1,...,N, where ui,j = σ2 σ2+1 W(i,j)

deg W(i)deg W(j):

IN + σ2 LG = (σ2 + 1)(IN U) (9)

Equation 9, together with Theorem 3.2, immediately provide the GRF-mechanism for the 2-regularized Laplacian kernel:

Corollary 3.3. Consider a kernel matrix of the 2regularized Laplacian kernel for a given graph G with N nodes {1, 2, ..., N}. Then the following holds:

(IN + σ2 LG) 2 = E[C(C )T ], (10)

where the rows of C RN K are of the form: 1 σ2+1ϕ(i)

for i = 1, ..., N and signature vectors ϕ(i) computed via Algorithm 1 for the graph GU and a matrix U defined above. Furthermore, C is an independent copy of C.

3.2.3. GRFS & REGULARIZED LAPLACIAN KERNEL

Let us now consider the case d = 1. From Corollary 3.3, we immediately get the following:

Corollary 3.4. Consider a kernel matrix of the 1regularized Laplacian kernel for a given graph G with N nodes {1, 2, ..., N}. Then the following holds:

(IN + σ2 LG) 1 = E[CDT ], (11)

where matrix D RN K is defined as: D = (IN + σ2 LG)C and C, C RN K are as in Corollary 3.3.

Corollary 3.4 provides a GRF mechanism for the regularized Laplacian kernel, since it implies that we can write K1 lap as:

K1 lap(i, j) = E[ϕ(i) ψ(j)], (12)

where ϕ(i) and ψ(i) for i = 1, ..., N are the rows of C and D respectively. Notice that this is the so-called asymmetric RF setting (Choromanski et al., 2022b) (where we use a different transformation for i and j). The approximate kernel matrix is not necessarily symmetric, but its expectation is always symmetric and the estimation is unbiased. However it is also easy to obtain here the GRF mechanism providing symmetric approximate kernel matrix and maintaining unbiasedness. It suffices to define:

2[ϕ(i) , ψ(i) ] ,

2[ψ(j) , ϕ(j) ] , (13)

and the corresponding approximate kernel matrix is of the form: 1

2(CD + DC ).

3.2.4. GOING BEYOND d = 2

Let us see now what happens when we take d > 2. We will define GRFs recursively, having the variants for d = 1 and d = 2 already established in previous sections (they will serve as a base for our induction analysis). Take some d and assume that the following GRF mechanism defined by the unbiased low-rank decomposition of the d-regularized Laplacian kernel matrix exists for some X, Y RN K:

(IN + σ2 LG) d = E[XY ] (14)

Then, using Corollary 3.3, we can rewrite:

(IN + σ2 LG) d 2 = E[XY C(C ) ], (15)

if we assume that matrices C, C RN K are constructed independently from X and Y. We can then compute in time O(NK2) matrix S = Y C and then conduct its SVDdecomposition: S = ZΣZ in time O(K3) for Z, Σ RK K and where Σ is diagonal. Then we can rewrite:

(IN + σ2 LG) d 2 = E[(XZΣ 1 2 )(C ZΣ 1 2 ) ] (16)

for XZΣ 1 2 , C ZΣ 1 2 RN K. The random feature vectors can be then given as the rows of XZΣ 1 2 and C ZΣ 1 2 .

We see that starting with d = 1, 2 we can then follow the above procedure to produce GRF-based low-rank decomposition of the kernel matrix for the d-regularized Laplacian kernel for any d N+.

3.2.5. GRFS FOR FAST GRAPH KERNEL METHODS

We will assume here that the graphs G under consideration have e edges for e = o(N 2) which is a reasonable

Taming graph kernels with random features

assumption in machine learning and thus graph adjacency list representation is a default choice. Using the analysis from Sec. 3.1.2, we conclude that GRFs can be computed in time O( Nmtsam

pterm + Ne) for d = 2. If uniform sampling is the strategy used by sample then time complexity can be simplified to O( Nm

pterm + Ne). If JLT-based dimensionality is applied, an additional cost O(N 2K) is incurred. This can be further reduced to O(N 2 log(K)) if unbiased structured JLT variants are applied (Choromanski et al., 2017). For d = 1, an additional time O(Ne) (for computing matrix D, see: Corollary 3.4) is incurred. Similar analysis can be applied to larger d. It is easy to see that for d > 2 GRFs can be computed in time cubic in K and linear in d. Thus for d > 2 they are useful if K N. We call this stage graph pre-processing. If the brute-force approach is applied, pre-processing involves explicit computation of the graph kernel matrix and for the graph kernels considered in this paper (as well as many others) takes time O(N 3). In some applications, pre-processing is a one-time procedure (per training), but in many others needs to be applied several times in training (e.g. in graph Transformers, where graph kernels can be used to topologically modulate regular attention mechanism) (Choromanski et al., 2022a).

When the graph pre-processing is completed, graph kernels usually interact with the downstream algorithm via their matrix-vector multiplication interface, e.g. the algorithm computes Kx for a given graph kernel matrix K = [K(i, j)]i,j=1,...,N RN N and a series of vectors x RN (see: k-means clustering with graph kernels: (Dhillon et al., 2004)). We will refer to this phase as inference. Brute-force inference can be trivially computed in time O(N 2) per matrix-vector multiplication. Using GRFs strategy, an unbiased estimation of the matrix-vector product can be computed via a series of matrix-vector multiplications obtained from (a chain of) decompositions constructed to define GRFs. For d = 2 those decompositions involve at least three matrices and thus the GRFs do not even need to be explicitly constructed.

For instance, for d = 2, inference can be conducted in time O(NK) and this is a pretty conservative bound. Even if one applies K = Ω(N), most of the entries of random feature vectors can be still equal to zero (if pterm is not too small) and in such a setting svmaps from Sec. 3.1.2 can be applied to provide sub-quadratic in N time complexity. For d = 1, an additional cost O(Ne) of multiplications with matrices IN + σ2 LG needs to be incurred. For d > 2, inference can be directly conducted in time O(NK).

3.2.6. SYSTEMS OF LINEAR EQUATIONS & GRFS

It is easy to see that there is nothing particularly special about the Laplacian matrix IN + σ2 LG from Corollary 3.4 and that for any invertible matrix of the form: λ(IN U) and symmetric U RN N of o(N 2) nonzero entries, one

can find a decomposition:

1 λ(IN U) 1 = E[CD ] (17) in sub-cubic time, using our methods. That however provides immediately a sub-cubic randomized algorithm for unbiasedly solving linear systems of the form: (IN U)x = b for any given b RN as: x = λ(C(D b)), where brackets indicate the order of computations.

3.2.7. FINAL REMARKS

The construction of GRFs presented in this section can be thought of as a low-rank decomposition of the kernel matrix K. Then one can argue that instead of using presented algorithm, a standard low-rank decomposition method can be applied (such as spectral analysis). This however defeats the main goal of this paper - sub-cubic time complexity (the matrix to be decomposed would need to be explicitly constructed) and unbiased estimation (such an approximation would be biased). The latter remains a problem for the alternative approaches even if more refined low-rank decomposition techniques not enforcing materialization of the kernel matrix K are applied (such methods would however rely on efficient algorithms for computing Kx which are non-trivial: for d = 1 would need to solve systems of linear equations with symmetric matrices as sub-routines).

4. Theoretical results

4.1. GRFs provide unbiased graph kernel estimation

We start by proving Theorem3.2. Concentration results for GRFs in the base setting, where the sampler chooses a neighbor uniformly at random, are given in Sec. A.3. We leave as an exciting open question the analysis of the concentration results beyond second moment methods.

Proof. For any two nodes a, b V(GU), denote by Ω(a, b) the set of walks from a to b in GU. Furthermore, symbol pre indicates than one walk is a prefix of the other. Then:

ϕ(i) ϕ(j) = 1 m2 X

ω2 Ω(j,x) w(ω1) w(ω2)

1[ω1 pre Ω(k, i)]1[ω2 pre Ω(l, j)], (18)

where Ω(k, i) and Ω(l, j) stand for the kth and lth sampled random walk from i and j respectively, function w outputs the product of the edge-weights of its given input walk and pi k,s and pj l,t are the probabilities returned by sample for the sth/tth vertex of the kth/lth sampled random walk starting

Taming graph kernels with random features

at i/j. Furthermore, l1 and l2 refer to the number of edges of ω1 and ω2 respectively. Note that those probabilities are in general random variables (if they are chosen as nontrivial functions of the history of sampled walks).

Therefore, for len(ω) denoting the number of edges of ω:

E[ϕ(i) ϕ(j)] = X

ω1 Ω(i,x) X

ω2 Ω(j,x) w(ω1)w(ω2) = X

ω Ω(i,j) (len(ω) + 1)w(ω) (19)

The proof is completed by an observation that:

(IN + 2U + 3U2 + ...)(i, j) = X

ω Ω(i,j) (len(ω) + 1)w(ω)

(20) This in turn follows from the fact that for any k 0: Uk(i, j) can be interpreted as the sum of w(ω) over all walks ω from i to j of k edges.

The extension to the anchor point setting is obtained by following the same proof, but with x iterating only over sampled anchor points and results in adding a multiplicative constant, as explained in Sec. 3.1.3.

4.2. q-GRFs and correlated random walks

As already mentioned, the idea of q-GRFs is to correlate different random walks initiated in a given node i, to more efficiently explore the entire graph (in particular, by avoiding reconstructing similar walks) and can be thought of as an analogue of the ORF or low discrepancy sequences techniques applied for regular RFs (Yang et al., 2014). A natural candidate for the q-GRF variant is the one, where method sample implements the so-called reinforced random walk strategy (Kozma, 2012). The probability of choosing a neighbor w of v (see: Sec. 3.1.2) can be defined as:

p(v, w) = f(N(v, w)) P

z Nv f(N(v, z)), (21)

where N(x, y) stands for the number of times that an edge {x, y} has been already used (across all the walks, or previous walks) and f : N R is a fixed function. Note that, since we want to deprioritize edges that were already frequently visited, f should be a decreasing function.

Implementing such a strategy can be done straightforwardly with an additional multiplicative factor dave in timecomplexity, where dave stands for the average degree of a graph node. We leave detailed theoretical and empirical analysis of this class of methods to future work.

5. Experiments

We empirically verify the quality of GRFs via various experiments, including downstream applications of graph kernels. The real-world graphs were accessed from the repositories described in (Ivashkin, 2023).

5.1. Speed tests

Here we compared the total number of FLOPS used by the GRF-variant of of the regularized Laplacian kernel in pre-processing followed by a single inference call involving computing the action of the kernel matrix on a given vector (this is for instance the way in which kernels are applied in the kernelized k-means algorithm, see: Sec. 5.3) with the correesponding time for various linear system solvers as well as a brute-force (BF) algorithm. Note that for dregularized Laplacian kernels with d = 1, inference straightforwardly reduces to solving linear systems of equations. We used graphs of different size. The results are presented in Table 1. GRFs provide substantial reduction of FLOPS, even for smaller graphs. We took pterm = 0.1 since it worked well in several other tests (see: Sec. 5.2, Sec. 5.3).

GRF BF Gauss-Seidel Conjugate Gradient Jacobi

960K 512.64M 6400K 512M 6400K

1.4M 1001M 10M 1000M 10M

10.2M 27B 90M 27B 90M

Table 1. The comparison of the number of FLOPS for the setting from Sec. 5.1 for different linear systems solvers and GRFs. Different rows correspond to: N = 800, N = 1000 and N = 3000.

5.2. Relative Frobenius norm error

We took several undirected graphs of different sizes and computed for them groundtruth kernel matrices Kd = [Kd lap(i, j)]i,j=1,...,N using d-regularized Laplacian kernels with d = 1, 2. We then computed their counterparts b Kd

obtained via GRFs and the corresponding relative Frobenius norm error defined as:

ϵ = Kd b Kd F

We considered different termination probabilities: pterm {0.1, 0.06, 0.01} leading to average random walk lengths: 10, 50

3 , 100 respectively as well as different number m of random walks for the construction of the GRF-vector: m = 1, 2, 10, 20, 40, 80. We fixed: σ2 = 0.2. The reported empirical relative Frobenium norm errors were obtained by averaging over s = 10 independent experiments. We also reported their corresponding standard deviations.

The results are presented in Fig. 2. We see that different pterm result in very similar error-curves and thus in practice it suffices to take pterm = 0.1 (average walk length l = 10), regardless of the density of the graph, even for graphs with N > 1000 nodes. Interestingly, error-curves are also very similar across all the graphs. Furthermore, small number of random walks m = 80 suffices to obtain ϵ < 2%. The standard deviations are reported on the plots, but since they are very small, we also include them in Sec. A.1.

Taming graph kernels with random features

Figure 2. Relative Frobenius norm error for the setting described in Sec. 5.2. First two rows correspond to d = 1 and last two to d = 2. We considered the following graphs (from upper-left to lower-right): Erdos-Renyi graph with edge prob. 0.4 (ER-0.4, N = 1000), ER-0.1 (N = 1000), dolphins (N = 62), eurosis (N = 1272), Networking (N = 1249), Databases (N = 1046), Encryption-and-Compression (N = 864), Hardware-and-Architecture (N = 763).

5.3. K-means algorithm with graph kernels

We applied GRFs in kernelized k-means algorithm (Dhillon et al., 2004) to cluster graph nodes using graph kernels. As in the previous section, we applied d-regularized Laplacian graph kernels with d = 1, 2. Furthermore, we have measured the impact of the additional dimensionality reduction techniques (anchor points and JLT-based reduction) on the clustering quality. For each variant, we reported the socalled clustering error defined as ϵ = Perror

( N 2) , where Perror stands for the number of pairs of nodes {i, j} of the graph that were classified differently by the groundtruth algorithm and its GRF-variant (e.g. they belonged to the same cluster in the groundtruth variant and two different clusters in the GRF one or vice versa). In all the experiments we used σ2 = 0.2, p 1 termm 400 and K 0.6N. We chose the no of clusters nb clusters = 3. The results are in Table 2. GRFs, even with more accurate of the: anchor points and JLTs methods provide accurate approximation, for some graphs (e.g. citeseer with N = 2120) with error < 1%.

Kernel, method citeseer Databases polbooks karate

d=1, reg 0.020 0.170 0.28 0.11 d=1, min(jlt, anc) 0.010 0.097 0.323 0.314

d=2, reg 0.008 0.140 0.12 0.032 d=2, min(jlt, anc) 0.010 0.098 0.264 0.198

Table 2. Clustering errors for the experiments from Sec. 5.3 (for the d-regularized Laplacian kernel). Different tested methods: (1) regular GRFs (reg, no compression), (2) GRFs with the more accurate among: JLT (jlt), and anchor points (anc) techniques.

6. Conclusion

We have proposed in this paper a new paradigm of graph random features (GRFs) for the unbiased and computationally efficient estimation of various kernels defined on the nodes of the graph. GRFs rely on the families of random walks initiated in different nodes, depositing loads in the visited vertices of the graph. The computation of GRFs can be also easily paralellized. We have provided detailed theoretical analysis of our method and exhaustive experiments, involving in particular kmeans clustering on graphs.

Taming graph kernels with random features

Boser, B. E., Guyon, I., and Vapnik, V. A training algorithm for optimal margin classifiers. In Haussler, D. (ed.), Proceedings of the Fifth Annual ACM Conference on Computational Learning Theory, COLT 1992, Pittsburgh, PA, USA, July 27-29, 1992, pp. 144 152. ACM, 1992. doi: 10.1145/130385.130401. URL https: //doi.org/10.1145/130385.130401.

Bressan, M. Faster algorithms for sampling connected induced subgraphs. Co RR, abs/2007.12102, 2020. URL https://arxiv.org/abs/2007.12102.

Campbell, C. Kernel methods: a survey of current techniques. Neurocomputing, 48(1-4):63 84, 2002. doi: 10. 1016/S0925-2312(01)00643-9. URL https://doi. org/10.1016/S0925-2312(01)00643-9.

Canu, S. and Smola, A. J. Kernel methods and the exponential family. Neurocomputing, 69(7-9):714 720, 2006. doi: 10.1016/j.neucom.2005.12.009. URL https:// doi.org/10.1016/j.neucom.2005.12.009.

Ch avez, G., Liu, Y., Ghysels, P., Li, X. S., and Rebrova, E. Scalable and memory-efficient kernel ridge regression. In 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), New Orleans, LA, USA, May 18-22, 2020, pp. 956 965. IEEE, 2020. doi: 10.1109/IPDPS47924.2020.00102. URL https:// doi.org/10.1109/IPDPS47924.2020.00102.

Chen, X., Li, Y., Wang, P., and Lui, J. C. S. A general framework for estimating graphlet statistics via random walk. Proc. VLDB Endow., 10(3):253 264, 2016. doi: 10. 14778/3021924.3021940. URL http://www.vldb. org/pvldb/vol10/p253-chen.pdf.

Choromanski, K., Downey, C., and Boots, B. Initialization matters: Orthogonal predictive state recurrent neural networks. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. Open Review.net, 2018a. URL https: //openreview.net/forum?id=HJJ23b W0b.

Choromanski, K., Rowland, M., Sarl os, T., Sindhwani, V., Turner, R. E., and Weller, A. The geometry of random features. In Storkey, A. J. and P erez-Cruz, F. (eds.), International Conference on Artificial Intelligence and Statistics, AISTATS 2018, 9-11 April 2018, Playa Blanca, Lanzarote, Canary Islands, Spain, volume 84 of Proceedings of Machine Learning Research, pp. 1 9. PMLR, 2018b. URL http://proceedings.mlr.press/ v84/choromanski18a.html.

Choromanski, K., Likhosherstov, V., Dohan, D., Song, X., Gane, A., Sarlos, T., Hawkins, P., Davis, J., Mohiuddin,

A., Kaiser, L., et al. Rethinking attention with performers. ar Xiv preprint ar Xiv:2009.14794, 2020.

Choromanski, K., Lin, H., Chen, H., Zhang, T., Sehanobish, A., Likhosherstov, V., Parker-Holder, J., Sarl os, T., Weller, A., and Weingarten, T. From block-toeplitz matrices to differential equations on graphs: towards a general theory for scalable masked transformers. In Chaudhuri, K., Jegelka, S., Song, L., Szepesv ari, C., Niu, G., and Sabato, S. (eds.), International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pp. 3962 3983. PMLR, 2022a. URL https://proceedings.mlr. press/v162/choromanski22a.html.

Choromanski, K. M., Rowland, M., and Weller, A. The unreasonable effectiveness of structured random orthogonal embeddings. Advances in neural information processing systems, 30, 2017.

Choromanski, K. M., Lin, H., Chen, H., Sehanobish, A., Ma, Y., Jain, D., Varley, J., Zeng, A., Ryoo, M. S., Likhosherstov, V., Kalashnikov, D., Sindhwani, V., and Weller, A. Hybrid random features. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. Open Review.net, 2022b. URL https://openreview.net/forum? id=EMigf E6Ze S.

Chung, F. R. K. and Yau, S. Coverings, heat kernels and spanning trees. Electron. J. Comb., 6, 1999. doi: 10. 37236/1444. URL https://doi.org/10.37236/ 1444.

Cortes, C. and Vapnik, V. Support-vector networks. Mach. Learn., 20(3):273 297, 1995. doi: 10.1007/ BF00994018. URL https://doi.org/10.1007/ BF00994018.

Cumby, C. M. and Roth, D. On kernel methods for relational learning. In Fawcett, T. and Mishra, N. (eds.), Machine Learning, Proceedings of the Twentieth International Conference (ICML 2003), August 21-24, 2003, Washington, DC, USA, pp. 107 114. AAAI Press, 2003. URL http://www.aaai.org/Library/ ICML/2003/icml03-017.php.

Dhillon, I. S., Guan, Y., and Kulis, B. Kernel k-means: spectral clustering and normalized cuts. In Kim, W., Kohavi, R., Gehrke, J., and Du Mouchel, W. (eds.), Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Seattle, Washington, USA, August 22-25, 2004, pp. 551 556. ACM, 2004. doi: 10.1145/1014052.1014118. URL https: //doi.org/10.1145/1014052.1014118.

Taming graph kernels with random features

Fang, J., Zhang, Q., Meng, Z., and Liang, S. Structureaware random fourier kernel for graphs. In Ranzato, M., Beygelzimer, A., Dauphin, Y. N., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, Neur IPS 2021, December 6-14, 2021, virtual, pp. 17681 17694, 2021.

Figueiredo, D. R. and Garetto, M. On the emergence of shortest paths by reinforced random walks. IEEE Trans. Netw. Sci. Eng., 4(1):55 69, 2017. doi: 10.1109/ TNSE.2016.2618063. URL https://doi.org/10. 1109/TNSE.2016.2618063.

Ghojogh, B., Ghodsi, A., Karray, F., and Crowley, M. Reproducing kernel hilbert space, mercer s theorem, eigenfunctions, nystr om method, and use of kernels in machine learning: Tutorial and survey. Co RR, abs/2106.08443, 2021. URL https://arxiv.org/ abs/2106.08443.

Goemans, M. X. and Williamson, D. P. Approximation algorithms for max-3-cut and other problems via complex semidefinite programming. J. Comput. Syst. Sci., 68(2):442 470, 2004. doi: 10.1016/j.jcss.2003.07. 012. URL https://doi.org/10.1016/j.jcss. 2003.07.012.

Goldin, R. F. Review: Statistical models: Theory and practice (revised edition). cambridge university press, new york, new york, 2009, xiv + 442 pp., ISBN 978-0-521-74385-3, $40. by david a. freedman. Am. Math. Mon., 117(9):844 847, 2010. doi: 10.4169/ 000298910X521733. URL https://doi.org/10. 4169/000298910X521733.

Ivashkin, V. Community graphs repository, 2023. URL https://github.com/vlivashkin/ community-graphs.

Johnson, W. B. Extensions of lipschitz mappings into hilbert space. Contemporary mathematics, 26:189 206, 1984a.

Johnson, W. B. Extensions of lipschitz mappings into a hilbert space. Contemp. Math., 26:189 206, 1984b.

K egl, B., Krzyzak, A., and Niemann, H. Radial basis function networks in nonparametric classification and function learning. In Jain, A. K., Venkatesh, S., and Lovell, B. C. (eds.), Fourteenth International Conference on Pattern Recognition, ICPR 1998, Brisbane, Australia, 1620 August, 1998, pp. 565 570. IEEE Computer Society, 1998. doi: 10.1109/ICPR.1998.711206. URL https: //doi.org/10.1109/ICPR.1998.711206.

Kondor, R. and Lafferty, J. D. Diffusion kernels on graphs and other discrete input spaces. In Sammut, C. and Hoffmann, A. G. (eds.), Machine Learning, Proceedings of

the Nineteenth International Conference (ICML 2002), University of New South Wales, Sydney, Australia, July 8-12, 2002, pp. 315 322. Morgan Kaufmann, 2002.

Kontorovich, L., Cortes, C., and Mohri, M. Kernel methods for learning languages. Theor. Comput. Sci., 405(3):223 236, 2008. doi: 10.1016/j.tcs.2008.06.037. URL https: //doi.org/10.1016/j.tcs.2008.06.037.

Kozma, G. Reinforced random walk, 2012. URL https: //arxiv.org/abs/1208.0364.

Leskovec, J. and Faloutsos, C. Sampling from large graphs. In Eliassi-Rad, T., Ungar, L. H., Craven, M., and Gunopulos, D. (eds.), Proceedings of the Twelfth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Philadelphia, PA, USA, August 20-23, 2006, pp. 631 636. ACM, 2006. doi: 10. 1145/1150402.1150479. URL https://doi.org/ 10.1145/1150402.1150479.

Likhosherstov, V., Choromanski, K., Dubey, A., Liu, F., Sarlos, T., and Weller, A. Chefs random tables: Nontrigonometric random features. In Neur IPS, 2022.

Lin, H., Chen, H., Choromanski, K. M., Zhang, T., and Laroche, C. Demystifying orthogonal monte carlo and beyond. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020, December 6-12, 2020, virtual, 2020.

Liu, F., Huang, X., Chen, Y., and Suykens, J. A. K. Random features for kernel approximation: A survey on algorithms, theory, and beyond. IEEE Trans. Pattern Anal. Mach. Intell., 44(10):7128 7148, 2022. doi: 10.1109/TPAMI.2021.3097011. URL https://doi. org/10.1109/TPAMI.2021.3097011.

Orbanz, P. Subsampling large graphs and invariance in networks. ar Xiv: Statistics Theory, 2017.

Rahimi, A. and Recht, B. Random features for large-scale kernel machines. Advances in neural information processing systems, 20, 2007.

Rahimi, A. and Recht, B. Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning. In Koller, D., Schuurmans, D., Bengio, Y., and Bottou, L. (eds.), Advances in Neural Information Processing Systems 21, Proceedings of the Twenty-Second Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 8-11, 2008, pp. 1313 1320. Curran Associates, Inc., 2008.

Taming graph kernels with random features

Ribeiro, P., Paredes, P., Silva, M. E. P., Apar ıcio, D., and Silva, F. M. A. A survey on subgraph counting: Concepts, algorithms, and applications to network motifs and graphlets. ACM Comput. Surv., 54(2):28:1 28:36, 2022. doi: 10.1145/3433652. URL https: //doi.org/10.1145/3433652.

Shervashidze, N., Vishwanathan, S. V. N., Petri, T., Mehlhorn, K., and Borgwardt, K. M. Efficient graphlet kernels for large graph comparison. In Dyk, D. A. V. and Welling, M. (eds.), Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics, AISTATS 2009, Clearwater Beach, Florida, USA, April 16-18, 2009, volume 5 of JMLR Proceedings, pp. 488 495. JMLR.org, 2009. URL http://proceedings. mlr.press/v5/shervashidze09a.html.

Smola, A. J. and Kondor, R. Kernels and regularization on graphs. In Sch olkopf, B. and Warmuth, M. K. (eds.), Computational Learning Theory and Kernel Machines, 16th Annual Conference on Computational Learning Theory and 7th Kernel Workshop, COLT/Kernel 2003, Washington, DC, USA, August 24-27, 2003, Proceedings, volume 2777 of Lecture Notes in Computer Science, pp. 144 158. Springer, 2003. doi: 10.1007/ 978-3-540-45167-9\ 12. URL https://doi.org/ 10.1007/978-3-540-45167-9_12.

Smola, A. J. and Sch olkopf, B. Bayesian kernel methods. In Mendelson, S. and Smola, A. J. (eds.), Advanced Lectures on Machine Learning, Machine Learning Summer School 2002, Canberra, Australia, February 11-22, 2002, Revised Lectures, volume 2600 of Lecture Notes in Computer Science, pp. 65 117. Springer, 2002. doi: 10.1007/3-540-36434-X\ 3. URL https: //doi.org/10.1007/3-540-36434-X_3.

Vishwanathan, S. V. N., Schraudolph, N. N., Kondor, R., and Borgwardt, K. M. Graph kernels. J. Mach. Learn. Res., 11:1201 1242, 2010. doi: 10.5555/1756006. 1859891. URL https://dl.acm.org/doi/10. 5555/1756006.1859891.

Wang, T., Chen, Y., Zhang, Z., Xu, T., Jin, L., Hui, P., Deng, B., and Li, X. Understanding graph sampling algorithms for social network analysis. In 31st IEEE International Conference on Distributed Computing Systems Workshops (ICDCS 2011 Workshops), 20-24 June 2011, Minneapolis, Minnesota, USA, pp. 123 128. IEEE Computer Society, 2011. doi: 10.1109/ICDCSW.2011. 34. URL https://doi.org/10.1109/ICDCSW. 2011.34.

Wu, L., Yen, I. E., Zhang, Z., Xu, K., Zhao, L., Peng, X., Xia, Y., and Aggarwal, C. C. Scalable global alignment graph kernel using random features: From node

embedding to graph embedding. In Teredesai, A., Kumar, V., Li, Y., Rosales, R., Terzi, E., and Karypis, G. (eds.), Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2019, Anchorage, AK, USA, August 48, 2019, pp. 1418 1428. ACM, 2019. doi: 10.1145/ 3292500.3330918. URL https://doi.org/10. 1145/3292500.3330918.

Yanardag, P. and Vishwanathan, S. V. N. Deep graph kernels. In Cao, L., Zhang, C., Joachims, T., Webb, G. I., Margineantu, D. D., and Williams, G. (eds.), Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, NSW, Australia, August 10-13, 2015, pp. 1365 1374. ACM, 2015. doi: 10.1145/2783258.2783417. URL https: //doi.org/10.1145/2783258.2783417.

Yang, J., Sindhwani, V., Avron, H., and Mahoney, M. W. Quasi-monte carlo feature maps for shift-invariant kernels. In Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 21-26 June 2014, volume 32 of JMLR Workshop and Conference Proceedings, pp. 485 493. JMLR.org, 2014. URL http://proceedings.mlr.press/ v32/yangb14.html.

Yu, F. X. X., Suresh, A. T., Choromanski, K. M., Holtmann Rice, D. N., and Kumar, S. Orthogonal random features. Advances in neural information processing systems, 29, 2016.

Taming graph kernels with random features

Kernel, Graph, pterm m=2 m=10 m=20 m=40 m=80

d=1, ER-0.4-1000, 0.1 0.0007259063790268863 0.0001903398359981377 0.00011763026820191872 8.79471573831236e-05 4.29321194947306e-05 d=1, ER-0.4-1000, 0.06 0.0005435175726472963 0.00013151409601958063 9.575658877244471e-05 4.3906087463097626e-05 4.0755077316485244e-05 d=1, ER-0.4-1000, 0.01 0.00022105611802712477 9.098043091730836e-05 7.626805728093205e-05 3.874109025559475e-05 3.7022093881032226e-05

d=1, ER-0.1-1000, 0.1 0.00616767931977833 0.004336727765953796 0.003934111721006881 0.0027504647840658337 0.0013502125948369828 d=1, ER-0.1-1000, 0.06 0.011169983712293842 0.004103948449818662 0.002123051971417801 0.0015286651824786544 0.001809200930986754 d=1, ER-0.1-1000, 0.01 0.008926804414936252 0.003593843664943383 0.002372436626864906 0.002528684013474355 0.0013236120502376739

d=1, dolphins, 0.1 0.00616767931977833 0.004336727765953796 0.003934111721006881 0.0027504647840658337 0.0013502125948369828 d=1, dolphins, 0.06 0.011169983712293842 0.004103948449818662 0.002123051971417801 0.0015286651824786544 0.001809200930986754 d=1, dolphins, 0.01 0.008926804414936252 0.003593843664943383 0.002372436626864906 0.002528684013474355 0.0013236120502376739

d=1, eurosis, 0.1 0.0060276707312881305 0.00092538589272756 0.0006342279681653932 0.0004384555835094276 0.00018536188888052218 d=1, eurosis, 0.06 0.004751929441074538 0.001020005565545354 0.000889042372679839 0.0005086950138860196 0.0002556955468331719 d=1, eurosis, 0.01 0.0037293050211431988 0.0007734429750918471 0.0006679713650390155 0.0004663147033666988 0.00029012600990115697

d=1, Networking, 0.1 0.00354911912464601 0.0008569845357120783 0.0007800162218792217 0.000486001643398867 0.0003402086731748513 d=1, Networking, 0.06 0.004330451321925086 0.0011314015990526772 0.0009955841652321848 0.0005356434310080785 0.0002993746309815702 d=1, Networking, 0.01 0.0035153340017524855 0.0015484554089057877 0.0009127004247725001 0.00024227482240693019 0.0002548717948222871

d=1, Databases, 0.1 0.004920247514564321 0.0010238037860984882 0.0006558270450890158 0.00046304862763166114 0.0003245028197695966 d=1, Databases, 0.06 0.004526013840391158 0.000963926605473951 0.0007692921395363883 0.0005198262247493105 0.0004310588594538151 d=1, Databases, 0.01 0.004943075493204197 0.0008015939178439096 0.000780276410600827 0.0003613319043567048 0.0003127005621608197

d=1, Enc&Comp, 0.1 0.0048278192814453745 0.0010825306683229289 0.0008869476479790293 0.0007716890723512931 0.0003704820179027577 d=1, Enc&Comp, 0.06 0.0020893222009769137 0.0012842314605827185 0.0004723607601931406 0.0006878263190053518 0.0003750554030785581 d=1, Enc&Comp, 0.01 0.00504474860734935 0.0013024704440231251 0.0009294029364896054 0.000660959802814907 0.00020474232102263236

d=1, Hard&Arch, 0.1 0.004611366876905434 0.0009512468516057303 0.0012081728737839681 0.0007626404640344821 0.00040508532045909195 d=1, Hard&Arch, 0.06 0.005063954232425586 0.001065299727136788 0.0012631414043132634 0.0006979985117773265 0.000421532809850973 d=1, Hard&Arch, 0.01 0.00274347868585054 0.0007186205865051595 0.0007365735158683657 0.00045503009550719795 0.00040966993435182316

d=2, ER-0.4-1000, 0.1 0.0005611717968776918 0.00027274809547358435 0.00011320568751584864 6.757544191642404e-05 6.039762637275414e-05 d=2, ER-0.4-1000, 0.06 0.0004256206092542753 0.0001255914552193544 8.877177278468484e-05 4.854379424147713e-05 4.386961653719971e-05 d=2, ER-0.4-1000, 0.01 0.0003048387437939227 8.612938808993734e-05 8.861321498224955e-05 3.118698564760755e-05 2.6440001535795706e-05

d=2, ER-0.1-1000, 0.1 0.0005236255151350093 0.00022767131458694158 0.00013017955748542738 9.70665903350761e-05 4.911946309935634e-05 d=2, ER-0.1-1000, 0.06 0.0005865169497721389 0.00025410090630458796 0.00019059306793781872 0.00011444739288479127 7.032284297507503e-05 d=2, ER-0.1-1000, 0.01 0.00046667795838897455 0.00027540444251076277 0.00010490895428890273 0.00013378373593132866 7.724500570322393e-05

d=2, dolphins, 0.1 0.008072537726967633 0.00571208308059399 0.0051191001841705405 0.0017181464186888107 0.001090082709032633 d=2, dolphins, 0.06 0.009755110708905598 0.00389170462370401 0.0025002668985388667 0.0016741448929614132 0.001371113891426613 d=2, dolphins, 0.01 0.006479716073801618 0.0024690593909838824 0.0016779353904653569 0.0018280376993101016 0.0008140444788820552

d=2, eurosis, 0.1 0.002985106085538542 0.000876946183305746 0.0007680934798181345 0.0005249300647113408 0.0002331259532404271 d=2, eurosis, 0.06 0.0041183672021622205 0.0008723728098050511 0.0007700295259631496 0.0005934744828608486 0.00017965740908675555 d=2, eurosis, 0.01 0.003124420288948241 0.0012421356753991252 0.000522089061052929 0.00036706376215862906 0.00019123151674677616

d=2, Networking, 0.1 0.003894812593269389 0.0012115481023431956 0.000687373457322389 0.0003709238725270653 0.0002917580174452871 d=2, Networking, 0.06 0.004866701669257291 0.0010784522973902714 0.0007199911928308271 0.000574302270555022 0.00040290056035626387 d=2, Networking, 0.01 0.003824906545448091 0.0014371944936820223 0.0003065956016769412 0.0003731510983670436 0.0003803384907960821

d=2, Databases, 0.1 0.004920247514564321 0.0010238037860984882 0.0006558270450890158 0.00046304862763166114 0.0003245028197695966 d=2, Databases, 0.06 0.004526013840391158 0.000963926605473951 0.0007692921395363883 0.0005198262247493105 0.0004310588594538151 d=2, Databases, 0.01 0.004943075493204197 0.0008015939178439096 0.000780276410600827 0.0003613319043567048 0.0003127005621608197

d=2, Enc&Comp, 0.1 0.006724706042286198 0.0014845981818248232 0.0010458737350497948 0.0008081389763835953 0.00035644825231180745 d=2, Enc&Comp, 0.06 0.004100807796697153 0.0011827406795993489 0.0008530016288033847 0.0007814930040936276 0.00024164229327689848 d=2, Enc&Comp, 0.01 0.002887642103947921 0.0009055525946103308 0.0006739056569936388 0.0005584100220945653 0.00042133127128199215

d=2, Hard&Arch, 0.1 0.003095039050038908 0.00126431072738253 0.000531917678481419 0.00047539738617717955 0.0005823370820758574 d=2, Hard&Arch, 0.06 0.0028895869476586425 0.0017287614112698189 0.0012669302897452082 0.0005934700975376661 0.0003548026338104648 d=2, Hard&Arch, 0.01 0.0029574141141546616 0.0014192560156932896 0.0008739945261346017 0.0006409254479486006 0.0005212426748758263 Table 3. Standard deviations for the experiments from Sec. 5.2 (for the d-regularized Laplacian kernel) for m > 1.

A. Appendix

A.1. Additional experimental details

In this section, we report standard deviations for the experiments conducted in Sec. 5.2. The results are presented in Table 3.

A.2. D-regularized Laplacian kernels

We show here that as long as the regularized Laplacian kernel is well-defined, the d-regularized variants for d > 1 are valid positive definite kernels. The following is true:

Theorem A.1. If a matrix IN + σ2 LG is invertible and the inverse has positive entries, then the d-regularized Laplacian kernel is a valid positive definite kernel.

Proof. Denote A = σ2 LG. Note first that, by Perron-Frobenius Theorem, we have: ρ(A) = λmax(A), where λmax(A) stands for the largest eigenvalue of A (note that all eigenvalues of A are real since A is symmetric). Then for n sufficiently large we have: A n (1 ξ)k for some ξ (0, 1). Thus the Neumann series: P n=0 An converges to: (IN A) 1. Thus we can rewrite: (IN A) d = (P n=0 An)d. Therefore we have: (IN A) d = P

n cd,n An for some sequence of

Taming graph kernels with random features

coefficients: (cd,0, cd,1, ...). We conclude that (IN A) d is symmetric for any d = 1, 2, ... (since A is). We can rewrite: (IN A) (d+1) = XY, where: X = (IN A) d and Y = (IN A) 1. We will proceed by induction on d. We can then conclude that matrix (IN A) (d+1) = XY is a product of two positive definite matrices. Furthermore, (IN A) (d+1)

is symmetric, as we have already observed. But then it is a positive definite matrix.

A.3. Concentration results for GRFs with the base sampler

We inherit the notation from the main body of the paper, in particular from the proof of Theorem 3.2. We also denote: A(ω) = Ql1 s=1 deg(vs) 1 pterm , where: ω = (v1, v2, ..., vl1+1). Furthermore, for two walks: ω1, ω2,we use the following notation: ω1 pre ω2 if ω1 is a strict prefix of ω2 and ω1 pre ω2 if ω1 is a prefix (not necessarily strict) of ω2. We prove the following concentration result. Theorem A.2. Consider an unbiased estimation of the matrix K = (IN U) 2 via M = B(B ) for matrices B, B RN N, as in Theorem 3.2. Then the following is true for any i, j {1, ..., |V(GU)|}, i = j:

Var(M(i, j)) = 1 m2 (Λ K2(i, j)) (23)

for Λ defined as follows:

ω4 Ω(j,x2) w(ω1)w(ω2)w(ω3)w(ω4)Γ(ω1, ω2, ω3, ω4) (24)

Γ(ω1, ω2, ω3, ω4) =

1 A(ω2)A(ω4) if ω1 pre ω2 and ω3 pre ω4 1 A(ω2)A(ω3) if ω1 pre ω2 and ω4 pre ω3 1 A(ω1)A(ω4) if ω2 pre ω1 and ω3 pre ω4 1 A(ω1)A(ω3) if ω2 pre ω1 and ω4 pre ω3

Proof. We have the following:

Var(M(i, j)) = 1 m2

l=1 Xk,l, (26)

where: Xk,l = X

ω2 Ω(j,x) w(ω1)w(ω2)1[ω1 pre Ω(k, i)]1[ω2 pre Ω(l, j)] (27)

Note that since different random walks are chosen independently, all random variables Xk,l are independent and therefore:

Var(M(i, j)) = 1 m2 (E[X2 1,1] (E[X1,1])2) (28)

Thus, from the unbiasedness of the estimator, we obtain:

Var(M(i, j)) = 1 m2 (E[X2 1,1] K2(i, j)) (29)

It suffices to prove that: E[X2 1,1] = Λ. Note that we have:

E[X2 1,1] = E[ X

ω4 Ω(j,x2) w(ω1)w(ω2)w(ω3)w(ω4)

1[ω1 pre Ω(1, i)]1[ω2 pre Ω(1, j)]

1[ω3 pre Ω(1, i)]1[ω4 pre Ω(1, j)]]

Thus we have:

E[X2 1,1] = X

ω4 Ω(j,x2) w(ω1)w(ω2)w(ω3)w(ω4)

P[1[ω1 pre Ω(1, i)]1[ω2 pre Ω(1, j)]

1[ω3 pre Ω(1, i)]1[ω4 pre Ω(1, j)]]

Taming graph kernels with random features

We conclude the proof, observing that:

Γ(ω1, ω2, ω3, ω4) = P h 1[ω1 pre Ω(1, i)]1[ω2 pre Ω(1, j)]1[ω3 pre Ω(1, i)]1[ω4 pre Ω(1, j)] i (32)

Completely analogous analysis can be conducted for i = j, but the formula is more convoluted since certain pairs of walks sampled from i and j are clearly no longer independent as being exactly the same (namely: the kth sampled walk from i and the kth sampled walk from j for k = 1, 2, ..., m).

A.4. Additional derivations leading to Equation 18

For Reader s convenience, we will here explain in more detail the derivations leading to Eq. 18. For the simplicity, we will assume that m = 1 since for larger m the formula is obtained simply by averaging over all pairs of random walks from node i and j.

Note first that directly from the definition of GRFs and Algorithm 1, we get:

ϕ(i) ϕ(j) = X

r Kx( Ω(1,i)) load Ω(1,i)(x, r)

v Kx( Ω(1,j)) load Ω(1,j)(x, v)

where Kx(ω) for a given walk ω returns the set of these time indices when ω hits x (first vertex of the walk gets time index zero, next one, time index one, etc.) and furthermore loadω(x, c) returns the increment of the load in x added to the current load in time c (see: Algorithm 1, while-loop, point 5). Now note that each r can be identified with its corresponding prefix-subwalk ω1 and each v can be identified with its corresponding prefix-subwalk ω2 Finally, observe that the increment

of the load that ω1 contributes to is precisely: w(ω1) Ql1 s=1 (pi 1,s) 1

1 pterm and the increment of the load that ω2 contributes to is

precisely: w(ω2) Ql2 t=1 (pj 1,t) 1

1 pterm (see: Algorithm 1, while-loop, point 4). That gives us Eq. 18.