# general_graph_random_features__3d066534.pdf

Published as a conference paper at ICLR 2024

General Graph Random Features

Isaac Reid1 , Krzysztof Choromanski2,3 , Eli Berger4 , Adrian Weller1,5

1University of Cambridge, 2Google Deep Mind, 3Columbia University, 4University of Haifa, 5Alan Turing Institute ir337@cam.ac.uk, kchoro@google.com

We propose a novel random walk-based algorithm for unbiased estimation of arbitrary functions of a weighted adjacency matrix, coined general graph random features (g-GRFs). This includes many of the most popular examples of kernels defined on the nodes of a graph. Our algorithm enjoys subquadratic time complexity with respect to the number of nodes, overcoming the notoriously prohibitive cubic scaling of exact graph kernel evaluation. It can also be trivially distributed across machines, permitting learning on much larger networks. At the heart of the algorithm is a modulation function which upweights or downweights the contribution from different random walks depending on their lengths. We show that by parameterising it with a neural network we can obtain g-GRFs that give higher-quality kernel estimates or perform efficient, scalable kernel learning. We provide robust theoretical analysis and support our findings with experiments including pointwise estimation of fixed graph kernels, solving non-homogeneous graph ordinary differential equations, node clustering and kernel regression on triangular meshes.1

1 Introduction and related work

The kernel trick is a powerful technique to perform nonlinear inference using linear learning algorithms (Campbell, 2002; Kontorovich et al., 2008; Canu and Smola, 2006; Smola and Schölkopf, 2002). Supposing we have a set of N datapoints X = {xi}N i=1, it replaces Euclidean dot products x i xj with evaluations of a kernel function K : X X R, capturing the similarity of the datapoints by instead taking an inner product between implicit (possibly infinite-dimensional) feature vectors in some Hilbert space HK.

An object of key importance is the Gram matrix K RN N whose entries enumerate the pairwise kernel evaluations, K := [K(xi, xj)]N i,j=1. Despite the theoretical rigour and empirical success enjoyed by kernel-based learning algorithms, the requirement to manifest and invert this matrix leads to notoriously poor O(N 3) time-complexity scaling. This has spurred research dedicated to efficiently approximating K, the chief example of which is random features (Rahimi and Recht, 2007): a Monte-Carlo approach which gives explicitly manifested, finite dimensional vectors ϕ(xi) Rm whose Euclidean dot product is equal to the kernel evaluation in expectation,

Kij = E ϕ(xi) ϕ(xj) . (1)

This allows one to construct a low-rank decomposition of K which provides much better scalability. Testament to its utility, a rich taxonomy of random features exists to approximate many different Euclidean kernels including the Gaussian, softmax, and angular and linear kernels (Johnson, 1984; Dasgupta et al., 2010; Goemans and Williamson, 2004; Choromanski et al., 2020).

Kernels defined on discrete input spaces, e.g. K : N N R with N the set of nodes of a graph G (Smola and Kondor, 2003; Kondor and Lafferty, 2002; Chung and Yau, 1999),

Equal contribution. 1Code is available at https://github.com/isaac-reid/general_graph_random_features.

Published as a conference paper at ICLR 2024

enjoy widespread applications including in bioinformatics (Borgwardt et al., 2005), community detection (Kloster and Gleich, 2014) and recommender systems (Yajima, 2006). More recently, they have been used in applications as diverse as manifold learning for deep generative modelling (Zhou et al., 2020) and for solving singleand multiple-source shortest path problems (Crane et al., 2017). However, for these graph-based kernel methods the problem of poor scalability is particularly acute. This is because even computing the corresponding Gram matrix K is typically of at least cubic time complexity in the number of nodes N, requiring e.g. the inversion of an N N matrix or computation of multiple matrix-matrix products. Despite the presence of this computational bottleneck, random feature methods for graph kernels have proved elusive. Indeed, only recently was a viable graph random feature (GRF) mechanism proposed by Choromanski (2023). Their algorithm uses an ensemble of random walkers which deposit a load at every vertex they pass through that depends on i) the product of weights of edges traversed by the walker and ii) the marginal probability of the subwalk. Using this scheme, it is possible to construct random features {ϕ(i)}N i=1 RN

such that ϕ(i) ϕ(j) gives an unbiased approximation to the ij-th matrix element of the 2-regularised Laplacian kernel. Multiple independent approximations can be combined to estimate the d-regularised Laplacian kernel with d = 2 or the diffusion kernel (although the latter is only asymptotically unbiased). The GRFs algorithm enjoys both subquadratic time complexity and strong empirical performance on tasks like k-means node clustering, and it is trivial to distribute across machines when working with massive graphs.

However, a key limitation of GRFs is that they only address a limited family of graph kernels which may not be suitable for the task at hand. Our central contribution is a simple modification which generalises the algorithm to arbitrary functions of a weighted adjacency matrix, allowing efficient and unbiased approximation a much broader class of graph node kernels. We achieve this by introducing an extra modulation function f that controls each walker s load as it traverses the graph. As well as empowering practitioners to approximate many more fixed kernels, we demonstrate that f can also be parameterised by a neural network and learned. We use this powerful approach to optimise g-GRFs for higher-quality approximation of fixed kernels and for scalable implicit kernel learning.

The remainder of the manuscript is organised as follows. In Sec. 2 we introduce general graph random features (g-GRFs) and prove that they enable scalable approximation of arbitrary functions of a weighted adjacency matrix, including many of the most popular examples of kernels defined on the nodes of a graph. We also extend the core algorithm with neural modulation functions, replacing one component of the g-GRF mechanism with a neural network, and derive generalisation bounds for the corresponding class of learnable graph kernels (Sec. 2.1). In Sec. 3 we run extensive experiments, including: pointwise estimation of a variety of popular graph kernels (Sec. 3.1); simulation of time evolution under non-homogeneous graph ordinary differential equations (Sec. 3.2); kernelised k-means node clustering including on large graphs (Sec. 3.3); training a neural modulation function to suppress the mean square error of fixed kernel estimates (Sec 3.4); and training a neural modulation function to learn a kernel for node attribute prediction on triangular mesh graphs (Sec. 3.5).

2 General graph random features

Consider a directed weighted graph G(N, E, W := [wij]i,j N ) where: i) N := {1, ..., N} is the set of nodes; ii) E is the set of edges, with (i, j) E if there is a directed edge from i to j in G; and iii) W is the weighted adjacency matrix, with wij the weight of the directed edge from i to j (equal to 0 if no such edge exists). Note that an undirected graph can be described as directed with the symmetric weighted adjacency matrix W. Now consider the matrices Kα(W) RN N, where α = (αk) k=0 and αk R:

k=0 αk Wk. (2)

We assume that the sum above converges for all W under consideration, which can be ensured with a regulariser W σW, σ R+. Without loss of generality, we also assume

Published as a conference paper at ICLR 2024

that α is normalised such that α0 = 1. The matrix Kα(W) can be associated with a graph function KG α : N N R mapping from a pair of graph nodes to a real number.

Note that if G is an undirected graph then Kα(W) automatically inherits the symmetry of W. In this case, it follows from Weyl s perturbation inequality (Bai et al., 2000) that Kα(W) is positive semidefinite for any given α provided the spectral radius ρ(W) := maxλ Λ(W) (|λ|) is sufficiently small (with Λ(W) the set of eigenvalues of W). This can again be ensured by multiplying the weight matrix W by a regulariser σ R+. It then follows that Kα(W) can be considered the Gram matrix of a graph kernel function KG α.

With suitably chosen α = (αk) k=0, the class described by Eq. 2 includes many popular examples of graph node kernels in the literature (Smola and Kondor, 2003; Chapelle et al., 2002). They measure connectivity between nodes and are typically functions of the graph Laplacian matrix, defined by L := I f W with f W := [wij/ p

didj]N i,j=1. Here, di := P

j wij is the weighted degree of node i such that f W is the normalised weighted adjacency matrix. For reference, Table 1 gives the kernel definitions and normalised coefficients αk (corresponding to powers of f W) to be considered later in the manuscript. In practice, factors in αk equal to a quantity raised to the power of k are absorbed into the normalisation of f W.

Name Form αk

d-regularised Laplacian (IN + σ2L) d d+k 1 k 1 + σ 2 k

p-step random walk (αIN L)p, α 2 p k (α 1) k

Diffusion exp( σ2L/2) 1 k!( σ2

Inverse Cosine cos (Lπ/4) 1 k! π

4 k n ( 1) k 2 if k even, ( 1) k 1

2 if k odd o

Table 1: Different graph functions/kernels KG α : N N R. The exp and cos mappings are defined via Taylor series expansions rather than element-wise, e.g. exp(M) := limn (IN + M/n)n and cos(M) := Re(exp(i M)). σ and α are regularisers. Note that the diffusion kernel is sometimes instead defined by exp(σ2(IN L)) but these forms are equivalent up to normalisation. Also note that the p-step random walk kernel is closely related to the graph Matérn kernel (Borovitskiy et al., 2021).

The chief goal of this work is to construct a random feature map ϕ(i) : N Rl with l N that provides unbiased approximation of Kα(W) as in Eq. 1. To do so, we consider the following algorithm.

Algorithm 1 Constructing a random feature vector ϕf(i) RN to approximate Kα(W)

Input: weighted adjacency matrix W RN N, vector of unweighted node degrees (no. neighbours) d RN, modulation function f : (N {0}) R, termination probability phalt (0, 1), node i N, number of random walks to sample m N. Output: random feature vector ϕf(i) RN

1: initialise: ϕf(i) 0 2: for w = 1, ..., m do 3: initialise: load 1 4: initialise: current_node i 5: initialise: terminated False 6: initialise: walk_length 0 7: while terminated = False do 8: ϕf(i)[current_node] ϕf(i)[current_node]+load f (walk_length) 9: walk_length walk_length+1 10: new_node Unif [N(current_node)] assign to one of neighbours

11: load load d[current_node]

1 phalt W [current_node,new_node] update load 12: current_node new_node 13: terminated (t Unif(0, 1) < phalt) draw RV t to decide on termination 14: end while 15: end for 16: normalise: ϕf(i) ϕf(i)/m

Published as a conference paper at ICLR 2024

Figure 1: Schematic for a random walk on a graph (solid red) and an accompanying modulation function f (dashed blue) used to approximate an arbitrary graph node function KG.

This is identical to the algorithm presented by Choromanski (2023) for constructing features to approximate the 2-regularised Laplacian kernel, apart from the presence of the extra modulation function f : (N {0}) R in line 8 that upweights or downweights contributions from walks depending on their length (see Fig. 1). We refer to ϕf as general graph random features (g-GRFs), where the subscript f identifies the modulation function. Crucially, the time complexity of Alg. 1 is subquadratic in the number of nodes N, in contrast to exact methods which are O(N 3).2

We now state the following central result, proved in App. A.2. Theorem 2.1 (Unbiased approximation of Kα via convolutions). For two modulation functions: f1, f2 : (N {0}) R, g-GRFs ϕf1(i))N i=1, (ϕf2(i) N i=1 constructed according to Alg. 1 give unbiased approximation of Kα,

[Kα]ij = E ϕf1(i) ϕf2(j) , (3)

for kernels with an arbitrary Taylor expansion α = (αk) k=0 provided that α = f1 f2. Here, is the discrete convolution of the modulation functions f1, f2; that is, for all k (N {0}),

p=0 f1(k p)f2(p) = αk. (4)

Clearly the class of pairs of modulation functions f1, f2 that satisfy Eq. 4 is highly degenerate. Indeed, it is possible to solve for f1 given any f2 and α provided f2(0) = 0. For instance, a trivial solution is given by: f1(i) = αi, f2(i) = I(i = 0) with I( ) the indicator function. In this case, the walkers corresponding to f2 are lazy , depositing all their load at the node at which they begin. Contributions to the estimator ϕf1(i) ϕf2(j) only come from walkers initialised at i traversing all the way to j rather than two walkers both passing through an intermediate node. Also of great interest is the case of symmetric modulation functions f1 = f2, where now intersections do contribute. In this case, the following is true (proved in App. A.3). Theorem 2.2 (Computing symmetric modulation functions). Supposing f1 = f2 = f, Eq. 4 is solved by a function f which is unique (up to a sign) and is given by

k1+2k2+3k3...=i k1+k2+k3+...=n

n k1k2k3...

αk1 1 αk2 2 αk3 3 ... . (5)

Moreover, f(i) can be efficiently computed with the iterative formula

f(0) = α0 = 1

f(i + 1) = αi+1 Pi 1

p=0 f(i p)f(p+1)

2f(0) for i 0. (6)

2Concretely, Alg. 1 yields a pair of matrices ϕ1,2 := (ϕ(i))N i=1 RN N such that K = E(ϕ1ϕ 2 ) in subquadratic time. Of course, explicitly multiplying the matrices to evaluate every element of b K would be O(N 3), but we avoid this since in applications we just evaluate ϕ1(ϕ 2 v) where v RN

is some vector and the brackets give the order of computation. This is O(N 2).

Published as a conference paper at ICLR 2024

For symmetric modulation functions, the random features ϕf1(i) and ϕf2(i) are identical apart from the particular sample of random walks used to construct them. They cannot share the same sample or estimates of diagonal kernel elements [Kα]ii will be biased.

Computational cost: Note that when running Alg. 1 one only needs to evaluate the modulation functions f1,2(i) up to the length of the longest walk one samples. A batch of size b, (f1,2(i))b i=1, can be pre-computed in time O(b2) and reused for random features corresponding to different nodes and even different graphs. Further values of f can be computed at runtime if b is too small and also reused for later computations. Moreover, the minimum length b required to ensure that all m walks are shorter than b with high probability (Pr( m i=1len(ωi) b) > 1 δ, δ 1) scales only logarithmically with m (see App. A.1). This means that, despite estimating a much more general family of graph functions, g-GRFs are essentially no more expensive than the original GRF algorithm. Moreover, any techniques used for dimensionality reduction of regular GRFs (e.g. applying the Johnson-Lindenstrauss transform (Dasgupta et al., 2010) or using anchor points (Choromanski, 2023)) can also be used with g-GRFs, providing further efficiency gains.

Generating functions: Inserting the constraint for unbiasedness in Eq. 4 back into the definition of Kα(W), we immediately have that Kα(W) = Kf1(W)Kf2(W) (7) where Kf1(W) := P i=0 f1(i)Wi is the generating function corresponding to the sequence (f1(i)) i=0. This is natural because the (discrete) Fourier transform of a (discrete) convolution returns the product of the (discrete) Fourier transforms of the respective functions. In the symmetric case f1 = f2, it follows that

Kf(W) = (Kα(W))

If the RHS has a simple Taylor expansion (e.g. Kα(W) = exp(W) so Kf(W) = exp( W

2 )), this enables us obtain f without recourse to the conditional sum in Eq. 5 or the iterative

d-regularised Laplacian (d 2+2i)!! (2i)!!(d 2)!!

p-step random walk p

Diffusion 1 2ii!

expression in Eq. 6. This is the case for many popular graph kernels; we provide some prominent examples in the table left. A notable exception is the inverse cosine kernel.

As an interesting corollary, by considering the diffusion kernel we have also proved that Pk p=0 1 2pp! 1 2k p(k p)! = 1

2.1 Neural modulation functions, kernel learning and generalisation

Instead of using a fixed modulation function f to estimate a fixed kernel, it is possible to parameterise it more flexibly. For example, we can define a neural modulation function f (N) : (N {0}) R by a neural network (with a restricted domain) whose input and output dimensionalities are equal to 1. During training, we can choose the loss function to target particular desiderata of g-GRFs: for example, to suppress the mean square error of estimates of some particular fixed kernel (Sec. 3.4), or to learn a kernel which performs better in a downstream task (Sec. 3.5). Implicitly learning Kα via f (N) is more scalable than learning Kα directly because it obviates the need to repeatedly compute the exact kernel, which is typically of O(N 3) time complexity. Since any modulation function f maps to a unique α by Eq. 4, it is also always straightforward to recover the exact kernel which the g-GRFs estimate, e.g. once the training is complete.

Supposing we have (implicitly) learned α, how can the learned kernel KG α be expected to generalise? Let ψKα : x HKα denote the feature mapping from the input space to the reproducing kernel Hilbert space HKα induced by the kernel KG α. Define the hypothesis set

H = {x w ψKα(x) : |αi| α(M) i , w 2 1}, (9) where we restricted our family of kernels so that the absolute value of each Taylor coefficient αi is smaller than some maximum value α(M) i . Following very similar arguments to Cortes et al. (2010), the following is true.

Published as a conference paper at ICLR 2024

Theorem 2.3 (Empirical Rademacher complexity bound). For a fixed sample S = (xi)m i=1, the empirical Rademacher complexity b R(H) is bounded by

i=0 α(M) i ρ(W)i, (10)

where ρ(W) is the spectral radius of the weighted adjacency matrix W.

Naturally, the bound on b R(H) increases monotonically with ρ(W). Following standard arguments in the literature, this immediately yields generalisation bounds for learning kernels KG α. We discuss this in detail, including proving Theorem 2.3, in App. A.4.

3 Experiments

Here we test the empirical performance of g-GRFs, both for approximating fixed kernels (Secs 3.1-3.3) and with learnable neural modulation functions (Secs 3.4-3.5).

3.1 Unbiased pointwise estimation of fixed kernels

We begin by confirming that g-GRFs do indeed give unbiased estimates of the graph kernels listed in Table 1, taking regularisers σ = 0.25 and α = 20 with phalt = 0.1. We use symmetric modulation functions f, computed with the closed forms where available and using the iterative scheme in Eq. 6 if not. Fig. 2 plots the relative Frobenius norm error between the true kernels K and their approximations with g-GRFs b K (that is, K b K F / K F ) against the number of random walkers m. We consider 8 different graphs: a small random Erdős-Rényi graph, a larger Erdős-Rényi graph, a binary tree, a d-regular graph and 4 real world examples (karate, dolphins, football and eurosis) (Ivashkin, 2023). They vary substantially in size. For every graph and for all kernels, the quality of the estimate improves as m grows and becomes very small with even a modest number of walkers.

5 10 15 No. random walks

Frob. norm error

ER (N = 20, p = 0.2)

1-reg Lap 2-reg Lap 2-step RW Diffusion Cosine

5 10 15 No. random walks

Frob. norm error

ER (N = 100, p = 0.04)

5 10 15 No. random walks

Frob. norm error

Binary tree (N = 127)

5 10 15 No. random walks

Frob. norm error

d-regular (N = 100, d = 10)

5 10 15 No. random walks

Frob. norm error

Karate (N = 34)

5 10 15 No. random walks

Frob. norm error

Dolphins (N = 62)

5 10 15 No. random walks

Frob. norm error

Football (N = 115)

5 10 15 No. random walks

Frob. norm error

Eurosis (N = 1272)

Figure 2: Unbiased estimation of popular kernels on different graphs using g-GRFs. The approximation error (y-axis) improves with the number of walkers (x-axis). We repeat 10 times; one standard deviation of the mean error is shaded.

3.2 Solving differential equations on graphs

An intriguing application of g-GRFs for fixed kernels is efficiently computing approximate solutions of time-invariant non-homogeneous ordinary differential equations (ODEs) on graphs. Consider the following ODE defined on the nodes N of the graph G: dx(t)

dt = Wx(t) + y(t), (11)

Published as a conference paper at ICLR 2024

where x(t) RN is the state of the graph at time t, W RN N is a weighted adjacency matrix and y(t) is a (known) driving term. Assuming the null initial condition x(0) = 0, Eq. 11 is solved by the convolution

0 exp(W(t τ))y(τ)dτ = Eτ P

p(τ) exp(W(t τ))y(τ) (12)

where P is a probability distribution on the interval [0, t], equipped with a (preferably efficient) sampling mechanism and probability density function p(τ). Taking n N Monte Carlo samples (τi)n i=1 i.i.d. P, we can construct the unbiased estimator:

1 p(τj) exp(W(t τj))y(τj). (13)

Note that exp(W(t τj)) is nothing other than the diffusion kernel, which is expensive to compute exactly for large N but can be efficiently approximated with g-GRFs. Take exp(W(t τj)) ΦjΦ j with Φj := (ϕ(i))N i=1 an N N matrix whose rows are g-GRFs constructed to approximate the kernel at a particular τj. Then we have that

1 p(τj)ΦjΦ j y(τj) (14)

which can be computed in quadratic time (c.f. cubic if the heat kernel is computed exactly). Further speed gains are possible if dimensionality reduction techniques are applied to the g-GRFs (Choromanski, 2023; Dasgupta et al., 2010).

2 4 6 8 10 12 14 16 No. walkers

Diffusion simulation error vs. no. walkers

karate dolphins football

Figure 3: ODE simulation error decreases as the number of walkers grows.

As an example, we consider diffusion on three real-world graphs with a fixed source at one node, taking W = L (the graph Laplacian) and y(t) = y = (1, 0, 0, ...) . The steady state is x( ) = W 1( y). We simulate evolution under the ODE for t = 1 with n = 10 discretisation timesteps and P uniform, approximating exp(W(t τj)) with different numbers of walkers m. As m grows, the quality of approximation improves and the (normalised) error on the final state bx(1) x(1) 2/ x(1) 2 drops for every graph. We take 100 repeats for statistics and plot the results in Fig. 3. One standard deviation of the mean error is shaded. phalt = 0.1 and the regulariser is given by σ = 1.

3.3 Efficient kernelised graph node clustering

Table 2: Errors in kernelised k-means clustering when approximating the kernel exp(σ2W) with g-GRFs.

Graph N Clustering error, Ec

karate 34 0.08 dolphins 62 0.16 polbooks 105 0.12 football 115 0.02 databases 1046 0.10 eurosis 1272 0.09 cora 2485 0.01 citeseer 3300 0.04

As a further demonstration of the utility of our new mechanism, we show how estimates of the kernel K = exp(σ2W) can be used to assign nodes to k = 3 clusters. Here, we choose W to be the (unweighted) adjacency matrix and the regulariser is σ2 = 0.2. We follow the algorithm proposed by Dhillon et al. (2004), comparing the clusters when we use exact and g-GRF-approximated kernels. For all graphs, m 80. Table 2 reports the clustering error, defined by

Ec := no. wrong pairs

N(N 1)/2 . (15)

This is the number of misclassified pairs of nodes (assigned to the same cluster when the converse is true or vice versa) divided by the total number of pairs. The error is small even with a modest number of walkers and on large graphs; kernel estimates efficiently constructed using g-GRFs can be readily deployed on downstream tasks where exact methods are slow.

Published as a conference paper at ICLR 2024

3.4 Learning f (N) for better kernel approximation

Following the discussion in Sec. 2.1 we now replaced fixed f with a neural modulation function f (N) parameterised by a simple neural network with 1 hidden layer of size 1: f (N)(x) = σsoftplus (w2σRe LU(w1x + b1) + b2) , (16) where w1, b1, w2, b2 R and σRe LU and σsoftplus are the Re LU and softplus activation functions, respectively. Bigger, more expressive architectures (including allowing f (N) to take negative values) can be used but this is found to be sufficient for our purposes.

We define our loss function to be the Frobenius norm error between a target Gram matrix and our g-GRF-approximated Gram matrix on the small Erdős-Rényi graph (N = 20) with m = 16 walks. For the target, we choose the 2-regularised Laplacian kernel. We train symmetric f (N) 1 = f (N) 2 but provide a brief discussion of the asymmetric case (including its respective strengths and weaknesses) in App. A.5. On this graph, we minimise the loss with the Adam optimiser and a decaying learning rate (LR = 0.01, γ = 0.975, 1000 epochs). We make the following striking observation: f (N) does not generically converge to the unique unbiased (symmetric) modulation function implied by α, but instead to some different f that though biased gives a smaller mean squared error (MSE). This is possible because by downweighting long walks the learned f (N) gives estimators with a smaller variance, which is sufficient to suppress the MSE on the kernel matrix elements even though it no longer gives the target value in expectation. We then fix f (N) and use it for kernel approximation on the remaining graphs. The learned, biased f (N) still provides better kernel estimates, including for graphs with very different topologies and a much greater number of nodes: eurosis is bigger by a factor of over 60. See Table 3 for the results. phalt = 0.5 and σ = 0.8.

Naturally, the learned f (N) is dependent on the the number of random walks m; as m grows, the variance on the kernel approximation drops so it is intuitive that the learned f (N) will approach the unbiased f. Fig. 4 empirically confirms this is the case, showing the learned f (N) for different numbers of walkers. The line labelled is the unbiased modulation function, which for the 2-regularised Laplacian kernel is constant.

Table 3: Kernel approximation error with m = 16 walks and unbiased or learned modulation functions. Lower is better. Brackets give one standard deviation of the last digit.

Graph N Frob. norm error on b K Unbiased Learned

Small ER 20 0.0488(9) 0.0437(9) Larger ER 100 0.0503(4) 0.0448(4) Binary tree 127 0.0453(4) 0.0410(4) d-regular 100 0.0490(2) 0.0434(2) karate 34 0.0492(6) 0.0439(6) dolphins 62 0.0505(5) 0.0449(5) football 115 0.0520(2) 0.0459(2) eurosis 1272 0.0551(2) 0.0484(2)

Figure 4: Learned modulation function with different numbers of random walkers m. It approaches the unbiased f (N) as m

0 2 4 6 8 i

Modulation functions for different no. walkers

2 4 8 16 32 64 128 256 512

These learned modulation functions might guide the more principled construction of biased, low-MSE GRFs in the future. An analogue in Euclidean space is provided by structured orthogonal random features (SORFs), which replace a random Gaussian matrix with a HDproduct to estimate the Gaussian kernel (Choromanski et al., 2017; Yu et al., 2016). This likewise improves kernel approximation quality despite breaking estimator unbiasedness.

3.5 Implicit kernel learning for node attribute prediction

As suggested in Sec. 2.1, it is also possible to train the neural modulation function f (N) directly using performance on a downstream task, performing implicit kernel learning. We have argued that this is much more scalable than optimising Kα directly.

Published as a conference paper at ICLR 2024

In this spirit, we now address the problem of kernel regression on triangular mesh graphs, as previously considered by Reid et al. (2023). For graphs in this dataset (Dawson Haggerty, 2023), every node is associated with a normal vector v(i) R3 equal to the mean of the normal vectors of its 3 surrounding faces. The task is to predict the directions of missing vectors (a random 5% split) from the remainder. Our (unnormalised) predictions are given by bv(i) := P

j b K(N)(i, j)v(j), where b K(N) is a kernel estimate constructed using g-GRFs with a neural modulation function f (N) (see Eq. 3). The angular prediction error is 1 cos θi with θi the angle between the true v(i) and and

0 2 4 6 8 i

Modulation functions for regression task

learned 1-reg Lap

Figure 5: Fixed and learned modulation functions for kernel regression

approximate bv(i) normals, averaged over the missing vectors. We train a symmetric pair f (N) using this angular prediction error on the small cylinder graph (N = 210) as the loss function. Then we freeze f (N) and compute the angular prediction error for other larger graphs. Fig. 5 shows the learned f (N) as well as some other modulation functions corresponding to popular fixed kernels. Note also that learning f (N) already includes (but is not limited to) optimising the lengthscale of a given kernel: taking f W βf W is identical to f(i) f(i)βi i.

The prediction errors are highly correlated between the different modulation functions for a given random draw of walks; ensembles which explore the graph poorly, terminating quickly or repeating edges, will give worse g-GRF estimators for every f. For this reason, we compute the prediction errors as the average normalised difference compared to the learned kernel result. Table 4 reports the results. Crucially, this difference is found to be positive for every graph and every fixed kernel, meaning the learned kernel always performs best.

Table 4: Normalised difference in angular prediction error compared to the learned kernel defined by f (N) (trained on the cylinder graph). All entries are positive since the learned kernel always performs best. We take 100 repeats (but only 10 for the very large cycloidal graph). The brackets give one standard deviation on the final digit.

Graph N Normalised (pred error) c.f. learned

1-reg Lap 2-reg Lap Diffusion

cylinder 210 +0.40(2) +0.85(4) +0.029(3) teapot 480 +0.81(5) +1.78(8) +0.059(3) idler-riser 782 +0.52(2) +1.12(1) +0.042(2) busted 1941 +0.81(2) +1.60(4) +0.063(2) torus 4350 +2.13(5) +5.3(1) +0.067(2) cycloidal 21384 +0.065(4) +0.13(1) +0.011(1)

It is remarkable that the learned f (N) gives the smallest error for all the graphs even though it was just trained on cylinder, the smallest one. We have implicitly learned a good kernel for this task which generalises well across topologies. It is also intriguing that the diffusion kernel performs only slightly worse. This is to be expected because their modulation functions are similar (see Fig. 5) so they encode very similar kernels, but this will not always be the case depending on the task at hand.

4 Conclusion

We have introduced general graph random features (g-GRFs), a novel random walk-based algorithm for time-efficient estimation of arbitrary functions of a weighted adjacency matrix. The mechanism is conceptually simple and trivially distributed across machines, unlocking kernel-based machine learning on very large graphs. By parameterising one component of the random features with a simple neural network, we can futher suppress the mean square error of estimators and perform scalable implicit kernel learning.

Published as a conference paper at ICLR 2024

5 Ethics and reproducibility

Ethics: Our work is foundational. There are no direct ethical concerns that we can see, though of course increases in scalability afforded by graph random features might amplify risks of graph-based machine learning, from bad actors or as unintended consequences.

Reproducibility: To foster reproducibility, we clearly state the central algorithm in Alg. 1. Source code is available at https://github.com/isaac-reid/general_graph_random_features. All theoretical results are accompanied by proofs in Appendices A.1-A.4, where any assumptions are made clear. The datasets we use correspond to standard graphs and are freely available online. We link suitable repositories in every instance. Except where prohibitively computationally expensive, results are reported with uncertainties to help comparison.

6 Relative contributions and acknowledgements

EB initially proposed using a modulation function to generalise GRFs to estimate the diffusion kernel and derived its mathematical expression. IR and KC then developed the full g-GRFs algorithm for general functions of a weighted adjacency matrix, proving all theoretical results, running the experiments and preparing the manuscript. AW provided helpful discussions and advice.

IR acknowledges support from a Trinity College External Studentship. AW acknowledges support from a Turing AI fellowship under grant EP/V025279/1 and the Leverhulme Trust via CFI.

We thank Kenza Tazi and Austin Tripp for their careful readings of the manuscript. Richard Turner provided valuable suggestions and support throughout the project.

Zhaojun Bai, James Demmel, Jack Dongarra, Axel Ruhe, and Henk van der Vorst. Templates for the solution of algebraic eigenvalue problems: a practical guide. SIAM, 2000. URL https://www.cs.ucdavis.edu/~bai/ET/contents.html.

Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463 482, 2002. URL https://dl.acm.org/doi/pdf/10.5555/944919.944944.

Karsten M Borgwardt, Cheng Soon Ong, Stefan Schönauer, SVN Vishwanathan, Alex J Smola, and Hans-Peter Kriegel. Protein function prediction via graph kernels. Bioinformatics, 21(suppl_1):i47 i56, 2005. URL https://doi.org/10.1093/bioinformatics/ bti1007.

Viacheslav Borovitskiy, Iskander Azangulov, Alexander Terenin, Peter Mostowsky, Marc Deisenroth, and Nicolas Durrande. Matérn gaussian processes on graphs. In International Conference on Artificial Intelligence and Statistics, pages 2593 2601. PMLR, 2021. URL https://doi.org/10.48550/ar Xiv.2010.15538.

Colin Campbell. Kernel methods: a survey of current techniques. Neurocomputing, 48(1-4): 63 84, 2002. doi: 10.1016/S0925-2312(01)00643-9. URL https://doi.org/10.1016/ S0925-2312(01)00643-9.

Stéphane Canu and Alexander J. Smola. Kernel methods and the exponential family. Neurocomputing, 69(7-9):714 720, 2006. doi: 10.1016/j.neucom.2005.12.009. URL https://doi.org/10.1016/j.neucom.2005.12.009.

Olivier Chapelle, Jason Weston, and Bernhard Schölkopf. Cluster kernels for semi-supervised learning. Advances in neural information processing systems, 15, 2002. URL https: //dl.acm.org/doi/10.5555/2968618.2968693.

Published as a conference paper at ICLR 2024

Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, et al. Rethinking attention with performers. ar Xiv preprint ar Xiv:2009.14794, 2020. URL https://doi.org/10.48550/ar Xiv.2009.14794.

Krzysztof M Choromanski, Mark Rowland, and Adrian Weller. The unreasonable effectiveness of structured random orthogonal embeddings. Advances in neural information processing systems, 30, 2017. URL https://doi.org/10.48550/ar Xiv.1703.00864.

Krzysztof Marcin Choromanski. Taming graph kernels with random features. In International Conference on Machine Learning, pages 5964 5977. PMLR, 2023. URL https://doi.org/10.48550/ar Xiv.2305.00156.

Fan R. K. Chung and Shing-Tung Yau. Coverings, heat kernels and spanning trees. Electron. J. Comb., 6, 1999. doi: 10.37236/1444. URL https://doi.org/10.37236/1444.

Corinna Cortes, Mehryar Mohri, and Afshin Rostamizadeh. Generalization bounds for learning kernels. In Proceedings of the 27th Annual International Conference on Machine Learning (ICML 2010), 2010. URL http://www.cs.nyu.edu/~mohri/pub/lk.pdf.

Keenan Crane, Clarisse Weischedel, and Max Wardetzky. The heat method for distance computation. Communications of the ACM, 60(11):90 99, 2017. URL https://dl.acm. org/doi/10.1145/3131280.

Anirban Dasgupta, Ravi Kumar, and Tamás Sarlós. A sparse johnson: Lindenstrauss transform. In Leonard J. Schulman, editor, Proceedings of the 42nd ACM Symposium on Theory of Computing, STOC 2010, Cambridge, Massachusetts, USA, 5-8 June 2010, pages 341 350. ACM, 2010. doi: 10.1145/1806689.1806737. URL https://doi.org/10.1145/ 1806689.1806737.

Michael Dawson-Haggerty. Trimesh repository, 2023. URL https://github.com/mikedh/ trimesh.

Inderjit S Dhillon, Yuqiang Guan, and Brian Kulis. Kernel k-means: spectral clustering and normalized cuts. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 551 556, 2004. URL https://dl.acm. org/doi/10.1145/1014052.1014118.

Michel X. Goemans and David P. Williamson. Approximation algorithms for max-3-cut and other problems via complex semidefinite programming. J. Comput. Syst. Sci., 68 (2):442 470, 2004. doi: 10.1016/j.jcss.2003.07.012. URL https://doi.org/10.1016/j. jcss.2003.07.012.

Vladimir Ivashkin. Community graphs repository, 2023. URL https://github.com/ vlivashkin/community-graphs.

William B Johnson. Extensions of lipschitz mappings into a hilbert space. Contemp. Math., 26:189 206, 1984. URL http://stanford.edu/class/cs114/readings/JL-Johnson. pdf.

Kyle Kloster and David F Gleich. Heat kernel based community detection. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1386 1395, 2014. URL https://doi.org/10.48550/ar Xiv.1403.3148.

Vladimir Koltchinskii and Dmitry Panchenko. Empirical margin distributions and bounding the generalization error of combined classifiers. The Annals of Statistics, 30(1):1 50, 2002. URL https://doi.org/10.48550/ar Xiv.math/0405343.

Risi Kondor and John D. Lafferty. Diffusion kernels on graphs and other discrete input spaces. In Claude Sammut and Achim G. Hoffmann, editors, Machine Learning, Proceedings of the Nineteenth International Conference (ICML 2002), University of New South Wales, Sydney, Australia, July 8-12, 2002, pages 315 322. Morgan Kaufmann, 2002. URL https://www.ml.cmu.edu/research/dap-papers/kondor-diffusion-kernels.pdf.

Published as a conference paper at ICLR 2024

Leonid Kontorovich, Corinna Cortes, and Mehryar Mohri. Kernel methods for learning languages. Theor. Comput. Sci., 405(3):223 236, 2008. doi: 10.1016/j.tcs.2008.06.037. URL https://doi.org/10.1016/j.tcs.2008.06.037.

Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. Advances in neural information processing systems, 20, 2007. URL https://dl.acm.org/doi/10. 5555/2981562.2981710.

Isaac Reid, Krzysztof Choromanski, and Adrian Weller. Quasi-monte carlo graph random features. ar Xiv preprint ar Xiv:2305.12470, 2023. URL https://doi.org/10.48550/ ar Xiv.2305.12470.

Alexander J Smola and Risi Kondor. Kernels and regularization on graphs. In Learning Theory and Kernel Machines: 16th Annual Conference on Learning Theory and 7th Kernel Workshop, COLT/Kernel 2003, Washington, DC, USA, August 24-27, 2003. Proceedings, pages 144 158. Springer, 2003. URL https://people.cs.uchicago.edu/~risi/ papers/Smola Kondor.pdf.

Alexander J. Smola and Bernhard Schölkopf. Bayesian kernel methods. In Shahar Mendelson and Alexander J. Smola, editors, Advanced Lectures on Machine Learning, Machine Learning Summer School 2002, Canberra, Australia, February 11-22, 2002, Revised Lectures, volume 2600 of Lecture Notes in Computer Science, pages 65 117. Springer, 2002. URL https://doi.org/10.1007/3-540-36434-X_3.

Yasutoshi Yajima. One-class support vector machines for recommendation tasks. In Pacific Asia Conference on Knowledge Discovery and Data Mining, pages 230 239. Springer, 2006. URL https://dl.acm.org/doi/10.1007/11731139_28.

Felix Xinnan X Yu, Ananda Theertha Suresh, Krzysztof M Choromanski, Daniel N Holtmann-Rice, and Sanjiv Kumar. Orthogonal random features. Advances in neural information processing systems, 29, 2016. URL https://doi.org/10.48550/ar Xiv.1610. 09072.

Yufan Zhou, Changyou Chen, and Jinhui Xu. Learning manifold implicitly via explicit heatkernel learning. Advances in Neural Information Processing Systems, 33:477 487, 2020. URL https://doi.org/10.48550/ar Xiv.2010.01761.

Published as a conference paper at ICLR 2024

A Appendices

A.1 Minimum batch size scales logarithmically with number of walks

Consider an ensemble of m random walks S := {ωi}m i=1 whose lengths are sampled from a geometric distribution with termination probability p. Trivially, Pr(len(ωi) < b) = 1 (1 p)b. Given m independent walkers,

Pr (maxωi S(len(ωi) < b) = Pr ( m i=1len(ωi) < b) = (1 (1 p)b)m. (17)

Take with high probability to mean with probability at least 1 δ, with δ 1 fixed. Then

b = log(1 (1 δ) 1 m ) log(1 p) log(δ) log(m)

log(1 p) . (18)

As the number of walkers m grows, the minimum value of b to ensure that all walkers are shorter than b with high probability scales logarithmically with m.

A.2 Proof of Theorem 2.1 (Unbiased approximation of Kα via convolutions)

We begin by proving that g-GRFs constructed according to Alg. 1 give unbiased estimation of graph functions with Taylor coefficients (αi) i=0 provided the discrete convolution relation in Eq. 4 is fulfilled.

Denote the set of m walks sampled out of node i by { Ω(i) k }m k=1. Carefully considering Alg. 1, it is straightforward to convince oneself that the g-GRF estimator takes the form

eω(ωiv)f(len(ωiv))

p(ωiv) I(ωiv Ω(i) k ). (19)

Here, Ωiv denotes the set of all graph walks between nodes i and v, of which each ωiv is a member. len(ωiv) is a function that gives the length of walk ωiv and eω(ωiv) evaluates the products of edge weights it traverses. p(ωiv) = (1 p)len(ωiv) Qlen(ωiv) i=1 1 di denotes the walk s marginal probability. I is the indicator function which evaluates to 1 if its argument is true (namely, walk ωiv is a prefix subwalk of Ω(i) k , the kth walk sampled out of i) and 0

otherwise. Trivially, we have that E h I(ωiv Ω(i) k ) i = p(ωiv) (by construction to make the

estimator unbiased).3

Now note that:

E ϕ(i) ϕ(j) = E

v N ϕ(i)vϕ(j)v

ωjv Ωjv eω(ωiv)f(len(ωiv))eω(ωjv)f(len(ωjv))

l2=0 Wl1 iv Wl2 jvf(l1)f(l2)

l3=0 Wl1 l3 iv Wl3 jvf(l1 l3)f(l3)

l1=0 Wl1 ij

l3=0 f(l1 l3)f(l3).

3To flag a subtle point: in the sum we take ωiv Ωiv to mean that ωiv is a member of the set of all possible walks of any length between nodes i and v, Ωiv. On the other hand, inside the indicator function by ωiv Ω(i) k we mean that ωiv is a prefix subwalk of one particular walk Ω(i) k sampled out of node i. In these two cases the interpretation of the symbol should be slightly different.

Published as a conference paper at ICLR 2024

From the first to the second line we used the definition of GRFs in Eq. 19. We then rewrote the sum of the product of edge weights over all possible paths as powers of the weighted adjacency matrix W, with l1,2 corresponding to walk lengths. To get to the fourth line, we changed indices in the infinite sums, then the final line followed simply.

The final expression in Eq. 20 is exactly equal to Kα(W) := P l1=0 αl1Wl1 provided we have that l1 X

l3=0 f(l1 l3)f(l3) := αl1, (21)

as stated in Eq. 4 of the main text (with variables renamed to k and p).

A.3 Proof of Theorem 2.2 (Computing symmetric modulation functions)

Here, we show how to compute f under the constraint that the modulation functions are identical, f1 = f2 = f. We will use the relationship in Eq. 4, reproduced below for the reader s convenience: i X

p=0 f(i p)f(p) = αi. (22)

The iterative form in Eq. 6 is easy to show. For i = 0 we have that f(0)2 = α0, so f(0) = α0 = 1 (where we used the normalisation condition α0 = 1). Now note that

p=0 f(i + 1 p)f(p) = 2f(0)f(i + 1) +

p=1 f(i + 1 p)f(p) = αi+1 (23)

f(i + 1) = αi+1 Pi 1 p=0 f(i p)f(p + 1)

2f(0) . (24)

This enables us to compute f(i + 1) given αi+1 and (f(p))i p=0.

The analytic form in Eq. 5 is only a little harder. Inserting the discrete convolution relation in Eq. 4 back into Eq. 2, we have that

Kα(W) = Kf1(W)Kf2(W) (25)

where Kf1(W) := P i=0 f1(i)Wi is the generating function corresponding to the sequence (f1(i)) i=0. Constraining f1 = f2,

Kf(W) = (Kα(W))

We also discuss this in the generating functions section of Sec. 2 where we use it to derive simple closed forms for f for some special kernels. Now we have that

i=0 f(i)Wi =

n=0 αn Wn ! 1

= 1 + α1W + α2W2 + ... 1

α1W + α2W2 + ... n .

We need to equate powers of W between the generating functions. Consider the terms proportional to Wi. Clearly no such terms will feature when n > i, so we can restrict the sum on the RHS to 0 n i. Meanwhile, the term in α1W + α2W2 + ... n proportional to Wi is nothing other than X

k1+2k2+3k3...=i k1+k2+k3+...=n

n k1k2k3...

αk1 1 αk2 2 αk3 3 ... (28)

Published as a conference paper at ICLR 2024

where n k1k2k3... is the multinomial coefficient. Combining,

k1+2k2+3k3...=i k1+k2+k3+...=n

n k1k2k3...

αk1 1 αk2 2 αk3 3 ... (29)

as shown in Eq. 5. Though this expression gives f purely in terms of α, the presence of the conditional sum limits its practical utility compared to the iterative form.

A.4 Proof of Theorem 2.3 (Empirical Rademacher complexity bound)

In this appendix we derive the bound on the empirical Rademacher complexity stated in Theorem 2.3 and show the consequent generalisation bounds. The early stages closely follow arguments made by Cortes et al. (2010). Recall that we have defined the hypothesis set

H = {x w ψKα(x) : |αi| α(M) i , w 2 1}, (30)

with ψKα : x HKα the feature mapping from the input space to the reproducing kernel Hilbert space HKα induced by the kernel KG α.

The empirical Rademacher complexity b R(H) for an arbitrary fixed sample S = (xi)m i=1 is defined by

b R(H) := 1

i=1 σih(xi)

the expectation is taken over σ = (σ1, ..., σm) with σi Unif( 1) i.i.d. Rademacher random variables.

Begin by noting that

h(x) := w ψKα(x) =

i=1 βi Kα(xi, x) (32)

where βi are the coordinates of the orthogonal projections of w on HS = span(ψKα(x1), ..., ψKα(xm)), where β Kαβ 1. Then we have that

sup α,β σ Kαβ

The supremum supβ σ Kαβ is reached when K1/2 α β is collinear with K1/2 α σ, and making β 2 as large as possible gives

σ Kασ . (34)

Now note that

i=0 αiσ Wiσ. (35)

σ Wiσ may take either sign, and the sum is maximised by taking αi = α(M) i for positive terms and αi = α(M) i for negative terms. Observe that

sup σ sup α

i=0 α(M) i ρ(W)i (36)

whereupon from Eq. 31 it follows that

i=0 α(M) i ρ(W)i (37)

Published as a conference paper at ICLR 2024

as stated in Thm 2.3. This bound is not tight for general graphs, but will be for specific examples: for example, when W is proportional to the identity so the only edges are self-loops. Nonetheless, it provides some intuition for how the learned kernel s complexity depends on W.

As stated in the main text, Eq. 37 immediately yields generalisation bounds for learning kernels. Again closely following Cortes et al. (2010), consider the application of binary classification where nodes are assigned a label yi = 1. Denote by R(h) the true generalisation error of h H, R(h) = Pr(yh(x) < 0). (38)

Consider a training sample S = ((xi, yi))m i=1, and define the ρ-empirical margin loss by

b Rρ(h) := 1

i=1 min(1, [1 yih(xi)/ρ]+) (39)

where ρ > 0. For any δ > 0, with probability at least 1 δ, the following bound holds for any h H (Bartlett and Mendelson, 2002; Koltchinskii and Panchenko, 2002):

R(h) b Rρ(h) + 2

ρ b R(H) + 3

δ 2m . (40)

Inserting the bound on the empirical Rademacher complexity in Eq. 37, we immediately have that

R(h) b Rρ(h) + 2

i=0 α(M) i ρ(W)i + 3

which shows how the generalisation error can be controlled via α(M) or ρ(W).

A.5 Learning asymmetric f (N)

0 2 4 6 8 i

Learned asymmetric modulation functions

initial f1 initial f2 trained f1 trained f2

Figure 6: Modulation functions (f1, f2) parameterised by separate neural networks before and after training to target the 2-regularised Laplacian kernel.

Recalling from Eq. 1 that the modulation functions (f1, f2) do not necessarily need to be equal for unbiased estimation of some kernel Kα, a natural extension of Sec. 3.4 is to introduce two separate neural modulation functions f (N) 1,2 and train them both following the scheme in Sec. 3.4. Intriguingly, even with an initialisation where f (N) 2 encodes lazy behaviour (deposits almost all its load at the starting node see Sec. 2) and f (N) 1 is flat, upon training the neural modulation functions quickly become very similar (though not identical). See Fig. 6. We use the same optimisation hyperparameters and network architectures as in Sec. 3.4. Rigorously proving the best possible choice of (f1, f2), including whether e.g. a symmetric pair is optimal, is left as an exciting open theoretical problem. We note that, whilst parameterising two separate neural modulation functions gives a more general mechanism (with the symmetric pair as a special case), it also doubles the number of parameters required so slows training and evaluation.

A.6 Further approximation error results for Erdős-Rényi graphs

Here we further investigate the behaviour of the g-GRFs kernel approximation as the properties of the graph change. To do this, we generate random graphs using the Erdős-Rényi model, where each of the N 2 possible edges are present with probability pedge or absent with probability 1 pedge. Edges are independent.

Published as a conference paper at ICLR 2024

Firstly, we investigate how the approximation error varies with graph sparsity. We take a fixed number of nodes (N = 100) and control the sparsity by varying pedge between 0.1 and 0.9. We then approximate the diffusion kernel using g-GRFs with a termination probability phalt = 0.1 and 8 walkers. For each graph we compute the relative Frobenius norm error between the true (K) and approximated ( b K) kernels: namely, K b K F / K F . We repeat 100 times to obtain the standard deviation of the mean error estimate. Fig. 7 shows the results; approximation quality degrades slightly as pedge grows, but remains very good throughout (error < 0.00275).

Next, we investigate the scalability of the method by comparing the time for exact and approximate evaluation of the diffusion kernel as the size of the graph grows. We fix pedge = 0.5 and vary the number of nodes N between 100 and 12800. For every graph, we measure the wall-clock time for i) exact computation of K by computing the matrix exponential and ii) approximate computation of b K using g-GRFs (8 walkers, phalt = 0.5, regulariser σ = 0.5). Fig. 8 shows the results. Naturally, the exact method scales worse, becoming slower for graphs bigger than a few thousand nodes, and by the largest graph g-GRFs are already faster by a factor of 7.8. We also measure the relative Frobenius norm between the true and approximated Gram matrices to check that the quality of estimation remains good. This is shown by the green line, which indeed remains almost constant and takes very small values for the error ( 0.005).

Figure 7: Approximation error vs. edgegeneration probability for the diffusion kernel on a graph of size N = 100. The quality remains good as sparsity changes, varying within a narrow range.

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 p

Frob. norm error

Approximation error vs. edge probability

Figure 8: Wall clock time for exact and approximate kernel evaluation as the number of nodes varies. g-GRFs scale better and are faster with a few thousand nodes. The approximation error remains low as N grows.

102 103 104 no. nodes

wall-clock time (s)

Wall-clock time vs graph size

exact ev. time approx. ev. time

Frob. norm error

approximation error

A.7 Further experimental details

In this short section, we provide further experimental details and discussion to supplement the main text.

1. Choice of phalt: The termination probability encodes a simple trade-off between approximation quality and speed; if phalt is lower, walks tend to last longer and sample more subwalks so give a better approximation of graph kernels. In practice, any reasonably small value works well, as also reported for the original GRFs mechanism (Choromanski, 2023). In experiments we typically choose phalt 0.1. 2. Choice of σ and kernel convergence: After Eq. 2 we noted that, when approximating some fixed kernel Kα(W) = P k=0 αk Wk, we assume that the sum does not diverge. It would not be possible to construct a finite random feature estimator if this were not the case. We require that P k=0 αkρ(W)k is finite, with ρ(W) := maxλ Λ(W) (|λ|) the spectral radius of the weighted adjacency matrix (Λ(W) is the set of eigenvalues of W). When considering e.g. the diffusion kernel in the literature, this is ensured by multiplying W by some regulariser σ R+, tak-

Published as a conference paper at ICLR 2024

ing e.g. W σW whereupon λi σλi i = 1, ..., N. This is the reason for extra parameters {σ, α} in the kernel expressions in Table 1. As we noted in Sec. 3.5, this is exactly equivalent to transforming the modulation function f(i) f(i)σi i. Where space allows, we report the exact choice of σ with each experiment (though empirically it does not modify our conclusions). In Sec. 3.5 we take the weighted adjacency matrix W := [aij/ p

didj]N i,j=1 with aij = 1 if node i is connected to j and 0 otherwise, and di the degree of node i. We then regularise by taking W σW with σ = 0.025, which is small enough for convergence even with the largest graph considered. We use m = 16 walkers and phalt = 0.5.