# repelling_random_walks__44f25d2b.pdf Published as a conference paper at ICLR 2024 Repelling Random Walks Isaac Reid1 , Eli Berger2 , Krzysztof Choromanski3,4 , Adrian Weller1,5 1University of Cambridge, 2University of Haifa, 3Google Deep Mind, 4Columbia University, 5Alan Turing Institute ir337@cam.ac.uk, kchoro@google.com We present a novel quasi-Monte Carlo mechanism to improve graph-based sampling, coined repelling random walks. By inducing correlations between the trajectories of an interacting ensemble such that their marginal transition probabilities are unmodified, we are able to explore the graph more efficiently, improving the concentration of statistical estimators whilst leaving them unbiased. The mechanism has a trivial drop-in implementation. We showcase the effectiveness of repelling random walks in a range of settings including estimation of graph kernels, the Page Rank vector and graphlet concentrations. We provide detailed experimental evaluation and robust theoretical guarantees. To our knowledge, repelling random walks constitute the first rigorously studied quasi-Monte Carlo scheme correlating the directions of walkers on a graph, inviting new research in this exciting nascent domain.1 1 Introduction and related work Quasi-Monte Carlo (QMC) sampling is well-established as a universal tool to improve the convergence of MC methods, improving the concentration properties of estimators by using low-discrepancy samples to reduce integration error (Dick et al., 2013). They replace i.i.d. samples with a correlated ensemble, carefully constructed to be more diverse and hence improve approximation quality. Such methods have been widely adopted in the Euclidean setting. For example, when sampling from isotropic distributions, one popular approach is to condition that samples are orthogonal: a trick that has proved successful in applications including dimensionality reduction (Choromanski et al., 2017), evolution strategy methods in reinforcement learning (Choromanski et al., 2018; Rowland et al., 2018) and estimating sliced Wasserstein distances (Rowland et al., 2019). Orthogonal Monte Carlo has also been used to improve the convergence of random feature maps for kernel approximation (Yu et al., 2016), including recently in attention approximation for scalable Transformers (Choromanski et al., 2020). Intuitively, conditioning that samples are orthogonal prevents them from clustering together and ensures that they explore Rd better. In specific applications it is sometimes possible to derive rigorous theoretical guarantees (Reid et al., 2023b). Less clear is how these powerful ideas generalise to discrete space. Of particular interest are random walks on graphs, which sample a sequence of nodes connected by edges with some stopping criterion. Random walks are ubiquitous in machine learning and statistics (Xia et al., 2019), providing a simple mechanism for unbiased graph sampling that can be implemented in a distributed way. However, slow diffusion times (especially for challenging graph topologies) can lead to poor convergence and downstream performance. Our key contribution is the first (to our knowledge) quasi-Monte Carlo scheme that correlates the directions of an ensemble of graph random walkers to improve estimator accuracy. By conditioning that walkers repel in a particular way that leaves the marginal walk probabilities unmodified, we are able to provably suppress the variance of various estimators Senior lead. 1Code is available at https://github.com/isaac-reid/repelling_random_walks. Published as a conference paper at ICLR 2024 whilst preserving their unbiasedness. We derive strong theoretical guarantees and observe large performance gains for algorithms estimating three disparate quantities: graph kernels (Choromanski, 2023), the Page Rank vector (Avrachenkov et al., 2007) and graphlet concentrations (Chen et al., 2016). Related work: The poor mixing of random walkers on graphs is well-documented and various schemes exist to try to improve estimator convergence. Most directly modify the base Markov chain by changing the transition probabilities, but without altering the walker s stationary distribution and therefore leaving asymptotic estimators (e.g. based on empirical node occupations) unmodified. The canonical example of such a scheme is non-backtracking walks which do not permit walkers to return to their most recently visited node (Alon et al., 2007; Diaconis et al., 2000; Lee et al., 2012). More involved schemes allow walkers to interact with their entire history (Zhou et al., 2015; Doshi et al., 2023). Many of these strategies provide theoretical guarantees that the asymptotic variance of estimators is reduced, but crucially the marginal probabilities of sampling different walks are modified so they cannot be applied to non-asymptotic estimators that rely on particular known marginal transition probabilities. Conversely, our QMC scheme leaves marginal walk probabilities unmodifed. Research has also predominantly been restricted to the behaviour of a single self-interacting walker rather than an ensemble, and when multiple walkers are considered analytic results are generally restricted to simple structures, e.g. complete graphs (Rosales et al., 2022; Chen, 2014). This research exists within the broader literature of reinforced random walks, where nonlinear Markov kernels are used so that walkers are less (or more) likely to transition to nodes that have been visited in the past (Pemantle, 2007). However, the analytic focus has predominantly been on properties like recurrence times, escape times from sets, cover times and localisation results for simple topologies (Amit et al., 1983; Tóth, 1995; Tarrès, 2004), rather than the behaviour of associated statistical estimators on general graphs. The latter is of more direct interest in machine learning. The remainder of the manuscript is organised as follows. In Sec. 2 we introduce the requisite mathematics and present our novel QMC repelling random walk mechanism. In Secs 3-5 we use it to approximate three quantities of interest in machine learning: graph node kernels (Sec. 3), the Page Rank vector (Sec. 4), and graphlet statistics (Sec. 5). Repelling random walks are empirically found to outperform the i.i.d. variant in every case and we are often able to provide concrete theoretical guarantees. 2 Repelling random walks Consider an undirected, connected graph G(N, E) where N := {1, ..., N} denotes the set of nodes and E denotes the set of edges, with (i, j) E if there is an edge between nodes i, j N. Write the graph s (weighted) adjacency matrix A := [aij]i,j N , where aij = 0 if (i, j) E and 0 otherwise. Let di := P j N I[(i, j) E] denote the node degree, which is the number of neighbours of a particular node, and let N(i) := {j N|(i, j) E} denote the set of neighbours of node i. The transition matrix P = [Pij]i,j N of a simple random walk is given by ( 1 di if (i, j) E 0 otherwise (1) such that at every timestep the walker selects one of its neighbours with uniform probability. This can be viewed as a finite and time-reversible Markov chain with state space N. Supposing we have m such walkers on the graph simultaneously, we can define an augmented Markov chain with state space N m, consisting of the possible node positions of all the walkers. If the walkers are independent, the joint transition matrix Q RNm N m is given a Kronecker product of the marginal transition matrices P(i): Q = m i=1P(i) (2) where the index i = 1, ..., m enumerates the walkers present. Our key contribution is now to induce correlations between the walkers paths such that the joint transition matrix Q is modified but each marginal transition matrix (and hence the unbiasedness of any estimator relying on it) is unchanged. The correlations are designed to improve estimator convergence. Published as a conference paper at ICLR 2024 Figure 1: Schematic for behaviour of repelling random walkers at a particular timestep. By sampling from each block (blue and green rectangles) without replacement we get a more even distribution over neighbours, without changing the marginal probabilities. Definition 2.1 (Repelling random walks). A repelling ensemble has the following behaviour. Let V(i) t denote the set of walkers at node i at timestep t, and N (i) t := |V(i) t | the size of this set. Randomly divide these walkers into N (i) t //d subsets of size d and one remainder subset of size N (i) t %d < d (where // and % denote truncating integer division and the remainder, respectively). Among each subset, assign the walkers to a neighbour from the set N(i) uniformly without replacement. This is in contrast to i.i.d. walkers where V(i) t are assigned to the neighbours N(i) uniformly with replacement. We provide a schematic in Fig. 1. In the repelling scheme, each walker still has a marginal transition probability Pij = {1/di if (i, j) E, 0 otherwise}, but now they are forced to take different edges and heuristically explore the graph more effectively. The sample of walks is more diverse . Since the marginal transition probabilities P(i) are unmodified, any estimators that are unbiased with i.i.d. walkers are also automatically unbiased with repelling walkers, including in the non-asymptotic regime. However, as we shall see, their concentration properties are often substantially better. Computational cost and implementation: Repelling random walks have a trivial dropin implementation. The only difference is whether walkers are assigned to neighbours with or without replacement. Moreover, the transitions in the augmented state space N m remain Markovian (memoryless); there are no extra space complexity costs because we only need access to the current positions of all the walkers. Physical interpretation and entanglement: Under repulsive interactions, the joint transition matrix Q can no longer be written as a Kronecker product. Consider transitions of 2 walkers in the same block from (i1, i2) to (j1, j2). We have: QNi1+i2,Nj1+j2 := Pr(j1, j2|i1, i2) = Pi1j1Pi2j2 ( 1 + δi1i2 di di 1(1 δj1j2) 1 if di = 1 1 if di = 1 (3) with δi1i2 the delta function. This does not generically factorise into (i1, j1)- and (i2, j2)- dependent parts. In quantum mechanics (QM), an interacting Hamiltonian H which cannot be written as a Kronecker sum gives rise to a time-evolution operator U := exp( i h Ht) that cannot be written as a Kronecker product, which in turn generically gives rise to quantum entanglement between particles. Just as the von-Neumann entropy (a measure of bipartite quantum entanglement (Amico et al., 2008)) increases under such interactions, in our QMC scheme the Shannon mutual information initially increases from 0: during the first timestep, I1,2 = δi1i2 log( di di 1) 0. Note that the analogy is not exact because in QM the timeevolution operator acts on (complex) wavefunctions whereas here the transition matrix acts on the (real positive) probabilities of being in different states of a Markov chain encoding the positions of walkers on a graph. It is just intended to help build intuition for the reader. It will be convenient to define one further class of interacting random walk. Published as a conference paper at ICLR 2024 Definition 2.2 (Transient repelling random walks). An ensemble of random walks is described as transient repelling the walkers repel (according to Def. 2.1) at the first timestep, and are independent thereafter. Such an ensemble will capture the repelling behaviour at early times but eventually relax to independence. Whilst less practical than the full repelling scheme, we will see that sometimes it makes theoretical analysis tractable. We now apply our repelling random walks mechanism to three disparate applications: approximation of graph kernels (Sec. 3), approximation of the Page Rank vector (Sec. 4), and approximation of graphlet concentrations (Sec. 5). 3 Application 1: approximating graph kernels We begin by demonstrating the effectiveness of repelling random walks for estimating graph kernels KG : N N R, defined on the nodes N of a graph G. Such kernels capture the structure of G, letting practitioners repurpose theoretically grounded and empirically successful algorithms like support vector machines, kernelised principal component analysis and Gaussian processes to the discrete domain (Smola and Kondor, 2003). Applications include in bioinformatics (Borgwardt et al., 2005), community detection (Kloster and Gleich, 2014), generative modelling (Zhou et al., 2020) and solving shortest-path problems (Crane et al., 2017). Chief examples of KG are the d-regularised Laplacian and diffusion kernels, given by K(d) lap := (I + σ2e L) d and Kdiff := exp( σ2e L/2) respectively. Here, σ2 is a lengthscale parameter and e L is the normalised graph Laplacian, defined by e L := I W with W = [aij/( di dj)1/2]N i,j=1 a normalised weighted adjacency matrix ( di = P j aij is the weighted node degree and aij is the weight of the original edge). e L is the analogue of the familiar Laplacian operator 2 = 2 x2 2 + ... + 2 x2n in discrete space, describing diffusion on G (Chung and Yau, 1999; Chung, 1997). For large graphs, computing e.g. K(d) lap exactly can be prohibitively expensive due to the O(N 3) time complexity of matrix inversion. This motivated the recently-introduced class of Graph Random Features (GRFs) (Choromanski, 2023), which provide a discrete analogue to Random Fourier Features (Rahimi and Recht, 2007). These N-dimensional vectors ϕ(i) RN are constructed for every node i N such that their Euclidean dot product is equal to the kernel evaluation in expectation, [K(2) lap]ij = E ϕ(i) ϕ(j) . (4) In their paper, Choromanski (2023) provides an elegant algorithm for constructing ϕ(i): one simulates m N random walks out of each node i that terminate with probability p at every timestep, depositing a load at every node they visit to build up a randomised projection of the local environment in G. They show that this gives an unbiased estimate of K(2) lap, which can be used to construct K(d) lap for d N or an asymptotically unbiased approximation of Kdiff. Since the unbiasedness of the estimator depends on the marginal probabilities of sampling different finite-length random walks being unmodified (c.f. just its stationary distribution), it is a natural setting to test our new quasi-Monte Carlo scheme. Remarkably, under mild conditions, we are able to derive an analytic closed form for the difference in kernel estimator mean squared error (MSE) between the i.i.d. and transientrepelling mechanisms for general graphs (deferred to Eq. 32 in App. A.1 for brevity). This enables us to make the following statement for some specific graphs, proved in App. A.1. Theorem 3.1 (Superiority of repelling random walks for kernel estimation). Consider graph nodes indexed (i, j) separated by at least 2 edges. In the limit σ 0, provided the number of walkers in the transient repelling ensemble is smaller than or equal to the node degrees d{i,j} and the edge-weights of W are equal, Var([ b K(2) lapij]repelling) Var([ b K(2) lapij]i.i.d.) (5) for both i) trees and ii) 2-dimensional grids. Though we have made some restrictions for analytic tractability, we will empirically observe that the full repelling QMC scheme is effective in much broader settings. In particular, Published as a conference paper at ICLR 2024 5 10 15 No. random walks Frob. norm error ER (N = 20, p = 0.2) GRFs q-a-GRFs q-r-GRFs q-ar-GRFs 5 10 15 No. random walks Frob. norm error ER (N = 100, p = 0.04) 5 10 15 No. random walks Frob. norm error Binary tree (N = 127) 5 10 15 No. random walks Frob. norm error d-regular (N = 100, d = 10) 5 10 15 No. random walks Frob. norm error Karate (N = 34) 5 10 15 No. random walks Frob. norm error Dolphins (N = 62) 5 10 15 No. random walks Frob. norm error Football (N = 115) 5 10 15 No. random walks Frob. norm error Eurosis (N = 1272) Figure 2: Relative Frobenius norm of estimates of the 2-regularised Laplace kernel (lower is better) vs. number of random walks for: i) vanilla GRFs; ii) GRFs with antithetic termination (Reid et al., 2023a) ( q-a-GRFs ); iii) GRFs with repelling walks ( q-r-GRFs ); iv) GRFs with both antithetic termination and repelling walks ( q-ar-GRFs ). Using both QMC schemes together gives the best results for all graphs considered and the gains are large (sometimes a factor of > 2). N gives the number of nodes, p is the edge-generation probability for the Erdös-Rényi graphs, and d is the d-regular node degree. One standard deviation on the mean error is shaded but is too small to easily see. it substantially suppresses kernel estimator variance with many walkers, arbitrary σ and arbitrary graphs. Extending the proof to these general cases is an exciting open problem. We also note that our scheme is fully compatible with the recently-introduced QMC scheme known as antithetic termination (Reid et al., 2023a), which anticorrelates the lengths of random walkers (by coupling their terminations) but does not modify their trajectories. Both schemes can be applied simultaneously, inducing negative correlations between both the walk directions and lengths. 3.1 Pointwise kernel estimation We now empirically test Eq. 5 for general graphs by comparing the variance of [ b K(2) lap]ij under different schemes. In what follows, GRFs refers to graph random features constructed using i.i.d. walkers, whilst q-{a,r,ar}-GRFs denotes the efficient quasi-Monte Carlo variants where walkers exhibit antithetic termination ( a ) (Reid et al., 2023a), repel ( r ), or both ( ar ). We use these different flavours of (q-)GRFs to generate unbiased estimates b K(2) lap, then compute the relative Frobenius norm K(2) lap b K(2) lap F/ K(2) lap F between the true and approximated Gram matrices. Fig. 2 presents the results for various graphs: small Erdős Rényi, larger Erdős-Rényi, a binary tree, a d-regular graph, and four standard real-world examples from (Ivashkin, 2023) (karate, dolphins, football and eurosis). These differ substantially in both size and structure. We take 100 repeats to compute the variance of the kernel approximation error, using a regulariser σ = 0.1 and a termination probability p = 0.5. The gains provided by the repelling QMC scheme (green) are much greater than those from antithetic termination (orange), but the lowest variance is achieved when both are used together (red). Note that the gains provided by repelling random walks continue to accrue as the size of the ensemble grows; with m = 16 walkers the error is often halved. 3.2 Downstream applications: kernel regression for node attribute prediction We have both proved (Theorem 3.1) and empirically confirmed (Fig. 2) that using repelling random walks substantially improves the quality of estimation of the 2-regularised Laplacian Published as a conference paper at ICLR 2024 kernel using GRFs. Naturally, this permits better performance in downstream applications that depend on the approximation. As an example, we follow Reid et al. (2023a) and consider kernel regression on triangular mesh graphs (Dawson-Haggerty, 2023). Consider a graph G where each node is associated with a normal vector v(i). The task is to predict the directions of a random set of missing test vectors (a 5% split) from the remaining train vectors. We compute our (unnormalised) predictions bv(i) as bv(i) := P j b K(2) lap(i, j)v(j), where j sums over the training vertices and b K(2) lap(i, j) is constructed using the GRF and q-{a,r,ar}-GRF mechanisms described in Sec. 3.1. We compute the average angular error 1 cos θ between the prediction bv(i) and groundtruth v(i) across the test set. We use m = 16 random walks with a termination probability p = 0.5 and a regulariser σ = 0.1, taking 1000 repeats for statistics. Table 1 reports the results. Higher-quality kernel approximations with repelling random walks give more accurate downstream predictions for all graphs, with the biggest gains appearing when our repelling scheme is introduced ( r and ar ). The difference is remarkably big when the number of nodes N is big: on torus, the error is reduced by a factor of almost 3. Accurate approximation is especially helpful for these large graphs as exact methods become increasingly expensive. Table 1: Angular error 1 cos θ between true and predicted node vectors when approximating the Gram matrix with GRFs and q-{a,r,ar}-GRFs (lower is better). Brackets give one standard deviation. Both schemes in combination works best. Graph N Pred error, 1 cos θ GRFs q-a-GRFs q-r-GRFs q-ar-GRFs cylinder 210 0.0650(7) 0.0644(7) 0.0466(3) 0.0459(2) teapot 480 0.0331(2) 0.0322(2) 0.0224(1) 0.0215(1) idler-riser 782 0.0528(3) 0.0521(3) 0.0408(2) 0.0408(2) busted 1941 0.00463(2) 0.00456(2) 0.003833(6) 0.003817(6) torus 4350 0.000506(1) 0.000482(1) 0.000180(1) 0.000181(1) Though for concreteness we have considered one particular downstream application, we stress that improving the kernel estimate can be expected to boost performance in any algorithm that uses it, e.g. for graph node clustering (Dhillon et al., 2004), shortest-path prediction (Crane et al., 2017) or simulation of graph diffusion (Reid et al., 2023a). 4 Application 2: approximating Page Rank As a second application, we use repelling random walks to improve numerical estimates of the Page Rank vector: a popular measure of node importance in a graph originally proposed by Page et al. (1998) to rank websites in search engine results. The Page Rank vector is defined as the stationary distribution of Markov chain whose state space is the set of all graph nodes N, with a transition matrix e P := (1 p)P + p Here, p (0, 1) is a scalar, N is the number of nodes, P is defined in Eq. 1 and E = [1]i,j N is a matrix whose entries are all ones. This encodes the behaviour of a surfer who at every timestep either teleports to a random node with probability p or moves to one of its neighbours chosen uniformly at random. Since e P is stochastic, aperiodic and irreducible, we can define the unique Page Rank vector π RN: π e P = π , π 1 = 1, (7) where we normalised the sum of vector entries to 1. Physically, πj is the stationary probability that a surfer is at node j. π is very expensive to compute for large graphs and the form of e P invites MC estimation with random walkers. Fogaras et al. (2005) suggest the following algorithm. Published as a conference paper at ICLR 2024 Algorithm 4.1 (Random walks for Page Rank estimation). (Fogaras et al., 2005) Simulate m N runs of a simple random walk with transition probability matrix P out of every node i N, terminating with probability p at each timestep. Evaluate the estimator bπj as the fraction of walks terminating at node j, bπj := 1 Nm P N Pm j=1 I(walker terminates at j). It is straightforward to show that bπ is an unbiased estimator of π (see App. A.2). This is a natural setting to test an ensemble of repelling random walks. We are able to make the following surprisingly strong statement. Theorem 4.2 (Superiority of repelling random walks for Page Rank estimation). For a transient repelling ensemble, Var(bπj)repelling Var(bπj)i.i.d. (8) for any graph. We defer a full proof to App. A.2 but provide a brief sketch below. Proof sketch: Supposing that the number of walkers is smaller than the minimum node degree, the behaviours of a transient repelling and i.i.d. ensemble only differ at the first timestep. In the former scheme walkers are forced to diverge whereas in the latter they are independent. The expectation values of the estimators associated with each walker are conditionally independent given their node positions at t = 1 and are identical in both schemes by definition; denote it by f(vt=1). With the i.i.d. ensemble the variance depends on Ev(1) v(2)[f(v(1) t=1)f(v(2) t=1)] where the node positions of a pair of walkers v(1,2) are independent. Meanwhile, for repelling walkers it depends on Ev(1) =v(2)[f(v(1) t=1)f(v(2) t=1)] where we condition that v(1,2) cannot be equal. Simple algebra reveals that the latter is smaller. It is straightforward to then generalise to when the number of walkers exceeds the minimum node degree. It is remarkable that Theorem 4.2 holds for arbitrary G. Table 2 shows the Page Rank estimator error with 2 walkers that are either i) i.i.d. or ii) repelling out of every node. The quality of approximation is already excellent with just a single pair. The termination probability is p = 0.3 and we take 1000 trials to compute the standard deviations (10000 for eurosis since it is larger). As per the theoretical guarantees, repelling random walks consistently perform better. Table 2: Mean L2-norm of the difference between the true and approximated Page Rank vectors πerr := π bπ 2, using i.i.d. and repelling pairs of random walkers. Lower is better. Repelling random walks consistently outperform i.i.d. random walks. Parentheses give one standard deviation on the mean error. Graph N Page Rank error, πerr i.i.d. repelling Small ER 20 0.0208(2) 0.0196(2) Larger ER 100 0.00420(2) 0.00406(2) Binary tree 127 0.00290(1) 0.00270(1) d-regular 100 0.00434(2) 0.00422(2) karate 34 0.0124(1) 0.0115(1) dolphins 62 0.00686(4) 0.00651(4) football 115 0.00385(2) 0.00376(2) eurosis 1272 0.000342(2) 0.000335(2) In the Page Rank setting, RRWs are closely related to the algorithm presented by Luo (2019), which was introduced to reduce edge bandwidth. The scheme takes di walks out of every node i and permutes them randomly among the neighbours at every timestep. Note that sampling without replacement is identical to permutation if the number of walkers is equal to the number of neighbours to which they must be assigned. As a brief addendum for the interested reader: bπj is actually a member of a broader class of functions coined step-by-step linear, defined as follows. Published as a conference paper at ICLR 2024 Definition 4.3 (Step-by-step linear functions). Let Ωr denote the set of all infinite-length walks starting at node r, Ωr := {(vi) i=0 | v0 = r, vi N, (vi, vi+1) E}. We refer to a function y : Ωr R as step-by-step linear if it takes the form: i=0 f(vi, i) j=1 g(vj 1, vj, j, j 1), (9) where f : N (N {0}) R and g : N N (N {0}) (N {0}) R. These functions have the property that the variance of the corresponding Monte Carlo estimator is guaranteed to be suppressed by conditioning that the ensemble of random walks {ω}m i=1 is transient repelling. Concretely, the following is true. Theorem 4.4 (Variance of step-by-step linear functions is reduced by transient repulsion). Consider the estimator Y := Pm i=1 y(ωi) where {ωi}m i=1 enumerates m (infinite) walks on G and y : Ωr R is a step-by-step linear function. Suppose that the sets of walks {ωi}m i=1 are either i) i.i.d. or ii) transient repelling (Def. 2.2). We have that: Var(Yrepelling) Var(Yi.i.d.). (10) We provide a proof and further discussion in Sec. A.3. Interestingly, the step-by-step linear family also includes ϕ(i)k, the kth component of the GRF corresponding to the ith node of G, though of course this alone is insufficient to guarantee suppression of variance of the kernel estimator ϕ(i) ϕ(j). 5 Application 3: approximating graphlet concentrations triangle wedge Figure 3: Graphlets for k = 3 Finally, we use repelling random walks to estimate the relative frequencies of graphlets: induced subgraph patterns within a graph G. Formally, a k-node induced subgraph Gk = (Vk, Ek) satisfies Vk V, |Vk| = k and Ek = {(u, v) : u, v Vk (u, v) E}: that is, a subset of k connnected nodes together with any edges between them. For example, for k = 3 the possible graphlets are a triangle and a wedge (see Fig. 3). Computing a graph s graphlet concentrations the proportions of different k-node graphlets is a task of broad interest in biology (Pržulj, 2007; Milenković and Pržulj, 2008) and network science (Becchetti et al., 2008; Ugander et al., 2013) since it characterises the local structure of G (Milo et al., 2002). Such concentrations even permit construction of graphlet kernels K : G G R to compare different graphs (Shervashidze et al., 2009). For large graphs, exact computation by exhaustive counting is unfeasible because of the combinatorial explosion in the number of graphlets with N. This motivates random walk Markov Chain Monte Carlo approaches. Such crawling-based algorithms also benefit from not requiring access to the entire graph simultaneously: a typical restriction for online social networks where the graph is only available via API calls to retrieve a particular node s neighbours (e.g. user s friends). These algorithms are also easily distributed across machines. Chen et al. (2016) propose a general algorithm for asymptotically unbiased, efficient estimation of graphlet concentrations using random walks. We summarise one particular instantiation of it for k = 3 below. Algorithm 5.1 (Graphlet concentration estimation using random walks). (Chen et al., 2016) Simulate a simple random walk of length L N (the sampling budget) out of a randomly selected node. Consider X(3) i = (Xi, Xi+1, Xi+2) with 1 i L 2, the states of an augmented Markov Chain whose state space is the ordered 3-tuples of consecutively-visited nodes. Discard all such states where Xi = Xi+2 (where the walker backtracks), and for the Published as a conference paper at ICLR 2024 5 10 15 No. random walks Graphlet conc. est. error ER (N = 20, p = 0.2) i.i.d. repelling 5 10 15 No. random walks Graphlet conc. est. error ER (N = 100, p = 0.04) 5 10 15 No. random walks Graphlet conc. est. error d-regular (N = 100, d = 10) 5 10 15 No. random walks Graphlet conc. est. error Polbooks (N = 105) 5 10 15 No. random walks Graphlet conc. est. error Karate (N = 34) 5 10 15 No. random walks Graphlet conc. est. error Dolphins (N = 62) 5 10 15 No. random walks Graphlet conc. est. error Football (N = 115) 5 10 15 No. random walks Graphlet conc. est. error Eurosis (N = 1272) Figure 4: Mean square error on estimates of k = 3 graphlet concentrations with different numbers of random walks on different graphs. Lower is better. Using the repelling scheme consistently improves the quality of the estimate compared to independent walks. remaining n classify the graphlets g(3) i to get the weighted counts Cwed := Pn 2 i=1 I(g(3) i = wedge) di+1 2 and Ctri := Pn 2 i=1 I(g(3) i = triangle) di+1 6 (where di+1 is the degree of the i + 1th node). In the limit of large L, bc(3) tri := Ctri Ctri+Cwed gives an unbiased estimator of the concentration of triangle graphlets. The weightings in the computation of C{wed,tri} are included to correct for two sources of bias: di+1 accounts for the fact that the stationary distribution of the expanded Markov chain is inversely proportional to the degree of the middle node, π(X(3) i ) = (2|V|di+1) 1, and the combinatorial factors adjust for the fact that 6 states X(3) i correspond to the triangle graphlet (twice the number of Hamiltonian paths) but only 2 correspond to the wedge. We implement Alg. 5.1 with both i) i.i.d. walkers and ii) a repelling ensemble. A rigorous theoretical analysis of concentration properties is very challenging and is deferred as important future work; for now, our study is empirical. Fig. 4 plots the fractional error of the estimator of triangle graphlet concentration bc(3) tri against the number of walkers. We use the same graphs as in Sec. 3, but replace the binary tree with polbooks since for the former bctri = 0 trivially. We impose a restricted sampling budget with walks of length L = 16 to highlight the benefits of repelling random walks in the transient regime, and take 2500 repeats over all starting nodes for statistics. Repelling random walks consistently perform better, providing more accurate estimates of the triangle graphlet concentration, and for some graphs the improvement is large. Alg. 5.1 can be generalised to estimate the concentrations larger graphlets with k > 3; we anticipate that repelling random walks will still prove effective. 6 Conclusion We have presented a new quasi-Monte Carlo scheme called repelling random walks that induces correlations between the directions of random walkers on a graph. Estimators constructed using this interacting ensemble are guaranteed to remain unbiased but their concentration properties are often substantially improved. We test our algorithm on applications as diverse as estimating graph kernels, the Page Rank vector and graphlet concentrations. In every case the experimental performance is very strong and often we are able to provide concrete theoretical guarantees. We hope this work will motivate further research on developing quasi-Monte Carlo methods to improve sampling on graphs. Published as a conference paper at ICLR 2024 7 Ethics and reproducibility Ethics: Our work is foundational with no immediate ethical concerns apparent to us. However, increases in scalability provided by quasi-Monte Carlo algorithms could exacerbate existing and incipient risks of graph-based machine learning, from bad actors or as unintended consequences. Reproducibility: Every effort has been made to guarantee the work s reproducibility. The core algorithm is clearly presented in Def. 2.1. Accompanying theoretical results are proved and discussed in the Appendices A.1-A.3, including any assumptions where appropriate. Source code is available at https://github.com/isaac-reid/repelling_random_walks. All datasets used correspond to standard graphs and are freely available online; we give links to suitable repositories in every instance. Results are reported with uncertainties to facilitate comparison. 8 Relative contributions and acknowledgements IR conceived and implemented the repelling random walks mechanism, derived the theoretical results and prepared the manuscript. KC was crucially involved in the project throughout, acting as the senior research lead. EB proposed the class of step-by-step linear functions and made important theoretical contributions. AW provided helpful discussions and guidance. IR acknowledges support from a Trinity College External Studentship. AW acknowledges support from a Turing AI fellowship under grant EP/V025279/1 and the Leverhulme Trust via CFI. We thank Kenza Tazi and Austin Tripp for their careful readings of the manuscript, and Yashar Ahmadian for interesting discussions about the relationship with quantum entanglement. Richard Turner provided valuable suggestions and support throughout the project. Noga Alon, Itai Benjamini, Eyal Lubetzky, and Sasha Sodin. Non-backtracking random walks mix faster. Communications in Contemporary Mathematics, 9(04):585 603, 2007. URL https://doi.org/10.1142/S021919970700255. Luigi Amico, Rosario Fazio, Andreas Osterloh, and Vlatko Vedral. Entanglement in manybody systems. Reviews of modern physics, 80(2):517, 2008. URL https://doi.org/10. 1103/Rev Mod Phys.80.517. Daniel J Amit, Giorgio Parisi, and Luca Peliti. Asymptotic behavior of the" true" selfavoiding walk. Physical Review B, 27(3):1635, 1983. URL https://doi.org/10.1103/ Phys Rev B.27.1635. Konstantin Avrachenkov, Nelly Litvak, Danil Nemirovsky, and Natalia Osipova. Monte carlo methods in pagerank computation: When one iteration is sufficient. SIAM Journal on Numerical Analysis, 45(2):890 904, 2007. URL https://dl.acm.org/doi/10.1137/ 050643799. Luca Becchetti, Paolo Boldi, Carlos Castillo, and Aristides Gionis. Efficient semi-streaming algorithms for local triangle counting in massive graphs. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 16 24, 2008. URL https://dl.acm.org/doi/10.1145/1401890.1401898. Karsten M Borgwardt, Cheng Soon Ong, Stefan Schönauer, SVN Vishwanathan, Alex J Smola, and Hans-Peter Kriegel. Protein function prediction via graph kernels. Bioinformatics, 21(suppl_1):i47 i56, 2005. URL https://doi.org/10.1093/bioinformatics/ bti1007. Published as a conference paper at ICLR 2024 Jun Chen. Two particles repelling random walks on the complete graph. Electronic Journal of Probability, 19(none):1 17, 2014. doi: 10.1214/EJP.v19-2669. URL https://doi. org/10.1214/EJP.v19-2669. Xiaowei Chen, Yongkun Li, Pinghui Wang, and John Lui. A general framework for estimating graphlet statistics via random walk. ar Xiv preprint ar Xiv:1603.07504, 2016. URL https://doi.org/10.48550/ar Xiv.1603.07504. Krzysztof Choromanski, Mark Rowland, Vikas Sindhwani, Richard Turner, and Adrian Weller. Structured evolution with compact architectures for scalable policy optimization. In International Conference on Machine Learning, pages 970 978. PMLR, 2018. URL https://doi.org/10.48550/ar Xiv.1603.07504. Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, et al. Rethinking attention with performers. ar Xiv preprint ar Xiv:2009.14794, 2020. URL https://doi.org/10.48550/ar Xiv.2009.14794. Krzysztof M Choromanski, Mark Rowland, and Adrian Weller. The unreasonable effectiveness of structured random orthogonal embeddings. Advances in neural information processing systems, 30, 2017. URL https://doi.org/10.48550/ar Xiv.1703.00864. Krzysztof Marcin Choromanski. Taming graph kernels with random features. In International Conference on Machine Learning, pages 5964 5977. PMLR, 2023. URL https://doi.org/10.48550/ar Xiv.2305.00156. Fan R. K. Chung and Shing-Tung Yau. Coverings, heat kernels and spanning trees. Electron. J. Comb., 6, 1999. doi: 10.37236/1444. URL https://doi.org/10.37236/1444. Fan RK Chung. Spectral graph theory, volume 92. American Mathematical Soc., 1997. Keenan Crane, Clarisse Weischedel, and Max Wardetzky. The heat method for distance computation. Communications of the ACM, 60(11):90 99, 2017. URL https://dl.acm. org/doi/10.1145/3131280. Michael Dawson-Haggerty. Trimesh repository, 2023. URL https://github.com/mikedh/ trimesh. Inderjit S Dhillon, Yuqiang Guan, and Brian Kulis. Kernel k-means: spectral clustering and normalized cuts. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 551 556, 2004. URL https://dl.acm. org/doi/10.1145/1014052.1014118. Persi Diaconis, Susan Holmes, and Radford M Neal. Analysis of a nonreversible markov chain sampler. Annals of Applied Probability, pages 726 752, 2000. URL https://doi. org/10.1214/aoap/1019487508. Josef Dick, Frances Y Kuo, and Ian H Sloan. High-dimensional integration: the quasimonte carlo way. Acta Numerica, 22:133 288, 2013. URL https://doi.org/10.1017/ S0962492913000044. Vishwaraj Doshi, Jie Hu, and Do Young Eun. Self-repellent random walks on general graphs achieving minimal sampling variance via nonlinear markov chains. ar Xiv preprint ar Xiv:2305.05097, 2023. URL https://doi.org/10.48550/ar Xiv.2305.05097. Dániel Fogaras, Balázs Rácz, Károly Csalogány, and Tamás Sarlós. Towards scaling fully personalized pagerank: Algorithms, lower bounds, and experiments. Internet Mathematics, 2(3):333 358, 2005. URL https://doi.org/10.1007/978-3-540-30216-2_9. Vladimir Ivashkin. Community graphs repository, 2023. URL https://github.com/ vlivashkin/community-graphs. Published as a conference paper at ICLR 2024 Kyle Kloster and David F Gleich. Heat kernel based community detection. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1386 1395, 2014. URL https://doi.org/10.48550/ar Xiv.1403.3148. Chul-Ho Lee, Xin Xu, and Do Young Eun. Beyond random walk and metropolis-hastings samplers: why you should not backtrack for unbiased graph sampling. ACM SIGMETRICS Performance evaluation review, 40(1):319 330, 2012. URL https://dl.acm.org/ doi/10.1145/2318857.2254795. Siqiang Luo. Distributed pagerank computation: An improved theoretical study. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 4496 4503, 2019. URL https://doi.org/10.1609/aaai.v33i01.33014496. Tijana Milenković and Nataša Pržulj. Uncovering biological network function via graphlet degree signatures. Cancer informatics, 6:CIN S680, 2008. URL https://doi.org/10. 48550/ar Xiv.0802.0556. Ron Milo, Shai Shen-Orr, Shalev Itzkovitz, Nadav Kashtan, Dmitri Chklovskii, and Uri Alon. Network motifs: simple building blocks of complex networks. Science, 298(5594): 824 827, 2002. URL https://www.science.org/doi/10.1126/science.298.5594.824. Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The pagerank citation ranking: Bring order to the web. Technical report, Technical report, stanford University, 1998. URL https://www.cis.upenn.edu/~mkearns/teaching/Networked Life/ pagerank.pdf. Robin Pemantle. A survey of random processes with reinforcement. Probability Surveys, 4 (none):1 79, 2007. doi: 10.1214/07-PS094. URL https://doi.org/10.1214/07-PS094. Nataša Pržulj. Biological network comparison using graphlet degree distribution. Bioinformatics, 23(2):e177 e183, 2007. URL https://doi.org/10.1093/bioinformatics/ btl301. Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. Advances in neural information processing systems, 20, 2007. URL https://dl.acm.org/doi/10. 5555/2981562.2981710. Isaac Reid, Krzysztof Choromanski, and Adrian Weller. Quasi-monte carlo graph random features. ar Xiv preprint ar Xiv:2305.12470, 2023a. URL https://doi.org/10.48550/ ar Xiv.2305.12470. Isaac Reid, Krzysztof Marcin Choromanski, Valerii Likhosherstov, and Adrian Weller. Simplex random features. In International Conference on Machine Learning, pages 28864 28888. PMLR, 2023b. URL https://doi.org/10.48550/ar Xiv.2301.13856. Rafael A Rosales, Fernando PA Prado, and Benito Pires. Vertex reinforced random walks with exponential interaction on complete graphs. Stochastic Processes and their Applications, 148:353 379, 2022. URL https://doi.org/10.1016/j.spa.2022.03.007. Mark Rowland, Krzysztof M Choromanski, François Chalus, Aldo Pacchiano, Tamas Sarlos, Richard E Turner, and Adrian Weller. Geometrically coupled monte carlo sampling. Advances in Neural Information Processing Systems, 31, 2018. URL https://dl.acm. org/doi/pdf/10.5555/3326943.3326962. Mark Rowland, Jiri Hron, Yunhao Tang, Krzysztof Choromanski, Tamas Sarlos, and Adrian Weller. Orthogonal estimation of wasserstein distances. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 186 195. PMLR, 2019. URL https://doi.org/10.48550/ar Xiv.1903.03784. Nino Shervashidze, SVN Vishwanathan, Tobias Petri, Kurt Mehlhorn, and Karsten Borgwardt. Efficient graphlet kernels for large graph comparison. In Artificial intelligence and statistics, pages 488 495. PMLR, 2009. URL https://api.semanticscholar.org/ Corpus ID:17557614. Published as a conference paper at ICLR 2024 Alexander J Smola and Risi Kondor. Kernels and regularization on graphs. In Learning Theory and Kernel Machines: 16th Annual Conference on Learning Theory and 7th Kernel Workshop, COLT/Kernel 2003, Washington, DC, USA, August 24-27, 2003. Proceedings, pages 144 158. Springer, 2003. URL https://people.cs.uchicago.edu/~risi/ papers/Smola Kondor.pdf. Pierre Tarrès. Vertex-reinforced random walk on z eventually gets stuck on five points. The Annals of Probability, 32(3):2650 2701, 2004. ISSN 00911798. URL http://www.jstor. org/stable/3481644. Bálint Tóth. The" true" self-avoiding walk with bond repulsion on z: Limit theorems. The Annals of Probability, pages 1523 1556, 1995. URL https://doi.org/10.1214/aop/ 1176987793. Johan Ugander, Lars Backstrom, and Jon Kleinberg. Subgraph frequencies: Mapping the empirical and extremal geography of large graph collections. In Proceedings of the 22nd international conference on World Wide Web, pages 1307 1318, 2013. URL https:// doi.org/10.48550/ar Xiv.1304.1548. Feng Xia, Jiaying Liu, Hansong Nie, Yonghao Fu, Liangtian Wan, and Xiangjie Kong. Random walks: A review of algorithms and applications. IEEE Transactions on Emerging Topics in Computational Intelligence, 4(2):95 107, 2019. URL https://doi.org/10. 1109/TETCI.2019.2952908. Felix Xinnan X Yu, Ananda Theertha Suresh, Krzysztof M Choromanski, Daniel N Holtmann-Rice, and Sanjiv Kumar. Orthogonal random features. Advances in neural information processing systems, 29, 2016. URL https://doi.org/10.48550/ar Xiv.1610. 09072. Yufan Zhou, Changyou Chen, and Jinhui Xu. Learning manifold implicitly via explicit heatkernel learning. Advances in Neural Information Processing Systems, 33:477 487, 2020. URL https://doi.org/10.48550/ar Xiv.2010.01761. Zhuojie Zhou, Nan Zhang, and Gautam Das. Leveraging history for faster sampling of online social networks. ar Xiv preprint ar Xiv:1505.00079, 2015. URL https://doi.org/ 10.48550/ar Xiv.1505.00079. Published as a conference paper at ICLR 2024 A Appendices A.1 Proof of Theorem 3.1 (superiority of repelling random walks for kernel estimation In this appendix, we prove Theorem 3.1: namely, that using transient repelling random walks reduces the mean square error of graph random feature (GRF) estimates of the 2regularised Laplacian kernel, defined by: h b K(2) lap i ij := I e L 2 ij = (1 + σ2) 2 I σ2 1 + σ2 W 2 (11) where e L is the Laplacian, σ is a regulariser and W is a normalised weighted adjacency matrix with elements W = aij/( di dj)1/2 N i,j=1 (with di = P j aij the degree of the ith node and aij the weight of the edge before normalisation). Ignoring the overall normalisation constant and absorbing the factor of σ2/(1 + σ2) into W, wlg we will now consider estimation of b Kij = (I W) 2 (12) where W = [wij]N i,j=1. Directly from the definition of the GRF vector (see e.g. (Choromanski, 2023) or (Reid et al., 2023a)), we have that b Kij = ϕ(i) ϕ(j) = 1 m2 X eω(ωix) p(ωix) eω(ωjx) p(ωjx) N(ωix)N(ωjx) (13) l=1 I ωix Ωl . (14) Here: m N is the number of random walkers simulated out of each node; Ωix is the set of all walks on the graph between the nodes indexed i and x; ωix is a member of this set; eω(ω) is a function that returns the product of weights of edges traversed by a graph random walk ω; and p(ω) is a function that returns the marginal probability of a random walk (equal to ((1 p)/d)len(ω) with 0 < p < 1 a finite termination probability, d the node degree and len(ω) the walk length in the case of a d-regular graph). Ωl denotes the lth walk out of node i such that N(ωix) counts the empirical number of walkers completing a particular prefix subwalk ωix, a discrete random variable between 0 and m. Since E(N(ωix)) = mp(ωix), it is straightforward to see that E(ϕ(i) ϕ(j)) = X ωjx Ωjx eω(ωix)eω(ωjx) ωij Ωij (len(ωij) + 1)eω(ωij) = (I W) 2 (15) which confirms that the estimator is unbiased. Our task is now to determine how the variance of b Kij depends on whether the ensemble of m walkers from each node is i) independent or ii) repelling. We will see that, under some conditions, it is guaranteed to be smaller in the latter case. The following is true: E b K2 ij = 1 m4 X eω(ωix) p(ωix) eω(ωiy) p(ωiy) E(N(ωix)N(ωiy)) eω(ωjx) p(ωjx) eω(ωjy) p(ωjy) E(N(ωjx)N(ωjy)) Published as a conference paper at ICLR 2024 This makes clear that the object of central importance will be E (N(ωix)N(ωiy)) (17) where x, y N. We will now consider how this depends on i) the pair of subwalks (ωix, ωiy) and ii) the presence or absence of repulsion. To avoid notational clutter, we will write expressions as if the graph is d-regular with the understanding that it is trivial to relax this without changing any conclusions, making the replacement: dlen(ωix) Qlen(ωix) 1 i=0 di. i.i.d. walkers: Begin with the simpler i.i.d. case. First consider the case that ωix = ωiy, ωix / ωiy and ωiy / ωix: namely, that the walks are distinct and neither is a (strict) subwalk of the other. It follows that a single walker cannot take both subwalks simultaneously. We also assume that both walks are of length len(ωi(x,y)) 1. Then we have that E [N(ωix)N(ωiy)] = E l2=1 I(ωix Ωl1)I(ωiy Ωl2)) = m(m 1) 1 p len(ωix)+len(ωiy) . What about if ωix ωiy? It is straightforward to convince oneself that E [N(ωix)N(ωiy)] = m(m 1) 1 p len(ωix)+len(ωiy) + m 1 p len(ωiy) (19) where the extra second correlation term comes from a single walker completing both subwalks. Lastly, suppose that len(ωix) = 0 (i.e. one of the subwalks has zero length). Then we have that E [N(ωix)N(ωiy)] = m2 1 p len(ωiy) . (20) Now we move onto the repelling case, which is substantially more difficult. Repelling walkers: For tractability, we will consider the transient repulsion scheme described in Def. 2.2. Suppose that we have N α walkers at some node indexed α = i, with the set of neighbouring nodes including nodes labelled β and γ (i.e. β, γ N(α)). An important quantity is E(NβNγ|N α). (21) From Def. 2.1, we have that Nβ = Nα//d + ϵ1, Nγ = Nα//d + ϵ2, (22) where // denotes truncating integer division ϵ1,2 are anticorrelated binary random variables. The reason they are anticorrelated is that a walker that transitions to β cannot also transition to γ. With a little work, one can convince oneself that E(NβNγ|N α) = N α d where R := N α%d, the remainder after the the walkers have been partitioned into blocks of size d. From this form, we can see that we will be concerned with the statistics of N α: in particular how E(Nα 2) behaves. Understanding this is our next task. Let Nα be a random variable denoting the number of walkers at node α of some particular walk on the graph. Then let N α denote the number of walkers surviving the p-step , where each walker terminates independently with probability p. Let Nβ denote the number of walkers that subsequently hop to node β on the walk. It is clear that Nβ = N//d+ϵ, with ϵ Published as a conference paper at ICLR 2024 a random variable that takes a value of 1 with probability R d and 0 with probability 1 R d . It is simple to show that E(N 2 β) = E(N α 2) d2 + E R It is also trivially the case that E(N α 2) = E(Nα 2)(1 p)2 + E(Nα)p(1 p) (law of iterated expectations). The second term in Eq. 24 is generically difficult to describe analytically, but it is simple in the special case that N α < d so R = N α. In particular, here we have that E(N 2 α) = E(Nα) = m 1 p len(ωiα) (25) whereupon, referring back to Eq. 23, E(NβNγ|N α) = 0. (26) It is trivial to see why this must be the case: supposing we begin with fewer than d walkers at node i, in the transient repelling scheme they all diverge at the first timestep. Any subsequent node α on some walk is occupied by at most 1 walker which chooses one of its dα neighbours at random, so value of NβNγ (product of occupations of its child nodes) always vanishes. It is encouraging that our algebraic approach reproduces this intuitive result. It follows immediately that, for walks diverging at some node not equal to the starting node i, E [N(ωix)N(ωiy)] = 0. What about if the walkers instead diverge at i? Here the result is different because E(Ni) = m, the initial number of particles, and Var(m) = 0 since the total number of walkers is a fixed hyperparameter. After just a little work, E(NβNγ) = (1 p)2 d(d 1)m(m 1) (27) and therefore E [N(ωix)N(ωiy)] = d d 1m(m 1) 1 p len(ωix)+len(ωiy) . (28) This is also intuitive: the repelling scheme shifts probability mass onto walks that diverge at i, enhancing this correlation term. The subwalk case is also straightforward. Supposing ωix ωiy, we can use Eq. 25 to show that E [N(ωix)N(ωiy)] = m 1 p len(ωiy) (29) because if the walkers immediately diverge then we can only sample both ωiy and ωix ωiy if a single walk traverses both. Lastly, if len(ωix) = 0, we still have that E [N(ωix)N(ωiy)] = m2 1 p len(ωiy) (30) which is natural because if one (or both) of the walks is of zero length then the repulsion scheme cannot modify the correlation term. We summarise these observations in Table 3, denoting c := 1 p d and ωix as shorthand for len(ωix) for compactness. Class E(N(ωix)N(ωiy)) i.i.d. transient repelling Same walk, ωix = ωiy mcωix + m(m 1)c2ωix mcωix Subwalk, ωix ωiy mcωiy + m(m 1)cωix+ωiy mcωiy Different walks, diverge at i m(m 1)cωix+ωiy d d 1m(m 1)cωix+ωiy Different walks, diverge at d = i m(m 1)cωix+ωiy 0 len(ωix) = 0 m2cωiy m2cωiy Published as a conference paper at ICLR 2024 More explicitly, we can write E(N(ωix)N(ωiy)) = m(m 1)cωix+ωiy I(both > 0) +mcωlonger I(subwalks, both > 0) +m2cωlonger I(len 0) if i.i.d. di di 1m(m 1)cωix+ωiy I(both > 0, div at i) +mcωlonger I(subwalks, both > 0) +m2cωlonger I(len 0) if repelling where only the first term differs. Here, I(both > 0) means both walks traverse at least one edge, I(len 0) means one of the walks is of length 0, I(subwalks) means ωix ωiy (or vice versa), and ωlonger denotes the length of the longer walk. We will now insert these expressions into Eq. 16 to compute the difference in variance of b Kij with the two possible coupling schemes. After tedious but straightforward algebra, we arrive at the following closed form: Var h b Kij i Var h b Kij i didj (di 1)(dj 1) (m 1)2 i N (i) i N (i)\i j N (j) j N (j)\j wii wjj wii wjj 1 (1 W)2 + di di 1 m 1 p(ωix) [B(x, j) C(x, j)] + dj dj 1 m 1 p(ωjx) [B(x, i) C(x, i)] di di 1 (B(i, j) C(i, j)) + dj dj 1 (B(j, i) C(j, i)) B(x, i) := X i N(i) w2 ii 1 (1 W)2 i N(i) wii 1 (1 W)2 C(x, i) := X i N(i) w2 ii W (1 W)2 i N(i) wii W (1 W)2 Note that both B and C are always positive by Jensen s inequality. It is remarkable that such a simple expression exists for the difference in kernel estimator variance between the i.i.d. and transient repelling schemes. Showing the class of graphs for which this is guaranteed to be positive is a challenging open problem, but we are able to make progress in some tractable special cases. For instance, consider the limit w 0 with all the graph weights equal, W = w A. In this case, when computing matrix elements we can just retain terms corresponding to the shortest path. For example, 1 (1 w A)2 ij = 1 + 2w A + 3w2A2 + ... ij = M(lij)(lij + 1)wlij + O(wlij+1) (35) Published as a conference paper at ICLR 2024 where lij denotes the length of the shortest path between nodes i and j and M(lij) denotes its multiplicity: the number of such unique paths that exist in G. To give examples, for a tree M(lij) = 1 and for a square grid M(lij) = a+b a , where a is the difference in x-coordinates of nodes i and j and b is the difference in y coordinates. When will (a) be positive? Provided that nodes i and j are separated by at least 2 edges, the following is true: m2 M(lij)2(lij 1)2w2lij + O(w2lij+1), (36) didj (di 1)(dj 1) (m 1)2 i N (i) i N (i)\i j N (j) j N (j)\j = didj (di 1)(dj 1) (m 1)2 i N (i) i N (i)\i j N (j) j N (j)\j M(li j )M(li j )wli j +li j +4(li j + 1)(li j + 1) + O(wli j +li j +5). (37) Note that, since i N(i) and j N(j), it is trivially the case that lij 2 li j lij + 2. In Eq. 37, only terms where li j = li j = lij 2 will give contributions of the same order as the leading term O(w2lij) in Eq. 36. The condition that Eq. 36 is greater at the leading order is then: M(lij)2 didj (di 1)(dj 1) i N (i),i N (i)\i ,j N (j),j N (j)\j li j =li j =lij 2 M(li j )M(li j ) (38) where we remind the reader that M(lij) denotes the degeneracy (number) of shortest paths (of length lij) between nodes i and j. This encodes the topological constraint that is sufficient for variance reduction with repelling random walkers on an equal-weights graph as w 0. Heuristically, the set of nodes N(i) cannot be too connected to the nodes N(j). A cursory numerical check suggests that Eq. 38 is not generically satisfied for every pair of nodes (i, j) on arbitrary graphs, but it does seem to very often be true. We can, however, identify some particular examples where it is guaranteed to hold. For example, it is trivially true for trees for which M(lij) = 1 but the RHS is 0. To wit: for trees there is a unique shortest path between nodes i and j and it is only the case that li j = lij 2 if both i and j lie on this path. Then conditioning that i = i and j = j means that we cannot fulfil li j = lij 2, whereupon the sum is over the empty set so evaluates to 0. It is also true for the two dimensional square grid. Without loss of generality, locate node i at coordinates (0, 0) and node j at (a, b). If a = 0 or b = 0 there is a unique shortest path so the inequality follows trivially. If a = 1 then M(lij) = b + 1 whereas the RHS evaluates to 2 (d/(d 1))2 which is smaller for b 1 and d = 4. Finally, if a, b > 1, then M(lij) = a+b a (a walker on a shortest path between nodes i and j must take a steps in one direction and b steps in the other, but we are free to permute their order). Meanwhile, the sum on the RHS of Eq. 38 evaluates to: d d 1 a + b 2 a 2 a + b 2 b 2 + a + b 2 a 1 whereupon the ratio RHS/LHS is: d d 1 2 2 a(a 1)b(b 1) + a2b2 [(a + b)(a + b 1)]2 2 2ab (a + b)2 1 + a(a 1) + b(b 1) (a 1)(b 1) + ab Published as a conference paper at ICLR 2024 The expression in square brackets is greater than 1 so its inverse is smaller than 1. Hence, it is sufficient that (a + b)2 2ab = (a b)2 2ab + 2 d d 1 which is trivially always true for d = 4. The inequality in Eq. 38 then holds in all cases where nodes i and j are separated by at least 2 edges, so terms (a) will indeed be positive on a two-dimensional square grid. Having asserted that the terms labelled (a) in Eq. 32 will sum to a positive value under certain conditions (namely: graphs characterised by Eq. 38 with equal edge weights w 0, considering nodes i and j separated by at least 2 edges), we now proceed to consider the terms labelled (b). When will (b) be positive? Now we consider the remaining terms involving e.g. B(x, i) C(x, i) where x, i N. These demand a little more care because the sum over x N means that we need to account for terms where x = i even when considering off-diagonal terms of the kernel estimate i = j. Note that B(x, i) = diw2Vari N(i) (1 w A) 2 the empirical variance of the matrix elements Kxi among the set of vertices i that neighbour i. Denote by lxi the smallest walk length for which the variance of the number of walks from x to the set of nodes N(i) is nonzero, lxi = min l N {l | Vari N(i)(Al xi ) = 0} . (43) This might well correspond to the shortest path between x and i N(i), but this is not necessarily the case (i.e. if all the neighbours i have an equal number of equally short paths to x so the variance on this quantity vanishes). Then we have that B(x, i) = dw2 i Vari N (i) (1 w A) 2 xi = d( lxi + 1)2w2 lxi Vari N (i) h A lxi xi i + O(w2 lxi+1) (44) C(x, i) = diw2Vari N (i) w A(1 w A) 2 xi = d( lxi)2w2 lxi Vari N (i) h A lxi xi i + O(w2 lxi+1) (45) where Vari N(i) h A lxi xi i denotes this first nonvanishing variance. For trees with x = i, lxi = lxi 1 and Vari N(i) h A lxi xi i = d 1 To leading order in w, it is trivial to see that B(x, i) > C(x, i). Inserting back into Eq. 32 and noting the positivity of the prefactor ω(ωix)2 p(ωix) , the positivity of (2) follows. Note that in this section we have not assumed anything about the structure of G beyond equal weights and w 0. The topological constraints originate solely from the terms in (a). Combining the above arguments, we conclude that the using the transient repelling scheme is indeed guaranteed to suppress the variance of kernel estimators b Kij for nodes i and j separated by at least 2 edges under certain conditions: namely, that we have an equallyweighted graph with w 0, and that the topological condition in Eq. 38 is true (which is the case for e.g. trees and two-dimensional grids). We stress that, in practice, the scheme performs very well even for much more general classes of graphs and in the non-asymptotic w limit. We defer extending the proof above to include these cases as future work. A.2 Proof of Theorem 4.2 (superiority of repelling random walks for Page Rank estimation In this appendix, we show how random walks can be used to estimate to the Page Rank vector and prove that using a transient repelling ensemble reduces the estimator mean square error (Theorem 4.2). Published as a conference paper at ICLR 2024 Our intention is to estimate the vector π, defined as the stationary distribution of the transition matrix e P defined in Eq. 6 and reproduced below: e P := (1 p)P + p The reader should refer back to Eq. 6 for all symbol definitions. We then require that π e P = π , π 1 = 1. (47) Rearranging and Taylor expanding (1 (1 p)P) 1, it is straightforward to convince oneself that the solution is given by k=0 (1 p)k Pk ji. (48) This is nothing other than a sum over all walks ωji from each of the graph nodes j to node i, each weighted by a factor of 1 p d k p (with dk generalising to the product of node degrees along ωji if the graph is not d-regular) and normalised by the number of vertices N. Equivalently, supposing we simulate a random walk out of a random node on the graph j, it is the probability that it terminates at node i. This invites the algorithm proposed by Fogaras et al. (2005) and shown in Alg. 4.1. We construct the unbiased estimator l=1 I[Ω(j) l terminates at node i] (49) where Ω(j) l denotes the lth walk (out of a total of m N) simulated from node j. Our task is now to consider the variance properties of the estimator bπ when ensembles of walkers out of each node are either i) independent or ii) repelling according to our QMC scheme defined in Def. 2.1. Evidently, E(bπ2 i ) = 1 N 2m2 X l1,l2=1 E n I[Ω(j1) l1 terminates at node i ] I[Ω(j2) l2 terminates at node i ] o . (50) By construction, only walkers out of the same node are correlated: we simulate repelling ensembles out of every vertex but a pair of walkers coming from different vertices remain independent throughout. Therefore, it is sufficient to consider the behaviour of terms j1 = j2. In particular, we we need to determine whether the value of E n I[Ω(j) l1 terminates at node i ] I[Ω(j) l2 terminates at node i ] o (51) is suppressed with repelling random walks for fixed arbitrary j N. We will refer to this as the correlation term. First consider the special case j = i so all walks are of length at least 1. We also consider transient repulsion (see Def. 2.2) so that walkers only repel at the first timestep; the fullrepelling scheme is empirically effective but difficult to reason about analytically. For i.i.d. walkers, the correlation term in Eq. 51 evaluates to p2 (1 p)2 P 1 (1 p)P ji = p2 1 p At the first timestep, a pair of repelling walkers are either assigned to the same block (see Fig. 1) or different blocks with a probability that depends on the number of walkers m and the degree of the first node dj. The correlation term in Eq. 51 depends on its evaluations conditioned on each of these two events, weighted by their respective probabilities. If the pair are assigned to different blocks their dynamics are i.i.d. so the correlation term is unmodified. Meanwhile, if they are assigned to the same block (which happens almost Published as a conference paper at ICLR 2024 surely if m dj), then the equivalent term evaluates to This differs from Eq. 52 in that i) the variable j in the sum over the neighbours of j can no longer be equal to j since the walks repel, and ii) there is an extra factor of dj dj 1 to account for the increase the conditional probability of choosing j given that j becomes excluded when the first walker picks it ( without replacement ). Denote the matrix element f(j , i) := h 1 1 (1 p)P i j i. Then the difference between the terms in Eqs 52 and 53 is equal to j j f(j , i)f(j , i) 1 dj dj 1I(do not share first edge) . (54) This can be rewritten j j f(j , i)f(j , i) I(share first edge) 1 = p2 (1 p)2 1 dj 1 j j f(j , i)2 j j f(j , i) where we used Jensen s inequality. To complete the proof, we also consider the subcase i = j. For i.i.d. walkers, the correlation term evaluates to p2 1 1 (1 p)P ii = p2 1 + (1 p)P 1 (1 p)P 1 + 2(1 p) P 1 (1 p)P ii + (1 p)2 P 1 (1 p)P For repelling walkers in the same block, we need to consider contributions from i) both walkers terminating immediately, ii) one terminating and one leaving then returning to i, and iii) both walkers leaving (to different neighbours (i = i ) then returning to i. Enumerating these possibilities, we get: p2 1 + 2(1 p) P 1 (1 p)P Only the final term differs compared to Eq. 56. It is of precisely the same form as considered above with i = j, so we immediately deduce that it is smaller with repelling walkers in the same block. It follows that when i = j repelling random walkers also yield lower variance estimators bπj. We have now considered the cases where the pair of walkers are in either different blocks or the same block, including the sub-cases i = j and i = j, and proved that the variance of the estimator bπ is the same or reduced in both cases. The proof is complete. A.3 Proof of Theorem 4.4 (variance of step-by-step linear functions is reduced by transient repulsion) In this section, we supplement the discussion at the end of 4, identifying a general class of functions whose variance is suppressed by conditioning that random walks exhibit transient repulsion. Published as a conference paper at ICLR 2024 Following Def. 4.3, let Ωr denote the set of all infinite-length walks starting at node r, Ωr := {(vi) i=0 | v0 = r, vi N, (vi, vi+1) E}. Recall that we refer to a function y : Ωr R as step-by-step linear if it takes the form: i=0 f(vi, i) j=1 g(vj 1, vj, j, j 1). (58) An example of such a function provided by ϕ(i)k, the k-th component of the graph random feature corresponding to node i, for which f(vi, i) = I(vi = k) and g(vj 1, vj, j, j 1) = wvj 1,vj dvj 1 1 p I(tj > p). Here, wvj 1,vj is the weight of the edge (vj 1, vj) E, dvj 1 is the degree of the node vj 1 and tj Unif(0, 1) is a termination random variable which controls whether the walk ends at the jth timestep. Another example is provided by the kth component of the Page Rank vector estimator bπk, in which case f(vi, i) = I(vi = k)I(tj > p) and g(vj 1, vj, j, j 1) = I(tj 1 < p), ensuring that y(w) evaluates to 1 if the walker terminates at node k and is zero otherwise. In Theorem 4.4, we asserted that, for the estimator Y := Pm i=1 y(ωi) (where {ωi}m i=1 enumerates m (infinite) walks on G and y : Ωr R is a step-by-step linear function), Var(Yrepelling) Var(Yi.i.d.). (59) This is proved as follows. Begin by writing y(ω(k)) = f(v0, 0) + i=1 f(v(k) i , i) j=1 g(v(k) j 1, v(k) j , j, j 1) = f(v0, 0) + h (v(k) i ) i=1 (60) with k = {1, 2}. Note that we took v(1) 0 = v(2) 0 = v0 since the walkers begin at the same node, and introduced the function h to simplify notation. As in Sec. A.2, a pair of walks may be either assigned to the same block or different blocks (see Fig. 1). We focus on the former case since in the latter they are i.i.d. and the variance of the estimator is trivially unchanged. This means that our walkers diverge at the first timestep, v(1) 1 = v(2) 1 . It follows from the definition of transient repulsion that the random variables h((v(k) i ) i=1) are conditionally independent given (v(1) 1 , v(2) 1 ) since at this point the walkers stop repelling. Moreover, since in both cases the marginal distribution over ω is identical, for Var(Y ) to be reduced by repulsion we just require that Ei.i.d. h h (v(1) i ) i=1 h (v(2) i ) i=1 i ? Erep h h (v(1) i ) i=1 h (v(2) i ) i=1 i . (61) In the i.i.d. scheme, v(1) 1 and v(2) 1 are independent and are uniformly distributed among the set of neighbours N(v0), each with probability 1/d0. Meanwhile, in the repelling scheme, (v(1) 1 , v(2) 1 ) is uniformly distributed among the set {(vi, vj)|vi, vj N(v0), vi = vj} i.e. all d0(d0 1) possible pairs of distinct neighbours of node v0. Then Eq. 61 evaluates to 1 d2 0 v(1) 1 N (v0) v(2) 1 N (v0) E(h|v(1) 1 )E(h|v(2) 1 ) 1 d0(d0 1) v(1) 1 N (v0) v(2) 1 N (v0)\v(1) 1 E(h|v(1) 1 )E(h|v(2) 1 ) ? 0 (62) where we used that h(1,2) are conditionally independent given v(1,2) 1 . Rearranging, this expression is nothing other than 1 d0 1Varv1 N(v0)(E(h|v1)) ? 0 (63) where the variance is being computed over the d0 nodes that neighbour v0. Eq. 63 trivially holds, confirming that Var(Yrepelling) Var(Yi.i.d.). Note that Theorem 4.4 does not obviate the long proof in Sec. A.1: suppressing the variance of the GRF coordinate ϕ(i)k, i, k N is not sufficient to conclude that the variance of the dot product b Kij = ϕ(i) ϕ(j) is also reduced since now we need to consider correlations between ϕ(i)k1 and ϕ(i)k2 with k1 = k2. On the other hand, it does subsume Theorem 4.2 as a special case, though we keep this section in the manuscript for clarity of presentation.