# fast_bidirectional_probability_estimation_in_markov_models__866e72c8.pdf Fast Bidirectional Probability Estimation in Markov Models Siddhartha Banerjee sbanerjee@cornell.edu Peter Lofgren plofgren@cs.stanford.edu We develop a new bidirectional algorithm for estimating Markov chain multi-step transition probabilities: given a Markov chain, we want to estimate the probability of hitting a given target state in ℓsteps after starting from a given source distribution. Given the target state t, we use a (reverse) local power iteration to construct an expanded target distribution , which has the same mean as the quantity we want to estimate, but a smaller variance this can then be sampled efficiently by a Monte Carlo algorithm. Our method extends to any Markov chain on a discrete (finite or countable) state-space, and can be extended to compute functions of multi-step transition probabilities such as Page Rank, graph diffusions, hitting/return times, etc. Our main result is that in sparse Markov Chains wherein the number of transitions between states is comparable to the number of states the running time of our algorithm for a uniform-random target node is order-wise smaller than Monte Carlo and power iteration based algorithms; in particular, our method can estimate a probability p using only O(1/ p) running time. 1 Introduction Markov chains are one of the workhorses of stochastic modeling, finding use across a variety of applications MCMC algorithms for simulation and statistical inference; to compute network centrality metrics for data mining applications; statistical physics; operations management models for reliability, inventory and supply chains, etc. In this paper, we consider a fundamental problem associated with Markov chains, which we refer to as the multi-step transition probability estimation (or MSTP-estimation) problem: given a Markov Chain on state space S with transition matrix P, an initial source distribution σ over S, a target state t S and a fixed length ℓ, we are interested in computing the ℓ-step transition probability from σ to t. Formally, we want to estimate: pℓ σ[t] := σP ℓ, et = σP ℓe T t , (1) where et is the indicator vector of state t. A natural parametrization for the complexity of MSTPestimation is in terms of the minimum transition probabilities we want to detect: given a desired minimum detection threshold δ, we want algorithms that give estimates which guarantee small relative error for any (σ, t, ℓ) such that pℓ σ[t] > δ. Parametrizing in terms of the minimum detection threshold δ can be thought of as benchmarking against a standard Monte Carlo algorithm, which estimates pℓ σ[t] by sampling independent ℓ-step paths starting from states sampled from σ. An alternate technique for MSTP-estimation is based on linear algebraic iterations, in particular, the (local) power iteration. We discuss these in more detail in Section 1.2. Crucially, however, both these techniques have a running time of Ω(1/δ) for testing if pℓ σ[t] > δ (cf. Section 1.2). Siddhartha Banerjee is an assistant professor at the School of Operations Research and Information Engineering at Cornell (http://people.orie.cornell.edu/sbanerjee). Peter Lofgren is a graduate student in the Computer Science Department at Stanford (http://cs. stanford.edu/people/plofgren/). 1.1 Our Results To the best of our knowledge, our work gives the first bidirectional algorithm for MSTP-estimation which works for general discrete state-space Markov chains1. The algorithm we develop is very simple, both in terms of implementation and analysis. Moreover, we prove that in many settings, it is order-wise faster than existing techniques. Our algorithm consists of two distinct forward and reverse components, which are executed sequentially. In brief, the two components proceed as follows: Reverse-work: Starting from the target node t, we perform a sequence of reverse local power iterations in particular, we use the REVERSE-PUSH operation defined in Algorithm 1. Forward-work: We next sample a number of random walks of length ℓ, starting from σ and transitioning according to P, and return the sum of residues on the walk as an estimate of pℓ σ[t]. This full algorithm, which we refer to as the Bidirectional-MSTP estimator, is formalized in Algorithm 2. It works for all countable-state Markov chains, giving the following accuracy result: Theorem 1 (For details, refer Section 2.3). Given any Markov chain P, source distribution σ, terminal state t, length ℓ, threshold δ and relative error ϵ, Bidirectional-MSTP (Algorithm 2) returns an unbiased estimate bpℓ σ[t] for pℓ σ[t], which, with high probability, satisfies: bpℓ σ[t] pℓ σ[t] < max ϵpℓ σ[t], δ . Since we dynamically adjust the number of REVERSE-PUSH operations to ensure that all residues are small, the proof of the above theorem follows from straightforward concentration bounds. Since Bidirectional-MSTP combines local power iteration and Monte Carlo techniques, a natural question is when the algorithm is faster than both. It is easy to to construct scenarios where the runtime of Bidirectional-MSTP is comparable to its two constituent algorithms for example, if t has more than 1/δ in-neighbors. Surprisingly, however, we show that in sparse Markov chains and for typical target states, Bidirectional-MSTP is order-wise faster: Theorem 2 (For details, refer Section 2.3). Given any Markov chain P, source distribution σ, length ℓ, threshold δ and desired accuracy ϵ; then for a uniform random choice of t S, the Bidirectional-MSTP algorithm has a running time of e O(ℓ3/2 q d/δ), where d is the average number of neighbors of nodes in S. Thus, for typical targets, we can estimate transition probabilities of order δ in time only O(1/ δ). Note that we do not need for every state that the number of neighboring states is small, but rather, that they are small on average for example, this is true in power-law networks, where some nodes have very high degree, but the average degree is small. The proof of this result is based on a modification of an argument in [2] refer Section 2.3 for details. Estimating transition probabilities to a target state is one of the fundamental primitives in Markov chain models hence, we believe that our algorithm can prove useful in a variety of application domains. In Section 3, we briefly describe how to adapt our method for some of these applications estimating hitting/return times and stationary probabilities, extensions to non-homogenous Markov chains (in particular, for estimating graph diffusions and heat kernels), connections to local algorithms and expansion testing. In addition, our MSTP-estimator could be useful in several other applications estimating ruin probabilities in reliability models, buffer overflows in queueing systems, in statistical physics simulations, etc. 1.2 Existing Approaches for MSTP-Estimation There are two main techniques used for MSTP-estimation. The first is a natural Monte Carlo algorithm: we estimate pℓ σ[t] by sampling independent ℓ-step paths, each starting from a random state sampled from σ. A simple concentration argument shows that for a given value of δ, we need eΘ(1/δ) samples to get an accurate estimate of pℓ σ[t], irrespective of the choice of t, and the structure 1Bidirectional estimators have been developed before for reversible Markov chains [1]; our method however is not only more general, but conceptually and operationally simpler than these techniques (cf. Section 1.2). of P. Note that this algorithm is agnostic of the terminal state t; it gives an accurate estimate for any t such that pℓ σ[t] > δ. On the other hand, the problem also admits a natural linear algebraic solution, using the standard power iteration starting with σ, or the reverse power iteration starting with et (which is obtained by re-writing Equation (1) as pℓ σ[t] := σ(et(P T )ℓ)T ). When the state space is large, performing a direct power iteration is infeasible however, there are localized versions of the power iteration that are still efficient. Such algorithms have been developed, among other applications, for Page Rank estimation [3, 4] and for heat kernel estimation [5]. Although slow in the worst case 2, such local update algorithms are often fast in practice, as unlike Monte Carlo methods they exploit the local structure of the chain. However even in sparse Markov chains and for a large fraction of target states, their running time can be Ω(1/δ). For example, consider a random walk on a random dregular graph and let δ = o(1/n) then for ℓ logd(1/δ), verifying pℓ es[t] > δ is equivalent to uncovering the entire logd(1/δ) neighborhood of s. Since a large random d-regular graph is (whp) an expander, this neighborhood has Ω(1/δ) distinct nodes. Finally, note that as with Monte Carlo, power iterations can be adapted to either the source or terminal state, but not both. For reversible Markov chains, one can get a bidirectional algorithms for estimating pℓ es[t] based on colliding random walks. For example, consider the problem of estimating length-2ℓrandom walk transition probabilities in a regular undirected graph G(V, E) on n vertices [1, 6]. The main idea is that to test if a random walk goes from s to t in 2ℓsteps with probability δ, we can generate two independent random walks of length ℓ, starting from s and t respectively, and detect if they terminate at the same intermediate node. Suppose pw, qw are the probabilities that a length-ℓwalk from s and t respectively terminate at node w then from the reversibility of the chain, we have that p2ℓ σ [t] = P w V pwqw; this is also the collision probability. The critical observation is that if we generate p 1/δ walks from s and t, then we get 1/δ potential collisions, which is sufficient to detect if p2ℓ σ [t] > δ. This argument forms the basis of the birthday-paradox, and similar techniques used in a variety of estimation problems (eg., see [7]). Showing concentration for this estimator is tricky as the samples are not independent; moreover, to control the variance of the samples, the algorithms often need to separately deal with heavy intermediate nodes, where pw or qw are much larger than O(1/n). Our proposed approach is much simpler both in terms of algorithm and analysis, and more significantly, it extends beyond reversible chains to any general discrete state-space Markov chain. The most similar approach to ours is the recent FAST-PPR algorithm of Lofgren et al. [2] for Page Rank estimation; our algorithm borrows several ideas and techniques from that work. However, the FAST-PPR algorithm relies heavily on the structure of Page Rank in particular, the fact that the Page Rank walk has Geometric(α) length (and hence can be stopped and restarted due to the memoryless property). Our work provides an elegant and powerful generalization of the FAST-PPR algorithm, extending the approach to general Markov chains. 2 The Bidirectional MSTP-estimation Algorithm 2.1 Algorithm As described in Section 1.1, given a target state t, our bidirectional MSTP algorithm keeps track of a pair of vectors the estimate vector qk t Rn and the residual vector rk t Rn for each length k {0, 1, 2, . . . , ℓ}. The vectors are initially all set to 0 (i.e., the all-0 vector), except r0 t which is initialized as et. Moreover, they are updated using a reverse push operation defined as: Algorithm 1 REVERSE-PUSH(v, i) Inputs: Transition matrix P, estimate vector qi t, residual vectors ri t, ri+1 t 1: return New estimate vectors {eqi t} and residual-vectors {eri t} computed as: eqi t qi t + ri t, ev ev; eri t ri t ri t, ev ev; eri+1 t ri+1 t + ri t, ev ev P T 2In particular, local power iterations are slow if a state has a very large out-neighborhood (for the forward iteration) or in-neighborhood (for the reverse update). The main observation behind our algorithm is that we can re-write pℓ σ[t] in terms of {qk t , rk t } as an expectation over random sample-paths of the Markov chain as follows (cf. Equation (3)): pℓ σ[t] = σ, qℓ t + k=0 EVk σP k rℓ k t (Vk) (2) In other words, given vectors {qk t , rk t }, we can get an unbiased estimator for pℓ σ[t] by sampling a length-ℓrandom trajectory {V0, V1, . . . , Vℓ} of the Markov chain P starting at a random state V0 sampled from the source distribution σ, and then adding the residuals along the trajectory as in Equation (2). We formalize this bidirectional MSTP algorithm in Algorithm 2. Algorithm 2 Bidirectional-MSTP(P, σ, t, ℓmax, δ) Inputs: Transition matrix P, source distribution σ, target state t, maximum steps ℓmax, minimum probability threshold δ, relative error bound ϵ, failure probability pf 1: Set accuracy parameter c based on ϵ and pf and set reverse threshold δr (cf. Theorems 1 and 2) (in our experiments we use c = 7 and δr = p δ/c) 2: Initialize: Estimate vectors qk t = 0 , k {0, 1, 2, . . . , ℓ}, Residual vectors r0 t = et and rk t = 0 , k {1, 2, 3, . . . , ℓ} 3: for i {0, 1, . . . , ℓmax} do 4: while v S s.t. ri t[v] > δr do 5: Execute REVERSE-PUSH(v, i) 6: end while 7: end for 8: Set number of sample paths nf = cℓmaxδr/δ (See Theorem 1 for details) 9: for index i {1, 2, . . . , nf} do 10: Sample starting node V 0 i σ 11: Generate sample path Ti = {V 0 i , V 1 i , . . . , V ℓmax i } of length ℓmax starting from V 0 i 12: For ℓ {1, 2, . . . , ℓmax}: sample k Uniform[0, ℓ] and compute Sℓ t,i = ℓrℓ k t [V k i ] (We reinterpret the sum over k in Equation 2 as an expectation and sample k rather sum over k ℓfor computational speed.) 13: end for 14: return {bpℓ σ[t]}ℓ [ℓmax], where bpℓ σ[t] = σ, qℓ t + (1/nf) Pnf i=1 Sℓ t,i 2.2 Some Intuition Behind our Approach Before formally analyzing the performance of our MSTP-estimation algorithm, we first build some intuition as to why it works. In particular, it is useful to interpret the estimates and residues in probabilistic/combinatorial terms. In Figure 1, we have considered a simple Markov chain on three states Solid, Hollow and Checkered (henceforth (S, H, C)). On the right side, we have illustrated an intermediate stage of reverse work using S as the target, after performing the REVERSE-PUSH operations (S, 0), (H, 1), (C, 1) and (S, 2) in that order. Each push at level i uncovers a collection Figure 1: Visualizing a sequence of REVERSE-PUSH operations: Given the Markov chain on the left with S as the target, we perform REVERSE-PUSH operations (S, 0), (H, 1), (C, 1),(S, 2). of length-(i + 1) paths terminating at S for example, in the figure, we have uncovered all length 2 and 3 paths, and several length 4 paths. The crucial observation is that each uncovered path of length i starting from a node v is accounted for in either qi v or ri v. In particular, in Figure 1, all paths starting at solid nodes are stored in the estimates of the corresponding states, while those starting at blurred nodes are stored in the residue. Now we can use this set of pre-discovered paths to boost the estimate returned by Monte Carlo trajectories generated starting from the source distribution. The dotted line in the figure represents the current reverse-work frontier it separates the fully uncovered neighborhood of (S, 0) from the remaining states (v, i). In a sense, what the REVERSE-PUSH operation does is construct a sequence of importancesampling weights, which can then be used for Monte Carlo. An important novelty here is that the importance-sampling weights are: (i) adapted to the target state, and (ii) dynamically adjusted to ensure the Monte Carlo estimates have low variance. Viewed in this light, it is easy to see how the algorithm can be modified to applications beyond basic MSTP-estimation: for example, to nonhomogenous Markov chains, or for estimating the probability of hitting a target state t for the first time in ℓsteps (cf. Section 3). Essentially, we only need an appropriate reverse-push/dynamic programming update for the quantity of interest (with associated invariant, as in Equation (2)). 2.3 Performance Analysis We first formalize the critical invariant introduced in Equation (2): Lemma 1. Given a terminal state t, suppose we initialize q0 t = 0, r0 t = et and qk t , rk t = 0 k 0. Then for any source distribution σ and length ℓ, after any arbitrary sequence of REVERSEPUSH(v, k) operations, the vectors {qk t , rk t } satisfy the invariant: pℓ σ[t] = σ, qℓ t + k=0 σP k, rℓ k t (3) The proof follows the outline of a similar result in Andersen et al. [4] for Page Rank estimation; due to lack of space, we defer it to our full version [8]. Using this result, we can now characterize the accuracy of the Bidirectional-MSTP algorithm: Theorem 1. We are given any Markov chain P, source distribution σ, terminal state t, maximum length ℓmax and also parameters δ, pf and ϵ (i.e., the desired threshold, failure probability and relative error). Suppose we choose any reverse threshold δr > δ, and set the number of samplepaths nf = cδr/δ, where c = max 6e/ϵ2, 1/ ln 2 ln (2ℓmax/pf). Then for any length ℓ ℓmax with probability at least 1 pf, the estimate returned by Bidirectional-MSTP satisfies: bpℓ σ[t] pℓ σ[t] < max ϵpℓ σ[t], δ . Proof. Given any Markov chain P and terminal state t, note first that for a given length ℓ ℓmax, Equation (2) shows that the estimate bpℓ σ[t] is an unbiased estimator. Now, for any random-trajectory Tk, we have that the score Sℓ t,k obeys: (i) E[Sℓ t,k] pℓ σ[t] and (ii) Sℓ t,k [0, ℓδr]; the first inequality again follows from Equation (2), while the second follows from the fact that we executed REVERSE-PUSH operations until all residual values were less than δr. Now consider the rescaled random variable Xk = Sℓ t,k/(ℓδr) and X = P k [nf ] Xk; then we have that Xk [0, 1], E[X] (nf/ℓδr)pℓ σ[t] and also (X E[X]) = (nf/ℓδr)(bpℓ σ[t] pℓ σ[t]). Moreover, using standard Chernoff bounds (cf. Theorem 1.1 in [9]), we have that: P [|X E[X]| > ϵE[X]] < 2 exp ϵ2E[X] and P[X > b] 2 b for any b > 2e E[X] Now we consider two cases: 1. E[Sℓ t,k] > δ/2e (i.e., E[X] > nfδ/2eℓδr = c/2e): Here, we can use the first concentration bound to get: P bpℓ σ[t] pℓ σ[t] ϵpℓ σ[t] = P |X E[X]| ϵnf ℓδr pℓ σ[t] P [|X E[X]| ϵE[X]] 2 exp ϵ2E[X] where we use that nf = cℓmaxδr/δ (cf. Algorithm 2). Moreover, by the union bound, we have: bpℓ σ[t] pℓ σ[t] ϵpℓ σ[t] 2ℓmax exp ϵ2c Now as long as c 6e/ϵ2 ln (2ℓmax/pf), we get the desired failure probability. 2. E[Sℓ t,k] < δ/2e (i.e., E[X] < c/2e): In this case, note first that since X > 0, we have that pℓ σ[t] bpℓ σ[t] (nf/ℓδr)E[X] δ/2e < δ. On the other hand, we also have: P bpℓ σ[t] pℓ σ[t] δ = P X E[X] nfδ P [X c] 2 c, where the last inequality follows from our second concentration bound, which holds since we have c > 2e E[X]. Now as before, we can use the union bound to show that the failure probability is bounded by pf as long as c log2 (ℓmax/pf). Combining the two cases, we see that as long as c max 6e/ϵ2, 1/ ln 2 ln (2ℓmax/pf), then we ℓ ℓmax bpℓ σ[t] pℓ σ[t] max{δ, ϵpℓ σ[t]} i pf. One aspect that is not obvious from the intuition in Section 2.2 or the accuracy analysis is if using a bidirectional method actually improves the running time of MSTP-estimation. This is addressed by the following result, which shows that for typical targets, our algorithm achieves significant speedup: Theorem 2. Let any Markov chain P, source distribution σ, maximum length ℓmax and parameters δ, pf and ϵ be given. Suppose we set δr = q ϵ2δ ℓmax log(ℓmax/pf ). Then for a uniform random choice of t S, the Bidirectional-MSTP algorithm has a running time of e O ℓ3/2 max q Proof. The runtime of Algorithm 2 consists of two parts: Forward-work (i.e., for generating trajectories): we generate nf = cℓmaxδr/δ sample trajectories, each of length ℓmax hence the running time is O cδℓ2 max/δ for any Markov chain P, source distribution σ and target node t. Substituting for c from Theorem 1, we get that the forward-work running time Tf = O ℓ2 maxδr log(ℓmax/pf ) Reverse-work (i.e., for REVERSE-PUSH operations): Let Tr denote the reverse-work runtime for a uniform random choice of t S. Then we have: v S (din(v) + 1)1{REVERSE-PUSH(v,k) is executed} Now for a given t S and k {0, 1, . . . , ℓmax}, note that the only states v S on which we execute REVERSE-PUSH(v, k) are those with residual rk t (v) > δr consequently, for these states, we have that qk t (v) > δr, and hence, by Equation (3), we have that pk ev[t] δr (by setting σ = ev, i.e., starting from state v). Moreover, a REVERSE-PUSH(v, k) operation involves updating the residuals for din(v)+1 states. Note that P t S pk ev[t] = 1 and hence, via a straightforward counting argument, we have that for any v S, P t S 1{pkev [t] δr} 1/δr. Thus, we have: v S (din(v) + 1)1{pkev [t] δr} = 1 t S (din(v) + 1)1{pkev [t] δr} v S (ℓmax + 1) (din(v) + 1) 1 δr = O ℓmax Finally, we choose δr = q ϵ2δ ℓmax log(ℓmax/pf ) to balance Tf and Tr and get the result. 3 Applications of MSTP estimation Estimating the Stationary Distribution and Hitting Probabilities: MSTP-estimation can be used in two ways to estimate stationary probabilities π[t]. First, if we know the mixing time τmix of the chain P, we can directly use Algorithm 2 to approximate π[t] by setting ℓmax = τmix and using any source distribution σ. Theorem 2 then guarantees that we can estimate a stationary probability of order δ in time O(τ 3/2 mix d/δ). In comparison, Monte Carlo has O(τmix/δ) runtime. We note that in practice, we usually do not know the mixing time in such a setting, our algorithm can be used to compute an estimate of pℓ σ[t] for all values of ℓ ℓmax. An alternative is to modify Algorithm 2 to estimate the truncated hitting time bpℓ,hit σ [t](i.e., the probability of hitting t starting from σ for the first time in ℓsteps). By setting σ = et, we get an estimate for the expected truncated return time E[Tt1{Tt ℓmax}] = P ℓ ℓmax ℓbpℓ,hit et [t]. Now, using that fact that π[t] = 1/E[Tt], we can get a lower bound for π[t] which converges to π[t] as ℓmax . We note also that the truncated hitting time has been shown to be useful in other applications such as identifying similar documents on a document-word-author graph [10]. To estimate the truncated hitting time, we modify Algorithm 2 as follows: at each stage i {1, 2, . . . , ℓmax} (note: not i = 0), instead of REVERSE-PUSH(t, i), we update eqi t[t] = qi t[t] + ri t[t], set eri t[t] = 0 and do not push back ri t[t] to the in-neighbors of t in the (i + 1)th stage. The remaining algorithm remains the same. It is easy to see from the discussion in Section 2.2 that the resulting quantity bpℓ,hit σ [t] is an unbiased estimate of P[Hitting time of t = ℓ|X0 σ] we omit a formal proof due to lack of space. Exact Stationary Probabilities in Strong Doeblin chains: A strong Doeblin chain [11] is obtained by mixing a Markov chain P and a distribution σ as follows: at each transition, the process proceeds according to P with probability α, else samples a state from σ. Doeblin chains are widely used in ML applications special cases include the celebrated Page Rank metric [12], variants such as HITS and SALSA [13], and other algorithms for applications such as ranking [14] and structured prediction [15]. An important property of these chains is that if we sample a starting node V0 from σ and sample a trajectory of length Geometric(α) starting from V0, then the terminal node is an unbiased sample from the stationary distribution [16]. There are two ways in which our algorithm can be used for this purpose: one is to replace the REVERSE-PUSH algorithm with a corresponding local update algorithm for the strong Doeblin chain (similar to the one in Andersen et al. [4] for Page Rank), and then sample random trajectories of length Geometric(α). A more direct technique is to choose some ℓmax >> 1/α, estimate {pℓ σ[t]} ℓ [ℓmax] and then directly compute the stationary distribution as p[t] = Pℓmax ℓ=1 αℓ 1(1 α)pℓ σ[t]. Graph Diffusions: If we assign a weight αi to random walks of length i on a (weighted) graph, the resulting scoring functions f(P, σ)[t] := P i=0 αi σT P i [t] are known as a graph diffusions [17] and are used in a variety of applications. The case where αi = αi 1(1 α) corresponds to Page Rank. If instead the length is drawn according to a Poisson distribution (i.e., αi = e ααi/i!), then the resulting function is called the heat-kernel h(G, α) this too has several applications, including finding communities (clusters) in large networks [5]. Note that for any function f as defined above, the truncated sum f ℓmax = Pℓmax i=0 αi p T σ P i obeys ||f f ℓmax|| P ℓmax+1 αi. Thus a guarantee on an estimate for the truncated sum directly translates to a guarantee on the estimate for the diffusion. We can use MSTP-estimation to efficiently estimate these truncated sums. We perform numerical experiments on heat kernel estimation in the next section. Conductance Testing in Graphs: MSTP-estimation is an essential primitive for conductance testing in large Markov chains [1]. In particular, in regular undirected graphs, Kale et al [6] develop a sublinear bidirectional estimator based on counting collisions between walks in order to identify weak nodes those which belong to sets with small conductance. Our algorithm can be used to extend this process to any graph, including weighted and directed graphs. Local Algorithms: There is a lot of interest recently on local algorithms those which perform computations given only a small neighborhood of a source node [18]. In this regard, we note that Bidirectional-MSTP gives a natural local algorithm for MSTP estimation, and thus for the applications mentioned above given a k-hop neighborhood around the source and target, we can perform Bidirectional-MSTP with ℓmax set to k. The proof of this follows from the fact that the invariant in Equation (2) holds after any sequence of REVERSE-PUSH operations. Figure 2: Estimating heat kernels: Bidirectional MSTP-estimation vs. Monte Carlo, Forward Push. To compare runtimes, we choose parameters such that the mean relative error of all algorithms is around 10%. Notice that Bidirectional-MSTP is 100 times faster than the other algorithms. 4 Experiments To demonstrate the efficiency of our algorithm on large Markov chains, we use heat kernel estimation (cf. Section 3) as an example application. The heat kernel is a non-homogenous Markov chain, defined as the probability of stopping at the target on a random walk from the source, where the walk length is sampled from a Poisson(ℓ) Distribution. In real-world graphs, a heat-kernel value between a pair of nodes has been shown to be a good indicator of an underlying community relationship [5] this suggests that it can serve as a metric for personalized search on social networks. For example, if a social network user s wants to view a list of users attending some event, then sorting these users by heat kernel values will result in the most similar users to s appearing on top. Bidirectional-MSTP is ideal for such personalized search applications, as the set of users filtered by a search query is typically much smaller than the set of nodes on the network. In Figure 2, we compare the runtime of different algorithms for heat kernel computation on four real-world graphs, ranging from millions to billions of edges 3. For each graph, for random (source, target) pairs, we compute the heat kernel using Bidirectional-MSTP, as well as two benchmark algorithms Monte Carlo, and the Forward Push algorithm (as presented in [5]). All three algorithms have parameters which allow them to trade off speed and accuracy for a fair comparison, we choose parameters such that the empirical mean relative error each algorithm is 10%. All three algorithms were implemented in Scala for the forward push algorithm, our implementation follows the code linked from [5]). We set average walk-length ℓ= 5 (since longer walks will mix into the stationary distribution), and set the maximum length to ℓ+10 ℓ 27; the probability of a walk being longer than this is 10 12, which is negligible. For reproducibility, our source code is available on our website (cf. [8]). Figure 2 shows that across all graphs, Bidirectional-MSTP is 100x faster than the two benchmark algorithms. For example, on the Twitter graph, it can estimate a heat kernel score is 0.1 seconds, while the the other algorithms take more than 4 minutes. We note though that Monte Carlo and Forward Push can return scores from the source to all targets, rather than just one target thus Bidirectional-MSTP is most useful when we want the score for a small set of targets. Acknowledgments Research supported by the DARPA GRAPHS program via grant FA9550-12-1-0411, and by NSF grant 1447697. Peter Lofgren was supported by an NPSC fellowship. Thanks to Ashish Goel and other members of the Social Algorithms Lab at Stanford for many helpful discussions. 3Pokec [19], Live Journal [20], and Orkut [20] datasets are from the SNAP [21]; Twitter-2010 [22] was downloaded from the Laboratory for Web Algorithmics [23]. Refer to our full version [8] for details. [1] Oded Goldreich and Dana Ron. On testing expansion in bounded-degree graphs. In Studies in Complexity and Cryptography. Miscellanea on the Interplay between Randomness and Computation. Springer, 2011. [2] Peter Lofgren, Siddhartha Banerjee, Ashish Goel, and C Seshadhri. FAST-PPR: Scaling personalized Page Rank estimation for large graphs. In ACM SIGKDD 14, 2014. [3] Reid Andersen, Fan Chung, and Kevin Lang. Local graph partitioning using Page Rank vectors. In IEEE FOCS 06, 2006. [4] Reid Andersen, Christian Borgs, Jennifer Chayes, John Hopcraft, Vahab S Mirrokni, and Shang-Hua Teng. Local computation of Page Rank contributions. In Algorithms and Models for the Web-Graph. Springer, 2007. [5] Kyle Kloster and David F Gleich. Heat kernel based community detection. In ACM SIGKDD 14, 2014. [6] Satyen Kale, Yuval Peres, and C Seshadhri. Noise tolerance of expanders and sublinear expander reconstruction. In IEEE FOCS 08, 2008. [7] Rajeev Motwani, Rina Panigrahy, and Ying Xu. Estimating sum by weighted sampling. In Automata, Languages and Programming, pages 53 64. Springer, 2007. [8] Siddhartha Banerjee and Peter Lofgren. Fast bidirectional probability estimation in markov models. Technical report, 2015. http://arxiv.org/abs/1507.05998. [9] Devdatt P Dubhashi and Alessandro Panconesi. Concentration of measure for the analysis of randomized algorithms. Cambridge University Press, 2009. [10] Purnamrita Sarkar, Andrew W Moore, and Amit Prakash. Fast incremental proximity search in large graphs. In Proceedings of the 25th international conference on Machine learning, pages 896 903. ACM, 2008. [11] Wolfgang Doeblin. Elements d une theorie generale des chaines simples constantes de markoff. In Annales Scientifiques de l Ecole Normale Sup erieure, volume 57, pages 61 111. Soci et e math ematique de France, 1940. [12] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The pagerank citation ranking: bringing order to the web. 1999. [13] Ronny Lempel and Shlomo Moran. The stochastic approach for link-structure analysis (SALSA) and the TKC effect. Computer Networks, 33(1):387 401, 2000. [14] Sahand Negahban, Sewoong Oh, and Devavrat Shah. Iterative ranking from pair-wise comparisons. In Advances in Neural Information Processing Systems, pages 2474 2482, 2012. [15] Jacob Steinhardt and Percy Liang. Learning fast-mixing models for structured prediction. In ICML 15, 2015. [16] Krishna B Athreya and Orjan Stenflo. Perfect sampling for Doeblin chains. Sankhy a: The Indian Journal of Statistics, pages 763 777, 2003. [17] Fan Chung. The heat kernel as the pagerank of a graph. Proceedings of the National Academy of Sciences, 104(50):19735 19740, 2007. [18] Christina E Lee, Asuman Ozdaglar, and Devavrat Shah. Computing the stationary distribution locally. In Advances in Neural Information Processing Systems, pages 1376 1384, 2013. [19] Lubos Takac and Michal Zabovsky. Data analysis in public social networks. In International. Scientific Conf. & Workshop Present Day Trends of Innovations, 2012. [20] Alan Mislove, Massimiliano Marcon, Krishna P. Gummadi, Peter Druschel, and Bobby Bhattacharjee. Measurement and Analysis of Online Social Networks. In Proceedings of the 5th ACM/Usenix Internet Measurement Conference (IMC 07), San Diego, CA, October 2007. [21] Stanford Network Analysis Platform (SNAP). http://http://snap.stanford.edu/. Accessed: 2014-02-11. [22] Paolo Boldi, Marco Rosa, Massimo Santini, and Sebastiano Vigna. Layered label propagation: A multi resolution coordinate-free ordering for compressing social networks. In ACM WWW 11, 2011. [23] Laboratory for Web Algorithmics. http://law.di.unimi.it/datasets.php. Accessed: 201402-11.