# statistical_guarantees_for_local_graph_clustering__66cabc8e.pdf Journal of Machine Learning Research 22 (2021) 1-54 Submitted 1/20; Revised 4/21; Published 6/21 Statistical guarantees for local graph clustering Wooseok Ha HAYWSE@BERKELEY.EDU Department of Statistics, University of California at Berkeley, Berkeley, CA, USA. Kimon Fountoulakis KFOUNTOU@UWATERLOO.CA School of Computer Science, University of Waterloo, Waterloo, ON, Canada. Michael W. Mahoney MMAHONEY@STAT.BERKELEY.EDU ICSI and Department of Statistics, University of California at Berkeley, Berkeley, CA, USA. Editor: Vahab Mirrokni Local graph clustering methods aim to find small clusters in very large graphs. These methods take as input a graph and a seed node, and they return as output a good cluster in a running time that depends on the size of the output cluster but that is independent of the size of the input graph. In this paper, we adopt a statistical perspective on local graph clustering, and we analyze the performance of the ℓ1-regularized Page Rank method (Fountoulakis et al., 2019) for the recovery of a single target cluster, given a seed node inside the cluster. Assuming the target cluster has been generated by a random model, we present two results. In the first, we show that the optimal support of ℓ1-regularized Page Rank recovers the full target cluster, with bounded false positives. In the second, we show that if the seed node is connected solely to the target cluster then the optimal support of ℓ1-regularized Page Rank recovers exactly the target cluster. We also show empirically that ℓ1-regularized Page Rank has a state-of-the-art performance on many real graphs, demonstrating the superiority of the method. From a computational perspective, we show that the solution path of ℓ1-regularized Page Rank is monotonic. This allows for the application of the forward stagewise algorithm, which approximates the entire solution path in running time that does not depend on the size of the whole graph. Finally, we show that ℓ1-regularized Page Rank and approximate personalized Page Rank (APPR) (Andersen et al., 2006), another very popular method for local graph clustering, are equivalent in the sense that we can lower and upper bound the output of one with the output of the other. Based on this relation, we establish for APPR similar results to those we establish for ℓ1-regularized Page Rank. Keywords: clustering, local graph clustering, Page Rank, seed expansion, ℓ1-regularization. 1. Introduction In many data applications, one is interested in finding small-scale structure in a very large data set. As an example, consider the following version of the so-called local graph clustering problem: *. Equal contribution 2021 Wooseok Ha, Kimon Fountoulakis, and Michael W. Mahoney. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v22/20-029.html. HA, FOUNTOULAKIS, AND MAHONEY given a large graph and a seed node in that graph, quickly find a good small cluster that includes that seed node. From an algorithmic perspective, one typically considers worst-case input graphs, and one is interested in running time guarantees, e.g., to find a good cluster in a time that depends linearly or sub-linearly on the size of the entire graph. From a statistical perspective, such a local graph clustering problem can be understood as a recovery problem. One assumes that there exists a target cluster in a given large graph, where the graph is assumed to have been generated by a random model, and the objective is to recover the target cluster from one node inside the cluster. In this paper, we consider the so-called ℓ1-regularized Page Rank algorithm (Fountoulakis et al., 2019), a popular algorithm for the local graph clustering problem, and we establish statistical recoverability guarantees for it. Previous theoretical analysis on local graph clustering, e.g., Andersen et al. (2006); Zhu et al. (2013), is based on the notion of conductance (a cluster quality metric that considers the internal versus external connectivity of a cluster) and considers running time performance for worst-case input graphs. In contrast, our goal will be to study the average-case performance of the ℓ1-regularized Page Rank algorithm, under a certain type of a local random graph model. The model we consider is very general; it concerns the target cluster and its adjacent nodes; and it encompasses the stochastic block model (Holland et al., 1983; Abbe, 2017) and the planted clustering model (Alon et al., 1998; Arias-Castro and Verzelen, 2014) as special cases. Within this random graph model, we provide theoretical guarantees for the unique optimal solution of the ℓ1-regularized Page Rank optimization problem. In particular, the cluster is recovered through the support set of the ℓ1-regularized Page Rank vector and we give rigorous bounds on the false positives and false negatives of the recovered cluster. Furthermore, observe that our statistical perspective is more aligned with statistical guarantees for the sparse regression problem (and the lasso problem (Tibshirani, 1996)), where the objective is to recover the true parameter and/or support from noisy data. Given this connection, we also establish a result for the exact support recovery of ℓ1-regularized Page Rank. Empirically we demonstrate the ability of the method to recover the target cluster in a range of real-world data graphs. Finally, we establish an equivalence between ℓ1-regularized Page Rank and the very popular local graph clustering algorithm from Andersen et al. (2006); Zhu et al. (2013). This allows us to prove similar average case guarantees for the algorithm of Andersen et al. (2006); Zhu et al. (2013) as well. 1.1 Literature review There is a large body of related work, the most relevant of which is work in theoretical computer science on local graph algorithms for example, personalized Page Rank and flow-based methods; and work in statistics on stochastic graph models. We discuss each in turn. Local graph clustering The origins of local graph clustering are with the work of Spielman and Teng (2013). Subsequent to their original results, there has been a great deal of follow-up work on local graph clustering procedures, including with random walks (Andersen et al., 2006; Zhu et al., 2013), local Lanczos spectral approximations (Shi et al., 2017), evolving sets (Andersen et al., 2016), seed expansion methods (Kloumann and Kleinberg, 2014), optimization-based approaches (Fountoulakis et al., 2019, 2017), and local flow methods (Orecchia and Zhu, 2014; Wang et al., 2017; Veldt et al., 2019; Fountoulakis et al., 2020b,a). There also exist local higher-order clustering (Yin et al., 2017), linear algebra approaches (Shi et al., 2017), spectral methods based on Heat Kernel Page Rank (Kloster and Gleich, 2014), and parallel local spectral approaches (Shun et al., 2016). In all of these cases, given a seed node, or a seed set of nodes, the goal of existing local STATISTICAL GUARANTEES FOR LOCAL GRAPH CLUSTERING graph clustering approaches is to compute a cluster nearby the seed that is related to the best cluster nearby the seed. Here, best and nearby are intentionally left under-specified, as they can be formalized in one of a few different but related ways. For example, best is usually related to a clustering score such as conductance. In fact, many existing methods for local graph clustering with theoretical guarantees are motivated through the problem of finding a cluster that is near the seed node and that also has small conductance value (Spielman and Teng, 2013; Andersen et al., 2006; Fountoulakis et al., 2019; Orecchia and Zhu, 2014; Veldt et al., 2019; Wang et al., 2017; Fountoulakis et al., 2020b). More recently, Green et al. (2019) studied local graph clustering in a traditional statistical learning setup where they aim to identify density clusters around a given seed node. Stochastic block model There are numerous papers in statistics on partitioning random graphs. Arguably, the stochastic block model (SBM) is the most commonly employed random model for graph partitioning, and it has been extensively studied (Abbe and Sandon, 2015; Abbe et al., 2015; Zhang and Zhou, 2016; Massoulié, 2014; Mossel et al., 2018; Newman et al., 2002; Mossel et al., 2015; Rohe et al., 2011; Amini and Levina, 2018; Abbe, 2017). Recent work has also generalized the SBM to a degree-corrected block model, to capture degree heterogeneity of the network (Chen et al., 2018; Gulikers et al., 2017; Zhao et al., 2012; Gao et al., 2018). The literature in this area is too extensive to cover in this paper, but we refer the readers to excellent survey papers on the graph partitioning problem (Abbe, 2017). We should emphasize that the traditional graph partitioning problem is quite different than the local graph clustering problem that we consider in this paper. Among other things, the former partitions all the vertices of a graph into different clusters, while for the latter problem our objective is to find a single cluster given a seed node in the cluster; the former takes as input a graph, while the latter takes as input a graph and a seed set of nodes; and the former runs in time depending on the size of the graph, while the latter runs in time depending on the size of the output, but is otherwise independent of the size of the graph. 1.2 Notation We write [n] = {1, . . . , n} for any n 1. Throughout the paper we assume we have a connected, undirected graph G = (V, E), where V denotes the set of nodes, with |V | = n, and E (V V ) denotes the set of edges. We denote by A the adjacency matrix of G, i.e., Aij = wij if (i, j) E, and 0 otherwise. For an unweighted graph, wij is set to 1 for all (i, j) E. We denote by D the diagonal degree matrix of G, i.e., Dii := di = P j:(i,j) E wij, where di is the weighted degree of node i. In this case, d = (di) Rn denotes the degree vector, and the volume of a subset of nodes is defined as Vol(B) = P i B di for B V . We denote by L = D A the graph Laplacian; and Q := αD + 1 α 2 L. If v Rn is a vector defined on V , we denote by support(v) the support set of v, i.e., support(v) := {i V | vi = 0}. For given sets of indexes B1, B2 [n], we write MB1,B2 to denote the submatrix of M indexed by B1 and B2. If B1 = {i} is a singleton, we use Mi,B2 to indicate the i-th row of M whose columns are indexed by B2. Analogously, we use MB1,j to indicate the j-th column of M whose rows are indexed by B1. We denote by B1\B2 a set difference between B1 and B2, and denote by Bc 1 = [n]\B1 the complement of B1. HA, FOUNTOULAKIS, AND MAHONEY 2. Local graph clustering from a variational point of view In this section, we briefly review and motivate ℓ1-regularized Page Rank (Fountoulakis et al., 2019), an optimization formulation of local graph clustering to find a local cluster around a given seed node. We then study some properties of the output vector of ℓ1-regularized Page Rank that will provide further intuition on the method. 2.1 Background on ℓ1-regularized Page Rank Page Rank (Page et al., 1999; Brin and Page, 1998) is a popular approach for ranking the importance of the nodes given a graph. It is defined as the stationary distribution of a Markov chain, which is encoded by a convex combination of the input distribution s Rn and the (lazy) random walk on the graph, i.e., p PR = αs + (1 α)Wp PR, (1) where W = (I + AD 1)/2 is the lazy random walk operator and where α (0, 1) is the teleportation parameter. To measure the ranking or importance of the nodes of the whole graph, Page Rank is often computed by setting the input vector s to be a uniform distribution over {1, 2, . . . , n}. For local graph clustering, where the aim is to identify a target cluster, given a seed node in the cluster, the input distribution s is set to be equal to one for the seed node and zero everywhere else. For example, when the node i is given as the seed node, we consider the input distribution s to be the discrete Dirac measure such that si = 1 and zero elsewhere. This personalized Page Rank (Haveliwala, 2002) measures the closeness or similarity of the nodes to the given seed node, and it outputs a ranking of the nodes that is personalized with respect to the seed node (as opposed to the original Page Rank, which considers the entire graph). From an operational point of view, the underlying diffusion process in (1) defining personalized Page Rank performs a lazy random walk with probability 1 α and teleports a random walker back to the original seed node with probability α. From the definition itself, the personalized Page Rank vector can be obtained by solving the linear system (1). Unfortunately, this step can be prohibitively expensive, especially when there is a single seed node or a small seed set of seed nodes, and when one is interested in very small clusters in a very large graph. In the seminal work of Andersen et al. (2006), the authors propose an iterative algorithm, called approximate personalized Page Rank (APPR), to solve this running time problem. They do so by approximating the personalized Page Rank vector, while running in time independent of the size of the entire graph. APPR was developed from an algorithmic (or theoretical computer science ) perspective, but it is equivalent to applying a coordinate descent type algorithm to the linear system (1) with a particular scheme of early stopping (see Section 4 for more details on the APPR algorithm). Gleich and Mahoney (2014) show that APPR implicitly solves a constrained ℓ1-regularized ℓ2 objective min-cut problem, drawing connection between APPR and ℓ1-regularized optimization. Motivated by this, Fountoulakis et al. (2019) proposed the ℓ1-regularized Page Rank optimization problem. Unlike APPR, the solution method for the ℓ1-regularized Page Rank optimization problem is purely optimization-based. It uses an ℓ1 norm regularization to set automatically to be zero all nodes that are dissimilar to the seed node, thus resulting in a highly sparse output. In this manner, ℓ1-regularized Page Rank can estimate the personalized ranking, while maintaining the most relevant nodes at the same time. Prior work (Fountoulakis et al., 2019) showed that proximal gradient descent (ISTA) can solve the ℓ1-regularized Page Rank minimization problem with access to only a small STATISTICAL GUARANTEES FOR LOCAL GRAPH CLUSTERING portion of the entire graph, i.e., without even touching the entire graph, therefore allowing the method to easily scale to very large-scale graphs (Shun et al., 2016). They also showed that the optimality conditions characterizing ℓ1-regularized Page Rank implies the termination criterion of APPR, and in particular, the worst-case performance guarantees of both methods are identical. In this paper, we investigate the statistical performance of ℓ1-regularized Page Rank by reformulating the local graph clustering into the problem of sparse support recovery. Here we give a more precise definition of the ℓ1-regularized Page Rank optimization problem from Fountoulakis et al. (2019) that we consider. Definition 1 (ℓ1-regularized Page Rank) Given a graph G = (V, E), with |V | = n, and a seed vector s Rn, the ℓ1-regularized Page Rank vector on the graph is defined as bx = arg min x Rn 1 2x Qx αx s | {z } :=f(x) where recall Q = αD + 1 α 2 L, and where ρ > 0 is a user-specified parameter that controls the amount of the regularization. Note that our definition of ℓ1-regularized Page Rank is consistent with the original formulation (Fountoulakis et al., 2019, Equation (8)) by change of variables D1/2x = q. To better understand the objective function of (2), when the regularization parameter ρ is set to 0, one can easily check that the re-scaled version of the output solution Dbx recovers the original Page Rank solution, that is, Dbx = p PR satisfies the stationary equation (1). For the stationary personalized Page Rank vector p PR, mass is concentrated around the seed node, meaning that after ordering, it has long tail for nodes far away from the seed node. Importantly, we can then efficiently cut this tale using ℓ1 norm regularization, without even having to compute the long tail. 2.2 Properties Here, we state some properties of the ℓ1-regularized Page Rank vector that will be useful for our analysis, as well as for gaining insight into the method. The proof of these lemmas can be found in Appendix C. First, the following lemma guarantees that the ℓ1-regularized Page Rank vector is non-negative. This result should be natural, because ℓ1-regularized Page Rank computes the importance scores of the nodes relative to the seed node, which cannot be negative. Lemma 2 (Non-negativity of ℓ1-reg. Page Rank.) Let bx be the optimal output solution given in Definition 1. Then bx is non-negative, i.e., bxj 0 for all j V . The next lemma guarantees that the gradient of f at the optimal solution bx must be non-positive. Since the ℓ1-regularized Page Rank problem is strongly convex (the minimum eigenvalue of Q is > 0), by KKT condition, this characterization is both necessary and sufficient for bx to be the unique solution. We frequently use this lemma in the proof of our subsequent results. HA, FOUNTOULAKIS, AND MAHONEY Lemma 3 (Optimality condition for ℓ1-reg. PR minimization.) Let support(bx) := {i V | bxi = 0} be the support set of the optimal solution. Then if(bx) = (Qbx)i αsi = ραdi, if i support(bx), [ ραdi, 0], if i / support(bx) and i is a neighbor of nonzero node, 0, otherwise. Finally, the next result shows that the solution path of the ℓ1-regularized Page Rank problem is monotone, meaning that when the output of ℓ1-regularized Page Rank is parametrized as a function of ρ > 0, the trajectory bx(ρ) changes in a monotonic manner as ρ varies. Lemma 4 (Monotonicity of ℓ1-reg. Page Rank.) Let bx(ρ) denote the solution for (2) indexed by ρ > 0. Then, bx(ρ) is monotone as a function of ρ, i.e., bx(ρ0) bx(ρ1) whenever ρ0 > ρ1, where is applied component-wise. Lemma 4 shows that once a node enter the model at some number ρ > 0, then it will never leave the model thereafter. This result is intuitive since the importance score of the nodes will increase as the amount of regularization or shrinkage decreases. In Section 5, we will use this monotonic property in a crucial way to develop an iterative algorithm that approximates the entire solution path. 3. Statistical guarantees under random model In this section, we study the recovery guarantees for ℓ1-regularized Page Rank under a random graph model. We begin by introducing a random model that we consider for generating a target cluster. In particular we will assume the graph is generated according to the following model. Definition 5 (Local random model.) Given a graph G = (V, E) that has n vertices, let K V be a target cluster inside the graph, and let Kc denote the complement of K. If two vertices i and j belong to K, then we draw an edge between i and j with probability p, independently of all other edges; if i K and j Kc, then we draw an edge with probability q, independently of all other edges; and otherwise, we allow any (deterministic or random) model to generate edges among vertices in Kc. According to Definition 5, the adjacency matrix A Rn n is symmetric, and for any i, j V , we have that Aij is an independent draw from a Bernoulli distribution with probability p if i, j K, and from a Bernoulli distribution with probability q if i K and j Kc. For the rest of the graph, i.e., when both i and j belong to Kc, Aij can be generated from an arbitrary fixed model. Under this definition, we can also naturally define the population version of the graph, which is the graph induced by the expected adjacency matrix E [A], where the expectation is taken with respect to the distribution defined by our random model. That is, the population graph is an undirected graph G = (V, E) whose adjacency matrix is E [A], where p if i K and j K, q if i K and j Kc, Any value if i Kc and j Kc. The expected degree matrix is similarly denoted by E [D] and the expected graph Laplacian is defined as E [L] = E [D] E [A]. The model in Definition 5 allows us to formulate the problem STATISTICAL GUARANTEES FOR LOCAL GRAPH CLUSTERING of local graph clustering as the recovery of a target cluster. Since we are interested in recovering a single target cluster, it is natural to make assumptions only for nodes in the target cluster and nodes adjacent to the target cluster, and to leave the interactions between other nodes unspecified. This random model is fairly general, and it covers several popular random graph models appearing in the literature, including the stochastic block model (SBM) (Holland et al., 1983; Abbe, 2017) and the planted clustering model (Alon et al., 1998; Arias-Castro and Verzelen, 2014; Chen and Xu, 2016). For instance, if the subgraph with the vertices within Kc is generated from the SBM, then the entire graph G = (V, E) follows the SBM. On the other hand, if the subgraph of Kc is generated from the classical Erd os-Rényi model with probability q, the entire graph G = (V, E) follows the Planted Densest Subgraph (in this case nodes in Kc do not belong to any clusters). Hence, the results we obtain here for our model holds more broadly across these different random graph models. Before we move on to our results, we need additional piece of notation. We write S K to denote a singleton of the given seed node. Let k = |K| denote the cardinality of the target cluster. According to our local model, any node in the target cluster has the same expected degree, E [di] = p(k 1) + q(n k) for all i K, which we denote by d. For the nodes ℓ outside K, we write E [dℓ] to denote its expected degree, where the expectation is taken with respect to a distribution that generates the graph in Kc. Conductance measures the weight of the edges that are being removed over the volume of the cluster formally it is defined as the ratio Cut(S, Sc)/ min (Vol(S), Vol(Sc)), where Cut(S, Sc) := P i S,j Sc Aij. From Definition 5, the conductance of the target cluster of the population graph G is given by Cond = 1 γ, where γ := p (k 1) d (0, 1). (4) Here γ can be viewed as the ratio of the random walker staying inside K under the population graph. (Note d is the expected degree of the target cluster and p (k 1) is the expected degree of the target cluster when restricted to the subgraph within K.) As in the worst-case analysis of local graph clustering (Andersen et al., 2006; Zhu et al., 2013), conductance Cond, or equivalently the number γ, will play a crucial role in determining the behavior of ℓ1-regularized Page Rank under the local graph model. In particular, we see that in the extreme scenario where γ = 1, we have q = 0 indicating perfect separability of the target cluster from the rest, while for γ = 0, we have p = 0 meaning there is no signal to ever recover. With this definition, we can also write p(k 1) = γ d and q(n k) = (1 γ) d. 3.1 Recovery of target cluster with bounded false positives Here, we investigate the performance of ℓ1-regularized Page Rank on the graph generated by the local random model as in Definition 5, and we state two of our main theorems. Our first main result guarantees full recovery of the target cluster for an appropriate choice of the regularization parameter. Specifically, for δ > 0, define ρ(δ) := 1 α 2 γp (1 + δ) d2 = O γp where α is the teleportation constant and where p, γ, d are the parameters of the random model defined in (4). Then, if we solve the convex problem (2) with ρ ρ(δ), the optimal solution fully recovers the target cluster K, as long as the seed node is initialized inside K. HA, FOUNTOULAKIS, AND MAHONEY Theorem 6 (Full recovery.) Suppose that p2k O δ 2 log k . If the regularization parameter satisfies ρ ρ(δ) where ρ(δ) is defined in (5), then the solution to Problem (2) fully recovers the cluster K, i.e., K support(bx), with probability at least 1 6 exp( O δ2p2k ).1 Our next main result provides an upper bound on the false positives present in the support set of the ℓ1-regularized Page Rank vector. By false positives , we mean the nonzero nodes that belong to Kc. We measure the size of false positives using a notion of volume, where we recall the volume of a subset of vertices B V is given by Vol(B) = P i B di. Theorem 7 (Bounds on false positives.) Suppose the same conditions as Theorem 6. If the regularization parameter satisfies ρ ρ(δ), then the solution to Problem (2) satisfies the bound Vol(FP) Vol(K) 1 + α γ2 1 | {z } with probability at least 1 6 exp( O δ2p2k ),2 where FP = {i support(bx) : i Kc} is the collection of false positive nodes. The proof of Theorem 6 and Theorem 7 is given in Appendix B.1 and Appendix B.2, respectively. To give a brief sketch of the proof, we first consider the following reduced version of the ℓ1regularized Page Rank problem, (see Appendix A.3 for details on this) 2x Qx αx s + ρα Dx 1 : x Kc = 0 . (7) The reduced problem (7) is introduced because it allows us to analyze the properties of the solution while restricting the matrices Q and D to the nodes in K only. Note that when there are no false positives, the optimal solution (2) coincides with the solution to Problem (7). More generally, we can prove that the support set of bx is always bigger than the support set of the solution to (7). Furthermore, we can show that when ρ ρ(δ), the solution to (7) is strictly positive over K. Putting these together then establishes Theorem 6. On the other hand, using the optimality condition Lemma 3, we can obtain an upper bound on the volume of the optimal solution bx. Since, by Theorem 6, bx entirely recovers the target cluster K, the errors in the support set of bx are solely due to the presence of false positives. Combining the two results, we can get an upper bound on the false positives, therefore proving Theorem 7. The results of Theorem 6 and Theorem 7 show several regimes where ℓ1-regularized Page Rank can fully recover the target cluster with nonvanishing probability. In particular, when p = O (1), the size of the target cluster, k, is required to be larger than O (log k), which includes the constant size k = O (1). This is often the regime of interest for local graph clustering, where the goal is to 1. More precisely, we assume (1 δ)p2k c 1 0 δ 2 log k for a fixed constant c0 > 0. Then with probability at least 1 6e c0δ2(1 δ)p2k, the statement in the theorem holds. 2. The same probability bound holds as in Theorem 6. STATISTICAL GUARANTEES FOR LOCAL GRAPH CLUSTERING find smalland meso-scale clusters in massive graphs (Leskovec et al., 2009, 2010). In addition, Theorem 6 indicates that if γ is small, we need to set ρ to be small to recover the entire cluster. Intuitively, more mass will leak out to Kc for small γ, so we need to run more steps of random walk (equivalently a smaller value of ρ in our optimization framework) to find the right cluster. However, this means that the ℓ1-regularized Page Rank vector will also pick up many nonzero nodes in Kc, resulting in many false positives in the support set. Indeed, Theorem 7 shows that the volume of false positives grows quadratically as 1/γ, so we need γ to be well-bounded to get a meaningful recovery from local clustering. In the case of p = O (1) , k = O (1), this amounts to requiring that q = O 1 n in order for the recovered cluster to keep high mass inside K. We remark several other comments regarding the results. First, we suspect the current bound we obtain in (6) may not be tight with respect to α and other constants, and especially the factor (1+α may be an artifact of our proof. Studying the lower bound on the performance of the method, as well as obtaining an improved bound on false positives, is therefore an interesting future direction to pursue. Furthermore, on the basis of our empirical results, ℓ1-regularized Page Rank performs well across a broad range of α values, and we have not seen much difference in terms of performance among different α s. The role of α in ℓ1-regularized Page Rank is closely tied to the regularization parameter ρ, and we leave the question of selecting optimal α for future work. 3.2 Exact recovery of target cluster with no false positives Next, we study the scenarios under which ℓ1-regularized Page Rank can exhibit a stronger recovery guarantee. Specifically, under some additional conditions, we show that the support set of the optimal solution (2) identifies the target cluster exactly, without making any false positives. For this stronger exact recovery result, we require the following assumption about the parameters of the model. Assumption 1 We assume p = O (1) , k = O (1), i.e., the within-cluster connectivity and the size of the target cluster do not scale with the size of the graph n. Also, we assume q = c n for a fixed numerical constant c > 0. As we noted previously, the setting k = O (1) is often the case of interest for local graph clustering, where we would like to identify smalland medium-scale structure in large graphs (Leskovec et al., 2009, 2010). In this case, Assumption 1 requires p = O (1), so that the underlying signal of the problem does not vanish as the size of the graph grows, n . As discussed earlier, this means q must also scale as O n 1 for the local clustering algorithm to find the target without making many false positives. Now we turn to the statement of exact recovery guarantees for ℓ1-regularized Page Rank when applied to the noisy graph generated from Definition 5. In particular, the fact that q = O n 1 from Assumption 1 allows that with nonvanishing probability there is a node in the target cluster that is solely connected to K. This node will serve as a good seed node input in the ℓ1-regularized Page Rank. With this choice of seed node, we now give conditions under which the optimal solution bx has no false positives with nonvanishing probability. Theorem 8 (No false positives.) Suppose the same conditions as Theorem 6. Assume that Assumption 1 holds and that the size of the target cluster k is 2(c + 3). Fix δ 0.1. If the regularization parameter satisfies ρ ρ(δ), then there is a good starting node in K such that the solution to HA, FOUNTOULAKIS, AND MAHONEY Problem (2) with that node as a seed node and with teleportation parameter α [0.1, 0.9] satisfies support(bx) K, with probability at least 1 6 exp( O δ2p2k ) (1 exp( 1.5c))k O n 1 ,3 as long as C(0.5c + 1) for all node j Kc adjacent to K, where C > 0 is a universal constant. The proof is given in Appendix B.3. The proof of Theorem 8 proceeds by showing that under the conditions of the theorem, the solution to the reduced problem (7) obeys the optimality condition for the full-dimensional ℓ1regularized Page Rank problem (Lemma 3). To do so, we use concentration inequalities to bound the difference between the solution to (7) and the ℓ1-regularized Page Rank vector on the population graph G = (V, E) (see Appendix A.2). Once we establish the equivalence of the solutions for both problems, the result follows since by construction the support set of the solution to (7) is contained in the target cluster. In the statement of Theorem 8, we required the conditions α [0.1, 0.9] and δ 0.1 mainly to avoid overly complicated constants; while this simplifies the presentation of the theorem, it is not difficult to show that a similar result holds more generally. Importantly, when combined with Theorem 6 (full recovery of the cluster), our result Theorem 8 immediately establishes that ℓ1regularized Page Rank recovers the target cluster exactly, even when the target cluster is constantsized. We state this result informally in the following, which requires no proof. Corollary 9 (Exact recovery; informal statement.) Under the same assumptions as Theorem 6 and Theorem 8, there is a good starting node in K such that ℓ1-regularized Page Rank parameterized with that node as a seed node satisfies support(bx) = K, with nonvanishing probability. It should be noted that a sort of condition like (8) about the realized degree seems necessary in order that the ℓ1-regularized Page Rank has no false positives. The optimization program (2) assigns less weights to low degree nodes in the ℓ1 penalty, so any nodes adjacent to K will become active unless the ℓ1-regularized Page Rank penalizes them with nontrivial weights. Unlike Theorem 6 and Theorem 7, condition (8) rules out some specific models to which Theorem 8 can be applied. For example, planted clustering model with p = O (1) and q = O (1/n) does not satisfy this condition because the degrees in Kc do not concentrate. For the stochastic block model, this condition is still satisfied if nodes adjacent to the target cluster belong to the clusters with degree larger than O (1/γp) = O (1). In practice, condition (8) may not be always applicable for every node adjacent to K, in which case the nodes that violate this condition may enter the model as false positives. We require the condition here though, since our model is essentially local and we do not have control outside K beyond its neighbors. 3. More precisely, we assume (1 δ)p2k c 1 0 δ 2 log k for a fixed constant c0 > 0. Then with probability at least 1 6e c0δ2(1 δ)pk (1 exp( 1.5c))k O n 1 , the statement in the theorem holds. STATISTICAL GUARANTEES FOR LOCAL GRAPH CLUSTERING 3.3 Comparison with existing results The local graph clustering problem has been relatively well-studied in the area of theoretical computer science, and the existing works largely focus on the worst-case guarantees. We now compare our results through the random graph model with the current known state-of-the-art worst-case results, given by Zhu et al. (2013). First, the main result Theorem 1 of Zhu et al. (2013), when applied to our population graph G, implies that Vol(FP), Vol(FN) Vol(K) O ((1 γ) log k), as long as Gap = O (1/((1 γ) log k)) O (1). When γ = O (1) (0, 1), our Theorem 6 states that if pk2 O (log k), the output of the ℓ1-regularized Page Rank model does not contain any false negative. This cannot be deduced from Zhu et al. (2013). In addition, our general bound on false positive, i.e., Vol(FP) Vol(K) (O 1/γ2 1) in Theorem 7, is better than the worst-case bound of Zhu et al. (2013) in the regime of large γ, which is typically the case for many interesting scenarios. For instance, when the expected target conductance Cond = 1 γ is small and fixed, the bound of the worstcase result degrades as the size of the target cluster k increases, whereas our result is improved by increasing the probability bound. In the regime of p = O (1) , q = O (1/n), and k = O (1) (hence γ = O (1)), our Theorem 8 shows that the output even contains no false positive. In this particular case, the strong separability (p = O (1) , q = O (1/n)) corresponds to a constant signal-to-noise ratio, since even for q = O (1/n) there are still a constant amount of edges outgoing from the target cluster, while the internal edges inside the target cluster is also constant. Although in practice the exact recovery of the target cluster may be a strong requirement, nevertheless, for real world clusters with high signal-to-noise ratio, ℓ1-regularized Page Rank can still reconstruct the ground truth clusters more or less exactly (see, for instance, Section 6.2). Finally we briefly give a comparison of our theoretical results with the information theoretic results of Chen and Xu (2016) in the special case of planted clustering model. To ease and simplify the comparison, we only consider the case where p = O(1) and q = O(log n/n). In this case, Chen and Xu (2016, Theorem 2.5) implies that a solution to a SDP achieves exact recovery as long as k O(log n), whereas our Theorem 6 and 7 suggest that in the same regime ℓ1-regularized Page Rank fully recovers the target cluster while picking up a constant proportion of false positives. Importantly, ℓ1-regularized Page Rank is essentially a local method, while Chen and Xu (2016) s SDP is a global method that explores the entire graph. 4. Equivalence between ℓ1-regularized Page Rank and APPR In this section, we illuminate a deep connection between approximate personalized Page Rank (APPR) and ℓ1-regularized Page Rank, and based on this result, we provide novel statistical recoverability guarantees for APPR when the graph is generated from the random graph model. Approximate personalized Page Rank (APPR), first studied in Andersen et al. (2006) and followed up by a number of subsequent works, is at the heart of local graph clustering whose central idea is to approximate the original Page Rank vector without ever touching the entire graph. Gleich and Mahoney (2014) show that APPR implicitly corresponds to adding a ℓ1 norm regularization term to the ℓ2 norm formulation of a min-cut linear program, explaining why APPR gives rise to very sparse solutions. Fountoulakis et al. (2019) further observed that the APPR algorithm is equivalent to the iterative coordinate descent solver applied to the Page Rank linear system (1) with a specifically designed termination criterion. In our notation, we can write the algorithm in a simple and compact way: HA, FOUNTOULAKIS, AND MAHONEY Given a parameter ρ > 0, initialize x(0) = 0; For each k 0, while D 1 f(x(k)) ρα, iterate the steps4 ( Choose an i [n] such that if(x(k)) ραdi; Update x(k+1) i = x(k) i d 1 i if(x(k)). (9) Output x(k ) =: x APPR. The output of the APPR algorithm is the personalized score vector x APPR that efficiently approximates the original Page Rank. Note that the updating formula (9) requires one evaluation of the gradient which can be done by accessing only neighboring nodes of the selected entry i. As a result the APPR algorithm can solve the Page Rank equation (1) extremely efficient; and in particular the running time of the algorithm depends on the size of the output rather than the size of the whole graph. APPR was developed purely from an algorithmic perspective and the output of the algorithm depends on which coordinate i is chosen at every iteration (see the updating step (9)). This property, while allowing the algorithm to enjoy the locality property and resulting in the output vector that is highly sparse, can lead to different results of local clustering in principle (albeit the results may look more or less similar). On the other hand, ℓ1-regularized Page Rank eliminates this issue by decoupling the locality/sparsity of the local clustering output vector from the algorithmic issue of running in time independent of the size of the graph. In particular, any convex optimization algorithm that exploits the locality of the problem, such as proximal gradient descent (ISTA), can be employed to minimize the ℓ1-regularized objective function (2) while touching only the neighbors of the current non-zero nodes. Therefore the formulation via ℓ1-regularized Page Rank achieves two objectives of the graph processing that are desirable for local graph clustering, i.e., both the locality/sparsity of the solution and the strongly local running time of the algorithm. It is shown in Fountoulakis et al. (2019) that ℓ1-regularized Page Rank can be viewed as a variational formulation of APPR, in the sense that the optimality condition for the ℓ1-regularized objective function (Lemma 3) implies the termination criterion of APPR, i.e., D 1 f(x(k)) < ρα. In this work, we strengthen the connection between the output of APPR and the output of ℓ1regularized Page Rank. In particular, we show that with appropriate choices of the regularization parameters, both approaches become equivalent in terms of cluster recovery. Theorem 10 (Equivalence between ℓ1-reg. PR and APPR.) Let bx(ρ) be the ℓ1-regularized Page Rank vector (2), let x APPR(ρ) be the output of APPR (9) at the regularization parameter ρ. Then for any ρ0 > 0, support(bx(ρ0)) support(x APPR(ρ0)) support(bx((1 α) ρ0/2)). That is, for any parameter ρ = ρ0 > 0, the support set of the output of APPR is, respectively, a superset and a subset of that of ℓ1-regularized Page Rank at ρ = ρ0 and ρ = (1 α) ρ0/2. 4. It can be shown that if APPR is initialized at x(0) = 0 then f(x(k)) 0 for all k 0. In this case, the termination criterion of APPR reads as if(x(k)) > ραdi for all i [n]. STATISTICAL GUARANTEES FOR LOCAL GRAPH CLUSTERING The proof is given in Appendix B.4. Theorem 10 establishes rather a stronger equivalence between the ℓ1-regularized Page Rank and the APPR approaches than was established in the prior work of Fountoulakis et al. (2019). Importantly, using this result, various statistical guarantees for the output cluster of APPR directly follow from the results we establish for the output of ℓ1-regularized Page Rank. While analyzing APPR s output in the random graph setting involves technical challenges due to its algorithmic nature, the fact that the output cluster of ℓ1-regularized Page Rank is given by the optimization solution allows us to analyze statistical properties more easily. In the previous section (Section 3), we stuided the recovery guarantees for ℓ1-regularized Page Rank under a random graph model. In the next section we will show how this guarantee can be transferred to the output cluster of APPR using Theorem 10. 4.1 Approximate personalized Page Rank on random graph One advantage of variational formulation of local graph clustering, via ℓ1-regularized convex program (2), is that it allows tractable analysis of the method in random graphs. Given the connection between ℓ1-regularized Page Rank and APPR established in Theorem 10, this further allows us to obtain the statistical guarantees of the APPR algorithm under the random graph model. We formally state this result in the following theorem, which is a simple consequence of Theorem 10, in combination with those obtained in Section 3.1 and 3.2. (We write FP(x APPR) = {i support(x APPR) : i Kc} to denote the collection of false positive nodes in the APPR s output.) Theorem 11 (Recovery guarantees for APPR.) Consider the APPR algorithm given in (9) with parameter ρ = ρ(δ). Under the same conditions as Theorem 6, the output vector x APPR(ρ(δ)) satisfies K support(x APPR) and Vol(FP(x APPR)) Vol(K) O 1 with nonvanishing probability. Furthermore, under the same conditions as Theorem 8, there is a good starting node in K such that the APPR s output vector parameterized with that node as a seed node satisfies support(x APPR) = K, with nonvanishing probability, as long as for all node j Kc adjacent to K. The proof is given in Appendix B.5. 5. Stagewise Page Rank and solution paths The ℓ1-regularized Page Rank problem (2) is convex and there are numerous ways to solve it using convex optimization techniques. In Fountoulakis et al. (2019), the authors apply proximal gradient descent (ISTA) and show that the algorithm enjoys the locality property, meaning that the algorithm touches at most the optimal support set and its neighbors, while inheriting the fast convergence property of the proximal gradient descent. Optimization algorithms typically solve the problem HA, FOUNTOULAKIS, AND MAHONEY at a single value of ρ, or a sequence of multiple values. On the other hand, path algorithms are designed to solve the problem for all values of a regularization parameter ρ (0, ], or a subset of it when terminated early. This is more desirable when we need solutions over a list of parameter values. The idea of designing path algorithms has already gained much attention in the sparse regression literature, first pioneered by Efron et al. (2004) and further developed by Zou et al. (2007); Hastie et al. (2004); Arnold and Tibshirani (2016). This has rendered the exploration of full regression coefficient paths relatively inexpensive. Unlike regression setting, however, this type of path algorithm has been less studied in the setting of local graph clustering (Gleich and Kloster, 2016). Motivated by the path algorithms developed in the sparse regression setting, we consider the following coordinate-wise algorithm for local graph clustering: Initialize x(0) = 0; For each k 0, iterate the steps ( Choose an i [n] such that d 1 i if(x(k)) 0 is the smallest among [n]; Update x(k+1) i = x(k) i + d 1 i η. (10) Output a sequence {x(0), x(1), . . .}. The two main features of this algorithm are: 1) we greedily select the coordinate i at each iteration that maximizes the magnitude of gradient, and 2) we update the current iterate by adding a small step size η to the ith coordinate. This conservative update of the variable counterbalances the greedy selection step, thus making the algorithm more stable. The above algorithm is called the stagewise algorithm, and in fact it has been widely studied by many authors (Efron et al., 2004; Hastie et al., 2007; Rosset et al., 2004; Rosset and Zhu, 2007; Zhao and Yu, 2007; Tibshirani, 2015). The stagewise algorithm is known to have implicit regularization effect closely related to ℓ1 norm regularization (Tibshirani, 2015), and in particular, if each component of the ℓ1-regularized solution has a monotone path, then the stagewise path exactly coincides with that of ℓ1 regularization, for η 0 (Efron et al., 2004; Rosset et al., 2004). Recalling our earlier result on the monotone path property of ℓ1-regularized Page Rank (Lemma 4), the following result is then immediate. Corollary 12 The output sequence of stagewise algorithm described in (10) converges to the ℓ1regularized Page Rank solution path as the step size goes to 0, i.e., η 0. Therefore, the stagewise algorithm allows us to provably explore the entire ℓ1-regularization path via a single run of simple iterative steps. Another advantage of the stagewise algorithm is that it enjoys the locality property, in that the algorithm only touches the chosen nodes and its neighbors as it progresses. This is obvious from the update step of (10) and the expression of the gradient if(x). Thus, when terminated early, the algorithm produces an approximate and partial solution path without making access to the entire graph. For the ℓ1-regularized Page Rank method, the parameter ρ controls the extent to which the random walk has moved farther from the seed, so different values of ρ reveal various scales of local clustering structure around the seed node. Therefore, in the setting of local graph clustering, the stagewise algorithm allows us to provably and efficiently track the evolution of a ℓ1-regularized STATISTICAL GUARANTEES FOR LOCAL GRAPH CLUSTERING Page Rank diffusion and better understand the local cluster properties of the graph. This is wellsuited for the purpose of exploratory graph analysis, and the idea of using path algorithms for exploring the graph has been also studied in Gleich and Kloster (2016). In addition to the exploratory analysis, the stagewise algorithm can be a competitive algorithm to find the target cluster if the size of the target cluster is small and/or medium and one needs a fine scale resolution of the solution path. However, when the size of the target cluster is quite large, using optimization algorithms with a coarse grid of regularization parameter may lead to better computational savings without exploring the entire solution path from scratch. Overall, the stagewise algorithm must be used in a complementary way to the optimization algorithms that directly solve (2). We also refer the readers to Tibshirani (2015) for comprehensive study of the stagewise algorithm for general sparse modeling problem. In Figure 1, we compare the actual solution path to the stagewise algorithm paths for different step sizes. We generate data from the stochastic block model with p = 0.5, q = 0.002 and r = 50. Each cluster has 20 nodes. Figure 1 shows the ℓ1-regularization path and stagewise component paths for one particular draw from the stochastic block model. Here we only show the solution paths for nodes in the target cluster without seed node, among n = 1000 nodes (each color line corresponds to different nodes in the target cluster). Note that when the step size is small, η = 0.0001, the stagewise path appears to closely match to the ℓ1-regularization path; for moderate step size, η = 0.0005, the stagewise path exhibits some jagged pattern but nevertheless accurately approximates the optimal path; and for relatively large step size, η = 0.001, while the jagged pattern becomes more evident visually, it is clear that the overall trend still coincides well with that of the ℓ1 solution path. 6. Numerical evaluation In this section, we provide a detailed numerical evaluation, to illustrate the performance of ℓ1regularized Page Rank on synthetic and real data. We have conducted a comprehensive experiments to demonstrate the state-of-the-art performance of the method across a wide range of real graphs, which has not been investigated at this level of extent in the prior works. To measure the quality of the recovered cluster, we define Precision and Recall as Vol(TP)/Vol(support(bx)) and Vol(TP)/Vol(K), respectively. The F1score is the harmonic mean of precision and recall, 2/ Precision 1 + Recall 1 . We will also make use of conductance, where recall Cond = Cut(S, Sc)/ min (Vol(S), Vol(Sc)). The lower the conductance value is, the better the quality of the cluster is. For our experiments, we solve problem (2) using a proximal coordinate descent algorithm, which enjoys both the strongly local running time property (running time depends on the size of cluster rather than the entire graph) as well as linear convergence (Fountoulakis et al., 2019). If we further need to explore the entire solution path of (2), the stagewise algorithm with small step size is used. 6.1 Simulated data In Figure 2 we demonstrate the performance of the stagewise algorithm (10) with η = 10 4 as γ increases. For the experiments in Figure 2, we fix the teleportation parameter α = 0.1. We generate graphs from the stochastic block model which consists of 10 clusters, each of which has 20 nodes and only one of which is the target cluster K. We use the same parameters p and q across different clusters to generate edges within and between clusters. Here we set p = 0.5 and q is varying in order to generate various γ as is shown in Figure 2. The results are averaged over 30 trials. For this HA, FOUNTOULAKIS, AND MAHONEY 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 ||x||1 Coordinates (a) ℓ1-reg. path 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 ||x||1 Coordinates (b) Stage. path η = 10 4 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 ||x||1 Coordinates (c) Stage. path η = 5 10 3 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 ||x||1 Coordinates (d) Stage. path η = 10 3 Figure 1: Comparison of ℓ1 regularization path and stagewise paths for different step sizes η. The profiles are shown only for nodes in K\S among n = 1000 nodes. Each color in the plot corresponds to different nodes. The x-axis is the ℓ1 norm of the current estimates. For stagewise, the results are obtained with 7706, 1527, and 764 iterations respectively. STATISTICAL GUARANTEES FOR LOCAL GRAPH CLUSTERING experiment we could use ℓ1-regularized Page Rank or APPR, but all these methods are similar with stagewise having the benefit of not needing to choose a regularization parameter. In particular, in Section 5 we show that for small enough step size η, the stagewise algorithm explores the piecewise constant solution path of the ℓ1-regularized Page Rank problem (see Corollory 12). Therefore, Figure 2 reflects the performance of ℓ1-regularized Page Rank as well. Moreover, in Theorem 10 we show an equivalence result between APPR and ℓ1-regularized Page Rank, and later on in Figure 4 we illustrate this equivalence in practice on real data as well. Therefore, Figure 2 reflects the approximate performance of APPR as well. There are two subfigures in Figure 2. In Subfigure 2(a), we illustrate the performance when the stagewise algorithm touches at most 3/4 of the overall nodes in the graph. After this threshold, the algorithm is terminated. In Subfigure 2(b), we illustrate the performance when the stagewise algorithm touches at most 1/2 of the nodes. We introduce this threshold for two reasons. First, in local graph clustering, it does not make sense to explore more than half of the graph. One could do this of course, but if one is willing to touch more than half of the graph, they could as well have utilized non-local algorithms. Second, we do this to illustrate the importance of not letting the algorithm to explore a huge part of the graph if the stagewise algorithm (or ℓ1-regularized Page Rank) is combined with metrics such as conductance. If one does this, then they take the risk of finding a larger or a smaller cluster that has smaller conductance than the target cluster but is very different than the target cluster, thus resulting in small F1score. In each subfigure, we illustrate the performance of the stagewise algorithm in three cases. The first (red line with rhombs, Stagewise, best F1score ) is a best-case scenario where we select the solution with the best F1score out of all the solutions that are produced by the stagewise algorithm (10). The second scenario (blue line with squares, Stagewise, minimum conductance ) shows a more realistic case where we select the solution with minimum conductance out of all the solutions that are produced by the stagewise algorithm. The third scenario (orange line with circles, Stagewise + sweep-cut ) also shows a realistic case where at each iteration of the stagewise algorithm we perform sweep-cut to find a cluster of small conductance and out of all iterations we return the cluster with the smallest conductance. The sweep cut rounding procedure is a common technique to post-process the output of local graph clustering methods for details, we refer the reader to one of these papers (Andersen et al., 2006; Zhu et al., 2013). In both subfigures, the F1score for the Stagewise, best F1score scenario scales linearly as a function of γ. On the other hand, in Subfigure 2(a) the scenario where we select the solution with minimum conductance scales sub-linearly as a function of γ until a phase-transition around γ 0.75 after which, the scenario of minimum conductance matches the best-case scenario. In Subfigure 2(a) we get similar results for the Stagewise + sweep-cut scenario, although the gap with the best-case scenario becomes smaller. In Subfigure 2(b) we see that the gap between Stagewise, best F1score scenario and the other two scenaria is much smaller. As we discussed above, the gap closes because in Subfigure 2(b) we do not allow the stagewise algorithm to explore more than half of the nodes in the graph, and the method avoids returning clusters of small conductance but low F1score. In Figure 3 we demonstrate a much more detailed view of the performance of the stagewise algorithm for four representative γ s. More precisely, we show that for large and medium values of γ there exists a solution in the solution path of the ℓ1-regularized Page Rank, which is found by the stagewise algorithm, and it recovers the target cluster. While for small γ < 0.5, there is no solution in the solution path of the ℓ1-regularized Page Rank that recovers the target cluster with high HA, FOUNTOULAKIS, AND MAHONEY 0.2 0.4 0.6 0.8 Stagewise, best F1score Stagewise, minimum conductance Stagewise + sweepcut (a) F1score against γ. Stagewise terminated when 3/4 of the overall nodes are nonzero. 0.2 0.4 0.6 0.8 Stagewise, best F1score Stagewise, minimum conductance Stagewise + sweepcut (b) F1score against γ. Stagewise terminated when 1/2 of the overall nodes are nonzero. Figure 2: In this figure we demonstrate performance (F1score) of the stagewise algorithm (10) as γ increases. We demonstrate three scenaria. In the first scenario (red line with rhombs, Stagewise, best F1score ) we select the solution with the best F1score out of all the solutions that are produced by the stagewise algorithm. In the second scenario (blue line with squares, Stagewise, minimum conductance ) we select the solution with minimum conductance out of all the solutions that are produced by the stagewise algorithm. In the third scenario (orange line with circles, Stagewise + sweep-cut ) at each iteration of the stagewise algorithm we perform sweep-cut to find a cluster of small conductance and out of all iterations we return the cluster with the smallest conductance. STATISTICAL GUARANTEES FOR LOCAL GRAPH CLUSTERING accuracy. For Figure 3 we use the same experiment setting as in Figure 2. For large γ, Figure 3(a), we observe that when the stagewise algorithm recovers about 20 nodes, then these nodes correspond to very high precision and recall, and as the number of nodes in the solution increases then precision decreases. We also observe that conductance of the recovered cluster is a good metric for finding the target cluster. Meaning that we will find the target cluster with high precision and recall if out of all solutions on the path we choose the one with minimum conductance. As γ gets smaller the minimum conductance does not relate to the target cluster. However, it is clear from Figures 3(b) and 3(c) that stagewise algorithm with minimum conductance still finds the target cluster K with good accuracy if the algorithm is terminated early. Finally, in Figure 3(d) we demonstrate a case where γ is small and conductance of solution fails to relate to the target cluster and there is no output of the stagewise algorithm that recovers the target cluster accurately. 6.2 Real data experiments Next, we test the performance of local graph clustering methods using biology networks and social networks. All the graphs that we consider are unweighted and undirected. For a summary of the datasets see Table 1. All datasets that are used come with suggested ground-truth clusters for social networks and with a gold standard for biology networks. However, we filter the given groundtruth clusters by measuring their conductance values. In particular, we keep only the ground-truth clusters that have conductance value less than or equal to 0.6. This means we keep the clusters that the number of edges that cross the cluster is 60% of the volume of the cluster. This way we obtain a wide range of clusters from good (small conductance) to noisy (large conductance, i.e., close to 0.6). Table 1: Summary of datasets dataset number of nodes number of edges description Sfld 232 15570 pairwise similarities of blasted sequences of proteins PPI-mips 1096 13221 protein-protein interaction network FB-Johns55 5157 186572 Facebook social network for Johns Hopkins University Colgate88 3482 155043 Facebook social network for Colgate University Orkut 3072441 117185083 Large-scale on-line social network 6.2.1 DATASETS Sfld. This dataset contains pairwise similarities of blasted sequences of 232 proteins belonging to the amidohydrolase superfamily (Brown et al., 2006). There are 232 nodes and 31140 edges in this graph. A gold standard is provided describing families within the given superfamily. According to the gold standard the amidrohydrolase superfamily contains 29 families/clusters. However, after filtering the families we find that only two have conductance value less than 0.6. (This should not be surprising, given that the graph is so dense.) In Table 2 we present properties of the two clusters that correspond to urease.0 and AMP amidrohydrolase superfamilies. PPI-mips. This dataset is a protein-protein interaction graph of mammalian species (Pagel et al., 2004). There are 1096 nodes and 26442 edges in this graph. In Table 2 we provide all suggested ground-truth clusters that have conductance less than 0.6. HA, FOUNTOULAKIS, AND MAHONEY 0 25 50 75 100 125 150 175 200 Number of active nodes in solution Precision Recall Conductance True Cardinality (a) γ = 0.91 0 25 50 75 100 125 150 175 200 Number of active nodes in solution Precision Recall Conductance True Cardinality (b) γ = 0.86 0 25 50 75 100 125 150 175 200 Number of active nodes in solution Precision Recall Conductance True Cardinality (c) γ = 0.65 0 25 50 75 100 125 150 175 200 Number of active nodes in solution Precision Recall Conductance True Cardinality (d) γ = 0.44 Figure 3: In Figures 3(a), 3(b) and 3(c) we illustrate that for large and medium values of γ, the stagewise algorithm recovers the target cluster. While for small γ in Figure 3(d), the stagewise algorithm k does not recover the target cluster with high accuracy. The x-axis gives the number of nonzero nodes in the solution of the stagewise. The vertical line indicates the cardinality of the target cluster (k = 20). STATISTICAL GUARANTEES FOR LOCAL GRAPH CLUSTERING Table 2: Ground truth clusters for the Sfld and PPI-mips datasets. Full details about the datasets can be found in the original paper (Pagel et al., 2004). In this table we report the volume, the number of nodes and the conductance of each ground truth cluster. dataset feature feature (short name) volume nodes conductance Sfld urease.0 urease 16209 100 0.42 AMP AMP 1721 28 0.56 Actin-associated-proteins Actin 870 24 0.36 Anaphase-promoting-complex Anaphase 165 11 0.33 Cdc28p-complexes Cdc28p 135 10 0.33 Coat-complexes Coat 676 19 0.49 cytoplasmic-ribosomal-large-subunit ct-large 9720 81 0.33 cytoplasmic-ribosomal-small-subunit ct-small 4788 57 0.33 F0-F1-ATP-synthase F0-F1-ATP 315 15 0.33 H+-transporting-ATPase-vacuolar H+ 315 15 0.33 mitochondrial-ribosomal-large-subunit mc-large 1488 32 0.33 mitochondrial-ribosomal-small-subunit mc-small 273 14 0.33 Mitochondrial-translation-complexes mc-complex 281 12 0.53 m RNA-splicing m RNA 1861 33 0.43 Nuclear-pore-complex Nuclear 614 18 0.50 RNA-polymerase-II-holoenzyme RNA 1487 29 0.45 Spindle-pole-body Spindle 981 22 0.52 TRAPP-complex TRAPP 135 10 0.33 t RNA-splicing t RNA 165 11 0.33 19-22S-regulator 19-22S 459 18 0.33 20S-proteasome 20S 315 15 0.33 FB-Johns55. This graph is a Facebook anonymized dataset on a particular day in September 2005 for a student social network at John Hopkins university. The graph is unweighted and it represents friendship ties. The data form a subset of the Facebook100 data-set from Traud et al. (2011, 2012). This graph has 5157 nodes and 186572 edges. This dataset comes along with 6 features, i.e., second major, high school, gender, dorm, major index and year. We construct ground truth clusters by using the features for each node. In particular, we consider nodes with the same value of a feature to be a cluster, e.g., students of year 2009. As analyzed in the original paper that introduced these datasets (Traud et al., 2012), for FB-Johns55 the year and major index features give non-trivial assortativity coefficients , which is a local measure of homophily . This agrees with the ground truth clusters we find after applying our filtering technique. The clusters per graph are shown in Table 3. Colgate88. This graph is constructed similarly to the FB-Johns55 graph but for Colgate University. We apply the same filtering techniques as for FB-Johns55 and we present the filtered clusters in Table 3. There are 3482 nodes and 155043 edges in this graph. Orkut. Orkut is a free on-line social network where users form friendship each other. Orkut also allows users form a group which other members can then join. This dataset has 3072441 nodes and 117185083 edges. It can be downloaded from Leskovec and Krevl (2014). This dataset comes with HA, FOUNTOULAKIS, AND MAHONEY Table 3: Ground truth clusters for the FB-Johns55 and Colgate88 datasets. dataset feature volume nodes conductance year 2006 81893 845 0.54 year 2007 89021 842 0.49 year 2008 82934 926 0.39 year 2009 33059 910 0.21 major index 217 10697 201 0.26 second major 0 178034 2844 0.51 dorm 0 137166 2121 0.52 gender 1 181656 2144 0.46 gender 2 173524 2598 0.48 year 2004 14888 230 0.54 year 2005 50643 501 0.50 year 2006 62065 557 0.48 year 2007 68382 589 0.41 year 2008 62430 641 0.29 year 2009 35379 641 0.11 second Major 0 175239 2107 0.54 dorm 0 100414 1157 0.53 gender 1 162759 1695 0.48 gender 2 123724 1485 0.55 5000 ground truth communities, which we filter by maintaining the clusters with conductance less than or equal to 0.6. Out of the 5000 communities only 282 pass our filtering test. The above real graphs may not be generated from the random model that we consider in Section 3, nonetheless, we have evaluated the method to illustrate the scenarios that are not just idealized. Moreover, the Facebook social networks and the biology networks are expected to exhibit homophily structure for some ground truth clusters that are highly inter-connected relative to the rest of the graph. These block-structured networks belong to a special class of networks and we believe that random graph models, such as stochastic block model, can closely approximate it. 6.2.2 METHODS We compare the performance of ℓ1-regularized Page Rank with state-of-the-art local graph clustering algorithms. There are two categories of local graph clustering methods: local spectral (Andersen et al., 2006) and it s follow-up work (Zhu et al., 2013); and local flow-based methods (Lang and Rao, 2004; Andersen and Lang, 2008; Orecchia and Zhu, 2014; Veldt et al., 2016). Approximate personalized Page Rank (APPR): In Andersen et al. (2006) and Zhu et al. (2013) an approximate personalized Page Rank (APPR) algorithm is proposed, where the personalized Page Rank linear system is solved approximately using a local diffusion process. Both papers study nearly identical algorithms, with the latter paper giving better theoretical guarantees based on a stronger assumption. In particular, the authors in Zhu et al. (2013) assume that the internal connectivity of the STATISTICAL GUARANTEES FOR LOCAL GRAPH CLUSTERING target cluster (minimum conductance of the induced subgraph for the target cluster) is quadratically stronger than the conductance of the target cluster. Simple Local (SL): Out of the three flow-based methods, Flow Improve, Local Flow Improve and Simple Local in Andersen and Lang (2008); Orecchia and Zhu (2014); Veldt et al. (2016), respectively, Simple Local Veldt et al. (2016) simplifies and generalizes the methods proposed in Andersen and Lang (2008); Orecchia and Zhu (2014), while having similar theoretical and practical guarantees in terms of quality of the output. Depending on the parameter tuning of Simple Local, one can obtain Flow Improve and one of the two variants of Local Flow Improve. In particular, there are two variants of Local Flow Improve in Orecchia and Zhu (2014). An exact and an inexact variant. The exact variant is equivalent to Simple Local. The only thing that differs is how one solves the sub-problems at each iteration, see Section 8 in Fountoulakis et al. (2020a) for details. In the second variant, each sub-problem is solved using Dinic s algorithm with early termination. This version has slightly worse guarantees in terms of conductance than Simple Local, although in practice we expect both methods to perform similarly. We expect that in practice the differences between the methods will be merely computational. Also, even the computational differences are expected to be minor since all methods have strongly-local running time. Since there exists no implementation of the inexact variant of the Local Flow Improve algorithm we choose not to implement it. Therefore, we will only use Simple Local in our experiments, and we will use different parameter tuning such that we obtain performance for a wide-range of methods. Moreover, Simple Local requires initialization with a set of seed nodes that ideally has some overlap with the target cluster. Therefore, in this paper, we will use Simple Local with three different initialization techniques. Below we comment on why we choose the following inputs to Simple Local. 1. ℓ1-reg. PR-SL: Simple Local using the output of ℓ1-reg. PR as input. 2. BFS-SL: We will initialize Simple Local using the output of a breadth-first-search-type (BFS) algorithm starting from a given seed node. The algorithm that is used for initialization of Simple Local is shown in Algorithm 1. Details about how we use this algorithm are given in the next subsection. Briefly, Algorithm 1 is used as a seed node expansion technique, which has also been used in Veldt et al. (2016). Note that all these flow-based methods are improve methods, i.e., they are not stand-alone, and they require initial input from some other stand-alone method, such as APPR or ℓ1-reg. PR. This is both a theoretical and a practical argument. It is a theoretical argument because in the theoretical analysis of Simple Local (see Theorem 3 in Orecchia and Zhu (2014)), the input to Simple Local is required to have sufficient overlap with the target cluster. Otherwise, currently there is no guarantee that flow-improve methods output a reasonable approximation to the target cluster in terms of its conductance value (Theorem 1 in Orecchia and Zhu (2014)) or in terms of false and true positives (Theorem 3 in Orecchia and Zhu (2014)). Theorem 3 in Orecchia and Zhu (2014) suggests that sufficient overlap can be achieved either by using a local spectral method such as APPR or ℓ1reg. PR. In this paper, we use only the latter. This is because, as shown in Fountoulakis et al. (2019), theoretically and empirically APPR is very similar to ℓ1-reg. PR. We also show extensive experiments in this section about the similarity of APPR and ℓ1-reg. PR. In practice one could use heuristics such as a BFS-type algorithm to slightly expand a seed node within a target cluster and then provide the output of the BFS-type as input to Simple Local, i.e., see also Veldt et al. (2016). In Algorithm 1 we provide a pseudo-code for the BFS-type algorithm that we use in this HA, FOUNTOULAKIS, AND MAHONEY paper. Basically, the only difference with a standard BFS algorithm is that we explore neighbors in batches, which allows us to terminate the BFS algorithm after a given number of steps. 6.2.3 PARAMETER TUNING APPR and ℓ1-reg. PR have two parameters, the teleportation parameter α and a tolerance/ℓ1regularization parameter ρ. The teleportation parameter should be set according to the reciprocal of the mixing time of a random walk within the target cluster, which is equal to the smallest nonzero eigenvalue of the normalized Laplacian for the subgraph that corresponds to the target cluster. See Zhu et al. (2013) for details. Let us denote the eigenvalue with λ. In our case the target cluster is a given ground truth cluster. We use this information to set parameter α. In particular, for each node in the target cluster we run APPR and ℓ1-reg. PR 4 times where α is set based on a range of values in [λ/2, 2λ] with a step of (2λ λ/2)/4. The regularization parameter of ℓ1-reg. PR has nearly identical purpose as the tolerance parameter of APPR. Both parameters are set to be proportional to inverse of the volume of the target cluster, as suggested by theoretical arguments in this paper as well as in Andersen et al. (2006); Zhu et al. (2013); Fountoulakis et al. (2019). For each parameter setting we use sweep cut to round the output of APPR and ℓ1-reg. PR to find a cluster of low conductance. The sweep cut rounding procedure is a common technique to post-process the output of local graph clustering methods for details, we refer the reader to Andersen et al. (2006); Zhu et al. (2013); Fountoulakis et al. (2019). We use a proximal coordinate descent algorithm (Fountoulakis et al., 2019) to solve the ℓ1-reg. PR problem. Over all parameter settings, we return the cluster with the lowest conductance value as an output of APPR and ℓ1-reg. PR. Simple Local has only one parameter denoted by δ. The δ parameter controls how localized the output is going to be around the input seed nodes (Veldt et al., 2016), it also controls the quality of the output in terms of its conductance value (Theorem 1 in Orecchia and Zhu (2014)) or in terms of false and true positives (Theorem 3 in Orecchia and Zhu (2014)). The parameter has to satisfy δ 0, and according to Orecchia and Zhu (2014) it should be set such that Vol(R)/Vol(V R) + δ = 1/(3/σ + 3), where R is the set of seed nodes that is given as input to Simple Local, and σ is a parameter that satisfies σ [Vol(R)/Vol(V R), 1]. It is suggested in Orecchia and Zhu (2014) to set σ = O(Vol(R K)/Vol(K)), where K denotes the target cluster. Therefore, in our experiments we set σ = min max Vol(R) Vol(V R), Vol(R K) When we use ℓ1-reg. PR as input to Simple Local, this parameter setting also provides the theoretical guarantees which are claimed in Orecchia and Zhu (2014) (assuming that all assumptions about the target cluster are satisfied). We set parameter steps = n in Algorithm 1 and we terminate Algorithm 1 when the current set of seed nodes, denoted by seeds, satisfies Vol(seeds K)/Vol(K) 0.75, i.e., the output of Algorithm 1 has to have at least 75% overlap with the target cluster, or when Vol(seeds)/Vol(G) 0.25, i.e., the volume of the output is larger than 25% the volume of the graph. This setting is also motivated by Theorem 3 in Orecchia and Zhu (2014), which mentions that the input to Simple Local has 75% overlap with the target cluster. However, since this relies on the assumption that the input set of nodes is the output of APPR or ℓ1-reg. PR, which are themselves clustering algorithms, while STATISTICAL GUARANTEES FOR LOCAL GRAPH CLUSTERING Algorithm 1 is not, then we also add the second termination criterion that the output cannot have volume more than 25% the volume of the graph. Input : seeds - set of seed nodes that we want to expand Output : seeds - updated/expanded set of seeds Parameters: steps - number of steps of the algorithm Create queue Q; Set all nodes in seeds as visited; Set step = 1; while step steps do k = 1 and l = size of Q; while k l do remove node u from head of Q; mark and enqueue all (unvisited) neighbours of u; Add newly visited nodes in set seeds; increase k; end increase step; end Algorithm 1: A modified BFS algorithm for seed set expansion 6.2.4 RESULTS APPR and ℓ1-reg. PR are similar. We start our empirical observations by comparing APPR and ℓ1-reg. PR. We already proved in Theorem 10 that these methods are similar. However, we would like to point out some additional details, which will help us justify simplification of the experiments for the remainder of this section. In Figure 4, we make a much more extended empirical comparison between the two local spectral methods, where we compare their precision, recall and F1score. We use the Orkut dataset, and we compare the two methods using all 282 ground truth clusters of this dataset. In this figure, we present average results over all nodes for each given ground truth cluster. We observe that APPR and ℓ1-reg. PR produce output with nearly identical precision, recall and F1score. We attribute any minor differences between the two methods to the minor differences between the optimality conditions of ℓ1-reg. PR and APPR. Moreover, APPR and proximal coordinate descent for ℓ1-reg. PR have similar running time in practice for producing the results in Figure 4. This is justified by worst-case analysis in Fountoulakis et al. (2019) as well as our average-case analysis in Theorem 11. In particular, each method required 75 hours to produce Figure 4, which required 304960 calls to APPR and ℓ1-reg. PR. Due to the similarity of the two approaches, we will only use ℓ1-reg. PR in the subsequent experiments. Note that in this experiment we do not show performance of flow-based methods because based on our empirical observations on small datasets like FB-Johns55 and Colgate88 it would have taken close to a month to obtain similar plots like in Figure 4. (Improving this situation for local flow improve methods is an important direction for future work.) For example, see the running times of flow-based methods for FB-Johns55 and Colgate88 in Table 7. APPR and ℓ1-reg. PR work well for low conductance target clusters. Here, we comment on the performance of APPR and ℓ1-reg. PR in Figure 4 on the Orkut dataset. We observe that performance of both methods decreases as the conductance of the target cluster increases, i.e., as the cluster HA, FOUNTOULAKIS, AND MAHONEY 0.1 0.2 0.3 0.4 0.5 0.6 Conductance of ground truth cluster (a) F1score vs conductance 0.1 0.2 0.3 0.4 0.5 0.6 Conductance of ground truth cluster (b) Recall vs conductance 0.1 0.2 0.3 0.4 0.5 0.6 Conductance of ground truth cluster (c) Precision vs conductance Figure 4: We demonstrate the similarity of APPR and ℓ1-reg. PR by illustrating F1score, recall and precision vs conductance of ground truth clusters plots for the Orkut dataset (282 ground truth clusters). This plot demonstrates two properties of the algorithms. First, it demonstrates performance of ℓ1-reg. PR and APPR as conductance of the target cluster increases (horizontal-axis). Observe that as conductance becomes larger then overall performances decreases. Second, it demonstrates that the methods ℓ1-reg. PR and APPR get nearly identical results, which is also justified by theoretical arguments. See the main text for details. STATISTICAL GUARANTEES FOR LOCAL GRAPH CLUSTERING quality gets worse. This is an expected outcome that is predicted by the average-case analysis of this paper. However, it is important to mention that, for target clusters with conductance close to 0.4, the F1score of ℓ1-reg. PR is around 0.8. This is surprisingly good performance since a cluster with conductance 0.4 means that roughly half of the edges of the target cluster cross the cluster boundary (which implies that such target clusters are actually quite noisy and not of particularly high quality). Results for biology datasets and Sfld and PPI-mips. The results for the biology datasets are shown in Table 4. We present average results over all nodes for each given ground truth cluster. We denote with bold numbers the performance number of a method when it has the largest F1score among all methods. We note the consistent state-of-the-art performance of ℓ1-reg. PR for all clusters in Table 4. For some ground truth clusters, ℓ1-regularized Page Rank perfectly recovers the target clusters, which is mainly attributed to the fact that the ground truth clusters have strong separability property (see also Corollary 9 of the main text). In most experiments, Simple Local did not improve the performance of the input of ℓ1-reg. PR, but also it did not make it worse. In some cases, like the Spindle ground truth cluster in the PPI-mips dataset, Simple Local decreased the performance of ℓ1-reg. PR in terms of the F1score. This is because Simple Local found clusters that have smaller conductance, but that do not correspond to clusters with the highest F1score. This is a known issue that has been mentioned in Fountoulakis et al. (2017). BFS-SL has the worst performance among all methods in most experiments. In fact, BFS-SL performs well only for the usrease ground truth cluster in the Sfld dataset. The performance of BFS-SL is especially poor for all ground truth clusters in the PPI-mips dataset. It is important to mention that we did experiment with different parameter tuning for both BFS-type Algorithm 1 and Simple Local for the BFS-SL method, but the performance was poor for all settings of parameters that we tried. We attribute the poor performance of BFS-SL in the BFS-type algorithm, which provides the input to Simple Local. In particular, the BFS-type Algorithm 1 is not related to clustering in a general sense, and this translates to poor quality input to Simple Local. As is mentioned in the theoretical analysis in Orecchia and Zhu (2014); Veldt et al. (2016), Simple Local requires as input the output of a local spectral method such as ℓ1-reg. PR in order to perform well, which is also verified by the results in Table 4. The corresponding running times for each method are shown in Table 5. Each numeric value in this table demonstrates the total running time to run a method for a given ground truth cluster. The running time is the sum of the running times for each node in the ground truth cluster. Note that ℓ1-reg. PR is the fastest method. Results social networks FB-Johns55 and Colgate88. The results for the social network graphs are shown in Table 6. There are a lot of interesting observation for this set of experiments. First, ℓ1-reg. PR outperforms BFS-LS, with the exception of the ground truth clusters of major index 217 in FB-Johns55, where BFS-LS has a 0.03 larger F1score, and the ground truth cluster of year 2009 in Colgate88, where BFS-SL has the same F1score as ℓ1-reg. PR. We observed two reasons that BFS-SL has worse performance in most experiments. The first reason is that BFS-SL outputs a cluster that has smaller conductance than the output cluster of ℓ1-reg. PR, but better conductance is often not related to the ground truth cluster, especially in cases where the ground truth cluster itself has large conductance. We attribute this behavior to SL because as an algorithm it attempts to find a cluster with small conductance. The second reason is that the input to SL from BFS-type Algorithm 1 is not a good approximation to the ground truth cluster, which is a required property of SL such that it performs well. HA, FOUNTOULAKIS, AND MAHONEY Table 4: Results for biology datasets Sfld and PPI-mips. In this table, we present average results of F1score over all nodes for each given ground truth cluster. We denote with bold numbers the performance number of a method when it has the largest score among all methods. dataset feature ℓ1-reg. PR BFS-SL ℓ1-reg. PR-SL urease 0.75 0.42 0.38 AMP 0.86 0.86 0.86 Actin 0.98 0.04 0.98 Anaphase 1.00 0.09 1.00 Cdc28p 1.00 0.10 1.00 Coat 0.85 0.04 0.85 ct-large 1.00 0.02 1.00 ct-small 1.00 0.05 1.00 F0-F1-ATP 1.00 0.06 1.00 H+ 1.00 0.06 1.00 mc-large 1.00 0.03 1.00 mc-small 1.00 0.07 1.00 mc-complex 0.78 0.07 0.79 m RNA 0.93 0.03 0.93 Nuclear 0.85 0.05 0.85 RNA 0.87 0.03 0.80 Spindle 0.85 0.03 0.82 TRAPP 1.00 0.10 1.00 t RNA 1.00 0.09 1.00 19-22S 1.00 0.05 1.00 20S 1.00 0.06 1.00 The second set of observations is about ℓ1-reg. PR-SL. For most ground truth clusters, we observe that ℓ1-reg. PR-SL performs worse than or on par with ℓ1-reg. PR, with the exception of ground truth clusters year 2009 and major index 217 in FB-Johns55 and ground truth clusters of years 2008 and 2009 in Colgate88. When ℓ1-reg. PR-SL makes the input of ℓ1-reg. PR worse, it is clearly because the former finds a cluster with better conductance value which does not relate to the ground truth cluster. We observe this behavior very often when the target cluster does not have small conductance value, and this is also confirmed through our simulation study (see Figure 3(d) in Section 6.1). When ℓ1-reg. PR-SL performs better it is because the ground truth cluster has small conductance but not small enough such that ℓ1-reg. PR performs well by itself. In particular, ℓ1-reg. PR leaks more mass outside of the target cluster than it should, and this results in small precision. This is a well-known problem that has been also observed in Fountoulakis et al. (2017), which can be fixed by SL. The corresponding running times for each method are shown in Table 7. The running time is the sum of the running times for each node in the ground truth cluster. Note that ℓ1-reg. PR is the fastest method. STATISTICAL GUARANTEES FOR LOCAL GRAPH CLUSTERING Table 5: Running times for biology datasets Sfld and PPI-mips. The numeric entries of the table show the total time in seconds over all nodes of a given ground truth cluster. The running times of ℓ1-reg. PR where less than 0.1 for most experiments, and they have been rounded up to 0.1. dataset feature ℓ1-reg. PR BFS-SL ℓ1-reg. PR-SL urease 17.2 29.8 151.2 AMP 0.2 2.5 1.6 Actin 0.1 1.8 0.6 Anaphase 0.1 0.8 0.1 Cdc28p 0.1 0.7 0.1 Coat 0.1 1.5 0.3 ct-large 3.6 8.5 16.5 ct-small 1.3 10.0 6.6 F0-F1-ATP 0.1 1.0 0.1 H+ 0.1 1.1 0.1 mc-large 0.3 3.3 2.0 mc-small 0.1 0.9 0.1 mc-complex 0.1 0.8 0.2 m RNA 0.3 3.5 2.4 Nuclear 0.1 1.3 0.3 RNA 0.2 1.5 2.7 Spindle 0.1 2.3 0.7 TRAPP 0.1 0.7 0.1 t RNA 0.1 0.7 0.1 19-22S 0.1 1.3 0.2 20S 0.1 1.0 0.1 Memory scalability for smalland large-scale graphs. One of the main motivations of the ℓ1regularized Page Rank model is the low memory requirements for each seed node. In Table 8 we illustrate performance in terms of memory requirements for solving the ℓ1-regularized Page Rank using proximal coordinate descent. In particular, we illustrate how memory requirement increases as ρ decreases. As is expected, for large values of ρ the output solution is very sparse and the algorithms have very low memory requirements. As ρ gets smaller then the solutions become denser and memory requirements increase. We observe similar performance for FB-Johns55 (small dataset) and Orkut (large dataset). 7. Conclusion In this paper, we have examined the ℓ1-regularized Page Rank optimization problem for local graph clustering. In a local graph clustering problem, the objective is to find a single target cluster in a large graph, given a seed node in the cluster, and to do so in a running time that does not depend on the size of the entire graph, but instead that depends on the size of the output cluster. Algorith- HA, FOUNTOULAKIS, AND MAHONEY Table 6: Results for Facebook datasets FB-Johns55 and Colgate88. In this table we present average results of F1score over all nodes for each given ground truth cluster. We denote with bold numbers the performance number of a method when it has the largest score among all methods. dataset feature ℓ1-reg. PR BFS-SL ℓ1-reg. PR-SL year 2006 0.32 0.13 0.23 year 2007 0.43 0.17 0.31 year 2008 0.50 0.34 0.36 year 2009 0.84 0.78 0.89 major index 217 0.85 0.88 0.88 second major 0 0.41 0.20 0.20 dorm 0 0.46 0.13 0.08 gender 1 0.42 0.21 0.21 gender 2 0.46 0.19 0.18 year 2004 0.42 0.29 0.44 year 2005 0.44 0.15 0.43 year 2006 0.46 0.27 0.39 year 2007 0.54 0.25 0.46 year 2008 0.75 0.56 0.88 year 2009 0.96 0.96 0.98 second major 0 0.49 0.24 0.25 dorm 0 0.46 0.04 0.26 gender 1 0.45 0.21 0.25 gender 2 0.34 0.20 0.26 mic results (i.e., running-time bounds to achieve a given cluster quality) for local graph clustering abound, but our results are the first statistical results (i.e., where one is interested in recovering a cluster under an hypothesized model). Under our local random model, we show that the optimal support of ℓ1-regularized Page Rank identifies the target cluster with bounded false positives, and in certain settings exact recovery is also possible. We further establish a strong connection between ℓ1-regularized Page Rank and approximate personalized Page Rank (APPR), based on which we obtain similar statistical guarantees on APPR under the random model. Additionally, we have brought the idea of solution path algorithms from the sparse regression literature to the local graph clustering literature, and we showed that the forward stagewise algorithm gives a provable approximation to the entire ℓ1 regularization path of this algorithm. Finally we demonstrate the state-of-the-art performance of ℓ1-regularized Page Rank on both simulated and real data graphs. Acknowledgments We would like to acknowledge support for this project from the National Science Foundation (NSF grant IIS-9988642) and the Multidisciplinary Research Program of the Department of Defense STATISTICAL GUARANTEES FOR LOCAL GRAPH CLUSTERING Table 7: Running times for Facebook datasets FB-Johns55 and Colgate88. The numeric entries of the table show the total time in seconds over all nodes of a given ground truth cluster. dataset feature ℓ1-reg. PR BFS-SL ℓ1-reg. PR-SL year 2006 307 28126 56022 year 2007 453 28083 61462 year 2008 1125 25855 62668 year 2009 852 15736 15090 major index 217 80 728 768 second major 0 8866 91500 308041 dorm 0 10865 70084 260485 gender 1 1533 69265 205031 gender 2 13080 83836 301843 year 2004 11 1482 659 year 2005 70 5800 7679 year 2006 99 5471 11398 year 2007 154 6198 11641 year 2008 287 4756 6900 year 2009 530 2882 2507 second major 0 2876 24525 71937 dorm 0 449 15184 36992 gender 1 1533 20347 53016 gender 2 430 16854 39537 (MURI N00014-00-1-0637). This material is also based in part on research sponsored by DARPA and the Air Force Research Laboratory under agreement number FA8750-17-2-0122. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of DARPA and the Air Force Research Laboratory or the U.S. Government. We would also like to acknowledge ARO, NSF, ONR, and Berkeley Institute for Data Science for providing partial support of this work. HA, FOUNTOULAKIS, AND MAHONEY Table 8: Memory scalability for proximal coordinate descent for solving the ℓ1-regularized Page Rank as ρ decreases. We illustrate averaged results over 100 trials, i.e., seed nodes. The numbers in the first row of each dataset are Megabytes. Note that to compute the memory requirements for storing the adjacency matrix we did not store the edge weights since the graphs are unweighted. dataset statistics ρ =1.0e-1 1.0e-2 1.0e-3 1.0e-4 1.0e-5 1.0e-6 1.0e-7 Average memory 0.11 0.20 0.27 1.00 3.51 22.8 100 Average memory mem. for adjacency 1.2e-4 2.1e-4 2.9e-4 1.0e-3 3.6e-3 2.4e-2 1.0e-1 Average nodes touched 72 72 131 1474 10788 83139 443857 Average nodes in solution 1 1 4 33 266 2401 22159 Number of iterations 1 2 10 104 1050 11370 132153 Average memory 0.09 0.17 0.20 0.75 1.44 2.41 2.65 Average memory mem. for adjacency 6.3e-2 1.1e-1 1.3e-1 4.9e-1 9.5e-1 1.5e+0 1.7e+0 Average nodes touched 66 68 114 959 3113 5079 5157 Average nodes in solution 1 1 3 36 367 4201 5157 Number of iterations 1 3 10 107 1746 41820 539493 STATISTICAL GUARANTEES FOR LOCAL GRAPH CLUSTERING Appendix A. Some auxiliary lemmas To establish the theorems, we need a few concentration inequalities and intermediate results that shall be used in the proof. In Section A.1, we give degree concentration inequalities for random graphs, and in Section A.2, we state recovery guarantees of ℓ1-regularized Page Rank on the population graph. Section A.3 gives a few important results on the ℓ1-regularized Page Rank when restricted to the target cluster. A.1 Concentration lemmas Here, we present several concentration lemmas for degrees of random graphs. The first lemma is consequence of Chernoff s inequality for the sum of independent Bernoulli random variables, and applying union bound (c.f. see Vershynin (2018, Theorem 2.3.1)). Lemma 13 (Adapted from Vershynin (2018, Proposition 2.4.1).) Let Xi be the sum of independent Bernoulli random variables, with E [Xi] = µ for i = 1, 2, . . . , k. Then, there exists a universal constant c0 > 0 such that if µ c 1 0 δ 2 log k, then with probability at least 1 2e c0δ2µ, For all i K: |Xi µ| δµ. According to our random model (Definition 5), the degree vector d K for the target cluster K comprises of random variables which are the sum of independent Bernoulli random variables with common mean d. Therefore, by applying Lemma 13, it is straightforward to see the following result. Lemma 14 (Adapted from Vershynin (2018, Proposition 2.4.1).) There exists a universal constant c0 > 0 such that if d c 1 0 δ 2 log k, then with probability at least 1 2e c0δ2 d, For all i K: di d δ d, where d is the average degree of the vertices in the cluster K. The next lemma bounds the number of edges between node j Kc and the target cluster K when q = O 1 Lemma 15 Suppose that q = c/n and that k 2(c + 3) for a positive constant c. Then for n sufficiently large, with probability at least 1 O n 1 , max j Kc Aj,K 1 c + 2. A.2 Exact recovery of target cluster in population graph Next, we define the ground truth Page Rank vector, which we obtain by applying ℓ1-regularized Page Rank to the population graph G of our random model. Recall the adjacency matrix E [A] defined in (3), the associated diagonal degree matrix E [D], and the Laplacian matrix E [L]. Writing E [Q] = αE [D] + 1 α 2 E [L], we denote the population version of ℓ1-regularized Page Rank as x = arg min x 2x E [Q] x αx s + ρα E [D] x 1 HA, FOUNTOULAKIS, AND MAHONEY Compared to (2), we see that the matrices associated with the graph are now replaced by their expected counterparts. The following lemma shows that there exist a range of ρ values for which the solution to the population version of the ℓ1-regularized Page Rank minimization problem (11) gives an exact recovery for the target cluster. Lemma 16 (Exact recovery for population Page Rank vector.) Consider the local random model given in Definition 5 and suppose that Assumption 1 holds. Then, for n sufficiently large, the ground truth Page Rank vector x , defined in (11), identifies K correctly, i.e., support(x ) = K, as long as ρ [ρ , ρ ), where ρ = q(1 α) 2α d E dℓ + q(1 α) k d + (n k)E dℓ , and ρ = p(1 α) d (1 + α) d + (1 α)p , where E dℓ = minℓ Kc E [dℓ]. Furthermore, x has a closed form expression, x = u 1S+v 1K, where u and v are given by u = 2α (1+α) d+(1 α)p; 2 p u ρα d α d+ 1 α From the closed form expression, we can see that x has high probability mass on the single seed node with a long tail further away from the seed node. The mass outside the target cluster is being thresholded exactly to zero via ℓ1-norm regularization, thus identifying the target cluster without any false positives. Of course, in practice, the population graph is unknown, and we are instead given an instance of the graph from the random model. In this case, the ground truth vector x allows us to estimate the magnitude of the ℓ1-regularized Page Rank vector bx by analyzing the error between x and bx. For more details, we refer the reader to the proof of Lemma 19. A.3 ℓ1-regularized Page Rank restricted to target cluster Consider the ℓ1-regularized Page Rank problem restricted to the target cluster K. bx R = arg min x Rn 2x Qx αx s + ρα Dx 1 : x Kc = 0 = arg min y Rk 2y QK,Ky αy s K + ρα DK,Ky 1 Since our aim is to recover the target cluster K based on the local model (Definition 5), it is natural to analyze the properties of the solution to this reduced problem. Here, abusing notation, we use bx R to denote the vector either in the original space Rn or in the reduced space Rk. We present several lemmas about bx R that will be critical for the proof of theorems. First, we give a guarantee on the recovery of the target cluster for the optimal solution (12). STATISTICAL GUARANTEES FOR LOCAL GRAPH CLUSTERING Lemma 17 Let bx R be the solution to the reduced problem (12). Suppose that (1 δ)p2k c 1 0 δ 2 log k. Then, if the regularization parameter satisfies ρ ρ(δ), we have support(bx R) = K, with probability at least 1 6e c0δ2(1 δ)p2k. Here c0 is the same constant appearing in Lemma 13. For the above result, since by construction bx R is zero outside K, all we need to show is that bx R K > 0 when ρ O γp d2 . Next we compare the support set of bx R to that of bx. Lemma 18 For any regularization parameter ρ [0, ), we have support(bx R) support(bx). Finally we have the following estimate on the maximum coordinate of bx R on K\S (recall u and v are respectively the components of x on S and K\S; see Lemma 16). Lemma 19 Let δ 0.1 and suppose that (1 δ)p2k c 1 0 δ 2 log k. Assume that Assumption 1 holds and that the size of the target cluster k 5. If the regularization parameter satisfies ρ ρ(δ), then for n sufficiently large, the following bound holds bx R K\S v |{z} =x K\S d + 2c1δ(v + ρα) | {z } bx R K\S x K\S with probability at least 1 6e c0δ2(1 δ)p2k. Here c0 is the same constant appearing in Lemma 13 and c1 is a positive numerical constant. Appendix B. Proofs of theorems In this section we prove all of our theorems. Based on the lemmas presented above we give proofs of our five theorems, respectively, in Section B.1, B.2, B.3, B.4, and B.5. B.1 Proof of Theorem 6 The proof of Theorem 6 is a straightforward combination of Lemma 17 and 18. Specifically, by Lemma 17, we know that for ρ ρ(δ), we have that support(bx R) = K. Then applying Lemma 18 we have support(bx R) = K support(bx), thus proving the result. B.2 Proof of Theorem 7 To prove Theorem 7, we first need the following lemma for bounding the volume of support(bx). This lemma is a stronger version of Fountoulakis and Gondzio (2015, Theorem 2). HA, FOUNTOULAKIS, AND MAHONEY Lemma 20 For any regularization parameter ρ > 0, it holds that Vol(support(bx(ρ))) 1 d bx(ρ) where Vol(support(bx)) = P i support(bx) di. Now by our choice of ρ ρ(δ) (5) we have Vol(support(bx)) 1 ρ 1 ρ(δ) = 1 + α 2 (1 + δ) d2 2 (1 + δ)k d where the second step follows from (4). Furthermore, by Theorem 6, support(bx) contains the target cluster K, so the errors in support(bx) are solely attributed to the presence of false positives. Denoting by FP the set of false positives in support(bx) we can write Vol(support(bx)) = Vol(K) + Vol(FP). (14) Since Vol(K) = P i K di (1 δ)k d by Lemma 14, it follows that Vol(FP) = Vol(support(bx)) Vol(K) by (14) 2 (1 + δ)k d γ2 Vol(K) by (13) with probability at least 1 2 exp( c0δ2 d). This proves the theorem. B.3 Proof of Theorem 8 Under Assumption 1, with nonvanishing probability we can find a good seed node in the target cluster that is connected solely to K. Lemma 21 Suppose that Assumption 1 holds, i.e., q = c n for a fixed constant c > 0. Under the local random model given in Definition 5, if n is sufficiently large, then P {There exists a node i K such that i is not connected to Kc} 1 (1 exp( 1.5c))k. We select this node as a seed node in the ℓ1-regularized Page Rank problem. Next we show that under the conditions of Theorem 8, the optimal solution to the reduced problem, bx R = bx R(ρ(δ)) in (12), obeys |(Qbx R αs)j| = | 1 α 2 Aj,Kbx R| < ρ(δ)αdj for j Kc. (15) Note that the above inequality is simply the optimality condition for the full-dimensional problem on Kc (see Lemma 3). Since, by the optimality conditions for the reduced problem, bx R also satisfies the STATISTICAL GUARANTEES FOR LOCAL GRAPH CLUSTERING full-dimensional optimality condition on K. By uniqueness of the solution this means that bx R = bx. Also, since support(bx R) K by construction, this proves that there is no false positive in bx. Now write Aj,Kbx R into the sum of two terms, Aj,Kbx R = Aj,Sbx R S + Aj,K\Sbx R K\S. The first term can be ignored, since by Lemma 21 we choose the seed node to be the one that is solely connected to K. For the second term we use Hölder s inequality to bound Aj,K\Sbx R K\S Aj,K\S 1 bx R K\S . Applying Lemma 15, 19, and using the fact that u 2α (1+α) d, v (1 α)p (1+α) d2 (which can be deduced from Lemma 16), and that δ 0.1 while α [0.1, 0.9] (by assumptions of the theorem), then (Qbx R αs)j 1 α 2 Aj,K\S 1 bx R K\S (0.5c + 1) c1(1 α)u d + (1 + 2c1δ)v + 2c1δ ρ(δ)α (0.5c + 1)C1 d2 + ρ(δ)α , where C1 > 0 is a positive constant. Then, to prove (15), it suffices to show (0.5c + 1)C1 2 ρ(δ)α d2 + 1 < dj. Plugging in the definition of ρ(δ) (5) to the above, and using α [0.1, 0.9] and δ 0.1, we obtain (0.5c + 1)C1 2 ρ(δ)α d2 + 1 = (0.5c + 1)C1 (0.5c + 1)C2 for some constant C2 > 0. By the assumption (8) this means that ρ(δ) 1α 1|(Qbx R αs)j| < dj, and thus we have proved the claim (15). B.4 Proof of Theorem 10 Fix ρ0 > 0. First, we prove the first part of the theorem, i.e., support(bx(ρ0)) support(x APPR(ρ0)). When ρ0 > 1/d S, this relation holds trivially because both bx(ρ0) and x APPR(ρ0) do not contain any nodes in the support set. To prove that the relation holds for ρ0 1/d S, we will proceed by induction. Specifically, we will prove that ( bxi(ρ0) < x APPR i (ρ0), i support(x APPR i (ρ0)), bxi(ρ0) = x APPR i (ρ0) = 0, i / support(x APPR i (ρ0)), (16) for all ρ0 (0, 1/d S]. Assuming that this holds, it is straightforward to follow that support(bx(ρ0)) support(x APPR(ρ0)), HA, FOUNTOULAKIS, AND MAHONEY thus proving the claim. Now we turn to proving (16). When ρ0 = 1/d S, we can see from the algorithm (9) that x APPR(ρ0) contains the seed node as the only element in the support set, while by Lemma 3, we have bx = 0, which proves (16). Next, assuming that it holds at some ρ0 (0, 1/d S], let eρ0 > 0 be chosen such that eρ0 < ρ0 and such that there appears a node i V that violates the condition (16) for the first time. If i support(x APPR( eρ0)), since the path bx(ρ) is continuous by the proof of Lemma 4, this means that i is the first node such that bxi( eρ0) = x APPR i ( eρ0) > 0 (there may exist multiple nodes that simultaneously violate (16) at ρ = eρ0 but the same statement still applies). In this case, by induction, we know that bxj( eρ0) x APPR j ( eρ0) for j = i. Then (D 1 f(bx( eρ0)))i = 1 + α 2 bxi( eρ0) 1 α 2 (D 1Abx( eρ0))i αsi 2 x APPR i ( eρ0) 1 α 2 (D 1Ax APPR( eρ0))i αsi = (D 1 f(x APPR( eρ0)))i, where the inequality holds since (D 1Ax)i = P j i wjixj/di for any x R|V |. By the optimality condition of (2) (Lemma 3), the left-hand side must be (D 1 f(bx( eρ0)))i = eρ0α which gives (D 1 f(x APPR( eρ0)))i eρ0α. However, this is a contradiction to the termination criterion of the APPR algorithm (9), and in particular, we must have i / support(x APPR( eρ0)) and bxi( eρ0) > 0 by the condition (16). Then, by Lemma 3, this implies that (D 1 f(bx( eρ0)))i = eρ0α, and following the same steps as above we can see that eρ0α > (D 1 f(x APPR( eρ0)))i. However, this again contradicts the termination criterion of APPR, thus proving the first part of the theorem. Next, we prove the second part of the theorem, i.e., support(x APPR(ρ0)) support(bx(ρ1)) for ρ1 = (1 α) ρ0/2. We first claim that for every node i support(x APPR(ρ0)), we must have ραdi < if(x APPR(ρ0)) 1 α 2 ραdi. The first inequality is obvious from the termination criterion of APPR. To see why the second inequality holds, if a node i V is selected at iteration k of the APPR algorithm, then by the update step (9), we have x(k+1) i = x(k) i d 1 i if(x(k)). Then, the gradient of f(x(k)) on node i is updated as if(x(k+1)) = 1 + α 2 dix(k) i 1 + α 2 if(x(k)) 1 α j i wjix(k) j αsi = if(x(k)) 1 + α 2 if(x(k)) = 1 α 2 if(x(k)). By the termination criterion of APPR, we know that if(x(k)) ρ0αdi for all i V . This shows that the right-hand side of the above equation is ρ0αdi, and in particular, taking k , we obtain if(x APPR(ρ0)) 1 α 2 ρ0αdi, which proves the claim. Now consider the following strategy of continuously adding small mass to the vector x APPR(ρ0): 1. Start with x APPR(ρ0). Find the node i whose gradient is smallest, i.e., j1 arg min i V if(x APPR(ρ0)). STATISTICAL GUARANTEES FOR LOCAL GRAPH CLUSTERING 2. Add mass to the node j1 until some other node j2 V has the gradient as small as j1. This event always happens because by the work above (17) we can see that the gradient on node j1 is increasing while the gradients on the other nodes are either decreasing (if it is a neighboring node of j1) or stay the same. 3. Add mass to the nodes (j1, j2) such that gradients on j1 and j2 are increasing at the same rate,5 and until some other node j3 V has the gradient as small as j1 and j2. 4. Continue this step until the gradient on the node j1 becomes 1 α Writing ex APPR to denote the output vector of the above procedure, it is easy to see that ex APPR indeed satisfies the optimality condition of the ℓ1-regularized Page Rank minimization problem, given in Lemma 3, with ρ = ρ1. This in turn shows that support(x APPR) support(bx(ρ1)), since by the uniqueness of the optimal solution we have ex APPR = bx(ρ1). This completes the proof of Theorem 10. B.5 Proof of Theorem 11 First, by Theorem 10, we know that support(bx(ρ(δ))) support(x APPR(ρ(δ))) support(bx((1 α) ρ(δ)/2)). Since bx(ρ(δ)) contains the target cluster by Theorem 6, therefore together with the relation above we obtain K support(x APPR). To see that the false positives in x APPR are also bounded, we can follow the same step as the proof of Theorem 7 (see Section B.2) to show that Vol(FP(bx((1 α) ρ(δ)/2))) Vol(K) 3 2 (1 α)γ2 1 From the relation between support(x APPR(ρ(δ))) and support(bx((1 α) ρ(δ)/2)) it directly follows that Vol(FP(x APPR)) Vol(K) 3 2 (1 α)γ2 1 Finally to prove the exact recovery case, it suffices to show that support(bx((1 α) ρ(δ)/2)) contains no false positives. In order for this all we need to do is to replace ρ(δ) by ρ((1 α)/2 δ) in the proof of Theorem 8. Then the same proof steps go through and we finally get the condition 2C(0.5c + 1) (1 α)γp = O 1 in place of (8). This completes the proof of the theorem. Appendix C. Proofs of lemmas We prove all of our lemmas in this section. 5. This is always possible since for any positive ϵ R|V |, we know that P i V ( if(x+ϵ) if(x)) > 0. So, in order to increase the gradients at the same rate, we need to find a direction ϵ such that the summands jf(x+ϵ) jf(x) are same for nodes j that are selected. HA, FOUNTOULAKIS, AND MAHONEY C.1 Proof of Lemma 2 By the KKT condition bx must satisfy jf(bx) = ρα Dbx 1 for all j V , or equivalently, (Qbx)j αsj = ραdj, xj > 0, ραdj, xj < 0, [ ραdj, ραdj], xj = 0, where d = diag(D) is the degree vector of the graph. Simple calculation shows that (Qbx)j = (1 + α) i:(i,j) E bxi i:(i,j) E bxj 1 α i:(i,j) E bxi Now assume j V is a node such that bxj < 0 and that j = arg minj V bxj. Then from the condition (18), it must be that (Qbx)j > 0. However by (19) this is only possible if there is at least one node i V with i j such that bxi < 1 + α However this means that bxi is smaller than bxj , contradicting to the fact that j V is the smallest node in the graph. This proves the desired result. C.2 Proof of Lemma 3 This lemma follows from Lemma 2 and the definition of Q. C.3 Proof of Lemma 4 First, by Rosset and Zhu (2007, Proposition 1), we know that the optimal solution path bx(ρ) for the ℓ1-regularized minimization (2) is piecewise linear as a function of ρ > 0. In particular, this implies that the path is continuous. Next, we prove that if the support set of bx(ρ) remains constant in the interval [ρ1, ρ0] (ρ0 > ρ1), then bx(ρ) is strictly decreasing on [ρ1, ρ0], i.e., bx(ρ1) > bx(ρ0), where the inequality applies component-wise. This, together with the non-negativity of the solution (Lemma 2) and the continuity of the path bx(ρ), then establishes the lemma. Write A(ρ) = {j : bxj(ρ) > 0} for ρ > 0 and choose ρ0, ρ1 > 0 with ρ0 > ρ1 such that the support set A(ρ) = A does not change for ρ [ρ1, ρ0]. For j A, by Lemma 3, we have that (Q bx(ρ))j αsj = ραdj, and so QA,A(bx A(ρ1) bx A(ρ0)) = (ρ0 ρ1)αd A. STATISTICAL GUARANTEES FOR LOCAL GRAPH CLUSTERING Multiplying Q 1 A,A on both sides (note that QA,A is positive definite), bx A(ρ1) bx A(ρ0) = α(ρ0 ρ1)Q 1 A,Ad A. (20) QA,A = 1 + α A,A = 1 + α 1 + α(D 1/2AD 1/2)A,A For all z R|A|, we have that z z z (D 1/2AD 1/2)A,Az = X (i,j) E,i,j A (i,j) E,i,j A wij where the second step holds since di > P j:(i,j) E,j A wij. This shows that all the eigenvalues of (D 1/2AD 1/2)A,A have absolute value less than 1. Using the previous fact, Q 1 A,A can be written as Q 1 A,A = 2(1 + α) 1 D 1/2 A,A 1 + α (D 1/2AD 1/2)A,A 1 D 1/2 A,A = 2(1 + α) 1 D 1/2 A,A 1 + α (D 1/2AD 1/2)A,A + 1 α 2 (D 1/2AD 1/2)2 A,A D 1/2 A,A . (21) Now it is easy to see that the right hand side in (20) is positive, i.e., α(ρ0 ρ1)Q 1 A,Ad A > 0. This completes the proof of Lemma 4. C.4 Proof of Lemma 13 This lemma is proved in Vershynin (2018, Proposition 2.4.1) which we reproduce here for completeness. By Chernoff s inequality ((Vershynin, 2018, Theorem 2.3.1)), for some constant c0 > 0, P {|Xi µ| δµ} 2e 2c0δ2µ. Using union bound, P max i K |Xi µ| δ d 2e 2c0δ2µ+log k. Since log k c0δ2µ by assumption, the result follows. C.5 Proof of Lemma 14 This lemma follows directly from Lemma 13. HA, FOUNTOULAKIS, AND MAHONEY C.6 Proof of Lemma 15 We have Aj,K 1 Binom(k, q) for j Kc, so using tail bound for Binomial distribution ((Arratia and Gordon, 1989)), for any t > c, P { Aj,K 1 t} exp t log t k q Set t = c + 3. Then, since log(1 x) x for all 0 < x < 1 k log(1 t/k) t log(1 q) t + log 2 t 2t = 2(c + 3), where the second inequality follows, since k 2(c + 3) by assumption, and that q 0.5 for n sufficiently large. Then using union bound, P max j Kc 1 Aj,K1 1 c + 3 exp (c + 3) log (c + 3)n + 2(c + 3) + log(n k) as long as n k. C.7 Proof of Lemma 16 First the results in Lemma 2 and Lemma 3 hold for any graph, seed vector s, and ρ > 0, so the result can also be applied to (11). In particular, the solution x is non-negative, and thus the optimality condition can be expressed as (E [Q] x )j α1j S = ( ραE [dj] x j > 0, [ ραE [dj] , 0] x j = 0. (22) (Note s is 1 on the seed node and 0 on the rest.) For ρ > 1 d the optimal solution (11) is the zero vector. For ρ 1 d the first nonzero nodes appear. From (22), as ρ decreases less than or equal to 1 d, we can see that the seed node becomes active at first. Now let ρ > 0 be the regularization parameter for which the next set of nodes enters the active set. Then, for ρ (ρ , 1 d], we know that x S > 0 and x Sc = 0. Writing x S = u1S for some u > 0, we can see that for the seed node j S, (E [Q] x )j = 1 + α i:(i,j) E E [Aij] x j 1 α i:(i,j) E E [Aij] x i Substituting in the optimality condition (22), and solving with respect to u, we get 2 u d α = ρα d, and so u = 2(α ρα d) (1 + α) d . It remains to be shown that x S = u1S, for the above u, satisfies the optimality conditions for j Sc. To verify this, we use the definition of u and the optimality conditions for j Sc. We need (E [Q] x )j [ ραE [dj] , 0] for all j Sc. (23) STATISTICAL GUARANTEES FOR LOCAL GRAPH CLUSTERING We have that (E [Q] x )j = 1 α 2 p u = α ρα d p(1 α) d(1 + α) for all j K\S, (E [Q] x )j = 1 α 2 q u = α ρα d q(1 α) d(1 + α) for all j Kc. Using the above equalities, we get that the optimality conditions are satisfied if and only if ρ > ρ := (1 α)p d (1 + α) d + (1 α)p . Next, we prove that, when ρ reaches ρ , the nodes in K\S appear in the active set. To see this, from (22), we can check that the first ρ value allowing nonzero nodes for K\S is given by 2 p u = ρα d , and so ρ = ρ = (1 α)p d (1 + α) d + (1 α)p . Now assuming that x takes the form of u1S + v1K, and substituting in the optimality condition (22), we obtain the following equations: j S : 1 + α (u + v) d 1 α 1 + α p v(k 1) | {z } =(E[Q]x )j j K\S : 1 + α 1 + α p (u + v(k 1)) | {z } =(E[Q]x )j So, solving with respect to u and v, we obtain u = 2α [(1+α) d+(1 α)p], 2 pu ρα d α d+ 1 α 2 q(n k). (24) Also, x = u1S + v1K, with u and v given in (24), satisfies the optimality conditions for Kc if and only if ρ > ρ := q(1 α) 2α d E dℓ + q(1 α) k d + (n k)E dℓ , where E dℓ = minℓ Kc E [dℓ]. We claim that ρ > ρ , in which case we have proved that x = u1S + v1K satisfies the optimality condition (22) for ρ [ρ , ρ ), which proves the theorem. It now remains to check that ρ > ρ . With some algebra, we have that ρ > ρ 2pα d E dℓ + pq(1 α) k d + pq(1 α)(n k)E dℓ > q(1 + α) d2 + pq(1 α) d 2pα d E dℓ + pq(1 α) k d + pq(1 α)(n k)E dℓ > q(1 α + 2α) d2 + pq(1 α) d 2α d(p E dℓ q d) + pq(1 α)(k 1) d + pq(1 α)(n k)E dℓ > q(1 α) d2. HA, FOUNTOULAKIS, AND MAHONEY Also, since d = p(k 1) + q(n k), we have q(1 α) d2 = q(1 α) d [p(k 1) + q(n k)] , ρ > ρ 2α d(p E dℓ q d) + (1 α)q(n k)(p E dℓ q d) > 0. (25) Under Assumption 1, we know that p E dℓ = O (1) while q d = O d n , so p E dℓ q d > 0 for a sufficiently large n. This verifies (25). C.8 Proof of Lemma 17 Define ˇx R(ρ) = αQ 1 K,K(s K ρd K). We will prove that ˇx R = ˇx R(ρ) > 0 for any ρ ρ(δ) where ρ(δ) is defined in (5). In particular, this means that ˇx R satisfies the optimality condition for the reduced problem (12), implying that bx R K = ˇx R > 0 and thus proving the result. First, using (21), we can write ˇx R = 2α 1 + αD 1 K,K 1 + αAK,KD 1 K,K + 1 α 2 (AK,KD 1 K,K)2 + (s K ρd K). Let w = D 1 K,Ks K. Then ˇx R = 2α 1 + α j (D 1 K,KAK,K)jw ρ j (D 1 K,KAK,K)j1 i 1 + αD 1 K,KAK,Kw + j+2 (D 1 K,KAK,K)j+2w j (D 1 K,KAK,K)j1 i . (26) Note that s K has mass 1 on the seed node, so we have w = s K/d S, where d S is the degree of the seed node. Denoting by FON the first-order neighbors of the seed node in K, i.e., FON = {i K : i is a neighbor of the seed node}, we then have D 1 K,KAK,Kw = D 1 K,K1FON/d S, where 1FON is the indicator vector for the set FON. Applying the degree concentration Lemma 14, then with probability at least 1 2e c0δ2 d, D 1 K,KAK,Kw 1 (1 + δ)2 d2 1FON. (27) STATISTICAL GUARANTEES FOR LOCAL GRAPH CLUSTERING Next, using Lemma 13, we have that |FON| (1 δ)pk with probability at least 1 2e c0δ2pk. Furthermore, for each non-seed node i / FON in K, it is connected to nodes in FON with probability p, independently of all other pairs. Then A i,K1FON| |FON| iid Binom(|FON|, p), and thus applying Lemma 13, and using the assumption (1 δ)p2k c 1 0 δ 2 log k, we get P n A i,K1FON (1 δ)|FON|p for all i / FON |FON| o 1 2e c0δ2|FON|p 1 2e c0δ2(1 δ)p2k. Hence, we can conclude that with probability at least 1 4e c0δ2(1 δ)p2k, A i,K1FON (1 δ)2p2k for all i / FON. (28) 2 (D 1 K,KAK,K)2w 1 α 2 (D 1 K,KAK,K) 1 (1 + δ)2 d2 1FON d2 D 1 K,K1 2 γp (1 + δ) d2 1, where the first step applies (27), the second step uses (28), and the third applies Lemma 14 and the fact that p k = γ d. Returning to (26), we can now see that ˇx R > 0 as long as ρ is less than ρ(δ) = 1 α 1+α 2 1 δ 1+δ 2 γp (1+δ) d2 . This completes the proof of Lemma 17. C.9 Proof of Lemma 18 Recall that bx is the ℓ1-regularized Page Rank on the original graph (2). If ρ > 1/d S, then we can easily check that bx = bx R = 0, while for ρ = 1/d S, we have support(bx R) = support(bx) = S. To prove that support(bx R) support(bx) holds for ρ < 1/d S, we proceed by induction. Writing A1 = support(bx R) and A2 = support(bx), by the optimality condition, we have bx R A1 = αQ 1 A1,A1(s A1 ρd A1) and bx A2 = αQ 1 A2,A2(s A2 ρd A2). By inductive hypothesis, assume A1 A2. According to the expression (21), then we can see that bx A1 bx R A1, where the inequality applies component-wise. Now let i K be a node such that bx R i = bxi = 0. Using the optimality condition, we know that i becomes active for the fulldimensional problem (2) whenever the following condition is satisfied: j i,j A2 wijbxj = ραdi. (29) Analogously, the node i becomes active for the reduced problem (12) when the following condition is satisfied: 1 α j i,j A1 wijbx R j = ραdi. (30) Comparing the left-hand sides in (29) and (30), it is obvious that under the induction assumption, the left-hand side in (29) is larger than in (30). This implies that bxi becomes active earlier than bx R i , so the induction assumption continues to hold. This completes the proof of the lemma. HA, FOUNTOULAKIS, AND MAHONEY C.10 Proof of Lemma 19 Since bx R = (bx R S, bx R K\S) is the solution to the reduced problem (12), fixing bx R S, then bx R K\S is the minimizer of the following optimization problem: bx R K\S = arg min y Rk 1 2(bx R S, y) QK,K(bx R S, y) + ρα D (bx R S, y) 1 Due to Lemma 17, the subgradient of D (bx R S, y) 1 at y = bx R K\S is given by d K\S. Using optimality, it follows that 0 = QK\S,Kbx R | {z } gradient at y = bx R K\S Next, we can prove that our choice of ρ = ρ(δ) lies in the interval [ρ , ρ ). Lemma 22 Under the conditions of Lemma 19, we have ρ = ρ(δ) (ρ , ρ ). Hence, by Lemma 16, the population version of ℓ1-regularized Page Rank (11) has support K, so it follows that x is also the solution to the reduced problem 2x E [QK,K] x αx s K + ρα E [DK,K] x 1 : x Kc = 0 . (32) (Abusing notation, we use x and x K interchangeably throughout the proof.) So, using the optimality condition, we obtain 0 = E QK\S,K x | {z } gradient at y = x K\S +ραE d K\S , where we have E [D] (x S, y) 1 = E d K\S due to Lemma 16 and Lemma 22. Combining the two optimality equations, and adding and subtracting QK\S,Kx , it follows that 0 = E QK\S,K x QK\S,Kbx R + ραE d K\S ραd K\S = (E QK\S,K QK\S,K)x + QK\S,K(x bx R) + ρα(E d K\S d K\S). Writing QK\S,K(x bx R) = QK\S,S(x S bx R S)+QK\S,K\S(x K\S bx R K\S), and rearranging terms, we get QK\S,K\S(bx R K\S x K\S) = (E QK\S,K QK\S,K)x + QK\S,S(x S bx R S) + ρα(E d K\S d K\S). Taking the ℓ norm on both sides, QK\S,K\S(bx R K\S x K\S) (E QK\S,K QK\S,K)x | {z } term 1 + ρα E[d K\S] d K\S | {z } term 2 + QK\S,S(bx R S x S) | {z } term 3 STATISTICAL GUARANTEES FOR LOCAL GRAPH CLUSTERING For term 1 we know that x = u1S + v1K from Lemma 16, so plugging in to term 1, (E QK\S,K QK\S,K)x = 1 α 2 (AK\S,S E[AK\S,S]) u 2 (d K\S E d K\S ) v 1 α 2 (AK\S,K E AK\S,K )1K v. term 1 = (E QK\S,K QK\S,K)x 2 AK\S,S E[AK\S,S] u 2 d K\S E d K\S + 1 α 2 (AK\S,K E AK\S,K )1K 2 AK\S,S E[AK\S,S] u + δ d v 1 α 2 u + δ d v, with probability at least 1 4e c0δ2pk, where the second step uses the triangle inequality, the third step applies Lemma 13 and Lemma 14, and the last step holds since |Aj,S p| 1 for all j K\S. Furthermore, by Lemma 14, we can bound term 2 as term 2 ρα δ d, while for term 3, simple calculation leads to bx R S x S . Putting the results together, we have QK\S,K\S(bx R K\S x K\S) 1 α 2 u + δ d v + ρα δ d + 1 α bx R S x S . (34) Now it remains to upper bound the term bx R S x S . Following the same steps as before, indeed we can check that the following holds: QS,S(bx R S x S) (E [QS,K] QS,K)x | {z } term 4 + ρα E[d S] d S | {z } term 5 + QS,K\S(bx R K\S x K\S) | {z } term 6 Also, we can see that using x = u1S + v1K, term 4 = 1 + α 2 (u + v)(d S d) 1 α 2 v((AS,K E [AS,K])1K) (u + v)δ d, where the inequality applies Lemma 13 and Lemma 14. It is also easy to see that term 5 ρα δ d, while term 6 = 1 α 2 AS,K\S(bx R K\S x K\S) 1 α 2 AS,K\S 1 bx R K\S x K\S 2 (1 + δ)p(k 1) bx R K\S x K\S , HA, FOUNTOULAKIS, AND MAHONEY where the second step uses Hölder s inequality and the next step applies Lemma 13. Furthermore, QS,S = 1+α 2 d S by definition, which is bounded below by 1+α 2 (1 δ) d. Thus, from (35), we get bx R S x S δ d(u + v) + ρα δ d + 1 α 2 (1 + δ)p(k 1) bx R K\S x K\S 2 (1 δ) d . Finally, return to (34) and substitute the above bound in place of bx R S x S , then QK\S,K\S(bx R K\S x K\S) 1 α 2 u + δ d v + ρα δ d δ d(u + v) + ρα δ d + 1 α 2 (1 + δ)p(k 1) bx R K\S x K\S (1 δ) d . (36) The following lemma concerns the local strong convexity of the matrix QK\S,K\S in ℓ norm. Lemma 23 Under the conditions of Lemma 19, with probability at least 1 2e c0δ2 d, QK\S,K\Sy α(1 δ) d y for all y Rk 1. Applying Lemma 23 to the left-hand side of (36), and rearranging terms and simplifying, we have that bx R K\S x K\S 1 + α δ 1 δ u + δ d + 1 α 1 + α δ 1 δ v + ρα δ d + 1 α 1 + α δ 1 δ where we use α < 1 and p(k 1) = γ d. By assumption of the lemma, we have that δ 0.1 and that α [0.1, 0.9] while γ (0, 1) by definition (4), so c1 d and 1 α 1 + α δ 1 δ δ d, for some numerical constant c1 > 0. As a result, c1 d bx R K\S x K\S (1 α)u + 2δ d(v + ρα). Dividing both sides by c1 d, bx R K\S x K\S c 1 1 (1 α)u d + 2c 1 1 δ(v + ρα). Using the triangle inequality, Lemma 19 now follows. STATISTICAL GUARANTEES FOR LOCAL GRAPH CLUSTERING C.11 Proof of Lemma 20 By Lemma 3 we know that ( = ραdi, bxi > 0, 0, bxi = 0. Summing over i s, we have X i (Qbx αs)i = α X j:bxj>0 dj = ρα Vol(support(bx)), where the first equality holds since 1 Qbx = 1 (αD + (1 α)/2 L)bx = α P i dibxi. Dividing both sides by ρα, we get Vol(support(bx)) 1 d bx C.12 Proof of Lemma 21 This result follows from simple probability calculation. We have P { i K(i is not connected to Kc)} = 1 P { i K(i is connected to Kc)} = 1 (P {i is connected to Kc})k = 1 (1 P {i is not connected to Kc})k = 1 (1 (1 q)n k)k 1 (1 (1 q)n)k. Here, the second and fourth steps hold since each pair of nodes has an edge independently of all other pairs. Now suppose that n is sufficiently large so that q 0.5. Then, we have (1 q)n exp( 1.5c), which leads to P { i K(i is not connected to Kc)} 1 (1 exp( 1.5c))k. C.13 Proof of Lemma 22 Letting c = 1 α 1+α 1 δ 1+δ 2 γ 1+δ, we can write ρ = ρ(δ) = c(1 α)p (1+α) d2 . Following the same steps as (25) (proof of Lemma 16), we can see ρ > ρ 2 cpα d E dℓ + cpq(1 α) k d + cpq(1 α)(n k)E dℓ > q(1 + α) d2 2α d( cp E dℓ q d) + (1 α)q(n k)( cp E dℓ q d) (37) pq(1 α)(k 1) d(1 c) > 0. Since δ 0.1 and α [0.1, 0.9], we know that c c γ for some constant c > 0. Also, since q = c n while p = k = O (1) by Assumption 1, we have that γ = O (1), and thus ( 2α d( cp E dℓ q d) + (1 α)q(n k)( cp E dℓ q d) = O d ; pq(1 α)(k 1) d(1 c) = O k d HA, FOUNTOULAKIS, AND MAHONEY Therefore, if n is sufficiently large, the first two terms on the right-hand side of (37) are positive and much larger than the last term. This proves that ρ > ρ . To see ρ < ρ , it suffices to check d2 < p(1 α) d (1 + α) d + (1 α)p = ρ . Rearranging the terms, we equivalently need to show 2α(1+α) d > (1 α)2p. Since d p (k 1), we then have 2α(1 + α) (k 1) > (1 α)2 which, due to the assumption α [0.1, 0.9], holds for any k 5. C.14 Proof of Lemma 23 Denoting by 1 the vector whose entries are all ones, we have QK\S,K\S = αDK\S,K\S + 1 α 2 (DK\S,K\S AK\S,K\S) = αDK\S,K\S + 1 α 2 (diag (A1)K\S,K\S AK\S,K\S) = αDK\S,K\S + 1 α 2 diag AK\S,S Kc1S Kc + 1 α 2 LK\S,K\S, where LK\S,K\S is the graph Laplacian of the sub-graph induced by AK\S,K\S. Now assume y = 1. Without loss of generality, we will assume y1 = 1 (the same argument holds in the case of y1 = 1). Then, by definition of graph Laplacian, it is easy to see (LK\S,K\S y)1 0, and so (QK\S,K\S y)1 αd1. Hence, applying degree concentration to d1 (Lemma 14), QK\S,K\S y (QK\S,K\S y)1 αd1 (1 δ)α d, with probability at least 1 2e c0δ2 d, proving the result. Emmanuel Abbe. Community detection and stochastic block models: Recent developments. The Journal of Machine Learning Research, 18(1):6446 6531, 2017. Emmanuel Abbe and Colin Sandon. Community detection in general stochastic block models: Fundamental limits and efficient algorithms for recovery. In 2015 IEEE 56th Annual Symposium on Foundations of Computer Science, pages 670 688. IEEE, 2015. Emmanuel Abbe, Afonso S Bandeira, and Georgina Hall. Exact recovery in the stochastic block model. IEEE Transactions on Information Theory, 62(1):471 487, 2015. Noga Alon, Michael Krivelevich, and Benny Sudakov. Finding a large hidden clique in a random graph. Random Structures & Algorithms, 13(3-4):457 466, 1998. STATISTICAL GUARANTEES FOR LOCAL GRAPH CLUSTERING Arash A Amini and Elizaveta Levina. On semidefinite relaxations for the block model. The Annals of Statistics, 46(1):149 179, 2018. Reid Andersen and Kevin J Lang. An algorithm for improving graph partitions. In Proceedings of the nineteenth annual ACM-SIAM symposium on Discrete algorithms, pages 651 660, 2008. Reid Andersen, Fan Chung, and Kevin Lang. Local graph partitioning using Page Rank vectors. In 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS 06), pages 475 486, 2006. Reid Andersen, Shayan Oveis Gharan, Yuval Peres, and Luca Trevisan. Almost optimal local graph clustering using evolving sets. Journal of the ACM (JACM), 63(2):15, 2016. Ery Arias-Castro and Nicolas Verzelen. Community detection in dense random networks. The Annals of Statistics, 42(3):940 969, 2014. Taylor B Arnold and Ryan J Tibshirani. Efficient implementations of the generalized lasso dual path algorithm. Journal of Computational and Graphical Statistics, 25(1):1 27, 2016. Richard Arratia and Louis Gordon. Tutorial on large deviations for the binomial distribution. Bulletin of mathematical biology, 51(1):125 131, 1989. Sergey Brin and Lawrence Page. The anatomy of a large-scale hypertextual web search engine. Computer networks and ISDN systems, 30(1-7):107 117, 1998. Shoshana D Brown, John A Gerlt, Jennifer L Seffernick, and Patricia C Babbitt. A gold standard set of mechanistically diverse enzyme superfamilies. Genome biology, 7(1):R8, 2006. Yudong Chen and Jiaming Xu. Statistical-computational tradeoffs in planted problems and submatrix localization with a growing number of clusters and submatrices. The Journal of Machine Learning Research, 17(1):882 938, 2016. Yudong Chen, Xiaodong Li, and Jiaming Xu. Convexified modularity maximization for degreecorrected stochastic block models. The Annals of Statistics, 46(4):1573 1602, 2018. Bradley Efron, Trevor Hastie, Iain Johnstone, and Robert Tibshirani. Least angle regression. The Annals of statistics, 32(2):407 499, 2004. Kimon Fountoulakis and Jacek Gondzio. Performance of first-and second-order methods for big data optimization. ar Xiv preprint ar Xiv:1503.03520, 2015. Kimon Fountoulakis, David F Gleich, and Michael W Mahoney. An optimization approach to locally-biased graph algorithms. Proceedings of the IEEE, 105(2):256 272, 2017. Kimon Fountoulakis, Farbod Roosta-Khorasani, Julian Shun, Xiang Cheng, and Michael W Mahoney. Variational perspective on local graph clustering. Mathematical Programming, 174(1-2): 553 573, 2019. Kimon Fountoulakis, Meng Liu, David F Gleich, and MW Michael. Flow-based algorithms for improving clusters: A unifying framework, software, and performance. ar Xiv preprint ar Xiv:2004.09608, 2020a. HA, FOUNTOULAKIS, AND MAHONEY Kimon Fountoulakis, Di Wang, and Shenghao Yang. p-norm flow diffusion for local graph clustering. In International Conference on Machine Learning, pages 3222 3232. PMLR, 2020b. Chao Gao, Zongming Ma, Anderson Y Zhang, and Harrison H Zhou. Community detection in degree-corrected block models. The Annals of Statistics, 46(5):2153 2185, 2018. David Gleich and Michael Mahoney. Anti-differentiating approximation algorithms: A case study with min-cuts, spectral, and flow. In International Conference on Machine Learning, pages 1018 1025. PMLR, 2014. David F Gleich and Kyle Kloster. Seeded Page Rank solution paths. European Journal of Applied Mathematics, 27(6):812 845, 2016. Alden Green, Sivaraman Balakrishnan, and Ryan J Tibshirani. Local spectral clustering of density upper level sets. ar Xiv preprint ar Xiv:1911.09714, 2019. Lennart Gulikers, Marc Lelarge, and Laurent Massoulié. A spectral method for community detection in moderately sparse degree-corrected stochastic block models. Advances in Applied Probability, 49(3):686 721, 2017. Trevor Hastie, Saharon Rosset, Robert Tibshirani, and Ji Zhu. The entire regularization path for the support vector machine. Journal of Machine Learning Research, 5(Oct):1391 1415, 2004. Trevor Hastie, Jonathan Taylor, Robert Tibshirani, and Guenther Walther. Forward stagewise regression and the monotone lasso. Electronic Journal of Statistics, 1:1 29, 2007. Taher H Haveliwala. Topic-sensitive Page Rank. In Proceedings of the 11th international conference on World Wide Web, pages 517 526. ACM, 2002. Paul W Holland, Kathryn Blackmond Laskey, and Samuel Leinhardt. Stochastic blockmodels: First steps. Social networks, 5(2):109 137, 1983. Kyle Kloster and David F Gleich. Heat kernel based community detection. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1386 1395. ACM, 2014. Isabel M Kloumann and Jon M Kleinberg. Community membership identification from small seed sets. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1366 1375. ACM, 2014. Kevin Lang and Satish Rao. A flow-based method for improving the expansion or conductance of graph cuts. In International Conference on Integer Programming and Combinatorial Optimization, pages 325 337. Springer, 2004. Jure Leskovec and Andrej Krevl. SNAP Datasets: Stanford large network dataset collection. http: //snap.stanford.edu/data, June 2014. Jure Leskovec, Kevin J Lang, Anirban Dasgupta, and Michael W Mahoney. Community structure in large networks: Natural cluster sizes and the absence of large well-defined clusters. Internet Mathematics, 6(1):29 123, 2009. STATISTICAL GUARANTEES FOR LOCAL GRAPH CLUSTERING Jure Leskovec, Kevin J Lang, and Michael Mahoney. Empirical comparison of algorithms for network community detection. In Proceedings of the 19th international conference on World wide web, pages 631 640. ACM, 2010. Laurent Massoulié. Community detection thresholds and the weak ramanujan property. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing, pages 694 703. ACM, 2014. Elchanan Mossel, Joe Neeman, and Allan Sly. Reconstruction and estimation in the planted partition model. Probability Theory and Related Fields, 162(3-4):431 461, 2015. Elchanan Mossel, Joe Neeman, and Allan Sly. A proof of the block model threshold conjecture. Combinatorica, 38(3):665 708, 2018. Mark EJ Newman, Duncan J Watts, and Steven H Strogatz. Random graph models of social networks. Proceedings of the National Academy of Sciences, 99(suppl 1):2566 2572, 2002. Lorenzo Orecchia and Zeyuan Allen Zhu. Flow-based algorithms for local graph clustering. In Proceedings of the twenty-fifth annual ACM-SIAM symposium on Discrete algorithms, pages 1267 1286. SIAM, 2014. Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The Page Rank citation ranking: Bringing order to the web. Technical report, Stanford Info Lab, 1999. Philipp Pagel, Stefan Kovac, Matthias Oesterheld, Barbara Brauner, Irmtraud Dunger-Kaltenbach, Goar Frishman, Corinna Montrone, Pekka Mark, Volker Stümpflen, Hans-Werner Mewes, et al. The MIPS mammalian protein protein interaction database. Bioinformatics, 21(6):832 834, 2004. Karl Rohe, Sourav Chatterjee, and Bin Yu. Spectral clustering and the high-dimensional stochastic blockmodel. The Annals of Statistics, 39(4):1878 1915, 2011. Saharon Rosset and Ji Zhu. Piecewise linear regularized solution paths. The Annals of Statistics, pages 1012 1030, 2007. Saharon Rosset, Ji Zhu, and Trevor Hastie. Boosting as a regularized path to a maximum margin classifier. Journal of Machine Learning Research, 5(Aug):941 973, 2004. Pan Shi, Kun He, David Bindel, and John E Hopcroft. Local Lanczos spectral approximation for community detection. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 651 667. Springer, 2017. Julian Shun, Farbod Roosta-Khorasani, Kimon Fountoulakis, and Michael W Mahoney. Parallel local graph clustering. Proceedings of the VLDB Endowment, 9(12):1041 1052, 2016. Daniel A Spielman and Shang-Hua Teng. A local clustering algorithm for massive graphs and its application to nearly linear time graph partitioning. SIAM Journal on Scientific Computing, 42 (1):1 26, 2013. Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267 288, 1996. HA, FOUNTOULAKIS, AND MAHONEY Ryan J Tibshirani. A general framework for fast stagewise algorithms. The Journal of Machine Learning Research, 16(1):2543 2588, 2015. Amanda L Traud, Eric D Kelsic, Peter J Mucha, and Mason A Porter. Comparing community structure to characteristics in online collegiate social networks. SIAM review, 53(3):526 543, 2011. Amanda L Traud, Peter J Mucha, and Mason A Porter. Social structure of facebook networks. Physica A: Statistical Mechanics and its Applications, 391(16):4165 4180, 2012. Nate Veldt, David Gleich, and Michael Mahoney. A simple and strongly-local flow-based method for cut improvement. In International Conference on Machine Learning, pages 1938 1947, 2016. Nate Veldt, Christine Klymko, and David F Gleich. Flow-based local graph clustering with better seed set inclusion. In Proceedings of the 2019 SIAM International Conference on Data Mining, pages 378 386. SIAM, 2019. Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge University Press, 2018. Di Wang, Kimon Fountoulakis, Monika Henzinger, Michael W Mahoney, and Satish Rao. Capacity releasing diffusion for speed and locality. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 3598 3607. JMLR. org, 2017. Hao Yin, Austin R Benson, Jure Leskovec, and David F Gleich. Local higher-order graph clustering. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 555 564. ACM, 2017. Anderson Y Zhang and Harrison H Zhou. Minimax rates of community detection in stochastic block models. The Annals of Statistics, 44(5):2252 2280, 2016. Peng Zhao and Bin Yu. Stagewise lasso. Journal of Machine Learning Research, 8(Dec):2701 2726, 2007. Yunpeng Zhao, Elizaveta Levina, and Ji Zhu. Consistency of community detection in networks under degree-corrected stochastic block models. The Annals of Statistics, 40(4):2266 2292, 2012. Zeyuan Allen Zhu, Silvio Lattanzi, and Vahab S Mirrokni. A local algorithm for finding wellconnected clusters. In Proceedings of the 30th International Conference on Machine Learning, pages 396 404, 2013. Hui Zou, Trevor Hastie, and Robert Tibshirani. On the degrees of freedom of the lasso. The Annals of Statistics, 35(5):2173 2192, 2007.