# finding_dense_subgraphs_via_lowrank_bilinear_optimization__99867b90.pdf Finding Dense Subgraphs via Low-Rank Bilinear Optimization Dimitris S. Papailiopoulos DIMITRIS@UTEXAS.EDU Ioannis Mitliagkas IOANNIS@UTEXAS.EDU Alexandros G. Dimakis DIMAKIS@AUSTIN.UTEXAS.EDU Constantine Caramanis CONSTANTINE@UTEXAS.EDU The University of Texas at Austin Given a graph, the Densest k-Subgraph (Dk S) problem asks for the subgraph on k vertices that contains the largest number of edges. In this work, we develop a new algorithm for Dk S that searches a low-dimensional space for provably dense subgraphs. Our algorithm comes with novel performance bounds that depend on the graph spectrum. Our graph-dependent bounds are surprisingly tight for real-world graphs where we find subgraphs with density provably within 70% of the optimum. These guarantees are significantly tighter than the best available worst case a priori bounds. Our algorithm runs in nearly linear time, under spectral assumptions satisfied by most graphs found in applications. Moreover, it is highly scalable and parallelizable. We demonstrate this by implementing it in Map Reduce and executing numerous experiments on massive real-world graphs that have up to billions of edges. We empirically show that our algorithm can find subgraphs of significantly higher density compared to the previous state of the art. 1. Introduction Given a graph G on n vertices with m edges and a parameter k, we are interested in finding an induced subgraph on k vertices with the largest average degree, also known as the maximum density. This is the Densest k-Subgraph (Dk S) a fundamental problem in combinatorial optimization with applications in numerous fields including social sciences, communication networks, and biology (see e.g. (Hu et al., 2005; Gibson et al., 2005; Dourisboure et al., 2007; Saha et al., 2010; Miller et al., 2010; Bahmani et al., 2012)). Dk S is a notoriously hard problem. It is NP-hard by reduc- Proceedings of the 31 st International Conference on Machine Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copyright 2014 by the author(s). tion to MAXCLIQUE. Moreover, Khot showed in (Khot, 2004) that, under widely believed complexity-theoretic assumptions, Dk S cannot be approximated within an arbitrary constant factor.1 The best known approximation ratio was n1/3+ϵ (for some small ϵ) due to (Feige et al., 2001). Recently, (Bhaskara et al., 2010) introduced an algorithm with approximation ratio n1/4+ϵ, that runs in time n O(1/ϵ). Such results, where the approximation factor scales as a polynomial in the number of vertices, are too pessimistic for real-world applications. This resistance to better approximations, despite the long history of the problem, suggests that Dk S is probably very hard in the worst case. Our Contributions. In this work we move beyond the worst case framework. We present a novel Dk S algorithm that has two key features: i) it comes with approximation guarantees that are surprisingly tight on real-world graphs and ii) it is fully parallelizable and can scale up to graphs with billions of edges. Our algorithm combines spectral and combinatorial techniques; it relies on examining candidate subgraphs obtained from vectors lying in a low-dimensional subspace of the adjacency matrix of the graph. This is accomplished through a framework called the Spannogram, which we define below. Our approximation guarantees are graph-dependent: they are related to the spectrum of the adjacency matrix of the graph. Let opt denote the average degree (i.e., the density) of the densest k-subgraph, where 0 opt k 1. Our algorithm takes as input the graph, the subgraph size k, and an accuracy parameter d {1, . . . , n}. The output is a subgraph on k vertices with density optd, for which we obtain the following approximation result: Theorem 1. For any unweighted graph, our algorithm outputs in time O nd+2 log n δ a k-subgraph that has density optd 0.5 (1 δ) opt 2 |λd+1|, with probability 1 1 n, where λi is the ith largest, in mag- 1approximation ratio ρ means that there exists an algorithm that produces in polynomial time a number A, such that 1 opt A ρ, where opt is the optimal density. Finding Dense Subgraphs via Low-Rank Bilinear Optimization nitude, eigenvalue of the adjacency matrix of the graph. If the graph is bipartite, or if the largest d eigenvalues of the graph are positive, then our algorithm runs in time O nd+1 + Td , and outputs a k-subgraph with density optd opt 2 |λd+1|, where Td is the time to compute the d leading eigenvectors of the adjacency matrix of the graph. Our bounds come close to 2+ϵ and 1+ϵ factor approximations, when λd+1 is significantly smaller than the density of the densest k-subgraph. In the following theorem, we give such an example. However, we would like to note that in the worst case our bounds might not yield something meaningful. Theorem 2. If the densest-k-subgraph contains a constant fraction of all the edges, and k = Θ( E), then we can approximate Dk S within a factor of 2+ϵ, in time n O(1/ϵ2). If additionally the graph is bipartite, we can approximate Dk S within a factor of 1 + ϵ. The above result is similar to the 1 + ϵ approximation ratio of (Arora et al., 1995) for dense graphs, where the densestk-subgraph contains a constant fraction of the Ω(n2) edges, where k = Ω(n). The innovation here is that our ratio also applies to sparse graphs with sublinear number of edges. Computable upper bounds. In addition to these theoretical guarantees, our analysis allows us to obtain a graphdependent upper bound for the optimal subgraph density. This is shown in Fig. 3 in our experimental section, where for many graphs our algorithm is provably within 70% from the upper bound of opt. These are far stronger guarantees than the best available a priori bounds. This illustrates the potential power of graph-dependent guarantees that, however, require the execution of an algorithm. Nearly-linear time approximation. Our algorithm has a worst-case running time of O nd+2 log n δ . Under some mild spectral assumptions, a randomized version of our algorithm runs in nearly-linear time. Theorem 3. Let the d largest eigenvalues of the graph be positive, and let the d-th,(d + 1)-st largest have constant C. Then, we can modify our algorithm to output, with probability 1 δ, a k-subgraph with density (1 ϵ)2 optd, in time O m log n + n ϵ δ , where m is the number of edges. We found that the above spectral condition holds for all d 5, in many real-world graphs that we tested. Scalability. We develop two key scalability features that allow us to scale up efficiently on massive graphs. Vertex sparsification: We introduce a pre-processing step that eliminates vertices that are unlikely to be part of the densest k-subgraph. The elimination is based on the vertices weighted leverage scores (Mahoney & Drineas, 2009; Boutsidis et al., 2009) and admits a provable bound on the introduced error. We empirically found that even with a negligible additional error, the elimination dramatically reduced problem sizes in all tested datasets. Map Reduce implementation: We show that our algorithm is fully-parallelizable and tailor it for the Map Reduce framework. We use our Map Reduce implementation to run experiments on Elastic Map Reduce (EMR) on Amazon. In our large-scale experiments, we were able to scale out to thousands of mappers and reducers in parallel over 800 cores, and find large dense subgraphs in graphs with billions of edges. 1.1. Related work Dk S algorithms: One of the few positive results for Dk S is a 1 + ϵ approximation for dense graphs where m = Ω(n2), and in the linear subgraph setting k = Ω(n) (Arora et al., 1995). For some values of m = o(n2) a 2 + ϵ approximation was established by (Suzuki & Tokuyama, 2005). Moreover, for any k = Ω(n) a constant factor approximation is possible via a greedy approach by (Asahiro et al., 2000), or via semidefinite relaxations by (Srivastav & Wolf, 1998) and (Feige & Langberg, 2001). Recently, (Alon et al., 2013) established new approximation results for graphs with small ϵ-rank, using an approximate solver for low-rank perturbed versions of the adjacency matrix. There is a vast literature on algorithms for detecting communities and well-connected subgraphs: greedy schemes (Ravi et al., 1994), optimization approaches (Jethava et al., 2012; d Aspremont et al., 2010; Ames, 2011), and the truncated power method (Yuan & Zhang, 2011). We compare with various of these algorithms in our evaluation section. The Spannogram framework: We present an exact solver for bilinear optimization problems on matrices of constant rank, under {0, 1} and sparsity constraints on the variables. Our theory is a generalization of the Spannogram framework, originally introduced in the foundational work of (Karystinos & Liavas, 2010) and further developed in (Asteris et al., 2014; Papailiopoulos et al., 2013), that obtains exact solvers for low-rank quadratic optimization problems with combinatorial constraints, such as sparse PCA. Map Reduce algorithms for graphs: The design of Map Reduce algorithms for massive graphs is an active research area as Hadoop becomes one of the standards for storing large data sets. The related work by Bahmani et al. (Bahmani et al., 2012) designs a novel Map Reduce algorithm for the densest subgraph problem. This densest subgraph problem requires finding a subgraph of highest normalized density without enforcing a specific subgraph size k. Finding Dense Subgraphs via Low-Rank Bilinear Optimization Surprisingly, without a subgraph size restriction, the densest subgraph becomes polynomially solvable and therefore fundamentally different from what we consider in this paper. 2. Proposed Algorithm The density of a subgraph indexed by a vertex set S {1, . . . , n} is equal to the average degree of the vertices within S: den(S) = 1T SA1S where A is the adjacency matrix (Ai,j = 1 if (i, j) is an edge, else Ai,j = 0) and the indicator vector 1S has 1s in the entries indexed by S and 0 otherwise. Observe that 1T SA1S = P i,j S Ai,j is twice the number of edges in the subgraph with vertices in S. For a fixed subgraph size |S| = k, we can express Dk S as a quadratic optimization: Dk S : opt = (1/k) max |S|=k 1T SA1S where |S| = k denotes that the optimization variable is a k-vertex subset of {1, . . . , n}. The bilinear relaxation of Dk S. We approximate Dk S via approximating its bipartite version. This problem can be expressed as a bilinear maximization: DBk S : opt B = (1/k) max |X|=k max |Y|=k 1T X A1Y. As we see in the following lemma, the two problems are fundamentally related: a good solution for the bipartite version of the problem maps to a half as good solution for Dk S. The proof is given in the Supplemental Material. Lemma 1. A ρ-approximation algorithm for DBk S implies a 2ρ-approximation algorithm for Dk S. 2.1. Dk S through low rank approximations At the core of our approximation lies a constant rank solver: we show that DBk S can be solved in polynomial time on constant rank matrices. We solve constant rank instances of DBk S instead of Dk S due to an important implication: Dk S is NP-hard even for rank-1 matrices with 1 negative eigenvalue, as we show in the Supplemental Material. The exact steps of our algorithm are given in the pseudocode tables referred to as Algorithms 1-3.2 The output of our algorithm is a k-subgraph Zd that has density optd that 2In the pseudocode of Algorithm 2, topk(v), denotes the indices of the k largest signed elements of v. comes with provable guarantees. We present our theoretical guarantees in the next subsection. Our main algorithmic innovation, the constant rank solver for DBk S (Algorithms 2-3), is called many times: in lines 5, 8, and 15 of our general Dk S approximation, shown as Algorithm 1. We describe its steps subsequently. Algorithm 1 low-rank approximations for Dk S 1: [Vd, Λd] = EVD(A, d) 2: if G is bipartite then 3: B = bi-adjacency of G 4: [Vd, Σd, Ud] = SVD(B, d) 5: {Xd, Yd} = arg max|X|+|Y|=k 1T X VdΣd UT d 1Y. 6: Zd = Xd Yd 7: else if The first d eigenvalues of A are positive then 8: {Xd, Xd} = arg max|X|=|Y|=k 1T X VdΛd VT d 1Y. 9: Zd = Xd 10: else 11: for i = 1 : log n δ do 12: draw n fair coins and assign them to vertices 13: L = vertices with heads; R = {1, . . . , n} L 14: Bi d = [VdΛd VT d ]L,R 15: {X i, Yi} = arg max|X|+|Y|=k 1T X Bi d1Y. 16: end for 17: {X i, Yi} = arg max1 i n 1T X i Bi d1Yi 18: Zd = Xd Yd 19: end if 20: Output: Zd Constant rank solver for DBk S. In the following we present an exact solver for DBk S on constant rank approximations of A. Our Dk S algorithm makes a number of calls to the DBk S low-rank solver on slightly different (some times rectangular) matrices. The details of the general lowrank solver are in the Supplemental Material. Step 1: Obtain Ad = Pd i=1 λiviv T i , a rank-d approximation of A. Here, λi is the i-th largest in magnitude eigenvalue and vi the corresponding eigenvector. Step 2: Use Ad to obtain O(nd) candidate subgraphs. For any matrix A we can solve DBk S by exhaustively checking all n k 2 pairs (X, Y) of k-subsets of vertices. Surprisingly, if we want to find the X, Y pairs that maximize 1T X Ad1Y, i.e., the bilinear problem on the rank-d matrix Ad, then we show that only O(nd) candidate pairs need to be examined. Step 3: Check all k-set pairs {X, Y} obtained by Step 2, and output the one with the largest density on the low-rank weighted adjacency Ad. In the next section, we derive the constant rank-solver using two key facts. First, for each fixed vertex set Y, we show that it is easy to find the optimal set X that maximizes 1T X Ad1Y for that Y. Since this turns out to be easy, then the challenge is to find the number of different vertex sets Y that we need to check. Do we need to exhaustively check all n k k-sets Y? We show that this question is equivalent Finding Dense Subgraphs via Low-Rank Bilinear Optimization to searching the span of the first d eigenvectors of A, and collecting in a set Sd the top-k coordinates of all vectors in that d-dimensional space. By modifying the Spannogram theory of (Karystinos & Liavas, 2010; Asteris et al., 2014), we show how this set has size O(nd) and can be constructed in time O(nd+1). This will imply that DBk S can be solved in time O(nd+1) on Ad. Computational Complexity. The worst-case time complexity of the constant-rank DBk S solver on Ad is O(Td + nd+1), where Td is the time to compute the first d eigenvectors of A. Under conditions satisfied by many real world graphs, we show that we can modify our algorithm and obtain a randomized one that succeeds with probability δ and is ϵ far from the optimal rank-d solver, while its complexity reduces to nearly linear in the number of edges m of the graph G: O m log n + n Algorithm 2 lowrank DBk S(k, d, A) 1: [Vd, Λd] = EVD(A, d) 2: Sd = Spannogram(k, Vd) 3: {Xd, Yd} = arg max|X|=k max Y Sd 1T X VdΛd VT d 1Y 4: Output: {Xd, Yd} 1: Spannogram(k, Vd) 2: Sd = {topk(v) : v span(v1, . . . , vd)} 3: Output: Sd. 2.2. Approximation Guarantees We approximate DBk S by finding a solution to the constant rank problem max |X|=k max |Y|=k 1T X Ad1Y. We output a pair of vertex sets, Xd, Yd, which we refer to as the rank-d optimal solution, that has density opt B d = (1/k) 1T Xd A1Yd. Our approximation guarantees measure how far opt B d is from opt B, the optimal density for DBk S. Our bounds capture a simple core idea: the loss in our approximation comes due to solving the problem on Ad instead of solving it on the full rank matrix A. This loss is quantified in the next lemma. The detailed proofs of the following results are in the supplemental material. Lemma 2. For any matrix A: opt B d opt B 2 |λd+1|, where λi is the ith largest eigenvalue of A. Using an appropriate pre-processing step and then running Algorithm 2 as a subroutine on a sub-sampled and low-rank version of A, we output a k-subgraph Zd that has density optd. By essentially combining Lemmata 1 and 2 we obtain the following bounds. Theorem 1. Algorithm 1 outputs in time O nd+2 log n δ a k-subgraph that has density optd = den(Zd) 0.5 (1 δ) opt 2 |λd+1|, with probability 1 1 n, where λi is the ith largest, in magnitude, eigenvalue of the adjacency matrix of the graph. If the graph is bipartite, or if the largest d eigenvalues of the graph are positive, then our algorithm runs in time O nd+1 + Td , and outputs a k-subgraph with density optd opt 2 |λd+1|, where Td is the time to compute the d leading eigenvectors of the adjacency matrix of the graph. Using bounds on eigenvalues of graphs, Theorem 1 translates to the following approximation guarantees. Theorem 2. If the densest-k-subgraph contains a constant fraction of all the edges, and k = Θ( E), then we can approximate Dk S within a factor of 2+ϵ, in time n O(1/ϵ2). If additionally the graph is bipartite, then we can approximate Dk S within a factor of 1 + ϵ. Remark 1. The above results are similar to the 1 + ϵ ratio of (Arora et al., 1995), which holds for graphs where the densest-k-subgraph contains Ω(n2) edges. Graph dependent bounds. For any given graph, after running our constant rank solver on Ad, we can compute an upper bound to the optimal density opt via bounds on opt B, since it is easy to see that opt B opt. Our graphdependent bound is the minimum of three upper bounds on the unknown optimal density: Lemma 3. The optimal density of Dk S can be bounded as opt min (1/k) 1T Xd Ad1Yd + |λd+1|, k 1, λ1 . In our experimental section, we plot the above upper bounds, and show that for most tested graphs our algorithm performs provably within 70% from the upper bound on the optimal density. These are far stronger guarantees than the best available a priori bounds. 3. The Spannogram Framework In this section, we describe how our constant rank solver operates by examining candidate vectors in a lowdimensional span of A. Here, we work on a rank-d matrix Ad = v1u T 1 + . . . + vdu T d where ui = λivi, and we wish to solve: max |X|=|Y|=k 1T X v1u T 1 + . . . + vdu T d 1Y. (1) Observe that we can rewrite (1) in the following way max |X|=|Y|=k 1T X v1 (u T 1 1Y) | {z } c1 + . . . + vd (u T d 1Y) | {z } cd = max |Y|=k max |X|=k 1T X v Y Finding Dense Subgraphs via Low-Rank Bilinear Optimization where v Y = v1 c1 + . . . + vd cd is an n-dimensional vector generated by the d-dimensional subspace spanned by v1, . . . , vd. We will now make a key observation: for every fixed vector v Y in (2), the index set X that maximizes 1T X v Y can be easily computed. It is not hard to see that for any fixed vector v Y, the k-subset X that maximizes 1T X v Y = P i X [v Y]i corresponds the set of k largest signed coordinates of v Y. That is, the locally optimal k-set is topk(v Y). We now wish to find all possible locally optimal sets X. If we could possibly check all vectors v Y, then we could find all locally optimal index sets topk(v Y). Let us denote as Sd the set of all k-subsets X that are the optimal solutions of the inner maximization of (2) for any vector v in the span of v1, . . . , vd Sd = {topk([v1 c1 + . . . + vd cd]) : c1, . . . , cd R}. Clearly, this set contains all possible locally optimal X sets of the form topk(v Y). Therefore, we can rewrite DBk S on Ad as max |Y|=k max X Sd 1T X Ad1Y. (3) The above problem can now be solved in the following way: for every set X Sd find the locally optimal set Y that maximizes 1T X Ad1Y, that is, this will be topk(Ad1X ). Then, we simply need to test all such X, Y pairs on Ad and keep the optimizer. Due to the above, the problem of solving DBk S on Ad is equivalent to constructing the set of k-supports Sd, and then finding the optimal solution in that set. How large can Sd be and can we construct it in polynomial time? Initially one could expect that the set Sd could have size as big as n k . Instead, we show that the set Sd will be tremendously smaller, as in (Karystinos & Liavas, 2010) and (Asteris et al., 2014). Lemma 4. The set Sd has size at most O(nd) and can be built in time O(nd+1) using Algorithm 2. 3.1. Constructing the set Sd We build up to the general rank-d algorithm by explaining special cases that are easier to understand. Rank-1 case. We start with the d = 1 case, where we have S1 = {topk(c1 v1) : c1 R}. It is not hard to see that there are only two supports to include in S1: topk(v1) and topk( v1). These two sets can be constructed in time in time O(n), via a partial sorting and selection algorithm (Cormen et al., 2001). Hence, S1 has size 2 and can be constructed in time O(n). Rank-2 case. This is the first non-trivial d which exhibits the details of the Spannogram algorithm. Let an auxiliary angle φ Φ = [0, π) and let c = [ c1 c2 ] = h sin φ cos φ i .3 Then, we re-express c1 v1 + c2 v2 in terms of φ as v(φ) = sin φ v1 + cos φ v2. (4) This means that we can rewrite the set S2 as: S2 = {topk( (v(φ)), φ [0, π)}. Observe that each element of v(φ) is a continuous spectral curve in φ: [v(φ)]i = [v1]i sin(φ) + [v2]i cos(φ). Consequently, the top/bottom-k supports of v(φ) (i.e., topk( v(φ))) are themselves a function of φ. How can we find all possible supports? 0 0.5 1 1.5 2 2.5 3 10 [v(φ)]1 [v(φ)]2 [v(φ)]3 [v(φ)]4 [v(φ)]5 Figure 1. A rank d = 2 spannogram for n = 5 and two random vectors v1, v2. Observe that every two curves intersect in exactly one point. These intersection points define intervals in which a top-k set is invariant. The Spannogram. In Fig. 1, we draw an example plot of five curves [v(φ)]i, i = 1, . . . , 5, which we call a spannogram. From the spannogram in Fig. 1, we can see that the continuity of these sinusoidal curves implies a local invariance property of the top/bottom k supports topk( v(φ)), in a small neighborhood around a fixed φ. So, when does a top/bottom-k support change? The index sets topk( v(φ)) change if and only if two curves cross, i.e., when the ordering of two elements [v(φ)]i,[v(φ)]j changes. Finding all supports: There are n curves and each pair intersects at exactly one point in the Φ domain4. Therefore, there are exactly n 2 intersection points. These n 2 intersection points define n 2 + 1 intervals. Within an interval the top/bottom k supports topk( v(φ)) remain the same. Hence, it is now clear that |S2| 2 n 2 = O(n2). A way to find all supports in S2 is to compute the v(φi,j) vectors on the intersection points of two curves i, j, and 3Observe that when we scan φ, the vectors c, c express all possible unit norm vectors on the circle. 4Here we assume that the curves are in general position. This can be always accomplished by infinitesimally perturbing the curves as in (Papailiopoulos et al., 2013). Finding Dense Subgraphs via Low-Rank Bilinear Optimization then the supports in the two adjacent intervals of such intersection point. The v(φi,j) vector on an intersection point of two curves i and j can be easily computed by first solving a set of linear equations [v(φi,j)]i = [v(φi,j)]j (ei ej)T [v1 v2]ci,j = 02 1 for the unknown vector ci,j, where ei is the i-th column of the n n identity matrix, i.e., ci,j = nullspace((ei ej)T [v1 v2]). Then, we compute v(φi,j) = [v1 v2]ci,j. Further details on breaking ties in topk(v(φi,j)) can be found in the supplemental material. Computational cost: We have n 2 intersection points, where we calculate the top/bottom k supports for each v(φi,j). The top/bottom k elements of every v(φi,j) can be computed in time O(n) using a partial sorting and selection algorithm (Cormen et al., 2001). Since we perform this routine a total of O( n 2 ) times, the total complexity of our rank-2 algorithm is O(n3). General Rank-d case. The algorithm generalizes to arbitrary dimension d, as we show in the supplemental material; its pseudo-code is given as Algorithm 3. Remark 2. Observe that the computation of each loop under line 2 of Algorithm 3 can be computed in parallel. This will allow us to parallelize the Spannogram. Algorithm 3 Spannogram(k,Vd) 1: Sd = 2: for all (i1, . . . , id) {1, . . . , n}d and s { 1, 1} do 3: c = s nullspace [(Vd]i1,: [Vd]i2,: ... [Vd]i1,: [Vd]id,: 4: v = VT d c 5: S = topk(v) 6: T = S {i1, . . . , id} 7: for all d k |T | subsets J of (i1, . . . , id) do 8: Sd = Sd S (T J ) 9: end for 10: end for 11: Output: Sd. 3.2. An approximate Sd in nearly-linear time In our exact solver, we solve DBk S on Ad in time O(nd+1). Surprisingly, when Ad has only positive eigenvalues, then we can tightly approximate DBk S on Ad in nearly linear time. Theorem 3. Let the d largest eigenvalues of the graph be positive, and let the d-th,(d + 1)-st largest have constant C. Then, we can output, with probability 1 δ, a k-subgraph with density (1 ϵ)2 optd, in time O m log n + n ϵ δ . The main idea is that instead of checking all O(nd) possible k sets in Sd, we can approximately solve the problem by randomly sampling M = O ϵ d log 1 ϵ δ vectors in the span of v1, . . . , vd. Our proof is based on the fact that we can approximate the surface of the d-dimensional sphere with M randomly sampled vectors from the span of v1, . . . , vd. This allows us to identify, probability 1 δ, near-optimal candidates in Sd. The modified algorithm is very simple and is given below; its analysis can be found in the supplemental material. Algorithm 4 Spannogram approx(k, Vd, Λd) 1: for i = 1 : O ϵ d log 1 ϵ δ do 2: v = (Λ1/2 d Vd)T randn(d, 1) 3: Sd = Sd topk(v) topk( v) 4: end for 5: Output: Sd. 4. Scaling up In this section, we present the two key scalability features that allow us to scale up to graphs with billions of edges. 4.1. Vertex Sparsification We introduce a very simple and efficient pre-processing step for discarding vertices that are unlikely to appear in a top k set in Sd. This step runs after we compute Ad and uses the leverage score, ℓi = Pd j=1[Vd]2 i,j|λj|, of the i-th vertex to decide whether we will discard it or not. We show in the supplemental material, that by appropriately setting a threshold, we can guarantee a provable bound on the error introduced. In our experimental results, the above elimination is able to reduce n to approximately ˆn 10 k for a provably small additive error, even for data sets where n = 108. 4.2. Map Reduce Implementation A Map Reduce implementation allows scaling out to a large number of compute nodes that can work in parallel. The reader can refer to (Meng & Mahoney, 2013; Bahmani et al., 2012)) for a comprehensive treatment of the Map Reduce paradigm. In short, the Hadoop/Map Reduce infrastructure stores the input graph as a distributed file spread across multiple machines; it provides a tuple streaming abstraction, where each map and reduce function receives and emits tuples as (key, value) pairs. The role of the keys is to ensure information aggregation: all the tuples with the same key are processed by the same reducer. For the spectral decomposition step of our scheme we design a simple implementation of the power method in Map Reduce. The details are beyond the scope of this work; high-performance implementations are already available in the literature, e.g. (Lin & Schatz, 2010). We instead focus on the novel implementation of the Spannogram. Our Map Reduce implementation of the rank-2 Spannogram is outlined in Algorithm 4. The Mapper is responsible for the duplication and dissemination of the eigenvectors, V2, U2 = V2Λ2, to all reducers. Line 3 emits the j-th row of V2 and U2 once for every node i. Since i is used as the key, this ensures that every reducer receives V2, U2 in Finding Dense Subgraphs via Low-Rank Bilinear Optimization their entirety. From the breakdown of the Spannogram in Section 3, it is understood that, for the rank-2 case, it suffices to solve a simple system of equations for every pair of nodes. The Reducer for node i receives the full eigenvectors V2, U2 and is responsible for solving the problem for every pair (i, j), where j > i. Then, Line 6 emits the best candidate computed at Reducer i. A trivial final step, not outlined here, collects all n2 candidate sets and keeps the best one as the final solution. The basic outline in Algorithm 4 comes with heavy communication needs and was chosen here for ease of exposition. The more efficient version that we implement, does not replicate V2, U2 n times. Instead, the number of reducers say R = nα is fine-tuned to the capabilities of the cluster. The mappers emit V2, U2 R times, once for every reducer. Then, reducer r is responsible for solving for node pairs (i, j), where i r (mod R) and j > i. Depending on the performance bottleneck, different choices for α are more appropriate. We divide the construction of the O(n2) candidate sets in S2 to O(nα) reducers and each of them computes O(n2 α) candidate subgraphs. The total communication cost for this parallelization scheme is O(n1+α): nα reducers need to have access to the entire V2, U2 that has 2 2 n entries. Moreover, the total computation cost for each reducer is O(n3 α). Algorithm 5 Spannogram MR(V2, U2) 1: Map({[V2]j,:, [U2]j,:, j}): 2: for i = 1 : n do 3: emit: i, {[V2]j,1, [V2]j,2, [U2]j,1, [U2]j,2, j} 4: end for 1: Reducei( i, {[V2]j,1, [V2]j,2, [U2]j,1, [U2]j,2, j} , j): 2: for each j i + 1 do 3: c = nullspace([V]i,: [V]j,:) 4: [denj, {Xj, Yj}] = max |Y|=k,X topk( V2c)1X V2UT 2 1Y 5: end for 6: emit: i, {Xi, Yi} = maxj 1Xj V2UT 2 1Yj 5. Experimental Evaluation We experimentally evaluate the performance of our algorithm and compare it to the truncated power method (TPower) of (Yuan & Zhang, 2011), a greedy algorithm by (Feige et al., 2001) (GFeige) and another greedy algorithm by (Ravi et al., 1994) (GRavi). We performed experiments on synthetic dense subgraphs and also massive real graphs from multiple sources. In all experiments we compare the density of the subgraph obtained by the Spannogram to the density of the output subgraphs given by the other algorithms. Our experiments illustrate three key points: (1) for all tested graphs, our method outperforms some times significantly all other algorithms compared; (2) our method is Subgraph density G-Feige G-Ravi TPower Spannogram (a) Densities of the recovered subgraph v.s. the expected number of edges. 0 100 200 300 400 500 600 Total time in minutes Number of cores Running times on 2.5 Billion Edges Power method Spannogram (b) Running times of the Spannogram and power iteration for two top eigenvectors. Figure 2. Planted clique experiments for random graphs. highly scalable, allowing us to solve far larger problem instances; (3) our data-dependent upper bound in many cases provide a certificate of near-optimality, far more accurate and useful, than what a priori bounds are able to do. Planted clique. We first consider the so-called (and now much studied) Planted Clique problem: we seek to find a clique of size k that has been planted in a graph where all other edges are drawn independently with probability 1/2. We scale our randomized experiments from n = 100 up to 105. In all cases we set the size of the clique to k = 3 n close to what is believed to be the critical computability threshold. In all our experiments, GRavi, TPower, and the Spannogram successfully recovered the hidden clique. However, as can be seen in Fig. 2, the Spannogram algorithm is the only one able to scale up to n = 105 a massive dense graph with about 2.5 billion edges. The reason is that this graph does not fit in the main memory of one machine and caused all centralized algorithms to crash after several hours. Our Map Reduce implementation scales out smoothly, since it splits the problem over multiple smaller problems solved in parallel. Specifically, we used Amazon Wireless Services Elastic Map Reduce framework (aws). We implemented our map and reduce functions in Python and used the MRJob class (mrj). For our biggest experiments we used a 100-machine strong cluster, consisting of m1.xlarge AWS instances (a total of 800 cores). The running times of our experiments over Map Reduce are shown in Fig. 2(b). The main bottleneck is the computation of the first two eigenvectors which is performed by repeating the power iteration for few (typically 4) iterations. This step is not the emphasis of this work and has not been optimized. The Spannogram algorithm is significantly faster and the benefits of parallelization are clear since it is CPU intensive. In principle, the other algorithms could be also implemented over Map Reduce, but that requires non-trivial dis- Finding Dense Subgraphs via Low-Rank Bilinear Optimization Figure 3. Subgraph density vs. subgraph size (k). We compare our Dk S Spannogram algorithm with the algorithms from (Feige et al., 2001) (GFeige), (Ravi et al., 1994) (GRavi), and (Yuan & Zhang, 2011) (t PM). Across all subgraph sizes k, we obtain higher subgraph densities using Spannograms of rank d = 2 or 5. We also obtain a provable data-dependent upper bound (solid black line) on the objective. This proves that for these data sets, our algorithm is typically within 80% from optimality, for all sizes up to k = 250, and indeed for small subgraph sizes we find a clique which is clearly optimal. Further experiments on multiple other data sets are shown in the supplemental material. tributed algorithm design. As is well-known, e.g., (Meng & Mahoney, 2013), implementing iterative machine learning algorithms over Map Reduce can be a significant task and schemes which perform worse in standard metrics can be highly preferable for this parallel framework. Careful Map Reduce algorithmic design is needed especially for dense graphs like the one in the hidden clique problem. Real Datasets. Next, we demonstrate our method s performance in real datasets and also illustrate the power of our data-dependent bounds. We run experiments on large graphs from different applications and our findings are presented in Fig. 3. The figure compares the density achieved by the Spannogram algorithm for rank 1, 2 and 5 to the performance of GFeige, GRavi and TPower. The figure shows that the rank-2 and rank-5 versions of our algorithm, improve sometimes significantly over the other techniques. Our novel data-dependent upper-bound shows that our results on these data sets are provably near-optimal. The experiments are performed for two community graphs (com-Live Journal and com-DBLP), a web graph (web Notre Dame), and a subset of the Facebook graph. A larger set of experiments is included in the supplemental material. Note that the largest graph in Figure 3 contains no more than 35 million edges; these cases fit in the main memory of a single machine and the running times are presented in the supplemental material, all performed on a standard Macbook Pro laptop using Matlab. In summary, rank-2 took less than one second for all these graphs while prior work methods took approximately the same time, up to a few seconds. Rank-1 was significantly faster than all other methods in all tested graphs and took fractions of a second. Rank-5 took up to 1000 seconds for the largest graph (Live Journal). We conclude that our algorithm is an efficient option for finding dense subgraphs. Different rank choices give a tradeoff between accuracy and performance while the parallel nature allows scalability when needed. Further, our theoretical upper-bound can be useful for practitioners investigating dense structures in large graphs. 6. Acknowledgments The authors would like to acknowledge support from NSF grants CCF 1344364, CCF 1344179, DARPA XDATA, and research gifts by Google, Docomo and Microsoft. Finding Dense Subgraphs via Low-Rank Bilinear Optimization Amazon Web Services, Elastic Map Reduce. URL http:// aws.amazon.com/elasticmapreduce/. MRJob. URL http://pythonhosted.org/mrjob/. Alon, Noga, Lee, Troy, Shraibman, Adi, and Vempala, Santosh. The approximate rank of a matrix and its algorithmic applications: approximate rank. In Proceedings of the 45th annual ACM symposium on Symposium on theory of computing, pp. 675 684. ACM, 2013. Ames, Brendan PW. Convex relaxation for the planted clique, biclique, and clustering problems. Ph D thesis, University of Waterloo, 2011. Arora, Sanjeev, Karger, David, and Karpinski, Marek. Polynomial time approximation schemes for dense instances of nphard problems. In STOC, 1995. Asahiro, Yuichi, Iwama, Kazuo, Tamaki, Hisao, and Tokuyama, Takeshi. Greedily finding a dense subgraph. Journal of Algorithms, 34(2):203 221, 2000. Asteris, Megasthenis, Papailiopoulos, Dimitris S, and Karystinos, George N. The sparse principal component of a constant-rank matrix. IEEE Trans. IT, 60(4):228 2290, 2014. Bahmani, Bahman, Kumar, Ravi, and Vassilvitskii, Sergei. Densest subgraph in streaming and mapreduce. Proceedings of the VLDB Endowment, 5(5):454 465, 2012. Bhaskara, Aditya, Charikar, Moses, Chlamtac, Eden, Feige, Uriel, and Vijayaraghavan, Aravindan. Detecting high log-densities: an O(n1/4) approximation for densest k-subgraph. In STOC, 2010. Boutsidis, Christos, Mahoney, Michael W, and Drineas, Petros. An improved approximation algorithm for the column subset selection problem. In Proceedings of the twentieth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 968 977. Society for Industrial and Applied Mathematics, 2009. Cormen, Thomas H, Leiserson, Charles E, Rivest, Ronald L, and Stein, Clifford. Introduction to algorithms. MIT press, 2001. d Aspremont, Alexandre et al. Weak recovery conditions using graph partitioning bounds. 2010. Dourisboure, Yon, Geraci, Filippo, and Pellegrini, Marco. Extraction and classification of dense communities in the web. In WWW, 2007. Feige, Uriel and Langberg, Michael. Approximation algorithms for maximization problems arising in graph partitioning. Journal of Algorithms, 41(2):174 211, 2001. Feige, Uriel, Peleg, David, and Kortsarz, Guy. The dense ksubgraph problem. Algorithmica, 29(3):410 421, 2001. Gibson, David, Kumar, Ravi, and Tomkins, Andrew. Discovering large dense subgraphs in massive graphs. In PVLDB, 2005. Hu, Haiyan, Yan, Xifeng, Huang, Yu, Han, Jiawei, and Zhou, Xianghong Jasmine. Mining coherent dense subgraphs across massive biological networks for functional discovery. Bioinformatics, 21(suppl 1):i213 i221, 2005. Jethava, Vinay, Martinsson, Anders, Bhattacharyya, Chiranjib, and Dubhashi, Devdatt. The lovasz theta function, svms and finding large dense subgraphs. In NIPS, 2012. Karystinos, George N and Liavas, Athanasios P. Efficient computation of the binary vector that maximizes a rank-deficient quadratic form. IEEE Trans. IT, 56(7):3581 3593, 2010. Khot, Subhash. Ruling out ptas for graph min-bisection, densest subgraph and bipartite clique. In FOCS, 2004. Lin, Jimmy and Schatz, Michael. Design patterns for efficient graph algorithms in mapreduce. In Proceedings of the Eighth Workshop on Mining and Learning with Graphs, pp. 78 85. ACM, 2010. Mahoney, Michael W and Drineas, Petros. Cur matrix decompositions for improved data analysis. Proceedings of the National Academy of Sciences, 106(3):697 702, 2009. Meng, Xiangrui and Mahoney, Michael. Robust regression on mapreduce. In Proceedings of The 30th International Conference on Machine Learning, pp. 888 896, 2013. Miller, B, Bliss, N, and Wolfe, P. Subgraph detection using eigenvector l1 norms. In NIPS, 2010. Papailiopoulos, Dimitris, Dimakis, Alexandros, and Korokythakis, Stavros. Sparse pca through low-rank approximations. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), pp. 747 755, 2013. Ravi, Sekharipuram S, Rosenkrantz, Daniel J, and Tayi, Giri K. Heuristic and special case algorithms for dispersion problems. Operations Research, 42(2):299 310, 1994. Saha, Barna, Hoch, Allison, Khuller, Samir, Raschid, Louiqa, and Zhang, Xiao-Ning. Dense subgraphs with restrictions and applications to gene annotation graphs. In Research in Computational Molecular Biology, pp. 456 472. Springer, 2010. Srivastav, Anand and Wolf, Katja. Finding dense subgraphs with semidefinite programming. Springer, 1998. Suzuki, Akiko and Tokuyama, Takeshi. Dense subgraph problems with output-density conditions. In Algorithms and Computation, pp. 266 276. Springer, 2005. Yuan, Xiao-Tong and Zhang, Tong. Truncated power method for sparse eigenvalue problems. ar Xiv preprint ar Xiv:1112.2679, 2011.