# the_product_cut__93d50125.pdf The Product Cut Xavier Bresson Nanyang Technological University Singapore xavier.bresson@ntu.edu.sg Thomas Laurent Loyola Marymount University Los Angeles tlaurent@lmu.edu Arthur Szlam Facebook AI Research New York aszlam@fb.com James H. von Brecht California State University, Long Beach Long Beach james.vonbrecht@csulb.edu We introduce a theoretical and algorithmic framework for multi-way graph partitioning that relies on a multiplicative cut-based objective. We refer to this objective as the Product Cut. We provide a detailed investigation of the mathematical properties of this objective and an effective algorithm for its optimization. The proposed model has strong mathematical underpinnings, and the corresponding algorithm achieves state-of-the-art performance on benchmark data sets. 1 Introduction We propose the following model for multi-way graph partitioning. Let G = (V, W) denote a weighted graph, with V its vertex set and W its weighted adjacency matrix. We define the Product Cut of a partition P = (A1, . . . , AR) of the vertex set V as Pcut(P) = QR r=1 Z(Ar, Ac r) e H(P) , H(P) = r=1 θr log θr, (1) where θr = |Ar|/|V | denotes the relative size of a set. This model provides a distinctive way to incorporate classical notions of a quality partition. The non-linear, non-local function Z(Ar, Ac r) of a set measures its intraand inter-connectivity with respect to the graph. The entropic balance H(P) measures deviations of the partition P from a collection of sets (A1, . . . , AR) with equal size. In this way, the Product Cut optimization parallels the classical Normalized Cut optimization [10, 15, 13] in terms of its underlying notion of cluster, and it arises quite naturally as a multiplicative version of the Normalized Cut. Nevertheless, the two models strongly diverge beyond the point of this superficial similarity. We provide a detailed analysis to show that (1) settles the compromise between cut and balance in a fundamentally different manner than classical objectives, such as the Normalized Cut or the Cheeger Cut. The sharp inequalities 0 Ncut(P) 1 e H(P) Pcut(P) 1 (2) succinctly capture this distinction; the Product Cut exhibits a non-vanishing lower bound while the Normalized Cut does not. We show analytically and experimentally that this distinction leads to superior stability properties and performance. From an algorithmic point-of-view, we show how to cast the minimization of (1) as a convex maximization program. This leads to a simple, exact continuous relaxation of the discrete problem that has a clear mathematical structure. We leverage this formulation to develop a monotonic algorithm for optimizing (1) via a sequence of linear programs, and we introduce a randomized version of this strategy that leads to a simple yet highly effective 30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain. algorithm. We also introduce a simple version of Algebraic Multigrid (AMG) tailored to our problem that allows us to perform each step of the algorithm at very low cost. On graphs that contain reasonably well-balanced clusters of medium scale, the algorithm provides a strong combination of accuracy and efficiency. We conclude with an experimental evaluation and comparison of the algorithm on real world data sets to validate these claims. 2 The Product Cut Model We begin by introducing our notation and by describing the rationale underlying our model. We use G = (V, W) to denote a graph on n vertices V = {v1, . . . , vn} with weighted edges W = {wij}n i,j=1 that encode similarity between vertices. We denote partitions of the vertex set into R subsets as P = (A1, . . . , AR), with the understanding that the Ar V satisfy the covering A1 . . . AR = V constraint, the non-overlapping Ar As = , (r = s) constraint and the non-triviality Ar = constraint. We use f, g, h, u, v to denote vertex functions f : V R, which we view as functions f(vi) and n-vectors f Rn interchangeably. For a A V we use |A| for its cardinality and 1A for its indicator function. Finally, for a given graph G = (V, W) we use D := diag(W1V ) to denote the diagonal matrix of weighted vertex degrees. The starting point for our model arises from a well-known and widely used property of the random walk on a graph. Namely, a random walker initially located in a cluster A is unlikely to leave that cluster quickly [8]. Different approaches of quantifying this intuition then lead to a variety of multi-way partitioning strategies for graphs [11, 12, 1]. The personalized page-rank methodology provides an example of this approach. Following [1], given a scalar 0 < α < 1 and a non-empty vertex subset A we define pr A := M 1 α 1A/|A| Mα := Id αWD 1 /(1 α) (3) as its personalized page-rank vector. As 1A/|A| is the uniform distribution on the set A and WD 1 is the transition matrix of the random walk on the graph, pr A corresponds to the stationary distribution of a random walker that, at each step, moves with probability α to a neighboring vertex by a usual random walk, and has a probability (1 α) to teleport to the set A. If A has a reasonable cluster structure, then pr A will concentrate on A and assign low probabilities to its complement. Given a high-quality partition P = (A1, . . . , AR) of V , we therefore expect that σi,r := pr Ar(vi) should achieve its maximal value over 1 r R when r = r(i) is the class of the ith vertex. Viewed from this perspective, we can formulate an R-way graph partitioning problem as the task of selecting P = (A1, . . . , AR) to maximize some combination of the collection {σi,r(i) : i V } of page-rank probabilities generated by the partition. Two intuitive options immediately come to mind, the arithmetic and geometric means of the collection: Maximize 1 n P vi Ar pr Ar(vi) over all partitions (A1, . . . , AR) of V into R sets. (4) vi Ar pr Ar(vi) 1/n over all partitions (A1, . . . , AR) of V into R sets. (5) The first option corresponds to a straightforward variant of the classical Normalized Cut. The second option leads to a different type of cut-based objective that we term the Product Cut. The underlying reason for considering (5) is quite natural. If we view each pr Ar as a probability distribution, then (5) corresponds to a formal likelihood of the partition. This proves quite analogous to re-formulating the classical k-means objective for partitioning n data points (x1, . . . , xn) into R clusters (A1, . . . , AR) in terms of maximizing a likelihood QR r=1 Q vi Ar exp( xi mr 2 of Gaussian densities. While the Normalized Cut variant (4) is certainly popular, we show that it suffers from several defects that the Product Cut resolves. As the Product Cut can be effectively optimized and generally leads to higher quality partitions, it therefore provides a natural alternative. To make these ideas precise, let us define the α-smoothed similarity matrix as Ωα := M 1 α and use {ωij}n i,j=1 to denote its entries. Thus ωij = (M 1 α 1vj)i = pr{vj}(vi), and so ωij gives a non-local measure of similarity between the vertices vi and vj by means of the personalized page-rank diffusion process. The matrix Ωα is column stochastic, non-symmetric, non-sparse, and has diagonal entries greater than (1 α). Given a partition P = (A1, . . . , AR), we define Pcut(P) := QR r=1 Z(Ar, Ac r)1/n e H(P) and Ncut(P) := 1 Cut(Ar, Ac r) Vol(Ar) (6) as its Product Cut and Normalized Cut, respectively. The non-linear, non-local function Z(A, Ac) := Y j A ωij (7) of a set measures its intraand inter-connectivity with respect to the graph while H(P) denotes the entropic balance (1). The definitions of Cut(A, Ac) = P j Ar ωij and Vol(A) = P are standard. A simple computation then shows that maximizing the geometric average (5) is equivalent to minimizing the Product Cut, while maximizing the arithmetic average (4) is equivalent to minimizing the Normalized Cut. At a superficial level, both models wish to achieve the same goal. The numerator of the Product Cut aims at a partition in which each vertex is weakly connected to vertices from other clusters and strongly connected with vertices from its own cluster. The denominator H(P) is maximal when |A1| = |A2| = . . . = |AR|, and so aims at a well-balanced partition of the vertices. The objective (5) therefore promotes partitions with strongly intra-connected clusters and weakly inter-connected clusters that have comparable size. The Normalized Cut, defined here on Ωα but usually posed over the original similarity matrix W, is exceedingly well-known [10, 15] and also aims at finding a good balance between low cut value and clusters of comparable sizes. Despite this apparent parallel between the Product and Normalized Cuts, the two objectives behave quite differently both in theory and in practice. To illustrate this discrepancy at a high level, note first that the following sharp bounds 0 Ncut(P) 1 (8) hold for the Normalized Cut. The lower bound is attained for partitions P in which the clusters are mutually disconnected. For the Product Cut, we have Theorem 1 The following inequality holds for any partition P: e H(P) Pcut(P) 1. (9) Moreover the lower bound is attained for partitions P in which the clusters are mutually disconnected. The lower bound in (9) can be directly read from (6) and (7), while the upper bound is non-trivial and proved in the supplementary material. This theorem goes at the heart of the difference between the Product and Normalized Cuts. To illustrate this, let P(k) denote a sequence of partitions. Then (9) shows that lim k H(P(k)) = 0 lim k Pcut(P(k)) = 1. (10) In other words, an arbitrarily ill-balanced partition leads to arbitrarily poor values of its Product Cut. The Normalized Cut does not possess this property. As an extreme but easy-to-analyze example, consider the case where G = (V, W) is a collection of isolated vertices. All possible partitions P consist of mutually disconnected clusters and the lower bound is reached for both (8) and (9). Thus Ncut(P) = 0 for all P and so all partitions are equivalent for the Normalized Cut. On the other hand Pcut(P) = e H(P), which shows that, in the absence of cut information, the Product Cut will choose the partition that maximizes the entropic balance. So in this case, any partition P for which |A1| = . . . = |AR| will be a minimizer. In essence, this tighter lower bound for the Product Cut reflects its stronger balancing effect vis-a-vis the Normalized Cut. 2.1 (In-)Stability Properies of the Product Cut and Normalized Cut In practice, the stronger balancing effect of the Product Cut manifests as a stronger tolerance to perturbations. We now delve deeper and contrast the two objectives by analyzing their stability properties using experimental data as well as a simplified model problem that isolates the source of (a) An in blue, Bn in green, C in red. (b) P0,good n = (An, Bn C) (c) P0,bad n = (An Bn, C) Figure 1: The graphs G0 n used for analyzing stability. the inherent difficulties. Invoking ideas from dynamical systems theory, we say an objective is stable if an infinitesimal perturbation of a graph G = (V, W) leads to an infinitesimal perturbation of the optimal partition. If an infinitesimal perturbation leads to a dramatic change in the optimal partition, then the objective is unstable. We use a simplified model to study stability of the Product Cut and Normalized Cut objectives. Consider a graph Gn = (Vn, Wn) made of two clusters An and Bn containing n vertices each. Each vertex in Gn has degree k and is connected to µk vertices in the opposite cluster, where 0 µ 1. The graph G0 n is a perturbation of Gn constructed by adding a small cluster C of size n0 n to the original graph. Each vertex of C has degree k0 and is connected to µ0k0 vertices in Bn and (1 µ0)k0 vertices in C for some 0 µ0 1. In the perturbed graph G0 n, a total of n0 vertices in Bn are linked to C and have degree k + µ0k0. See figure 1(a). The main properties of Gn, G0 n are Unperturbed graph Gn : |An| = |Bn| = n, Cond Gn(An) = µ, Cond Gn(Bn) = µ Perturbed graph G0 n: |An| = |Bn| = n, Cond G0n(An) = µ, Cond G0n(Bn) µ . |C| = n0 n, Cond G0n(C) = µ0. where Cond G(A) = Cut(A, Ac)/ min(|A|, |Ac|) denotes the conductance of a set. If we consider the parameters µ, µ0, k, k0, n0 as fixed and look at the perturbed graph G0 n in the limit n of a large number of vertices, then as n becomes larger the degree of the bulk vertices will remain constant while the size |C| of the perturbation becomes infinitesimal. To examine the influence of this infinitesimal perturbation for each model, let Pn = (An, Bn) denote the desired partition of the unperturbed graph Gn and let P0,good n = (An, Bn C) and P0,bad n = (An Bn, C) denote the partitions of the perturbed graph G0 n depicted in figure 1(b) and 1(c), respectively. As P0,good n Pn, a stable objective will prefer P0,good n to P0,bad n while any objective preferring the converse is unstable. A detailed study of stability proves possible for this specific graph family. We summarize the conclusions of this analysis in the theorem below, which shows that the Normalized Cut is unstable in certain parameter regimes while the Product Cut is always stable. The supplementary material contains the proof. Theorem 2 Suppose that µ, µ0, k, k0, n0 are fixed. Then µ0 < 2µ Ncut G0n(P0,good n ) > Ncut G0n(P0,bad n ) for n large enough. (11) Pcut G0n(P0,good n ) < Pcut G0n(P0,bad n ) for n large enough. (12) Statement (11) simply says that the large cluster An must have a conductance µ at least twice better than the conductance µ0 of the small perturbation cluster C in order to prevent instability. Thus adding an infinitesimally small cluster with mediocre conductance (up to two times worse the conductance of the main structure) has the potential of radically changing the partition selected by the Normalized Cut. Moreover, this result holds for the classical Normalized Cut, its smoothed variant (4) as well as for similar objectives such as the Cheeger Cut and Ratio Cut. Conversely, (12) shows that adding an infinitesimally small cluster will not affect the partition selected by the aa Partition P of Partition P of Partition P of Partition P of WEBKB4 found by WEBKB4 found by CITESEER found by CITESEER found by the Pcut algo. the Ncut algo. the Pcut algo. the Ncut algo. e H(P) .2506 .7946 .1722 .7494 Pcut(P) .5335 .8697 .4312 .8309 Ncut(P) .5257 .5004 .5972 .5217 Figure 2: The Product and Normalized Cuts on WEBKB4 (R = 4 clusters) and CITESEER (R = 6 clusters). The pie charts visually depict the sizes of the clusters in each partition. In both cases, NCut returns a super-cluster while PCut returns a well-balanced partition. The NCut objective prefers the ill-balanced partitions while the PCut objective dramatically prefers the balanced partitions. Product Cut. The proof, while lengthy, is essentially just theorem 1 in disguise. To see this, note that the sequence of partitions P0,bad n becomes arbitrarily ill-balanced, which from (10) implies limn Pcut G0n(P0,bad n ) = 1. However, the unperturbed graph Gn grows in a self-similar fashion as n and so the Product Cut of Pn remains approximately a constant, say γ, for all n. Thus Pcut Gn(Pn) γ < 1 for n large enough, and Pcut G0n(P0,good n ) Pcut Gn(Pn) since |C| is infinitesimal. Therefore Pcut G0n(P0,good n ) γ < 1. Comparing this upper-bound with the fact limn Pcut G0n(P0,bad n ) = 1, we see that the Product Cut of P0,bad n becomes eventually larger than the Product Cut of P0,good n . While we execute this program in full only for the example above, this line of argument is fairly general and similar stability estimates are possible for more general families of graphs. This general contrast between the Product Cut and the Normalized Cut extends beyond the realm of model problems, as the user familiar with off-the-shelf NCut codes likely knows. When provided with dirty graphs, for example an e-mail network or a text data set, NCut has the aggravating tendency to return a super-cluster. That is, NCut often returns a partition P = (A1, . . . , AR) where a single set |Ar| contains the vast majority of the vertices. Figure 2 illustrates this phenomenon. It compares the partitions obtained for NCut (computed on Ωα using a modification of the standard spectral approximation from [15]) and for PCut (computed using the algorithm presented in the next section) on two graphs constructed from text data sets. The NCut algorithm returns highly ill-balanced partitions containing a super-cluser, while PCut returns an accurate and well-balanced partition. Other strategies for optimizing NCut obtain similarly unbalanced partitions. As an example, using the algorithm from [9] with the original sparse weight matrix W leads to relative cluster sizes of 99.2%, 0.5%, 0.2% and 0.1% for WEBKB4 and 98.5%, 0.4%, 0.3%, 0.3%, 0.3% and 0.2% for CITESEER. As our theoretical results indicate, these unbalanced partitions result from the normalized cut criterion itself and not the algorithm used to minimize it. 3 The Algorithm Our strategy for optimizing the Product Cut relies on a popular paradigm for discrete optimization, i.e. exact relaxation. We begin by showing that the discrete, graph-based formulation (5) can be relaxed to a continuous optimization problem, specifically a convex maximization program. We then prove that this relaxation is exact, in the sense that optimal solutions of the discrete and continuous problems coincide. With an exact relaxation in hand, we may then appeal to continuous optimization strategies (rather than discrete or greedy ones) for optimizing the Product Cut. This general idea of exact relaxation is intimately coupled with convex maximization. Assume that the graph G = (V, W) is connected. Then by taking the logarithm of (5) we see that (5) is equivalent to the problem Maximize PR r=1 P i Ar log (Ωα1Ar )i |Ar| over all partitions P = (A1, . . . , AR) of V into R non-empty subsets. The relaxation of (P) then follows from the usual approach. We first encode sets Ar V as binary vertex functions 1Ar, then relax the binary constraint to arrive at a continuous program. Given a vertex function f Rn + with non-negative entries, we define the continuous energy e(f) as e(f) := f, log Ωαf/ f, 1V if f = 0, and e(0) = 0, where , denotes the usual dot product in Rn and the logarithm applies entriwise. As (Ωαf)i > 0 whenever f = 0, the continuous energy is well-defined. After noting that P r e(1Ar) is simply the objective value in problem (P), we arrive to the following continuous relaxation Maximize PR r=1 e(fr) over all (f1, . . . , f R) Rn + . . . Rn + satisfying PR r=1 fr = 1V where the non-negative cone Rn + consists of all vectors in Rn with non-negative entries. The following theorem provides the theoretical underpinning for our algorithmic approach. It establishes convexity of the relaxed objective for connected graphs. Theorem 3 Assume that G = (V, W) is connected. Then the energy e(f) is continuous, positive 1-homogeneous and convex on Rn +. Moreover, the strict convexity property e(θf + (1 θ)g) < θe(f) + (1 θ)e(g) for all θ (0, 1) holds whenever f, g Rn + are linearly independent. The continuity of e(f) away from the origin as well as the positive one-homogeneity are obvious, while the continuity of e(f) at the origin is easy to prove. The proof of convexity of e(f), provided in the supplementary material, is non-trivial and heavily relies on the particular structure of Ωα itself. With convexity of e(f) in hand, we may prove the main theorem of this section. Theorem 4 ( Equivalence of (P) and (P-rlx) ) Assume that G = (V, W) is connected and that V contains at least R vertices. If P = (A1, . . . , AR) is a global optimum of (P) then (1A1, . . . , 1AR) is a global optimum of (P-rlx) . Conversely, if (f1, . . . , f R) is a global optimum of (P-rlx) then (f1, . . . , f R) = (1A1, . . . , 1AR) where (A1, . . . , AR) is a global optimum of (P). Proof. By strict convexity, the solution of the maximization (P-rlx) occurs at the extreme points of the constraint set Σ = {(f1, . . . , f R) : fr RN + and PR r=1 fr = 1}. Any such extreme point takes the form (1A1, . . . , 1AR), where necessarily A1 . . . AR = V and Ar As = (r = s) hold. It therefore suffices to rule out extreme points that have an empty set of vertices. But if A = B are non-empty then 1A, 1B are linearly independent, and so the inequality e(1A +1B) < e(1A)+e(1B) holds by strict convexity and one-homogeneity. Thus given a partition of the vertices into R 1 non-empty subsets and one empty subset, we can obtain a better energy by splitting one of the non-empty vertex subsets into two non-empty subsets. Thus any globally maximal partition cannot contain empty subsets. With theorems 3 and 4 in hand, we may now proceed to optimize (P) by searching for optima of its exact relaxation. We tackle the latter problem by leveraging sequential linear programming or gradient thresholding strategies for convex maximization. We may write (P-rlx) as Maximize E(F) subject to F C and ψi(F) = 0 for i = 1, . . . , n (13) where F = (f1, . . . , f R) is the optimization variable, E(F) is the convex energy to be maximized, C is the bounded convex set [0, 1]n . . . [0, 1]n and the n affine constraints ψi(F) = 0 correspond to the row-stochastic constraints PR r=1 fi,r = 1. Given a current feasible estimate F k of the solution, we obtain the next estimate F k+1 by solving the linear program Maximize Lk(F) subject to F C and ψi(F) = 0 for i = 1, . . . , n (14) where Lk(F) = E(F k) + E(F (k)), F F k is the linearization of the energy E(F) around the current iterate. By convexity of E(F), this strategy monotonically increases E(F k) since E(F k+1) Lk(F k+1) Lk(F k) = E(F k). The iterates F k therefore encode a sequence of partitions of V that monotonically increase the energy at each step. Either the current iterate maximizes the linear form, in which case first-order optimality holds, or else the subsequent iterate produces a partition with a Algorithm 1 Randomized SLP for PCut Initialization: (f 0 1 , . . . , f 0 R) = (1A1, . . . , 1AR) for (A1, . . . , AR) a random partition of V for k = 0 to maxiter do for r = 1 to R do Set ˆfr = f k r /(Pn i=1 f k i,r) then solve Mαur = ˆfr Set gi,r = fi,r/ui,r for i = 1, . . . n then solve M T α vr = gr Set hr = log ur + vr 1 end for Choose at random sk vertices and let I V be these vertices. for all i V do If i I then f k+1 i,r = 1 if r = arg maxs his 0 otherwise, if i / I then f k+1 i,r = 1 if hi,r > 0 0 otherwise. end for end for strictly larger objective value. The latter case can occur only a finite number of times, as only a finite number of partitions exist. Thus the sequence F k converges after a finite number of iterations. While simple and easy to implement, this algorithm suffers from a severe case of early termination. When initialized from a random partition, the iterates F k almost immediately converge to a poorquality solution. We may rescue this poor quality algorithm and convert it to a highly effective one, while maintaining its simplicity, by randomizing the LP (14) at each step in the following way. At step k we solve the LP maximize Lk(F) subject to F C and ψi(F) = 0 for i Ik, (15) where the set Ik is a random subset of {1, 2, . . . , n} obtained by drawing sk constraints uniformly at random without replacement. The LP (15) is therefore version of LP (14) in which we have dropped a random set of constraints. If we start by enforcing a small number sk of constraints and slowly increment this number sk+1 = sk + sk as the algorithm progresses, we allow the algorithm time to explore the energy landscape. Enforcing more constraints as the iterates progress ensures that (15) eventually coincides with (14), so convergence of the iterates F k of the randomized algorithm is still guaranteed. The attraction is that LP (15) has a simple, closed-form solution given by a variant of gradient thresholding. We derive the closed form solution of LP (15) in section 1 of the supplementary material, and this leads to Algorithm 1 above. The overall effectiveness of this strategy relies on two key ingredients. The first is a proper choice of the number of constraints sk to enforce at each step. Selecting the rate at which sk increases is similar, in principle, to selecting a learning rate schedule for a stochastic gradient descent algorithm. If sk increases too quickly then the algorithm will converge to poor-quality partitions. If sk increases too slowly, the algorithm will find a quality solution but waste computational effort. A good rule of thumb is to linearly increase sk at some constant rate sk λ until all constraints are enforced, at which point we switch to the deterministic algorithm and terminate the process at convergence. The second key ingredient involves approximating solutions to the linear system Mαx = b quickly. We use a simple Algebraic Multigrid (AMG) technique, i.e. a stripped-down version of [7] or [6], to accomplish this. The main insight here is that exact solutions of Mαx = b are not needed, but not all approximate solutions are effective. We need an approximate solution x that has non-zero entries on all of |V | for thresholding to succeed, and this can be accomplished by AMG at very little cost. 4 Experiments We conclude our study of the Product Cut model by presenting extensive experimental evaluation of the algorithm1. We intend these experiments to highlight the fact that, in addition to a strong theoretical model, the algorithm itself leads to state-of-the-art performance in terms of cluster purity on a variety of real world data sets. We provide experimental results on four text data sets (20NEWS, RCV1, WEBKB4, CITESEER) and four data sets containing images of handwritten digits (MNIST, PENDIGITS, USPS, OPTDIGITS). We provide the source for these data sets and details on their 1The code is available at https://github.com/xbresson/pcut Table 1: Algorithmic Comparison via Cluster Purity. 20NE RCV1 WEBK CITE MNIS PEND USPS OPTI size 20K 9.6K 4.2K 3.3K 70K 11K 9.3K 5.6K R 20 4 4 6 10 10 10 10 RND 6 30 39 22 11 12 17 12 NCUT 27 38 40 23 77 80 72 91 LSD 34 38 46 53 76 86 70 91 MTV 36 43 45 43 96 87 85 95 GRACLUS 42 42 49 54 97 85 87 94 NMFR 61 43 58 63 97 87 86 98 PCut (.9,λ1) 61 53 58 63 97 87 89 98 PCut (.9,λ2) 60 50 57 64 96 84 89 95 construction in the supplementary material. We compare our method against partitioning algorithms that, like the Product Cut, rely on graph-cut objective principles and that partition the graph in a direct, non-recursive manner. The NCut algorithm [15] is a widely used spectral algorithm that relies on a post-processing of the eigenvectors of the graph Laplacian to optimize the Normalized Cut energy. The NMFR algorithm [14] uses a graph-based random walk variant of the Normalized Cut. The LSD algorithm [2] provides a non-negative matrix factorization algorithm that relies upon a trace-based relaxation of the Normalized Cut objective. The MTV algorithm from [3] and the balanced k-cut algorithm from [9] provide total-variation based algorithms that attempt to find an optimal multi-way Cheeger cut of the graph by using ℓ1 optimization techniques. Both algorithms optimize the same objective and achieve similar purity values. We report results for [3] only. The GRACLUS algorithm [4, 5] uses a multi-level coarsening approach to optimize the NCut objective as formulated in terms of kernel k-means. Table 1 reports the accuracy obtained by these algorithms for each data set. We use cluster purity to quantify the quality of the calculated partition, defined according to the relation: Purity = 1 n PR r=1 max1