# the_product_cut__93d50125.pdf

The Product Cut

Xavier Bresson Nanyang Technological University Singapore xavier.bresson@ntu.edu.sg

Thomas Laurent Loyola Marymount University Los Angeles tlaurent@lmu.edu

Arthur Szlam Facebook AI Research New York aszlam@fb.com

James H. von Brecht California State University, Long Beach Long Beach james.vonbrecht@csulb.edu

We introduce a theoretical and algorithmic framework for multi-way graph partitioning that relies on a multiplicative cut-based objective. We refer to this objective as the Product Cut. We provide a detailed investigation of the mathematical properties of this objective and an effective algorithm for its optimization. The proposed model has strong mathematical underpinnings, and the corresponding algorithm achieves state-of-the-art performance on benchmark data sets.

1 Introduction

We propose the following model for multi-way graph partitioning. Let G = (V, W) denote a weighted graph, with V its vertex set and W its weighted adjacency matrix. We deﬁne the Product Cut of a partition P = (A1, . . . , AR) of the vertex set V as

Pcut(P) = QR r=1 Z(Ar, Ac r) e H(P) , H(P) =

r=1 θr log θr, (1)

where θr = |Ar|/|V | denotes the relative size of a set. This model provides a distinctive way to incorporate classical notions of a quality partition. The non-linear, non-local function Z(Ar, Ac r) of a set measures its intraand inter-connectivity with respect to the graph. The entropic balance H(P) measures deviations of the partition P from a collection of sets (A1, . . . , AR) with equal size. In this way, the Product Cut optimization parallels the classical Normalized Cut optimization [10, 15, 13] in terms of its underlying notion of cluster, and it arises quite naturally as a multiplicative version of the Normalized Cut.

Nevertheless, the two models strongly diverge beyond the point of this superﬁcial similarity. We provide a detailed analysis to show that (1) settles the compromise between cut and balance in a fundamentally different manner than classical objectives, such as the Normalized Cut or the Cheeger Cut. The sharp inequalities

0 Ncut(P) 1 e H(P) Pcut(P) 1 (2)

succinctly capture this distinction; the Product Cut exhibits a non-vanishing lower bound while the Normalized Cut does not. We show analytically and experimentally that this distinction leads to superior stability properties and performance. From an algorithmic point-of-view, we show how to cast the minimization of (1) as a convex maximization program. This leads to a simple, exact continuous relaxation of the discrete problem that has a clear mathematical structure. We leverage this formulation to develop a monotonic algorithm for optimizing (1) via a sequence of linear programs, and we introduce a randomized version of this strategy that leads to a simple yet highly effective

30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.

algorithm. We also introduce a simple version of Algebraic Multigrid (AMG) tailored to our problem that allows us to perform each step of the algorithm at very low cost. On graphs that contain reasonably well-balanced clusters of medium scale, the algorithm provides a strong combination of accuracy and efﬁciency. We conclude with an experimental evaluation and comparison of the algorithm on real world data sets to validate these claims.

2 The Product Cut Model

We begin by introducing our notation and by describing the rationale underlying our model. We use G = (V, W) to denote a graph on n vertices V = {v1, . . . , vn} with weighted edges W = {wij}n i,j=1 that encode similarity between vertices. We denote partitions of the vertex set into R subsets as P = (A1, . . . , AR), with the understanding that the Ar V satisfy the covering A1 . . . AR = V constraint, the non-overlapping Ar As = , (r = s) constraint and the non-triviality Ar = constraint. We use f, g, h, u, v to denote vertex functions f : V R, which we view as functions f(vi) and n-vectors f Rn interchangeably. For a A V we use |A| for its cardinality and 1A for its indicator function. Finally, for a given graph G = (V, W) we use D := diag(W1V ) to denote the diagonal matrix of weighted vertex degrees.

The starting point for our model arises from a well-known and widely used property of the random walk on a graph. Namely, a random walker initially located in a cluster A is unlikely to leave that cluster quickly [8]. Different approaches of quantifying this intuition then lead to a variety of multi-way partitioning strategies for graphs [11, 12, 1]. The personalized page-rank methodology provides an example of this approach. Following [1], given a scalar 0 < α < 1 and a non-empty vertex subset A we deﬁne

pr A := M 1 α 1A/|A| Mα := Id αWD 1 /(1 α) (3)

as its personalized page-rank vector. As 1A/|A| is the uniform distribution on the set A and WD 1 is the transition matrix of the random walk on the graph, pr A corresponds to the stationary distribution of a random walker that, at each step, moves with probability α to a neighboring vertex by a usual random walk, and has a probability (1 α) to teleport to the set A. If A has a reasonable cluster structure, then pr A will concentrate on A and assign low probabilities to its complement. Given a high-quality partition P = (A1, . . . , AR) of V , we therefore expect that σi,r := pr Ar(vi) should achieve its maximal value over 1 r R when r = r(i) is the class of the ith vertex.

Viewed from this perspective, we can formulate an R-way graph partitioning problem as the task of selecting P = (A1, . . . , AR) to maximize some combination of the collection {σi,r(i) : i V } of page-rank probabilities generated by the partition. Two intuitive options immediately come to mind, the arithmetic and geometric means of the collection:

Maximize 1 n P

vi Ar pr Ar(vi) over all partitions (A1, . . . , AR) of V into R sets. (4)

vi Ar pr Ar(vi) 1/n over all partitions (A1, . . . , AR) of V into R sets. (5)

The ﬁrst option corresponds to a straightforward variant of the classical Normalized Cut. The second option leads to a different type of cut-based objective that we term the Product Cut. The underlying reason for considering (5) is quite natural. If we view each pr Ar as a probability distribution, then (5) corresponds to a formal likelihood of the partition. This proves quite analogous to re-formulating the classical k-means objective for partitioning n data points (x1, . . . , xn) into R clusters (A1, . . . , AR) in terms of maximizing a likelihood QR r=1 Q

vi Ar exp( xi mr 2

of Gaussian densities. While the Normalized Cut variant (4) is certainly popular, we show that it suffers from several defects that the Product Cut resolves. As the Product Cut can be effectively optimized and generally leads to higher quality partitions, it therefore provides a natural alternative.

To make these ideas precise, let us deﬁne the α-smoothed similarity matrix as Ωα := M 1 α and use {ωij}n i,j=1 to denote its entries. Thus ωij = (M 1 α 1vj)i = pr{vj}(vi), and so ωij gives a non-local measure of similarity between the vertices vi and vj by means of the personalized page-rank diffusion process. The matrix Ωα is column stochastic, non-symmetric, non-sparse, and has diagonal entries

greater than (1 α). Given a partition P = (A1, . . . , AR), we deﬁne

Pcut(P) := QR r=1 Z(Ar, Ac r)1/n

e H(P) and Ncut(P) := 1

Cut(Ar, Ac r) Vol(Ar) (6)

as its Product Cut and Normalized Cut, respectively. The non-linear, non-local function

Z(A, Ac) := Y

j A ωij (7)

of a set measures its intraand inter-connectivity with respect to the graph while H(P) denotes the entropic balance (1). The deﬁnitions of

Cut(A, Ac) = P

j Ar ωij and Vol(A) = P

are standard. A simple computation then shows that maximizing the geometric average (5) is equivalent to minimizing the Product Cut, while maximizing the arithmetic average (4) is equivalent to minimizing the Normalized Cut. At a superﬁcial level, both models wish to achieve the same goal. The numerator of the Product Cut aims at a partition in which each vertex is weakly connected to vertices from other clusters and strongly connected with vertices from its own cluster. The denominator H(P) is maximal when |A1| = |A2| = . . . = |AR|, and so aims at a well-balanced partition of the vertices. The objective (5) therefore promotes partitions with strongly intra-connected clusters and weakly inter-connected clusters that have comparable size. The Normalized Cut, deﬁned here on Ωα but usually posed over the original similarity matrix W, is exceedingly well-known [10, 15] and also aims at ﬁnding a good balance between low cut value and clusters of comparable sizes.

Despite this apparent parallel between the Product and Normalized Cuts, the two objectives behave quite differently both in theory and in practice. To illustrate this discrepancy at a high level, note ﬁrst that the following sharp bounds 0 Ncut(P) 1 (8) hold for the Normalized Cut. The lower bound is attained for partitions P in which the clusters are mutually disconnected. For the Product Cut, we have

Theorem 1 The following inequality holds for any partition P:

e H(P) Pcut(P) 1. (9)

Moreover the lower bound is attained for partitions P in which the clusters are mutually disconnected.

The lower bound in (9) can be directly read from (6) and (7), while the upper bound is non-trivial and proved in the supplementary material. This theorem goes at the heart of the difference between the Product and Normalized Cuts. To illustrate this, let P(k) denote a sequence of partitions. Then (9) shows that lim k H(P(k)) = 0 lim k Pcut(P(k)) = 1. (10)

In other words, an arbitrarily ill-balanced partition leads to arbitrarily poor values of its Product Cut. The Normalized Cut does not possess this property. As an extreme but easy-to-analyze example, consider the case where G = (V, W) is a collection of isolated vertices. All possible partitions P consist of mutually disconnected clusters and the lower bound is reached for both (8) and (9). Thus Ncut(P) = 0 for all P and so all partitions are equivalent for the Normalized Cut. On the other hand Pcut(P) = e H(P), which shows that, in the absence of cut information, the Product Cut will choose the partition that maximizes the entropic balance. So in this case, any partition P for which |A1| = . . . = |AR| will be a minimizer. In essence, this tighter lower bound for the Product Cut reﬂects its stronger balancing effect vis-a-vis the Normalized Cut.

2.1 (In-)Stability Properies of the Product Cut and Normalized Cut

In practice, the stronger balancing effect of the Product Cut manifests as a stronger tolerance to perturbations. We now delve deeper and contrast the two objectives by analyzing their stability properties using experimental data as well as a simpliﬁed model problem that isolates the source of

(a) An in blue, Bn in green, C in red.

(b) P0,good n = (An, Bn C)

(c) P0,bad n = (An Bn, C)

Figure 1: The graphs G0 n used for analyzing stability.

the inherent difﬁculties. Invoking ideas from dynamical systems theory, we say an objective is stable if an inﬁnitesimal perturbation of a graph G = (V, W) leads to an inﬁnitesimal perturbation of the optimal partition. If an inﬁnitesimal perturbation leads to a dramatic change in the optimal partition, then the objective is unstable.

We use a simpliﬁed model to study stability of the Product Cut and Normalized Cut objectives. Consider a graph Gn = (Vn, Wn) made of two clusters An and Bn containing n vertices each. Each vertex in Gn has degree k and is connected to µk vertices in the opposite cluster, where 0 µ 1. The graph G0 n is a perturbation of Gn constructed by adding a small cluster C of size n0 n to the original graph. Each vertex of C has degree k0 and is connected to µ0k0 vertices in Bn and (1 µ0)k0 vertices in C for some 0 µ0 1. In the perturbed graph G0 n, a total of n0 vertices in Bn are linked to C and have degree k + µ0k0. See ﬁgure 1(a). The main properties of Gn, G0 n are

Unperturbed graph Gn : |An| = |Bn| = n, Cond Gn(An) = µ, Cond Gn(Bn) = µ

Perturbed graph G0 n: |An| = |Bn| = n, Cond G0n(An) = µ, Cond G0n(Bn) µ . |C| = n0 n, Cond G0n(C) = µ0.

where Cond G(A) = Cut(A, Ac)/ min(|A|, |Ac|) denotes the conductance of a set. If we consider the parameters µ, µ0, k, k0, n0 as ﬁxed and look at the perturbed graph G0 n in the limit n of a large number of vertices, then as n becomes larger the degree of the bulk vertices will remain constant while the size |C| of the perturbation becomes inﬁnitesimal.

To examine the inﬂuence of this inﬁnitesimal perturbation for each model, let Pn = (An, Bn) denote the desired partition of the unperturbed graph Gn and let P0,good n = (An, Bn C) and P0,bad n = (An Bn, C) denote the partitions of the perturbed graph G0 n depicted in ﬁgure 1(b) and 1(c), respectively. As P0,good n Pn, a stable objective will prefer P0,good n to P0,bad n while any objective preferring the converse is unstable. A detailed study of stability proves possible for this speciﬁc graph family. We summarize the conclusions of this analysis in the theorem below, which shows that the Normalized Cut is unstable in certain parameter regimes while the Product Cut is always stable. The supplementary material contains the proof.

Theorem 2 Suppose that µ, µ0, k, k0, n0 are ﬁxed. Then

µ0 < 2µ Ncut G0n(P0,good n ) > Ncut G0n(P0,bad n ) for n large enough. (11)

Pcut G0n(P0,good n ) < Pcut G0n(P0,bad n ) for n large enough. (12)

Statement (11) simply says that the large cluster An must have a conductance µ at least twice better than the conductance µ0 of the small perturbation cluster C in order to prevent instability. Thus adding an inﬁnitesimally small cluster with mediocre conductance (up to two times worse the conductance of the main structure) has the potential of radically changing the partition selected by the Normalized Cut. Moreover, this result holds for the classical Normalized Cut, its smoothed variant (4) as well as for similar objectives such as the Cheeger Cut and Ratio Cut. Conversely, (12) shows that adding an inﬁnitesimally small cluster will not affect the partition selected by the

aa Partition P of Partition P of Partition P of Partition P of WEBKB4 found by WEBKB4 found by CITESEER found by CITESEER found by the Pcut algo. the Ncut algo. the Pcut algo. the Ncut algo. e H(P) .2506 .7946 .1722 .7494 Pcut(P) .5335 .8697 .4312 .8309 Ncut(P) .5257 .5004 .5972 .5217

Figure 2: The Product and Normalized Cuts on WEBKB4 (R = 4 clusters) and CITESEER (R = 6 clusters). The pie charts visually depict the sizes of the clusters in each partition. In both cases, NCut returns a super-cluster while PCut returns a well-balanced partition. The NCut objective prefers the ill-balanced partitions while the PCut objective dramatically prefers the balanced partitions.

Product Cut. The proof, while lengthy, is essentially just theorem 1 in disguise. To see this, note that the sequence of partitions P0,bad n becomes arbitrarily ill-balanced, which from (10) implies limn Pcut G0n(P0,bad n ) = 1. However, the unperturbed graph Gn grows in a self-similar fashion as n and so the Product Cut of Pn remains approximately a constant, say γ, for all n. Thus Pcut Gn(Pn) γ < 1 for n large enough, and Pcut G0n(P0,good n ) Pcut Gn(Pn) since |C| is inﬁnitesimal. Therefore Pcut G0n(P0,good n ) γ < 1. Comparing this upper-bound with the fact limn Pcut G0n(P0,bad n ) = 1, we see that the Product Cut of P0,bad n becomes eventually larger than the Product Cut of P0,good n . While we execute this program in full only for the example above, this line of argument is fairly general and similar stability estimates are possible for more general families of graphs.

This general contrast between the Product Cut and the Normalized Cut extends beyond the realm of model problems, as the user familiar with off-the-shelf NCut codes likely knows. When provided with dirty graphs, for example an e-mail network or a text data set, NCut has the aggravating tendency to return a super-cluster. That is, NCut often returns a partition P = (A1, . . . , AR) where a single set |Ar| contains the vast majority of the vertices. Figure 2 illustrates this phenomenon. It compares the partitions obtained for NCut (computed on Ωα using a modiﬁcation of the standard spectral approximation from [15]) and for PCut (computed using the algorithm presented in the next section) on two graphs constructed from text data sets. The NCut algorithm returns highly ill-balanced partitions containing a super-cluser, while PCut returns an accurate and well-balanced partition. Other strategies for optimizing NCut obtain similarly unbalanced partitions. As an example, using the algorithm from [9] with the original sparse weight matrix W leads to relative cluster sizes of 99.2%, 0.5%, 0.2% and 0.1% for WEBKB4 and 98.5%, 0.4%, 0.3%, 0.3%, 0.3% and 0.2% for CITESEER. As our theoretical results indicate, these unbalanced partitions result from the normalized cut criterion itself and not the algorithm used to minimize it.

3 The Algorithm

Our strategy for optimizing the Product Cut relies on a popular paradigm for discrete optimization, i.e. exact relaxation. We begin by showing that the discrete, graph-based formulation (5) can be relaxed to a continuous optimization problem, speciﬁcally a convex maximization program. We then prove that this relaxation is exact, in the sense that optimal solutions of the discrete and continuous problems coincide. With an exact relaxation in hand, we may then appeal to continuous optimization strategies (rather than discrete or greedy ones) for optimizing the Product Cut. This general idea of exact relaxation is intimately coupled with convex maximization.

Assume that the graph G = (V, W) is connected. Then by taking the logarithm of (5) we see that (5) is equivalent to the problem

Maximize PR r=1 P

i Ar log (Ωα1Ar )i

|Ar| over all partitions P = (A1, . . . , AR) of V into R non-empty subsets.

The relaxation of (P) then follows from the usual approach. We ﬁrst encode sets Ar V as binary vertex functions 1Ar, then relax the binary constraint to arrive at a continuous program. Given a vertex function f Rn + with non-negative entries, we deﬁne the continuous energy e(f) as

e(f) := f, log Ωαf/ f, 1V if f = 0, and e(0) = 0,

where , denotes the usual dot product in Rn and the logarithm applies entriwise. As (Ωαf)i > 0 whenever f = 0, the continuous energy is well-deﬁned. After noting that P

r e(1Ar) is simply the objective value in problem (P), we arrive to the following continuous relaxation

Maximize PR r=1 e(fr)

over all (f1, . . . , f R) Rn + . . . Rn + satisfying PR r=1 fr = 1V

where the non-negative cone Rn + consists of all vectors in Rn with non-negative entries.

The following theorem provides the theoretical underpinning for our algorithmic approach. It establishes convexity of the relaxed objective for connected graphs.

Theorem 3 Assume that G = (V, W) is connected. Then the energy e(f) is continuous, positive 1-homogeneous and convex on Rn +. Moreover, the strict convexity property

e(θf + (1 θ)g) < θe(f) + (1 θ)e(g) for all θ (0, 1)

holds whenever f, g Rn + are linearly independent.

The continuity of e(f) away from the origin as well as the positive one-homogeneity are obvious, while the continuity of e(f) at the origin is easy to prove. The proof of convexity of e(f), provided in the supplementary material, is non-trivial and heavily relies on the particular structure of Ωα itself. With convexity of e(f) in hand, we may prove the main theorem of this section.

Theorem 4 ( Equivalence of (P) and (P-rlx) ) Assume that G = (V, W) is connected and that V contains at least R vertices. If P = (A1, . . . , AR) is a global optimum of (P) then (1A1, . . . , 1AR) is a global optimum of (P-rlx) . Conversely, if (f1, . . . , f R) is a global optimum of (P-rlx) then (f1, . . . , f R) = (1A1, . . . , 1AR) where (A1, . . . , AR) is a global optimum of (P).

Proof. By strict convexity, the solution of the maximization (P-rlx) occurs at the extreme points of the constraint set Σ = {(f1, . . . , f R) : fr RN + and PR r=1 fr = 1}. Any such extreme point takes the form (1A1, . . . , 1AR), where necessarily A1 . . . AR = V and Ar As = (r = s) hold. It therefore sufﬁces to rule out extreme points that have an empty set of vertices. But if A = B are non-empty then 1A, 1B are linearly independent, and so the inequality e(1A +1B) < e(1A)+e(1B) holds by strict convexity and one-homogeneity. Thus given a partition of the vertices into R 1 non-empty subsets and one empty subset, we can obtain a better energy by splitting one of the non-empty vertex subsets into two non-empty subsets. Thus any globally maximal partition cannot contain empty subsets.

With theorems 3 and 4 in hand, we may now proceed to optimize (P) by searching for optima of its exact relaxation. We tackle the latter problem by leveraging sequential linear programming or gradient thresholding strategies for convex maximization. We may write (P-rlx) as

Maximize E(F) subject to F C and ψi(F) = 0 for i = 1, . . . , n (13)

where F = (f1, . . . , f R) is the optimization variable, E(F) is the convex energy to be maximized, C is the bounded convex set [0, 1]n . . . [0, 1]n and the n afﬁne constraints ψi(F) = 0 correspond to the row-stochastic constraints PR r=1 fi,r = 1. Given a current feasible estimate F k of the solution, we obtain the next estimate F k+1 by solving the linear program

Maximize Lk(F) subject to F C and ψi(F) = 0 for i = 1, . . . , n (14)

where Lk(F) = E(F k) + E(F (k)), F F k is the linearization of the energy E(F) around the current iterate. By convexity of E(F), this strategy monotonically increases E(F k) since E(F k+1) Lk(F k+1) Lk(F k) = E(F k). The iterates F k therefore encode a sequence of partitions of V that monotonically increase the energy at each step. Either the current iterate maximizes the linear form, in which case ﬁrst-order optimality holds, or else the subsequent iterate produces a partition with a

Algorithm 1 Randomized SLP for PCut

Initialization: (f 0 1 , . . . , f 0 R) = (1A1, . . . , 1AR) for (A1, . . . , AR) a random partition of V for k = 0 to maxiter do

for r = 1 to R do

Set ˆfr = f k r /(Pn i=1 f k i,r) then solve Mαur = ˆfr Set gi,r = fi,r/ui,r for i = 1, . . . n then solve M T α vr = gr Set hr = log ur + vr 1 end for Choose at random sk vertices and let I V be these vertices. for all i V do

If i I then f k+1 i,r = 1 if r = arg maxs his 0 otherwise, if i / I then f k+1 i,r = 1 if hi,r > 0 0 otherwise. end for end for

strictly larger objective value. The latter case can occur only a ﬁnite number of times, as only a ﬁnite number of partitions exist. Thus the sequence F k converges after a ﬁnite number of iterations.

While simple and easy to implement, this algorithm suffers from a severe case of early termination. When initialized from a random partition, the iterates F k almost immediately converge to a poorquality solution. We may rescue this poor quality algorithm and convert it to a highly effective one, while maintaining its simplicity, by randomizing the LP (14) at each step in the following way. At step k we solve the LP

maximize Lk(F) subject to F C and ψi(F) = 0 for i Ik, (15)

where the set Ik is a random subset of {1, 2, . . . , n} obtained by drawing sk constraints uniformly at random without replacement. The LP (15) is therefore version of LP (14) in which we have dropped a random set of constraints. If we start by enforcing a small number sk of constraints and slowly increment this number sk+1 = sk + sk as the algorithm progresses, we allow the algorithm time to explore the energy landscape. Enforcing more constraints as the iterates progress ensures that (15) eventually coincides with (14), so convergence of the iterates F k of the randomized algorithm is still guaranteed. The attraction is that LP (15) has a simple, closed-form solution given by a variant of gradient thresholding. We derive the closed form solution of LP (15) in section 1 of the supplementary material, and this leads to Algorithm 1 above.

The overall effectiveness of this strategy relies on two key ingredients. The ﬁrst is a proper choice of the number of constraints sk to enforce at each step. Selecting the rate at which sk increases is similar, in principle, to selecting a learning rate schedule for a stochastic gradient descent algorithm. If sk increases too quickly then the algorithm will converge to poor-quality partitions. If sk increases too slowly, the algorithm will ﬁnd a quality solution but waste computational effort. A good rule of thumb is to linearly increase sk at some constant rate sk λ until all constraints are enforced, at which point we switch to the deterministic algorithm and terminate the process at convergence. The second key ingredient involves approximating solutions to the linear system Mαx = b quickly. We use a simple Algebraic Multigrid (AMG) technique, i.e. a stripped-down version of [7] or [6], to accomplish this. The main insight here is that exact solutions of Mαx = b are not needed, but not all approximate solutions are effective. We need an approximate solution x that has non-zero entries on all of |V | for thresholding to succeed, and this can be accomplished by AMG at very little cost.

4 Experiments

We conclude our study of the Product Cut model by presenting extensive experimental evaluation of the algorithm1. We intend these experiments to highlight the fact that, in addition to a strong theoretical model, the algorithm itself leads to state-of-the-art performance in terms of cluster purity on a variety of real world data sets. We provide experimental results on four text data sets (20NEWS, RCV1, WEBKB4, CITESEER) and four data sets containing images of handwritten digits (MNIST, PENDIGITS, USPS, OPTDIGITS). We provide the source for these data sets and details on their

1The code is available at https://github.com/xbresson/pcut

Table 1: Algorithmic Comparison via Cluster Purity.

20NE RCV1 WEBK CITE MNIS PEND USPS OPTI size 20K 9.6K 4.2K 3.3K 70K 11K 9.3K 5.6K R 20 4 4 6 10 10 10 10 RND 6 30 39 22 11 12 17 12 NCUT 27 38 40 23 77 80 72 91 LSD 34 38 46 53 76 86 70 91 MTV 36 43 45 43 96 87 85 95 GRACLUS 42 42 49 54 97 85 87 94 NMFR 61 43 58 63 97 87 86 98 PCut (.9,λ1) 61 53 58 63 97 87 89 98 PCut (.9,λ2) 60 50 57 64 96 84 89 95

construction in the supplementary material. We compare our method against partitioning algorithms that, like the Product Cut, rely on graph-cut objective principles and that partition the graph in a direct, non-recursive manner. The NCut algorithm [15] is a widely used spectral algorithm that relies on a post-processing of the eigenvectors of the graph Laplacian to optimize the Normalized Cut energy. The NMFR algorithm [14] uses a graph-based random walk variant of the Normalized Cut. The LSD algorithm [2] provides a non-negative matrix factorization algorithm that relies upon a trace-based relaxation of the Normalized Cut objective. The MTV algorithm from [3] and the balanced k-cut algorithm from [9] provide total-variation based algorithms that attempt to ﬁnd an optimal multi-way Cheeger cut of the graph by using ℓ1 optimization techniques. Both algorithms optimize the same objective and achieve similar purity values. We report results for [3] only. The GRACLUS algorithm [4, 5] uses a multi-level coarsening approach to optimize the NCut objective as formulated in terms of kernel k-means. Table 1 reports the accuracy obtained by these algorithms for each data set. We use cluster purity to quantify the quality of the calculated partition, deﬁned according to the relation: Purity = 1

n PR r=1 max1<i<R mr,i. Here mr,i denotes the number of data points in the rth cluster that belong to the ith ground-truth class. The third row of the table (RND) provides a base-line purity for reference, i.e. the purity obtained by assigning each data point to a class from 1 to R uniformly at random. The PCut, MTV and GRACLUS algorithms rely on randomization, so for these algorithms we report the average purity achieved over 500 different runs. For the PCut algorithm, we use α = .9 when deﬁning Ωα. Also, in order to illustrate the tradeoff when selecting the rate at which the number of enforced constraints sk increases, we report accuracy results for the linear rates

sk = 10 4 n := λ1 and sk = 5 10 4 n := λ2 where n denotes the total number of vertices in the data set. By and large both PCut and NMFR consistently outperform the other algorithms in terms of accuracy.

Table 2: Computational Time

MNIST 20NEWS NMFR PCut (.9,λ1) PCut (.9,λ2) NMFR PCut (.9,λ1) PCut (.9,λ2) 4.6mn 11s 10s 3.7mn 1.3mn 16s (92%) (92%) (91%) (58%) (58%) (57%)

In addition to the accuracy comparisons, table 2 records the time required for PCut and NMFR to reach 95% of their limiting purity value on the two largest data sets, 20NEWS and MNIST. Each algorithm is implemented in a fair and consistent way, and the experiments were all performed on the same architecture. Timing results on the smaller data sets from table 1 are consistent with those obtained for 20NEWS and MNIST. In general we observe that PCut runs signiﬁcantly faster. Additionally, as we expect for PCut, the slower rate λ1 generally leads to more accurate results while the larger rate λ2 typically converges more quickly.

When taken together, our theoretical and experimental results clearly reveal that the model provides a promising method for graph partitioning. The algorithm consistently achieves state-of-the-art results, and typically runs signiﬁcantly faster than other algorithms that achieve a comparable level of accuracy. Additionally, both the model and algorithmic approach rely upon solid mathematical foundations that are frequently missing in the multi-way clustering literature.

Acknowledgements: TL was supported by NSF DMS-1414396.

[1] Reid Andersen, Fan Chung, and Kevin Lang. Local graph partitioning using pagerank vectors. In Proceedings of the 47th Annual Symposium on Foundations of Computer Science (FOCS 06), pages 475 486, 2006.

[2] Raman Arora, M Gupta, Amol Kapila, and Maryam Fazel. Clustering by left-stochastic matrix factorization. In International Conference on Machine Learning (ICML), pages 761 768, 2011.

[3] Xavier Bresson, Thomas Laurent, David Uminsky, and James von Brecht. Multiclass total variation clustering. In Advances in Neural Information Processing Systems (NIPS), 2013.

[4] Inderjit S. Dhillon, Yuqiang Guan, and Brian Kulis. Weighted graph cuts without eigenvectors: A multilevel approach. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(11):1944 1957, 2007.

[5] George Karypis and Vipin Kumar. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J. Sci. Comput., 20(1):359 392, 1998.

[6] Dilip Krishnan, Raanan Fattal, and Richard Szeliski. Efﬁcient preconditioning of laplacian matrices for computer graphics. ACM Transactions on Graphics (TOG), 32(4):142, 2013.

[7] Oren E Livne and Achi Brandt. Lean algebraic multigrid (lamg): Fast graph laplacian linear solver. SIAM Journal on Scientiﬁc Computing, 34(4):B499 B522, 2012.

[8] László Lovász and Miklós Simonovits. Random walks in a convex body and an improved volume algorithm. Random structures & algorithms, 4(4):359 412, 1993.

[9] Syama Sundar Rangapuram, Pramod Kaushik Mudrakarta, and Matthias Hein. Tight continuous relaxation of the balanced k-cut problem. In Advances in Neural Information Processing Systems, pages 3131 3139, 2014.

[10] J. Shi and J. Malik. Normalized Cuts and Image Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 22(8):888 905, 2000.

[11] Daniel A. Spielman and Shang-Hua Teng. Nearly-linear time algorithms for graph partitioning, graph sparsiﬁcation, and solving linear systems. In Proceedings of the thirty-sixth annual ACM symposium on Theory of computing, pages 81 90, 2004.

[12] Daniel A. Spielman and Shang-Hua Teng. A local clustering algorithm for massive graphs and its application to nearly linear time graph partitioning. SIAM Journal on Computing, 42(1):1 26, 2013.

[13] U. von Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17(4):395 416, 2007.

[14] Zhirong Yang, Tele Hao, Onur Dikmen, Xi Chen, and Erkki Oja. Clustering by nonnegative matrix factorization using graph random walk. In Advances in Neural Information Processing Systems (NIPS), pages 1088 1096, 2012.

[15] Stella X. Yu and Jianbo Shi. Multiclass spectral clustering. in international conference on computer vision. In International Conference on Computer Vision, 2003.