# reduction_techniques_for_graphbased_convex_clustering__746d36b4.pdf

Reduction Techniques for Graph-Based Convex Clustering

Lei Han1 and Yu Zhang2

1Department of Statistics, Rutgers University 2Department of Computer Science and Engineering, Hong Kong University of Science and Technology 1lhan@stat.rutgers.edu, leihan.cs@gmail.com; 2yu.zhang.ust@gmail.com

The Graph-based Convex Clustering (GCC) method has gained increasing attention recently. The GCC method adopts a fused regularizer to learn the cluster centers and obtains a geometric clusterpath by varying the regularization parameter. One major limitation is that solving the GCC model is computationally expensive. In this paper, we develop efﬁcient graph reduction techniques for the GCC model to eliminate edges, each of which corresponds to two data points from the same cluster, without solving the optimization problem in the GCC method, leading to improved computational efﬁciency. Speciﬁcally, two reduction techniques are proposed according to tree-based and cyclic-graph-based convex clustering methods separately. The proposed reduction techniques are appealing since they only need to scan the data once with negligibly additional cost and they are independent of solvers for the GCC method, making them capable of improving the efﬁciency of any existing solver. Experiments on both synthetic and real-world datasets show that our methods can largely improve the efﬁciency of the GCC model.

Introduction

Clustering, which partitions a data set into several subsets with each one having similar data points, is an important unsupervised task, and it has been extensively studied in the literature. Many clustering algorithms have been proposed such as the k-means method, density-based approaches (Ester et al. 1996), spectral clustering methods (Ng, Jordan, and Weiss 2002), and minimum spanning tree (MST) algorithms (Grygorash, Zhou, and Jorgensen 2006; Wang, Wang, and Wilkes 2009). Different algorithms have their own favors to capture certain types of cluster structure. For example, the k-means algorithm can group data points with each cluster having a convex shape but the spectral clustering and MST methods can deal with the non-convex shape and even more complex one. All the aforementioned clustering methods have some limitations, e.g., some of them relying on the initial setting of the cluster center since it is difﬁcult to attain the global optimum and many of them failing to determine the number of clusters.

Both authors contributed equally. Copyright c 2016, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

Recently, the GCC method (Hocking et al. 2011; Chen et al. 2014) has attracted increasing attentions since it can determine the number of clusters based on the globally optimal solution of its convex objective function. The GCC model organizes the data points as an undirected graph and adopts a fused regularizer as the sum of a set of pairwise fusion terms each of which corresponds to an edge in the graph. Then, the GCC method obtains a clusterpath by optimizing the objective functions according to a sequence of regularization parameters. The generated clusterpath provides a meaningfully geometric interpretation for the clustering structure hidden in the data. For the clustering performance, the GCC method is comparable to the state-of-the-art clustering methods such as the spectral clustering. However, solving the GCC model is very computationally expensive due to the complex fused regularizer.

In this paper, we develop novel graph reduction techniques for the GCC model. Instead of seeking efﬁcient algorithms to solve the GCC model, the graph reduction techniques are developed to eliminate edges in the graph before optimizing the objective function in the GCC model such that the data points connected by the eliminated edges are (almost) guaranteed to be grouped. In this way, we can reduce the number of variable to be optimized since the variables for the data points to be detected are from the same cluster and will be merged as one variable, hence we can improve the computational efﬁciency. Those reduction techniques are obtained from the analysis on the relations between the primal and dual forms of the GCC model. The proposed reduction techniques have good match with the clusterpath generated by the GCC method since the reduction techniques can identity the data points to be grouped together when varying the value of the regularization parameter, which is just what the clusterpath needs. The proposed reduction techniques are appealing since they only need to scan the data once with negligibly additional cost and they can improve the efﬁciency of any existing solver for the GCC model since they are independent of solvers. Specifically, we consider two types of graphs in the GCC model, i.e., the tree which is acyclic and the generally cyclic graph, and correspondingly two reduction techniques are proposed for the tree-based convex clustering (TCC) model and the cyclic-graph-based convex clustering (CGCC) model, respectively. We evaluate the proposed reduction techniques

Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16)

on both synthetic and real-world datasets and ﬁnd that they can gain considerable speedup.

Overview on The GCC Method If unspeciﬁed, we use bold-face and capital letters for matrices, bold-face and lower-case letters for vectors, and lowercase letters for scalars. We are given n data points in a pdimensional space and the data matrix is denoted by X Rn p, where xi, the ith row of X, denotes the ith data point. There is an undirected graph G = (V, E) to encode relations between the n data points, where each data point corresponds to a vertex in V and E denotes the set of edges in G. The GCC method proposed in (Hocking et al. 2011) aims to solve the following convex optimization problem:

min A Rn p 1 2 A X 2 F + λ

(i,j) E wij αi αj q, (1)

where F denotes the matrix Frobenius norm, αi is the ith row of A, wij 0 denotes the similarity between the corresponding pair of data points, and q denotes the ℓq norm of a vector. Here αi can be viewed as an instancespeciﬁc cluster center for xi and different data points having the same instance-speciﬁc cluster center will be considered to belong to the same cluster. So the fused regularizer (i.e., the second term in problem (1)) is to detect the equivalence among all the possible pairs of rows in the matrix A based on E. According to (Chen et al. 2014), the graph G is only required to be connected and hence it can be sparse where many wij s are equal to zero or equivalently the set of edges E has a small size. The continuous path of the optimal solutions obtained from problem (1) by varying λ is called the clusterpath (Hocking et al. 2011; Chen et al. 2014). In the following analysis, we consider two instantiations of the GCC model depending on the structure of G, i.e. the TCC and CGCC models. We ﬁrst consider the simple case where G is a tree and correspondingly problem (1) deﬁnes the TCC model, and then study a general graph, which can be cyclic, with problem (1) as the objective function of the CGCC model.

Tree Reduction for The TCC Model In this section, we consider the TCC model. The tree used is denoted by T = (V, ET ). It is obvious that |ET |, the size of ET , is equal to n 1.

The Dual Form for The TCC Model We ﬁrst transform problem (1) for the TCC model. The fused regularizer, denoted by Ωq(A), in problem (1) can be reformulated as Ωq(A) = λ WCA 1,q, where 1,q denotes the sum of the ℓq norms of the rows in a matrix, W R(n 1) (n 1) is a diagonal matrix with the weights wij s in the diagonal for (i, j) ET , and C R(n 1) n is an auxiliary sparse matrix with each row containing only two non-zero entries 1 and 1 corresponding to an edge in ET . Therefore, the objective function of the TCC model can be reformulated as

min A Rn p 1 2 A X 2 F + λ WCA 1,q. (2)

Now, we propose some useful lemmas. Lemma 1. C has full row-rank and so rank(C) = n 1.

Lemma 2. By constructing a matrix D = C 1n

where 1n R1 n is a row vector with all elements being 1, then rank(D) = n and hence D is invertible. Based on Lemma 1 and 2, let B η

= Γ = DA = CA 1n A

where B R(n 1) p and η R1 p. Then problem (2) can be rewritten as minΓ 1 2 X D 1Γ 2 F + λ WB 1,q. By deﬁning D1 Rn (n 1) and d2 Rn 1 as [D1 d2] = D 1, we can proceed as

min B,η f(B, η) = 1

2 X (D1B + d2η) 2 F + λ WB 1,q.

Now by setting f

η = 0, the solution of η is given by

η = d T 2 d2 1 d T 2 (X D1B ). (4)

Note that d T 2 d2 is a scalar. By plugging Eq. (4) back into f(B, η), we can rewrite problem (2) as

min B R(n 1) p 1 2 X DB 2 F + λ WB 1,q, (5)

where I denotes an identity matrix with appropriate size, X = I 1 d T 2 d2 d2d T 2 X Rn p, and D = I 1 d T 2 d2 d2d T 2 D1 Rn (n 1). The solution of problem (5) has an afﬁne relationship with the solution of problem (2), where A = D 1Γ and Γ can be obtained from B based on Eqs. (3) and (4). In order to ﬁnd the sufﬁcient condition that guarantees some data points to be grouped into a cluster, we derive the dual form of problem (5) as1

min Θ Rn p g(Θ) = λ2

s.t. drΘ q wr, r = 1, , n 1, (6)

where Θ Rn p is the dual variable, the index r corresponds to a pair of indices (i, j) denoted by r (i, j), wr is equal to wij, dr is the rth row of DT , and q = q q 1. The KKT conditions for problem (5) or (6) include

λΘ = X DB ,

β r β r q , if β r = 0, (i.e., αi = αj), u with u q wi, if β r = 0, (i.e., αi = αj), (7)

where r (i, j) and β r is the rth row of B . Eq. (7) suggests a sufﬁcient condition to determine whether two data points will be clustered by the TCC method as drΘ q < wr β r = 0 αi = αj. (R1)

αi = αj, αj = αk αi = αk. (R2)

1Please refer to the supplementary material (http://www.stat.rutgers.edu/home/ lhan/) for the details.

According to rules (R1) and (R2), if two data points xi and xj are grouped by the TCC method which is equivalent to β r = 0, the corresponding edge in the tree is useless and can be eliminated since we will treat αi and αi as the same variable. Unfortunately, using rule (R1) is not feasible since we do not know the optimal solution Θ and neither does rule (R2). Inspired by the feature screening methods (Wang et al. 2013; Wang, Wonka, and Ye 2014; Wang et al. 2014; Wang and Ye 2014; Zhao and Liu 2014), we resort to constructing a feasible region O which contains Θ , and then relax rule (R1) as

drΘ q : Θ O < wr β r = 0. (8)

Eq. (8) provides a sufﬁcient condition to ﬁnd the edges to be eliminated. The key here is to construct a tight region O for Θ , since a tighter region O concentrating around Θ will lead to the elimination of more edges.

Feasible Region Construction

If we are given the optimal Θ (λ ) at some regularization parameter λ where the optimal solution Θ is viewed as a function of the regularization parameter, then we can construct O for Θ (λ), where λ < λ , by utilizing Θ (λ ) as we will see later. This motivates the reduction techniques to work along a decreasing sequence of regularization parameters, through which the clusterpath can be generated simultaneously. We ﬁrst determine a constant λmax such that for any regularization parameter λ larger than λmax, the TCC model will group all the data points into one cluster. λmax can be used for the ﬁrst regularization parameter since the corresponding optimal solution can be obtained analytically without any solver. For problem (6), we deﬁne Fr = {Θ : drΘ q wr} for r = 1, , n 1 and F = r=1, ,n 1Fr as the intersection of {Fr}n 1 r=1 . It is easy to see that if X/λ F, then Θ (λ) = X/λ. Moreover, from (R1) we can see that if X/λ is an interior point of F then B = 0. Indeed, we have the following result to determine λmax.

Theorem 1. For problem (6), we deﬁne λmax = maxr{ρr : dr X ρr q = wr}. Then the following statements are equiva-

lent: (i) X

λ F; (ii) Θ (λ) = X

λ ; (iii) B = 0.

The condition (iii) in Theorem 1 implies that all the data points are grouped together to a cluster for a regularization parameter λmax. Now, we can construct O for any λ smaller than λmax via the following theorem.2

Theorem 2. Suppose that the optimal Θ (λ ) is known for a λ λmax. Let ρr be deﬁned in Theorem (1). For any 0 < λ < λ , deﬁne d = arg max dr ρr, e =

2Similar to (Hocking et al. 2011), three values for q (i.e., 1, 2, and ) are investigated and our analysis can be easily generalized to a general q.

X λ Θ (λ ), if λ < λmax, d T sign d X λmax

, if λ = λmax, q = 1,

d T d X λmax , if λ = λmax, q = 2,

e T d X λmax , if λ = λmax, q = ,

and v(λ, λ ) = X

λ Θ (λ ), v (λ, λ ) = v(λ, λ ) v(λ,λ ),n(λ )

n(λ ) 2 F n(λ ), where I( ), | |, and .= are element-wise indicator, absolute, and equivalent operators, and sign( ) denotes the sign function. Then, we have the following result: Θ (λ) o(λ, λ ) F R(λ, λ ), (9)

where o(λ, λ ) = Θ (λ ) + 1

2v (λ, λ ) and R(λ, λ ) = 1 2 v (λ, λ ) F . Theorem 2 implies that the optimal Θ (λ) lies in a ball depending on Θ (λ ) with the center o(λ, λ ) and radius R(λ, λ ). Therefore, we can directly use the ball to construct the feasible region O(λ, λ ):

O(λ, λ ) = Θ(λ) : Θ(λ) o(λ, λ ) F R(λ, λ ) . (10) Based on the sufﬁcient condition in Eq. (8), we only need to estimate the following supreme value:

drΘ q : Θ O(λ, λ ), r = 1, , n 1 . (11)

The Eater Rules The supreme value can be obtained in the following theorem which also concludes the Exact Tree Reduction (Eater) rules ﬁnally. Theorem 3. (Eater Rules) For the TCC model, suppose the optimal solution Θ (λ ) at 0 < λ λmax is known. Let r (i, j). Then for 0 < λ < λ , the edge (i, j) can be eliminated under a regularization parameter λ, i.e., α i = α j, if the following conditions hold:

dro(λ, λ ) + R(λ, λ ) dr 2 < wr, if q = 1, dro(λ, λ ) 2 + R(λ, λ ) dr 2 < wr, if q = 2, dro(λ, λ ) 1 + R(λ, λ ) dr 2 < wr, if q = ,

α i = α k and α j = α k for some k. (R2*) Some remarks for the Eater rules: (1) In the Eater rules, all the conditions can be determined efﬁciently via only simple matrix operations on the given data matrix X. This is promising since we can directly determine which data points will be clustered together for a given λ without solving the objective function. The only requirement is the knowledge of Θ (λ ) at some λ > λ. (2) In order to generate the clusterpath, multiple TCC models need to be learned along a decreasing sequence of regularization parameters λ0 > λ1 > > λt where λ0 = λmax. The proposed reduction techniques can work based on this sequence, since Θ (λi) can be used to do the reduction for the regularization parameter λi+1. Moreover, at the beginning, both λmax and Θ (λmax) can be analytically computed according to Theorem 1.

Graph Reduction for CGCC In this section, we present reduction techniques for the CGCC model. The cyclic graph is denoted by G = (V, EC) and C Rm n is deﬁned similar to C, where m, the number of edges, is not smaller than n. Then we have the following properties for C.

Lemma 3. C Rm n is a rank-deﬁcient matrix, where rank(C) < n m.

Since C is rank-deﬁcient, we cannot transform the objective function of the CGCC model in a similar way to the TCC model, and therefore we investigate the original problem (1) instead.

The Dual Form for The CGCC Model Problem (1) can be reformulated as

min A Rn p 1 2 A X 2 F + λ WCA 1,q, (12)

where W is the corresponding diagonal weight matrix. The Lagrangian dual for problem (12) is

min Φ Rm p h(Φ) = λ2

s.t. φr q wr, r = 1, , m, (13)

where φr is a row of Φ. The KKT conditions for problem (13) include

X = A + λC T Φ , CX = CA + λCC T Φ , (14)

[CA ]r [CA ]r q , if [CA ]r = 0 (i.e., αi = αj),

u with u q wr, if [CA ]r = 0 (i.e., αi = αj), (15)

where r (i, j). According to Eq. (15), we obtain a sufﬁcient condition as

φ r q < wr [CA ]r = 0 αi = αj. (R3)

Similar to TCC, we want to construct a feasible region G containing Φ , and then relax rule (R3) as

sup Φ { φr q : Φ G} < wr [CA ]r = 0. (16)

In order to construct G, similar to the TCC model, we ﬁrst need to seek a constant λmax under which the CGCC model can group all the data points into a cluster. We deﬁne F as F = {Φ : φr q < wr, r = 1, , m}. For Φ F, the optimality condition of problem (13) implies that

CC T Φ = CX

It is easy to prove that CC T is rank-deﬁcient and it is not invertible since C Rm n is rank-deﬁcient. As a consequence, problem (13) can have multiple solutions and so does Eq. (17), bringing difﬁculties to ﬁnd λmax. Instead of deriving exact reduction rules for the CGCC model, we propose to seek for inexact reduction rules, which is called inexact Cyclic-Graph Reduction (Cigar) rules.

The Cigar Rules We propose to relax the dual problem (13) as

Φ h(Φ) = λ2

F X 2 F + δλ2

s.t. φr q wr, r = 1, , m, (18)

where δ is a small positive constant. Problem (18) is strictly convex. D is deﬁned as D = CC T + δI Rm m and is positive deﬁnite. Then we can determine λmax in the following theorem. Theorem 4. For the relaxed dual problem (18), we deﬁne λmax = maxr{ρr : dr Y

ρr q = wr}, where dr denotes the

rth row of the inverse matrix D 1 and Y = CX. Then the following statements are equivalent: (i) D 1Y

Φ (λ) = D 1Y

Then we can construct the feasible region for Φ based on the following theorem.

Theorem 5. Suppose that Φ (λ ) is known for λ < λmax. For any 0 < λ < λ , we deﬁne n(λ ) = Λ 1Y

λ ΛΦ (λ ), v(λ, λ ) = Λ 1Y

λ ΛΦ (λ ), v (λ, λ ) = v(λ, λ ) v(λ,λ ),n(λ )

n(λ ) 2 F n(λ ). Then, we have the following result: ΛΦ (λ) o(λ, λ ) F R(λ, λ ), if λ < λmax, (19)

where Λ = D

1 2 , o(λ, λ ) = ΛΦ (λ ) + 1

2v (λ, λ ), and R(λ, λ ) = 1

2 v (λ, λ ) F . Finally, we obtain the Cigar rules for the CGCC model. Deﬁnition 1. (Cigar Rules) For the CGCC problem, suppose the optimal solution Φ (λ ) of problem (18) is given where 0 < λ < λmax. Let r (i, j) and let ζr be the rth row of Λ 1. Then for 0 < λ < λ , the edge (i, j) can be eliminated under λ, i.e. xi and xj clustered (inexactly), which is denoted by α i α j, if the following conditions hold:

ζro(λ, λ ) + R(λ, λ ) ζr 2 < wr, if q = 1, ζro(λ, λ ) 2 + R(λ, λ ) ζr 2 < wr, if q = 2, ζro(λ, λ ) 1 + R(λ, λ ) ζr 2 < wr, if q = , (R3*)

α i α k and α j α k for some k. (R4*)

Some remarks for the Cigar rules: (1) The Cigar rules are inexact, since problem (18) is not the dual form of problem (1). When we use the Cigar rules, some edges may be mis-eliminated and as a consequence, some data points may be mis-clustered. (2) The parameter δ in problem (18) plays an important role, since λmax depends on the value of δ. Actually, we can determine δ empirically and we put the details in the supplementary material. (3) When the number of rows in C, i.e. m, is large, directly calculating D 1, Λ, and Λ 1 is computationally demanding. However, due to the speciﬁc form of D, there exists efﬁcient ways to compute those matrices and we put the details in the supplementary material.

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

Figure 1: The performance of the Eater rule on synthetic data (n=200). The ﬁrst and second rows report the results in the Halfmoon and Spiral data respectively.

Experiments In this section, we evaluate the Eater and Cigar rules on both synthetic and real-world datasets. For the TCC model, we use the the tree structure generated by the MST algorithm, since the MST clustering methods can deal with complex cluster structure. For the CGCC model, we add edges to the MST to construct cyclic graphs, in which the Cigar rule is evaluated. In order to measure the performance of the reduction rules, we deﬁne the following metrics: data reduction rate (DRR) = 1 nc/n, data fusion rate (DFR) = 1 n c/n, exact edge reduction rate (EERR) = | Ere|/|E re|, mistaken edge-elimination rate (MEER)=| Ere (E E re)|/| Ere|, where nc is the number of distinct αi s (number of clusters) obtained by the Eater/Cigar rule, n c is the true number of distinct αi s detected by some solver, Ere is the set of removed edges by Eater/Cigar rules, and E re is the true set of removed edges. The MEER is specially deﬁned for the Cigar rule since the Eater rule is exact. Similar to (Hocking et al. 2011), the weight before any data pair (i, j) is deﬁned as wij = exp γ xi xj 2 2 , and we use γ = 10/ d2, where d is the average ℓ2 distance for all possible pairs of data points. In the experiments, q is set to be 2. We use the FISTA (Beck and Teboulle 2009) algorithm to solve problem (5) for the TCC model. For the CGCC model, we use the ADMM method to solve it. All the experiments are conducted on a machine with Intel i7 CPU and 8GB RAM under the Matlab 2013b environment.

Experiments on Synthetic Data We investigate two widely used synthetic datasets in the clustering literature (Grygorash, Zhou, and Jorgensen 2006; Wang, Wang, and Wilkes 2009), i.e. the halfmoon and spiral datasets. In each dataset, we generate n data points where n is equal to 200. The Eater and Cigar rules are tested on a se-

quence of the regularization parameter with 500 values for λ equally spaced between λmax and λmax/500 and the clusterpath can be generated simultaneously. In the following experiments, we use CGCC-k to denote the CGCC model on a cyclic graph generated from the MST by adding k(n 1) additional edges with largest weights. In our experiments, we ﬁnd that all the TCC and CGCC models can correctly detect the clusters in the data, and the clustering performance of these models is reported in the supplementary material. In the following, we focus on evaluating the proposed rules.

Table 1: Running Time (in seconds) on the synthetic data based on the Eater rule. E+S denotes the total time cost of using the Eater rule and the solver.

Data Solver Eater E+S Speedup halfmoon (n=200) 134.4 0.5 21.6 6.2 spiral (n=200) 176.4 0.6 31.3 5.6

Fig. 1 shows the performance of the Eater rule in terms of DRR, DFR and EERR when n equals 200. In Figs. 1(a) and 1(c), the closer the DRR curve is to the DFR curve, the better performance the Eater rule gains, because the region below the DRR curve denotes the fraction of data reduction that the Eater rule can achieve. Note that if the rule is exact, the DRR curve will never exceed the DFR curve. By comparing the DRR and DFR curves, we can see that a large proportion of the data points are clustered via the Eater rule. The blue regions in Figs. 1(b) and 1(d) represent the exact edge reduction rate. We can see that at least over 40% edges can be eliminated for each λ/λmax, and the average fraction of the eliminated edges is around 70%. Those results verify that the Eater rule is able to detect the data points from the same cluster without learning the TCC model. Table 1 shows the running time for generating the clusterpath by using the TCC model, where the use of the Eater rule can achieve 56 times speedup compared with solving the TCC problem directly. Fig. 2 shows the performance of the Cigar rule in terms of DRR, DFR, EERR and MEER, when the size of the data is 200. Different from the Eater rule, the Cigar rule may miscluster some data points as analyzed previously. The red regions in Figs. 2(c), (f), (i) and (l) denote the fraction of mistaken eliminated edges under different settings. From the results, we observe that some mistakes are made by the Cigar rule when λ is small. Moreover, the Cigar rule becomes less effective in terms of the EERR when λ is small. On the contrast, when λ/λmax > 0.3, the Cigar rule can correctly detect the clusters. Based on the DFR curve, we see that when λ is small, the data fusion rate varies rapidly, implying that the clusterpath changes frequently. This is a possible reason that the Cigar rule becomes less effective for small λ. Table 2 records the running time of different methods and we ﬁnd that the Cigar rule can achieve 2-7 times speedup.

Iris Data & Vehicle Data We test the Eater and Cigar rules on two UCI datasets, i.e., the iris and vehicle datasets.3 The iris data contains 150 data

3https://archive.ics.uci.edu/ml/datasets.html

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

Figure 2: The performance of the Cigar rule on synthetic data (n=200). The ﬁrst and second rows denote the results on the halfmoon and spiral datasets respectively. In each row, the ﬁrst three ﬁgures denote the results for the CGCC-1 model while the rest is for the CGCC-2 model.

1 60 100 200 300 0

Indices in the scale interval of λ/λmax

1 60 100 200 300 0

Indices in the scale interval of λ/λmax

1 10 20 30 40 50 60 70 80 90 0

Indices in the log scale interval of λ/λmax

1 10 20 30 40 50 60 70 80 90 0

Indices in the log scale intervel of λ/λmax

Figure 3: The performance of the Eater rule on the iris and vehicle data. The ﬁrst and second rows report the results in the Iris and Vehicle data respectively.

points and each data point has 4 attributes, while the vehicle data contains 946 data points, each of which has 18 attributes. In the iris data, we ﬁrst test the Eater rule along a sequence of 360 values. The sequence consists of two parts, where the ﬁrst subsequence contains 60 values equally spaced on the logarithmic scale from 1 1.1 to 1 1.160 and the second one contains 300 values equally spaced on the residual interval ( 1 1.160 , 0), since empirically we ﬁnd that the clusterpath changes frequently when λ/λmax < 1 1.160 . For the CGCC model, we generate a sequence of λ identical to that in the synthetic data. In the vehicle data, we test the two rules along a sequence of 100 values equally spaced on the logarithmic scale of λ/λmax from 1 1.1 to 1 1.1100 . Fig. 3 shows the performance of the Eater rule. In Figs. 3(b) and 3(d), we observe that the Eater rule can consistently achieve 80%-90% EERR, resulting in very effective data re-

Table 2: Running Time (in seconds) on the synthetic data based on the Cigar rule. C+S means the total time cost of using the Cigar rule and the solver.

Data Model Solver Cigar C+S Speedup Halfmoon CGCC-1 189.3 8.1 25.0 7.6 (n=200) CGCC-2 201.6 15.1 42.9 4.7 Spiral CGCC-1 202.2 6.9 85.2 2.4 (n=200) CGCC-2 233.4 13.9 88.2 2.6

Table 3: Running time (seconds) on the real-world datasets. R+S means the total time cost of using the reduction rules and the solver.

Data Model Solver Rule R+S Speedup TCC 247.6 0.3 28.5 8.7 Iris CGCC-1 310.9 7.1 45.4 6.8 CGCC-2 388.1 12.8 58.7 6.6 TCC 1399.5 2.6 168.7 8.3 Vehicle CGCC-1 2625.7 31.4 877.5 3.0 CGCC-2 3473.7 44.4 1271.4 2.7

duction as shown in Figs. 3(a) and 3(c). Fig. 4 shows the performance of the Cigar rule for the CGCC model. In Figs. 4(b), (e), (h) and (k), although the Cigar rule only eliminates a small fraction of the edges when λ/λmax < 0.1, it is almost exact as shown in Figs. 4(c), (f), (i) and (l), implying that the Cigar rule is more effective in the real-world datasets than the synthetic datasets. Table 3 shows the running time of different methods in the two datasets. From the results, we can see that the proposed rules can achieve around 2-9 times speedup. Moreover, similar to the results on the synthetic data, the Eater rule is much more efﬁcient than the Cigar rule, but the computation cost for both the rules can be negligible compared with the time cost of the solver.

Conclusion and Future Work In this paper, we developed novel reduction techniques for the graph-based convex clustering model to eliminated a fraction of edges in the graph before solving it. In the future

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

1 10 20 30 40 50 60 70 80 90 0

Indices in the log scale interval of λ/λmax

1 10 20 30 40 50 60 70 80 90 0

Indices in the log scale interval of λ/λmax

1 10 20 30 40 50 60 70 80 90 0

Indices in the log scale interval of λ/λmax

1 10 20 30 40 50 60 70 80 90 0

Indices in the log scale interval of λ/λmax

1 10 20 30 40 50 60 70 80 90 0

Indices in the log scale interval of λ/λmax

1 10 20 30 40 50 60 70 80 90 0

Indices in the log scale interval of λ/λmax

Figure 4: The performance of the Cigar rule on the iris and vehicle data. The ﬁrst and second rows denote the performance in the iris and vehicle data respectively. In each row, the ﬁrst three ﬁgures denote the results for the CGCC-1 model while the rest is for the CGCC-2 model.

study, we are interested in ﬁnding exact rules for the CGCC model and applying the reduction techniques to other applications with large-scale data.

Acknowledgment

This work is supported by NSFC (61305071, 61473087) and Nature Science Foundation of Jiangsu Provice of China (BK20141340).

Beck, A., and Teboulle, M. 2009. A fast iterative shrinkagethresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences 2(1):183 202. Chen, G. K.; Chi, E.; Ranola, J.; and Lange, K. 2014. Convex clustering: An attractive alternative to hierarchical clustering. ar Xiv preprint ar Xiv:1409.2065. Ester, M.; Kriegel, H.-P.; Sander, J.; and Xu, X. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of ACM SIGKDD Conference on Knowledge Discovery and Data Mining, volume 96, 226 231. Grygorash, O.; Zhou, Y.; and Jorgensen, Z. 2006. Minimum spanning tree based clustering algorithms. In Proceedings of the 18th IEEE International Conference on Tools with Artiﬁcial Intelligence, 73 81. Hocking, T. D.; Joulin, A.; Bach, F.; and Vert, J.-P. 2011. Clusterpath: An algorithm for clustering using convex fusion penalties. In Proceedings of the 28th International Conference on Machine Learning. Ng, A. Y.; Jordan, M. I.; and Weiss, Y. 2002. On spectral clustering: Analysis and an algorithm. In Advances in Neural Information Processing Systems, 849 856. Wang, J., and Ye, J. 2014. Two-layer feature reduction for sparse-group lasso via decomposition of convex sets. In Advances in Neural Information Processing Systems, 2132 2140.

Wang, J.; Zhou, J.; Wonka, P.; and Ye, J. 2013. Lasso screening rules via dual polytope projection. In Advances in Neural Information Processing Systems, 1070 1078. Wang, J.; Zhou, J.; Wonka, P.; and Ye, J. 2014. A safe screening rule for sparse logistic regression. In Advances in Neural Information Processing Systems. Wang, X.; Wang, X.; and Wilkes, D. M. 2009. A divide-andconquer approach for minimum spanning tree-based clustering. IEEE Transactions on Knowledge and Data Engineering 21(7):945 958. Wang, J.; Wonka, P.; and Ye, J. 2014. Scaling SVM and least absolute deviations via exact data reduction. In Proceedings of International Conference on Machine Learning. Zhao, Z., and Liu, J. 2014. Safe and efﬁcient screening for sparse support vector machine. In Proceedings of ACM SIGKDD Conference on Knowledge Discovery and Data Mining.