# robust_bayesian_maxmargin_clustering__9f96cfd9.pdf

Robust Bayesian Max-Margin Clustering

Changyou Chen Jun Zhu Xinhua Zhang

Dept. of Electrical and Computer Engineering, Duke University, Durham, NC, USA State Key Lab of Intelligent Technology & Systems; Tsinghua National TNList Lab; Dept. of Computer Science & Tech., Tsinghua University, Beijing 100084, China Australian National University (ANU) and National ICT Australia (NICTA), Canberra, Australia cchangyou@gmail.com; dcszj@tsinghua.edu.cn; xinhua.zhang@anu.edu.au

We present max-margin Bayesian clustering (BMC), a general and robust framework that incorporates the max-margin criterion into Bayesian clustering models, as well as two concrete models of BMC to demonstrate its ﬂexibility and effectiveness in dealing with different clustering tasks. The Dirichlet process max-margin Gaussian mixture is a nonparametric Bayesian clustering model that relaxes the underlying Gaussian assumption of Dirichlet process Gaussian mixtures by incorporating max-margin posterior constraints, and is able to infer the number of clusters from data. We further extend the ideas to present max-margin clustering topic model, which can learn the latent topic representation of each document while at the same time cluster documents in the max-margin fashion. Extensive experiments are performed on a number of real datasets, and the results indicate superior clustering performance of our methods compared to related baselines.

1 Introduction

Existing clustering methods fall roughly into two categories. Deterministic clustering directly optimises some loss functions, while Bayesian clustering models the data generating process and infers the clustering structure via Bayes rule. Typical deterministic methods include the well known kmeans [1], n Cut [2], support vector clustering [3], Bregman divergence clustering [4, 5], and the methods built on the very effective max-margin principle [6 9]. Although these methods can ﬂexibly incorporate constraints for better performance, it is challenging for them to ﬁnely capture hidden regularities in the data, e.g., automated inference of the number of clusters and the hierarchies underlying the clusters. In contrast, Bayesian clustering provides favourable convenience in modelling latent structures, and their posterior distributions can be inferred in a principled fashion. For example, by deﬁning a Dirichlet process (DP) prior on the mixing probability of Gaussian mixtures, Dirichlet process Gaussian mixture models [10] (DPGMM) can infer the number of clusters in the dataset. Other priors on latent structures include the hierarchical cluster structure [11 13], coclustering structure [14], etc. However, Bayesian clustering is typically difﬁcult to accommodate external constraints such as max-margin. This is because under the standard Bayesian inference designing some informative priors (if any) that satisfy these constraints is highly challenging.

To address this issue, we propose Bayesian max-margin clustering (BMC), which allows maxmargin constraints to be ﬂexibly incorporated into a Bayesian clustering model. Distinct from the traditional max-margin clustering, BMC is fully Bayesian and enables probabilistic inference of the number of clusters or the latent feature representations of data. Technically, BMC leverages the regularized Bayesian inference (Reg Bayes) principle [15], which has shown promise on supervised learning tasks, such as classiﬁcation [16, 17], link prediction [18], and matrix factorisation [19], where max-margin constraints are introduced to improve the discriminative power of a Bayesian

model. However, little exploration has been devoted to the unsupervised setting, due in part to the absence of true labels that makes it technically challenging to enforce max-margin constraints. BMC constitutes a ﬁrst extension of Reg Bayes to the unsupervised clustering task. Note that distinct from the clustering models using maximum entropy principle [20, 21] or posterior regularisation [22], BMC is more general due to the intrinsic generality of Reg Bayes [15].

We demonstrate the ﬂexibility and effectiveness of BMC by two concrete instantiations. The ﬁrst is Dirichlet process max-margin Gaussian mixture (DPMMGM), a nonparametric Bayesian clustering model that relaxes the Gaussian assumption underlying DPGMM by incorporating max-margin constraints, and is able to infer the number of clusters in the raw input space. To further discover latent feature representations, we propose the max-margin clustering topic model (MMCTM). As a topic model, it performs max-margin clustering of documents, while at the same time learns the latent topic representation for each document. For both DPMMGM and MMCTM, we develop efﬁcient MCMC algorithms by exploiting data augmentation techniques. This avoids imposing restrictive assumptions such as in variational Bayes, thereby facilitating the inference of the true posterior. Extensive experiments demonstrate superior clustering performance of BMC over various competitors.

2 Regularized Bayesian Inference

We ﬁrst brieﬂy overview the principle of regularised Bayesian inference (Reg Bayes) [15].

The motivation of Reg Bayes is to enrich the posterior of a probabilistic model by incorporating additional constraints, under an information-theoretical optimisation formulation. Formally, suppose a probabilistic model has latent variables Θ, endowed with a prior p(Θ) (examples of Θ will be clear soon later). We also have observations X := {x1, , xn}, with xi Rp. Let p(X|Θ) be the likelihood. Then, posterior inference via the Bayes theorem is equivalent to solving the following optimisation problem [15]:

inf q(Θ) P KL(q(Θ) || p(Θ)) EΘ q(Θ) [log p(X|Θ)] (1)

where P is the space of probability distribution1, q(Θ) is the required posterior (here and afterwards we will drop the dependency on X for notation simplicity). In other words, the Bayesian posterior p(Θ|X) is identical to the optimal solution to (1). The power of Reg Bayes stems in part from the ﬂexibility of engineering P, which typically encodes constraints imposed on q(Θ), e.g., via expectations of some feature functions of Θ (and possibly the data X). Furthermore, the constraints can be parameterised by some auxiliary variable ξ. For example, ξ may quantify the extent to which the constraints are violated, then it is penalised in the objective through a function U. To summarise, Reg Bayes can be generally formulated as

inf ξ,q(Θ) KL(q(Θ) || p(Θ)) EΘ q(Θ) [log p(X|Θ)]+ U(ξ)

s.t. q(Θ) P(ξ). (2)

To distinguish from the standard Bayesian posterior, the optimal q(Θ) is called post-data posterior. Under mild regularity conditions, Reg Bayes admits a generic representation theorem to characterise the solution q(Θ) [15]. It is also shown to be more general than the conventional Bayesian methods, including those methods that introduce constraints on a prior. Such generality is essential for us to develop a Bayesian framework of max-margin clustering. Note that like many sophisticated Bayesian models, posterior inference remains as a key challenge of developing novel Reg Bayes models. Therefore, one of our key technical contributions is on developing efﬁcient and accurate algorithms for BMC, as detailed below.

3 Robust Bayesian Max-margin Clustering

For clustering, one key assumption of our model is that X forms a latent cluster structure. In particular, let each cluster be associated with a latent projector ηk Rp, which is included in Θ and has prior distribution subsumed in p(Θ). Given any distribution q on Θ, we then deﬁne the compatibility score of xi with respect to cluster k by using the marginal distribution on ηk (as ηk Θ):

Fk(xi) = Eq(ηk) ηT k xi = Eq(Θ) ηT k xi . (3)

1In theory, we also require that q is absolutely continuous with respect to p to make the KL-divergence well deﬁned. The present paper treats this constraint as an implicit assumption for clarity.

For each example xi, we introduce a random variable yi valued in Z+, which denotes its cluster assignment and is also included in Θ. Inspired by conventional multiclass SVM [7, 23], we utilize P(ξ) in Reg Bayes (2) to encode the max-margin constraints based on Fk(xi), with the slack variable ξ penalised via their sum in U(ξ). This amounts to our Bayesian max-margin clustering (BMC):

inf ξi 0,q(Θ) L(q(Θ)) + 2c X

s.t. Fyi(xi) Fk(xi) ℓI(yi = k) ξi, i, k

where L(q(Θ)) = KL(q(Θ)||p(Θ)) EΘ q(Θ)[log p(X|Θ)] measures the KL divergence between q and the original Bayesian posterior p(Θ|X) (up to a constant); I( ) = 1 if holds true, and 0 otherwise; ℓ> 0 is a constant scalar of margin. Note we found that the commonly adopted balance constraints in max-margin clustering models [6] either are unnecessary or do not help in our framework. We will address this issue in speciﬁc models.

Clearly by absorbing the slack variables ξ, the optimisation problem (4) is equivalent to

inf q(Θ) L(q(Θ)) + 2c X

i max 0, max k:k =yi EΘ q(Θ)[ζik] (5)

where ζik := ℓI(yi = k) (ηyi ηk)T xi. Exact solution to (5) is hard to compute. An alternative approach is to approximate the posterior by assuming independence between random variables, e.g. variational inference. However, this is usually slow and susceptible to local optimal. In order to obtain an analytic optimal distribution q that facilitates efﬁcient Bayesian inference, we resort to the technique of Gibbs classiﬁer [17] which approximates (in fact, upper bounds due to the convexity of max function) the second term in (5) by an expected hinge loss, i.e., moving the expectation out of the max. This leads to our ﬁnal formulation of BMC:

inf q(Θ) L(q(Θ)) + 2c X

max 0, max k:k =yi ζik . (6)

Problem (6) is still much more challenging than existing Reg Bayes models [17], which are restricted to supervised learning with two classes only. Speciﬁcally, BMC allows multiple clusters/classes in an unsupervised setting, and the latent cluster membership yi needs to be inferred. This complicates the model and brings challenges for posterior inference, as addressed below. In a nutshell, our inference algorithms rely on two key steps by exploring data augmentation techniques. First, in order to tackle the multi-class case, we introduce auxiliary variables si := arg maxk:k =yi ζik. Applying standard derivations in calculus of variation [24] and augmenting the model with {si}, we obtain an analytic form of the optimal solution to (6) by augmenting Θ (refer to Appendix A for details):

q(Θ, {si}) p(Θ|X) Y

i exp( 2c max(0, ζisi)) . (7)

Second, since the max term in (7) obfuscates efﬁcient sampling, we apply the augmentation technique introduced by [17], which showed that q(Θ, {si}) is identical to the marginal distribution of the augmented post-data posterior

q(Θ, {si}, {λi}) p(Θ|X) Y

i φi(λi|Θ), (8)

where φi(λi|Θ) := λ 1

2λi (λi + cζisi)2 . Here λi is an augmented variable for xi that has an generalised inverse Gaussian distribution [25] given Θ and xi.

Note that our two steps of data augmentation are exact and incur no approximation. With the augmented variables ({si}, {λi}), we can develop efﬁcient sampling algorithms for the augmented posterior q(Θ, {si}, {λi}) without restrictive assumptions, thereby allowing us to approach the true target posterior q(Θ) by dropping the augmented variables. The details will become clear soon in our subsequent clustering models.

4 Dirichlet Process Max-margin Gaussian Mixture Models

In (4), we have left unspeciﬁed the prior p(Θ) and the likelihood p(X|Θ). This section presents an instantiation of Bayesian nonparametric clustering for non-Gaussian data. We will present another instantiation of max-margin document clustering based on topic models in next section.

Figure 1: Left: Graphical model of DPMMGM. The part excluding ηk and v corresponds to DPGMM. Right: Graphical model of MMCTM. The one excluding {ηk} and the arrow between yi and wil corresponds to CTM.

Here a convenient model of p(X, Θ) is mixture of Gaussian. Let the mean and variance of the k-th cluster component be µk and Λk. In a nonparametric setting, the number of clusters is allowed to be inﬁnite, and the cluster yi that each data point belongs to is drawn from a Dirichlet process [10]. To summarize, the latent variables are Θ = {µk, Λk, ηk} k=1 {yi}n i=1. The prior p(Θ) is speciﬁed as: µk and Λk employ a standard Normal-inverse Wishart prior [26]: µk N(µk; m, (rΛk) 1), and Λk IW(Λk; S, ν). (9) yi Z+ has a Dirichlet process prior with parameter α. ηk follows a normal prior with mean 0 and variance v I, where I is the identity matrix. The likelihood p(xi|Θ) is N(xi; µyi, (rΛyi) 1), i.e. independent of ηk. The max-margin constraints take effects in the model via φi s in (8). Note this model of p(Θ, X), apart from ηk, is effectively the Dirichlet process Gaussian mixture model [10] (DPGMM). Therefore, we call our post-data posterior q(Θ, {si}, {λi}) in (8) as Dirichlet process max-margin Gaussian mixture model (DPMMGM). The hyperparameters include m, r, S, ν, α, v. Interpretation as a generalised DP mixture The formula of the augmented post-data posterior in (8) reveals that, compared with DPGMM, each data point is associated with an additional factor φi(λi|Θ). Thus we can interpret DPMMGM as a generalised DP mixture with Normal-inversed Wishart-Normal as the base distribution, and a generalised pseudo likelihood that is proportional to

f(xi, λi|yi, µyi, Λyi, {ηk}) := N(xi; µyi, (rΛyi) 1) φi (λi|Θ) . (10) To summarise, DPMMGM employs the following generative process with the graphical model shown in Fig. 1 (left): (µk, Λk, ηk) N µk; m, (rΛk) 1 IW (Λk; S, ν) N (ηk; 0, v I) , k = 1, 2,

w Stick-Breaking(α), yi|w Discrete(w), i [n] (xi, λi)|yi, {µk, Λk, ηk} f(xi, λi|yi, µyi, Λyi, {ηk}). i [n] Here [n] := {1, , n} is the set of integers up to n and means that (xi, λi) is generative from a distribution that is proportional to f( ). Since this normalisation constant is shared by all samples xi, there is no need to deal with it by posterior inference. Another beneﬁt of this interpretation is that it allows us to use existing techniques for non-conjugate DP mixtures to sample the cluster indicators yi efﬁciently, and to infer the number of clusters in the data. This approach is different from previous work on Reg Bayes nonparametric models where truncated approximation is used to deal with the inﬁnite dimensional model space [15, 18]. In contrast, our method does not rely on any approximation. Note that DPMMGM does not need the complicated class balance constraints [6] because the Gaussians in the pseudo likelihood would balance the clusters to some extent.

Posterior inference Posterior inference for DPMMGM can be done by efﬁcient Gibbs sampling. We integrate out the inﬁnite dimension vector w, so the variables needed to be sampled are {µk, Λk, ηk}k {yi, si, λi}i. Conditional distributions are derived in Appendix B. Note that we use an extension of the Reused Algorithm [27] to jointly sample (yi, si), which allows it to allocate to empty clusters in Bayesian nonparametric setting. The time complexity is almost the same as DPGMM except for the additional step to sample ηk, with cost O(p3). So it would be necessary to put the constraints on a subspace (e.g., by projection) of the original feature space when p is high.

5 Max-margin Clustering Topic Model

Although many applications exhibit clustering structures in the raw observed data which can be effectively captured by DPMMGM, it is common that such regularities are more salient in terms of some high-level but latent features. For example, topic distributions are often more useful than word frequency in the task of document clustering. Therefore, we develop a max-margin clustering topic model (MMCTM) in the framework of BMC, which allows topic discovery to co-occur with document clustering in a Bayesian and max-margin fashion. To this end, the latent Dirichlet allocation (LDA) [28] needs to be extended by introducing a cluster label into the model, and deﬁne each cluster as a mixture of topic distributions. This cluster-based topic model [29] (CTM) can then be used in concert with BMC to enforce large margin between clusters in the posterior q(Θ).

Let V be the size of the word vocabulary, T be the number of topics, and K be the number of clusters, 1N be a N-dimensional one vector. Then the generative process of CTM for the documents goes as: 1. For each topic t, generate its word distribution φt: φt|β Dir(β1V ).

2. Draw a base topic distribution µ0: µ0|α0 Dir(α01T ). Then for each cluster k, generate its topic distribution mixture µk: µk|α1, µ0 Dir(α1µ0).

3. Draw a base cluster distribution γ: γ|ω Dir(ω1K). Then for each document i [D]:

Generate a cluster label yi and a topic distribution µi: yi|γ Discrete(γ), µi|α, µyi Dir(αµyi). Generate the observed words wil: zil Discrete(µi), wil Discrete(φzil), l [Ni]. Fig. 1 (right) shows the structure. We then augment CTM with max-margin constraints, and get the same posterior as in Eq. (7), with the variables Θ corresponding to {φt}T t=1 {ηk, µk}K k=1 {µ0, γ} {µi, yi}D i=1 {zil}D,Ni i=1,l=1.

Compared with the raw word space which is normally extremely high-dimensional and sparse, it is more reasonable to characterise the clustering structure in the latent feature space the empirical latent topic distributions as in the Med LDA [16]. Speciﬁcally, we summarise the topic distribution of document i by xi RT , whose t-th element is 1 Ni PNi l=1 I(zil = t). Then the compatibility score for document i with respect to cluster k is deﬁned similar to (3) as Fk(xi) = Eq(Θ) ηT k xi . Note, however, the expectation is also taken over xi since it is not observed.

Posterior inference To achieve fast mixing, we integrate out {φt}T t=1 {µ0, γ} {µk}K k=1 {µi}D i=1 in the posterior, thus Θ = {yi}D i=1 {ηk}K k=1 {zil}D,Ni i=1,l=1. The integration is straightforward by the Dirichlet-Multinomial conjugacy. The detailed form of the posterior and the conditional distributions are derived in Appendix C. By extending CTM with max-margin, we note that many of the the sampling formulas are extension of those in CTM [29], with additional sampling for ηk, thus the sampling can be done fairly efﬁciently.

Dealing with vacuous solutions Different from DPMMGM, the max-margin constraints in MMCTM do not interact with the observed words wil, but with the latent topic representations xi (or zil) that are also inferred from the model. This easily makes the latent representation zi s collapse into a single cluster, a vacuous solution plaguing many other unsupervised learning methods as well. One remedy is to incorporate the cluster balance constraints into the model [7]. However, this does not help in our Bayesian setting because apart from signiﬁcant increase in computational cost, MCMC often fails to converge in practice2. Another solution is to morph the problem into a weakly semisupervised setting, where we assign to each cluster a few documents according to their true label (we will refer to these documents as landmarks), and sample the rest as in the above unsupervised setting. These labeled examples can be considered as introducing constraints that are alternative to the balance constraints. Usually only a very small number of labeled documents are needed, thus barely increasing the cost in training and labelling. We will focus on this setting in experiment.

6 Experiments

6.1 Dirichlet Process Max-margin Gaussian Mixture

2We observed the cluster sizes kept bouncing with sampling iterations. This is probably due to the highly nonlinear mapping from observed word space to the feature space (topic distribution), making the problem multi-modal, i.e., there are multiple optimal topic assignments in the post-data posterior (8). Also the balance constraints might weaken the max-margin constraints too much.

Figure 2: An illustration of DPGMM (up) and DPMMGM (bottom).

We ﬁrst show the distinction between our DPMMGM and DPGMM by running both models on the non-Gaussian half-rings data set [30]. There are a number of hyperparameters to be determined, e.g., (α, r, S, ν, v, c, ℓ); see Section 4. It turns out the cluster structure is insensitive to (α, r, S, ν), and so we use a standard sampling method to update α [31], while r, ν, S are sampled by employing Gamma, truncated Poisson, inverse Wishart priors respectively, as is done in [32]. We set v = 0.01, c = 0.1, ℓ= 5 in this experiment. Note that the clustering structure is sensitive to the values of c and ℓ, which will be studied below. Empirically we ﬁnd that DPMMGM converges much faster than DPGMM, both converging well within 200 iterations (see Appendix D.4 for examples). In Fig. 2, the clustering structures demonstrate clearly that DPMMGM relaxes the Gaussian assumption of the data distribution, and correctly ﬁnds the number of clusters based on the margin boundary, whereas DPGMM produces a too fragmented partition of the data for the clustering task.

Parameter sensitivity We next study the sensitivity of hyperparameters c and ℓ, with other hyperparameters sampled during inference as above. Intuitively the impact of these parameters is as follows. c controls the weight that the max-margin constraint places on the posterior. If there were no other constraint, the max-margin constraint would drive the data points to collapse into a single cluster. As a result, we expect that a larger value of the weight c will result in fewer clusters. Similarly, increasing the value of ℓwill lead to a higher loss for any violation of the constraints, thus driving the data points to collapse as well. To test these implications, we run DPMMGM on a 2-dimensional synthetic dataset with 15 clusters [33]. We vary c and ℓto study how the cluster structures change with respect to these parameter settings. As can be observed from Fig. 3, the results indeed follow our intuition, providing a mean to control the cluster structure in applications.

(a) c :5e-6, ℓ:5e-1 (b) c :5e-4, ℓ:5e-1 (c) c :5e-3, ℓ:5e-1 (d) c :5e-2, ℓ:5e-1 (e) c :5e-1, ℓ:5e-1

(f) c :5e-3, ℓ:5e-4 (g) c :5e-3, ℓ:5e-2 (h) c :5e-3, ℓ:5e-1 (i) c :5e-3, ℓ:2 (j) c :5e-3, ℓ:5

Figure 3: Clustering structures with varied ℓand c: (ﬁrst row) ﬁxed ℓand increasing c; (second row) ﬁxed c and increasing ℓ. Lines are η s. Clearly the number cluster decreases with growing c and ℓ.

Real Datasets. As other clustering models, we test DPMMGM on ten real datasets (small to moderate sizes) from the UCI repository [34]. Scaling up to large dataset is an interesting future. The ﬁrst three columns of Table 1 list some of the statistics of these datasets (we used random subsets of the three large datasets Letter, MNIST, and Segmentation).

A heuristic approach for model selection. Model selection is generally hard for unsupervised clustering. Most existing algorithms simply ﬁx the hyperparameters without examining their impacts on model performance [10, 35]. In DPMMGM, the hyperparameters c and ℓare critical to clustering quality since they control the number of clusters. Without training data in our setting they can not be set using cross validation. Moreover, they are not feasible to be estimated use Bayesian sampling as well because they are not parameters from a proper Bayesian model. we thus introduce a timeefﬁcient heuristic approach to selecting appropriate values. Suppose the dataset is known to have K clusters. Our heuristic goes as follows. First initialise c and ℓto 0.1. Then at each iteration, we compare the inferred number of clusters with K. If it is larger than K (otherwise we do the converse), we choose c or ℓrandomly, and increase its value by u

n, where u is a uniform random variable in [0, 1] and n is the number of iterations so far. According to the parameter sensitivity studied above, increasing c or ℓtends to decrease the number of clusters, and the model eventually

Dataset Data property NMI n p K kmeans n Cut DPGMM DPMMGM DPMMGM

Glass 214 10 7 0.37 0.04 0.22 0.00 0.37 0.05 0.46 0.01 0.45 0.01 Half circle 300 2 2 0.43 0.00 1.00 0.00 0.49 0.02 0.67 0.02 0.51 0.07 Iris 150 4 3 0.72 0.08 0.61 0.00 0.73 0.00 0.73 0.00 0.73 0.00 Letter 1000 16 10 0.33 0.01 0.04 0.00 0.19 0.09 0.38 0.04 0.23 0.04 MNIST 1000 784 10 0.50 0.01 0.38 0.00 0.55 0.03 0.56 0.01 0.55 0.02 Satimage 4435 36 6 0.57 0.06 0.55 0.00 0.21 0.05 0.51 0.01 0.30 0.00 Segment n 1000 19 7 0.52 0.03 0.34 0.00 0.23 0.09 0.61 0.05 0.52 0.10 Vehicle 846 18 4 0.10 0.00 0.14 0.00 0.02 0.02 0.14 0.00 0.05 0.00 Vowel 990 10 11 0.42 0.01 0.44 0.00 0.28 0.03 0.39 0.02 0.41 0.02 Wine 178 13 3 0.84 0.01 0.46 0.00 0.56 0.02 0.90 0.02 0.59 0.01

Table 1: Comparison for different methods on NMI scores. K: true number of clusters.

stabilises due to the stochastic decrement by u

n. We denote the model learned from this heuristic as DPMMGM. In the case where the true number of clusters is unknown, we can still apply this strategy, except that the number of clusters K needs to be ﬁrst inferred from DPGMM. This method is denoted as DPMMGM .

Comparison. We measure the quality of clustering results by using the standard normalised mutual information (NMI) criterion [36]. We compare our DPMMGM with the well established KMeans, n Cut and DPGMM clustering methods3. All experiments are repeated for ﬁve times with random initialisation. The results are shown in Table 1. Clearly DPMMGM signiﬁcantly outperforms other models, achieving the best NMI scores. DPMMGM , which is not informed of the true number of clusters, still obtains reasonably high NMI scores, and outperforms the DPGMM model.

6.2 Max-margin Clustering Topic Model

Datasets. We test the MMCTM model on two document datasets: 20NEWS and Reuters-R8 . For the 20NEWS dataset, we combine the training and test datasets used in [16], which ends up with 20 categories/clusters with roughly balanced cluster sizes. It contains 18,772 documents in total with a vocabulary size of 61,188. The Reuters-R8 dataset is a subset of the Reuters-21578 dataset4, with of 8 categories and 7,674 documents in total. The size of different categories is biased, with the lowest number of documents in a category being 51 while the highest being 2,292.

Comparison We choose L {5, 10, 15, 20, 25} documents randomly from each category as the landmarks, use 80% documents for training and the rest for testing. We set the number of topics (i.e., T) to 50, and set the Dirichlet prior in Section 5 to ω = 0.1, β = 0.01, α = α0 = α1 = 10, as clustering quality is not sensitive to them. For the other hyperparameters related to the max-margin constraints, e.g., v in the Gaussian prior for η, the balance parameter c, and the cost parameter ℓ, instead of doing cross validation which is computationally expensive and not helpful for our scenario with few labeled data, we simply set v = 0.1, c = 9, ℓ= 0.1. This is found to be a good setting and denoted as MMCTM. To test the robustness of this setting, we vary c over {0.1, 0.2, 0.5, 0.7, 1, 3, 5, 7, 9, 15, 30, 50} and keep v = ℓ= 0.1 (ℓand c play similar roles and so varying one is enough). We choose the best performance out of these parameter settings, denoted as MMCTM , which can be roughly deemed as the setting for the optimal performance. We compared MMCTM with state-of-the-art SVM and semi-supervised SVM (S3VM) models. They are efﬁciently implemented in [37], and the related parameters are chosen by 5-fold cross validation. As in [16], raw word frequencies are used as input features. We also compare MMCTM with a Bayesian baseline cluster based topic model (CTM) [29], the building block of MMCTM without the max-margin constraints. Note we did not compare with the standard Med LDA [16] because it is supervised. We measure the performance by cluster accuracy, which is the proportion of correctly clustered documents. To accelerate MMCTM, we simply initialise it with CTM, and ﬁnd it converges surprisingly fast in term of accuracy, e.g., usually within 30 iterations (refer to Appendix

3We additionally show some comparison with some existing max-margin clustering models in Appendix D.2 on two-cluster data because their code only deals with the case of two clusters. Our method performs best. 4Downloaded from csmining.org/index.php/r52-and-r8-of-reuters-21578.html.

L CTM SVM S3VM MMCTM MMCTM

5 17.22 4.2 37.13 2.9 39.36 3.2 56.70 1.9 57.86 0.9 10 24.50 4.5 46.99 2.4 47.91 2.8 54.92 1.6 56.56 1.3 15 22.76 4.2 52.80 1.2 52.49 1.4 55.06 2.7 57.80 2.2 20 26.07 7.2 56.10 1.5 54.44 2.1 56.62 2.2 59.70 1.4 25 27.20 1.5 59.15 1.4 57.45 1.7 55.70 2.4 61.92 3.0 Reuters-R8

5 41.27 16.7 78.12 1.1 78.51 2.3 79.18 4.1 80.86 2.9 10 42.63 7.4 80.69 1.2 79.15 1.2 80.04 5.3 83.48 1.0 15 39.67 9.9 83.25 1.7 81.87 0.8 85.48 2.1 86.86 2.5 20 58.24 8.3 85.66 1.0 73.95 2.0 82.92 1.7 83.82 1.6 25 51.93 5.9 84.95 0.1 82.39 1.8 86.56 2.5 88.12 0.5

Table 2: Clustering acc. (in %). Bold means signiﬁcantly different.

10 20 30 50 70 100 0

Number of topics (#topic)

Accuracy (%)

training test

(a) 20NEWS dataset

10 20 30 50 70 100 0

Number of topics

Accuracy (%)

(b) Reuters-R8 dataset

Figure 4: Accuracy vs. #topic aaaaaaaaaaaaaaaa

Figure 5: 2-D t SNE embedding on 20NEWS for MMCTM (left) and CTM (right). Best viewed in color. See Appendix D.3 for the results on Reuters-R8 datasets.

D.5). The accuracies are shown in Table 2, and we can see that MMCTM outperforms other models (also see Appendix D.4), except for SVM when L = 20 on the Reuters-R8 dataset. In addition, MMCTM performs almost as well as using the optimal parameter setting (MMCTM ).

Sensitivity to the number of topics (i.e., T). Note the above experiments simply set T = 50. To validate the affect of T, we varied T from 10 to 100, and the corresponding accuracies are plotted In Fig. 4 for the two datasets. In both cases, T = 50 seems to be a good parameter value.

Cluster embedding. We ﬁnally plot the clustering results by embedding them into the 2dimensional plane using t SNE [38]. In Fig. 5, it can be observed that compared to CTM, MMCTM generates well separated clusters with much larger margin between clusters.

7 Conclusions

We propose a robust Bayesian max-margin clustering framework to bridge the gap between maxmargin learning and Bayesian clustering, allowing many Bayesian clustering algorithms to be directly equipped with the max-margin criterion. Posterior inference is done via two data augmentation techniques. Two models from the framework are proposed for Bayesian nonparametric maxmargin clustering and topic model based document clustering. Experimental results show our models signiﬁcantly outperform existing methods with competitive clustering accuracy.

Acknowledgments This work was supported by an Australia China Science and Research Fund grant (ACSRF-06283) from the Department of Industry, Innovation, Climate Change, Science, Research and Tertiary Education of the Australian Government, the National Key Project for Basic Research of China (No. 2013CB329403), and NSF of China (Nos. 61322308, 61332007). NICTA is funded by the Australian Government as represented by the Department of Broadband, Communications and the Digital Economy and the Australian Research Council through the ICT Centre of Excellence program.

[1] J. Mac Queen. Some methods of classiﬁcation and analysis of multivariate observations. In Proc. 5th Berkeley Symposium on Math., Stat., and Prob., page 281, 1967. [2] J. Shi and J. Malik. Normalized cuts and image segmentation. TPAMI, 22(8):705 767, 2000. [3] A. Ben-Hur, D. Horn, H. Siegelmann, and V. Vapnik. Support vector clustering. JMLR, 2:125 137, 2001. [4] A. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh. Clustering with Bregman divergences. JMLR, 6: 1705 1749, 2005. [5] H. Cheng, X. Zhang, and D. Schuurmans. Convex relaxations of Bregman divergence clustering. In UAI, 2013. [6] L. Xu, J. Neufeld, B. Larson, and D. Schuurmans. Max-margin clustering. In NIPS, 2005. [7] B. Zhao, F. Wang, and C. Zhang. Efﬁcient multiclass maximum margin clustering. In ICML, 2008. [8] Y. F. Li, I. W. Tsang, J. T. Kwok, and Z. H. Zhou. Tighter and convex maximum margin clustering. In AISTATS, 2009. [9] G. T. Zhou, T. Lan, A. Vahdat, and G. Mori. Latent maximum margin clustering. In NIPS, 2013. [10] C. E. Rasmussen. The inﬁnite Gaussian mixture model. In NIPS, 2000. [11] K. A. Heller and Z. Ghahramani. Bayesian hierarchical clustering. In ICML, 2005. [12] Y. W. Teh, H. Dau me III, and D. Roy. Bayesian agglomerative clustering with coalescents. In NIPS, 2008. [13] Y. Hu, J. Boyd-Graber, H. Dau me, and Z. I. Ying. Binary to bushy: Bayesian hierarchical clustering with the Beta coalescent. In NIPS, 2013. [14] P. Wang, K. B. Laskey, C. Domeniconi, and M. I. Jordan. Nonparametric Bayesian co-clustering ensembles. In SDM, 2011. [15] J. Zhu, N. Chen, and E. P. Xing. Bayesian inference with posterior regularization and applications to inﬁnite latent SVMs. JMLR, 2014. [16] J. Zhu, A. Ahmed, and E. P. Xing. Med LDA: Maximum margin supervised topic models. JMLR, 13(8): 2237 2278, 2012. [17] J. Zhu, N. Chen, H. Perkins, and B. Zhang. Gibbs max-margin topic models with fast sampling algorithms. In ICML, 2013. [18] J. Zhu. Max-margin nonparametric latent feature models for link prediction. In ICML, 2012. [19] M. Xu, J. Zhu, and B. Zhang. Fast max-margin matrix factorization with data augmentation. In ICML, 2013. [20] L. Wang, X. Li, Z. Tu, and J. Jia. Discriminative cllustering via generative feature mapping. In AAAI, 2012. [21] R. Gomes, A. Krause, and P. Perona. Discriminative clustering by regularized information maximization. In NIPS, 2010. [22] K. Ganchev, J. Graa, J. Gillenwater, and B. Taskar. Posterior regularization for structured latent variable models. JMLR, 11:2001 2049, 2010. [23] K. Crammer and Y. Singer. On the algorithmic implementation of multiclass kernel-based vector machines. JMLR, 2:265 292, 2001. [24] C. Fox. An introduction to the calculus of variations. Courier Dover Publications, 1987. [25] B. Jorgensen. Statistical properties of the generalized inverse Gaussian distribution. Lecture Notes in Statistics, 1982. [26] K. P. Murphy. Conjugate Bayesian analysis of the Gaussian distribution. Technical report, UCB, 2007. [27] S. Favaro and Y. W. Teh. MCMC for normalized random measure mixture models. Stat. Sci., 2013. [28] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. JMLR, 3:993 1022, 2003. [29] H. M. Wallach. Structured topic models for language. Ph D thesis, University of Cambridge, 2008. [30] A. Jain and M. Law. Data clustering: A user s dilemma. Lecture Notes in Comp. Sci., 3776:1 10, 2005. [31] Y.W. Teh, M.I. Jordan, M.J. Beal, and D.M. Blei. Hierarchical Dirichlet processes. J. Amer. Statist. Assoc., 101(476):1566 1581, 2006. [32] M. Davy and J. Y. Tourneret. Generative supervised classiﬁcation using Dirichlet process priors. TPAMI, 32(10):1781 1794, 2010. [33] P. Franti and O. Virmajoki. Iterative shrinking method for clustering problems. PR, 39(5):761 765, 2006. [34] K. Bache and M. Lichman. UCI machine learning repository, 2013. URL http://archive.ics. uci.edu/ml. [35] A. Shah and Z. Ghahramani. Determinantal clustering process a nonparametric bayesian approach to kernel based semi-supervised clustering. In UAI, 2013. [36] N. X. Vinh, J. Epps, and J. Bailey. Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance. JMLR, (11):2837 2854, 2010. [37] V. Sindhwani, P. Niyogi, and M. Belkin. SVMlin: Fast linear SVM solvers for supervised and semisupervised learning. In NIPS Workshop on Machine Learning Open Source Software, 2006. http: //vikas.sindhwani.org/svmlin.html. [38] L.J.P. van der Maaten and G.E. Hinton. Visualizing high-dimensional data using t-SNE. JMLR, 9(11): 2579 2605, 2008.