# von_misesfisher_clustering_models__407e38aa.pdf

Von Mises-Fisher Clustering Models

Siddharth Gopal SGOPAL1@CS.CMU.EDU Carnegie Mellon University, Pittsburgh, PA 15213 USA

Yiming Yang YIMING@CS.CMU.EDU Carnegie Mellon University, Pittsburgh, PA 15213 USA

This paper proposes a suite of models for clustering high-dimensional data on a unit sphere based on von Mises-Fisher (v MF) distribution and for discovering more intuitive clusters than existing approaches. The proposed models include a) A Bayesian formulation of v MF mixture that enables information sharing among clusters, b) a Hierarchical v MF mixture that provides multiscale shrinkage and tree structured view of the data and c) a Temporal v MF mixture that captures evolution of clusters in temporal data. For posterior inference, we develop fast variational methods as well as collapsed Gibbs sampling techniques for all three models. Our experiments on six datasets provide strong empirical support in favour of v MF based clustering models over other popular tools such as K-means, Multinomial Mixtures and Latent Dirichlet Allocation.

1. Introduction With the advent of large amounts of unlabeled data, clustering has emerged as an important tool for the end user to obtain a structured view of the data. Probabilistic clustering algorithms such as K-means (Gaussian mixtures), Multinomial Mixtures, Latent Dirichlet allocation (Blei et al., 2003) have emerged as the defacto standard for discovering the latent structures and relations in the data. Such probabilistic models deﬁne a generative model for the data by assuming some rigid instance representation, for e.g. Multinomial Mixtures assumes that each instance is represented as discrete feature-counts and is drawn from one of many Multinomial distributions. However, it is questionable whether such representation of data is appropriate for all domains. For example, in textmining (classiﬁcation, retrieval, collaborative ﬁltering etc) documents have typically been represented using a term Proceedings of the 31 st International Conference on Machine Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copyright 2014 by the author(s).

frequency-Inverse Document frequency Normalized form (Tf-Idf normalization) (Salton & Mc Gill, 1986), where each document is represented as a point on a unit-sphere using a combination of both within-document frequencies and inverse corpus-frequences. Tf-Idf normalization has always shown better performance than feature-counts representation based on several supervised tasks such as classiﬁcation (Joachims, 2002), retrieval (Robertson, 2004) etc. Similarly, in Image-modeling, unit normalized spatial pyramid vectors is a common representation (Yang et al., 2009). Normalization is often an important step in data analysis because it removes the magnitude of the instances from the picture and places more importance on the directional distribution; in other words, we do not want unduly long documents or big images to distort our inferences, hence we represent the instances as relative contributions of individual features. Despite the practical successes of normalized data representation, they have not been well studied by Bayesian Graphical models. Since the data lies on a unit-sphere manifolds, popular clustering assumptions such as Gaussian (O Hagan et al., 2004) or Multinomial (Blei et al., 2003; 2004; Blei & Lafferty, 2006a;b) are not appropriate. On one hand we have empirical success of such normalized representation and on the other hand we have a wide variety of graphical models that model data using a different representation. Can we get the best of both worlds? Can we develop models that are particularly suited to such unitsphere manifolds but at the same time maintain ﬂexibility like other graphical models? In this paper, we propose a suite of models using von Mises-Fisher (v MF) distributions which have been primarily used to model such directional data. v MF models have been long studied in the directional statistics community (Fisher, 1953; Jupp & Mardia, 1989) and are naturally suited in such scenarios as they model distances between instances using the angle of separation i.e. cosine similarity. Works in v MF have typically focused on low-dimensional problems (2D or 3D spaces) to maintain tractability and relying on Gibbs sampling for inference which is generally difﬁcult to scale in higher dimensions (Hasnat et al., 2013) (Mardia & El-Atoum, 1976)

Von Mises-Fisher Clustering Models

(Guttorp & Lockhart, 1988) (Bangert et al., 2010). More popular works include text clustering work by (Banerjee et al., 2006) (Banerjee et al., 2003) where an EM-based algorithm without Bayesian inference was used, and spherical topic models by (Reisinger et al., 2010) which mimic LDA with a Bayesian inference for learning the mean parameters in v MF, but leave the crucial concentration parameters (the variance parts) of the models to be set manually. In this paper, using the v MF distribution as the building block, we propose three increasingly powerful models for cluster analysis

1. The Bayesian v MF Mixture Model (B-v MFmix): A fully Bayesian formulation of mixture of v MF distributions where each instance represented as a point on a unitsphere is assumed to be drawn from one of many v MF distributions. The parameters of the clusters are themselves drawn from a common prior which helps the clusters to share information among each other.

2. The Hierarchical v MF Mixture Model (H-v MFmix): When the data we want to analyze is huge, one of the efﬁcient means of browsing is by means of a hierarchy (Cutting et al., 1992). We extend B-v MFmix to H-v MFmix, to enable partitioning the data into increasing levels of speciﬁcity as deﬁned by a given input hierarchy. To our knowledge, this is the ﬁrst hierarchical Bayesian model for v MFbased clustering.

3. The Temporal v MF Mixture Model (T-v MFmix): For temporal data streams, analyzing how the latent clusters in the data evolve over time is naturally desirable. We augment B-v MFmix to the ﬁrst temporal v MF-based model that accommodates changes in cluster parameters between adjacent time-points. For example, in a corpus of documents, this could reﬂect the changing vocabulary within a cluster of documents.

We develop fast variational inference schemes for all the methods and equally fast collapsed Gibbs sampling techniques for the simpler models. Our empirical comparison on several datasets conclusively establishes that v MF distributions are better alternatives to standard Gaussian, Multinomial Mixtures and can be successfully used for cluster analysis. To our knowledge, this is the ﬁrst work that provides a thorough treatment of v MF distribution in the context of a wide variety of graphical models for data analysis.

2. Related Background The von Mises-Fisher (v MF) distribution deﬁnes a probability density over points on a unit-sphere. It is parameterized by mean parameter µ and concentration parameter κ - the former deﬁnes the direction of the mean and the latter determines the spread of the probability mass around the mean. The density function for x RD, x = 1, µ =

1, κ > 0 is given by,

f(x|µ, κ) = CD(κ) exp(κµ x); CD(κ) = κ.5D 1

(2π).5DI.5D 1(κ)

where Iν(a) is the modiﬁed bessel function of ﬁrst kind with order ν and argument a. Note that µ x is the cosine similarity between x and mean µ and that κ plays the role of the inverse of variance. The simplest v MF mixture model (Banerjee et al., 2006) assumes that each instance is drawn from one of the K v MF distributions with a mixing distribution π, where K is a known constant, and the parameters of the v MF distributions correspond to the underlying themes (clusters) in the data. The cluster assignment variable for instance xi is denoted by zi {1, 2, ..K} in the probabilistic generative model given below,

zi Categorical(.|π) i = 1, 2..N xi v MF(.|µzi, κ) i = 1, 2..N

The k th cluster is deﬁned by mean parameter µk and concentration parameter κ. The parameters P = {µ, κ, π} are treated as ﬁxed unknown constants and Z = {zi}N i=1 are treated as a latent variables. To train the model, we can use the familiar EM algorithm, to efﬁciently iterate between calculating the E[Z] in the E-step and optimizing P to maximize the likelihood in the M-step. The update equations are, 1

E[zik] = πkv MF(xi|µk, κ)

j=1 πjv MF(xi|µj, κ) , Rk =

i=1 E[zik]xi

N , µk = Rk Rk , r =

N , κ = r D r3

3. Proposed Bayesian v MF models We improve upon the basic v MF mixture model in the above section by adding bayesian components in three signiﬁcant ways a) the fully Bayesian v MF mixture model (Bv MFmix) b) A hierarchical extension of B-v MFmix and c) A temporal extension of B-v MFmix.

3.1. Bayesian v MF mixtures (B-v MFmix) The bayesian v MF mixture model is the fundamental building block for the hierarchical and temporal v MF mixtures. This model enables sharing of information between the clusters by shrinking the cluster parameters towards each other using a prior. The generative model for given data

1 The supplementary material thoroughly describes all the inference schemes (EM, variational, collapsed gibbs sampling) as well as complete set results only a part of which is presented in the paper

Von Mises-Fisher Clustering Models

D = {xi}N i=1 and a ﬁxed number of clusters K is given by,

π Dirichlet(.|α) µk v MF(.|µ0, C0) k = 1, 2, ..K

κk log Normal(.|m, σ2) k = 1, 2, ..K

zi Categorical(.|π) i = 1, 2...N

xi v MF(.|µzi, κzi) i = 1, 2, ..N

The prior parameters are {α, µ0, C0, m, σ2}. The cluster mean parameters µ = {µk}K k=1 are commonly drawn from a prior v MF distribution with parameters {µ0, C0}. The cluster concentration parameters κ = {κk}K k=1 are commonly drawn from a log-normal prior with mean m and variance σ2. The mixing distribution π is drawn from a symmetric dirichlet with parameter α. This bayesian model improves over the simple v MF mixture in multiple ways; ﬁrstly we share statistical strength by shrinking the cluster mean parameters towards a common µ0, secondly there is ﬂexibility to learn cluster-speciﬁc concentration parameters κk without the risk of overﬁtting if the priors are appropriately set, thirdly the posterior distribution over the parameters gives a measure of uncertainty of the parameters unlike point estimates in simple v MF mixtures. These advantages are evident in our experimental section (in section 4). The likelihood of the data and the posterior of the parameter is given by,

P(Z, µ, κ, π|.) P(X|Z, µ, κ, π)P(Z, µ, κ, π|m, σ2, µ0, C0, α)

Since the posterior distribution of the parameters cannot be calculated in closed form, we need to resort to approximate inference using variational methods or sampling techniques.

3.1.1. VARIATIONAL INFERENCE Using variational inference, we try to ﬁnd a distribution that is closest in KL-divergence to the true posterior. We assume the following factored form for the approximate posterior.

q(π) Dirichlet(.|ρ)

q(µk) v MF(.|ψk, γk) k = 1, 2..K

q(zi) Categorical(.|λi) i = 1, 2..N

For the concentration parameters, the inference is not straight-forward because of the presence of log-bessel function. We present two different ways of estimating the posterior of the concentration parameters - (a) Sampling scheme and (b) Bounding scheme. Each scheme assumes a different form for the posterior distribution of κk s.

q(κk) No-speciﬁc form k = 1, 2..K [Sampling]

q(κk) log Normal(.|ak, bk) k = 1, 2..K [Bounding]

We develop a variational algorithm where the posterior parameters - ρ, ψ, γ, λ are iteratively optimized to maximize

the variational lower bound (VLB)1 .

VLB = Eq[log P(D, Z, µ, κ, π|m, σ2, µ0, C0, α)] H(q)

The closed form updates for posterior parameters for µk and zik are given by,

ψk = Rk Rk , γk = Rk ,

where Rk = Eq[κk]

i=1 Eq[zik]xi + C0µ0

λik exp(Eq[log v MF(xi|µk, κk)] + Eq[log(πk)])

log v MF(xi|µk, κk) = Eq[log CD(κk)] + Eq[κk]x i Eq[µk]

Next we present the posterior estimation of the concentration parameters. Sampling: In the sampling scheme, we rely on estimating κk s (and related quantities such as log CD(κk)) by drawing samples from the posterior distribution. Sampling requires the computation of the conditional distribution for κk. However, variational inference (for the other parameters) does not maintain samples but instead maintains only posterior distributions. To overcome this issue, we rewrite the conditional distribution for κk in terms of the expectation of posterior parameters rather than samples. Using the VLB and Jensen s inequality on the conditional of κk,

P(κk|X, m, σ2, µ0, C0, α) P(κk, X|m, σ2, µ0, C0, α)

Eq h P(κk, X, Z, µ, κ k, π|m, σ2, µ0, C0, α) i

exp Eq h log P(κk, X, Z, µ, κ k|m, σ2, µ0, C0, π) i

i=1 Eq[zik] log CD(κk) + κk

i=1 Eq[zik]x i Eq[µk]

log Normal(κk|m, σ2) (1)

Having identiﬁed the proportionality of the conditional distribution, we can use MCMC sampling with a suitable proposal distribution to draw samples. We used a log-normal distribution around the current iterate as the proposal distribution. The MCMC sampling step for the posterior of κ introduces ﬂexibility into the model as the samples are now from the true posterior (given the rest of the parameters). The downside is that the variational bound is no longer guaranteed to be a valid lower-bound since Eq[log CD(κk)] and Eq[κk] are estimated through the samples. However, we observed that the posterior distribution of the κk s was highly concentrated around the mode and the estimates from samples gave good empirical performance in terms of predictive power. This partial MCMC sampling does not increase computational costs much since there are only K variables to be estimated and the major computational bottleneck is only in

Von Mises-Fisher Clustering Models

updating λ. Another alternative to repeated sampling is to use grid search, where we break down the continuous posterior distribution of κk across a ﬁnite set of points, and use this discrete distribution to estimate the expectation of various quantities.Refer the last part of supplementary material for a thorough discussion of the computational issues. Note that the same model with an unnormalized prior for κ was used in . However since κ is not a natural parameter of the v MF distribution, it is not clear whether the unnormalized prior results in a valid posterior distribution. Bounding: The core problem in doing full variational inference for the model is the computation of the Eq[log CD(κk)] in the VLB. We are not aware of any distribution q(κk) for which is there is a closed form expression for Eq[log CD(κk)] (due to the presence of the log bessel function in CD(κk)). To overcome this issue, we ﬁrst upper-bound the log bessel function using a Turan type inequality (Baricz et al., 2011), followed by an approximation using the delta method. More speciﬁcally, the growth of the modiﬁed bessel function of ﬁrst kind with order v and argument u i.e. Iv(u) can be lower-bounded by (Baricz et al., 2011),

u2 + v2 Iv(u) Iv(u)

Integrating over u > 0,

log(Iv(u)) p

u2 + v2 v log(u)

v2 + u2 + v2) + v2 p

Eq[log Iv(u)] Eq[ p

u2 + v2] v Eq[log(u)]

v Eq[log(v p

v2 + u2 + v2) + v2)] Eq[ p

u2 + v2] (2)

Since all the expectations on the RHS of eq (2) are twice differentiable, we can use the delta approximation method. The expectation of a function g over a distribution q is given by

Eq[g(x)] g(Eq[x]) + g (Eq[x])V arq[x]

Applying the equations (2), (3) to the VLB, we can estimate the posterior parameters ak, bk by optimizing VLB using gradient descent1 . Although it is tempting to directly apply the delta method to calculate Eq[log Iv(u)], this leads to expressions that are not computable directly as well as numerically unstable.

3.1.2. COLLAPSED GIBBS SAMPLING We can develop efﬁcient sampling techniques for the model by using the fact that v MF distributions are conjugate w.r.t each other. This enables us to completely integrate out {µk}K k=1 and π and update the model only by maintaining the cluster assignment variables {zi}N i=1 and the concentration parameters {κk}K k=1. The conditional distributions 1

are given by,

P(zi = k|Z i, ..) (α +

j=1,j =i I(zj = k))CD(κk)

j =i,zj=k xj + C0µ0 )

j:zj=k xj + C0µ0 )

P(κk|κ k, ..) CD(C0)CD(κk)

j:zj=k xj + C0µ0 )

log Normal(κk|m, σ2)

The conditional distribution of κk is again not of a standard form and as before, we use a step of MCMC sampling (with log-normal proposal distribution around the current iterate) to sample κk. The advantage of sampling is that the distribution of samples eventually converge to the true posterior, however, the downside is that it is not clear how many samples must be drawn.

3.1.3. EMPIRICAL BAYES When the user does not have enough information to set the prior parameters, it is useful to be able to learn them directly from the data. The prior parameters {α, µ0, C0, m, σ2} are estimated by maximizing the variational lower-bound (which acts as a proxy to the true marginal likelihood). The details are discussed in the supplementary material.

3.2. Hierarchical v MF Mixtures (H-v MFmix) Often the data that the user wants to analyze is large and manually inspecting a ﬂat layer of several clusters is harder than hierarchically browsing the data (Cutting et al., 1992). For such cases, we develop a hierarchical v MF mixture model that enables a hierarchically nested organization of the data. We assume that the user wants to organize the data into a given ﬁxed hierarchy of nodes N. The hierarchy is deﬁned by the parent function pa(x) : N N which denotes the parent of the node x in the hierarchy. The generative model for the H-v MFmix is given by,

π Dirichlet(.|α) µn v MF(.|µpa(k), κpa(k)) n N

κn lg Norm(.|m, σ2) n N

zi Categorical(.|π), zi {Leaf nodes of hierarchy } xi v MF(.|µzi, κzi) i = 1, 2..N

The user speciﬁed parameters are {µ0, C0} - parameter of the root-node and {m, σ2, α}. Each node n is equipped with a v MF distribution with parameters µn, κn. The mean parameters of the siblings nodes are drawn from a common prior deﬁned by their parent v MF distribution. The concentration parameters for all the nodes are commonly drawn

Von Mises-Fisher Clustering Models

Table 1. Dataset Statistics

Dataset #Instances #Training #Testing #True Clusters #Features TDT4 622 311 311 34 8895 TDT5 6366 3183 3183 126 20733 CNAE 1079 539 540 9 1079 K9 2340 1170 1170 20 21839 NEWS20 18744 11269 7505 20 53975 NIPS 2483 1241 1242 - 14036

from a log-normal distribution with mean m and variance σ2. The instance xi is drawn from the one of the leaf-nodes zi (Note that the data X resides on the leaf-nodes). One can also formulate slightly different models for e.g. by letting all the sibling nodes share the same concentration parameter; or by forming a hierarchy of concentration parameters etc - we leave such models for future research. By drawing the parameters of siblings from a common parent node, we are enforcing the constraint that nodes which are closer to each other in the hierarchy share similar parameters. Our hope is that this would enable data to be organized into the appropriate levels of granularity as deﬁned by the hierarchy. For inference, we can develop a similar variational inference methods with Empirical Bayes step to estimate the prior parameters. The details are presented in the supplementary material.

3.3. Temporal v MF mixtures (T-v MFmix) Sometimes, a data collection can evolve over time. In such cases, it will be useful to develop models that can capture the evolution of clusters over time. We present Temporal v MF mixture (T-v MFmix), a state-space model based on the v MF distribution where the parameters of a cluster at a given time point have evolved from the previous time point. Given data across T time-points, X = {{xt,i}Nt i=1}T t=1 and a ﬁxed number of clusters K, the generative model is given by,

π Dirichlet(.|α), κk log Normal(.|m, σ2) k = 1, 2..K

µ1,k v MF(.|µ 1, C0) k = 1, 2, 3..K

µt,k v MF(.|µt 1,k, C0) t = 2...T, k = 1, 2, 3..K

zt,i Mult(.|π) t = 1, 2...T; i = 1, 2, ..Nt,

xt,i v MF(.|µzt,i, κzt,i) t = 1, 2..T; i = 1, 2, ...Nt

The prior parameters are {µ 1, C0, α, m, σ2}. The clusterspeciﬁc concentration parameters κk s are commonly drawn from a log-Normal distribution with mean m and variance σ2. The mean parameters of the clusters at time t are drawn from a v MF distribution centered around the previous time t 1 with a concentration C0. This timeevolution of the cluster parameters introduces a layer of ﬂexibility and enables T-v MFmix to accommodate changes in the mean parameter within a given cluster. The C0 parameter controls the sharpness of time-evolution; having a large value of C0 ensures that cluster parameters are

more or less the same over time whereas a low value of C0 enables the cluster parameter to ﬂuctuate wildly between adjacent time points. Note that it is also possible to incorporate time-evolution of the mixing distribution i.e. Dirichlet(.|α) (similar to (Blei & Lafferty, 2006a))

ηt N(.|ηt 1, Σ 1), αt,k = exp(ηt,k)/

j=1 exp(ηt,j)

πt Dirichlet(.|αt)

For inference, we develop a mean-ﬁeld variational approach with Empirical Bayes step for estimating the prior parameters1 . The prior parameter C0 in some sense acts as a regularization term forcing the parameters of the next time-step to be similar to the previous one. We recommend setting the parameter manually than directly learning it from data1 .

4. Empirical Evaluation Throughout our experiments we used several popular benchmark datasets (Table 1) - TDT-{4,5} (Allan et al., 1998) , CNAE 2 , K9 3, NEWS20 4 , and NIPS (Globerson et al., 2007). All datasets (except NIPS) have associated class-labels and are single-labeled i.e. each instance is assigned to exactly one class-label. For TDT-{4,5}, we used only those subset of documents for which relevance judgements were available.

4.1. Metrics and Baselines First and the most natural question that we would like to answer is are v MF mixtures any better than standard Gaussian or Multinomial mixtures? . Generally, comparing two clustering models is in itself a hard problem. Conclusive comparisons often involve detailed user-studies (Boyd-Graber et al., 2009) which are time-consuming and not always feasible. Therefore likelihood based comparisons have been commonly used as an alternative (Blei & Lafferty, 2006a),(Blei & Lafferty, 2006b),(Teh et al., 2006). Likelihood measures the predictive power of the model on unseen data i.e. the generalization ability of the model - a metric widely used for model selection. However, since the support of the models are different - v MF models are deﬁned on unit-spheres, Multinomial

2 http://archive.ics.uci.edu/ml/datasets/CNAE-9 3 http://www-users.cs.umn.edu/boley/ftp/PDDPdata/ 4 http://people.csail.mit.edu/jrennie/20Newsgroups/

Von Mises-Fisher Clustering Models

Table 2. Comparison of v MFmix vs other clustering models models with 30 clusters using NMI and ARI metrics1. Each result is averaged over 10 different starting values for the algorithms. Bold face numbers indicate best performing method. The results of the signiﬁcance tests against B-v MFmix are denoted by a for signiﬁcance at 5% level, for signiﬁcance at 1% level.

Dataset TDT4 TDT5 CNAE K9 NEWS20 Method/Metric NMI ARI NMI ARI NMI ARI NMI ARI NMI ARI B-v MFmix .900 .799 .860 .710 .748 .669 .551 .352 .567 .397 v MFmix .880 .729 .851 .676 .650 .426 .543 .350 .565 .386

K-means .842 .675 .827 .591 .605 .311 .487 .264 .492 .217

Gaussian Mixtures .857 .693 .833 .617 .620 .357 .482 .293 .518 .267

Multinomial Mixtures .720 .467 .796 .580 .726 .528 .487 .321 .454 .200

Table 3. Average Likelihood on held-out test-set of B-v MFmix and v MFmix with 20, 30 clusters 1. Each result is averaged over 10 different starting values for the algorithms. Bold face numbers indicate best performing method.

Dataset TDT4 TDT5 CNAE K9 NEWS20 NIPS #Clusters Method 20 B-v MFmix 8628821.5 235115133.1 942719.5 91683824.8 1637772501.4 58811701.0 v MFmix 8387289.5 232542058.5 936675.9 91476580.9 1637630340.0 58671832.3 30 B-v MFmix 8638797.7 235145615.9 944888.3 91694902.0 1638015418.8 58825835.7 v MFmix 8300513.7 231946214.7 939730.9 91376189.0 1637866392.5 58600285.2

models are deﬁned for non-negative integers etc, we cannot use likelihood on held-out test set to compare v MF and other non-v MF models (numerically, the likelihood of the v MF models is around 5 orders of magnitude larger than Multinomial or Gaussian mixtures). To address this issue, we compare v MF and non-v MF clustering models based on how well they are able to recover the ground-truth clusters - the human assigned class-labels are assumed to be the true ground truth clusters. We use six widely used evaluation metrics for this comparison - Normalized Mutual Information (NMI) , Mutual Information (MI) , Rand Index (RI) , Adjusted Rand Index (ARI) , Purity and Macro F1 (ma-F1). The deﬁnitions of the metrics can be found in (Banerjee et al., 2006), (Steinley, 2004) and (Manning et al., 2008). We compare the following 5 clustering methods,

B-v MFmix: Our proposed Bayesian v MF mixture model that extracts ﬂat clusters.

v MFmix: A mixture of v MF distributions described in Section 2. Note that this is similar to the model developed in (Banerjee et al., 2006), except that all clusters share the same κ. Using the same κ for all clusters performed signiﬁcantly better than allowing cluster-speciﬁc κk s - this may be due to the absence of Bayesian prior.

K-means (KM): The standard k-means with euclidean distance and hard-assignments.

Gaussian Mixtures (GM): A mixture of Gaussian distributions with the means commonly drawn from a Normal prior and a single variance parameter for all clusters - using cluster speciﬁc variances performed worse (even with an Inverse gamma prior).

Multinomial Mixture (MM): A graphical model where each instance is represented as a vector of feature counts, drawn from a mixture of K Multinomial distributions.

We used the Tf-Idf normalized data representation for v MFmix, KM and GM, and feature counts representation (without normalization) for MM. For all the methods, every instance is assigned to the cluster with largest probability before evaluation. Also, for the sake of a fair comparison, we set the same random initial values of the clusterassignment variables in all the algorithms; the results of each method is averaged over 10 different starting values.

4.2. Results Table 2 summarizes the main results of B-v MFmix, v MFmix, KM, GM and MM in the ground-truth based evaluation on ﬁve data sets. Due to the lack space we only include the results for 30 clusters and two metrics NMI and ARI 5. Note that no separate test set is needed for this type of evaluation. The parameters of all the probabilistic models are estimated using MAP Inference. Among the six methods, B-v MFmix achieves the best performance on all the datasets. In fact, both B-v MFmix as well as v MFmix show a consistently strong performance against other methods, suggesting that the clusters generated by v MF models are indeed better aligned with the ground truth than the non-v MF methods. To further validate our ﬁndings, we conducted two-sided signiﬁcance tests using paired t-test between every method and B-v MFmix for both the metrics. The results of the 10 different runs are considered as ob-

5 The complete set of results for varying number of clusters and metrics is given in the supplementary material

Von Mises-Fisher Clustering Models

0.000E+00 1.000E+04 2.000E+04 3.000E+04 4.000E+04 5.000E+04 6.000E+04 7.000E+04 8.000E+04 9.000E+04

1988 1990 1992 1994 1996 1998 2000 2002

Improvement in Log-Likelihood

T-v MF-mix B-v MF-mix

1996 1998 2000 2002 2004 2006

Improvement in Log-Likelihood

T-v MF-mix B-v MF-mix

Figure 1. Relative improvement in likelihood of T-v MFmix, B-v MFmix models over v MFmix across time (with 30 clusters) - NIPS dataset (left), NYTC-elections dataset (right)

27 (3,3) 64 (3,4) 81 (4,3) 100 (2,10) 125 (3,5

Log-Likelihood

H-v MF-mix B-v MF-mix

9.250E+05 9.300E+05 9.350E+05 9.400E+05 9.450E+05 9.500E+05 9.550E+05 9.600E+05 9.650E+05 9.700E+05

27 (3,3) 64 (3,4) 81 (4,3) 100 (2,10) 125 (3,5

Log-Likelihood

H-v MF-mix B-v MF-mix

Figure 2. Average Likelihood of H-v MFmix, B-v MFmix and v MFmix for different input hierarchies. The number of clusters for Bv MFmix, v MFmix is set to number of leaf nodes in the hierarchy - NEWS20 (left) , CNAE (right)

served samples. The null hypothesis is that there is no signiﬁcant difference between the methods. Our experiments almost all the observed results are statistically signiﬁcant. Table 3 compares the performance of the basic v MF mixture (v MFmix) and the fully Bayesian v MF mixture model (B-v MFmix) on six data sets. Since the support of both models are the same, i.e. points on a unit-sphere, we can directly compare them by looking at the predictive power on a held out test-set. The B-v MFmix is trained through variational inference (using Sampling method for the concentration parameters) and v MFmix is trained through the EM algorithm. In our experiments, the sampling method performed better than the bounding method with an insigniﬁcant difference in running time, we therefore report all results using the sampling based method. Due to lack space we only include the results for K = 20, 30 clusters (refer supplementary material for full results). The results show that B-v MFmix is able to assign a higher likelihood for unseen data for all six datasets. Figure 2 compares the performance of the hierarchical v MF mixture model (H-v MFmix) with v MFmix and B-v MFmix. We test H-v MFmix on 5 different hierarchies with vary-

ing depth of hierarchy and branch factor (branch factor is the #children under each node). For example (height h=3,branching factor b=4) is a hierarchy 3 levels deep where each internal node in the hierarchy has 4 children each. We also plot the performance of the corresponding v MFmix and B-v MFmix; the number of clusters was set equal to the number of leaf nodes in the hierarchy. Due to the lack of space we plot the results for only two datasets - NEWS20 and CNAE (refer supplementary material for full results). The superior performance of H-v MFmix in terms of predictive power strongly supports the rationale for hierarchically shrinking the model parameters. Finally, we compare the predictive power of T-v MFmix with v MFmix and B-v MFmix on the following datasets which are temporal in nature,

1. NIPS: This dataset outlined in table 1 has 2483 research papers spread over a span of 17 years. The collection was divided into 17 time-points based on year of publication of each paper.

2. NYTC-elections: This is a collection of New York Times news articles from the period 1996-2007 which have been tagged elections . The collection was di-

Von Mises-Fisher Clustering Models

(1987) excitatory inhibitor activity synaptic ﬁring cells

(1991) activity synaptic ﬁring inhibitory excitatory neuron

(1994) activity cortical cortex stimulus neuronal response

(1997) cortical visual neuronal orientation spatial tuning

(2000) scene spikes spike quantitative visual stimuli

(2003) scene accessible quantitative dissimilarity fold distinguish

Figure 3. The change in vocabulary over the years for one of the topics from the NIPS dataset

Table 4. Comparison of different data representations using v MFmix and K-means using 30 clusters. Bold face numbers indicate best performing data representation.

Dataset NEWS20 TDT5 Method Representation/Metric NMI ARI NMI ARI v MFmix Tf-Idf normalization .565 .386 .851 .676 Tf normalization .084 .027 .827 .657 K-means Tf-Idf normalization .492 .213 .827 .591 Tf normalization .089 .0284 .7993 .551

vided into 11 time-points based on year of news release.

We test the models by predicting the likelihood of data in the next time-point given all the articles from the beginning to the current time-point, similar to the setup used in (Blei & Lafferty, 2006a). For simplicity, we ﬁx the number of clusters to 30. For ease of visualization, we plot the relative improvement in log-likelihood of the B-v MFmix and T-v MFmix models over the simple v MFmix model (this is because the log-likelihood between adjacent time-points are not comparable and ﬂuctuate wildly). The results are plotted in Fig 1 (the supplementary material contains the results for 2 additional datasets). The results suggest that T-v MFmix by taking the temporal nature into account, is able to always assign a higher likelihood to the next timepoint than B-v MFmix and v MFmix. To visualize the the cluster evolutionion over time, Fig 3 shows the progress of the mean parameter for one the the clusters in NIPS. The six words with largest weights in the mean parameter are shown over time.

4.2.1. EFFECT OF TF-IDF NORMALIZATION In order to fully understand the beneﬁts of data representation using Tf-Idf normalization, we performed controlled experiments on the two largest datasets - NEWS20 and TDT5. We compared the performance of representing the data using plain Tf normalization against Tf-Idf normalization with ltc 6 based term weighting. The results of experiments using v MFmix and K-means (as reported in Table 4) show that on both the datasets, representing the data using Tf-Idf normalization gives signiﬁcant performance beneﬁts.

6 http://nlp.stanford.edu/IR-book/html/htmledition/ document-and-query-weighting-schemes-1.html

4.3. Experimental Settings All our experiments were run on 48 core AMD opteron 6168 @ 1.92Ghz with 66GB RAM with full parallelization wherever possible. The main computational bottleneck in all our variational inference algorithm is the computation of λ and µ. The updates for both these parameters can be rewritten in terms of matrix products which can be efﬁciently computed using state-of-art parallel matrix multiplication tools. Refer section 10 of the supplementary material for a thorough discussion of the computational issues.

5. Conclusion In this paper we proposed a suite of powerful Bayesian v MF models on unit-sphere manifolds as an alternative to the popular approaches based on multinomial or Gaussian distributions. Our models enable full Bayesian inference with v MF models in ﬂat, hierarchical and temporal clustering. Our fast variational/sampling algorithms make the methods scalable to reasonable data volumes with high-dimensional feature spaces. The experiments provide strong empirical support for the effectiveness of our approaches - all our models outperformed strong baselines (k-means, Multinomial Mixtures and Latent Dirichlet Allocation) by a large margin on most data sets. For future work we would develop non-parametric versions of our v MF models as well as handle multi-ﬁeld information.

Acknowledgements This work is supported in part by the National Science Foundation (NSF) under grant IIS 1216282. We thank the reviewers for their excellent feedback and Guy Blelloch for sharing his computational resources.

Von Mises-Fisher Clustering Models

References Allan, James, Carbonell, Jaime G, Doddington, George, Yamron, Jonathan, and Yang, Yiming. Topic detection and tracking pilot study ﬁnal report. 1998.

Banerjee, Arindam, Dhillon, Inderjit, Ghosh, Joydeep, and Sra, Suvrit. Generative model-based clustering of directional data. In SIGKDD, pp. 19 28. ACM, 2003.

Banerjee, Arindam, Dhillon, Inderjit S, Ghosh, Joydeep, and Sra, Suvrit. Clustering on the unit hypersphere using von mises-ﬁsher distributions. JMLR, 6(2):1345, 2006.

Bangert, Mark, Hennig, Philipp, and Oelfke, Uwe. Using an inﬁnite von mises -ﬁsher mixture model to cluster treatment beam directions in external radiation therapy. In ICMLA, 2010.

Baricz, Arp ad, Ponnusamy, Saminathan, and Vuorinen, Matti. Functional inequalities for modiﬁed bessel functions. Expositiones Mathematicae, 29(4):399 414, 2011.

Blei, David M and Lafferty, John D. Dynamic topic models. In ICML, 2006a.

Blei, David M, Ng, Andrew Y, and Jordan, Michael I. Latent dirichlet allocation. JMLR, 2003.

Blei, David M, Grifﬁths, T, Jordan, M, and Tenenbaum, J. Hierarchical topic models and the nested chinese restaurant process. NIPS, 16:106 114, 2004.

Blei, MD and Lafferty, JD. Correlated topic models. In NIPS, pp. 147 155. Citeseer, 2006b.

Boyd-Graber, Jonathan, Chang, Jordan, Gerrish, Sean, Wang, Chong, and Blei, David. Reading tea leaves: How humans interpret topic models. In NIPS, 2009.

Cutting, Douglass R, Karger, David R, Pedersen, Jan O, and Tukey, John W. Scatter/gather: A cluster-based approach to browsing large document collections. In SIGIR, 1992.

Fisher, Ronald. Dispersion on a sphere. Proceedings of the Royal Society of London. Series A. Mathematical and Physical Sciences, 1953.

Globerson, A., Chechik, G., Pereira, F., and Tishby, N. Euclidean Embedding of Co-occurrence Data. JMLR, 8: 2265 2295, 2007.

Guttorp, Peter and Lockhart, Richard A. Finding the location of a signal: A bayesian analysis. Journal of the American Statistical Association, 83(402):322 330, 1988.

Hasnat, Md Abul, Alata, Olivier, and Tr emeau, Alain. Hierarchical 3-d von mises-ﬁsher mixture model. Workshop on Divergences and Divergence Learning, ICML, 2013.

Joachims, Thorsten. Learning to classify text using support vector machines: Methods, theory and algorithms. Kluwer Academic Publishers, 2002.

Jupp, PE and Mardia, KV. A uniﬁed view of the theory of directional statistics, 1975-1988. International Statistical Review/Revue Internationale de Statistique, 1989.

Manning, Christopher D, Raghavan, Prabhakar, and Sch utze, Hinrich. Introduction to information retrieval, volume 1. Cambridge University Press Cambridge, 2008.

Mardia, KV and El-Atoum, SAM. Bayesian inference for the von mises-ﬁsher distribution. Biometrika, 63(1): 203 206, 1976.

O Hagan, Anthony, Forster, Jonathan, and Kendall, Maurice George. Bayesian inference. Arnold London, 2004.

Reisinger, Joseph, Waters, Austin, Silverthorn, Bryan, and Mooney, Raymond. Spherical topic models. In ICML, 2010.

Robertson, Stephen. Understanding inverse document frequency: on theoretical arguments for idf. Journal of documentation, 60(5):503 520, 2004.

Salton, Gerard and Mc Gill, Michael J. Introduction to modern information retrieval. 1986.

Steinley, Douglas. Properties of the hubert-arable adjusted rand index. Psychological methods, 2004.

Teh, Yee Whye, Jordan, Michael I, Beal, Matthew J, and Blei, David M. Hierarchical dirichlet processes. JASA, 101(476), 2006.

Yang, Jianchao, Yu, Kai, Gong, Yihong, and Huang, Thomas. Linear spatial pyramid matching using sparse coding for image classiﬁcation. In CVPR, pp. 1794 1801. IEEE, 2009.