# ordinal_mixed_membership_models__631ce2d6.pdf

Ordinal Mixed Membership Models

Seppo Virtanen S.VIRTANEN@WARWICK.AC.UK Mark Girolami M.GIROLAMI@WARWICK.AC.UK Department of Statistics, University of Warwick, CV4 7AL Coventry UK

We present a novel class of mixed membership models for joint distributions of groups of observations that co-occur with ordinal response variables for each group for learning statistical associations between the ordinal response variables and the observation groups. The class of proposed models addresses a requirement for predictive and diagnostic methods in a wide range of practical contemporary applications. In this work, by way of illustration, we apply the models to a collection of consumer-generated reviews of mobile software applications, where each review contains unstructured text data accompanied with an ordinal rating, and demonstrate that the models infer useful and meaningful recurring patterns of consumer feedback. We also compare the developed models to relevant existing works, which rely on improper statistical assumptions for ordinal variables, showing signiﬁcant improvements both in predictive ability and knowledge extraction.

1. Introduction

There exist large repositories of user-generated assessment, preference or review data consisting of free-form text data associated with ordinal variables for quality or preference. Examples include product reviews, user feedback, recommendation systems, expert assessments, clinical records, survey questionnaires, economic or health status reports, to name a few. The ubiquitous need to statistically model the underlying processes and analyse such data collections presents signiﬁcant methodological research challenges necessitating the development of proper statistical models and inference approaches.

In this work, our interest focuses on, but is not limited

Proceedings of the 32 nd International Conference on Machine Learning, Lille, France, 2015. JMLR: W&CP volume 37. Copyright 2015 by the author(s).

to, analysing reviews of mobile software applications provided by consumers. Such analysis is useful for both software developers and consumers, inferring and understanding themes or properties of mobile applications that consumers comment about. These themes may involve consumers preferences and experiences on properties they (dis)appreciate or direct feature requests or problems directed to the software developers.

Our work belongs in the ﬁeld of mixed membership modelling, which is a powerful and important statistical modelling methodology. Observations are grouped and each group is modelled with a mixture model; mixture components are common to all groups, whereas mixture proportions are group-speciﬁc. The components are deemed to capture recurring patterns of observations and each group to exhibit a subset of components. The class of models has been shown to be able to extract interpretable meaningful themes, also referred to as topics, based on, for example, text data (Blei et al., 2003). These models, however, are not able to capture statistical associations between the groups and co-occurring quantitative information, that is, response variables, related to each group.

Previous work on joint models utilising both the textual data and response variables (Blei & Mc Auliffe, 2007; Dai & Storkey, 2015; Lacoste-Julien et al., 2009; Nguyen et al., 2013; Ramage et al., 2009; Wang et al., 2009) has demonstrated the utility of joint modelling by inferring topics that are predictive of the response leading to increased interpretability. However, these models lack proper statistical formulations suitable for ordinal response variables and it is not at all straightforward to correct this shortcoming. In this work, we remove this hindrance by presenting a novel class of joint mixed membership models.

The proposed class of models builds on our new statistical generative response model for ordinal variables. In more detail, we introduce a certain stick-breaking formulation to parameterise underlying data-generating probabilities over the ordinal variables. The response model contains groupspeciﬁc latent scores as well as mean variables that transform the scores into ordinal variables using the developed construction. We compare the response model with exist-

Ordinal Mixed Membership Models

ing alternatives for ordinal variables (Albert & Chib, 1993; Chu & Ghahramani, 2005) and show that our formulation provides favourable statistical properties.

We present two different novel model formulations that couple the developed response model with mixed membership models. Speciﬁcally, the formulations hierarchically couple the latent scores of the response model with the mixing components of a mixed membership model either via the mixture proportions or observation assignments capturing associations between the components and responses. The ﬁrst construction infers a correlation structure between (as well as, within) the mixture proportions and latent scores based on the observed data, not enforcing a priori any correlation structure or specifying which of the components are associated with the responses. We derive a scalable variational Bayesian inference algorithm to approximate the model posterior distribution. The model is motivated by unsupervised correlated topic models by Blei & Lafferty (2006) and Paisley et al. (2012). The second construction assumes the latent scores of the response model are given by a weighted linear combination of the mean assignments over each group, such that the component-speciﬁc combination weights a posteriori provide a means to inspect components that have predictive value. We present a Markov Chain Monte Carlo (MCMC) sampling scheme for posterior inference. The model is related to supervised LDA (SLDA; Blei & Mc Auliffe, 2007); our model can be seen as an extension of SLDA to ordinal responses.

We demonstrate the developed models on a collection of reviews of mobile software applications. We compare the models to the relevant previous work and show that the proper ordinal response model is valuable for learning statistical associations between the responses and text data providing signiﬁcant improvements in terms of both predictive ability and knowledge extraction by inferring interpretable and useful themes of consumer feedback.

The paper is structured as follows. Section 2 presents the methodological contributions of this work: Section 2.1 presents our proposed generative model for ordinal variables, whereas the next two Sections 2.2 and 2.3 present model formulations and inference approaches for joint mixed membership modelling of groups of observations and group-speciﬁc ordinal response variables. Related work is reviewed in Section 3. Section 4 describes the experiments and contains the results. Section 5 concludes the paper.

2. Joint Mixed Membership Models

The mth group of observations w(m) is paired with an ordinal response variable y(m). The response variables, also

referred to as ratings, take values in R Z+ > 2 ordered categories ranging between poor (1) and excellent (R). We note that for a simple case, when R = 2, y(m) is binary and may be modelled by a Bernoulli distribution. The w(m) contains an unordered sequence of D(m) words w(m) d over a V -dimensional vocabulary, w(m) = {w(m) 1 , w(m) 2 , . . . , w(m) D(m)}.

2.1. Ordinal Response Variables

We assume y(m) is drawn from a categorical distribution over R categories. The probability that y(m) takes an integer value r {1, . . . , R} is denoted by p(y(m) = r). Since the categories are ordered, we propose a stick-breaking parameterisation for the probabilities; a unit length stick is split into R smaller sticks that sum to one. We refer to these smaller sticks as stick weights v(m) r for the mth group and rth category. We parameterise the v(m) r using a function σ( ) mapping its argument to a value between zero and one and introducing continuous-valued latent variables or scores t(m) for each group as well as mean parameters µr for each category. The generative model for the y(m) is

p(y(m) = r) = v(m) r

r =1 (1 v(m) r ), (1)

v(m) r = σ(t(m) µr).

Each v(m) r represents a binary decision boundary, speciﬁed by the mean variables, for the t(m). The mean variables are ordered, that is, µ1 < µ2 < < µR, representing boundaries between the ordered categories. For computational simplicity, we use σ(x) = (1 + exp( x)) 1

corresponding to a logit (or sigmoid) function, for which 1 σ(x) = σ( x). Alternative choices include probit, log log or Cauchy functions, to name a few. The stick-breaking formulation guarantees that the probabilities p(y(m) = r), for r = 1, . . . , R, are positive and sum to one for any value of the t(m). More importantly, the formulation leads to a simple posterior inference algorithm; the ordering of the mean variables is implicitly inferred based on the observed data without enforcing explicit constraints. For identiﬁability, we set, without loss generality, v(m) R = 1. Figure 1 demonstrates the construction of probabilities based on the t(m) for simulated mean variables µ.

Based on a collection of observed responses y(m), where m = 1, . . . , M, the model log likelihood is

m ln(v(m) y(m)) +

r =1 ln(1 v(m) r ). (2)

Point estimates for the latent scores as well as mean variables may be inferred by maximising the log likelihood using unconstrained gradient-based optimisation techniques.

Ordinal Mixed Membership Models

10 5 0 5 10

0.0 0.4 0.8

probability

Figure 1. Visual demonstration of category probabilities. Here, x-axis denotes a range of values for the latent variable or score t(m), whereas the vertical lines denote the category cut-off points, referred to as mean variables µ.

In the following sections, we present two approaches for parameterising the latent scores constructing statistical associations between the responses and groups. Main statistical interest focuses on the parameterisation, whereas the mean variables are relevant mainly for computing predictions. For this reason, in the following, we assign a uniform prior for the mean variables.

2.2. Joint Correlated Topic Model

In this section, we present a novel joint model (referred to as, JTM) for the y(m) and w(m), where m = 1, . . . , M. At the core of the model are group-speciﬁc latent variables u(m) that are common for y(m) and w(m) capturing statistical associations between them.

For the responses we introduce a linear mapping or projection ξ and construct the data-generating latent score (Equation 1) as t(m) = ξT u(m), computing a cross product between the u(m) and the mapping ξ.

The generative process for the w(m) (groups of observations), for m = 1, . . . , M, is given by

w(m) d Categorical(ηc(m) d ), (3)

c(m) d Categorical(θ(m)),

where ηk, for k = 1, . . . , K, denotes mixture components (topics), c(m) d , for d = 1, . . . , Dm, denotes observation assignments and θ(m) mixture (topic) proportions over the K topics.

We connect the θ(m) to the latent variables u(m) by introducing topic-speciﬁc mappings vk and gamma-distributed variables z(m) k (parameterised suitably) such that a priori

E[θ(m) k ] βk exp(v T k u(m)), (4)

where βk, for k = 1, . . . , K, are positive concentration parameters. The latent mappings capture statistical associations between any two topics indexed by k and k . If the vk

and v k are similar, the topics ηk and ηk , respectively, tend to co-occur, assuming that βk and βk are sufﬁciently large. We use (normalised) gamma-distributed variables to construct the topic proportions thus parameterising a mapping from the continuous latent variables to the discrete topic proportions. For simpliﬁed posterior inference we deﬁne

βk = β exp(mk). (5)

The process is

θ(m) k z(m) k Gamma β, exp( v T k u(m) mk) ,

where the β denotes the shape parameter and the exp( v T k u(m) mk) denotes the rate parameter of the gamma distribution, respectively. We see that

E[θ(m) k ] β exp(v T k u(m) + mk),

as desired (4), using equation (5)1. Figure 2 illustrates a graphical plate diagram of the model.

We complete the model description specifying distributions for the model hyper-parameters, the root nodes in Figure 2. We assign

β Gamma(α0, β0),

u(m) Normal(0, I)

ξ, vk Normal(0, l 1I),

where l denotes a precision (inverse variance) parameter of a (zero-mean) Gaussian distribution,

ηk Dirichlet(γ1),

where γ is a concentration parameter of a Dirichlet distribution, and a non-informative prior for the mk.

2.2.1. INTERPRETATION

After specifying the model, we highlight the role of the latent variables and the corresponding mappings for the responses and topics, ξ and vk, where k = 1, . . . , K, respectively. We may compute a measure for similarity between two vectors xi and xj deﬁning a function

l(xi, xj) = x T i xj q

(x T i xi)(x T j xj)

that outputs a value between 1 and 1 indicating similarity or dissimilarity between the vectors. We may compute l(ξ, vk), where k = 1, . . . , K, and use the (dis)similarity

1We note that for a gamma-distributed random variable x Gamma(a, b) = ba

Γ(a)xa 1 exp( bx), where Γ( ) denotes the gamma function, E[x] = a/b.

Ordinal Mixed Membership Models

Figure 2. Graphical plate diagram of the joint correlated topic model. Unshaded nodes correspond to unobserved variables, whereas shaded nodes correspond to observed variables. Hyperparameters for the root nodes, whose values need to be ﬁxed prior to posterior inference, are omitted from the visualisation. Plates indicate replication over topics, groups and words. The hidden variables may be divided into local group-speciﬁc variables and global variables common to all groups. That is, the unnormalised topic proportions z(m), topic indicators c(m) j and latent variables u(m) are deﬁned for each group, whereas the set of topics ηk, mappings from latent variables to data domains, ξ and vk, are common to all groups.

scores to infer whether the topics that are positively or negatively associated with excellent or poor ratings.

Next, we present a theoretical justiﬁcation for the similarity measure. Marginalisation of the latent variables u is analytically tractable leading to a joint Gaussian distribution for the t(m) and auxiliary variables h(m) k (replacing the v T k u(m)). The covariance matrix of the Gaussian distribution is Σ = WWT + I,

where WT = ξ v1 . . . v K .We see that the similarity values deﬁned above correspond to correlations between the response and topical mappings, respectively. We also note that the distribution is able to capture correlations between any two topics. Hence, we refer to this model as joint correlated topic model.

2.2.2. REGULARISATION

During posterior inference the model infers statistical associations between the groups and responses. The inferred topics summarise recurring word co-occurrences over the corpus into interpretable themes some of which may have signiﬁcant associations with the ratings. However, for ﬁnite sample sizes the correlation structure may be weak. Accordingly we introduce a user-deﬁned parameter λ > 0, that balances for the limited sample sizes. Even though, we expect, when the sample size M increases for ﬁxed vocabulary size V , the role of λ diminishes, since there are more

data to estimate the underlying correlation structure. The joint likelihood of the model is

k=1 p(w(m) d )p(c(m) d )p(z(m) k )

exp(L)p(ξ) λp(u(m))p(vk)p(β),

where D = {w(m), y(m)}M m=1, Θ denotes unknown quantities of the model and L is given in Equation (2). For λ < 1 the model focuses more on explaining the text.

2.2.3. VARIATIONAL BAYESIAN INFERENCE

We present a variational Bayesian (VB) (Wainwright & Jordan, 2008) posterior inference algorithm for the model that scales well for large data collections and can readily be extended to stochastic online learning (Hoffman et al., 2013). We approximately marginalise over the topic assignments and proportions using non-trivial factorised distributions, whereas we use point distributions (estimates) for several variables to simplify computations, in essence, adopting an empirical Bayes approach for these variables. The corresponding inference algorithm is able to prune out irrelevant topics from the model based on the observed data. Full variational inference would be possible using techniques presented by B ohning (1992); Jaakkola & Jordan (1997) and Wang & Blei (2013), for example, lower bounding analytically intractable log sigmoid function appearing in the log likelihood function (2). Alternatively, MCMC sampling strategies may provide appealing approaches for posterior inference. However, it is far from trivial to design suitable proposal distributions for the latent variables.

We introduce a factorised posterior approximation

k=1 q(c(m) d )q(z(m) k ),

omitting the point distributions for clarity, and minimise the KL-divergence between the factorisation q(Θ) and the posterior p(Θ|D). Alternatively, we maximise a lower bound of the model evidence with respect to the parameters of the q(Θ),

ln p(D) LV B = E[ln p(D, Θ)] E[p(Θ) ln p(Θ)],

where expectations are taken with respect to the q(Θ).

We choose the following distributions for the topic assignments and unnormalised topic proportions

q(c(m) d ) = Categorical(c(m) d |φ(m) d ),

q(z(m) k ) = Gamma(z(m) k |a(m) k , b(m) k ),

Ordinal Mixed Membership Models

whose parameters are

φ(m) w,k ηk,w exp(E[ln z(m) k ]),

a(m) k = β +

j=1 φ(m) j,k ,

b(m) k = exp( v T k u(m) mk) + D(m) PK k=1 E[z(m) k ] .

In the derivations, we applied Jensen s inequality lower bounding analytically intractable E[ln PK k=1 z(m) k ] needed for normalisation of z(m) k , for k = 1, . . . , K, by introducing additional auxiliary parameters for each group. The expectations appearing above with respect to the variational factorisation are

E[ln z(m) k ] = ψ(a(m) k ) ln b(m) k ,

E[z(m) k ] = a(m) k b(m) k ,

where ψ( ) denotes the digamma function.

The lower bound of the model evidence, a cost function to maximise, with respect to the u(m) is

LV B u = λL+ X

m,k E[ln p(z(m) k |u(m), vk, m, β)]+ln p(u(m)),

whereas for v, m and β the cost function is

LV B v,m,β = X

m,k E[ln p(z(m) k |u(m), vk, m, β)]+ln p(vk, m, β).

To infer the mapping ξ we maximise LV B ξ = L + ln p(ξ). Unconstrained gradient-based optimisation techniques may be used to infer point estimates for these unobserved quantities (optimising β in log-domain). Finally, the topics are updated as ηk,w X

d,m φ(m) d,k + γ 1.

2.3. Ordinal Supervised Topic Model

In this section, we propose a novel topic model for the ordinal responses and groups of observations. The model assumes a generative process for the words similar to that in Equation 3 introducing topic assignments c(m) d for words w(m) d , where d = 1, . . . , D(m), and topic proportions θ(m)

for the mth group. Here, the generative model for the ratings depends on the c(m) d , where d = 1, . . . , D(m). In more detail, we deﬁne

ec(m) k = 1 D(m)

j=1 I[c(m) j = k],

where I[ ] denotes the indicator function equaling 1 if the argument is true and zero otherwise, representing an empirical topic distribution for the mth group. We use the quantity to construct a linear mapping to the ratings. The model (see Figure 3 for an illustration of a graphical plate diagram) is

t(m) = ξTec(m)

w(m) d Categorical(ηc(m) d ),

c(m) d Categorical(θ(m)),

θ(m) Dirichlet(α),

ηk Dirichlet(γ1),

ξk Normal(0, ζ).

Based on the observed data D the model infers a set of topics that explain not only word co-occurrences but also the responses.

Figure 3. Graphical plate diagram for the ordinal supervised topic model. The topic proportions θ(m) are group-speciﬁc and generated from an asymmetric Dirichlet distribution. The ordinal generative model for the ratings depends on topic assignments c(m) d , that specify the topical content (textual themes via topics ηk) of the mth group.

2.3.1. MCMC SAMPLING SCHEME

We present a MCMC sampling scheme for the model. We consecutively sample the topic assignments given current value of ξ using collapsed Gibbs sampling, building on the work by Grifﬁths & Steyvers (2004), analytically marginalising out topics as well as topic proportions. Then, given the newly sampled assignments we update the value for the ξ as well as the concentration parameters α. The topic assignment probabilities are given by

p(c(m) d = k) N c(m) d w,k + γ

N c(m) d k + V γ (N c(m) d k,d + αk)

p(y(m)|{c(m) j }D(m) j=1,j =d, c(m) d = k),

Ordinal Mixed Membership Models

where Nw,k denotes the counts word w (here, w(m) d = w) is assigned to the kth topic, Nk = PV w=1 Nw,k and Nk,d denotes counts tokens in document d are assigned to the kth topic. Upper index c(m) d means excluding the current count. The parameters of the response distribution are inferred by maximising Lξ = L+ln p(ξ). The concentration parameters are updated recursively

αk = αk PM m=1 ψ(Nk,m + αk) Mψ(αk) PM m=1 ln P j Nj,m + αj 1

2 Mψ(P j αj) ,

building on Minka s ﬁxed point iteration (Minka, 2000). In the denominator, we approximate ψ(x) ln(x 1/2), that is accurate when x > 1. This is the case, since all w(m), for m = 1, . . . , M, contain at least one word token. The asymmetric Dirichlet prior enables pruning irrelevant topics based on the observed data (Wallach et al., 2009).

We note that due to recursive sampling of the topic assignments computational cost of inference may become considerable for large data sets. The recursive property carries also to a corresponding variational Bayesian treatment, since the topic assignments are dependent on each other.

3. Related Work

Previous works on statistical models for ordinal data (Albert & Chib, 1993; Chu & Ghahramani, 2005) assume

y(m) = j if µj 1 < z(m) µj,

z(m) Normal(t(m), 1),

where z(m), for m = 1, . . . , M, denote Gaussiandistributed auxiliary variables. Marginalisation of the z(m)

leads to an ordinal probit model. The corresponding inference algorithm relies on truncated Gaussian distributions and takes into account explicit ordering constraints for the mean variables leading to a complicated inference algorithm that is sensitive to initialisation thus potentially leading to local minima.

The original supervised LDA model (SLDA; Blei & Mc Auliffe, 2007) uses canonical exponential family distributions for the response model. Under the canonical formulations the expectation of a response variable is E[y(m)] = g(t(m)), where g( ) denotes a link function speciﬁc for each member of the family. Examples of the most common members of this family include Gaussian, Bernoulli and Poisson distributions suitable for continuousvalued, binary or count variables, respectively. However, more importantly, the formulation does not support ordinal variables.

Previous applications of SLDA by Blei & Mc Auliffe (2007); Dai & Storkey (2015) and Nguyen et al. (2013) for

ordinal responses, such as product or movie reviews, have made a strong model mis-speciﬁcation; they treat ordinal variables as continuous-valued. In this approach, the ordinal variables are represented as distinct values in the real domain with arbitrary user-deﬁned intervals between them, enabling use of a Gaussian response model. The model is y(m) Normal(t(m)+µ, τ 1), where µ is a mean variable and τ is a precision (inverse variance) parameter. There are a number of statistical ﬂaws in this approach undermining interpretability. First, we note that the mean parameter of the Gaussian distribution, in general, may lead to results that make no sense in terms of the ordinal categories, especially for non-equidistant between-category intervals. Second, observed ratings still take discrete values but the predictions will not correspond to these values. Third, the Gaussian error assumption is not supported by discrete data.

Wang et al. (2009) present an important and non-trivial extension of SLDA to unordered, that is, nominal response variables, motivated by classiﬁcation tasks. The nominal variables represent logically separate concepts that do not permit ordering.

Ramage et al. (2009) and Lacoste-Julien et al. (2009) present alternative joint topic models, where functions of the nominal response variables (class information) affect topic proportions. The response variables are not explicitly modelled using generative formulations. The approach by Mimno & Mc Callum (2008) uses a similar model formulation suitable for a wide range of observed response variables (or features, in general) performing linear regression from the responses, which are treated as covariates, to the concentration parameters of Dirichlet distributions of the topic proportions. However, it is not obvious how to use these formulations for ordinal response variables.

4. Experiments and Results

We collect consumer-generated reviews of mobile software applications (apps) from Apple s App Store. The review data for each app contains an ordinal rating taking values in ﬁve categories ranging from poor to excellent as well as free-ﬂowing text data. We select the vocabulary using tfidf scores. After simple pre-processing, the data collection contains M = 5511 apps with vocabulary size V = 3995 and total number of words PM m=1 D(m) = 1.5 106. The relatively small data collection is chosen to keep algorithm running times reasonable especially for the sampling-based inference approaches.

4.1. Experimental Setting

We compare the joint correlated topic model (JTM; Section 2.2) and ordinal supervised topic model (SLDA) (Sec-

Ordinal Mixed Membership Models

tion 2.3) to SLDA with a Gaussian response model as adopted in previous work by Blei & Mc Auliffe (2007); Dai & Storkey (2015) and Nguyen et al. (2013) (see Section 3 for more details) as well as to sparse ordinal and Gaussian linear regression models. For Gaussian response models we represent the ratings as unit-spaced integers starting from one. The likelihood-speciﬁc parameters for the Gaussian model are mean and precision. We adopt the inference procedure described in Section 2.3 using collapsed Gibbs sampling also for Gaussian SLDA. For the regression models we infer a linear combination of the word counts and assign a sparsity-inducing prior distribution for the regression weights over the vocabulary in order to improve predictive ability. We maximise the corresponding joint log likelihood of the model for a ﬁxed prior precision2. For all the models that use a Gaussian response model, the mean variable is inferred by computing an empirical response mean. We initialise the models randomly.

For the joint correlated topic model, referred to as, JTM, we bound the maximum number of active topics to K = 100, set dimensionality of the latent variables to L = 30, α0 = 1, β0 = 10 6 and prior precision to l = L. The results are shown for λ = 0.001, although, λ 0.1 provided also good performance with little statistical variation. We terminated the algorithm (both in training and testing phase), when the relative difference of the (corresponding) lower bound fell below 10 4. The SLDA models were also computed for K = 100 and we used ζ = 1. We used 500 sweeps of sampling for inferring the topics and response parameters. For testing we used 500 sweeps of collapsed Gibbs sampling. Although we omit formal time comparisons due to difﬁculties in comparing VB to MCMC approaches, we ﬁnd that the sampling approach is roughly one order of magnitude slower. In general, determining convergence for MCMC approaches remains an open research problem, whereas VB provides a local bound for model evidence. For all the topic models we used γ = 0.01. For JTM, this (effectively) equals a topic Dirichlet concentration parameter value γ + 1 due to point estimate shifting the value by minus one. For the regression models we sidestep proper cross-validation of the prior precision and show results for the values providing the best performance, potentially leading to over-optimistic results.

4.2. Rating Prediction

We evaluate the models quantitatively in terms of predictive ability. Even though the developed joint mixed membership models are formulated primarily for exploring sta-

2We use t(m) = ξT x(m), where x(m) denotes word counts over the V -dimensional vocabulary, and p(ξ|ϵ) QV d=1 exp( ϵ ln(cosh(ξd))), where ϵ denotes a precision parameter of the prior distribution.

tistical associations between the ratings and text data, they can readily be used as predictive models. More speciﬁcally, we predict the ordinal rating based on the text. We partition available data into multiple training and test sets using 10-fold cross validation. For each model (and fold) we compute the test-set log likelihood (probability) of ratings (the higher, the better) and use these values for comparison. Despite various predictive criteria have been proposed, the selected measure is well motivated by statistical modelling. In the test phase, for JTM, we infer the latent variables u, topic proportions (unnormalised gamma-distributed variables z(m) k ) and topic assignments c(m) d given the values for the remaining parameters inferred in the training phase. For SLDA models the test phase corresponds to estimating the topic assignments using standard LDA model algorithm (using collapsed Gibbs sampling) with ﬁxed topics inferred based on the training data. Finally, we compute the corresponding latent scores t(m) for the models, obtaining the predictions.

Table 1 shows the test-set log likelihoods for the models. The ordinal linear regression model resulted in signiﬁcantly better predictions than the Gaussian regression model (paired one-sided Wilcoxon; p < 10 3) showing that it is important to substitute a statistically poorly motivated Gaussian response distribution with a proper generative model. For both models the sparsity assumption improves predictive ability. For the ordinal regression model, the most relevant words predictive of low (poor) ratings include waste and free and those of high (excellent) ratings include amazing and perfect. The model, however, falls short in providing in-depth interpretations, necessitating the use of topic models.

All the topic models perform substantially better than the regression models. The ordinal SLDA model provides the best predictive performance, JTM is the second best and Gaussian SLDA is the worst. All (pair-wise) comparisons are statistically signiﬁcant (paired one-sided Wilcoxon; p < 0.005). We discovered K = 100 is a sufﬁciently large threshold value for the number of topics; some of the inferred topics are inactive. This, together with good predictive accuracy, establish evidence the developed models have captured the relevant statistical variation in the observed data. For JTM, we also performed a sensitivity analysis of the dimensionality of the latent variables L and found little statistical variation for 30 L 100 = K. The test log likelihoods range between a minimum of 669.42(9.68) for L = 80 and a maximum of 661.98(11.73) for L = 50.

Next, we compared the inferred topics of different models quantitatively using a measure, referred to as, semantic coherence proposed by Mimno et al. (2011) for quantifying topic trustworthiness. Table 2 shows the average topic

Ordinal Mixed Membership Models

Table 1. Rating prediction test set log likelihoods for different methods. The table shows values for mean and standard deviation computed over 10 folds obtained by cross-validation.

model log likelihood

Ordinal SLDA 638.53(13.38) JTM 667.79(15.91) Gaussian SLDA 681.71(17.69) Ordinal regression 704.30(13.21) Gaussian regression 735.40(14.70)

coherences (the higher, the better). The topics inferred by JTM have signiﬁcantly larger coherence (two sample onesided Wilcoxon, p < 0.0002).

Table 2. Average semantic coherence values for the inferred topics of different models.

model coherence

JTM 52.64(19.94) Oridinal SLDA 66.30(26.43) Gaussian SLDA 67.84(26.54)

4.3. Inspection of Inferred Topics

Finally, we visualise and interpret the topics inferred by the JTM model. Figures 4 and 5 visualise nine topics associated with high (excellent) and low (poor) ratings, respectively. As explained in Section 2.2.1, the associations (both sign and strength) are given by computing the similarity scores (that is, correlations).

fun amazing

super absolutely friends

stuff things

features likes

familysweet

honestly thx earn long watching

suggest sticker

hands packs

rocks brilliant points

today forever

possibilities

levels time

addictingchallenging

entertaining

long frustrating

interesting

harder wait

thinking high

past great workout

workouts day

easy exercises

exercise run ups

program helps

push progress

pace awesome

shape highly

helped results

variety minute

instructions

ability user

designed great forward

quick future nice

functionality clean

improvements

including beautifully lot

simply kudos

improvement

happy multi love

people friends share

fun quotes post community

posts concept

life interesting

makes design

group friendly

stars posted

wonderful connected

experience great calculator

time simple easy

calculations

lot accurate

nice figure worth

conversion calculate

functions grade

conversions

job function

found thing

percentage equations

spreadsheet

feature helped

calculation

clear change

calculators

easy clean simple

stock students

portfolio money

information

functionality

change real

alerts friendly

display options

performance

improvement

notifications

wow issue users

recommended

words english

word language

french helpfultranslation

quran languages spanish

application

pronunciation

update definitions

translations

dictionaries

amazing translator

effort writing

kid learning

educational

nice numbers

time recommend

likes words

interactive

grandson makes

counting number

parent part

count older

Figure 4. Visual illustration of topics associated with high ratings.

One of the topics associated with high ratings (Figure 4) captures word co-occurrence patterns containing adjectives with positive semantics. The remaining topics capture themes customers appreciate, such as games, health monitoring, calculations (for example, for unit conversions), learning languages, social networking and education. One of the topics captures positive customer feedback about app interface and design.

worthlessstars

deleted garbage

apple bother

piece completely

downloading

complete review bad

advertising ap terrible

opened remove

false shown

advertisement poor download whatsoever

uninstalled

disappointment

bought refund

program disappointed

problem customer

supported scam

company recommend

instructions

crashes downloaded

function clear

simply emailed purchasing

installed removed

loaded worth

thing immediately

close memory

work button

read stop money shows

give genbackground

bought opens

broken buck

hold warning lot

review work

back give working

poor button

updated start wrong rating

apparently good

developer deleted

disappointed

installed simply

work developers

works software

great updated

mentioned response

appears give

glad solve concept

automatically

stopped corrected

improvements

development

change pretty

fix problem

stars fixed time

button long

people guys

disappointed

ago support

option main

appears original

downloaded big

updates open happened

apparently today expect

slow hoping

annoying adds

click seconds

pops open bottom

advertisements popping

download minutes

advertisement

advertising horrible

frustrating

accidentally

minute bother

downloaded wanted

uninstalled

made randomly

working finedisappointed

problem restore

support fix

back compatible

stopped response

screen apparently

frustrating functioning

disappointing

complained dropped

stating supported

desperately

fully waste time

stupid worst buy

read downloading

fake people

thing wasted

description

dollar garbage

screen doesnt

instructions

understand guys

complete rate

write bunch

Figure 5. Visual illustration of topics associated with low ratings.

The topics associated with low ratings (Figure 5) contain customers negative experiences or feature requests such as removal of adds, software updates and problems with functionality.

5. Discussion

In this work, we develop a new class of ordinal mixed membership models suitable for capturing statistical associations between groups of observations and co-occurring ordinal response variables for each group. We depart from the existing dominant approach that relies on improper model assumptions for the ordinal response variables. We successfully demonstrate the developed models for analysing reviews of mobile software applications provided by consumers. The proposed class of models as well as inference approaches are applicable for a wide range of present-day applications. In the future, we expect to see improvements in statistical inference including fully Bayesian treatments and nonparametric Bayesian formulations. Stochastic online learning or model formulations for streaming data may be applied to scale the statistical inference to cope with current data repositories containing review data for a few millions of groups.

Ordinal Mixed Membership Models

Acknowledgement

This research is supported by an EPSRC program grant A Population Approach to Ubicomp System Design (EP/J007617/1).

Albert, James H and Chib, Siddhartha. Bayesian analysis of binary and polychotomous response data. Journal of the American Statistical Association, 88(422):669 679, 1993.

Blei, David and Lafferty, John. Correlated topic models. In Advances in Neural Information Processing Systems, 2006.

Blei, David M and Mc Auliffe, Jon D. Supervised topic models. In Advances in Neural Information Processing Systems, 2007.

Blei, David M, Ng, Andrew Y, and Jordan, Michael I. Latent Dirichlet allocation. the Journal of Machine Learning Research, 3:993 1022, 2003.

B ohning, Dankmar. Multinomial logistic regression algorithm. Annals of the Institute of Statistical Mathematics, 44(1):197 200, 1992.

Chu, Wei and Ghahramani, Zoubin. Gaussian processes for ordinal regression. Journal of Machine Learning Research, 6:1019 1041, 2005.

Dai, Andrew and Storkey, Amos J. The supervised hierarchical Dirichlet process. IEEE Transactions on Pattern Analysis & Machine Intelligence, 37(2):243 255, 2015.

Grifﬁths, Thomas L and Steyvers, Mark. Finding scientiﬁc topics. Proceedings of the National Academy of Sciences, 101(suppl 1):5228 5235, 2004.

Hoffman, Matthew D, Blei, David M, Wang, Chong, and Paisley, John. Stochastic variational inference. Journal of Machine Learning Research, 14(1):1303 1347, 2013.

Jaakkola, T and Jordan, Michael I. A variational approach to Bayesian logistic regression models and their extensions. In Artiﬁcial Intelligence and Statistics, 1997.

Lacoste-Julien, Simon, Sha, Fei, and Jordan, Michael I. Disc LDA: Discriminative learning for dimensionality reduction and classiﬁcation. In Advances in Neural Information Processing Systems, 2009.

Mimno, David and Mc Callum, Andrew. Topic models conditioned on arbitrary features with Dirichlet-multinomial regression. In Uncertainty in Artiﬁcial Intelligence, 2008.

Mimno, David, Wallach, Hanna M, Talley, Edmund, Leenders, Miriam, and Mc Callum, Andrew. Optimizing semantic coherence in topic models. In Empirical Methods in Natural Language Processing, 2011.

Minka, Thomas. Estimating a Dirichlet distribution, 2000.

Nguyen, Viet-An, Boyd-Graber, Jordan L, and Resnik, Philip. Lexical and hierarchical topic regression. In Advances in Neural Information Processing Systems, 2013.

Paisley, John, Wang, Chong, Blei, David M, et al. The discrete inﬁnite logistic normal distribution. Bayesian Analysis, 7(2):235 272, 2012.

Ramage, Daniel, Hall, David, Nallapati, Ramesh, and Manning, Christopher D. Labelled LDA: A supervised topic model for credit attribution in multi-labeled corpora. In Empirical Methods in Natural Language Processing, 2009.

Wainwright, Martin J and Jordan, Michael I. Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning, 1(1-2):1 305, 2008.

Wallach, Hanna M, Minmo, David, and Mc Callum, Andrew. Rethinking LDA: Why priors matter. In Advances in Neural Information Processing Systems, 2009.

Wang, Chong and Blei, David M. Variational inference in nonconjugate models. Journal of Machine Learning Research, 14(1):1005 1031, 2013.

Wang, Chong, Blei, David, and Li, Fei-Fei. Simultaneous image classiﬁcation and annotation. In Computer Vision and Pattern Recognition, 2009.