# ordinal_mixed_membership_models__631ce2d6.pdf Ordinal Mixed Membership Models Seppo Virtanen S.VIRTANEN@WARWICK.AC.UK Mark Girolami M.GIROLAMI@WARWICK.AC.UK Department of Statistics, University of Warwick, CV4 7AL Coventry UK We present a novel class of mixed membership models for joint distributions of groups of observations that co-occur with ordinal response variables for each group for learning statistical associations between the ordinal response variables and the observation groups. The class of proposed models addresses a requirement for predictive and diagnostic methods in a wide range of practical contemporary applications. In this work, by way of illustration, we apply the models to a collection of consumer-generated reviews of mobile software applications, where each review contains unstructured text data accompanied with an ordinal rating, and demonstrate that the models infer useful and meaningful recurring patterns of consumer feedback. We also compare the developed models to relevant existing works, which rely on improper statistical assumptions for ordinal variables, showing significant improvements both in predictive ability and knowledge extraction. 1. Introduction There exist large repositories of user-generated assessment, preference or review data consisting of free-form text data associated with ordinal variables for quality or preference. Examples include product reviews, user feedback, recommendation systems, expert assessments, clinical records, survey questionnaires, economic or health status reports, to name a few. The ubiquitous need to statistically model the underlying processes and analyse such data collections presents significant methodological research challenges necessitating the development of proper statistical models and inference approaches. In this work, our interest focuses on, but is not limited Proceedings of the 32 nd International Conference on Machine Learning, Lille, France, 2015. JMLR: W&CP volume 37. Copyright 2015 by the author(s). to, analysing reviews of mobile software applications provided by consumers. Such analysis is useful for both software developers and consumers, inferring and understanding themes or properties of mobile applications that consumers comment about. These themes may involve consumers preferences and experiences on properties they (dis)appreciate or direct feature requests or problems directed to the software developers. Our work belongs in the field of mixed membership modelling, which is a powerful and important statistical modelling methodology. Observations are grouped and each group is modelled with a mixture model; mixture components are common to all groups, whereas mixture proportions are group-specific. The components are deemed to capture recurring patterns of observations and each group to exhibit a subset of components. The class of models has been shown to be able to extract interpretable meaningful themes, also referred to as topics, based on, for example, text data (Blei et al., 2003). These models, however, are not able to capture statistical associations between the groups and co-occurring quantitative information, that is, response variables, related to each group. Previous work on joint models utilising both the textual data and response variables (Blei & Mc Auliffe, 2007; Dai & Storkey, 2015; Lacoste-Julien et al., 2009; Nguyen et al., 2013; Ramage et al., 2009; Wang et al., 2009) has demonstrated the utility of joint modelling by inferring topics that are predictive of the response leading to increased interpretability. However, these models lack proper statistical formulations suitable for ordinal response variables and it is not at all straightforward to correct this shortcoming. In this work, we remove this hindrance by presenting a novel class of joint mixed membership models. The proposed class of models builds on our new statistical generative response model for ordinal variables. In more detail, we introduce a certain stick-breaking formulation to parameterise underlying data-generating probabilities over the ordinal variables. The response model contains groupspecific latent scores as well as mean variables that transform the scores into ordinal variables using the developed construction. We compare the response model with exist- Ordinal Mixed Membership Models ing alternatives for ordinal variables (Albert & Chib, 1993; Chu & Ghahramani, 2005) and show that our formulation provides favourable statistical properties. We present two different novel model formulations that couple the developed response model with mixed membership models. Specifically, the formulations hierarchically couple the latent scores of the response model with the mixing components of a mixed membership model either via the mixture proportions or observation assignments capturing associations between the components and responses. The first construction infers a correlation structure between (as well as, within) the mixture proportions and latent scores based on the observed data, not enforcing a priori any correlation structure or specifying which of the components are associated with the responses. We derive a scalable variational Bayesian inference algorithm to approximate the model posterior distribution. The model is motivated by unsupervised correlated topic models by Blei & Lafferty (2006) and Paisley et al. (2012). The second construction assumes the latent scores of the response model are given by a weighted linear combination of the mean assignments over each group, such that the component-specific combination weights a posteriori provide a means to inspect components that have predictive value. We present a Markov Chain Monte Carlo (MCMC) sampling scheme for posterior inference. The model is related to supervised LDA (SLDA; Blei & Mc Auliffe, 2007); our model can be seen as an extension of SLDA to ordinal responses. We demonstrate the developed models on a collection of reviews of mobile software applications. We compare the models to the relevant previous work and show that the proper ordinal response model is valuable for learning statistical associations between the responses and text data providing significant improvements in terms of both predictive ability and knowledge extraction by inferring interpretable and useful themes of consumer feedback. The paper is structured as follows. Section 2 presents the methodological contributions of this work: Section 2.1 presents our proposed generative model for ordinal variables, whereas the next two Sections 2.2 and 2.3 present model formulations and inference approaches for joint mixed membership modelling of groups of observations and group-specific ordinal response variables. Related work is reviewed in Section 3. Section 4 describes the experiments and contains the results. Section 5 concludes the paper. 2. Joint Mixed Membership Models The mth group of observations w(m) is paired with an ordinal response variable y(m). The response variables, also referred to as ratings, take values in R Z+ > 2 ordered categories ranging between poor (1) and excellent (R). We note that for a simple case, when R = 2, y(m) is binary and may be modelled by a Bernoulli distribution. The w(m) contains an unordered sequence of D(m) words w(m) d over a V -dimensional vocabulary, w(m) = {w(m) 1 , w(m) 2 , . . . , w(m) D(m)}. 2.1. Ordinal Response Variables We assume y(m) is drawn from a categorical distribution over R categories. The probability that y(m) takes an integer value r {1, . . . , R} is denoted by p(y(m) = r). Since the categories are ordered, we propose a stick-breaking parameterisation for the probabilities; a unit length stick is split into R smaller sticks that sum to one. We refer to these smaller sticks as stick weights v(m) r for the mth group and rth category. We parameterise the v(m) r using a function σ( ) mapping its argument to a value between zero and one and introducing continuous-valued latent variables or scores t(m) for each group as well as mean parameters µr for each category. The generative model for the y(m) is p(y(m) = r) = v(m) r r =1 (1 v(m) r ), (1) v(m) r = σ(t(m) µr). Each v(m) r represents a binary decision boundary, specified by the mean variables, for the t(m). The mean variables are ordered, that is, µ1 < µ2 < < µR, representing boundaries between the ordered categories. For computational simplicity, we use σ(x) = (1 + exp( x)) 1 corresponding to a logit (or sigmoid) function, for which 1 σ(x) = σ( x). Alternative choices include probit, log log or Cauchy functions, to name a few. The stick-breaking formulation guarantees that the probabilities p(y(m) = r), for r = 1, . . . , R, are positive and sum to one for any value of the t(m). More importantly, the formulation leads to a simple posterior inference algorithm; the ordering of the mean variables is implicitly inferred based on the observed data without enforcing explicit constraints. For identifiability, we set, without loss generality, v(m) R = 1. Figure 1 demonstrates the construction of probabilities based on the t(m) for simulated mean variables µ. Based on a collection of observed responses y(m), where m = 1, . . . , M, the model log likelihood is m ln(v(m) y(m)) + r =1 ln(1 v(m) r ). (2) Point estimates for the latent scores as well as mean variables may be inferred by maximising the log likelihood using unconstrained gradient-based optimisation techniques. Ordinal Mixed Membership Models 10 5 0 5 10 0.0 0.4 0.8 probability Figure 1. Visual demonstration of category probabilities. Here, x-axis denotes a range of values for the latent variable or score t(m), whereas the vertical lines denote the category cut-off points, referred to as mean variables µ. In the following sections, we present two approaches for parameterising the latent scores constructing statistical associations between the responses and groups. Main statistical interest focuses on the parameterisation, whereas the mean variables are relevant mainly for computing predictions. For this reason, in the following, we assign a uniform prior for the mean variables. 2.2. Joint Correlated Topic Model In this section, we present a novel joint model (referred to as, JTM) for the y(m) and w(m), where m = 1, . . . , M. At the core of the model are group-specific latent variables u(m) that are common for y(m) and w(m) capturing statistical associations between them. For the responses we introduce a linear mapping or projection ξ and construct the data-generating latent score (Equation 1) as t(m) = ξT u(m), computing a cross product between the u(m) and the mapping ξ. The generative process for the w(m) (groups of observations), for m = 1, . . . , M, is given by w(m) d Categorical(ηc(m) d ), (3) c(m) d Categorical(θ(m)), where ηk, for k = 1, . . . , K, denotes mixture components (topics), c(m) d , for d = 1, . . . , Dm, denotes observation assignments and θ(m) mixture (topic) proportions over the K topics. We connect the θ(m) to the latent variables u(m) by introducing topic-specific mappings vk and gamma-distributed variables z(m) k (parameterised suitably) such that a priori E[θ(m) k ] βk exp(v T k u(m)), (4) where βk, for k = 1, . . . , K, are positive concentration parameters. The latent mappings capture statistical associations between any two topics indexed by k and k . If the vk and v k are similar, the topics ηk and ηk , respectively, tend to co-occur, assuming that βk and βk are sufficiently large. We use (normalised) gamma-distributed variables to construct the topic proportions thus parameterising a mapping from the continuous latent variables to the discrete topic proportions. For simplified posterior inference we define βk = β exp(mk). (5) The process is θ(m) k z(m) k Gamma β, exp( v T k u(m) mk) , where the β denotes the shape parameter and the exp( v T k u(m) mk) denotes the rate parameter of the gamma distribution, respectively. We see that E[θ(m) k ] β exp(v T k u(m) + mk), as desired (4), using equation (5)1. Figure 2 illustrates a graphical plate diagram of the model. We complete the model description specifying distributions for the model hyper-parameters, the root nodes in Figure 2. We assign β Gamma(α0, β0), u(m) Normal(0, I) ξ, vk Normal(0, l 1I), where l denotes a precision (inverse variance) parameter of a (zero-mean) Gaussian distribution, ηk Dirichlet(γ1), where γ is a concentration parameter of a Dirichlet distribution, and a non-informative prior for the mk. 2.2.1. INTERPRETATION After specifying the model, we highlight the role of the latent variables and the corresponding mappings for the responses and topics, ξ and vk, where k = 1, . . . , K, respectively. We may compute a measure for similarity between two vectors xi and xj defining a function l(xi, xj) = x T i xj q (x T i xi)(x T j xj) that outputs a value between 1 and 1 indicating similarity or dissimilarity between the vectors. We may compute l(ξ, vk), where k = 1, . . . , K, and use the (dis)similarity 1We note that for a gamma-distributed random variable x Gamma(a, b) = ba Γ(a)xa 1 exp( bx), where Γ( ) denotes the gamma function, E[x] = a/b. Ordinal Mixed Membership Models Figure 2. Graphical plate diagram of the joint correlated topic model. Unshaded nodes correspond to unobserved variables, whereas shaded nodes correspond to observed variables. Hyperparameters for the root nodes, whose values need to be fixed prior to posterior inference, are omitted from the visualisation. Plates indicate replication over topics, groups and words. The hidden variables may be divided into local group-specific variables and global variables common to all groups. That is, the unnormalised topic proportions z(m), topic indicators c(m) j and latent variables u(m) are defined for each group, whereas the set of topics ηk, mappings from latent variables to data domains, ξ and vk, are common to all groups. scores to infer whether the topics that are positively or negatively associated with excellent or poor ratings. Next, we present a theoretical justification for the similarity measure. Marginalisation of the latent variables u is analytically tractable leading to a joint Gaussian distribution for the t(m) and auxiliary variables h(m) k (replacing the v T k u(m)). The covariance matrix of the Gaussian distribution is Σ = WWT + I, where WT = ξ v1 . . . v K .We see that the similarity values defined above correspond to correlations between the response and topical mappings, respectively. We also note that the distribution is able to capture correlations between any two topics. Hence, we refer to this model as joint correlated topic model. 2.2.2. REGULARISATION During posterior inference the model infers statistical associations between the groups and responses. The inferred topics summarise recurring word co-occurrences over the corpus into interpretable themes some of which may have significant associations with the ratings. However, for finite sample sizes the correlation structure may be weak. Accordingly we introduce a user-defined parameter λ > 0, that balances for the limited sample sizes. Even though, we expect, when the sample size M increases for fixed vocabulary size V , the role of λ diminishes, since there are more data to estimate the underlying correlation structure. The joint likelihood of the model is k=1 p(w(m) d )p(c(m) d )p(z(m) k ) exp(L)p(ξ) λp(u(m))p(vk)p(β), where D = {w(m), y(m)}M m=1, Θ denotes unknown quantities of the model and L is given in Equation (2). For λ < 1 the model focuses more on explaining the text. 2.2.3. VARIATIONAL BAYESIAN INFERENCE We present a variational Bayesian (VB) (Wainwright & Jordan, 2008) posterior inference algorithm for the model that scales well for large data collections and can readily be extended to stochastic online learning (Hoffman et al., 2013). We approximately marginalise over the topic assignments and proportions using non-trivial factorised distributions, whereas we use point distributions (estimates) for several variables to simplify computations, in essence, adopting an empirical Bayes approach for these variables. The corresponding inference algorithm is able to prune out irrelevant topics from the model based on the observed data. Full variational inference would be possible using techniques presented by B ohning (1992); Jaakkola & Jordan (1997) and Wang & Blei (2013), for example, lower bounding analytically intractable log sigmoid function appearing in the log likelihood function (2). Alternatively, MCMC sampling strategies may provide appealing approaches for posterior inference. However, it is far from trivial to design suitable proposal distributions for the latent variables. We introduce a factorised posterior approximation k=1 q(c(m) d )q(z(m) k ), omitting the point distributions for clarity, and minimise the KL-divergence between the factorisation q(Θ) and the posterior p(Θ|D). Alternatively, we maximise a lower bound of the model evidence with respect to the parameters of the q(Θ), ln p(D) LV B = E[ln p(D, Θ)] E[p(Θ) ln p(Θ)], where expectations are taken with respect to the q(Θ). We choose the following distributions for the topic assignments and unnormalised topic proportions q(c(m) d ) = Categorical(c(m) d |φ(m) d ), q(z(m) k ) = Gamma(z(m) k |a(m) k , b(m) k ), Ordinal Mixed Membership Models whose parameters are φ(m) w,k ηk,w exp(E[ln z(m) k ]), a(m) k = β + j=1 φ(m) j,k , b(m) k = exp( v T k u(m) mk) + D(m) PK k=1 E[z(m) k ] . In the derivations, we applied Jensen s inequality lower bounding analytically intractable E[ln PK k=1 z(m) k ] needed for normalisation of z(m) k , for k = 1, . . . , K, by introducing additional auxiliary parameters for each group. The expectations appearing above with respect to the variational factorisation are E[ln z(m) k ] = ψ(a(m) k ) ln b(m) k , E[z(m) k ] = a(m) k b(m) k , where ψ( ) denotes the digamma function. The lower bound of the model evidence, a cost function to maximise, with respect to the u(m) is LV B u = λL+ X m,k E[ln p(z(m) k |u(m), vk, m, β)]+ln p(u(m)), whereas for v, m and β the cost function is LV B v,m,β = X m,k E[ln p(z(m) k |u(m), vk, m, β)]+ln p(vk, m, β). To infer the mapping ξ we maximise LV B ξ = L + ln p(ξ). Unconstrained gradient-based optimisation techniques may be used to infer point estimates for these unobserved quantities (optimising β in log-domain). Finally, the topics are updated as ηk,w X d,m φ(m) d,k + γ 1. 2.3. Ordinal Supervised Topic Model In this section, we propose a novel topic model for the ordinal responses and groups of observations. The model assumes a generative process for the words similar to that in Equation 3 introducing topic assignments c(m) d for words w(m) d , where d = 1, . . . , D(m), and topic proportions θ(m) for the mth group. Here, the generative model for the ratings depends on the c(m) d , where d = 1, . . . , D(m). In more detail, we define ec(m) k = 1 D(m) j=1 I[c(m) j = k], where I[ ] denotes the indicator function equaling 1 if the argument is true and zero otherwise, representing an empirical topic distribution for the mth group. We use the quantity to construct a linear mapping to the ratings. The model (see Figure 3 for an illustration of a graphical plate diagram) is t(m) = ξTec(m) w(m) d Categorical(ηc(m) d ), c(m) d Categorical(θ(m)), θ(m) Dirichlet(α), ηk Dirichlet(γ1), ξk Normal(0, ζ). Based on the observed data D the model infers a set of topics that explain not only word co-occurrences but also the responses. Figure 3. Graphical plate diagram for the ordinal supervised topic model. The topic proportions θ(m) are group-specific and generated from an asymmetric Dirichlet distribution. The ordinal generative model for the ratings depends on topic assignments c(m) d , that specify the topical content (textual themes via topics ηk) of the mth group. 2.3.1. MCMC SAMPLING SCHEME We present a MCMC sampling scheme for the model. We consecutively sample the topic assignments given current value of ξ using collapsed Gibbs sampling, building on the work by Griffiths & Steyvers (2004), analytically marginalising out topics as well as topic proportions. Then, given the newly sampled assignments we update the value for the ξ as well as the concentration parameters α. The topic assignment probabilities are given by p(c(m) d = k) N c(m) d w,k + γ N c(m) d k + V γ (N c(m) d k,d + αk) p(y(m)|{c(m) j }D(m) j=1,j =d, c(m) d = k), Ordinal Mixed Membership Models where Nw,k denotes the counts word w (here, w(m) d = w) is assigned to the kth topic, Nk = PV w=1 Nw,k and Nk,d denotes counts tokens in document d are assigned to the kth topic. Upper index c(m) d means excluding the current count. The parameters of the response distribution are inferred by maximising Lξ = L+ln p(ξ). The concentration parameters are updated recursively αk = αk PM m=1 ψ(Nk,m + αk) Mψ(αk) PM m=1 ln P j Nj,m + αj 1 2 Mψ(P j αj) , building on Minka s fixed point iteration (Minka, 2000). In the denominator, we approximate ψ(x) ln(x 1/2), that is accurate when x > 1. This is the case, since all w(m), for m = 1, . . . , M, contain at least one word token. The asymmetric Dirichlet prior enables pruning irrelevant topics based on the observed data (Wallach et al., 2009). We note that due to recursive sampling of the topic assignments computational cost of inference may become considerable for large data sets. The recursive property carries also to a corresponding variational Bayesian treatment, since the topic assignments are dependent on each other. 3. Related Work Previous works on statistical models for ordinal data (Albert & Chib, 1993; Chu & Ghahramani, 2005) assume y(m) = j if µj 1 < z(m) µj, z(m) Normal(t(m), 1), where z(m), for m = 1, . . . , M, denote Gaussiandistributed auxiliary variables. Marginalisation of the z(m) leads to an ordinal probit model. The corresponding inference algorithm relies on truncated Gaussian distributions and takes into account explicit ordering constraints for the mean variables leading to a complicated inference algorithm that is sensitive to initialisation thus potentially leading to local minima. The original supervised LDA model (SLDA; Blei & Mc Auliffe, 2007) uses canonical exponential family distributions for the response model. Under the canonical formulations the expectation of a response variable is E[y(m)] = g(t(m)), where g( ) denotes a link function specific for each member of the family. Examples of the most common members of this family include Gaussian, Bernoulli and Poisson distributions suitable for continuousvalued, binary or count variables, respectively. However, more importantly, the formulation does not support ordinal variables. Previous applications of SLDA by Blei & Mc Auliffe (2007); Dai & Storkey (2015) and Nguyen et al. (2013) for ordinal responses, such as product or movie reviews, have made a strong model mis-specification; they treat ordinal variables as continuous-valued. In this approach, the ordinal variables are represented as distinct values in the real domain with arbitrary user-defined intervals between them, enabling use of a Gaussian response model. The model is y(m) Normal(t(m)+µ, τ 1), where µ is a mean variable and τ is a precision (inverse variance) parameter. There are a number of statistical flaws in this approach undermining interpretability. First, we note that the mean parameter of the Gaussian distribution, in general, may lead to results that make no sense in terms of the ordinal categories, especially for non-equidistant between-category intervals. Second, observed ratings still take discrete values but the predictions will not correspond to these values. Third, the Gaussian error assumption is not supported by discrete data. Wang et al. (2009) present an important and non-trivial extension of SLDA to unordered, that is, nominal response variables, motivated by classification tasks. The nominal variables represent logically separate concepts that do not permit ordering. Ramage et al. (2009) and Lacoste-Julien et al. (2009) present alternative joint topic models, where functions of the nominal response variables (class information) affect topic proportions. The response variables are not explicitly modelled using generative formulations. The approach by Mimno & Mc Callum (2008) uses a similar model formulation suitable for a wide range of observed response variables (or features, in general) performing linear regression from the responses, which are treated as covariates, to the concentration parameters of Dirichlet distributions of the topic proportions. However, it is not obvious how to use these formulations for ordinal response variables. 4. Experiments and Results We collect consumer-generated reviews of mobile software applications (apps) from Apple s App Store. The review data for each app contains an ordinal rating taking values in five categories ranging from poor to excellent as well as free-flowing text data. We select the vocabulary using tfidf scores. After simple pre-processing, the data collection contains M = 5511 apps with vocabulary size V = 3995 and total number of words PM m=1 D(m) = 1.5 106. The relatively small data collection is chosen to keep algorithm running times reasonable especially for the sampling-based inference approaches. 4.1. Experimental Setting We compare the joint correlated topic model (JTM; Section 2.2) and ordinal supervised topic model (SLDA) (Sec- Ordinal Mixed Membership Models tion 2.3) to SLDA with a Gaussian response model as adopted in previous work by Blei & Mc Auliffe (2007); Dai & Storkey (2015) and Nguyen et al. (2013) (see Section 3 for more details) as well as to sparse ordinal and Gaussian linear regression models. For Gaussian response models we represent the ratings as unit-spaced integers starting from one. The likelihood-specific parameters for the Gaussian model are mean and precision. We adopt the inference procedure described in Section 2.3 using collapsed Gibbs sampling also for Gaussian SLDA. For the regression models we infer a linear combination of the word counts and assign a sparsity-inducing prior distribution for the regression weights over the vocabulary in order to improve predictive ability. We maximise the corresponding joint log likelihood of the model for a fixed prior precision2. For all the models that use a Gaussian response model, the mean variable is inferred by computing an empirical response mean. We initialise the models randomly. For the joint correlated topic model, referred to as, JTM, we bound the maximum number of active topics to K = 100, set dimensionality of the latent variables to L = 30, α0 = 1, β0 = 10 6 and prior precision to l = L. The results are shown for λ = 0.001, although, λ 0.1 provided also good performance with little statistical variation. We terminated the algorithm (both in training and testing phase), when the relative difference of the (corresponding) lower bound fell below 10 4. The SLDA models were also computed for K = 100 and we used ζ = 1. We used 500 sweeps of sampling for inferring the topics and response parameters. For testing we used 500 sweeps of collapsed Gibbs sampling. Although we omit formal time comparisons due to difficulties in comparing VB to MCMC approaches, we find that the sampling approach is roughly one order of magnitude slower. In general, determining convergence for MCMC approaches remains an open research problem, whereas VB provides a local bound for model evidence. For all the topic models we used γ = 0.01. For JTM, this (effectively) equals a topic Dirichlet concentration parameter value γ + 1 due to point estimate shifting the value by minus one. For the regression models we sidestep proper cross-validation of the prior precision and show results for the values providing the best performance, potentially leading to over-optimistic results. 4.2. Rating Prediction We evaluate the models quantitatively in terms of predictive ability. Even though the developed joint mixed membership models are formulated primarily for exploring sta- 2We use t(m) = ξT x(m), where x(m) denotes word counts over the V -dimensional vocabulary, and p(ξ|ϵ) QV d=1 exp( ϵ ln(cosh(ξd))), where ϵ denotes a precision parameter of the prior distribution. tistical associations between the ratings and text data, they can readily be used as predictive models. More specifically, we predict the ordinal rating based on the text. We partition available data into multiple training and test sets using 10-fold cross validation. For each model (and fold) we compute the test-set log likelihood (probability) of ratings (the higher, the better) and use these values for comparison. Despite various predictive criteria have been proposed, the selected measure is well motivated by statistical modelling. In the test phase, for JTM, we infer the latent variables u, topic proportions (unnormalised gamma-distributed variables z(m) k ) and topic assignments c(m) d given the values for the remaining parameters inferred in the training phase. For SLDA models the test phase corresponds to estimating the topic assignments using standard LDA model algorithm (using collapsed Gibbs sampling) with fixed topics inferred based on the training data. Finally, we compute the corresponding latent scores t(m) for the models, obtaining the predictions. Table 1 shows the test-set log likelihoods for the models. The ordinal linear regression model resulted in significantly better predictions than the Gaussian regression model (paired one-sided Wilcoxon; p < 10 3) showing that it is important to substitute a statistically poorly motivated Gaussian response distribution with a proper generative model. For both models the sparsity assumption improves predictive ability. For the ordinal regression model, the most relevant words predictive of low (poor) ratings include waste and free and those of high (excellent) ratings include amazing and perfect. The model, however, falls short in providing in-depth interpretations, necessitating the use of topic models. All the topic models perform substantially better than the regression models. The ordinal SLDA model provides the best predictive performance, JTM is the second best and Gaussian SLDA is the worst. All (pair-wise) comparisons are statistically significant (paired one-sided Wilcoxon; p < 0.005). We discovered K = 100 is a sufficiently large threshold value for the number of topics; some of the inferred topics are inactive. This, together with good predictive accuracy, establish evidence the developed models have captured the relevant statistical variation in the observed data. For JTM, we also performed a sensitivity analysis of the dimensionality of the latent variables L and found little statistical variation for 30 L 100 = K. The test log likelihoods range between a minimum of 669.42(9.68) for L = 80 and a maximum of 661.98(11.73) for L = 50. Next, we compared the inferred topics of different models quantitatively using a measure, referred to as, semantic coherence proposed by Mimno et al. (2011) for quantifying topic trustworthiness. Table 2 shows the average topic Ordinal Mixed Membership Models Table 1. Rating prediction test set log likelihoods for different methods. The table shows values for mean and standard deviation computed over 10 folds obtained by cross-validation. model log likelihood Ordinal SLDA 638.53(13.38) JTM 667.79(15.91) Gaussian SLDA 681.71(17.69) Ordinal regression 704.30(13.21) Gaussian regression 735.40(14.70) coherences (the higher, the better). The topics inferred by JTM have significantly larger coherence (two sample onesided Wilcoxon, p < 0.0002). Table 2. Average semantic coherence values for the inferred topics of different models. model coherence JTM 52.64(19.94) Oridinal SLDA 66.30(26.43) Gaussian SLDA 67.84(26.54) 4.3. Inspection of Inferred Topics Finally, we visualise and interpret the topics inferred by the JTM model. Figures 4 and 5 visualise nine topics associated with high (excellent) and low (poor) ratings, respectively. As explained in Section 2.2.1, the associations (both sign and strength) are given by computing the similarity scores (that is, correlations). fun amazing super absolutely friends stuff things features likes familysweet honestly thx earn long watching suggest sticker hands packs rocks brilliant points today forever possibilities levels time addictingchallenging entertaining long frustrating interesting harder wait thinking high past great workout workouts day easy exercises exercise run ups program helps push progress pace awesome shape highly helped results variety minute instructions ability user designed great forward quick future nice functionality clean improvements including beautifully lot simply kudos improvement happy multi love people friends share fun quotes post community posts concept life interesting makes design group friendly stars posted wonderful connected experience great calculator time simple easy calculations lot accurate nice figure worth conversion calculate functions grade conversions job function found thing percentage equations spreadsheet feature helped calculation clear change calculators easy clean simple stock students portfolio money information functionality change real alerts friendly display options performance improvement notifications wow issue users recommended words english word language french helpfultranslation quran languages spanish application pronunciation update definitions translations dictionaries amazing translator effort writing kid learning educational nice numbers time recommend likes words interactive grandson makes counting number parent part count older Figure 4. Visual illustration of topics associated with high ratings. One of the topics associated with high ratings (Figure 4) captures word co-occurrence patterns containing adjectives with positive semantics. The remaining topics capture themes customers appreciate, such as games, health monitoring, calculations (for example, for unit conversions), learning languages, social networking and education. One of the topics captures positive customer feedback about app interface and design. worthlessstars deleted garbage apple bother piece completely downloading complete review bad advertising ap terrible opened remove false shown advertisement poor download whatsoever uninstalled disappointment bought refund program disappointed problem customer supported scam company recommend instructions crashes downloaded function clear simply emailed purchasing installed removed loaded worth thing immediately close memory work button read stop money shows give genbackground bought opens broken buck hold warning lot review work back give working poor button updated start wrong rating apparently good developer deleted disappointed installed simply work developers works software great updated mentioned response appears give glad solve concept automatically stopped corrected improvements development change pretty fix problem stars fixed time button long people guys disappointed ago support option main appears original downloaded big updates open happened apparently today expect slow hoping annoying adds click seconds pops open bottom advertisements popping download minutes advertisement advertising horrible frustrating accidentally minute bother downloaded wanted uninstalled made randomly working finedisappointed problem restore support fix back compatible stopped response screen apparently frustrating functioning disappointing complained dropped stating supported desperately fully waste time stupid worst buy read downloading fake people thing wasted description dollar garbage screen doesnt instructions understand guys complete rate write bunch Figure 5. Visual illustration of topics associated with low ratings. The topics associated with low ratings (Figure 5) contain customers negative experiences or feature requests such as removal of adds, software updates and problems with functionality. 5. Discussion In this work, we develop a new class of ordinal mixed membership models suitable for capturing statistical associations between groups of observations and co-occurring ordinal response variables for each group. We depart from the existing dominant approach that relies on improper model assumptions for the ordinal response variables. We successfully demonstrate the developed models for analysing reviews of mobile software applications provided by consumers. The proposed class of models as well as inference approaches are applicable for a wide range of present-day applications. In the future, we expect to see improvements in statistical inference including fully Bayesian treatments and nonparametric Bayesian formulations. Stochastic online learning or model formulations for streaming data may be applied to scale the statistical inference to cope with current data repositories containing review data for a few millions of groups. Ordinal Mixed Membership Models Acknowledgement This research is supported by an EPSRC program grant A Population Approach to Ubicomp System Design (EP/J007617/1). Albert, James H and Chib, Siddhartha. Bayesian analysis of binary and polychotomous response data. Journal of the American Statistical Association, 88(422):669 679, 1993. Blei, David and Lafferty, John. Correlated topic models. In Advances in Neural Information Processing Systems, 2006. Blei, David M and Mc Auliffe, Jon D. Supervised topic models. In Advances in Neural Information Processing Systems, 2007. Blei, David M, Ng, Andrew Y, and Jordan, Michael I. Latent Dirichlet allocation. the Journal of Machine Learning Research, 3:993 1022, 2003. B ohning, Dankmar. Multinomial logistic regression algorithm. Annals of the Institute of Statistical Mathematics, 44(1):197 200, 1992. Chu, Wei and Ghahramani, Zoubin. Gaussian processes for ordinal regression. Journal of Machine Learning Research, 6:1019 1041, 2005. Dai, Andrew and Storkey, Amos J. The supervised hierarchical Dirichlet process. IEEE Transactions on Pattern Analysis & Machine Intelligence, 37(2):243 255, 2015. Griffiths, Thomas L and Steyvers, Mark. Finding scientific topics. Proceedings of the National Academy of Sciences, 101(suppl 1):5228 5235, 2004. Hoffman, Matthew D, Blei, David M, Wang, Chong, and Paisley, John. Stochastic variational inference. Journal of Machine Learning Research, 14(1):1303 1347, 2013. Jaakkola, T and Jordan, Michael I. A variational approach to Bayesian logistic regression models and their extensions. In Artificial Intelligence and Statistics, 1997. Lacoste-Julien, Simon, Sha, Fei, and Jordan, Michael I. Disc LDA: Discriminative learning for dimensionality reduction and classification. In Advances in Neural Information Processing Systems, 2009. Mimno, David and Mc Callum, Andrew. Topic models conditioned on arbitrary features with Dirichlet-multinomial regression. In Uncertainty in Artificial Intelligence, 2008. Mimno, David, Wallach, Hanna M, Talley, Edmund, Leenders, Miriam, and Mc Callum, Andrew. Optimizing semantic coherence in topic models. In Empirical Methods in Natural Language Processing, 2011. Minka, Thomas. Estimating a Dirichlet distribution, 2000. Nguyen, Viet-An, Boyd-Graber, Jordan L, and Resnik, Philip. Lexical and hierarchical topic regression. In Advances in Neural Information Processing Systems, 2013. Paisley, John, Wang, Chong, Blei, David M, et al. The discrete infinite logistic normal distribution. Bayesian Analysis, 7(2):235 272, 2012. Ramage, Daniel, Hall, David, Nallapati, Ramesh, and Manning, Christopher D. Labelled LDA: A supervised topic model for credit attribution in multi-labeled corpora. In Empirical Methods in Natural Language Processing, 2009. Wainwright, Martin J and Jordan, Michael I. Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning, 1(1-2):1 305, 2008. Wallach, Hanna M, Minmo, David, and Mc Callum, Andrew. Rethinking LDA: Why priors matter. In Advances in Neural Information Processing Systems, 2009. Wang, Chong and Blei, David M. Variational inference in nonconjugate models. Journal of Machine Learning Research, 14(1):1005 1031, 2013. Wang, Chong, Blei, David, and Li, Fei-Fei. Simultaneous image classification and annotation. In Computer Vision and Pattern Recognition, 2009.