# markov_mixed_membership_models__4197bc1f.pdf

Markov Mixed Membership Models

Aonan Zhang AZ2385@COLUMBIA.EDU John Paisley JPAISLEY@COLUMBIA.EDU Department of Electrical Engineering, Columbia University, New York, NY, USA

We present a Markov mixed membership model (Markov M3) for grouped data that learns a fully connected graph structure among mixing components. A key feature of Markov M3 is that it interprets the mixed membership assignment as a Markov random walk over this graph of nodes. This is in contrast to tree-structured models in which the assignment is done according to a tree structure on the mixing components. The Markov structure results in a simple parametric model that can learn a complex dependency structure between nodes, while still maintaining full conjugacy for closed-form stochastic variational inference. Empirical results demonstrate that Markov M3 performs well compared with tree structured topic models, and can learn meaningful dependency structure between topics.

1. Introduction

Mixed membership modeling is a statistical framework for modeling grouped data where each group is represented as a unique mixture over a shared structure (Airoldi et al., 2014). A wide range of data fall within the scope of mixed membership models, including documents (Blei et al., 2003), images (Li & Perona, 2005), and the genome (Pritchard et al., 2000). Exchangeability assumptions can be relaxed to extend mixed membership models to link data (Airoldi et al., 2008), heterogeneous data (Chang & Blei, 2009; Wang & Blei, 2011) and matrix factorization (Mackey et al., 2010). In this paper, we focus on the case where each group s data is assumed i.i.d. given its mixed membership mixing measure.

For discrete grouped data, the most basic mixed membership model is latent Dirichlet allocaiton (LDA) (Blei et al., 2003), which assumes a ﬁnite set of discrete distributions,

Proceedings of the 32 nd International Conference on Machine Learning, Lille, France, 2015. JMLR: W&CP volume 37. Copyright 2015 by the author(s).

and models each group as a mixture over these distributions using a Dirichlet prior. The simple ﬂat Dirichlet prior assumes no structure among the atoms, and so the model can overﬁt as the number of components increases beyond a certain number. To capture ﬁner resolution without overﬁtting, structure has been introduced to the atom relationships, for example by modeling pairwise correlations (Blei & Lafferty, 2007) or tree structures (Blei et al., 2010).

Among these models, the tree-structured model is especially interesting for the structure it can learn (Blei et al., 2010; Li et al., 2012; Kim et al., 2012; Ahmed et al., 2013; Paisley et al., 2015). Because the components are given a strict parent/child relationship, tree models can discover components of different granularities having top-down dependencies. In topic modeling, this is natural since topics can be more or less speciﬁc and the children of one topic can further specify the more general content of the parent topic that unites them.

To consider two Bayesian nonparametric instances, the nested Chinese restaurant process (n CRP) (Blei et al., 2010) and nested hierarchical Dirichlet process (n HDP) (Paisley et al., 2015) are two tree-structured models that select distributions on paths from a root node (see Figure 1). For example, the n CRP selects the atoms for a group by following a path from root to leaf node; the n HDP generalizes this by selecting a subtree of atoms for each group. Still, in both models it is assumed that there is a clear treestructured relationship among nodes. In this paper, we explore a related modeling framework that allows for a more ﬂexible range of node connections by assuming a Markov structure among the nodes.

To this end, we present the Markov mixed membership model (Markov M3) for grouped data that learns a fully connected graph structure among mixing components. With respect to tree-structured models, our proposed Markov model is straightforward in that, rather than imposing a tree-structured transition rule between nodes, we model the nodes as a fully connected graph with a ﬁrstorder Markov rule for transitioning between nodes (see Figure 1). We therefore refer to this as a graph-based mixed membership model. In the context of topic models,

Markov Mixed Membership Models

(a) n CRP (b) n HDP

(c) Markov mixed membership model

Figure 1. Comparison between three path-based models. (a) The tree-structured nested Chinese restaurant process (n CRP) selects one path per group; (b) the tree-based nested hierarchical Dirichlet process (n HDP) places high probability on a subtree for each group; (c) the proposed graph-based Markov mixed membership model selects one path per group using a Markov random walk on the fully connected set of nodes (an example high-probability connectivity is depicted in the background here).

this means that any topic can a priori transition to any other topic, but using a sparse Dirichlet prior on transition distributions we learn a meaningful dependence between topics through posterior inference.

An advantage of our proposed framework is that it avoids the combinatorial issues encountered during inference by models such as the n CRP (Wang & Blei, 2009), and does not require complicated pruning procedures such as required by the n HDP (Paisley et al., 2015). In contrast, Markov M3 is a fully conjugate-exponential family model that allows for closed-form updates that doesn t require the complicated procedures of the n CRP or n HDP. As with those two models, our model is easily learned with stochastic variational inference allowing for processing of large data sets (Hoffman et al., 2013).

In Section 2 we present Markov M3 and show the n CRP to be the nonparametric limiting case of the model. We present a closed-form stochastic variational inference algorithm for the model in Section 3. In Section 4 we present experiments on two large grouped discrete data sets.

2. Model Description

Mixed membership models are applicable to data sets where the observations are grouped, i.e., where viewing the data on the instance-level results in subsets of the data. For example, each document in a set can be represented as a

Algorithm 1 Generative process for Markov M3

Global variables

Draw an initial-state distribution π iid Dir( α0

K 1K). for each atom k {1, 2, , K} do

1. Draw parameter βk iid µ. 2. Draw transition distribution θk iid Dir( α0

K 1K). end for Local variables

for each document d {1, 2, , D} do

1. Draw a Markov chain of atoms zd MC(π, θ). 2. Draw a distribution on atoms, νd GEM(γ0). for each word n in document d do

1. Draw assignment ℓdn iid Disc(νd). 2. Draw observation wdn f(βzd,ℓdn). end for end for

group of words. Mixed membership models assume that the groups share an data generating (global) structure, but with different distributions to account for group-level (local) differences. For example, topic models share the same set of distributions on words, but mix over them differently for each group.

A common assumption made by mixed membership models is that the data is exchangeable within and across groups. In this case, where there exists a random mixed membership measure (i.e., De-ﬁnitte s measure) that results in i.i.d. generation of the data. LDA is an example of an exchangeable mixed membership model both within/across groups, whereas dynamic topic models assume partial exchangeability since the order of documents matters (Blei & Lafferty, 2006). We present a fully exchangeable model in which each group mixes on paths selected from a fully connected graph according to a Markov random walk.

2.1. Markov mixed membership models

We deﬁne the generative process for the Markov mixed membership model. The following procedure is also summarized in Algorithm 1. Let wd be the set of data for group d. We model this data as an i.i.d. set drawn from a mixture Gd with mixing distribution νd GEM(γ0) and a groupspeciﬁc sequence of atoms (ˆβ1, ˆβ2, . . . ). We can draw from the GEM stick-breaking distribution by sampling,

udi iid Beta(1, γ0), νdi = udi

j=1 (1 udi). (1)

We use the mixture Gd = P i=1 νdiδ ˆβi to generate group wd = (wd1, . . . , wdn) by sampling

ℓdn Discrete(νd), wdn|ℓdn f(ˆβℓdn). (2)

Markov Mixed Membership Models

The distribution f is problem-speciﬁc. In this paper we take f to be a discrete distribution and ˆβ a V -dimensional probability vector. Assuming we have K atoms, ˆβi β = {β1, . . . , βK}, in this paper we let

βk iid Dir(β01V ). (3)

In the n CRP (Blei et al., 2010), the sequence of ˆβi is selected by following a path from the root node to the leaf node of a tree. Instead, we assume a ﬁrst order Markov structure for this sequence. We construct a Markov transition distribution on β by drawing

θk iid Dir(α0/K, . . . , α0/K) (4)

for each k. The distribution on the sequence ˆβd is then P(ˆβi = βj |ˆβi 1 = βj|θ) = θj,j . The variable zd shown in Algorithm 1 and used for inference indexes this sequence: zdi = j if ˆβi = βj . We assume the same Dirichlet prior on the initial state distribution π as well.

We observe that if we were to set ℓdn = n, we would assign wdn to atom ˆβn and the result would be a standard hidden Markov model (HMM). What differentiates our model, and reintroduces group-level exchangeability, is that each word chooses which of the selected states it belongs to i.i.d. νd. The analogy to the HMM can be pursued by thinking of the words of a document ﬁrst being partitioned into ordered sets according to νd, and then drawing each group according to an HMM. We will see how this way of considering the model leads to variational inference that builds on inference for the HMM (Beal, 2003).

2.2. Relationship to tree-structured models

As Figure 1 illustrates, the major different between graphbased and tree-based mixed membership models is the dependence structure between nodes. Where models such as the n CRP impose a strict parent/child hierarchy, Markov M3 in a sense captures the potential for each node to be the parent of all others (which simply results from a shift of perspective about Markov chains). The limited modeling ability of the n CRP is primarily due to the rigid single path allowed per group. In topic modeling, this forces each selected topic to be a strict subset of those previously selected. This assumption was relaxed by recent tree-structured alternatives (Kim et al., 2012; Ahmed et al., 2013; Paisley et al., 2015). For example, as Figure 1 indicates, the n HDP allows for multiple paths per document, so two general topics can be combined in a single document by allowing words to select paths in different directions.

As seen in Figure 1, Markov M3 in one sense returns to the one path per group structure of the n CRP, but allows for exploration of the entire space like the n HDP. This provides

another remedy to the rigidness of the n CRP. It also offers modeling capabilities not found in the n HDP, since in that model, the presence of two subtrees is not causally linked according to the prior the presence of one subtree says nothing about the presence of another. With Markov M3, there is a causal connection between two atoms according to the Markovian generative structure (which we observe is not symmetric). Markov M3 is again like the n CRP in that, once it selects its path of atoms, it mixes on them using a probability vector drawn from a stick-breaking distribution.

The Markov mixed membership model can be viewed as a possible parametric version of the nested Chinese restaurant process, in that we can show that the n CRP is the limiting process of the model in Algorithm 1 as K goes to inﬁnity. To roughly sketch this, consider the marginal probability measure GK m = PK k=1 θmkδβk constructed for the mth node. Ishwaran & Zarepour (2002) proved that lim K GK m = Gm DP(α0µ). In the limit K , θmk = 0 with probability one, while P

k θmk = 1. In this case, the nonzero probability can be shown to be on a disjoint set of atoms for each Gm with probability one, despite the fact that they share atoms in the ﬁnite approximation GK m. Practically speaking, this means that a state transition sequence sampled from the inﬁnite limit of Markov M3 will never return to the same node twice, and so the model can equivalently be thought of as selecting a path in a tree. We formally state this in the following Proposition.

Proposition 1 As K goes to inﬁnity, the Markov mixed membership model recovers the underlying mixing measure of the nested Chinese restaurant process.

2.3. Related work

Graph-based mixed membership models have been applied to grouped data in other settings. For example, mixed membership models have been applied to graph data, where instances are linked as a graph, and stochastic block models (Airoldi et al., 2008) have been proposed to model the links between instances through clustering. Mixed membership models have also been applied to collaborative ﬁltering (Mackey et al., 2010; Wang & Blei, 2011) and link prediction (Chang & Blei, 2009). Exploring exchangeable structures in graphs has also received recent theoretical interest (Orbanz & Roy, 2015).

We also observe that mixed membership models can be applied to the more traditional use of hidden Markov models for sequential data. For example Paul (2014) considers a similarly named process for the fundamentally different problem of nonexchangeable sequence modeling.

Markov Mixed Membership Models

3. Scalable Variational Inference

We derive a variational inference algorithm for learning an approximate posterior of all model variables shown in Algorithm 1. We can factorize the joint distribution of our model as

p(w, β, θ, z, u, ℓ, π) = p(β)p(θ)p(π) Q

d p(wd|β, zd, ℓd)p(zd|θ, π)p(ℓd|ud)p(ud). (5)

We apply mean-ﬁeld variational inference to approximate the posterior p(β, θ, z, u, ℓ|w) by deﬁning a factorized q distribution on these variables and locally maximizing the variational objective function

L = Eq[ln p(w, β, θ, z, u, ℓ)] Eq[ln q]

using coordinate ascent, which approximately minimizes the KL-divergence between the true posterior and q. We restrict q to the following factorized form

q(β, θ, π, z, u, ℓ) = q(β)q(θ)q(π)q(z)q(u)q(ℓ), (6)

which we further factorize as

q(β)q(θ)q(π) = q(π|απ)

k=1 q(βk|λk)q(θk|αk),

q(z)q(u) = Y

d q(zd|ϕd) Y

i q(udi|adi, bdi),

n q(ℓdn|φdn). (7)

We select all q distributions to be in the same family as the prior deﬁned in Algorithm 1.In the following, we use coordinate ascent to optimize the variational objective, with respect to the variational parameters. We ﬁrst discuss the update of local variables zd, ud, ℓd, which are only dependent on a single instance. Then we discuss the global variables β, θ, π that depends on multiple instances. Though we have proposed a parametric model in the size of the graph, K, we have deﬁned the Markov chain for each group to be inﬁnite in length. For inference, we introduce a truncation of this stick-breaking construction to level T, as is typically done (Blei & Jordan, 2006). We note that truncation-free methods are a possible remedy (Wang & Blei, 2012).

3.1. Local variables

The most complex part of inference is in learning the Markov sequence zd that selects atoms for group d. We can derive the explicit form of its posterior from

q(zd) exp X

i E[ln p(zdi|zd,i 1, θ) | {z } pairwise potential

n φdn(i)E[ln p(wd|ℓd, zdi, β)

| {z } single potential

In Eq. (8) we can break q(zd) into single potentials and pairwise potentials as indicated. We can then solve using the forward-backward algorithm to infer the posterior joint marginals. In fact, this is exactly the procedure for learning the state transitions of an HMM with the important difference that the emission at step i is not predeﬁned, but instead the soft clustering of data in wd induced by the variational parameters of ℓd (which is changing with each iteration). Thus, a simple modiﬁcation can be made to the forward-backward algorithm that accounts for the new emission process.1 Below, the variational parameters ϕdi are the result of forward-backward, and correspond to the distribution on atoms for the ith state of group d. Given these vectors, the updates for the remaining variational parameters are more straightforward.

The data allocation variable ℓdn has multinomial variational parameter φdn as found by calculating

φdn(i) exp E[ln νdi] + P

k ϕdi(k)E[ln βk,wdn] , (9)

We observe that the last term sums over atom assignments for the ith value in the Markov sequence. The ﬁrst expectation is from the stick-break construction

E[ln νdi] = E[ln udi] + Pi 1 j=1 E[ln(1 udj)] (10)

These expectations frequently arise in variational inference. For example, E[ln βkv] = ψ(λkv) ψ(P

v λkv ), where ψ( ) denotes the digamma function.

The ﬁnal local variables are the stick-breaking proportions ud. Given the allocation distributions q(ℓdn), the update of the beta q distribution of udi is

adi = 1 + X

n φdn(i), bdi = γ0 + X

n,i >i φdn(i ). (11)

We can iterate several times updating the local variables for each document before moving on to the global variables.

3.2. Global variables

The global variables include the Markov transition probabilities and the atoms. The update to the variational parameters of the initial state and transition probabilities, π and θ, are identical to the HMM,

d ϕd1(k), (12)

i>1 ϕd,i 1(k)ϕd,i(k ). (13)

1We omit the full derivation for space. Please see the appendix for details.

Markov Mixed Membership Models

Algorithm 2 An outline of batch variational inference

Local variables: For each document d,

Update q(zd) with forward-backward. Eq. (8) Update each word allocation q(ℓdn). Eq. (9) Update stick proportions q(udi). Eq. (11) Global variables: For π and each atom k,

Update the initial state distribution q(π). Eq. (12) Update the transition distribution q(θk). Eq. (13) Update the atom distribution q(βk). Eq. (14)

For Dirichlet-distributed atoms β, the variational update to the Dirichlet q distribution of βk is

λkv = β0 + X

n 1(wdn = v) X

i φdn(i)ϕdi(k). (14)

The right-most summation calculates the probability that word wdn and topic βk are both be assigned to the same point in the Markov sequence (zd1, zd2, . . . ), which is required for word wdn to belong to component βk.

3.3. Stochastic variational inference

Since Markov M3 is a conjugate-exponential family model, it is immediately amenable to stochastic variational inference (SVI) (Hoffman et al., 2013). For models such as tree-based models and the proposed graph-based model, such large data extensions can help in learning the greater level of structure deﬁned by the model prior. As with other mixed membership models, we can exploit the fact that the variational objective function factorizes as

L = Eq[ln q] + Eq[ln p(β, θ, π)] (15)

d Eq[ln p(wd, zd, ud, ℓd|β, θ, π)].

Using SVI, we stochastically optimize L by restricting the local calculations to a small subset Ct of the D groups at iteration t. Given the subset Ct, SVI proceeds by (1) optimizing all local variables indexed by Ct, (2) forming the global updates restricted to Ct, and (3) averaging these updates with the current values. Let ˆαπ,k, ˆαk,k and ˆλkv be the coordinate ascent updates restricted to Ct. These are calculated as in Eq. (12) (14). Then the stochastic updates to the true values are

α(t+1) π,k = (1 ρt)α(t) π,k + ρt(D/|Ct|)ˆαπ,k

α(t+1) k,k = (1 ρt)α(t) k,k + ρt(D/|Ct|)ˆαk,k

λ(t+1) kv = (1 ρt)λ(t) kv + ρt(D/|Ct|)ˆλkv (16)

The decaying learning rate ρt must satisfy P t=1 ρt = , P t=1 ρ2 t < to ensure convergence (Bottou, 1998). We set ρt = (τ0 + t) κ, where τ0 > 0 and 0.5 < κ 1.

Table 1. Three datasets used for batch comparison. Corpus # train # test # vocab # tokens Huff Post 3.5K 589 6,313 907K Science 4K 1K 4,403 1.39M Nips 2.2K 300 14,086 3.3M

4. Experiments

Our experiments with Markov M3 focus on grouped discrete data problems. We ﬁrst consider topic modeling on small and large scale problems. We then show qualitative results on a music tagging problem, where the union of quantized song features and user tags provides the discrete grouping on a song level.

4.1. Document modeling

Batch comparisons. We ﬁrst apply our model to three datasets easily learned with batch inference: Hufﬁngton Post, Science and NIPS papers. The statistics from each data set is shown in Table 1. We split each data set into a training set and a test set, also shown in Table 1. For the testing set we split each document into a 90/10 split and learned local parameters on 90% of words in the document and predicted 10% for prediction given the inferred topic proportions. We present the quantitative performance using preplexity, which can be calculated as

perplexity = exp

n w T S log p(wn|w T R)

where w T R, w T S represent training and test words in the test set respectively.

We compare performance of our model with LDA, the correlated topic model (CTM) (Blei & Lafferty, 2007) and the nested HDP (n HDP) (Paisley et al., 2015). The n HDP was shown to give better predictive ability than the n CRP, and so we do not compare with that algorithm. We have also noted that when K goes to inﬁnity, the Markov M3 recovers the n CRP mixed membership model and so performance of our model tends to the n CRP.

The perplexity results using a different number of topic are shown in Figure 2. For the nonparametric n HDP we truncate its posterior topic number to 175, which is higher than the maximum number of topics used by the parametric models. On all datasets, the Markov M3 consistently performs better than other parametric models. The reason is that Markov M3 can more ﬂexibly model all pairwise topic dependencies, while LDA only considers there to be a slight negative correlation among topics, and CTM considers pairwise correlation without dependency information. We observe that the best performance for Markov M3 is better than n HDP, which gives evidence that a graph struc-

Markov Mixed Membership Models

20 40 60 80 100 1600

Held out Perplexity

Huffington Post

20 40 60 80 100 1000

20 40 60 80 100 1800

LDA CTM n HDP Markov M3

Figure 2. Held-out perplexity results. The Markov transition model (Markov M3) overall achieves best performance among parametric models. Its best performance is even better than the state-of-the-art nonparametric n HDP.

0 0.5 1 1.5 2 2.5 3 3.5 7.7

Documents Seen ( 105)

Held out log likelihood

online LDA stoch. Markov M3

Figure 3. Predictive performance for online Markov M3 and online LDA. Markov M3 is consistantly better for various number of topics through the entire learning process.

ture is preferable to a tree structure for modeling topic dependencies.

Stochastic learning. For a large-scale problem, we train our model using stochastic variational inference on the New York Times dataset, which contains 1.8 million documents, and compare its predictive performance on a held-out test set with online LDA (Hoffman et al., 2010a). For both models we use the same topic initialization. We also use a learning rate of (10 + t) 0.75 for both models, and a batch size of |Ct| = 500. For Markov M3, we truncate the path length to 15 and set γ0 = 1.

In Figure 3, we show the predictive performance throughout the learning process of both models, considering various number of topics. We see that the stochastic version of Markov M3 performs better than online LDA, which is consistent with the results on smaller scale problems.

Markov M3 learns a transition distribution over all topics. In Fig. 5, we depict the most probable transitions from several topics within the graph. We limit connections and directions to those above a threshold for clarity and use size to roughly indicate probability. As can be seen, most transition probabilities between topics are very low, thus are not displayed on the graph. Among all the topics, a few of

E-Sports at College, With Stars and Scholarships

sports olympic game athlete summer

team coach season

player league season

student college campus faculty

school education student

company business

Obama s Budget Seeks International Minimum Tax for Corporations

tax income percent property cut

billion quarter earning sale share

percent increase

japan trade foreign international country

rate bound bill

fund investor money stock

China Further Tightens Grip on the Internet

com internet web site online china chinese beijing taiwan asian

problem need small system

computer technology software

economic growth inﬂation rate

Figure 4. Topic paths selected by three documents. The size of the node indicates the proportion of the topic.

them are general (e.g., topic 67 with top words say , life , man ), but most topics are more speciﬁc. Topics naturally form small cliques, where the transition probability within a cluster is signiﬁcantly higher than transitions across clusters. There are also connections between topics in different

Markov Mixed Membership Models

hall orchestra

concert symphony

rock song band album

66 theater broadway

87 jazz tonight

street tomorrow

128 opera video metro production

108 center festival

art lincoln theater

18 ﬁlm direct movie minute director

57 dance dancer theater performance

83 play character

theater production

107 ﬁlm movie hollywood

80 children

family mother

wife beloved

late grandmother

4 father husband grandfather

brother beloved

30 family passing condolence

friend extend

54 student college campus

faculty education 16 school education

student teacher

156 english language

class word learn

86 book author publish

life book character

188 news times newspaper

59 television

75 player league season

103 team coach season national

142 sports olympic

game athlete summer

big ball bowl

life man little know

123 black white south africa racial

190 professor

action corporate afﬁrmative academic

153 science

theory human

37 executive

vice director chairman

20 life love family friend

76 study cancer researcher

99 aids disease

114 medical hospital

patient doctor center

105 drug cocaine treatment marijuana

male breast

31 percent increase

rate average

131 arrest police authority

ofﬁcial charge

care insurance

14 house senate republican committee

bill legislation

vote measure

42 election

voter democratic

campaign candidate

fund political

34 party political government

election opposition

68 soviet german

union gorbachev

118 union agreement

talk strike negotiate

124 russia moscow

yeltin soviet

putin 101 ofﬁcial attack terrorist pakistan

north weapon

122 military

army defence pentagon

23 iraq hussein

war baghdad

Figure 5. Selected topic subgraph from a 200-topic graph learned by online Markov M3 on New York Times dataset. The graph shows Markov transitions with high probability among topics.

but similar domains. For example, there is a high transition probability between a ﬁlm topic (18) to a music topic (46).

We further illustrate the Markov property by focusing on the document level. Since each document learns a distribution on paths through the graph, we show the most probable path found using the Viterbi algorithm. We show the paths selected for three documents in Figure 4, where the arrow indicates path direction and the size of the node indicates the proportion for that topic. In general, Markov M3 tends to visit earlier topics with larger proportions, as is encouraged by the stick-breaking prior. The topics selected in these examples are clearly interpretable, and the sequence captures information about topics relations.

Sensitivity analysis. We empirically analyze the effect of the truncation level of the Markov chain and the stickbreaking concentration parameter in our model. The truncation level indicates the number of topics we allow to each document. When the truncation level is too small, each document can only explore very limited number topics.2

As Figure 6 shows, the model is less accurate in this case.

2We recall that the n CRP restricted each document to 3 topics, not counting the shared root topic.

5 10 15 20 25 30 7.48

Held out log likelihood

0 1 2 3 4 5 6 7.465

Figure 6. Sensitivity analysis for (left) the truncation level of sticks, and (right) the stick-breaking concentration parameter. Results are shown in terms of average log likelihood on a test set.

However, when the length of Markov random walk become too large, performance also decreases. This is because the update to the transition distributions θk treat the entire sequence of each document equally. If there are many empty topics in the Markov sequence, as indicated by the q distributions on udi, the information in the q distributions of the transition matrix can become less informative. In this sense, truncation of the model is important and should be set so that most available topics are used by a document. The concentration parameter γ0 deﬁnes the smoothness for the stick-breaking proportions. Figure 6 shows that choosing a smooth prior can help improve performance.

Markov Mixed Membership Models

rock indie altern altern rock

indie rock rock

pop classic rock female vocalist

folk country

rock indie indie rock

dance female voc.

female vocalist

dance spanish male vocalist

rap r&b reggae

chill out electron

jazz lounge

dance house

The Bristols, Little Baby

(indie, indie rock)

Carroll Thompson, I m So Sorry

(female vocalist, reggae)

dub reggae electronica

rock indie altern altern rock

metal heavy metal death metal

8 rock indie indie rock

rock stonerrock

doom heavy metal

death metal

Akercocke, Footsteps Resound In An Empty Chapel

(black metal, death metal, metal)

Figure 7. Markov transition paths learned from three songs without knowing their tags (the ground truth tags are marked in the parenthesis below). The black node denotes the initial state. The size of nodes indicates the proportion for that factor, and the thickness of arrows indicates the transition probability between factors. All the paths are embedded in the graph (the gray background).

4.2. Million song dataset

We also experiment with Million Song Dataset (Bertin Mahieux et al., 2011). We ﬁrst extract music audio features from 371K songs and learn a codebook of size 512 using K-means. We represent each song as a vector that can be split into two parts. The ﬁrst part contains the vectorquantized audio features using the codebook. This gives a vector w NJ, where wk represents the counts of audio frames that fall into cluster k for a particular song. The remaining part is a user-applied bag-of-tags v {0, 1}L, with a total of L = 561 tags. We set vl = 1 if tag l is observed for the given song. Thus, the entire quantized feature can be represented as [w, v], or a document with a vocabulary size of 1,073.

Exploiting latent factors that generates audio waveforms for tagging has been studied in recent years (Hoffman et al., 2010b; Liang et al., 2013). The end goal we consider is the problem of assigning semantic tags to a song by only analyzing its audio waveform using a model learned from the weakly labeled songs (i.e., incomplete and noisy labeled). We apply Markov M3 with 50 nodes (topics) to this problem to learn joint audio-tag topics each factor is a distribution over words , which is a combination of the audio codebook and the tags. In our problem set-up, we note that the audio features dominate entire feature since the number of user-applied tags is much smaller than the number of quantized audio features, and so the topics learned will be audio-centered and the marginal tag distribution can be viewed as a weak semantic label of music style captured by that audio factor.

As with document modeling we learns a fully connected graph over music factors, which we display here by showing the path transitions for three held-out test songs. Again,

the model learns a distribution on paths, and we only show the most probable path using Viterbi. For this testing problem, since we don t have any tags, we feed the quantized audio features into our model and to learn their Markov transitions over factors. We then use the tagging portion of the selected atoms to represent the path selected. From Figure 7, we ﬁnd that Markov M3 pulls out the correct, but noisy tags (marked as red) with high probability, and also discovers other possibly relevant tags.

5. Conclusions and Future Work

We proposed a Markov mixed membership model (Markov M3) that explores a fully connected graph structure among components. Markov M3 provides a new way of performing mixed membership modeling with structured distributions, and an alternative to similar tree-based models. We showed how Markov M3 gives a new perspective on the n CRP by showing n CRP to be a limiting case of Markov M3. We showed the effectiveness of this modeling framework on small and large datasets for discrete grouped data such as documents and quantized music. In future work, we are interested in exploring an Bayesian nonparametric extension of Markov M3 that still allows its components to be fully connected. Another direction is to develop extensions of Markov M3 to problems with nonexchangeable data.

Ahmed, A., Hong, L., and Smola, A. Nested chinese restaurant franchise processes: Applications to user tracking and document modeling. In International Conference on Machine Learning, 2013.

Markov Mixed Membership Models

Airoldi, E., Blei, D., Fienberg, S., and Xing, E. Mixed membership stochastic blockmodels. Journal of Machine Learning Research, 9:1981 2014, 2008.

Airoldi, E.M., Blei, D., Erosheva, E.A., and Fienberg, S.E. (eds.). Handbook of Mixed Membership Models and Their Applications. Chapman and Hall/CRC Handbooks of Modern Statistical Methods, 2014.

Beal, M.J. Variational algorithms for approximate bayesian inference. Ph. D. Thesis, University College London, 2003.

Bertin-Mahieux, T., Ellis, D., Whitman, B., and Lamere, P. The million song dataset. In International Society for Music Information Retrieval, 2011.

Blei, D. and Jordan, M. I. Variational inference for dirichlet process mixtures. Bayesian analysis, 1(1):121 143, 2006.

Blei, D. and Lafferty, J. Dynamic topic models. In International Conference on Machine Learning, 2006.

Blei, D. and Lafferty, J. A correlated topic model of Science. Annals of Applied Statistics, 1(1):17 35, 2007.

Blei, D., Ng, A., and Jordan, M. Latent dirichlet allocation. Journal of Machine Learning Research, 3:993 1022, 2003.

Blei, D., Grifﬁths, T., and Jordan, M. The nested chinese restaurant process and bayesian nonparametric inference of topic hierarchies. Proceedings of the National Academy of Sciences, 57(2):1 30, 2010.

Bottou, L. Online learning and stochastic approximations. On-line learning in neural networks, 17(9), 1998.

Chang, J. and Blei, D. Relational topic models for document networks. In International Conference on Artiﬁcial Intelligence and Statistics, 2009.

Hoffman, M., Blei, D., and Bach, F. Online learning for latent dirichlet allocation. In Advances in Neural Information Processing Systems, 2010a.

Hoffman, M., Blei, D., and Cook, P. Bayesian nonparametric matrix factorization for recorded music. In International Conference on Machine Learning, 2010b.

Hoffman, M., Blei, D., Paisley, J., and Wang, C. Stochastic variational inference. Journal of Machine Learning Research, 14:1005 1031, 2013.

Ishwaran, H. and Zarepour, M. Dirichlet prior sieves in ﬁnite normal mixtures. Statistica Sinica, 12:941 963, 2002.

Kim, J., Kim, D., Kim, S., and Oh, A. Modeling topic hierarchies with the recursive Chinese restaurant process. In International Conference on Information and Knowledge Management, 2012.

Li, F. and Perona, P. A bayesian hierarchical model for learning natural scene categories. In Computer Vision and Pattern Recognition, 2005.

Li, L., Zhang, X., Zhou, M., and Carin, L. Nested dictionary learning for hierarchical organization of imagery and text. In Uncertainty in Artiﬁcial Intelligence, 2012.

Liang, D., Hoffman, M., and Ellis, D. Beta process sparse nonnegative matrix factorization for music. In International Society for Music Information Retrieval, 2013.

Mackey, L., Weiss, D., and Jordan, M. Mixed membership matrix factorization. In International Conference on Machine Learning, 2010.

Orbanz, P. and Roy, D. Bayesian models of graphs, arrays and other exchangeable random structures. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(2):437 461, 2015.

Paisley, J., Wang, C., Blei, D., and Jordan, M. Nested hierarchical dirichlet processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(2):256 270, 2015.

Paul, M. J. Mixed membership markov models for unsupervised conversation modeling. In Empirical Methods in Natural Language Processing, 2014.

Pritchard, J. K., Stephens, M., Rosenberg, N. A., and Donnelly, P. Association mapping in structured populations. American Journal of Human Genetics, 67:170 181, 2000.

Wang, C. and Blei, D. Variational inference for the nested chinese restaurant process. In Advances in Neural Information Processing Systems, 2009.

Wang, C. and Blei, D. Collaborative topic modeling for recommending scientiﬁc articles. In Knowledge Discovery and Data Mining, 2011.

Wang, C. and Blei, D. Truncation-free online variational inference for bayesian nonparametric models. In Advances in Neural Information Processing Systems, 2012.