# gibbs_maxmargin_topic_models_with_data_augmentation__5ac884ad.pdf

Journal of Machine Learning Research 15 (2014) 1073-1110 Submitted 3/13; Revised 9/13; Published 3/14

Gibbs Max-margin Topic Models with Data Augmentation

Jun Zhu dcszj@mail.tsinghua.edu.cn Ning Chen ningchen@mail.tsinghua.edu.cn Hugh Perkins ngls11@mails.tsinghua.edu.cn Bo Zhang dcszb@mail.tsinghua.edu.cn State Key Lab of Intelligent Technology and Systems Tsinghua National Lab for Information Science and Technology Department of Computer Science and Technology Tsinghua University Beijing, 100084, China

Editor: David Blei

Max-margin learning is a powerful approach to building classiﬁers and structured output predictors. Recent work on max-margin supervised topic models has successfully integrated it with Bayesian topic models to discover discriminative latent semantic structures and make accurate predictions for unseen testing data. However, the resulting learning problems are usually hard to solve because of the non-smoothness of the margin loss. Existing approaches to building max-margin supervised topic models rely on an iterative procedure to solve multiple latent SVM subproblems with additional mean-ﬁeld assumptions on the desired posterior distributions. This paper presents an alternative approach by deﬁning a new max-margin loss. Namely, we present Gibbs max-margin supervised topic models, a latent variable Gibbs classiﬁer to discover hidden topic representations for various tasks, including classiﬁcation, regression and multi-task learning. Gibbs max-margin supervised topic models minimize an expected margin loss, which is an upper bound of the existing margin loss derived from an expected prediction rule. By introducing augmented variables and integrating out the Dirichlet variables analytically by conjugacy, we develop simple Gibbs sampling algorithms with no restrictive assumptions and no need to solve SVM subproblems. Furthermore, each step of the augment-and-collapse Gibbs sampling algorithms has an analytical conditional distribution, from which samples can be easily drawn. Experimental results on several medium-sized and large-scale data sets demonstrate signiﬁcant improvements on time eﬃciency. The classiﬁcation performance is also improved over competitors on binary, multi-class and multi-label classiﬁcation tasks. Keywords: supervised topic models, max-margin learning, Gibbs classiﬁers, regularized Bayesian inference, support vector machines

1. Introduction

As the availability and scope of complex data increase, developing statistical tools to discover latent structures and reveal hidden explanatory factors has become a major theme in statistics and machine learning. Topic models represent one type of such useful tools to discover latent semantic structures that are organized in an automatically learned latent topic space, where each topic (i.e., a coordinate of the latent space) is a unigram distribution over

c 2014 Jun Zhu, Ning Chen, Hugh Perkins and Bo Zhang.

Zhu, Chen, Perkins and Zhang

the terms in a vocabulary. Due to its nice interpretability and extensibility, the Bayesian formulation of topic models (Blei et al., 2003) has motivated substantially broad extensions and applications to various ﬁelds, such as document analysis, image categorization (Fei-Fei and Perona, 2005), and network data analysis (Airoldi et al., 2008). Besides discovering latent topic representations, many models usually have a goal to make good predictions, such as relational topic models (Chang and Blei, 2009; Chen et al., 2013) whose major goal is to make accurate predictions on the link structures of a document network. Another example is supervised topic models, our focus in this paper, which learn a prediction model for regression and classiﬁcation tasks. As supervising information (e.g., user-input rating scores for product reviews) gets easier to obtain on the Web, developing supervised latent topic models has attracted a lot of attention. Both maximum likelihood estimation (MLE) and max-margin learning have been applied to learn supervised topic models. Diﬀerent from the MLE-based approaches (Blei and Mc Auliﬀe, 2007), which deﬁne a normalized likelihood model for response variables, max-margin supervised topic models, such as maximum entropy discrimination LDA (Med LDA) (Zhu et al., 2012), directly minimize a margin-based loss derived from an expected (or averaging) prediction rule. By performing discriminative learning, max-margin supervised topic models can discover predictive latent topic representations and have shown promising performance in various prediction tasks, such as text document categorization (Zhu et al., 2012) and image annotation (Yang et al., 2010). However, their learning problems are generally hard to solve due to the non-smoothness of the margin-based loss function. Most existing solvers rely on a variational approximation scheme with strict mean-ﬁeld assumptions on posterior distributions, and they normally need to solve multiple latent SVM subproblems in an EM-type iterative procedure. By showing a new interpretation of Med LDA as a regularized Bayesian inference method, the recent work (Jiang et al., 2012) successfully developed Monte Carlo methods for such max-margin topic models, with a weaker mean-ﬁeld assumption. Though the prediction performance is improved because of more accurate inference, the Monte Carlo methods still need to solve multiple SVM subproblems. Thus, their eﬃciency could be limited as learning SVMs is normally computationally demanding. Furthermore, due to the dependence on SVM solvers, it is not easy to parallelize these algorithms for large-scale data analysis tasks, although substantial eﬀorts have been made to develop parallel Monte Carlo methods for unsupervised topic models (Newman et al., 2009; Smola and Narayanamurthy, 2010; Ahmed et al., 2012). This paper presents Gibbs Med LDA, an alternative formulation of max-margin supervised topic models, for which we can develop simple and eﬃcient inference algorithms. Technically, instead of minimizing the margin loss of an expected (averaging) prediction rule as adopted in existing max-margin topic models, Gibbs Med LDA minimizes the expected margin loss of many latent prediction rules, of which each rule corresponds to a conﬁguration of topic assignments and the prediction model, drawn from a post-data posterior distribution. Theoretically, the expected margin loss is an upper bound of the existing margin loss of an expected prediction rule. Computationally, although the expected margin loss can still be hard to optimize using variational algorithms, we successfully develop simple and fast Gibbs sampling algorithms without any restrictive assumptions on the posterior distribution and without solving multiple latent SVM subproblems. By introducing a set of auxiliary variables and integrating out the Dirichlet variables by conjugacy, each of the

Gibbs Max-margin Topic Models with Data Augmentation

sampling substeps has a closed-form conditional distribution, from which samples can be eﬃciently drawn. Our algorithms represent an extension of the classical ideas of data augmentation (Dempster et al., 1977; Tanner and Wong, 1987; van Dyk and Meng, 2001) and its recent developments in learning fully observed max-margin classiﬁers (Polson and Scott, 2011) to learn the sophisticated latent topic models. On the other hand, Gibbs Med LDA represents a generalization of Gibbs (or stochastic) classiﬁers (Mc Allester, 2003; Catoni, 2007; Germain et al., 2009) to incorporate a hierarchy of latent variables for discovering latent topic representations. We further generalize the ideas to develop a Gibbs Med LDA regression model and a multi-task Gibbs Med LDA model, for which we also develop eﬃcient augment-and-collapse Gibbs sampling algorithms by exploring the same ideas of data augmentation. Empirical results on real data sets demonstrate signiﬁcant improvements in time eﬃciency. The classiﬁcation performance is also signiﬁcantly improved in binary, multi-class, and multi-label classiﬁcation tasks. The rest of the paper is structured as follows. Section 2 summarizes some related work. Section 3 reviews Med LDA and its EM-type algorithms. Section 4 presents Gibbs Med LDA and its sampling algorithms for classiﬁcation. Section 5 presents two extensions of Gibbs Med LDA for regression and multi-task learning. Section 6 presents empirical results. Finally, Section 7 concludes and discusses future directions.

2. Related Work

Max-margin learning has been very successful in building classiﬁers (Vapnik, 1995) and structured output prediction models (Taskar et al., 2003) in the last decade. Recently, research on learning max-margin models in the presence of latent variable models has received increasing attention because of the promise of using latent variables to capture the underlying structures of the complex problems. Deterministic approaches (Yu and Joachims, 2009) ﬁll in the unknown values of the hidden structures by using some estimates (e.g., MAP estimates), and then a max-margin loss function is deﬁned with the ﬁlled-in hidden structures, while probabilistic approaches aim to infer an entire distribution proﬁle of the hidden structures given evidence and some prior distribution, following the Bayes way of thinking. Though the former works well in practice, we focus on Bayesian approaches, which can naturally incorporate prior beliefs, maintain the entire distribution proﬁle of latent structures, and be extensible to nonparametric methods. One representative work along this line is maximum entropy discrimination (MED) (Jaakkola et al., 1999; Jebara, 2001), which learns a distribution of model parameters given a set of labeled training data. Med LDA (Zhu et al., 2012) is one extension of MED to infer hidden topical structures from data and MMH (max-margin Harmoniums) (Chen et al., 2012) is another extension that infers the hidden semantic features from multi-view data. Along similar lines, recent work has also successfully developed nonparametric Bayesian max-margin models, such as inﬁnite SVMs (i SVM) (Zhu et al., 2011) for discovering clustering structures when building SVM classiﬁers and inﬁnite latent SVMs (i LSVM) (Zhu et al., 2014) for automatically learning predictive features for SVM classiﬁers. Both i SVM and i LSVM can automatically resolve the model complexity, for example, the number of components in a mixture model or the number of latent features in a factor analysis model. The nonparametric Bayesian max-

Zhu, Chen, Perkins and Zhang

margin ideas have been proven to be eﬀective in dealing with more challenging problems, such as link prediction in social networks (Zhu, 2012) and low-rank matrix factorization for collaborative recommendation (Xu et al., 2012, 2013).

One common challenge of these Bayesian max-margin latent variable models is on the posterior inference, which is normally intractable. Almost all the existing work adopts a variational approximation scheme, with some mean-ﬁeld assumptions. Very little research has been done on developing Monte Carlo methods, except the work (Jiang et al., 2012) which still makes mean-ﬁeld assumptions. The work in the present paper provides a novel way to formulate Bayesian max-margin models and we show that these new formulations can have very simple and eﬃcient Monte Carlo inference algorithms without making restrictive assumptions. The key step to deriving our algorithms is a data augmentation formulation of the expected margin-based loss. Other work on inferring the posterior distributions of latent variables includes max-margin min-entropy models (Miller et al., 2012) which learn a single set of model parameters, diﬀerent from our focus of inferring the model posterior distribution.

Data augmentation refers to methods of augmenting the observed data so as to make it easy to analyze with an iterative optimization or sampling algorithm. For deterministic algorithms, the technique has been popularized in the statistics community by the seminal expectation-maximization (EM) algorithm (Dempster et al., 1977) for maximum likelihood estimation (MLE) with missing values. For stochastic algorithms, the technique has been popularized in statistics by Tanner and Wong s data augmentation algorithm for posterior sampling (Tanner and Wong, 1987) and in physics by Swendsen and Wang s sampling algorithms for Ising and Potts models (Swendsen and Wang, 1987). When using the idea to solve estimation or posterior inference problems, the key step is to ﬁnd a set of augmented variables, conditioned on which the distribution of our models can be easily sampled. The speed of mixing or convergence is another important concern when designing a data augmentation method. While the conﬂict between simplicity and speed is a common phenomenon with many standard augmentation schemes, some work has demonstrated that with more creative augmentation schemes it is possible to construct EM-type algorithms (Meng and van Dyk, 1997) or Markov Chain Monte Carlo methods (known as slice sampling) (Neal, 1997) that are both fast and simple. We refer the readers to van Dyk and Meng (2001) for an excellent review of the broad literature of data augmentation and an eﬀective search strategy for selecting good augmentation schemes.

For our focus on max-margin classiﬁers, the recent work (Polson and Scott, 2011) provides an elegant data augmentation formulation for support vector machines (SVM) with fully observed input data, which leads to analytical conditional distributions that are easy to sample from and fast to mix. Our work in the present paper builds on the method of Polson et al. and presents a successful implementation of data augmentation to deal with the challenging posterior inference problems of Bayesian max-margin latent topic models. Our approach can be generalized to deal with other Bayesian max-margin latent variable models, for example, max-margin matrix factorization (Xu et al., 2013), as reviewed above.

Finally, some preliminary results were presented in a conference paper (Zhu et al., 2013a). This paper presents a full extension.

Gibbs Max-margin Topic Models with Data Augmentation

We begin with a brief overview of Med LDA and its learning algorithms, which motivate our developments of Gibbs Med LDA.

3.1 Med LDA: A Regularized Bayesian Model

We consider binary classiﬁcation with a labeled training set D = {(wd, yd)}D d=1, where wd = {wdn}Nd n=1 denotes the bag-of-words appearing in document d and the response variable Y takes values from the output space Y = { 1, +1}. D is the data set size. Basically, Med LDA consists of two parts an LDA model for describing input documents W = {wd}D d=1, and an expected classiﬁer for considering the supervising signal y = {yd}D d=1. Below, we introduce each of them in turn. LDA: Latent Dirichlet allocation (LDA) (Blei et al., 2003) is a hierarchical Bayesian model that posits each document as an admixture of K topics, where each topic Φk is a multinomial distribution over a V -word vocabulary. For document d, the generating process can be described as

1. draw a topic proportion θd Dir(α)

2. for each word n (1 n Nd):

(a) draw a topic assignment1 zdn|θd Mult(θd)

(b) draw the observed word wdn|zdn, Φ Mult(Φzdn)

where Dir( ) is a Dirichlet distribution; Mult( ) is multinomial; and Φzdn denotes the topic selected by the non-zero entry of zdn. For a fully-Bayesian LDA, the topics are random samples drawn from a prior, for example, Φk Dir(β). Given a set of documents W, we let zd = {zdn}Nd n=1 denote the set of topic assignments for document d and let Z = {zd}D d=1 and Θ = {θd}D d=1 denote all the topic assignments and mixing proportions for the whole corpus, respectively. Then, LDA infers the posterior distribution using Bayes rule

p(Θ, Z, Φ|W) = p0(Θ, Z, Φ)p(W|Z, Φ)

where p0(Θ, Z, Φ) = QK k=1 p0(Φk|β) QD d=1 p0(θd|α) QNd n=1 p(zdn|θd) according to the generating process of LDA; and p(W) is the marginal evidence. We can show that the posterior distribution by Bayes rule is the solution of an information theoretical optimization problem

min q(Θ,Z,Φ) KL [q(Θ, Z, Φ) p0(Θ, Z, Φ)] Eq [log p(W|Z, Φ)]

s.t. : q(Θ, Z, Φ) P, (1)

where KL(q||p) is the Kullback-Leibler divergence and P is the space of probability distributions with an appropriate dimension. In fact, if we add the constant log p(W) to the objective, the problem is the minimization of the KL-divergence KL(q(Θ, Z, Φ) p(Θ, Z, Φ|W)),

1. zdn is a K-dimensional binary vector with only one nonzero entry.

Zhu, Chen, Perkins and Zhang

whose solution is the desired posterior distribution by Bayes rule. One advantage of this variational formulation of Bayesian inference is that it can be naturally extended to include some regularization terms on the desired post-data posterior distribution q. This insight has been taken to develop regularized Bayesian inference (Reg Bayes) (Zhu et al., 2014), a computational framework for doing Bayesian inference with posterior regularization.2 As shown in Jiang et al. (2012) and detailed below, Med LDA is one example of Reg Bayes models. Moreover, as we shall see in Section 4, our Gibbs max-margin topic models follow this similar idea too. Expected Classiﬁer: Given a training set D, an expected (or averaging) classiﬁer chooses a posterior distribution q(h|D) over a hypothesis space H of classiﬁers such that the qweighted (expected) classiﬁer hq(w) = sign Eq[h(w)]

will have the smallest possible risk. Med LDA follows this principle to learn a posterior distribution q(η, Θ, Z, Φ|D) such that the expected classiﬁer

ˆy = sign F(w) (2)

has the smallest possible risk, approximated by the training error RD(q) = PD d=1 I(ˆyd = yd). The discriminant function is deﬁned as

F(w) = Eq(η,z|D)[F(η, z; w)],

F(η, z; w) = η z,

where z is the average topic assignment associated with the words w, a vector with element zk = 1 N PN n=1 zk n, and η is the classiﬁer weights. Note that the expected classiﬁer and the LDA likelihood are coupled via the latent topic assignments Z. The strong coupling makes it possible for Med LDA to learn a posterior distribution that can describe the observed words well and make accurate predictions. Regularized Bayesian Inference: To integrate the above two components for hybrid learning, Med LDA regularizes the properties of the topic representations by imposing the following max-margin constraints derived from the classiﬁer (2) to a standard LDA inference problem (1)

yd F(wd) ℓ ξd, d,

where ℓ( 1) is the cost of making a wrong prediction; and ξ = {ξd}D d=1 are non-negative slack variables for inseparable cases. Let L(q) = KL(q||p0(η, Θ, Z, Φ)) Eq[log p(W|Z, Φ)] be the objective for doing standard Bayesian inference with the classiﬁer η and p0(η, Θ, Z, Φ) = p0(η)p0(Θ, Z, Φ). Med LDA solves the regularized Bayesian inference (Zhu et al., 2014) problem

min q(η,Θ,Z,Φ) P,ξ L (q(η, Θ, Z, Φ)) + 2c

s.t.: yd F(wd) ℓ ξd, ξd 0, d,

2. Posterior regularization was ﬁrst used in Ganchev et al. (2010) for maximum likelihood estimation and was later extended in Zhu et al. (2014) to Bayesian and nonparametric Bayesian methods.

Gibbs Max-margin Topic Models with Data Augmentation

where the margin constraints directly regularize the properties of the post-data distribution and c is the positive regularization parameter. Equivalently, Med LDA solves the unconstrained problem3

min q(η,Θ,Z,Φ) L (q(η, Θ, Z, Φ)) + 2c R (q(η, Θ, Z, Φ)) , (4)

where R(q) = PD d=1 max(0, ℓ yd F(wd)) is the hinge loss that upper-bounds the training error RD(q) of the expected classiﬁer (2). Note that the constant 2 is included simply for convenience.

3.2 Existing Iterative Algorithms

Since it is diﬃcult to solve problem (3) or (4) directly because of the non-conjugacy (between priors and likelihood) and the max-margin constraints, corresponding to a non-smooth posterior regularization term in (4), both variational and Monte Carlo methods have been developed for approximate solutions. It can be shown that the variational method (Zhu et al., 2012) is a coordinate descent algorithm to solve problem (4) with the fully-factorized assumption that

q(η, Θ, Z, Φ) = q(η)

while the Monte Carlo methods (Jiang et al., 2012) make a weaker assumption that

q(η, Θ, Z, Φ) = q(η)q(Θ, Z, Φ).

All these methods have a similar EM-type iterative procedure, which solves many latent SVM subproblems, as outlined below. Estimate q(η): Given q(Θ, Z, Φ), we solve problem (4) with respect to q(η). In the equivalent constrained form, this step solves

min q(η),ξ KL (q(η) p0(η)) + 2c

s.t. : yd Eq[η] Eq[ zd] ℓ ξd, ξd 0, d.

This problem is convex and can be solved with Lagrangian methods. Speciﬁcally, let µd be the Lagrange multipliers, one per constraint. When the prior p0(η) is the commonly used standard normal distribution, we have the optimum solution q(η) = N(κ, I), where κ = PD d=1 ydµd Eq[ zd]. It can be shown that the dual problem of (5) is the dual of a standard binary linear SVM and we can solve it or its primal form eﬃciently using existing high-performance SVM learners. We denote the optimum solution of this problem by (q (η), κ , ξ , µ ).

3. If not speciﬁed, q is subject to the constraint q P.

Zhu, Chen, Perkins and Zhang

Estimate q(Θ, Z, Φ): Given q(η), we solve problem (4) with respect to q(Θ, Z, Φ). In the constrained form, this step solves

min q(Θ,Z,Φ),ξ L (q(Θ, Z, Φ)) + 2c

s.t. : yd(κ ) Eq[ zd] ℓ ξd, ξd 0, d.

Although we can solve this problem using Lagrangian methods, it would be hard to derive the dual objective. An eﬀective approximation strategy was used in Zhu et al. (2012) and Jiang et al. (2012), which updates q(Θ, Z, Φ) for only one step with ξ ﬁxed at ξ . By ﬁxing ξ at ξ , we have the solution

q(Θ, Z, Φ) p(W, Θ, Z, Φ) exp

where the second term indicates the regularization eﬀects due to the max-margin posterior constraints. For those data with non-zero Lagrange multipliers (i.e., support vectors), the second term will bias Med LDA towards a new posterior distribution that favors more discriminative representations on these hard data points. The Monte Carlo methods (Jiang et al., 2012) directly draw samples from the posterior distribution q(Θ, Z, Φ) or its collapsed form using Gibbs sampling to estimate Eq[ zd], the expectations required to learn q(η). In contrast, the variational methods (Zhu et al., 2012) solve problem (6) using coordinate descent to estimate Eq[ zd] with a fully factorized assumption.

4. Gibbs Med LDA

Now, we present Gibbs max-margin topic models for binary classiﬁcation and their augmentand-collapse sampling algorithms. We will discuss further extensions in the next section.

4.1 Learning with an Expected Margin Loss

As stated above, Med LDA chooses the strategy to minimize the hinge loss of an expected classiﬁer. In learning theory, an alternative approach to building classiﬁers with a posterior distribution of models is to minimize an expected loss, under the framework known as Gibbs classiﬁers (or stochastic classiﬁers) (Mc Allester, 2003; Catoni, 2007; Germain et al., 2009) which have nice theoretical properties on generalization performance. For our case of inferring the distribution of latent topic assignments Z = {zd}D d=1 and the classiﬁcation model η, the expected margin loss is deﬁned as follows. If we have drawn a sample of the topic assignments Z and the prediction model η from a posterior distribution q(η, Z), we can deﬁne the linear discriminant function

F(η, z; w) = η z

as before and make prediction using the latent prediction rule

ˆy(η, z) = sign F(η, z; w). (7)

Gibbs Max-margin Topic Models with Data Augmentation

Note that the prediction is a function of the conﬁguration (η, z). Let ζd = ℓ ydη zd, where ℓis a cost parameter as deﬁned before. The hinge loss of the stochastic classiﬁer is

d=1 max(0, ζd),

a function of the latent variables (η, Z), and the expected hinge loss is

R (q) = Eq[R(η, Z)] =

d=1 Eq [max(0, ζd)] ,

a functional of the posterior distribution q(η, Z). Since for any (η, Z), the hinge loss R(η, Z) is an upper bound of the training error of the latent Gibbs classiﬁer (7), that is,

d=1 I (yd = ˆyd(η, zd)) ,

d=1 Eq [I(yd = ˆyd(η, zd))] ,

where I( ) is an indicator function that equals to 1 if the predicate holds otherwise 0. In other words, the expected hinge loss R (q) is an upper bound of the expected training error of the Gibbs classiﬁer (7). Thus, it is a good surrogate loss for learning a posterior distribution which could lead to a low training error in expectation. Then, with the same goal as Med LDA of ﬁnding a posterior distribution q(η, Θ, Z, Φ) that on one hand describes the observed data and on the other hand predicts as well as possible on training data, we deﬁne Gibbs Med LDA as solving the new regularized Bayesian inference problem

min q(η,Θ,Z,Φ) L (q(η, Θ, Z, Φ)) + 2c R (q(η, Θ, Z, Φ)) . (8)

Note that we have written the expected margin loss R as a function of the complete distribution q(η, Θ, Z, Φ). This does not conﬂict with our deﬁnition of R as a function of the marginal distribution q(η, Z) because the other irrelevant variables (i.e., Θ and Φ) are integrated out when we compute the expectation. Comparing to Med LDA in problem (4), we have the following lemma by applying Jensen s inequality.

Lemma 1 The expected hinge loss R is an upper bound of the hinge loss of the expected classiﬁer (2):

R (q) R(q) =

d=1 max (0, Eq[ζd]) ;

and thus the objective in (8) is an upper bound of that in (4) when c values are the same.

Zhu, Chen, Perkins and Zhang

4.2 Formulation with Data Augmentation

If we directly solve problem (8), the expected hinge loss R is hard to deal with because of the non-diﬀerentiable max function. Fortunately, we can develop a simple collapsed Gibbs sampling algorithm with analytical forms of local conditional distributions, based on a data augmentation formulation of the expected hinge-loss. Let φ(yd|zd, η) = exp{ 2c max(0, ζd)} be the unnormalized likelihood of the response variable for document d. Then, problem (8) can be written as

min q(η,Θ,Z,Φ) L (q(η, Θ, Z, Φ)) Eq [log φ(y|Z, η)] , (9)

where φ(y|Z, η) = QD d=1 φ(yd|zd, η). Solving problem (9) with the constraint that q(η, Θ, Z, Φ) P, we can get the normalized posterior distribution

q(η, Θ, Z, Φ) = p0(η, Θ, Z, Φ)p(W|Z, Φ)φ(y|Z, η)

where ψ(y, W) is the normalization constant that depends on the observed data only. Due to the complicated form of φ, it will not have simple conditional distributions if we want to derive a Gibbs sampling algorithm for q(η, Θ, Z, Φ) directly. This motivates our exploration of data augmentation techniques. Speciﬁcally, using the ideas of data augmentation (Tanner and Wong, 1987; Polson and Scott, 2011), we have Lemma 2.

Lemma 2 (Scale Mixture Representation) The unnormalized likelihood can be expressed as

φ(yd|zd, η) = Z

1 2πλd exp (λd + cζd)2

Proof Due to the fact that a max(0, x) = max(0, ax) if a 0, we have 2c max(0, ζd) = 2 max(0, cζd). Then, we can follow the proof in Polson and Scott (2011) to get the results.

Lemma 2 indicates that the posterior distribution of Gibbs Med LDA can be expressed as the marginal of a higher-dimensional distribution that includes the augmented variables λ = {λd}D d=1, that is,

q(η, Θ, Z, Φ) = Z

0 q(η, λ, Θ, Z, Φ)dλ1 dλD = Z

RD + q(η, λ, Θ, Z, Φ)dλ,

where R+ = {x : x R, x > 0} is the set of positive real numbers; the complete posterior distribution is

q(η, λ, Θ, Z, Φ) = p0(η, Θ, Z, Φ)p(W|Z, Φ)φ(y, λ|Z, η)

and the unnormalized joint distribution of y and λ is

φ(y, λ|Z, η) =

1 2πλd exp (λd + cζd)2

Gibbs Max-margin Topic Models with Data Augmentation

In fact, we can show that the complete posterior distribution is the solution of the data augmentation problem of Gibbs Med LDA

min q(η,λ,Θ,Z,Φ) L (q(η, λ, Θ, Z, Φ)) Eq [log φ(y, λ|Z, η)] ,

which is again subject to the normalization constraint that q(η, λ, Θ, Z, Φ) P. The ﬁrst term in the objective is L(q(η, λ, Θ, Z, Φ)) = KL(q(η, λ, Θ, Z, Φ)||p0(η, Θ, Z, Φ)p0(λ)) Eq[log p(W|Z, Φ)], where a prior distribution is imposed on the augmented variables λ. One good choice of p0(λ) is the noninformative uniform prior.

Remark 3 The objective of this augmented problem is an upper bound of the objective in (9) (thus, also an upper bound of Med LDA s objective due to Lemma 1). This is because by using the data augmentation we can show that

Eq(V)[log φ(y|Z, η)]=Eq(V)

RD + φ(y, λ|Z, η)dλ

q(λ|V) q(λ|V)φ(y, λ|Z, η)dλ

Eq(V) Eq(λ|V) [log φ(y, λ|Z, η)] Eq(λ|V) [log q(λ|V)]

=Eq(V,λ) [log φ(y, λ|Z, η)] Eq(V,λ) [log q(λ|V)] ,

where V = {η, Θ, Z, Φ} denotes all the random variables in Med LDA. Therefore, we have

L(q(V)) Eq(V)[log φ(y|Z, η)] L(q(V)) Eq(V,λ) [log φ(y, λ|Z, η)] + Eq(V,λ) [log q(λ|V)]

=L(q(V, λ)) Eq[log φ(y, λ|Z, η)].

4.3 Inference with Collapsed Gibbs Sampling

Although with the above data augmentation formulation we can do Gibbs sampling to infer the complete posterior distribution q(η, λ, Θ, Z, Φ) and thus q(η, Θ, Z, Φ) by ignoring λ, the mixing speed would be slow because of the large sample space of the latent variables. One way to eﬀectively reduce the sample space and improve mixing rates is to integrate out the intermediate Dirichlet variables (Θ, Φ) and build a Markov chain whose equilibrium distribution is the resulting marginal distribution q(η, λ, Z). We propose to use collapsed Gibbs sampling, which has been successfully used in LDA (Griﬃths and Steyvers, 2004). With the data augmentation representation, this leads to our augment-and-collapse sampling algorithm for Gibbs Med LDA, as detailed below. For the data augmented formulation of Gibbs Med LDA, by integrating out the Dirichlet variables (Θ, Φ), we get the collapsed posterior distribution:

q(η, λ, Z) p0(η)p(W, Z|α, β)φ(y, λ|Z, η)

1 2πλd exp (λd + cζd)2

δ(x) = Qdim(x) i=1 Γ(xi)

Γ(Pdim(x) i=1 xi) ;

Zhu, Chen, Perkins and Zhang

Γ( ) is the Gamma function; Ct k is the number of times that the term t is assigned to topic k over the whole corpus; Ck = {Ct k}V t=1 is the set of word counts associated with topic k; Ck d is the number of times that terms are associated with topic k within the d-th document; and Cd = {Ck d}K k=1 is the set of topic counts for document d. Then, the conditional distributions used in collapsed Gibbs sampling are as follows. For η: Let us assume its prior is the commonly used isotropic Gaussian distribution p0(η) = QK k=1 N(ηk; 0, ν2), where ν is a non-zero parameter. Then, we have the conditional distribution of η given the other variables:

q(η|Z, λ) p0(η)

d=1 exp (λd + cζd)2

(λd + cζd)2

2η 1 ν2 I + c2 D X

d=1 yd λd + cℓ

= N(η; µ, Σ), (10)

a K-dimensional Gaussian distribution, where the posterior mean and the covariance matrix are

d=1 yd λd + cℓ

1 ν2 I + c2 D X

Therefore, we can easily draw a sample from this multivariate Gaussian distribution. The inverse can be robustly done using Cholesky decomposition, an O(K3) procedure. Since K is normally not large, the inversion can be done eﬃciently, especially in applications where the number of documents is much larger than the number of topics. For Z: The conditional distribution of Z given the other variables is

δ(α) exp (λd + cζd)2

By canceling common factors, we can derive the conditional distribution of one variable zdn given others Z as:

q(zk dn = 1|Z , η, λ, wdn = t) (Ct k, n + βt)(Ck d, n + αk) PV t=1 Ct k, n + PV t=1 βt exp γyd(cℓ+ λd)ηk

c2 γ2η2 k + 2γ(1 γ)ηkΛk dn 2λd

where C , n indicates that term n is excluded from the corresponding document or topic; γ = 1 Nd ; and

Λk dn = 1 Nd 1

k =1 ηk Ck d, n

Gibbs Max-margin Topic Models with Data Augmentation

Algorithm 1 collapsed Gibbs sampling algorithm for Gibbs Med LDA classiﬁcation models

1: Initialization: set λ = 1 and randomly draw zdk from a uniform distribution.

2: for m = 1 to M do

3: draw the classiﬁer from the normal distribution (10)

4: for d = 1 to D do

5: for each word n in document d do

6: draw a topic from the multinomial distribution (11)

8: draw λ 1 d (and thus λd) from the inverse Gaussian distribution (12).

10: end for

is the discriminant function value without word n. We can see that the ﬁrst term is from the LDA model for observed word counts and the second term is from the supervised signal y. For λ: Finally, the conditional distribution of the augmented variables λ given the other variables is factorized and we can derive the conditional distribution for each λd as

q(λd|Z, η) 1 2πλd exp (λd + cζd)2

1 2πλd exp c2ζ2 d 2λd λd

= GIG λd; 1

2, 1, c2ζ2 d

GIG(x; p, a, b) = C(p, a, b)xp 1 exp 1

is a generalized inverse Gaussian distribution (Devroye, 1986) and C(p, a, b) is a normalization constant. Therefore, we can derive that λ 1 d follows an inverse Gaussian distribution

p(λ 1 d |Z, η) = IG λ 1 d ; 1 c|ζd|, 1 , (12)

IG(x; a, b) =

b 2πx3 exp b(x a)2

for a > 0 and b > 0. With the above conditional distributions, we can construct a Markov chain which iteratively draws samples of the classiﬁer weights η using Equation (10), the topic assignments Z using Equation (11) and the augmented variables λ using Equation (12), with an initial condition. To sample from an inverse Gaussian distribution, we apply the transformation method with multiple roots (Michael et al., 1976) which is very eﬃcient with a constant time complexity. Overall, the per-iteration time complexity is O(K3 + Ntotal K), where Ntotal = PD d=1 Nd is the total number of words in all documents. If K is not very large

Zhu, Chen, Perkins and Zhang

(e.g., K Ntotal), which is the common case in practice as Ntotal is often very large, the per-iteration time complexity is O(Ntotal K); if K is large (e.g., K Ntotal), which is not common in practice, drawing the global classiﬁer weights will dominate and the periteration time complexity is O(K3). In our experiments, we initially set λd = 1, d and randomly draw Z from a uniform distribution. In training, we run this Markov chain to ﬁnish the burn-in stage with M iterations, as outlined in Algorithm 1. Then, we draw a sample ˆη as the Gibbs classiﬁer to make predictions on testing data. In general, there is no theoretical guarantee that a Markov chain constructed using data augmentation can converge to the target distribution; see Hobert (2011) for a failure example. However, for our algorithms, we can justify that the Markov transition distribution of the chain satisﬁes the condition K from Hobert (2011), that is, the transition probability from one state to any other state is larger than 0. Condition K implies that the Markov chain is Harris ergodic (Tan, 2009, Lemma 1). Therefore, no matter how the chain is started, our sampling algorithms can be employed to eﬀectively explore the intractable posterior distribution. In practice, the sampling algorithm as well as the ones to be presented require only a few iterations to get stable prediction performance, as we shall see in Section 6.5.1. More theoretical analysis such as convergence rates requires a good bit of technical Markov chain theory and is our future work.

4.4 Prediction

To apply the Gibbs classiﬁer ˆη, we need to infer the topic assignments for testing document, denoted by w. A fully Bayesian treatment needs to compute an integral in order to get the posterior distribution of the topic assignment given the training data D and the testing document content w:

p(z|w, D) Z

PK V p(z, w, Φ|D)dΦ = Z

PK V p(z, w|Φ)p(Φ|D)dΦ,

where PV is the V 1 dimensional simplex; and the second equality holds due to the conditional independence assumption of the documents given the topics. Various approximation methods can be applied to compute the integral. Here, we take the approach applied in Zhu et al. (2012), which uses a point estimate of topics Φ from training data and makes predictions based on them. Speciﬁcally, we use a point estimate ˆΦ (a Dirac measure) to approximate the probability distribution p(Φ|D). For the collapsed Gibbs sampler, an estimate of ˆΦ using the samples is the posterior mean

ˆφkt Ct k + βt.

Then, given a testing document w, we infer its latent components z using ˆΦ by drawing samples from the local conditional distribution

p(zk n = 1|z n, w, D) ˆφkwn Ck n + αk , (13)

where Ck n is the number of times that the terms in this document w assigned to topic k with the n-th term excluded. To start the sampler, we randomly set each word to one topic. Then, we run the Gibbs sampler for a few iterations until some stop criterion is satisﬁed,

Gibbs Max-margin Topic Models with Data Augmentation

for example, after a few burn-in steps or the relative change of data likelihood is lower than some threshold. Here, we adopt the latter, the same as in Jiang et al. (2012). After this burn-in stage, we keep one sample of z for prediction using the stochastic classiﬁer. Empirically, using the average of a few (e.g., 10) samples of z could lead to slightly more robust predictions, as we shall see in Section 6.5.4.

5. Extensions to Regression and Multi-task Learning

The above ideas can be naturally generalized to develop Gibbs max-margin supervised topic models for various prediction tasks. In this section, we present two examples for regression and multi-task learning, respectively.

5.1 Gibbs Med LDA Regression Model

We ﬁrst discuss how to generalize the above ideas to develop a regression model, where the response variable Y takes real values. Formally, the Gibbs Med LDA regression model also has two components an LDA model to describe input bag-of-words documents and a Gibbs regression model for the response variables. Since the LDA component is the same as in the classiﬁcation model, we focus on presenting the Gibbs regression model.

5.1.1 The Models with Data Augmentation

If a sample of the topic assignments Z and the prediction model η are drawn from the posterior distribution q(η, Z), we deﬁne the latent regression rule as

ˆy(η, z) = η z. (14)

To measure the goodness of the prediction rule (14), we adopt the widely used ϵ-insensitive loss

d=1 max (0, | d| ϵ) ,

where d = yd η zd is the margin between the true score and the predicted score. The ϵ-insensitive loss has been successfully used in learning fully observed support vector regression (Smola and Scholkopf, 2003). In our case, the loss is a function of predictive model η as well as the topic assignments Z which are hidden from the input data. To resolve this uncertainty, we deﬁne the expected ϵ-insensitive loss

Rϵ(q) = Eq [Rϵ(η, Z)] =

d=1 Eq [max(0, | d| ϵ)] ,

a function of the desired posterior distribution q(η, Z). With the above deﬁnitions, we can follow the same principle as Gibbs Med LDA to deﬁne the Gibbs Med LDA regression model as solving the regularized Bayesian inference problem

min q(η,Θ,Z,Φ) L (q(η, Θ, Z, Φ)) + 2c Rϵ (q(η, Θ, Z, Φ)) . (15)

Zhu, Chen, Perkins and Zhang

Note that as in the classiﬁcation model, we have put the complete distribution q(η, Θ, Z, Φ) as the argument of the expected loss Rϵ, which only depends on the marginal distribution q(η, Z). This does not aﬀect the results because we are taking the expectation to compute Rϵ and any irrelevant variables will be marginalized out. As in the Gibbs Med LDA classiﬁcation model, we can show that Rϵ is an upper bound of the ϵ-insensitive loss of Med LDA s expected prediction rule, by applying Jensen s inequality to the convex function h(x) = max(0, |x| ϵ).

Lemma 4 We have Rϵ PD d=1 max(0, |Eq[ d]| ϵ).

We can reformulate problem (15) in the same form as problem (9), with the unnormalized likelihood φ(yd|η, zd) = exp ( 2c max(0, | d| ϵ)) .

Then, we have the dual scale of mixture representation, by noting that

max(0, |x| ϵ) = max(0, x ϵ) + max(0, x ϵ). (16)

Lemma 5 (Dual Scale Mixture Representation) For regression, the unnormalized likelihood can be expressed as

φ(yd|η, zd) = Z

1 2πλd exp (λd + c( d ϵ))2

1 2πωd exp (ωd c( d + ϵ))2

Proof By the equality (16), we have φ(yd|η, zd) = exp{ 2c max(0, d ϵ)} exp{ 2c max(0, d ϵ)}. Each of the exponential terms can be formulated as a scale mixture of Gaussians due to Lemma 2.

Then, the data augmented learning problem of the Gibbs Med LDA regression model is

min q(η,λ,ω,Θ,Z,Φ) L (q(η, λ, ω, Θ, Z, Φ)) Eq [log φ(y, λ, ω|Z, η)] ,

where φ(y, λ, ω|Z, η) = QD d=1 φ(yd, λd, ωd|Z, ω) and

φ(yd, λd, ωd|Z, η) = 1 2πλd exp (λd + c( d ϵ))2

1 2πωd exp (ωd c( d + ϵ))2

Solving the augmented problem and integrating out (Θ, Φ), we can get the collapsed posterior distribution

q(η, λ, ω, Z) p0(η)p(W, Z|α, β)φ(y, λ, ω|Z, η).

Gibbs Max-margin Topic Models with Data Augmentation

Algorithm 2 collapsed Gibbs sampling algorithm for Gibbs Med LDA regression models

1: Initialization: set λ = 1 and randomly draw zdk from a uniform distribution.

2: for m = 1 to M do

3: draw the classiﬁer from the normal distribution (17)

4: for d = 1 to D do

5: for each word n in document d do

6: draw a topic from the multinomial distribution (18)

8: draw λ 1 d (and thus λd) from the inverse Gaussian distribution (19).

9: draw ω 1 d (and thus ωd) from the inverse Gaussian distribution (20).

10: end for

11: end for

5.1.2 A Collapsed Gibbs Sampling Algorithm

Following similar derivations as in the classiﬁcation model, the Gibbs sampling algorithm to infer the posterior has the following conditional distributions, with an outline in Algorithm 2. For η: Again, with the isotropic Gaussian prior p0(η) = QK k=1 N(ηk; 0, ν2), we have

q(η|Z, λ, ω) p0(η)

d=1 exp (λd + c( d ϵ))2

exp (ωd c( d + ϵ))2

(λd + c( d ϵ))2

2λd + (ωd c( d + ϵ))2

2η 1 ν2 I + c2 D X

d=1 ρd zd z d

= N(η; µ, Σ), (17)

where the posterior covariance matrix and the posterior mean are

1 ν2 I + c2 D X

d=1 ρd zd z d

and ρd = 1 λd + 1

ωd and ψd = yd ϵ

ωd are two parameters. We can easily draw a sample from a K-dimensional multivariate Gaussian distribution. The inverse can be robustly done using Cholesky decomposition. For Z: We can derive the conditional distribution of one variable zdn given others Z as:

q(zk dn = 1|Z , η, λ, ω, wdn = t) (Ct k, n + βt)(Ck d, n + αk) PV t=1 Ct k, n + PV t=1 βt exp cγψdηk

c2(γ2ρdη2 k 2 + γ(1 γ)ρdηkΥk dn) , (18)

Zhu, Chen, Perkins and Zhang

where γ = 1 Nd ; and Υk dn = 1 Nd 1 PK k =1 ηk Ck d, n is the discriminant function value without word n. The ﬁrst term is from the LDA model for observed word counts. The second term is from the supervised signal y. For λ and ω: Finally, we can derive that λ 1 d and ω 1 d follow the inverse Gaussian distributions:

q(λ 1 d |Z, η, ω) = IG λ 1 d ; 1 c| d ϵ|, 1 , (19)

q(ω 1 d |Z, η, λ) = IG ω 1 d ; 1 c| d + ϵ|, 1 . (20)

The per-iteration time complexity of this algorithm is similar to that of the binary Gibbs Med LDA model, that is, linear to the number of documents and the number of topics if K is not too large.

5.2 Multi-task Gibbs Med LDA

The second extension is a multi-task Gibbs Med LDA. Multi-task learning is a scenario where multiple potentially related tasks are learned jointly with the hope that their performance can be boosted by sharing some statistic information among these tasks, and it has attracted a lot of research attention. In particular, learning a common latent representation shared by all the related tasks has proven to be an eﬀective way to capture task relationships (Ando and Zhang, 2005; Argyriou et al., 2007; Zhu et al., 2014). Here, we take the similar approach to learning multiple predictive models which share the common latent topic representations. As we shall see in Section 6.3.2, one natural application of our approach is to do multilabel classiﬁcation (Tsoumakas et al., 2010), where each document can belong to multiple categories, by deﬁning each task as a binary classiﬁer to determine whether a data point belongs to a particular category; and it can also be applied to multi-class classiﬁcation, where each document belongs to only one of the many categories, by deﬁning a single output prediction rule (See Section 6.3.2 for details).

5.2.1 The Model with Data Augmentation

We consider L binary classiﬁcation tasks and each task i is associated with a classiﬁer with weights ηi. We assume that all the tasks work on the same set of input data W = {wd}D d=1, but each data d has diﬀerent binary labels {yi d}L i=1 in diﬀerent tasks. A multi-task Gibbs Med LDA model has two components an LDA model to describe input words (the same as in Gibbs Med LDA); and multiple Gibbs classiﬁers sharing the same topic representations. When we have the classiﬁer weights η and the topic assignments Z, drawn from a posterior distribution q(η, Z), we follow the same principle as in Gibbs Med LDA and deﬁne the latent Gibbs rule for each task as

i = 1, . . . L : ˆyi(ηi, z) = sign F(ηi, z; w) = sign(η i z). (21)

Let ζi d = ℓ yi dη i zd. The hinge loss of the stochastic classiﬁer i is

Ri(ηi, Z) =

d=1 max(0, ζi d)

Gibbs Max-margin Topic Models with Data Augmentation

and the expected hinge loss is

R i(q) = Eq[Ri(ηi, Z)] =

d=1 Eq max(0, ζi d) .

For each task i, we can follow the argument as in Gibbs Med LDA to show that the expected loss R i(q) is an upper bound of the expected training error PD d=1 Eq[I(yi d = ˆyi d(ηi, zd))] of the Gibbs classiﬁer (21). Thus, it is a good surrogate loss for learning a posterior distribution which could lead to a low expected training error. Then, following a similar procedure of deﬁning the binary Gibbs Med LDA classiﬁer, we deﬁne the multi-task Gibbs Med LDA model as solving the following Reg Bayes problem:

min q(η,Θ,Z,Φ) L (q(η, Θ, Z, Φ)) + 2c R MT (q(η, Θ, Z, Φ)) ,

where the multi-task expected hinge loss is deﬁned as a summation of the expected hinge loss of all the tasks:

R MT (q(η, Θ, Z, Φ)) =

i=1 R i (q(η, Θ, Z, Φ)) .

Due to the separability of the multi-task expected hinge loss, we can apply Lemma 2 to reformulate each task-speciﬁc expected hinge loss R i as a scale mixture by introducing a set of augmented variables {λi d}D d=1. More speciﬁcally, let φi(yi d|zd, η) = exp{ 2c max(0, ζi d)} be the unnormalized likelihood of the response variable for document d in task i. Then, we have

φi(yi d|zd, η) = Z

2πλi d exp (λi d + cζi d)2

5.2.2 A Collapsed Gibbs Sampling Algorithm

Similar to the binary Gibbs Med LDA classiﬁcation model, we can derive the collapsed Gibbs sampling algorithm, as outlined in Algorithm 3. Speciﬁcally, let

φi(yi, λi|Z, η) =

2πλi d exp (λi d + cζi d)2

be the joint unnormalized likelihood of the class labels yi = {yi d}D d=1 and the augmentation variables λi = {λi d}D d=1. Then, for the multi-task Gibbs Med LDA, we can integrate out the Dirichlet variables (Θ, Φ) and get the collapsed posterior distribution

q(η, λ, Z) p0(η)p(W, Z|α, β)

i=1 φi(yi, λi|Z, η)

2πλi d exp (λi d + cζi d)2

Zhu, Chen, Perkins and Zhang

Algorithm 3 collapsed Gibbs sampling algorithm for multi-task Gibbs Med LDA

1: Initialization: set λ = 1 and randomly draw zdk from a uniform distribution.

2: for m = 1 to M do

3: for i = 1 to L do

4: draw the classiﬁer ηi from the normal distribution (22)

6: for d = 1 to D do

7: for each word n in document d do

8: draw a topic from the multinomial distribution (23)

10: for i = 1 to L do

11: draw (λi d) 1 (and thus λi d) from the inverse Gaussian distribution (24).

12: end for

13: end for

14: end for

Then, we can derive the conditional distributions used in collapsed Gibbs sampling as follows. For η: We also assume its prior is an isotropic Gaussian p0(η) = QL i=1 QK k=1 N(ηik; 0, ν2). Then, we have the factorized form q(η|Z, λ) = QL i=1 q(ηi|Z, λ), and each individual distribution is

q(ηi|Z, λ) p0(ηi)

d=1 exp (λi d + cζi d)2

= N(ηi; µi, Σi), (22)

where the posterior covariance matrix and posterior mean are

1 ν2 I + c2 D X

zd z d λi d

, and µi = Σi

d=1 yi d λi d + cℓ

Similarly, the inverse can be robustly done using Cholesky decomposition, an O(K3) procedure. Since K is normally not large, the inversion can be done eﬃciently. For Z: The conditional distribution of Z is

i=1 exp (λi d + cζi d)2

By canceling common factors, we can derive the conditional distribution of one variable zdn given others Z as:

q(zk dn = 1|Z , η, λ, wdn = t) (Ct k, n + βt)(Ck d, n + αk) PV t=1 Ct k, n + PV t=1 βt

i=1 exp γyi d(cℓ+ λi d)ηik λi d

c2 γ2η2 ik + 2γ(1 γ)ηikΛi dn 2λi d

Gibbs Max-margin Topic Models with Data Augmentation

where Λi dn = 1 Nd 1 PK k =1 ηik Ck d, n is the discriminant function value without word n. We can see that the ﬁrst term is from the LDA model for observed word counts and the second term is from the supervised signal {yi d} from all the multiple tasks. For λ: Finally, the conditional distribution of the augmented variables λ is fully factorized, q(λ|Z, η) = QL i=1 QD d=1 q(λi d|Z, η), and each variable follows a generalized inverse Gaussian distribution

q(λi d|Z, η) 1 q

2πλi d exp (λi d + cζi d)2

= GIG λi d; 1

2, 1, c2(ζi d)2 .

Therefore, we can derive that (λi d) 1 follows an inverse Gaussian distribution

p((λi d) 1|Z, η) = IG (λi d) 1; 1 c|ζi d|, 1 , (24)

from which a sample can be eﬃciently drawn with a constant time complexity. The per-iteration time complexity of the algorithm is O(LK3 + Ntotal K + DL). For common large-scale applications where K and L are not too large while D (thus Ntotal) is very large, the step of sampling latent topic assignments takes most of the time. If L is very large, for example, in the PASCAL large-scale text/image categorization challenge tasks which have tens of thousands of categories,4 the step of drawing global classiﬁer weights may dominate. A nice property of the algorithm is that we can easily parallelize this step since there is no coupling among these classiﬁers once the topic assignments are given. A preliminary investigation of the parallel algorithm is presented in Zhu et al. (2013b).

6. Experiments

We present empirical results to demonstrate the eﬃciency and prediction performance of Gibbs Med LDA (denoted by Gibbs Med LDA) on the 20Newsgroups data set for classiﬁcation, a hotel review data set for regression, and a Wikipedia data set with more than 1 million documents for multi-label classiﬁcation. We also analyze its sensitivity to key parameters and examine the learned latent topic representations qualitatively. The 20Newsgroups data set contains about 20K postings within 20 groups. We follow the same setting as in Zhu et al. (2012) and remove a standard list of stop words for both binary and multi-class classiﬁcation. For all the experiments, we use the standard normal prior p0(η) (i.e., ν2 = 1) and the symmetric Dirichlet priors α = α K 1, β = 0.01 1, where 1 is a vector with all entries being 1. For each setting, we report the average performance and standard deviation with ﬁve randomly initialized runs. All the experiments, except the those on the large Wikipedia data set, are done on a standard desktop computer.

6.1 Binary Classiﬁcation

The binary classiﬁcation task is to distinguish postings of the newsgroup alt.atheism and postings of the newsgroup talk.religion.misc. The training set contains 856 documents, and

4. See the websites: http://lshtc.iit.demokritos.gr/; and http://www.image-net.org/challenges/LSVRC/2012/index.

Zhu, Chen, Perkins and Zhang

5 10 15 20 25 30 0.65

5 10 15 20 25 30

Train time (seconds)

Gibbs Med LDA g Med LDA v Med LDA Gibbs LDA+SVM

5 10 15 20 25 30 0

Test time (seconds)

Figure 1: Classiﬁcation accuracy, training time (in log-scale) and testing time (in linear scale) on the 20Newsgroups binary classiﬁcation data set.

the test set contains 569 documents. We compare Gibbs Med LDA with the Med LDA model that uses variational methods (denoted by v Med LDA) (Zhu et al., 2012) and the Med LDA that uses collapsed Gibbs sampling algorithms (denoted by g Med LDA) (Jiang et al., 2012). We also include unsupervised LDA using collapsed Gibbs sampling as a baseline, denoted by Gibbs LDA. For Gibbs LDA, we learn a binary linear SVM on its topic representations using SVMLight (Joachims, 1999). The results of other supervised topic models, such as s LDA and Disc LDA (Lacoste-Jullien et al., 2009), were reported in Zhu et al. (2012). For Gibbs Med LDA, we set α = 1, ℓ= 164 and M = 10. As we shall see in Section 6.5, Gibbs Med LDA is insensitive to α, ℓand M in a wide range. Although tuning c (e.g., via cross-validation) can produce slightly better results, we ﬁx c = 1 for simplicity.

Figure 1 shows the accuracy, training time, and testing time of diﬀerent methods with various numbers of topics. We can see that by minimizing an expected hinge-loss and making no restrictive assumptions on the posterior distributions, Gibbs Med LDA achieves higher accuracy than other max-margin topic models, which make some restrictive mean-ﬁeld assumptions. Similarly, as g Med LDA makes a weaker mean-ﬁeld assumption, it achieves slightly higher accuracy than v Med LDA, which assumes that the posterior distribution is fully factorized. For the training time, Gibbs Med LDA is about two orders of magnitudes faster than v Med LDA, and about one order of magnitude faster than g Med LDA. This is partly because both v Med LDA and g Med LDA need to solve multiple SVM problems. For the testing time, Gibbs Med LDA is comparable with g Med LDA and the unsupervised Gibbs LDA, but faster than the variational algorithm used by v Med LDA, especially when the number of topics K is large. There are several possible reasons for the faster testing than v Med LDA, though they use the same stopping criterion. For example, v Med LDA performs mean-ﬁeld inference in a full space which leads to a low convergence speed, while Gibbs Med LDA carries out Gibbs sampling in a collapsed space. Also, the sparsity of the

Gibbs Max-margin Topic Models with Data Augmentation

0 3 5 10 15 20 25 30

Predictive R2

0 3 5 10 15 20 25 30 10 0

Train time (seconds)

0 3 5 10 15 20 25 30 0

Test time (seconds)

Gibbs Med LDA v Med LDA s LDA

Figure 2: Predictive R2, training time and testing time on the hotel review data set.

sampled topics in Gibbs Med LDA could save time, while v Med LDA needs to carry out computation for each dimension of the variational parameters.

6.2 Regression

We use the hotel review data set (Zhu and Xing, 2010) built by randomly crawling hotel reviews from the Trip Advisor website,5 where each review is associated with a global rating score ranging from 1 to 5. In these experiments, we focus on predicting the global rating scores for reviews using the bag-of-words features only, with a vocabulary of 12,000 terms, though the other manually extracted features (e.g., part-of-speech tags) are provided. All the reviews have character lengths between 1,500 and 6,000. The data set consists of 5,000 reviews, with 1,000 reviews per rating. The data set is uniformly partitioned into training and testing sets. We compare the Gibbs Med LDA regression model with the Med LDA regression model that uses variational inference and supervised LDA (s LDA) which also uses variational inference. For Gibbs Med LDA and v Med LDA, the precision is set at ϵ = 1e 3

and c is selected via 5 fold cross-validation during training. Again, we set the Dirichlet parameter α = 1 and the number of burn-in M = 10. Figure 2 shows the predictive R2 (Blei and Mc Auliﬀe, 2007) of diﬀerent methods. We can see that Gibbs Med LDA achieves comparable prediction performance with v Med LDA, which is better than s LDA. Note that v Med LDA uses a full likelihood model for both input words and response variables, while Gibbs Med LDA uses a simpler likelihood model for words only.6 For training time, Gibbs Med LDA is about two orders of magnitudes faster than v Med LDA (as well as s LDA), again due to the fact that Gibbs Med LDA does not need to solve multiple SVM problems. For testing time, Gibbs Med LDA is also much faster than v Med LDA and s LDA, especially when the number of topics is large, due to the same reasons as stated in Section 6.1.

5. See the website: http://www.tripadvisor.com/. 6. The Med LDA with a simple likelihood on words only doesn t perform well for regression.

Zhu, Chen, Perkins and Zhang

6.3 Multi-class Classiﬁcation

We perform multi-class classiﬁcation on 20Newsgroups with all 20 categories. The data set has a balanced distribution over the categories. The test set consists of 7,505 documents, in which the smallest category has 251 documents and the largest category has 399 documents. The training set consists of 11,269 documents, in which the smallest and the largest categories contain 376 and 599 documents, respectively. We consider two approaches to doing multi-class classiﬁcation one is to build multiple independent binary Gibbs Med LDA models, one for each category, and the other one is to build multiple dependent binary Gibbs Med LDA models under the framework of multi-task learning, as presented in Section 5.2.

6.3.1 Multiple One-vs-All Classifiers

Various methods exist to apply binary classiﬁers to do multi-class classiﬁcation, including the popular one-vs-all and one-vs-one strategies. Here we choose the one-vs-all strategy, which has shown eﬀective (Rifkin and Klautau, 2004), to provide some preliminary analysis. Let ˆηi be the sampled classiﬁer weights of the 20 one-vs-all binary classiﬁers after the burn-in stage. For a test document w, we need to infer the latent topic assignments zi under each one-vs-all binary classiﬁer using a Gibbs sampler with the conditional distribution (13). Then, we predict the document as belonging to the single category which has the largest discriminant function value, that is,

ˆy = argmax i=1,...,L (ˆη i zi),

where L is the number of categories (i.e., 20 in this experiment). Again, since Gibbs Med LDA is insensitive to α and ℓ, we set α = 1 and ℓ= 64. We also ﬁx c = 1 for simplicity. The number of burn-in iterations is set as M = 20, which is suﬃciently large as will be shown in Figure 7. Figure 3 shows the classiﬁcation accuracy and training time, where Gibbs Med LDA builds 20 binary Gibbs Med LDA classiﬁers. Note that for Gibbs Med LDA the horizontal axis denotes the number of topics used by each single binary classiﬁer. Since there is no coupling among these 20 binary classiﬁers, we can learn them in parallel, which we denote by p Gibbs Med LDA. We can see that Gibbs Med LDA clearly improves over other competitors on the classiﬁcation accuracy, which may be due to the diﬀerent strategies on building the multi-class classiﬁers.7 However, given the performance gain on the binary classiﬁcation task, we believe that the Gibbs sampling algorithm without any restrictive factorization assumptions is another factor leading to the improved performance. For training time, Gibbs Med LDA takes slightly less time than the variational Med LDA as well as g Med LDA. But if we train the 20 binary Gibbs Med LDA classiﬁers in parallel, we can save a lot of training time. These results are promising since it is now not uncommon to have a desktop computer with multiple processors or a cluster with tens or hundreds of computing nodes.

6.3.2 Multi-class Classification as a Multi-task Learning Problem

The second approach to performing multi-class classiﬁcation is to formulate it as a multiple task learning problem, with a single output prediction rule. Speciﬁcally, let the label space

7. Med LDA learns multi-class SVM (Zhu et al., 2012).

Gibbs Max-margin Topic Models with Data Augmentation

20 40 60 80 100 0.55

Gibbs Med LDA g Med LDA v Med LDA Gibbs LDA+SVM

20 40 60 80 100 10 1

Train time (seconds)

Gibbs Med LDA g Med LDA v Med LDA Gibbs LDA+SVM p Gibbs Med LDA

Figure 3: (a) classiﬁcation accuracy and (b) training time of the one-vs-all Gibbs Med LDA classiﬁers for multi-class classiﬁcation on the whole 20Newsgroups data set.

be Y = {1, . . . , L}. We can deﬁne one binary classiﬁcation task for each category i and the task is to distinguish whether a data example belongs to the class i (with binary label +1) or not (with binary label 1). All the binary tasks share the same topic representations. To apply the model as we have presented in Section 5.2, we need to determine the true binary label of each document in a task. Given the multi-class label yd of document d, this can be easily done by deﬁning

i = 1, . . . , L : yi d = +1 if yd = i 1 otherwise .

Then, we can learn a multi-task Gibbs Med LDA model using the data with transferred multiple labels. Let ˆηi be the sampled classiﬁer weights of task i after the burn-in stage. For a test document w, once we have inferred the latent topic assignments z using a Gibbs sampler with the conditional distribution (13), we compute the discriminant function value ˆη i z for each task i, and predict the document as belonging to the single category which has the largest discriminant function value, that is,

ˆy = argmax i=1,...,L (ˆη i z).

Figure 4 shows the performance of the multi-task Gibbs Med LDA with comparison to the high-performance methods of the one-vs-all Gibbs Med LDA and g Med LDA. Note again that for the one-vs-all Gibbs Med LDA the horizontal axis denotes the number of topics used by each single binary classiﬁer. We can see that although the multi-task Gibbs Med LDA uses 20 times fewer topics than the one-vs-all Gibbs Med LDA, their prediction accuracy scores are comparable when the multi-task Gibbs Med LDA uses a reasonable number of topics (e.g., larger than 40). Both implementations of Gibbs Med LDA yield higher performance than g Med LDA. Looking at training time, when there is only a single processor core available, the multi-task Gibbs Med LDA is about 3 times faster than the one-vs-all Gibbs Med LDA. When there are multiple processor cores available, the naive parallel one-vs-all Gibbs Med LDA is

Zhu, Chen, Perkins and Zhang

20 40 60 80 100 0.55

Gibbs Med LDA g Med LDA Multi task Gibbs Med LDA

20 40 60 80 100 10 1

Train time (seconds)

Gibbs Med LDA Multi task Gibbs Med LDA p Gibbs Med LDA

20 40 60 80 100

Test time (seconds)

Gibbs Med LDA Multi task Gibbs Med LDA p Gibbs Med LDA

Figure 4: (a) classiﬁcation accuracy; (b) training time; and (c) testing time of the multitask Gibbs Med LDA classiﬁers for multi-class classiﬁcation on the whole 20Newsgroups data set.

faster. In this case, using 20 processor cores, the parallel one-vs-all Gibbs Med LDA is about 7 times faster than the multi-task Gibbs Med LDA. In some scenarios, the testing time is signiﬁcant. We can see that using a single core, the multi-task Gibbs Med LDA is about 20 times faster than the one-vs-all Gibbs Med LDA. Again however, in the presence of multiple processor cores, in this case 20, the parallel one-vs-all Gibbs Med LDA tests at least as fast, at the expense of using more processor resources. So, depending on the processor cores available, both the parallel one-vs-all Gibbs Med LDA and the multi-task Gibbs Med LDA can be excellent choices. Where high eﬃciency single-core processing is key, then the multitask Gibbs Med LDA is a great choice. When there are many processor cores available, then the parallel one-vs-all Gibbs Med LDA might be an appropriate choice.

6.4 Multi-label Classiﬁcation

We also present some results on a multi-label classiﬁcation task. We use the Wiki data set which is built from the large Wikipedia set used in the PASCAL LSHC challenge 2012, and where each document has multiple labels. The original data set is extremely imbalanced.8

We built our data set by selecting the 20 categories that have the largest numbers of documents and keeping all the documents that are labeled by at least one of these 20 categories. The training set consists of 1.1 millions of documents and the testing set consists of 5,000 documents. The vocabulary has 917,683 terms in total. To examine the eﬀectiveness of Gibbs Med LDA, which performs topic discovery and classiﬁer learning jointly, we compare it with a linear SVM classiﬁer built on the raw bag-of-words features and a two-step approach denoted by Gibbs LDA+SVM. The Gibbs LDA+SVM method ﬁrst uses LDA with collapsed Gibbs sampling to discover latent topic representations for all the documents and then builds 20 separate binary SVM classiﬁers using the training documents with their

8. The data set is available at: http://lshtc.iit.demokritos.gr/.

Gibbs Max-margin Topic Models with Data Augmentation

200 400 600 0.25

Multi task Gibbs Med LDA Gibbs LDA+SVM SVM

200 400 600 0.2

200 400 600 0.2

Figure 5: Precision, recall and F1-measure of the multi-label classiﬁcation using diﬀerent models on the Wiki data set.

discovered topic representations. For multi-task Gibbs Med LDA, we use 40 burn-in steps, which is suﬃciently large. The model is insensitive to other parameters, similar to the multi-class classiﬁcation task. Figure 5 shows the precision, recall and F1 measure (i.e., the harmonic mean of precision and recall) of various models running on a distributed cluster with 20 nodes (each node is equipped with two 6-core CPUs).9 We can see that the multi-task Gibbs Med LDA performs much better than other competitors. There are several reasons for the improvements. Since the vocabulary has about 1 million terms, the raw features are in a high-dimensional space and each document gives rise to a sparse feature vector, that is, only a few elements are nonzero. Thus, learning SVM classiﬁers on the raw data leads not just to over-ﬁtting but a wider failure to generalize. For example, two documents from the same category might contain non-intersecting sets of words, yet contain similar latent topics. Using LDA to discover latent topic representations can produce dense features. Building SVM classiﬁers using the latent topic features improves the overall F1 measure, by improving the ability to generalize, and reducing overﬁtting. But, due to its two-step procedure, the discovered topic representations may not be very predictive. By doing max-margin learning and topic discovery jointly, the multi-task Gibbs Med LDA can discover more discriminative topic features, thus improving signiﬁcantly over the two-step Gibbs LDA+SVM algorithm.

6.5 Sensitivity Analysis

We now provide a more careful analysis of the various Gibbs Med LDA models on their sensitivity to some key parameters in the classiﬁcation tasks. Speciﬁcally, we will look at

9. For Gibbs LDA, we use the parallel implementation in Yahoo-LDA, which is publicly available at: https: //github.com/shravanmn/Yahoo_LDA. For Gibbs Med LDA, the parallel implementation of our Gibbs sampler is presented in Zhu et al. (2013b).

Zhu, Chen, Perkins and Zhang

0 10 20 0.55

# Burn in steps

K=5 K=10 K=15 K=20

0 10 20 0.55

# Burn in steps

Train Accuracy

K=5 K=10 K=15 K=20

# Burn in steps

Train time (seconds)

K=5 K=10 K=15 K=20

Figure 6: (Left) testing accuracy, (Middle) training accuracy, and (Right) training time of Gibbs Med LDA with diﬀerent numbers of burn-in steps for binary classiﬁcation.

the eﬀects of the number of burn-in steps, the Dirichlet prior α, the loss penalty ℓ, and the number of testing samples.

6.5.1 Burn-in Steps

Figure 6 shows the classiﬁcation accuracy, training accuracy and training time of Gibbs Med LDA with diﬀerent numbers of burn-in samples in the binary classiﬁcation task. When M = 0, the model is essentially random, for which we draw a classiﬁer with the randomly initialized topic assignments for training data. We can see that both the training accuracy and testing accuracy increase very quickly and converge to their stable values with 5 to 10 burn-in steps. As expected, the training time increases about linearly in general when using more burn-in steps. Moreover, the training time increases linearly as K increases. In the previous experiments, we have chosen M = 10, which is suﬃciently large. Figure 7 shows the performance of Gibbs Med LDA for multi-class classiﬁcation with diﬀerent numbers of burn-in steps when using the one-vs-all strategy. We show the total training time as well as the training time of the naive parallel implementation of p Gibbs Med LDA. We can see that when the number of burn-in steps is larger than 20, the performance is quite stable, especially when K is large. Again, the training time grows about linearly as the number of burn-in steps increases. Even if we use 40 or 60 steps of burn-in, the training time is still competitive, compared with the variational Med LDA, especially considering that Gibbs Med LDA can be naively parallelized by learning diﬀerent binary classiﬁers simultaneously. Figure 8 shows the testing classiﬁcation accuracy, training accuracy and training time of the multi-task Gibbs Med LDA for multi-class classiﬁcation with diﬀerent numbers of burnin steps. We can see that again both the training accuracy and testing accuracy increase fast and converge to their stable scores after about 30 burn-in steps. Also, the training time increases about linearly as the number of burn-in steps increases.

Gibbs Max-margin Topic Models with Data Augmentation

5 10 20 30 40 50 60 0.3

# Burn in steps

K=30 K=40 K=50 K=60

5 10 20 30 40 50 60

# Burn in steps

Train time (seconds)

K=30 K=40 K=50 K=60

5 10 20 30 40 50 60

# Burn in steps

Train time (seconds)

K=30 K=40 K=50 K=60

Figure 7: (Left) classiﬁcation accuracy of Gibbs Med LDA, (Middle) training time of Gibbs Med LDA and (Right) training time of the parallel p Gibbs Med LDA with diﬀerent numbers of burn-in steps for multi-class classiﬁcation.

10 0 10 1 10 2 0.3

# Burn in steps

K=60 K=70 K=80 K=90

10 0 10 1 10 2 0.3

# Burn in steps

Train Accuracy

K=60 K=70 K=80 K=90

10 0 10 1 10 2

# Burn in steps

Train time (seconds)

K=60 K=70 K=80 K=90

Figure 8: (Left) test accuracy, (Middle) training accuracy, and (Right) training time of the multi-task Gibbs Med LDA with diﬀerent numbers of burn-in steps for multi-class classiﬁcation.

6.5.2 Dirichlet Prior

For topic models with a Dirichlet prior, the Dirichlet hyper-parameter can be automatically estimated, such as using the Newton-Raphson method (Blei et al., 2003). Here, we analyze its eﬀects on the performance by setting diﬀerent values. Figure 9 shows the classiﬁcation performance of Gibbs Med LDA on the binary task with diﬀerent α values for the symmetric Dirichlet prior α = α

K 1. For the three diﬀerent topic numbers, we can see that the performance is quite stable in a wide range of α values, for example, from 0.1 to 10. We can also

Zhu, Chen, Perkins and Zhang

10 5 10 4 10 3 10 2 10 1 10 0 10 1 10 2 10 3 0.45

K = 5 K = 10 K = 15

Figure 9: Classiﬁcation accuracy of Gibbs Med LDA on the binary classiﬁcation data set with diﬀerent α values.

1 3 5 7 9 11 13 15 17 19 21 23 25 0.45

K = 5 K = 10 K = 15

Figure 10: Classiﬁcation accuracy of Gibbs Med LDA on the binary classiﬁcation data set with diﬀerent ℓvalues.

see that it generally needs a larger α in order to get the best results when K becomes larger. For example, when α < 0.1, using fewer topics results in slightly higher performance. This is mainly because a large K tends to produce sparse topic representations and an appropriately large α is needed to smooth the representations, as the eﬀective Dirichlet prior is αk = α/K.

6.5.3 Loss Penalty ℓ

Figure 10 shows the classiﬁcation performance of Gibbs Med LDA on the binary classiﬁcation task with diﬀerent ℓvalues. Again, we can see that in a wide range, for example, from 25 to 625, the performance is quite stable for all the three diﬀerent K values. In the

Gibbs Max-margin Topic Models with Data Augmentation

0 5 10 15 20 0.45

# samples of z

K = 5 K = 10 K = 15

0 5 10 15 20 0

# samples of z

Test time (seconds)

K = 5 K = 10 K = 15

Figure 11: (Left) classiﬁcation accuracy and (Right) testing time of Gibbs Med LDA on the binary classiﬁcation data set with diﬀerent numbers of z samples in making predictions.

above experiments, we set ℓ= 164. For the multi-class classiﬁcation task, we have similar observations, and we set ℓ= 64 in the previous experiments.

6.5.4 The Number of Testing Samples

Figure 11 shows the classiﬁcation performance and testing time of Gibbs Med LDA in the binary classiﬁcation task with diﬀerent numbers of z samples when making predictions, as stated in Section 4.4. We can see that in a wide range, for example, from 1 to 19, the classiﬁcation performance is quite stable for all the three diﬀerent K values we have tested; and the testing time increases about linearly as the number of z samples increases. For the multi-class classiﬁcation task, we have similar observations.

6.6 Topic Representations

Finally, we also visualize the discovered latent topic representations of Gibbs Med LDA on the 20Newsgroup data set. We choose the multi-task Gibbs Med LDA, since it learns a single common topic space shared by multiple classiﬁers. We set the number of topics at 40. Figure 12 shows the average topic representations of the documents from each category, and Table 1 presents the 10 most probable words in each topic. We can see that for diﬀerent categories, the average representations are quite diﬀerent, indicating that the topic representations are good at distinguishing documents from diﬀerent classes. We can also see that on average the documents in each category have very few salient topics, that is, the topics with a high probability of describing the documents. For example, the ﬁrst two most salient topics for describing the documents in the category alt.atheism are topic 20 and topic 29, whose top-ranked words (see Table 1) reﬂect the semantic meaning of the category. For graphics category, the documents have the most salient topic 23, which

Zhu, Chen, Perkins and Zhang

5 10 15 20 25 30 35 40 0

(a) alt.atheism

5 10 15 20 25 30 35 40 0

(b) graphics

5 10 15 20 25 30 35 40 0

(c) ms-windows.misc

5 10 15 20 25 30 35 40 0

(d) ibm.pc.hardware

5 10 15 20 25 30 35 40 0

(e) sys.mac.hardware

5 10 15 20 25 30 35 40 0

(f) windows.x

5 10 15 20 25 30 35 40 0

(g) misc.forsale

5 10 15 20 25 30 35 40 0

(h) rec.autos

5 10 15 20 25 30 35 40 0

(i) rec.motorcycles

5 10 15 20 25 30 35 40 0

(j) rec.sport.baseball

5 10 15 20 25 30 35 40 0

(k) rec.sport.hockey

5 10 15 20 25 30 35 40 0

(l) sci.crypt

5 10 15 20 25 30 35 40 0

(m) sci.electronics

5 10 15 20 25 30 35 40 0

(n) sci.med

5 10 15 20 25 30 35 40 0

(o) sci.space

5 10 15 20 25 30 35 40 0

(p) religion.christian

5 10 15 20 25 30 35 40 0

(q) politics.guns

5 10 15 20 25 30 35 40 0

(r) politics.mideast

5 10 15 20 25 30 35 40 0

(s) politics.misc

5 10 15 20 25 30 35 40 0

(t) religion.misc

Figure 12: (a-t) per-class average topic representations on the 20Newsgroups data set.

has topic words image, graphics, ﬁle, jpeg, and etc., all of which are closely related to the semantic of graphics. For other categories, we have similar observations.

Gibbs Max-margin Topic Models with Data Augmentation

Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7 Topic 8 Topic 9 Topic 10 data sale woman space team writes mr db windows ﬁle mission oﬀer told nasa game don president cs writes congress center shipping afraid launch hockey time stephanopoulos mov article january sci dos building writes play article jobs bh ﬁle centers jpl mail couldn earth season information russian si ﬁles bill planetary price ﬂoor article nhl number administration al problem quotes mass condition beat orbit ca people meeting byte dos hr probe good standing moon games make george theory don states ames interested immediately shuttle players work russia bits run march atmosphere sell crowd gov year part working larson win included Topic 11 Topic 12 Topic 13 Topic 14 Topic 15 Topic 16 Topic 17 Topic 18 Topic 19 Topic 20 organizations gun msg jesus mac drive ma wiring writes writes began people health god apple scsi nazis supply power people security guns medical people writes mb mu boxes article article terrible weapons food christian drive card conﬂict bnr don don association ﬁrearms article bible problem system ql plants ground god sy writes disease sandvik mb controller te reduce good life publication article patients christians article bus ne corp current things helped government writes objective system hard eu relay work apr organized ﬁre doctor ra bit ide cx onur circuit evidence bullock law research christ don disk qu damage ve time Topic 21 Topic 22 Topic 23 Topic 24 Topic 25 Topic 26 Topic 27 Topic 28 Topic 29 Topic 30 ms pub image internet fallacy window key unit god bike myers ftp graphics anonymous conclusion server encryption heads jesus dod os mail ﬁle privacy rule motif chip master people writes vote anonymous ﬁles posting innocent widget clipper multi church article votes archive software email perfect sun government led christians ride santa electronic jpeg anonymity assertion display keys vpic christ don fee server images users true application security dual christian apr impression faq version postings ad mit escrow ut bible ca issued eﬀ program service consistent ﬁle secure ratio faith motorcycle voting directory data usenet perspective xterm nsa protected truth good Topic 31 Topic 32 Topic 33 Topic 34 Topic 35 Topic 36 Topic 37 Topic 38 Topic 39 Topic 40 matthew israel courtesy car didn year south entry open people wire people announced writes apartment writes war output return government neutral turkish sequence cars sumgait article tb ﬁle ﬁlename article judas armenian length article started game nuclear program read tax reported jews rates don mamma team rockefeller build input make acts israeli molecular good father baseball georgia um char health island armenians sets engine karina good military entries write state safety article pm speed ll don bay ei enter don hanging writes automatic apr guy games acquired eof judges money outlets turkey barbara oil knew runs ships printf year cramer

Table 1: The ten most probable words in the topics discovered by Multi-task Gibbs Med LDA (K = 40) on the 20Newsgroups data set.

7. Conclusions and Discussions

We have presented Gibbs Med LDA, an alternative approach to learning max-margin supervised topic models by minimizing an expected margin loss. We have applied Gibbs Med LDA to various tasks including text categorization, regression, and multi-task learning. By using data augmentation techniques, we have presented simple and highly eﬃcient augmentand-collapse Gibbs sampling algorithms, without making any restricting assumptions on posterior distributions. Empirical results on the medium-sized 20Newsgroups and hotel re-

Zhu, Chen, Perkins and Zhang

view data sets and a large-scale Wikipedia data set demonstrate signiﬁcant improvements on time eﬃciency and classiﬁcation accuracy over existing max-margin topic models. Our approaches are applicable to building other max-margin latent variable models, such as the max-margin nonparametric latent feature models for link prediction (Zhu, 2012) and matrix factorization (Xu et al., 2012). Finally, we release the code for public use.10

The new data augmentation formulation without any need to solve constrained subproblems has shown great promise on improving the time eﬃciency of max-margin topic models. For future work, we are interested in developing highly scalable sampling algorithms (e.g., using a distributed architecture) (Newman et al., 2009; Smola and Narayanamurthy, 2010; Ahmed et al., 2012) to deal with large scale data sets. One nice property of the sampling algorithms is that the augmented variables are local to each document. Therefore, they can be eﬀectively handled in a distributed architecture. But, the global prediction model weights bring in new challenges. Some preliminary work has been investigated in Zhu et al. (2013b). Another interesting topic is to apply the data augmentation technique to deal with the multiclass max-margin formulation, which was proposed by Crammer and Singer (2001) and used in Med LDA for learning multi-class max-margin topic models. Intuitively, it can be solved following an iterative procedure that infers the classiﬁer weights associated with each category by ﬁxing the others, similar as in polychomotous logistic regression (Holmes and Held, 2006), in which each substep may involve solving a binary hinge loss and thus our data augmentation techniques can be applied. A systematical investigation is our future work.

Acknowledgments

The work was supported by the National Basic Research Program (973 Program) of China (Nos. 2013CB329403, 2012CB316301), National Natural Science Foundation of China (Nos. 61322308, 61332007, 61305066), Tsinghua University Initiative Scientiﬁc Research Program (No. 20121088071), a Microsoft Research Asia Research Fund (No. 20123000007), and a China Postdoctoral Science Foundation Grant (No. 2013T60117) to NC.

Amr Ahmed, Mohamed Aly, Joseph Gonzalez, Shravan Narayanamurthy, and Alex Smola. Scalable inference in latent variable models. In International Conference on Web Search and Data Mining (WSDM), pages 123 132, 2012.

Edoardo M. Airoldi, David M. Blei, Stephen E. Fienberg, and Eric P. Xing. Mixed membership stochastic blockmodels. Journal of Machine Learning Research (JMLR), 9:1981 2014, 2008.

Rie K. Ando and Tong Zhang. A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research (JMLR), (6):1817 1853, 2005.

10. The code is available at: http://www.ml-thu.net/~jun/gibbs-medlda.shtml.

Gibbs Max-margin Topic Models with Data Augmentation

Andreas Argyriou, Theodoros Evgeniou, and Massimiliano Pontil. Multi-task feature learning. In Advances in Neural Information Processing Systems (NIPS), pages 41 48, 2007.

David M. Blei and John D. Mc Auliﬀe. Supervised topic models. In Advances in Neural Information Processing Systems (NIPS), pages 121 128, 2007.

David M. Blei, Andrew Ng, and Michael I. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research (JMLR), 3:993 1022, 2003.

Olivier Catoni. PAC-Bayesian supervised classiﬁcation: The thermodynamics of statistical learning. Monograph Series of the Institute of Mathematical Statistics, 2007.

Jonathan Chang and David Blei. Relational topic models for document networks. In Artiﬁcial Intelligence and Statistics (AISTATS), pages 81 88, 2009.

Ning Chen, Jun Zhu, Fuchun Sun, and Eric P. Xing. Large-margin predictive latent subspace learning for multiview data analysis. IEEE Trans. on Pattern Analysis and Machine Intelligence (TPAMI), 34(12):2365 2378, 2012.

Ning Chen, Jun Zhu, Fei Xia, and Bo Zhang. Generalized relational topic models with data augmentation. In International Joint Conference on Artiﬁcial Intelligence (IJCAI), pages 1273 1279, 2013.

Koby Crammer and Yoram Singer. On the algorithmic implementation of multiclass kernelbased vector machines. Journal of Machine Learning Research (JMLR), (2):265 292, 2001.

Arthur P. Dempster, Nan M. Laird, and Donald B. Rubin. Maximum likelihood estimation from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Ser. B, (39):1 38, 1977.

Luc Devroye. Non-uniform Random Variate Generation. Springer-Verlag, 1986.

Li Fei-Fei and Pietro Perona. A Bayesian hierarchical model for learning natural scene categories. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 524 531, 2005.

Kuzman Ganchev, Joao Gra ca, Jennifer Gillenwater, and Ben Taskar. Posterior regularization for structured latent variable models. Journal of Machine Learning Research (JMLR), 11:2001 2049, 2010.

Pascal Germain, Alexandre Lacasse, Francois Laviolette, and Mario Marchand. PACBayesian learning of linear classiﬁers. In International Conference on Machine Learning (ICML), pages 353 360, 2009.

Thomas Griﬃths and Mark Steyvers. Finding scientiﬁc topics. Proceedings of National Academy of Science (PNAS), pages 5228 5235, 2004.

James P. Hobert. The Data Augmentation Algorithm: Theory and Methodology. In Handbook of Markov Chain Monte Carlo (S. Brooks, A. Gelman, G. Jones and X.-L. Meng, eds.). Chapman & Hall/CRC Press, Boca Raton, FL., 2011.

Zhu, Chen, Perkins and Zhang

Chris C. Holmes and Leonhard Held. Bayesian auxiliary variable models for binary and multinomial regression. Bayesian Analysis, 1(1):145 168, 2006.

Tommi Jaakkola, Marina Meila, and Tony Jebara. Maximum entropy discrimination. In Advances in Neural Information Processing Systems (NIPS), 1999.

Tony Jebara. Discriminative, Generative and Imitative Learning. Ph D thesis, Media Laboratory, MIT, Dec 2001.

Qixia Jiang, Jun Zhu, Maosong Sun, and Eric P. Xing. Monte Carlo methods for maximum margin supervised topic models. In Advances in Neural Information Processing Systems (NIPS), pages 1601 1609, 2012.

Thorsten Joachims. Making Large-scale SVM Learning Practical. MIT press, 1999.

Simons Lacoste-Jullien, Fei Sha, and Michael I. Jordan. Disc LDA: Discriminative learning for dimensionality reduction and classiﬁcation. In Advances in Neural Information Processing Systems (NIPS), pages 897 904, 2009.

David Mc Allester. PAC-Bayesian stochastic model selection. Machine Learning, 51:5 21, 2003.

Xiao-Li Meng and David A. van Dyk. The EM algorithm an old folk song sung to a fast new tune. Journal of the Royal Statistical Society, Ser. B, (59):511 567, 1997.

John R. Michael, William R. Schucany, and Roy W. Haas. Generating random variates using transformations with multiple roots. The American Statistician, 30(2):88 90, 1976.

Kevin Miller, M. Pawan Kumar, Ben Packer, Danny Goodman, and Daphne Koller. Maxmargin min-entropy models. In Artiﬁcial Intelligence and Statistics (AISTATS), pages 779 787, 2012.

Radford M. Neal. Markov chain Monte Carlo methods based on slicing the density function. Technical Report No. 9722, Department of Statistics, University of Toronto, 1997.

David Newman, Arthur Asuncion, Padhraic Smyth, and Max Welling. Distributed algorithms for topic models. Journal of Machine Learning Research (JMLR), (10):1801 1828, 2009.

Nicholas G. Polson and Steven L. Scott. Data augmentation for support vector machines. Bayesian Analysis, 6(1):1 24, 2011.

Ryan Rifkin and Aldebaro Klautau. In defense of one-vs-all classiﬁcation. Journal of Machine Learning Research (JMLR), (5):101 141, 2004.

Alex Smola and Shravan Narayanamurthy. An architecture for parallel topic models. Very Large Data Base (VLDB), 3(1-2):703 710, 2010.

Alex Smola and Bernhard Scholkopf. A tutorial on support vector regression. Statistics and Computing, 14(3):199 222, 2003.

Gibbs Max-margin Topic Models with Data Augmentation

Robert H. Swendsen and Jian-Sheng Wang. Nonuniversal critical dynamics in Monte Carlo simulations. Physical Review Letters, (58):86 88, 1987.

Aixin Tan. Convergence Rates and Regeneration of the Block Gibbs Sampler for Bayesian Random Eﬀects Models. Ph D thesis, Department of Statistics, University of Florida, 2009.

Martin A. Tanner and Wing-Hung Wong. The calculation of posterior distributions by data augmentation. Journal of the Americal Statistical Association (JASA), 82(398):528 540, 1987.

Ben Taskar, Carlos Guestrin, and Daphne Koller. Max-margin Markov networks. In Advances in Neural Information Processing Systems (NIPS), 2003.

Grigorios Tsoumakas, Ioannis Katakis, and Ioannis Vlahavas. Mining multi-label data. Data Mining and Knowledge Discovery Handbook, 2nd ed., pages 667 685, 2010.

David van Dyk and Xiao-Li Meng. The art of data augmentation. Journal of Computational and Graphical Statistics (JCGS), 10(1):1 50, 2001.

Vladimir N. Vapnik. The Nature of Statistical Learning Theory. Springer, New York, 1995.

Minjie Xu, Jun Zhu, and Bo Zhang. Bayesian nonparametric maximum margin matrix factorization for collaborative prediction. In Advances in Neural Information Processing Systems (NIPS), pages 64 72, 2012.

Minjie Xu, Jun Zhu, and Bo Zhang. Fast max-margin matrix factorization with data augmentation. In International Conference on Machine Learning (ICML), pages 978 986, 2013.

Shuanghong Yang, Jiang Bian, and Hongyuan Zha. Hybrid generative/discriminative learning for automatic image annotation. In Uncertainty in Artiﬁcial Intelligence (UAI), pages 683 690, 2010.

Chun-Nam Yu and Thorsten Joachims. Learning structural SVMs with latent variables. In International Conference on Machine Learning (ICML), pages 1169 1176, 2009.

Jun Zhu. Max-margin nonparametric latent feature models for link prediction. In International Conference on Machine Learning (ICML), pages 719 726, 2012.

Jun Zhu and Eric P. Xing. Conditional topic random ﬁelds. In International Conference on Machine Learning (ICML), pages 1239 1246, 2010.

Jun Zhu, Ning Chen, and Eric P. Xing. Inﬁnite SVM: Dirichlet process mixtures of largemargin kernel machines. In International Conference on Machine Learning (ICML), pages 617 624, 2011.

Jun Zhu, Amr Ahmed, and Eric. P. Xing. Med LDA: maximum margin supervised topic models. Journal of Machine Learning Research (JMLR), (13):2237 2278, 2012.

Zhu, Chen, Perkins and Zhang

Jun Zhu, Ning Chen, Hugh Perkins, and Bo Zhang. Gibbs max-margin topic models with fast inference algorithms. In International Conference on Machine Learning (ICML), pages 124 132, 2013a.

Jun Zhu, Xun Zheng, Li Zhou, and Bo Zhang. Scalable inference in max-margin topic models. In ACM SIGKDD Conference on Knowledge Discovery and Data Mining (SIGKDD), pages 964 972, 2013b.

Jun Zhu, Ning Chen, and Eric P. Xing. Bayesian inference with posterior regularization and applications to inﬁnite latent SVMs. Journal of Machine Learning Research (JMLR, in press) (Technical Report, ar Xiv:1210.1766v3), 2014.