# online_bayesian_passiveaggressive_learning__e058fb40.pdf Online Bayesian Passive-Aggressive Learning Tianlin Shi STL501@GMAIL.COM Jun Zhu DCSZJ@MAIL.TSINGHUA.EDU.CN Institute for Interdisciplinary Information Sciences, Tsinghua University, China Dept. of Comp. Sci. & Tech., TNList Lab, State Key Lab of Intell. Tech. & Sys., Tsinghua University, China Online Passive-Aggressive (PA) learning is an effective framework for performing max-margin online learning. But the deterministic formulation and estimated single large-margin model could limit its capability in discovering descriptive structures underlying complex data. This paper presents online Bayesian Passive-Aggressive (Bayes PA) learning, which subsumes the online PA and extends naturally to incorporate latent variables and perform nonparametric Bayesian inference, thus providing great flexibility for explorative analysis. We apply Bayes PA to topic modeling and derive efficient online learning algorithms for max-margin topic models. We further develop nonparametric methods to resolve the number of topics. Experimental results show that our approaches significantly improve time efficiency while maintaining comparable results with the batch counterparts. 1. Introduction Online learning is an effective way to deal with largescale applications, especially applications with streaming data. Online Passive-Aggressive (PA) learning (Crammer et al., 2006) provides a generic framework for online largemargin learning, with many applications (Mc Donald et al., 2005; Chiang et al., 2008). Though enjoying strong discriminative ability suitable for predictive tasks, existing PA methods are formulated as a point estimate problem by optimizing some deterministic objective function. This may lead to some inconvenience. For example, a single largemargin model is often less than sufficient in describing complex data, which may have rich underlying structures. On the other hand, Bayesian methods enjoy great flexibil- Proceedings of the 31 st International Conference on Machine Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copyright 2014 by the author(s). ity in describing the possible underlying structures of complex data. Moreover, the recent progress on nonparametric Bayesian methods (Hjort, 2010; Teh et al., 2006a) further provides an increasingly important framework that allows Bayesian models to have an unbounded model complexity, e.g., an infinite number of components in a mixture model (Hjort, 2010) or an infinite number of units in a latent feature model (Ghahramani & Griffiths, 2005), and to adapt when the learning environment changes. For Bayesian models, one challenging problem is posterior inference, for which both variational and Monte Carlo methods can be too expensive to be applied to large-scale applications. To scale up Bayesian inference, much progress has been made on developing online variational Bayes (Hoffman et al., 2010; Mimno et al., 2012) and online Monte Carlo (Ahn et al., 2012) methods. However, due to the generative nature, Bayesian models are lack of the discriminative ability of large-margin methods and usually less than sufficient in performing discriminative tasks. Successful attempts have been made to bring large-margin learning and Bayesian methods together. For example, maximum entropy discrimination (MED) (Jaakkola et al., 1999) made a significant advance in conjoining maxmargin learning and Bayesian generative models, mainly in the context of supervised learning and structured output prediction (Zhu & Xing, 2009). Recently, much attention has been focused on generalizing MED to incorporate latent variables and perform nonparametric Bayesian inference, in many contexts including topic modeling (Zhu et al., 2012), matrix factorization (Xu et al., 2012), and multi-task learning (Jebara, 2011; Zhu et al., 2011). However, posterior inference in such models remains a big challenge. It is desirable to develop efficient online algorithms for these Bayesian max-margin models. To address the above problems of both the online PA and Bayesian max-margin models, we present online Bayesian Passive-Aggressive (Bayes PA) learning, a generic framework of performing online learning for Bayesian maxmargin models. We show that online Bayes PA subsumes the standard online PA when the underlying model is lin- Online Bayesian Passive-Aggressive Learning ear and the parameter prior is Gaussian. We further show that one significance of Bayes PA is its natural generalization to incorporate latent variables and to perform nonparametric Bayesian inference, thus allowing Bayes PA to have the great flexibility of (nonparametric) Bayesian methods for explorative analysis as well as the strong discriminative ability of large-margin learning for predictive tasks. As concrete examples, we apply the theory of online Bayes PA to topic modeling and derive efficient online learning algorithms for max-margin supervised topic models (Zhu et al., 2012). We further develop efficient online learning algorithms for the nonparametric max-margin topic models, an extension of the nonparametric topic models (Teh et al., 2006a; Wang et al., 2011) for predictive tasks. Extensive empirical results on real data sets show significant improvements on time efficiency and maintenance of comparable results with the batch counterparts. 2. Bayesian Passive-Aggressive Learning In this section, we present a general perspective on online max-margin Bayesian inference. 2.1. Online PA Learning The goal of online learning is to minimize the cumulative loss for a certain prediction task from the sequentially arriving training samples. Online Passive-Aggressive (PA) algorithms (Crammer et al., 2006) achieve this goal by updating some parameterized model w (e.g., the weights of a linear SVM) in an online manner with the instantaneous losses from arriving data {xt}t 0 and corresponding responses {yt}t 0. The losses ℓϵ(w; xt, yt) could be the hinge loss (ϵ ytw xt)+ for binary classification or the ϵinsensitive loss (|yt w xt| ϵ)+ for regression, where ϵ is a hyper-parameter and (x)+ = max(0, x). The Passive Aggressive update rule is then derived by defining the new weight wt+1 as the solution to the optimization problem: min w 1 2||w wt||2 s.t.: ℓϵ(w; xt, yt) = 0. (1) Intuitively, if wt suffers no loss from the new data, i.e., ℓϵ(wt; xt, yt) = 0, the algorithm passively assigns wt+1 = wt; otherwise, it aggressively projects wt to the feasible zone of parameter vectors that attain zero loss. With provable bounds, (Crammer et al., 2006) shows that online PA algorithms could achieve comparable results to the optimal classifier w . In practice, in order to account for inseparable training samples, soft margin constraints are often adopted and the resulting learning problem is min w 1 2||w wt||2 + 2cℓϵ(w; xt, yt), (2) where c is a positive regularization parameter. For problems (1) and (2) with samples arriving one at a time, closedform solutions can be derived (Crammer et al., 2006). 2.2. Online Bayes PA Learning Instead of updating a point estimate of w, online Bayes PA sequentially infers a new posterior distribution qt+1(w), either parametric or nonparametric, on the arrival of new data (xt, yt) by solving the following optimization problem: min q(w) Ft KL[q(w)||qt(w)] Eq(w)[log p(xt|w)] s.t.: ℓϵ[q(w); xt, yt] = 0, (3) where Ft is some distribution family, e.g., the probability simplex P. In other words, we find a posterior distribution qt+1(w) in the feasible zone that is not only close to qt(w) by the commonly used KL-divergence, but also has a high likelihood of explaining new data. As a result, if Bayes rule already gives the posterior distribution qt+1(w) qt(w)p(xt|w) that suffers no loss (i.e., ℓϵ = 0), Bayes PA passively updates the posterior following just Bayes rule; otherwise, Bayes PA aggressively projects the new posterior to the feasible zone of posteriors that attain zero loss. We should note that when no likelihood is defined (e.g., p(xt|w) is independent of w), Bayes PA will passively set qt+1(w) = qt(w) if qt(w) suffers no loss. We call it non-likelihood Bayes PA. In practical problems, the constraints in (3) could be unrealizable. To deal with such cases, we introduce the softmargin version of Bayes PA learning, which is equivalent to minimizing the objective function L(q(w)) in problem (3) with a regularization term (Cortes & Vapnik, 1995): qt+1(w) = argmin q(w) Ft L(q(w)) + 2cℓϵ(q(w); xt, yt). (4) For the max-margin classifiers that we focus on in this paper, two loss functions ℓϵ(q(w); xt, yt) are common the hinge loss of an averaging classifier that makes predictions using the rule ˆyt = sign Eq(w)[w xt]: ℓAvg ϵ [q(w); xt, yt] = (ϵ yt Eq(w)[w xt])+ and the expected hinge loss of a Gibbs classifier that randomly draws a classifier w q(w) to make predictions using the rule ˆyt = sign w xt: ℓGibbs ϵ [q(w); xt, yt] = Eq(w)[(ϵ ytw xt)+]. They are closely connected via the following lemma due to the convexity of the function (x)+. Lemma 2.1. The expected hinge loss ℓGibbs ϵ is an upper bound of the hinge loss ℓAvg ϵ , that is, ℓGibbs ϵ ℓAvg ϵ . Before developing Bayes PA learning for practical problems, we make several observations. Lemma 2.2. If q0(w) = N(0, I), Ft = P and we use ℓAvg ϵ , the non-likelihood Bayes PA subsumes the online PA. Online Bayesian Passive-Aggressive Learning This can be proved by induction. See Appendix A. Lemma 2.3. If Ft = P and we use ℓGibbs ϵ , the update rule of online Bayes PA is qt+1(w) = qt(w)p(xt|w)e 2c(ϵ ytw xt)+ Γ(xt, yt) , (5) where Γ(xt, yt) is the normalization constant. Therefore, the posterior qt(w) in the previous round t becomes a prior, while the newly observed data and its loss function provide a likelihood and an unnormalized pseudolikelihood respectively. Mini-Batches. A useful technique to reduce the noise in data is the use of mini-batches. Suppose that we have a mini-batch of data points at time t with an index set Bt, denoted as Xt = {xd}d Bt, Yt = {yd}d Bt. The Bayesian PA update equation for this mini-batch is simply, qt+1(w) = argmin q Ft L(q(w)) + 2cℓϵ(q(w); Xt, Yt), where ℓϵ(q(w); Xt, Yt) = P d Bt ℓϵ(q(w); xd, yd). 2.3. Learning with Latent Structures To expressively explain complex real-word data, Bayesian models with latent structures have been extensively developed. The latent structures could typically be characterized by two kinds of latent variables local latent variables hd (d 0) that characterize the hidden structures of each observed data xd and global variables M that capture the common properties shared by all data. The goal of Bayesian PA learning with latent structures is therefore to update the distribution of M as well as weights w based on each incoming mini-batch (Xt, Yt) and their corresponding latent variables Ht = {hd}d Bt. Because of the uncertainty in Ht, we extend Bayes PA to infer the joint posterior distribution, qt+1(w, M, Ht), as solving min q Ft L(q(w, M, Ht)) + 2cℓϵ(q(w, M, Ht); Xt, Yt),(6) where L(q)=KL[q||qt(w, M)p0(Ht)] Eq[log p(Xt|w, M, Ht)] and ℓϵ(q; Xt, Yt) is some cumulative marginloss on the min-batch data induced from some classifiers defined on the latent variables Ht and/or global variables M. Both the averaging classifiers and Gibbs classifiers can be used as in the case without latent variables. We will present concrete examples in the next sections. Before diving into the details, we should note that in a real online setting, only global variables are maintained in the bookkeeping, while the local information in the streaming data is forgotten. However, (6) gives us a distribution of (w, M) that is coupled with the local variables Ht. Although in some cases we can marginalize out the local variables Ht, in general we would not obtain a closed-form posterior distribution qt+1(w, M) for the next optimization round, especially in dealing with some involved models like Med LDA (Zhu et al., 2012). Therefore, we resort to approximation methods, e.g., by posing additional assumptions about q(w, M, Ht) such as the mean-field assumption, q(w, M, Ht) = q(w)q(M)q(Ht). Then, we can solve the problem via an iterative procedure and use the optimal distribution q (w)q (M) as qt+1(w, M). More details will be provided in next sections. 3. Online Max-Margin Topic Models We apply the theory of online Bayes PA to topic modeling and develop online learning algorithms for max-margin topic models. We also present a nonparametric generalization to resolve the number of topics in the next section. 3.1. Batch Med LDA A max-margin topic model consists of a latent Dirichlet allocation (LDA) (Blei et al., 2003) model for describing the underlying topic representations and a max-margin classifier for predicting responses. Specifically, LDA is a hierarchical Bayesian model that treats each document as an admixture of topics, Φ = {φk}K k=1, where each topic φk is a multinomial distribution over a W-word vocabulary. Let θ denote the mixing proportions. The generative process of document d is described as zdi Mult(θd), xdi Mult(φzdi), i [nd] where zdi is a topic assignment variable and Mult( ) is a multinomial distribution. For Bayesian LDA, the topics are drawn from a Dirichlet distribution, i.e., φk Dir(γ). Given a document set X = {xd}D d=1. Let Z = {zd}D d=1 and Θ = {θd}D d=1. LDA infers the posterior distribution p(Φ, Θ, Z|X) p0(Φ, Θ, Z)p(X|Z, Φ) via Bayes rule. From a variational point of view, the Bayes posterior is equivalent to the solution of the optimization problem: min q P KL[q(Φ, Θ, Z)||p(Φ, Θ, Z|X)]. The advantage of the variational formulation of Bayesian inference lies in the convenience of posing restrictions on the post-data distribution with a regularization term. For supervised topic models (Blei & Mc Auliffe, 2010; Zhu et al., 2012), such a regularization term could be a loss function of a prediction model w on the data X = {xd}D d=1 and response signals Y = {yd}D d=1. As a regularized Bayesian (Reg Bayes) model (Jiang et al., 2012), Med LDA infers a distribution of the latent variables Z as well as classification weights w by solving the problem: min q P L(q(w, Φ, Θ, Z)) + 2c d=1 ℓϵ(q(w, zd); xd, yd), Online Bayesian Passive-Aggressive Learning where L(q(w, Φ, Θ, Z)) = KL[q(w, Φ, Θ, Z)||p(w, Φ, Θ, Z|X)] . To specify the loss function, a linear discriminant function needs to be defined with respect to w and zd f(w, zd) = w zd, (7) where zdk = 1 nd P i I[zdi = k] are the average topic assignments of the words in document d. Based on the discriminant function, both averaging classifiers with the hinge loss ℓAvg ϵ (q(w, zd); xd, yd) = (ϵ yd Eq[f(w, zd)])+, (8) and Gibbs classifiers with the expected hinge loss ℓGibbs ϵ (q(w, zd); xd, yd) = Eq[(ϵ ydf(w, zd))+], (9) have been proposed, with extensive comparisons reported in (Zhu et al., 2013a) using batch learning algorithms. 3.2. Online Med LDA To apply the online Bayes PA, we have the global variables M = Φ and local variables Ht = (Θt, Zt). We consider Gibbs Med LDA because as shown in (Zhu et al., 2013a) it admits efficient inference algorithms by exploring data augmentation. Specifically, let ζd = ϵ ydf(w, zd) and ψ(yd|zd, w) = e 2c(ζd)+. Then in light of Lemma 2.3, the optimal solution to problem (6), qt+1(w, M, Ht), is qt(w, M)p0(Ht)p(Xt|Ht, M)ψ(Yt|Ht, w) Γ(Xt, Yt) , where ψ(Yt|Ht, w) = Q d Bt ψ(yd|hd, w) and Γ(Xt, Yt) is a normalization constant. To potentially improve the inference accuracy, we first integrate out the local variables Θt by the conjugacy between a Dirichlet prior and a multinomial likelihood (Griffiths & Steyvers, 2004; Teh et al., 2006b). Then we have the local variables Ht = Zt. By the equality (Zhu et al., 2013a): ψ(yd|zd, w) = Z 0 ψ(yd, λd|zd, w)dλd, (10) where ψ(yd, λd|zd, w) = (2πλd) 1/2 exp( (λd+cζd)2 2λd ), the collapsed posterior qt+1(w, Φ, Zt) is a marginal distribution of qt+1(w, Φ, Zt, λt), which equals to p0(Zt)qt(w, Φ)p(Xt|Zt, Φ)ψ(Yt, λt|Zt, w) Γ(Xt, Yt) , where ψ(Yt, λt|Zt, w) = Q d Bt ψ(yd, λd|zd, w) and λt = {λd}d Bt are augmented variables, which are also locally associated with individual documents. In fact, the augmented distribution qt+1(w, Φ, Zt, λt) is the solution to the problem: min q P L(q(w, Φ, Zt, λt)) Eq[log ψ(Yt, λt|Zt, w)], (11) where L(q) = KL[q(w, Φ, Zt, λt) qt(w, Φ)p0(Zt)] Eq[log p(Xt|Zt, Φ)]. We can show that this objective is an upper bound of that in the original problem (6). See Appendix B for details. With the mild mean-field assumption that q(w, Φ, Zt, λt) = q(w)q(Φ)q(Zt, λt), we can solve (11) via an iterative procedure that alternately updates each factor distribution (Jordan et al., 1998), as detailed below. Global Update: By fixing the distribution of local variables, q(Zt, λt), and ignoring irrelevant variables, we have the mean-field update equations: q(Φk) qt(Φk) exp(Eq(Zt)[log p0(Zt)p(X|Zt, Φ)]), k q(w) qt(w) exp(Eq(Zt,λt)[log p0(Zt)ψ(Yt, λt|Zt, w)]). If initially q0(Φk) = Dir( 0 k1, ..., 0 k W ) and q0(w) = N(w; µ0, Σ0), by induction we can show that the inferred distributions in each round have a closed form, namely, qt(Φk) = Dir( t k1, ..., t k W ) and qt(w) = N(w; µt, Σt). For the above update equations, we have q(Φk) = Dir( k1, ..., k W ), (12) where kw = t kw + P i [nd] γk di I[xdi = w] for all words w and γk di = Eq(zd)I[zdi = k] is the probability of assigning word xdi to topic k, and q(w) = N(w; µ , Σ ), (13) where the posterior paramters are computed as (Σ ) 1 = (Σt) 1 + c2 P d Bt Eq(zd,λd)[λ 1 d zd z d ] and µ = Σ (Σt) 1µt + Σ c P d Bt Eq(zd,λd)[yd(1 + cϵλ 1 d ) zd]. Local Update: Given the distribution of global variables, q(Φ, w), the mean-field update equation for (Zt, λt) is q(Zt, λt) p0(Zt) Y 1 2πλd exp X i [nd] Λzdi,xdi Eq(Φ,w)[(λd + cζd)2 2λd ] , (14) where Λzdi,xdi = Eq(Φ)[log(Φzdi,xdi)] = Ψ( zdi,xdi) Ψ(P w zdi,w) and Ψ( ) is the digamma function, due to the distribution in (12). But it is impossible to evaluate the expectation in the global update using (14) because of the huge number of configurations for (Zt, λt). As a result, we turn to Gibbs sampling and estimate the required expectations using multiple empirical samples. This hybrid strategy has shown promising performance for LDA (Mimno et al., 2012). Specifically, the conditional distributions used in the Gibbs sampling are as follows: For Zt: By canceling out common factors, the conditional distribution of one variable zdi given Z di t and λt is q(zdi =k|Z di t , λt) (α+C di dk )exp cyd(cϵ+λd)µ k ndλd +Λk,xdi c2(µ 2 k +Σ kk+2(µ kµ +Σ ,k) C di d ) 2n2 dλd Online Bayesian Passive-Aggressive Learning Algorithm 1 Online Med LDA 1: Let q0(w) = N(0; v2I), q0(φk) = Dir(γ), k. 2: for t = 0 do 3: Set q(Φ, w) = qt(Φ, w). Initialize Zt. 4: for i = 1 I do 5: Draw samples {Z(j) t , λ(j) t }J j=1 from (15, 16). 6: Discard the first β burn-in samples (β < J ). 7: Use the rest J β samples to update q(Φ, w) following (12, 13). 8: end for 9: Set qt+1(Φ, w) = q(Φ, w). 10: end for where Σ ,k is the k-th column of Σ , C di d is a vector with the k-th entry being the number of words in document d (except the i-th word) that are assigned to topic k. For λt: Let ζd = ϵ yd z d µ . The conditional distribution of each variable λd given Zt is q(λd|Zt) 1 2πλd exp c2 z d Σ zd+(λd+c ζd)2 2, 1, c2( ζ2 d + z d Σ zd) , (16) a generalized inverse gaussian distribution (Devroye, 1986). Therefore, λ 1 d follows an inverse gaussian distribution IG(λ 1 d ; 1 c ζ2 d+ z d Σ zd , 1), from which we can draw a sample in constant time (Michael et al., 1976). For training, we run the global and local updates alternately until convergence at each round of PA optimization, as outlined in Alg. 1. To make predictions on testing data, we then draw one sample of ˆw as the classification weight and apply the prediction rule. The inference of z for testing documents is the same as in (Zhu et al., 2013a). 4. Online Nonparametric Med LDA We present online nonparametric Med LDA for resolving the unknown number of topics, based on the theory of hierarchical Dirichlet process (HDP) (Teh et al., 2006a). 4.1. Batch Med HDP HDP provides an extension to LDA that allows for a nonparametric inference of the unknown topic numbers. The generative process of HDP can be summarized using a stick-breaking construction (Wang & Blei, 2012), where the stick lengths π = {πk} k=1 are generated as: ik sdj] for k = {1, 2, ...}. Since Zt contains only finite number of discrete variables, we only need to maintain and update the global distribution for a finite number of topics. Local Update: Fixing the global distribution q(w, π, Φ), we get the mean-field update equation for (Zt, St, λt): q(Zt, St, λt) q(Zt, St) q(Zt, λt) (21) where q(Zt, St) = exp(Eq(Φ)q( π)[log p(Xt|Φ, Zt) + log p(Zt, St| π)]) and q(Zt, λt) = exp(Eq(w)[log ψ(Yt, λt|w, Zt)]). To overcome the the potentially unbounded latent space, we take the ideas from (Wang & Blei, 2012) and adopt an approximation for q(Zt, St): q(Zt, St) Eq(Φ)q( π)[p(X|Φ, Zt)p(Zt, St| π)]. (22) Instead of marginalizing out π in (22), which is analytically difficult, we sample π jointly with (Zt, St, λt). This leads to the following Gibbs sampling scheme: For Zt: Let K be the current inferred number of topics. The conditional distribution of one variable zdi given Z di t , λt and π can be derived from (21) with sd marginalized out for convenience: q(zdi = k|Z di t , λt, π) (απk+C di dk )(C di kxdi+ kxdi) P w (C di kw + kw) exp cyd(cϵ+λd)µ k ndλd c2(µ 2 k +Σ kk+2(µ kµ +Σ ,k) C di d ) 2n2 dλd Besides, for k > K and symmetric Dirichlet prior η, this becomes q(zdi = k|Z di t , λt, π) απk/W, and therefore the total probability of assigning a new topic is q(zdi > K|Z di t , λt, π) α For λt: The conditional distribution q(λd|Zt, St, π) is the same as (16). For St: The conditional distribution of sdk given Zt, π, λt can be derived from the joint distribution (17): q(sdk|Zt, λt, π) S(nd zdk, sdk)(απk)sdk (23) For π: It can be derived from (21) that given (Zt, St, λt), each πk follows the beta distribution, πk Beta(ak, bk), where ak = u k + P d Bt sdk and bk = v k + P Similar to online Med LDA, we iterate the above steps till convergence for training. 0 5 10 15 20 25 30 #Passes Through the Dataset pa Med LDA b Med LDA g Med LDA sp LDA+SVM #Passes Through the Dataset pa Med HDP b Med HDP tf HDP+SVM Figure 1. Test errors with different number of passes through the 20NG training dataset. Left: LDA-based models. Right: HDPbased models. 5. Experiments We demonstrate the efficiency and prediction accuracy of online Med LDA and Med HDP, denoted as pa Med LDA and pa Med HDP, on the 20Newsgroup (20NG) and a large Wikipedia dataset. A sensitivity analysis of the key parameters is also provided. Following the same setting in (Zhu et al., 2012), we remove a standard list of stop words. All of the experiments are done on a normal computer with single-core clock rate up to 2.4 GHz. 5.1. Classification on 20Newsgroup We perform multi-class classification on the 20NG dataset with all the 20 categories. The training set contains 11,269 documents, with the smallest category having 376 documents and the biggest category having 599 documents. The test set contains 7,505 documents, with the smallest and biggest categories having 259 and 399 documents respectively. We use the one-vs-all strategy (Rifkin & Klautau, 2004) for multi-class classification. We compare pa Med LDA and pa Med HDP with their batch counterparts, denoted as b Med LDA and b Med HDP, which are obtained by letting the batch size |B| be equal to the dataset size D, and Gibbs Med LDA, denoted as g Med LDA, (Zhu et al., 2013a), which performs Gibbs sampling in the batch manner. We also consider online unsupervised topic models as baselines, including sparse inference for LDA (sp LDA) (Mimno et al., 2012), which has been demonstrated to be superior than online variational LDA (Hoffman et al., 2010) in performance, and truncation-free online variational HDP (tf HDP) (Wang & Blei, 2012), which has been shown to be promising in nonparametric topic modeling. For both of them, we learn a linear SVM with the topic representations using LIBSVM (Chang & Lin, 2011). The performances of other batch supervised topic models, such as s LDA (Blei & Mc Auliffe, 2010) and Disc LDA (Lacoste-Julien et al., 2008), are reported in (Zhu et al., 2012). For all LDA-based topic models, we use symmetric Dirichlet priors α = 1/K 1, γ = 0.5 1; for all HDP-based topic models, we use α = 5, γ = 1, η = 0.45 Online Bayesian Passive-Aggressive Learning 20 40 60 80 100 0.4 pa Med LDA b Med LDA g Med LDA sp LDA+SVM 0 20 40 60 80 100 120 10 Figure 2. Classification accuracy and running time of pa Med LDA and comparison models on the 20NG dataset. 1; for all MED topic models, we use ϵ = 164, c = 1, v = 1, the choice of which is not crucial to the models performance as shown in (Zhu et al., 2013a). We first analyze how many processed documents are sufficient for each model to converge. Figure 1 shows the prediction accuracy with the number of passes through the entire 20NG dataset, where K = 80 for parametric models and (I, J , β) = (1, 2, 0) for Bayes PA. As we could observe, by solving a series of latent Bayes PA learning problems, pa Med LDA and pa Med HDP fully explore the redundancy of documents and converge in one pass, while their batch counterparts need many passes as burn-in steps. Besides, compared with the online unsupervised learning algorithms, Bayes PA topic models utilize supervising-side information from each mini-batch, and therefore exhibit a faster convergence rate in discrimination ability. Next, we study each model s best performance possible and the corresponding training time. To allow for a fair comparison, we train each model until the relative change of its objective is less than 10 4. Figure 2 shows the accuracy and training time of LDA-based models on the whole dataset with varying numbers of topics. Similarly, Figure 3 shows the accuracy and training time of HDP-based models, where the dots stand for the mean inferred numbers of topics, and the lengths of the horizontal bars represent their standard deviations. As we can see, Bayes PA topic models, at the power of online learning, are about 1 order of magnitude faster than their batch counterparts in training time. Furthermore, thanks to the merits of Gibbs sampling, which does not pose strict mean-field assumptions about the independence of latent variables, Bayes PA topic models parallel their batch alternatives in accuracy. 5.2. Further Discussions We provide further discussions on Bayes PA learning for topic models. First, we analyze the models sensitivity to some key parameters. Second, we illustrate an application with a large Wikipedia dataset containing 1.1 million documents, where class labels are not exclusive. 20 22 24 26 28 30 0.55 pa Med HDP b Med HDP tf HDP+SVM 20 22 24 26 28 30 10 Figure 3. Classification accuracy and running time of pa Med HDP and comparison models on the 20NG dataset. CPU Seconds (Log Scale) CPU Seconds (Log Scale) 1 4 16 64 256 1024 Batch Batch Size Figure 4. Test errors of pa Med LDA (left) and pa Med HDP (right) with different batch sizes on the 20NG dataset. 5.2.1. SENSITIVITY ANALYSIS Batch Size |B|: Figure 4 presents the test errors of Bayes PA topic models as a function of training time on the entire 20NG dataset with various batch sizes, where K = 40. We can see that the convergence speeds of different algorithms vary. First of all, the batch algorithms suffer from multiple passes through the dataset and therefore are much slower than the online alternatives. Second, we could observe that algorithms with medium batch sizes (|B| = 64, 256) converge faster. If we choose a batch size too small, for example, |B| = 1, each iteration would not provide sufficient evidence for the update of global variables; if the batch size is too large, each mini-batch becomes redundant and the convergence rate reduces. Number of iterations I and samples J : Since the time complexity of Algorithm 1 is linear in both I and J , we would like to know how these parameters influence the quality of the trained model. First, notice that the first β samples are discarded as burn-in steps. To understand how large β is sufficient, we consider the settings of the pairs (J , β) and check the prediction accuracy of Algorithm 1 for K = 40, |B| = 512, as shown in Table 1. We can see that accuracies closer to the diagonal of the table are relatively lower, while settings with the same number of kept samples, e.g. (J , β) = (3, 0), (5, 2), (9, 6), yield similar results. The number of kept samples exhibits a more significant role in the performance of Bayes PA topic models than the burn-in steps. Online Bayesian Passive-Aggressive Learning Table 1. Effect of the number of samples and burn-in steps. J β 0 2 4 6 8 1 0.783 3 0.803 0.799 5 0.808 0.803 0.792 9 0.806 0.806 0.806 0.804 0.796 1 2 3 4 0.77 # Local Samples 1 2 3 4 0.77 I =1 I =2 I=3 I=4 # Optimization Iteration Figure 5. Classification accuracies and training time of (a): pa Med LDA, (b): pa Med HDP, with different combinations of (I, J ) on the 20NG dataset. Next, we analyze which setting of (I, J ) guarantees good performance. Figure 5 presents the results. As we can see, for J = 1, the algorithms suffer from the noisy approximation and therefore sacrifices prediction accuracy. But for larger J , simply I = 1 is promising, possibly due to the redundancy among mini-batches. 5.2.2. MULTI-TASK CLASSIFICATION For multi-task classification, a set of binary classifiers are trained, each of which identifies whether a document xd belongs to a specific task/category yτ d {+1, 1}. These binary classifiers are allowed to share common latent representations and therefore could be attained via a modified Bayes PA update equation: min q Ft L(q(w, M, Ht)) + 2c τ=1 ℓϵ(q(w, M, Ht); Xt, Y τ t ) where T is the total number of tasks. We can then derive the multi-task version of Passive-Aggressive topic models, denoted by pa Med LDA-mt and pa Med HDP-mt, in a way similar to Section 3.2 and 4.2. We test pa Med LDA-mt and pa Med HDP-mt as well as comparison models, including b Med LDA-mt, b Med HDP-mt and g Med LDA-mt (Zhu et al., 2013b) on a large Wiki dataset built from the Wikipedia set used in PASCAL LSHC challenge 2012 2. The Wiki dataset is a collection of documents with labels up to 20 different kinds, while the 2See http://lshtc.iit.demokritos.gr/. Time (Seconds) pa Med LDA mt pa Med HDP mt b Med LDA mt g Med LDA mt b Med HDP mt Figure 6. F1 scores of various models on the 1.1M wikipedia dataset. data distribution among the labels is balanced. The training/testing split is 1.1 million / 5 thousand. To measure performance, we use F1 score, the harmonic mean of precision and recall. Figure 6 shows the F1 scores of various models as a function of training time. We can see that Bayes PA topic models are again about 1 order of magnitude faster than the batch alternatives and yet produce comparable results. Therefore, Bayes PA topic models are potentially extendable to large-scale multi-class settings. 6. Conclusions and Future Work We present online Bayesian Passive-Aggressive (Bayes PA) learning as a new framework for max-margin Bayesian inference of online streaming data. We show that Bayes PA subsumes the online PA, and more significantly, generalizes naturally to incorporate latent variables and to perform nonparametric Bayesian inference, therefore providing great flexibility for explorative analysis. Based on the ideas of Bayes PA, we develop efficient online learning algorithms for max-margin topic models as well as their nonparametric extensions. Empirical experiments on several real datasets demonstrate significant improvements on time efficiency, while maintaining comparable results. As future work, we are interested in showing provable bounds in its convergence and limitations. Furthermore, better understanding its mathematical structure would allow one to design more involved Bayes PA algorithms for various models. We are also interested in developing highly scalable, distributed (Broderick et al., 2013) Bayes PA learning paradigms, which will better meet the demand of processing massive real data available today. Acknowledgments The work is supported by the National Basic Research Program of China (No. 2013CB329403), National Natural Science Foundation of China (Nos. 61322308, 61332007), and Tsinghua University Initiative Scientific Research Program (No. 20121088071). Online Bayesian Passive-Aggressive Learning Ahn, S., Korattikara, A., and Welling, M. Bayesian posterior sampling via stochastic gradient Fisher scoring. In ICML, 2012. Antoniak, C.E. Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. The annals of statistics, pp. 1152 1174, 1974. Blei, D.M. and Mc Auliffe, J.D. Supervised topic models. In NIPS, 2010. Blei, D.M., Ng, A., and Jordan, M.I. Latent Dirichlet allocation. JMLR, 3:993 1022, 2003. Broderick, T., Boyd, N., Wibisono, A., Wilson, A.C., and Jordan, M.I. Streaming variational Bayes. ar Xiv preprint ar Xiv:1307.6769, 2013. Chang, C.C. and Lin, C.-J. LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST), 2(3):27, 2011. Chiang, D., Marton, Y., and Resnik, P. Online large-margin training of syntactic and structural translation features. In EMNLP, 2008. Cortes, C. and Vapnik, V. Support-vector networks. Machine learning, 20(3):273 297, 1995. Crammer, K., Dekel, O., Keshet, J., Shalel-Shwartz, S., and Singer, Y. Online Passive-Aggressive learning. JMLR, 7:551 585, 2006. Devroye, L. Non-Uniform Random Variate Generation. Springer, 1986. Ghahramani, Z. and Griffiths, T.L. Infinite latent feature models and the Indian buffet process. In NIPS, 2005. Griffiths, T.L. and Steyvers, M. Finding scientific topics. PNAS, 101:5228 5235, 2004. Hjort, N.L. Bayesian Nonparametrics. Cambridge University Press, 2010. Hoffman, M., Bach, F.R., and Blei, D.M. Online learning for latent Dirichlet allocation. In NIPS, 2010. Jaakkola, T., Meila, M., and Jebara, T. Maximum entropy discrimination. In NIPS, 1999. Jebara, T. Multitask sparsity via maximum entropy discrimination. JMLR, 12:75 110, 2011. Jiang, Q., Zhu, J., Sun, M., and Xing, E.P. Monte Carlo Methods for Maximum Margin Supervised Topic Models. In NIPS, 2012. Jordan, M.I., Ghahramani, Z., Jaakkola, T.S., and Saul, L.K. An Introduction to Variational Methods for Graphical Models. Springer, 1998. Lacoste-Julien, S., Sha, F., and Jordan, M.I. Disc LDA: Discriminative learning for dimensionality reduction and classification. In NIPS, 2008. Mc Donald, R., Crammer, K., and Pereira, F. Online largemargin training of dependency parsers. In ACL, 2005. Michael, J.R., Schucany, W.R., and Haas, R.W. Generating random variates using transformations with multiple roots. The American Statistician, 30(2):88 90, 1976. Mimno, D., Hoffman, M., and Blei, D.M. Sparse stochastic inference for latent dirichlet allocation. ICML, 2012. Rifkin, R. and Klautau, A. In defense of one-vs-all classification. JMLR, 5:101 141, 2004. Teh, Y.W., Jordan, M.I., Beal, M.J., and Blei, D.M. Hierarchical Dirichlet processes. Journal of the American Statistical Association, 101(476), 2006a. Teh, Y.W., Newman, D., and Welling, M. A collapsed variational Bayesian inference algorithm for latent Dirichlet allocation. In NIPS, 2006b. Wang, C. and Blei, D.M. Truncation-free online variational inference for Bayesian nonparametric models. In NIPS, 2012. Wang, C., Paisley, J.W., and Blei, D.M. Online variational inference for the hierarchical Dirichlet process. In AISTATS, 2011. Xu, M., Zhu, J., and Zhang, B. Bayesian nonparametric maximum margin matrix factorization for collaborative prediction. In NIPS, 2012. Zhu, J. and Xing, E.P. Maximum entropy discrimination Markov networks. JMLR, 10:2531 2569, 2009. Zhu, J., Chen, N., and Xing, E.P. Infinite SVM: a Dirichlet process mixture of large-margin kernel machines. In ICML, 2011. Zhu, J., Ahmed, A., and Xing, E.P. Med LDA: maximum margin supervised topic models. JMLR, 13:2237 2278, 2012. Zhu, J., Chen, N., Perkins, H., and Zhang, B. Gibbs max-margin topic models with fast sampling algorithms. ICML, 2013a. Zhu, J., Zheng, X., Zhou, L., and Zhang, B. Scalable inference in max-margin topic models. In SIGKDD, 2013b.