# online_bayesian_passiveaggressive_learning__ad6477f8.pdf

Journal of Machine Learning Research 18 (2017) 1-39 Submitted 5/14; Revised 4/16; Published 4/17

Online Bayesian Passive-Aggressive Learning

Tianlin Shi TIANLINS@CS.STANFORD.EDU Institute for Interdisciplinary Information Sciences Tsinghua University Beijing, 100084 China

Jun Zhu DCSZJ@TSINGHUA.EDU.CN State Key Lab of Intelligent Technology and Systems Tsinghua National Lab for Information Science and Technology Department of Computer Science and Technology Tsinghua University Beijing, 100084 China

Editor: Amir Globerson

We present online Bayesian Passive-Aggressive (Bayes PA) learning, a generic online learning framework for hierarchical Bayesian models with max-margin posterior regularization. We show that Bayes PA subsumes the standard online Passive-Aggressive (PA) learning and extends naturally to incorporate latent variables for both parametric and nonparametric Bayesian inference, therefore providing great ﬂexibility for explorative analysis. As an important example, we apply Bayes PA to topic modeling and derive efﬁcient online learning algorithms for max-margin topic models. We further develop nonparametric Bayes PA topic models to infer the unknown number of topics in an online manner. Experimental results on 20newsgroups and a large Wikipedia multi-label dataset (with 1.1 millions of training documents and 0.9 million of unique terms in the vocabulary) show that our approaches signiﬁcantly improve time efﬁciency while achieving comparable accuracy with the corresponding batch algorithms.

1. Introduction

With the fast growth of data volume and variety, it is becoming increasingly important to develop scalable machine learning algorithms, which can be categorized into two major categories online/stochastic methods on a single core and parallel/distributed algorithms on multiple-cores or multiple machines. This paper focuses on online learning, a process of answering a sequence of questions (e.g., which category does a document belong to?) given knowledge of the correct answers (e.g., the true category labels) to previous questions. Such a process is especially suitable for the applications with streaming data. For the applications with large datasets, online learning algorithms can effectively explore data redundancy relative to the model to be learned, by repeatedly subsampling the data; and they often lead to faster convergence to satisfactory results than the corresponding batch learning algorithms. Among the many popular algorithms, online Passive Aggressive (PA) learning (Crammer et al., 2006) provides a generic framework of performing online

. Correspondence should be addressed to J. Zhu. T. Shi is now with Stanford Artiﬁcial Intelligence Laboratory (SAIL), Stanford University, Stanford, CA, 94305

c 2017 Tianlin Shi and Jun Zhu.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v18/14-188.html.

SHI AND ZHU

learning for large-margin methods (e.g., SVMs), with many applications in natural language processing and text mining (Mc Donald et al., 2005; Chiang et al., 2008). Though enjoying strong discriminative ability that is preferable for predictive tasks, existing online PA methods are formulated as a point estimate problem by optimizing some deterministic objective function. This may lead to some potential shortcomings. For example, a single large-margin model could fail to capture the rich underlying structure of complex data.

On the other hand, Bayesian methods enjoy greater ﬂexibility in describing the possible underlying structures of complex data by incorporating a hierarchy of latent variables. Moreover, the recent progress on nonparametric Bayesian methods (Hjort, 2010; Teh et al., 2006a) further provides an increasingly important framework that allows the Bayesian models to have an unbounded model complexity, e.g., an inﬁnite number of components in a mixture model (Hjort, 2010) or an inﬁnite number of units in a latent feature model (Ghahramani and Grifﬁths, 2005), and to adapt when the learning environment changes. In particular, adaptation to the changing environment is of great importance in online learning. For Bayesian models, one challenging problem is posterior inference, for which both variational and Monte Carlo methods can be too expensive to be applied to large-scale applications. To scale up Bayesian inference, much progress has been made on developing stochastic variational Bayes (Hoffman et al., 2013; Mimno et al., 2012) and stochastic gradient Monte Carlo (Welling and Teh, 2011; Ahn et al., 2012) methods, which repeatedly draw samples from a given ﬁnite dataset.1 To deal with the potentially unbounded streaming data, streaming variational Bayes methods (Broderick et al., 2013) have been developed as a general framework, with an application to topic models for learning latent topic representations. However, due to the generative nature, Bayesian models usually does not perform as well as discriminative models in predictive tasks such as classiﬁcation.

Successful attempts have been made to bring large-margin learning and Bayesian methods together to solve supervised learning tasks using ﬂexible Bayesian latent variable models. For example, maximum entropy discrimination (MED) made a signiﬁcant advance in conjoining max-margin learning and Bayesian generative models, in the context of univariate prediction (Jaakkola et al., 1999) and structured output prediction (Zhu and Xing, 2009). Recently, much attention has been devoted to generalizing MED to incorporate latent variables and perform nonparametric Bayesian inference in various contexts, including topic modeling (Zhu et al., 2012), matrix factorization (Xu et al., 2013), and multi-task learning (Jebara, 2011; Zhu et al., 2014c). Regularized Bayesian inference (Reg Bayes) (Zhu et al., 2014c) provides a uniﬁed framework for Bayesian models to perform max-margin learning, where the max-margin principle is incorporated through adding posterior constraints to an information-theoretical optimization problem. By solving this problem involving latent variables, Reg Bayes can uncover structures that support the supervised learning tasks. Reg Bayes subsumes the standard Bayes rule and is more ﬂexible in incorporating domain knowledge or max-margin constraints. Though ﬂexible in discovering latent structures and powerful in discriminative predictions, posterior inference in such models remains a challenge. By exploring data augmentation techniques, recent progress has been made to develop efﬁcient MCMC methods (Zhu et al., 2014b), which can also be implemented in distributed clusters (Zhu et al., 2013). However, these batch learning methods are not applicable to streaming data, and they do not explore the statistical redundancy in large-scale corpora either.

1. There are also many distributed methods for scalable Bayesian inference. Please see (Zhu et al., 2014a) for a review.

ONLINE BAYESIAN PASSIVE-AGGRESSIVE LEARNING

To address the above problems of both online PA on incorporating ﬂexible latent structures and Bayesian max-margin models on scalable streaming inference, this paper presents online Bayesian Passive-Aggressive (Bayes PA) learning, a general framework of performing online learning for Bayesian max-margin models. We show that online Bayes PA subsumes the standard online PA when the underlying model is linear and the parameter prior is Gaussian (See Table 1 for its close relationships with streaming variational Bayes and Reg Bayes). We further show that one major signiﬁcance of Bayes PA is its natural generalization to incorporate a hierarchy of latent variables for both parametric and nonparametric Bayesian inference, therefore allowing online Bayes PA to have the great ﬂexibility of (nonparametric) Bayesian methods for explorative analysis as well as the strong discriminative ability of large-margin learning for predictive tasks. As concrete examples, we apply the theory of online Bayes PA to topic modeling and derive efﬁcient online learning algorithms for max-margin supervised topic models (Zhu et al., 2012). We further develop efﬁcient online learning algorithms for the nonparametric max-margin topic models, an extension of the nonparametric topic models (Teh et al., 2006a) for predictive tasks. Extensive empirical results on real datasets demonstrate signiﬁcant improvements on time efﬁciency and maintenance of comparable results with corresponding batch algorithms.

The paper is structured as follows. We discuss the related work in Section 2, and review the preliminary knowledge in Section 3. Then, we move on to the detailed description of Bayes PA in Section 4, and present the theoretical analysis. Section 5 presents the concrete instantiations on topic modeling, and Section 6 presents the extensions to nonparametric topic models and multi-task learning. Section 7 presents experimental results. Finally, Section 8 concludes this paper with future directions discussed.

2. Related Work

As a well-established learning paradigm, online learning is of both theoretical and practical interest. The goal of online learning is to make a sequence of decisions, such as classiﬁcation and regression, and use the knowledge extracted from previous correct answers to produce decisions on incoming ones. The root of online learning could be traced back to early studies of repeated games (Hannan, 1957), where an agent dynamically makes choices with the summary of past information. The idea became popular with the advent of Perceptron algorithms (Rosenblatt, 1958), which adopt an additive update rule for the classiﬁer weights, and its multiplicative counterpart is the Winnow algorithm (Littlestone, 1988). The class of online multiplicative algorithms was further generalized by Adaboost (Freund and Schapire, 1997) in a decision theoretic sense and is now widely applied to various ﬁelds of study (Arora et al., 2012).

First presented by Crammer et al. (2006) as a weight updating method, online Passive-Aggressive (PA) algorithms provide a generic online learning framework for max-margin models. In particular, they considered loss functions that enforce max-margin constraints, and showed that surprisingly simple update rules could be derived in closed forms. Motivated by online PA learning and to handle unbalanced training sets, Dredze et al. (2008) proposed conﬁdence-weighted learning, which maintains a Gaussian distribution of the classiﬁer weights at each round and replaces the max-margin constraint in PA with a probabilistic constraint enforcing conﬁdence of classiﬁcation. Within the same framework, Crammer et al. (2008) derived a new convex form of the constraint and demonstrated performance improvements through empirical evaluations.

SHI AND ZHU

The theoretical analysis of online learning typically relies on the notion of regret, which is the average loss incurred by an adaptive online learner on streaming data versus the best achievable through a single ﬁxed model having the hindsight of all data (Murphy, 2012). It can be shown that the notion of regret is closely related to weak duality in convex optimization, which brings online learning to the algorithmic framework of convex repeated games (Shalev-Shwartz and Singer, 2006). Although the classical regime of online learning is based on decision theory, recently much attention has been paid to the theory and practice of online probabilistic inference in the context of Big Data. Rooted either in variational inference or Monte Carlo sampling methods, there are broadly two lines of work on the topic of online Bayesian inference. Stochastic variational inference (SVI) (Hoffman et al., 2013) is a stochastic approximation algorithm for mean-ﬁeld variational inference. By approximating the natural gradients in maximizing the evidence lower bound with stochastic gradients sampled from data points, Hoffman et al. (2013) demonstrated scalable inference of topic models on large corpora. Mimno et al. (2012) showed the performance of SVI could be improved through structured mean-ﬁeld assumptions and locally collapsed variational inference. SVI is also applicable to the stochastic inference of nonparametric Bayesian models, such as hierarchical Dirichlet process (Wang et al., 2011; Wang and Blei, 2012b). There is also a large body of work on extending Monte Carlo methods to the online setting. A classic approach is sequential Monte Carlo methods (SMC) or particle ﬁlters (Doucet and Johansen, 2009), which arose from the numerical estimation of state-space models. Through Rao Blackwellized particle ﬁlters (Doucet et al., 2000), one could obtain online inference algorithms for latent Dirichlet allocation (LDA) (Canini et al., 2009), and more recently Gao et al. (2016) presented a streaming (collapsed) Gibbs sampler with better accuracy. To tackle the sparsity issues and inadequate coverage of particles, Steinhardt and Liang (2014) leveraged abstract particles to represent entire regions of the sample space. Recently, Korattikara et al. (2014) introduced an approximate Metropolis-Hasting rule based on sequential hypothesis testing that allows accepting and rejecting samples using only a fraction of the data. As an alternative, Bardenet et al. (2014) proposed an adaptive sampling strategy of Metropolis-Hastings from a controlled perturbation of the target distribution. With elegant use of gradient information that Metropolis-Hastings algorithms neglected, a line of work (Welling and Teh, 2011; Ahn et al., 2012; Patterson and Teh, 2013) also developed stochastic gradient methods based on Langevin dynamics. While most online Bayesian inference methods have adopted a stochastic approximation of the posterior distribution by sub-sampling a given ﬁnite dataset, in many applications data arrives in stream so that the dataset is changing over time and its size is unknown. To relax the previous request on knowing the data size, Broderick et al. (2013) made streaming updates to the estimated posterior and demonstrated the advantage of streaming variational Bayes (SVB) over stochastic variational inference. As will be discussed in this paper, Bayes PA also does not impose assumptions on the size of dataset and works on streaming data. The idea to discriminatively train univariate or structured output classiﬁers was popularized by the seminal work on support vector machines (Vapnik, 1995) and max-margin Markov networks (aka structural SVMs) (Taskar et al., 2003). Since then, research on developing max-margin models with latent variables has received increasing attention, because of the promise to capture underlying complex structures of the problems. A promising line of work focused on Bayesian approaches, and one representative is maximum entropy discrimination (MED) (Jaakkola et al., 1999; Jebara, 2001; Zhu and Xing, 2009), which learns distributions of model parameters discriminatively from labeled

ONLINE BAYESIAN PASSIVE-AGGRESSIVE LEARNING

data. Med LDA (Zhu et al., 2012) extends MED to infer latent topical structure from data with large margin constraints on the target posteriors. Similarly, nonparametric Bayesian max-margin models have also been developed, such as inﬁnite SVMs (i SVM) (Zhu et al., 2011) for building SVM classiﬁers with a latent mixture structure, and inﬁnite latent SVMs (i LSVM) (Zhu et al., 2014c) for automatic discovering predictive features for SVMs. Furthermore, the idea of nonparametric Bayesian learning has been applied to link prediction (Zhu, 2012), matrix factorization (Xu et al., 2013), etc. Regularized Bayesian inference (Reg Bayes) (Zhu et al., 2014c) provides a uniﬁed framework for performing max-margin learning of (nonparametric) Bayesian models, where the max-margin principle was incorporated through imposing posterior constraints to a variational formulation of the standard Bayes rule.

Max-margin Bayesian learning in the batch mode has already been one of the common challenges facing this class of models. Despite its general intractability, efﬁcient algorithms have been proposed under different settings. One way is to solve the problem via variational inference under a mean-ﬁeld (Zhu et al., 2012) or structured mean-ﬁeld (Jiang et al., 2012) assumption. Recently, Zhu et al. (2014b) provided a key insight in deriving efﬁcient Monte Carlo methods without making strict assumptions. Their technique is based on a data augmentation formulation of the expected margin loss. Based on similar techniques, fast inference algorithms have also been developed for generalized relational topic models (Chen et al., 2015), matrix factorization (Xu et al., 2013), etc. Data augmentation (DA) refers to the method of introducing augmented variables along with the observed data to make their joint distribution tractable. The technique was popularized in the statistics community by the well-known expectation-maximization algorithm (EM) (Dempster et al., 1977) for maximum likelihood estimation with missing data. For posterior inference, the technique is popularized by Tanner and Wong (1987) in statistics and by Swendsen and Wang (1987) for the Ising and Potts models in physics. For a broader introduction to DA methods, we refer the readers to Van Dyk and Meng (2001).

Finally, our conference version of the paper (Shi and Zhu, 2014) has introduced some preliminary work, which we largely extend here.

3. Preliminaries

This section reviews the preliminary knowledge that is needed to develop online Bayesian Passive Aggressive learning. The relationships with Bayes PA will be summarized in Table 1 later.

3.1 Online Passive-Aggressive Learning

Based on a decision-theoretic view, the goal of online supervised learning is to minimize the cumulative loss for a certain prediction task from the sequentially arriving training samples. Online Passive-Aggressive (PA) learning (Crammer et al., 2006) achieves this goal by updating some parametric model w RK (e.g., the weights of a linear SVM) in an online manner with the instantaneous losses from arriving data {xt}t 0 (xt RK) and corresponding responses {yt}t 0, where K is the number of input features. The loss ℓϵ(w; xt, yt), as they consider, could be the hinge loss (ϵ ytw xt)+ for binary classiﬁcation (yt {0, 1}) or the ϵ-insensitive loss (|yt w xt| ϵ)+ for regression (yt R), where ϵ is a hyper-parameter and (x)+ = max(0, x). The online Passive Aggressive update rule is then derived by deﬁning the new weight wt+1 as the solution to the

SHI AND ZHU

following optimization problem:

min w 1 2||w wt||2 s.t.: ℓϵ(w; xt, yt) = 0, (1)

where is the Euclidean 2-norm. Intuitively, if wt suffers no loss from the new data, i.e., ℓϵ(wt; xt, yt) = 0, the algorithm passively assigns wt+1 = wt; otherwise, it aggressively projects wt to the feasible region of parameter vectors that attain zero loss on the new data. With provable regret bounds, Crammer et al. (2006) showed that online PA algorithms could achieve comparable results to the optimal classiﬁer w , which has the hindsight of all data. In practice, in order to account for inseparable training samples, soft margin constraints are often adopted and the resulting PA learning problem is

min w 1 2||w wt||2 + 2c ℓϵ(w; xt, yt), (2)

where c is a positive regularization parameter and the constant factor 2 is included for simplicity as will be clear soon. For problems (1) and (2) with samples arriving one at a time, closed-form solutions can be derived (Crammer et al., 2006). For example, for the binary hinge loss the update rule for problem (2) is wt+1 = wt + τtytxt, where τt = min(2c, max(0, ϵ ytw xt)/||xt||2); and for the ϵ-insensitive loss, the update rule is wt+1 = wt + sign(yt w xt)τtxt, where τt = max(0, ϵ ytw xt)/||xt||2.

3.2 Streaming Variational Bayes

For Bayesian models, Bayes rule naturally leads to a streaming update procedure for online learning. Speciﬁcally, suppose the data {xt}t 0 are generated i.i.d. according to a distribution p(x|w) and the prior p(w) is given. Bayes theorem implies that the posterior distribution of w given the ﬁrst T samples (T 1) satisﬁes

p(w|{xt}T t=0) p(w|{xt}T 1 t=0 )p(x T |w). (3)

In other words, the posterior after observing the ﬁrst T 1 samples is treated as the prior for the incoming data point. This natural streaming Bayes rule, however, is often intractable to compute, especially for complex models (e.g., when latent variables are present). Streaming variational Bayes (SVB) (Broderick et al., 2013) suggests that a variational approximation should be adopted and it practically works well. Speciﬁcally, let A be any algorithm that calculates an approximate posterior q(w) = A(X, q0(w)) based on data X and prior q0(w). Then, the recursive formula for approximate streaming update is:

q(w|{xt}T t=0) = A x T , q(w|{xt}T 1 t=0 ) .

The choice of A can be problem-speciﬁc. For topic modeling, Broderick et al. (2013) showed that one may adopt mean-ﬁeld variational Bayes (Wainwright and Jordan, 2008), expectation propagation (Minka, 2001), and one-pass posterior approximation algorithms using stochastic variational inference (Hoffman et al., 2013) or sufﬁcient statistics update (Honkela and Valpola, 2003; Luts et al., 2013). By applying the streaming update in a distributed setting asynchronously, SVB could also be scaled up across multiple computer clusters (Broderick et al., 2013).

ONLINE BAYESIAN PASSIVE-AGGRESSIVE LEARNING

3.3 Regularized Bayesian Inference

The decision-theoretic and Bayesian view of learning can be reciprocal. For example, it would be beneﬁcial to combine the ﬂexibility of Bayesian models with the discriminative power of largemargin methods. The idea of regularized Bayesian inference (Reg Bayes) (Zhu et al., 2014c) is to treat Bayesian inference as an optimization problem with an imposed loss function for prediction tasks or a regularization term on the target posterior distribution in general. Mathematically, Reg Bayes for supervised learning can be formulated as

min q P KL h q(w)||p0(w) i Eq(w) h log p(X|w) i + 2c ℓ q(w); X, Y , (4)

where P is the probability simplex, p(X|w) is the likelihood function and KL is the Kullback Leibler divergence.2 Note that if the loss is zero, i.e., ℓ(q(w); X, Y ) = 0, then the optimal solution is q (w) p0(w)p(X|w), which is just the Bayes rule. However, when ℓ(q(w); X, Y ) = 0, Reg Bayes biases the inferred posterior towards discriminating the supervising side-information Y , with the parameter c controlling the extent of regularization. If the posterior regularization term ℓ( ) is a convex function of q(w) through the linear expectation operator, Zhu et al. (2014c) presented a general representation theorem to characterize the solution to problem (4). To distinguish from the posterior distribution obtained via Bayes rule, the solution to problem (4) is called post-data posterior (Zhu et al., 2014c). Many instantiations have been developed following the generic framework of Reg Bayes to demonstrate its superior performance than standard Bayesian models in various settings, such as topic modeling (Jiang et al., 2012; Zhu et al., 2014b), matrix factorization (Xu et al., 2013) and link prediction (Zhu, 2012), as well as the ﬂexibility of performing robust Bayesian inference with noisy domain knowledge (Mei et al., 2014).

4. Bayesian Passive-Aggressive Learning

In this section, we present online Bayesian Passive-Aggressive (Bayes PA) learning, a general perspective on online max-margin Bayesian inference. Without loss of generality, we consider binary classiﬁcation. The techniques can be applied to other learning tasks. We provide an extension in Section 6.2.

4.1 Online Bayes PA Learning

Instead of updating a point estimate of w, online Bayesian PA (Bayes PA) sequentially infers a new post-data posterior distribution qt+1(w), either parametric or nonparametric, on the arrival of a new data point (xt, yt) by solving the following optimization problem:

min q(w) Ft KL h q(w)||qt(w) i Eq(w) h log p(xt|w) i

s.t.: ℓϵ q(w); xt, yt = 0, (5)

2. We assume that the model space W is a complete separable metric space endowed with its Borel σ-algebra B(W ). Let P0 and Q be probability measures on W . The Kullback-Leibler (KL) divergence of the probability measure Q with respect to the measure P0 is deﬁned as KL[Q P0] = R d Q d P0 (w) log d Q

d P0 (w)d P0(w), where d Q d P0 (w) is the Radon-Nikodym derivative (Durret, 2010). It is required that Q is absolutely continuous with respect to P0 such that this derivative exists. In this paper, we further assume that P0 is absolutely continuous with respect to some background measure µ. Thus, there exists a density p0 that satisﬁes d P0 = p0dµ and there also exists a density q that satisﬁes d Q = qdµ. Then, the KL-divergence can be expressed as KL[q p0] = R q(w) log q(w) p0(w)dµ(w).

SHI AND ZHU

feasible(zone(

Projection!

feasible(zone(

Figure 1: Graphical Illustration of Bayes PA learning. (a). Update passively by Bayes rule, if the resulting distribution suffer zero loss. (b) Otherwise, aggressively project the resulting distribution to the feasible region of weights with zero loss.

where Ft can be some family of distributions or the whole probability simplex P. For notational convenience, we denote the objective as L(q(w)). In words, we ﬁnd a post-data posterior distribution qt+1(w) in the feasible region that is not only close to the current posterior qt(w) in terms of KL-divergence, but also has a high likelihood of explaining the new data. As a result, if Bayes rule already gives the posterior distribution qt+1(w) qt(w)p(xt|w) that suffers no loss (i.e., ℓϵ = 0), Bayes PA passively updates the posterior following just Bayes rule;3 otherwise, Bayes PA aggressively projects the new posterior to the feasible region of posteriors that attain zero loss. The passive and aggressive update cases are illustrated in Figure 1. We should note that when no likelihood is deﬁned (e.g., p(xt|w) is independent of w), Bayes PA will passively set qt+1(w) = qt(w) if qt(w) suffers no loss; otherwise, it will aggressively project qt(w) to the feasible region. We call it non-likelihood Bayes PA. In practical problems, the constraints in (5) could be unrealizable. To deal with such cases, we introduce the soft-margin version of Bayes PA learning, which is equivalent to minimizing the objective function L(q(w)) in problem (5) with a regularization term, similar as in SVMs (Cortes and Vapnik, 1995):

qt+1(w) = argmin q(w) Ft L q(w) + 2c ℓϵ q(w); xt, yt . (6)

For the max-margin classiﬁers that we consider, two types of loss functionals ℓϵ(q(w); xt, yt) are common:

3. Under the Bayesian setting, we regard Bayes rule as a natural update (passive), since no aggressive projection is applied.

ONLINE BAYESIAN PASSIVE-AGGRESSIVE LEARNING

Methods Max-margin learning ? Bayesian inference ? Streaming update ? PA yes no yes SVB no yes yes Reg Bayes yes yes no Bayes PA yes yes yes

Table 1: The comparison between Bayes PA and its various precursors, including online PA, streaming variational Bayes (SVB) and regularized Bayesian inference (Reg Bayes), in three different aspects.

1. Averaging classiﬁer: assume that a post-data posterior distribution q(w) is given, then an averaging classiﬁer makes predictions using the sign rule ˆyt = sign Eq(w)[w xt] when the discriminant function has the simple linear form, f(xt; w) = w xt. For this classiﬁer, its hinge loss is therefore deﬁned as:

ℓave ϵ q(w); xt, yt = ϵ yt Eq(w) h w xt i

2. Gibbs classiﬁer: assume that a post-data posterior distribution q(w) is given, then a Gibbs classiﬁer randomly draws a weight vector w q(w) to make predictions using the sign rule ˆyt = sign w xt, when the discriminant function has the same linear form as above. For each single w, we can measure its hinge loss (ϵ ytw xt)+. To account for the randomness of w, the expected hinge loss of a Gibbs classiﬁer is therefore deﬁned as:

ℓgibbs ϵ q(w); xt, yt = Eq(w)

They are closely connected via the following lemma due to the convexity of the function (x)+.

Lemma 1 The expected hinge loss ℓgibbs ϵ is an upper bound of the hinge loss ℓave ϵ , that is,

ℓgibbs ϵ q(w); xt, yt ℓave ϵ q(w); xt, yt .

Bayes PA is deeply connected to its various precursors reviewed in Section 3, as summarized in Table 1. First, Bayes PA is a natural Bayesian extension of online PA, which is explicated via the following theorem. Note that this theorem only serves as a connection to PA, and will not be relied on by our algorithms, where we are more interested in the cases with nontrivial likelihood models to describe the underlying structures (e.g., topic structures), which are typically out-of-reach for the standard PA. Nevertheless, the idea of the proof details would later be applied to develop practical Bayes PA algorithms for topic models. Therefore, we include the complete proof here.

Theorem 2 If the prior is normal q0(w) = N(w0, I), Ft = P, and we use the averaging classiﬁer ℓave ϵ , the non-likelihood Bayes PA subsumes the online PA.

SHI AND ZHU

Proof The soft-margin version of Bayes PA learning can be reformulated using a slack variable ξt:

qt+1(w) = argmin q(w) P KL h q(w)||qt(w) i + 2c ξt

s.t. : yt Eq(w) w xt ϵ ξt, ξt 0. (7)

Similar to Corollary 5 in (Zhu et al., 2012), the optimal solution q (w) of the above problem can be derived from its functional Lagrangian and has the following form:

q (w) = 1 Γ(τ t , xt, yt)qt(w) exp τ t ytw xt , (8)

where the normalization term Γ(τt, xt, yt) = R

w qt(w) exp τtytw xt dw, and τ t is the optimal solution to the dual problem max τt ϵτt log Γ(τt, xt, yt)

s.t. 0 τt 2c. (9)

Using this primal-dual interpretation, we prove that for the normal prior p0(w) = N(w0, I), the post-data posterior is also Gaussian: qt(w) = N(µt, I) for some µt in each round t = 0, 1, 2, ... This can be shown by induction. By our assumption, q0(w) = p0(w) = N(w0, I) is Gaussian. Assume for round t 0, the distribution qt(w) = N(µt, I). Then for round t + 1, Eq. (8) suggests the distribution

qt+1(w) = C Γ(τ t , xt, yt) exp 1

2||w (µt + τ t ytxt)||2 ,

where the constant C = exp(ytτ t µ t xt + 1

2τ 2 t x t xt) . Therefore, the distribution qt+1(w) = N(µt+τ t ytxt, I), and the normalization term is Γ(τt, xt, yt) = (

2π)K exp(τtytx t µt+1

2τ 2 t x t xt) for any τt [0, 2c]. Next, we show that µt+1 = µt + τ t ytxt is the optimal solution of the online PA update rule (Crammer et al., 2006). To see this, we replace Γ(τt, xt, yt) in problem (9) with our derived form. Ignoring constant terms, we obtain the dual problem

max τt ϵτt 1

2τ 2 t x t xt ytτtµ t xt

s.t.: 0 τt 2c, (10)

which is exactly the dual form of the online PA update equation:

µPA t+1 = arg min µ 1 2||µ µt||2 + 2c ξt

s.t. ytµ xt ϵ ξt, ξt 0.

The optimal solution is µPA t+1 = µt + τ t ytxt. Note that τ t is the optimal solution of dual problem (10) shared by both PA and Bayes PA. Therefore, we conclude that µt+1 = µPA t+1.

Second, suppose some algorithm A is capable of solving problem (6), then it would produce streaming updates to the posterior distribution. For averaging classiﬁers, it is easy to modify the proof of theorem 2 to derive the update rule of Bayes PA, which is presented in the following lemma.

ONLINE BAYESIAN PASSIVE-AGGRESSIVE LEARNING

Lemma 3 If Ft = P and we use the averaging classiﬁer with loss functional ℓave ϵ , the update rule of online Bayes PA is

qt+1(w) = 1 Γ(τ t , xt, yt)qt(w)p(xt|w) exp τ t ytw xt , (11)

where Γ(τt, xt, yt) is the normalization term Γ(τt, xt, yt) = R

w qt(w)p(xt|w) exp τtytw t xt dw, and τ t is the optimal solution to the dual problem

max τt ϵτt log Γ(τt, xt, yt)

s.t. 0 τt 2c.

For Gibbs classiﬁers, we have the following lemma to characterize its streaming update rule.

Lemma 4 If Ft = P and we use the Gibbs classiﬁer with loss functional ℓgibbs ϵ , the update rule of online Bayes PA is

qt+1(w) = qt(w)p(xt|w) exp 2c ϵ ytw xt

Γ(xt, yt) , (12)

where Γ(xt, yt) is the normalization constant.

In both update rules (11) and (12), the post-data posterior qt(w) in the previous round t can be treated as a prior, while the newly observed data and the loss it incurs provide a likelihood and an un-normalized pseudo-likelihood respectively. Due to the analytical form, the Bayes PA update rule (12) is sequential in nature, simiar as the convential Bayes rule (3). Therefore, the sequential posterior at time T is the same as the batch posterior that observes all the data instances up to T. For the case with averaging classifers, Eq. (11) reduces to the streaming Bayesian update problem if there is no loss functional (i.e., ℓϵ = 0). Therefore, Bayes PA is an extension to streaming variational Bayes (SVB) with imposed max-margin posterior constraints. Finally, the update formulation (6) is essentially a Reg Bayes problem with a single data point (xt, yt). Although Reg Bayes inference is normally intractable, we would show later in the paper how to use variational approximation to bypass the difﬁculty for speciﬁc settings. This would lead to variational approximation algorithm A for the streaming update of post-data posterior. Besides treating a single data point at a time, a useful technique in practice to reduce the noise in data is to use mini-batches. Suppose that we have a mini-batch of data points at time t with an index set Bt. Let Xt = {xd}d Bt, Yt = {yd}d Bt. The online Bayes PA update equation for this mini-batch can be deﬁned in a natural way:

min q Ft KL h q(w)||qt(w) i Eq(w) h log p(Xt|w) i + 2c ℓϵ q(w); Xt, Yt ,

where ℓϵ(q(w); Xt, Yt) = P d Bt ℓϵ(q(w); xd, yd) is the cumulative loos functional on the minibatch Bt. Like the vanilla PA methods (Crammer et al., 2006), Bayes PA on mini-batches may not have closed-form update rules, and numerical optimization methods are needed to solve this new formulation.

SHI AND ZHU

global&variables

local&variables

sampling analysis model&update draw&a&mini5 batch infer&the&hidden& structure update&distribu8on&of& global&variables

Xt,Yt t = 1,2,..., (Xt,Yt ) q*(Ht ) q*(w,M)

Figure 2: Graphical illustrations of: (a) the abstraction of models with latent structures; and (b) the procedure of Bayes PA learning with latent structures.

4.2 Bayes PA Learning with Latent Structures

To expressively explain complex real-word data, Bayesian models with latent structures have been extensively developed. The latent structures could typically be characterized by a hierarchy of variables, which are generally grouped into two sets local latent variables hd (d 0) that characterize the hidden structures of each observed data xd and global variables M that capture the common properties shared by all data. As illustrated in Figure 2, Bayes PA learning with latent structures aims to update the distribution of M as well as the classiﬁer weights w, based on each incoming mini-batch (Xt, Yt) and their corresponding latent variables Ht = {hd}d Bt. Because of the uncertainty in Ht, our posterior approximation algorithm A would ﬁrst infer the joint posterior distribution qt+1(w, M, Ht) from

min q Ft L q(w, M, Ht) + 2c ℓϵ q(w, M, Ht); Xt, Yt , (13)

where L(q) = KL[q(w, M, Ht)||qt(w, M)p0(Ht)] Eq(w,M,Ht)[log p(Xt|w, M, Ht)] and ℓϵ(q(w, M, Ht); Xt, Yt) is some cumulative loss functional on the min-batch data incurred by some classiﬁers on the latent variables Ht and/or global variables M. As in the case without latent variables, both averaging classiﬁer and Gibbs classiﬁer can be used. Next, algorithm A produces the approximate posterior qt+1(w, M). In general we would not obtain a closed-form posterior distribution by marginalizing out Ht, especially in dealing with some involved models like Med LDA (Zhu et al., 2012). The intractability is bypassed through the mean-ﬁeld assumption q(w, M, Ht) = q(w)q(M)q(Ht). Speciﬁcally, algorithm A solves problem (13) using an iterative procedure and obtain the optimal distribution q (w)q (M)q (Ht). Then it sets qt+1(w, M) = q (w)q (M) and proceeds to next round. Concrete examples of this method will be discussed in Section 5 and Section 6.

5. Online Max-Margin Topic Models

In this section, we apply the theory of online Bayes PA to topic modeling. We ﬁrst review the basic ideas of max-margin topic models, and develop online learning algorithms based on Bayes PA with averaging and Gibbs classiﬁers respectively.

ONLINE BAYESIAN PASSIVE-AGGRESSIVE LEARNING

5.1 Basics of Med LDA

A max-margin topic model consists of a latent Dirichlet allocation (LDA) model (Blei et al., 2003) for describing the underlying topic representations of document content and a max-margin classiﬁer for making predictions. Speciﬁcally, LDA is a hierarchical Bayesian model that treats each document as an admixture of K topics, Φ = {φk}K k=1, where each topic φk is a multinomial distribution over a given W-word vocabulary.4 The generative process of the d-th document (1 d D) is described as follows:

Draw a topic mixture proportion vector θd|α Dir(α)

For the i-th word in document d, where i = 1, 2, ..., nd,

draw a latent topic assignment zdi Mult(θd).

draw the word instance xdi Mult(φzdi).

where Dir( ) is the Dirichlet distribution and Mult( ) is the multinomial distribution. For Bayesian LDA, the topics are also drawn from a Dirichlet distribution, i.e., φk Dir(γ). Given a document set X = {xd}D d=1. Let Z = {zd}D d=1 and Θ = {θd}D d=1 denote all the topic assignments and topic mixing vectors. LDA infers the posterior distribution p(Φ, Θ, Z|X) p0(Φ, Θ, Z)p(X|Z, Φ) via Bayes rule. From a variational point of view, the posterior is identical to the solution of the optimization problem:

min q P KL h q(Φ, Θ, Z)||p(Φ, Θ, Z|X) i .

The advantage of the variational formulation of Bayesian inference lies in the convenience of posing restrictions on the post-data distribution with a regularization term. For supervised topic models (Blei and Mc Auliffe, 2010; Zhu et al., 2012), such a regularization term could be a loss function of a prediction model w on the data X = {xd}D d=1 and response signals Y = {yd}D d=1. As a regularized Bayesian (Reg Bayes) model, Med LDA infers a distribution of the latent variables Z as well as classiﬁcation weights w by solving the problem:

min q P L q(w, Φ, Θ, Z) + 2c

d=1 ℓϵ q(w, zd); xd, yd ,

where L(q(w, Φ, Θ, Z)) = KL[q(w, Φ, Θ, Z)||p(w, Φ, Θ, Z|X)] . To specify the loss function, a linear discriminant function needs to be deﬁned with respect to w and zd

f(w, zd) = w zd, (14)

where zdk = 1 nd P i I[zdi = k] is the average frequency of assigning the words in document d to topic k. Based on the discriminant function, both averaging classiﬁers with the hinge loss

ℓave ϵ (q(w, zd); xd, yd) = ϵ yd Eq(w,zd)[f(w, zd)]

4. Without causing confusion, we slightly abused the notation K to denote the topic number (i.e., the latent dimension) in topic models.

SHI AND ZHU

and Gibbs classiﬁers with the expected hinge loss

ℓgibbs ϵ (q(w, zd); xd, yd) = Eq(w,zd) (ϵ ydf(w, zd))+ , (16)

have been proposed, with extensive comparisons reported in Zhu et al. (2014b) using batch learning algorithms. As commented in (Chang and Blei, 2010), deﬁning the classiﬁer on zd, instead of θd, enforces words and labels to share the same topics. Besides, it retains the conjugacy structure between the Dirichlet prior of θ and the multinomial likelihood of generating z; and it will allow us to integrate out θ for efﬁcient collapsed inference.

5.2 Online Med LDA

We ﬁrst apply online Bayes PA to Med LDA with averaging classiﬁers, which will be referred to as pa Med LDAave for convenience. During inference, we integrate out the local variables Θt using the conjugacy between a Dirichlet prior and a multinomial likelihood (Grifﬁths and Steyvers, 2004; Teh et al., 2006b), which potentially improves the inference accuracy. Then we have the global variables M = Φ and local variables Ht = Zt. The latent Bayes PA rule (13) becomes:

min q,ξd KL h q(w, Φ, Zt)||qt(w, Φ)p0(Zt)p(Xt|Φ, Zt) i + 2c X

d Bt ξd, (17)

s.t.: yd Eq(w,zd)[w zd] ϵ ξd, ξd 0, d Bt,

q(w, Φ, Zt) P.

Since directly solving the above problem is intractable, we would impose a mild mean-ﬁeld assumption q(w, Φ, Zt) = q(w)q(Φ)q(Zt). Now, problem (17) can be solved using an iterative procedure that alternately updates each factor distribution (Jordan et al., 1998), as detailed below:

1. Update global q(Φ): By ﬁxing the distributions q(Zt) and q(w), we can ignore irrelevant terms and solve

min q(Φ) KL h q(Φ)q(Zt)||qt(Φ)p0(Zt)p(Xt|Φ, Zt) i .

The optimal solution has the following closed form:

q (Φk) qt(Φk) exp Eq(Zt) h log p0(Zt)p(Xt|Zt, Φ) i , k = 1, 2, ..., K. (18)

If initially the prior is q0(Φk) = Dir( 0 k1, ..., 0 k W ), then by induction the inferred distributions in each round are also in the family of Dirichlet distributions, namely, qt(Φk) = Dir( t k1, ..., t k W ). Using equation (18), we can derive

q (Φk) = Dir( k1, ..., k W ), (19)

where kw = t kw + P d Bt Pnd i=1 γk di I[xdi = w] for all words w (1 w W) in the vocuabulary and γk di = Eq(zd)I[zdi = k] is the probability of assigning each word xdi to topic k.

ONLINE BAYESIAN PASSIVE-AGGRESSIVE LEARNING

2. Update global weight q(w): Keeping all the other distributions ﬁxed, q(w) can be solved as

min q(w) KL h q(w)||qt(w) i + 2c X

s.t.: yd Eq(w)[w] bzd ϵ ξd, ξd 0, d Bt,

where bzd = Eq(zd)[ zd] is the expectation of topic assignments under the ﬁxed distribution q(Z). Similar to Proposition 2 in Med LDA (Zhu et al., 2012), the optimal solution is attained by solving the Lagrangian form with respect to q(w), which gives

q (w) = 1 Z(τ d)qt(w) exp

d Bt τ dydw bzd

where the Lagrange multipliers τ d(d Bt) are obtained by solving the dual problem

max 0 τd 2c ϵ X

d Bt τd log Z(τd).

For the common spherical Gaussian prior q0(w) = N(0, σ2I), by induction the distribution qt(w) = N(µt, σ2I) at each round. So equation (20) gives q (w) = N(µ , σ2I), where

µ = µt + σ2 X

d Bt τ dydbzd. (21)

Furthermore, the dual problem becomes,

max 0 τd 2c ϵ X

1 2σ2τd1τd2 bz d1 bzd2 µ t X

d Bt ydτdbzd, (22)

which is identical to the Lagrangian dual of the classical PA problem with mini-batch Bt (expressed in the equivalent constrained form by introducing slack variables)

min µ ||µ µt||2

This equivalence suggests that we could rely on contemporary PA techniques to solve for µ . In particular, for instances coming one at a time (i.e., Bt = {t}, t), we have the closed-form solution

ϵ ytµ t bzt

whose computation requires O(K) time; for mini-batches, we could adapt methods solving linear SVM to either the dual (22) or primal (23) problem, which by state-of-the-arts require complexity O(poly(ϵ 1)K) per training instance in order to obtain ϵ-accurate solutions. Here we choose a gradient-based method similar to Shalev-Shwartz et al. (2011). Speciﬁcally, we ﬁrst set µ1 = µt, and then take the gradient steps i = 2, 3, ... until µi converges to µ . Let

SHI AND ZHU

Ai t be the set of instances in Bt that suffer non-zero loss at step i, then we use the gradients to iteratively update µi+1 µi ρi i, (24)

where annealing rate ρi = σ2i 1 and

Correspondingly, we can derive the gradient-based update rule for the dual parameters. Imagine that we implicitly maintain the relationship µ = µt + σ2 P d Bt τdydbzd. Then the following update rule for τd (d Bt) naturally implies the update rule (24) for µ:

i )τ i d + 2c

i for d Ai t (1 1

i )τ i d for d Ai t.

Therefore, the gradient steps adaptively adjust the contribution of each latent bzd to µ based on the loss it incurs. Furthermore, the annealing makes sure that 0 τ i d 2c for all i. Since the problem (22) is concave, it can be guaranteed that τ i d converges to τ d. This correspondence would be used in updating q(Zt).

3. Update local q(Zt): Fixing all the other distributions, we aim to solve

min q(Zt) KL h q(Zt)||p0(Zt)p(Xt|Zt, Φ) i + 2c X

s.t.: ydµ Eq(zd)[ zd] ϵ ξd, ξd 0, d Bt,

where µ = Eq(w)[w] is the expectation of w under the ﬁxed distribution q(w). Unlike the weight w, the expectation over Zt during optimization is intractable due to the combinatorial space of values. Instead, we adopt the same approximation strategy as Med LDA (Zhu et al., 2012): ﬁx ξ, τd at the previous global step, and use the approximate solution

q (Zt) = p0(Zt)p(Xt|Zt, Φ) exp

d Bt τ dydµ zd

Then the expectation of zd, as needed in the global updates, could be approximated by samples from the distribution q (Zt). Speciﬁcally, we use Gibbs sampling with the conditional distribution

q(zdi = k | Z di t ) α + C di dk exp

d Bt n 1 d ydτ dµ k

where Λzdi,xdi = Eq(Φ)[log(Φzdi,xdi)] = Ψ( zdi,xdi) Ψ(P w zdi,w) (note that Ψ( ) is the digamma function) and C di d is a vector with the k-th entry being the number of words in document d (except the i-th word) that are assigned to topic k.

Then we draw J samples {Z(j) t }J j=1 using Eq. (25), discard the ﬁrst β (0 β < J ) burn-in

samples, and approximate bzdk with the empirical sum (J β) 1 PJ j=β+1 P d,i I[z(j) di = k].

ONLINE BAYESIAN PASSIVE-AGGRESSIVE LEARNING

Algorithm 1 Online Med LDA

1: Let q0(w) = N(0; σ2I), q0(φk) = Dir(γ), k.

2: for t = 0 do

3: Set q(Φ, w) = qt(Φ)qt(w). Initialize Zt.

4: for i = 1 I do

5: Draw samples {Z(j) t }J j=1 from Eq. (25).

6: Discard the ﬁrst β burn-in samples (β < J ).

7: Use the rest J β samples to update q(Φ, w) following Eq.s (19, 20).

9: Set qt+1(Φ, w) = q(Φ, w).

10: end for

At each round t of Bayes PA optimization during training, we run the global and local updates alternately until convergence, and assign qt(Φ, w) = q (Φ)q (w), as outlined in Algorithm 1. To make predictions on testing data, we use the mean µ as the classiﬁcation weight and apply the prediction rule. The inference of z for testing documents is similar to Zhu et al. (2014b). First, we draw a single sample of Φ, and for each test document d, we infer the MAP of θd. Then, we directly run the sampling of zd until the burn-in stage is completed, and use the average of several samples to compute bzd. Then the prediction rule is applied on µ and bzd.

5.3 Online Gibbs Med LDA

In this section, we apply the theory of Bayes PA to Gibbs Med LDA. As shown in Zhu et al. (2014b), using Gibbs classiﬁers admits efﬁcient inference algorithms by exploring data augmentation (DA) techniques (Tanner and Wong, 1987; Polson and Scott, 2011). Based on this insight, we will develop our efﬁcient online inference algorithms for Gibbs Med LDA. We denote the model by pa Med LDAgibbs . Speciﬁcally, let ζd = ϵ ydf(w, zd) and ψ(yd|zd, w) = e 2c(ζd)+. By Lemma 4, the optimal solution to problem (13) is

qt+1(w, M, Ht) = qt(w, M)p0(Ht)p(Xt|Ht, M)ψ(Yt|Ht, w)

Γ(Xt, Yt) ,

where ψ(Yt|Ht, w) = Q d Bt ψ(yd|hd, w) and Γ(Xt, Yt) is a normalization constant. The basic idea of DA is to construct conjugacy between prior and data during inference by introducing augmented variables. Speciﬁcally, we would use the following equality (Zhu et al., 2014b):

ψ(yd|zd, w) = Z

0 ψ(yd, λd|zd, w)dλd, (26)

where ψ(yd, λd|zd, w) = (2πλd) 1/2 exp (λd+cζd)2

. Equality (26) essentially implies that the collapsed posterior qt+1(w, Φ, Zt) is a marginal distribution of

qt+1(w, Φ, Zt, λt) = p0(Zt)qt(w, Φ)p(Xt|Zt, Φ)ψ(Yt, λt|Zt, w)

Γ(Xt, Yt) ,

where ψ(Yt, λt|Zt, w)=Q d Bt ψ(yd, λd|zd, w) and λt = {λd}d Bt are augmented variables locally associated with individual documents. In fact, the augmented distribution qt+1(w, Φ, Zt, λt) is the

SHI AND ZHU

solution to the problem:

min q P L q(w, Φ, Zt, λt) Eq h log ψ(Yt, λt|Zt, w) i , (27)

where L(q(w, Φ, Zt, λt)) = KL[q(w, Φ, Zt, λt) qt(w, Φ)p0(Zt)] Eq[log p(Xt|Zt, Φ)]. In fact, this objective is an upper bound of that in the original problem (13) (See Appendix A for details). Again, with the mild mean-ﬁeld assumption that q(w, Φ, Zt, λt) = q(w, Φ)q(Zt, λt), we can solve problem (27) via an iterative procedure that alternately updates each factor distribution (Jordan et al., 1998), as detailed below.

1. Global Update: By ﬁxing the distribution of local variables, q(Zt, λt), and ignoring irrelevant terms, the optimal distribution of w and Φ can be shown to have the induced factorization form, q(w, Φ) = q(w)q(Φ). For q(Φ), the update rule is exactly (19). For q(w), we have the update rule

qt+1(w) qt(w) exp Eq(Zt,λt) h log p0(Zt)ψ(Yt, λt|Zt, w) i .

If the initial prior is normal q0(w) = N(w; µ0, Σ0), by induction we can show that the inferred distribution in each round is also a normal distribution, namely, qt(w) = N(w; µt, Σt). Indeed, the optimal solution of q(w) to problem (27) is

q (w) = N(w; µ , Σ ), (28)

where the posterior paramters are computed as

(Σt) 1 + c2 X

d Bt Eq(zd,λd) h λ 1 d zd z d i

(Σt) 1µt + c X

d Bt Eq(zd,λd) yd 1 + cϵλ 1 d zd

For the sequential update rule, we simply set µt+1 = µ and Σt+1 = Σ .

2. Local Update: Given the distribution of global variables, q(Φ, w), the mean-ﬁeld update equation for (Zt, λt) is

q(Zt, λt) p0(Zt) Y

i [nd] Λzdi,xdi Eq(Φ,w)

(λd + cζd)2

where Λ admits the same deﬁnition as in (25). But it is impossible to evaluate the expectation in the global update using q(Zt, λt) because of the huge number of conﬁgurations for (Zt, λt). As a result, we turn to Gibbs sampling and estimate the required expectations using multiple empirical samples. This hybrid strategy has shown promising performance for LDA (Mimno et al., 2012). Speciﬁcally, the conditional distributions used in the Gibbs sampling are as follows:

ONLINE BAYESIAN PASSIVE-AGGRESSIVE LEARNING

Algorithm 2 Online Gibbs Med LDA

1: Let q0(w) = N(0; σ2I), q0(φk) = Dir(γ), k.

2: for t = 0 do

3: Set q(Φ, w) = qt(Φ)qt(w). Initialize Zt.

4: for i = 1 I do

5: Draw samples {Z(j) t , λ(j) t }J j=1 from Eq.s (29, 30).

6: Discard the ﬁrst β burn-in samples (β < J ).

7: Use the rest J β samples to update q(Φ, w) following Eq.s (19, 28).

9: Set qt+1(Φ, w) = q(Φ, w).

10: end for

For Zt: By canceling out common factors, the conditional distribution of one variable zdi given Z di t and λt is

q(zdi =k|Z di t , λt) (α+C di dk )exp cyd(cϵ+λd)µ k ndλd +Λk,xdi c2(µ 2 k +Σ kk+2(µ kµ +Σ ,k) C di d ) 2n2 dλd

where Σ ,k is the k-th column of Σ .

For λt: Let ζd = ϵ yd z d µ . The conditional distribution of each variable λd given Zt is

q(λd|Zt) 1 2πλd exp c2 z d Σ zd + (λd + c ζd)2

2, 1, c2 ζ2 d + z d Σ zd , (30)

a generalized inverse gaussian distribution (Devroye, 1986). Therefore, λ 1 d follows an inverse gaussian distribution, that is,

q(λ 1 d |Zt) = IG

ζ2 d + z d Σ zd , 1

from which we can draw a sample in constant time (Michael et al., 1976).

For training, we run the global and local updates alternately until convergence at each round of PA optimization, as outlined in Alg. 2. To make predictions on testing data, we then draw one sample of ˆw as the classiﬁcation weight and apply the prediction rule. The inference of z for testing documents is the same as online Med LDA.

6. Extensions

In the above topic models, we assume that the number of topics (i.e., K) is pre-speciﬁed, which is unfortunately unknown in practice. An extra procedure is often applied to select a good K value.

SHI AND ZHU

We now present extensions of online Med LDA to automatically determine the unknown K values by leveraging nonparametric Bayesian techniques. We also present an extension of these models for multi-task learning.

6.1 Online Nonparametric Med LDA

We ﬁrst present online nonparametric Med LDA for resolving the unknown number of topics, based on the theory of hierarchical Dirichlet process (HDP) (Teh et al., 2006a).

6.1.1 BATCH MEDHDP

A two-level HDP provides an extension to LDA that allows for a nonparametric inference of the unknown topic numbers. The generative process of HDP can be summarized using a stick-breaking construction (Wang and Blei, 2012b), where the stick lengths π = {πk} k=1 are generated as:

i<k (1 πi), πk Beta(1, η), for k = 1, ..., ,

and the topic mixing proportions are generated as θd Dir(απ), for d = 1, ..., D. Each topic φk is a sample from a Dirichlet base distribution, i.e., φk Dir(γ). After we get the topic mixing proportions θd, the generation of words is the same as in the standard LDA. To extend the HDP topic model for predictive tasks, we introduce a classiﬁer w that is drawn from a Gaussian process, GP(0, Σ), where the covariance function is Σ(w, w ) = σ2I[w = w ]. We still deﬁne the linear discriminant function in the same form as Eq. (14). Since the number of words in a document is ﬁnite, the average topic assignment vector zd has only a ﬁnite number of non-zero elements, and the dot product in Eq. (14) is in fact ﬁnite. Therefore, given the latent topic assignments, the conditoinal posterior of w is in fact a multivariate Gaussian distribution. Let π = { πk} k=1. We deﬁne maximum entropy discrimination HDP (Med HDP) topic model as solving the following Reg Bayes problem to infer the joint post-data posterior q(w, π, Φ, Θ, Z):5

min q P L q(w, π, Φ, Θ, Z) + 2c

d=1 ℓϵ q(w, zd); xd, yd , (31)

where L(q(w, π, Φ, Θ, Z)) = KL[q(w, π, Φ, Θ, Z)||p(w, π, Φ, Θ, Z|X)] is the objective corresponding to the standard Bayesian inference under the variational formulation of Bayes rule. The loss function could be either (15) or (16). We call the resulting model with the averaging classiﬁer Med HDPave and that with the Gibbs classiﬁer Med HDPgibbs . Since Med HDP is a new model, we would brieﬂy discuss the corresponding inference problem. For the inference of Med HDPgibbs , we can use Gibbs sampling based on Chinese Restaurant Franchise (Teh et al., 2006a; Wang and Blei, 2012a) with modiﬁcations similar to the techniques introduced in Zhu et al. (2014b). For Med HDPave , the current state-of-the-art for inferring max-margin Bayesian models with averaging loss resorts to mean-ﬁeld assumptions and variational inference. Notice that classical mean-ﬁeld derivation would fail due to the potentially unbounded space of variables. However, it is possible to incorporate Gibbs sampling into mean-ﬁeld update equations to explore the unbounded space (Welling et al., 2008; Wang and Blei, 2012b) and therefore bypass the difﬁculty. In this paper, we would not focus on developing inference algorithms for Med HDP,

5. Given π, π can be computed via the stick breaking process.

ONLINE BAYESIAN PASSIVE-AGGRESSIVE LEARNING

but instead attain batch Med HDP algorithms from the corresponding Bayes PA methods, as will be clear at the end of each subsection below.

6.1.2 ONLINE MEDHDP

To apply the ideas of Bayes PA to develop online Med HDP algorithms, we have the global variables M = ( π, Φ), and the local variables Ht = (Θt, Zt). As in online Med LDA, we marginalize out Θt by conjugacy. Furthermore, to simplify the sampling scheme, we introduce another set of auxiliary latent variables St = {sd}d Bt, where sd = {sdk} k=1 and each element sdk represents the number of occupied tables serving dish k in a Chinese restaurant process (CRP) (Teh et al., 2006a; Wang and Blei, 2012b). By deﬁnition, we have p(Zt, St| π) = Q d Bt p(sd, zd| π) and

p(sd, zd| π)

k=1 S(nd zdk, sdk)(απk)sdk, (32)

where S(a, b) are unsigned Stirling numbers of the ﬁrst kind (Antoniak, 1974). It is not hard to verify that p(zd| π) = P sd p(sd, zd| π). After this collapse-and-augment procedure, we now have the local variables Ht = (Zt, St). The global variables remain intact. The new Bayes PA problem is now:

min q Ft L q(w, π, Φ, Ht) + 2c X

d Bt ℓϵ w; xt, yt , (33)

where L(q(w, π, Φ, Ht)) = KL[q(w, π, Φ, Ht)||qt(w, π, Φ)p(Zt, St| π)p(Xt|Zt, Φ)] . As in online Med LDA, we adopt the mild mean ﬁeld assumption q(w, π, Φ, Ht) = q(w)q( π)q(Φ)q(Ht) and solve problem (33) via an iterative procedure detailed below.

1. Global Update: By ﬁxing the distribution of local variables, q(Ht), and ignoring the irrelevant terms, we have same mean-ﬁeld update equations (19) for Φ and (21) for w with the averaging loss. For global variable π, we have

q ( πk) qt( πk) Y

d Bt exp Eq(hd) h log p(sd, zd| π) i . (34)

By induction, we can show that qt( πk) = Beta(ut k, vt k) is a Beta distribution at each step, and the update equation is q ( πk) = Beta(u k, v k), (35)

where u k = ut k + P d Bt Eq(sd)[sdk] and v k = vt k + P d Bt Eq(sd)[P j>k sdj] for k = {1, 2, ...} and u0 k = 1, v0 k = η.

Since Zt contains only a ﬁnite number of discrete variables, we only need to maintain and update the above global distributions for a ﬁnite number of topics.

2. Local Update: Fixing the global distribution q(w, π, Φ), we get the mean-ﬁeld update equation for (Zt, St): q (Zt, St) q(Zt, St)ˆq(Zt) (36)

SHI AND ZHU

q(Zt, St) = exp Eq (Φ)q ( π)[log p(X|Φ, Zt) + log p(Zt, St| π)] ,

ˆq(Zt) = exp

d Bt τdyd E[w] zd

and τd(d Bt) are the dual variables computed in the global update. The most cumbersome point to tackle is the potentially unbounded sample space of Zt and St. We take the ideas from (Wang and Blei, 2012b) and adopt an approximation for q(Zt, St):

q(Zt, St) Eq (Φ)q ( π) [p(X|Φ, Zt)p(Zt, St| π)] . (37)

Computing the expectation regarding π in (37) turns out to be difﬁcult. However, imagine that the expectation operator is essentially collapsing π out from the joint distribution

q( π, Zt, St) Eq (Φ)) [q ( π)p(X|Φ, Zt)p(Zt, St| π)] . (38)

Now we propose to uncollapse π and sample the local variables from

q ( π, Zt, St) q( π, Zt, St)ˆq(Zt). (39)

Notice in the local updates, π is only an auxilary variable. Putting all the pieces together, we have the following sampling scheme.

For Zt: Let K be the current inferred number of topics. The conditional distribution of one variable zdi given all other local variables can be derived from (36) with sd marginalized out for convenience.

q(zdi = k|Z di t , π) (απk + C di dk )(C di kxdi + t kxdi) P w (C di kw + t kw) exp

d Bt n 1 d ydτdµ k

Besides, for k > K and symmetric Dirichlet prior γ, (40) converge to a single rule q(zdi = k|Z di t , π) απk/W, and therefore the total probability of assigning a new topic is

q(zdi > K|Z di t , π) α

For St: The conditional distribution of sdk given (Zt, π, λt) can be derived from the joint distribution (32):

q(sdk|Zt, π) S(nd zdk, sdk)(απk)sdk. (41)

For π: It can be derived from (36) that given (Zt, St), each πk follows the beta distribution, πk Beta(ak, bk), (42)

where ak = u k + P d Bt sdk and bk = v k + P d Bt P j>k sdj.

ONLINE BAYESIAN PASSIVE-AGGRESSIVE LEARNING

Similar to online Med LDA, we iterate the above steps till convergence for training. For testing, the learned model is essentially a ﬁnite Med LDA, and we use the same scheme as that of online Med LDA. Notice that if we run online Med HDP for only one round (T = 1) and use the entire dataset as mini-batch (|B| = D), iterating the above steps till converge in fact solves the batch Med HDP problem Eq. (31). We call this batch version Med HDPave , and will use it as a baseline algorithm.

6.1.3 ONLINE GIBBS MEDHDP

For Gibbs Med HDP, the only difference is the loss functional ℓϵ, which is reﬂected in the sampling of local variables. As in online Gibbs Med LDA, we can facilitate more efﬁcient inference by adopting the same data augmentation technique with the augmented variables λt. Then the local variables are (Zt, St, λt) and the global variables are unchanged. We then use the mean ﬁeld assumption q(w, π, Φ, Ht) = q(w)q( π)q(Φ)q(Ht) and compute the iterative steps as follows.

1. Global Update: The same as online Med HDP, except that the update rule for w is now (28) for the Gibbs classiﬁer.

2. Local Update: This step involves drawing samples of the local variables. We develop a Gibbs sampler, which iteratively draws St from the local conditional in (41), draws π from the conditional in (42), and draws the augmented variables λt from the conditional in (30). For Zt, we explain the sampling procedure in detail. Speciﬁcally, we infer Zt through

q (Zt, St, λt, π) q( π, Zt, St)ˆq(Zt, λt), (43)

ˆq(Zt, λt) = Y

i [nd] Λzdi,xdi Eq(Φ,w)

(λd + cζd)2

The Gibbs sampling for each variable zdi is

q(zdi = k|Z di t , λt, π) (απk + C di dk )(C di kxdi + t kxdi) P w (C di kw + t kw)

cyd(cϵ + λd)µ k ndλd c2(µ 2 k + Σ kk + 2(µ kµ + Σ ,k) C di d )

while the probability of sampling a new topic is

q(zdi > K|Z di t , π) α

Again, we iterate the above steps till convergence for training and the testing is the same as online Med HDP. A batch version algorithm can be attained by setting T = 1 and |B| = D, which we denote as Med HDPgibbs .

SHI AND ZHU

6.2 Multi-task Learning

The above models have been presented for classiﬁcation. The basic ideas can be applied to solve other learning tasks, such as regression and multi-task learning (MTL). We use multi-task learning an one example. The primary assumption of multi-task learning is that by sharing statistical strength in a joint learning procedure, multiple related tasks can be mutually enhanced or some main tasks can be improved. MTL has many applications. We consider one scenario for multi-label classiﬁcation. In this task, a set of binary classiﬁers are trained, each of which identiﬁes whether a document xd belongs to a speciﬁc category yq d {+1, 1}. These binary classiﬁers are allowed to share common latent representations and therefore could be attained via a modiﬁed Bayes PA update equation:

min q Ft L q(w, M, Ht) + 2c

q=1 ℓϵ q(w, M, Ht); Xt, Y τ t ,

where Q is the total number of tasks. We can then derive the multi-task version of Passive Aggressive topic models, denoted by pa Med LDAmtave and pa Med LDAmtgibbs , in a way similar as in Section 5. We can further develop the nonparametric multi-task Med LDA topic models in a way similar as in Section 6.1 and the online PA learning algorithms. We denote the nonparametric online models by pa Med HDPave and pa Med HDPgibbs , according to whether the task-speciﬁc classiﬁer is averaging or Gibbs.

7. Experiments

We demonstrate the efﬁciency and prediction accuracy of online Med LDA, online Gibbs Med LDA and their extensions on the 20Newsgroup (20NG) and a large Wikipedia dataset. A sensitivity analysis of the key parameters is also provided. Following the same setting in Zhu et al. (2012), we remove a standard list of stop words. All of the experiments are done on a normal computer with single-core clock rate up to 2.4 GHz.

7.1 Classiﬁcation on 20Newsgroup

We perform multi-class classiﬁcation on the entire 20NG dataset with all the 20 categories. The training set contains 11,269 documents, with the smallest category having 376 documents and the biggest category having 599 documents. The testing set contains 7,505 documents, with the smallest and biggest categories having 259 and 399 documents respectively. We adopt the one-vs-all strategy (Rifkin and Klautau, 2004) to combine binary classiﬁers for multi-class prediction tasks. We use shorthand notations pa Med LDAave and pa Med LDAgibbs for online Med LDA and online Gibbs Med LDA respectively. The corresponding batch learning algorithms are Med LDA (Zhu et al., 2012) and Gibbs Med LDA (Med LDAgibbs ) (Zhu et al., 2014b), which is a Med LDA model with Gibbs classiﬁers. We use collapsed Gibbs sampling to solve Med LDA, which is exactly the g Med LDA model proposed by Jiang et al. (2012). We also choose a state-of-the-art online unsupervised topic model as the baseline, the sparse inference for LDA (sp LDA) (Mimno et al., 2012), which has been demonstrated to be superior than the stochastic variational LDA (Hoffman et al., 2013) in prediction performance. To perform the supervised tasks, we learn a linear SVM with the topic representations using LIBSVM (Chang and Lin, 2011). We refer the readers to (Zhu et al., 2012) for the performance of other batch-learning-based supervised topic models, such as s LDA

ONLINE BAYESIAN PASSIVE-AGGRESSIVE LEARNING

#Passes Through the Dataset

pa Med LDAave pa Med LDAgibbs Med LDAgibbs g Med LDA sp LDA+SVM

#Passes Through the Dataset

pa Med HDPave pa Med HDPgibbs Med HDPave Med HDPgibbs tf HDP+SVM

Figure 3: Test errors of different models with respect to the number of passes through the 20NG training dataset.

(Blei and Mc Auliffe, 2010) and Disc LDA (Lacoste-Julien et al., 2008), as well as SVM classiﬁers on the raw inputs, which are generally either inferior in accuracy or slower in efﬁciency than Med LDA. For all the LDA-based topic models, we use symmetric Dirichlet priors α = 1/K 1 and γ = 0.45 1. For Bayes PA with Gibbs classiﬁers, the parameters were set at ϵ = 164, c = 1, and σ2 = 1. The models performance is not sensitive to the choice of these parameters in wide ranges as shown in Zhu et al. (2014b). For Bayes PA with averaging classiﬁers, the parameters determined by cross validation are ϵ = 16, c = 500, and σ2 = 10 3. For reasons explained in section 7.3, we set the mini-batch size |B| = 1 for the averaging classiﬁer and |B| = 512 for the Gibbs classiﬁer. We ﬁrst analyze how many processed documents are sufﬁcient for each model to converge. Figure 3 shows the test errors with respect to the number of passes through the 20NG training set, where the number of topics is set at K = 80 and the other parameters of Bayes PA are set at

SHI AND ZHU

(I, J , β) = (1, 2, 0). As we can observe, by solving a series of latent Bayes PA problems, both pa Med LDAave and pa Med LDAgibbs fully explore the redundancy of documents and converge in less than one pass, while their corresponding batch algorithms (i.e., Med LDA and Med LDAgibbs ) need many passes as burn-in steps. Besides, compared with the online learning algorithms for unsupervised topic models (i.e., sp LDA+SVM), Bayes PA topic models use supervising side information from each mini-batch, and therefore exhibit a faster convergence rate in discrimination ability. The convergence performance of Bayes PA models is signiﬁcantly better than that of the unsupervised sp LDA. Next, we study each model s best performance possible and the corresponding training time. To allow for a fair comparison, we train each model until the relative change of its objective is less than 10 4. Figure 4 shows the prediction accuracy and training time of LDA-based models on the whole dataset with varying numbers of topics. We can see that Bayes PA topic models are more than an order of magnitude faster than their corresponding batch algorithms in training time due to the power of online learning. pa Med LDAave is faster than pa Med LDAgibbs , because it does not need to update the covariance matrix of classiﬁer weights w. But the tradeoff is that for averaging models, they are more sensitive to the initial choice of σ2, and therefore we need to use crossvalidation to determine the best choice of variance beforehand. Furthermore, thanks to the merits of structured mean-ﬁeld inference, which does not impose strict assumptions on the independence of latent variables, Bayes PA topic models parallel their batch alternatives in accuracy. Moreover, all the supervised models are signiﬁcantly better than the unsupervised sp LDA in classiﬁcation, reaching the state-of-the-art performance on the tested datasets. Table 2 visualizes the learnt topic representation by pa Med LDAave and pa Med LDAgibbs . For the displayed categories, we plot the corresponding classiﬁer s topic distribution averaged over the positive examples and top words from the topic matrix. As we can see, the average topic distributions become increasingly sparse as more and more data are observed. Eventually, the averaged topic distribution for each category contains only 1 2 non-zero entries and meanwhile different categories have quite diverse average topic distributions, therefore showing strong discriminative ability of the topic representations in distinguishing different categories. Such sparse and discriminatrive patterns are similar to what have been shown in batch settings (Zhu et al., 2012, 2014b).

7.2 Extensions

We now present the experimental results on the extensions of Bayes PA topic models. We ﬁrst present the results of nonparametric topic modeling on the same 20NG dataset. Then we demonstrate multi-task learning on a large Wikipedia dataset with more than 1 million documents and about 1 million unique terms.

7.2.1 NONPARAMETRIC TOPIC MODELING

Recall the nonparametric extensions of Bayes PA topic models pa Med HDPave and pa Med HDPgibbs . To validate the advantage of online learning, we test them against their corresponding batch algorithms (i.e., the models Med HDPave and Med HDPgibbs ) on the 20NG corpus. We also include an unsupervised baseline as the comparison model, the truncation-free HDP topic model (tf HDP) (Wang and Blei, 2012b), which is ﬁrst used to discover the latent topic representations, and then combined with a linear SVM classiﬁer for document categorization. For all HDP-based models, the following parameter setting is used: α = 5 1, γ = 0.5 1 and η = 1. As an initial number of topic numbers

ONLINE BAYESIAN PASSIVE-AGGRESSIVE LEARNING

20 40 60 80 100 0.4

pa Med LDAave pa Med LDAgibbs Med LDAgibbs g Med LDA sp LDA+SVM

0 50 100 150 10

20 30 40 50 0.55

pa Med HDPave pa Med HDPgibbs Med HDPave Med HDPgibbs tf HDP+SVM

20 30 40 50 10

Figure 4: Classiﬁcation accuracy and running time of various models with respect to the number of topics on the 20NG dataset.

for HDP to start with, we choose K = 20. We observed that the training time and the prediction accuracy do not depend heavily on the initial number of topics. The other parameters of Bayes PA are the same. Figure 3 shows the convergence of pa Med HDPave and pa Med HDPgibbs , and Figure 4 plots the accuracy and time together with the inferred topic numbers, where the length of the horizontal bars represents the variance of the inferred topic numbers. The results are summarized from ﬁve different runs. As we can see, the nonparametric extensions of Bayes PA topic models also dramatically improve time efﬁciency and converge to corresponding batch algorithms in prediction performance. Furthermore, the averaging models are again faster to train because they do not need to update the covariance matrix of classiﬁer weights.

SHI AND ZHU

Results by pa Med LDAave

Category Visualization

alt.altheism

#Observation 1 64 512 4096 11269

Topic distribution

Top words T8 writes article don people time good T11 god people writes don article atheism

comp.graphics

#Observation 1 64 512 4096 11269

Topic distribution

Top words T19 image graphics ﬁle jpeg ﬁles software T13 writes article people don time good

misc.forsale

#Observation 1 64 512 4096 11269

Topic distribution

Top words T4 sale offer shipping mail dos price T8 writes article people don time good

Results by pa Med LDAgibbs

Category Visualization

alt.altheism

#Observation 1 64 512 4096 11269

Topic distribution

Top words T11 god people writes article don time T16 god people don writes jesus time

comp.graphics

#Observation 1 64 512 4096 11269

Topic distribution

Top words T14 graphics image ﬁle program images writes T13 entry space don people system writes

misc.forsale

#Observation 1 64 512 4096 11269

Topic distribution

Top words T17 sale mail dos good offer shipping T18 don writes article time good work

Table 2: Visualization of the learnt topics by pa Med LDAave and pa Med LDAgibbs . See text for details.

ONLINE BAYESIAN PASSIVE-AGGRESSIVE LEARNING

Time (Seconds)

pa Med LDAmtave pa Med HDPmtave pa Med LDAmtgibbs pa Med HDPmtgibbs Med LDAmt Med HDPmt

Figure 5: F1 scores of various multi-task topic models with respect to the running time on the 1.1M wikipedia dataset.

7.2.2 MULTI-TASK CLASSIFICATION

We test pa Med LDAmtave , pa Med LDAmtgibbs and their nonparametric exentions on a large Wiki dataset. The Wiki dataset is built from the Wikipedia set used in PASCAL LSHC challenge 2012 6. The Wiki dataset is a collection of documents with labels up to 20 different kinds, while the data distribution among the labels is balanced. The training set consists of 1.1 millions of wikipedia documents and the testing test consists of 5,000 documents. The vocabulary contains 917,683 unique terms. To measure performance, we use F1 score, the harmonic mean of precision and recall.

As baseline batch algorithms, we include Med LDAmt, a recent multi-task extention of Gibbs Med LDA (Zhu et al., 2013). Since Med HDP is a new model, there is no exising implementation of multi-task batch versions. So we instead extended Med HDP to support multi-task inference. We call this model Med HDPmt.

We use the same validation scheme as previous to select batchsize |B| = 1, c = 5000, σ2 = 10 6 for pa Med LDAmtave ; We choose |B| = 512, c = 1, σ2 = 1 for pa Med LDAmtgibbs . For both models, the Dirichlet parameters are α = 0.8 1, γ = 0.5 1, and ϵ = 1. We use K = 40 topics for Med LDA models. The nonparametric extensions use exactly the same parameter settings except that α = 5 1, η = 1 and we do not need to specify the topic number K.

Figure 5 shows the F1 scores of various models as a function of training time. We ﬁnd that Bayes PA topic models produce comparable results with their batch counterparts, but the training time is signiﬁcantly less. With either Gibbs or averaging classiﬁers, Bayes PA is about two orders of magnitude faster than their batch counterparts. Therefore, Bayes PA topic models could potentially be applied to large-scale multi-class settings.

6. See http://lshtc.iit.demokritos.gr/.

SHI AND ZHU

CPU Seconds (Log Scale)

CPU Seconds (Log Scale)

1 4 16 64 256 1024 Batch Batch Size

Figure 6: Test errors of pa Med LDAave (left) and pa Med LDAgibbs (right) with different batch sizes on the 20NG dataset.

7.3 Sensitivity Analysis

We provide further discussions on Bayes PA learning for topic models. We analyze the models sensitivity to some key parameters.

Batch Size |B|: Figure 6 presents the test errors of Bayes PA topic models (pa Med LDAave , pa Med LDAgibbs ) as a function of training time on the entire 20NG dataset with various batch sizes. The number of topics is ﬁxed at K = 40. We can see that the convergence speeds of different algorithms vary. First of all, the batch algorithms suffer from multiple passes through the dataset and therefore are much slower than the online alternatives. Second, we could observe that algorithms with medium batch sizes (|B| = 64 or 256) converge faster. If we choose a batch size too small, for example, |B| = 1, each iteration would not provide sufﬁcient evidence for the update of global variables; if the batch size is too large, each mini-batch becomes redundant and the convergence rate also decreases. By comparing the two ﬁgures, we ﬁnd that pa Med LDAave runs faster than pa Med LDAgibbs . This is because for averaging classifers, we do not update the covaraince of the classiﬁer weights, which requires frequent matrix inverse operations. Furthermore, pa Med LDAave appears to be more robust against change in batchsize. Similarly, Figure 7 shows the sensitivity experiment of the batchsize parameter in pa Med HDP models. The results are similar to pa Med LDA models, that is, a moderate batchsize leads to faster convergence.

Number of iterations I and samples J : Since the time complexity of Algorithm 2 is linear in both I and J , we would like to know how these parameters inﬂuence the quality of the trained models. First, we analyze which setting of (I, J ) guarantees good performance. Fix β = 0, K = 40. Figure 8 presents the results. First, the number of samples J does not have a large effect on the accuracy. Second, the peformance of pa Med LDAgibbs and pa Med HDPgibbs is not sensitive the number of optimization round, but pa Med LDAave and pa Med HDPave suffers largely if I = 1. This is because with averaging classiﬁers, the sampling of latent variable Z relies not only on global

ONLINE BAYESIAN PASSIVE-AGGRESSIVE LEARNING

CPU Seconds (Log Scale)

CPU Seconds (Log Scale)

1 4 16 64 256 1024 Batch Batch Size

Figure 7: Test errors of pa Med HDPave (left) and pa Med HDPgibbs (right) with different batch sizes on the 20NG dataset.

parameters, but also on a local variable τ, so more optimization rounds are needed. The training time of all models scale linearly in terms of I and J .

1 0.802 3 0.803 0.803 5 0.805 0.805 0.803

1 0.783 3 0.803 0.799 5 0.808 0.803 0.792 (a) (b)

Table 3: Effect of the number of local samples and burn-in steps for (a). pa Med LDAave ; and (b). pa Med LDAgibbs .

Notice that the ﬁrst β samples are discarded as burn-in steps. To understand how large β is sufﬁcient, we consider the settings of the pairs (J , β) and check the prediction accuracy of Algorithm 2 for K = 40. Based on the sensitivity analysis of I and J , we ﬁx I = 1 for pa Med LDAgibbs , pa Med HDPgibbs and I = 2 for pa Med LDAave , pa Med HDPave . The results are shown in Table 3. We can see that accuracy scores closer to the diagonal of the table are relatively lower, while settings with the same number of kept samples, e.g. (J , β) = (3, 0), (5, 2), (9, 6), yield similar results. Therefore, the number of kept samples exhibits a more signiﬁcant role in the performance of Bayes PA topic models than the number of burn-in steps.

Number of Topics. Finally, we observed an accuracy degradation phenomenon when running one-pass Bayes PA with a large number of topics on small datasets. To see this effect, we test on a small binary classiﬁcation dataset, which consists of the documents from the two groups alt.atheism and talk.religion.misc in the 20 newsgroup dataset. The training set contains 1,597

SHI AND ZHU

200 400 600 800 1000 1200 1400

I=1 I=2 I=3 I=4

# Local Samples

# Local Samples

# Optimization Rounds

Figure 8: Classiﬁcation accuracies and training time of (a): pa Med LDAgibbs , (b): pa Med HDPgibbs , (c): pa Med LDAave , and (d): pa Med HDPave , with different combinations of (I, J ) on the 20NG dataset. The x-axis includes different values of J , while different values of I are shown as separate lines.

documents with a split of 795/802 over the two categories and the test set contains 401 documents with a split of 204/197. For clarity, we report the results of the online Bayes PA with a Gibbs Med LDA classiﬁer, whose hyper-parameters are set as in the multi-class setting. As shown in Figure 9 (the red curve), when the number of topics increases to relatively large, the classiﬁcation accuracy drops signiﬁcantly for Gibbs Med LDA with a single pass of the dataset. We believe this phenomenon is due to the fact that with a limited number of data points and a large number of parameters, the online Bayes PA fails to converge in just one pass. Two possible strategies can be used to make this algorithm practically useful:

Multiple Scans: In this strategy, we simply scan the dataset for multiple passes. As shown in Figure 9, this method works well with ﬁve passes, Bayes PA can learn the topic model with more than 100 topics well and lead to improved accuracy when K is in the range of [50, 100]. However, one disadvantage of this strategy is that it takes more computational resources, linear to the number of scans.

Ghost Samples: In this strategy, we create ghost copies of the data, that is, for every data point (xd, yd), we make L 1 exact clones. Then in our global update equation, each aggregation

ONLINE BAYESIAN PASSIVE-AGGRESSIVE LEARNING

0 50 100 150 200 number of topics

baseline ghost samples multiple scans

Figure 9: Solving the performance degradation issue with a couple of practical approaches: baseline: the model without any modiﬁcations, where degradation issues emerge. ghost copies: create L = 10 copies for every mini-batch. multiple scan: run the model with total of ﬁve passes over the dataset.

of sufﬁcient statistics will be multiplied by an extra factor of L. This strategy works well too, as shown in Figure 9, although slightly worse than the strategy of multiple scans. One advantage of this strategy is that it does not increase the computation burden.

8. Conclusions and Future Work

We present online Bayesian Passive-Aggressive (Bayes PA) learning as a new framework for maxmargin Bayesian inference on streaming data. For ﬁxed but large-scale datasets, online Bayes PA effectively explores the statistical redundancy by repeatedly drawing samples and leads to faster convergence. We show that Bayes PA subsumes the online PA, and more signiﬁcantly generalizes naturally to incorporate latent variables and to perform nonparametric Bayesian inference, therefore providing great ﬂexibility for explorative analysis. Based on the ideas of Bayes PA, we develop efﬁcient online learning algorithms for max-margin topic models as well as their nonparametric extensions which can automatically infer the unknown topic numbers. Empirical experiments on 20Newsgroups and a large-scale Wikipedia multi-label dataset demonstrate signiﬁcant improvements on time efﬁciency, while maintaining comparable results. As for future work, we are interested in developing highly scalable, distributed (Broderick et al., 2013) Bayes PA learning paradigms, which will better meet the demand of processing massive real data available today. We are also interested in applying Bayes PA to develop efﬁcient algorithms for more sophisticated max-margin Bayesian models, such as the latent feature relational models (Zhu, 2012). Finally, a theoretical analysis of the regret bounds for Bayes PA needs to be systematically carried out for both averaging and Gibbs classiﬁers.

SHI AND ZHU

Acknowledgements

This work is supported by National Key Foundation R&D Projects (No. 2013CB329403), the National Natural Science Foundation of China (NSFC) (Nos. 61620106010, 61322308, 61332007), the NSFC and German Research Foundation (DFG) in Project Crsossmodal Learning (No. NSFC 61621136008 / DFC TRR-169), and the National Youth Top-notch Talent Support Program. J.Z is also partly supported by the collaborative projects from Intel and Tencent.

Appendix A.

We show the objective in (27) is an upper bound of that in (13), that is,

L q(w, Φ, Zt, λt) Eq h log(ψ(Yt, λt|Zt, w)) i L q(w, Φ, Zt) +2c X

d Bt Eq h (ξd)+ i , (45)

where L(q) = KL[q||qt(w, Φ)q0(Zt)]. Proof We ﬁrst have

L q(w, Φ, Zt, λt) = Eq

log q(λt | w, Φ, Zt)q(w, Φ, Zt)

qt(w, Φ, Zt)

L q(w, Φ, Zt) = Eq

log q(w, Φ, Zt)

qt(w, Φ, Zt)

Comparing these two equations and canceling out common factors, we know that in order for (45) to make sense, it sufﬁces to prove

H[q ] Eq [log(ψ(Yt, λt|Zt, w)] 2c X

d Bt Eq [(ξd)+] (46)

is uniformly true for any given (w, Φ, Zt), where H( ) is the entropy operator and q = q(λt | w, Φ, Zt). The inequality (46) can be reformulated as

ψ(Yt, λt|Zt, w)

d Bt Eq [(ξd)+] (47)

Exploiting the convexity of the function log( ), i.e.

Eq log ψ(Yt, λt|Zt, w)

λt ψ(Yt, λt|Zt, w) dλt,

and utilizing the equality (26), we then have (47) and therefore prove (45).

Sungjin Ahn, Anoop Korattikara, and Max Welling. Bayesian posterior sampling via stochastic gradient Fisher scoring. In International Conference on Machine Learning (ICML), 2012.

ONLINE BAYESIAN PASSIVE-AGGRESSIVE LEARNING

Charles E. Antoniak. Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. Annals of Statistics, pages 1152 1174, 1974.

Sanjeev Arora, Elad Hazan, and Satyen Kale. The multiplicative weights update method: a metaalgorithm and applications. Theory of Computing, 8(1):121 164, 2012.

R emi Bardenet, Arnaud Doucet, and Chris Holmes. Towards scaling up Markov chain Monte Carlo: an adaptive subsampling approach. In International Conference on Machine Learning (ICML), 2014.

David M. Blei and Jon D. Mc Auliffe. Supervised topic models. In Neural Information Processing Systems (NIPS), 2010.

David M. Blei, Andrew Ng, and Michael I. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993 1022, 2003.

Tamara Broderick, Nicholas Boyd, Andre Wibisono, Ashia C Wilson, and Michael Jordan. Streaming variational bayes. In Neural Information Processing Systems (NIPS), pages 1727 1735, 2013.

Kevin R. Canini, Lei Shi, and Thomas Grifﬁths. Online inference of topics with latent Dirichlet allocation. In International Conference on Artiﬁcial Intelligence and Statistics, pages 65 72, 2009.

Chih-Chung Chang and Chih-Jen Lin. LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST), 2(3):27, 2011.

Jonathan Chang and David M Blei. Hierarchical relational models for document networks. The Annals of Applied Statistics, pages 124 150, 2010.

Ning Chen, Jun Zhu, Fei Xia, and Bo Zhang. Discriminative relational topic models. IEEE Trans. on Pattern Analysis and Machine Intelligence (PAMI), 37(5):973 986, 2015.

David Chiang, Yuval Marton, and Philip Resnik. Online large-margin training of syntactic and structural translation features. In Conference on Empirical Methods on Natural Language Processing (EMNLP), 2008.

Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine Learning, 20(3):273 297, 1995.

Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Shalel-Shwartz, and Yoram Singer. Online passive-aggressive algorithms. Journal of Machine Learning Research (JMLR), 7:551 585, 2006.

Koby Crammer, Mark Dredze, and Fernando Pereira. Exact convex conﬁdence-weighted learning. In Neural Information Processing Systems (NIPS), 2008.

Arthur P. Dempster, Nan M. Laird, and Donald B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal statistical Society, 39(1):1 38, 1977.

Luc Devroye. Non-Uniform Random Variate Generation. Springer, 1986.

SHI AND ZHU

Arnaud Doucet and Adam M Johansen. A tutorial on particle ﬁltering and smoothing: Fifteen years later. Handbook of Nonlinear Filtering, 12:656 704, 2009.

Arnaud Doucet, Nando De Freitas, Kevin Murphy, and Stuart Russell. Rao-Blackwellised particle ﬁltering for dynamic Bayesian networks. In Proceedings of the Sixteenth Conference on Uncertainty in Artiﬁcial Intelligence, pages 176 183. Morgan Kaufmann Publishers Inc., 2000.

Mark Dredze, Koby Crammer, and Fernando Pereira. Conﬁdence-weighted linear classiﬁcation. In International Conference on Machine Learning (ICML), pages 264 271, 2008.

Rick Durret. Probabiilty: Theory and Examples. Cambridge University Press, 2010.

Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119 139, 1997.

Yang Gao, Jianfei Chen, and Jun Zhu. Streaming gibbs sampling for LDA model. ar Xiv preprint ar Xiv:1601.01142, 2016.

Zoubin Ghahramani and Thomas Grifﬁths. Inﬁnite latent feature models and the Indian buffet process. In Neural Information Processing Systems (NIPS), 2005.

Thomas Grifﬁths and Mark Steyvers. Finding scientiﬁc topics. Proceedings of the National Academy of Sciences, 101(Suppl 1):5228 5235, 2004.

James Hannan. Approximation to Bayes risk in repeated play. Contributions to the Theory of Games, 3:97 139, 1957.

Nils Lid Hjort. Bayesian Nonparametrics. Cambridge University Press, 2010.

Matthew Hoffman, David M. Blei, Chong Wang, and John Paisley. Stochastic variational inference. Journal of Machine Learning Research (JMLR), 14(1):1303 1347, 2013.

Antti Honkela and Harri Valpola. Online variational Bayesian learning. In 4th International Symposium on Independent Component Analysis and Blind Signal Separation, pages 803 808, 2003.

Tommi Jaakkola, Marina Meila, and Tony Jebara. Maximum entropy discrimination. In Neural Information Processing Systems (NIPS), 1999.

Tony Jebara. Discriminative, Generative and Imitative learning. Ph D thesis, Massachusetts Institute of Technology, 2001.

Tony Jebara. Multitask sparsity via maximum entropy discrimination. Journal of Machine Learning Research (JMLR), 12:75 110, 2011.

Qixia Jiang, Jun Zhu, Maosong Sun, and Eric Xing. Monte Carlo methods for maximum margin supervised topic models. In Neural Information Processing Systems (NIPS), 2012.

Michael I. Jordan, Zoubin Ghahramani, Tommi S. Jaakkola, and Lawrence K. Saul. An Introduction to Variational Methods for Graphical Models. Springer, 1998.

Anoop Korattikara, Yutian Chen, and Max Welling. Austerity in MCMC land: Cutting the Metropolis-Hastings budget. In International Conference on Machine Learning (ICML), 2014.

ONLINE BAYESIAN PASSIVE-AGGRESSIVE LEARNING

Simon Lacoste-Julien, Fei Sha, and Michael I Jordan. Disc LDA: Discriminative learning for dimensionality reduction and classiﬁcation. In Neural Information Processing Systems (NIPS), 2008.

Nick Littlestone. Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. Machine learning, 2(4):285 318, 1988.

Jan Luts, Tamara Broderick, and Matt Wand. Real-time semiparametric regression. ar Xiv:1209.3550, 2013.

Ryan Mc Donald, Koby Crammer, and Fernando Pereira. Online large-margin training of dependency parsers. In Annual Meeting of the Association for Computational Linguistics, 2005.

Shike Mei, Jun Zhu, and Xiaojin Zhu. Robust Reg Bayes: Selectively incorporating ﬁrst-order logic domain knowledge into Bayesian models. In International Conference on Machine Learning, 2014.

John R. Michael, William R. Schucany, and Roy W. Haas. Generating random variates using transformations with multiple roots. Journal of the American Statistician, 30(2):88 90, 1976.

David Mimno, Matthew Hoffman, and David M. Blei. Sparse stochastic inference for latent Dirichlet allocation. International Conference on Machine Learning (ICML), 2012.

Thomas P. Minka. Expectation propagation for approximate Bayesian inference. In Uncertainty in Artiﬁcial Intelligence (UAI). Morgan Kaufmann Publishers Inc., 2001.

Kevin P. Murphy. Machine Learning: a Probabilistic Perspective. MIT Press, 2012.

Sam Patterson and Yee Whye Teh. Stochastic gradient Riemannian Langevin dynamics on the probability simplex. In Neural Information Processing Systems (NIPS), pages 3102 3110, 2013.

Nicholas G. Polson and Steven L. Scott. Data augmentation for support vector machines. Bayesian Analysis, 6(1):1 23, 2011.

Ryan Rifkin and Aldebaro Klautau. In defense of one-vs-all classiﬁcation. Journal of Machine Learning Research (JMLR), 5:101 141, 2004.

Frank Rosenblatt. The perceptron: a probabilistic model for information storage and organization in the brain. Psychological Review, 65(6):386, 1958.

Shai Shalev-Shwartz and Yoram Singer. Convex repeated games and Fenchel duality. In Neural Information Processing Systems (NIPS), pages 1265 1272, 2006.

Shai Shalev-Shwartz, Yoram Singer, Nathan Srebro, and Andrew Cotter. Pegasos: Primal estimated sub-gradient solver for SVM. Mathematical Programming, 127(1):3 30, 2011.

Tianlin Shi and Jun Zhu. Online Bayesian passive-aggressive learning. In International Conference on Machine Learning (ICML), 2014.

Jacob Steinhardt and Percy Liang. Filtering with abstract particles. In International Conference on Machine Learning (ICML), 2014.

SHI AND ZHU

Robert H. Swendsen and Jian-Sheng Wang. Nonuniversal critical dynamics in Monte Carlo simulations. Physical Review Letters, 58(2):86 88, 1987.

Martin A. Tanner and Wing Hung Wong. The calculation of posterior distributions by data augmentation. Journal of the American statistical Association, 82(398):528 540, 1987.

Ben Taskar, Carlos Guestrin, and Daphne Koller. Max-margin Markov networks. In Neural Information Processing Systems (NIPS), 2003.

Yee Whye Teh, M.I. Jordan, M.J. Beal, and David M. Blei. Hierarchical Dirichlet processes. Journal of the American Statistical Association, 101(476), 2006a.

Yee Whye Teh, David Newman, and Max Welling. A collapsed variational Bayesian inference algorithm for latent Dirichlet allocation. In Neural Information Processing Systems (NIPS), 2006b.

David Van Dyk and Xiao-Li Meng. The art of data augmentation. Journal of Computational and Graphical Statistics, 10(1), 2001.

Vladimir N. Vapnik. The Nature of Statistical Learning Theory. Springer, New York, 1995.

Martin J. Wainwright and Michael I. Jordan. Graphical models, exponential families, and variational inference. Foundations and Trends R in Machine Learning, 1(1-2):1 305, 2008.

Chong Wang and David M. Blei. A split-merge MCMC algorithm for the hierarchical Dirichlet process. ar Xiv preprint ar Xiv:1201.1657, 2012a.

Chong Wang and David M. Blei. Truncation-free online variational inference for Bayesian nonparametric models. In Neural Information Processing Systems (NIPS), 2012b.

Chong Wang, John Paisley, and David M. Blei. Online variational inference for the hierarchical Dirichlet process. In International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), 2011.

Max Welling and Yee Whye Teh. Bayesian learning via stochastic gradient Langevin dynamics. In International Conference on Machine Learning (ICML), pages 681 688, 2011.

Max Welling, Yee Whye Teh, and Hilbert J. Kappen. Hybrid variational/Gibbs collapsed inference in topic models. In Uncertainty in Artiﬁcial Intelligence (UAI), 2008.

Minjie Xu, Jun Zhu, and Bo Zhang. Fast max-margin matrix factorization with data augmentation. In International Conference on Machine Learning (ICML), pages 978 986, 2013.

Jun Zhu. Max-margin nonparametric latent feature models for link prediction. In International Conference on Machine Learning (ICML), 2012.

Jun Zhu and Eric P. Xing. Maximum entropy discrimination Markov networks. Journal of Machine Learning Research (JMLR), 10:2531 2569, 2009.

Jun Zhu, Ning Chen, and Eric P Xing. Inﬁnite SVM: a Dirichlet process mixture of large-margin kernel machines. In International Conference on Machine Learning (ICML), pages 617 624, 2011.

ONLINE BAYESIAN PASSIVE-AGGRESSIVE LEARNING

Jun Zhu, Amr Ahmed, and Eric P Xing. Med LDA: maximum margin supervised topic models. Journal of Machine Learning Research (JMLR), 13:2237 2278, 2012.

Jun Zhu, Xun Zheng, Li Zhou, and Bo Zhang. Scalable inference in max-margin topic models. In ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2013.

Jun Zhu, Jianfei Chen, and Wenbo Hu. Big learning with Bayesian methods. ar Xiv preprint ar Xiv:1411.6370, 2014a.

Jun Zhu, Ning Chen, Hugh Perkins, and Bo Zhang. Gibbs max-margin topic models with data augmentation. Journal of Machine Learning Research (JMLR), 15:1073 1110, 2014b.

Jun Zhu, Ning Chen, and Eric P Xing. Bayesian inference with posterior regularization and applications to inﬁnite latent SVMs. Journal of Machine Learning Research (JMLR), 15:1799 1847, 2014c.