# online_bayesian_maxmargin_subspace_multiview_learning__c8918813.pdf Online Bayesian Max-Margin Subspace Multi-View Learning Jia He1,3, Changying Du2, Fuzhen Zhuang1, Xin Yin1, Qing He1, Guoping Long2 1Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing 100190, China 2Laboratory of Parallel Software and Computational Science, Institute of Software, Chinese Academy of Sciences, Beijing 100190, China 3University of Chinese Academy of Sciences, Beijing 100049, China {hej, zhuangfz, yinx, heq}@ics.ict.ac.cn, {changying, guoping}@iscas.ac.cn Last decades have witnessed a number of studies devoted to multi-view learning algorithms, however, few efforts have been made to handle online multi-view learning scenarios. In this paper, we propose an online Bayesian multi-view learning algorithm to learn predictive subspace with max-margin principle. Specifically, we first define the latent margin loss for classification in the subspace, and then cast the learning problem into a variational Bayesian framework by exploiting the pseudo-likelihood and data augmentation idea. With the variational approximate posterior inferred from the past samples, we can naturally combine historical knowledge with new arrival data, in a Bayesian Passive-Aggressive style. Experiments on various classification tasks show that our model have superior performance. 1 Introduction Nowadays, multi-view data are often generated from multiple information channels continuously, e.g., hundreds of You Tube videos consisting of visual, audio and text features are uploaded every minute. Multi-view learning arouses amounts of interests in the past decades [Blum and Mitchell, 1998; Yarowsky, 1995; G onen and Alpaydın, 2011; Quang et al., 2013; Sun and Chao, 2013]. Among them, the multi-view subspace learning approaches aim at obtaining a subspace shared by multiple views and then learning models in the shared subspace [Sharma et al., 2012; Hardoon et al., 2004; Guo and Xiao, 2012]. They are very useful for cross-view classification and retrieval. However, these approaches are prone to overfitting to small training data without considering the maximum margin principle. In [Chen et al., 2012], a large-margin harmonium model (MMH) based on latent subspace Markov network is developed for multi-view data. But MMH is under the maximum entropy discrimination framework and cannot infer the penalty parameter of maxmargin models in Bayesian style automatically. In [Du et al., 2015], a posterior-regularized Bayesian approach is proposed to combine Principal Component Analysis (PCA) with the max-margin learning, which can infer the penalty parameter of max-margin models but cannot address multi-view data. On the other hand, multi-view data often cannot be collected in a single time due to temporal and spatial constrictions in applications, while the traditional multi-view algorithm need store the entire training samples. Online learning is an efficient method to address this problem. Many efforts have been made on the studies of online learning [Cesa Bianchi and Lugosi, 2006; Hazan et al., 2007; Chechik et al., 2010]. Unfortunately, there are few studies about online multi-view learning. OPMV is one of the few online multi-view learning [Zhu et al., 2015]. OPMV is not in a Bayesian framework and does not introduce the max-margin principle, thus it is prone to overfitting to small training data. And OPMV is formulated as a point estimate by optimizing some deterministic objective function. Online Passive Aggressive (PA) learning provides a method for online largemargin learning [Crammer et al., 2006]. Although it enjoys strong discriminative ability suitable for predictive tasks, it is also formulated as a point estimate by optimizing some deterministic objective function. The point estimate can be affected seriously by inappropriate regularization, outliers and noises, especially when the training data arrive sequentially. Based on the online PA learning, Shi proposes a Bayesian PA learning method [Shi and Zhu, 2013] which infers a posterior under the Bayesian framework instead of a point estimate. Nevertheless, these online learning methods cannot process multi-view data. To the best of our knowledge, there has been few efforts focused on online multi-view learning under the Bayesian framework. In this paper, we address the aforementioned problems by developing an online Bayesian multi-view subspace learning method with max-margin principle. Specifically, we first propose a predictive subspace learning method based on factor analysis and define a latent margin loss for classification in the subspace. Then we cast the learning problem into a variational Bayesian framework by exploiting the pseudolikelihood and data augmentation idea which allows us to automatically infer the penalty parameter. With the variational approximate posterior inferred from the past samples, we can naturally combine historical knowledge with new arriving data, in a Bayesian Passive-Aggressive style. We up- Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI-16) date our model with the training data coming one by one, instead of storing all training data. Experiments on synthetic and various real classification tasks show both our batch and online model have superior performance, compared with a number of competitors. Related Work The earliest works of multi-view learning are introduced by Blum and Mitchell [Blum and Mitchell, 1998] and Yarowsky [Yarowsky, 1995]. Nowadays, there are many multi-view learning approaches, e.g., multiple kernel learning [G onen and Alpaydın, 2011], disagreement-based multiview learning [Blum and Mitchell, 1998] and late fusion methods which combine outputs of the models constructed from different view features [Ye et al., 2012]. Especially, the multi-view subspace learning algorithms learn latent salient representation of multi-view data [Sharma et al., 2012; Hardoon et al., 2004]. This approach aims at obtaining a subspace shared by multiple views and then learn models in the shared subspace. Online learning starts from the Perceptron algorithm [Rosenblatt, 1958] and has attracted much attention during the past years [Cesa-Bianchi and Lugosi, 2006; Hazan et al., 2007; Grangier and Bengio, 2008; Chechik et al., 2010]. Crammer proposes the Online Passive-Aggressive (PA) learning which provides a general framework for online large-margin learning[Crammer et al., 2006], with many applications [Chiang et al., 2008]. Online Bayesian Passive-Aggressive learning presents a generic framework of performing online learning for Bayesian max-margin models [Shi and Zhu, 2013]. 2 The Model In this section, we firstly propose the max-margin subspace learning based on factor analysis. Then we develop a multiview classification with max-margin subspace learning under the Bayesian framework. Finally, we extend the batch model to the online scenario which trains the model with the samples coming one by one. 2.1 Max-margin Subspace Learning Suppose we have a set of N observations x(n), n = 1, , N in d-dimension feature space and a 1 N label vector y with its element yn 2 {+1, 1}, n = 1, , N. Factor analysis projects an observation into a low dimensional space that captures the latent feature of data. The generative process for the n-th observation x(n) is as follows: " N("|0, Φ) x(n) = µ + Wz(n) + ", (1) where " 2 Rd 1 denotes the Gaussian noise, Φ 2 Rd d is a variance matrix of ", µ 2 Rd 1 is the mean value of x(n) , W 2 Rd m is the factor loading matrix, z(n) is a mdimensional latent variable. The estimates of model variables (µ, W, Φ, Z) can be obtained as follows: max µ,W,Φ,Z s(µ, W, Φ, Z) = max µ,W,Φ,Z log 1 (2 )d/2|Φ| 2(x(n) µ Wz(n))T Φ 1(x(n) µ Wz(n))). However, factor analysis is an unsupervised model, which learns the latent variables of the observations without using any label information. The max-margin principle can be introduced to incorporate label information into the factor analysis model. We define z = [z T , 1]T as the augmented latent representation of observation x, and let f(x; z, ) = T z be a discriminant function parameterized by . Now for fixed values of Z and , we can compute the margin loss on training data (X, y) by max(0, 1 ynf(x(n); z(n), )). (2) The max-margin subspace learning model can be formulated as follows: max µ,W,Φ,Z, s(µ, W, Φ, Z) C m(Z, ), (3) where C is the regularization parameter. 2.2 Multi-view Classification with Bayesian M2SL Then we propose a Bayesian max-margin subspace multiview learning (BM2SMVL) model. We assume that Nv is the number of views, Nc is the number of classes, di is the dimension of the i-th view, the data matrix of the i-th view is Xi 2 Rdi N consisting of N observations x(n) i in didimension feature space, x(n) = {x(n) i , i = 1, , Nv} denotes the n-th observation, and Y is a Nc N label matrix consisting of N label vectors y(n) = {y(n) c , c = 1, , Nc}. If the n-th observation s label belongs to the c-th class, we define y(n) c = +1 otherwise y(n) c = 1. In our BM2SMVL model, each view x(n) i of the n-th observation x(n) is generated from the latent variable z(n). We impose prior distributions over all variables shown in Eq.(1). The generative process for the n-th observation is as follows: z(n) N(z(n)|0, Im) µi N(µi|0, β 1 i Idi) i Qm j=1 Γ( ij|a i, b i) Wi| i Qdi j=1 N(wij|0, diag( i)) φi Γ(φi|aφi, bφi) x(n) i |z(n) N(x(n) i |Wiz(n) + µi, φ 1 where Γ( ) is the Gamma distribution, βi, a i, b i, aφi, bφi are the hyper-parameters, and Wi 2 Rdi m. The prior on Wi and i is introduced according to the automatic relevance determination [Reents and Urbanczik, 1998]. In order to improve the efficiency of our algorithm, we define the variance matrix Φi of the x(n) i as a diagonal matrix φ 1 i Idi. Let = (µ, , W, φ, Z) denote all variables. p0( ) = p0(µ)p0(W, )p0(φ)p0(z) is the prior of . We can verify the Bayesian posterior distribution p( |X) = p0( )p(X| )/p(X) is equal to the solution of the following optimization problem: min q( )2P KL(q( )kp0( )) Eq( )[logp(X| )], (4) where KL(qkp) is the Kullback-Leibler divergence, and P is the space of probability distributions. When the observations are given, p(X) is a constant. Next, we adapt our model with the one-VS-rest strategy like that for SVM for multi-class classification problems. We have Nc classifiers, and take the c-th classification for an example: fc(x(n); z(n), c) = T c z(n) denotes a discriminant function. Under the Bayesian framework, we impose a prior on c as follows: c p0( c) = Γ( |a ,c, b ,c) p( c| c) = N( c|0, 1 where a ,c and b ,c are hyper-parameters. For simplify, let = {( c, c)}Nc c=1. Then we can replace the margin loss with the expected margin loss for the classification. We introduce '(Y, λ|Z, ) = exp{ 2C max(0, 1 y(n) c z(n)} (5) as the pseudo-likelihood of the n-th data s label variable. We can get our final model as follows: min q( , )2P KL(q( , )kp0( , )) Eq( )[logp(X| )] Eq( , )[log('(Y|Z, ))], (6) where p0( , ) is the prior, p0( , ) = p0( )p0( ), p0( c) = p( c| c)p0( c) and C is the regularization parameter. Solving problem (6), we can get the posterior distribution q( , ) = p0( , )p(X| )'(Y|Z, ) φ(X, Y) , (7) where φ(X, Y) is the normalization constant. In order to approximate q( , ) we use variational approximate inference which is introduced in the Section 3. 2.3 Online BM2SMVL The goal of online learning is to minimize the cumulative loss for a certain prediction task from the sequentially arriving training samples. In this section, we present an online BM2SMVL (OBM2SMVL) based on the online Passive Aggressive learning framework [Crammer et al., 2006]. This generic framework for online large-margin learning has been used in many applications [Chiang et al., 2008]. Online Bayesian Passive-Aggressive learning was presented for online Bayesian max-margin topic models [Shi and Zhu, 2013]. Assuming we have already got the posterior qt( , ) at time t, when a new data (x(t+1), y(t+1)) is coming, we need update the new posterior distribution qt+1( , ). For simplify, We denote x(t+1) = {x(t+1) i=1, y(t+1) = {y(t+1) c=1. Generally, we define ! as the parameterized model and (!; x(t+1), y(t+1)) as the loss for the new data (x(t+1), y(t+1)). Our OBM2SMVL sequentially infers a new posterior distribution qt+1(!) on the arrival of new data (x(t+1), y(t+1)) by solving the following optimization problem: min q(!)2P KL(q(!)kqt(!)) Eq(!)[logp(x(t+1)|!)] + (!; x(t+1), y(t+1)). The online model includes three main updating rule. Firstly, we hope KL(q(!)kqt(!)) is as small as possible. It means that qt+1(!) is close to qt(!). Secondly, the likelihood of the new data Eq(!)[logp(x(t+1)|!)] is high enough. Thirdly, the loss of the new data (!; x(t+1), y(t+1)) is as small as possible. It means that the new model qt+1(!) suffers little loss from the new data. To introduce the online idea to the above multi-view classification BM2SMVL, we let ( , ) denote !. A new posterior distribution qt+1( , ) on the arrival of new data (x(t+1), y(t+1)) can be gotten by solving the following optimization problem: min q( , )2P KL(q( , )kqt( , )) Eq( , )[logp(x(t+1)| , )] + ( , ; x(t+1), y(t+1)). As above, we introduce '( ) function to replace the hinge loss as the pseudo-likelihood. So the formula is replaced by: min q( , )2P KL(q( , )kqt( , )) Eq( )[logp(x(t+1)| )] Eq( , )[log('(y(t+1)| z(t+1), ))]. Similar to Eq.(7), we can get the posterior distribution: qt+1( , ) = qt( , )p(x(t+1)| )'(y(t+1)| z(t+1), ) φ(x(t+1), y(t+1)) , where φ(x(t+1), y(t+1)) is the normalization constant. Note that, the latent variable z(t) is unrelated to the new posterior, because the variable z(t+1) s prior is p0(z). Let ( , \z(t)) denote all variables in and except z(t), then we can further get qt+1( , ) =qt( , \z(t))p0(z)p(x(t+1)| ) φ(x(t+1), y(t+1)) '(y(t+1)| z(t+1), ). (8) In order to approximate qt+1( , ) we use variational approximate inference which is introduced in Section 3. 3 Variational Inference Because the posterior is intractable to compute, we apply the variational inference method[Beal, 2003] to approximate the posteriors in (7) for BM2SMVL and in (8) for OBM2SMVL. This method is much more efficient than sampling based methods [Gilks, 2005]. 3.1 Data Augmentation Since the pseudo-likelihood function '( ) involves a max operater which is difficult and inefficient for posterior inference. We re-express the pseudo-likelihood function into the integration of a function with augmented variable based on the data augmentation idea [Polson and Scott, 2011]. For BM2SMVL, we replace the pseudo-likelihood '( ) with: c | z(n), c) = Z 1 c + C(1 y(n) c z(n))]2} dλ(n) Then we can get '(Y, λ|Z, )= c z(n))]2} q Similarly, we introduce the augmented variable to the pseudolikelihood function '( ) for OBM2SMVL '(y(t+1), λ(t+1)|z(t+1), ) c + C(1 y(t+1) c z(t+1))]2} q 3.2 Variational Approximate Inference Next, we apply the mean-field variational method to approximating the posterior distributions. Variational Inference in BM2SMVL Firstly, we define a family of factorized but free-form variational distributions: V ( , , λ) = V (µ)V (W)V ( )V (φ)V (Z)V ( )V (λ)V ( ). The main idea of variational Bayesian inference is that we need to minimize the KL divergence KL(V ( , , λ)kq( , , λ)) between the approximating distribution and the target posterior. Next, we initialize the distributions of V ( , , λ). Then we iteratively update each parameter of our model by fix other parameters as the current estimates. Now, we give the joint distribution of data and parameters: p( , , λ, X, Y) = p0(µ)p(W| )p0( )p0(φ)p0(Z)p( | ) p0( )p(X|µ, W, φ, Z)'(Y, λ|Z, ). It can be shown that when keeping all other factors fixed the optimal distribution V (λ) satisfies V (λ) / exp{E λ[log p( , , λ, X, Y)]}, where E λ denotes the expectation with respect to V ( , , λ) over all variables except for λ. Then we can get the updating formula for E λ: c = C2h(1 y(n) where h i represents the expectation, GIG( ) is the generalized inverse Gaussian distribution. Similarly, we can get the updating formulas for all other factors. Since they are tedious and easy to derive, here we only provide the equations for Z, other updating formulas are omitted because of the limited space of the paper, N(z(n)|µ(n) c ) 1i + Im {C(1 + Ch(λ(n) c ) 1i)y(n) c ) 1ih c,(m+1) ci}}, where c denotes the first m dimensions of c, i.e., c = [ c, c,(m+1)]. Variational Inference in OBM2SMVL Now, we use variational inference to approximate qt+1( , ) in OBM2SMVL model. Firstly, we give the joint distribution of data and parameters: p( , , λ(t+1), x(t+1), y(t+1)) = p0(µ)p(W| )p0( )p0(φ) 0 (z)p( | ) p0( )p(x(t+1)|µ, W, φ, z)'(y(t+1), λ(t+1)|z, ). It can be shown that when keeping all other factors fixed, the optimal distribution V (λ(t+1)) satisfies V (λ(t+1)) / exp{E λ(t+1)[log p( , , λ(t+1), x(t+1), y(t+1))]}, where E λ(t+1) denotes the expectation with respect to V ( , , λ(t+1)) over all variables except for λ(t+1). Then we can get the updating formula for E λ(t+1): V (λ(t+1)) = 2, 1, χ(t+1) c = C2h(1 y(t+1) c z(t+1))2i. Similarly, we can get the updating formulas for all other factors. Since they are tedious and easy to derive, here we only provide the equations for z(t+1), other updating formulas are omitted because of the limited space of the paper, V (z(t+1)) = N(z(t+1)|µ(t+1) h c( c)T ih(λ(t+1) c ) 1i + Im hφiih(Wi)T Wii} 1 hφiih(Wi)T i(x(t+1) {C(1 + Ch(λ(t+1) c ) 1i)y(t+1) c ) 1ih c,(m+1) ci}}, where c denotes the first m dimensions of c, i.e., c = [ c, c,(m+1)]. 3.3 Computational Complexity For each iteration of parameter updating in our batch learning BM2SMVL, we need O(NNv dm2) computation, where d is the average dimension of all Nv views. The most computation is spent on the calculation of (n) z , n = 1, , N where the matrix multiplication h Wi T Wii consumes dim2 computation. And each iteration of parameter updating in our online learning OBM2SMVL consumes O(Nv dm2) when a new sample is coming. 4 Experiments We evaluate the proposed batch learning model BM2SMVL and online learning model OBM2SMVL on various classification tasks including image data and text data. 4.1 Real Data Sets There are four data sets, i.e., Tervid, Washington, Cornell and News4Gv, used in our experiments. Trecvid contains 1,078 manually labeled video shots that belong to five categories [Chen et al., 2012]. And each shot is represented by a 1,894-dim binary vector of text features and a 165-dim vector of HSV color histogram. Web KB data set has two views, including the content features of the web pages and the link features exploited from the link structures. This data set consists of 877 web pages from computer science departments in four universities, i.e., Cornell, Washington, Wisconsin and Texas. And each university has five document classes, i.e., course, faculty, student, project and staff. We select the web pages from Cornell and Washington as our experimental data1. These two data sets have five classes with two views. 20Newsgroups data set is widely used for classification. This data set has approximately 20,000 newsgroup documents, which are divided into 20 categories. We follow the way in [Long et al., 2008] to construct multi-view learning problems. We use the tf-idf weighting scheme to represent the document, and the document frequency with the value of 5 is adopted to cut down the number of word features. The details of these data sets are shown in Table 1. Table 1: Statistics of the multiclass data sets. Datasets Trecvid Washington Cornell News4Gv size 1078 230 195 1500 class 5 5 5 3 V1-Dim 1894 1703 1703 6783 V2-Dim 165 230 195 6307 V3-Dim - - - 7717 V4-Dim - - - 9336 4.2 Competitors We compare our model with five competitors: VMRML [Quang et al., 2013]: it is a vector-valued man- ifold regularization multi-view learning. The regularization parameters are set as the default value in their paper, and we tune the parameter σ for rbf carefully in each data set; 1http://www-2.cs.cmu.edu/ webkb/ MVMED [Sun and Chao, 2013]: it presents a multi- view maximum entropy discrimination model. We use the model with one-VS-rest strategy for multiclass problem. According to the paper, we choose the best parameter from 2[ 5:5] by executing 5-fold cross-validation for each data set; MMH [Chen et al., 2012]: it is a large-margin predic- tive latent subspace learning for multi-view data. Based on the parameters given in its code2, we tune the four paramters carefully to choose the best parameters for each data set; SVM-FULL: it concatenates all views to form a new sin- gle view, and applies SVM for classification. We choose the linear kernel and execute 5-fold cross-validation on training sets to decide the cost parameter c from 10[ 3:3]; OPMV [Zhu et al., 2015]: it is an online multi-view learning. According to the paper, the learning rate parameter are chose from 2[ 8:8], the regularization parameter are chose from 1e[ 16:0], and the penalty parameters is pre-defined as 1. The parameters are set according to the above rules. Table 2: Batch learning comparison on multiclass data sets. Listed results are test accuracies (%) averaged over 20 independent runs. Bold face indicates highest accuracy. Trecvid Washington Cornell News4Gv MMH 61.22 0.0 80.98 2.94 74.01 0.19 - MVMED 63.80 0.0 73.86 2.78 72.27 2.97 94.26 0.83 VMRML 63.27 0.0 79.44 2.61 76.96 4.03 93.34 0.98 SVM-FULL 62.34 0.0 82.91 3.33 76.14 2.40 99.21 0.30 BM2SMVL 65.86 0.0 83.48 3.03 78.87 3.63 97.99 0.57 Table 3: Online learning comparison on multiclass data sets. Listed results are test accuracies (%) averaged over 20 independent runs. Bold face indicates highest accuracy. Trecvid Washington Cornell News4Gv OPMV 61.41 0.0 73.44 2.06 66.40 3.83 - OBM2SMVL 63.27 0.0 77.96 3.87 74.18 4.63 96.03 0.66 4.3 Parameter Setting In our batch learning, the regularization parameter C is chosen from the integer set {1, 2, 3} and the subspace dimension m from the integer set {20, 30, 50} for each data set by performing 5-fold cross validation on training data. While in our online learning, the regularization parameter C is chosen from the integer set {1, 5, 15} and the subspace dimension m from the integer set {20, 30, 50}. For the rest parameters, both our batch and online learning are set as the same, i.e., a = b = 1e-3, aφ = 1e-2, a = 1e-1, bφ = b = β = 1e-5. 4.4 Experimental Results Since a normal prior with zero mean is imposed on the observation data, we normalize the observation data to have zero 2http://bigml.cs.tsinghua.edu.cn/ ningchen/MMH.htm mean and unit variance. In batch learning experiments, we use the same training/testing split of the Trecvid data set as in [Chen et al., 2012]. So there is only one result in this data set. For other data sets the results of all models are averaged over 20 independent runs and. All the results are shown in Table 2. The ratio sampled for training data is 0.5 in the three data set Trecvid, Washington and Cornell, and 0.05 in News4Gv. Since MMH can not address high dimensional data, e.g., News4Gv, so its result is missing for News4Gv in Table 2. In online learning experiments, we use the same training/testing split of the above batch learning experiments. We sample 0.1 of the training data as the batch training, and the rest come one by one. Since OPMV can only deal with twoview data, so its result is missing for News4Gv in Table 3. From Table 2 and Table 3, we have the following insightful observations: - Our BM2SMVL achieves the best performance on the Trecvid, Washington and Cornell data sets and performs just a little worse than the SVM-FULL in the News4Gv data. We attribute it to that our method can automatically infer the penalty parameter of max-margin model based on the data augmentation idea, while MVMED and MMH are both under the maximum entropy discrimination framework and cannot infer the penalty parameter. SVM-FULL makes full use of all the information from the observations by concatenating all views to form a new single view. This maybe the reason why it performs better than our BM2SMVL in the News4Gv. But some information from the observations is not helpful for the classification in other data sets. In this case, SVM-FULL cannot achieve a good performance. - Our method infers a posterior under the Bayesian frame- work instead of a point estimate as in VMRML. With Baysian model averaging over the posterior, we can make more robust predictions than VMRML. - We also find that OBM2SMVL performs better than OPMV on all data sets and just a little worse than BM2SMVL. Unlike OPMV, which seeks a point estimate by optimizing some deterministic objective function, our online model infers a posterior under the Bayesian framework. The point estimate can be affected seriously by inappropriate regularization, outliers and noises, especially when the training data arrive sequentially. 4.5 Sensitivity Analysis We study the sensitivity of BM2SMVL and OBM2SMVL with respect to the subspace dimension m, and the regularization parameter C. When we study the influence of m, C (batch) is set as 2 for BM2SMVL and C (online) is set as 15 for OBM2SMVL. The averaged results are shown in Figure 1 (a) and Figure 2 (a). We find that the test accuracy increases when m becomes larger. And when m is large enough, the test accuracy remains stable. When we study the influence of C, m is set as 30 for both batch and online learning. From the results in Figure 1 (b) 10 20 30 40 50 60 Subspace dimension m of batch learning Test accuracy Cornell Washington Trecvid (a) m (batch) 2 ( 2) 2 ( 1) 2 (0) 2 (1) 2 (2) 2 (3) 0.1 Regularization parameter C of batch learning Test accuracy Cornell Washington Trecvid (b) C (batch) Figure 1: (a) Results on different data sets with different parameters m in BM2SMVL; (b) Results on different data sets with different regularization parameters C in BM2SMVL. 10 20 30 40 50 60 Subspace dimension m(online) of online learning Test accuracy Cornell Washington Trecvid (a) m (online) 3 ( 1) 3 (0) 3 (1) 3 (2) 3 (3) 3 (4) 3 (5) 0.1 Rregularization parameter C(online) of online learning Test accuracy Cornell Washington Trecvid (b) C (online) Figure 2: (a) Results on different data sets with different subspace dimensions m (online) in OBM2SMVL; (b) Results on different data sets with different regularization parameters C (online) in OBM2SMVL. and Figure 2 (b) , we can find that different data sets may prefer different values of C. In batch learning, C (batch) balances the classification model and subspace learning model, so our model cannot get the best performance when C (batch) is too large or too small. C (online) reflects the importance of new arrival data in our online model. When C (online) is too small, the new arrival data plays a tiny role in the online model and offers little help to improve the performance of our online model. For some data sets like Cornell, when C (online) is too large, the performance of OBM2SMVL would become bad because the online model doesn t take full advantage of the historical knowledge. For some other data sets like Trecvid and Washington, they are less sensitive to C (online) when C (online) is large enough. 5 Conclusion We propose an online Bayesian method to learn predictive subspace for multi-view data. Specifically, the proposed method is based on the data augmentation idea for maxmargin learning, which allows us to automatically infer the weight and penalty parameter and find the most appropriate predictive subspace simultaneously under the Bayesian framework. Experiments on various classification tasks show that both our batch model BM2SMVL and online model OBM2SMVL can achieve superior performance, compared with a number of state-of-the-art competitors. Acknowledgments This work was supported by the National Natural Science Foundation of China (No. 9154610306, 61573335, 61473273, 61473274), National High-tech R&D Program of China (863 Program) (No.2014AA015105), Guangdong provincial science and technology plan projects (No. 2015 B010109005). References [Beal, 2003] Matthew James. Beal. Variational algorithms for approximate bayesian inference. University College London, 2003. [Blum and Mitchell, 1998] Avrim Blum and Tom Mitchell. Combining labeled and unlabeled data with co-training. In Proceedings of The Eleventh Annual Conference on Computational learning theory, pages 92 100, 1998. [Cesa-Bianchi and Lugosi, 2006] Nicolo Cesa-Bianchi and G abor Lugosi. Prediction, learning, and games. Cambridge University Press, 2006. [Chechik et al., 2010] Gal Chechik, Varun Sharma, Uri Shalit, and Samy Bengio. Large scale online learning of image similarity through ranking. The Journal of Machine Learning Research, 11:1109 1135, 2010. [Chen et al., 2012] Ning Chen, Jun Zhu, Fuchun Sun, and Eric Poe Xing. Large-margin predictive latent subspace learning for multiview data analysis. Pattern Analysis and Machine Intelligence, 34(12):2365 2378, 2012. [Chiang et al., 2008] David Chiang, Yuval Marton, and Philip Resnik. Online large-margin training of syntactic and structural translation features. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 224 233, 2008. [Crammer et al., 2006] Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Shalev-Shwartz, and Yoram Singer. Online passive-aggressive algorithms. The Journal of Machine Learning Research, 7:551 585, 2006. [Du et al., 2015] Changying Du, Shandian Zhe, Fuzhen Zhuang, Yuan Qi, Qing He, and Zhongzhi Shi. Bayesian maximum margin principal component analysis. In The Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015. [Gilks, 2005] Walter R Gilks. Markov chain monte carlo. Wiley Online Library, 2005. [G onen and Alpaydın, 2011] Mehmet G onen and Ethem Al- paydın. Multiple kernel learning algorithms. The Journal of Machine Learning Research, 12:2211 2268, 2011. [Grangier and Bengio, 2008] David Grangier and Samy Bengio. A discriminative kernel-based approach to rank images from text queries. Pattern Analysis and Machine Intelligence, 30(8):1371 1384, 2008. [Guo and Xiao, 2012] Yuhong Guo and Min Xiao. Cross language text classification via subspace co-regularized multi-view learning. Computer Science - Computation and Language, 2012. [Hardoon et al., 2004] David R Hardoon, Sandor Szedmak, and John Shawe-Taylor. Canonical correlation analysis: An overview with application to learning methods. Neural Computation, 16(12):2639 2664, 2004. [Hazan et al., 2007] Elad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69(2-3):169 192, 2007. [Long et al., 2008] Bo Long, Philip S. Yu, and Zhongfei Zhang. A general model for multiple view unsupervised learning. SIAM International Conference on Data Mining, pages 822 833, 2008. [Polson and Scott, 2011] Nicholas G. Polson and Steven L. Scott. Data augmentation for support vector machines. Bayesian Analysis, 6(1):43 47, 2011. [Quang et al., 2013] Minh H Quang, Loris Bazzani, and Vit- torio Murino. A unifying framework for vector-valued manifold regularization and multi-view learning. In International Conference on Machine Learning, pages 100 108, 2013. [Reents and Urbanczik, 1998] G. Reents and R. Urbanczik. Self-averaging and on-line learning. Physical Review Letters, 80(24):5448, 1998. [Rosenblatt, 1958] Frank Rosenblatt. The perceptron: a probabilistic model for information storage and organization in the brain. Psychological Review, 65(6):386, 1958. [Sharma et al., 2012] Abhishek Sharma, Abhishek Kumar, Hal Daume III, and David W Jacobs. Generalized multiview analysis: A discriminative latent space. In IEEE Conference on Computer Vision and Pattern Recognition, pages 2160 2167, 2012. [Shi and Zhu, 2013] Tianlin Shi and Jun Zhu. Online bayesian passive-aggressive learning. In International Conference on Machine Learning, pages 378 386, 2013. [Sun and Chao, 2013] Shiliang Sun and Guoqing Chao. Multi-view maximum entropy discrimination. In International Joint Conference on Artificial Intelligence, pages 1706 1712, 2013. [Yarowsky, 1995] David Yarowsky. Unsupervised word sense disambiguation rivaling supervised methods. In Proceedings of The 33rd Annual Meeting on Association for Computational Linguistics, pages 189 196, 1995. [Ye et al., 2012] Guangnan Ye, Dong Liu, I-Hong Jhuo, Shih-Fu Chang, et al. Robust late fusion with rank minimization. In IEEE Conference on Computer Vision and Pattern Recognition, pages 3021 3028, 2012. [Zhu et al., 2015] Yue Zhu, Wei Gao, and Zhi-Hua Zhou. One-pass multi-view learning. In Asian Conference on Machine Learning, pages 407 422, 2015.