# deep_multiview_concept_learning__ed3a2376.pdf Deep Multi-View Concept Learning Cai Xu , Ziyu Guan , Wei Zhao , Yunfei Niu , Quan Wang , Zhiheng Wang State Key Lab of ISN, School of Computer Science and Technology, Xidian University School of Computer Science and Technology, Xidian University College of Computer Science and Technology, Henan Polytechnic University {cxu 3@stu., zyguan@, ywzhao@mail., yfniu@stu., qwang@}xidian.edu.cn, wzhenry@eyou.com Multi-view data is common in real-world datasets, where different views describe distinct perspectives. To better summarize the consistent and complementary information in multi-view data, researchers have proposed various multi-view representation learning algorithms, typically based on factorization models. However, most previous methods were focused on shallow factorization models which cannot capture the complex hierarchical information. Although a deep multiview factorization model has been proposed recently, it fails to explicitly discern consistent and complementary information in multi-view data and does not consider conceptual labels. In this work we present a semi-supervised deep multi-view factorization method, named Deep Multi-view Concept Learning (DMCL). DMCL performs nonnegative factorization of the data hierarchically, and tries to capture semantic structures and explicitly model consistent and complementary information in multi-view data at the highest abstraction level. We develop a block coordinate descent algorithm for DMCL. Experiments conducted on image and document datasets show that DMCL performs well and outperforms baseline methods. 1 Introduction Multi-view data is prevalent in many real-world applications. For instance, the same news can be obtained from various language sources; an image can be described by different low level visual features. These views often represent diverse and complementary information of the same data. Integrating multiple views is helpful to boost the performance of data mining tasks. We are concerned with representation learning by synthesizing multi-view data. In recent years, a lot of multi-view representation learning algorithms were proposed based on different techniques (e.g. matrix factorization [Guan et al., 2015; Deng et al., 2015], transfer learning [Xu and Sun, 2012]). Corresponding author View 1 View H Final Layer Partial Label Information Structured Sparse Structured Sparse within-class affinity graph between-class penalty graph affinity graph penalty graph Regularization Figure 1: Illustration of DMCL. It factorizes multi-view data iteratively to extract the high-level common encoding VM. Partial label information is leveraged to learn semantic structures and structured sparseness constraints are used to model consistency and complementarity among different views. As a particularly useful family of techniques in data analysis, matrix factorization is a successful representation learning technique over a variety of areas, e.g. recommendation [Wang et al., 2016a; 2016b], image clustering [Trigeorgis et al., 2017]. Recently, Nonnegative Matrix Factorization (NMF), a specific form of matrix factorization, has received significant attention in multi-view representation learning [Zong et al., 2017; Guan et al., 2015; Liu et al., 2015] due to its intuitive parts-based interpretation [Lee and Seung, 2001]. Given a data matrix X RD N + for N items, NMF seeks two nonnegative matrices U RD K + and V RK N + such that X UV. U/V is called the basis/encoding matrix. The Multi-view Concept Learning (MCL) algorithm proposed in [Guan et al., 2015] is a typical semi-supervised method which explicitly discerns consistent and complementary information in multi-view data to generate conceptual representations. However, a common draw- Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18) back of the above methods is that they fail to capture complex hierarchical structures of real-world data. In order to learn a better representation by capturing the hierarchical structures, different deep matrix factorization techniques were proposed recently. Trigeorgis et al. [Trigeorgis et al., 2017] proposed the Deep Semi-NMF method for representation learning. It has an interpretation of clustering according to different attributes of a given dataset. Nevertheless, it only deals with the single view case. Zhao et al. [Zhao et al., 2017] extended Deep Semi-NMF to the multiview case. Although it is a multi-view deep factorization method, it neither considers label information of data nor explicitly models consistent and complementary information. In this paper, we propose a new multi-view deep factorization method, named Deep Multi-view Concept Learning (DMCL). As shown in Figure 1, the deep model factorizes the data matrices (X(v)) iteratively to get the final representation VM. For each view, each layer is a NMF which takes the representation obtained from the previous layer as its data matrix. The final representation VM is shared across different views. We impose graph embedding regularization [Yan et al., 2007] on VM to capture data conceptual structures. We also require the basis matrices ({U(v) M }) of the final layer to be sparse in term of columns to explicitly model consistency and complementarity among different views. The major contribution of this work is a novel semisupervised deep NMF method for conceptual representation learning from multi-view data. Conceptual features can reflect semantic relationships between data items; they are connected to different views in a flexible fashion, i.e. some conceptual features are described by all the views (consistency), while others may only be associated with some of the views (complementarity). We design the optimization problem to encourage VM and ({U(v) M } to comply with these properties. The second contribution is that we propose a block coordinate decent algorithm to optimize DMCL. We also design a pretraining scheme for DMCL. Thirdly, we empirically evaluate DMCL on two real world datasets and show its superiority over state-of-art baseline methods. 2 Related Work Our work falls into the area of multi-view representation learning, which is concerned with how to embed inputs from different views of the same set of data items to a new common latent space for better data representation. A recent survey for this area can be found in [Li et al., 2016]. One direction stemmed from Canonical Correlation Analysis (CCA) [Chaudhuri et al., 2009] is based on the principle of maximizing correlations in the common latent space. However, it is nontrivial to extend those methods to deal with multiple views. Another popular idea is to find a shared latent representation across different views. Many methods in this direction were based on (nonnegative) matrix factorization, e.g. [Zong et al., 2017; Liu et al., 2015; Guan et al., 2015]. However, researchers were mainly focused on consistency among different views, while complementarity is rarely explicitly modeled. There were some works that explicitly considered complementarity. Guan et al. proposed a NMF-based flexible method where group sparseness constraints were imposed on basis matrices to learn flexible association patterns between encoding dimensions and views [Guan et al., 2015]. Nevertheless, traditional models are intrinsically shallow models and may not well handle intricate natural data. Inspired by deep learning [Bengio and others, 2009], different deep models were proposed recently for multi-view representation learning. Srivastava and Salakhutdinov [Srivastava and Salakhutdinov, 2012] proposed to learn joint representation of images and texts by Deep Boltzmann Machines. Ngiam et al. [Ngiam et al., 2011] explored extracting shared representations by training a bimodal deep autoencoder. Deep matrix factorization techniques that factorize complex natural data into multiple levels of factors will also increase representational and modeling power [Sharma et al., 2017; Trigeorgis et al., 2017]. Those methods can be viewed as a decoder network that produces a reconstruction X = g(VM). Compared to deep matrix factorization, deep neural networks are harder to approximate global optima and lack interpretability. A closely related work is [Zhao et al., 2017] where multi-layer matrix factorization is performed for multiview data clustering. Our DMCL is different from their method in that we not only incorporate label information but also explicitly learn consistency and complementarity among multiple views, trying to capture conceptual features hidden in the data. 3 The Method Our DMCL is a deep extension of the MCL method. In this section, we review MCL briefly and then present DMCL, together with its optimization algorithm. 3.1 A Brief Review of MCL We use X(v) RDv N + to denote the v-th view of data, where Dv is the dimensionality of the v-th view. The dataset is described by H views: {X(v)}H v=1. The basis matrix U(v) RDv K + denotes the linear connection between X(v) and V, the common encoding matrix. The data matrix of each view is separated into labeled and unlabeled parts: X(v) = [X(v),l X(v),u]. Correspondingly, the encoding matrix becomes V = [Vl Vu]. We use N l/N u to denote the number of labeled/unlabeled items, respectively. The optimization problem of MCL is formualted as [Guan et al., 2015] min {U(v)}H v=1,V 1 2 v=1 X(v) U(v)V 2 F + α v=1 U(v) 1, 2 tr Vl La(Vl)T tr Vl Lp(Vl)T s.t. U (v) ik 0, 1 Vkj 0, i, j, k, v. (1) The upper bound 1 for Vkj is used to guarantee the problem is well lower bounded [Guan et al., 2015]. The first term is the reconstruction criterion. The second term contains the group Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18) sparseness constraints imposed on the basis matrix of each view. For view v, it is defined as k=1 max 1 i M |U (v) ik | (2) It means we encourage some basis vectors of U(v) to be completely 0, so that the corresponding dimensions in V are not associated with this view. These dimensions could represent complementary information in multi-view data. The third term is the graph embedding criterion for regularizing V where tr( ) denotes matrix trace. La and Lp represent the graph Laplacian matrices of within-class affinity graph Ga and between-class penalty graph Gp with their weighted adjacency matrices defined as W a ij = 1 N l ci 1 N l , if ci = cj 0, otherwise (3) W p ij = 1 N l , if ci = cj 0, otherwise (4) where ci denotes the label of item i, N l ci is the total number of items with label ci. The graph embedding term intrinsically forces within-class items to be near while keeps betweenclass items away from ecah other. With simple algebra transformation, we have P i,j W a ij vl i vl j 2 2 = tr[Vl La Vl T ] and P i,j W p ij vl i vl j 2 2 = tr[Vl Lp Vl T ]. The fourth term is a simple L1 norm regularizer on V since an item should not have too many conceptual features. MCL not only tries to capture semantic structures of the data by semi-supervision, but also models consistency and complementarity among different views by group sparsity constraints. However, the shallow computation (i.e. one-step factorization) in MCL may not be able to well handle complex real world data. 3.2 Deep Multi-view Concept Learning In order to obtain a more expressive representation, DMCL decomposes each of the data matrices {X(v)}H v=1 iteratively to obtain the high-level representation: X(v) U(v) 1 U(v) 2 U(v) M VM (5) where U(v) 1 RDv p1 + , ,U(v) m Rpm 1 pm + , ,U(v) M Rp M 1 p M + denote M basis matrices and VM Rp M N + denotes the final common encoding. The optimization problem of DMCL is min {U(v) m },VM X(v) U(v) 1 U(v) 2 U(v) M VM 2 n tr h Vl MLa Vl M T i tr h Vl MLp Vl M T io U(v) M 1, + γ VM 1,1 s.t. (U (v) m )ik 0, 1 (VM)kj 0, i, j, k, v, m. (6) Here we only apply the graph embedding constraints and the encoding sparseness constraint to VM, since the intermediate encodings are near low-level features, so they would not well represent high-level conceptual features. The group sparseness constraints used to learn the structures of consistency and complementarity are only imposed on {U(v) M }, since they can only be used where a common encoding (VM) is reached. Note although the overall factorization is equivalent to a linear operation, as in [Zhao et al., 2017], the multi-layer computation can still help to better represent data items hierarchically by seeking a better local optimum [Zhao et al., 2017]. 3.3 Optimization (6) is not convex in both {U(v) m } and VM. Therefore, we can only find its local minima. To improve the quality of the solution and speedup learning, we initialize the model parameters using unsupervised greedy pre-training, similar to layerwise pre-training in deep learning [Hinton and Salakhutdinov, 2006]. Specifically, for each view v we first decompose X(v) as X(v) = U(v) 1 V(v) 1 using NMF. Then, we treat the learned V(v) 1 as the data matrix for layer 2 and continue to factorize it iteratively until the final layer. An exception is VM. For layer M, we obtain a set of encoding matrices {V(v) M } from the above initialization scheme. However, it is difficult to use them to initialize VM since elements in the same position of the encoding vectors for an item may represent different meanings in different views and so they are not comparable. Hence, we choose to initialize VM randomly. Preliminary experiments also confirmed its effectiveness. Afterwards, the variables of (6) are separated into three groups: {U(v) m }m =M, {U(v) M }H v=1 and VM. (6) is convex in one group when the other two are fixed. Therefore, we solve DMCL by block coordinate descent [Lin, 2007] which each time optimizes one group of variables while keeping the other groups fixed. The procedure is depicted in Algorithm 1. At line 2, we have V(v) 0 := X(v). Next, we describe the detailed ideas for addressing the three subproblems. Updating VM Since VM is randomly initialized, we optimize it firstly. The subproblem for VM is: min VM ψ(VM) := 1 v=1 X(v) e U(v) M VM 2 F + γ VM 1,1 2 tr Vl MLa(Vl M)T tr Vl MLp(Vl M)T s.t. 1 (VM)kj 0, k, j. (7) where e U(v) M = QM m=1 U(v) m . We can decompose (7) into two subproblems in which the variables are Vl M and Vu M, labeled part and unlabeled part of VM, respectively. The update rules can be similarly derived as in [Guan et al., 2015]. Here we only give the equations Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18) Algorithm 1: Optimization of DMCL Input: {X(v)}H v=1; α; β; γ; La; Lp; layer sizes {pm} Output: {U(v) m }, v, m; VM 1 for m = 1 to M, v = 1 to H do 2 U(v) m , V(v) m NMF(V(v) m 1, pm) 4 Randomly initialize VM 6 Fix other variables, optimize (7) w.r.t VM. 7 Fix other variables, optimize (15) w.r.t {U(v) m }m =M. 8 Fix other variables, optimize (18) w.r.t {U(v) M }H v=1. 9 until convergence or max no. iterations reached due to space limitation: (V l M)kj min B2 kj + 4Akj Ckj 2Akj (V l M)kj (V u M)kj min 1, (γ Qu kj) + |γ Qu kj| 2(Pvu j )k (V u M)kj (9) where P, Qu, Akj, Bkj and Ckj are v=1 ( e U(v) M )T e U(v) M (10) v=1 ( e U(v) M )T X(v),u (11) Akj = (Pvl j)k + β((Da + Wp) vl k)j (12) Bkj = γ Ql kj (13) Ckj = β((Dp + Wa) vl k)j (14) Here vl j and vl k denote the j-th column vector and the k-th row vector of Vl M, respectively. Da and Dp are diagonal matrices with Da ii = PN l j=1 W a ij, Dp ii = PN l j=1 W p ij. Updating {U(v) m }m =M Fixing other variables, we get the subproblem for {U(v) m } min {U(v) m }m =M Γ = 1 X(v) U(v) 1 U(v) 2 U(v) M VM 2 s.t. (U (v) m )ik 0, i, k, v; m = 1, , M 1. (15) For m = 1, the updating is similar as in NMF. Otherwise, the gradient of U(v) m is derived as: U(v) m = (ξ(v) m )T ξ(v) m U(v) m Φ(v) m (Φ(v) m )T (ξ(v) m )T X(v)(Φ(v) m )T where ξ(v) m = Qm 1 i=1 U(v) i and Φ(v) m = QM i=m+1 U(v) i VM. Thus the additive update for U(v) m can be given as: (U(v) m )ik (U(v) m )ik η Γ Inspired by [Lee and Seung, 2001], to obtain a multiplicative update rule, η is set as (U(v) m )ik (ξ(v) m ) T ξ(v) m U(v) m Φ(v) m (Φ(v) m ) T while, the convergence is guaranteed. Correspondingly, the multiplicative update rule is: (ξ(v) m ) T X(v)(Φ(v) m ) T ik (ξ(v) m ) T ξ(v) m U(v) m Φ(v) m (Φ(v) m ) T Updating {U(v) M }H v=1 When other variables are fixed, the U(v) M of different views are independent with each other and their subproblems are identical. For clarity, we just focus on one view and omit the superscript (v) temporally: min UM φ(UM) := 1 X e UM 1UMVM 2 F + α UM 1, s.t. (UM)ik 0, i, k. where e UM 1 = QM 1 m=1 Um. φ(UM) is the sum of a differentiable function and a general closed convex function. It can be solved using composite gradient mapping [Nesterov, 2013] which was proposed for minimizing such composite functions. The central idea is to iteratively minimize an auxiliary function m L(UM) and adjust the guess of the Lipschitz constant of the first term of φ(UM) so that φ(UM) decreases by the minimizer of m L(UM). Denote f(UM) = 1 2 X e UM 1UMVM 2 Ut M as the value of UM at the t-th iteration. The auxiliary function is set as m L(Ut M; UM) = f(Ut M) + α UM 1, UM Ut M 2 F + tr[ f(Ut M)T (UM Ut M)] (19) where L is the guess of Lf, the Lipschitz constant of f(UM), and f(Ut M) is the gradient of f(UM) at Ut M: f(Ut M) = e UT M 1 e UM 1Ut MVMVM T e UT M 1XVM T Then we minimize (19) to get a candidate for Ut+1 M : b Ut+1 M = arg min UM 0 m L(Ut M; UM) (20) We have φ(Ut M) = m L(Ut M; Ut M) m L(Ut M; b Ut+1 M ) from (20). Furthermore, it has been proved that for L Lf Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18) Algorithm 2: Composite Gradient Mapping Input: ηu > 1, ηd > 1: scaling parameters for L 2 Initialize L0 : 0 < L0 Lf. 7 Optimize (20) to get b Ut+1 M 8 if φ( b Ut+1 M ) > m L(Ut M; b Ut+1 M ) then 11 until φ( b Ut+1 M ) m L(Ut M; b Ut+1 M ) 12 Ut+1 M = b Ut+1 M 13 Lt+1 = max(L0, L/ηd) 14 t = t + 1 15 until convergence we have m L(Ut M; b Ut+1 M ) φ( b Ut+1 M ) [Nesterov, 2013], leading to an acceptable b Ut+1 M . Meanwhile, L is inversely proportional to the step size (i.e. b Ut+1 M Ut M ), so it cannot be set to too large values. Now the problem is to find a suitable L in each iteration. We starts with an estimate L0 such that 0 < L0 Lf, and in each iteration adjusts L until we get φ( b Ut+1 M ) m L(Ut; b Ut+1 M ). The algorithm for optimizing UM is shown in Algorithm 2. The remaining problem is how to optimize (20) efficiently. The solution is similar to that in [Guan et al., 2015]. We omit the details due to space limitation. 3.4 Time Complexity The DMCL model is composed of pre-training and fineturning stages, so we analyze them separately. For simplify, we set the input feature dimensionalities for all views to D; the dimensionalities of all layers are set to p. We use τ to denote the number of iterations for all iterative procedures. The computational cost for the pre-training stage is O(τH(DNp+MNp2)). For fine-turning, the time complexity consists of three subparts. For optimizing {U(v) m }m =M, the cost is O(τHM(Dp2 + Np2 + DNp + Mp3)). For optimizing {U(v) M }H v=1, we need to run Algorithm 2. Here we give the result: O(HD((2(τ + 1) + log2 Lf L0 )P + τK2 + Mp3 + Dp2)). Here τ denotes the number iterations of the outer loop of Algorithm 2. For VM, the time complexity is O(τ(Np + Np2 + (N l)2p) + H(Mp3 + Dp2)). The time cost of DMCL is linear in feature dimension D, item number N and view number H. However, it it not linear in layer number M and layer size p. Therefore, it is essential to set layer number and layer sizes of DMCL properly for high performance and low time consumption. 4 Experiments We evaluate the performance of DMCL on document and image datasets in terms of classification and clustering metrics. Important statistics are summarized in Table 1 and a brief introduction of the datasets is presented below. Reuters [Amini et al., 2009]. It consists of 111740 documents written in five languages of 6 categories represented as TFIDF vectors. We utilize documents written in English and Italian as two views. For each categories, we randomly choose 200 documents. Totally 1200 documents are used. Image Net [Deng et al., 2009]. It is a well known realworld image database that contains roughly 15 million images organized according to the Word Net hierarchy. We randomly select 50 leaf synsets in the hierarchy as categories and sample 240 images from each one. Three kinds of features are extracted as different views, i.e., 64D HSV histogram, 512D GIST descriptors, and 1000D bag of SIFT visual words. We compare DMCL with the following baseline algorithms: Deep NMF (DNMF) [Song et al., 2015] is a deep matrix factorization method for single view data. We apply it on each view and report the best performance. Concatenation DNMF (Concat DNMF) concatenates feature vectors of different views and then applies DNMF. Deep Multiview Semi-NMF (DMSNMF) [Zhao et al., 2017] is an unsupervised matrix factorization method synthesizing multiview data to capture a uniform representation. Multi-view Concept Learning (MCL) [Guan et al., 2015] is a semisupervised shallow method for multi-view data. We use the holdout method [Han et al., 2011] for evaluation and tune model parameters by cross-validation on the training set. For each dataset, we randomly split the data items for each category and use 50% for training while the remaining 50% are reserved for test. We use the learned representation of these methods for classification and clustering. Note that the label information of the training set is utilized in semi-supervised methods, i.e., MCL and DMCL. For classification, the training items are fed into the k NN classifier (k=9) and Accuracy is calculated using the test set. For clustering, k-means is applied to the test set with k set to the actual number of classes. Accuracy and Normalized Mutual Information(NMI) are used to evaluate clustering performance [Trigeorgis et al., 2017]. To account for runtime randomness in evaluation, we run each test case 10 times and calculate the averaged performance and standard derivation. 4.1 Results Table 2 and Figure 2 show the classification and clustering performance of DMCL and baseline methods. First, DNMF is the worst. It is outperformed by all the other methods. This reveals the importance of using multiple views. Second, semi-supervised methods outperforms unsupervised Dataset Size # of categories Dimensionality Reuters 1200 6 21536/15506 Image Net 12000 50 64/512/1000 Table 1: Dataset summary. Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18) Image Net Reuters DNMF Concat DNMF DMSNMF MCL DMCL (a) Accuracy Image Net Reuters 0.1 DNMF Concat DNMF DMSNMF MCL DMCL Figure 2: Clustering performance of different methods. Error bars represent standard deviations. 0 100 200 300 0.2 Performance Classification Acc Clustering Acc NMI 0 0.02 0.04 0.2 Performance Classification Acc Clustering Acc NMI 0 0.02 0.04 0.2 Performance Classification Acc Clustering Acc NMI Figure 3: Parameter study for DMCL by Accuracy and NMI on the Reuters dataset: (a) varying α when β = 0.015, γ = 0.005; (b) varying β when α = 100, γ = 0.005; (c) varying γ when α = 100, β = 0.015. ones. This is intuitive since by exploiting the partial label information we can build a more discriminative representation. Third, our proposed DMCL consistently outperforms MCL on the two datasets, which indicates that the deep model could generate better representation by hierarchical modeling. We use t-test with significance level 0.05 to test the significance of performance difference. Results show that DMCL significantly outperforms all the baseline methods. 4.2 Analysis In this subsection, we will analyze DMCL from two perspectives, i.e., parameter setting and convergence analysis. Parameter analysis. Parameters of DMCL include α, β, γ, and layer number and sizes. Here we explore their impact to performance by cross-validation on the training set and only show representative results due to space limitation. We first focus on the former three parameters. α and γ control the degree of sparseness, while β controls the impact of the graph embedding regularization. We vary one parameter each time and fix the other two to explore its influence. Fig Method Image Net Reuters DNMF 17.66 0.75 59.75 1.65 Concat DNMF 25.28 0.81 61.25 1.93 DMSNMF 27.39 0.76 64.53 1.72 MCL 30.31 0.62 70.85 1.53 DMCL 32.41 0.67 73.17 1.61 Table 2: Classification performance on different datasets (accuracy std dev,%). 25 50 75 100 125 150 175 p M Layer Sizes: [ 200 p M ] Layer Sizes: [ 300 200 p M ] Layer Sizes: [ 500 300 200 p M ] Figure 4: Clustering performance of DMCL with three different layer number settings on the Image Net dataset. 0 20 40 60 80 100 Iterations Objective function value (a) Objective function value 0 20 40 60 80 100 Iterations Figure 5: Convergence analysis of DMCL on Image Net dataset. 3 shows the results on Reuters. We find a general pattern: the performance curves first go up and then go down when increasing the parameters. This means the sparseness and graph embedding terms are useful for learning good representations. Based on the results, we set α = 100, β = 0.015 and γ = 0.005 in other experiments. Regarding layer sizes, previous work on multi-view deep factorization [Zhao et al., 2017] has found that p M, the size of the final layer, usually plays a more important role than the sizes of the other layers. Hence, we vary p M under different layer numbers and fix the sizes of the other layers empirically. Fig 4 shows the NMI results on Image Net under 3 layer number settings. The full settings are {[200 p M], [300 200 p M], [500 300 200 p M]}. As can be seen, the performance increases with the layer number, which indicates deep factorization really helps to find better representations. Considering both performance and efficiency, we choose 3 layers for all the experiments. The layer sizes are set to [300 200 125]. Convergence analysis. Fig 5 shows the curves of objective function value and NMI against the number of iterations for DMCL and MCL. We find at the beginning the objective function value drops and the performance increases rapidly. The optimization procedure of DMCL typically converges in around 40 iterations on the Image Net dataset. 5 Conclusion In this paper, we developed a Deep Multi-view Concept Learning (DMCL) method to seek a common high-level representation for multi-view partially labeled data. DMCL tries to capture data semantic structures by graph embedding guided by label information. It also tries to learn the consistent and complementary information of multi-view data Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18) by group sparseness constraints. Experimental results on document/image datasets for both classification and clustering tasks confirmed the effectiveness of DMCL compared to competitive multi-view factorization models. Acknowledgments This research was supported by the National Natural Science Foundation of China (Grant Nos. 61522206, 61373118, 61672409), the Major Basic Research Project of Shaanxi Province (Grant No. 2017ZDJC-31), the Science and Technology Plan Program in Shaanxi Province of China (Grant No. 2017KJXX-80), the Fundamental Research Funds for the Central Universities, and the Innovation Fund of Xidian University. [Amini et al., 2009] Massih Amini, Nicolas Usunier, and Cyril Goutte. Learning from multiple partially observed views-an application to multilingual text categorization. In NIPS, pages 28 36, 2009. [Bengio and others, 2009] Yoshua Bengio et al. Learning deep architectures for ai. Found. Trends Mach. Learn., 2(1):1 127, 2009. [Chaudhuri et al., 2009] Kamalika Chaudhuri, Sham M. Kakade, Karen Livescu, and Karthik Sridharan. Multiview clustering via canonical correlation analysis. In ICML, pages 129 136, 2009. [Deng et al., 2009] Jia Deng, Wei Dong, Richard Socher, Li Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, pages 248 255, 2009. [Deng et al., 2015] Cheng Deng, Zongting Lv, Wei Liu, Junzhou Huang, Dacheng Tao, and Xinbo Gao. Multi-view matrix decomposition: A new scheme for exploring discriminative information. In IJCAI, pages 3438 3444, 2015. [Guan et al., 2015] Ziyu Guan, Lijun Zhang, Jinye Peng, and Jianping Fan. Multi-view concept learning for data representation. IEEE TKDE, 27(11):3016 3028, 2015. [Han et al., 2011] Jiawei Han, Jian Pei, and Micheline Kamber. Data mining: concepts and techniques. Elsevier, 2011. [Hinton and Salakhutdinov, 2006] Geoffrey E Hinton and Ruslan R Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504 507, 2006. [Lee and Seung, 2001] Daniel D Lee and H Sebastian Seung. Algorithms for non-negative matrix factorization. In NIPS, pages 556 562, 2001. [Li et al., 2016] Yingming Li, Ming Yang, and Zhongfei Zhang. Multi-view representation learning: A survey from shallow methods to deep methods. ar Xiv preprint ar Xiv:1610.01206, 2016. [Lin, 2007] Chih-Jen Lin. Projected gradient methods for nonnegative matrix factorization. Neural Comput., 19(10):2756 2779, 2007. [Liu et al., 2015] Jing Liu, Yu Jiang, Zechao Li, Zhi-Hua Zhou, and Hanqing Lu. Partially shared latent factor learning with multiview data. IEEE Trans. Neural Netw. Learn. Syst., 26(6):1233 1246, 2015. [Nesterov, 2013] Yu Nesterov. Gradient methods for minimizing composite functions. Math. Program., 140(1):125 161, 2013. [Ngiam et al., 2011] Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y Ng. Multimodal deep learning. In ICML, pages 689 696, 2011. [Sharma et al., 2017] Pulkit Sharma, Vinayak Abrol, Anil Kumar Sao, Pulkit Sharma, Vinayak Abrol, and Anil Kumar Sao. Deep-sparse-representation-based features for speech recognition. IEEE Trans. Audio Speech Lang. Process., 25(11):2162 2175, 2017. [Song et al., 2015] Hyun Ah Song, Bo-Kyeong Kim, Thanh Luong Xuan, and Soo-Young Lee. Hierarchical feature extraction by multi-layer non-negative matrix factorization network for classification task. Neurocomputing, 165:63 74, 2015. [Srivastava and Salakhutdinov, 2012] Nitish Srivastava and Ruslan R Salakhutdinov. Multimodal learning with deep boltzmann machines. In NIPS, pages 2222 2230, 2012. [Trigeorgis et al., 2017] George Trigeorgis, Konstantinos Bousmalis, Stefanos Zafeiriou, and Bj orn W Schuller. A deep matrix factorization method for learning attribute representations. IEEE Trans. Pattern Anal. Mach. Intell., 39(3):417 429, 2017. [Wang et al., 2016a] Beidou Wang, Martin Ester, Jiajun Bu, Yu Zhu, Ziyu Guan, and Deng Cai. Which to view: Personalized prioritization for broadcast emails. In WWW, pages 1181 1190, 2016. [Wang et al., 2016b] Beidou Wang, Martin Ester, Yikang Liao, Jiajun Bu, Yu Zhu, Ziyu Guan, and Deng Cai. The million domain challenge: Broadcast email prioritization by cross-domain recommendation. In SIGKDD, pages 1895 1904, 2016. [Xu and Sun, 2012] Zhijie Xu and Shiliang Sun. Multisource transfer learning with multi-view adaboost. In ICONIP, pages 332 339, 2012. [Yan et al., 2007] Shuicheng Yan, Dong Xu, Benyu Zhang, Hong-Jiang Zhang, Qiang Yang, and Stephen Lin. Graph embedding and extensions: a general framework for dimensionality reduction. IEEE Trans. Pattern Anal. Mach. Intell., 29(1):40 51, 2007. [Zhao et al., 2017] Handong Zhao, Zhengming Ding, and Yun Fu. Multi-view clustering via deep matrix factorization. In AAAI, pages 2921 2927, 2017. [Zong et al., 2017] Linlin Zong, Xianchao Zhang, Long Zhao, Hong Yu, and Qianli Zhao. Multi-view clustering via multi-manifold regularized non-negative matrix factorization. Neural Netw., 88:74 89, 2017. Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18)