# multiview_clustering_via_deep_matrix_factorization__2c4a6e34.pdf

Multi-View Clustering via Deep Matrix Factorization

Handong Zhao, Zhengming Ding, Yun Fu

Department of Electrical and Computer Engineering, Northeastern University, Boston, USA, 02115 College of Computer and Information Science, Northeastern University, Boston, USA, 02115 {hdzhao,allanding,yunfu}@ece.neu.edu

Multi-View Clustering (MVC) has garnered more attention recently since many real-world data are comprised of different representations or views. The key is to explore complementary information to beneﬁt the clustering problem. In this paper, we present a deep matrix factorization framework for MVC, where semi-nonnegative matrix factorization is adopted to learn the hierarchical semantics of multi-view data in a layerwise fashion. To maximize the mutual information from each view, we enforce the non-negative representation of each view in the ﬁnal layer to be the same. Furthermore, to respect the intrinsic geometric structure in each view data, graph regularizers are introduced to couple the output representation of deep structures. As a non-trivial contribution, we provide the solution based on alternating minimization strategy, followed by a theoretical proof of convergence. The superior experimental results on three face benchmarks show the effectiveness of the proposed deep matrix factorization model.

Introduction Traditional clustering aims to identify groups of similar behavior in single view data (von Luxburg 2007; Liu et al. 2015; Steinwart 2015; Tao et al. 2016; Liu et al. 2016; Li, Kong, and Fu 2017). As the real-world data are always captured from multiple sources or represented by several distinct feature sets (Cai, Nie, and Huang 2013a; Ding and Fu 2014; Gao et al. 2015; Zhao and Fu 2015; Wang, Ding, and Fu 2016), MVC is intensively studied recently by leveraging the heterogeneous data to achieve the same goal. Different features characterize different information from the data set. For example, an image can be described by different characteristics, e.g., color, texture, shape and so on. These multiple types of features can provide useful information from different views. MVC aims to integrate multiple feature sets together, and uncover the consistent latent information from different views. Extensive research efforts have been made in developing effective MVC methods (Cai, Nie, and Huang 2013a; Gao et al. 2015; Xu, Han, and Nie 2016; Zhao, Liu, and Fu 2016). Along this line, Kumar et al. developed co-regularized multi-view spectral clustering to do clustering on different views simultaneously with a co-regularization constraint (Kumar, Rai, and

Copyright 2017, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

Multi-view data collection Layer 1 Layer m

Figure 1: Framework of our proposed method. Same shape denotes the same class. For demonstration purposes, we only show the two-view case, where two deep matrix factorization structures are proposed to capture rich information behind each view in a layer-wise fashion. With the deep structure, samples from the same class but different views gather close to each other to generate more discriminative representation.

III 2011). Gao et al. proposed to perform clustering on the subspace representation of each view simultaneously guided by a common cluster structure for the consistence across different views (Gao et al. 2015). A good survey can be found in (Xu, Tao, and Xu 2013). Recently, lots of research activities on MVC have achieved promising performance based on Non-negative Matrix Factorization (NMF) and its variants, because the non-negativity constraints allow for better interpretability (Guan et al. 2012; Trigeorgis et al. 2014). The general idea is to seek a common latent factor through non-negative matrix factorization among multi-view data (Liu et al. 2013; Zhang et al. 2014; 2015). Semi Non-negative Matrix Factorization (Semi-NMF), as one of the most popular variants of NMF, was proposed to extend NMF by relaxing the factorized basis matrix to be real values. This practice allows Semi-NMF to have a wider application in the real world than NMF. Apart from exploring Semi-NMF in MVC application for the ﬁrst time, our method has another distinction from the existing NMF-based MVC methods: we adopt a deep structure to conduct Semi-NMF hierarchically as shown in Figure 1. As illustrated, through the deep Semi-NMF structure, we push data samples from the same class closer layer

by layer. We borrow the idea from deep learning (Bengio 2009), thus this practice has such a ﬂavor. Note that the proposed method is different from the existing deep auto-encoder based MVC approaches (Andrew et al. 2013; Wang et al. 2015), though all of us are of deep structure. One major difference is that (Andrew et al. 2013; Wang et al. 2015) are based on Canonical Correlation Analysis (CCA), which is limited to 2-view case, while our method has no such limitation. To sum up, in this paper we propose a deep MVC algorithm through graph regularized semi-nonnegative matrix factorization. The key is to build a deep structure through semi-nonnegative matrix factorization to seek a common feature representation with more consistent knowledge to facilitate clustering. To the best of our knowledge, this is the ﬁrst attempt applying semi-nonnegative matrix factorization to MVC in a deep structure. We summarize our major contributions as follows:

Deep Semi-NMF structure is built to capture the hidden information by leveraging beneﬁts of strong interpretability from Semi-NMF and effective feature learning from deep structure. Through this deep matrix factorization structure, we dissemble unimportant factors layer by layer and generate an effective consensus representation in the ﬁnal layer for MVC.

To respect the intrinsic geometric relationship among data samples, we introduce graph regularizers to guide the shared representation learning in each view. This practice makes the consensus representation in the ﬁnal layer preserve most shared structures across multiple graphs. It can be considered as a fusion scheme to boost the ﬁnal MVC performance.

Overview of Semi-NMF

As a variant of NMF, Ding et al. (Ding, Li, and Jordan 2010) extended the application of traditional NMF from nonnegative input to a mix-sign input, while still preserving the strong interpretability at the same time. Its objective function can be expressed as:

min Z,H 0 X ZH 2 F, (1)

where X Rd n denotes the input data with n samples, each sample is of d dimensional feature. In the discussion on equivalence of semi-NMF and K-means clustering (Ding, Li, and Jordan 2010), Z Rd K can be considered as the cluster centroid matrix1, and H RK n, H 0 is the soft cluster assignment matrix in latent space2. Similar to the traditional NMF, the compact representation H uncovers

1For a neat presentation, we do not follow the notation style in (Ding, Li, and Jordan 2010), and remove the mix-sign notation on X and Z, which does not affect the rigorousness. 2In some literatures (Ding, Li, and Jordan 2010; Zhao et al. 2015), Semi-NMF is also called the soft version of K-means clustering.

the hidden semantics by simulating the part-based representation in human brain, i.e., psychological and physiological interpretation. While in reality, natural data may contain different modalities (or factors), e.g., expression, illumination, pose in face datasets (Samaria and Harter 1994; Georghiades, Belhumeur, and Kriegman 2001). Single NMF is not strong enough to eliminate the effect of those undesirable factors and extract the intrinsic class information. To solve this, Trigeorgis et al. (Trigeorgis et al. 2014) showed that a deep model based on Semi-NMF has a promising result in data representation. The multi-layer decomposition process can be expressed as

X Z1H+ 1 X Z1Z2H+ 2 ... X Z1 . . . Zm H+ m

where Zi denotes the i-th layer basis matrix, H+ i is the i-th layer representation matrix. (Trigeorgis et al. 2014) proved that each hidden representations layer is able to identify the different attributes. Inspired by this work, we propose a MVC method based on deep matrix factorization technique.

The proposed method

In the MVC setting, let us denote X = {X(1), . . . , X(v), . . . , X(V )} as the data sample set. V represents the number of views. X(v) Rdv n, where dv denotes the dimensionality of the v-view data and n is the number of data samples. Then we formulate our model as:

min Z(v) i , H(v) i Hm, α(v)

v=1 (α(v))γ X(v) Z(v) 1 Z(v) 2 . . . Z(v) m Hm 2 F

+ βtr(Hm L(v)HT m)

s.t. H(v) i 0, Hm 0,

v=1 α(v) = 1, α(v) 0,

(3) where X(v) is the given data for v-th view. Z(v) i , i {1, 2, . . . , m} is the i-th layer mapping for view v. m is the number of layers. Hm is the consensus latent representation for all views. α(v) is the weighting coefﬁcient for the v-th view. γ is the parameter to control the weights distribution. L(v) is the graph Laplacian of the graph for view v, where each graph is constructed in k-nearest neighbor (k-NN) fashion. The weight matrix of the graph for view v is A(v) and L(v) = A(v) D(v), where D(v) ii =

j A(v) ij (He and Niyogi 2004; Ding and Fu 2016). Remark 1: Due to the homology of multi-view data, the ﬁnal layer representation H(v) m for v-th view data should be close to each other. Here, we use the consensus Hm as a constraint to enforce multi-view data to share the same representation after multi-layer factorization.

Remark 2: Multiple graphs are constructed to constrain the common representation learning so that the geometric structure in each view could be well preserved for the ﬁnal clustering. Moreover, the novel graph term could fuse the geometric knowledge from multiple views to make the common representation more consistent.

Optimization To expedite the approximation of the variables in the proposed model, each of the layers is pre-trained to have an initial approximation of variables Z(v) i and H(v) i for the i-th layer in v-th view. The effectiveness of pre-training has been proven before (Hinton and Salakhutdinov 2006) on deep autoencoder networks. Similar to (Trigeorgis et al. 2014), we decompose the input data matrix X(v) Z(v) 1 H(v) 1 to perform the pre-training, where Z(v) 1 Rdv p1 and H(v) 1 Rp1 n. Then the v-th view feature matrix H(v) 1 is decomposed as H(v) 1 Z(v) 2 H(v) 2 , where Z(v) 2 Rp1 p2 and H(v) 2 Rp2 n. p1 and p2 are the dimensionalities for layer 1 and layer 2, respectively.3 Continue to do so until we have pretrained all layers. Following this, the weights of each layer is ﬁne-tuned by alternating minimizations of the proposed objective function Eq. (3). First, we denote the cost func-

tion as C = V

v=1 (α(v))γ X(v) Z(v) 1 Z(v) 2 . . . Z(v) m Hm 2 F +

βtr(Hm L(v)HT m) .

Update rule for weight matrix Z(v) i . We minimize the objective value with respect to Z(v) i by ﬁxing the rest of variables in v-th view for the i-th layer. By setting C/ Z(v) i = 0, we give the solutions as

Z(v) i = (ΦTΦ) 1ΦTX(v) H(v) i T( H(v) i H(v) i T) 1

Z(v) i = Φ X(v) H(v) i , (4)

where Φ = [Z(v) 1 . . . Z(v) i 1], H(v) i denotes the reconstruction (or the learned latent feature) of the i-th layer s feature matrix in v-th view, and notation represents the Moore-Penrose pseudo-inverse. Update rule for weight matrix H(v) i (i < m). Following (Ding, Li, and Jordan 2010), the update rule for H(v) i (i < m) is formulated as follows:

H(v) i = H(v) i

[ΦTX(v)]pos + [ΦTΦH(v) i ]neg

[ΦTX(v)]neg + [ΦTΦH(v) i ]pos , (5)

where [M]pos denotes a matrix that all the negative elements are replaced by 0. Similarly, [M]neg denotes one that has all the positive elements replaced by 0. That is,

k, j [M]pos kj =|Mkj| + Mkj

2 , [M]neg kj =|Mkj| Mkj

3For the ease of presentation, we denote the dimensionalities (layer size) from layer 1 to layer m as [p1 . . . pm] in the experiments.

Update rule for weight matrix Hm (i.e., H(v) i (i = m)). Since Hm involves the graph term, the updating rule and convergence property have never been investigated before. We give the updating rule ﬁrst, followed by the proof of its convergence property.

[ΦTX(v)]pos+[ΦTΦHm]neg+Gu(Hm, A) [ΦTX(v)]neg+[ΦTΦHm]pos+Gd(Hm, A) (7) where Gu(Hm, A) = β([Hm A(v)]pos + [Hm D(v)]neg) and Gd(Hm, A) = β([Hm A(v)]neg + [Hm D(v)]pos). Theorem 1. The limited solution of the update rule in Eq. (7) satisﬁes the KKT condition.

Proof. We introduce the Lagrangian function

v=1 (α(v))γ X(v) Z(v) 1 Z(v) 2 . . . Z(v) m Hm 2 F

+βtr(Hm L(v)HT m) ηHm , (8) where the Lagrangian multiplier η enforces nonnegative constraints, Hm 0. The zero gradient condition gives L(Hm)/ Hm = 2ΦT(ΦHm X(v)) + 2Hm(D(v) A(v)) η = 0. From the complementary slackness condition, we obtain 2ΦT(ΦHm X(v)) + 2Hm(D(v) A(v))

= ηkl(Hm)kl = 0. (9) This is a ﬁxed point equation that the solution must satisfy at convergence. The limiting solution of Eq. (7) satisﬁes the ﬁxed point equation. At convergence, H( ) m = H(t+1) m = H(t) m = Hm, i.e.,

(Hm)kl = (Hm)kl [ΦTX(v)]pos kl + [ΦTΦHm]neg kl + [Gu(H(v) m , A)]kl [ΦTX(v)]neg + [ΦTΦHm]pos + [Gd(H(v) m , A)]kl . (10)

Note that ΦTX(v) = [ΦTX(v)]pos [ΦTX(v)]neg; ΦTΦHm = [ΦTΦHm]pos [ΦTΦHm]neg; Hm D(v) = [Hm D(v)]pos [Hm D(v)]neg; Hm A(v) = [Hm A(v)]pos [Hm A(v)]neg. Thus Eq. (10) reduces to 2ΦT(ΦHm X(v))+2Hm(D(v) A(v))

kl (Hm)2 kl=0. (11) Eq. (11) is identical to Eq. (9). Both equations require that at least one of the two factors is equal to zero. The ﬁrst factors in both equations are identical. For the second factor (Hm)kl or (H2 m)kl, if (Hm)kl = 0 then (H2 m)kl = 0, and vice versa. Therefore if Eq. (9) holds, Eq. (11) also holds and vice versa.

Update rule for weight α(v). Similar to (Cai, Nie, and Huang 2013b), for the ease of representation, let

us denote R(v) = X(v) Z(v) 1 Z(v) 2 . . . Z(v) m Hm 2 F + βtr(Hm L(v)HT m). The objective in Eq. (3) with respect to α(v) is written as

v=1 (α(v))γR(v), s.t.

v=1 α(v) = 1, α(v) 0.

(12) The Lagrange function of Eq. (12) is written as

v=1 (α(v))γR(v) λ(

v=1 α(v) 1), (13)

where λ is the Lagrange multiplier. By taking the derivative of Eq. (13) with respect to α(v), and setting it to zero, we obtain

α(v) = λ γR(v)

1 γ 1 . (14)

Then we replace α(v) in Eq. (14) into V

v=1 α(v) = 1, and

γR(v) 1 1 γ

γR(v) 1 1 γ . (15)

It is interesting to see that with only one parameter γ, we could control the different weights for different views. When γ approaches , we get equal weights. When γ is close to 1, the weight of the view whose R(v) value is the smallest is assigned to 1, and the others are assigned to 0. Until now, we have all the update rules done. We repeat the updates iteratively until convergence. The entire algorithm is outlined in Algorithm 1. After obtaining the optimized Hm, standard spectral clustering (Ng, Jordan, and Weiss 2001) is performed on the graph built on Hm via k-NN algorithm.

Time complexity

Our deep matrix factorization model is composed of two stages, i.e., pre-training and ﬁne-tuning, so we analyze them separately. To simplify the analysis, we assume the dimensions in all the layers (i.e., layer size) are the same, denoting p. The original feature dimensions for all the views are the same, denoting d. V is the number of views. m is the number of layers. In pre-training stage, the Semi-NMF process and graph construction are the time consuming parts. The complexity is of order O V mtp(dnp + np2 + pd2 + pn2 + dn2) , where tp is the number of iterations to achieve convergence in Semi-NMF optimization process. Normally, p < d, thus the computational cost is Tpre. = O V mtp(dnp + pd2 + dn2) for the pre-training stage. Similarly, in the ﬁne-tuning stage, the time complexity is of order Tfine. = O V mtf(dnp + pd2 + pn2) , where tf is the number of iterations in this ﬁne-tuning stage. To sum up, the overall computational cost is Ttotal = Tpre. + Tfine..

Algorithm 1: Optimization of Problem (3)

Input: Multi-view data X(v), tuning parameters γ, β, layer size pi, the number of nearest neighbors k. Initialize: for all layers in each view do

(Z(v) i , H(v) i ) Semi NMF(H(v) i 1, di) α(v) 1

V A(v) k-NN graph construction on X(v). end while not converged do

for all layers in each view do

H(v) i Hm if i = m Z(v) i+1 H(v) i+1 otherwise Φ i 1 τ=1 Zτ. Zi Φ X(v) H(v) i .

H(v) i Update via Eq. (5) if i < m Update via Eq. (7) otherwise α(v) Update via Eq. (15). end end Output: Weighted matrices Z(v) i and feature matrices H(v) i (i < m) and Hm in the ﬁnal layer.

Experiments

We choose three face image/video benchmarks in our experiments, as face contains good structural information, which is beneﬁcial to manifesting the strengths of deep NMF structure. A brief introduction of datasets and preprocessing steps is as follows. Yale consists of 165 images of 15 subjects in raw pixel. Each subject has 11 images, with different conditions, e.g., facial expressions, illuminations, with/without glasses, lighting conditions, etc. Extended Yale B consists of 38 subjects of face images. Each subject has 64 faces images under various lighting conditions and poses. In this work, the ﬁrst 10 subjects, 640 images data are used for experiment. Notting-Hill is a well-known video face benchmark (Zhang et al. 2009), which is generated from movie Notting Hill . There are 5 major casts, including 4660 faces in 76 tracks. For these datasets, we follow the preprocessing strategy (Cao et al. 2015). Firstly all the images are resized into 48 48 and then three kinds of features are extracted, i.e., intensity, LBP (Ahonen, Hadid, and Pietik ainen 2006) and Gabor (Feichtinger and Strohmer 1998). Speciﬁcally, LBP is a 59dimension histogram over 9 10 pixel patches generated from cropped images. The scale parameter λ in Gabor wavelets is ﬁxed as 4 at four orientations θ = {0 , 45 , 90 , 135 } with a cropped image of size 25 30 pixels. For the comparison baselines, we have the following. (1) Best SV performs standard spectral clustering (Ng, Jordan, and Weiss 2001) on the features in each view. We report the best performance. (2) Concat Fea concatenates all the features, and then performs standard spectral clustering. (3) Concat PCA concatenates all the features, then projects the

Table 1: Results of 6 different metrics (mean standard deviation) on dataset Yale. Method NMI ACC AR F-score Precision Recall Best SV 0.654 0.009 0.616 0.030 0.440 0.011 0.475 0.011 0.457 0.011 0.495 0.010 Concat Fea 0.641 0.006 0.544 0.038 0.392 0.009 0.431 0.008 0.415 0.007 0.448 0.008 Concat PCA 0.665 0.037 0.578 0.038 0.396 0.011 0.434 0.011 0.419 0.012 0.450 0.009 Co-Reg 0.648 0.002 0.564 0.000 0.436 0.002 0.466 0.000 0.455 0.004 0.491 0.003 Co-Train 0.672 0.006 0.630 0.001 0.452 0.010 0.487 0.009 0.470 0.010 0.505 0.007 Min-D 0.645 0.005 0.615 0.043 0.433 0.006 0.470 0.006 0.446 0.005 0.496 0.006 Multi NMF 0.690 0.001 0.673 0.001 0.495 0.001 0.527 0.000 0.512 0.000 0.543 0.000 Na MSC 0.671 0.011 0.636 0.000 0.475 0.004 0.508 0.007 0.492 0.003 0.524 0.004 Di MSC 0.727 0.010 0.709 0.003 0.535 0.001 0.564 0.002 0.543 0.001 0.586 0.003 Ours 0.782 0.010 0.745 0.011 0.579 0.002 0.601 0.002 0.598 0.001 0.613 0.002

Table 2: Results of 6 different metrics (mean standard deviation) on dataset Extended Yale B. Method NMI ACC AR F-score Precision Recall Best SV 0.360 0.016 0.366 0.059 0.225 0.018 0.303 0.011 0.296 0.010 0.310 0.012 Concat Fea 0.147 0.005 0.224 0.012 0.064 0.003 0.159 0.002 0.155 0.002 0.162 0.002 Concat PCA 0.152 0.003 0.232 0.005 0.069 0.002 0.161 0.002 0.158 0.001 0.164 0.002 Co-Reg 0.151 0.001 0.224 0.000 0.066 0.001 0.160 0.000 0.157 0.001 0.162 0.000 Co-Train 0.302 0.007 0.186 0.001 0.043 0.001 0.140 0.001 0.137 0.001 0.143 0.002 Min-D 0.186 0.003 0.242 0.018 0.088 0.001 0.181 0.001 0.174 0.001 0.189 0.002 Multi NMF 0.377 0.006 0.428 0.002 0.231 0.001 0.329 0.001 0.298 0.001 0.372 0.002 Na MSC 0.594 0.004 0.581 0.013 0.380 0.002 0.446 0.004 0.411 0.002 0.486 0.001 Di MSC 0.635 0.002 0.615 0.003 0.453 0.000 0.504 0.006 0.481 0.002 0.534 0.001 Ours 0.649 0.002 0.763 0.001 0.512 0.002 0.564 0.001 0.525 0.001 0.610 0.001

Table 3: Results of 6 different metrics (mean standard deviation) on dataset Notting-Hill. Method NMI ACC AR F-score Precision Recall Best SV 0.723 0.008 0.813 0.000 0.712 0.020 0.775 0.015 0.774 0.018 0.776 0.013 Concat Fea 0.628 0.028 0.673 0.033 0.612 0.041 0.696 0.032 0.699 0.032 0.693 0.031 Concat PCA 0.632 0.009 0.733 0.008 0.598 0.015 0.685 0.012 0.691 0.010 0.680 0.014 Co-Reg 0.660 0.003 0.758 0.000 0.616 0.004 0.699 0.000 0.705 0.003 0.694 0.003 Co-Train 0.766 0.005 0.689 0.027 0.589 0.035 0.677 0.026 0.688 0.030 0.667 0.023 Min-D 0.707 0.003 0.791 0.000 0.689 0.002 0.758 0.002 0.750 0.002 0.765 0.003 Multi NMF 0.752 0.001 0.831 0.001 0.762 0.000 0.815 0.000 0.804 0.001 0.824 0.001 Na MSC 0.730 0.002 0.752 0.013 0.666 0.004 0.738 0.005 0.746 0.002 0.730 0.011 Di MSC 0.799 0.001 0.843 0.021 0.787 0.001 0.834 0.001 0.822 0.005 0.836 0.009 Ours 0.797 0.005 0.871 0.009 0.803 0.002 0.847 0.002 0.826 0.007 0.870 0.001

original features into a low-dimensional subspace via PCA. Spectral clustering is applied on the projected feature representation. (4) Co-Reg (SPC) (Kumar, Rai, and III 2011) co-regularizes the clustering hypotheses to enforce the memberships from different views admit with each other. (5) Co Training (SPC) (Kumar and III 2011) borrows the idea of co-training strategy to alternatively modify the graph structure of each view using other views information. (6) Min D(isagreement) (de Sa 2005) builds a bipartite graph which derives from the minimizing-disagreement idea. (7) Multi NMF (Liu et al. 2013) applies NMF to project each view data to the common latent subspace. This method can be roughly considered as one-layer version of our proposed method. (8) Na MSC (Cao et al. 2015) ﬁrstly applies (Hu et al. 2014) to each view data, then combines the learned representations and feeds to the spectral clustering. (9) Di MSC (Cao et al. 2015) investigates the complementary information of representations of multi-view data by introducing a diversity term. This work is also one of the most recent approaches in MVC. We do not make the comparison with deep auto-encoder based methods (Andrew et al. 2013;

Wang et al. 2015), because these CCA-based methods cannot fully utilize more than 2 view data, leading to an unfair comparison. To make a comprehensive evaluation, we use six different evaluation metrics including normalized mutual information (NMI), accuracy (ACC), adjusted rand index (AR), F-score, Precision and Recall. For details about the metrics, readers could refer to (Kumar and III 2011; Cao et al. 2015). For all the metrics, higher value denotes better performance. Different measurements favor different properties, thus a comprehensive view can be acquired from the diverse results. For each experiment, we repeat 10 times and report the mean values along with standard deviations.

Table 1 and Table 2 tabulate the results on datasets Yale and Extended Yale B. Our method outperforms all the other competitors. For the dataset Yale, we raise the performance bar by around 7.57% in NMI, 5.08% in ACC, 8.22% in AR, 6.56% in F-score, 10.13% in Precision and 4.61% in Recall. On average, we improve the state-of-the-art Di MSC by more

0 20 40 60 80 100 Iteration

Objective Value

Figure 2: Objective function value (red line) and NMI (blue line) with respect to iteration time on Yale dataset with parameters β = 0.1, γ = 0.5 and layer size is [100, 50], respectively.

5 10 3 5 10 2 5 10 1 5 5 10 -15 10 -25 10 -3

Layer Size=[100 50] Layer Size=[500 50] Layer Size=[500 200]

Figure 3: NMI curves w.r.t parameter γ on Yale dataset with three different layer size settings, i.e., {[100 50], [500 50], [500 200]}, and β is set as 0.1.

10 3 10 2 10 1 1 10 -1 10 -2 10 -3

Layer Size=[100 50] Layer Size=[500 50] Layer Size=[500 200]

Figure 4: NMI curves w.r.t parameter β on Yale dataset with three different layer size settings, i.e., {[100 50], [500 50], [500 200]}, and γ is set as 0.5.

than 7%. The possible reason why our method improves a lot is that both image data in Yale and Extended Yale B contain multiple factors, i.e., pose, expression, illumination, etc. The existing MVC methods only involve one layer of representation, e.g., one layer factor decomposition in Multi NMF or the practice of self-representation (i.e., coefﬁcient matrix Z in Na MSC and Di MSC (Cao et al. 2015)). However, our proposed approach can extract the meaningful representation layer by layer. Through the deep representation, we eliminate the inﬂuence of undesirable factors, and keep the core information (i.e., class/id information) in the ﬁnal layer. Table 3 lists the performance on video data Notting-Hill. This dataset is more challenging than the previous two image datasets, since the illumination conditions vary dramatically and the source of lighting is arbitrary. Moreover, there is no ﬁxed expression pattern in the Notting-Hill movie, on the contrary to datasets Yale and Extended Yale B. We observe

from the tables that our method reports the superior results in ﬁve metrics. The only outlier is NMI, but our performance is slightly worse than Di MSC by only 0.25%. Therefore, we safely draw the conclusion that our proposed method generally achieves better clustering performance in the challenging video dataset Notting-Hill.

Analysis In this subsection, the robustness and stability of the proposed model is evaluated. The convergence property is ﬁrstly studied in terms of objective value and NMI performance. Then the analytical experiments on three key model parameters β, γ, and layer size are conducted. Convergence analysis. In Theorem 1, we theoretically show that the most complex updating for Hm satisﬁes KKT conditions. To experimentally show the convergence property of the whole model, we compute the objective value of Eq. (3) in each iteration. The corresponding parameters γ, β and layer size are set as 0.5, 0.1 and [100, 50], respectively. The objective value curve is plotted in red in Figure 2. We observe that the objective value decreases steadily, and then gradually meets the convergence after around 100 iterations. The average NMI (in blue) has two stages before converging: from #1 to #14, the NMI increases dramatically; then from #15 to #30, it slightly bumps and reaches the best at around the convergence point. For the sake of safety, the maximum number of iterations is set to 150 for all the experiments. Parameter analysis. In the proposed method, we have four sets of parameters i.e., balancing parameters β and γ, layer size pi and the number of nearest neighbors k when constructing k-NN graph. Selecting k in the k-NN graph construction algorithms is an open problem (He and Niyogi 2004). Due to the limited page length, we only include the ﬁrst three parameter analysis experiments in this paper. However, we ﬁnd that k = 5 usually achieves relatively good results. Figure 3 shows the inﬂuence of NMI result with respect to the parameter γ under three different layer size settings, i.e., {[100 50], [500 50], [500 200]}. Parameter β is set as 0.1. γ is evaluated in the grid of {5 10 3, 5 10 2, 5 10 1, 5 100, 5 101, 5 102}. Note that to avoid division by 0, γ cannot be set as 1. We observe that the proposed method achieves the best when γ = 0.5 under different layer size settings. In general, when γ is in the magnitude of 10 1, 10 2, 10 3, the performance is quite stable. We ﬁx parameter γ = 0.5 as default in our experiments. Figure 4 explores the parameter sensitivity of our model in terms of parameter β. Considering the possible amplitude variations of two terms in the objective function Eq. (3), we evaluate β within the following set {103, 102, 101, 100, 10 1, 10 2, 10 3}. As can be seen, the average NMI results under three different layer size settings are relatively steady, and slightly better when β = {10 2, 10 3}. In practice, we choose β = 0.01 as default. For the layer size analysis, from Figure 3 and Figure 4, we observe that the setting of [100 50] always performs best. Empirically, we ﬁnd that the last layer dimension usually plays a more important role than other layer size (blue curves are always close to red ones). In Yale dataset, the ground-

truth number of cluster is 10. When the last layer size is set as 200, it might introduce more noise compared with the last layer size set as 50. This is the possible reason why green curves (i.e., layer size is [500 200]) perform worst.

Conclusion In this paper, we proposed a deep matrix factorization approach for MVC problem. Through the multi-layer Semi NMF, our method was capable of eliminating the bad inﬂuences from diverse modalities, while only keeping the class information in the output layer. With the guidance of multiple graphs, the learned common representation could preserve the geometric structure in each view, especially the common structure information. Extensive experimental results validated the effectiveness of the proposed deep matrix factorization structure, by comparing it with nine baselines.

Acknowledgments

This work is supported in part by the NSF IIS award 1651902, NSF CNS award 1314484, ONR award N00014-12-1-1028, ONR Young Investigator Award N00014-14-1-0484, and U.S. Army Research Ofﬁce Young Investigator Award W911NF14-1-0218.

Ahonen, T.; Hadid, A.; and Pietik ainen, M. 2006. Face description with local binary patterns: Application to face recognition. TPAMI 28(12):2037 2041. Andrew, G.; Arora, R.; Bilmes, J.; and Livescu, K. 2013. Deep canonical correlation analysis. In ICML, 1247 1255. Bengio, Y. 2009. Learning deep architectures for AI. Foundations and Trends in Machine Learning 2(1):1 127. Cai, X.; Nie, F.; and Huang, H. 2013a. Multi-view k-means clustering on big data. In AAAI, 2598 2604. Cai, X.; Nie, F.; and Huang, H. 2013b. Multi-view k-means clustering on big data. In IJCAI. Cao, X.; Zhang, C.; Fu, H.; Liu, S.; and Zhang, H. 2015. Diversityinduced multi-view subspace clustering. In CVPR, 586 594. de Sa, V. R. 2005. Spectral clustering with two views. In ICML, 20 27. Ding, Z., and Fu, Y. 2014. Low-rank common subspace for multiview learning. In ICDM, 110 119. Ding, Z., and Fu, Y. 2016. Robust multi-view subspace learning through dual low-rank decompositions. In AAAI, 1181 1187. Ding, C. H. Q.; Li, T.; and Jordan, M. I. 2010. Convex and seminonnegative matrix factorizations. TPAMI 32(1):45 55. Feichtinger, H. G., and Strohmer, T. 1998. Gabor analysis and algorithms : theory and applications. Applied and numerical harmonic analysis. Birkhuser. Gao, H.; Nie, F.; Li, X.; and Huang, H. 2015. Multi-view subspace clustering. In ICCV, 4238 4246. Georghiades, A. S.; Belhumeur, P. N.; and Kriegman, D. J. 2001. From few to many: Illumination cone models for face recognition under variable lighting and pose. TPAMI 23(6):643 660. Guan, N.; Tao, D.; Luo, Z.; and Yuan, B. 2012. Nenmf: an optimal gradient method for nonnegative matrix factorization. TSP 60(6):2882 2898.

He, X., and Niyogi, P. 2004. Locality preserving projections. In NIPS, 153 160. Hinton, G. E., and Salakhutdinov, R. R. 2006. Reducing the dimensionality of data with neural networks. Science 313:504 507. Hu, H.; Lin, Z.; Feng, J.; and Zhou, J. 2014. Smooth representation clustering. In CVPR, 3834 3841. Kumar, A., and III, H. D. 2011. A co-training approach for multiview spectral clustering. In ICML, 393 400. Kumar, A.; Rai, P.; and III, H. D. 2011. Co-regularized multi-view spectral clustering. In NIPS, 1413 1421. Li, J.; Kong, Y.; and Fu, Y. 2017. Sparse subspace clustering by learning approximation ℓ0 codes. In AAAI. Liu, J.; Wang, C.; Gao, J.; and Han, J. 2013. Multi-view clustering via joint nonnegative matrix factorization. In SDM, 252 260. Liu, H.; Liu, T.; Wu, J.; Tao, D.; and Fu, Y. 2015. Spectral ensemble clustering. In KDD. Liu, H.; Shao, M.; Li, S.; and Fu, Y. 2016. Inﬁnite ensemble for image clustering. In KDD. Ng, A. Y.; Jordan, M. I.; and Weiss, Y. 2001. On spectral clustering: Analysis and an algorithm. In NIPS, 849 856. Samaria, F., and Harter, A. 1994. Parameterisation of a stochastic model for human face identiﬁcation. In WACV, 138 142. Steinwart, I. 2015. Fully adaptive density-based clustering. Annals of Statistics 43(5):2132 2167. Tao, Z.; Liu, H.; Li, S.; and Fu, Y. 2016. Robust spectral ensemble clustering. In CIKM, 367 376. Trigeorgis, G.; Bousmalis, K.; Zafeiriou, S.; and Schuller, B. W. 2014. A deep semi-nmf model for learning hidden representations. In ICML, 1692 1700. von Luxburg, U. 2007. A tutorial on spectral clustering. Statistics and Computing 17(4):395 416. Wang, W.; Arora, R.; Livescu, K.; and Bilmes, J. 2015. On deep multi-view representation learning. In ICML. Wang, S.; Ding, Z.; and Fu, Y. 2016. Coupled marginalized autoencoders for cross-domain multi-view learning. In IJCAI, 2125 2131. Xu, J.; Han, J.; and Nie, F. 2016. Discriminatively embedded k-means for multi-view clustering. In CVPR. Xu, C.; Tao, D.; and Xu, C. 2013. A survey on multi-view learning. Co RR abs/1304.5634. Zhang, Y.; Xu, C.; Lu, H.; and Huang, Y. 2009. Character identiﬁcation in feature-length ﬁlms using global face-name matching. TMM 11(7):1276 1288. Zhang, X.; Zhao, L.; Zong, L.; Liu, X.; and Yu, H. 2014. Multiview clustering via multi-manifold regularized nonnegative matrix factorization. In ICDM, 1103 1108. Zhang, X.; Zong, L.; Liu, X.; and Yu, H. 2015. Constrained nmf-based multi-view clustering on unmapped data. In AAAI, 3174 3180. Zhao, H., and Fu, Y. 2015. Dual-regularized multi-view outlier detection. In IJCAI, 4077 4083. Zhao, H.; Ding, Z.; Shao, M.; and Fu, Y. 2015. Part-level regularized semi-nonnegative coding for semi-supervised learning. In ICDM, 1123 1128. Zhao, H.; Liu, H.; and Fu, Y. 2016. Incomplete multi-modal visual data grouping. In IJCAI, 2392 2398.