# cpmnets_cross_partial_multiview_networks__91e4da48.pdf

CPM-Nets: Cross Partial Multi-View Networks

Changqing Zhang1,2, Zongbo Han1, Yajie Cui1, Huazhu Fu3, Joey Tianyi Zhou4 , Qinghua Hu1,2

1College of Intelligence and Computing, Tianjin University, Tianjin, China 2Tianjin Key Lab of Machine Learning, Tianjin, China 3Inception Institute of Artiﬁcial Intelligence, Abu Dhabi, UAE 4Institute of High Performance Computing, A*STAR, Singapore

Despite multi-view learning progressed fast in past decades, it is still challenging due to the difﬁculty in modeling complex correlation among different views, especially under the context of view missing. To address the challenge, we propose a novel framework termed Cross Partial Multi-View Networks (CPM-Nets). In this framework, we ﬁrst give a formal deﬁnition of completeness and versatility for multi-view representation and then theoretically prove the versatility of the latent representation learned from our algorithm. To achieve the completeness, the task of learning latent multi-view representation is speciﬁcally translated to degradation process through mimicking data transmitting, such that the optimal tradeoff between consistence and complementarity across different views could be achieved. In contrast with methods that either complete missing views or group samples according to view-missing patterns, our model fully exploits all samples and all views to produce structured representation for interpretability. Extensive experimental results validate the effectiveness of our algorithm over existing state-of-the-arts.

1 Introduction

In the real-word applications, data is usually represented in different views, including multiple modalities or multiple types of features. A lot of existing methods [1, 2, 3] empirically demonstrate that different views could complete each other, leading ultimate performance improvement. Unfortunately, the unknown and complex correlation among different views often disrupts the integration of different modalities in the model. Moreover, data with missing views further aggravates the modeling difﬁculty.

Conventional multi-view learning usually holds the assumption that each sample is associated with the uniﬁed observed views and all views are available for each sample. However, in practical applications, there are usually incomplete cases for multi-view data [4, 5, 6, 7, 8]. For example, in medical data, different types of examinations are usually conducted for different subjects, and in web analysis, some webs may contain texts, pictures and videos, but others may only contain one or two types, which produce data with missing views. The view-missing patterns (i.e., combinations of available views) become even more complex for the data with more views.

Projecting different views into a common space (e.g., CCA: Canonical Correlation Analysis and its variants [9, 10, 11]) is impeded by view-missing issue. Several methods are proposed to keep on exploiting the correlation of different views. One straightforward way is completing the missing views, and then the on-shelf multi-view learning algorithms could be adopted. The missing views are basically blockwise and thus low-rank based completion [12, 13] is not applicable which has been widely recognized [5, 14]. Missing modality imputation methods [15, 5] usually require samples with two paired modalities to train the networks which can predict the missing modality from the observed one. To explore the complementarity among multiple views, another natural way

Corresponding author: J. T. Zhou <joey.tianyi.zhou@gmail.com>.

33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada.

Encoding networks Latent space

Testing with retuned networks Partial multi-view data

Class label Testing sample Samples

Class distribution

View reconstruction

Figure 1: Illustration of Cross Partial Multi-View Networks. Given multi-view data with missing views (black blocks), the encoding networks degrade the complete latent representation into the available views (white blocks). Learning multi-view representation according to the distributions of observations and classes has the promise to encode complementary information, as well as provide accurate prediction.

is manually grouping samples according to the availability of data sources [16], and subsequently learning multiple models on these groups for late fusion. Although it is more effective than learning on each single view, the grouping strategy is not ﬂexible especially for the data with large number of views. Accordingly, a challenging problem arises - how to effectively and ﬂexibly exploit samples with arbitrary view-missing patterns?

Our methodology is expected to endow the following merits: complete and structured representation - comprehensively encoding information from different views into a clustering-structured representation, and ﬂexible integration - handling arbitrary view-missing patterns. To this end, we propose a novel algorithm, i.e., Cross Partial Multi-View Networks (CPM-Nets) for classiﬁcation, as shown in Fig. 1. Beneﬁting from the learned common latent representation from the encoding networks, all samples and all views can be jointly exploited regardless of view-missing patterns. For the multi-view representation, CPM-Nets jointly considers multi-view complementarity and class distribution, making them mutually improve each other to obtain the representation reﬂecting the underlying patterns. Speciﬁcally, the encoded latent representation from observations is complete and versatile thus promotes the prediction performance, while the clustering-like classiﬁcation schema in turn enhances the separability for latent representation. Theoretical analysis and empirical results validate the effectiveness of the proposed CPM-Nets in exploiting partial multi-view data.

1.1 Related Work

Multi-View Learning (MVL) aims to jointly utilize information from different views. Multi-view clustering algorithms [17, 18, 19, 20, 21] usually search for the consistent clustering hypotheses across different views, where the representative methods include co-regularized based [17], co-training based [18] and high-order multi-view clustering [19]. Under the metric learning framework, multi-view classiﬁcation methods [22, 23] jointly learn multiple metrics for multiple views. The representative multi-view representation learning methods are CCA based, including kernelized CCA [10], deep neural networks based CCA [11, 24], and semi-paired and semi-supervised generalized correlation analysis (S2GCA) [25]. Cross-View Learning (CVL) basically searches mappings between two views, and has been widely applied in real applications [26, 27, 28, 29, 30]. With adversarial training, the embedding spaces of two individual views are learned and aligned simultaneously [27]. The cross-modal convolutional neural networks are regularized to obtain a shared representation which is agnostic of the modality for cross-modal scene images [28]. The cross-view learning can be also utilized for missing view imputation [31, 14]. For Partial Multi-View Learning (PMVL), existing strategies usually transform the incomplete case into complete multi-view learning task. The imputation methods [5, 31] complete the missing views by leveraging the strength of deep neural networks. The grouping strategy [16] divides all samples according to the availability of data sources, and then multiple classiﬁers are learned for late fusion. Although effective, this strategy cannot scale well for data with large number of views or small-sample-size case. Though the KCCA based algorithm [8] can model incomplete data, it needs one complete (primary) view.

2 Cross Partial Multi-View Networks

Recently, there is an increasing interest in learning on data with multiple views, including multi-view learning and cross-view learning. Differently, we focus on classiﬁcation based on data with missing views, which is termed Partial Multi-View Classiﬁcation (see deﬁnition 2.1) where samples with different view-missing patterns are involved. The proposed cross partial multi-view networks enable the comparability for samples with different combinations of views instead of samples in two different views, which generalizes the concept of cross-view learning. There are three main challenges for partial multi-view classiﬁcation: (1) how to project samples with arbitrary view-missing patterns (ﬂexibility) into a common latent space (completeness) for comparability (in section 2.1)? (2) how to make the learned representation to reﬂect class distribution (structured representation) for separability (in section 2.2)? (3) how to reduce the gap between representation obtained in test stage and training stage for consistency (in section 2.3)? For clariﬁcation, we ﬁrst give the formal deﬁnition of partial multi-view classiﬁcation as follows:

Deﬁnition 2.1 (Partial Multi-View Classiﬁcation (PMVC)) Given the training set {Sn, yn}N n=1, where Sn is a subset of the complete observations Xn = {x(v) n }V v=1 (i.e., S X) and yn is the class label with N and V being the number of samples and views, respectively, PMVC trains a classiﬁer by using training data containing view-missing samples, to classify a new instance S with arbitrary possible view-missing pattern.

2.1 Multi-View Complete Representation

Considering the ﬁrst challenge - we aim to design a ﬂexible algorithm to project samples with arbitrary view-missing patterns into a common space, where the desired latent representation should encode the information from observed views. Inspired by the reconstruction point of view [32], we provide the deﬁnition of completeness for multi-view representation as follows:

Deﬁnition 2.2 (Completeness for Multi-View Representation) A multi-view representation h is complete if each observation, i.e., x(v) from {x(1), ..., x(V )}, can be reconstructed from a mapping fv( ), i.e., x(v) = fv(h).

Intuitively, we can reconstruct each view from a complete representation in a numerically stable way. Furthermore, we show that the completeness is achieved under the assumption [33] that each view is conditionally independent given the shared multi-view representation. Similar to each view from X, the class label y can also be considered as one (semantic) view, then we have

p(y, S|h) = p(y|h)p(S|h), (1)

where p(S|h) = p(x(1)|h)p(x(2)|h)...p(x(V )|h). We can obtain the common representation by maximizing p(y, S|h).

Based on different views in S, we model the likelihood with respect to h given observations S as

p(S|h) e (S,f(h;Θr)), (2)

where Θr are parameters governing the reconstruction mapping f( ) from common representation h to partial observations S with (S, f(h; Θr)) being the reconstruction loss. From the view of class label, we model the likelihood with respect to h given class label y as

p(y|h) e (y,g(h;Θc)), (3)

where Θc are parameters governing the classiﬁcation function g( ) based on common representation h, and (y, g(h; Θc)) deﬁnes the classiﬁcation loss. Accordingly, assuming the data are independent and identically distributed (IID), the log-likelihood function is induced as

L({hn}N n=1, Θr, Θc) =

n=1 ln p(yn, Sn|hn) N X

n=1 (Sn, f(hn; Θr)) + (yn, g(hn; Θc)) ,

(4) where Sn denotes the available views for the nth sample. On one hand, we encode the information from available views into a latent representation hn and denote the encoding loss as

(Sn, f(hn; Θr)). On the other hand, the learned representation should be consistent with class distribution, which is implemented by minimizing the loss (yn, g(hn; Θc)) to penalize the disagreement with class label.

Effectively encoding information from different views is the key requirement for multi-view representation, thus we seek a common representation which could recover the partial (available) observations. Accordingly, the following loss is induced

(Sn, f(hn; Θr)) = ℓr(Sn, hn) =

v=1 snv||fv(hn; Θ(v) r ) x(v) n ||2, (5)

where (Sn, f(hn; Θr)) is specialized with the reconstruction loss ℓr(Sn, hn). snv is an indicator of the availability for the nth sample in the vth view, i.e., snv = 1 and 0 indicating available and unavailable views, respectively. fv( ; Θ(v) r ) is the reconstruction network for the vth view parameterized by Θ(v) r . In this way, hn encodes comprehensive information from different available views, and different samples (regardless of their missing patterns) are associated with representations in a common space, making them comparable.

Ideally, minimizing Eq. (5) will induce a complete representation. Since the complete representation encodes information from different views, it should be versatile compared with each single view. We give the deﬁnition of versatility for multi-view representation as follows:

Deﬁnition 2.3 (Versatility for Multi-View Representation) Given the observations x(1), ..., x(V ) from V views, the multi-view representation h is of versatility if v and mapping ϕ( ) with y(v) = ϕ(x(v)), there exists a mapping ψ( ) satisfying y(v) = ψ(h), where h is the corresponding multi-view representation for sample S = {x(1), ..., x(V )}.

Accordingly, we have the following theoretical result:

Proposition 2.1 (Versatility for the Multi-View Representation from Eq. (5)) There exists a solution (with respect to latent representation h) to Eq. (5) which holds the versatility.

Proof 2.1 The proof for proposition 2.1 is as follow. Ideally, according to Eq. (5), there exists x(v) = fv(h; Θ(v) r ), where fv( ) is the mapping from h to x(v). Hence, ϕ( ) with y(v) = ϕ(x(v)), there exists a mapping ψ( ) satisfying y(v) = ψ(h) by deﬁning ψ( ) = ϕ(fv( )). This proves the versatility of the latent representation h based on multi-view observations {x(1), ..., x(V )}.

In practical case, it is usually difﬁcult to guarantee the exact versatility for latent representation, then the goal is to minimize the error ey = PV v=1 ||ψ(h) ϕ(x(v))||2 (i.e., PV v=1 ||ϕ(fv(h; Θ(v))) ϕ(x(v))||2) which is inversely proportional to the degree of versatility. Fortunately, it is easy to show that Ker with er = PV v=1 ||fv(h; Θ(v) r ) x(v)||2 from Eq. (5) is the upper bound of ey if ϕ( ) is Lipschitz continuous with K being the Lipschitz constant.

Although the proof is inferred under the condition that all views are available, it is intuitive and easy to generalize the results for view-missing case.

2.2 Classiﬁcation on Structured Latent Representation

Multiclass classiﬁcation remains challenging due to possible confusing classes [34]. For the second challenge - we target to ensure the learned representation to be structured for separability by a clustering-like loss. Speciﬁcally, we should minimize the following classiﬁcation loss

(yn, y) = (yn, g(hn; Θc)), (6)

where g(hn; Θc) = arg maxy YEh T (y)F(h, hn) and F(h, hn) = φ(h; Θc)T φ(hn; Θc), with φ( ; Θc) being the feature mapping function for h, and T (y) being the set of latent representation from class y. In our implementation, we set φ(h; Θc) = h for simplicity and effectiveness. By jointly considering classiﬁcation and representation learning, the misclassiﬁcation loss is speciﬁed as

ℓc(yn, y, hn) = max y Y

0, (yn, y) + Eh T (y)F(h, hn) Eh T (yn)F(h, hn) . (7)

Algorithm 1: Algorithm for CPM-Nets /*Training*/ Input: Partial multi-view dataset: D = {Sn, yn}N n=1, hyperparameter λ. Initialize: Initialize {hn}N n=1 and {Θ(v) r }V v=1 with random values. while not converged do

for v = 1 : V do

Update the network parameters Θ(v) r with gradient descent: Θ(v) r Θ(v) r α 1

N PN n=1 ℓr(Sn, hn; Θr)/ Θ(v) r ; end for n = 1 : N do

Update the latent representation hn with gradient descent: hn hn α 1

N PN n=1(ℓr(Sn, hn; Θr) + λℓc(yn, y, hn))/ hn; end end Output: networks parameters {Θ(v) r }V v=1 and latent representation {hn}N n=1. /*Test*/ Train the retuned networks ({Θ(v) rt }V v=1) for test; Calculate the latent representation with the retuned networks for test instance; Classify the test instance with y = arg maxy YEh T (y)F(h, htest).

Compared with mostly used parametric classiﬁcation equipped with cross entropy loss, the clusteringlike loss not only penalizes the misclassiﬁcation but also ensures structured representation. Speciﬁcally, for correctly classiﬁed sample, i.e., y = yn, there is no loss. For incorrectly classiﬁed sample, i.e., y = yn, it will enforce the similarity between hn and the center corresponding to class yn larger than that between hn and the center corresponding to class y (wrong label) with a margin (yn, y). Hence, the proposed nonparametric loss naturally leads to a representation with clustering structure.

Based on above considerations, the overall objective function is induced as

min{hn}N n=1,Θr 1 N

n=1 ℓr(Sn, hn; Θr) + λℓc(yn, y, hn), (8)

where λ > 0 balances the belief degree of information from multiple views and class labels.

2.3 Test: Towards Consistency with Training Stage

The last challenge lies in the gap between training and test stages in representation learning. To classify a test sample with incomplete views S, we need to obtain its common representation h. A straightforward way is to optimize the objective, minh ℓr(S, f(h; Θr)), to encode the information from S into h. This way raises a new issue - how to ensure the representations obtained in test stage consistent with training stage? The gap originates from the difference between the objectives corresponding to training and test stages. Speciﬁcally, in test, we can obtain the uniﬁed representation with h = arg minhℓr(S, f(h; Θr)) and then conduct classiﬁcation with y = arg maxy YEhn T (y)F(h, hn). However, it is different from representation learning in training stage which simultaneously considers reconstruction and classiﬁcation error. To address this issue, we introduce the ﬁne-tuning strategy based on {Sn, hn}N n=1 obtained after training to update the networks {fv(h; Θ(v) r )}V v=1 for consistent mapping from observations to latent representation. Accordingly, in test stage we obtain the retuned encoding networks {f v(h; Θ(v) rt )}V v=1 by ﬁne-tuning the networks {fv(h; Θ(v) r )}V v=1. Subsequently, we can solve the following objective - minh ℓr(S, f (h; Θrt)) to obtain the latent representation which is consistent with that in training. The optimization of the proposed CPM-Nets and the test procedure are summarized in Algorithm 1.

2.4 Discussion on key components

The CPM-Nets are composed of two key components, i.e., encoding networks and clustering-like classiﬁcation, which are different from conventional ways thus detailed explanations are provided.

Encoding schema. To encode the information from multiple views into a common representation, there is an alternative route, i.e., ℓr(Sn, hn) = PV v=1 snv||f(x(v) n ; Θ(v)) hn||2. This is different from the schema used in our model shown in Eq. (5), i.e., ℓr(Sn, hn) = PV v=1 snv||f(hn; Θ(v)) x(v) n ||2. The underlying assumption in our model is that information from different views are originated from a latent representation h, and hence it can be mapped to each individual view. Whereas for the alternative, it indicates that the latent representation could be obtained from (mapping) each single view, which is basically not the case in real applications. For the alternative, ideally, minimizing the loss will enforce the representations of different views to be the same, which is not reasonable especially for the views highly independent. From the view of information theory, the encoding network for the vth view could be considered as communication channel with ﬁxed property, i.e., p(x(v)|h) and p(h|x(v)) for our model and the alternative, respectively, where the degradation process could be mimicked as data transmitting. Therefore, it is more reasonable to send comprehensive information and receive partial information, i.e., p(x(v)|h) compared with its counterpart - sending partial data and receiving comprehensive data, i.e., p(h|x(v)). The theoretical results in subsection 2.1 also advocates above analysis.

Classiﬁcation model. For classiﬁcation, the widely used strategy is to learn a classiﬁcation function based on h, i.e., y = f(h; Θ) parameterized with Θ. Compared with this manner, the reasons of using the clustering-like classiﬁer in our model are as follows. First, jointly learning the latent representation and parameterized classiﬁer is likely an under-constrained problem which may ﬁnd representation that can well ﬁt the training data but not well reﬂect the underlying patterns, thus the generalization ability may be affected [35]. Second, the clustering-like classiﬁcation produces the compactness within the same class and separability between different classes for the learned representation, making the classiﬁer interpretable. Third, the nonparametric way reduces the load of parameter tuning and reﬂects a simpler inductive bias which is especially beneﬁcial to small-sample-size regime [36].

3 Experiments

3.1 Experiment Setting

We conduct experiments on the following datasets: ORL 2 The dataset contains 10 facial images for each of 40 subjects. PIE 3 A subset containing 680 facial images of 68 subjects are used. Yale B Similar to previous work [37], we use a subset which contains 650 images of 10 subjects. For ORL, PIE and Yale B, three types of features: intensity, LBP and Gabor are extracted. CUB [38] The dataset contains different categories of birds, where the ﬁrst 10 categories are used and deep visual features from Goog Le Net and text features using doc2vec [39] are used as two views. Handwritten 4 The dataset contains 10 categories from digits 0 to 9 , and 200 images in each category with 6 types of image features are used. Animal The dataset consists of 10158 images from 50 classes with two types of deep features extracted with DECAF [40] and VGG19 [41].

We compared the proposed CPM-Nets with the following methods: (1) Feat Concate simply concatenates multiple types of features from different views. (2) CCA [9] maps multiple types of features into one common space, and subsequently concatenates the low-dimensional features of different views. (3) DCCA (Deep Canonical Correlation Analysis) [11] learns low-dimensional features with neural networks and concatenates them. (4) DCCAE (Deep Canonical Correlated Auto Encoders) [24] employs autoencoders for common representations, and then combines these projected low-dimensional features together. (5) KCCA (Kernelized CCA) [10] employs feature mappings induced by positive-deﬁnite kernels. (6) MDc R (Multi-view Dimensionality co-Reduction) [42] applies the kernel matching to regularize the dependence across multiple views and projects each view onto a low-dimensional space. (7) DMF-MVC (Deep Semi-NMF for Multi-View Clustering) [43] utilizes a deep structure through semi-nonnegative matrix factorization to seek a common feature representation. (8) ITML (Information-Theoretic Metric Learning) [44] characterizes the metric using a Mahalanobis distance function and solves the problem as a particular Bregman optimization. (9) LMNN (Large Margin Nearest Neighbors) [45] searches a Mahalanobis distance metric to optimize the k-nearest neighbours classiﬁer. For metric learning methods, the original features

2https://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html 3http://www.cs.cmu.edu/afs/cs/project/PIE/Multi Pie/Multi-Pie/Home.html 4https://archive.ics.uci.edu/ml/datasets/Multiple+Features

of multiple views are concatenated, and then the new representation could be obtained with the projection induced by the learned metric matrix.

For all methods, we tune the parameters with 5-fold cross validation. For CCA-based methods, we select two views for the best performance. For our CPM-Nets, we set the dimensionality (K) of the latent representation from {64, 128, 256} and tune the parameter λ from the set {0.1, 1, 10} for all datasets. We run 10 times for each method to report the mean values and standard deviations. Please refer to the supplementary material for the details of network architectures and parameter settings.

Missing Rate (2) 0 0.1 0.2 0.3 0.4 0.5

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

DMF CCA Feat Con KCCA DCCA DCCAE MDc R ITML LMNN Ours

Missing Rate (2) 0 0.1 0.2 0.3 0.4 0.5

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

DMF CCA Feat Con KCCA DCCA DCCAE MDc R ITML LMNN Ours

Missing Rate (2) 0 0.1 0.2 0.3 0.4 0.5

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

DMF CCA Feat Con KCCA DCCA DCCAE MDc R ITML LMNN Ours

Missing Rate (2) 0 0.1 0.2 0.3 0.4 0.5

DMF CCA Feat Con KCCA DCCA DCCAE MDc R ITML LMNN Ours

Missing Rate (2) 0 0.1 0.2 0.3 0.4 0.5

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

DMF CCA Feat Con KCCA DCCA DCCAE MDc R ITML LMNN Ours

Missing Rate (2) 0 0.1 0.2 0.3 0.4 0.5

DMF CCA Feat Con KCCA DCCA DCCAE MDc R ITML LMNN Ours

(f) Handwritten

Figure 2: Performance comparison under different missing rate (η).

3.2 Experimental Results

Firstly, we evaluate our algorithm by comparing it with state-of-the-art multi-view representation learning methods, investigating the performance with respect to varying missing rate. The missing rate is deﬁned as η =

v Mv V N , where Mv indicates the number of samples without the vth view. Since datasets may be associated with different number of views, samples are randomly selected as missing multi-view ones, and the missing views are randomly selected by guaranteeing at least one of them is available. As a result, partial multi-view data are obtained with diverse missing patterns. For compared methods, the missing views are ﬁlled with mean values according to available samples within the same class. From the results in Fig. 2, we have the following observations: (1) without missing, our algorithm achieves very competitive performance on all datasets which validates the stability of our algorithm for complete multi-view data; (2) with increasing the missing rate, the performance degradations of the compared methods are much larger than that of ours. Taking the results on ORL for example, ours and LMNN obtain the accuracy of 98.4% and 98.0%, respectively, while with increasing the missing rate, the performance gap becomes much larger; (3) our model is rather robust to view-missing data, since our algorithm usually performs relatively promising with heavily missing cases. For example, the performance decline (on ORL) is less than 5% with increasing the missing rate from η = 0.0 to η = 0.3.

Furthermore, we also ﬁll the missing views with recently proposed imputation method - Cascaded Residual Autoencoder (CRA) [5]. Since CRA needs a subset of samples with complete views in training, we set 50% data as complete-view samples and the left are samples with missing views (missing rate η = 0.5). The comparison results are shown in Fig. 3. It is observed that ﬁlling with CRA is generally better than that of using mean values due to capturing the correlation of different views. Although the missing views are ﬁlled with CRA by using part of samples with complete views, our proposed algorithm still demonstrates the clear superiority. The proposed CPM-Nets performs as the best on all the six datasets.

F e a t C o n

F e a t C o n

F e a t C o n

A c c u r a c y

R L P I E Y a l e B

C R A m e a n O u r s

F e a t C o n

F e a t C o n

F e a t C o n

A c c u r a c y

C U B A n i m a l H a n d w r i t t e n

C R A m e a n O u r s

Figure 3: Performance comparison with view completion by using mean value and cascaded residual autoencoder (CRA) [5] (with missing rate η = 0.5).

(a) Feat Con (U) (b) DCCA (U) (c) Ours (U) (d) LMNN (S) (e) ITML (S) (f) Ours (S)

0 1 2 3 4 5 6 7 8 9

Figure 4: Visualization of representations with missing rate η = 0.5, where U and S indicate unsupervised and supervised manner in representation learning. (Zoom in for best view).

We visualize the representations from different methods on Handwritten to investigate the improvement of CPM-Nets. As shown in Fig. 4, the subﬁgures (a)-(c) obtain representations in unsupervised manner. It is observed that the latent representation from our algorithm reveals the underlying class distribution much better. With introducing label information, the representation from CPM-Nets are further improved, where the clusters are more compact and the margins between different classes becomes more clear, which validates the effectiveness of using clustering-like loss. It is noteworthy that we jointly exploit all samples, all views for random view-missing patterns in experiments, demonstrating the ﬂexility in handling partial multi-view data, while Fig. 4 supports the claim of structured representation.

4 Conclusions

We proposed a novel algorithm for partial multi-view data classiﬁcation named CPM-Nets, which can jointly exploit all samples, all views and is ﬂexible for arbitrary view-missing patterns. Our algorithm focuses on learning a complete thus versatile representation to handling the complex correlation among multiple views. The common representation also endows the ﬂexibility for handling the data with arbitrary number of views and complex view-missing patterns, which is different from existing ad hoc methods. Equipped with a clustering-like classiﬁcation loss, the learned representation is well structured making the classiﬁer interpretable. We empirically validated that the proposed algorithm is relatively robust to heavy and complex view-missing data.

Acknowledgments

This work was partly supported by National Natural Science Foundation of China (61976151, 61602337, 61732011, 61702358). We also appreciate the discussion with Ganbin Zhou and valuable comments from all the reviewers.

[1] Tadas Baltrušaitis, Chaitanya Ahuja, and Louis-Philippe Morency. Multimodal machine learning: A survey and taxonomy. IEEE TPAMI, 41(2):423 443, 2019.

[2] Chang Xu, Dacheng Tao, and Chao Xu. A survey on multi-view learning. ar Xiv preprint ar Xiv:1304.5634, 2013.

[3] Paramveer Dhillon, Dean Foster, and Lyle Ungar. Multi-view learning of word embeddings via cca. In NIPS, pages 199 207, 2011.

[4] Shao-Yuan Li, Yuan Jiang, and Zhi-Hua Zhou. Partial multi-view clustering. In AAAI, pages 1968 1974, 2014.

[5] Luan Tran, Xiaoming Liu, Jiayu Zhou, and Rong Jin. Missing modalities imputation via cascaded residual autoencoder. In CVPR, pages 1405 1414, 2017.

[6] Xinwang Liu, Xinzhong Zhu, Miaomiao Li, Lei Wang, Chang Tang, Jianping Yin, Dinggang Shen, Huaimin Wang, and Wen Gao. Late fusion incomplete multi-view clustering. IEEE TPAMI, 2018.

[7] Mingxia Liu, Jun Zhang, Pew-Thian Yap, and Dinggang Shen. Diagnosis of alzheimer s disease using view-aligned hypergraph learning with incomplete multi-modality data. In MICCAI, pages 308 316, 2016.

[8] Anusua Trivedi, Piyush Rai, Hal Daumé III, and Scott L Du Vall. Multiview clustering with incomplete views. In NIPS Workshop, volume 224, 2010.

[9] Harold Hotelling. Relations between two sets of variates. Biometrika, 28(3/4):321 377, 1936.

[10] Shotaro Akaho. A kernel method for canonical correlation analysis. ar Xiv preprint cs/0609071, 2006.

[11] Galen Andrew, Raman Arora, Jeff Bilmes, and Karen Livescu. Deep canonical correlation analysis. In ICML, pages 1247 1255, 2013.

[12] Jian-Feng Cai, Emmanuel J Candès, and Zuowei Shen. A singular value thresholding algorithm for matrix completion. SIAM Journal on Optimization, 20(4):1956 1982, 2010.

[13] Rahul Mazumder, Trevor Hastie, and Robert Tibshirani. Spectral regularization algorithms for learning large incomplete matrices. JMLR, 11(Aug):2287 2322, 2010.

[14] Lei Cai, Zhengyang Wang, Hongyang Gao, Dinggang Shen, and Shuiwang Ji. Deep adversarial learning for multi-modality missing data completion. In KDD, pages 1158 1166, 2018.

[15] Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y Ng. Multimodal deep learning. In ICML, pages 689 696, 2011.

[16] Lei Yuan, Yalin Wang, Paul M Thompson, Vaibhav A Narayan, and Jieping Ye. Multi-source learning for joint analysis of incomplete multi-modality neuroimaging data. In KDD, pages 1149 1157, 2012.

[17] Abhishek Kumar, Piyush Rai, and Hal Daume. Co-regularized multi-view spectral clustering. In NIPS, pages 1413 1421, 2011.

[18] Abhishek Kumar and Hal Daumé. A co-training approach for multi-view spectral clustering. In ICML, pages 393 400, 2011.

[19] Changqing Zhang, Huazhu Fu, Si Liu, Guangcan Liu, and Xiaochun Cao. Low-rank tensor constrained multiview subspace clustering. In ICCV, pages 1582 1590, 2015.

[20] Changqing Zhang, Qinghua Hu, Huazhu Fu, Pengfei Zhu, and Xiaochun Cao. Latent multi-view subspace clustering. In CVPR, pages 4279 4287, 2017.

[21] Zhiyong Yang, Qianqian Xu, Weigang Zhang, Xiaochun Cao, and Qingming Huang. Split multiplicative multi-view subspace clustering. IEEE Transactions on Image Processing, 2019.

[22] Haichao Zhang, Thomas S Huang, Nasser M Nasrabadi, and Yanning Zhang. Heterogeneous multi-metric learning for multi-sensor fusion. In 14th International Conference on Information Fusion, pages 1 8, 2011.

[23] Heng Zhang, Vishal M Patel, and Rama Chellappa. Hierarchical multimodal metric learning for multimodal classiﬁcation. In CVPR, pages 3057 3065, 2017.

[24] Weiran Wang, Raman Arora, Karen Livescu, and Jeff Bilmes. On deep multi-view representation learning. In ICML, pages 1083 1092, 2015.

[25] Xiaohong Chen, Songcan Chen, Hui Xue, and Xudong Zhou. A uniﬁed dimensionality reduction framework for semi-paired and semi-supervised multi-view data. Pattern Recognition, 45(5):2005 2018, 2012.

[26] Nikhil Rasiwasia, Jose Costa Pereira, Emanuele Coviello, Gabriel Doyle, Gert RG Lanckriet, Roger Levy, and Nuno Vasconcelos. A new approach to cross-modal multimedia retrieval. In ACM MM, pages 251 260, 2010.

[27] Yu-An Chung, Wei-Hung Weng, Schrasing Tong, and James Glass. Unsupervised cross-modal alignment of speech and text embedding spaces. In NIPS, pages 7365 7375, 2018.

[28] Lluis Castrejon, Yusuf Aytar, Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. Learning aligned cross-modal representations from weakly aligned data. In CVPR, pages 2940 2949, 2016.

[29] Joey Tianyi Zhou, Ivor W Tsang, Sinno Jialin Pan, and Mingkui Tan. Multi-class heterogeneous domain adaptation. Journal of Machine Learning Research, 20(57):1 31, 2019.

[30] Joey Tianyi Zhou, Sinno Jialin Pan, and Ivor W Tsang. A deep learning framework for hybrid heterogeneous transfer learning. Artiﬁcial Intelligence, 2019.

[31] Chao Shang, Aaron Palmer, Jiangwen Sun, Ko-Shin Chen, Jin Lu, and Jinbo Bi. Vigan: Missing view imputation with generative adversarial networks. In ICBD, pages 766 775, 2017.

[32] Tai Sing Lee. Image representation using 2d gabor wavelets. IEEE TPAMI, 18(10):959 971, 1996.

[33] Martha White, Xinhua Zhang, Dale Schuurmans, and Yao-liang Yu. Convex multi-view subspace learning. In NIPS, pages 1673 1681, 2012.

[34] Weiwei Liu, Ivor W Tsang, and Klaus-Robert Müller. An easy-to-hard learning paradigm for multiple classes and multiple labels. The Journal of Machine Learning Research, 18(1):3300 3337, 2017.

[35] Lei Le, Andrew Patterson, and Martha White. Supervised autoencoders: Improving generalization performance with unsupervised regularizers. In NIPS, pages 1 11, 2018.

[36] Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. In NIPS, pages 1 11, 2017.

[37] Athinodoros S Georghiades, Peter N Belhumeur, and David J Kriegman. From few to many: Illumination cone models for face recognition under variable lighting and pose. IEEE TPAMI, (6):643 660, 2001.

[38] Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. The caltech-ucsd birds-200-2011 dataset. 2011.

[39] Quoc Le and Tomas Mikolov. Distributed representations of sentences and documents. In ICML, pages 1188 1196, 2014.

[40] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classiﬁcation with deep convolutional neural networks. In NIPS, pages 1097 1105, 2012.

[41] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.

[42] Changqing Zhang, Huazhu Fu, Qinghua Hu, Pengfei Zhu, and Xiaochun Cao. Flexible multiview dimensionality co-reduction. IEEE TIP, 26(2):648 659, 2017.

[43] Handong Zhao, Zhengming Ding, and Yun Fu. Multi-view clustering via deep matrix factorization. In AAAI, pages 2921 2927, 2017.

[44] Jason V Davis, Brian Kulis, Prateek Jain, Suvrit Sra, and Inderjit S Dhillon. Informationtheoretic metric learning. In ICML, pages 209 216, 2007.

[45] Kilian Q Weinberger and Lawrence K Saul. Distance metric learning for large margin nearest neighbor classiﬁcation. JMLR, 10(Feb):207 244, 2009.