# multiview_collaborative_gaussian_process_dynamical_systems__19b19025.pdf Journal of Machine Learning Research 24 (2023) 1-32 Submitted 2/19; Revised 7/23; Published 10/23 Multi-view Collaborative Gaussian Process Dynamical Systems Shiliang Sun slsun@cs.ecnu.edu.cn School of Computer Science and Technology, East China Normal University, Shanghai 200062, P. R. China Department of Automation, Shanghai Jiao Tong University, Shanghai 200240, P. R. China Jingjing Fei jingjingfei16@163.com Jing Zhao jzhao@cs.ecnu.edu.cn Liang Mao lmao14@outlook.com School of Computer Science and Technology, East China Normal University, Shanghai 200062, P. R. China Editor: Massimiliano Pontil Gaussian process dynamical systems (GPDSs) have shown their effectiveness in many tasks of machine learning. However, when they address multi-view data, current GPDSs do not explicitly model the dependence between private and shared latent variables. Instead, they introduce structurally and intrinsically discrete segmentation in the latent space. In this paper, we propose the multi-view collaborative Gaussian process dynamical systems (Mc GPDSs) model, which assumes that the private latent variable for each view is controlled by its dynamical prior and the shared latent variable. The relevance between private and shared latent variables can be automatically learned by optimization in the Bayesian framework. The model is capable of learning an effective latent representation and generating novel data of one view given data of the other view. We evaluate our model on two-view data sets, and our model obtains better performance compared with the state-of-the-art multi-view GPDSs. Keywords: Gaussian process, multi-view machine learning, dynamical system, variational inference, multi-output modeling 1. Introduction A Gaussian process (GP) is a collection of random variables, any finite number of which have a joint Gaussian distribution (Rasmussen and Williams, 2006). GPs are stochastic processes over real-valued functions and completely specified by mean functions and covariance functions (Rasmussen and Williams, 2006). Recently, GPs have been proved successful in various areas of machine learning (Lawrence and Jordan, 2005; Andreas and Carlos, 2007; Damianou et al., 2011; L uthi et al., 2018; Feurer et al., 2018; Wei et al., 2019; Medina et al., 2019) because GPs can provide flexible function approximation. For example, to implement nonlinear dimensionality reduction, GP latent variables (GPLVMs) (Lawrence, 2004, 2005; c 2023 Shiliang Sun and Jingjing Fei and Jing Zhao and Liang Mao. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v24/19-094.html. Sun and Fei and Zhao and Mao Titsias and Lawrence, 2010) have been presented, which use the global latent variables and assume the conditional independence among multiple outputs. For modelling dynamics in sequential data, some Gaussian process dynamical systems (GPDSs) have been proposed, which extend GPLVMs by adding dynamical priors on the latent variables, such as GP dynamical models (GPDMs) (Wang et al., 2006), variational GPDSs (VGPDSs) (Damianou et al., 2011), variational dependent multi-output GPDSs (VDM-GPDSs) (Zhao and Sun, 2016) and collaborative Gaussian process dynamical systems (CGPDSs) (Zhao et al., 2018). Specifically, the GPDM models the dynamics by adding the Markov prior on the latent space and characterizes the variability among outputs via constructing the output variances with different parameters. The VGPDS employs the GP dynamical prior on the latent space, which is more flexible and can capture some specific dynamical information such as periodicity with specific kernels. The VDM-GPDS models the dependence among multiple outputs and employs convolution processes to capture the multioutput dependence explicitly. The VDM-GPDS obtains better performance than the GPDM and VGPDS, but the VDM-GPDS is time-consuming during training attributed to the introduced convolution processes. The CGPDS expresses each output as the sum of a global latent process and a local latent process, which can capture the universality and individuality of all outputs. Moreover, the CGPDS assumes that the latent processes are conditionally independent, which ensures the resulting evidence lower bound to be decomposed across dimensions and allows the stochastic optimization. We will detail CGPDSs in Section 2 in a self-contained form. With the rapid development of information techniques, more and more data exhibit multi-view characteristics such as the URL link and text in a web document, the audio and image frames of a video, the surrounding words and image of a web image and so on. Data of different modalities often offer complementary information, and multi-view learning can exploit this information to learn representations, which are more comprehensive and expressive than that of single-view learning (Sun et al., 2019). More specifically, multi-view learning uses one function to model a view and optimizes all functions together during training. Consensus and complementarity are the two core principles of multi-view learning. The consensus principle maximizes the agreement on the representations of different views, and the complementarity principle exploits the complementary information contained in different views to represent multi-view data comprehensively (Li et al., 2018). Since multiview learning can use the consensus and complementarity properties of multiple views and exploit the redundant views of the same input data, multi-view learning is often more natural and effective than single-view learning (Sun, 2013; Xu et al., 2013; Li et al., 2018; Ding et al., 2018). Recently, several models extended GPLVMs or GPDSs to the scenario of multi-view learning. The shared GPLVM assumed that each view has been generated from the same low-dimensional latent variable corrupted by additive Gaussian noise (Shon et al., 2006). Furthermore, a new version of the shared GPLVM, i.e., the subspace GPLVM, was proposed (Ek and Lawrence, 2009), in which the latent space for each view is factorized into a shared one, which captures the shared information across the views, and a private one, which explains the remaining variance. Salzmann et al. (2010) learned the dimensionality of the factorization by introducing regularizers. The manifold relevance determination (MRD) (Damianou et al., 2012) improved the hard segmentation between the private and shared Multi-view Collaborative Gaussian Process Dynamical Systems latent variables and employed the soft segmentation in the latent space. Concretely, the MRD employed learned scales in the automatic relevance determination (ARD) kernels and a pre-given threshold to determine whether a dimension is the private or shared latent variable. This threshold requires to be specified manually and often varies for different datasets, whose configuration thus needs expert knowledge and is time-consuming. The above models do not explicitly model the correlation between private and shared latent variables (dimensions). This kind of model assumption in the latent space brings the structurally and intrinsically discrete segmentation between the shared and private latent variables. On many real-world data sets, it is quite difficult to clearly divide the latent space which generates multi-view observations into shared and private latent information because private and shared latent information is complexly coupled and interacts with each other. For example, in the multi-view data set which contains pictures of different faces under the same lighting condition, we can take the characteristics of faces as private information and the lighting condition as shared information. It is not difficult to figure out that intensely bright lighting conditions can affect the characteristics of the face, such as the color of the skin. In this paper, we propose the multi-view collaborative Gaussian process dynamical systems (Mc GPDSs) model, which makes full use of the characteristics of multi-view data and the advantages of the CGPDSs. The proposed model relaxes the discrete structural segmentation in the latent space and automatically learns the relevance between private and shared latent variables through optimization. Since private latent variables are determined by their dynamical priors and the shared latent variable, Mc GPDSs can model more complex and abundant information of data. Experiments on the synthetic and real-world data sets also validate the superiority of our proposed Mc GPDSs. The contributions of our model are summarized as follows: 1) Our model extends the CGPDS into multi-view learning, which possesses the advantages of the multi-view learning and the CGPDS to model high-dimensional multi-output data. 2) Our model explicitly models the relationship between shared and private latent variables and automatically learns their relevance. 3) All parameters in our model can be learned through optimization. The remainder of the paper is structured as follows. Section 2 introduces the related work including multi-view learning, CGPDSs and several multi-view models based on GPLVMs and GPDSs. Section 3 presents the proposed model in detail. Section 4 describes the inference and learning techniques. Section 5 illustrates the procedure of prediction with Mc GPDSs. Section 6 provides extensive experimental evaluations to validate the effectiveness of our model, and Section 7 concludes the work and discusses future work. 2. Related Work In this section, we first briefly review the related works on multi-view learning. Then we give an introduction to CGPDSs (Zhao et al., 2018) and several multi-view models based on GPLVMs and GPDSs (Shon et al., 2006; Ek and Lawrence, 2009; Damianou et al., 2012). 2.1 Multi-view Learning Multi-view learning is concerned with learning from data represented by multiple views. It has received increasing attention and been applied widely. Wei et al. (2018) evaluated the Sun and Fei and Zhao and Mao quality of community-based question answering through transductive multi-view learning. Hu et al. (2018) proposed a shareable and individual multi-view metric learning approach for visual recognition. Puyol et al. (2018) described a method of regional multi-view learning for cardiac motion analysis, and the method was applied to the identification of dilated cardiomyopathy patients. Jing et al. (2018) employed low-rank multi-view embedding learning to predict the popularity of the micro video. Tulsiani et al. (2018) considered multi-view consistency as the supervisory signal for learning shape and pose prediction. In the literature, multi-view learning is closely related to other machine learning methods, such as active learning, domain adaptation, and representation learning. More specifically, Muslea et al. (2002) combined co-testing and co-EM where co-testing is a novel method for active learning with multiple views and co-EM is used to generate classifiers and select the unlabeled points with the largest amount of information for labeling. Muslea et al. (2006) improved co-testing by considering differences between strong and weak views and assuming strong views with more information. Domain adaptation solves the problem of adapting a model trained on the source domain to the target domain, where the data from the source and target domains are largely different. Domain adaptation can be applied in the cross-language text classification task where documents in different languages represent different views. Co-training (Wan, 2009) and multi-view co-classification (Amini and Goutte, 2010) have been proposed and successfully applied in the task. Multi-view representation learning has been a promising research topic in recent years on account of the ability to provide abundant and complementary information for learning representations. Multi-view representation learning methods contain generative methods including multi-modal topic learning (Cohn and Hofmann, 2001; Barnard et al., 2003; Blei and Jordan, 2003), multi-view sparse coding (Jia et al., 2010; Cao et al., 2013; Liu et al., 2014) and multi-view latent space Markov networks (Xing et al., 2012; Chen et al., 2010), and deep neural methods including multi-modal autoencoders (Ngiam et al., 2011; Feng et al., 2014; Wang et al., 2015), multi-model Boltzmann machines (Srivastava and Salakhutdinov, 2012) and multi-modal recurrent neural networks (Karpathy and Fei-Fei, 2015; Mao et al., 2014; Donahue et al., 2015). CGPDSs aim to model multi-output sequential data. As a multi-output model, the CGPDS supposes that each output is the sum of a global latent process and a designed local latent process to capture dependence among multiple outputs and maintain the unique characteristics of each output. Since standard Bayesian inference is analytically intractable, CGPDSs adopt variational inference and introduce inducing points to learn the model. Moreover, the evidence lower bound can be decomposed regarding dimensions attributed to the conditional independence of outputs, which allows optimizing parameters in a stochastic optimization framework. Figure 1 shows the graphical model of CGPDSs. Given a multi-output sequential data Y RN D with yn RD be the observation at time tn R+, the CGPDS assumes that there are low-dimensional latent variables X RN Q (with Q D) that generate the observations. Moreover, a GP prior on the low-dimensional latent variables is used to model the dynamics, as in Damianou et al. (2011). Specifically, the CGPDS is defined as a four-layer GPDS through the following generative Multi-view Collaborative Gaussian Process Dynamical Systems Figure 1: The graphical model for the CGPDSs. The gray solid circles represent observations. The black hollow circles represent latent variables. The cyan hollow circles represent parameters. q=1 N(xq|0, Kt,t), where the xq RN is the qth row of X and Kt,t is the covariance matrix computed by κ(t, t ). p(h|X) = N(h|0, HX,X), j=1 N(gj|0, Gj X,X), where latent processes h and {gj}J j=1 are both GPs with input x, and the HX,X and Gj X,X are covariance matrices computed by κh(x, x ) and κj g(x, x ), respectively. The CGPDS introduces latent processes h and {gj}J j=1, which is entirely different from the previous GPDSs such as the VGPDS and VDM-GPDS. The VGPDS uses a single GP mapping from X to F (the noise-free version of the output Y ), which can only learn the common information among multiple outputs, but cannot learn the unique information of each output. The VDM-GPDS employs convolution processes to explicitly model the dependence among multiple outputs. In the VDM-GPDS, the mapping from X to F contains an ND ND matrix, which increases the computational complexity of the model and prevents the model from scaling to large datasets. The CGPDS can capture the dependence and differences among multiple outputs with a relatively simple model structure. p(yd|g, h) = N(yd|ℓd + h, β 1I) Sun and Fei and Zhao and Mao j=1 wdjgj + h, β 1I), (1) where h is the global latent process which captures the dependence among outputs and ℓd is the local latent process specific to the dth output which is constructed by latent processes {gj}J j=1 and weights {wdj}. The weights {wdj} represent the local parameters which are different for D outputs. β is the inverse variance of the white Gaussian noise. As shown in (1), the idea for constructing the output yd is inspired by the COGP (Nguyen and Bonilla, 2014). The COGP models the dth output yd as the weighted sum of the dth local latent process and J global latent processes, which contains (J + D) GPs in total. The CGPDS uses a global latent process h and a local latent process ℓd constructed by J(J D) latent processes {gj}J j=1, which includes (J + 1) GPs. In a word, CGPDSs can not only capture the dependence among multiple outputs but also maintain the specific characteristics of each output with fewer parameters. Last but not the least, fewer parameters would make the model easier to learn. 2.3 Multi-view Models Based on GPLVMs and GPDSs In this section, we give the introduction of related multi-view models based on GPLVMs and GPDSs such as the shared GPLVM, subspace GPLVM and MRD. The shared GPLVM assumes that all observations are generated from the same lowdimensional latent variable with additional Gaussian noise. Figure 2(a) shows the graphical model of the shared GPLVM. The dotted line represents the back-mapping from the output space, which can constrain the latent space. The assumption of sharing the same latent variable for all views is far from perfect for many datasets because this means data of all views share main generating parameters. Therefore, ideally, the shared latent variable can be used to connect all views and the private latent variables can be used to differentiate all views. The back-constraint from the second view to the latent space represents the bijective relationship between Y (2) and X(1,2). The back-constraint means that observation in the first view Y (1) has to be accommodated by throwing away information which does not exist in the second view Y (2). This model can also be considered as a feature selection model because it uses information from one view to determine what is important for the other view. A new version of the shared GPLVM, that is, the subspace GPLVM, introduces the private latent variable for each view and a shared latent variable for all views. Figure 2(b) represents the graphical model of the subspace GPLVM. The subspace GPLVM learns a factorized latent representation within a single model. The model directly concatenates the private latent variable of each view with the shared latent variable, and then generates the data of each view. For inference, the subspace GPLVM seeks the maximum a posterior (MAP) solution for the latent space. The fact that the latent variables are not integrated out indicates that it is difficult to determine the structure of the latent space automatically. The idea of employing factorized latent space in the multi-view learning has been proposed in several works (Jia et al., 2011; Virtanen et al., 2012; Zhang et al., 2013). The MRD can also learn a factorized latent representation and relax the previous hard discrete segmentation of latent space. Figure 2(c) shows the graphical model of the MRD with dynamics. A single latent variable X is used as the latent representation Multi-view Collaborative Gaussian Process Dynamical Systems (a) Shared GPLVM (b) Subspace GPLVM (c) MRD Figure 2: Development of multi-view models based on GPLVMs and GPDSs. (a) shows the shared GPLVM where all the variances in the observations are shared in a single shared latent variable. (b) shows the subspace GPLVM which introduces private latent variables to express the variance in each view. (c) represents the MRD which uses a single latent variable and selects the shared and private latent dimensions according to the ARD weights w(1) and w(2) and a predetermined threshold. The shadowed nodes represent observations. The black hollow nodes represent latent variables. The cyan nodes represent parameters. for all views where each dimension in X represents private or shared latent information. The MRD adopts variational inference with inducing points in order to integrate out the latent variable X. More precisely, the outputs of two view Y (1) and Y (2) are assumed to be independent GPs with the zero mean and an ARD covariance function, that is, κ(xi, xj) = (σard)2 exp 1 2 PQ q=1 wq(xi,q xj,q)2. Two sets of ARD weights w(1) and w(2) in this model can be optimized in the Bayesian framework. An additional threshold δ is required to be specified manually for each dataset. By comparing ARD weights with the threshold, the MRD determines whether the dimension is private or shared and divides the latent space into three subspaces with X = (X(1), Xs, X(2)). Here, Xs represents the shared subspace which consists of the set of dimensions q [1, , Q] with w (1) q > δ and w (2) q > δ. X(1) and X(2) are private latent subspaces of two views, respectively. X(1) is composed of the set of dimensions where w (1) q > δ and w (2) q < δ and analogously for X(2) (w (1) q < δ and w (2) q > δ). There are two different versions of the MRD model, one with dynamic characteristics (with the GP prior on the latent variable) and one without dynamic characteristics. 3. Multi-view Collaborative Gaussian Process Dynamical System In this section, we extend the CGPDS to the scenario of multi-view learning and propose the model of multi-view collaborative Gaussian process dynamical systems (Mc GPDSs). Figure 3 shows the graphical model of the Mc GPDS. Specifically, we aim to model two views Y (1) RN D1 and Y (2) RN D2 in the same model where y (1) n and y (2) n are the observations at time tn R+. We assume there is a shared low-dimensional latent variable X(1,2) RN Q which governs the generation of the private low-dimensional latent variables, that is, X(1) RN Q and X(2) RN Q. The Sun and Fei and Zhao and Mao Figure 3: The graphical model for the Mc GPDS. The Mc GPDS explicitly models the dependence between private and shared latent variables and automatically learns the relevance between private and shared latent variables. The shadowed nodes represent observations. The black hollow nodes represent latent variables. private low-dimensional latent variable for each view generates the corresponding observation. Moreover, we endow the GP prior on low-dimensional latent variables to model the dynamics. Here, N represents the number of training points. D1 and D2 represent the dimensions of two-view data, respectively. Q denotes the dimension of low-dimensional latent variables (with Q min(D1, D2)). The superscript (1) and (2) corresponds to the first and second view, respectively. The superscript (1, 2) means the shared information for two views. Formally, the generative process is given as follows. The shared low-dimensional latent variable X(1,2) is assumed to be a multi-dimensional GP indexed by time t, that is x(1,2) q (t) GP(0, κ(1,2) x (t, t )), q = 1, . . . , Q, (2) where dimensions of the shared latent function x(1,2)(t) are independently drawn from a GP with the covariance function κ (1,2) x (t, t ) with parameters θ (1,2) x . Since the latent variable X(1,2) is conditionally independent given t, we have p(X(1,2)|t) = q=1 N(x(1,2) q |0, K (1,2) t,t ), (3) where K (1,2) t,t is the covariance matrix computed by kernel κ (1,2) x (t, t ). We also introduce two latent variables X(1) and X(2) which follow view-specific dynamic priors, i.e., p( X(1)|t) = q=1 N( x(1) q |0, K (1) t,t), (4) Multi-view Collaborative Gaussian Process Dynamical Systems p( X(2)|t) = q=1 N( x(2) q |0, K (2) t,t), (5) where X(1) and X(2) are also assumed to be conditionally independent, and K (1) t,t and K (2) t,t are covariance matrices computed by kernels κ (1) x (t, t ) and κ (2) x (t, t ), respectively. Let ˆX(1) be a noisy version of the shared latent variable X(1,2), i.e., ˆX(1) N( ˆX(1)|X(1,2), ϵ(1)). The private latent variable X(1) is defined as a convex combination of the view-specific latent variable X(1) and ˆX(1), i.e., X(1) = (1 α(1)) ˆX(1) + α(1) X(1), with the combination weight α(1) [0, 1] which can adjust the importance of the two combination components. The model can automatically learn the dependence between private and shared latent variables by optimizing α(1). After integrating out ˆX(1), the conditional distribution of X(1) given X(1,2) and t is p(X(1)|X(1,2), t) = q=1 N(x(1) q |(1 α(1))x(1,2) q , (α(1))2K (1) t,t + (1 α(1))2ϵ(1)), Similarly we define the private latent variable X(2), with p(X(2)|X(1,2), t) = q=1 N(x(2) q |(1 α(2))x(1,2) q , (α(2))2K (2) t,t + (1 α(2))2ϵ(2)), where α(2) [0, 1] and ϵ(2) denotes the variance of the Gaussian noise in the second view. The setting of latent space in our model is largely different from the previous multi-view models based on GPLVMs and GPDSs, such as the shared GPLVM, subspace GPLVM and MRD. The shared GPLVM employs a single shared latent variable for all views and all variances in the observations are shared, where the private information cannot be modeled. The subspace GPLVM introduces a factorized latent space where each view is connected with an additional private latent space. The model employs MAP estimates so that the structure of the latent space cannot be automatically determined. The MRD model also employs a single latent space and determines whether a dimension is private or shared according to the weights in the ARD covariance functions and the artificially specified threshold. All the above models either use a single latent variable or do not explicitly model the relationship between private and shared latent variables (dimensions). Our model explicitly models the relevance between shared and private latent space. The relevance of the private and shared latent variables can be automatically learned by optimizing the weights α(1) and α(2). The mapping from X(1) to Y (1) (X(1) 7 Y (1)) and the mapping from X(2) to Y (2) (X(2) 7 Y (2)) in the Mc GPDS employ the same idea as the mapping from X to Y (X 7 Y ) in the CGPDS (Zhao et al., 2018). Additionally, attributed to conditional independence assumption, the distributions of the outputs can be written as the product of D terms, that is, p(Y (1)|X(1)) = d=1 N(y (1) d | j=1 wdjg (1) j (X(1)) + h(1)(X(1)), (β(1)) 1), p(Y (2)|X(2)) = d=1 N(y (2) d | j=1 wdjg (2) j (X(2)) + h(2)(X(2)), (β(2)) 1), Sun and Fei and Zhao and Mao where β(1) and β(2) are the inverse variance of the white Gaussian noise. The latent processes h(1) and {g (1) j }J j=1 are GPs indexed by input X(1). Similarly, latent processes h(2) and {g (2) j }J j=1 are GPs indexed by input X(2), and we have h(1)(x(1)) GP(0, κ (1) h (x(1), x(1) )), h(2)(x(2)) GP(0, κ (2) h (x(2), x(2) )), g (1) j (x(1)) GP(0, κ(1)j g (x(1), x(1) )), g (2) j (x(2)) GP(0, κ(2)j g (x(2), x(2) )), where the kernels κ (1) h (x(1), x(1) ) and κ(1)j g (x(1), x(1) ) are parameterized by θ (1) h and θ(1)j g , respectively. Similarly, θ (2) h and θ(2)j g are parameters of κ (2) h (x(2), x(2) ) and κ(2)j g (x(2), x(2) ). The mappings from X(1) to Y (1) and X(2) to Y (2 are different from the shared GPLVM, subspace GPLVM and MRD. The shared GPLVM, subspace GPLVM and MRD employ one GP mapping for each view to capture the common information of multiple outputs. These models can not sufficiently model the characteristics of each output, while the mappings in our model can well capture the differences and dependence among multiple outputs. 4. Inference and Learning Given the model assumptions, we can get the joint distribution of observations and latent variables for the proposed model, p(Y (1), Y (2), H(1), H(2), G(1), G(2), X(1), X(1), X(1,2)) K {(1),(2)} p(Y K|GK, HK)p(GK, HK|XK)p(XK|X(1,2), t)p(X(1,2)|t), (6) where the superscript K {(1), (2)} of a variable indicates the view the variable corresponding to and GK = [(g K 1 ) , . . . , (g K J ) ]. The marginal likelihood can be calculated by integrating out all the latent variables, which is commonly used as the goal of model learning. However, the private low-dimensional variables X(1) and X(2) cannot be integrated out because they appear nonlinearly in the inverse of the kernel matrices G (1)j X,X, H (1) X,X and G (2)j X,X, H (2) X,X, respectively. Throughout the paper, covariance matrices are represented by bold uppercase characters with superscripts and subscripts. The corresponding GP can be inferred from the character, with K for x, H for h and G for g, respectively. The superscript indicates the view that the GP is from, while the subscript indicates the inputs where the covariance matrix evaluated. Following Titsias and Lawrence (2010), we make some approximations to the true posterior of the model using variational inference, thus deducing the variational lower bound of the logarithmic marginal likelihood. 4.1 Variational Lower Bound We introduce inducing points and adopt the structured variational inference method to our model. In order to train the proposed model, we minimize the KL divergence between approximate posterior and true posterior, which is equivalent to maximizing the evidence lower bound of the logarithmic marginal likelihood. First, we employ inducing variables to augment the model. Specifically, for each view K {(1), (2)} and each latent function, we introduce a set of M inducing variables. We Multi-view Collaborative Gaussian Process Dynamical Systems use {u K j RM}J j=1 and v K RM to represent the value of g K j at inducing inputs ZKj g RM Q and the value of h K at inducing points ZK h RM Q, respectively. Denote U K = [(u K 1 ) , . . . , (u K J) ]. Attributed to the conditional independence assumption of latent variables {g K j }J j=1, we have p(U K|{ZKj g }J j=1) = QJ j=1 N(u K j |0, G Kj Z,Z). p(V K|XK) is also assumed to be zero-mean Gaussian with covariance matrix HK Z,Z. The conditional Gaussian distributions are given as p(GK|U K, XK) = QJ j=1 N(g K j |µKj g , e G Kj X,X) with µKj g = G Kj X,Z(G Kj Z,Z) 1u K j and e G Kj X,X = G Kj X,X G Kj X,Z(G Kj Z,Z) 1G Kj Z,X. Additionally, p(HK|V K, XK) = N(HK|µK h, e HK X,X) with µK h = HK X,Z(HK Z,Z) 1v K and e HK X,X = HK X,X HK X,Z(HK Z,Z) 1HK Z,X. Then, we introduce the joint variational distribution which is assumed to be factorized as q(Θ(1))q(Θ(2))q(X(1))q(X(2))q(X1,2) where q(X(1)) = N(X(1)|µ(1), S(1)), q(X(2))=N(X(2)|µ(2), S(2)) and q(X(1,2)) = N(X(1,2)|µ(1,2), S(1,2)). q(Θ(1)) and q(Θ(2)) are the variational distributions of latent variables {G(1), H(1), U (1), V (1)} and {G(2), H(2), U (2), V (2)} whose specific forms are defined as q(ΘK) = p(GK|U K, XK)p(HK|V K, XK)q(U K)q(V K), K {(1), (2)}. (7) Finally, given the above assumptions, the lower bound of the logarithmic marginal likelihood can be expressed as Fv(q) = Z Y K q(ΘK)q(XK)q(X(1,2)) log Y p(Y K|XK)p(XK|X(1,2), t)p(X(1,2)|t) q(ΘK)q(XK)q(X(1,2)) dΘKd XKd X(1,2) = KL q(X(1))q(X(2))q(X(1,2))||p(X(1)|X(1,2), t)p(X(2)|X(1,2), t)p(X(1,2)|t) K ˆLK, K {(1), (2)}. (8) The detailed calculation of the KL divergence is given below. KL(q(X(1))q(X(2))q(X(1,2))||p(X(1)|X(1,2), t)p(X(2)|X(1,2), t)p(X(1,2)|t)) log |A(1)| + log |A(2)| + log |K (1,2) t,t | log |S(1,2) q | log |S(1) q | log |S(2) q | + (1 α(1))µ(1,2) q µ(1) q (A(1)) 1 (1 α(1))µ(1,2) q µ(1) q + (1 α(2))µ(1,2) q µ(2) q (A(2)) 1 (1 α(2))µ(1,2) q µ(2) q + Tr (1 α(1))2(A(1)) 1 + (1 α(2))2(A(2)) 1 S(1,2) q + Tr (K (1,2) t,t ) 1 µ(1,2) q (µ(1,2) q ) + S(1,2) q + Tr (A(1)) 1S(1) q + (A(2)) 1S(2) q , (9) where A(1) and A(2) represent (α(1))2K (1) t,t + (1 α(1))2ϵ(1)I and (α(2))2K (2) t,t + (1 α(2))2ϵ(2)I, respectively. Since the observations on different dimensions in each view are assumed to be conditionally independent, the term ˆLK can be decomposed regarding dimensions, which has the following formula. 2 |HK Z,Z| 1 2 2 |βKψK 4 + HK Z,Z| 1 2 + j=1 log |G Kj Z,Z| |βK(w K dj)2ψ Kj 5 + G Kj Z,Z| Sun and Fei and Zhao and Mao 2(y K d ) βKI j=1 (βK)2(w K dj)2ψ Kj 1 (βK(w K dj)2ψ Kj 5 + G Kj Z,Z) 1(ψ Kj 1 ) (βK)2ψK 0 (βKψK 4 + HK Z,Z) 1(ψK 0 ) y K d βK 2 ψK 2 + βK 2 Tr(ψK 4 (HK Z,Z) 1) j=1 (w K dj)2ψ Kj 3 + βK j=1 Tr((w K dj)2ψ Kj 5 (G Kj Z,Z) 1) , (10) where ψK 0 = HK X,Z q(XK), ψ Kj 1 = G Kj X,Z q(XK), ψK 2 = Tr( HK X,X q(XK)), ψ Kj 3 = Tr( G Kj X,X q(XK)), ψK 4 = HK Z,XHK X,Z q(XK), and ψ Kj 5 = G Kj Z,XG Kj X,Z q(XK). q(XK) denotes expectation under the distribution q(XK). The detailed computations for the evidence lower bound and the involved statistics are given in Appendix A and B, respectively. The computational complexity for training Mc GPDS is dominated by computing the inversions of the kernel matrices, and thus the computational complexity is O(V D(J + 1)M3 + (V + 1)N3), where V is the number of views. 4.2 Parameter Estimation The parameters to be optimized in the proposed model include model parameters and variational parameters. The model parameters involve hyperparameters in the kernel functions of the latent variables {g(1), g(2), h(1), h(2), X(1), X(2), X(1,2)}, e.g., σ2 f and αq in the used ARD kernel κ(x, x ) = σ2 f exp( 1 2 PQ q=1 αq(xq x q)2), the inverse variance of white Gaussian noise {β(1), β(2)}, Gaussian noises {ϵ(1), ϵ(2)}, and weights {W (1), W (2), α(1), α(2)}. The variational parameters include the mean and covariance of the variational distributions, {µ(1), S(1), µ(2), S(2), µ(1,2), S(1,2)}, and the inducing inputs {Z(1), Z (1) h , Z(2), Z (2) h }. All the parameters are jointly optimized through the gradient descent method. Here we give the update rules for variational mean and covariance matrices, in which the optimization for covariance employs the reparameterization trick inspired by Opper and Archambeau (2009). The derivation is analogous to that in Damianou et al. (2011) and Damianou et al. (2016), to which we refer the readers for more details. The variational mean in the private latent space can be optimized by the gradient descent method and the gradient of evidence lower bound w.r.t variational mean is given by L µKq = ˆLK µKq (AK q ) 1 µK q (1 αK)µ(1,2) q . The private variational covariance matrix SK q can be reparameterized as SK q = ((AK q ) 1 + diag(λK q )) 1, where diag(λK q ) = 2 Fv(q) SK q is an N N diagonal and positive definite matrix, w.r.t which the gradient of evidence lower bound is given by L λKq = (SK q SK q )( ˆLK 2λK q ). (11) Multi-view Collaborative Gaussian Process Dynamical Systems The shared variational parameters {µ(1,2), S(1,2)} have analytical solutions. After updating the private variational parameters, we can update the shared variational parameters by the following equations. µ(1,2) q = S(1,2) q (1 α(1))(A(1) q ) 1µ(1) q + (1 α(2))(A(2) q ) 1µ(2) q , (12) S(1,2) q = [(1 α(1))2(A(1) q ) 1 + (1 α(2))2(A(2) q ) 1 + (Ktt (1,2)) 1] 1. (13) 5. Prediction with the Mc GPDS Given the trained Mc GPDS which can jointly model observations of two views Y (1) and Y (2) and learn the shared latent space X(1,2) and the private latent spaces X(1) and X(2), we aim to generate the outputs from a view given the observations from the other view. For example, generate Y (2) RN D2 using Y (1) RN D1. The Mc GPDS has the capability to accomplish this task by three steps, similar to MRD (Damianou et al., 2012). In the first step, we use variational inference again to derive the posterior distributions of the latent variables X (1) RN Q and X (1,2) RN Q which are most likely to govern the generation of Y (1) . We use q(X (1) , X (1,2) ) to approximate p(X (1) , X (1,2) |Y (1) ). The approximate posterior distribution q(X (1) , X (1,2) ) is the marginal distribution of q(X(1), X(1,2), X (1) , X (1,2) ). To obtain q(X(1), X(1,2), X (1) , X (1,2) ), we maximize the variational lower bound of the marginal likelihood p(Y (1), Y (1) ), F (1) = KL q(X(1) , X(1))q(X(1,2) , X(1,2))||p(X(1) , X(1)|X(1,2) , X(1,2))p(X(1,2) , X(1,2)) + ˆL(1)(Y (1) , Y (1)), (14) where we ve omitted time t and t for brevity. Particularly, the lower bound can be maximized using the same method as for training. The detailed calculation for F (1) is given in Appendix C. In the second step, we obtain the private latent variable which is also essential to generate data from a view. Precisely, in order to generate observations Y (2) , we need to obtain the private latent variable X (2) . However, just the observed test data from the first view Y (1) can hardly provide information for data in the second view Y (2) and thus it is quite difficult to obtain an exact representation of X (2) . Therefore, we refer to the latent variables learned from training data, X(1,2) and X(2), and employ the nearest neighbor to obtain the private latent variable X (2) . Specifically, we find the shared latent variable from training data X(1,2) which is closest to X (1,2) obtained by the first step, and acquire the variational distribution of private latent variable X(2) directly from training data whose indexes correspond to X(1,2) to approximate the posterior of private latent variable X (2) . In the third step, we predict the output Y (2) using the marginal posterior distribution of latent variable q(X (2) ) obtained through the second step. Specifically, Y (2) can be calculated by p(Y (2) ) = Z p(Y (2) |G(2) , H(2) )p(G(2) |X(2) , U (2))p(H(2) |X(2) , V (2))q(U (2))q(V (2))q(X(2) ) d G(2) d H(2) d X(2) d U (2)d V (2). (15) Note that the variational distributions q(U (2)) and q(V (2)) are obtained during the training phase which need not be optimized during the prediction period. Since the integration in Sun and Fei and Zhao and Mao Algorithm 1 Prediction with the Mc GPDS 1: Input: training data for two views Y (1) and Y (2), Mc GPDS model trained via two-view data (Y (1), Y (2)) and test data in the first view Y (1) . 2: Output: generated observations in the second view Y (2) . 3: Maximize the evidence lower bound of the marginal likelihood p(Y (1) , Y (1)) to obtain q(X(1), X(1,2), X (1) , X (1,2) ). 4: Get the marginal distribution q(X (1) , X (1,2) ) to obtain test mean µ(1) and µ(1,2) and covariance S(1) and S(1,2) . 5: Find the optimal ˆµ(2) and ˆS (2) using the K-nearest neighbor method according to the distance between µ(1,2) and µ(1,2). 6: q(X (2) ) N(ˆµ(2) , ˆS(2) ). 7: Predict Y (2) using Equation (15). (15) is analytically intractable, we follow Damianou et al. (2011) to calculate the expectation of g (2) and h (2) as E(g (2) ) and E(h (2) ), respectively, and estimate the expectation covariance matrices with Monte Carlo sampling. The element-wise autocovariance matrices of g (2) and h (2) are denoted as V(g (2) ) and V(h (2) ), respectively. E(h(2) ) = ψ (2) 0 b (2) h , E(g(2)j ) = ψ (2)j 1 b(2)j g , V(h (2)j en ) = b (2) h (ψ (2) 4en ((ψ (2) 0en) )ψ (2) 0en)b (2) h + ψ (2) 2 Tr ((H (2) Z,Z) 1 (H (2) Z,Z + β(2)ψ (2) 4 ) 1)ψ (2) 4 , V(g (2)j en ) = b(2)j g (ψ (2)j 5en (ψ (2)j 1en ) ψ (2)j 1en )b(2)j g + ψ (2)j 3 Tr[((G 1 Z,Z)(2)j (G (2)j Z,Z + β(2)w2 djψ (2)j 5 ) 1) ψ (2)j 5 ], where V(h (2)j en ) denotes the enth entry of V(h (2) ), and V(g (2)j en ) denotes the (en j)th entry of V(g (2) ). b (2) h = β(2)(H (2) Z,Z + β(2)ψ (2) 4 ) 1(ψ (2) 0 ) y(2), b (2)j g = β(2)(G (2)j Z,Z + β(2)ψ (2)j 5 ) 1(ψ (2)j 1 ) y(2), ψ (2) 0 = H (2) X ,Z q(X(2) ), ψ (2)j 1 = G (2)j X ,Z q(X(2) ), ψ (2) 2 = Tr( H (2) X ,X q(X(2) )), ψ (2)j 3 = Tr( G (2)j X ,X q(X(2) )), ψ (2) 4 = H (2) Z,X H (2) X ,Z q(X(2) ), ψ (2)j 5 = G (2)j Z,X G (2)j X ,Z q(X(2) ), ψ (2) 0en = H (2) Xen,Z q(X(2) en ), ψ (2)j 1en = G (2)j Xen,u q(X(2) en ), ψ (2) 4en = H (2) Z,Xen K (2) hen,Z q(X(2) en ), ψ (2)j 5en = G (2)j Z,Xen G (2)j Xen,Z q(X(2) en ), en = 1, . . . , N , d = 1, . . . , D and j = 1, . . . , J. Since Y (2) d = PJ j=1 w (2) dj g (2)j + h (2) , d [1 . . . D], the expectation and covari- ance of Y (2) d are E(Y (2) d ) = PJ j=1 w (2) dj E(g (2)j ) + E(h (2) ) and V(Y (2) d ) = PJ j=1(w (2) dj )2V(g (2)j ) + V(h (2) ) + (β(2)) 1I, where (y (2) ) = [(y (2) 1) , . . . , (y (2) D) ]. The whole prediction process is shown in Algorithm 1. 6. Experiments In order to validate the effectiveness of the proposed Mc GPDS, we conduct experiments on five multi-view datasets including two synthetic datasets and three real world datasets1. We 1. For an implementation of Mc GPDS in Matlab, see https://github.com/mcgpds/mcgpds. Multi-view Collaborative Gaussian Process Dynamical Systems evaluate our model in two different kinds of tasks. The first is recovering the structures of the latent variables when the correlation between the shared and private latent variables is strong. The second is generating data from one view given data from the other view. For comparison, all models are trained with the same initializations and we set J = 1 in the proposed model. For the toy data experiments, we use linear kernel without inducing points and the dimension of each view s private latent variable is set to 1. For the real-world data experiments, we use RBF kernel with the variance initialized to 1. We use 100 inducing points and the dimension of each view s private latent variable is set to 5 unless otherwise stated. For all the experiments, alpha is initialized to 0.5 for each view and the mixture weights in the output layer are independently initialized from a Gaussian distribution with 0 mean and 0.01 variance. For the K-nearest neighbor method, we set K = 1. In the experiments, the shared GPLVM refers to the new version of the shared GPLVM, namely, the subspace GPLVM. For MRD, we follow the setting in Damianou et al. (2012). All experiments are repeated five times, and the average results are reported as the final results. The root mean square error (RMSE) and mean standardized log loss (MSLL) are used as the performance measures. MSLL is the mean negative log probability of all the test data, where the predictive density is given by (15). The lower the RMSE and MSLL are, the better the performance is. 6.1 Toy Data 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 True Recovered (a) Private Signal (cos(π2t)) 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 True Recovered (b) Private Signal (cos( 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 True Recovered (c) Shared Signal (sin(2πt)) Figure 4: The results of Mc GPDSs on the toy dataset. Red lines represent true signals, and blue lines represent recovered signals. Sun and Fei and Zhao and Mao 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 True Recovered (a) Private Signal (cos(π2t)) 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 True Recovered (b) Private Signal (cos( 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 True Recovered (c) Shared Signal (sin(2πt)) Figure 5: The results of MRD on the toy dataset. Red lines represent true signals, and blue lines represent recovered signals. First, we conduct the experiment on a synthetic dataset which is similar to the one used by Salzmann et al. (2010) and Jia et al. (2010). We first generate three one-dimensional latent variables using three signals: cos(π2t) and cos( 5πt) which generate the private latent variables, and sin(2πt) which generates the shared latent variable. Then, we use the randomly generated projection matrices to map the one-dimensional private latent variables to the ten-dimensional space and the one-dimensional shared latent variable to the five-dimensional space. The two-view sequential data Y (1) and Y (2) are constructed by concatenating the ten-dimensional private variable of each view with the five-dimensional shared variable. Therefore, both the generated sequences Y (1) and Y (2) are in 15 dimensions in total, that is, y (1) i , y (2) i R15. The proposed model is capable of learning the latent variables corresponding to the observed sequential data. We use the Mc GPDS with a linear kernel function to recover the latent signals: the private signals (cos(π2t) and cos( 5πt)) and the shared signal (sin(2πt)). We compare our model with the state-of-the-art GP-based multi-view dynamical system, i.e., MRD with dynamics. Figure 4 shows the recovery results of the latent signals by our model. Specifically, Figure 4(a), (b) and (c) show the true signals as well as the recovered signals by Mc GPDS for cos(π2t), cos( 5πt) and sin(2πt), respectively. As shown in Figure 4, the recovered signals almost exactly match the true signals (up to a translation), which demonstrates that our model has the ability to learn an effective latent representation even when private latent variables are orthogonal to shared latent variables. As a comparison, Figure 5 shows the Multi-view Collaborative Gaussian Process Dynamical Systems results of the MRD with dynamics on this toy dataset. Figure 5(a) and (b) shows that the recovered private signals by the MRD deviates significantly from the true signals in both view. The only recovered signal that matches the true signal is the shared signal, as shown in Figure 5(d). 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Variance of the Private Signal of the Second View Figure 6: The learned α(1) and α(2) of Mc GPDS on the toy dataset with different variance of private latent signal of view 2. Next, we test the interpretability of the learned combination weights, α(1) and α(2), on another synthetic dataset. The shared latent signal is generated from sin(2πt) and the private latent signals are generated from pure Gaussian noise. The variance of the Gaussian noise of the first view is fixed to 0.1, while that of the second view varies from 0.1 to 1. The observations from each view is constructed by concatenating its private signal and the shared signal. Under this construction, the first view almost only contains the shared signal, while the ratio of the private and the shared signals in the second view increases with the variance of the former. We plot the learned α(1) and α(2) against the variance of the private latent variable of view 2 in Figure 6. As expected, the learned α(2) increases with the variance of the private signal of the second view, which coincides with the change of the significant of the private signal. The learned α(1) also increases but at a slower rate, since large noise in the second view adds difficulty in recovering the shared signal and the view-specific dynamic has to complement. 6.2 Human Motion Data In this experiment, we use the human motion data which contain a set of 3D human poses and their corresponding silhouettes. The data are collected by Agarwal and Triggs (Ankur and Bill, 2006). We use 566 frames for training which contain 5 sequences corresponding to walking motions in different directions. The test data is a separate walking sequence of 158 frames. The pose data are 63-dimensional joint location vectors, and the silhouette data are 100-dimensional histogram of oriented gradients (HOG) vectors. We consider the task of generating data from a view given the other view, that is, we generate the corresponding 3D human poses given the silhouette data. We use the RBF kernel for all GPs and 100 inducing points for Mc GPDS. Dimensions of the shared and both private latent variables are set to 5 for all the models. As described in the previous section, given test data in the first view Y (1) test, Mc GPDS optimises the private latent variables in the first view X (1) test and the shared latent points X (1,2) test . Then, the training latent variables X(2) in the second view are selected as the test private Sun and Fei and Zhao and Mao Table 1: The RMSE and MSLL on the human motion dataset. NNYspace 2.65 0.00 - NNXspace (X learned by MRD) 3.19 0.03 - NNXspace (X learned by Mc GPDS) 2.40 0.03 - Shared GPLVM 5.15 0.01 3.41 0.17 MRD without dynamics 5.03 0.01 3.37 0.03 MRD with dynamics 2.65 0.01 3.01 0.25 Independent CGPDS 2.69 0.13 3.22 0.23 Mc GPDS 2.37 0.03 2.60 0.05 Mc GPDS+GPLVM 2.62 0.04 3.78 0.25 Mc GPDS+Linear 2.81 0.15 - latent variables X (2) test according to the similarity of X(1,2) and X (1,2) test . Finally, Mc GPDS generates a set of novel poses Y (2) test based on these selected training latent points X(2). In this experiment, we compare our model with seven different methods, the nearest neighbor (NN) in silhouette space (NNYspace), the NN method in the X space (X learned by MRD), the NN method in the X space (X learned by Mc GPDS), the shared Gaussian process latent variable model (GPLVM), the MRD without dynamics, the MRD with dynamics and the independent CGPDS model. NNYspace finds the predicted 3D pose from training data whose silhouette is the closest to the corresponding test silhouette. Similarly, NNXspace finds the predicted 3D pose from training data whose shared latent information is the closest to the corresponding shared information of test data. The independent CGPDS model use one CGPDS on each view independently. To demonstrate the usefulness of the two key components in Mc GPDS, i.e., modelling the private latent variables using GPS with the mixture mean and covariance, and modelling the map from the private latent variables to observations with CGPDS, we conduct ablation studies for them. More specifically, we run two methods, Mc GPDS+GPLVM, which is Mc GPDS with the prior of the private latent variables replaced by that of GPLVM, and Mc GPDS+Linear, which is Mc GPDS with the output coupling layer replaced by a linear map, on the human motion dataset with the other setting unchanged. Table 1 shows the RMSE and MSLL on the human motion dataset. As shown in Table 1, our model (Mc GPDS) obtains the lowest RMSE 2.37 0.03 and the lowest MSLL 2.60 0.05, which means that our model outperforms the state-of-the-art model (MRD with dynamics). Both Mc GPDS and MRD with dynamics outperform the independent CGPDS model, which confirms the usage of shared latent space structures. In addition, NNXspace (X learned by Mc GPDS) performs better than NNXspace (X learned by MRD). The ablation studies also confirms the usage of the two key components. Figure 7 demonstrates the results visually. As shown in Figure 7, the 3D poses generated by our model are closest to the true poses. To better understand the impact of dimensionality and number of inducing points in Mc GPDS, we plot the RMSE and MSLL against total dimension of private latent variables in Figure 8(a) and the RMSE, MSLL and training time against number of inducing points in Figure 8(b). Figure 8(a) shows that the RMSE of Mc GPDS decreases as the total dimension Multi-view Collaborative Gaussian Process Dynamical Systems MRD without dynamics MRD with dynamics Figure 7: The results of generating 3D poses given silhouettes. The left-most side of each line represents the test silhouette. The remaining parts, from left to right, are the true poses, poses generated by MRD without dynamics, poses generated by NNXspace (X learned by the Mc GPDS), poses generated by MRD with dynamics, and poses generated by Mc GPDS, respectively. of private latent variables increases, implying that larger latent space provides Mc GPDS more capability to capture multiview dynamics. The increase of MSLL is possibly due to the increase of number of variables, which encourages the model to upweight the KL divergence term in the ELBO, leading to an increase in the variance of the likelihood and thus in the MSLL. Figure 8(b) shows how the training time increases with the number of inducing points, while the impact of the latter on RMSE and MSLL is moderate. 6.3 CUAVE Data In this experiment, we employ the CUAVE data which are composed of the videos showing a person speaking Arabic numerals and the corresponding Mel frequency cepstral coefficients (mfcc) features of the audio signals. Each video is represented by a 3750-dimensional vector and each mfcc feature is represented by a 13-dimensional vector. We use 194 frames of videos and mfcc features as training data and 51 frames of videos for testing. Our task is to Sun and Fei and Zhao and Mao 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 T otal Dimension of Private Latent Variables (a) Impact of Dimensionality 50 100 150 200 250 300 Number of Inducing Points T raining Time T raining Time(hr) (b) Impact of Num. of Inducing Points Figure 8: (a) RMSE and MSLL of Mc GPDS with different total dimension of private latent variables on the human motion dataset. (b) RMSE, MSLL and training time(hr) of Mc GPDS with different number of inducing points on the human motion dataset. Table 2: The RMSE and MSLL on the CUVAE dataset. NNYspace 1.31 0.00 - NNXspace (X learned by MRD) 1.70 0.10 - NNXspace (X learned by Mc GPDS) 1.38 0.15 - Shared GPLVM 1.61 0.01 4.70 0.27 MRD without dynamics 1.29 0.01 4.34 0.13 MRD with dynamics 1.24 0.03 3.45 0.20 Mc GPDS 1.19 0.03 1.94 0.07 generate mfcc features given the frames of the videos. We use the RBF kernel for all GPs and 100 inducing points for Mc GPDS. Dimensions of the shared and both private latent variables are set to 5 for all the models. From Table 2, we can see that our model obtains the best performance (with the lowest RMSE 1.19 0.03 and lowest MSLL 1.94 0.07) on the CUAVE dataset. The method NNXspace (X learned by Mc GPDS) is also better than NNXspace (X learned by MRD) in the CUAVE dataset. These results show that our model can obtain more reasonable latent representation, and thus generate observations closer to the truth. 6.4 Classification In the final experiment, we examine Mc GPDS on a classification task. We use the Oil dataset, which contains 1000 12-dimensional examples from 3 classes. The observations constitute the first view, while the corresponding labels are taken as the second view in the form of one-hot encoding. Following the setting of Damianou et al. (2012), we select 10 random subsets of the data with increasing number of training points and compare to the NN method in the data space. Figure 9 shows that the accuracy of Mc GPDS is worse than Multi-view Collaborative Gaussian Process Dynamical Systems 200 400 600 800 1000 Number of Training Points Mv CGPDS NN Figure 9: Accuracy of Mc GPDS and NN on the Oil Dataset. NN when the training set is small and is comparable to NN as the number of training points increases. There are two possible reasons for the mediocre performance of Mc GPDS on small-size non-dynamic data. First, Mc GPDS uses three GPs to model the time dynamics, while the time stamps of non-dynamic data provide little, if not misleading, information on the observations. Second, Mc GPDS uses a mixture of GPs to model the observations, while the observations from view 2 of the used dataset is just the one-hot representation of labels. Both of these could potentially make Mc GPDS perform not so well on non-dynamic small data. We leave the application of Mc GPDS to classification for future work. 7. Conclusion In this paper, we have proposed the Mc GPDS, which extends the CGPDS into the scenario of multi-view learning with flexible and general modeling in the latent space. As a novel hierarchical multi-view framework, the Mc GPDS takes full use of the characteristics of the multi-view data and the advantages of the CGPDS. The setting on the latent space is elastic and reasonable, where the relationship between private and shared latent variables can be learned adaptively via optimizing weights. We introduce inducing points and employ variational inference to integrate out the latent variables. The proposed model is trained through maximizing the evidence lower bound. The effectiveness of our model for multi-view learning has been empirically validated on synthetic and real-world two-view datasets. For future work, we will extend our model beyond the current two views. The methodology can be similar to the current scenario, but deriving the ELBO for more-than-two-view cases is non-trivial and the applications of generating one or multiple views from other views will be more challenging. Acknowledgments This work was supported by National Natural Science Foundation of China under Projects 62076096 and 62006078, Shanghai Municipal Project 20511100900, Chenguang Program of the Shanghai Education Development Foundation and the Shanghai Municipal Education Commission under Grant 19CG25, the Open Research Fund of KLATASDS-MOE, and the Sun and Fei and Zhao and Mao Fundamental Research Funds for the Central Universities. Corresponding Author: Shiliang Sun. Multi-view Collaborative Gaussian Process Dynamical Systems Appendix A. Derivation of the Evidence Lower Bound for Training In this section, we give the detailed derivation of the evidence lower bound for training data. Given two views of data, Y (1) and Y (2), the joint probability distribution for the proposed model is given by p(Y (1), Y (2), H(1), H(2), G(1), G(2), X(1), X(2), X(1,2)) = p(Y (1)|G(1), H(1))p(Y (2)|G(2), H(2)) p(G(1), H(1)|X(1))p(G(2), H(2)|X(2))p(X(1)|X(1,2), t)p(X(2)|X(1,2), t)p(X(1,2)|t). (16) We can get the logarithm of the marginal likelihood by integrating the latent variables, p(Y (1), Y (2)) = Z p(Y (1)|G(1), H(1))p(Y (2)|G(2), H(2))p(G(1), H(1)|X(1))p(G(2), H(2)|X(2)) p(X(1)|X(1,2), t)p(X(2)|X(1,2), t)p(X(1,2)|t)d H(1)d H(2)d G(1)d G(2)d X(1)d X(2)d X(1,2). (17) Note that the integration w.r.t X(1) and X(2) is intractable, because X(1) appears nonlinearly in the inverse of the matrices G (1) X,X and H (1) X,X and X(2) appears nonlinearly in the inverse of the matrices G (2) X,X and H (2) X,X. Therefore, we introduce inducing variables U and V to augment the model and compute the lower bound of its logarithmic marginal likelihood. The augmented joint probability density takes the form as p(Y (1),Y (2), H(1), H(2), G(1), G(2), U (1), U (2), V (1), V (2), X(1), X(2), X(1,2)) =p(Y(1)|G(1), H(1))p(G(1)|U (1), X(1))p(H(1)|V (1), X(1))p(U (1)|X(1))p(V (1)|X(1)) p(Y (2)|G(2), H(2))p(G(2)|U (2), X(2))p(H(2)|V (2), X(2))p(U (2)|X(2))p(V (2)|X(2)) p(X(1)|X(1,2), t)p(X(2)|X(1,2), t)p(X(1,2)|t). (18) In the above formula, p(U (1)|X(1)) and p(U (2)|U (2)) are zero-mean Gaussian with covariance matrices G (1) Z,Z and G (2) Z,Z and p(V (1)|X(1)) and p(V (2)|X(2)) are zero-mean Gaussian with covariance matrices H (1) Z,Z and H (2) Z,Z. Precisely, they are expressed as p(U (1)|X(1)) = j=1 N(u (1) j ; 0, G (1)j Z,Z), (19) p(U (2)|X(2)) = j=1 N(u (2) j ; 0, G (2)j Z,Z), (20) p(V (1)|X(1)) = N(V (1); 0, H (1) Z,Z), (21) p(V (2)|X(2)) = N(V (2); 0, H (2) Z,Z). (22) The conditional distributions for latent variables G and H given the inducing variables U and V are Gaussian, which have the following forms. p(GK|U K, XK) = j=1 N(g K j ; µKj g , e KKj g ), (23) p(HK|V K, XK) = N(HK; µK h, e KK h), (24) Sun and Fei and Zhao and Mao where K {(1), (2)}. The specific expressions for the related statistics are µKj g = G Kj X,Z(G Kj Z,Z) 1u K j , e KKj g = G Kj X,X G Kj X,Z(G Kj Z,Z) 1G Kj Z,X, µK h = HK X,Z(HK Z,Z) 1v K and e KK h = HK X,X HK X,Z(HK Z,Z) 1HK Z,X. We now adopt the variational inference method to approximately compute the integral. Specifically, we introduce a joint variational distribution q(Ω) over all the latent variables denoted by Ω, which has the factorized form as q(Ω) = q(Θ(1))q(Θ(2))q(X(1))q(X(2))q(X(1,2)), (25) q(X(1)) = N(X(1)|µ(1), S(1)), q(X(2)) = N(X(2)|µ(2), S(2)), q(X(1,2)) = N(X(1,2)|µ(1,2), S(1,2)), q(Θ(1)) = p(G(1)|U (1), X(1))p(H(1)|V (1), X(1))q(U (1))q(V (1)), q(Θ(2)) = p(G(2)|U (2), X(2))p(H(2)|V (2), X(2))q(U (2))q(V (2)). The evidence lower bound of the logarithmic marginal likelihood log p(Y (1), Y (2)) is Fv(q, θ) = Z q(Θ(1))q(X(1)) log p(Y (1)|G(1), H(1))p(G(1)|X(1))p(H(1)|X(1)) q(Θ(1)) d G(1)d H(1)d X(1) + Z q(Θ(2))q(X(2)) log p(Y (2)|G(2), H(2))p(G(2)|X(2))p(H(2)|X(2)) q(Θ(2)) d G(2)d H(2)d X(2) Z q(X(1))q(X(2))q(X(1,2)) log q(X(1))q(X(2))q(X(1,2)) p(X(1)|X(1,2), t)p(X(2)|X(1,2), t)p(X(1,2)|t)d X(1,2)d X(1)d X(2) = ˆL(1) + ˆL(2) KL(q(X(1))q(X(2))q(X(1,2))||p(X(1)|X(1,2), t)p(X(2)|X(1,2), t)p(X(1,2)|t)). (26) The detailed computation of the first term ˆL(1) in Equation (26) is given by ˆL(1) = Z q(U (1), V (1))q(X(1)) log p(Y (1)|U (1), V (1), X(1))p(U (1), V (1)) q(U (1), V (1)) d U (1)d V (1)d X(1), (27) where log p(Y (1)|U (1), V (1), X(1)) in the lower bound can be approximated by log p(Y (1)|U (1), V (1), X(1)) log p(Y (1)|G(1), H(1)) p(G(1),H(1)|U(1),V (1)) d=1 log p(Y (1) d |G(1), H(1)) p(G(1)|U(1))p(H(1)|V (1)) log N(Y (1) d | j=1 w (1) dj µ(1)j g + µ (1) h , (β(1)) 1I) β(1) 2 Tr( e K (1) h ) j=1 (w (1) dj )2( e K(1)j g )) . (28) Multi-view Collaborative Gaussian Process Dynamical Systems As the outputs Y (1) are conditionally independent, the lower bound can be written as a sum of D terms, that is, ˆL(1) = PD d=1 ˆL (1) d , where ˆL (1) d is given by ˆL (1) d = Z q(u(1), v(1))q(X(1)) log N(y (1) d | PJ j=1 w (1) dj µ (1)j g + µ (1) h , (β(1)) 1I)p(u(1), v(1)) q(u(1), v(1)) du(1)dv(1)d X(1) Z β(1) 2 Tr( e K (1) h )q(X(1))d X(1) Z β(1) j=1 (w (1) dj )2 e K(1)j g )q(X(1))d X(1). By changing the integration order, we get ˆL (1) d = Z q(u(1), v(1)) log e log N(y(1) d ;PJ j=1 w(1) dj µ(1)j g +µ(1) h ,(β(1)) 1I) q(X(1))p(u(1), v(1)) q(u(1), v(1)) du(1)dv(1) 2 Tr( e K (1) h q(X(1))) β(1) j=1 (w (1) dj )2 e K(1)j g q(X(1))), (29) where the optimal variational distribution q(u(1), v(1)) for the dth output that gives rise to this lower bound is q(u(1), v(1)) e log N(y(1) d ;PJ j=1 w(1) dj µ(1)j g +µ(1) h ,(β(1)) 1I) q(X(1))p(u(1), v(1)). (30) The optimal variational distribution is analytically Gaussian, q(u(1), v(1)) =N v(1); H (1) Z,Z(β(1)ψ (1) 4 + H (1) Z,Z) 1(ψ (1) 0 ) β(1)y (1) d , H (1) Z,Z(β(1)ψ (1) 4 + H (1) Z,Z) 1H (1) Z,Z j=1 N u (1) j ; G (1)j Z,Z((β(1)(w (1) dj )2ψ (1)j 5 + G (1)j Z,Z) 1(ψ (1)j 1 ) w (1) dj β(1)y (1) d , G (1)j Z,Z((β(1)(w (1) dj )2ψ (1)j 5 + G (1)j Z,Z) 1G (1)j Z,Z , (31) where ψ (1) 0 = H (1) X,Z q(X(1)), ψ (1)j 1 = G (1)j X,Z q(X(1)), ψ (1) 2 = Tr( H (1) X,X q(X(1))) , ψ (1)j 3 = Tr( G (1)j X,X q(X(1))), ψ (1) 4 = H (1) Z,XH (1) X,Z q(X(1)) and ψ (1)j 5 = G (1)j Z,XG (1)j X,Z q(X(1)). Furthermore, the optimal lower bound can be obtained using Jensen s inequality, ˆL (1) d log Z e log N(y(1) d ;PJ j=1 w(1) dj µ(1)j g +µ(1) h ,(β(1)) 1I) q(X(1))p(u(1), v(1))du(1)dv(1) 2 Tr( e K (1) h q(X(1))) β(1) j=1 (w (1) dj )2 e K(1)j g q(X(1))) 2 |G (1) Z,Z| 1 2 |H (1) Z,Z| 2 |β(1)(w (1) dj )2ψ (1) 5 + G (1) Z,Z| 1 2 |β(1)ψ (1) 4 + H (1) Z,Z| 2(y (1) d ) F (1) d y (1) d } 2 Tr( e K (1) h q(X(1))) β(1) j=1 (w (1) dj )2 e K(1)j g q(X(1))), (32) Sun and Fei and Zhao and Mao where F (1) d = β(1)I (β(1))2(w (1) dj )2ψ (1) 1 (β(1)(w (1) dj )2ψ (1) 5 +G (1) Z,Z) 1(ψ (1) 1 ) (β(1))2ψ (1) 0 (β(1)ψ (1) 4 + H (1) Z,Z) 1(ψ (1) 0 ) . Therefore, the closed-form of the first term ˆL(1) in the lower bound of the approximated logarithmic marginal log-likelihood in Equation (26) is given by log (β(1)) N 2 |H (1) Z,Z| 2 |β(1)ψ (1) 4 + H (1) Z,Z| j=1 log |G(1)j Z,Z| |β(1)(w (1) dj )2ψ(1)j 5 + G(1)j Z,Z| 2(y (1) d ) β(1)I j=1 (β(1))2(w (1) dj )2ψ(1)j 1 (β(1)(w (1) dj )2ψ(1)j 5 + G(1)j Z,Z) 1(ψ(1)j 1 ) (β(1))2ψ (1) 0 (β(1)ψ (1) 4 + H (1) Z,Z) 1(ψ (1) 0 ) y (1) d β(1) 2 ψ (1) 2 + β(1) 2 Tr(ψ (1) 4 (H (1) Z,Z) 1) β(1) j=1 (w (1) dj )2ψ(1)j 3 j=1 Tr((w (1) dj )2ψ(1)j 5 (G(1)j Z,Z) 1) , (33) and similarly for ˆL(2), log (β(2)) N 2 |H (2) Z,Z| 2 |β(2)ψ (2) 4 + H (2) Z,Z| j=1 log |G(2)j Z,Z| |β(2)(w (2) dj )2ψ(2)j 5 + G(2)j Z,Z| 2(y (2) d ) β(2)I j=1 (β(2))2(w (2) dj )2ψ(2)j 1 (β(2)(w (2) dj )2ψ(2)j 5 + G(2)j Z,Z) 1(ψ(2)j 1 ) (β(2))2ψ (2) 0 (β(2)ψ (2) 4 + H (2) Z,Z) 1(ψ (2) 0 ) y (2) d β(2) 2 ψ (2) 2 + β(2) 2 Tr(ψ (2) 4 (H (2) Z,Z) 1) β(2) j=1 (w (2) dj )2ψ(2)j 3 j=1 Tr((w (2) dj )2ψ(2)j 5 (G(2)j Z,Z) 1) . (34) For the calculation of KL divergence, for simplification, we employ A (1) q and A (2) q to represent (α(1))2K (1) t,t + (1 α(1))2ϵ(1)I and (α(2))2K (2) t,t + (1 α(2))2ϵ(2)I, respectively. Then the specific calculation is given below. KL(q(X(1))q(X(2))q(X(1,2))||p(X(1)|X(1,2), t)p(X(2)|X(1,2), t)p(X(1,2)|t)) log |A(1) q | + log |A(2) q | + log |K (1,2) t,t | log |S(1,2) q | log |S(1) q | log |S(2) q | + (1 α(1))µ(1,2) q µ(1) q (A(1) q ) 1 (1 α(1))µ(1,2) q µ(1) q + (1 α(2))µ(1,2) q µ(2) q (A(2) q ) 1 (1 α(2))µ(1,2) q µ(2) q + Tr (1 α(1))2(A(1) q ) 1 + (1 α(2))2(A(2) q ) 1 S(1,2) q + Tr (K (1,2) t,t ) 1 µ(1,2) q (µ(1,2) q ) + S(1,2) q + Tr (A(1) q ) 1S(1) q + (A(2) q ) 1S(2) q . (35) Multi-view Collaborative Gaussian Process Dynamical Systems Appendix B. Computation of Statistics ψ0, ψ1, ψ2, ψ3, ψ4, ψ5 ψ (1) 0 , ψ (1) 1 , ψ (2) 0 and ψ (2) 1 are N M matrices. ψ (1) 2 , ψ (1) 3 , ψ (2) 2 and ψ (2) 3 are scalars. ψ (1) 4 , ψ (1) 5 , ψ (2) 4 , ψ (2) 5 are (J M) (J M) matrices. we use the ARD kernel κARD(x, x ) = σ2 f exp( 1 2 PQ q=1 αp(xq x q)2), and obtain (ψ (1) 0 )n,m = ( H (1) X,Z q(X(1)))n,m = Z κ(1)h(x(1) n , z(1)h m )N(x(1) n |µ(1) n , S(1) n )dx(1) n = (σ2 f)(1)h QQ q=1(S (1) nqα (1)h q + 1) 1 2 exp 1 (z (1)h mq µ (1) nq)2α (1)h q S (1) nqα (1)h q + 1 (ψ (1)j 1 )n,m = ( G (1)j X,Z q(X(1)))n,m = Z κ (1) j (x(1) n , z(1) m )N(x(1) n |µ(1) n , S(1) n )dx(1) n = (σ2 f) (1) j QQ q=1(S (1) nqα (1) jq + 1) 1 2 exp 1 (z (1) mq µ (1) nq)2α (1) jq S (1) nqα (1) jq + 1 ψ (1) 2 = Tr( H (1) X,X q(X(1))) = N(σ2 f)(1)h, (38) ψ (1)j 3 = Tr( G (1)j X,X q(X(1))) = N(σ2 f) (1) j , (39) (ψ (1) 4 )m,m = ( H (1) Z,XH (1) X,Z q(X(1)))m,m Z k(1)h(x(1) n , z(1)h m )k(1)h(x(1) n , z (1)h m )N(x(1) n |µ(1) n , S(1) n )dx(1) n = (σ4 f)(1)h N X α(1)h q (z(1)h mq z(1)h m q )2 4 α(1)h q (µ(1) np z(1)h mq 2 z(1)h m q 2α(1)h q S(1) nq +1 (2α (1)h q S (1) nq + 1) 1 2 (ψ (1)j 5 )m,m = ( G (1)j Z,XG (1)j X,Z q(X(1)))m,m Z k (1) j (x(1) n , z(1) m )k (1) j (x(1) n , z (1) m )N(x(1) n |µ(1) n , S(1) n )dx(1) n = (σ4 f) (1) j α(1) jq (z(1) mq z(1) m q)2 4 α(1) jq (µ(1) nq z(1) mq 2α(1) jq S(1) nq +1 (2α (1) jq S (1) nq + 1) 1 2 The statistics ψ (2) 0 , ψ (2) 1 , ψ (2) 2 , ψ (2) 3 , ψ (2) 4 , ψ (2) 5 in the second view have the similar formulas. Appendix C. Derivation of Varitional Lower Bound for Testing Given test data in the first view Y (1) , we maximize a variational lower bound on the logarithmic marginal likelihood log p(Y (1), Y (1) ) which can be expressed as follows. For Sun and Fei and Zhao and Mao brevity, we ve omitted time t and t . F (1) = log Z p(Y (1) , Y (1)|X(1) , X(1))p(X(1) , X(1)|X(1,2) , X(1,2))p(X(1,2) , X(1,2)) d X(1,2) d X(1) d X(1)d X(1,2) Z q(X(1) , X(1))q(X(1,2) , X(1,2))q(G(1))q(H(1)) log p(Y (1) , Y (1)|X (1) , X(1))p(X (1) , X(1)|X (1,2) , X(1,2))p(X (1,2) , X(1,2)) q(X (1) , X(1))q(X (1,2) , X(1,2))q(G(1))q(H(1)) d X(1,2) d X(1) d X(1)d X(1,2)d G(1)d H(1) = Z q(G(1))q(H(1))q(X(1) , X(1)) log p(Y (1) , Y (1) |X (1) , X(1)) q(G(1))q(H(1)) d X(1) d X(1)d G(1)d H(1) + Z q(X(1) , X(1))q(X(1,2) , X(1,2)) log p(X (1) , X(1)|X (1,2) , X(1,2))p(X (1,2) , X(1,2)) q(X (1) , X(1))q(X (1,2) , X(1,2)) d X(1) d X(1,2) d X(1)d X(1,2) = e L(1)(Y (1) , Y (1)) KL q(X(1) , X(1))q(X(1,2) , X(1,2))||p(X(1) , X(1)|X(1,2) , X(1,2)) p(X(1,2) , X(1,2)) , (42) The quantity F (1) can be maximized using the same method as for training. In addition, parameters of the new variational distribution q(X(1), X (1) ) are jointly optimized because X(1) and X (1) are coupled in q(X(1), X (1) ), and so are q(X(1,2), X (1,2) ). Specially, the quantity e L(1)(Y (1) , Y (1)) can be expressed as e L(1)(Y (1) , Y (1)) = log β(1) N+N 2 |H (1) Z,Z| 2 |β(1) ψ4 (1) + H (1) Z,Z| j=1 log |G(1)j Z,Z| |β(1)(w (1) dj )2 ψ5 (1)j + G(1)j Z,Z| 2( yd (1)) β(1)I j=1 β(1)2(w (1) dj )2 ψ1 (1)j(β(1)(w (1) dj )2 ψ5 (1)j + G(1)j Z,Z) 1( ψ1 (1)j) β(1)2 ψ0 (β(1) ψ4 (1) + H (1) Z,Z) 1( ψ0) yd (1) β(1) 2 ψ2 + β(1) 2 Tr( ψ4 (1)(H (1) Z,Z) 1) β(1) j=1 ψ3 (1)j j=1 Tr((w (1) dj )2 ψ5 (1)j(G(1)j Z,Z) 1) , (43) and the KL divergence can be expressed as KL q(X(1), X(1) )q(X(1,2), X(1,2) )||p(X(1), X(1) |X(1,2), X(1,2) )p(X(1,2), X(1,2) ) log | A(1) q | + log | K (1,2) t,t | log | S(1,2) q | log | S(1) q | + Tr ( A(1) q ) 1 S(1) q Multi-view Collaborative Gaussian Process Dynamical Systems + (1 α(1)) µ(1,2) q µ(1) q ( A(1) q ) 1 (1 α(1)) µ(1,2) q µ(1) q + Tr (1 α(1))2( A(1) q ) 1 S(1,2) q + Tr ( K(1,2) x,x ) 1 µ(1,2) q ( µ(1,2) q ) + S(1,2) q . (44) where ψ0 (1) = H (1) X,Z q(X(1),X(1) ), ψ (1)j 1 = G (1)j X,Z q(X(1),X(1) ), ψ (1) 2 = Tr( H (1) X,X q(X(1),X(1) )) , ψ (1)j 3 = Tr( G (1)j X,X )q(X(1),X(1) ), ψ4 (1) = H (1) Z,XH (1) X,Z q(X(1),X(1) ) and ψ5 (1)j = G (1)j Z,XG (1)j X,Z q(X(1),X(1) ). M. R. Amini and C. Goutte. A co-classification approach to learning from multilingual corpora. Machine Learning, 79:105 121, 2010. K. Andreas and G. Carlos. Nonmyopic active learning of Gaussian processes: an explorationexploitation approach. In Proceedings of the 24th International Conference on Machine Learning, pages 449 456, 2007. A. Ankur and T. Bill. Recovering 3D human pose from monocular images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 21:1 8, 2006. K. Barnard, P. Duygulu, D. Forsyth, D. Nando N. Freitas, D. M. Blei, and M. I. Jordan. Matching words and pictures. Journal of Machine Learning Research, 3:1107 1135, 2003. D. M. Blei and M. I. Jordan. Modeling annotated data. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, pages 127 134, 2003. T. Cao, V. Jojic, S. Modla, D. Powell, K. Czymmek, and M. Niethammer. Robust multimodal dictionary learning. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 259 266, 2013. N. Chen, J. Zhu, and E. P. Xing. Predictive subspace learning for multi-view data: a large margin approach. Advances in Neural Information Processing Systems, 23:361 369, 2010. D. A. Cohn and T. Hofmann. The missing link A probabilistic model of document content and hypertext connectivity. Advances in Neural Information Processing Systems, 14: 430 436, 2001. A. C. Damianou, M. K. Titsias, and N. D. Lawrence. Variational Gaussian process dynamical systems. Advances in Neural Information Processing Systems, 24:2510 2518, 2011. A. C. Damianou, C. H. Ek, M. K. Titsias, and N. D. Lawrence. Manifold relevance determination. In Proceedings of the 29th International Conference on Machine Learning, pages 1 8, 2012. A. C. Damianou, M. K. Titsias, and N. D. Lawrence. Variational inference for latent variables and uncertain inputs in Gaussian processes. Journal of Machine Learning Research, 17: 1425 1486, 2016. Sun and Fei and Zhao and Mao Z. Ding, M. Shao, and Y. Fu. Robust multi-view representation: A unified perspective from multi-view learning to domain adaption. In Proceedings of the 27th International Joint Conferences on Artificial Intelligence, pages 5434 5440, 2018. J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2625 2634, 2015. C. H. Ek and N. D. Lawrence. Shared Gaussian process latent variable models. Ph D thesis, Oxford Brookes University, 2009. F. Feng, X. Wang, and R. Li. Cross-modal retrieval with correspondence autoencoder. In Proceedings of the 22nd International Conference on Multimedia, pages 7 16, 2014. M. Feurer, B. Letham, and E. Bakshy. Scalable meta-learning for Bayesian optimization using ranking-weighted Gaussian process ensembles. In Proceedings of the 36th Automatic Machine Learning Workshop at International Conference on Machine Learning, pages 1 15, 2018. J. Hu, J. Lu, and Y. Tan. Sharable and individual multi-view metric learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40:2281 2288, 2018. Y. Jia, M. Salzmann, and T. Darrell. Factorized latent spaces with structured sparsity. Advances in Neural Information Processing Systems, 23:982 990, 2010. Y. Jia, M. Salzmann, and T. Darrell. Learning cross-modality similarity for multinomial data. In Proceedings of the 13th IEEE International Conference on Computer Vision, pages 2407 2414, 2011. P. Jing, Y. Su, L. Nie, X. Bai, J. Liu, and M. Wang. Low-rank multi-view embedding learning for micro-video popularity prediction. IEEE Transactions on Knowledge and Data Engineering, 30:1519 1532, 2018. A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3128 3137, 2015. N. D. Lawrence. Gaussian process latent variable models for visualisation of high dimensional data. Advances in Neural Information Processing Systems, 17:329 336, 2004. N. D. Lawrence. Probabilistic non-linear principal component analysis with Gaussian process latent variable models. Journal of Machine Learning Research, 6:1783 1816, 2005. N. D. Lawrence and I. M. Jordan. Semi-supervised learning via Gaussian processes. Advances in Neural Information Processing Systems, 18:753 760, 2005. Y. Li, M. Yang, and Z. M. Zhang. A survey of multi-view representation learning. IEEE Transactions on Knowledge and Data Engineering, 10:1 20, 2018. Multi-view Collaborative Gaussian Process Dynamical Systems W. Liu, D. Tao, J. Cheng, and Y. Tang. Multiview Hessian discriminative sparse coding for image annotation. Computer Vision and Image Understanding, 118:50 60, 2014. M. L uthi, T. Gerig, C. Jud, and T. Vetter. Gaussian process morphable models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40:1860 1873, 2018. J. Mao, W. Xu, Y. Yang, J. Wang, Z. Huang, and A. Yuille. Deep captioning with multimodal recurrent neural networks. ar Xiv preprint , ar Xiv:1412.6632, pages 1 17, 2014. J. R. Medina, H. Borner, S. Endo, and S. Hirche. Impedance-based Gaussian processes for modeling human motor behavior in physical and non-physical interaction. IEEE Transactions on Biomedical Engineering, 63:1 12, 2019. I. Muslea, S. Minton, and C. A. Knoblock. Active + semi-supervised learning = robust multi-view learning. In Proceedings of the 19th International Conference on Machine Learning, pages 435 442, 2002. I. Muslea, S. Minton, and C. A. Knoblock. Active learning with multiple views. Journal of Artificial Intelligence Research, 27:203 233, 2006. J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng. Multimodal deep learning. In Proceedings of the 28th International Conference on Machine Learning, pages 689 696, 2011. V. T. Nguyen and E. Bonilla. Collaborative multi-output Gaussian processes. In Proceedings of the 30th Uncertainty in Artificial Intelligence, pages 643 652, 2014. M. Opper and C. Archambeau. The variational Gaussian approximation revisited. Neural Computation, 21:786 792, 2009. E. Puyol, B. Ruijsink, B. Gerber, M. Amzulescu, H. Langet, M. De Craene, J. A. Schnabel, P. Piro, and A. P. King. Regional multi-view learning for cardiac motion analysis: Application to identification of dilated cardiomyopathy patients. IEEE Transactions on Biomedical Engineering, 65:1 9, 2018. C. E. Rasmussen and C. K. I. Williams. Gaussian Process for Machine Learning. MIT Press, 2nd edition, 2006. M. Salzmann, C. H. Ek, R. Urtasun, and T. Darrell. Factorized orthogonal latent spaces. In Proceedings of the 13th International Conference on Artificial Intelligence and Statistics, pages 701 708, 2010. A. Shon, K. Grochow, A. Hertzmann, and R. P. Rao. Learning shared latent structure for image synthesis and robotic imitation. Advances in Neural Information Processing Systems, 18:1233 1240, 2006. N. Srivastava and R. R. Salakhutdinov. Multimodal learning with deep Boltzmann machines. Advances in Neural Information Processing Systems, 25:2222 2230, 2012. S. Sun. A survey of multi-view machine learning. Neural Computing and Applications, 23: 2031 2038, 2013. Sun and Fei and Zhao and Mao S. Sun, L. Mao, Z. Dong, and L. Wu. Multiview Machine Learning. Springer, 1st edition, 2019. M. K. Titsias and N. D. Lawrence. Bayesian Gaussian process latent variable model. In Proceedings of the 13th International Conference on Artificial Intelligence and Statistics, pages 844 851, 2010. S. Tulsiani, A. Efros, and J. Malik. Multi-view consistency as supervisory signal for learning shape and pose prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1 9, 2018. S. Virtanen, Y. Jia, A. Klami, and T. Darrell. Factorized multi-modal topic model. In Proceedings of the 28th Conference on Uncertainty in Artificial Intelligence, pages 1 9, 2012. X. Wan. Co-training for cross-lingual sentiment classification. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pages 235 243, 2009. J. M. Wang, D. J. Fleet, and A. Hertzmann. Gaussian process dynamical models. Advances in Neural Information Processing Systems, 19:1441 1448, 2006. W. Wang, R. Arora, K. Livescu, and J. Bilmes. On deep multi-view representation learning. In Proceedings of the 32nd International Conference on Machine Learning, pages 1083 1092, 2015. H. Wei, P. Zhu, M. Liu, J. P. How, and S. Ferrari. Automatic pan tilt camera control for learning Dirichlet process Gaussian process mixture models of multiple moving targets. IEEE Transactions on Automatic Control, 64:159 173, 2019. X. Wei, H. Huang, L. Nie, F. Feng, R. Hong, and T. Chua. Quality matters: Assessing c QA pair quality via transductive multi-view learning. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, pages 4482 4488, 2018. E. P. Xing, R. Yan, and A. G. Hauptmann. Mining associated text and images with dual-wing harmoniums. ar Xiv preprint , ar Xiv:1207.1423, pages 1 9, 2012. C. Xu, D. Tao, and C. Xu. A survey on multi-view learning. ar Xiv preprint , ar Xiv:1304.5634, pages 1 59, 2013. C. Zhang, C. H. Ek, A. Damianou, and H. Kjellstrom. Factorized topic models. In Proceedings of the 1st International Conference on Learning Representations, pages 1 9, 2013. J. Zhao and S. Sun. Variational dependent multi-output Gaussian process dynamical systems. Journal of Machine Learning Research, 17:1 36, 2016. J. Zhao, J. Fei, and S. Sun. A variant of Gaussian process dynamical systems. Technical report, East China Normal University, 2018.