# multiview_collaborative_gaussian_process_dynamical_systems__19b19025.pdf

Journal of Machine Learning Research 24 (2023) 1-32 Submitted 2/19; Revised 7/23; Published 10/23

Multi-view Collaborative Gaussian Process Dynamical Systems

Shiliang Sun slsun@cs.ecnu.edu.cn School of Computer Science and Technology, East China Normal University, Shanghai 200062, P. R. China Department of Automation, Shanghai Jiao Tong University, Shanghai 200240, P. R. China

Jingjing Fei jingjingfei16@163.com

Jing Zhao jzhao@cs.ecnu.edu.cn

Liang Mao lmao14@outlook.com School of Computer Science and Technology, East China Normal University, Shanghai 200062, P. R. China

Editor: Massimiliano Pontil

Gaussian process dynamical systems (GPDSs) have shown their eﬀectiveness in many tasks of machine learning. However, when they address multi-view data, current GPDSs do not explicitly model the dependence between private and shared latent variables. Instead, they introduce structurally and intrinsically discrete segmentation in the latent space. In this paper, we propose the multi-view collaborative Gaussian process dynamical systems (Mc GPDSs) model, which assumes that the private latent variable for each view is controlled by its dynamical prior and the shared latent variable. The relevance between private and shared latent variables can be automatically learned by optimization in the Bayesian framework. The model is capable of learning an eﬀective latent representation and generating novel data of one view given data of the other view. We evaluate our model on two-view data sets, and our model obtains better performance compared with the state-of-the-art multi-view GPDSs.

Keywords: Gaussian process, multi-view machine learning, dynamical system, variational inference, multi-output modeling

1. Introduction

A Gaussian process (GP) is a collection of random variables, any ﬁnite number of which have a joint Gaussian distribution (Rasmussen and Williams, 2006). GPs are stochastic processes over real-valued functions and completely speciﬁed by mean functions and covariance functions (Rasmussen and Williams, 2006). Recently, GPs have been proved successful in various areas of machine learning (Lawrence and Jordan, 2005; Andreas and Carlos, 2007; Damianou et al., 2011; L uthi et al., 2018; Feurer et al., 2018; Wei et al., 2019; Medina et al., 2019) because GPs can provide ﬂexible function approximation. For example, to implement nonlinear dimensionality reduction, GP latent variables (GPLVMs) (Lawrence, 2004, 2005;

c 2023 Shiliang Sun and Jingjing Fei and Jing Zhao and Liang Mao.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v24/19-094.html.

Sun and Fei and Zhao and Mao

Titsias and Lawrence, 2010) have been presented, which use the global latent variables and assume the conditional independence among multiple outputs. For modelling dynamics in sequential data, some Gaussian process dynamical systems (GPDSs) have been proposed, which extend GPLVMs by adding dynamical priors on the latent variables, such as GP dynamical models (GPDMs) (Wang et al., 2006), variational GPDSs (VGPDSs) (Damianou et al., 2011), variational dependent multi-output GPDSs (VDM-GPDSs) (Zhao and Sun, 2016) and collaborative Gaussian process dynamical systems (CGPDSs) (Zhao et al., 2018). Speciﬁcally, the GPDM models the dynamics by adding the Markov prior on the latent space and characterizes the variability among outputs via constructing the output variances with diﬀerent parameters. The VGPDS employs the GP dynamical prior on the latent space, which is more ﬂexible and can capture some speciﬁc dynamical information such as periodicity with speciﬁc kernels. The VDM-GPDS models the dependence among multiple outputs and employs convolution processes to capture the multioutput dependence explicitly. The VDM-GPDS obtains better performance than the GPDM and VGPDS, but the VDM-GPDS is time-consuming during training attributed to the introduced convolution processes. The CGPDS expresses each output as the sum of a global latent process and a local latent process, which can capture the universality and individuality of all outputs. Moreover, the CGPDS assumes that the latent processes are conditionally independent, which ensures the resulting evidence lower bound to be decomposed across dimensions and allows the stochastic optimization. We will detail CGPDSs in Section 2 in a self-contained form. With the rapid development of information techniques, more and more data exhibit multi-view characteristics such as the URL link and text in a web document, the audio and image frames of a video, the surrounding words and image of a web image and so on. Data of diﬀerent modalities often oﬀer complementary information, and multi-view learning can exploit this information to learn representations, which are more comprehensive and expressive than that of single-view learning (Sun et al., 2019). More speciﬁcally, multi-view learning uses one function to model a view and optimizes all functions together during training. Consensus and complementarity are the two core principles of multi-view learning. The consensus principle maximizes the agreement on the representations of diﬀerent views, and the complementarity principle exploits the complementary information contained in diﬀerent views to represent multi-view data comprehensively (Li et al., 2018). Since multiview learning can use the consensus and complementarity properties of multiple views and exploit the redundant views of the same input data, multi-view learning is often more natural and eﬀective than single-view learning (Sun, 2013; Xu et al., 2013; Li et al., 2018; Ding et al., 2018). Recently, several models extended GPLVMs or GPDSs to the scenario of multi-view learning. The shared GPLVM assumed that each view has been generated from the same low-dimensional latent variable corrupted by additive Gaussian noise (Shon et al., 2006). Furthermore, a new version of the shared GPLVM, i.e., the subspace GPLVM, was proposed (Ek and Lawrence, 2009), in which the latent space for each view is factorized into a shared one, which captures the shared information across the views, and a private one, which explains the remaining variance. Salzmann et al. (2010) learned the dimensionality of the factorization by introducing regularizers. The manifold relevance determination (MRD) (Damianou et al., 2012) improved the hard segmentation between the private and shared

Multi-view Collaborative Gaussian Process Dynamical Systems

latent variables and employed the soft segmentation in the latent space. Concretely, the MRD employed learned scales in the automatic relevance determination (ARD) kernels and a pre-given threshold to determine whether a dimension is the private or shared latent variable. This threshold requires to be speciﬁed manually and often varies for diﬀerent datasets, whose conﬁguration thus needs expert knowledge and is time-consuming. The above models do not explicitly model the correlation between private and shared latent variables (dimensions). This kind of model assumption in the latent space brings the structurally and intrinsically discrete segmentation between the shared and private latent variables. On many real-world data sets, it is quite diﬃcult to clearly divide the latent space which generates multi-view observations into shared and private latent information because private and shared latent information is complexly coupled and interacts with each other. For example, in the multi-view data set which contains pictures of diﬀerent faces under the same lighting condition, we can take the characteristics of faces as private information and the lighting condition as shared information. It is not diﬃcult to ﬁgure out that intensely bright lighting conditions can aﬀect the characteristics of the face, such as the color of the skin. In this paper, we propose the multi-view collaborative Gaussian process dynamical systems (Mc GPDSs) model, which makes full use of the characteristics of multi-view data and the advantages of the CGPDSs. The proposed model relaxes the discrete structural segmentation in the latent space and automatically learns the relevance between private and shared latent variables through optimization. Since private latent variables are determined by their dynamical priors and the shared latent variable, Mc GPDSs can model more complex and abundant information of data. Experiments on the synthetic and real-world data sets also validate the superiority of our proposed Mc GPDSs. The contributions of our model are summarized as follows: 1) Our model extends the CGPDS into multi-view learning, which possesses the advantages of the multi-view learning and the CGPDS to model high-dimensional multi-output data. 2) Our model explicitly models the relationship between shared and private latent variables and automatically learns their relevance. 3) All parameters in our model can be learned through optimization. The remainder of the paper is structured as follows. Section 2 introduces the related work including multi-view learning, CGPDSs and several multi-view models based on GPLVMs and GPDSs. Section 3 presents the proposed model in detail. Section 4 describes the inference and learning techniques. Section 5 illustrates the procedure of prediction with Mc GPDSs. Section 6 provides extensive experimental evaluations to validate the eﬀectiveness of our model, and Section 7 concludes the work and discusses future work.

2. Related Work

In this section, we ﬁrst brieﬂy review the related works on multi-view learning. Then we give an introduction to CGPDSs (Zhao et al., 2018) and several multi-view models based on GPLVMs and GPDSs (Shon et al., 2006; Ek and Lawrence, 2009; Damianou et al., 2012).

2.1 Multi-view Learning

Multi-view learning is concerned with learning from data represented by multiple views. It has received increasing attention and been applied widely. Wei et al. (2018) evaluated the

Sun and Fei and Zhao and Mao

quality of community-based question answering through transductive multi-view learning. Hu et al. (2018) proposed a shareable and individual multi-view metric learning approach for visual recognition. Puyol et al. (2018) described a method of regional multi-view learning for cardiac motion analysis, and the method was applied to the identiﬁcation of dilated cardiomyopathy patients. Jing et al. (2018) employed low-rank multi-view embedding learning to predict the popularity of the micro video. Tulsiani et al. (2018) considered multi-view consistency as the supervisory signal for learning shape and pose prediction. In the literature, multi-view learning is closely related to other machine learning methods, such as active learning, domain adaptation, and representation learning. More speciﬁcally, Muslea et al. (2002) combined co-testing and co-EM where co-testing is a novel method for active learning with multiple views and co-EM is used to generate classiﬁers and select the unlabeled points with the largest amount of information for labeling. Muslea et al. (2006) improved co-testing by considering diﬀerences between strong and weak views and assuming strong views with more information. Domain adaptation solves the problem of adapting a model trained on the source domain to the target domain, where the data from the source and target domains are largely diﬀerent. Domain adaptation can be applied in the cross-language text classiﬁcation task where documents in diﬀerent languages represent diﬀerent views. Co-training (Wan, 2009) and multi-view co-classiﬁcation (Amini and Goutte, 2010) have been proposed and successfully applied in the task. Multi-view representation learning has been a promising research topic in recent years on account of the ability to provide abundant and complementary information for learning representations. Multi-view representation learning methods contain generative methods including multi-modal topic learning (Cohn and Hofmann, 2001; Barnard et al., 2003; Blei and Jordan, 2003), multi-view sparse coding (Jia et al., 2010; Cao et al., 2013; Liu et al., 2014) and multi-view latent space Markov networks (Xing et al., 2012; Chen et al., 2010), and deep neural methods including multi-modal autoencoders (Ngiam et al., 2011; Feng et al., 2014; Wang et al., 2015), multi-model Boltzmann machines (Srivastava and Salakhutdinov, 2012) and multi-modal recurrent neural networks (Karpathy and Fei-Fei, 2015; Mao et al., 2014; Donahue et al., 2015).

CGPDSs aim to model multi-output sequential data. As a multi-output model, the CGPDS supposes that each output is the sum of a global latent process and a designed local latent process to capture dependence among multiple outputs and maintain the unique characteristics of each output. Since standard Bayesian inference is analytically intractable, CGPDSs adopt variational inference and introduce inducing points to learn the model. Moreover, the evidence lower bound can be decomposed regarding dimensions attributed to the conditional independence of outputs, which allows optimizing parameters in a stochastic optimization framework. Figure 1 shows the graphical model of CGPDSs. Given a multi-output sequential data Y RN D with yn RD be the observation at time tn R+, the CGPDS assumes that there are low-dimensional latent variables X RN Q (with Q D) that generate the observations. Moreover, a GP prior on the low-dimensional latent variables is used to model the dynamics, as in Damianou et al. (2011). Speciﬁcally, the CGPDS is deﬁned as a four-layer GPDS through the following generative

Multi-view Collaborative Gaussian Process Dynamical Systems

Figure 1: The graphical model for the CGPDSs. The gray solid circles represent observations. The black hollow circles represent latent variables. The cyan hollow circles represent parameters.

q=1 N(xq|0, Kt,t),

where the xq RN is the qth row of X and Kt,t is the covariance matrix computed by κ(t, t ).

p(h|X) = N(h|0, HX,X),

j=1 N(gj|0, Gj X,X),

where latent processes h and {gj}J j=1 are both GPs with input x, and the HX,X and Gj X,X are covariance matrices computed by κh(x, x ) and κj g(x, x ), respectively. The CGPDS introduces latent processes h and {gj}J j=1, which is entirely diﬀerent from the previous GPDSs such as the VGPDS and VDM-GPDS. The VGPDS uses a single GP mapping from X to F (the noise-free version of the output Y ), which can only learn the common information among multiple outputs, but cannot learn the unique information of each output. The VDM-GPDS employs convolution processes to explicitly model the dependence among multiple outputs. In the VDM-GPDS, the mapping from X to F contains an ND ND matrix, which increases the computational complexity of the model and prevents the model from scaling to large datasets. The CGPDS can capture the dependence and diﬀerences among multiple outputs with a relatively simple model structure.

p(yd|g, h) = N(yd|ℓd + h, β 1I)

Sun and Fei and Zhao and Mao

j=1 wdjgj + h, β 1I), (1)

where h is the global latent process which captures the dependence among outputs and ℓd is the local latent process speciﬁc to the dth output which is constructed by latent processes {gj}J j=1 and weights {wdj}. The weights {wdj} represent the local parameters which are diﬀerent for D outputs. β is the inverse variance of the white Gaussian noise. As shown in (1), the idea for constructing the output yd is inspired by the COGP (Nguyen and Bonilla, 2014). The COGP models the dth output yd as the weighted sum of the dth local latent process and J global latent processes, which contains (J + D) GPs in total. The CGPDS uses a global latent process h and a local latent process ℓd constructed by J(J D) latent processes {gj}J j=1, which includes (J + 1) GPs. In a word, CGPDSs can not only capture the dependence among multiple outputs but also maintain the speciﬁc characteristics of each output with fewer parameters. Last but not the least, fewer parameters would make the model easier to learn.

2.3 Multi-view Models Based on GPLVMs and GPDSs

In this section, we give the introduction of related multi-view models based on GPLVMs and GPDSs such as the shared GPLVM, subspace GPLVM and MRD. The shared GPLVM assumes that all observations are generated from the same lowdimensional latent variable with additional Gaussian noise. Figure 2(a) shows the graphical model of the shared GPLVM. The dotted line represents the back-mapping from the output space, which can constrain the latent space. The assumption of sharing the same latent variable for all views is far from perfect for many datasets because this means data of all views share main generating parameters. Therefore, ideally, the shared latent variable can be used to connect all views and the private latent variables can be used to diﬀerentiate all views. The back-constraint from the second view to the latent space represents the bijective relationship between Y (2) and X(1,2). The back-constraint means that observation in the ﬁrst view Y (1) has to be accommodated by throwing away information which does not exist in the second view Y (2). This model can also be considered as a feature selection model because it uses information from one view to determine what is important for the other view. A new version of the shared GPLVM, that is, the subspace GPLVM, introduces the private latent variable for each view and a shared latent variable for all views. Figure 2(b) represents the graphical model of the subspace GPLVM. The subspace GPLVM learns a factorized latent representation within a single model. The model directly concatenates the private latent variable of each view with the shared latent variable, and then generates the data of each view. For inference, the subspace GPLVM seeks the maximum a posterior (MAP) solution for the latent space. The fact that the latent variables are not integrated out indicates that it is diﬃcult to determine the structure of the latent space automatically. The idea of employing factorized latent space in the multi-view learning has been proposed in several works (Jia et al., 2011; Virtanen et al., 2012; Zhang et al., 2013). The MRD can also learn a factorized latent representation and relax the previous hard discrete segmentation of latent space. Figure 2(c) shows the graphical model of the MRD with dynamics. A single latent variable X is used as the latent representation

Multi-view Collaborative Gaussian Process Dynamical Systems

(a) Shared GPLVM (b) Subspace GPLVM (c) MRD

Figure 2: Development of multi-view models based on GPLVMs and GPDSs. (a) shows the shared GPLVM where all the variances in the observations are shared in a single shared latent variable. (b) shows the subspace GPLVM which introduces private latent variables to express the variance in each view. (c) represents the MRD which uses a single latent variable and selects the shared and private latent dimensions according to the ARD weights w(1) and w(2) and a predetermined threshold. The shadowed nodes represent observations. The black hollow nodes represent latent variables. The cyan nodes represent parameters.

for all views where each dimension in X represents private or shared latent information. The MRD adopts variational inference with inducing points in order to integrate out the latent variable X. More precisely, the outputs of two view Y (1) and Y (2) are assumed to be independent GPs with the zero mean and an ARD covariance function, that is, κ(xi, xj) = (σard)2 exp 1

2 PQ q=1 wq(xi,q xj,q)2. Two sets of ARD weights w(1) and w(2) in this model can be optimized in the Bayesian framework. An additional threshold δ is required to be speciﬁed manually for each dataset. By comparing ARD weights with the threshold, the MRD determines whether the dimension is private or shared and divides the latent space into three subspaces with X = (X(1), Xs, X(2)). Here, Xs represents the shared subspace which consists of the set of dimensions q [1, , Q] with w (1) q > δ and w (2) q > δ. X(1) and X(2) are private latent subspaces of two views, respectively. X(1) is composed of the set of dimensions where w (1) q > δ and w (2) q < δ and analogously for X(2) (w (1) q < δ and w (2) q > δ). There are two diﬀerent versions of the MRD model, one with dynamic characteristics (with the GP prior on the latent variable) and one without dynamic characteristics.

3. Multi-view Collaborative Gaussian Process Dynamical System

In this section, we extend the CGPDS to the scenario of multi-view learning and propose the model of multi-view collaborative Gaussian process dynamical systems (Mc GPDSs). Figure 3 shows the graphical model of the Mc GPDS. Speciﬁcally, we aim to model two views Y (1) RN D1 and Y (2) RN D2 in the same model where y (1) n and y (2) n are the observations at time tn R+. We assume there is a shared low-dimensional latent variable X(1,2) RN Q which governs the generation of the private low-dimensional latent variables, that is, X(1) RN Q and X(2) RN Q. The

Sun and Fei and Zhao and Mao

Figure 3: The graphical model for the Mc GPDS. The Mc GPDS explicitly models the dependence between private and shared latent variables and automatically learns the relevance between private and shared latent variables. The shadowed nodes represent observations. The black hollow nodes represent latent variables.

private low-dimensional latent variable for each view generates the corresponding observation. Moreover, we endow the GP prior on low-dimensional latent variables to model the dynamics. Here, N represents the number of training points. D1 and D2 represent the dimensions of two-view data, respectively. Q denotes the dimension of low-dimensional latent variables (with Q min(D1, D2)). The superscript (1) and (2) corresponds to the ﬁrst and second view, respectively. The superscript (1, 2) means the shared information for two views. Formally, the generative process is given as follows. The shared low-dimensional latent variable X(1,2) is assumed to be a multi-dimensional GP indexed by time t, that is

x(1,2) q (t) GP(0, κ(1,2) x (t, t )), q = 1, . . . , Q, (2)

where dimensions of the shared latent function x(1,2)(t) are independently drawn from a GP with the covariance function κ (1,2) x (t, t ) with parameters θ (1,2) x . Since the latent variable X(1,2) is conditionally independent given t, we have

p(X(1,2)|t) =

q=1 N(x(1,2) q |0, K (1,2) t,t ), (3)

where K (1,2) t,t is the covariance matrix computed by kernel κ (1,2) x (t, t ). We also introduce two latent variables X(1) and X(2) which follow view-speciﬁc dynamic priors, i.e.,

p( X(1)|t) =

q=1 N( x(1) q |0, K (1) t,t), (4)

Multi-view Collaborative Gaussian Process Dynamical Systems

p( X(2)|t) =

q=1 N( x(2) q |0, K (2) t,t), (5)

where X(1) and X(2) are also assumed to be conditionally independent, and K (1) t,t and K (2) t,t are covariance matrices computed by kernels κ (1) x (t, t ) and κ (2) x (t, t ), respectively. Let ˆX(1) be a noisy version of the shared latent variable X(1,2), i.e., ˆX(1) N( ˆX(1)|X(1,2), ϵ(1)). The private latent variable X(1) is deﬁned as a convex combination of the view-speciﬁc latent variable X(1) and ˆX(1), i.e., X(1) = (1 α(1)) ˆX(1) + α(1) X(1), with the combination weight α(1) [0, 1] which can adjust the importance of the two combination components. The model can automatically learn the dependence between private and shared latent variables by optimizing α(1). After integrating out ˆX(1), the conditional distribution of X(1) given X(1,2) and t is

p(X(1)|X(1,2), t) =

q=1 N(x(1) q |(1 α(1))x(1,2) q , (α(1))2K (1) t,t + (1 α(1))2ϵ(1)),

Similarly we deﬁne the private latent variable X(2), with

p(X(2)|X(1,2), t) =

q=1 N(x(2) q |(1 α(2))x(1,2) q , (α(2))2K (2) t,t + (1 α(2))2ϵ(2)),

where α(2) [0, 1] and ϵ(2) denotes the variance of the Gaussian noise in the second view. The setting of latent space in our model is largely diﬀerent from the previous multi-view models based on GPLVMs and GPDSs, such as the shared GPLVM, subspace GPLVM and MRD. The shared GPLVM employs a single shared latent variable for all views and all variances in the observations are shared, where the private information cannot be modeled. The subspace GPLVM introduces a factorized latent space where each view is connected with an additional private latent space. The model employs MAP estimates so that the structure of the latent space cannot be automatically determined. The MRD model also employs a single latent space and determines whether a dimension is private or shared according to the weights in the ARD covariance functions and the artiﬁcially speciﬁed threshold. All the above models either use a single latent variable or do not explicitly model the relationship between private and shared latent variables (dimensions). Our model explicitly models the relevance between shared and private latent space. The relevance of the private and shared latent variables can be automatically learned by optimizing the weights α(1) and α(2). The mapping from X(1) to Y (1) (X(1) 7 Y (1)) and the mapping from X(2) to Y (2)

(X(2) 7 Y (2)) in the Mc GPDS employ the same idea as the mapping from X to Y (X 7 Y ) in the CGPDS (Zhao et al., 2018). Additionally, attributed to conditional independence assumption, the distributions of the outputs can be written as the product of D terms, that is,

p(Y (1)|X(1)) =

d=1 N(y (1) d |

j=1 wdjg (1) j (X(1)) + h(1)(X(1)), (β(1)) 1),

p(Y (2)|X(2)) =

d=1 N(y (2) d |

j=1 wdjg (2) j (X(2)) + h(2)(X(2)), (β(2)) 1),

Sun and Fei and Zhao and Mao

where β(1) and β(2) are the inverse variance of the white Gaussian noise. The latent processes h(1) and {g (1) j }J j=1 are GPs indexed by input X(1). Similarly, latent processes h(2) and {g (2) j }J j=1 are GPs indexed by input X(2), and we have

h(1)(x(1)) GP(0, κ (1) h (x(1), x(1) )), h(2)(x(2)) GP(0, κ (2) h (x(2), x(2) )),

g (1) j (x(1)) GP(0, κ(1)j g (x(1), x(1) )), g (2) j (x(2)) GP(0, κ(2)j g (x(2), x(2) )),

where the kernels κ (1) h (x(1), x(1) ) and κ(1)j g (x(1), x(1) ) are parameterized by θ (1) h and θ(1)j g , respectively. Similarly, θ (2) h and θ(2)j g are parameters of κ (2) h (x(2), x(2) ) and κ(2)j g (x(2), x(2) ). The mappings from X(1) to Y (1) and X(2) to Y (2 are diﬀerent from the shared GPLVM, subspace GPLVM and MRD. The shared GPLVM, subspace GPLVM and MRD employ one GP mapping for each view to capture the common information of multiple outputs. These models can not suﬃciently model the characteristics of each output, while the mappings in our model can well capture the diﬀerences and dependence among multiple outputs.

4. Inference and Learning

Given the model assumptions, we can get the joint distribution of observations and latent variables for the proposed model,

p(Y (1), Y (2), H(1), H(2), G(1), G(2), X(1), X(1), X(1,2))

K {(1),(2)} p(Y K|GK, HK)p(GK, HK|XK)p(XK|X(1,2), t)p(X(1,2)|t), (6)

where the superscript K {(1), (2)} of a variable indicates the view the variable corresponding to and GK = [(g K 1 ) , . . . , (g K J ) ]. The marginal likelihood can be calculated by integrating out all the latent variables, which is commonly used as the goal of model learning. However, the private low-dimensional variables X(1) and X(2) cannot be integrated out because they appear nonlinearly in the inverse of the kernel matrices G (1)j X,X, H (1) X,X and G (2)j X,X, H (2) X,X, respectively. Throughout the paper, covariance matrices are represented by bold uppercase characters with superscripts and subscripts. The corresponding GP can be inferred from the character, with K for x, H for h and G for g, respectively. The superscript indicates the view that the GP is from, while the subscript indicates the inputs where the covariance matrix evaluated. Following Titsias and Lawrence (2010), we make some approximations to the true posterior of the model using variational inference, thus deducing the variational lower bound of the logarithmic marginal likelihood.

4.1 Variational Lower Bound

We introduce inducing points and adopt the structured variational inference method to our model. In order to train the proposed model, we minimize the KL divergence between approximate posterior and true posterior, which is equivalent to maximizing the evidence lower bound of the logarithmic marginal likelihood. First, we employ inducing variables to augment the model. Speciﬁcally, for each view K {(1), (2)} and each latent function, we introduce a set of M inducing variables. We

Multi-view Collaborative Gaussian Process Dynamical Systems

use {u K j RM}J j=1 and v K RM to represent the value of g K j at inducing inputs ZKj g RM Q and the value of h K at inducing points ZK h RM Q, respectively. Denote U K = [(u K 1 ) , . . . , (u K J) ]. Attributed to the conditional independence assumption of latent variables {g K j }J j=1, we have p(U K|{ZKj g }J j=1) = QJ j=1 N(u K j |0, G Kj Z,Z). p(V K|XK) is also assumed to be zero-mean Gaussian with covariance matrix HK Z,Z. The conditional Gaussian distributions are given as p(GK|U K, XK) = QJ j=1 N(g K j |µKj g , e G Kj X,X) with µKj g = G Kj X,Z(G Kj Z,Z) 1u K j and e G Kj X,X = G Kj X,X G Kj X,Z(G Kj Z,Z) 1G Kj Z,X. Additionally, p(HK|V K, XK) = N(HK|µK h, e HK X,X)

with µK h = HK X,Z(HK Z,Z) 1v K and e HK X,X = HK X,X HK X,Z(HK Z,Z) 1HK Z,X. Then, we introduce the joint variational distribution which is assumed to be factorized as q(Θ(1))q(Θ(2))q(X(1))q(X(2))q(X1,2) where q(X(1)) = N(X(1)|µ(1), S(1)), q(X(2))=N(X(2)|µ(2), S(2)) and q(X(1,2)) = N(X(1,2)|µ(1,2), S(1,2)). q(Θ(1)) and q(Θ(2)) are the variational distributions of latent variables {G(1), H(1), U (1), V (1)} and {G(2), H(2), U (2), V (2)} whose speciﬁc forms are deﬁned as

q(ΘK) = p(GK|U K, XK)p(HK|V K, XK)q(U K)q(V K), K {(1), (2)}. (7)

Finally, given the above assumptions, the lower bound of the logarithmic marginal likelihood can be expressed as

Fv(q) = Z Y

K q(ΘK)q(XK)q(X(1,2)) log Y

p(Y K|XK)p(XK|X(1,2), t)p(X(1,2)|t)

q(ΘK)q(XK)q(X(1,2)) dΘKd XKd X(1,2)

= KL q(X(1))q(X(2))q(X(1,2))||p(X(1)|X(1,2), t)p(X(2)|X(1,2), t)p(X(1,2)|t)

K ˆLK, K {(1), (2)}. (8)

The detailed calculation of the KL divergence is given below.

KL(q(X(1))q(X(2))q(X(1,2))||p(X(1)|X(1,2), t)p(X(2)|X(1,2), t)p(X(1,2)|t))

log |A(1)| + log |A(2)| + log |K (1,2) t,t | log |S(1,2) q | log |S(1) q | log |S(2) q |

+ (1 α(1))µ(1,2) q µ(1) q (A(1)) 1 (1 α(1))µ(1,2) q µ(1) q

+ (1 α(2))µ(1,2) q µ(2) q (A(2)) 1 (1 α(2))µ(1,2) q µ(2) q

+ Tr (1 α(1))2(A(1)) 1 + (1 α(2))2(A(2)) 1 S(1,2) q

+ Tr (K (1,2) t,t ) 1 µ(1,2) q (µ(1,2) q ) + S(1,2) q + Tr (A(1)) 1S(1) q + (A(2)) 1S(2) q , (9)

where A(1) and A(2) represent (α(1))2K (1) t,t + (1 α(1))2ϵ(1)I and (α(2))2K (2) t,t + (1 α(2))2ϵ(2)I, respectively. Since the observations on diﬀerent dimensions in each view are assumed to be conditionally independent, the term ˆLK can be decomposed regarding dimensions, which has the following formula.

2 |HK Z,Z| 1 2

2 |βKψK 4 + HK Z,Z| 1 2 +

j=1 log |G Kj Z,Z|

|βK(w K dj)2ψ Kj 5 + G Kj Z,Z|

Sun and Fei and Zhao and Mao

2(y K d ) βKI

j=1 (βK)2(w K dj)2ψ Kj 1 (βK(w K dj)2ψ Kj 5 + G Kj Z,Z) 1(ψ Kj 1 )

(βK)2ψK 0 (βKψK 4 + HK Z,Z) 1(ψK 0 ) y K d βK

2 ψK 2 + βK

2 Tr(ψK 4 (HK Z,Z) 1)

j=1 (w K dj)2ψ Kj 3 + βK

j=1 Tr((w K dj)2ψ Kj 5 (G Kj Z,Z) 1) , (10)

where ψK 0 = HK X,Z q(XK), ψ Kj 1 = G Kj X,Z q(XK), ψK 2 = Tr( HK X,X q(XK)), ψ Kj 3 = Tr( G Kj X,X q(XK)),

ψK 4 = HK Z,XHK X,Z q(XK), and ψ Kj 5 = G Kj Z,XG Kj X,Z q(XK). q(XK) denotes expectation under

the distribution q(XK). The detailed computations for the evidence lower bound and the involved statistics are given in Appendix A and B, respectively. The computational complexity for training Mc GPDS is dominated by computing the inversions of the kernel matrices, and thus the computational complexity is O(V D(J + 1)M3 + (V + 1)N3), where V is the number of views.

4.2 Parameter Estimation

The parameters to be optimized in the proposed model include model parameters and variational parameters. The model parameters involve hyperparameters in the kernel functions of the latent variables {g(1), g(2), h(1), h(2), X(1), X(2), X(1,2)}, e.g., σ2 f and αq in the

used ARD kernel κ(x, x ) = σ2 f exp( 1

2 PQ q=1 αq(xq x q)2), the inverse variance of white Gaussian noise {β(1), β(2)}, Gaussian noises {ϵ(1), ϵ(2)}, and weights {W (1), W (2), α(1), α(2)}. The variational parameters include the mean and covariance of the variational distributions, {µ(1), S(1), µ(2), S(2), µ(1,2), S(1,2)}, and the inducing inputs {Z(1), Z (1) h , Z(2), Z (2) h }. All the parameters are jointly optimized through the gradient descent method. Here we give the update rules for variational mean and covariance matrices, in which the optimization for covariance employs the reparameterization trick inspired by Opper and Archambeau (2009). The derivation is analogous to that in Damianou et al. (2011) and Damianou et al. (2016), to which we refer the readers for more details. The variational mean in the private latent space can be optimized by the gradient descent method and the gradient of evidence lower bound w.r.t variational mean is given by

L µKq = ˆLK

µKq (AK q ) 1 µK q (1 αK)µ(1,2) q .

The private variational covariance matrix SK q can be reparameterized as

SK q = ((AK q ) 1 + diag(λK q )) 1,

where diag(λK q ) = 2 Fv(q)

SK q is an N N diagonal and positive deﬁnite matrix, w.r.t which the gradient of evidence lower bound is given by

L λKq = (SK q SK q )( ˆLK

2λK q ). (11)

Multi-view Collaborative Gaussian Process Dynamical Systems

The shared variational parameters {µ(1,2), S(1,2)} have analytical solutions. After updating the private variational parameters, we can update the shared variational parameters by the following equations.

µ(1,2) q = S(1,2) q (1 α(1))(A(1) q ) 1µ(1) q + (1 α(2))(A(2) q ) 1µ(2) q , (12)

S(1,2) q = [(1 α(1))2(A(1) q ) 1 + (1 α(2))2(A(2) q ) 1 + (Ktt (1,2)) 1] 1. (13)

5. Prediction with the Mc GPDS

Given the trained Mc GPDS which can jointly model observations of two views Y (1) and Y (2) and learn the shared latent space X(1,2) and the private latent spaces X(1) and X(2), we aim to generate the outputs from a view given the observations from the other view. For example, generate Y (2) RN D2 using Y (1) RN D1. The Mc GPDS has the capability to accomplish this task by three steps, similar to MRD (Damianou et al., 2012). In the ﬁrst step, we use variational inference again to derive the posterior distributions of the latent variables X (1) RN Q and X (1,2) RN Q which are most likely to govern the generation of Y (1) . We use q(X (1) , X (1,2) ) to approximate p(X (1) , X (1,2) |Y (1) ). The approximate posterior distribution q(X (1) , X (1,2) ) is the marginal distribution of q(X(1), X(1,2), X (1) , X (1,2) ). To obtain q(X(1), X(1,2), X (1) , X (1,2) ), we maximize the variational lower bound of the marginal likelihood p(Y (1), Y (1) ),

F (1) = KL q(X(1) , X(1))q(X(1,2) , X(1,2))||p(X(1) , X(1)|X(1,2) , X(1,2))p(X(1,2) , X(1,2))

+ ˆL(1)(Y (1) , Y (1)), (14)

where we ve omitted time t and t for brevity. Particularly, the lower bound can be maximized using the same method as for training. The detailed calculation for F (1) is given in Appendix C. In the second step, we obtain the private latent variable which is also essential to generate data from a view. Precisely, in order to generate observations Y (2) , we need to obtain the private latent variable X (2) . However, just the observed test data from the ﬁrst view Y (1) can hardly provide information for data in the second view Y (2) and thus it is quite diﬃcult to obtain an exact representation of X (2) . Therefore, we refer to the latent variables learned from training data, X(1,2) and X(2), and employ the nearest neighbor to obtain the private latent variable X (2) . Speciﬁcally, we ﬁnd the shared latent variable from training data X(1,2)

which is closest to X (1,2) obtained by the ﬁrst step, and acquire the variational distribution of private latent variable X(2) directly from training data whose indexes correspond to X(1,2)

to approximate the posterior of private latent variable X (2) . In the third step, we predict the output Y (2) using the marginal posterior distribution of latent variable q(X (2) ) obtained through the second step. Speciﬁcally, Y (2) can be calculated by

p(Y (2) ) = Z p(Y (2) |G(2) , H(2) )p(G(2) |X(2) , U (2))p(H(2) |X(2) , V (2))q(U (2))q(V (2))q(X(2) )

d G(2) d H(2) d X(2) d U (2)d V (2). (15)

Note that the variational distributions q(U (2)) and q(V (2)) are obtained during the training phase which need not be optimized during the prediction period. Since the integration in

Sun and Fei and Zhao and Mao

Algorithm 1 Prediction with the Mc GPDS

1: Input: training data for two views Y (1) and Y (2), Mc GPDS model trained via two-view data (Y (1), Y (2)) and test data in the ﬁrst view Y (1) .

2: Output: generated observations in the second view Y (2) .

3: Maximize the evidence lower bound of the marginal likelihood p(Y (1) , Y (1)) to obtain q(X(1), X(1,2), X (1) , X (1,2) ).

4: Get the marginal distribution q(X (1) , X (1,2) ) to obtain test mean µ(1) and µ(1,2) and covariance S(1) and S(1,2) .

5: Find the optimal ˆµ(2) and ˆS (2) using the K-nearest neighbor method according to the distance between µ(1,2) and µ(1,2).

6: q(X (2) ) N(ˆµ(2) , ˆS(2) ).

7: Predict Y (2) using Equation (15).

(15) is analytically intractable, we follow Damianou et al. (2011) to calculate the expectation of g (2) and h (2) as E(g (2) ) and E(h (2) ), respectively, and estimate the expectation covariance matrices with Monte Carlo sampling. The element-wise autocovariance matrices of g (2) and h (2) are denoted as V(g (2) ) and V(h (2) ), respectively.

E(h(2) ) = ψ (2) 0 b (2) h ,

E(g(2)j ) = ψ (2)j 1 b(2)j g ,

V(h (2)j en ) = b (2) h (ψ (2) 4en ((ψ (2) 0en) )ψ (2) 0en)b (2) h + ψ (2) 2 Tr ((H (2) Z,Z) 1 (H (2) Z,Z + β(2)ψ (2) 4 ) 1)ψ (2) 4 ,

V(g (2)j en ) = b(2)j g (ψ (2)j 5en (ψ (2)j 1en ) ψ (2)j 1en )b(2)j g + ψ (2)j 3 Tr[((G 1 Z,Z)(2)j (G (2)j Z,Z + β(2)w2 djψ (2)j 5 ) 1)

ψ (2)j 5 ],

where V(h (2)j en ) denotes the enth entry of V(h (2) ), and V(g (2)j en ) denotes the (en j)th entry of V(g (2) ). b (2) h = β(2)(H (2) Z,Z + β(2)ψ (2) 4 ) 1(ψ (2) 0 ) y(2), b (2)j g = β(2)(G (2)j Z,Z + β(2)ψ (2)j 5 ) 1(ψ (2)j 1 ) y(2), ψ (2) 0 = H (2) X ,Z q(X(2) ), ψ (2)j 1 = G (2)j X ,Z q(X(2) ), ψ (2) 2 = Tr( H (2) X ,X q(X(2) )), ψ (2)j 3 = Tr( G (2)j X ,X q(X(2) )),

ψ (2) 4 = H (2) Z,X H (2) X ,Z q(X(2) ), ψ (2)j 5 = G (2)j Z,X G (2)j X ,Z q(X(2) ), ψ (2) 0en = H (2) Xen,Z q(X(2) en ), ψ (2)j 1en =

G (2)j Xen,u q(X(2) en ),

ψ (2) 4en = H (2) Z,Xen K (2) hen,Z q(X(2) en ), ψ (2)j 5en = G (2)j Z,Xen G (2)j Xen,Z q(X(2) en ), en = 1, . . . , N , d = 1, . . . , D and

j = 1, . . . , J. Since Y (2) d = PJ j=1 w (2) dj g (2)j + h (2) , d [1 . . . D], the expectation and covari-

ance of Y (2) d are E(Y (2) d ) = PJ j=1 w (2) dj E(g (2)j ) + E(h (2) ) and V(Y (2) d ) = PJ j=1(w (2) dj )2V(g (2)j ) + V(h (2) ) + (β(2)) 1I, where (y (2) ) = [(y (2) 1) , . . . , (y (2) D) ]. The whole prediction process is shown in Algorithm 1.

6. Experiments

In order to validate the eﬀectiveness of the proposed Mc GPDS, we conduct experiments on ﬁve multi-view datasets including two synthetic datasets and three real world datasets1. We

1. For an implementation of Mc GPDS in Matlab, see https://github.com/mcgpds/mcgpds.

Multi-view Collaborative Gaussian Process Dynamical Systems

evaluate our model in two diﬀerent kinds of tasks. The ﬁrst is recovering the structures of the latent variables when the correlation between the shared and private latent variables is strong. The second is generating data from one view given data from the other view. For comparison, all models are trained with the same initializations and we set J = 1 in the proposed model. For the toy data experiments, we use linear kernel without inducing points and the dimension of each view s private latent variable is set to 1. For the real-world data experiments, we use RBF kernel with the variance initialized to 1. We use 100 inducing points and the dimension of each view s private latent variable is set to 5 unless otherwise stated. For all the experiments, alpha is initialized to 0.5 for each view and the mixture weights in the output layer are independently initialized from a Gaussian distribution with 0 mean and 0.01 variance. For the K-nearest neighbor method, we set K = 1. In the experiments, the shared GPLVM refers to the new version of the shared GPLVM, namely, the subspace GPLVM. For MRD, we follow the setting in Damianou et al. (2012). All experiments are repeated ﬁve times, and the average results are reported as the ﬁnal results. The root mean square error (RMSE) and mean standardized log loss (MSLL) are used as the performance measures. MSLL is the mean negative log probability of all the test data, where the predictive density is given by (15). The lower the RMSE and MSLL are, the better the performance is.

6.1 Toy Data

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00

True Recovered

(a) Private Signal (cos(π2t))

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00

True Recovered

(b) Private Signal (cos(

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00

True Recovered

(c) Shared Signal (sin(2πt))

Figure 4: The results of Mc GPDSs on the toy dataset. Red lines represent true signals, and blue lines represent recovered signals.

Sun and Fei and Zhao and Mao

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00

True Recovered

(a) Private Signal (cos(π2t))

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00

True Recovered

(b) Private Signal (cos(

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00

True Recovered

(c) Shared Signal (sin(2πt))

Figure 5: The results of MRD on the toy dataset. Red lines represent true signals, and blue lines represent recovered signals.

First, we conduct the experiment on a synthetic dataset which is similar to the one used by Salzmann et al. (2010) and Jia et al. (2010). We ﬁrst generate three one-dimensional latent variables using three signals: cos(π2t) and cos(

5πt) which generate the private latent variables, and sin(2πt) which generates the shared latent variable. Then, we use the randomly generated projection matrices to map the one-dimensional private latent variables to the ten-dimensional space and the one-dimensional shared latent variable to the ﬁve-dimensional space. The two-view sequential data Y (1) and Y (2) are constructed by concatenating the ten-dimensional private variable of each view with the ﬁve-dimensional shared variable. Therefore, both the generated sequences Y (1) and Y (2) are in 15 dimensions in total, that is, y (1) i , y (2) i R15. The proposed model is capable of learning the latent variables corresponding to the observed sequential data. We use the Mc GPDS with a linear kernel function to recover the latent signals: the private signals (cos(π2t) and cos(

5πt)) and the shared signal (sin(2πt)). We compare our model with the state-of-the-art GP-based multi-view dynamical system, i.e., MRD with dynamics. Figure 4 shows the recovery results of the latent signals by our model. Speciﬁcally, Figure 4(a), (b) and (c) show the true signals as well as the recovered signals by Mc GPDS for cos(π2t), cos(

5πt) and sin(2πt), respectively. As shown in Figure 4, the recovered signals almost exactly match the true signals (up to a translation), which demonstrates that our model has the ability to learn an eﬀective latent representation even when private latent variables are orthogonal to shared latent variables. As a comparison, Figure 5 shows the

Multi-view Collaborative Gaussian Process Dynamical Systems

results of the MRD with dynamics on this toy dataset. Figure 5(a) and (b) shows that the recovered private signals by the MRD deviates signiﬁcantly from the true signals in both view. The only recovered signal that matches the true signal is the shared signal, as shown in Figure 5(d).

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Variance of the Private Signal of the Second View

Figure 6: The learned α(1) and α(2) of Mc GPDS on the toy dataset with diﬀerent variance of private latent signal of view 2.

Next, we test the interpretability of the learned combination weights, α(1) and α(2), on another synthetic dataset. The shared latent signal is generated from sin(2πt) and the private latent signals are generated from pure Gaussian noise. The variance of the Gaussian noise of the ﬁrst view is ﬁxed to 0.1, while that of the second view varies from 0.1 to 1. The observations from each view is constructed by concatenating its private signal and the shared signal. Under this construction, the ﬁrst view almost only contains the shared signal, while the ratio of the private and the shared signals in the second view increases with the variance of the former. We plot the learned α(1) and α(2) against the variance of the private latent variable of view 2 in Figure 6. As expected, the learned α(2) increases with the variance of the private signal of the second view, which coincides with the change of the signiﬁcant of the private signal. The learned α(1) also increases but at a slower rate, since large noise in the second view adds diﬃculty in recovering the shared signal and the view-speciﬁc dynamic has to complement.

6.2 Human Motion Data

In this experiment, we use the human motion data which contain a set of 3D human poses and their corresponding silhouettes. The data are collected by Agarwal and Triggs (Ankur and Bill, 2006). We use 566 frames for training which contain 5 sequences corresponding to walking motions in diﬀerent directions. The test data is a separate walking sequence of 158 frames. The pose data are 63-dimensional joint location vectors, and the silhouette data are 100-dimensional histogram of oriented gradients (HOG) vectors. We consider the task of generating data from a view given the other view, that is, we generate the corresponding 3D human poses given the silhouette data. We use the RBF kernel for all GPs and 100 inducing points for Mc GPDS. Dimensions of the shared and both private latent variables are set to 5 for all the models. As described in the previous section, given test data in the ﬁrst view Y (1) test, Mc GPDS optimises the private latent variables in the ﬁrst view X (1) test and the shared latent points X (1,2) test . Then, the training latent variables X(2) in the second view are selected as the test private

Sun and Fei and Zhao and Mao

Table 1: The RMSE and MSLL on the human motion dataset.

NNYspace 2.65 0.00 - NNXspace (X learned by MRD) 3.19 0.03 - NNXspace (X learned by Mc GPDS) 2.40 0.03 - Shared GPLVM 5.15 0.01 3.41 0.17 MRD without dynamics 5.03 0.01 3.37 0.03 MRD with dynamics 2.65 0.01 3.01 0.25 Independent CGPDS 2.69 0.13 3.22 0.23 Mc GPDS 2.37 0.03 2.60 0.05 Mc GPDS+GPLVM 2.62 0.04 3.78 0.25 Mc GPDS+Linear 2.81 0.15 -

latent variables X (2) test according to the similarity of X(1,2) and X (1,2) test . Finally, Mc GPDS generates a set of novel poses Y (2) test based on these selected training latent points X(2). In this experiment, we compare our model with seven diﬀerent methods, the nearest neighbor (NN) in silhouette space (NNYspace), the NN method in the X space (X learned by MRD), the NN method in the X space (X learned by Mc GPDS), the shared Gaussian process latent variable model (GPLVM), the MRD without dynamics, the MRD with dynamics and the independent CGPDS model. NNYspace ﬁnds the predicted 3D pose from training data whose silhouette is the closest to the corresponding test silhouette. Similarly, NNXspace ﬁnds the predicted 3D pose from training data whose shared latent information is the closest to the corresponding shared information of test data. The independent CGPDS model use one CGPDS on each view independently. To demonstrate the usefulness of the two key components in Mc GPDS, i.e., modelling the private latent variables using GPS with the mixture mean and covariance, and modelling the map from the private latent variables to observations with CGPDS, we conduct ablation studies for them. More speciﬁcally, we run two methods, Mc GPDS+GPLVM, which is Mc GPDS with the prior of the private latent variables replaced by that of GPLVM, and Mc GPDS+Linear, which is Mc GPDS with the output coupling layer replaced by a linear map, on the human motion dataset with the other setting unchanged. Table 1 shows the RMSE and MSLL on the human motion dataset. As shown in Table 1, our model (Mc GPDS) obtains the lowest RMSE 2.37 0.03 and the lowest MSLL 2.60 0.05, which means that our model outperforms the state-of-the-art model (MRD with dynamics). Both Mc GPDS and MRD with dynamics outperform the independent CGPDS model, which conﬁrms the usage of shared latent space structures. In addition, NNXspace (X learned by Mc GPDS) performs better than NNXspace (X learned by MRD). The ablation studies also conﬁrms the usage of the two key components. Figure 7 demonstrates the results visually. As shown in Figure 7, the 3D poses generated by our model are closest to the true poses. To better understand the impact of dimensionality and number of inducing points in Mc GPDS, we plot the RMSE and MSLL against total dimension of private latent variables in Figure 8(a) and the RMSE, MSLL and training time against number of inducing points in Figure 8(b). Figure 8(a) shows that the RMSE of Mc GPDS decreases as the total dimension

Multi-view Collaborative Gaussian Process Dynamical Systems

MRD without dynamics

MRD with dynamics

Figure 7: The results of generating 3D poses given silhouettes. The left-most side of each line represents the test silhouette. The remaining parts, from left to right, are the true poses, poses generated by MRD without dynamics, poses generated by NNXspace (X learned by the Mc GPDS), poses generated by MRD with dynamics, and poses generated by Mc GPDS, respectively.

of private latent variables increases, implying that larger latent space provides Mc GPDS more capability to capture multiview dynamics. The increase of MSLL is possibly due to the increase of number of variables, which encourages the model to upweight the KL divergence term in the ELBO, leading to an increase in the variance of the likelihood and thus in the MSLL. Figure 8(b) shows how the training time increases with the number of inducing points, while the impact of the latter on RMSE and MSLL is moderate.

6.3 CUAVE Data

In this experiment, we employ the CUAVE data which are composed of the videos showing a person speaking Arabic numerals and the corresponding Mel frequency cepstral coeﬃcients (mfcc) features of the audio signals. Each video is represented by a 3750-dimensional vector and each mfcc feature is represented by a 13-dimensional vector. We use 194 frames of videos and mfcc features as training data and 51 frames of videos for testing. Our task is to

Sun and Fei and Zhao and Mao

2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0

T otal Dimension of Private Latent Variables

(a) Impact of Dimensionality

50 100 150 200 250 300

Number of Inducing Points

T raining Time

T raining Time(hr)

(b) Impact of Num. of Inducing Points

Figure 8: (a) RMSE and MSLL of Mc GPDS with diﬀerent total dimension of private latent variables on the human motion dataset. (b) RMSE, MSLL and training time(hr) of Mc GPDS with diﬀerent number of inducing points on the human motion dataset.

Table 2: The RMSE and MSLL on the CUVAE dataset.

NNYspace 1.31 0.00 - NNXspace (X learned by MRD) 1.70 0.10 - NNXspace (X learned by Mc GPDS) 1.38 0.15 - Shared GPLVM 1.61 0.01 4.70 0.27 MRD without dynamics 1.29 0.01 4.34 0.13 MRD with dynamics 1.24 0.03 3.45 0.20 Mc GPDS 1.19 0.03 1.94 0.07

generate mfcc features given the frames of the videos. We use the RBF kernel for all GPs and 100 inducing points for Mc GPDS. Dimensions of the shared and both private latent variables are set to 5 for all the models. From Table 2, we can see that our model obtains the best performance (with the lowest RMSE 1.19 0.03 and lowest MSLL 1.94 0.07) on the CUAVE dataset. The method NNXspace (X learned by Mc GPDS) is also better than NNXspace (X learned by MRD) in the CUAVE dataset. These results show that our model can obtain more reasonable latent representation, and thus generate observations closer to the truth.

6.4 Classiﬁcation

In the ﬁnal experiment, we examine Mc GPDS on a classiﬁcation task. We use the Oil dataset, which contains 1000 12-dimensional examples from 3 classes. The observations constitute the ﬁrst view, while the corresponding labels are taken as the second view in the form of one-hot encoding. Following the setting of Damianou et al. (2012), we select 10 random subsets of the data with increasing number of training points and compare to the NN method in the data space. Figure 9 shows that the accuracy of Mc GPDS is worse than

Multi-view Collaborative Gaussian Process Dynamical Systems

200 400 600 800 1000 Number of Training Points

Mv CGPDS NN

Figure 9: Accuracy of Mc GPDS and NN on the Oil Dataset.

NN when the training set is small and is comparable to NN as the number of training points increases. There are two possible reasons for the mediocre performance of Mc GPDS on small-size non-dynamic data. First, Mc GPDS uses three GPs to model the time dynamics, while the time stamps of non-dynamic data provide little, if not misleading, information on the observations. Second, Mc GPDS uses a mixture of GPs to model the observations, while the observations from view 2 of the used dataset is just the one-hot representation of labels. Both of these could potentially make Mc GPDS perform not so well on non-dynamic small data. We leave the application of Mc GPDS to classiﬁcation for future work.

7. Conclusion

In this paper, we have proposed the Mc GPDS, which extends the CGPDS into the scenario of multi-view learning with ﬂexible and general modeling in the latent space. As a novel hierarchical multi-view framework, the Mc GPDS takes full use of the characteristics of the multi-view data and the advantages of the CGPDS. The setting on the latent space is elastic and reasonable, where the relationship between private and shared latent variables can be learned adaptively via optimizing weights. We introduce inducing points and employ variational inference to integrate out the latent variables. The proposed model is trained through maximizing the evidence lower bound.

The eﬀectiveness of our model for multi-view learning has been empirically validated on synthetic and real-world two-view datasets. For future work, we will extend our model beyond the current two views. The methodology can be similar to the current scenario, but deriving the ELBO for more-than-two-view cases is non-trivial and the applications of generating one or multiple views from other views will be more challenging.

Acknowledgments

This work was supported by National Natural Science Foundation of China under Projects 62076096 and 62006078, Shanghai Municipal Project 20511100900, Chenguang Program of the Shanghai Education Development Foundation and the Shanghai Municipal Education Commission under Grant 19CG25, the Open Research Fund of KLATASDS-MOE, and the

Sun and Fei and Zhao and Mao

Fundamental Research Funds for the Central Universities. Corresponding Author: Shiliang Sun.

Multi-view Collaborative Gaussian Process Dynamical Systems

Appendix A. Derivation of the Evidence Lower Bound for Training

In this section, we give the detailed derivation of the evidence lower bound for training data. Given two views of data, Y (1) and Y (2), the joint probability distribution for the proposed model is given by

p(Y (1), Y (2), H(1), H(2), G(1), G(2), X(1), X(2), X(1,2)) = p(Y (1)|G(1), H(1))p(Y (2)|G(2), H(2))

p(G(1), H(1)|X(1))p(G(2), H(2)|X(2))p(X(1)|X(1,2), t)p(X(2)|X(1,2), t)p(X(1,2)|t). (16)

We can get the logarithm of the marginal likelihood by integrating the latent variables,

p(Y (1), Y (2)) = Z p(Y (1)|G(1), H(1))p(Y (2)|G(2), H(2))p(G(1), H(1)|X(1))p(G(2), H(2)|X(2))

p(X(1)|X(1,2), t)p(X(2)|X(1,2), t)p(X(1,2)|t)d H(1)d H(2)d G(1)d G(2)d X(1)d X(2)d X(1,2). (17)

Note that the integration w.r.t X(1) and X(2) is intractable, because X(1) appears nonlinearly in the inverse of the matrices G (1) X,X and H (1) X,X and X(2) appears nonlinearly in the inverse of the matrices G (2) X,X and H (2) X,X. Therefore, we introduce inducing variables U and V to augment the model and compute the lower bound of its logarithmic marginal likelihood. The augmented joint probability density takes the form as

p(Y (1),Y (2), H(1), H(2), G(1), G(2), U (1), U (2), V (1), V (2), X(1), X(2), X(1,2))

=p(Y(1)|G(1), H(1))p(G(1)|U (1), X(1))p(H(1)|V (1), X(1))p(U (1)|X(1))p(V (1)|X(1))

p(Y (2)|G(2), H(2))p(G(2)|U (2), X(2))p(H(2)|V (2), X(2))p(U (2)|X(2))p(V (2)|X(2))

p(X(1)|X(1,2), t)p(X(2)|X(1,2), t)p(X(1,2)|t). (18)

In the above formula, p(U (1)|X(1)) and p(U (2)|U (2)) are zero-mean Gaussian with covariance matrices G (1) Z,Z and G (2) Z,Z and p(V (1)|X(1)) and p(V (2)|X(2)) are zero-mean Gaussian with covariance matrices H (1) Z,Z and H (2) Z,Z. Precisely, they are expressed as

p(U (1)|X(1)) =

j=1 N(u (1) j ; 0, G (1)j Z,Z), (19)

p(U (2)|X(2)) =

j=1 N(u (2) j ; 0, G (2)j Z,Z), (20)

p(V (1)|X(1)) = N(V (1); 0, H (1) Z,Z), (21)

p(V (2)|X(2)) = N(V (2); 0, H (2) Z,Z). (22)

The conditional distributions for latent variables G and H given the inducing variables U and V are Gaussian, which have the following forms.

p(GK|U K, XK) =

j=1 N(g K j ; µKj g , e KKj g ), (23)

p(HK|V K, XK) = N(HK; µK h, e KK h), (24)

Sun and Fei and Zhao and Mao

where K {(1), (2)}. The speciﬁc expressions for the related statistics are µKj g = G Kj X,Z(G Kj Z,Z) 1u K j , e KKj g = G Kj X,X G Kj X,Z(G Kj Z,Z) 1G Kj Z,X, µK h = HK X,Z(HK Z,Z) 1v K and e KK h = HK X,X HK X,Z(HK Z,Z) 1HK Z,X. We now adopt the variational inference method to approximately compute the integral. Speciﬁcally, we introduce a joint variational distribution q(Ω) over all the latent variables denoted by Ω, which has the factorized form as

q(Ω) = q(Θ(1))q(Θ(2))q(X(1))q(X(2))q(X(1,2)), (25)

q(X(1)) = N(X(1)|µ(1), S(1)),

q(X(2)) = N(X(2)|µ(2), S(2)),

q(X(1,2)) = N(X(1,2)|µ(1,2), S(1,2)),

q(Θ(1)) = p(G(1)|U (1), X(1))p(H(1)|V (1), X(1))q(U (1))q(V (1)),

q(Θ(2)) = p(G(2)|U (2), X(2))p(H(2)|V (2), X(2))q(U (2))q(V (2)).

The evidence lower bound of the logarithmic marginal likelihood log p(Y (1), Y (2)) is

Fv(q, θ) = Z q(Θ(1))q(X(1)) log p(Y (1)|G(1), H(1))p(G(1)|X(1))p(H(1)|X(1))

q(Θ(1)) d G(1)d H(1)d X(1)

+ Z q(Θ(2))q(X(2)) log p(Y (2)|G(2), H(2))p(G(2)|X(2))p(H(2)|X(2))

q(Θ(2)) d G(2)d H(2)d X(2)

Z q(X(1))q(X(2))q(X(1,2)) log q(X(1))q(X(2))q(X(1,2)) p(X(1)|X(1,2), t)p(X(2)|X(1,2), t)p(X(1,2)|t)d X(1,2)d X(1)d X(2)

= ˆL(1) + ˆL(2) KL(q(X(1))q(X(2))q(X(1,2))||p(X(1)|X(1,2), t)p(X(2)|X(1,2), t)p(X(1,2)|t)). (26)

The detailed computation of the ﬁrst term ˆL(1) in Equation (26) is given by

ˆL(1) = Z q(U (1), V (1))q(X(1)) log p(Y (1)|U (1), V (1), X(1))p(U (1), V (1))

q(U (1), V (1)) d U (1)d V (1)d X(1), (27)

where log p(Y (1)|U (1), V (1), X(1)) in the lower bound can be approximated by

log p(Y (1)|U (1), V (1), X(1)) log p(Y (1)|G(1), H(1)) p(G(1),H(1)|U(1),V (1))

d=1 log p(Y (1) d |G(1), H(1)) p(G(1)|U(1))p(H(1)|V (1))

log N(Y (1) d |

j=1 w (1) dj µ(1)j g + µ (1) h , (β(1)) 1I) β(1)

2 Tr( e K (1) h )

j=1 (w (1) dj )2( e K(1)j g )) . (28)

Multi-view Collaborative Gaussian Process Dynamical Systems

As the outputs Y (1) are conditionally independent, the lower bound can be written as a sum of D terms, that is, ˆL(1) = PD d=1 ˆL (1) d , where ˆL (1) d is given by

ˆL (1) d = Z q(u(1), v(1))q(X(1)) log N(y (1) d | PJ j=1 w (1) dj µ (1)j g + µ (1) h , (β(1)) 1I)p(u(1), v(1))

q(u(1), v(1))

du(1)dv(1)d X(1) Z β(1)

2 Tr( e K (1) h )q(X(1))d X(1) Z β(1)

j=1 (w (1) dj )2 e K(1)j g )q(X(1))d X(1).

By changing the integration order, we get

ˆL (1) d = Z q(u(1), v(1)) log e log N(y(1) d ;PJ j=1 w(1) dj µ(1)j g +µ(1) h ,(β(1)) 1I) q(X(1))p(u(1), v(1)) q(u(1), v(1)) du(1)dv(1)

2 Tr( e K (1) h q(X(1))) β(1)

j=1 (w (1) dj )2 e K(1)j g q(X(1))), (29)

where the optimal variational distribution q(u(1), v(1)) for the dth output that gives rise to this lower bound is

q(u(1), v(1)) e log N(y(1) d ;PJ j=1 w(1) dj µ(1)j g +µ(1) h ,(β(1)) 1I) q(X(1))p(u(1), v(1)). (30)

The optimal variational distribution is analytically Gaussian,

q(u(1), v(1)) =N v(1); H (1) Z,Z(β(1)ψ (1) 4 + H (1) Z,Z) 1(ψ (1) 0 ) β(1)y (1) d , H (1) Z,Z(β(1)ψ (1) 4 + H (1) Z,Z) 1H (1) Z,Z

j=1 N u (1) j ; G (1)j Z,Z((β(1)(w (1) dj )2ψ (1)j 5 + G (1)j Z,Z) 1(ψ (1)j 1 ) w (1) dj β(1)y (1) d ,

G (1)j Z,Z((β(1)(w (1) dj )2ψ (1)j 5 + G (1)j Z,Z) 1G (1)j Z,Z , (31)

where ψ (1) 0 = H (1) X,Z q(X(1)), ψ (1)j 1 = G (1)j X,Z q(X(1)), ψ (1) 2 = Tr( H (1) X,X q(X(1))) , ψ (1)j 3 = Tr( G (1)j X,X q(X(1))), ψ (1) 4 = H (1) Z,XH (1) X,Z q(X(1)) and ψ (1)j 5 = G (1)j Z,XG (1)j X,Z q(X(1)). Furthermore, the optimal lower bound can be obtained using Jensen s inequality,

ˆL (1) d log Z e log N(y(1) d ;PJ j=1 w(1) dj µ(1)j g +µ(1) h ,(β(1)) 1I) q(X(1))p(u(1), v(1))du(1)dv(1)

2 Tr( e K (1) h q(X(1))) β(1)

j=1 (w (1) dj )2 e K(1)j g q(X(1)))

2 |G (1) Z,Z|

1 2 |H (1) Z,Z|

2 |β(1)(w (1) dj )2ψ (1) 5 + G (1) Z,Z|

1 2 |β(1)ψ (1) 4 + H (1) Z,Z|

2(y (1) d ) F (1) d y (1) d }

2 Tr( e K (1) h q(X(1))) β(1)

j=1 (w (1) dj )2 e K(1)j g q(X(1))), (32)

Sun and Fei and Zhao and Mao

where F (1) d = β(1)I (β(1))2(w (1) dj )2ψ (1) 1 (β(1)(w (1) dj )2ψ (1) 5 +G (1) Z,Z) 1(ψ (1) 1 ) (β(1))2ψ (1) 0 (β(1)ψ (1) 4 + H (1) Z,Z) 1(ψ (1) 0 ) .

Therefore, the closed-form of the ﬁrst term ˆL(1) in the lower bound of the approximated logarithmic marginal log-likelihood in Equation (26) is given by

log (β(1)) N

2 |H (1) Z,Z|

2 |β(1)ψ (1) 4 + H (1) Z,Z|

j=1 log |G(1)j Z,Z|

|β(1)(w (1) dj )2ψ(1)j 5 + G(1)j Z,Z|

2(y (1) d ) β(1)I

j=1 (β(1))2(w (1) dj )2ψ(1)j 1 (β(1)(w (1) dj )2ψ(1)j 5 + G(1)j Z,Z) 1(ψ(1)j 1 ) (β(1))2ψ (1) 0

(β(1)ψ (1) 4 + H (1) Z,Z) 1(ψ (1) 0 ) y (1) d β(1)

2 ψ (1) 2 + β(1)

2 Tr(ψ (1) 4 (H (1) Z,Z) 1) β(1)

j=1 (w (1) dj )2ψ(1)j 3

j=1 Tr((w (1) dj )2ψ(1)j 5 (G(1)j Z,Z) 1) , (33)

and similarly for ˆL(2),

log (β(2)) N

2 |H (2) Z,Z|

2 |β(2)ψ (2) 4 + H (2) Z,Z|

j=1 log |G(2)j Z,Z|

|β(2)(w (2) dj )2ψ(2)j 5 + G(2)j Z,Z|

2(y (2) d ) β(2)I

j=1 (β(2))2(w (2) dj )2ψ(2)j 1 (β(2)(w (2) dj )2ψ(2)j 5 + G(2)j Z,Z) 1(ψ(2)j 1 ) (β(2))2ψ (2) 0

(β(2)ψ (2) 4 + H (2) Z,Z) 1(ψ (2) 0 ) y (2) d β(2)

2 ψ (2) 2 + β(2)

2 Tr(ψ (2) 4 (H (2) Z,Z) 1) β(2)

j=1 (w (2) dj )2ψ(2)j 3

j=1 Tr((w (2) dj )2ψ(2)j 5 (G(2)j Z,Z) 1) . (34)

For the calculation of KL divergence, for simpliﬁcation, we employ A (1) q and A (2) q to represent (α(1))2K (1) t,t + (1 α(1))2ϵ(1)I and (α(2))2K (2) t,t + (1 α(2))2ϵ(2)I, respectively. Then the speciﬁc calculation is given below.

KL(q(X(1))q(X(2))q(X(1,2))||p(X(1)|X(1,2), t)p(X(2)|X(1,2), t)p(X(1,2)|t))

log |A(1) q | + log |A(2) q | + log |K (1,2) t,t | log |S(1,2) q | log |S(1) q | log |S(2) q |

+ (1 α(1))µ(1,2) q µ(1) q (A(1) q ) 1 (1 α(1))µ(1,2) q µ(1) q

+ (1 α(2))µ(1,2) q µ(2) q (A(2) q ) 1 (1 α(2))µ(1,2) q µ(2) q

+ Tr (1 α(1))2(A(1) q ) 1 + (1 α(2))2(A(2) q ) 1 S(1,2) q

+ Tr (K (1,2) t,t ) 1 µ(1,2) q (µ(1,2) q ) + S(1,2) q + Tr (A(1) q ) 1S(1) q + (A(2) q ) 1S(2) q . (35)

Multi-view Collaborative Gaussian Process Dynamical Systems

Appendix B. Computation of Statistics ψ0, ψ1, ψ2, ψ3, ψ4, ψ5

ψ (1) 0 , ψ (1) 1 , ψ (2) 0 and ψ (2) 1 are N M matrices. ψ (1) 2 , ψ (1) 3 , ψ (2) 2 and ψ (2) 3 are scalars. ψ (1) 4 , ψ (1) 5 , ψ (2) 4 , ψ (2) 5 are (J M) (J M) matrices. we use the ARD kernel κARD(x, x ) = σ2 f exp( 1

2 PQ q=1 αp(xq x q)2), and obtain

(ψ (1) 0 )n,m = ( H (1) X,Z q(X(1)))n,m = Z κ(1)h(x(1) n , z(1)h m )N(x(1) n |µ(1) n , S(1) n )dx(1) n

= (σ2 f)(1)h QQ q=1(S (1) nqα (1)h q + 1) 1 2 exp 1

(z (1)h mq µ (1) nq)2α (1)h q S (1) nqα (1)h q + 1

(ψ (1)j 1 )n,m = ( G (1)j X,Z q(X(1)))n,m = Z κ (1) j (x(1) n , z(1) m )N(x(1) n |µ(1) n , S(1) n )dx(1) n

= (σ2 f) (1) j QQ q=1(S (1) nqα (1) jq + 1) 1 2 exp 1

(z (1) mq µ (1) nq)2α (1) jq S (1) nqα (1) jq + 1

ψ (1) 2 = Tr( H (1) X,X q(X(1))) = N(σ2 f)(1)h, (38)

ψ (1)j 3 = Tr( G (1)j X,X q(X(1))) = N(σ2 f) (1) j , (39)

(ψ (1) 4 )m,m = ( H (1) Z,XH (1) X,Z q(X(1)))m,m

Z k(1)h(x(1) n , z(1)h m )k(1)h(x(1) n , z (1)h m )N(x(1) n |µ(1) n , S(1) n )dx(1) n

= (σ4 f)(1)h N X

α(1)h q (z(1)h mq z(1)h m q )2

4 α(1)h q (µ(1) np z(1)h mq

2 z(1)h m q

2α(1)h q S(1) nq +1

(2α (1)h q S (1) nq + 1) 1 2

(ψ (1)j 5 )m,m = ( G (1)j Z,XG (1)j X,Z q(X(1)))m,m

Z k (1) j (x(1) n , z(1) m )k (1) j (x(1) n , z (1) m )N(x(1) n |µ(1) n , S(1) n )dx(1) n

= (σ4 f) (1) j

α(1) jq (z(1) mq z(1) m q)2

4 α(1) jq (µ(1) nq z(1) mq

2α(1) jq S(1) nq +1

(2α (1) jq S (1) nq + 1) 1 2

The statistics ψ (2) 0 , ψ (2) 1 , ψ (2) 2 , ψ (2) 3 , ψ (2) 4 , ψ (2) 5 in the second view have the similar formulas.

Appendix C. Derivation of Varitional Lower Bound for Testing

Given test data in the ﬁrst view Y (1) , we maximize a variational lower bound on the logarithmic marginal likelihood log p(Y (1), Y (1) ) which can be expressed as follows. For

Sun and Fei and Zhao and Mao

brevity, we ve omitted time t and t .

F (1) = log Z p(Y (1) , Y (1)|X(1) , X(1))p(X(1) , X(1)|X(1,2) , X(1,2))p(X(1,2) , X(1,2))

d X(1,2) d X(1) d X(1)d X(1,2)

Z q(X(1) , X(1))q(X(1,2) , X(1,2))q(G(1))q(H(1))

log p(Y (1) , Y (1)|X (1) , X(1))p(X (1) , X(1)|X (1,2) , X(1,2))p(X (1,2) , X(1,2)) q(X (1) , X(1))q(X (1,2) , X(1,2))q(G(1))q(H(1)) d X(1,2) d X(1) d X(1)d X(1,2)d G(1)d H(1)

= Z q(G(1))q(H(1))q(X(1) , X(1)) log p(Y (1) , Y (1) |X (1) , X(1)) q(G(1))q(H(1)) d X(1) d X(1)d G(1)d H(1)

+ Z q(X(1) , X(1))q(X(1,2) , X(1,2)) log p(X (1) , X(1)|X (1,2) , X(1,2))p(X (1,2) , X(1,2)) q(X (1) , X(1))q(X (1,2) , X(1,2)) d X(1) d X(1,2) d X(1)d X(1,2)

= e L(1)(Y (1) , Y (1)) KL q(X(1) , X(1))q(X(1,2) , X(1,2))||p(X(1) , X(1)|X(1,2) , X(1,2))

p(X(1,2) , X(1,2)) , (42)

The quantity F (1) can be maximized using the same method as for training. In addition, parameters of the new variational distribution q(X(1), X (1) ) are jointly optimized because X(1) and X (1) are coupled in q(X(1), X (1) ), and so are q(X(1,2), X (1,2) ). Specially, the quantity e L(1)(Y (1) , Y (1)) can be expressed as

e L(1)(Y (1) , Y (1)) =

log β(1) N+N

2 |H (1) Z,Z|

2 |β(1) ψ4 (1) + H (1) Z,Z|

j=1 log |G(1)j Z,Z|

|β(1)(w (1) dj )2 ψ5 (1)j + G(1)j Z,Z|

2( yd (1)) β(1)I

j=1 β(1)2(w (1) dj )2 ψ1 (1)j(β(1)(w (1) dj )2 ψ5 (1)j + G(1)j Z,Z) 1( ψ1 (1)j) β(1)2 ψ0

(β(1) ψ4 (1) + H (1) Z,Z) 1( ψ0) yd (1) β(1)

2 ψ2 + β(1)

2 Tr( ψ4 (1)(H (1) Z,Z) 1) β(1)

j=1 ψ3 (1)j

j=1 Tr((w (1) dj )2 ψ5 (1)j(G(1)j Z,Z) 1) , (43)

and the KL divergence can be expressed as

KL q(X(1), X(1) )q(X(1,2), X(1,2) )||p(X(1), X(1) |X(1,2), X(1,2) )p(X(1,2), X(1,2) )

log | A(1) q | + log | K (1,2) t,t | log | S(1,2) q | log | S(1) q | + Tr ( A(1) q ) 1 S(1) q

Multi-view Collaborative Gaussian Process Dynamical Systems

+ (1 α(1)) µ(1,2) q µ(1) q ( A(1) q ) 1 (1 α(1)) µ(1,2) q µ(1) q

+ Tr (1 α(1))2( A(1) q ) 1 S(1,2) q + Tr ( K(1,2) x,x ) 1 µ(1,2) q ( µ(1,2) q ) + S(1,2) q . (44)

where ψ0 (1) = H (1) X,Z q(X(1),X(1) ), ψ (1)j 1 = G (1)j X,Z q(X(1),X(1) ), ψ (1) 2 = Tr( H (1) X,X q(X(1),X(1) )) ,

ψ (1)j 3 = Tr( G (1)j X,X )q(X(1),X(1) ), ψ4 (1) = H (1) Z,XH (1) X,Z q(X(1),X(1) ) and ψ5 (1)j = G (1)j Z,XG (1)j X,Z q(X(1),X(1) ).

M. R. Amini and C. Goutte. A co-classiﬁcation approach to learning from multilingual corpora. Machine Learning, 79:105 121, 2010.

K. Andreas and G. Carlos. Nonmyopic active learning of Gaussian processes: an explorationexploitation approach. In Proceedings of the 24th International Conference on Machine Learning, pages 449 456, 2007.

A. Ankur and T. Bill. Recovering 3D human pose from monocular images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 21:1 8, 2006.

K. Barnard, P. Duygulu, D. Forsyth, D. Nando N. Freitas, D. M. Blei, and M. I. Jordan. Matching words and pictures. Journal of Machine Learning Research, 3:1107 1135, 2003.

D. M. Blei and M. I. Jordan. Modeling annotated data. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, pages 127 134, 2003.

T. Cao, V. Jojic, S. Modla, D. Powell, K. Czymmek, and M. Niethammer. Robust multimodal dictionary learning. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 259 266, 2013.

N. Chen, J. Zhu, and E. P. Xing. Predictive subspace learning for multi-view data: a large margin approach. Advances in Neural Information Processing Systems, 23:361 369, 2010.

D. A. Cohn and T. Hofmann. The missing link A probabilistic model of document content and hypertext connectivity. Advances in Neural Information Processing Systems, 14: 430 436, 2001.

A. C. Damianou, M. K. Titsias, and N. D. Lawrence. Variational Gaussian process dynamical systems. Advances in Neural Information Processing Systems, 24:2510 2518, 2011.

A. C. Damianou, C. H. Ek, M. K. Titsias, and N. D. Lawrence. Manifold relevance determination. In Proceedings of the 29th International Conference on Machine Learning, pages 1 8, 2012.

A. C. Damianou, M. K. Titsias, and N. D. Lawrence. Variational inference for latent variables and uncertain inputs in Gaussian processes. Journal of Machine Learning Research, 17: 1425 1486, 2016.

Sun and Fei and Zhao and Mao

Z. Ding, M. Shao, and Y. Fu. Robust multi-view representation: A uniﬁed perspective from multi-view learning to domain adaption. In Proceedings of the 27th International Joint Conferences on Artiﬁcial Intelligence, pages 5434 5440, 2018.

J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2625 2634, 2015.

C. H. Ek and N. D. Lawrence. Shared Gaussian process latent variable models. Ph D thesis, Oxford Brookes University, 2009.

F. Feng, X. Wang, and R. Li. Cross-modal retrieval with correspondence autoencoder. In Proceedings of the 22nd International Conference on Multimedia, pages 7 16, 2014.

M. Feurer, B. Letham, and E. Bakshy. Scalable meta-learning for Bayesian optimization using ranking-weighted Gaussian process ensembles. In Proceedings of the 36th Automatic Machine Learning Workshop at International Conference on Machine Learning, pages 1 15, 2018.

J. Hu, J. Lu, and Y. Tan. Sharable and individual multi-view metric learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40:2281 2288, 2018.

Y. Jia, M. Salzmann, and T. Darrell. Factorized latent spaces with structured sparsity. Advances in Neural Information Processing Systems, 23:982 990, 2010.

Y. Jia, M. Salzmann, and T. Darrell. Learning cross-modality similarity for multinomial data. In Proceedings of the 13th IEEE International Conference on Computer Vision, pages 2407 2414, 2011.

P. Jing, Y. Su, L. Nie, X. Bai, J. Liu, and M. Wang. Low-rank multi-view embedding learning for micro-video popularity prediction. IEEE Transactions on Knowledge and Data Engineering, 30:1519 1532, 2018.

A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3128 3137, 2015.

N. D. Lawrence. Gaussian process latent variable models for visualisation of high dimensional data. Advances in Neural Information Processing Systems, 17:329 336, 2004.

N. D. Lawrence. Probabilistic non-linear principal component analysis with Gaussian process latent variable models. Journal of Machine Learning Research, 6:1783 1816, 2005.

N. D. Lawrence and I. M. Jordan. Semi-supervised learning via Gaussian processes. Advances in Neural Information Processing Systems, 18:753 760, 2005.

Y. Li, M. Yang, and Z. M. Zhang. A survey of multi-view representation learning. IEEE Transactions on Knowledge and Data Engineering, 10:1 20, 2018.

Multi-view Collaborative Gaussian Process Dynamical Systems

W. Liu, D. Tao, J. Cheng, and Y. Tang. Multiview Hessian discriminative sparse coding for image annotation. Computer Vision and Image Understanding, 118:50 60, 2014.

M. L uthi, T. Gerig, C. Jud, and T. Vetter. Gaussian process morphable models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40:1860 1873, 2018.

J. Mao, W. Xu, Y. Yang, J. Wang, Z. Huang, and A. Yuille. Deep captioning with multimodal recurrent neural networks. ar Xiv preprint , ar Xiv:1412.6632, pages 1 17, 2014.

J. R. Medina, H. Borner, S. Endo, and S. Hirche. Impedance-based Gaussian processes for modeling human motor behavior in physical and non-physical interaction. IEEE Transactions on Biomedical Engineering, 63:1 12, 2019.

I. Muslea, S. Minton, and C. A. Knoblock. Active + semi-supervised learning = robust multi-view learning. In Proceedings of the 19th International Conference on Machine Learning, pages 435 442, 2002.

I. Muslea, S. Minton, and C. A. Knoblock. Active learning with multiple views. Journal of Artiﬁcial Intelligence Research, 27:203 233, 2006.

J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng. Multimodal deep learning. In Proceedings of the 28th International Conference on Machine Learning, pages 689 696, 2011.

V. T. Nguyen and E. Bonilla. Collaborative multi-output Gaussian processes. In Proceedings of the 30th Uncertainty in Artiﬁcial Intelligence, pages 643 652, 2014.

M. Opper and C. Archambeau. The variational Gaussian approximation revisited. Neural Computation, 21:786 792, 2009.

E. Puyol, B. Ruijsink, B. Gerber, M. Amzulescu, H. Langet, M. De Craene, J. A. Schnabel, P. Piro, and A. P. King. Regional multi-view learning for cardiac motion analysis: Application to identiﬁcation of dilated cardiomyopathy patients. IEEE Transactions on Biomedical Engineering, 65:1 9, 2018.

C. E. Rasmussen and C. K. I. Williams. Gaussian Process for Machine Learning. MIT Press, 2nd edition, 2006.

M. Salzmann, C. H. Ek, R. Urtasun, and T. Darrell. Factorized orthogonal latent spaces. In Proceedings of the 13th International Conference on Artiﬁcial Intelligence and Statistics, pages 701 708, 2010.

A. Shon, K. Grochow, A. Hertzmann, and R. P. Rao. Learning shared latent structure for image synthesis and robotic imitation. Advances in Neural Information Processing Systems, 18:1233 1240, 2006.

N. Srivastava and R. R. Salakhutdinov. Multimodal learning with deep Boltzmann machines. Advances in Neural Information Processing Systems, 25:2222 2230, 2012.

S. Sun. A survey of multi-view machine learning. Neural Computing and Applications, 23: 2031 2038, 2013.

Sun and Fei and Zhao and Mao

S. Sun, L. Mao, Z. Dong, and L. Wu. Multiview Machine Learning. Springer, 1st edition, 2019.

M. K. Titsias and N. D. Lawrence. Bayesian Gaussian process latent variable model. In Proceedings of the 13th International Conference on Artiﬁcial Intelligence and Statistics, pages 844 851, 2010.

S. Tulsiani, A. Efros, and J. Malik. Multi-view consistency as supervisory signal for learning shape and pose prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1 9, 2018.

S. Virtanen, Y. Jia, A. Klami, and T. Darrell. Factorized multi-modal topic model. In Proceedings of the 28th Conference on Uncertainty in Artiﬁcial Intelligence, pages 1 9, 2012.

X. Wan. Co-training for cross-lingual sentiment classiﬁcation. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pages 235 243, 2009.

J. M. Wang, D. J. Fleet, and A. Hertzmann. Gaussian process dynamical models. Advances in Neural Information Processing Systems, 19:1441 1448, 2006.

W. Wang, R. Arora, K. Livescu, and J. Bilmes. On deep multi-view representation learning. In Proceedings of the 32nd International Conference on Machine Learning, pages 1083 1092, 2015.

H. Wei, P. Zhu, M. Liu, J. P. How, and S. Ferrari. Automatic pan tilt camera control for learning Dirichlet process Gaussian process mixture models of multiple moving targets. IEEE Transactions on Automatic Control, 64:159 173, 2019.

X. Wei, H. Huang, L. Nie, F. Feng, R. Hong, and T. Chua. Quality matters: Assessing c QA pair quality via transductive multi-view learning. In Proceedings of the 27th International Joint Conference on Artiﬁcial Intelligence, pages 4482 4488, 2018.

E. P. Xing, R. Yan, and A. G. Hauptmann. Mining associated text and images with dual-wing harmoniums. ar Xiv preprint , ar Xiv:1207.1423, pages 1 9, 2012.

C. Xu, D. Tao, and C. Xu. A survey on multi-view learning. ar Xiv preprint , ar Xiv:1304.5634, pages 1 59, 2013.

C. Zhang, C. H. Ek, A. Damianou, and H. Kjellstrom. Factorized topic models. In Proceedings of the 1st International Conference on Learning Representations, pages 1 9, 2013.

J. Zhao and S. Sun. Variational dependent multi-output Gaussian process dynamical systems. Journal of Machine Learning Research, 17:1 36, 2016.

J. Zhao, J. Fei, and S. Sun. A variant of Gaussian process dynamical systems. Technical report, East China Normal University, 2018.