# graphical_multioutput_gaussian_process_with_attention__9007f16b.pdf

Published as a conference paper at ICLR 2024

GRAPHICAL MULTIOUTPUT GAUSSIAN PROCESS WITH ATTENTION

Yijue Dai, Wenzhong Yan & Feng Yin School of Science and Engineering The Chinese University of Hong Kong, Shenzhen, China yinfeng@cuhk.edu.cn, yijuedai@link.cuhk.edu.cn

Integrating information while recognizing dependence from multiple data sources and enhancing the predictive performance of the multi-output regression are challenging tasks. Multioutput Gaussian Process (MOGP) methods offer outstanding solutions with tractable predictions and uncertainty quantiﬁcation. However, their practical applications are hindered by high computational complexity and storage demand. Additionally, there exist model mismatches in existing MOGP models when dealing with non-Gaussian data. To improve the model representation ability in terms of ﬂexibility, optimality, and scalability, this paper introduces a novel multi-output regression framework, termed Graphical MOGP (GMOGP), which is empowered by: (i) Generating ﬂexible Gaussian process priors consolidated from identiﬁed parents, (ii) providing dependent processes with attention-based graphical representations, and (iii) achieving Pareto optimal solutions of kernel hyperparameters via a distributed learning framework. Numerical results conﬁrm that the proposed GMOGP signiﬁcantly outperforms state-of-the-art MOGP alternatives in predictive performance, as well as in time and memory efﬁciency, across various synthetic and real datasets.

1 INTRODUCTION

For inference on functions, the Gaussian Process (GP) serves as the most prominent Bayesian nonparametric model, offering not only function values but also predictive distributions. Capturing the input-output relationship with a multivariate Gaussian at ﬁnite locations grants tractable predictions and predictive uncertainty (Rasmussen & Williams, 2006). The scenarios where resource constraints impede the collection of ample data samples motivate the development of Multioutput GP (MOGP) models, as the MOGP can correlate data from multiple diverse sources and obtain better predictive accuracy than isolated models. Practical applications of the existing MOGP methods are widespread, such as time-series analysis (D urichen et al., 2014), water resource forecasting (Pastrana-Cort es et al., 2023), and group-structured data clustering (Leroy et al., 2023).

The popular MOGP models are mainly rooted in the Linear Model of Coregionalization (LMC) proposed by Goulard & Voltz (1992). Modeling all outputs as the linear combinations of shared latent GPs, the LMC correlates the diverse data by a covariance matrix. There also exist other MOGP models, such as (1) transformation-based models that treat outputs as augmented inputs, transforming the MOGP into a sequence of single output GPs (Requeima et al., 2019); (2) discrepancy-based models, which focus on transferring information from low-ﬁdelity outputs to progressively enhance the performance of high-ﬁdelity target outputs (Perdikaris et al., 2017; Requeima et al., 2019). However, the challenges of providing understandable output correlations and a probable output hierarchy have impeded their advancement. To date, improved learning performance of the LMC is achieved by deriving representative covariance functions (Bonilla et al., 2007; Wilson et al., 2012; Dai et al., 2020; Chen et al., 2022), by adapting common mean process (Leroy et al., 2022), and by introducing neural embeddings or efﬁcient frameworks (Liu et al., 2022; Chung & Al Kontar, 2023). However, the predictive performance of the LMC-based models may be inferior to that of isolated GPs when sufﬁcient training data exists for each output (Bruinsma et al., 2020). See the results in Section 5.

The corresponding author is Feng Yin. Our code is available at https://github.com/Blspdianna/GMOGP.

Published as a conference paper at ICLR 2024

In this paper, we propose a novel multi-output regression framework, termed graphical MOGP (GMOGP). Unlike the LMC, which interprets the output correlations by covariance based on a set of shared GPs, the GMOGP, built upon a probability graphical model, can directly learn the conditional dependence and imply graphical representations of the multiple outputs. Moreover, learning the kernel hyperparameters of the GMOGP can be formulated as a multi-objective optimization (MOO) problem with Pareto optimal solutions. Further extensions by non-linear transformations make the transformed GMOGP (TGMOGP) available to ﬁt non-Gaussian data. Enhanced predictive accuracy and efﬁciency validate the efﬁcacy of the GMOGP across various synthetic/real datasets.

2 BACKGROUND

This section concludes basic concepts and notations in Gaussian process regression (GPR), serving as the bedrock of the upcoming GMOGP framework detailed in Section 3.

2.1 GAUSSIAN PROCESS REGRESSION

A GP characterizes a distribution over functions fully by a mean function m(x) and a kernel function k(x, x ; θ), i.e., f(x) GP (m(x), k(x, x ; θ)) . (1)

The commonly used squared exponential (SE) kernel, k SE(x, x ; l, σ2 f) = σ2 f exp( x x 2 /l2), is parametrized by a length-scale l and a signal variance σ2 f (Rasmussen & Williams, 2006). Given a dataset D : {X, y} = {xn, yn}N n=1, the GPR model can be described as:

yn = f(xn) + ϵn, n = 1, 2, . . . , N, (2)

where xn Rd denotes an input location, and the noise ϵn is assumed i.i.d with ϵn N(0, σ2). In particular, the scalar-valued observations {yn R}N n=1 direct single-output GP (SOGP) models.

2.1.1 MULTIOUTPUT GAUSSIAN PROCESS

For multi-output regression, an MOGP can be derived as:

f(x) GP(m M(x), KM(x, x ; θM)), (3)

where the vector-valued function f(x) = [f (1)(x), f (2)(x), . . . , f (S)(x)](S>1) is determined by the mean function m M(x) RS and the matrix-valued kernel function KM(x, x ) RS S. Note that the hyperparameters θM are searched in a vector-valued reproducing kernel Hilbert space (RKHS).

At ﬁnite locations X RN d, the prior and the likelihood of the MOGP has the following forms:

(Prior): p(f(X)) = N(m M(X), KM(X, X; θM)) (4) (Likelihood): p(Y |f, X, Σ) = N(f(X), Σ IN), (5)

where Y =[y(1), . . . y(i), . . . , y(S)] RNS, and y(i)={y(i) n R}N n=1 are the labels of the ith output. The noise terms are assumed to have zero mean and a covariance matrix Σ = diag{σ2 1, σ2 2, . . . , σ2 S}. The kernel function for the MOGP prior evaluated at X is derived as follows:

KM(X, X; θM) =

K11(X, X; θ11) K1S(X, X; θ1S) ... ... ... KS1(X, X; θS1) KSS(X, X; θSS)

RSN SN. (6)

Generally, we train the MOGP models by minimizing their negative log marginal likelihood (NLL):

L{θM,Σ} Y T (KM(X, X; θM) + Σ) 1 Y + log |KM(X, X; θM) + Σ| , (7)

with Y = Y m M(X), and Σ = Σ IN. Integrating over the parameter space differentiates the marginal likelihood methods from other non-Bayesian counterparts. The automatic incorporation of a trade-off between model ﬁt and model complexity renders the marginal likelihood a valuable metric for model selection (Sch olkopf et al., 2002; Micchelli & Pontil, 2005). Conditioning the joint Gaussian prior on the observations, the predictive distribution for a test input x turns out to be:

p(f |x , X, Y, θM) = N( f , V ) (8)

Published as a conference paper at ICLR 2024

with (omitting kernel hyperparameters) f = KM(x , X)(KM(X, X) + Σ) 1Y (9)

V = KM(x , x ) KM(x , X)(KM(X, X) + Σ) 1KM(X, x ). (10) In order to infer the underlying functions and correlate the multiple outputs, the LMC models tailor distinct coefﬁcients to each output via Q shared independent GPs. The elements of their kernel functions (Eq.(6)) are formulated as (Ki,i (x, x )) = PQ q=1 aiqai qkq(x, x ), where i, i I, I := {1, 2, . . . , S}, and different modes of the coefﬁcients aiqai q correspond to varied LMC variants (Alvarez et al., 2012; Liu et al., 2018). Although the LMC can generate dependent processes and measure output correlation, their applications are narrowed by two signiﬁcant impediments: (i) The computational and storage burdens associated with the SN-dimensional correlation matrix, and (ii) model mismatch/inﬂexibility when the underlying likelihood (Eq.(5)) deviates from Gaussian distributions. Given these challenges, an alternative framework that can leverage the advantages of MOGPs, generate ﬂexible prior/predictive distributions, and freely accommodate various efﬁcient learning methods is considered in Section 3.

3 GRAPHICAL MULTIOUTPUT GAUSSIAN PROCESS

Ubiquitously, there exists interplay among the multiple outputs, giving rise to varied research, e.g., graph construction, transfer learning, and multi-task learning (Qiao et al., 2018; Vandenhende et al., 2021). For instance, the dependence (denoted by arrows) between a target output (marked in yellow) and the other six outputs can be described by a graphical representation shown in Figure 1(a). In the multi-output regression, the strong bond between the trend of the target CO2 and Temperature has been well discovered (solid link), while the other bonds are vague (dashed link) and worth exploring.

GDP Vehicle

Temperature

Figure 1: Illustrations of: (a) a graphical representation among the target CO2 and the other outputs, (b) an example of a directed acyclic graph (the cross means that there are no directed cycles).

As directed graphical models (a.k.a. Bayesian network) can express the conditional dependence structure among random variables and make probabilistic statements for a broad class of distributions based on a speciﬁc graph (Koller & Friedman, 2009), we propose a graphical MOGP model to learn the output dependence jointly with the regression task.

Mathematically, a graph is built with nodes and links. In the directed graphical model, each node can represent random variables, while the links convey probabilistic dependence and have directionality indicated by arrows. For example, given nodes {v1, v2, . . . , v6} and graph Gv shown in Figure 1(b), the joint distribution can be decomposed by repeatedly applying the product rule of probability: p(v1, v2, . . . , v6) = p(v6|v1, v2, v3, v4, v5)p(v5|v1, v2, v3, v4) . . . p(v2|v1)p(v1) (11)

Gv = p(v6|v1, v2, v3, v5)p(v5|v2, v4)p(v4|v1, v3)p(v3|v2)p(v2)p(v1). (12) The Eq.(12) is derived according to the conditional independence, e.g., p(v6|v1, v2, v3, v4, v5) = p(v6|v1, v2, v3, v5), as there is no link from node v4 to v6 in the graph Gv. In more general terms, the joint distribution of multiple nodes v = {vi}i I deﬁned by a graph can be expressed as:

i=1 p(vi|pai), (13)

where pai denotes the parents1 of vi, e.g., pa6 = {v1, v2, v3, v5} in Eq.(12). Note that the joint distribution only admits directed acyclic graphs; see more details in (Bishop & Nasrabadi, 2006).

1If there exists a link going from a node i to a node j, the node i is regarded as the parent of the node j.

Published as a conference paper at ICLR 2024

Accordingly, for S > 1 outputs, we can model each output as an SOGP, and generate multivariate Gaussian random variables evaluated at X (represented by node f (j) X , j I). Following the result in Eq. (13), the joint distribution deﬁned by the speciﬁc graph structure, with the target node f (i) X (representing the CO2) connected to the heads of arrows (see Figure 1(a)), can be derived as follows:

p(f (1) X , f (2) X , . . . , f (S) X ) = p(f (i) X pai) Y

j pai p(f (j) X ). (14)

Since the links are unknown in practice, the parents of the target node are assumed to contain all other nodes (except the target one) at ﬁrst, i.e., pai = {f (1) X , f (2) X , . . . , f (S) X }/{f (i) X }, and can be adjusted via an attention mechanism detailed in Section 3.2. Alternately, we can treat each output as the target node (represented by f (i) X , i I), and learn the conditional dependence on the other nodes. The diagrammatic illustration of the output bonds and the graph with respect to different target at the ﬁrst stage (all vague) is given in Figure 2. Moreover, conditioning on the target node, the parents are dependent (see Remark 3.1), which paves the way to generate dependent processes.

Figure 2: Illustration of the bonds among six outputs and the dependence structure between the different target (marked in yellow) and its parents (marked in purple) at the ﬁrst stage.

Remark 3.1 (Conditional Dependence). Deﬁned by the speciﬁc graph with joint distribution form: p(f (i) X , f (j) X , f (k) X ) = p(f (i) X |f (j) X , f (k) X )p(f (j) X )p(f (k) X ), if conditioning on the target f (i) X , the conditional distribution p(f (j) X , f (k) X |f (i) X ) = p(f (j) X )p(f (k) X )p(f (i) X |f (j) X , f (k) X )/p(f (i) X ), yielding the parents f (j) X and f (k) X are dependent, i.e., f (j) X f (k) X |f (i) X .

3.1 GRAPHICAL MULTIOUTPUT GAUSSIAN PROCESS PRIOR

Bayesian methods are invaluable within the machine learning community due to the ability to incorporate prior knowledge. Appropriate prior beliefs can represent detailed inductive bias, improve interpretability, and prevent overﬁtting (Theodoridis, 2015; Lotﬁet al., 2022). In the GMOGP, we model the conditional distribution in Eq.(14) as Gaussian with aggregated information, namely,

p(f (i) X |pai) = N f (i) X X

j pai αi,jf (j) X + mi, kθi(X, X) , i I, (15)

where αi,j R and mi are bias parameters added into the mean, and kθi(X, X) RN N is a covariance function. Conditioning on the states of its parents, each target node is of the form:

f (i) X = X

j pai αi,jf (j) X + mi + ψi, (16)

with mi and ψi N(0, kθi(X, X)) characterizing the ith output. Given the parents with p(f (j) X ) = N(mj, kθj(X, X)) , we can derive the GMOGP prior via E(f (i) X ) = P

j pai αi,j E(f (j) X ) + mi and

cov(f (i) X , f (i) X )=E h (f (i) X E[f (i) X ]) n P

j pai αi,j(f (j) X E[f (j) X ]) + ψi oi (see Proposition 3.1).

Proposition 3.1 (GMOGP Prior). In the multi-output regression, deﬁned by the speciﬁc graph with a target node and given its parents, the GMOGP prior for each target node (i I) follows p(f (i) X ) =

j pai αi,jmj+mi, P

j pai α2 i,jkθj(X, X)+kθi(X, X) . The proof is given in Appendix A.1.

Following the noise setting in Section 2.1.1, the posterior of the ith output is proportional to the product of the prior and the likelihood, with hyperparameters Θ := {θj}j I and αi := {αi,j}j pai,

p(y(i), f (i) X ; Θ, αi, σi) = p(f (i) X ; Θ, αi)p(y(i)|f (i) X ; σi). (17)

Published as a conference paper at ICLR 2024

Enhanced representation ability provided by the graph structure in the input space has been demonstrated in recent graph GP methods, such as Ng et al. (2018) and Fang et al. (2021), in which the graph structures among the input features are well discovered. For the multiple outputs, the graph structure is unclear and intriguing. To learn the graph while improving the performance of the multioutput regression, we propose to incorporate the attention mechanism into the GMOGP to modify the parents, as the attention-based graph construction methods have shown impressive performance, such as (Vaswani et al., 2017; Veliˇckovi c et al., 2018; Kim & Oh, 2022).

3.2 LEARNING PARENTS WITH ATTENTION

Centering on the CO2 level (Figure 1(a)), the dependence on the number of vehicles, GDP level, and the degree of education is obscured. Intuitively, highly conditional dependent parents deserve large coefﬁcients {αi,j}j pai. Without knowing the graph adjacency, such coefﬁcients can be learned by using an attention mechanism αi,j = exp(ei,j)/(1 + P

j pai exp(ei,j)) and a scoring function:

ei,j = Leaky Re LU f (i) X , f (j) X wij + cij , (18)

where the weights wij and bias cij are learning parameters. The inner product-based scoring function is widely used, as seen in works such as Vaswani et al. (2017) and Devlin et al. (2018), and can reveal geometric constructions involving angles, lengths, and distances (Sch olkopf et al., 2002). An alternative scoring function employs the concatenation of inputs (Veliˇckovi c et al., 2018; Brody et al., 2022). However, existing scoring functions parametrized by common parameter matrices showed static attention a ﬁxed node consistently achieves the highest attention (Brody et al., 2022).

In Eq.(18), we introduce a modiﬁed scoring function, which can achieve dynamic attention with the aid of disparate learning parameters and non-linear function. The learned coefﬁcients of various real datasets are listed in Section 5 and Appendix B. In practice, we substitute the observation values y(i), y(j) at the initial learning phase. When αi,j 0, the parents set pai is adjusted by unlinking the node j (j I, j = i). For classic MOGPs, the dependence between outputs is measured by symmetric covariance matrices. In comparison, our GMOGP can not only learn a more ﬂexible asymmetric dependence measure but also capture the covariance from cov(f (j) X , f (i) X ) = P

j pai αi,j cov(f (j) X , f (j ) X ) + Iijkθi(X, X), where Iij is the i, j element of the identity matrix (see Appendix A.1). In addition, the GMOGP can handle heterotopic data (X(1) = X(2) =, . . . , = X(S) = X) by introducing an extra weight matrix to Eq.(18) for aligning extracted input features.

3.3 MODEL LEARNING AND INFERENCE

In Section 2.1.1, all kernel hyperparameters are learned through the marginal likelihood L{θM,Σ}. Both computational/storage requirements and learning challenges, brought about by the highdimensional covariance matrix and multiple vector-valued RKHSs, impede the practical implementation of classic MOGPs (Yang et al., 2020). For the GMOGP model, each target node i, i I has its own prior and likelihood (Eq.(17)). Accordingly, the type II maximum likelihood can be conducted for every output, and the parameters corresponding to the ith output, i.e., γ(i) := {Θ, αi, σi}, can be updated via minimizing:

L(i) γ(i) n ( y(i))T k(i) G (X, X) + σ2 i IN 1 y(i) + log k(i) G (X, X) + σ2 i IN o , (19)

where y(i) = y(i) (P

j pai αi,jmj+mi), and k(i) G (X, X)=P

j pai α2 i,jkθj(X, X) + kθi(X, X).

Note that the size of the correlation matrix k(i) G (X, X) is much smaller than the KM(X, X) in Eq.(6), and the hyperparameters Θ are shared among the multiple outputs for information exchange. Instead of solving with a high-dimensional problem, the separate objective with the shared kernel hyperparameters can be modeled by a multi-objective optimization (MOO) problem, i.e., F (Θ) = [L(1)(Θ), L(2)(Θ), . . . , L(S)(Θ)]T . Moreover, applying the weighted sum method with the objective function: PS i=1 wi L(i)(Θ), wi > 0, to solve the MOO problem provides a sufﬁcient condition for Pareto optimality2 of the kernel hyperparameters (Marler & Arora, 2010). In the GMOGP, the

2A solution point is Pareto optimal if it is not possible to move from that point and improve at least one objective function without causing detriment to any other objective function (see details in Appendix A.5) .

Published as a conference paper at ICLR 2024

weights can be set equal, since there is no priority among the outputs, and each marginal likelihood L(i) γ(i), i I follows a Gaussian that can be normalized so that no one dominates the objective.

To summarize, the proposed GMOGP empowers the vanilla MOGPs with: (i) The attention-based asymmetric dependence measure (αij = αji), (ii) less computational and storage requirements, and (iii) Pareto optimal solutions of the kernel hyperparameters and extra graphical representations. Figure 3 presents the detailed workﬂow of the GMOGP hyperparameter optimization scheme. As a separable model, each target node possesses distinct training samples and parents, undergoing an independent inference procedure (O(N 3)). It contributes individual information via θi and aggregates dependent information through shared θj, j pai, prompting the application of diverse distributed learning schemes. Key distinctions between the distributed GMOGP and the classic distributed GP (DGP) (Deisenroth & Ng, 2015) are summarized in Remark 3.2.

Global update step for

pa!: pa": pa#:

Figure 3: Illustrating the GMOGP workﬂow, each target output transmits individual knowledge while aggregating optimized information from its parents, enabling separable training and inference.

Remark 3.2 (Distributed scheme). The distributed framework can learn all parameters jointly and does not require adjustments, such as (i) simplifying approximations, e.g., the block diagonal approximation commonly employed in DGP approaches; (ii) extra prediction fusion methods, as the prediction at x can be separately inferred from k(i) G (x , X)(k(i) G (X, X) + σ2 i IN) 1y(i); (iii) data segmentation and (iv) additional client selection methods, as we can apply the identiﬁed parents.

3.4 NON-GAUSSIAN GMOGP PRIOR

The Gaussian priors on functions are widely used in Bayesian machine learning literature, beneﬁting from posterior consistency, tractable inference, and closed-form solutions (Rasmussen & Williams, 2006). Whereas, the case of non-Gaussian prior/likelihood is often overlooked and ubiquitous in reality, e.g., heavy-tailed, bounded, and multimodal data (Snelson et al., 2003; Sendera et al., 2021).

Typical remedies include deep GP (DGP) (Damianou & Lawrence, 2013) and transformed GP (TGP) (Rios & Tobar, 2019). The performance of the TGP, transformed by marginal ﬂows, was tested in (Maro nas et al., 2021) and demonstrated competitiveness with the DGP while incurring lower computational costs. However, the naive application of marginal ﬂows to MOGP models involves untraceable learning procedures and inefﬁcient high-dimensional approximation approaches, as 1D quadrature cannot be utilized. Lately, a TGP model, tailored for multi-class classiﬁcation, has been proposed. It can generate dependent processes with different marginal ﬂows from a single sample of a latent process (Maro nas & Hern andez-Lobato, 2023). In the context of the multi-output regression problem addressed in this paper, the single latent function is neither sufﬁciently ﬂexible nor accommodating distinct prior beliefs. Also, the strong dependence imposed by the single GP sample can degrade regression performance, especially when dealing with low-quality or biased outputs.

In contrast, the GMOGP model is unconstrained in applying non-linear transformations (termed ﬂow). Concretely, extra model ﬂexibility can be extended by compositing K element-wise invertible transformations {Gφk : F 7 F}K 1 k=0 . Correspondingly, each target node deﬁned in Eq.(16)

aggregates non-linear relations, i.e., Gφ(i) k (f (i) X ) = Gφ(i) k

j pai αi,jf (j) X + mi + ψi . Applying

Published as a conference paper at ICLR 2024

the inverse function theorem and the change of variable formula iteratively (Rezende & Mohamed, 2015), we can derive the transformed GMOGP prior as:

p γ(i),Φi f (i) KX |G, X = p γ(i) f (i) 0X K 1 Y

det Gφ(i) k f (i) k X

, k {0, 1, . . . , K 1}, (20)

where f (i) 0X = f (i) X , and f (i) k+1X = Gφ(i) k (f (i) k X ) with parameters Φi = {φ(i) 0 , φ(i) 1 , . . . , φ(i) K 1}, such

as φ(i) k = {ζk, ρk, λk} in Sinh-Archsinh (SAL) ﬂow: Gφ(i) k (xk) = ζk sinh(ρk arcsinh(xk) λk). We refer readers to Rios (2020) and Maro nas et al. (2021) for available ﬂows and validity analysis.

Consolidating with the non-Gaussian prior, the transformed GMOGP (TGMOGP) can be easily implemented by adapting the off-the-shelf sparse variational inference algorithms in Hensman et al. (2015) or Maro nas et al. (2021) through optimizing the negative evidence lower bound (NELBO):

min {γ(i),Φi,u(i) 0 ,m(i) u ,K(i) u }; i=1,2,...,S

i=1 Eq f (i) 0X

h log p y(i)|GΦi(f (i) 0X ) i + Eq u(i) 0 "

log p(u(i) 0 )

with inducing points u(i) 0 RM, M << N and q(u(i) 0 ) = N(m(i) u , K(i) u ). The detailed derivation of the NELBO and variational gap are elaborated in Appendix A.2. For inference, by the law of the unconscious statistician (LOTUS), the predictions of output i, i I can be estimated from the ﬁrst moment of predictive distribution: p(y(i) )= R p y(i) |GΦi(f (i) 0X ) q(f (i) 0X )df (i) 0X (Snelson et al., 2003). In total, the dependencies of the variables in the GMOGP are concluded in the following Figure 4.

𝒎! 𝜽! 𝒎" 𝜽"

w"! 𝑐"! Attention mechanism

Invertible transformation

Figure 4: Visualizing variable dependencies in the GMOGP. Each target output has their own parents, transformations, and samples with knowledge exchanged by kernel hyperparameters θj, j pai.

4 RELATED WORK

The graph-based GP models have been widely developed in local structured feature spaces and relational learning discipline, resulting in the development of graph GP (Ng et al., 2018) and relational GP (Chu et al., 2006; Silva et al., 2007). Without knowing neighbor graph structures or a complete Laplacian matrix, attention-based graph construction methods have sprung up and have been shown to be successful in supervised learning problems (Veliˇckovi c et al., 2018). The idea of ﬁnding neighborhoods using non-negative kernel (NNK) regression (Shekkizhar & Ortega, 2020) explores geometric understanding and connections to the classic matching pursuit approach (Tropp & Gilbert, 2007). In the GMOGP framework, we can draw parallels with the kernel matching pursuit objective (Vincent & Bengio, 2002), where y(i) = βk(i) G (X , X) can be regarded as matching through constructed kernel dictionary and coefﬁcient β = (k(i) G (X, X) + σ2 i IN) 1y(i). Moreover, attention mechanisms have been utilized in the multi-task learning community, such as the multi-task attention networks (MTAN) in (Liu et al., 2019). In addition, the weighted sum objective function has been well-established for balancing task contributions (Vandenhende et al., 2021).

Regarding to the graph among the multiple outputs, the GMOGP introduces the probability graphical model to the MOGP regression, leading to an efﬁcient MOGP alternative with attention-based ﬂexible/non-Gaussian priors and various graphical representations.

Published as a conference paper at ICLR 2024

Table 1: The average test RMSE of the synthetic experiments described in Section 5.1. All metrics are compared against the baselines: [1] Isolated SOGPs, [2] LMC (de Wolff et al., 2021), [3] freeform task similarity model (FICM) (Bonilla et al., 2007), [4] Gaussian process regression network (GPRN) (Wilson et al., 2012), and [5] convolution process (CMOGP) (Alvarez & Lawrence, 2011). (l NF : The ﬂow parameters, Vm : The variational parameters.)

Average RMSE Test NLL Kdim Number of Parameters

[1] SOGP 0.5653 0.0023 0.4891 0.0043 N 4S [2] LMC 0.5917 0.0096 0.5543 0.0506 S N (2 + S)Q + 2S + 1 [3] FICM 0.5544 0.0046 0.4798 0.0176 S N (S(S + 5) + 4)/2 [4] GPRN 0.5819 0.0207 0.5787 0.0445 S N 2(S + Q) + 3 [5] CMOGP 0.5539 0.0089 0.4689 0.0143 S N (2 + S)Q + 2S + 1

[6] GMOGP 0.5541 0.0054 0.1636 0.0143 N S(S + 5) + 1 [7] TGMOGP 0.5343 0.0023 -0.6354 0.0023 N S(S + 5) + 1 + 3l NF + Vm

5 EXPERIMENTS

In this section, we benchmark the predictive performance of the proposed (transformed) GMOGP against baseline MOGP methods and isolated SOGP using various synthetic and real-world datasets.

5.1 SYNTHETIC DATA EXPERIMENTS

A multi-output regression task with non-Gaussian noise and different function compositions is evaluated. Five outputs (S = 5) are generated by the following functions specialized at X R1800 2:

y(1) = f1(X) + ϵ1, (22) y(2) = f1(X) + f2(X) + ϵ2, (23) y(3) = sinh(2arcsinh(f1(X) + f2(X)) + ϵ3), (24) y(4) = 3 tanh(f3(X)f4(X) + f1(X) + ϵ4), (25) y(5) = 5f3(X)f4(X) + ϵ5, (26)

where f1(x) = 2 cos(x1+x2), f2(x) = (x1+x2)2, f3(x) = exp(|x1x2|+1), f4(x) = log(x1+3), and ϵ1, ϵ2, . . . , ϵ5 are i.i.d. Gaussian noise with a common standard deviation 0.2. The detailed experiment settings and available marginal ﬂows are elaborated in Appendix B.

Table 1 shows the compared predictive performance at 600 test points. The GMOGP is competitive with the FICM but exhibits a smaller test NLL, and surpasses other baselines in both test RMSE and NLL. Given the test error for each output shown in Figure 5(c), the TGMOGP achieves improved performance, especially in the 3rd output, conﬁrming the enhanced representation ability w.r.t. the non-Gaussian data. The inferior performance of the classic MOGP methods (even worse than the isolated SOGP) indicates inefﬁcient model learning and correlation measures in high-dimensional space. Moreover, increasing the number of independent latent processes Q in LMC variants yields little performance gain (shown in Figure 5(b)). Other results are obtained with Q = S. By contrast, our TGMOGP achieves superior results constantly among all different training sizes (Figure 5(a)).

10 60 110 160 220 320 440 700 1800 Number of Traning Samples

SOGP LMC FICM GPRN GMOGP TGMOGP

1 2 3 4 5 6 7 8 9 10 15 20 30 Number of Q

N=1000 N=100

O1 O2 O3 O4 O5

SOGP LMC FICM GPRN GMOGP TGMOGP

Figure 5: The sub-ﬁgures show (a) the average RMSE changes with the number of training samples, (b) the RMSE versus the number of latent independent GPs, and (c) the test error of each output.

Published as a conference paper at ICLR 2024

Table 2: Comparison of test RMSE on real datasets, where SGPRN (Li et al., 2021) and variational LMC (V-LMC) are tested. The shadowed results are learned with two distributed computing units.

Datasets SOGP V-LMC100 FICM SGPRN GMOGP TGMOGP100

JURA 0.605 0.01 0.443 0.01 0.394 0.05 0.438 0.02 0.376 0.01 0.382 0.01 ECG 0.245 0.01 0.229 0.01 0.222 0.01 0.232 0.02 0.219 0.00 0.217 0.00 EEG 0.343 0.05 0.207 0.03 0.147 0.03 0.261 0.03 0.082 0.01 0.117 0.00 SARCOS1 1.139 0.01 1.063 0.01 0.792 0.04 0.844 0.04 0.643 0.03 0.558 0.00

KUKA 0.05 0.01 0.14 0.01 0.03 0.01 0.12 0.02 0.02/ 0.02 0.00 0.04/ 0.04 0.01 Test NLL -0.25 0.01 -0.51 0.01 -0.65 0.02 -0.55 0.01 -1.81/ -1.76 0.02 -3.49/ -3.48 0.01

SARCOS2 0.26 0.05 0.29 0.04 0.33 0.03 - 0.21/ 0.22 0.02 0.16/ 0.16 0.01 Time/Iter 20.75(s) 370.2(s) 419.3(s) >2400(s) 32.41/ 21.07 (s) 4.65/ 3.43 (s)

Table 3: Instances of the learned attention coefﬁcients values from the GMOGP (αjj = 1).

Coefﬁcient α1,2 α1,3 α1,4 α2,1 α2,3 α2,4 α3,1 α3,2 α3,4 α4,1 α4,2 α4,3

SARCOS2 1.9e-4 0.998 2.9e-4 5.4e-4 6.9e-4 0.898 7.3e-5 5.1e-5 0.989 1.3e-3 0.966 6.7e-4

5.2 REAL-WORLD DATA EXPERIMENTS

In this section, we evaluate the predictive performance of the GMOGP alongside scalable MOGP models in real-world applications. The data description and tasks are detailed in Appendix B.3, where we select the ﬁrst 2k/20k training samples (denoted as SARCOS1/2) from the SARCOS dataset to investigate how performance varies with different training sizes. The results in Table 2 show that our GMOGP outperforms baseline models on all real datasets in accuracy and efﬁciency. For variational inference with 100 inducing points, the TGMOGP100 achieves lower predictive error than the V-LMC100 with largely reduced time cost, and even surpasses the models trained with full data. One probable reason is that the variational distribution generated from a free-form Gaussian in the TGMOGP100 is also transformed by the SAL ﬂow, which improves the possibility of learning a sufﬁcient statistic. Another reason is that the TGMOGP transformed by the SAL ﬂow can describe asymmetry and non-Gaussian tailed distributions (Jones & Pewsey, 2019), and the SARCOS data is negatively-skewed with mean < median < mode. The trend illustrated in the Figure 5(a) is also shown by the different rankings across the two SARCOS data. As for the two larger datasets (KUKA (12k) and SARCOS2), we learn the GMOGP-based models through the distributed learning framework using two distributed computing units (the shadowed results), where comparable results are achieved with reduced time consumption. In a comparison of the test negative log-likelihood, the outstanding result of the TGMOGP model implies it has a more appropriate model complexity.

The graphical representations learned along with the improved predictive performance on the multioutput regression tasks (KUKA and SARCOS2) can be indicated by the attention coefﬁcients (Table 3). In the SARCOS (4-outputs), we can identify the parent of the node representing the 2nd output is the node corresponding to the 4th output (α2,4 = 0.8978), since we unlink the other two nodes with coefﬁcients approaching 0. Similarly, the dependence between the 3rd and 4th outputs can be implied by α3,4 = 0.9998. To understand the graphical representations, we calculate the Pearson correlation coefﬁcients between the 2nd/3rd and 4th outputs, which give high values of 0.7744 and 0.9657 coincidently. Other interpretations of a trafﬁc prediction task are explored in Appendix B.3.

6 CONCLUSION

We propose a graphical multioutput Gaussian process (GMOGP) model, an innovative framework tailored for efﬁcient multi-output regression and graphical representation learning. Noteworthy model ﬂexibility and optimality are veriﬁed by superior predictive performance achieved across various synthetic and real datasets. Distributed learning frameworks and sparse variational inference methods can be directly applied to the proposed GMOGP framework, giving a chance to deal with large datasets and a great number of outputs. Moreover, varied graphical representations and conditional dependence empower the vanilla MOGP models with more representation ability.

Published as a conference paper at ICLR 2024

7 ACKNOWLEDGMENTS

This work was supported by the NSFC under Grant No. 62271433, and in part by Shenzhen Science and Technology Program under Grant No. JCYJ20220530143806016 and No. RCJC202106091044 48114.

Mauricio A Alvarez and Neil D Lawrence. Computationally efﬁcient convolved multiple output Gaussian processes. The Journal of Machine Learning Research, 12:1459 1500, 2011.

Mauricio A Alvarez, Lorenzo Rosasco, and Neil D Lawrence. Kernels for vector-valued functions: A review. Foundations and Trends R in Machine Learning, 4(3):195 266, 2012.

Christopher M Bishop and Nasser M Nasrabadi. Pattern recognition and machine learning, volume 4. Springer, 2006.

Edwin V Bonilla, Kian Chai, and Christopher Williams. Multi-task Gaussian process prediction. In Advances in neural information processing systems, volume 20, 2007.

Shaked Brody, Uri Alon, and Eran Yahav. How attentive are graph attention networks? In International Conference on Learning Representations, 2022. URL https://openreview.net/ forum?id=F72ximsx7C1.

Wessel Bruinsma, Eric Perim, William Tebbutt, Scott Hosking, Arno Solin, and Richard Turner. Scalable exact inference in multi-output Gaussian processes. In International Conference on Machine Learning, pp. 1190 1201. PMLR, 2020.

Yair Censor. Pareto optimality in multiobjective problems. Applied Mathematics and Optimization, 4(1):41 59, 1977.

Kai Chen, Twan Van Laarhoven, Elena Marchiori, Feng Yin, and Shuguang Cui. Multitask Gaussian process with hierarchical latent interactions. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4148 4152. IEEE, 2022.

Wei Chu, Vikas Sindhwani, Zoubin Ghahramani, and S Keerthi. Relational learning with Gaussian processes. Advances in Neural Information Processing Systems, 19, 2006.

Seokhyun Chung and Raed Al Kontar. Federated multi-output Gaussian processes. Technometrics, pp. 1 14, 2023.

Yijue Dai, Tianjian Zhang, Zhidi Lin, Feng Yin, Sergios Theodoridis, and Shuguang Cui. An interpretable and sample efﬁcient deep kernel for Gaussian process. In Conference on Uncertainty in Artiﬁcial Intelligence, pp. 759 768. PMLR, 2020.

Andreas Damianou and Neil D Lawrence. Deep Gaussian processes. In Artiﬁcial intelligence and statistics, pp. 207 215. PMLR, 2013.

Taco de Wolff, Alejandro Cuevas, and Felipe Tobar. Mogptk: The multi-output Gaussian process toolkit. Neurocomputing, 424:49 53, 2021.

Marc Peter Deisenroth and Jun Wei Ng. Distributed Gaussian processes. In International Conference on Machine Learning, pp. 1481 1490, July 2015.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. ar Xiv preprint ar Xiv:1810.04805, 2018.

Robert D urichen, Marco AF Pimentel, Lei Clifton, Achim Schweikard, and David A Clifton. Multitask Gaussian processes for multivariate physiological time-series analysis. IEEE Transactions on Biomedical Engineering, 62(1):314 322, 2014.

Jinyuan Fang, Shangsong Liang, Zaiqiao Meng, and Qiang Zhang. Gaussian process with graph convolutional kernel for relational learning. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pp. 353 363, 2021.

Published as a conference paper at ICLR 2024

Michel Goulard and Marc Voltz. Linear coregionalization model: tools for estimation and choice of cross-variogram matrix. Mathematical Geology, 24(3):269 286, 1992.

James Hensman, Alexander Matthews, and Zoubin Ghahramani. Scalable variational Gaussian process classiﬁcation. In Artiﬁcial Intelligence and Statistics, pp. 351 360. PMLR, 2015.

Chris Jones and Arthur Pewsey. The sinh-arcsinh normal distribution. Signiﬁcance, 16(2):6 7, 2019.

Dongkwan Kim and Alice Oh. How to ﬁnd your friendly neighborhood: Graph attention design with self-supervision. ar Xiv preprint ar Xiv:2204.04879, 2022.

Daphne Koller and Nir Friedman. Probabilistic graphical models: principles and techniques. MIT press, 2009.

Arthur Leroy, Pierre Latouche, Benjamin Guedj, and Servane Gey. Magma: inference and prediction using multi-task Gaussian processes with common mean. Machine Learning, 111(5):1821 1849, 2022.

Arthur Leroy, Pierre Latouche, Benjamin Guedj, and Servane Gey. Cluster-speciﬁc predictions with multi-task Gaussian processes. Journal of Machine Learning Research, 24(5):1 49, 2023.

Shibo Li, Wei Xing, Robert M. Kirby, and Shandian Zhe. Scalable gaussian process regression networks. In Proceedings of the Twenty-Ninth International Joint Conference on Artiﬁcial Intelligence, 2021.

Haitao Liu, Jianfei Cai, and Yew-Soon Ong. Remarks on multi-output Gaussian process regression. Knowledge-Based Systems, 144:102 121, 2018.

Haitao Liu, Jiaqi Ding, Xinyu Xie, Xiaomo Jiang, Yusong Zhao, and Xiaofang Wang. Scalable multi-task Gaussian processes with neural embedding of coregionalization. Knowledge-Based Systems, 247:108775, 2022.

Shikun Liu, Edward Johns, and Andrew J Davison. End-to-end multi-task learning with attention. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 1871 1880, 2019.

Sanae Lotﬁ, Pavel Izmailov, Gregory Benton, Micah Goldblum, and Andrew Gordon Wilson. Bayesian model selection, the marginal likelihood, and generalization. In International Conference on Machine Learning, pp. 14223 14247. PMLR, 2022.

R Timothy Marler and Jasbir S Arora. The weighted sum method for multi-objective optimization: new insights. Structural and multidisciplinary optimization, 41:853 862, 2010.

Juan Maro nas and Daniel Hern andez-Lobato. Efﬁcient transformed Gaussian processes for nonstationary dependent multi-class classiﬁcation. In International Conference on Machine Learning, pp. 24045 24081. PMLR, 2023.

Juan Maro nas, Oliver Hamelijnck, Jeremias Knoblauch, and Theodoros Damoulas. Transforming Gaussian processes with normalizing ﬂows. In International Conference on Artiﬁcial Intelligence and Statistics, pp. 1081 1089. PMLR, 2021.

Charles A Micchelli and Massimiliano Pontil. On learning vector-valued functions. Neural computation, 17(1):177 204, 2005.

Yin Cheng Ng, Nicol o Colombo, and Ricardo Silva. Bayesian semi-supervised learning with graph Gaussian processes. Advances in Neural Information Processing Systems, 31, 2018.

Juli an David Pastrana-Cort es, David Augusto Cardenas-Pe na, Mauricio Holgu ın-Londo no, Germ an Castellanos-Dominguez, and Alvaro Angel Orozco-Guti errez. Multi-output variational Gaussian process for daily forecasting of hydrological resources. Engineering Proceedings, 39(1):83, 2023.

Paris Perdikaris, Maziar Raissi, Andreas Damianou, Neil D Lawrence, and George Em Karniadakis. Nonlinear information fusion algorithms for data-efﬁcient multi-ﬁdelity modelling. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 473(2198):20160751, 2017.

Published as a conference paper at ICLR 2024

Lishan Qiao, Limei Zhang, Songcan Chen, and Dinggang Shen. Data-driven graph construction and graph learning: A review. Neurocomputing, 312:336 351, 2018.

Carl Edward Rasmussen and Christopher KI Williams. Gaussian processes for machine learning. MIT press Cambridge, MA, 2006.

James Requeima, William Tebbutt, Wessel Bruinsma, and Richard E Turner. The Gaussian process autoregressive regression model. In Proceedings of the 22nd International Conference on Artiﬁcial Intelligence and Statistics, pp. 1860 1869. PMLR, 2019.

Danilo Rezende and Shakir Mohamed. Variational inference with normalizing ﬂows. In International conference on machine learning, pp. 1530 1538. PMLR, 2015.

Gonzalo Rios. Transport Gaussian processes for regression. ar Xiv preprint ar Xiv:2001.11473, 2020.

Gonzalo Rios and Felipe Tobar. Compositionally-warped Gaussian processes. Neural Networks, 118:235 246, 2019.

Bernhard Sch olkopf, Alexander J Smola, Francis Bach, et al. Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press, 2002.

Marcin Sendera, Jacek Tabor, Aleksandra Nowak, Andrzej Bedychaj, Massimiliano Patacchiola, Tomasz Trzcinski, Przemysław Spurek, and Maciej Zieba. Non-Gaussian Gaussian processes for few-shot regression. In Advances in Neural Information Processing Systems, volume 34, pp. 10285 10298, 2021.

Sarath Shekkizhar and Antonio Ortega. Graph construction from data by non-negative kernel regression. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3892 3896. IEEE, 2020.

Ricardo Silva, Wei Chu, and Zoubin Ghahramani. Hidden common cause relations in relational learning. Advances in neural information processing systems, 20, 2007.

Edward Snelson, Zoubin Ghahramani, and Carl Rasmussen. Warped Gaussian processes. In Advances in neural information processing systems, volume 16, 2003.

S. Theodoridis. Machine Learning: A Bayesian and Optimization Perspective. 05 2015. ISBN 978-0-12-801522-3.

Michalis Titsias. Variational learning of inducing variables in sparse Gaussian processes. In Artiﬁcial intelligence and statistics, pp. 567 574. PMLR, 2009.

Joel A Tropp and Anna C Gilbert. Signal recovery from random measurements via orthogonal matching pursuit. IEEE Transactions on information theory, 53(12):4655 4666, 2007.

Simon Vandenhende, Stamatios Georgoulis, Wouter Van Gansbeke, Marc Proesmans, Dengxin Dai, and Luc Van Gool. Multi-task learning for dense prediction tasks: A survey. IEEE transactions on pattern analysis and machine intelligence, 2021.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, volume 30, 2017.

Petar Veliˇckovi c, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Li o, and Yoshua Bengio. Graph attention networks. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=r JXMpik CZ.

Pascal Vincent and Yoshua Bengio. Kernel matching pursuit. Machine learning, 48:165 187, 2002.

Andrew Gordon Wilson, David A. Knowles, and Zoubin Ghahramani. Gaussian process regression networks. In Proceedings of the 29th International Coference on International Conference on Machine Learning, pp. 1139 1146, 2012.

Qiang Yang, Wei-Neng Chen, Tianlong Gu, Hu Jin, Wentao Mao, and Jun Zhang. An adaptive stochastic dominant learning swarm optimizer for high-dimensional optimization. IEEE Transactions on Cybernetics, 52(3):1960 1976, 2020.

Published as a conference paper at ICLR 2024

A MATHEMATICAL APPENDIX

In the part A, we will give some insights and detailed derivation of the graphical multioutput Gaussian process (GMOGP) prior, the evidence lower bound (ELBO) of the transformed GMOGP, etc.

A.1 GMOGP PRIOR DERIVATION

In the main paper, we can derive the target node i in the GMOGP as the following aggregated form, i.e., f (i) X = X

j pai αi,jf (j) X + mi + ψi. (27)

Recalling that each node represents the random variables generated by a single output Gaussian process, i.e., we have p(f (j) X ) = N(mj, kθj(X, X)), j pai and ψi N(0, kθi(X, X)). Therefore, the GMOGP prior of target node p(f (i) X ) follows Gaussian with mean:

E(f (i) X ) = X

j pai αi,j E(f (j) X ) + mi (28)

j pai αi,jmj + mi (29)

and covariance:

cov(f (i) X , f (i) X ) = E f (i) X E[f (i) X ] f (i) X E[f (i) X ] T (30)

= E f (i) X E[f (i) X ] ( X

j pai αi,j f (j) X E[f (j) X ] + ψi

j pai αi,jcov(f (i) X , f (j) X ) + kθi(X, X), (32)

cov(f (i) X , f (j) X ) = cov(f (j) X , f (i) X ) (33)

= E f (j) X E[f (j) X ] f (i) X E[f (i) X ] T (34)

= E f (j) X E[f (j) X ] ( X

k pai αi,k f (k) X E[f (k) X ] + ψi

k pai αi,kcov(f (j) X , f (k) X ) (i = j). (36)

Notably, under the speciﬁc graphical representation (such as the decomposition in Figure 2), there are no parents of other nodes j, i.e., paj,j =i = { }. Then, we have P

k pai αi,kcov(f (j) X , f (k) X ) =

αi,jcov(f (j) X , f (j) X ) = αi,jkθj(X, X). As a result, substituting this result to the above Eq.(32), we can derive the covariance form for our target node i, i.e.

k(i) G (X, X) := cov(f (i) X , f (i) X ) = X

j pai α2 i,jkθj(X, X) + kθi(X, X). (37)

Remark A.1. In the context of the probabilistic graphic model, only directed acyclic graphs (DAGs) can be generated and represented. Given a well-deﬁned ordering w.r.t. the joint distribution (see Remark A.2), the covariance can be evaluated recursively starting from the lowest numbered node. Remark A.2 (Ordering). The decomposed product term of the joint distribution is asymmetrical, i.e., an implicit order must be chosen, such as (v1, v2, . . . , v6) in Eq.(12). Following a different ordering, we can obtain a different decomposition and hence a different graphical representation.

Published as a conference paper at ICLR 2024

A.2 ELBO OF THE TRANSFORMED GMOGP

Following a similar derivation as (Titsias, 2009) and (Maro nas et al., 2021), the log marginal likelihood of output i, i = 1, 2, . . . , S admits the following inequality: (We denote speciﬁc locations n with additional subscripts, such that f (i) KXn = GΦi f (i) 0Xn = Gφ(i) K 1 Gφ(i) 1 Gφ(i) 0 f (i) 0Xn ,

with f (i) 0X = f (i) X )

log p(y(i)) Eq f (i) KX ,u(i) K

log p y(i) | f (i) KX

p f (i) KX , u(i) K

q f (i) KX , u(i) K

= Eq f (i) KX ,u(i) K

n p y(i) n | f (i) KXn

p f (i) KX , u(i) K

q f (i) KX , u(i) K

n Eq f (i) KX ,u(i) K h log p y(i) n | f (i) KXn

| {z } ELL(i)

+ Eq f (i) KX ,u(i) K

log p f (i) KX , u(i) K

q f (i) KX , u(i) K

| {z } KL(i)

where we deﬁne the approximate posterior as:

q(f (i) KX , u(i) K ) = p(f (i) KX |u(i) K )q(u(i) K ) (41)

= p(f (i) KX |u(i) K )q(u(i) 0 )

det Gφ(i) k (u(i) k )

p(f (i) KX |u(i) K ) = p(f (i) KX , u(i) K )/p(u(i) K ) (43)

= p(f (i) 0X , u(i) 0 )

G φ(i) k (f (i) k X )

G φ(i) k (f (i) k X )

u(i) k G φ(i) k (u(i) k )

G φ(i) k (u(i) k )

. p(u(i) K ) (44)

= p(f (i) 0X , u(i) 0 ) .

det Gφ(i) k (u(i) k )

Gφ(i) k (f (i) k X )

Gφ(i) k (f (i) k X )

Gφ(i) k (u(i) k )

1 Gφ(i) k (u(i) k )

f (i) k X | {z } evaluated to zero for marginal ﬂows

det Gφ(i) k (u(i) k )

= p(f (i) 0X |u(i) 0 )

Gφ(i) k (f (i) k X )

Moreover, applying marginal ﬂows (coordinate-wise), we can derive a more simpliﬁed form:

det Gφ(i) k (f (i) k X )

d Gφ(i) k (f (i) k Xn )

df (i) k Xn

Published as a conference paper at ICLR 2024

and we suppose

q(u(i) 0 ) = N(m(i) u , K(i) u ) (48)

with m(i) u RM and K(i) u RM M being the variational parameters. The detailed derivation and simpliﬁcation of the expected log-likelihood (ELL) and KL divergence in Eq.(40) are speciﬁed separately in the next subsections.

A.2.1 KL DIVERGENCE FOR A TARGET OUTPUT

Following the approximate posterior in Eq.(42), the conditional term in the KL(i) can be canceled, namely:

KL(i) = Eq f (i) KX ,u(i) K

log p f (i) KX , u(i) K

q f (i) KX , u(i) K

= Eq u(i) K

log p(f (i) KX |u(i) K )p(u(i) 0 ) QK 1 k=0

det G φ(i) k (u(i) k )

p(f (i) KX |u(i) K )q(u(i) 0 ) QK 1 k=0

det G φ(i) k (u(i) k )

= Eq u(i) K "

log p(u(i) 0 )

= Eq u(i) 0 "

log p(u(i) 0 )

, KL h q(u(i) 0 ) || p(u(i) 0 ) i (52)

where we ﬁrst marginalized f (i) KX and then applied the law of the unconscious statistician (LOTUS). Concretely, given an invertible transformation G( ) (any transformation implies valid stochastic processes, we refer the readers to (Rios, 2020) for more details), the expectation of any function g( ) under the distribution p(u(i) K ) admits:

Ep(u(i) K )

h g(u(i) K ) i = Ep(u(i) 0 )

h g(GΦi(u(i) 0 )) i . (53)

A.2.2 EXPECTED LOG-LIKELIHOOD UNDER MARGINAL FLOWS

After integrating out the u(i) K and applying the LOTUS rule over the expectation with regard to q(f (i) KX ), we can simplify the ELL(i) term in Eq.(40) as:

n Eq f (i) KX ,u(i) K h log p y(i) n | f (i) KXn

Z h log p y(i) n | f (i) KXn

i q f (i) KX |u(i) K p u(i) K du(i) K df (i) KX (55)

n Eq f (i) KX

h log p y(i) n | f (i) KXn

n Eq f (i) 0X h log p y(i) n | GΦi f (i) 0Xn i . (57)

Published as a conference paper at ICLR 2024

The distribution q f (i) 0X ) can be calculated by:

q f (i) 0X ) = Z p(f (i) 0X |u(i) 0 )q(u(i) 0 )du(i) 0 (58)

= N k(i) G (X, Z)(k(i) G (Z, Z)) 1m(i) u ,

k(i) G (X, X) k(i) G (X, Z)(k(i) G (Z, Z)) 1 h k(i) G (Z, Z) + K(i) u i (k(i) G (Z, Z)) 1k(i) G (Z, X) ,

p(u(i) 0 ) = N 0, k(i) G (Z, Z) (60)

p(f (i) 0X |u(i) 0 ) = N k(i) G (X, Z)(k(i) G (Z, Z)) 1u(i) 0 ,

k(i) G (X, X) k(i) G (X, Z)(k(i) G (Z, Z)) 1k(i) G (Z, X) , (61)

where Z RM d denotes the inducing points locations, and X RN d are data inputs.

In conclusion, the resulting ELBO of the log marginal in Eq.(40) for output i is given by:

ELBO(i) = ELL(i) KL(i) (62)

= Eq f (i) 0X

h log p y(i)|GΦi(f (i) 0X ) i + Eq u(i) 0 "

log p(u(i) 0 )

n Eq f (i) 0Xn h log p y(i) n | GΦi(f (i) 0Xn ) i + Eq u(i) 0 "

log p(u(i) 0 )

Note that the observation value y(i) n only depends on the function evaluation at position n. Therefore, minimizing the objective function in the main paper (Eq.(19)) with zero mean equals to:

P1 : max γ(i); i=1,2,...,S

i=1 log p(y(i)), (65)

with γ(i) = {Θ, αi, σi}. According to the above inequality in Eq.(40), every output can derive the corresponding lower bound ELBO(i), i = 1, 2, . . . , S. Applying the following logic: a < b, c < d : a + c < b + d, (a, b, c, d R), (66) we can formulate the variational lower bound of the transformed GMOGP as follows: S X

i=1 log p(y(i))

i=1 Eq f (i) 0X

h log p y(i)|GΦi(f (i) 0X ) i + Eq u(i) 0 "

log p(u(i) 0 )

| {z } ELBO(i)

Then, instead to solve the optimization P1 in Eq.(65), we can maximizing its lower bound, namely:

P2 : max {γ(i),Φi,u(i) 0 ,m(i) u ,K(i) u }; i=1,2,...,S

i=1 Eq f (i) 0X

h log p y(i)|GΦi(f (i) 0X ) i + Eq u(i) 0 "

log p(u(i) 0 )

or min {γ(i),Φi,u(i) 0 ,m(i) u ,K(i) u }; i=1,2,...,S

i=1 Eq f (i) 0X

h log p y(i)|GΦi(f (i) 0X ) i + Eq u(i) 0 "

log p(u(i) 0 )

which recovers the negative ELBO (NELBO) in the main paper, and the objective function can be further decomposed with locations n as below:

n Eq f (i) 0Xn h log p y(i) n | GΦi(f (i) 0Xn ) i + Eq u(i) 0 "

log p(u(i) 0 )

Published as a conference paper at ICLR 2024

A.2.3 VARIATIONAL GAP

The gap between the evidence and the ELBO for each output can be measured as follows:

log p(y(i)) ELBO(i)

= Eq f (i) KX ,u(i) K h log p(y(i)) i Eq f (i) KX ,u(i) K

log p y(i) | f (i) KX

p f (i) KX , u(i) K

q f (i) KX , u(i) K

= Eq f (i) KX ,u(i) K h log q f (i) KX , u(i) K i Eq f (i) KX ,u(i) K

log p y(i) | f (i) KX

p f (i) KX , u(i) K

= Eq f (i) KX ,u(i) K

log q f (i) KX , u(i) K

p f (i) KX , u(i) K | y(i)

= KL q f (i) KX , u(i) K || p f (i) KX , u(i) K | y(i) . (74)

The gap is yielded as the Kullback-Leibler divergence between q(f (i) KX, u(i) K ) and p(f (i) KX, u(i) K | y(i)). This fact forms the basis of the variational inference algorithm for approximate Bayesian inference.

A.3 GMOGP PREDICTION COMPARISON

For the GMOGP model proposed in Section 3, we can derive the conditional distribution from the joint Gaussian: p(y(i), f (i) X ; γ(i)) = p(f (i) X ; Θ, αi)p(y(i)|f (i) X ). (75)

Correspondingly, the estimates of the output i at a test data x can be calculated from:

GMOGP: f (i)(x ) = k(i) G (x , X) k(i) G (X, X) + σ2 i IN 1 y(i) | {z } β

The counterpart in the classic LMC model admits:

LMC: f [i] =

q=1 aiqajqkq(x , X) β[(j 1)N + 1 : Nj] | {z } N-dimensional vector

where f [i] denotes the ith element of the LMC prediction vector, and β = (KM(X, X) + Σ) 1Y represents the NS-dimensional vector calculated via the high-dimensional gram matrix. Regarding to the output i, the prediction only counts on the elements from index (j 1)N + 1 to Nj.

In contrast, it is noticeable that the ﬁrst part on the right-hand side of Equations (76) and (77) are both derived composite kernel functions of the test point x and support training points. As shown in the experiments on synthetic data, we ﬁnd out that little predictive performance gain can be brought by increasing the number of latent independent processes (see Figure 5(b)). Moreover, the LMC with constant coefﬁcients aiqajq may lack of ﬂexibility and representation ability.

In the context of kernel methods, the predictions are determined by coefﬁcients and a non-linear mapping function in a kernel reproducing Hilbert space. For the above two models, the coefﬁcient β and β both convey the correlated information from other observed output values. Distinct from the LMC, the GMOGP admits less computational complexity and graph learning.

Another alternative view of GMOGP is that it can remedy the cancellation of inter-output transfer in the FICM proposed in Bonilla et al. (2007) with noiseless observations. Since in the noiseless case with a block design, the predictions for output i depend only on the observations y(i). In other

Published as a conference paper at ICLR 2024

words, there is a cancellation of information transfer among other outputs. Speciﬁcally, given the kernels, the prediction at a test data x for output i can be concluded as:

[f (x )][i] = h (KM(x, x) k(x , X)) (KM(x, x) k(X, X)) 1 Y i [i] (78)

= hn KM(x, x) (KM(x, x)) 1 k(x , X) (k(X, X)) 1 o Y i [i] (79)

= k(x , X) (k(X, X)) 1 y(i). (80)

Here, the [f (x )][i] represents the ith element of the vector-valued prediction. Compared with the GMOGP in Eq.(76) in the noiseless version, the information in other outputs can be implied from the inner product based attention coefﬁcients enrolled in the aggregated kernel function.

A.4 JOINT DISTRIBUTION OF MULTIPLE OUTPUTS

From the main paper, the joint distribution of all S outputs deﬁned by the speciﬁc graph structure can be formulated as:

p(f (1) X , f (2) X , . . . , f (S) X ) = p(f (i) X pai)p(pai) (81)

= p(f (i) X pai) Y

j pai p(f (j) X ) (82)

In the GMOGP model, we model the target output with the following conditional distribution:

p(f (i) X |pai) = N f (i) X X

j pai αi,jf (j) X + mi, kθi(X, X) , i I, (83)

and p(f (j) X ) = N(mj, kθj(X, X)), j pai. After the model training, given the learned parameters, we can obtain the corresponding joint distribution from Eq.(82). While in the TGMOGP, since we transform the GMOGP prior with the marginal ﬂow, the TGMOGP prior becomes:

p γ(i),Φi f (i) KX |G, X = p γ(i) f (i) X K 1 Y

det Gφ(i) k f (i) k X

Under the equality derived in Eq.(27), the invertible transformations on the target node i can be represented as:

GΦi(f (i) X ) = GΦi

j pai αi,jf (j) X + mi + ψi

Given the parent nodes pai, we have

p(GΦi(f (i) X )|pai) = p(f (i) X |pai)

det Gφ(i) k f (i) k X

Therefore, the resulting joint distribution can also be deduced with learned coefﬁcients and hyperparameters.

A.5 PARETO OPTIMALITY

Pareto optimality is a foundational concept in the optimization community. In single objective problems, the Pareto optimal solution is unique. In the context of the multi-objective optimization, the multiple objectives need to be optimized simultaneously, looking for a set of Pareto optimal solutions. This process is called multi-objective optimization (Censor, 1977). In short, the Pareto optimal solution is a set of non-inferior solutions in the objective space, and deﬁnes a boundary beyond that none of the objectives can be improved without sacriﬁcing at least one of the other objectives.

Published as a conference paper at ICLR 2024

The beneﬁts of solving the multi-output regression in a multi-objective problem (MOO) instead of the classic high-dimensional optimization considered in the MOGP methods are discussed here. Since the kernel hyperparameters Θ are shared among all different outputs to achieve information exchange, we can build an MOO problem,

F (Θ) = [L(1)(Θ), L(2)(Θ), . . . , L(S)(Θ)]T , (87) with S objectives in the following form:

L(i) γ(i) n ( y(i))T k(i) G (X, X) + σ2 i IN 1 y(i) + log k(i) G (X, X) + σ2 i IN o . (88)

There exists competition among the multiple objectives, as the outputs correspond to multiple data sources and distinct hyperparameters. The illustration of the competition between two objectives is given in Figure 6, where the Pareto optimal solutions represent the points at the Pareto front.

Without loss of generality, the Pareto optimal solutions of the kernel hyperparameters are better than other feasible solutions, and can obtain more stable performance compared to those learned by a high-dimensional optimization problem built in state-of-the-art (SOTA) MOGP methods. Furthermore, the objective functions for target outputs are parameterized by unique attention coefﬁcients, all of which are simultaneously minimized to zero.

Pareto front

Feasible solutions

Objective 2

Objective 1

Figure 6: The description of the Pareto optimal solutions with respect to the multi-objective problem.

B ADDITIONAL DETAILS ON EXPERIMENTS

In this section, we explicit the detailed experiment settings, learning parameters for all competitive models, and extra experiment results of various real-world datasets.

B.1 LEARNING PARAMETERS

In the competing methods, the primitive kernel function is selected to be the Squared Exponential (SE). The hyperparameters contain a length-scale l, an output-scale/signal variance c, namely

k SE(x, x ) = c exp x x 2 /l2 . (89)

The detailed learning parameters are speciﬁed in Table 4, where l NF denotes the number of stacked transformations in the ﬂow (speciﬁc parameters of the ﬂow are listed in section B.2), and the Vm represents the quantity of variational parameters (including the inducing points, the mean and covariance in the variational distribution). Also, the Q denotes the number of latent independent Gaussian processes used in the LMC variants, and the S represents the number of outputs.

Published as a conference paper at ICLR 2024

Table 4: The detailed learning parameters for competing models. Acronym Learning parameters (style and size)

[1] SOGP noise: S, length-scale: S, output-scale: S, mean: S [2] LMC noise: S+1, mean: S, coefﬁcients: SQ , length-scale: Q, output-scale: Q [3] FICM noise: S+1, mean: S, raw-variance: S(S + 1)/2 , length-scale: 1 [4] GPRN noise: S+1, mean: S , length-scale: Q + 1, output-scale: Q + 1 [5] GMOGP noise: S+1, mean: S, attention: S2 + S, length-scale: S, output-scale: S [6] TGMOGP noise: S+1, mean: S, attention: S2 + S, length-scale: S, output-scale: S, 4l NF , Vm

B.2 MARGINAL FLOWS

In the experiments, we stack K = 3 or 4 layers of a composite Sinh-Archsinh ﬂow with Afﬁne ﬂow (SAL), which can be formulated as below:

f (i) KX = c(i) sinh b(i) arcsinh f (i) K 1X

a(i) + d(i). (90)

For the output i, i I, each layer in the ﬂow has four free parameters, i.e., a(i), b(i), c(i), d(i) R. Some commonly used and available formulas of the ﬂow are listed in the following Table 5. More discussions and validity analysis can be found in (Rios & Tobar, 2019).

Table 5: The list of common marginal ﬂow types with input f (i) X . The parameters are constrained so that each individual ﬂow is strictly increasing/decreasing functions.

Flow Forward Inverse Parameters

Log log f (i) 0X

exp(f (i) KX )

Softplus log exp f (i) 0X

+ 1 log exp f (i) KX

Afﬁne a + b f (i) 0X (f (i) KX a)/b a, b R

Sinh-Archsinh sinh b arcsinh f (i) 0X

a sinh 1 b arcsinh f (i) KX

sgn f (i) 0X

λ 1 sgn λ f (i) KX + 1 λ f (i) KX + 1

Tan H a tanh b f (i) 0X + c + d a, b, c, d R

B.3 REAL WORLD EXPERIMENTS

This section shows more detailed experiment results of the multi-output regression tasks across various real datasets. The data descriptions and speciﬁed tasks are demonstrated in Table 6.

Table 6: Description of the applied real datasets and multi-output regression tasks. Dataset Ntrain Ntest DX S Tasks JURA 249 100 2 3 Cadmium, nickel, and zinc concentration prediction in Jura EEG 256 100 1 7 Extrapolation of sampled signals from electrodes on scalps TRAFFIC 500 172 1 6 Trafﬁc time-series prediction for three adjacent cells ECG 6000 2000 1 5 Extrapolation of signals from electrodes on navel and head KUKA 12235 5325 21 7 Torques prediction of seven joints for an lightweight arm SARCOS1 2000 4449 21 4 Torques prediction of the 2,3,4,7 joints for an anthropomorphic arm SARCOS2 20000 4449 21 4

B.3.1 EEG DATA FITTING FOR THE FIRST OUTPUT

The main results of the proposed GMOGP models on the EEG dataset are shown in the main paper. Here, we want to draw the predictions of the output regarding to the electron F0 and show more data

Published as a conference paper at ICLR 2024

0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0

Figure 7: The predictive performance of the EEG data with respect to the signal of electron F0. Sub-ﬁgures show the ﬁtting results from different methods: (a) GMOGP, (b) FICM, (c) LMC, (d) SOGP, where the cross items represent the training data points, and the inverted triangles are the locations of the 100 test points.

ﬁtting details. In Figure 7, we can see the GMOGP outperforms the other competing methods with lower predictive uncertainty.

B.4 KUKA PREDICTIVE PERFORMANCE FOR THE MULTIPLE OUTPUTS

In Figure 8, we show the test RMSE of each output, and compare the average RMSE on the seven outputs. The standardized KUKA data has close mean, median, and mode, which may be better described by the Gaussian predictive distributions.

0.02 0.04 0.06 0.08 0.10 Output 1

0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 Output 2

0.02 0.04 0.06 0.08 0.10 Output 3

0.02 0.04 0.06 0.08 0.10 0.12 Output 4

0.02 0.04 0.06 0.08 0.10 0.12 0.14 Output 5

0.025 0.050 0.075 0.100 0.125 0.150 0.175 0.200 Output 6

0.05 0.10 0.15 0.20 0.25 Output 7

0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 Average RMSE

0 20 40 60 80 100 120 140 Time(s)/Iter

Figure 8: The sub-ﬁgures show the test RMSE for output 1 to output 7. The last two show the average RMSE and time cost per iteration.

Published as a conference paper at ICLR 2024

B.4.1 TRAFFIC DATA EXTRAPOLATION

In this section, we test our proposed GMOGP models on the real time-series dataset:

Trafﬁc data: The data containing 28 days of trafﬁc measurements of 3 cells are collected by communication base stations in China. For each cell, we gathered two trafﬁc time-series data (the amount of trafﬁc that has been accumulated per hour in 28 days). Each trafﬁc data contains 672 samples. We use the ﬁrst 500 samples for training and the last 172 for testing. This multi-output regression task can be modeled as a GMOGP with six outputs, and the data are drawn in Figure 9.

0 100 200 300 400 500

0 100 200 300 400 500

0 100 200 300 400 500

0 100 200 300 400 500

0 100 200 300 400 500

0.25 0.50 0.75

0 100 200 300 400 500

Figure 9: The description of the trafﬁc data from 3 neighbored cells in 28 days. The lines paired with black and red from top to bottom represent the trafﬁc data in the same cell of different orientations.

SOGP V-LMC FICM SGPRN GMOGPTGMOGP

1.0 0.506 0.099 0.098 0.094 0.1

0.129 1.0 0.117 0.404 0.117 0.116

0.122 0.357 1.0 0.123 0.126 0.145

0.119 0.399 0.122 1.0 0.121 0.12

0.147 0.225 0.165 0.146 1.0 0.167

0.171 0.157 0.171 0.163 0.165 1.0

Figure 10: Trafﬁc data predictive results, namely, (a) average RMSE of different methods, and (b) learned attention coefﬁcient values α.

Figure 10(a) shows the average test RMSE on the trafﬁc dataset over ﬁve runs. The results demonstrate that the GMOGP-based models outperform other MOGP methods and the SOGP model. Since the real data is non-smooth and contains zero values, the isolated SOGP model can hardly extrapolate well with insufﬁcient model ﬂexibility brought by a primitive kernel. According to the attention coefﬁcients learned by the GMOGP model in Figure 10(b), we can interpret graphical representations and construct the joint distribution of all nodes. Intuitively, we can ﬁnd out the output 1 has strong conditional dependence with the output 2, and it also can be noticed by the coefﬁcient α1,2. In fact, the 1st and the 2rd outputs are collected in the same cell with similar trends in practice. However, the other output with different envelope and regularity can provide little useful information when predicting the ﬁrst output. Comparing to the classic MOGP models that correlate all outputs without selection, our GMOGP can provide an efﬁcient alternative for multi-output regression.

Published as a conference paper at ICLR 2024

C LIMITATIONS, EXTENSIONS, AND FUTURE WORK

The main limitations of the GMOGP can be mitigated through various avenues:

1. Interpretation of the attention coefﬁcients: The attention coefﬁcients signify the dependence among nodes and graph structures in this paper. However, interpretations of the coefﬁcients learned by the TGMOGP and SOTA graph attention networks are vagued and anticipated in Explainable Artiﬁcial Intelligence (XAI) ﬁeld.

2. Dynamic dependence measure: The interaction or dependence dynamically changes with inputs.

3. Extensions of online learning frameworks and graph construction schemes: Opportunities exist for extending the GMOGP to online learning frameworks and enhancing graph construction schemes in an online setting.

4. Adaptation of attention mechanism/scoring function/dependence measure for different tasks: Consideration of diverse attention mechanisms, scoring functions, and dependence measures tailored to speciﬁc tasks in practice.

5. Exploration of different aggregation models (Eq. (16)).

6. Exploration of different objective functions and weighting schemes tailored for varied applications.

In the future, exploration on the use of GMOGP with various types of adaptive/input-dependent covariance structures and ﬂows would be instructive. Additionally, extending the GMOGP to equip data-dependent mean function, model multiple ﬁdelity outputs, and facilitate sequential forecasting would be of great interest. We hope the GMOGP will inspire further research into explainable networks, bridging beneﬁts from learning models and statistics.