# bayesian_alignments_of_warped_multioutput_gaussian_processes__f348a859.pdf Bayesian Alignments of Warped Multi-Output Gaussian Processes Markus Kaiser Siemens AG Technical University of Munich markus.kaiser@siemens.com Clemens Otte Siemens AG clemens.otte@siemens.com Thomas Runkler Siemens AG Technical University of Munich thomas.runkler@siemens.com Carl Henrik Ek University of Bristol carlhenrik.ek@bristol.ac.uk We propose a novel Bayesian approach to modelling nonlinear alignments of time series based on latent shared information. We apply the method to the real-world problem of finding common structure in the sensor data of wind turbines introduced by the underlying latent and turbulent wind field. The proposed model allows for both arbitrary alignments of the inputs and non-parametric output warpings to transform the observations. This gives rise to multiple deep Gaussian process models connected via latent generating processes. We present an efficient variational approximation based on nested variational compression and show how the model can be used to extract shared information between dependent time series, recovering an interpretable functional decomposition of the learning problem. We show results for an artificial data set and real-world data of two wind turbines. 1 Introduction Many real-world systems are inherently hierarchical and connected. Ideally, a machine learning method should model and recognize such dependencies. Take wind power production, which is one of the major providers for renewable energy today, as an example: To optimize the efficiency of a wind turbine the speed and pitch have to be controlled according to the local wind conditions (speed and direction). In a wind farm turbines are typically equipped with sensors for wind speed and direction. The goal is to use these sensor data to produce accurate estimates and forecasts of the wind conditions at every turbine in the farm. For the ideal case of a homogeneous and very slowly changing wind field, the wind conditions at each geometrical position in a wind farm can be estimated using the propagation times (time warps) computed from geometry, wind speed, and direction [21, 4, 18]. In the real world, however, wind fields are not homogeneous, exhibit global and local turbulences, and interfere with the turbines and the terrain inside and outside the farm and further, sensor faults may lead to data loss. This makes it extremely difficult to construct accurate analytical models of wind propagation in a farm. Also, standard approaches for extracting such information from data, e.g. generalized time warping [24], fail at this task because they rely on a high signal to noise ratio. Instead, we want to construct Bayesian nonlinear dynamic data based models for wind conditions and warpings which handle the stochastic nature of the system in a principled manner. In this paper, we look at a generalization of this type of problem and propose a novel Bayesian approach to finding nonlinear alignments of time series based on latent shared information. We view the power production of different wind turbines as the outputs of a multi-output Gaussian process 32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada. (MO-GP) [1] which models the latent wind fronts. We embed this model in a hierarchy, adding a layer of non-linear alignments on top and a layer of non-linear warpings [19, 14] below which increases flexibility and encodes the original generative process. We show how the resulting model can be interpreted as a group of deep Gaussian processes with the added benefit of covariances between different outputs. The imposed structure is used to formulate prior knowledge in a principled manner, restrict the representational power to physically plausible models and recover the desired latent wind fronts and relative alignments. The presented model can be interpreted as a group of D deep GPs all of which share one layer which is a MO-GP. This MO-GP acts as an interface to share information between the different GPs which are otherwise conditionally independent. The paper has the following contributions: In Section 2, we propose a hierarchical, warped and aligned multi-output Gaussian process (AMO-GP). In Section 3, we present an efficient learning scheme via an approximation to the marginal likelihood which allows us to fully exploit the regularization provided by our structure, yielding highly interpretable results. We show these properties for an artificial data set and for real-world data of two wind turbines in Section 4. 2 Model Definition We are interested in formulating shared priors over a set of functions {fd}D d=1 using GPs, thereby directly parameterizing their interdependencies. In a traditional GP setting, multiple outputs are considered conditionally independent given the inputs, which significantly reduces the computational cost but also prevents the utilization of shared information. Such interdependencies can be formulated via convolution processes (CPs) as proposed by Boyle and Frean [5], a generalization of the linear model of coregionalization (LMC) [13, 7]. In the CP framework, the output functions are the result of a convolution of the latent processes wr with smoothing kernel functions Td,r for each output fd, defined as Z Td,r(x z) wr(z) dz. (1) In this model, the convolutions of the latent processes generating the different outputs are all performed around the same point x. We generalize this by allowing different alignments of the observations which depend on the position in the input space. This allows us to model the changing relative interaction times for the different latent wind fronts as described in the introduction. We also assume that the dependent functions fd are latent themselves and the data we observe is generated via independent noisy nonlinear transformations of their values. Every function fd is augmented with an alignment function ad and a warping gd on which we place independent GP priors. For simplicity, we assume that the outputs are evaluated all at the same positions X = {xn}N n=1. This can easily be generalized to different input sets for every output. In our application, the xn are one-dimensional time indices. However, since the model can be generalized to multi-dimensional inputs, we do not restrict ourselves to the one-dimensional case. We note that in the multi-dimensional case, reasoning about priors on alignments can be challenging. We call the observations associated with the d-th function yd and use the stacked vector y = (y1, . . . , y D) to collect the data of all outputs. The final model is then given by yd = gd(fd(ad(X))) + ϵd, (2) where ϵd N(0, σ2 y,d I) is a noise term. The functions are applied element-wise. This encodes the generative process described above: For every turbine yd, observations at positions X are generated by first aligning to the latent wind fronts using ad, applying the front in fd, imposing turbine-specific components gd and adding noise in ϵd. We assume independence between ad and gd across outputs and apply GP priors of the form ad GP(id, ka,d) and gd GP(id, kg,d). By setting the prior mean to the identity function id(x) = x, the standard CP model is our default assumption. During learning, the model can choose the different ad and gd in a way to reveal the independent shared latent processes {wr}R r=1 on which we also place GP priors wr GP(0, ku,r). Similar to Boyle and Frean [5], we assume the latent processes to be independent white noise processes by setting cov[wr(z), wr (z )] = δrr δzz . Under this prior, the fd are also GPs with zero mean and cov[fd(x), fd (x )] = PR r=1 R Td,r(x z)Td ,r(x z) dz. Figure 1: The graphical model of AMO-GP with variational parameters (blue). A CP, informed by R latent processes, models shared information between multiple data sets with nonlinear alignments and warpings. This CP connects multiple deep GPs through a shared layer. Figure 2: An artificial example of hierarchical composite data with multiple observations of shared latent information. This hierarchy generates two data sets using a dampened sine function which is never observed directly. Using the squared exponential kernel for all Td,r, the integral can be shown to have a closed form solution. With {σd,r, ℓd,r} denoting the kernel hyper parameters associated with Td,r, it is given by cov[fd(x), fd (x )] = 2 σd,rσd ,r QK k=1 ˆℓ 1 d,d ,r,k exp ˆℓ2 d,d ,r,k where x is K-dimensional and ˆℓd,d ,r,k = q ℓ2 d,r,k + ℓ2 d ,r,k. 3 Variational Approximation Since exact inference in this model is intractable, we present a variational approximation to the model s marginal likelihood in this section. A detailed derivation of the variational bound can be found in Appendix A. Analogously to y, we denote the random vectors which contain the function values of the respective functions and outputs as a and f. The joint probability distribution of the data can then be written as p(y, f, a | X) = d=1 p(yd | fd) p(ad | X), ad | X N(X, Ka,d + σ2 a,d I), f | a N(0, Kf + σ2 f I), yd | fd N(fd, Kg,d + σ2 y,d I). Here, we use K to refer to the Gram matrices corresponding to the respective GPs. All but the convolution processes factorize over the different levels of the model as well as the different outputs. 3.1 Variational Lower Bound To approximate a single deep GP, that is a single string of GPs stacked on top of each other, Hensman and Lawrence [11] proposed nested variational compression in which every GP in the hierarchy is handled independently. In order to arrive at their lower bound they make two variational approximations. First, they consider a variational approximation q(ˆa, u) = p(ˆa | u) q(u) to the true posterior of a single GP first introduced by Titsias [22]. In this approximation, the original model is augmented with inducing variables u together with their inducing points Z which are assumed to be latent observations of the same function and are thus jointly Gaussian with the observed data. In contrast to [22], the distribution q(u) is not chosen optimally but optimized using the closed form q(u) N(u | m, S). This gives rise to the Scalable Variational GP presented in [10]. Second, in order to apply this variational bound for the individual GPs recursively, uncertainties have to be propagated through subsequent layers and inter-layer cross-dependencies are avoided using another variational approximation. The variational lower bound for the AMO-GP is given by log p(y | X, Z, u) d=1 log N yd Ψg,d K 1 ug,dug,dmg,d, σ2 y,d I 1 2σ2 a,d tr(Σa,d) ψf tr Φf K 1 uf uf ψg,d tr Φg,d K 1 ug,dug,d d=1 KL(q(ua,d) p(ua,d)) KL(q(uf) p(uf)) d=1 KL(q(uy,d) p(uy,d)) 1 2σ2 f tr Φf Ψ T fΨf K 1 uf uf mfm T f + Sf K 1 uf uf 1 2σ2 y,d tr Φg,d Ψ T g,dΨg,d K 1 ug,dug,d mg,dm T g,d + Sg,d K 1 ug,dug,d , where KL denotes the KL-divergence. A detailed derivation can be found in Appendix A. The bound contains one Gaussian fit term per output dimension and a series of regularization terms for every GP in the hierarchy. The KL-divergences connect the variational approximations to the prior and the different trace terms regularize the variances of the different GPs (for a detailed discussion see [11]). This bound depends on the hyper parameters of the kernel and likelihood {ℓ, σ} and the variational parameters {Zl,d, ml,d, Sl,d | l {a, f, d}, d [D]}. The bound can be calculated in O(NM 2) time and factorizes along the data points which enables stochastic optimization. Since every of the N data points is associated with one of the D outputs, the computational cost of the model is independent of D. Information is only shared between the different outputs using the inducing points in f. As the different outputs share a common function, increasing D allows us to reduce the number of variational parameters per output, because the shared function can still be represented completely. A central component of this bound are expectations over kernel matrices, the three Ψ-statistics ψf = Eq(a)[tr(Kff)], Ψf = Eq(a)[Kfu] and Φf = Eq(a)[Kuf Kfu]. Closed form solutions for these statistics depend on the choice of kernel and are known for specific kernels, such as linear or RBF kernels, for example shown in [8]. In the following subsection we will give closed form solutions for these statistics required in the shared CP-layer of our model. 3.2 Convolution Kernel Expectations The uncertainty about the first layer is captured by the variational distribution of the latent alignments a given by q(a) N(µa, Σa). Every aligned point in a corresponds to one output of f and ultimately to one of the yd. Since the closed form of the multi output kernel depends on the choice of outputs, we will use the notation ˆf(an) to denote fd(an) such that an is associated with output d. For simplicity, we only consider one single latent process wr. Since the latent processes are independent, the results can easily be generalized to multiple processes. Then, ψf is given by ψf = Eq(a)[tr(Kff)] = n=1 ˆσ2 nn. (6) Similar to the notation ˆf( ), we use the notation ˆσnn to mean the variance term associated with the covariance function cov[ ˆf(an), ˆf(an )] as shown in (3). The expectation Ψf = Eq(a)[Kfu] connecting the alignments and the pseudo inputs is given by Ψf = Eq(a)[Kfu], with (Ψf)ni = ˆσ2 ni (Σa) 1 nn ˆℓni + (Σa) 1 nn exp 2 (Σa) 1 nnˆℓni (Σa) 1 nn + ˆℓni ((µa)n Zi)2 ! where ˆℓni is the combined length scale corresponding to the same kernel as ˆσni. Lastly, Φf = Eq(a)[Kuf Kfu] connects alignments and pairs of pseudo inputs with the closed form Φf = Eq(a)[Kuf Kfu], with n=1 ˆσ2 niˆσ2 nj (Σa) 1 nn ˆℓni + ˆℓnj + (Σa) 1 nn exp ˆℓniˆℓnj ˆℓni + ˆℓnj (Zi Zj)2 2 (Σa) 1 nn(ˆℓni + ˆℓnj) (Σa) 1 nn + ˆℓni + ˆℓnj (µa)n ˆℓni Zi + ˆℓnj Zj ˆℓni + ˆℓnj The Ψ-statistics factorize along the data and we only need to consider the diagonal entries of Σa. If all the data belong to the same output, the Ψ-statistics of the squared exponential kernel can be recovered as a special case. This case is used for the output-specific warpings g. 3.3 Model Interpretation The graphical model shown in Figure 1 illustrates that the presented model can be interpreted as a group of D deep GPs all of which share one layer which is a CP. This CP acts as an interface to share information between the different GPs which are otherwise conditionally independent. This modelling-choice introduces a new quality to the model when compared to standard deep GPs with multiple output dimensions, since the latter are not able in principle to learn dependencies between the different outputs. Compared to standard multi-output GPs, the AMO-GP introduces more flexibility with respect to the shared information. CPs make strong assumptions about the relative alignments of the different outputs, that is, they assume constant time-offsets. The AMO-GP extends this by introducing a principled Bayesian treatment of general nonlinear alignments ad on which we can place informative priors derived from the problem at hand. Together with the warping layers gd, our model can learn to share knowledge in an informative latent space learnt from the data. Alternatively, this model can be interpreted as a shared and warped latent variable model with a very specific prior: The indices X are part of the prior for the latent space ad(X) and specify a sense of order for the different data points y which is augmented with uncertainty by the alignment functions. Using this order, the convolution processes enforce the covariance structure for the different datapoints specified by the smoothing kernels. In order to derive an inference scheme, we need the ability to propagate uncertainties about the correct alignments and latent shared information through subsequent layers. We adapted the approach of nested variational compression by Hensman and Lawrence [11], which is originally concerned with a single deep GP. The approximation is expanded to handle multiple GPs at once, yielding the bound in (5). The bound reflects the dependencies of the different outputs as the sharing of information between the different deep GPs is approximated through the shared inducing variables uf,d. Our main contribution for the inference scheme is the derivation of a closed-form solution for the Ψ-statistics of the convolution kernel in (6) to (8). 4 Experiments In this section we show how to apply the AMO-GP to the task of finding common structure in time series observations. In this setting, we observe multiple time series Td = (Xd, yd) and assume that there exist latent time series which determine the observations. We will first apply the AMO-GP to an artificial data set in which we define a decomposed system of dependent time series by specifying a shared latent function generating the observations together (a) Shallow GP with RBF kernel. (b) Multi-Output GP with dependent RBF kernel. (c) Deep GP with RBF kernels. (d) AMO-GP with (dependent) RBF kernels. Figure 3: A comparison of the AMO-GP with other GP models. The plots show mean predictions and a shaded area of two standard deviations. If available, the ground truth is displayed as a dashed line. Additional lines are noiseless samples drawn from the model. The shallow and deep GPs in Figures 3a and 3c model the data independently and revert back to the prior in y2. Because of the nonlinear alignment, a multi-output GP cannot model the data in Figure 3b. The AMO-GP in Figure 3d recovers the alignment and warping and shares information between the two outputs. with relative alignments and warpings for the different time series. We will show that our model is able to recover this decomposition from the training data and compare the results to other approaches of modeling the data. Then we focus on a real world data set of a neighbouring pair of wind turbines in a wind farm, where the model is able to recover a representation of the latent prevailing wind condition and the relative timings of wind fronts at the two turbines. 4.1 Artificial data set Our data set consists of two time series T1 and T2 generated by a dampened sine function. We choose the alignment of T1 and the warping of T2 to be the identity in order to prevent us from directly observing the latent function and apply a sigmoid warping to T1. The alignment of T2 is selected to be a quadratic function. Figure 2 shows a visualization of this decomposed system of dependent time Table 1: Test-log-likelihoods for the models presented in Section 4. Experiment Test set GP MO-GP DGP AMO-GP (Ours) Artificial [0.7, 0.8] T1 -0.12 -0.053 0.025 1.54 [0.35, 0.65] T2 -0.19 -5.66 -0.30 0.72 Wind [40, 45] T2 -4.42 -2.31 -1.80 -1.43 [65, 75] T2 -7.26 -0.73 -1.93 -0.69 series. To obtain training data we uniformly sampled 500 points from the two time series and added Gaussian noise. We subsequently removed parts of the training sets to explore the generalization behaviour of our model, resulting in |T1| = 450 and |T2| = 350. We use this setup to train our model using squared exponential kernels both in the conditionally independent GPs ad and gd and as smoothing kernels in f. We can always choose one alignment and one warping to be the identity function in order to constrain the shared latent spaces a and f and provide a reference the other alignments and warpings will be relative to. Since we assume our artificial data simulates a physical system, we apply the prior knowledge that the alignment and warping processes have slower dynamics compared to the shared latent function which should capture most of the observed dynamics. To this end we applied priors to the ad and gd which prefer longer length scales and smaller variances compared to f. Otherwise, the model could easily get stuck in local minima like choosing the upper two layers to be identity functions and model the time series independently in the gd. Additionally, our assumption of identity mean functions prevents pathological cases in which the complete model collapses to a constant function. Figure 3d shows the AMO-GP s recovered function decomposition and joint predictions. The model successfully recovered a shared latent dampened sine function, a sigmoid warping for the first time series and an approximate quadratic alignment function for the second time series. In Figures 3a to 3c, we show the training results of a standard GP, a multi-output GP and a three-layer deep GP on the same data. For all of these models, we used RBF kernels and, in the case of the deep GP, applied priors similar to our model in order to avoid pathological cases. In Table 1 we report test log-likelihoods for the presented models, which illustrate the qualitative differences between the models. Because all models are non-parametric and converge well, repeating the experiments with different initializations leads to very similar likelihoods. Both the standard GP and deep GP cannot learn dependencies between time series and revert back to the prior where no data is available. The deep GP has learned that two layers are enough to model the data and the resulting model is essentially a Bayesian warped GP which has identified the sigmoid warping for T1. Uncertainties in the deep GP are placed in the middle layer areas where no data are available for the respective time series, as sharing information between the two outputs is impossible. In contrast to the other two models, the multi-output GP can and must share information between the two time series. As discussed in Section 2 however, it is constrained to constant time-offsets and cannot model the nonlinear alignment in the data. Because of this, the model cannot recover the latent sine function and can only model one of the two outputs. 4.2 Pairs of wind turbines This experiment is based on real data recorded from a pair of neighbouring wind turbines in a wind farm. The two time series T1 and T2 shown in gray in Figure 4 record the respective power generation of the two turbines over the course of one and a half hours, which was smoothed slightly using a rolling average over 60 seconds. There are 5400 data points for the first turbine (blue) and 4622 data points for the second turbine (green). We removed two intervals (drawn as dashed lines) from the second turbine s data set to inspect the behaviour of the model with missing data. This allows us to evaluate and compare the generative properties of our model in Figure 5. The power generated by a wind turbine is mainly dependent on the speed of the wind fronts interacting with the turbine. For system identification tasks concerned with the behaviour of multiple wind turbines, associating the observations on different turbines due to the same wind fronts is an important task. However it is usually not possible to directly measure these correspondences or wind propagation 0 15 30 45 60 75 90 Figure 4: The joint posterior for two time series y1 and y2 of power production for a pair of wind turbines. The top and bottom plots show the two observed time series with training data and dashed missing data. The AMO-GP recovers an uncertain relative alignment of the two time series shown in the middle plot. High uncertainty about the alignment is placed in areas where multiple explanations are plausible due to the high amount of noise or missing data. (a) Samples from a GP. 0.6 0.8 1 1.2 (b) Samples from a MO-GP. 39 47 0.6 0.8 (c) Samples from a DGP. 0.6 0.8 1 1.2 (d) Samples from the AMO-GP. Figure 5: A comparison of noiseless samples drawn from a GP, a MO-GP, a DGP and the AMOGP. The separation of uncertainties implied by the model structure of AMP-GP gives rise to an informative model. Since the uncertainty in the generative process is mainly placed in the relative alignment shown in Figure 4, all samples in Figure 5d resemble the underlying data in structure. speeds between turbines, which means that there is no ground truth available. An additional problem is that the shared latent wind conditions are superimposed by turbine-specific local turbulences. Since these local effects are of comparable amplitude to short-term changes of wind speed, it is challenging to decide which parts of the signal to explain away as noise and which part to identify as the underlying shared process. Our goal is the simultaneous learning of the uncertain alignment in time a and of the shared latent wind condition f. Modelling the turbine-specific parts of the signals is not the objective, so they need to be explained by the Gaussian noise term. We use a squared exponential kernel as a prior for the alignment functions ad and as smoothening kernels in f. For the given data set we can assume the output warpings gd to be linear functions because there is only one dimension, the power generation, which in this data set is of similar shape for both turbines. Again we encode a preference for alignments with slow dynamics with a prior on the length scales of ad. As the signal has turbinespecific autoregressive components, plausible alignments are not unique. To constrain the AMO-GP, we want it to prefer alignments close to the identity function which we chose as a prior mean function. Figure 4 shows the joint model learned from the data in which a1 is chosen to be the identity function. The possible alignments identified match the physical conditions of the wind farm. For the given turbines, time offsets of up to six minutes are plausible and for most wind conditions, the offset is expected to be close to zero. For areas where the alignment is quite certain however, the two time series are explained with comparable detail. The model is able to recover unambiguous associations well and successfully places high uncertainty on the alignment in areas where multiple explanations are plausible due to the noisy signal. As expected, the uncertainty about the alignment also grows where data for the second time series is missing. This uncertainty is propagated through the shared function and results in higher predictive variances for the second time series. Because of the factorization in the model however, we can recover the uncertainties about the alignment and the shared latent function separately. Figure 5 compares samples drawn from our model with samples drawn from a GP, a MO-GP and a DGP. The GP reverts to their respective priors when data is missing, while the MO-GP does not handle short-term dynamics and smoothens the signal enough such that the nonlinear alignment can be approximated as constant. Samples drawn from a DGP model showcase the complexity of a DGP prior. Unconstrained composite GPs are hard to reason about and make the model very flexible in terms of representable functions. Since the model s evidence is very broad, the posterior is uninformed and inference is hard. Additionally, as discussed in Appendix B and [11], the nested variational compression bound tends to loosen with high uncertainties. AMO-GP shows richer structure: Due to the constraints imposed by the model, more robust inference leads to a more informed model. Samples show that it has learned that a maximum which is missing in the training data has to exist somewhere, but the uncertainty about the correct alignment due to the local turbulence means that different samples place the maximum at different locations in X-direction. 5 Conclusion We have proposed the warped and aligned multi-output Gaussian process (AMO-GP), in which MOGPs are embedded in a hierarchy to find shared structure in latent spaces. We extended convolution processes [5] with conditionally independent Gaussian processes on both the input and output sides, giving rise to a highly structured deep GP model. This structure can be used to both regularize the model and encode expert knowledge about specific parts of the system. By applying nested variational compression [11] to inference in these models, we presented a variational lower bound which combines Bayesian treatment of all parts of the model with scalability via stochastic optimization. We compared the model with GPs, deep GPs and multi-output GPs on an artificial data set and showed how the richer model-structure allows the AMO-GP to pick up on latent structure which the other approaches cannot model. We then applied the AMO-GP to real world data of two wind turbines and used the proposed hierarchy to model wind propagation in a wind farm and recover information about the latent non homogeneous wind field. With uncertainties decomposed along the hierarchy, our approach handles ambiguities introduced by the stochasticity of the wind in a principled manner. This indicates the AMO-GP is a good approach for these kinds of dynamical system, where multiple misaligned sensors measure the same latent effect. 6 Acknowledgement The project this report is based on was supported with funds from the German Federal Ministry of Education and Research under project number 01IB15001. The sole responsibility for the reports contents lies with the authors. [1] Mauricio A. Alvarez, Lorenzo Rosasco, and Neil D. Lawrence. Kernels for Vector-Valued Functions: a Review . In: ar Xiv:1106.6251 [cs, math, stat] (June 2011). ar Xiv: 1106.6251. [2] Mauricio A. Alvarez et al. Efficient Multioutput Gaussian Processes through Variational Inducing Kernels. In: AISTATS. Vol. 9. 2010, pp. 25 32. [3] Mauricio Alvarez and Neil D. Lawrence. Sparse convolved Gaussian processes for multioutput regression . In: Advances in neural information processing systems. 2009, pp. 57 64. [4] Eilyan Bitar and Pete Seiler. Coordinated control of a wind turbine array for power maximization . In: American Control Conference (ACC), 2013. IEEE, 2013, pp. 2898 2904. [5] Phillip Boyle and Marcus R. Frean. Dependent Gaussian Processes. In: NIPS. Vol. 17. 2004, pp. 217 224. [6] Phillip Boyle et al. Multiple output gaussian process regression. Tech. rep. 2005. [7] Timothy C. Coburn. Geostatistics for natural resources evaluation. Taylor & Francis Group, 2000. [8] Andreas C. Damianou and Neil D. Lawrence. Deep Gaussian Processes . In: ar Xiv:1211.0358 [cs, math, stat] (Nov. 2012). ar Xiv: 1211.0358. [9] David Duvenaud et al. Avoiding pathologies in very deep networks. 2014. [10] James Hensman, Nicolo Fusi, and Neil D. Lawrence. Gaussian Processes for Big Data . In: ar Xiv:1309.6835 [cs, stat] (Sept. 2013). [11] James Hensman and Neil D. Lawrence. Nested Variational Compression in Deep Gaussian Processes . In: ar Xiv:1412.1370 [stat] (Dec. 2014). ar Xiv: 1412.1370. [12] James Hensman, Alex Matthews, and Zoubin Ghahramani. Scalable Variational Gaussian Process Classification . In: ar Xiv:1411.2005 [stat] (Nov. 2014). ar Xiv: 1411.2005. [13] Andre G. Journel and Ch J. Huijbregts. Mining geostatistics. Academic press, 1978. [14] Miguel Lázaro-Gredilla. Bayesian warped Gaussian processes . In: Advances in Neural Information Processing Systems. 2012, pp. 1619 1627. [15] Alexander G. de G. Matthews et al. GPflow: A Gaussian process library using Tensor Flow . In: Journal of Machine Learning Research 18.40 (2017), pp. 1 6. [16] Carl Edward Rasmussen and Christopher K I Williams. Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning). The MIT Press, 2006. [17] Hugh Salimbeni and Marc Deisenroth. Doubly Stochastic Variational Inference for Deep Gaussian Processes . In: ar Xiv:1705.08933 [stat] (May 2017). ar Xiv: 1705.08933. [18] J. G. Schepers and S. P. Van der Pijl. Improved modelling of wake aerodynamics and assessment of new farm control strategies . In: Journal of Physics: Conference Series. Vol. 75. IOP Publishing, 2007, p. 012039. [19] Edward Snelson, Carl Edward Rasmussen, and Zoubin Ghahramani. Warped Gaussian Processes . In: MIT Press, 2004, pp. 337 344. [20] Jasper Snoek et al. Input Warping for Bayesian Optimization of Non-stationary Functions . In: ar Xiv:1402.0929 [cs, stat] (Feb. 2014). ar Xiv: 1402.0929. [21] Maryam Soleimanzadeh and Rafael Wisniewski. Controller design for a wind farm, considering both power and load aspects . In: Mechatronics 21.4 (2011), pp. 720 727. [22] Michalis K. Titsias. Variational Learning of Inducing Variables in Sparse Gaussian Processes. In: AISTATS. Vol. 5. 2009, pp. 567 574. [23] Michalis K. Titsias and Neil D. Lawrence. Bayesian Gaussian process latent variable model . In: International Conference on Artificial Intelligence and Statistics. 2010, pp. 844 851. [24] Feng Zhou and Fernando De la Torre. Generalized time warping for multi-modal alignment of human motion . In: Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 2012, pp. 1282 1289.