# infinitehorizon_gaussian_processes__ab1a7bfa.pdf Infinite-Horizon Gaussian Processes Arno Solin Aalto University arno.solin@aalto.fi James Hensman PROWLER.io james@prowler.io Richard E. Turner University of Cambridge ret26@cam.ac.uk Gaussian processes provide a flexible framework for forecasting, removing noise, and interpreting long temporal datasets. State space modelling (Kalman filtering) enables these non-parametric models to be deployed on long datasets by reducing the complexity to linear in the number of data points. The complexity is still cubic in the state dimension m which is an impediment to practical application. In certain special cases (Gaussian likelihood, regular spacing) the GP posterior will reach a steady posterior state when the data are very long. We leverage this and formulate an inference scheme for GPs with general likelihoods, where inference is based on single-sweep EP (assumed density filtering). The infinite-horizon model tackles the cubic cost in the state dimensionality and reduces the cost in the state dimension m to O(m2) per data point. The model is extended to online-learning of hyperparameters. We show examples for large finite-length modelling problems, and present how the method runs in real-time on a smartphone on a continuous data stream updated at 100 Hz. 1 Introduction Gaussian process (GP, [25]) models provide a plug & play interpretable approach to probabilistic modelling, and would perhaps be more widely applied if not for their associated computational complexity: naïve implementations of GPs require the construction and decomposition of a kernel matrix at cost O(n3), where n is the number of data. In this work, we consider GP time series (i.e. GPs with one input dimension). In this case, construction of the kernel matrix can be avoided by exploiting the (approximate) Markov structure of the process and re-writing the model as a linear Gaussian state space model, which can then be solved using Kalman filtering (see, e.g., [27]). The Kalman filter costs O(m3n), where m is the dimension of the state space. We propose the Infinite-Horizon GP approximation (IHGP), which reduces the cost to O(m2n). As m grows with the number of kernel components in the GP prior, this cost saving can be significant for many GP models where m can reach hundreds. For example, the automatic statistician [6] searches for kernels (on 1D datasets) using sums and products of kernels. The summing of two kernels results in the concatenation of the state space (sum of the ms) and a product of kernels results in the Kronecker sum of their statespaces (product of ms). This quickly results in very high state dimensions; we show results with a similarly constructed kernel in our experiments. We are concerned with real-time processing of long (or streaming) time-series with short and long length-scale components, and non-Gaussian noise/likelihood and potential non-stationary structure. We show how the IHGP can be applied in the streaming setting, including efficient estimation of the marginal likelihood and associated gradients, enabling on-line learning of hyper (kernel) parameters. We demonstrate this by applying our approach to a streaming dataset of two million points, as well as providing an implementation of the method on an i Phone, allowing on-line learning of a GP model of the phone s acceleration. This work was undertaken whilst AS was a Visiting Research Fellow with University of Cambridge. 32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada. 1 0.5 0 0.5 1 IHGP mean 95% quantiles Full GP 0 2 4 6 8 10 Length-scale, ℓ log p(y | θ) IHGP Full GP Figure 1: (Left) GP regression with n = 100 observations and a Matérn covariance function. The IHGP is close to exact far from boundaries, where the constant marginal variance assumption shows. (Right) Hyperparameters θ = (σ2 n, σ2, ℓ) optimised independently for both models. For data where a Gaussian noise assumption may not be appropriate, many approaches have been proposed for approximation (see, e.g., [21] for an overview). Here we show how to combine Assumed Density Filtering (ADF, a.k.a. single-sweep Expectation Propagation, EP [5, 12, 19]) with the IHGP. We are motivated by the application to Log-Gaussian Cox Processes (LGCP, [20]). Usually the LGCP model uses binning to avoid a doubly-intractable model; in this case it is desirable to have more bins in order to capture short-lengthscale effects, leading to more time points. Additionally, the desire to capture long-and-short-term effects means that the state space dimension m can be large. We show that our approach is effective on standard benchmarks (coal-mining disasters) as well as a much larger dataset (airline accidents). The structure of the paper is as follows. Sec. 2 covers the necessary background and notation related to GPs and state space solutions. Sec. 3 leverages the idea of steady-state filtering to derive IHGP. Sec. 4 illustrates the approach on several problems, and the supplementary material contains additional examples and a nomenclature for easier reading. Code implementations in MATLAB/C++/Objective-C and video examples of real-time operation are available at https://github.com/Aalto ML/IHGP. 2 Background We are concerned with GP models [25] admitting the form: f(t) GP(µ(t), κ(t, t )) and y | f Qn i=1 p(yi | f(ti)), where the data D = {(ti, yi)}n i=1 are input output pairs, µ(t) the mean function, and κ(t, t ) the covariance function of the GP prior. The likelihood factorizes over the observations. This family covers many standard modelling problems, including regression and classification tasks. Without loss of generality, we present the methodology for zero-mean (µ(t) := 0) GP priors. We approximate posteriors of the form (see [24] for an overview): q(f | D) = N(f | Kα, (K 1 + W) 1), (1) where Ki,j = κ(ti, tj) is the prior covariance matrix, α Rn, and the (likelihood precision) matrix is diagonal, W = diag(w). Elements of w Rn are non negative for log-concave likelihoods. The predictive mean and marginal variance for a test input t is µf, = k T α and σ2 f, = k k T (K + W 1) 1k . A probabilistic way of learning the hyperparameters θ of the covariance function (such as magnitude and scale) and the likelihood model (such as noise scale) is by maximizing the (log) marginal likelihood function p(y | θ) [25]. Numerous methods have been proposed for dealing with the prohibitive computational complexity of the matrix inverse in dealing with the latent function in Eq. (1). While general-purpose methods such as inducing input [4, 23, 30, 33], basis function projection [11, 17, 32], interpolation approaches [37], or stochastic approximations [10, 14] do not pose restrictions to the input dimensionality, they scale poorly in long time-series models by still needing to fill the extending domain (see discussion in [3]). For certain problems tree-structured approximations [3] or band-structured matrices can be leveraged. However, [8, 22, 26, 29] have shown that for one-dimensional GPs with high-order Markovian structure, an optimal representation (without approximations) is rewriting the GP in terms of a state space model and solving inference in linear time by sequential Kalman filtering methods. We will therefore focus on building upon the state space methodology. 2.1 State space GPs In one-dimensional GPs (time-series) the data points feature the special property of having a natural ordering. If the GP prior itself admits a Markovian structure, the GP model can be reformulated as a state space model. Recent work has focused on showing how many widely used covariance function can be either exactly (e.g., the half-integer Matérn class, polynomial, noise, constant) or approximately (e.g., the squared-exponential/RBF, rational quadratic, periodic, etc.) converted into state space models. In continuous time, a simple dynamical system able to represent these covariance functions is given by the following linear time-invariant stochastic differential equation (see [28]): f(t) = F f(t) + L w(t), yi p(yi | h T f(ti)), (2) where w(t) is an s-dimensional white noise process, and F Rm m, L Rm s, h Rm 1 are the feedback, noise effect, and measurement matrices, respectively. The driving process w(t) Rs is a multivariate white noise process with spectral density matrix Qc Rs s. The initial state is distributed according to f0 N(0, P0). For discrete input values ti, this translates into fi N(Ai 1fi 1, Qi 1), yi p(yi | h Tfi), (3) with f0 N(0, P0). The discrete-time dynamical model is solved through a matrix exponential Ai = exp(F ti), where ti = ti+1 ti 0. For stationary covariance functions, κ(t, t ) = κ(t t ), the process noise covariance is given by Qi = P Ai P AT i . The stationary state (corresponding to the initial state P0) is distributed by f N(0, P ) and the stationary covariance can be found by solving the Lyapunov equation P = F P + P FT + L Qc LT = 0. Appendix B shows an example of representing the Matérn (ν = 3/2) covariance function as a state space model. Other covariance functions have been listed in [31]. 2.2 Bayesian filtering The closed-form solution to the linear Bayesian filtering problem Eq. (3) with a Gaussian likelihood N(yi | h Tfi, σ2 n) is known as the Kalman filter [27]. The interest is in the following marginal distributions: p(fi | y1:i 1) = N(fi | mp i , Pp i ) (predictive distribution), p(fi | y1:i) = N(fi | mf i, Pf i) (filtering distribution), and p(yi | y1:i 1) = N(yi | vi, si) (decomposed marginal likelihood). The predictive state mean and covariance are given by mp i = Ai mf i 1 and Pp i = Ai Pf i 1 AT i + Qi. The so called innovation mean and variances vi and si are vi = yi h Tmp i and si = h TPp i h + σ2 n. (4) The log marginal likelihood can be evaluated during the filter update steps by log p(y) = Pn i=1 1 2(log 2πsi + v2 i /si). The filter mean and covariances are given by ki = Pp i h/si, mf i = mp i 1 + ki vi, Pf i = Pp i ki h TPp i , (5) where ki Rm represents the filter gain term. In batch inference, we are actually interested in the so called smoothing solution, p(f | D) corresponding to marginals p(fi | y1:n) = N(fi | ms i, Ps i). The smoother mean and covariance is solved by the backward recursion, from i = n 1 backwards to 1: ms i = mf i + Gi (ms i+1 mp i+1), Ps i = Pf i + Gi (Ps i+1 Pp i+1) GT i , (6) where Gi = Pf i AT i+1 [Pp i+1] 1 is the smoother gain at ti. The computational complexity is clearly linear in the number of data n (recursion repetitions), and cubic in the state dimension m due to matrix matrix multiplications, and the matrix inverse in calculation of Gi. 3 Infinite-horizon Gaussian processes We now tackle the cubic computational complexity in the state dimensionality by seeking infinitehorizon approximations to the Gaussian process. In Sec. 3.1 we revisit traditional steady-state Kalman filtering (for Gaussian likelihood, equidistant data) from quadratic filter design (see, e.g., [18] and [7] for an introduction), and extend it to provide approximations to the marginal likelihood and its gradients. Finally, we present an infinite-horizon framework for non-Gaussian likelihoods. 10 2 10 1 100 101 102 103 10 4 10 2 100 Likelihood variance, γ Elements of diag(Pp) 10 2 10 1 100 101 102 103 0 0.1 0.2 0.3 Likelihood variance, γ Figure 2: (Left) Interpolation of Pp (dots solved, solid interpolated). The dashed lines show elements in P (prior stationary state covariance). (Right) The Kalman gain k evaluated for the Pps. 3.1 Steady-state Kalman filter for t In steady-state Kalman filtering (see [7], Ch. 8.4, or [1], Ch. 4, for the traditional perspective) we assume t ℓeff, where ℓeffis the longest time scale in the covariance function, and equidistant observations in time (Ai := A and Qi := Q). After several ℓeff(as t ), the filter gain converges to the stationary limiting Kalman filter gain k. The resulting filter becomes time-invariant, which introduces approximation errors near the boundaries (cf. Fig. 1). In practice, we seek a stationary filter state covariance (corresponding to the stationary Kalman gain) ˆPf. Solving for this matrix thus corresponds to seeking a covariance that is equal between two consecutive filter recursions. Directly from the Kalman filtering forward prediction and update steps (in Eq. 5), we recover the recursion (by dropping dependency on the time step): ˆPp = A ˆPp AT A ˆPp h (h T ˆPp h + σ2 n) 1 h T ˆPp AT + Q. (7) This equation is of the form of a discrete algebraic Riccati equation (DARE, see, e.g., [15]), which is a type of nonlinear matrix equation that often arises in the context of infinite-horizon optimal control problems. Since σ2 n > 0, Q is P.S.D., and the associated state space model being both stabilizable and observable, the DARE has a unique stabilising solution for ˆPp that can be found either by iterating the Riccati equation or by matrix decompositions. The Schur method by Laub [16] solves the DARE in O(m3), is numerically stable, and widely available in matrix libraries (Python scipy.linalg.solve_discrete_are, MATLAB Control System Toolbox DARE, see also SLICOT routine SB02OD). The corresponding stationary gain is k = ˆPp h/(h T ˆPp h + σ2 n). Re-deriving the filter recursion with the stationary gain gives a simplified iteration for the filter mean (the covariance is now time-invariant): ˆmf i = (A k h TA) ˆmf i 1 + k yi and ˆPf = ˆPp k h T ˆPp, (8) for all i = 1, 2, . . . , n. This recursive iteration has a computational cost associated with one m m matrix vector multiplication, so the overall computational cost for the forward iteration is O(n m2) (as opposed to the O(n m3) in the Kalman filter). Marginal likelihood evaluation: The approximative log marginal likelihood comes out as a byproduct of the filter forward recursion: log p(y) n 2 log 2πˆs Pn i=1 ˆv2 i /(2 ˆs), where the stationary innovation covariance is given by ˆs = h T ˆPp h+σ2 n and the innovation mean by ˆvi = yi h TA ˆmf i 1. Steady-state backward pass: To obtain the complete infinite-horizon solution, we formally derive the solution corresponding to the smoothing distribution p(fi | y1:n) N(fi | ˆms i, ˆPs), where ˆP is the stationary state covariance. Establishing the backward recursion does not require taking any additional limits, as the smoother gain is only a function of consecutive filtering steps. Re-deriving the backward pass in Equation (6) gives the time-invariant smoother gain and posterior state covariance G = ˆPf AT [A ˆPf AT + Q] 1 and ˆPs = G ˆPs GT + ˆPf G (A ˆPf AT + Q) GT, (9) where ˆPs is implicitly defined in terms of the solution to a DARE. The backward iteration for the state mean: ˆms i = ˆmf i + G ( ˆms i+1 A ˆmf i). Even this recursion scales as O(n m2). Algorithm 1 Infinite-horizon Gaussian process (IHGP) inference. The GP prior is specified in terms of a state space model. After the setup cost on line 2, all operations are at most O(m2). 1: Input: {yi}, {A, Q, h, P0}, p(y | f) targets, model, likelihood 2: Set up Pp(γ), Ps(γ), and G(γ) for γ1:K solve DAREs for a set of likelihood variances, cost O(K m3) 3: mf 0 0; Pp 0 P0; γ0 = initialize 4: for i = 1 to n do 5: Evaluate Pp i Pp(γi 1) find predictive covariance 6: µf,i h TA mf i 1; σ2 f,i = h TPp i h latent 7: if Gaussian likelihood then 8: ηi yi; γi σ2 n,i if σ2 n,i := σ2 n, ki and Pf i become time-invariant 9: else 10: Match exp(νi fi τi f 2 i /2) N(fi | µf,i, σ2 f,i) mom = p(yi | fi) N(fi | µf,i, σ2 f,i) match moments 11: ηi νi/τi; γi τ 1 i equivalent update 12: end if 13: ki Pp i h/( σ2 f,i + γi) gain 14: mf i (A ki h TA) mf i 1 + ki ηi; Pf i Pp i ki γi k T i mean and covariance 15: end for 16: ms n mf n; Ps n Ps(γn) initialize backward pass 17: for i = n 1 to 1 do 18: ms i mf i + G(γi) (ms i+1 A mf i); Ps i Ps(γi) mean and covariance 19: end for 20: Return: µf,i = h Tms i, σ2 f,i = h TPs i h, log p(y) mean, variance, evidence 3.2 Infinite-horizon GPs for general likelihoods In IHGP, instead of using the true predictive covariance for propagation, we use the one obtained from the stationary state of a system with measurement noise fixed to the current measurement noise and regular spacing. The Kalman filter iterations can be used in solving approximate posteriors for models with general likelihoods in form of Eq. (1) by manipulating the innovation vi and si (see [22]). We derive a generalization of the steady-state iteration allowing for time-dependent measurement noise and non-Gaussian likelihoods. We re-formulate the DARE in Eq. (7) as an implicit function ˆPp : R+ Rm m of the likelihood variance, measurement noise , γ R+: Pp(γ) = A Pp(γ) AT A Pp(γ) h (h TPp(γ) h + γ) 1 h TPp(γ) AT + Q. (10) The elements in Pp are smooth functions in γ, and we set up an interpolation scheme inspired by Wilson and Nickisch [37] who use cubic convolutional interpolation [13] in their KISS-GP framework over a log-spaced one-dimensional grid of K points in γ for evaluation of ˆPp(γ). Fig. 2 shows results of K = 32 grid points (as dots) over γ = 10 2, . . . , 103 (this grid is used throughout the experiments). In the limit of γ the measurement has no effect, and the predictive covariance returns to the stationary covariance of the GP prior (dashed). Similarly, the corresponding gain terms k show the gains going to zero in the same limit. We set up a similar interpolation scheme for evaluating G(γ) and Ps(γ) following Eq. (9). Now, solving the DAREs and the smoother gain has been replaced by computationally cheap (one-dimensional) kernel interpolation. Alg. 1 presents the recursion in IHGP inference by considering a locally steady-state GP model derived from the previous section. As can be seen in Sec. 3.1, the predictive state on step i only depends on γi 1. For non-Gaussian inference we set up an EP [5, 12, 19] scheme which only requires one forward pass (assumed density filtering, see also unscented filtering [27]), and is thus well suited for streaming applications. We match the first two moments of p(yi | fi) and exp(τ fi ν f 2 i /2) w.r.t. latent values N(fi | µf,i, σ2 f,i) (denoted by mom = , implemented by quadrature). The steps of the backward pass are also only dependent on the local steady-state model, thus evaluated in terms of γis. Missing observations correspond to γi = , and the model could be generalized to non-equidistant time sampling by the scheme in Nickisch et al. [22] for calculating A( ti) and Q( ti). Table 1: Mean absolute error of IHGP w.r.t. SS, negative log-likelihoods, and running times. Mean over 10 repetitions reported; n = 1000. Regression Count data Classification Likelihood Gaussian Poisson Logit Probit MAE E[f(t )] 0.0095 0.0415 0.0741 0.0351 MAE V[f(t )] 0.0008 0.0024 0.0115 0.0079 NLL-FULL 1452.5 2645.5 618.9 614.4 NLL-SS 1452.5 2693.5 617.5 613.9 NLL-IHGP 1456.0 2699.3 625.1 618.2 tfull 0.18 s 6.17 s 11.78 s 9.93 s tss 0.04 s 0.13 s 0.13 s 0.11 s t IHGP 0.01 s 0.14 s 0.13 s 0.10 s 0 20 40 60 80 100 State dimensionality, m Running time (seconds) State space IHGP Figure 3: Empirical running time comparison for GP regression on n = 10,000 data points. Maximum RMSE in IHGP E[f(t )] < 0.001. 3.3 Online hyperparameter estimation Even though IHGP can be used in a batch setting, it is especially well suited for continuous data streams. In such applications, it is not practical to require several iterations over the data for optimising the hyperparameters as new data would arrive before the optimisation terminates. We propose a practical extension of IHGP for online estimation of hyperparameters θ by leveraging that (i) new batches of data are guaranteed to be available from the stream, (ii) IHGP only requires seeing each data point once for evaluating the marginal likelihood and its gradient, (iii) data can be non-stationary, requiring the hyperparameters to adapt. We formulate the hyperparameter optimisation problem as an incremental gradient descent (e.g., [2]) resembling stochastic gradient descent, but without the assumption of finding a stationary optimum. Starting from some initial set of hyperparameters θ0, for each new (mini) batch j of data y(j) in a window of size nmb, iterate θj = θj 1 + η log p(y(j) | θj 1), (11) where η is a learning-rate (step-size) parameter, and the gradient of the marginal likelihood is evaluated by the IHGP forward recursion. In a vanilla GP the windowing would introduce boundary effect due to growing marginal variance towards the boundaries, while in IHGP no edge effects are present as the data stream is seen to continue beyond any boundaries (cf. Fig. 1). 4 Experiments We provide extensive evaluation of the IHGP both in terms of simulated benchmarks and four real-world experiments in batch and online modes. 1860 1880 1900 1920 1940 1960 Time (years) Accident intensity, λ(t) Full (mean) Full (90% quantiles) State space (a) Intensity (EP vs. ADF) 1860 1880 1900 1920 1940 1960 Time (years) Accident intensity, λ(t) IHGP (mean) IHGP (90% quantiles) State space (b) Intensity (state space vs. infinite-horizon) Figure 4: A small-scale comparison study on the coal mining accident data (191 accidents in n = 200 bins). The data set is sufficiently small that full EP with naïve handling of the latent function can be conducted. Full EP is shown to work similarly as ADF (single-sweep EP) by state space modelling. We then compare ADF on state space (exact handling of the latent function) to ADF with the IHGP. Figures below decompose the intensity into components: log λ(t) = ftrend(t) + fyear(t) + fweek(t) 1920 1930 1940 1950 1960 1970 1980 1990 2000 2010 0 10 20 30 40 Time (years) Accident intensity, λ(t) 1920 1940 1960 1980 2000 (arb. unit) J F M A M J J A S O N D 1920 1960 2000 Sun Mon Tue Wed Thu Fri Sat 1920 1960 2000 Day-of-week Figure 5: Explanatory analysis of the aircraft accident data set (1210 accidents predicted in n = 35,959 daily bins) between years 1919 2018 by a log-Gaussian Cox process (Poisson likelihood). 4.1 Experimental validation In the toy examples, the data were simulated from yi = sinc(xi 6) + εi, εi N(0, 0.1) (see Fig. 1 for a visualization). The same function with thresholding was used in the classification examples in the Appendix. Table 1 shows comparisons for different log-concave likelihoods over a simulated data set with n = 1000. Example functions can be seen in Fig. 1 and Appendix E. The results are shown for a Matérn (ν = 3/2) with a full GP (naïve handling of latent, full EP as in [24]), state space (SS, exact state space model, ADF as in [22]), and IHGP. With m only 2, IHGP is not faster than SS, but approximation errors remain small. Fig. 3 shows experimental results for the computational benefits in a regression study, with state dimensionality m = 2, . . . , 100. Experiments run in Mathworks MATLAB (R2017b) on an Apple Mac Book Pro (2.3 GHz Intel Core i5, 16 Gb RAM). Both methods have linear time complexity in the number of data points, so the number of data points is fixed to n = 10,000. The GP prior is set up as an increasing-length sum of Matérn (ν = 3/2) kernels with different characteristic length-scales. The state space scheme follows O(m3) and IHGP is O(m2). 4.2 Log-Gaussian Cox processes A log Gaussian Cox process is an inhomogeneous Poisson process model for count data. The unknown intensity function λ(t) is modelled with a log-Gaussian process such that f(t) = log λ(t). The likelihood of the unknown function f is p({tj} | f) = exp( R exp(f(t)) dt + PN j=1 f(tj)). The likelihood requires non-trivial integration over the exponentiated GP, and thus instead the standard approach [20] is to consider locally constant intensity in subregions by discretising the interval into bins. This approximation corresponds to having a Poisson model for each bin. The likelihood becomes p({tj} | f) Qn i=1 Poisson(yi({tj}) | exp(f(ˆti))), where ˆti is the bin coordinate and yi the number of data points in it. This model reaches posterior consistency in the limit of bin width going to zero [34]. Thus it is expected that the accuracy improves with tighter binning. Coal mining disasters dataset: The data (available, e.g., in [35]) contain the dates of 191 coal mine explosions that killed ten or more people in Britain between years 1851 1962, which we discretize into n = 200 bins. We use a GP prior with a Matérn (ν = 5/2) covariance function that has an exact state space representation (state dimensionality m = 3) and thus no approximations regarding handling the latent are required. We optimise the characteristic length-scale and magnitude hyperparameters w.r.t. marginal likelihood in each model. Fig. 4 shows that full EP and state space ADF produce almost equivalent results, and IHGP ADF and state space ADF produce similar results. In IHGP the edge effects are clear around 1850 1860. (a) Holding in hand (c) Swinging (d) On table Figure 6: Screenshots of online adaptive IHGP running in real-time on an i Phone. The lower plot shows current hyperparameters (measurement noise is fixed to σ2 n = 1 for easier visualization) of the prior covariance function, with a trail of previous hyperparameters. The top part shows the last 2 seconds of accelerometer data (red), the GP mean, and 95% quantiles. The refresh rate for updating the hyperparameters and re-prediction is 10 Hz. Video examples are in the supplementary material. Airline accident dataset: As a more challenging regression problem we explain the time-dependent intensity of accidents and incidents of commercial aircraft. The data [22] consists of dates of 1210 incidents over the time-span of years 1919 2017. We use a bin width of one day and start from year 1900 ensure no edge effects (n = 43,099), and a prior covariance function (similar to [6, 36]) κ(t, t ) = κν=5/2 Mat. (t, t ) + κ1 year per (t, t ) κν=3/2 Mat. (t, t ) + κ1 week per (t, t ) κν=3/2 Mat. (t, t ) (12) capturing a trend, time-of-year variation (with decay), and day-of-week variation (with decay). This model has a state space representation of dimension m = 3 + 28 + 28 = 59. All hyperparameters (except time periods) were optimised w.r.t. marginal likelihood. Fig. 5 shows that we reproduce the time-of-year results from [22] and additionally recover a high-frequency time-of-week effect. 4.3 Electricity consumption We do explorative analysis of electricity consumption for one household [9] recorded every minute (in log k W) over 1,442 days (n = 2,075,259, with 25,979 missing observations). We assign the model a GP prior with a covariance function accounting for slow variation and daily periodicity (with decay). We fit a GP to the entire data with 2M data points by optimising the hyperparameters w.r.t. marginal likelihood (results shown in Appendix F) using BFGS. Total running time 624 s. The data is, however, inherently non-stationary due to the long time-horizon, where use of electricity has varied. We therefore also run IHGP online in a rolling-window of 10 days (nmb = 14,400, η = 0.001, window step size of 1 hr) and learn the hyperparameters online during the 34,348 incremental gradient steps (evaluation time per step 0.26 0.05 s). This leads to a non-stationary adaptive GP model which, e.g., learns to dampen the periodic component when the house is left vacant for days. Results shown in Appendix F in the supplement. 4.4 Real-time GPs for adaptive model fitting In the final experiment we implement the IHGP in C++ with wrappers in Objective-C for running as an app on an Apple i Phone 6s (i OS 11.3). We use the phone accelerometer x channel (sampled at 100 Hz) as an input and fit a GP to a window of 2 s with Gaussian likelihood and a Matérn (ν = 3/2) prior covariance function. We fix the measurement noise to σ2 n = 1 and use separate learning rates η = (0.1, 0.01) in online estimation of the magnitude scale and length-scale hyperparemeters. The GP is re-estimated every 0.1 s. Fig. 6 shows examples of various modes of data and how the GP has adapted to it. A video of the app in action is included in the web material together with the codes. 5 Discussion and conclusion We have presented Infinite-Horizon GPs, a novel approximation scheme for state space Gaussian processes, which reduces the time-complexity to O(m2n). There is a clear intuition to the approximation: As widely known, in GP regression the posterior marginal variance only depends on the distance between observations, and the likelihood variance. If both these are fixed, and t is larger than the largest length-scale in the prior, the posterior marginal variance reaches a stationary state. The intuition behind IHGP is that for every time instance, we adapt to the current likelihood variance, discard the Markov-trail, and start over by adapting to the current steady-state marginal posterior distribution. This approximation scheme is important especially in long (number of data in the thousands millions) or streaming (n growing without limit) data, and/or the GP prior has several components (m large). We showed examples of regression, count data, and classification tasks, and showed how IHGP can be used in interpreting non-stationary data streams both off-line (Sec. 4.3) and on-line (Sec. 4.4). Acknowledgments We thank the anonymous reviewers as well as Mark Rowland and Will Tebbutt for their comments on the manuscript. AS acknowledges funding from the Academy of Finland (grant number 308640). [1] B. D. Anderson and J. B. Moore. Optimal Filtering. Prentice-Hall, Englewood Cliffs, NJ, 1979. [2] D. P. Bertsekas. Nonlinear Programming. Athena Scientific, Cambridge, MA, 1999. [3] T. D. Bui and R. E. Turner. Tree-structured Gaussian process approximations. In Advances in Neural Information Processing Systems (NIPS), pages 2213 2221, 2014. [4] T. D. Bui, J. Yan, and R. E. Turner. A unifying framework for Gaussian process pseudo-point approximations using power expectation propagation. Journal of Machine Learning Research (JMLR), 18(104):1 72, 2017. [5] L. Csató and M. Opper. Sparse on-line Gaussian processes. Neural Computation, 14(3):641 668, 2002. [6] D. Duvenaud, J. R. Lloyd, R. Grosse, J. B. Tenenbaum, and Z. Ghahramani. Structure discovery in nonparametric regression through compositional kernel search. In International Conference on Machine Learning (ICML), volume 28 of PMLR, pages 1166 1174, 2013. [7] F. Gustafsson. Adaptive Filtering and Change Detection. John Wiley & Sons, 2000. [8] J. Hartikainen and S. Särkkä. Kalman filtering and smoothing solutions to temporal Gaussian process regression models. In Proceedings of the IEEE International Workshop on Machine Learning for Signal Processing (MLSP), pages 379 384, 2010. [9] G. Hébrail and A. Bérard. Individual household electric power consumption data set, 2012. URL https://archive.ics.uci.edu/ml/datasets/individual+household+electric+power+consumption. Online: UCI Machine Learning Repository. [10] J. Hensman, N. Fusi, and N. D. Lawrence. Gaussian processes for big data. In Uncertainty in Artificial Intelligence (UAI), pages 282 290. AUAI Press, 2013. [11] J. Hensman, N. Durrande, and A. Solin. Variational Fourier features for Gaussian processes. Journal of Machine Learning Research (JMLR), 18(151):1 52, 2018. [12] T. Heskes and O. Zoeter. Expectation propagation for approximate inference in dynamic Bayesian networks. In Uncertainty in Artificial Intelligence (UAI), pages 216 223. Morgan Kaufmann Publishers Inc., 2002. [13] R. G. Keys. Cubic convolution interpolation for digital image processing. IEEE Transactions on Acoustics, Speech and Signal Processing, 29(6):1153 1160, 1981. [14] K. Krauth, E. V. Bonilla, K. Cutajar, and M. Filippone. Auto GP: Exploring the capabilities and limitations of Gaussian process models. In Uncertainty in Artificial Intelligence (UAI). AUAI Press, 2017. [15] P. Lancaster and L. Rodman. Algebraic Riccati Equations. Clarendon Press, 1995. [16] A. Laub. A schur method for solving algebraic Riccati equations. IEEE Transactions on Automatic Control, 24(6):913 921, 1979. [17] M. Lázaro-Gredilla, J. Quiñonero-Candela, C. E. Rasmussen, and A. R. Figueiras-Vidal. Sparse spectrum Gaussian process regression. Journal of Machine Learning Research (JMLR), 11:1865 1881, Jun 2010. [18] P. S. Maybeck. Stochastic Models, Estimation and Control, volume 1. Academic Press, New York, 1979. [19] T. Minka. Expectation propagation for approximate Bayesian inference. In Uncertainty in Artificial Intelligence (UAI), volume 17, pages 362 369, 2001. [20] J. Møller, A. R. Syversveen, and R. P. Waagepetersen. Log Gaussian Cox processes. Scandinavian Journal of Statistics, 25(3):451 482, 1998. [21] H. Nickisch and C. E. Rasmussen. Approximations for binary Gaussian process classification. Journal of Machine Learning Research (JMLR), 9(10):2035 2078, 2008. [22] H. Nickisch, A. Solin, and A. Grigorievskiy. State space Gaussian processes with non-Gaussian likelihood. In International Conference on Machine Learning (ICML), volume 80 of PMLR, pages 3789 3798, 2018. [23] J. Quiñonero-Candela and C. E. Rasmussen. A unifying view of sparse approximate Gaussian process regression. Journal of Machine Learning Research (JMLR), 6(Dec):1939 1959, 2005. [24] C. E. Rasmussen and H. Nickisch. Gaussian processes for machine learning (GPML) toolbox. Journal of Machine Learning Research (JMLR), 11:3011 3015, 2010. Software package: http://www.gaussianprocess. org/gpml/code. [25] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. The MIT Press, 2006. [26] S. Reece and S. Roberts. An introduction to Gaussian processes for the Kalman filter expert. In Proceedings of the 13th Conference on Information Fusion (FUSION). IEEE, 2010. [27] S. Särkkä. Bayesian Filtering and Smoothing. Cambridge University Press, 2013. [28] S. Särkkä and A. Solin. Applied Stochastic Differential Equations. Cambridge University Press, Cambridge, in press. [29] S. Särkkä, A. Solin, and J. Hartikainen. Spatiotemporal learning via infinite-dimensional Bayesian filtering and smoothing. IEEE Signal Processing Magazine, 30(4):51 61, 2013. [30] E. Snelson and Z. Ghahramani. Sparse Gaussian processes using pseudo-inputs. In Advances in Neural Information Processing Systems (NIPS), pages 1257 1264. Curran Associates, Inc., 2006. [31] A. Solin. Stochastic Differential Equation Methods for Spatio-Temporal Gaussian Process Regression. Doctoral dissertation, Aalto University, Helsinki, Finland, 2016. [32] A. Solin and S. Särkkä. Hilbert space methods for reduced-rank Gaussian process regression. ar Xiv preprint ar Xiv:1401.5508, 2014. [33] M. K. Titsias. Variational learning of inducing variables in sparse Gaussian processes. In International Conference on Artificial Intelligence and Statistics (AISTATS), volume 5 of PMLR, pages 567 574, 2009. [34] S. T. Tokdar and J. K. Ghosh. Posterior consistency of logistic Gaussian process priors in density estimation. Journal of Statistical Planning and Inference, 137(1):34 42, 2007. [35] J. Vanhatalo, J. Riihimäki, J. Hartikainen, P. Jylänki, V. Tolvanen, and A. Vehtari. GPstuff: Bayesian modeling with Gaussian processes. Journal of Machine Learning Research (JMLR), 14(Apr):1175 1179, 2013. Software package: http://research.cs.aalto.fi/pml/software/gpstuff. [36] A. Wilson and R. Adams. Gaussian process kernels for pattern discovery and extrapolation. In International Conference on Machine Learning (ICML), volume 28 of PMLR, pages 1067 1075, 2013. [37] A. G. Wilson and H. Nickisch. Kernel interpolation for scalable structured Gaussian processes (KISS-GP). In International Conference on Machine Learning (ICML), volume 37 of PMLR, pages 1775 1784, 2015.