# a_functional_dynamic_boltzmann_machine__6299d22c.pdf

A Functional Dynamic Boltzmann Machine

Hiroshi Kajino IBM Research - Tokyo KAJINO@jp.ibm.com

Dynamic Boltzmann machines (Dy BMs) are recently developed generative models of a time series. They are designed to learn a time series by efﬁcient online learning algorithms, whilst taking long-term dependencies into account with help of eligibility traces, recursively updatable memory units storing descriptive statistics of all the past data. The current Dy BMs assume a ﬁnitedimensional time series and cannot be applied to a functional time series, in which the dimension goes to inﬁnity (e.g., spatiotemporal data on a continuous space). In this paper, we present a functional dynamic Boltzmann machine (F-Dy BM) as a generative model of a functional time series. A technical challenge is to devise an online learning algorithm with which F-Dy BM, consisting of functions and integrals, can learn a functional time series using only ﬁnite observations of it. We rise to the above challenge by combining a kernel-based function approximation method along with a statistical interpolation method and ﬁnally derive closed-form update rules. We design numerical experiments to empirically conﬁrm the effectiveness of our solutions. The experimental results demonstrate consistent error reductions as compared to baseline methods, from which we conclude the effectiveness of F-Dy BM for functional time series prediction.

1 Introduction This work is concerned with learning a time series for forecasting future events. In particular, we are focusing on a light-weight model that can be trained online while preserving the predictive performance as much as possible. Aside from recent high-performance but complex models like a family of recurrent neural networks (RNNs) [Rumelhart et al., 1986; Hochreiter and Schmidhuber, 1997; Sundermeyer et al., 2012], light-weight models are required when training and predicting on devices with low computational power such as mobile and Io T devices and when dealing with massive amount of stream data reported from a number of sensors. In this light, we focus on a family of vector autoregressive models (VARs) [L utkepohl, 2005], and above all,

Figure 1: Illustration of F-Dy BM, modeling a functional pattern f [t](x) deﬁned on a two dimensional space. Heat maps represent functional patterns. The current pattern depends on ﬁve past patterns and two eligibility traces (which summarize all the past patterns) through weight functions w[δ] and ul, respectively.

one of its state-of-the-art variants called dynamic Boltzmann machines (Dy BMs) [Osogami and Otsuka, 2015; Osogami, 2016; Dasgupta and Osogami, 2017]. Dy BMs are recently emerging generative models of a binary/real-valued multi-dimensional time series. One of their essential characteristics is a recursively updatable memory unit summarizing all the past data, which is dubbed as an eligibility trace. Its recursive update rule enables us to develop online learning algorithms for Dy BMs while capturing long-term dependencies of a time series to increase its predictive ability. Osogami [2016] reported up to 20-30% performance gain by eligibility traces. Therefore, we employ Dy BMs as a time-series modeling framework. In this paper, we present a new variant of Dy BMs called a functional dynamic Boltzmann machine (F-Dy BM), which is able to handle a partially-observable functional time series, where at each discrete time step t Z, ﬁnite evaluations of a function f [t](x): X R are given. Our F-Dy BM is mainly motivated by spatiotemporal data. Assume that a functional time series is a spatiotemporal time series collected from mo-

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

bile devices, where f [t](x) corresponds to a temperature, water quality, or air quality observed at location x X and time step t. This example implies two properties that cannot be handled by the existing Dy BMs designed for a vectorial time series. First, it is required to forecast a future value at any location x in a continuous space X, rather than ﬁnite ﬁxed locations as in the case with a vectorial time series, in which each dimension corresponds to each ﬁxed location. The existing Dy BMs are not able to forecast values at inﬁnite locations by themselves. Second, observational points can move and may appear and disappear abruptly, and in some cases, the identities of observations are lost due to privacy concerns, inhibiting us from constructing a vectorial time series. These two properties clearly highlight the essential difference between a vectorial time series and a functional one. To this end, we develop F-Dy BM along with an online learning algorithm. The development of F-Dy BM constitutes of a modeling task and an algorithm implementation task. The model of F-Dy BM can be derived rather straightforwardly based on the Gaussian Dy BM (G-Dy BM) [Osogami, 2016; Dasgupta and Osogami, 2017]; replacing vectors with functions, weight matrices with weight functions, and matrix-vector multiplications with integrals. As a result, we obtain a model of a functional time series as depicted in Fig. 1. On contrary, a learning algorithm cannot be directly derived in the same way as modeling because of the following three technical challenges. First, neither a functional time series nor weight functions can be represented in a computer having only a ﬁnite memory. Second, the model now involves integrals, which in general cannot be efﬁciently computed. Third, typically, only ﬁnite observations of a functional pattern are available at each time, which breaks the fully-observable assumption in G-Dy BM Our idea to overcome these difﬁculties is twofold. The ﬁrst idea, addressing the ﬁrst and second challenges, is to model weight functions by ﬁnite kernel-based basis functions. Then, we are able to represent model parameters by ﬁnite parameters, and furthermore, the kernel trick allows us to analytically compute the integrals, ﬁnally resulting in closed-form learning rules. Second, to address the third challenge, we employ a Gaussian process as a core component of our model, and estimate a functional pattern from ﬁnite observations by the maximum-a-posteriori (MAP) estimation. These ideas successfully lead to an online learning algorithm for F-Dy BM. The effectiveness of F-Dy BM is empirically demonstrated using ﬁve real spatiotemporal data sets. As we will discuss in the related work section, F-Dy BM can be interpreted as an extension of VAR and functional autoregression (FAR) [Bosq, 2000] as well as G-Dy BM; eligibility traces mainly differentiate F-Dy BM from FAR and G-Dy BM from VAR, and rigorous modeling of a functional time series differentiates FDy BM from G-Dy BM and FAR from VAR. Therefore, we design the experiment to validate the contribution of each of these innovations. The experimental results indicate that adding eligibility traces decreased the error by 12% on average, and the function-based modeling decreased the error by 11.7% on average as compared to a heuristic application of vector-based models. Hence, we conclude that F-Dy BM achieves substantial performance improvement because of the two features.

Notation. We employ the following mathematical conventions. For a matrix X = [x1 . . . x N] RN D

and a function f : RD R, we deﬁne f(X) := [f(x1) . . . f(x N)] RN. For matrices X and Y = [y1 . . . y M] RM D and a function K : RD RD R, we deﬁne K(X, Y ) as an N M matrix whose (n, m)- element corresponds to K(xn, ym). Let us denote a value associated with time t by f [t]. A sequence of values from time to t 1 (including t 1) is denoted by f [<t].

2 Problem Setting

We assume that observations are carried out in X (e.g., the Euclidean space in the case of a standard spatiotemporal data). Let us denote a signal observed at x X at time t Z by f [t](x) R. We call a function f [t] as a pattern. Our problem setting is that, at every time t, we are given ﬁnite observations of a pattern f [t], and are willing to predict the next pattern f [t+1]. Let us denote the observational points at time t

by X[t] = h x[t] 1 . . . x[t] N [t] i . The observational points as

well as the number of them, N [t], can be time-variant.

3 Gaussian Dynamic Boltzmann Machine

This section brieﬂy reviews a Gaussian dynamic Boltzmann machine (G-Dy BM) [Osogami, 2016; Dasgupta and Osogami, 2017], which extends the original binary-valued Dy BM [Osogami and Otsuka, 2015] to handle real values. G-Dy BM models an N-dimensional time series by N neural units. The n-th unit (n = 1, . . . , N) outputs a signal f [t] n R at time t. We call the set of signals emitted from all the neurons a pattern, which is denoted by f [t] = h f [t] 1 . . . f [t] N

i . G-Dy BM models the dynamics of the pattern as follows:

p(f [t] | f [<t]) =

n=1 N(f [t] n ; µ[t] n , σ2 n), (1)

δ=1 W [δ]f [t δ] +

l=1 Ulα[t 1] l , (2)

δ=1 λδ 1 l f [t δ] (l = 1, . . . , L), (3)

where parameters are b, {W [δ]}d δ=1, and {Ul}L l=1, and hyperparameters are d, L, {σn}N n=1, and {λl}L l=1.1 The pattern at time t is dependent on all the past patterns through the mean

µ[t] = h µ[t] 1 . . . µ[t] N

i (Eq. (1)). The mean at time t is calculated using two types of memory units (Eq. (2)). One stores the recent d patterns {f [t δ]}d δ=1, and the other called an eligibility trace, α[t 1] l (l = 1, . . . , L) summarizes all the past patterns (Eq. (3)).

1We ﬁx σn = 1 for all n.

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

4 Functional Dynamic Boltzmann Machines

We present a functional dynamic Boltzmann machine (FDy BM), which models the dynamics of a functional pattern emitted from an inﬁnite number of neurons distributed in X. We ﬁrst present the model of F-Dy BM, and then, we develop an online learning algorithm, which trains F-Dy BM from ﬁnite observations of a functional pattern and enables us to forecast the next functional pattern.

We assume that each neuron is associated with location x X, instead of index n. Let us denote a pattern at time t by f [t] : X R. f [t](x) represents a signal of a neuron at x and time t. We derive F-Dy BM by extending Eqs. (1), (2), and (3) to Eqs. (4), (5), and (6) respectively as follows. Equation (1) is reﬁned by substituting a Gaussian process for the Gaussian distribution:

p(f [t] | f [<t]) = GP(f [t]; µ[t], Kσ2), (4)

where f [t] is a functional pattern, µ[t] : X R is the mean function at time t, which is dependent on all the past patterns f [<t], and Kσ2 : X X R is a covariance function.2 Equations (2) and (3) are reﬁned as follows:

µ[t](x) =b(x) +

X w[δ](x, x )f [t δ](x )dx

X ul(x, x )α[t 1] l (x )dx ,

α[t 1] l (x) =

δ=1 λδ 1 l f [t δ](x), (6)

where we substitute weight functions for weight matrices and integrals for matrix multiplications. Model parameters are b(x), {w[δ](x, x )}d δ=1, and {ul(x, x )}L l=1, and hyperparameters are d, L, Kσ2, and {λl}L l=1.

4.2 Online Learning Algorithm

So far, we have successfully compiled G-Dy BM for a functional pattern. However, as discussed in the introduction, there still remain the following three technical challenges to deploy F-Dy BM into the real world:

(i) The model involves functional parameters, b, w[δ], and ul, as well as functional patterns f [t](x), which cannot be stored using ﬁnite memory.

(ii) Equation (5) involves integrals of the weight functions and previous patterns. In general, numerical integration is an error-prone and time-consuming procedure, and should be avoided whenever possible.

(iii) Our problem setting assumes that only ﬁnite observations of a pattern are available at each time step.

2We employ the following shorthand: Kσ2(x, x ) := K(x, x )+ σ2δx,x with denoting the Kronecker delta by δx,x .

Main Ideas We introduce two main ideas to address the three technical challenges raised above. The ﬁrst idea is to assume a particular parametric form for functional parameters using ﬁnite parameters along with the covariance function Kσ2 of the Gaussian process. Then, we can represent all of the functional parameters by ﬁnitedimensional vectors, and in addition, the kernel trick allows us to avoid numerical integration, addressing both challenges, (i) and (ii). In speciﬁc, weight functions, w[δ](x, x ) and ul(x, x ), and the bias function b(x) are assumed to have the following parametric forms:

w[δ](x, x ) = Kσ2(x, P)W [δ]Kσ2(P, x ) (δ = 1, . . . , d), (7)

ul(x, x ) = Kσ2(x, P)Ul Kσ2(P, x ) (l = 1, . . . , L), (8) b(x) = Kσ2(x, P)b, (9)

where P = [p1 . . . p N] is a set of points in X, and W [δ], Ul RN M and b RN are parameters. Then, by substituting Eqs. (7), (8), and (9) for Eq. (5), we obtain

µ[t](x) = Kσ2(x, P)

δ=1 W [δ]f [t δ](P)

l=1 Ulα[t 1] l (P)

where the integrals can be analytically computed via the kernel trick because f [t δ](x) and α[t 1] l (x) belong to the reproducing Hilbert kernel space of Kσ2( , ) (see Eq. (4)). Thus, the two challenges (i) and (ii) are addressed. Another beneﬁt of the resultant model is that it is sufﬁcient to memorize patterns at the ﬁxed set of points, P, for computing Eq. (10), rather than the time-varying points X[t]. This is especially signiﬁcant when computing the eligibility trace using ﬁxed-sized memory. In fact, the eligibility trace α[t 1] l (P) can be deﬁned in a similar way to Eq. (3):

α[t 1] l (P) =

δ=1 λδ 1 l f [t δ](P), (11)

which is also recursively updatable. The second idea addressing the third challenge (iii), is to replace the functional patterns appearing in Eqs. (10) and (11) with their estimators. At time t, given f [t](X[t]) and µ[t](x), the MAP estimator of f [t](P) is obtained as

ˆf [t](P) = µ[t](P) + K(P, X[t])Kσ2(X[t], X[t]) 1d[t](X[t]), (12)

where we denote d[t](x) := f [t](x) µ[t](x). At each time t, we compute the MAP estimator ˆf [t](P) as Eq. (12) and use it as a proxy for f [t](P) in Eq. (10) from time t+1 onward. We choose to impute unobserved values by their estimators progressively, instead of relying on EM-based algorithms (e.g., [Shumway and Stoffer, 1982]), in order to keep the online algorithm as efﬁcient as possible.

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

Algorithm 1 An online learning algorithm for F-Dy BM.

1: for t = 0, 1, 2, . . . do 2: Predict the pattern at time t using µ[t](x) as Eq. (10). 3: Observe the pattern at time t, (X[t], f [t](X[t])). 4: Update parameters as Eqs. (13) (16). 5: Compute ˆf [t](P) as Eq. (12). 6: Update Q and E using ˆf [t](P) as Eq. (17). 7: end for

Implementation Putting everything together, we derive an online learning algorithm for F-Dy BM (Algorithm 1). It learns Θ = {b} {W [δ]}d δ=1 {Ul}L l=1 in an online way, and enables us to predict the next unseen pattern. Its input is hyperparameters, d, L, P RN D, Kσ2, and {λl}L l=1. First, two memory units, Q and E, and the model parameters Θ are initialized by ﬁlling 0, where Q is a queue storing { ˆf [t δ](P)}d δ=1, and E is a set of eligibility traces {α[t 1] l (P)}L l=1. For each time t = 0, 1, 2, . . . , we ﬁrst predict the next pattern as Eq. (10). Then, we observe (X[t], f [t](X[t])) and update Θ as

Θ Θ + η[t] L[t](Θ)

where η[t] is a learning rate, and L[t](Θ) is a conditional loglikelihood of the current observations, which is given by

L[t](Θ) := log p(f [t](X[t]) | f [<t]; Θ)

2d[t](X[t]) Kσ2(X[t], X[t]) 1d[t](X[t]) + C,

where C is a constant independent of Θ. The gradients of L[t](Θ) with respect to the model parameters are,

b L[t](Θ) = K(P, X[t])Kσ2(X[t], X[t]) 1d[t](X[t]),

W [δ] L[t](Θ) =

b L[t](Θ) ˆf [t δ](P) , (15)

Ul L[t](Θ) =

b L[t](Θ) α[t 1] l (P) . (16)

Finally, we update the memory units Q and E; ˆf [t](P) is computed as Eq. (12) and is enqueued to Q, and E is updated as

α[t] l (P) = λlα[t 1] l (P) + ˆf [t](P) (l = 1, . . . , L). (17)

5 Related Work As we have shown in the previous section, F-Dy BM is derived from G-Dy BM by taking the number of neurons to inﬁnity. Besides, Osogami [2016] and Dasgupta and Osogami [2017] discussed a close connection between G-Dy BM and VAR, which has been extensively used in econometrics. Therefore, we spare this section to discuss the relationships among a family of Dy BMs and VARs, as illustrated in Fig. 2. First, we discuss the relationships among VAR, G-Dy BM, and F-Dy BM. The relationship between F-Dy BM and GDy BM has been discussed so far; F-Dy BM extends G-Dy BM

Gaussian Dy BM

Vector Autoregression Functional Autoregression

Functional Dy BM

+ Eligibility traces + Online learning algorithm + Eligibility traces + Online learning algorithm

Figure 2: Relationship diagram among VAR, FAR, G-Dy BM, and F-Dy BM.

by increasing the number of neurons to inﬁnity. The relationship between G-Dy BM and VAR was discussed by Osogami [2016]; G-Dy BM extends VAR by adding eligibility traces. In fact, VAR(d), a VAR model of the d-th order can be derived by setting Ul = O (l = 1, . . . , L) in Eq. (2). The extension from VAR to G-Dy BM complies with the design principle of Dy BMs to learn a time series online, because the recursive update rule of eligibility traces enables us to equip G-Dy BM with an efﬁcient online learning algorithm. This research direction to extend the model while keeping its learning algorithm online is unique in the research domain of VARs, because most of the existing VARs do not focus on the online setting. This direction is also unique in neural network modeling for time-series data. In the same way as Dy BM, recurrent neural networks have a memory unit (i.e., a hidden unit) h[t] that summarizes all the historical patterns by recurrent connection deﬁned as,

h[t] = σ Whf [t] + Uhh[t 1] + bh ,

in the case of Elman network [Elman, 1990]. However, in order to train the parameters Wh, Uh, and bh, backpropagationthrough-time is required for gradient computation, and forward computation of h[t] from the beginning of a time series is also required for prediciton whenever the parameters are updated. Thus, most of the learning algorithms for RNNs are typically not suitable to the online setting as compared with those for Dy BMs. Note that the only exception is a family of echo-state networks [Jaeger and Haas, 2004], where the parameters above are ﬁxed, and only the connection from h[t] to the next pattern f [t+1] is learned, resulting in a simple linear regression task. Dasgupta and Osogami [2017] augment GDy BM by an echo-state network, and the resultant learning algorithm is almost as efﬁcient as the original G-Dy BM. Then, we review functional autoregression (FAR) [Bosq, 2000], and discuss its relationship with other models. FAR extends VAR by increasing the dimension of the vector space to inﬁnity in order to model a functional time series. In a similar way to G-Dy BM and VAR, we can show that F-Dy BM extends FAR by attaching eligibility traces. Besides, F-Dy BM is different from FAR in its application as well as the algorithmic aspects. While F-Dy BM considers X as a multidimensional space in general, FAR has been mostly applied to a one-dimensional case, i.e., X R, such as a periodic continuous time series. For example, suppose we have a continuous time series g(x) (x R) that has period 1, where x represents time. Then, for each time step t N, a functional time series f [t](x) is deﬁned as f [t](x) = g(t + x) (x [0, 1)). Such applications of FAR range over climate [Besse

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

et al., 2000], mortality and fertility rates [Hyndman and Ullah, 2007], the concentration of ozone [Damon and Guillas, 2002], and Eurodollar futures rates [Kargin and Onatski, 2008] to name a few. As for a learning algorithm, Didericksen et al. [2012] reviewed and compared two popular methods for training FAR. The ﬁrst one is referred to as estimated kernel (EK), which expands and approximates weight functions using ﬁnite functional principle components of functional data with the largest eigenvalues [Ramsay and Silverman, 2005]. EK may be preferable to our method in that EK forms basis functions adaptively dedicated to the data set at hand (while ours pre-deﬁnes basis functions). However, we avoid from employing EK as a learning algorithm because EK cannot be easily applied to our online setting. The second one is referred to as predictive factors (PF) [Kargin and Onatski, 2008], which employs basis functions that are helpful for prediction rather than explaining the data as EK does. In a similar reason to EK, we do not apply PF to F-Dy BM.

6 Experiment

We empirically investigate the effectiveness of F-Dy BM. FDy BM is unique in that it has eligibility traces and it models a functional time series rather than a vectorial time series. Thus, we design an experiment to quantitatively evaluate the beneﬁt from these two features.

6.1 Methods Evaluated We evaluated the four methods descried in the related work: VAR, FAR, G-Dy BM, and F-Dy BM. VAR and FAR are equipped with online learning algorithms in the same way as G-Dy BM and F-Dy BM. Since VAR and G-Dy BM cannot deal with a functional time series as they are, we rely on the following wrapping method.

Wrapping method. The basic idea is to associate each unit of G-Dy BM or VAR with a point in X and to interpolate a value at any point via a Gaussian process. Let us assume that the n-th unit is associated with a point pn X, and let us denote the set of the points by P = [p1 . . . p N] . Let us assume that a function f [t] is distributed according to the Gaussian process GP(0, Kσ2). Then, given observations f [t](X[t]), the signals at P can be estimated as

ˆf [t](P) = K(P, X[t])Kσ2(X[t], X[t]) 1f [t](X[t]).

Then, we feed ˆf [t](P) to VAR or G-Dy BM for online learning. Prediction can be also implemented using the Gaussian process. Given a prediction by VAR or G-Dy BM µ[t], a prediction at any point x X can be estimated as

ˆf [t](x) = K(x, P)Kσ2(P, P) 1µ[t].

We chose these four methods because performance comparison between F-Dy BM and G-Dy BM and that between FAR and VAR allow us to quantify the contribution of our rigorous treatment of a functional pattern, and performance comparison between F-Dy BM and FAR and that between GDy BM and VAR reveal the contribution of eligibility traces.

6.2 Data Sets We employ ﬁve real data sets for performance evaluation. In speciﬁc, since our primary motivating example is spatiotemporal prediction, we evaluate the performance on the following two types of spatiotemporal data sets.

NOAA Global Surface Temperature V4.01 We use a temperature data set provided by the NOAA/OAR/ESRL PSD, Boulder, Colorado, USA. It contains the global temperature anomalies from January 1880 to present, resulting in the length 1638.3 The globe is spatially gridded by 5 5 , and a temperature anomaly at each location is reported. We selected 392 locations that observe the temperature without interruption. For each observational location, we deﬁne x R3 to be the location on the earth represented by a point on a unit ball in the orthogonal coordinate system.

Air Data In addition to the temperature data, we employ four air quality data sets retrieved from Air Data4. We used the daily summary data of criteria gases (CO, NO2, O3, and SO2) from 1990 to November 2016. Each data set contains daily concentration data of each criteria gas measured at outdoor monitors located in the United States. Since each gas has interrelated but its own generating mechanism, their concentration data have different dynamics. Therefore, it is valuable to evaluate the predictive performance on each of the four data sets. We preprocessed the data sets as follows. For each data set, we extracted the data summarized by its primary standard of National Ambient Air Quality Standards. Then, we chose records with Event Type = None to exclude anomaly records caused by wildﬁre for example. After that, we dropped multiple records at the same day and the same location by selecting the ﬁrst record. Finally, we converted each concentration measurement c R+ to log(c+10 5), and regarded it as f [t](x), where t corresponds to a day, and x R3 is deﬁned in the same way as the NOAA data set. Each data set has the same length 9,831 and time-variant numbers of observational points N [t]. The means and standard deviations of N [t] of CO, NO2, O3, and SO2 are 401.3 87.6, 371.0 37.8, 870.0 260.9, and 540.7 107.5, respectively.

6.3 Protocol and Conﬁguration We design the following experimental protocol for performance evaluation. Assume that we have a time series and multiple instances of each model that have different sets of hyperparameters. Our aim is to choose the best instance from the set of candidates and evaluate its generalization ability. To do so, we ﬁrst divide the time series into training, validation, and test sets in chronological order with ratio 3 : 3 : 4.

3Retrieved from http://www.esrl.noaa.gov/psd/ data/gridded/data.noaaglobaltemp.html on Aug. 23, 2016. A temperature anomaly refers to a departure from a reference value such as a long-term average 4Air Data is maintained by the United States Environmental Protection Agency, and we retrieved it from Air Quality System Data Mart (http://aqsdr1.epa.gov/aqsweb/aqstmp/ airdata/download_files.html) on Dec. 28, 2016.

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

Table 1: Means and standard deviations of test RMSEs. NOAA CO NO2 O3 SO2 VAR 1.164 0.001 2.184 0.068 1.723 0.040 1.383 0.042 3.931 0.018 FAR 1.158 0.003 1.880 0.124 1.668 0.067 0.561 0.024 3.888 0.042 G-Dy BM 1.155 0.003 1.872 0.099 1.604 0.047 0.598 0.056 3.852 0.034 F-Dy BM 1.131 0.009 1.768 0.073 1.520 0.014 0.457 0.049 3.750 0.058

Then, we train each instance by running its online learning algorithm on the training set. The trained instances are then exposed to the validation set. At each step, each instance predicts one step ahead, and we compute the root-mean-square error (RMSE) using the prediction and the actual observation. The mean of the RMSEs across the entire validation set is computed for each instance, and we choose the instance with the smallest mean RMSE as the best. Finally, we evaluate the performance of the best one on the test set, and report the mean RMSE. Note that in our setting the algorithm runs through each subset of a time series only once, and the model parameters are always updated using a new observation. For all the models, we used the RBF kernel K(x, x ; γ) = exp( γ x x 2), and prepared the instances by all the combinations of the following hyperparameters: N = 25, d = 3, η[0] {2 n}23 n=19, σ2 {2 n}2 n=0, γ {2n}5 n=0 for all the models and L = 3, λ1 = 0.1, λ2 = 0.5, λ3 = 0.9 for G-Dy BM and F-Dy BM. Since the validation scores of VAR were sometimes exceptionally worse than the others, we additionally tried γ {2n}2 n= 3 for VAR. All the models are trained by using SGD, and the learning rate is controlled by rmsprop [Tieleman and Hinton, 2012]. Points P RN 3 were uniform-randomly sampled from [ 1, 1]3 and normalized to the unit vectors. Due to the randomness arising from P, we executed the above procedure for ten times, and report the mean and standard deviation for each model.

6.4 Results Experimental results are summarized in Table 1. We ﬁrst notice consistent performance improvements resulting from direct modeling of a functional time series when comparing FDy BM with G-Dy BM and FAR with VAR. In speciﬁc, the RMSE of F-Dy BM is 76 98% (92% on average) of that of GDy BM, and the RMSE of FAR is 41 99% (84% on average) of that of VAR. Also, performance comparison between FDy BM and FAR, and G-Dy BM and VAR demonstrates consistent beneﬁt from eligibility traces; the RMSE of F-Dy BM is 81 98% (92% on average) of that of FAR, and the RMSE of G-Dy BM is 43 99% (84% on average) of that of VAR. In summary, this numerical experiment demonstrates consistent performance improvements by turning on each feature of F-Dy BM, which veriﬁes its effectiveness.

7 Concluding Remark and Future Work

This paper has presented a functional dynamic Boltzmann machine (F-Dy BM) as an extension of Gaussian Dy BM with the aim of learning a functional time series online while still considering all the past patterns. Whereas it is relatively straightforward to extend the model, devising an online learning algorithm yields three technical challenges regarding how

to deal with and operate functions and how to learn from ﬁnite observations of a functional pattern. We have addressed these difﬁculties with two ideas: to model weight functions by a kernel function and ﬁnite parameters and to replace the functional patterns with their MAP estimators progressively. These two ideas successfully lead an online learning algorithm for F-Dy BM. We designed an experiment to demonstrate the effectiveness of two essential characteristics of FDy BM, i.e., eligibility traces and careful modeling of a functional time series. As a result, we have observed that eligibility traces decrease the error by 12.0% on average and the functional modeling decreases the error by 11.7%. One future direction will be to develop further more efﬁcient learning algorithms. Since our current implementation requires to solve a linear system of size N [t] at each time step, time complexity of each step amounts to O (N [t])3 (while the wrapping method requires O (N [t])3 + N 3 ). In the Gaussian process community, inference algorithms with much lower time complexity have been extensively studied [Qui nonero-Candela and Rasmussen, 2005; Rasmussen and Williams, 2006; Wilson et al., 2015], and we believe leveraging these innovations will lead to a scalable algorithm. Another research direction is to use F-Dy BM for language modeling, predicting the next word given the former words in a document. F-Dy BM can model this task by regarding a position in a document as a time step, a word vector obtained by word2vec [Mikolov et al., 2013] as x, and f [t](x) as a onehot representation of the word appearing at position t. Since F-Dy BM can leverage all the predecessor words and their similarities for prediction, we believe F-Dy BM will achieve more than moderate performance with much efﬁciency. Finally, it is fruitful to examine the theoretical properties of F-Dy BM, especially the estimation error induced by progressively replacing f [t δ](Q) with its MAP estimator in Eq. (10). Recently, Anava et al. [2015] have presented an online prediction algorithm for an AR model that can handle missing data, and they also have derived its regret bound. However, there are still several obstacles that hinder us from applying their theory to our algorithm. For example, since their theory does not assume the underlying model generating a time series, they consider the regret between a particular inefﬁcient predictor and its efﬁcient but approximated predictor, not the regret between their algorithm and the best possible AR model in the fully-observable case. Therefore, we believe that the ﬁrst step towards investigating the theoretical properties of F-Dy BM is to examine the regret bound against the fully-observable case.

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

References [Anava et al., 2015] Oren Anava, Elad Hazan, and Assaf Zeevi. Online time series prediction with missing data. In Proceedings of the 32nd International Conference on Machine Learning, pages 2191 2199, 2015. [Besse et al., 2000] Philippe C. Besse, Herv e Cardot, and David B. Stephenson. Autoregressive forecasting of some functional climatic variations. Scandinavian Journal of Statistics, 27:673 687, 2000. [Bosq, 2000] Denis Bosq. Linear Processes in Function Spaces: Theory and Applications, chapter 5. Springer New York, 2000. [Damon and Guillas, 2002] Julien Damon and Serge Guillas. The inclusion of exogenous variables in functional autoregressive ozone forecasting. Environmetrics, 13:759 774, 2002. [Dasgupta and Osogami, 2017] Sakyasingha Dasgupta and Takayuki Osogami. Nonlinear dynamic Boltzmann machines for time-series prediction. In Proceedings of the Thirty-First AAAI Conference on Artiﬁcial Intelligence, 2017. [Didericksen et al., 2012] Devin Didericksen, Piotr Kokoszka, and Xi Zhang. Empirical properties of forecasts with the functional autoregressive model. Computational Statistics, 27(2):285 298, 2012. [Elman, 1990] Jeffrey L. Elman. Finding structure in time. Cognitive Science, 14:179 211, 1990. [Hochreiter and Schmidhuber, 1997] Sepp Hochreiter and J urgen Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735 1780, 1997. [Hyndman and Ullah, 2007] Rob J. Hyndman and Md Shahid Ullah. Robust forecasting of mortality and fertility rates: a functional data approach. Computational Statistics & Data Analysis, 51(10):4942 4956, 2007. [Jaeger and Haas, 2004] Herbert Jaeger and Harald Haas. Harnessing nonlinearity: predicting chaotic systems and saving energy in wireless communication. Science, 304(5667):78 80, 2004. [Kargin and Onatski, 2008] Vladislav Kargin and Alexei Onatski. Curve forecasting by functional autoregression. Journal of Multivariate Analysis, 99(10):2508 2526, 2008. [L utkepohl, 2005] Helmut L utkepohl. New Introduction to Multiple Time Series Analysis. Part I. Springer Berlin Heidelberg, 2005. [Mikolov et al., 2013] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26, pages 3111 3119, 2013. [Osogami and Otsuka, 2015] Takayuki Osogami and Makoto Otsuka. Seven neurons memorizing sequences of alphabetical images via spike-timing dependent plasticity. Scientiﬁc Reports, 5:14149 EP , 09 2015.

[Osogami, 2016] Takayuki Osogami. Learning binary or real-valued time-series via spike-timing dependent plasticity. In The First NIPS Workshop Computing with Spikes, 2016. [Qui nonero-Candela and Rasmussen, 2005] Joaquin Qui nonero-Candela and Carl Edward Rasmussen. A unifying view of sparse approximate Gaussian process regression. Journal of Machine Learning Research, 6:1939 1959, 2005. [Ramsay and Silverman, 2005] James Ramsay and Bernard W. Silverman. Functional Data Analysis, chapter 8. Springer-Verlag New York, 2005. [Rasmussen and Williams, 2006] Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning, chapter 8. MIT Press, 2006. [Rumelhart et al., 1986] David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. Learning internal representations by error propagation, chapter 8, pages 318 362. MIT Press, 1986. [Shumway and Stoffer, 1982] Robert H. Shumway and David S. Stoffer. An approach to time series smoothing and forecasting using the EM algorithm. Journal of Time Series Analysis, 3(4):253 264, 1982. [Sundermeyer et al., 2012] Martin Sundermeyer, Ralf Schl uter, and Hermann Ney. LSTM neural networks for language modeling. In Interspeech, pages 194 197, 2012. [Tieleman and Hinton, 2012] Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude, coursera: Neural networks for machine learning, 2012. [Wilson et al., 2015] Andrew Gordon Wilson, Christopher Dann, and Hannes Nickisch. Thoughts on massively scalable Gaussian processes. Technical report, eprint ar Xiv:1511.01870, 2015.

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)