# discriminative_state_space_models__2f15b81d.pdf

Discriminative State-Space Models

Vitaly Kuznetsov

Google Research New York, NY 10011, USA

vitaly@cims.nyu.edu

Mehryar Mohri Courant Institute and Google Research

New York, NY 10011, USA

mohri@cims.nyu.edu

We introduce and analyze Discriminative State-Space Models for forecasting nonstationary time series. We provide data-dependent generalization guarantees for learning these models based on the recently introduced notion of discrepancy. We provide an in-depth analysis of the complexity of such models. We also study the generalization guarantees for several structural risk minimization approaches to this problem and provide an efﬁcient implementation for one of them which is based on a convex objective.

1 Introduction

Time series data is ubiquitous in many domains including such diverse areas as ﬁnance, economics, climate science, healthcare, transportation and online advertisement. The ﬁeld of time series analysis consists of many different problems, ranging from analysis to classiﬁcation, anomaly detection, and forecasting. In this work, we focus on the problem of forecasting, which is probably one of the most challenging and important problems in the ﬁeld.

Traditionally, time series analysis and time series prediction, in particular, have been approached from the perspective of generative modeling: particular generative parametric model is postulated that is assumed to generate the observations and these observations are then used to estimate unknown parameters of the model. Autoregressive models are among the most commonly used types of generative models for time series [Engle, 1982, Bollerslev, 1986, Brockwell and Davis, 1986, Box and Jenkins, 1990, Hamilton, 1994]. These models typically assume that the stochastic process that generates the data is stationary up to some known transformation, such as differencing or composition with natural logarithms.

In many modern real world applications, the stationarity assumption does not hold, which has led to the development of more ﬂexible generative models that can account for non-stationarity in the underlying stochastic process. State-Space Models [Durbin and Koopman, 2012, Commandeur and Koopman, 2007, Kalman, 1960] provide a ﬂexible framework that captures many of such generative models as special cases, including autoregressive models, hidden Markov models, Gaussian linear dynamical systems and many other models. This framework typically assumes that the time series Y is a noisy observation of some dynamical system S that is hidden from the practitioner:

Yt = h(St) + t, St = g(St 1) + t for all t. (1)

In (1), h, g are some unknown functions estimated from data, { t}, { t} are sequences of random variables and {St} is an unobserved sequence of states of a hidden dynamical system.1 While this class of models provides a powerful and ﬂexible framework for time series analysis, the theoretical learning properties of these models is not sufﬁciently well understood. The statistical guarantees available in the literature rely on strong assumptions about the noise terms (e.g. { t} and { t} are Gaussian white noise). Furthermore, these results are typically asymptotic and require the model

1A more general formulation is given in terms of distribution of Yt: ph(Yt|St)pg(St|St 1).

31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.

to be correctly speciﬁed. This last requirement places a signiﬁcant burden on a practitioner since the choice of the hidden state-space is often a challenging problem and typically requires extensive domain knowledge.

In this work, we introduce and study Discriminative State-Space Models (DSSMs). We provide the precise mathematical deﬁnition of this class of models in Section 2. Roughly speaking, a DSSM follows the same general structure as in (1) and consists of a state predictor g and an observation predictor h. However, no assumption is made about the form of the stochastic process used to generate observations. This family of models includes existing generative models and other statebased discriminative models (e.g. RNNs) as special cases, but also consists of some novel algorithmic solutions explored in this paper.

The material we present is organized as follows. In Section 3, we generalize the notion of discrepancy, recently introduced by Kuznetsov and Mohri [2015] to derive learning guarantees for DSSMs. We show that our results can be viewed as a generalization of those of these authors. Our notion of discrepancy is ﬁner, taking into account the structure of state-space representations, and leads to tighter learning guarantees. Additionally, our results provide the ﬁrst high-probability generalization guarantees for state-space models with possibly incorrectly speciﬁed models. Structural Risk Minimization (SRM) for DSSMs is analyzed in Section 4. As mentioned above, the choice of the state-space representation is a challenging problem since it requires carefully balancing the accuracy of the model on the training sample with the complexity of DSSM to avoid overﬁtting. We show that it is possible to adaptively learn a state-space representation in a principled manner using the SRM technique. This requires analyzing the complexity of several families of DSSMs of interest in Appendix B. In Section 5, we use our theory to design an efﬁcient implementation of our SRM technique. Remarkably, the resulting optimization problem turns out to be convex. This should be contrasted with traditional SSMs that are often derived via Maximum Likelihood Estimation (MLE) with a non-convex objective. We conclude with some promising preliminary experimental results in Appendix D.

2 Preliminaries

In this section, we introduce the general scenario of time series prediction as well as the broad family of DSSMs considered in this paper.

We study the problem of time series forecasting in which the learner observes a realization (X1, Y1), . . . , (XT , YT ) of some stochastic process, with (Xt, Yt) 2 Z = X Y. We assume

that the learner has access to a family of observation predictors H = {h: X S ! Y} and state predictors G = {g: X S ! S}, where S is some pre-deﬁned space. We refer to any pair f = (h, g) 2 H G = F as a DSSM, which is used to make predictions as follows:

yt = h(Xt, st), st = g(Xt, st 1) for all t. (2)

Observe that this formulation includes the hypothesis sets used in (1) as special cases. In our setting, h and g both accept an additional argument x 2 X. In practice, if Xt = (Yt 1, . . . , Yt p) 2 X = Yp for some p, then Xt represents some recent history of the stochastic process that is used to make a prediction of Yt. More generally, X may also contain some additional side information. Elements of the output space Y may further be multi-dimensional, which covers both multi-variate time series forecasting and multi-step forecasting.

The performance of the learner is measured using a bounded loss function L: H S Z ! [0, M], for some upper bound M 0. A commonly used loss function is the squared loss: L(h, s, z) = (h(x, s) y)2.

The objective of the learner is to use the observed realization of the stochastic process up to time T to determine a DSSM f = (h, g) 2 F that has the smallest expected loss at time T + 1, conditioned on the given realization of the stochastic process:2

1 ) = E[L(h, s T +1, ZT +1)|ZT

2An alternative performance metric commonly considered in the time series literature is the averaged generalization error LT +1(f) = E[L(f, s T +1, ZT +1)]. The path-dependent generalization error that we consider in this work is a ﬁner measure of performance since it only takes into consideration the realized history of the stochastic process, as opposed to an average trajectory.

where st for all t is speciﬁed by g via the recursive computation in (2). We will use the notation ar

s to denote (as, as+1, . . . ar).

In the rest of this section, we will introduce the tools needed for the analysis of this problem. The key technical tool that we require is the notion of state-space discrepancy:

disc(s) = sup

E[L(h, s T +1, ZT +1)|ZT

E[L(h, st, Zt)|Zt 1

where, for simplicity, we used the shorthand s = s T +1

1 . This deﬁnition is a strict generalization of the q-weighted discrepancy of Kuznetsov and Mohri [2015]. In particular, redeﬁning L(h, s, z) = se L(h, z) and setting st = Tqt for 1 t T and s T +1 = 1 recovers the deﬁnition of q-weighted discrepancy. The discrepancy disc deﬁnes an integral probability pseudo-metric on the space of probability distributions that serves as a measure of the non-stationarity of the stochastic process Z with respect to both the loss function L and the hypothesis set H, conditioned on the given state sequence s. For example, if the process Z is i.i.d., then we simply have disc(s) = 0 provided that s is a constant sequence. See [Cortes et al., 2017, Kuznetsov and Mohri, 2014, 2017, 2016, Zimin and Lampert, 2017] for further examples and bounds on discrepancy in terms of other divergences. However, the most important property of the discrepancy disc(s) is that, as shown in Appendix C, under some additional mild assumptions, it can be estimated from data.

The learning guarantees that we present are given in terms of data-dependent measures of sequential complexity, such as expected sequential covering number [Rakhlin et al., 2010], that are modiﬁed to account for the state-space structure in the hypothesis set. The following deﬁnition of a complete binary tree is used throughout this paper: a Z-valued complete binary tree z is a sequence (z1, . . . , z T ) of T mappings zt : { 1}t 1 ! Z, t 2 [1, T]. A path in the tree is σ = (σ1, . . . , σT 1) 2 { 1}T 1. We write zt(σ) instead of zt(σ1, . . . , σt 1) to simplify the notation. Let R = R0 G be any function class where G is a family of state predictors and R0 = {r: Z S ! R}. A set V of R-valued trees of depth T is a sequential -cover (with respect to p norm) of R on a tree z of depth T if for all (r, g) 2 R and all σ 2 { 1}T , there is v 2 V such that

&&vt(σ) r(zt(σ), st)

where st = g(zt(σ), st 1). The (sequential) covering number Np( , R, z) on a given tree z is deﬁned to be the size of the minimal sequential cover. We call Np( , R) = supz Np( , R, z) the maximal covering number. See Figure 1 for an example.

We deﬁne the expected covering number to be Ez T (p)[Np( , R, z)], where T(p) denotes the distribution of z implicitly deﬁned via the following sampling procedure. Given a stochastic process distributed according to the distribution p with pt( |zt 1

1 ) denoting the conditional distribution at time t, sample Z1, Z0

1 from p1 independently. In the left child of the root sample Z2, Z0

2 according to p2( |Z1) and in the right child according to p2( |Z0

2) all independent from each other. For a node that can be reached by a path (σ1, . . . , σt), we draw Zt, Z0

t according to pt( |S1(σ1), . . . , St 1(σt 1)), where St(1) = Zt and St( 1) = Z0

t. Expected sequential covering numbers are a ﬁner measure of complexity since they directly take into account the distribution of the underlying stochastic process.

For further details on sequential complexity measures, we refer the reader to [Littlestone, 1987, Rakhlin et al., 2010, 2011, 2015a,b].

In this section, we present our generalization bounds for learning with DSSMs. For our ﬁrst result, we assume that the sequence of states s (or equivalently state predictor g) is ﬁxed and we are only learning the observation predictor h. Theorem 1. Fix s 2 ST +1. For any δ > 0, with probability at least 1 δ, for all h 2 H and all > 0, the following inequality holds:

L(h, Xt, st) + disc(s) + 2 + M

Ev T (P)[N1( ,Rs,v)]

where Rs = {(z, s) 7! L(h, s, z): h 2 H} {s}.

The proof of Theorem 1 (as well as the proofs of all other results in this paper) is given in Appendix A. Note that this result is a generalization of the learning guarantees of Kuznetsov and Mohri [2015]. Indeed, setting s = (Tq1, . . . , Tq T , 1) for some weight vector q and L(h, s, z) = se L(h, z) recovers Corollary 2 of Kuznetsov and Mohri [2015]. Zimin and Lampert [2017] show that, under some additional assumptions on the underlying stochastic process (e.g. Markov processes, uniform martingales), it is possible to choose these weights to guarantee that the discrepancy disc(s) is small. Alternatively, Kuznetsov and Mohri [2015] show that if the distribution of the stochastic process at times T + 1 and [T s, T] is sufﬁciently close (in terms of discrepancy) then disc(s) can be estimated from data. In Theorem 5 in Appendix C, we show that this property holds for arbitrary state sequences s. Therefore, one can use the bound of Theorem 1 that can be computed from data to search for the predictor h 2 H that minimizes this quantity. The quality of the result will depend on the given state-space sequence s. Our next result shows that it is possible to learn h 2 H and s generated by some state predictor g 2 G jointly.

Theorem 2. For any δ > 0, with probability at least 1 δ, for all f = (h, g) 2 H G and all > 0, the following inequality holds:

L(h, Xt, st) + disc(s) + 2 + M

Ev T (P)[N1( ,R,v)]

where st = g(Xt, st 1) for all t and R = {(z, s) 7! L(h, s, z): h 2 H} G.

The cost of this signiﬁcantly more general result is a slightly larger complexity term N1( , R, v) N1( , Rs, v). This bound is also much tighter than the one that can be obtained by applying the result of Kuznetsov and Mohri [2015] directly to F = H G, which would lead to the same bound as in Theorem 2 but with disc(s) replaced by supg2G disc(s). Not only supg2G disc(s) is an upper bound on disc(s), but it is possible to construct examples that lead to learning bounds that are too loose. Consider the stochastic process generated as follows. Let X be uniformly distributed on { 1}. Suppose Y1 = 1 and Yt = Yt 1 for all t > 1 if X = 1 and Yt = Yt 1 for all t > 1 otherwise. In other words, Y is either periodic or a constant state sequence. If L is the squared loss, for G = {g1, g2} with g1(s) = s and g2(s) = s and H = {h} with h(s) = s, for odd T, supg2G disc(s) 1/2. On the other hand, the bound in terms of disc(s) is much ﬁner and helps us select g such that disc(s) = 0 for that g. This example shows that even for simple deterministic dynamics our learning bounds are ﬁner than existing ones.

Since the guarantees of Theorem 2 are data-dependent and hold uniformly over F, they allow us to seek a solution f 2 F that would directly optimize this bound and that could be computed from the given sample. As our earlier example shows, the choice of the family of state predictors G is crucial to achieve good guarantees. For instance, if G = {g1} then it may be impossible to have a non-trivial bound. In other words, if the family of state predictors is not rich enough, then, it may not be possible to handle the non-stationarity of the data. On the other hand, if G is chosen to be too large, then, the complexity term may be too large. In Section 4, we present an SRM technique that enables us to learn the state-space representation and adapt to non-stationarity in a principled way.

4 Structural Risk Minimization

Suppose we are given a sequence of families of observation predictors H1 H2 Hn . . . and a sequence of families of state predictors G1 G2 Gn . . . Let Rk = {(s, z) 7! L(h, s, z): h 2 Hk} Gk and R = [1

k=1Rk. Consider the following objective function:

F(h, g, k) = 1

L(h, st, Zt) + (s) + Bk + M

where (s) is any upper bound on disc(s) and Bk is any upper bound on M

Ev T (P)[N1( ,Rk,v)]

δ T . We are presenting an estimatable upper bound on disc(s) in Appendix C, which provides one

particular choice for (s). In Appendix B, we also prove upper bounds on the expected sequential covering numbers for several families of hypothesis. Then, we deﬁne the SRM solution as follows:

(eh, eg, ek) = argminh,g2Hk Gk,k 1 F(h, g, k). (6)

We also deﬁne f by f = (h , g ) 2 argminf2F LT +1(f|ZT

1 ). Then, the following result holds. Theorem 3. For any δ > 0, with probability at least 1 δ, for all > 0, the following bound holds:

LT +1(eh, eg|ZT

1 ) LT +1(f |ZT

1 ) + 2 (s ) + 2 + 2Bk(f ) + M

t = g (Xt, s

t 1), and where k(f ) is the smallest integer k such that f 2 Hk Gk.

Theorem 3 provides a learning guarantee for the solution of SRM problem (5). This result guarantees for the SRM solution a performance close to that of the best-in-class model f modulo a penalty term that includes the discrepancy (of the best-in-class state predictor), similar to the guarantees of Section 3. This guarantee can be viewed as a worst case bound when we are unsure if the state-space predictor captures the non-stationarity of the problem correctly. However, in most cases, by introducing a state-space representation, we hope that it will help us model (at least to some degree) the non-stationarity of the underlying stochastic process. In what follows, we present a more optimistic best-case analysis which shows that, under some additional mild assumptions on the complexity of the hypothesis space with respect to stochastic process, we can simultaneously simplify the SRM optimization and give tighter learning guarantees for this modiﬁed version. Assumption 1 (Stability of state trajectories). Assume that there is a decreasing function r such that for any > 0 and δ > 0, with probability 1 δ, if h , g = argmin(h,g)2F LT +1(h, g|ZT

1 ) and (h, g) 2 F is such that

Lt(h, g|Zt 1

1 ) Lt(h , g |Zt 1

&&&&& , (7)

then, the following holds:

LT +1(h, g|ZT

1 ) LT +1(h , g |ZT

1 ) r( ). (8)

Roughly speaking, this assumption states that, given a sequence of states s1, . . . , s T generated by g such that the performance of some observation predictor h along this sequence of states is close to the performance of the ideal pair h along the ideal sequence generated by g , the performance of h in the near future (at state s T +1) will remain close to that of h (in state s

T +1). Note that, in most cases of interest, r has the form r( ) = a , for some a > 0.

Consider the following optimization problem which is similar to (5) but omits the discrepancy upper bound :

F0(h, g, k) = 1

L(h, st, Zt) + Bk + M

We will refer to F0 as an optimistic SRM objective and we let (h0, g0) be a minimizer of F0. Then, we have the following learning guarantee. Theorem 4. Under Assumption 1, for any δ > 0, with probability at least 1 δ, for all > 0, the inequality LT +1(h0, g0|ZT

1 ) LT +1(f |ZT

1 ) < r( ) holds with

= 2 + 2Bk(f ) + M

t = g (Xt, s

t 1), and where k(f ) is the smallest integer k such that f 2 Hk Gk.

We remark that a ﬁner analysis can be used to show that Assumption 1 only need to be satisﬁed for k k(f ) for the Theorem 4. Furthermore, observe that for linear functions r( ) = a , one recovers a guarantee similar to the bound in Theorem 3, but the discrepancy term is omitted making this result tighter. This result suggests that in the optimistic scenarios where our hypothesis set contains a good

state predictor that can capture the data non-stationarity, it is possible to achieve a tighter guarantee that avoids the pessimistic discrepancy term. Note that, increasing the capacity of the family of state predictors makes it easier to ﬁnd such a good state predictor but it also may make the learning problem harder and lead to the violation of Assumption 1. This further motivates the use of an SRM technique for this problem to ﬁnd the right balance between capturing the non-stationarity in data and the complexity of the models that are being used. Theorem 4 formalizes this intuition by providing theoretical guarantees for this approach.

We now consider several illustrative examples showing that this assumption holds in a variety of cases of interest. In all our examples, we will use the squared loss but it is possible to generalize all of them to other sufﬁciently regular losses.

Linear models. Let F be deﬁned by F = {f : y 7! w (y), kwk } for some > 0 and some feature map . Consider a separable case where Yt = w (Yt 1

t p) + t, where t represents white noise. One can verify that the following equality holds:

1 ) = E[(w (Yt 1

t p) Yt)|Yt 1

(w w ) (Yt 1

In view of that, it follows that (7) is equal to

(w w ) (Yt 1

for any coordinate j 2 [1, N]. Thus, for any coordinate j 2 [1, N], by Hölder s inequality, we have

LT +1(h, g|ZT

1 ) LT +1(h , g |ZT

where σj = 1 T

t p)2 is the empirical variance of the j-th coordinate and where r = supy (y)2 is the empirical 1-norm radius of the data. The special case where is the identity map covers standard autoregressive models. These often serve as basic building blocks for other state-space models, as discussed below. More generally, other feature maps may be induced by a positive deﬁnite kernel K. Alternatively, we may take as our hypothesis set F the convex hull of all decision trees of certain depth d. In that case, we can view each coordinate j as the output of a particular decision tree on the given input.

Linear trend models. For simplicity, in this example, we consider univariate time series with linear trend. However, this can be easily generalized to the multi-variate setting with different trend models. Deﬁne G as G = {s 7! s + c: |c| } for some > 0 and let H be a singleton consisting of the identity map. Assume that Yt = c t + t, where t is white noise. As in the previous example, it is easy to check that Lt(h, g|Zt 1

1 ) = |c c |2t2. Therefore, in this case, one can show that (7) reduces to 1

3(T + 1)(2T + 1)|c c |2 and therefore, if = O(

1/T), then we have |c c |2 = O(1/T 5/2) and thus (8) is |c c |2(T + 1)2 = O(

Periodic signals. We study a multi-resolution setting where the time series of interest are modeled as a linear combination of periodic signals at different frequencies. We express this as a state-space model as follows. Deﬁne

where 1 is d 1-dimensional row vector of 1s, 0 is d 1-dimensional column vector of 0 and Id 1 is an identity matrix. It easy to verify that, under the map s 7! Ads, the sequence s1 e1, s2 e1 . . . , st e1 . . ., where 1 = (1, 0, . . . , 0)T , is a periodic sequence with period d. Let D = d1, . . . , dk be any collection of positive integers and let A be a block-diagonal matrix with Ad1, . . . , Adk on the diagonal. We set G = {s 7! A s} and H = {s 7! w s: kwk < }, where we also restrict ws to be non-zero only at coordinates 1, 1 + d1, 1 + d1 + d2, . . . , 1 + Pk 1

j=1 dk 1. Once again, to simplify our presentation, we assume that Yt satisﬁes Yt = w st + t. Using arguments similar to those of the previous examples, one can show that (7) is lower bounded by (wj w

t=1 st,j for any coordinate j. Therefore, as before, if (7) is upper bounded by > 0, then (8) is upper bounded by r PN

1 σj , where r is the maximal radius of any state and σj a variance of j-th state sequence.

Trajectory ensembles. Note that, in our previous example, we did not exploit the fact that the sequences were periodic. Indeed, our argument holds for any g that generates a multi-dimensional trajectory h 2 H = {s 7! w s: kwk < } which can be interpreted as learning an ensemble of different state-space trajectories.

Structural Time Series Models (STSMs). STSMs are a popular family of state-space models that combine all of the previous examples. For this model, we use (h, g) 2 H G that have the following structure: h(xt, g(st)) = w (xt)+ct+w0 st, where st is a vector of periodic sequences described in the previous examples and xt is the vector representing the most recent history of the time series. Note that our formulation is very general and allows for arbitrary feature maps that can correspond either to kernel-based or tree-based models. Arguments similar to those given in previous examples show that Assumption 1 holds in this case.

Shifting parameters. We consider the non-realizable case where H is a set of linear models but where the data is generated according to the following procedure. The ﬁrst T/2 rounds obey the formula Yt = w0Yt 1 + t, the subsequent rounds the formula Yt = w Yt 1 + t. Note that, in this case, we have | 1

t=1 Lt(w0|Zt 1

1 ) Lt(w |Zt 1

1 )| = 0. However, if w0 and w are sufﬁciently far apart, it is possible to show that there is a constant lower bound on LT +1(w0|ZT

1 ) LT +1(w |ZT

1 ). One approach to making Assumption 1 hold for this stochastic process is to choose H such that the resulting learning problem is separable. However, that requires us to know the exact nature of the underlying stochastic process. An alternative agnostic approach, is to consider a sequence of states (or equivalently weights) that can assign different weights qt to different training points.

Finally, observe that our learning guarantees in Section 3 and 4 are expressed in terms of the expected sequential covering numbers of the family of DSSMs that we are seeking to learn. A priori, it is not clear if it is possible to control the complexity of such models in a meaningful way. However, in Appendix B, we present explicit upper bounds on the expected sequential covering numbers of several families of DSSMs, including several of those discussed above: linear models, tree-based hypothesis, and trajectory ensembles.

5 Algorithmic Solutions

The generic SRM procedures described in Section 4 can lead to the design of a range of different algorithmic solutions for forecasting time series, depending on the choice of the families Hk and Fk. The key challenge for the design of an algorithm design in this setting is to come up with a tractable procedure for searching through sets of increasing complexity. In this section, we describe one such procedure that leads to a boosting-style algorithm. Our algorithm learns a structural time series model by adaptively adding various structural subcomponents to the model in order to balance model complexity and the ability of the model to handle non-stationarity in data. We refer to our algorithm as Boosted Structural Time Series Models (BOOSTSM).

We will discuss BOOSTSM in the context of the squared loss, but most of our results can be straightforwardly extended to other convex loss functions. The hypothesis set used by our algorithm admits the following form: H = {(x, s) 7! w (x) + w0 s: kwk1 , kw0k1 0}. Each coordinate j is a binary-valued decision tree maps its inputs to a bounded set. For simplicity, we also assume that = 0 = 1. We choose G to be any set of state trajectories. For instance, this set may include periodic or trend sequences as described in Section 4.

Note that, to make the discussion concrete, we impose an 1-constraint to the parameter vectors, but other regularization penalties are also possible. The particular choice of the regularization deﬁned by H would also lead to sparser solutions, which is an additional advantage given that our state-space representation is high-dimensional.

For the squared loss and the aforementioned H, the optimistic SRM objective (9) is given by

F(w, w0) = 1

yt w (xt) + w0 st

+ λ(kwk1 + kw0k1), (10)

where we omit log(k) because the index k in our setting tracks the maximal depth of the tree and it sufﬁces to restrict the search to the case k < T as, for deeper trees, we can achieve zero empirical

error. With this upper bound on k, O

is small and hence not included in the objective.

BOOSTSM(S = ((xi, yi)T

t=1) 1 f0 0 2 for k 1 to K do 3 j argminj k,j + λ sgn(wj) 4 j0 argminj0 δk,j0 + λ sgn(w0

j) 5 if k,j + λ sgn(wj) δk,j0 + λ sgn(w0

j) then 6 k argmin F(w + j, w0) 7 fk fk 1 + k j 8 else k argmin F(w, w0 + j0) 9 fk fk 1 + t j0 10 return f K

Figure 1: Pseudocode of the BOOSTSM algorithm. On line 3 and 4 two candidates are selected to be added to the ensemble: a state trajectory with j0 or a tree-based predictor with index j. Both of these minimize their subgradients within their family of weak learners. Subgradients are deﬁned by (11). The candidate with the smallest gradient is added to the ensemble. The weight of the new ensemble member is found via line search (line 6 and 8).

The regularization penalty is directly derived from the bounds on the expected sequential covering numbers of H given in Appendix B in Lemma 4 and Lemma 5.

Observe that (10) is a convex objective function. Our BOOSTSM algorithm is deﬁned by the application of coordinate descent to this objective. Figure 1 gives its pseudocode. The algorithm proceeds in K rounds. At each round, we either add a new predictor tree or a new state-space trajectory to the model, depending on which results in a greater decrease in the objective. In particular, with the following deﬁnitions:

(yt fk 1(xt, st)) j(xt), δk,j = 1

(yt fk 1(xt, st))st,j. (11)

the subgradient in tree-space direction j at round k is given by k,j + λ sgn(wk,j). We use the notation wk to denote the tree-space parameter vector after k 1 rounds. Similarly, the subgradient in the trajectory space direction j0 is given by δk,j0 + λ sgn(w0

k,j), where w0

k represents the trajectory space parameter vector after k 1 rounds.

By standard results in optimization theory [Luo and Tseng, 1992], BOOSTSM admits a linear convergence guarantee.

6 Conclusion

We introduced a new family of models for forecasting non-stationary time series, Discriminative State Space Models. This family includes existing generative models and other state-based discriminative models (e.g. RNNs) as special cases, but also covers several novel algorithmic solutions explored in this paper. We presented an analysis of the problem of learning DSSMs in the most general setting of non-stationary stochastic processes and proved ﬁnite-sample data-dependent generalization bounds. These learning guarantees are novel even for traditional state-space models since the existing guarantees are only asymptotic and require the model to be correctly speciﬁed. We fully analyzed the complexity of several DSSMs that are useful in practice. Finally, we also studied the generalization guarantees of several structural risk minimization approaches to this problem and provided an efﬁcient implementation of one such algorithm which is based on a convex objective. We report some promising preliminary experimental results in Appendix D.

Acknowledgments

This work was partly funded by NSF CCF-1535987 and NSF IIS-1618662, as well as a Google Research Award.

Rakesh D. Barve and Philip M. Long. On the complexity of learning from drifting distributions. In

COLT, 1996.

Tim Bollerslev. Generalized autoregressive conditional heteroskedasticity. J Econometrics, 1986.

George Edward Pelham Box and Gwilym Jenkins. Time Series Analysis, Forecasting and Control.

Holden-Day, Incorporated, 1990.

Peter J Brockwell and Richard A Davis. Time Series: Theory and Methods. Springer-Verlag, New

York, 1986.

J.J.F. Commandeur and S.J. Koopman. An Introduction to State Space Time Series Analysis. OUP

Oxford, 2007.

Corinna Cortes, Giulia De Salvo, Vitaly Kuznetsov, Mehryar Mohri, and Scott Yand. Multi-armed

bandits with non-stationary rewards. Co RR, abs/1710.10657, 2017.

Victor H. De la Peña and Evarist Giné. Decoupling: from dependence to independence: randomly

stopped processes, U-statistics and processes, martingales and beyond. Probability and its applications. Springer, NY, 1999.

J. Durbin and S.J. Koopman. Time Series Analysis by State Space Methods: Second Edition. Oxford

Statistical Science Series. OUP Oxford, 2012.

Robert Engle. Autoregressive conditional heteroscedasticity with estimates of the variance of United

Kingdom inﬂation. Econometrica, 50(4):987 1007, 1982.

James D. Hamilton. Time series analysis. Princeton, 1994.

Rudolph Emil Kalman. A new approach to linear ﬁltering and prediction problems. Transactions of

the ASME Journal of Basic Engineering, 82(Series D), 1960.

Vitaly Kuznetsov and Mehryar Mohri. Generalization bounds for time series prediction with non-

stationary processes. In ALT, 2014.

Vitaly Kuznetsov and Mehryar Mohri. Learning theory and algorithms for forecasting non-stationary

time series. In Advances in Neural Information Processing Systems 28, pages 541 549, 2015.

Vitaly Kuznetsov and Mehryar Mohri. Time series prediction and on-line learning. In Proceedings of

The 29th Conference on Learning Theory, COLT 2016, 2016.

Vitaly Kuznetsov and Mehryar Mohri. Generalization bounds for non-stationary mixing processes.

Machine Learning, 106(1):93 117, 2017.

M. Ledoux and M. Talagrand. Probability in Banach Spaces: Isoperimetry and Processes. Ergebnisse

der Mathematik und ihrer Grenzgebiete. U.S. Government Printing Ofﬁce, 1991.

Nick Littlestone. Learning quickly when irrelevant attributes abound: A new linear-threshold

algorithm. Machine Learning, 1987.

Zhi-Quan Luo and Paul Tseng. On the convergence of coordinate descent method for convex

differentiable minimization. Journal of Optimization Theory and Applications, 72(1):7 35, 1992.

Alexander Rakhlin, Karthik Sridharan, and Ambuj Tewari. Online learning: Random averages,

combinatorial parameters, and learnability. In NIPS, 2010.

Alexander Rakhlin, Karthik Sridharan, and Ambuj Tewari. Online learning: Stochastic, constrained,

and smoothed adversaries. In NIPS, 2011.

Alexander Rakhlin, Karthik Sridharan, and Ambuj Tewari. Sequential complexities and uniform

martingale laws of large numbers. Probability Theory and Related Fields, 2015a.

Alexander Rakhlin, Karthik Sridharan, and Ambuj Tewari. Online learning via sequential complexi-

ties. JMLR, 16(1), January 2015b.

Alexander Zimin and Christopher H. Lampert. Learning theory for conditional risk minimization. In

AISTAT, 2017.