# online_time_series_prediction_with_missing_data__d3da451b.pdf

Online Time Series Prediction with Missing Data

Oren Anava OANAVA@TX.TECHNION.AC.IL Technion, Haifa, Israel

Elad Hazan EHAZAN@CS.PRINCETON.EDU Princeton University, NY, USA

Assaf Zeevi ASSAF@GSB.COLUMBIA.EDU Columbia University, NY, USA

We consider the problem of time series prediction in the presence of missing data. We cast the problem as an online learning problem in which the goal of the learner is to minimize prediction error. We then devise an efﬁcient algorithm for the problem, which is based on autoregressive model, and does not assume any structure on the missing data nor on the mechanism that generates the time series. We show that our algorithm s performance asymptotically approaches the performance of the best AR predictor in hindsight, and corroborate the theoretic results with an empirical study on synthetic and real-world data.

1. Introduction

A time series is a sequence of signal observations, typically measured at uniform time intervals. Perhaps one of the most well-studied models for time series analysis and prediction is the autoregressive (AR) model. Roughly speaking, the AR model is based on the assumption that each observation can be represented as a (noisy) linear combination of some previous observations. This model has been successfully used in many real-world applications such as DNA microarray data analysis, stock market prediction, and noise cancelation, to name but a few.

Recently there has been growing interest in the problem of time series prediction in the presence of missing data, mainly in the proper learning setting, in which an underlying model is assumed and the goal is to recover its parameters. Most of the current work relies on statistical assumptions on the error terms, such as independence and

Proceedings of the 32 nd International Conference on Machine Learning, Lille, France, 2015. JMLR: W&CP volume 37. Copyright 2015 by the author(s).

Gaussian distribution. These assumptions allow the use of Maximum Likelihood (ML) techniques to recover consistent (and sometimes optimal) estimators for the model parameters. However, these assumptions are many times not met in practice, causing the resulting estimators to be no longer consistent. Occasionally, additional assumptions on the structure of the missing data are added, and the statistical modeling becomes even more distant from the data.

In this paper we argue that assumptions on the model generating the time series and the structure of its missing data can be relaxed in a substantial manner while still supporting the development of efﬁcient methods to solve the problem. Our main contribution is a novel online learning approach for time series prediction with missing data, that allows the observations (along with the missing data) to be arbitrarily or even adversarially generated. The goal of this paper is to show that the new approach is theoretically more robust, and is thus capable of coping with a wider range of time series and missing data structures.

1.1. Informal Statement of Our Results

We cast the problem of AR prediction in the presence of missing data as an online learning problem. A major modeling challenge arises as AR prediction is not well deﬁned when some of the data is missing. To overcome this issue, we deﬁne a new family of AR predictors; each such predictor makes use of its own past predictions to ﬁll in the missing data, and then provides an AR prediction using the completed data. To be slightly more formal, a predictor in this family has the following recursive form:

XREC t (α) =

k=1 αk Xt k1{Xt k is revealed}

k=1 αk XREC t k(α)1{Xt k is not revealed},

Online Time Series Prediction with Missing Data

where Xt is the signal measured at time point t, and α Rp is the vector of AR coefﬁcients. Now, let ℓt(Xt, Xt) denote the loss suffered by predicting Xt at time point t, and RT be the corresponding regret term. Then, our main theorem is the following:

Theorem 3.1. Our algorithm generates an online sequence of predictions { Xt}T t=1, for which it holds that:

t=1 ℓt Xt, Xt 1{Xt is revealed}

t=1 ℓt Xt, XREC t (α ) 1{Xt is revealed} = O(

where F denotes the number of time points in which our algorithm received feedback, and α is the minimizer of PT t=1 ℓt(Xt, XREC t (α))1{Xt is revealed}.

This result is somewhat surprising, as even the problem of ﬁnding the best AR predictor in hindsight is non-convex due to the recursive structure of the predictors we consider. The key observation that enables an efﬁcient solution in this scheme relies on non-proper learning: it is possible to learn coefﬁcients in a much larger class and compete against the best AR predictor. The complexity of this class is in fact exponential in the parameters of our problem, yet we prove that the learning of the new coefﬁcients can be done efﬁciently due to some special characteristics. This idea was successfully applied also in the work of (Hazan et al., 2015), who considered the problem of low-rank classiﬁcation with missing data.

1.2. Related Work

Several different approaches for time series prediction in the presence of missing data exist; we overview the major ones. Perhaps the earliest approach, originated in the control theory community, goes back to the work of (Kalman, 1960). In this work, the concept of state-space modeling is presented to deal with streams of noisy input data. Although the original work is not aimed at coping with missing data, it initialized a solid line of works that use statespace modeling to impute missing data (Shumway & Stoffer, 1982; Sinopoli et al., 2004; Liu & Goldsmith, 2004; Wang et al., 2005). We refer the reader to (Durbin & Koopman, 2012) for a complete overview of time series analysis using state-space models.

Another increasingly common approach builds upon the concept of multiple imputation (Honaker & King, 2010). Roughly speaking, multiple imputation aims at inferring relevant information from the observed data (using known statistical methods), and use it to impute multiple values for each missing data point. The resulted multiple data sets are now treated as completed, which allows the use of stan-

dard statistical methods for the analysis task. Results from different data sets are then combined using various simple procedures.

In the statistical literature, missing data are usually imputed using maximum likelihood estimators corresponding to a speciﬁc underlying model. Very often, these estimators are not efﬁciently computed, which motivates the use of Expectation Maximization (EM) algorithms. This approach was proposed by (Dempster et al., 1977), and is currently the most popular for dealing with missing data in time series. Essentially, the EM algorithm avoids the separate treatment of each of the exponentially many missing data patterns by using the following two-step procedure: in the E-step, missing observations are ﬁlled in with their conditional expectations given the observed data and the current estimate of the model parameters; and in the M-step, a new estimate of the model parameters is computed from the current version of the completed data. We note that the vast majority of the state-space modeling literature relies on EM techniques as well (for instance, see (Shumway & Stoffer, 1982; Sinopoli et al., 2004)).

One particular time series model that has received a great attention in the statistical literature is the AR model. In the context of missing data, there are many works that assume this underlying model, and differ in the assumptions on the missing data patterns: in (Dunsmuir & Robinson, 1981), a stochastic mechanism is assumed to generate the missing data; (Ding et al., 2010) consider a scarce pattern of the observed signal; and (Choong et al., 2009) rely on local similarity structures in the data. The existence of distributional assumptions on the time series or on the patterns of the missing data is in common to all of these works.

To date, we are not aware of an approach that allows the signal (along with its missing data) to be generated arbitrary, let alone adversarially. The only approach that allows for adversarially generated signals we are aware of is the recent result of (Anava et al., 2013), which does not account for missing data. Our work can be seen as an extension of the latter to the missing data setting.

2. Preliminaries and Model

A time series is a sequence of signal observations, measured at successive time points (usually spaced at uniform intervals). Let Xt denote the signal measured at time t. The traditional AR model, parameterized by lag p and coefﬁcient vector α Rp, assumes that each observation complies with the formula

k=1 αk Xt k + ϵt,

Online Time Series Prediction with Missing Data

where {ϵt}t Z is assumed to be white noise. In words, the model assumes that Xt is a noisy linear combination of the previous p observations. Sometimes, an additional additive term α0 is included to indicate drift, but we will ignore this for simplicity. Notice that this does not increase the complexity of the problem, since we can simply raise the dimension of the vector α by one and assign the value 1 to the corresponding observation.

The motivation to use AR(p) models for signal prediction goes back to Wold s decomposition theorem. According to this theorem, a stationary signal {Xt}t Z can be represented as an MA( ) process. That is,

i=1 βiϵt i + ϵt,

where P i=1 β2 i < , and {ϵt}t Z have zero-mean and equal variance. If, in addition, {Xt}T t=1 is assumed to be invertible, we can represent it as an AR( ) process. That is,

i=1 αi Xt i + ϵt,

where {αi} i=1 are uniquely deﬁned. This representation, accompanied with the natural assumption that αi decays fast enough as a function of i, motivates the use of AR(p) models for the task of signal prediction.

2.1. The Online Setting for AR Prediction

After motivating the use of AR models, arises the question of misspeciﬁcation: what happens if we employ a model that does not comply with our data? Standard statistical methods (e.g., maximum likelihood) for estimating the AR coefﬁcients are based on the assumption that the observations come from an AR model, and thus are not suitable when the model is misspeciﬁed. This drives the use of online learning based techniques to circumvent this issue.

Online learning is a well established learning paradigm which has both theoretical and practical appeals. The goal in this paradigm is to make a sequential prediction, where the data, rather than being generated stochastically, is assumed to be chosen by an adversary that has full knowledge of our learning algorithm (see for instance (Cesa-Bianchi & Lugosi, 2006)). Speciﬁcally, the following setting is usually assumed in the context of time series prediction: at time point t, we need to make a prediction Xt, after which the true value of the signal Xt is revealed, and we suffer a loss denoted by ℓt(Xt, Xt). Usually, ℓt is assumed to be convex with Lipschitz gradients.

When considering an AR(p) prediction, we must deﬁne in advance the decision set K Rp, which stands for the class of AR coefﬁcients against which we want to compete. We

henceforth let K = [ 1, 1]p. Our prediction at time point t then takes the form:

XAR t (αt) =

k=1 αt k Xt k, (1)

where αt K is generated by our online algorithm. Here comes the punch of the online setting: our goal is to design an algorithm that generates a sequence {αt}T t=1 which is almost as good as the best (in hindsight) AR coefﬁcients in K. More formally, we deﬁne the regret to be

t=1 ℓt Xt, XAR t (αt) min α K

t=1 ℓt Xt, XAR t (α) ,

and wish to design efﬁcient algorithms, whose regret grows sublinearly in T. Thus, even if the model is misspeciﬁed (meaning the best AR coefﬁcients in K have unsatisfactory predictive power), minimizing the regret term is still meaningful. Remains the question of how can we compete against the best AR coefﬁcients in the presence of missing data? The latter is the main question we try to answer in this work.

2.2. Problem Deﬁnition

Throughout this work we consider the following setting (which accounts for missing data): at time point t, we need to make a prediction Xt, after which feedback in the form of the real signal Xt is not necessarily revealed, and we suffer loss denoted by ℓt(Xt, Xt)1{Xt is revealed}. That is, we suffer loss only if we receive feedback. Here also, ℓt is assumed to be convex with Lipschitz gradients.

The problem arising in this setting is two-fold: ﬁrst, we cannot provide an AR prediction at time t (even given the vector α) if some of the required past observations are missing. Second, the best AR predictor in hindsight is not well-deﬁned. To solve this problem, we deﬁne a family of recursive predictors, each of the form:

XREC t (α) =

k=1 αk Xt k1{Xt k is revealed}

k=1 αk XREC t k(α)1{Xt k is not revealed},

where α K = [ 1, 1]p. Essentially, a predictor in this family uses its own (updated) estimations as a proxy for the actual signal, and then provides an AR(p) prediction as if there is no missing data. The problem at hand then translates into minimizing the corresponding regret term.

Online Time Series Prediction with Missing Data

2.3. Our Assumptions

Throughout the remainder of this work we assume that the following hold:

(1) Xt [ 1, 1] for all t. Here, the constant 1 can be replaced by any other constant C < (which is known in advanced), but in order to simplify the writing we assume that C = 1.

(2) For all t there exist at least p successive time points in t d, . . . , t 1 for which we received feedback. This assumption makes sure that each prediction XREC t (α) does not look back more than d time points. This assumption can be completely removed, as discussed in Section 3.4.

Note we do not assume an underlying AR(p) model that generates the signal, nor a statistical model according to which the missing observations are omitted from us.

3. Our Approach

We brieﬂy outline our approach to the problem at hand. Basically, observe that the prediction at time point t can be written in the following form:

XREC t (α) =

k=1 pk(α)Xt k1{Xt k},

where pk(α) is a polynomial in α that is determined by the structure of missing data. Each polynomial pk potentially contains up to 2k 1 terms of the form αi1 . . . αij, such that Pj m=1 im = k. This means that the prediction at time point t constitutes of up to 2d terms of the form αi1 . . . αij, such that for each of them it holds that Pj m=1 im d. Notice that each such term is less or equal to 1 for α that is the best recursive AR(p) predictor in hindsight, since we consider K = [ 1, 1]p.

The latter observation allows the use of non-proper learning techniques in this setting. Essentially, the idea is to learn a vector w R2d such that each entry in w corresponds to a product of the form αi1 . . . αij, while ignoring the restrictions imposed on w by α. Obtaining a regret bound w.r.t. to best w would imply a regret bound w.r.t. the best α, yet an efﬁciency question would remain since the dimension of w is 2d. In Section 3.3 we prove that the inner products in the space induced by this relaxation can be computed efﬁciently, which overall gives an efﬁcient algorithm with provable regret bound.

3.1. Notation and Deﬁnitions

We denote by [n] the set {1, . . . , n}, and use X(t d:t 1) {R { }}d to denote the vector of values Xt d, . . . , Xt 1,

where missing observations are encoded using { }. The notation 1{Xt} will be used as the indicator of the event {Xt is revealed}. Finally, denote

Bd = n w R2d 1 : w 2 2 2do .

We say that the vector b = (b1, . . . , bd) {0, 1}d is the binary representation of a number n if n is represented as Pd k=1 bk2k 1. We denote by b(n) the unique binary representation of n. For a given number n and its binary representation b(n) = (b1, . . . , bd), we deﬁne m(n) to be the maximal index k such that bk = 1. That is,

m(n) = max k : 2k 1 < n .

The deﬁnition below links between a structure of missing observations and a binary vector b. This, in turn, will be used to link between a vector w R2d 1 and a structure of missing observations.

Deﬁnition 1. Let b = (b1, . . . , bd) {0, 1}d and set m = max{k | bk = 1}. We say that b is a semi-valid path with respect to X(t d:t 1) (and denote b sv X(t d:t 1)), if bm = 1{Xt m} = 1 and bi 1{Xt i} for i < m. If in addition (b1, . . . , bm) does not contain p successive zeros we say that b is a valid path with respect to X(t d:t 1) (and denote b v X(t d:t 1)).

Basically, each structure of missing data corresponds to many valid paths; each of these corresponds to coefﬁcient of a revealed signal in Equation (2). To see that, note that

XREC t (α) =

k=1 pk(α|X(t d:t 1))Xt k1{Xt k},

where pk(α|X(t d:t 1)) is a polynomial in α that is determined by the structure of missing data in X(t d:t 1).

Deﬁnition 2. For a given vector X(t d:t 1), we deﬁne the function Φ(X(t d:t 1)) R2d 1 as follows:

Φ(X(t d:t 1))

n = Xt m(n) if b(n) sv X(t d:t 1), 0 otherwise.

Here, Φ(X(t d:t 1))

n denotes the n-th coordinate of the vector Φ(X(t d:t 1)). Similarly, we deﬁne an auxiliary function Φp for valid paths:

Φp(X(t d:t 1))

n = Xt m(n) if b(n) v X(t d:t 1), 0 otherwise.

From now on, we use the notation Xt(w) for predictions of the form w Φ(X(t d:t 1)), and Xp t (w) for predictions of the form w Φp(X(t d:t 1)).

Online Time Series Prediction with Missing Data

Algorithm 1 LAZY OGD (on ℓ2-ball with radius D)

1: Input: learning rate ηt. 2: Set a1 = 0. 3: for t = 1 to T do 4: Play at and incur loss ft(at)

5: Set at+1 = ηt Pt i=1 fi(ai) max{1, ηt

D Pt i=1 fi(ai) } 6: end for

3.2. Algorithm and Analysis

We take a step back to present an online algorithm: LAZY OGD (Algorithm 1), which is aimed at minimizing regret in the general online learning framework. LAZY OGD is a special instance of the FTRL algorithm when K is an ℓ2-ball with radius D. Let {ft}T t=1 be convex loss functions for which maxa,t { ft(a) } G and denote a = arg min a D PT t=1 ft(a). Then, the regret of LAZY OGD is bounded as follows:

t=1 ft(at) min a D

t=1 ηt ft(at) 2

t=1 ft(at) 2 3GD

A complete analysis can be found in (Hazan, 2011; Shalev Shwartz, 2012).

Our algorithm (Algorithm 2) is an adaptation of LAZY OGD to Bd, but notice that in its current form it is inefﬁcient (since the dimension of Bd is exponential in d). This form is easier to analyze and is thus stated here; an efﬁcient version of Algorithm 2 is presented in Section 3.3. In the sequel, we denote D = maxw Bd{ w } and G = maxw,t{ ℓt(Xt, Xt(w)) }. From the deﬁnition of Bd it follows directly that D = 2d/2. The value of G depends on the loss functions considered; for instance, if we consider the squared loss then G = 2d/2.

The following is our main theorem:

Theorem 3.1. Algorithm 2 generates an online sequence {wt}T t=1 for which it holds that:

t=1 ℓt(Xt, Xt(wt) 1{Xt}

t=1 ℓt(Xt, XREC t (α))1{Xt} 3GD

if we choose ηt = D G Pt τ=1 1{Xτ }.

Algorithm 2

1: Input: learning rate ηt. 2: Set w1 = 0. 3: for t = 1 to T do 4: Play wt and incur ft(wt) = ℓt(Xt, Xt(wt))1{Xt}

5: Set wt+1 = ηt Pt τ=1 fτ (wτ )1{Xτ } max{1,ηt Pt τ=1 fτ (wτ ) 2 d/2} 6: end for

Proof. Algorithm 2 is simply LAZY OGD on the decision set Bd. Thus, by applying it we get that

t=1 ℓt Xt, Xt(wt) 1{Xt}

t=1 ℓt Xt, Xt(w) 1{Xt} 3GD

Now, we can write

t=1 ℓt Xt, Xt(w) 1{Xt}

(a) min w Bd

t=1 ℓt Xt, Xp t (w) 1{Xt}

(b) min α 1

t=1 ℓt Xt, XREC t (α) 1{Xt}, (4)

where Xp t (w) is of the form w Φp(X(t d:t 1)).

The two key inequalities above are explained as follows. To prove (a), note that from the construction of Φ and Φp it follows that for any t, Φp(X(t d:t 1)) can be written as

Φp(X(t d:t 1))

n = Φ(X(t d:t 1))

n if n N, 0 otherwise,

where N is the set of all numbers n 2d 1 for which there exists X {R { }}d such that b(n) v X. In particular, if we denote

w = arg min w Bd

t=1 ℓt Xt, Xp t (w) 1{Xt}

then w.l.o.g. we can assume that w n = 0 for all n N. Now, note that for w it holds that Xp t (w ) = Xt(w ) for all t, which implies that

t=1 ℓt Xt, Xt(w ) 1{Xt} =

t=1 ℓt Xt, Xp t (w ) 1{Xt}.

Finally, since minw Bd PT t=1 ℓt Xt, Xt(w) 1{Xt} PT t=1 ℓt Xt, Xt(w ) 1{Xt}, the claim holds.

Online Time Series Prediction with Missing Data

Now, to prove (b), let us ﬁrst denote

α = arg min α 1

t=1 ℓt Xt, Xt(α) 1{Xt}.

Next, for a given vector b {0, 1}d we deﬁne I(b) [d]d

as follows:

[I(b)]i = i max {j|j < i and bj = 1} if bi = 1, 0 otherwise,

where max {j : j < i and bj = 1} = 0 if no such j exists. In words, I(b) is a vector holding the distance between ones in b to the nearest one to their left1.

Now, consider the following construction of w:

( Qd i=1 α [I(b(n))]i if b(n) v X(t d:t 1) 0 otherwise

Our claim is that for the construction above, it holds that Xp t (w) = Xt(α ). Let us look ﬁrst at Xt(α ). By deﬁnition, it can be recursed until it encounters p successive revealed observations. Thus, it has the following structure:

k=1 pk(α |X(t d:t 1))Xt k1{Xt},

where pk(α |X(t d:t 1)) is a polynomial consisting of sums of products of α 1, . . . , α p. In fact, the polynomial pk(α |X(t d:t 1)) is a sum of elements, each of them is of the form Qd i=1 α [I(b)]i, where b is a valid path w.r.t. X(t d:t 1) and in addition k = max{i : bi = 1}. This property follows directly from Deﬁnition 1 in Section 3.1. Thus, we can write

k=1 pk(α |X(t d:t 1))Xt k1{Xt k}

i=1 α [I(b)]i Xt k1{k = max{i : bi = 1}}

i=1 α [I(b(n))]i Xt m(n)1{b(n) v X(t d:t 1)}

n=1 wn Φp(X(t d:t 1))

= w Φp(X(t d:t 1)) = Xp t (w),

where X v t is the set of all valid paths w.r.t. X(t d:t 1). This establishes the claim in (b). Now, plugging (4) into (3) gives the stated result.

Algorithm 3 Efﬁcient Implementation of Algorithm 2

1: Input: learning rate η. 2: Set w1 = 0. 3: for t = 1 to T do 4: Predict Xt and incur ℓt(Xt, Xt 1{Xt} 5: Set prediction: Xt+1 = η Pt τ=1 Errτ K(t+1,τ)

2 d Pt s=1 Pt τ=1 Errs Errτ K(s,τ) o

3.3. Efﬁcient Implementation of Algorithm 2

Theorem 3.1 ignores computational considerations. Next, We show that Xt(wt) can in fact be generated efﬁciently. For t = 1 we have that w1 = 0 and thus X1(w1) = 0. Next, assume that { Xτ(wτ)}t τ=1 are efﬁciently generated and prove for Xt+1(wt+1). Now, notice that if we denote ℓτ(wτ)1{Xt} = ErrτΦ(X(τ d:τ 1)), the above implies that Errτ is known for all τ t. Thus,

Xt+1(wt+1) = Φ(X(t d+1:t)) wt+1

= η Pt τ=1 Φ(X(t d+1:t)) ℓτ(wτ)

max{1, η Pt τ=1 ℓτ(wτ) 2 d/2}

= η Pt τ=1 ErrτK(t + 1, τ)

2 d Pt s=1 Pt τ=1 Errs ErrτK(s, τ) ,

where K(s, τ) = Φ(X(s d:s 1)) Φ(X(τ d:τ 1)). The above is efﬁcient to generate if and only if K(s, τ) is efﬁcient to compute for all s and τ. This computation can be done in O(d) computations despite the fact that Φ(X(s d:s 1)) R2d 1.

To see that, we ﬁrst need to deﬁne an auxiliary function c as follows: c(s, τ)

k = c(X(s d:s 1), X(τ d:τ 1))

k = Pk 1 i=1 (1 1{Xs i})(1 1{Xτ i}) . Essentially, c(s, τ) counts the number of relatively common missing observations in X(s d:s 1) and X(τ d:τ 1). Now, notice that

K(s, τ) = Φ(X(s d:s 1)) Φ(X(τ d:τ 1))

Φ(X(s d:s 1))

n Φ(X(τ d:τ 1))

n=1 Xs m(n)Xτ m(n)1{b(n) X sv s }1{b(n) X sv τ }

k=1 2[c(s,τ)]k Xs k Xτ k1{Xs k}1{Xτ k},

where X sv t is the set of all valid paths w.r.t. X(t d:t 1), and the last equality follows from Deﬁnition 1.

1For instance, it holds that I((0, 1, 0, 1)) = (0, 2, 0, 2) and I((0, 0, 1, 1)) = (0, 0, 3, 1).

Online Time Series Prediction with Missing Data

3.4. Some Extensions

We brieﬂy discuss two issues: removal of assumption (2) and replacement of Bd with a smaller ball.

As mentioned before, assumption (2) makes sure that each prediction XREC t (α) considers at most d past observations. However, our algorithm is still applicable if this assumption does not hold. The theoretic guarantee then gets a different interpretation: we are almost as good as the best recursive AR(p) predictor, but now the recursion is limited to at most d past observations. Essentially, this is equivalent to deﬁning a family of recursive AR predictors with bounded memory, and compete against predictors in this family.

As for the second issue, recall that the dimension of the decision set Bd is exponential in d. This affects us only in the regret bound, as we proved earlier that computations can be done efﬁciently in our setting. To mitigate the regret bound effect, we deﬁne ˆBd = n w R2d 1 : w 2 2 d o

and state the following corollary:

Corollary 3.2. Algorithm 2 generates an online sequence {wt}T t=1 for which it holds that:

t=1 ℓt(Xt, Xt(wt) 1{Xt}

t=1 ℓt(Xt, XREC t (α))1{Xt} 3GD

where K = α Rp : αi 1

The proof follows by a simple calculation, and is thus omitted here. Note that in the above D =

d and G is again determined by the selection of the loss functions. This case captures natural scenarios, in which the effect of past observations decays as they are more distant.

4. Illustrative Examples

The following experiments demonstrate the effectiveness of the proposed algorithm under various synthetic settings. Due to lack of space, we defer the experimental results on real-world data to the supplementary material.

4.1. Baselines

Most of the works on time series with missing observations consider what we call the ofﬂine setting: given a time series that contains missing observations, compute the model parameters (in our case, the AR coefﬁcients) and\or impute the missing data. Our online setting can be seen as a sequential ofﬂine setting, in which at time t we are given the time series values up to time t 1 and our task is to predict

Algorithm 4 OGDIMPUTE

1: Input: learning rate η 2: Initialize α1 = 0, and set Xt = 0 for t 0 3: for t = 1 to T do 4: Predict XAR t (αt) = Pp i=1 αt i Xt i 5: Observe loss ℓt Xt, XAR t (αt) 1{Xt}

6: If 1{Xt} = 0, then set Xt = XAR t (αt) 7: Set αt+1 = ΠK αt η ℓt Xt, XAR t (αt) 1{Xt}

the signal at time t. In light of this, we adapt the ofﬂine baselines presented below to the online setting. We note that this adaptation does not weaken the ofﬂine baselines in any way, and we use it only for comparison purposes.

Yule-Walker estimator. We use the well-known Yule Walker estimator, and adapt it to our setting as follows: at ﬁrst, we initialize some AR coefﬁcients. At time t, at our disposal is the data seen up to time t 1, where missing observations are ﬁlled by corresponding past predictions of the algorithm. Then, the AR coefﬁcients are computed by solving the Yule-Walker equations, and a prediction is made accordingly.

Expectation Maximization (EM). Our EM baseline is based on the algorithm originally proposed by (Shumway & Stoffer, 1982). Roughly speaking, the algorithm assumes an underlying state-apace model and employs the Kalman ﬁlter to estimate its parameters. The estimation is done by maximizing the log-likelihood using iterative EM steps, and the resulting Kalman smoothed estimator is used to complete the missing observations.

ARLSIMPUTE. This algorithm was proposed by (Choong et al., 2009) to cope originally with missing observations in DNA microarray data. The algorithm is based on the following iterative method: at the ﬁrst iteration, initialize all the missing observations to 0. Then, in every iteration compute LS estimator for the AR coefﬁcients and update the value of the missing observations by maximizing the log-likelihood.

OGDIMPUTE. We propose another algorithm for the problem at hand, denoted Algorithm 4. Basically, the algorithm applies the standard Online Gradient Descent algorithm to learn the AR coefﬁcients, while ﬁlling missing observations with their past predictions. Whereas this algorithm is very fast and simple to implement, its downside is the lack of theoretic guarantee.

4.2. Generating the Synthetic Data

To compare the performance of the proposed algorithms we design several different settings (presented below). In order to ensure the stability of the results we average them

Online Time Series Prediction with Missing Data

0 500 1000 1500 2000

MSE vs. time for 10% missing data (sanity check)

Yule Walker ARLSimpute EM (Kalman Filter) OGDimpute Our Algorithm

0% 10% 20% 0

MSE@2000 vs. percentage of missing data (sanity check)

Percentage of missing data

Yule Walker ARLSimpute EM (Kalman Filter) OGDimpute Our Algorithm

0 500 1000 1500 2000

MSE vs. time for 10% missing data (AR mixture)

Yule Walker ARLSimpute EM (Kalman Filter) OGDimpute Our Algorithm

0% 10% 20% 0

MSE@2000 vs. percentage of missing data (AR mixture)

Percentage of missing data

Yule Walker ARLSimpute EM (Kalman Filter) OGDimpute Our Algorithm

0 500 1000 1500 2000

MSE vs. time for 10% missing data (heteroscedasticity)

Yule Walker ARLSimpute EM (Kalman Filter) OGDimpute Our Algorithm

0% 10% 20% 0

MSE@2000 vs. percentage of missing data (heteroscedasticity)

Percentage of missing data

Yule Walker ARLSimpute EM (Kalman Filter) OGDimpute Our Algorithm

Figure 1. Experimental results for synthetic data.

Setting 1. sanity check Setting 2. AR mixture Setting 3. heteroscedasticity

0% 10% 20% 0% 10% 20% 0% 10% 20%

Yule-Walker 0.1156 0.1270 0.1427 0.1343 0.1476 0.1496 0.0986 0.1023 0.1059

ARLSIMPUTE 0.1058 0.1160 0.1336 0.1344 0.1424 0.1417 0.1063 0.1150 0.1276

EM (Kalman ﬁlter) 0.1065 0.1170 0.1301 0.1101 0.1189 0.1244 0.1042 0.1078 0.1205

OGDIMPUTE 0.1071 0.1175 0.1305 0.1396 0.1530 0.1568 0.0945 0.0984 0.1028

Our algorithm 0.1085 0.1212 0.1447 0.1072* 0.1168* 0.1260 0.0937 0.0979 0.1018

Table 1. Experimental results for synthetic data.

over 50 runs. For our algorithm, we used d = 3p in all considered settings. In our tables, we mark with bold font the best results, and add an asterisk to indicate signiﬁcance level of 0.05.

Setting 1 (sanity check). We generate a time series using the coefﬁcient vector α = [0.6, 0.5, 0.4, 0.4, 0.3] and i.i.d. noise terms that are distributed N(0, 0.32). We then omit some of the data points in a random manner (that is, each data point is omitted with a certain probability).

Setting 2 (AR mixture). Our motivation in this setting is to examine the functionality of the different algorithms when faced with changing environments. Thus, we consider a predeﬁned set of AR coefﬁcients, and generate time series by alternating between them in a random manner. We add an additive noise which is distributed Uni[ 0.5, 0.5]. Here also, the missing data is omitted in a random manner.

Setting 3 (heteroscedasticity). Here we test the robustness of the different algorithms to unequally distributed noise terms. Thus, we generate a time series using the coefﬁcient vector α = [0.11, 0.5], and noise terms that are distributed normally, with expectation that is the value of the previous noise term and variance 0.32. This implies that consecutive noise terms are positively correlated. The missing data points are chosen randomly here as well.

As evident in Figure 1 and Table 1, our online algorithm outperforms the other algorithms when the time series exhibits some complicated structure. In the case where the error terms are Gaussian and the time series complies with the AR model (sanity check), we can see that all algorithms perform roughly the same, as can be expected by the theoretical guarantees. We point out that whereas the ofﬂine algorithms (especially the EM based and ARLSIMPUTE) require rather large computational power, the two online algorithms are fast and quite simple to implement.

5. Discussion and Conclusion

In this work we studied the problem of time series prediction using the AR model in the presence of missing data. We considered a setting in which the signal, along with the missing data, are allowed to be arbitrary. We then deﬁned the notion of learning in this setting with respect to the best (in hindsight) AR predictor, and showed that we can be almost as good as this predictor.

It remains for future work to study whether the dependence on the parameter d in the regret bound can be improved. It would also be interesting to study whether our approach could be extended to more complex time series models, such as ARCH, ARMA, and ARIMA.

Online Time Series Prediction with Missing Data

Acknowledgments

The research leading to these results has received funding from the European Unions Seventh Framework Programme (FP7/2007-2013) under grant agreement n 336078 ERCSUBLRN. We would also like to acknowledge the work of (Hazan et al., 2015), which incited the non-proper learning technique in this work.

Anava, Oren, Hazan, Elad, Mannor, Shie, and Shamir, Ohad. Online learning for time series prediction. ar Xiv preprint ar Xiv:1302.6927, 2013.

Cesa-Bianchi, N. and Lugosi, G. Prediction, learning, and games. Cambridge University Press, 2006.

Choong, Miew Keen, Charbit, Maurice, and Yan, Hong. Autoregressive-model-based missing value estimation for dna microarray time series data. Information Technology in Biomedicine, IEEE Transactions on, 13(1): 131 137, 2009.

Dempster, Arthur P, Laird, Nan M, and Rubin, Donald B. Maximum likelihood from incomplete data via the em algorithm. Journal of the royal statistical society. Series B (methodological), pp. 1 38, 1977.

Ding, Jie, Han, Lili, and Chen, Xiaoming. Time series ar modeling with missing observations based on the polynomial transformation. Mathematical and Computer Modelling, 51(5):527 536, 2010.

Dunsmuir, William and Robinson, PM. Estimation of time series models in the presence of missing data. Journal of the American Statistical Association, 76(375):560 568, 1981.

Durbin, James and Koopman, Siem Jan. Time series analysis by state space methods. Number 38. Oxford University Press, 2012.

Hazan, Elad. The convex optimization approach to regret minimization. Optimization for machine learning, pp. 287, 2011.

Hazan, Elad, Livni, Roi, and Mansour, Yishay. Classiﬁcation with low rank and missing data. Co RR, abs/1501.03273, 2015.

Honaker, James and King, Gary. What to do about missing values in time-series cross-section data. American Journal of Political Science, 54(2):561 581, 2010.

Kalman, Rudolph Emil. A new approach to linear ﬁltering and prediction problems. Journal of Fluids Engineering, 82(1):35 45, 1960.

Liu, Xiangheng and Goldsmith, Andrea. Kalman ﬁltering with partial observation losses. In Decision and Control, 2004. CDC. 43rd IEEE Conference on, volume 4, pp. 4180 4186. IEEE, 2004.

Shalev-Shwartz, Shai. Online learning and online convex optimization. Foundations and Trends in Machine Learning, 4(2):107 194, 2012.

Shumway, Robert H and Stoffer, David S. An approach to time series smoothing and forecasting using the em algorithm. Journal of time series analysis, 3(4):253 264, 1982.

Sinopoli, Bruno, Schenato, Luca, Franceschetti, Massimo, Poolla, Kameshwar, Jordan, Michael I, and Sastry, Shankar S. Kalman ﬁltering with intermittent observations. Automatic Control, IEEE Transactions on, 49(9): 1453 1464, 2004.

Wang, Zidong, Yang, Fuwen, Ho, Daniel WC, and Liu, Xiaohui. Robust ﬁnite-horizon ﬁltering for stochastic systems with missing measurements. Signal Processing Letters, IEEE, 12(6):437 440, 2005.