# online_time_series_prediction_with_missing_data__d3da451b.pdf Online Time Series Prediction with Missing Data Oren Anava OANAVA@TX.TECHNION.AC.IL Technion, Haifa, Israel Elad Hazan EHAZAN@CS.PRINCETON.EDU Princeton University, NY, USA Assaf Zeevi ASSAF@GSB.COLUMBIA.EDU Columbia University, NY, USA We consider the problem of time series prediction in the presence of missing data. We cast the problem as an online learning problem in which the goal of the learner is to minimize prediction error. We then devise an efficient algorithm for the problem, which is based on autoregressive model, and does not assume any structure on the missing data nor on the mechanism that generates the time series. We show that our algorithm s performance asymptotically approaches the performance of the best AR predictor in hindsight, and corroborate the theoretic results with an empirical study on synthetic and real-world data. 1. Introduction A time series is a sequence of signal observations, typically measured at uniform time intervals. Perhaps one of the most well-studied models for time series analysis and prediction is the autoregressive (AR) model. Roughly speaking, the AR model is based on the assumption that each observation can be represented as a (noisy) linear combination of some previous observations. This model has been successfully used in many real-world applications such as DNA microarray data analysis, stock market prediction, and noise cancelation, to name but a few. Recently there has been growing interest in the problem of time series prediction in the presence of missing data, mainly in the proper learning setting, in which an underlying model is assumed and the goal is to recover its parameters. Most of the current work relies on statistical assumptions on the error terms, such as independence and Proceedings of the 32 nd International Conference on Machine Learning, Lille, France, 2015. JMLR: W&CP volume 37. Copyright 2015 by the author(s). Gaussian distribution. These assumptions allow the use of Maximum Likelihood (ML) techniques to recover consistent (and sometimes optimal) estimators for the model parameters. However, these assumptions are many times not met in practice, causing the resulting estimators to be no longer consistent. Occasionally, additional assumptions on the structure of the missing data are added, and the statistical modeling becomes even more distant from the data. In this paper we argue that assumptions on the model generating the time series and the structure of its missing data can be relaxed in a substantial manner while still supporting the development of efficient methods to solve the problem. Our main contribution is a novel online learning approach for time series prediction with missing data, that allows the observations (along with the missing data) to be arbitrarily or even adversarially generated. The goal of this paper is to show that the new approach is theoretically more robust, and is thus capable of coping with a wider range of time series and missing data structures. 1.1. Informal Statement of Our Results We cast the problem of AR prediction in the presence of missing data as an online learning problem. A major modeling challenge arises as AR prediction is not well defined when some of the data is missing. To overcome this issue, we define a new family of AR predictors; each such predictor makes use of its own past predictions to fill in the missing data, and then provides an AR prediction using the completed data. To be slightly more formal, a predictor in this family has the following recursive form: XREC t (α) = k=1 αk Xt k1{Xt k is revealed} k=1 αk XREC t k(α)1{Xt k is not revealed}, Online Time Series Prediction with Missing Data where Xt is the signal measured at time point t, and α Rp is the vector of AR coefficients. Now, let ℓt(Xt, Xt) denote the loss suffered by predicting Xt at time point t, and RT be the corresponding regret term. Then, our main theorem is the following: Theorem 3.1. Our algorithm generates an online sequence of predictions { Xt}T t=1, for which it holds that: t=1 ℓt Xt, Xt 1{Xt is revealed} t=1 ℓt Xt, XREC t (α ) 1{Xt is revealed} = O( where F denotes the number of time points in which our algorithm received feedback, and α is the minimizer of PT t=1 ℓt(Xt, XREC t (α))1{Xt is revealed}. This result is somewhat surprising, as even the problem of finding the best AR predictor in hindsight is non-convex due to the recursive structure of the predictors we consider. The key observation that enables an efficient solution in this scheme relies on non-proper learning: it is possible to learn coefficients in a much larger class and compete against the best AR predictor. The complexity of this class is in fact exponential in the parameters of our problem, yet we prove that the learning of the new coefficients can be done efficiently due to some special characteristics. This idea was successfully applied also in the work of (Hazan et al., 2015), who considered the problem of low-rank classification with missing data. 1.2. Related Work Several different approaches for time series prediction in the presence of missing data exist; we overview the major ones. Perhaps the earliest approach, originated in the control theory community, goes back to the work of (Kalman, 1960). In this work, the concept of state-space modeling is presented to deal with streams of noisy input data. Although the original work is not aimed at coping with missing data, it initialized a solid line of works that use statespace modeling to impute missing data (Shumway & Stoffer, 1982; Sinopoli et al., 2004; Liu & Goldsmith, 2004; Wang et al., 2005). We refer the reader to (Durbin & Koopman, 2012) for a complete overview of time series analysis using state-space models. Another increasingly common approach builds upon the concept of multiple imputation (Honaker & King, 2010). Roughly speaking, multiple imputation aims at inferring relevant information from the observed data (using known statistical methods), and use it to impute multiple values for each missing data point. The resulted multiple data sets are now treated as completed, which allows the use of stan- dard statistical methods for the analysis task. Results from different data sets are then combined using various simple procedures. In the statistical literature, missing data are usually imputed using maximum likelihood estimators corresponding to a specific underlying model. Very often, these estimators are not efficiently computed, which motivates the use of Expectation Maximization (EM) algorithms. This approach was proposed by (Dempster et al., 1977), and is currently the most popular for dealing with missing data in time series. Essentially, the EM algorithm avoids the separate treatment of each of the exponentially many missing data patterns by using the following two-step procedure: in the E-step, missing observations are filled in with their conditional expectations given the observed data and the current estimate of the model parameters; and in the M-step, a new estimate of the model parameters is computed from the current version of the completed data. We note that the vast majority of the state-space modeling literature relies on EM techniques as well (for instance, see (Shumway & Stoffer, 1982; Sinopoli et al., 2004)). One particular time series model that has received a great attention in the statistical literature is the AR model. In the context of missing data, there are many works that assume this underlying model, and differ in the assumptions on the missing data patterns: in (Dunsmuir & Robinson, 1981), a stochastic mechanism is assumed to generate the missing data; (Ding et al., 2010) consider a scarce pattern of the observed signal; and (Choong et al., 2009) rely on local similarity structures in the data. The existence of distributional assumptions on the time series or on the patterns of the missing data is in common to all of these works. To date, we are not aware of an approach that allows the signal (along with its missing data) to be generated arbitrary, let alone adversarially. The only approach that allows for adversarially generated signals we are aware of is the recent result of (Anava et al., 2013), which does not account for missing data. Our work can be seen as an extension of the latter to the missing data setting. 2. Preliminaries and Model A time series is a sequence of signal observations, measured at successive time points (usually spaced at uniform intervals). Let Xt denote the signal measured at time t. The traditional AR model, parameterized by lag p and coefficient vector α Rp, assumes that each observation complies with the formula k=1 αk Xt k + ϵt, Online Time Series Prediction with Missing Data where {ϵt}t Z is assumed to be white noise. In words, the model assumes that Xt is a noisy linear combination of the previous p observations. Sometimes, an additional additive term α0 is included to indicate drift, but we will ignore this for simplicity. Notice that this does not increase the complexity of the problem, since we can simply raise the dimension of the vector α by one and assign the value 1 to the corresponding observation. The motivation to use AR(p) models for signal prediction goes back to Wold s decomposition theorem. According to this theorem, a stationary signal {Xt}t Z can be represented as an MA( ) process. That is, i=1 βiϵt i + ϵt, where P i=1 β2 i < , and {ϵt}t Z have zero-mean and equal variance. If, in addition, {Xt}T t=1 is assumed to be invertible, we can represent it as an AR( ) process. That is, i=1 αi Xt i + ϵt, where {αi} i=1 are uniquely defined. This representation, accompanied with the natural assumption that αi decays fast enough as a function of i, motivates the use of AR(p) models for the task of signal prediction. 2.1. The Online Setting for AR Prediction After motivating the use of AR models, arises the question of misspecification: what happens if we employ a model that does not comply with our data? Standard statistical methods (e.g., maximum likelihood) for estimating the AR coefficients are based on the assumption that the observations come from an AR model, and thus are not suitable when the model is misspecified. This drives the use of online learning based techniques to circumvent this issue. Online learning is a well established learning paradigm which has both theoretical and practical appeals. The goal in this paradigm is to make a sequential prediction, where the data, rather than being generated stochastically, is assumed to be chosen by an adversary that has full knowledge of our learning algorithm (see for instance (Cesa-Bianchi & Lugosi, 2006)). Specifically, the following setting is usually assumed in the context of time series prediction: at time point t, we need to make a prediction Xt, after which the true value of the signal Xt is revealed, and we suffer a loss denoted by ℓt(Xt, Xt). Usually, ℓt is assumed to be convex with Lipschitz gradients. When considering an AR(p) prediction, we must define in advance the decision set K Rp, which stands for the class of AR coefficients against which we want to compete. We henceforth let K = [ 1, 1]p. Our prediction at time point t then takes the form: XAR t (αt) = k=1 αt k Xt k, (1) where αt K is generated by our online algorithm. Here comes the punch of the online setting: our goal is to design an algorithm that generates a sequence {αt}T t=1 which is almost as good as the best (in hindsight) AR coefficients in K. More formally, we define the regret to be t=1 ℓt Xt, XAR t (αt) min α K t=1 ℓt Xt, XAR t (α) , and wish to design efficient algorithms, whose regret grows sublinearly in T. Thus, even if the model is misspecified (meaning the best AR coefficients in K have unsatisfactory predictive power), minimizing the regret term is still meaningful. Remains the question of how can we compete against the best AR coefficients in the presence of missing data? The latter is the main question we try to answer in this work. 2.2. Problem Definition Throughout this work we consider the following setting (which accounts for missing data): at time point t, we need to make a prediction Xt, after which feedback in the form of the real signal Xt is not necessarily revealed, and we suffer loss denoted by ℓt(Xt, Xt)1{Xt is revealed}. That is, we suffer loss only if we receive feedback. Here also, ℓt is assumed to be convex with Lipschitz gradients. The problem arising in this setting is two-fold: first, we cannot provide an AR prediction at time t (even given the vector α) if some of the required past observations are missing. Second, the best AR predictor in hindsight is not well-defined. To solve this problem, we define a family of recursive predictors, each of the form: XREC t (α) = k=1 αk Xt k1{Xt k is revealed} k=1 αk XREC t k(α)1{Xt k is not revealed}, where α K = [ 1, 1]p. Essentially, a predictor in this family uses its own (updated) estimations as a proxy for the actual signal, and then provides an AR(p) prediction as if there is no missing data. The problem at hand then translates into minimizing the corresponding regret term. Online Time Series Prediction with Missing Data 2.3. Our Assumptions Throughout the remainder of this work we assume that the following hold: (1) Xt [ 1, 1] for all t. Here, the constant 1 can be replaced by any other constant C < (which is known in advanced), but in order to simplify the writing we assume that C = 1. (2) For all t there exist at least p successive time points in t d, . . . , t 1 for which we received feedback. This assumption makes sure that each prediction XREC t (α) does not look back more than d time points. This assumption can be completely removed, as discussed in Section 3.4. Note we do not assume an underlying AR(p) model that generates the signal, nor a statistical model according to which the missing observations are omitted from us. 3. Our Approach We briefly outline our approach to the problem at hand. Basically, observe that the prediction at time point t can be written in the following form: XREC t (α) = k=1 pk(α)Xt k1{Xt k}, where pk(α) is a polynomial in α that is determined by the structure of missing data. Each polynomial pk potentially contains up to 2k 1 terms of the form αi1 . . . αij, such that Pj m=1 im = k. This means that the prediction at time point t constitutes of up to 2d terms of the form αi1 . . . αij, such that for each of them it holds that Pj m=1 im d. Notice that each such term is less or equal to 1 for α that is the best recursive AR(p) predictor in hindsight, since we consider K = [ 1, 1]p. The latter observation allows the use of non-proper learning techniques in this setting. Essentially, the idea is to learn a vector w R2d such that each entry in w corresponds to a product of the form αi1 . . . αij, while ignoring the restrictions imposed on w by α. Obtaining a regret bound w.r.t. to best w would imply a regret bound w.r.t. the best α, yet an efficiency question would remain since the dimension of w is 2d. In Section 3.3 we prove that the inner products in the space induced by this relaxation can be computed efficiently, which overall gives an efficient algorithm with provable regret bound. 3.1. Notation and Definitions We denote by [n] the set {1, . . . , n}, and use X(t d:t 1) {R { }}d to denote the vector of values Xt d, . . . , Xt 1, where missing observations are encoded using { }. The notation 1{Xt} will be used as the indicator of the event {Xt is revealed}. Finally, denote Bd = n w R2d 1 : w 2 2 2do . We say that the vector b = (b1, . . . , bd) {0, 1}d is the binary representation of a number n if n is represented as Pd k=1 bk2k 1. We denote by b(n) the unique binary representation of n. For a given number n and its binary representation b(n) = (b1, . . . , bd), we define m(n) to be the maximal index k such that bk = 1. That is, m(n) = max k : 2k 1 < n . The definition below links between a structure of missing observations and a binary vector b. This, in turn, will be used to link between a vector w R2d 1 and a structure of missing observations. Definition 1. Let b = (b1, . . . , bd) {0, 1}d and set m = max{k | bk = 1}. We say that b is a semi-valid path with respect to X(t d:t 1) (and denote b sv X(t d:t 1)), if bm = 1{Xt m} = 1 and bi 1{Xt i} for i < m. If in addition (b1, . . . , bm) does not contain p successive zeros we say that b is a valid path with respect to X(t d:t 1) (and denote b v X(t d:t 1)). Basically, each structure of missing data corresponds to many valid paths; each of these corresponds to coefficient of a revealed signal in Equation (2). To see that, note that XREC t (α) = k=1 pk(α|X(t d:t 1))Xt k1{Xt k}, where pk(α|X(t d:t 1)) is a polynomial in α that is determined by the structure of missing data in X(t d:t 1). Definition 2. For a given vector X(t d:t 1), we define the function Φ(X(t d:t 1)) R2d 1 as follows: Φ(X(t d:t 1)) n = Xt m(n) if b(n) sv X(t d:t 1), 0 otherwise. Here, Φ(X(t d:t 1)) n denotes the n-th coordinate of the vector Φ(X(t d:t 1)). Similarly, we define an auxiliary function Φp for valid paths: Φp(X(t d:t 1)) n = Xt m(n) if b(n) v X(t d:t 1), 0 otherwise. From now on, we use the notation Xt(w) for predictions of the form w Φ(X(t d:t 1)), and Xp t (w) for predictions of the form w Φp(X(t d:t 1)). Online Time Series Prediction with Missing Data Algorithm 1 LAZY OGD (on ℓ2-ball with radius D) 1: Input: learning rate ηt. 2: Set a1 = 0. 3: for t = 1 to T do 4: Play at and incur loss ft(at) 5: Set at+1 = ηt Pt i=1 fi(ai) max{1, ηt D Pt i=1 fi(ai) } 6: end for 3.2. Algorithm and Analysis We take a step back to present an online algorithm: LAZY OGD (Algorithm 1), which is aimed at minimizing regret in the general online learning framework. LAZY OGD is a special instance of the FTRL algorithm when K is an ℓ2-ball with radius D. Let {ft}T t=1 be convex loss functions for which maxa,t { ft(a) } G and denote a = arg min a D PT t=1 ft(a). Then, the regret of LAZY OGD is bounded as follows: t=1 ft(at) min a D t=1 ηt ft(at) 2 t=1 ft(at) 2 3GD A complete analysis can be found in (Hazan, 2011; Shalev Shwartz, 2012). Our algorithm (Algorithm 2) is an adaptation of LAZY OGD to Bd, but notice that in its current form it is inefficient (since the dimension of Bd is exponential in d). This form is easier to analyze and is thus stated here; an efficient version of Algorithm 2 is presented in Section 3.3. In the sequel, we denote D = maxw Bd{ w } and G = maxw,t{ ℓt(Xt, Xt(w)) }. From the definition of Bd it follows directly that D = 2d/2. The value of G depends on the loss functions considered; for instance, if we consider the squared loss then G = 2d/2. The following is our main theorem: Theorem 3.1. Algorithm 2 generates an online sequence {wt}T t=1 for which it holds that: t=1 ℓt(Xt, Xt(wt) 1{Xt} t=1 ℓt(Xt, XREC t (α))1{Xt} 3GD if we choose ηt = D G Pt τ=1 1{Xτ }. Algorithm 2 1: Input: learning rate ηt. 2: Set w1 = 0. 3: for t = 1 to T do 4: Play wt and incur ft(wt) = ℓt(Xt, Xt(wt))1{Xt} 5: Set wt+1 = ηt Pt τ=1 fτ (wτ )1{Xτ } max{1,ηt Pt τ=1 fτ (wτ ) 2 d/2} 6: end for Proof. Algorithm 2 is simply LAZY OGD on the decision set Bd. Thus, by applying it we get that t=1 ℓt Xt, Xt(wt) 1{Xt} t=1 ℓt Xt, Xt(w) 1{Xt} 3GD Now, we can write t=1 ℓt Xt, Xt(w) 1{Xt} (a) min w Bd t=1 ℓt Xt, Xp t (w) 1{Xt} (b) min α 1 t=1 ℓt Xt, XREC t (α) 1{Xt}, (4) where Xp t (w) is of the form w Φp(X(t d:t 1)). The two key inequalities above are explained as follows. To prove (a), note that from the construction of Φ and Φp it follows that for any t, Φp(X(t d:t 1)) can be written as Φp(X(t d:t 1)) n = Φ(X(t d:t 1)) n if n N, 0 otherwise, where N is the set of all numbers n 2d 1 for which there exists X {R { }}d such that b(n) v X. In particular, if we denote w = arg min w Bd t=1 ℓt Xt, Xp t (w) 1{Xt} then w.l.o.g. we can assume that w n = 0 for all n N. Now, note that for w it holds that Xp t (w ) = Xt(w ) for all t, which implies that t=1 ℓt Xt, Xt(w ) 1{Xt} = t=1 ℓt Xt, Xp t (w ) 1{Xt}. Finally, since minw Bd PT t=1 ℓt Xt, Xt(w) 1{Xt} PT t=1 ℓt Xt, Xt(w ) 1{Xt}, the claim holds. Online Time Series Prediction with Missing Data Now, to prove (b), let us first denote α = arg min α 1 t=1 ℓt Xt, Xt(α) 1{Xt}. Next, for a given vector b {0, 1}d we define I(b) [d]d as follows: [I(b)]i = i max {j|j < i and bj = 1} if bi = 1, 0 otherwise, where max {j : j < i and bj = 1} = 0 if no such j exists. In words, I(b) is a vector holding the distance between ones in b to the nearest one to their left1. Now, consider the following construction of w: ( Qd i=1 α [I(b(n))]i if b(n) v X(t d:t 1) 0 otherwise Our claim is that for the construction above, it holds that Xp t (w) = Xt(α ). Let us look first at Xt(α ). By definition, it can be recursed until it encounters p successive revealed observations. Thus, it has the following structure: k=1 pk(α |X(t d:t 1))Xt k1{Xt}, where pk(α |X(t d:t 1)) is a polynomial consisting of sums of products of α 1, . . . , α p. In fact, the polynomial pk(α |X(t d:t 1)) is a sum of elements, each of them is of the form Qd i=1 α [I(b)]i, where b is a valid path w.r.t. X(t d:t 1) and in addition k = max{i : bi = 1}. This property follows directly from Definition 1 in Section 3.1. Thus, we can write k=1 pk(α |X(t d:t 1))Xt k1{Xt k} i=1 α [I(b)]i Xt k1{k = max{i : bi = 1}} i=1 α [I(b(n))]i Xt m(n)1{b(n) v X(t d:t 1)} n=1 wn Φp(X(t d:t 1)) = w Φp(X(t d:t 1)) = Xp t (w), where X v t is the set of all valid paths w.r.t. X(t d:t 1). This establishes the claim in (b). Now, plugging (4) into (3) gives the stated result. Algorithm 3 Efficient Implementation of Algorithm 2 1: Input: learning rate η. 2: Set w1 = 0. 3: for t = 1 to T do 4: Predict Xt and incur ℓt(Xt, Xt 1{Xt} 5: Set prediction: Xt+1 = η Pt τ=1 Errτ K(t+1,τ) 2 d Pt s=1 Pt τ=1 Errs Errτ K(s,τ) o 3.3. Efficient Implementation of Algorithm 2 Theorem 3.1 ignores computational considerations. Next, We show that Xt(wt) can in fact be generated efficiently. For t = 1 we have that w1 = 0 and thus X1(w1) = 0. Next, assume that { Xτ(wτ)}t τ=1 are efficiently generated and prove for Xt+1(wt+1). Now, notice that if we denote ℓτ(wτ)1{Xt} = ErrτΦ(X(τ d:τ 1)), the above implies that Errτ is known for all τ t. Thus, Xt+1(wt+1) = Φ(X(t d+1:t)) wt+1 = η Pt τ=1 Φ(X(t d+1:t)) ℓτ(wτ) max{1, η Pt τ=1 ℓτ(wτ) 2 d/2} = η Pt τ=1 ErrτK(t + 1, τ) 2 d Pt s=1 Pt τ=1 Errs ErrτK(s, τ) , where K(s, τ) = Φ(X(s d:s 1)) Φ(X(τ d:τ 1)). The above is efficient to generate if and only if K(s, τ) is efficient to compute for all s and τ. This computation can be done in O(d) computations despite the fact that Φ(X(s d:s 1)) R2d 1. To see that, we first need to define an auxiliary function c as follows: c(s, τ) k = c(X(s d:s 1), X(τ d:τ 1)) k = Pk 1 i=1 (1 1{Xs i})(1 1{Xτ i}) . Essentially, c(s, τ) counts the number of relatively common missing observations in X(s d:s 1) and X(τ d:τ 1). Now, notice that K(s, τ) = Φ(X(s d:s 1)) Φ(X(τ d:τ 1)) Φ(X(s d:s 1)) n Φ(X(τ d:τ 1)) n=1 Xs m(n)Xτ m(n)1{b(n) X sv s }1{b(n) X sv τ } k=1 2[c(s,τ)]k Xs k Xτ k1{Xs k}1{Xτ k}, where X sv t is the set of all valid paths w.r.t. X(t d:t 1), and the last equality follows from Definition 1. 1For instance, it holds that I((0, 1, 0, 1)) = (0, 2, 0, 2) and I((0, 0, 1, 1)) = (0, 0, 3, 1). Online Time Series Prediction with Missing Data 3.4. Some Extensions We briefly discuss two issues: removal of assumption (2) and replacement of Bd with a smaller ball. As mentioned before, assumption (2) makes sure that each prediction XREC t (α) considers at most d past observations. However, our algorithm is still applicable if this assumption does not hold. The theoretic guarantee then gets a different interpretation: we are almost as good as the best recursive AR(p) predictor, but now the recursion is limited to at most d past observations. Essentially, this is equivalent to defining a family of recursive AR predictors with bounded memory, and compete against predictors in this family. As for the second issue, recall that the dimension of the decision set Bd is exponential in d. This affects us only in the regret bound, as we proved earlier that computations can be done efficiently in our setting. To mitigate the regret bound effect, we define ˆBd = n w R2d 1 : w 2 2 d o and state the following corollary: Corollary 3.2. Algorithm 2 generates an online sequence {wt}T t=1 for which it holds that: t=1 ℓt(Xt, Xt(wt) 1{Xt} t=1 ℓt(Xt, XREC t (α))1{Xt} 3GD where K = α Rp : αi 1 The proof follows by a simple calculation, and is thus omitted here. Note that in the above D = d and G is again determined by the selection of the loss functions. This case captures natural scenarios, in which the effect of past observations decays as they are more distant. 4. Illustrative Examples The following experiments demonstrate the effectiveness of the proposed algorithm under various synthetic settings. Due to lack of space, we defer the experimental results on real-world data to the supplementary material. 4.1. Baselines Most of the works on time series with missing observations consider what we call the offline setting: given a time series that contains missing observations, compute the model parameters (in our case, the AR coefficients) and\or impute the missing data. Our online setting can be seen as a sequential offline setting, in which at time t we are given the time series values up to time t 1 and our task is to predict Algorithm 4 OGDIMPUTE 1: Input: learning rate η 2: Initialize α1 = 0, and set Xt = 0 for t 0 3: for t = 1 to T do 4: Predict XAR t (αt) = Pp i=1 αt i Xt i 5: Observe loss ℓt Xt, XAR t (αt) 1{Xt} 6: If 1{Xt} = 0, then set Xt = XAR t (αt) 7: Set αt+1 = ΠK αt η ℓt Xt, XAR t (αt) 1{Xt} the signal at time t. In light of this, we adapt the offline baselines presented below to the online setting. We note that this adaptation does not weaken the offline baselines in any way, and we use it only for comparison purposes. Yule-Walker estimator. We use the well-known Yule Walker estimator, and adapt it to our setting as follows: at first, we initialize some AR coefficients. At time t, at our disposal is the data seen up to time t 1, where missing observations are filled by corresponding past predictions of the algorithm. Then, the AR coefficients are computed by solving the Yule-Walker equations, and a prediction is made accordingly. Expectation Maximization (EM). Our EM baseline is based on the algorithm originally proposed by (Shumway & Stoffer, 1982). Roughly speaking, the algorithm assumes an underlying state-apace model and employs the Kalman filter to estimate its parameters. The estimation is done by maximizing the log-likelihood using iterative EM steps, and the resulting Kalman smoothed estimator is used to complete the missing observations. ARLSIMPUTE. This algorithm was proposed by (Choong et al., 2009) to cope originally with missing observations in DNA microarray data. The algorithm is based on the following iterative method: at the first iteration, initialize all the missing observations to 0. Then, in every iteration compute LS estimator for the AR coefficients and update the value of the missing observations by maximizing the log-likelihood. OGDIMPUTE. We propose another algorithm for the problem at hand, denoted Algorithm 4. Basically, the algorithm applies the standard Online Gradient Descent algorithm to learn the AR coefficients, while filling missing observations with their past predictions. Whereas this algorithm is very fast and simple to implement, its downside is the lack of theoretic guarantee. 4.2. Generating the Synthetic Data To compare the performance of the proposed algorithms we design several different settings (presented below). In order to ensure the stability of the results we average them Online Time Series Prediction with Missing Data 0 500 1000 1500 2000 MSE vs. time for 10% missing data (sanity check) Yule Walker ARLSimpute EM (Kalman Filter) OGDimpute Our Algorithm 0% 10% 20% 0 MSE@2000 vs. percentage of missing data (sanity check) Percentage of missing data Yule Walker ARLSimpute EM (Kalman Filter) OGDimpute Our Algorithm 0 500 1000 1500 2000 MSE vs. time for 10% missing data (AR mixture) Yule Walker ARLSimpute EM (Kalman Filter) OGDimpute Our Algorithm 0% 10% 20% 0 MSE@2000 vs. percentage of missing data (AR mixture) Percentage of missing data Yule Walker ARLSimpute EM (Kalman Filter) OGDimpute Our Algorithm 0 500 1000 1500 2000 MSE vs. time for 10% missing data (heteroscedasticity) Yule Walker ARLSimpute EM (Kalman Filter) OGDimpute Our Algorithm 0% 10% 20% 0 MSE@2000 vs. percentage of missing data (heteroscedasticity) Percentage of missing data Yule Walker ARLSimpute EM (Kalman Filter) OGDimpute Our Algorithm Figure 1. Experimental results for synthetic data. Setting 1. sanity check Setting 2. AR mixture Setting 3. heteroscedasticity 0% 10% 20% 0% 10% 20% 0% 10% 20% Yule-Walker 0.1156 0.1270 0.1427 0.1343 0.1476 0.1496 0.0986 0.1023 0.1059 ARLSIMPUTE 0.1058 0.1160 0.1336 0.1344 0.1424 0.1417 0.1063 0.1150 0.1276 EM (Kalman filter) 0.1065 0.1170 0.1301 0.1101 0.1189 0.1244 0.1042 0.1078 0.1205 OGDIMPUTE 0.1071 0.1175 0.1305 0.1396 0.1530 0.1568 0.0945 0.0984 0.1028 Our algorithm 0.1085 0.1212 0.1447 0.1072* 0.1168* 0.1260 0.0937 0.0979 0.1018 Table 1. Experimental results for synthetic data. over 50 runs. For our algorithm, we used d = 3p in all considered settings. In our tables, we mark with bold font the best results, and add an asterisk to indicate significance level of 0.05. Setting 1 (sanity check). We generate a time series using the coefficient vector α = [0.6, 0.5, 0.4, 0.4, 0.3] and i.i.d. noise terms that are distributed N(0, 0.32). We then omit some of the data points in a random manner (that is, each data point is omitted with a certain probability). Setting 2 (AR mixture). Our motivation in this setting is to examine the functionality of the different algorithms when faced with changing environments. Thus, we consider a predefined set of AR coefficients, and generate time series by alternating between them in a random manner. We add an additive noise which is distributed Uni[ 0.5, 0.5]. Here also, the missing data is omitted in a random manner. Setting 3 (heteroscedasticity). Here we test the robustness of the different algorithms to unequally distributed noise terms. Thus, we generate a time series using the coefficient vector α = [0.11, 0.5], and noise terms that are distributed normally, with expectation that is the value of the previous noise term and variance 0.32. This implies that consecutive noise terms are positively correlated. The missing data points are chosen randomly here as well. As evident in Figure 1 and Table 1, our online algorithm outperforms the other algorithms when the time series exhibits some complicated structure. In the case where the error terms are Gaussian and the time series complies with the AR model (sanity check), we can see that all algorithms perform roughly the same, as can be expected by the theoretical guarantees. We point out that whereas the offline algorithms (especially the EM based and ARLSIMPUTE) require rather large computational power, the two online algorithms are fast and quite simple to implement. 5. Discussion and Conclusion In this work we studied the problem of time series prediction using the AR model in the presence of missing data. We considered a setting in which the signal, along with the missing data, are allowed to be arbitrary. We then defined the notion of learning in this setting with respect to the best (in hindsight) AR predictor, and showed that we can be almost as good as this predictor. It remains for future work to study whether the dependence on the parameter d in the regret bound can be improved. It would also be interesting to study whether our approach could be extended to more complex time series models, such as ARCH, ARMA, and ARIMA. Online Time Series Prediction with Missing Data Acknowledgments The research leading to these results has received funding from the European Unions Seventh Framework Programme (FP7/2007-2013) under grant agreement n 336078 ERCSUBLRN. We would also like to acknowledge the work of (Hazan et al., 2015), which incited the non-proper learning technique in this work. Anava, Oren, Hazan, Elad, Mannor, Shie, and Shamir, Ohad. Online learning for time series prediction. ar Xiv preprint ar Xiv:1302.6927, 2013. Cesa-Bianchi, N. and Lugosi, G. Prediction, learning, and games. Cambridge University Press, 2006. Choong, Miew Keen, Charbit, Maurice, and Yan, Hong. Autoregressive-model-based missing value estimation for dna microarray time series data. Information Technology in Biomedicine, IEEE Transactions on, 13(1): 131 137, 2009. Dempster, Arthur P, Laird, Nan M, and Rubin, Donald B. Maximum likelihood from incomplete data via the em algorithm. Journal of the royal statistical society. Series B (methodological), pp. 1 38, 1977. Ding, Jie, Han, Lili, and Chen, Xiaoming. Time series ar modeling with missing observations based on the polynomial transformation. Mathematical and Computer Modelling, 51(5):527 536, 2010. Dunsmuir, William and Robinson, PM. Estimation of time series models in the presence of missing data. Journal of the American Statistical Association, 76(375):560 568, 1981. Durbin, James and Koopman, Siem Jan. Time series analysis by state space methods. Number 38. Oxford University Press, 2012. Hazan, Elad. The convex optimization approach to regret minimization. Optimization for machine learning, pp. 287, 2011. Hazan, Elad, Livni, Roi, and Mansour, Yishay. Classification with low rank and missing data. Co RR, abs/1501.03273, 2015. Honaker, James and King, Gary. What to do about missing values in time-series cross-section data. American Journal of Political Science, 54(2):561 581, 2010. Kalman, Rudolph Emil. A new approach to linear filtering and prediction problems. Journal of Fluids Engineering, 82(1):35 45, 1960. Liu, Xiangheng and Goldsmith, Andrea. Kalman filtering with partial observation losses. In Decision and Control, 2004. CDC. 43rd IEEE Conference on, volume 4, pp. 4180 4186. IEEE, 2004. Shalev-Shwartz, Shai. Online learning and online convex optimization. Foundations and Trends in Machine Learning, 4(2):107 194, 2012. Shumway, Robert H and Stoffer, David S. An approach to time series smoothing and forecasting using the em algorithm. Journal of time series analysis, 3(4):253 264, 1982. Sinopoli, Bruno, Schenato, Luca, Franceschetti, Massimo, Poolla, Kameshwar, Jordan, Michael I, and Sastry, Shankar S. Kalman filtering with intermittent observations. Automatic Control, IEEE Transactions on, 49(9): 1453 1464, 2004. Wang, Zidong, Yang, Fuwen, Ho, Daniel WC, and Liu, Xiaohui. Robust finite-horizon filtering for stochastic systems with missing measurements. Signal Processing Letters, IEEE, 12(6):437 440, 2005.