# heteroscedastic_sequences_beyond_gaussianity__775483c3.pdf Heteroscedastic Sequences: Beyond Gaussianity Oren Anava OANAVA@TX.TECHNION.AC.IL Technion, Haifa, Israel Shie Mannor SHIE@EE.TECHNION.AC.IL Technion, Haifa, Israel We address the problem of sequential prediction in the heteroscedastic setting, when both the signal and its variance are assumed to depend on explanatory variables. By applying regret minimization techniques, we devise an efficient online learning algorithm for the problem, without assuming that the error terms comply with a specific distribution. We show that our algorithm can be adjusted to provide confidence bounds for its predictions, and provide an application to ARCH models. The theoretical results are corroborated by an empirical study. 1. Introduction Heteroscedasticity refers to the case in which the variability of the dependent variable (also called signal) is unequal across the range of values of the explanatory variables (also called features). In statistical modeling, the variability is usually characterized through the conditional variance, which is a key parameter in many statistical applications such as volatility estimation in finance, disease phenotypes prediction in medicine, and more. Much work has been done on parameter estimation and signal prediction using heteroscedastic models, mostly relying on statistical assumptions on the error terms such as Gaussianity or other symmetrical distributions. These assumptions allow the use of Maximum Likelihood (ML) techniques to recover consistent estimators for the signal and its conditional variance. However, if these assumptions are not met in practice, the resulting estimators are no longer consistent and the following statement from (Whittaker & Robinson, 1967) is sometimes quoted: Proceedings of the 33 rd International Conference on Machine Learning, New York, NY, USA, 2016. JMLR: W&CP volume 48. Copyright 2016 by the author(s). Everybody believes in the exponential law of errors: the experimenters, because they think it can be proved by mathematics; and the mathematicians, because they believe it has been established by observation. In this paper we argue that traditional modeling assumptions on the signal generation can be substantially relaxed while still maintaining the ability to solve the problem. Moreover, we offer a novel online learning approach that allows the signal to be partially adversarial and partially stochastic. We show that our approach is more general than a ML-based approach, and is capable of coping with rather complex scenarios and models. An important aspect of our work is bridging the gap between a statistical approach and a pure online learning approach for sequential prediction problems. We claim that while the statistical approach fails to model real-world data due to strict distributional assumptions, the online learning approach fails to do so due to lack of such assumptions, and the actual truth lies somewhere in between them. 1.1. Main contribution In this work we propose a new approach to handle heteroscedasticity an online learning approach that does not require the error terms to be Gaussian, nor to comply with a specific distribution as is common in the statistical approach to the problem. The main contributions of this work are as follows: Casting the problem of heteroscedastic signal prediction as an online learning problem, in which two terms of regret are minimized in parallel: one captures the prediction accuracy, and the other measures the quality of the conditional variance estimation. This casting is the key idea that enables handling non-Gaussian error terms in this setting. Design of an online learning algorithm that is suitable to work with biased gradients. The necessity in our case arises since the conditional variance is not observed, and can only be estimated with bias. Heteroscedastic Sequences: Beyond Gaussianity Derivation of worst case confidence bounds for the prediction in each round, using regret analysis. Here, worst case refers to the distribution of the error terms, which is assumed to be unknown and might even vary from round to round. Application to ARCH Prediction in which the robustness of our approach to non-Gaussian error terms is demonstrated on synthetic data. 1.2. Related Work Heteroscedastic models have been extensively studied, mainly in the context of time series and regression. In regression, perhaps the earliest heteroscedastic model that was considered (see (Lee, 1973) for example) assumes the following generation of the signal: yt = u xt + ϵt t, (1) where xt is a known feature vector and ϵt N(0, σ2 t ) for σ2 t = v xt. In this setting, u and v are unknown to us and have to be regressed by the algorithm. Though many such regression algorithms exist, they are all based on the following principle: apply a known technique (e.g., least squares or maximum likelihood) to recover u, and then regress over ϵ2 t = (yt u xt)2 to recover v. A comparison of several such algorithms appears in (Amemiya, 1977). Clearly, the basic model (Equation (1)) suffers from many shortcomings, such as lack of generality and the need for strong assumptions on the error distribution. In the following years, many works addressed these issues and proposed more general frameworks to model heteroscedasticity. Initially, various parametric models of the conditional variance were considered: σt = σ(1 + |v xt|)λ σt = σ|v xt|λ (Box & Hill, 1974); σt = σeλv xt (Bickel, 1978); 1 + (v xt)2 (Jobson & Fuller, 1980). Later works showed that qualitatively similar results can be obtained for any smooth (and known in advance) function σt = σ(xt) (Davidian & Carroll, 1987; Muller & Stadtmuller, 1987). In a parallel line of work, (Fuller & Rao, 1978) circumvented the need to come up with a specific parametric form of the conditional variance by assuming that the signal is divided into finite number of groups, where the variance within each group is equal. Their results rely as well on Gaussian distributional assumptions of the error terms. The problem of conditional variance estimation was also studied in the context of nonparametric models. (Fan & Yao, 1998) considered two-dimensional strictly stationary processes {(yt, xt)} of the form yt = m(xt) + σ(xt)ϵt t, where E [ϵt | xt] = 0 and E ϵ2 t | xt = 1, and offered residual based estimators which are locally linear. The idea of nonparametric estimators was further studied in the works of (Yu & Jones, 2004) who proposed likelihoodbased locally linear estimators; (Brown et al., 2007) who applied the difference sequence idea to estimate the conditional variance; and (Mishra et al., 2010) who incorporated parametric and nonparametric estimators in a multiplicative way. In the field of time series analysis and prediction, it was the seminal work of (Engle, 1982), in which the autoregressive conditional heteroscedastic (ARCH) model was introduced, that led the development of a plethora of heteroscedastic models. Among the many extensions to the ARCH model, one can find the GARCH model (Bollerslev, 1986), the EGARCH model (Nelson, 1991), and many other models that were shown to be highly effective in practice. In the learning literature, the study of heteroscedasticity is rather limited. We note the work of (Zhu et al., 2013) that is aimed at coping with high-dimensional heteroscedastic data. Perhaps the closest works to ours, at least in spirit, are the works of (Anava et al., 2013; 2015) who considered the sequential prediction problem using the AR and ARMA models in a partially (or fully) adversarial setting. We also note the works of (Even-Dar et al., 2009; Yu et al., 2009), who considered a hybrid setting in the context of Markov decision processes. 2. Preliminaries and Model Before defining our setting, we provide some useful background about kernel methods and the framework of Online Convex Optimization (OCO). 2.1. Kernel Methods A kernel is a function k : X X R (for some X Rd), which is usually assumed to be continuous. A kernel is a Mercer kernel if for any finite set of points {x1, . . . , xn} the n n matrix K, where Kij is defined to be k(xi, xj), is positive semi-definite. For such kernels, it is well-known that there exists a Hilbert space H and a mapping φ : X H such that k(xi, xj) = φ(xi), φ(xj) , where , is the inner product in H. Two well-known examples of kernels are the polynomial kernel k(xi, xj) = 1 + x i xj p and the Gaussian kernel k(xi, xj) = exp 1 2σ2 xi xj 2 , which are particularly useful for modeling nonnegative entities. See (Shawe-Taylor & Cristianini, 2004) for more information on Kernel methods. Heteroscedastic Sequences: Beyond Gaussianity Algorithm 1 LAZY OGD (on the ℓ2 unit ball) 1: Input: learning rate ηt. 2: Set a1 = 0. 3: for t = 1 to T do 4: Play at and observe loss ℓt(at) 5: Set at+1 = ηt Pt i=1 ℓi(ai) max{1,ηt Pt i=1 ℓi(ai) } 6: end for In our context, we will use kernel functions to characterize the first and second moments of the signal. The motivation is to gain a modeling power without paying big computational costs. This important property does not come without drawbacks; perhaps the most prominent is the fact that not all online algorithms are compatible with kernels. We turn to present an online algorithm that is suited for kernels: LAZY ONLINE GRADIENT DESCENT (OGD), which is also known to be a special instance of FOLLOW THE REGULARIZED LEADER (FTRL) algorithm. 2.2. Online Convex Optimization and LAZY OGD One of the most well-studied frameworks of online learning is Online Convex Optimization (OCO). In this framework, an online player iteratively chooses an action at K, and then suffers loss that is equal to ℓt(at). The action set K is assumed to be a closed and bounded convex subset of Rd, and the loss functions {ℓt}T t=1 are assumed to be convex functions from K to [0, 1]. The performance of the player is measured using the regret criterion, defined as follows: RT (ℓ1, . . . , ℓT ) = t=1 ℓt(at) min a K where T is a predefined integer that denotes the total number of rounds played. The goal in this framework is to design efficient algorithms, whose regret grows sublinearly in T, corresponding to an average per-round regret going to zero as T increases. One of the popular algorithms for OCO is the LAZY OGD algorithm (Algorithm 1), for which the following regret bound is known1: RLazy T (ℓ1, . . . , ℓT ) = t=1 ℓt(at) min a K t=1 ηt ℓt(at) 2. (2) where a = arg mina K PT t=1 ℓt(a). If the action set K is assumed to be the ℓ2 unit ball, then we have RLazy T (ℓ1, . . . , ℓT ) 2G T for a properly chosen ηt and 1We use throughout the paper to denote the ℓ2-norm. G = maxa,t { ℓt(a) }. A complete analysis can be found in (Hazan, 2011; Shalev-Shwartz, 2012). 2.3. Online Prediction of Heteroscedastic Signals Assume the following iterative game between a player and nature (which might be adversarial): at round t, nature chooses xt X Rd and generates yt R such that E [ yt | xt ] = u 0 φ(xt) and Var [ yt | xt ] = v 0 ψ(xt), where φ and ψ are induced by a Mercer kernel, and u0 , v0 1 are set beforehand by nature. A special case, for instance, is the standard linear model yt = u 0 xt + q where the error terms ϵt are assumed to be distributed N(0, 1). Note that the adversarial behavior of nature is expressed in (1) the selection of xt; (2) the choice of the parameters u0, v0; and (3) the distribution of yt given xt. The player receives xt and has to provide a prediction yt = u t φ(xt) for yt and an estimation σ2 t = v t ψ(xt) for its conditional variance. The player, of course, is not aware of nature s selection of u0 and v0, but is aware of the functions φ and ψ (in the sense of knowing to compute the inner products φ(xi), φ(xj) and ψ(xi), ψ(xj) for any xi, xj X). After committing to yt and σ2 t , the player incurs two losses, one for the prediction error and the other for the conditional variance inaccuracy: ℓSig t (ut) = yt u t φ(xt) 2 ℓVar t (vt) = v t ψ(xt) v 0 ψ(xt) 2 . Naturally, the goal of the player is to (separately) minimize the sum of losses, over a predefined number of rounds T. Here also, we choose the regret to measure the performance of the online player. Thus, for the signal prediction task we have RT (ℓSig 1 , . . . , ℓSig T ) = t=1 ℓSig t (ut) min u 1 t=1 ℓSig t (u), and for the conditional variance estimation we have RT (ℓVar 1 , . . . , ℓVar T ) = t=1 ℓVar t (vt) min v 1 t=1 ℓVar t (v). (3) Notice that ℓVar t (v) cannot be directly computed since v 0 ψ(xt) is unknown at any stage, and thus a reasonable goal would be to minimize E RT (ℓVar 1 , . . . , ℓVar T ) . Readers, especially those familiar with online learning, might wonder if our setting cannot be strengthened in certain ways. First of all, in many online learning applications, nothing is assumed about the data generation, and we might Heteroscedastic Sequences: Beyond Gaussianity envision a scenario in which the signal yt is generated arbitrarily. However, this setting is ill-defined in the context of conditional variance estimation, as the notion of variance does not exist for non-stochastic signal. This fact causes the regret term in Equation (3), along with learning in this setting, to be meaningless. 2.4. Our Assumptions Throughout this work we assume the following: (1) For any t, it holds that E [ yt | xt ] = u 0 φ(xt) and Var [ yt | xt ] = v 0 ψ(xt), where φ and ψ are functions induced by a Mercer kernel, and u0 , v0 1. (2) It holds that φ(x) , ψ(x) 1 for any x X. (3) It holds that yt [ 1, 1] for any t. We note that these assumptions can be relaxed, and are merely here to simplify the exposition and calculations. In assumption (1), it would be sufficient to require E [ yt | xt ] u 0 φ(xt) and Var [ yt | xt ] v 0 ψ(xt), for a small enough bias (that would be added to our regret bound). Assumption (2) can be replaced by the assumption that φ(x) , ψ(x) C for some C < (which need not be known in advance, as the algorithm can be easily adjusted to handle this case by using a standard doubling trick). Assumption (3) can be relaxed to light tail assumption. That is, we can assume that yt lies in a finite interval with high probability (which holds in particular for Gaussian errors), without increasing the complexity of the problem at hand. 3. Our Approach The main challenge in our setting is the fact that the conditional variance is not revealed to us at any stage, and yet we wish to compete against the best conditional variance estimator in hindsight. To circumvent this issue, we use biased estimators of the conditional variance instead of the actual ones. The bias follows from the fact that the expected value of signal is also unknown, and can only be approximated by our online algorithm. Next, we take a step back from our setting, and present a general working scheme for OCO with biased gradient estimators. This scheme was considered before (e.g., in the work of (Huh & Rusmevichientong, 2013)), but the existing algorithms are not suitable to our setting due to lack of generality (and in particular, inability of coping with kernels). Throughout this section, we denote by Ft 1 the sigma-algebra that is generated by all the actions played up to round t, and all losses occurred up to round t 1. 3.1. OCO with Biased Gradient Estimators Consider the OCO framework described in Section 2.2, with the following change: after committing to an action at, the online player only receives a feedback in the form of a biased gradient estimator at at. For this framework, we can prove the following: Proposition 3.1. Let {ℓt}T t=1 be a sequence of loss functions, and { ℓt}T t=1 a sequence of corresponding approximation functions for which it holds that E[ ℓt(a) | Ft 1] = ℓt(a) + bt(a), for any t and a K. Denote by {at}T t=1 the sequence of actions that a first-order algorithm A outputs for {ht}T t=1, where ht(a) = ℓt(a) + a ( ℓt(at) ℓt(at)). Then, E[RT (ℓ1, . . . , ℓT )] = t=1 E [ℓt(at)] E BA T (h1, . . . , h T ) t=1 E (at a ) bt(at) , where a = arg mina K PT t=1 ℓt(a), and BA T (h1, . . . , h T ) is a regret bound of algorithm A applied to {ht}T t=1. Basically, the proposition states that one can provide biased gradient estimators as an input to any first-order online algorithm, and incur a corresponding additional term in the regret. The proof of this Proposition appears in the supplementary material. 3.2. Algorithm and Analysis We turn to present our algorithm (Algorithm 2) along with its analysis. We start by defining ℓVar t (vt) = v t ψ(xt) (yt u t φ(xt))2 2 , (4) which is an approximation to ℓVar t (vt). This definition plays an important role in our analysis, since ℓVar t (vt) is unobserved at any stage. Note that despite the inefficient representation of Algorithm 2, in practice the predictions and the variance estimations are generated efficiently using a simple kernel trick. This form is easier to analyze and is thus stated here. For Algorithm 2 we can prove the following: Theorem 3.2. Algorithm 2 generates online sequences {ut}T t=1 and {vt}T t=1, for which it holds that RT (ℓSig 1 , . . . , ℓSig T ) t=1 ℓSig t (ut) min u 1 t=1 ℓSig t (u) 8T 1/2, Heteroscedastic Sequences: Beyond Gaussianity Algorithm 2 1: Input: learning rates ηSig, ηVar. 2: Set u1 = 0 and v1 = 0. 3: for t = 1 to T do 4: Play ut and observe loss ℓSig t (ut) 5: Play vt and suffer loss ℓVar t (vt) 6: Set ut+1 = ηSig Pt i=1 ℓSig i (ui) max{1,ηSig Pt i=1 ℓSig i (ui) } 7: Set vt+1 = ηVar Pt i=1 ℓVar i (vi) max{1,ηVar Pt i=1 ℓVar i (vi) } 8: end for E RT (ℓVar 1 , . . . , ℓVar T ) t=1 E ℓVar t (vt) min v 1 t=1 ℓVar t (v) 64T 1/2, if we choose ηSig = ηVar = 1 2 Proof. Notice that applying Algorithm 2 to {ℓSig t }T t=1 is equivalent to applying LAZY OGD, and thus it trivially holds that RT (ℓSig 1 , . . . , ℓSig T ) 8T 1/2. For {ℓVar t }T t=1 we have to work somewhat harder. Our proof relies on the fact that Algorithm 2 generates an online sequence {ut}T t=1 for which it holds that t=1 E h u t φ(xt) u 0 φ(xt) 2i u φ(xt) u 0 φ(xt) 2 8T 1/2. This observation is proved in the supplementary material. This immediately implies that t=1 E h u t φ(xt) u 0 φ(xt) 2i 8T 1/2, since min u 1 PT t=1 u φ(xt) u 0 φ(xt) 2 = 0 for the case where u0 1, which holds by assumption (1). Next, we use the definition of ℓVar t in Equation 4 to derive E h ℓVar t (vt) | Ft 1 i = ℓVar t (vt) + bt(vt), where bt(vt) = 2ψ(xt)(u 0 φ(xt) u t φ(xt))2. Notice that bt(vt) does not depend on vt in this case. Defining ht(v) = ℓVar t (v)+v ( ℓVar t (vt) ℓVar t (vt)) and applying LAZY OGD to {ht}T t=1 gives (by Proposition 3.1): E[RT (ℓVar 1 , . . . , ℓVar T )] = t=1 E ℓVar t (vt) t=1 ℓVar t (v ) E [BT (h1, . . . , h T )] t=1 E (vt v ) bt(vt) , where v = arg minv PT t=1 ℓt(v), and BT (h1, . . . , h T ) is the regret bound of LAZY OGD for {ht}T t=1. It can be easily shown that E [BT (h1, . . . , h T )] 32T 1/2. Finally, we can bound t=1 E (vt v ) bt(vt) t=1 E (vt v ) ψ(xt)(u 0 φ(xt) u t φ(xt))2 t=1 E (u 0 φ(xt) u t φ(xt))2 32T 1/2, which completes the proof. 3.3. Worst-Case Confidence Bounds We are now interested in using the results from the previous section to generate confidence bounds for our prediction. More formally, given a probability α (0, 1) our task is to provide a sequence {ct}T t=1 for which it holds that: t=1 P |u t φ(xt) yt| ct α. (5) In words, the expected proportion of the predictions for which the distance to the actual signal exceeds the corresponding ct is at most α. The above trivially holds if we choose large enough constants {ct}T t=1, and thus we are interested not only in finding such constants, but also in showing that they are tight in a sense. To derive Equation (5), we somewhat abuse notations and define ℓSig t (u) = 1 c2 t yt u φ(xt) 2 and ℓVar t (v) = 1 c4 t v ψ(xt) v 0 ψ(xt) 2. Notice that here also we need an estimated loss for ℓVar t as it is not revealed to us. Thus, we define ℓVar t (v) = 1 v ψ(xt) yt u t φ(xt) 2 2 . Note that ct might be random, yet it must hold that E [ct | Ft 1] = ct. That is, ct is known given the actions played up to round t, and the losses occurred up to time t 1. Otherwise, the losses are not well defined. Now, we can prove the following: Proposition 3.3. Let ℓSig t , ℓVar t and ℓVar t be as defined above and let α (0, 1). Then, Algorithm 2 generates online sequences {ut}T t=1 and {vt}T t=1, for which it holds that: t=1 P |u t φ(xt) yt| ct α, 2 max{β,v t ψ(xt)} α and β = 16T 1/4. Heteroscedastic Sequences: Beyond Gaussianity Remark: The constant 2 can be further improved to 1 (infinitesimally) by adjusting the value assigned to β. For simplicity, we prove only the result stated in the claim. As mentioned before, finding a sequence {ct}T t=1 for which 1 T PT t=1 P |u t φ(xt) yt| ct α is meaningless, unless this sequence is tight in some sense. In our context, we need to show that there exist an error distribution for which the above holds in the other direction. Thus, we set some k 1, and define the following error distribution: ϵt = yt u 0 φ(xt) = v 0 ψ(xt), w.p. 1 2k2 0, w.p. 1 1 v 0 ψ(xt), w.p. 1 2k2 One can easily verify that for this distribution E [yt | xt] = u 0 φ(xt) and Var [yt | xt] = v 0 ψ(xt). Now, note that for the choice ct = k p v 0 ψ(xt) it holds that: P |yt u t φ(xt)| ct P (|ϵt| ct) = P |ϵt| k q v 0 ψ(xt) = 1 Setting k = q 1 α gives the result. 4. Extensions and Applications Here we extend the result of the previous section to several interesting cases. The first is the multivariate case, in which yt Rn and the conditional variance then takes the form of a matrix. The second is an extension of our approach to higher moments, which enable the derivation of tighter confidence bounds for the prediction. 4.1. The Multivariate Case In the multivariate case, the considered game is described as follows. At round t, nature chooses xt Rd and generates yt Rn such that: (1) E [ yt | xt ] = U0φ(xt), where φ : Rd R ˆd and U0 is an n ˆd matrix with U0 F 1. Here and on, F refers to the Frobenius norm. (2) Var [ yt | xt ] = V0ψ(xt), where ψ : Rd R d and V0 is an n n d tensor such that V0 F 1 and Vijk = Vjik for any i, j and k. Here also, the player receives xt and has to provide a prediction yt = Utφ(xt), and an estimation Σ2 t = Vtψ(xt) for its covariance matrix. After committing to yt and Σ2 t, the player suffers two losses: ℓSig t (Ut) = yt Utφ(xt) 2 , ℓVar t (Vt) = Vtψ(xt) V0ψ(xt) 2 F . The regret is defined accordingly. Notice that in this setting as well an estimation for ℓVar t is required, and thus we define an approximation ℓVar t as follows: V ψ(xt) (yt Utφ(xt))(yt Utφ(xt)) 2 Applying Algorithm 2 to the extended setting yields the following result: Corollary 4.1. Algorithm 2 generates online sequences {Ut}T t=1 and {Vt}T t=1, for which it holds that t=1 ℓSig t (Ut) min U F 1 t=1 ℓSig t (U) 8 t=1 E ℓVar t (Vt) min V F 1 t=1 ℓVar t (V ) = 64n The proof resembles the proof of Theorem 3.2, and is thus omitted here. 4.2. Higher Moments The motivation in this section is to refine the result from Section 3.3 to higher moments. That is, to derive confidence bounds for the prediction which account for higher moments (other than the first and the second). We present here only a high level description of our approach and defer the technical parts to future work. We start by providing some useful background. Recall that for a random variable X that has a moment generating function MX that is finite in some open interval I about 0, it holds that: (1) X has moments of all orders; and (2) we can represent MX(s) = P n=0 E[Xn]sn n! for s I. By Chernoff bounds, we know we can upper bound the tail events of X as follows: P(X x) e sx MX(s) for s > 0, and P(X x) e sx MX(s) for s < 0. The above can further be optimized over s to derive tight bounds. In our context, these facts can be used to generate confidence bounds of the form we are interested in (as in Equation (5)). Thus, assume that for each yt and its corresponding feature vector xt it holds that: E [(yt E [ yt ])n] = u n φn(xt), for n {1, . . . , k}. In words, the n-th moment of the error term is given by the inner product between some Heteroscedastic Sequences: Beyond Gaussianity vector un and a function of xt that is known in advance. This assumption generalizes the assumptions presented in Section 2 to higher moments. In addition, assume that P n=0 u n φn(xt)sn n! Pk n=1 u n φn(xt)sn n! for some k, which is independent of T. This assumption holds, for instance, when (yt E [ yt | xt ]) [ 1, 1]. Then, for a given yt we can derive the following bound: P (yt E [ yt ] c) e sc k X u n φn(xt)sn for s > 0, and the symmetric inequality for s < 0. This, again, can be optimized over s to derive the optimal bound. Notice, however, that u1, . . . , uk are unknown to us, and thus we need the regret analysis from which we can derive vectors u1,t, . . . , uk,t with the following guarantee: E [Rn T ] = t=1 E [ℓn t (un,t)] min u 1 t=1 ℓn t (u) = o(T), for n {1, . . . , k}. Here, as before, we use the definition ℓn t (u) = (u φn(xt) u n φn(xt))2 and its corresponding estimate ℓn t (u) = (u φn(xt) yn t )2 to derive the regret bounds. 5. Application to ARCH Models We turn to present an application of our result to ARCH models. We first provide some background and then proceed to formally define the adaptation to our framework and our main result. 5.1. Background Let {yt}T t=1 be a time series (that is, a series of signal observations). The traditional ARCH(p) (short for autoregressive conditional heteroskedasticity) model of (Engle, 1982) is parameterized by lag p and coefficient vectors u0, v0 Rp+1. The model assumes that yt is a noisy linear combination of the previous p observations. That is, yt = u0(0) + k=1 u0(k)yt k + ϵt. (6) The error term ϵt in this model is assumed to be split into a stochastic piece zt and a time-dependent standard deviation σt, characterizing the typical size of the error terms so that ϵt = σtzt. The random variable zt is usually assumed to be a white noise process. The term σ2 t complies with the following model: σ2 t = v0(0) + k=1 v0(k)ϵ2 t k, (7) where v0(0) > 0, and v0(k) 0 for all k > 0. Notice that the AR(p) model is a special case of the ARCH(p) model, where the v0(k) coefficients are all zero for k > 0. 5.2. Adaptation to Our Setting In our context, we will describe the setting as follows: First, some coefficient vectors (u0, v0) are fixed by the adversary. At each round t, the adversary generates ϵt from some zeromean distribution with variance σ2 t , and then use it to determine yt via Equation (6). We emphasize that (u0, v0) and the noise terms (along with their distribution) are not revealed to us at any point. At round t, we need to make a prediction yt for the signal and another prediction σ2 t for its conditional variance. After that, we incur two losses: ℓSig t (ut) = (yt yt(ut))2 and ℓVar t (vt) = σ2 t σ2 t (vt) 2 . Naturally, yt(ut) should be of the form yt(ut) = ut(0) Pp k=1 ut(k)yt k, but the main question is what should be the form of σ2 t (vt)? Recall that ϵt p, . . . , ϵt 1 are unobserved, and thus we cannot straightforwardly apply our approach (which requires knowing the feature vector). 5.3. Main Result Our main result relies on the fact that Proposition 3.1 does not require the feature vector to be explicitly given, but only that the gradient of the approximated loss are close enough to the gradient of the original loss. Thus, if we let ϵ2 t(ut) = (yt yt(ut))2 and consequently define ℓVar t (v) = ϵ2 t(ut) v(0) k=1 v(k) ϵ2 t k(ut k) then we can prove the following result: Corollary 5.1. Let ℓSig t , ℓVar t , and ℓVar t be as defined above. Then, Algorithm 2 generates online sequences {ut}T t=1 and {vt}T t=1, for which it holds that RT (ℓSig 1 , . . . , ℓSig T ) O(T 1/2), and also that E RT (ℓVar 1 , . . . , ℓVar T ) O(T 1/2), if we choose ηSig = ηVar = 1 2 The corollary relies on the fact that attaining a regret bound for ℓSig 1 , . . . , ℓSig T is a standard task in OCO. This regret bound implies that ϵ2 t(ut) is close in average to ϵ2 t, which in turn implies that ℓVar t is close to ℓVar t . 5.4. Experimental Results Most of the works on time series prediction consider what we call the offline setting: given a time series, compute the model parameters (in our case, the ARCH coefficients) and Heteroscedastic Sequences: Beyond Gaussianity 0 200 400 600 800 1000 0 Prediction MSE vs. time (Gaussian errors) MLE Algorithm 2 0 200 400 600 800 1000 0 Variance estimation MSE vs. time (Gaussian errors) MLE Algorithm 2 0 200 400 600 800 1000 0.2 Prediction MSE vs. time (Uniform errors) MLE Algorithm 2 0 200 400 600 800 1000 0 Variance estimation MSE vs. time (Uniform errors) MLE Algorithm 2 0 200 400 600 800 1000 0 Prediction MSE vs. time (discrete errors) MLE Algorithm 2 0 200 400 600 800 1000 0 Variance estimation MSE vs. time (discrete errors) MLE Algorithm 2 Figure 1. MSE of the prediction and the conditional variance estimation as a function of time. MSE@1000 for the signal prediction task MSE@1000 for the variance estimation task Gaussian errors Uniform errors Discrete errors Gaussian errors Uniform errors Discrete errors MLE 0.1823 (0.0159) 0.2037 (0.0137) 0.0580 (0.0119) 0.0157 (0.0139) 0.0138 (0.0121) 0.0625 (0.0207) Algorithm 2 0.1799 (0.0134) 0.2025 (0.0065) 0.0526 (0.0108) 0.0141 (0.0021) 0.0058* (0.0087) 0.0499* (0.0092) Table 1. MSE@1000 of the prediction and the conditional variance estimation. Bold font marks the best results, and asterisk indicates significance level of 0.05. Standard deviations of the results are presented in brackets. measure the prediction error. Our online setting can be seen as a sequential offline setting, in which at round t we are given the time series values up to round t 1 and our task is to predict the signal and its variance at round t. In light of this, we adapt the state-of-the-art offline baseline (MLE) to the online setting. Note that this adaptation does not weaken the offline baseline in any way, and is used only for comparison purposes. In the plots we present the average accumulated loss up to round t for t = 1, . . . , 1000. To test the robustness of our approach to different error distributions, we generate three time series using the ARCH model (Equations (6) and (7)) with u0 = (0, 0.55, 0.11) and v0 = (0.1, 0.25, 0.25), each differs only in its error distribution. We consider the following distributions of the stochastic piece of the error terms: zt N(0, 1); zt Uni( 3); and a discrete distribution k, w.p. 1 2k2 0, w.p. 1 1 k2 k, w.p. 1 2k2 20. In all cases, one can easily verify that E [zt] = 0 and E z2 t = 1. This implies that the first and second moments of yt conditioned on the history are equal. We apply our algorithm and the MLE baseline (Engle, 1982) to the three time series. Figure 1 presents the MSE of the prediction and the MSE of the estimated conditional variance with respect to the true conditional variance (the latter is known as the data is synthetically generated). Table 1 presents the MSE at the end of both tasks, where bold font marks the best results, and asterisk indicates significance level of 0.05. To ensure stability, we average the results over 50 runs. As evidenced by Figure 1 and Table 1, both algorithms perform roughly the same in the signal prediction task regardless of the error distribution. However, the online algorithm significantly surpasses the standard MLE in the variance estimation task, mainly when the time series exhibits some complicated error distribution (uniform or discrete). These empirical findings support the theoretical results and validate the generality of the online approach in practice. 6. Conclusion and Discussion In this paper we presented an approach for the problem of sequential signal prediction in heteroscedastic environment. The main novelty of our approach is the fact that we allow the signal to be partially adversarial, in contrast to traditional methods that require it to be fully stochastic. To date, we are not aware of many works pursuing this direction (even outside the scope of sequential prediction of heteroscedastic sequences). We hope to extend this approach more broadly to bridge the gap between the statistical approach and online learning. Heteroscedastic Sequences: Beyond Gaussianity Acknowledgments The research leading to these results has received funding from the European Community s Seventh Framework Programme (FP7/2007-2013) under grant agreement 306638 (SUPREL), and the European Unions Seventh Framework Programme (FP7/2007-2013) under grant agreement n 336078 ERC-SUBLRN. Amemiya, Takeshi. A note on a heteroscedastic model. Journal of Econometrics, 6(3):365 370, 1977. Anava, Oren, Hazan, Elad, Mannor, Shie, and Shamir, Ohad. Online learning for time series prediction. ar Xiv preprint ar Xiv:1302.6927, 2013. Anava, Oren, Hazan, Elad, and Zeevi, Assaf. Online time series prediction with missing data. In ICML, 2015. Bickel, Peter J. Using residuals robustly i: Tests for heteroscedasticity, nonlinearity. The Annals of Statistics, pp. 266 291, 1978. Bollerslev, Tim. Generalized autoregressive conditional heteroskedasticity. Journal of econometrics, 31(3):307 327, 1986. Box, George E. P. and Hill, William J. Correcting inhomogeneity of variance with power transformation weighting. Technometrics, 16(3):385 389, 1974. doi: 10.1080/ 00401706.1974.10489207. Brown, Lawrence D, Levine, M, et al. Variance estimation in nonparametric regression via the difference sequence method. The Annals of Statistics, 35(5):2219 2232, 2007. Davidian, Marie and Carroll, Raymond J. Variance function estimation. Journal of the American Statistical Association, 82(400):1079 1091, 1987. Engle, R.F. Autoregressive conditional heteroscedasticity with estimates of the variance of united kingdom inflation. Econometrica, 50:987 1007, 1982. Even-Dar, Eyal, Kakade, Sham M, and Mansour, Yishay. Online markov decision processes. Mathematics of Operations Research, 34(3):726 736, 2009. Fan, Jianqing and Yao, Qiwei. Efficient estimation of conditional variance functions in stochastic regression. Biometrika, 85(3):645 660, 1998. Fuller, Wayne A. and Rao, J. N. K. Estimation for a linear regression model with unknown diagonal covariance matrix. The Annals of Statistics, 6(5):1149 1158, 09 1978. Hazan, Elad. The convex optimization approach to regret minimization. Optimization for machine learning, pp. 287, 2011. Huh, Woonghee Tim and Rusmevichientong, Paat. Online sequential optimization with biased gradients: theory and applications to censored demand. INFORMS Journal on Computing, 26(1):150 159, 2013. Jobson, JD and Fuller, WA. Least squares estimation when the covariance matrix and parameter vector are functionally related. Journal of the American Statistical Association, 75(369):176 181, 1980. Lee, T. C. Nonlinear methods in econometrics : S.M. Goldfeld and R.E. Quandt, (North-Holland Publ. Co., Amsterdam and London, 1972). Journal of Econometrics, 1 (4):399 401, December 1973. Mishra, Santosh, Su, Liangjun, and Ullah, Aman. Semiparametric estimator of time series conditional variance. Journal of Business & Economic Statistics, 28(2):256 274, 2010. Muller, Hans-Georg and Stadtmuller, Ulrich. Estimation of heteroscedasticity in regression analysis. Ann. Statist., 15(2):610 625, 06 1987. doi: 10.1214/aos/1176350364. Nelson, Daniel B. Conditional heteroskedasticity in asset returns: A new approach. Econometrica: Journal of the Econometric Society, pp. 347 370, 1991. Shalev-Shwartz, Shai. Online learning and online convex optimization. Foundations and Trends in Machine Learning, 4(2):107 194, 2012. Shawe-Taylor, John and Cristianini, Nello. Kernel methods for pattern analysis. Cambridge university press, 2004. Whittaker, E. T. and Robinson, G. The calculus of observations: An introduction to numerical analysis, chapter 11, pp. 285 316. Dover Publications, 1967. Yu, Jia Yuan, Mannor, Shie, and Shimkin, Nahum. Markov decision processes with arbitrary reward processes. Mathematics of Operations Research, 34(3): 737 757, 2009. Yu, K and Jones, MC. Likelihood-based local linear estimation of the conditional variance function. Journal of the American Statistical Association, 99(465):139 144, 2004. Zhu, Liping, Dong, Yuexiao, and Li, Runze. Semiparametric estimation of conditional heteroscedasticity via single-index modeling. Statistica Sinica, 23(3):1215, 2013.