# online_forecasting_of_totalvariationbounded_sequences__883076d7.pdf Online Forecasting of Total-Variation-bounded Sequences Dheeraj Baby Department of Computer Science UC Santa Barbara dheeraj@ucsb.edu Yu-Xiang Wang Department of Computer Science UC Santa Barbara yuxiangw@cs.ucsb.edu We consider the problem of online forecasting of sequences of length n with total-variation at most Cn using observations contaminated by independent σsubgaussian noise. We design an O(n log n)-time algorithm that achieves a cumulative square error of O(n1/3C2/3 n σ4/3 + C2 n) with high probability. We also prove a lower bound that matches the upper bound in all parameters (up to a log(n) factor). To the best of our knowledge, this is the first polynomial-time algorithm that achieves the optimal O(n1/3) rate in forecasting total variation bounded sequences and the first algorithm that adapts to unknown Cn. Our proof techniques leverage the special localized structure of Haar wavelet basis and the adaptivity to unknown smoothness parameters in the classical wavelet smoothing [Donoho et al., 1998]. We also compare our model to the rich literature of dynamic regret minimization and nonstationary stochastic optimization, where our problem can be treated as a special case. We show that the workhorse in those settings online gradient descent and its variants with a fixed restarting schedule are instances of a class of linear forecasters that require a suboptimal regret of Ω( n). This implies that the use of more adaptive algorithms is necessary to obtain the optimal rate. 1 Introduction Nonparametric regression is a fundamental class of problems that has been studied for more than half a century in statistics and machine learning [Nadaraya, 1964, De Boor et al., 1978, Wahba, 1990, Donoho et al., 1998, Mallat, 1999, Scholkopf and Smola, 2001, Rasmussen and Williams, 2006]. It solves the following problem: Let yi = f(ui)+Noise for i = 1, ..., n. How can we estimate a function f using data points (u1, y1), ..., (un, yn) and the knowledge that f belongs to a function class F? Function class F typically imposes only weak regularity assumptions on the function f such as boundedness and smoothness, which makes nonparametric regression widely applicable to many real-life applications especially those with unknown physical processes. A recent and successful class of nonparametric regression technique called trend filtering [Steidl et al., 2006, Kim et al., 2009, Tibshirani, 2014, Wang et al., 2014] was shown to have the property of local adaptivity [Mammen and van de Geer, 1997] in both theory and practice. We say a nonparametric regression technique is locally adaptive if it can cater to local differences in smoothness, hence allowing more accurate estimation of functions with varying smoothness and abrupt changes. For example, for functions with bounded total variation (when F is a total variation class), standard nonparametric regression techniques such as kernel smoothing and smoothing splines have a mean square error (MSE) of O(n 1/2) while trend filtering has the optimal O(n 2/3). 33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada. Trend filtering is, however, a batch learning algorithm where one observes the entire dataset ahead of the time and makes inference about the past. This makes it inapplicable to the many time series problems that motivate the study of trend filtering in the first place [Kim et al., 2009]. These include influenza forecasting, inventory planning, economic policy-making, financial market prediction and so on. In particular, it is unclear whether the advantage of trend filtering methods in estimating functions with heterogeneous smoothness (e.g., sharp changes) would carry over to the online forecasting setting. The focus of this work is in developing theory and algorithms for locally adaptive online forecasting which predicts the immediate future value of a function with heterogeneous smoothness using only noisy observations from the past. 1.1 Problem Setup 1. Fix action time intervals 1, 2, ..., n 2. The player declares a forecasting strategy Ai : Ri 1 R for i = 1, ..., n. 3. An adversary chooses a sequence θ1:n = [θ1, θ2, . . . , θn]T Rn. 4. For every time point i = 1, ..., n: (a) We play xi = Ai(y1, ..., yi 1). (b) We receive a feedback yi = θi + Zi, where Zi is a zero-mean, independent subgaussian noise. 5. At the end, the player suffers a cumulative error Pn i=1 (xi θi)2. Figure 1: Nonparametric online forecasting model. The focus of the proposed work is to design a forecasting strategy that minimizes the expected cumulative square error. Note that the problem depends a lot on the choice of the sequence θi. Our primary interest is on sequences with bounded total variation (TV) so that Pn i=2|θi θi 1| Cn, but we will also talk about the adaptivity of our method to easier problems such as forecasting Sobolev and Holder functions. We propose a model for nonparametric online forecasting as described in Figure 1. This model can be re-framed in the language of the online convex optimization model with three differences. 1. We consider only quadratic loss functions of the form ℓt(x) = (x θt)2. 2. The learner receives independent noisy gradient feedback, rather than the exact gradient. 3. The criterion of interest is redefined as the dynamic regret [Zinkevich, 2003, Besbes et al., Rdynamic(A, ℓ1:n) := E t=1 inf xt ℓt(xt). The new criterion is called a dynamic regret because we are now comparing to a stronger dynamic baseline that chooses an optimal x in every round. Of course in general, the dynamic regret will be linear in n [Jadbabaie et al., 2015]. To make the problem non-trivial, we restrict our attention to sequences of ℓ1, ..., ℓn that are regular, which makes it possible to design algorithms with sublinear dynamic regret. In particular, we borrow ideas from the nonparametric regression literature and consider sequences [θ1, ..., θn] that are discretizations of functions in the continuous domain. Regularity assumptions emerge naturally as we consider canonical functions classes such as the Holder class, Sobolev class and Total Variation classes [see, e.g., Tsybakov, 2008, for a review]. 1.2 Assumptions We consolidate all the assumptions used in this work and provide necessary justifications for them. (A1) The time horizon for the online learner is known to be n. (A2) The parameter σ2 of subgaussian noise in the observations is known. (A3) The ground truth denoted by θ1:n = [θ1, ..., θn]T has its total variation bounded by some positive Cn, i.e., we take F to be the total variation class TV(Cn) := {θ1:n Rn : Dθ1:n 1 Cn} where D is the discrete difference operator. Here Dθ1:n = [θ2 θ1, . . . , θn θn 1]T . (A4) |θ1| U. The knowledge of σ2 in assumption (A2) is primarily used to get the optimal dependence of σ in minimax rate. This assumption can be relaxed in practice by using the Median Absolute Deviation estimator as described in Section 7.5 of Johnstone [2017] to estimate σ2 robustly. Assumption (A3) features a samples from a large class of functions with spatially inhomogeneous degree of smoothness. The functions residing in this class need not even be continuous. Our goal is to propose a policy that is locally adaptive whose empirical mean squared error converges at the minimax rate for this function class. We stress that we do not assume that the learner knows Cn. The problem is open and nontrivial even when Cn is known. Assumption (A4) is very mild as it puts restriction only to the first value of the sequence. This assumption controls the inevitable prediction error for the first point in the sequence. 1.3 Our Results The major contributions of this work are summarized below. It is known that the minimax MSE for smoothing sequences in the TV class is Ω(n 2/3). This implies a lowerbound of Ω(n1/3) for the dynamic regret in our setting. We present a policy ARROWS (Adaptive Restarting Rule for Online averaging using Wavelet Shrinkage) with a nearly minimax dynamic regret O(n1/3) and a run-time complexity of O(n log n). We show that a class of forecasting strategies including the popular Online Gradient Descent (OGD) with fixed restarts [Besbes et al., 2015], moving averages (MA) [Box and Jenkins, 1970] are fundamentally limited by Ω( n) regret. We also provide a more refined lower bound that characterized the dependence of U, Cn and σ, which certifies the adaptive optimality of ARROWS in all regimes. The bound also reveals a subtle price to pay when we move from the smoothing problem to the forecasting problem, which indicates the separation of the two problems when Cn/σ n1/4, a regime where the forecasting problem is strictly harder (See Figure 3). Lastly, we consider forecasting sequences in Sobolev classes and Holder classes and establish that ARROWS can automatically adapt to the optimal regret of these simpler function classes as well, while OGD and MA cannot, unless we change their tuning parameter (to behave suboptimally on the TV class). 2 Related Work The topic of this paper sits well in between two amazing bodies of literature: nonparametric regression and online learning. Our results therefore contribute to both fields and hopefully will inspire more interplay between the two communities. Throughout this paper when we refer O(n1/3) as the optimal regret, we assume the parameters of the problem are such that it is acheivable (see Figure 3). Nonparametric regression. As we mentioned before, our problem online nonparametric forecasting is motivated by the idea of using locally adaptive nonparametric regression for time series forecasting [Mammen and van de Geer, 1997, Kim et al., 2009, Tibshirani, 2014]. It is more challenging than standard nonparametric regression because we do not have access to the data in the future. While our proof techniques make use of several components (e.g., universal shrinkage) from the seminal work in wavelet smoothing [Donoho et al., 1990, 1998], the way we use them to construct and analyze our algorithm is new and more generally applicable for converting non-parametric regression methods to forecasting methods. Adaptive Online Learning. Our problem is also connected to a growing literature on adaptive online learning which aims at matching the performance of a stronger time-varying baseline [Zinkevich, 2003, Hall and Willett, 2013, Besbes et al., 2015, Chen et al., 2018b, Jadbabaie et al., 2015, Hazan and Seshadhri, 2007, Daniely et al., 2015, Yang et al., 2016, Zhang et al., 2018a,b, Chen et al., 2018a]. Many of these settings are highly general and we can apply their algorithms directly to our problem, but to the best of our knowledge, none of them achieves the optimal O(n1/3) dynamic regret. In the remainder of this section, we focus our discussion on how to apply the regret bounds in non-stationary stochastic optimization [Besbes et al., 2015, Chen et al., 2018b] to our problem while leaving more elaborate discussion with respect to alternative models (e.g. the constrained comparator approach [Zinkevich, 2003, Hall and Willett, 2013], adaptive regret [Jadbabaie et al., 2015, Zhang et al., 2018a], competitive ratio [Bansal et al., 2015, Chen et al., 2018a]), as well as the comparison to the classical time series models to Appendix A. Regret from Non-Stationary Stochastic Optimization The problem of non-stationary stochastic optimization is more general than our model because instead of considering only the quadratic functions, ℓt(x) = (x θt)2, they work with the more general class of strongly convex functions and general convex functions. They also consider both noisy gradient feedbacks (stochastic first order oracle) and noisy function value feedbacks (stochastic zeroth order oracle). In particular, Besbes et al. [2015] define a quantity Vn which captures the total amount of variation of the functions ℓ1:n using Vn := Pn 1 i=1 ℓi+1 ℓi . 1 Chen et al. [2018b] generalize the notion to Vn(p, q) := Pn 1 i=1 ℓi+1 ℓi q p 1/q for any 1 p, q + where p:= ( R | (x)|pdx)1/p is the standard Lp norm for functions2. Table 1 summarizes the known results under the non-stationary stochastic optimization setting. Table 1: Summary of known minimax dynamic regret in the non-stationary stochastic optimization model. Note that the choice of q does not affect the minimax rate in any way, but the choice of p does. - indicates that the no upper or lower bounds are known for that setting. Noisy gradient feedback Noisy function value feedback Assumptions on ℓ1:n p = + 1 p < + p = + 1 p < + Convex & Lipschitz Θ(n2/3V 1/3 n ) O(n 2p+d 3p+d Vn(p, q) p 3p+d ) - - Strongly convex & Smooth Θ(n1/2V 1/2 n ) Θ(n 2p+d 4p+d Vn(p, q) 2p 4p+d ) Θ(n2/3V 1/3 n ) Θ(n 4p+d 6p+d Vn(p, q) Our assumption on the underlying trend θ1:n F can be used to construct an upper bound of this quantity of variation Vn or Vn(p, q). As a result, the algorithms in non-stationary stochastic optimization and their dynamic regret bounds in Table 1 will apply to our problem (modulo additional restrictions on bounded domain). However, our preliminary investigation suggests that this direct reduction does not, in general, lead to optimal algorithms. We illustrate this observation in the following example. Example 1. Let F be the set of all bounded sequences in the total variation class TV (1). It can be worked out that Vn(p, q) = O(1) for all p, q. Therefore the smallest regret from [Besbes et al., 2015, Chen et al., 2018b] is obtained by taking p + , which gives us a regret of O(n1/2). Note that we expect the optimal regret to be O(n1/3) according to the theory of locally adaptive nonparametric regression. In Example 1, we have demonstrated that one cannot achieve the optimal dynamic regret using known results in non-stationary stochastic optimization. We show in section 3.1 that Restarting OGD algorithm has a fundamental lower bound of Ω( n) on dynamic regret in the TV class. Online nonparametric regression. As we finalize our manuscript, it comes to our attention that our problem of interest in Figure 1 can be cast as a special case of the online nonparametric regression problem [Rakhlin and Sridharan, 2014, Gaillard and Gerchinovitz, 2015]. The general result of Rakhlin and Sridharan [2014] implies the existence of an algorithm that enjoys a O(n1/3) regret for the TV class without explicitly constructing one, which shows that n1/3 is the minimax rate when Cn = O(1) (see more details in Appendix A). To the best of our knowledge, our proposed algorithm remains the first polynomial time algorithm with O(n1/3) regret and our results reveal more precise (optimal) upper and lower bounds on all parameters of the problem (see Section 3.4). 3 Main results We are now ready to present our main results. 1The Vn definition in [Besbes et al., 2015] for strongly convex functions are defined a bit differently, the is taken over the convex hull of minimizers. This creates some subtle confusions regarding our results which we explain in details in Appendix I. 2We define Vn(p, q) to be a factor of n 1/q times bigger than the original scaling presented in [Chen et al., 2018b] so the results become comparable to that of [Besbes et al., 2015]. 3.1 Limitations of Linear Forecasters Restarting OGD as discussed in Example 1, fails to achieve the optimal regret in our setting. A curious question to ask is whether it is the algorithm itself that fails or it is an artifact of a potentially suboptimal regret analysis. To answer this, let s consider the class of linear forecasters estimators that outputs a fixed linear transformation of the observations y1:n. The following preliminary result shows that Restarting OGD is a linear forecaster . By the results of Donoho et al. [1998], linear smoothers are fundamentally limited in their ability to estimate functions with heterogeneous smoothness. Since forecasting is harder than smoothing, this limitation gets directly translated to the setting of linear forecasters. Proposition 2. Online gradient descent with a fixed restart schedule is a linear forecaster. Therefore, it has a dynamic regret of at least Ω( n). Proof. First, observe that the stochastic gradient is of form 2(xt yt) where xt is what the agent played at time t and yt is the noisy observation θt + Independent noise. By the online gradient descent strategy with the fixed restart interval and an inductive argument, xt is a linear combination of y1, ..., yt 1 for any t. Therefore, the entire vector of predictions x1:t is a fixed linear transformation of y1:t 1. The fundamental lower bound for linear smoothers from Donoho et al. [1998] implies that this algorithm will have a regret of at least Ω( n). The proposition implies that we will need fundamentally new nonlinear algorithmic components to achieve the optimal O(n1/3) regret, if it is achievable at all! In this section, we present our policy ARROWS (Adaptive Restarting Rule for Online averaging using Wavelet Shrinkage). The following notations are introduced for describing the algorithm. th denotes start time of the current bin and t be the current time point. yth:t denotes the average of the y values for time steps indexed from th to t. pad0(yth, ..., yt) denotes the vector (yth yth:t, ..., yt yth:t)T zero-padded at the end till its length is a power of 2. i.e, a re-centered and padded version of observations. T(x) where x is a sequence of values, denotes the element-wise soft thresholding of the sequence with threshold σ p β log(n) H denotes the orthogonal discrete Haar wavelet transform matrix of proper dimensions Let Hx = α = [α1, α2, ..., αk]T where k being a power of 2 is the length of x. Then the vector [α2, ..., αk]T can be viewed as a concatenation of log2 k contiguous blocks represented by α[l], l = 0, ..., log2(k) 1. Each block α[l] at level l contains 2l coefficients. ARROWS: inputs - observed y values, time horizon n, std deviation σ, δ (0, 1], a hyperparameter β > 24 1. Initialize th = 1, new Bin = 1, y0 = 0 2. For t = 1 to n: (a) If new Bin == 1, predict xth t = yt 1, else predict xth t = yth:t 1 (b) set new Bin = 0, observe yt and suffer loss (xth t θt)2 (c) Let y = pad0(yth, ..., yt) and k be the padded length. (d) Let ˆα(th : t) = T(H y) (e) Restart Rule: If 1 k Plog2(k) 1 l=0 2l/2 ˆα(th : t)[l] 1> σ i. set new Bin = 1 ii. set th = t + 1 Our policy is the byproduct of following question: How can one lift a batch estimator that is minimax over the TV class to a minimax online algorithm? Restarting OGD when applied to our setting with squared error losses reduces to partitioning the duration of game into fixed size chunks and outputting online averages. As described in Section 3.1, this leads to suboptimal regret. However, the notion of averaging is still a good idea to keep. If within a time interval, the Total Variation (TV) is adequately small, then outputting sample averages is reasonable for minimizing the cumulative squared error. Once we encounter a bump in the variation, a good strategy is to restart the averaging procedure. Thus we need to adaptively detect intervals with low TV. For accomplishing this, we communicate with an oracle estimator whose output can be used to construct a lowerbound of TV within an interval. The decision to restart online averaging is based on the estimate of TV computed using this oracle. Such a decision rule introduces non-linearity and hence breaks free of the suboptimal world of linear forecasters. The oracle estimator we consider here is a slightly modified version of the soft thresholding estimator from Donoho [1995]. We capture the high level intuition behind steps (d) and (e) as follows. Computation of Haar coefficients involves smoothing adjacent regions of a signal and taking difference between them. So we can expect to construct a lowerbound of the total variation Dθ1:n 1 from these coeffcients. The extra thresholding step T(.) in (d) is done to denoise the Haar coefficients computed from noisy data. In step (e), a weighted L1 norm of denoised coefficients is used to lowerbound the total variation of the true signal. The multiplicative factors 2l/2 are introduced to make the lowerbound tighter. We restart online averaging once we detect a large enough variation. The first coefficient ˆα(th : t)1 is zero due to the re-centering caused by pad0 operation. The hyper-parameter β controls the degree to which we shrink the noisy wavelet coefficients. For sufficiently small β, It is almost equivalent to the universal soft-thresholding of [Donoho, 1995]. The optimal selection of β is described in Theorem 3. We refer to the duration between two consecutive restarts inclusive of the first restart but exclusive of the second as a bin. The policy identifies several bins across time, whose width is adaptively chosen. 0 5000 10000 15000 20000 25000 30000 ogd ma arrows truth 102 103 104 105 106 107 n ogd ma arrows n1/3logn line (nlogn)1/2 line Figure 2: An illustration of ARROWS on a sequence with heterogeneous smoothness. We compare qualitatively (on the left) and quantitatively (on the right) to two popular baselines: (a) restarting online gradient descent [Besbes et al., 2015]; (b) the moving averages [Box and Jenkins, 1970] with optimal parameter choices. As we can see, ARROWS achieves the optimal O(n1/3) regret while the baselines are both suboptimal. 3.3 Dynamic Regret of ARROWS In this section, we provide bounds for non-stationary regret and run-time of the policy. Theorem 3. Let the feedback be yt = θt + Zt, t = 1, . . . , n and Zt be independent, σ-subgaussian random variables. If β = 24 + 8 log(8/δ) log(n) , then with probability at least 1 δ, ARROWS achieves a dynamic regret of O(n1/3 Dθ1:n 2/3 1 σ4/3 + |θ1|2+ Dθ1:n 2 2+σ2) where O hides a logarithmic factor in n and 1/δ. Proof Sketch. Our policy is similar in spirit to restarting OGD but with an adaptive restart schedule. The key idea we used is to reduce the dynamic regret of our policy in probability roughly to a sum of squared error of a soft thresholding estimator and number of restarts. This was accomplished by using a Follow The Leader (FTL) reduction. For bounding the squared error part of the sum we modified the threshold value for the estimator in Donoho [1995] and proved high probability guarantees for the convergence of its empirical mean. To bound the number of times we restart, we first establish a connection between Haar coefficients and total variation. This is intuitive since computation of Haar coefficients can be viewed as smoothing the adjacent regions of a signal and taking their difference. Then we exploit a special condition called uniform shrinkage of the soft-thresholding estimator which helps to optimally bound the number of restarts with high probability. Theorem 3 provides an upper bound of the minimax dynamic regret for forecasting the TV class. Corollary 4. Suppose the ground truth θ1:n TV (Cn) and |θ1| U. Then Dθ1:n 1 Cn. By noting that Dθ1:n 2 Dθ1:n 1, under the setup in Theorem 3 ARROWS achieves a dynamic regret of O(n1/3C2/3 n σ4/3 + U 2 + C2 n + σ2) with probability at-least 1 δ. Remark 5 (Adaptivity to unknown parameters.). Observe that ARROWS does not require the knowledge of Cn.It adapts optimally to the unknown TV radius Cn := Dθ1:n 1 of the ground truth θ1:n. The adaptivity to n can be achieved by a standard doubling trick. σ, if unknown, can be robustly estimated from the first few observations by a Median Absolute Deviation estimator (eg. Section 7.5 of Johnstone [2017]), thanks to the sparsity of wavelet coefficients of TV bounded functions. 3.4 A lower bound on the minimax regret We now give a matching lower bound of the expected regret, which establishes that ARROWS is adaptively minimax. Proposition 6. Assume min{U, Cn} > 2πσ and n > 3, there is a universal constant c such that inf x1:n sup θ1:n TV(Cn) E t=1 (xt(y1:t 1) θt)2 # c(U 2 + C2 n + σ2 log n + n1/3C2/3 n σ4/3). The proof is deferred to the Appendix I. The result shows that our result in Theorem 3 is optimal up to a logarithmic term in n and 1/δ for almost all regimes (modulo trivial cases of extremely small min{U, Cn}/σ and n)3. Remark 7 (The price of forecasting). The result also shows that forecasting is strictly harder than smoothing. Observe that a term with C2 n is required even if σ = 0, whereas in the case of a one-step look-ahead oracle (or the smoothing algorithm that sees all n observations) does not have this term. This implies that the total amount of variation that any algorithm can handle while producing a sublinear regret has dropped from Cn = o(n) to Cn = o( n). Moreover, the regime where the n1/3C2/3 n σ4/3 term is meaningful only when Cn = o(n1/4). For the region where σn1/4 Cn σn1/2, the minimax regret is essentially proportional to C2 n. This is illustrated in Figure 3. We note that in much of the online learning literature, it is conventional to consider a slightly more restrictive setting with bounded domain, which could reduce the minimax regret. The following remark summarizes a variant of our results in this setting. Remark 8 (Minimax regret in bounded domain). If we consider predicting sequences from a subset of the TV (Cn) ball having an extra boundedness condition |θi| B for i = 1 . . . n, it can be shown that (see Appendix I) minimax regret is Ω min{n B2, nσ2, n1/3C2/3 n σ4/3} + B2 + min{n B2, BCn} + σ2 . In particular, forecasting is still strictly harder than smoothing due to the min{n B2, BCn} term in the bound. The discussion in Appendix I, shows a way of using ARROWS whose regret can match this lower bound. 3When both U and Cn are moderately small relative to σ, the lower bound will depend on σ a little differently because the estimation error goes to 0 faster than 1/ n. We know the minimax risk exactly for that case as well but it is somewhat messy [see e.g. Wasserman, 2006]. When they are both much smaller than σ, e.g., when min{U, Cn} σ/ n, then outputting 0 when we do not have enough information will be better than doing online averages. Figure 3: An illustration of the minimax (dynamic) regret of forecasters and smoothers as a function of Cn. The non-trivial regime for forecasting is when Cn lies between σ q n and σ n1/4 where forecasting is just as hard as smoothing. When Cn > σ n1/4, forecasting is harder than smoothing. The yellow region indicates the extra loss incurred by any minimax forecaster. The green region marks the extra loss incurred by a linear forecaster compared to minimax forecasting strategy. The figure demonstrates that linear forecasters are sub-optimal even in the non-trivial regime. When Cn > σ n1/2, it is impossible to design a forecasting strategy with sub-linear regret. For Cn > σ n, identity function is optimal estimator for smoothing and when when Cn < σ q n , online averaging is optimal for both problems. 3.5 The adaptivity of ARROWS to Sobolev and Holder classes It turns out that ARROWS is also adaptively optimal in forecasting sequences in the discrete Sobolev classes and the discrete Holder classes, which are defined as S(C n) = {θ1:n : Dθ1:n 2 C n}, H(B n) = {θ1:n : Dθ1:n B n}. These classes feature sequences that are more spatially homogeneous than those in the TV class. The minimax cumulative error of nonparametric estimation in the discrete Sobolev class is Θ(n2/3[C n]2/3σ4/3) [see e.g., Sadhanala et al., 2016, Theorem 5 and 6]. Corollary 9. Let the feedback be yt = θt + Zt where Zt is an independent, σ-subgaussian random variable. Let θ1:n S(C n) and |θ1| U. If β = 24 + 8 log(8/δ) log(n) , then with probability at least 1 δ, ARROWS achieves a dynamic regret of O(n2/3[C n]2/3σ4/3 + U 2 + [C n]2 + σ2) where O hides a logarithmic factor in n and 1/δ. Thus despite the fact that ARROWS is designed for total variation class, it adapts to the optimal rates of forecasting sequences that are spatially regular. To gain some intuition, let s minimally expand the Sobolev ball to a TV ball of radius Cn = n C n. The chosen scaling of Cn activates the embedding S(C n) TV (Cn) (see the illustration in Table 2) with both classes having same minimax rate in the batch setting. This implies that dynamic regret of ARROWS is simultaneously minimax optimal over S(C n) and TV (Cn) wrt the term containing n. It can be shown that ARROWS is optimal wrt to the additive [C n]2, U 2, σ2 terms as well. Minimaxity in Sobolev class implies minimaxity in Holder class since it is known that a Holder ball is sandwiched between two Sobolev balls having the same minimax rate [see e.g., Tibshirani, 2015]. A proof of the Corollary and related experiments are presented in Appendix F and J. 3.6 Fast computation Last but not least, we remark that there is a fast implementation of ARROWS that reduces the overall time-complexity for n step from O(n2) to O(n log n). Proposition 10. The run time of ARROWS is O(n log(n)), where n is the time horizon. The proof exploits the sequential structure of our policy and sparsity in wavelet transforms, which allows us to have O(log n) incremental updates in all but O(log n) steps. See Appendix G for details. Table 2: Minimax rates for cumulative error Pn i=1(ˆθi θ)2 in various settings and policies that achieve those rates. ARROWS is adaptively minimax across all of the described function classes while linear forecasters fail to perform optimally over the TV class. For simplicity, we assume U is small and hide a log n factors in all the forecasting rates. Class Minimax rate for Forecasting Minimax rate for Smoothing Minimax rate for Linear Forecasting TV Dθ1:n 1 Cn n1/3C2/3 n σ4/3 + C2 n + σ2 n1/3C2/3 n σ4/3 + σ2 n1/2Cnσ + C2 n + σ2 Sobolev Dθ1:n 2 C n n2/3[C n]2/3σ4/3 + [C n]2 + σ2 n2/3[C n]2/3σ4/3 + σ2 n2/3[C n]2/3σ4/3 + [C n]2 + σ2 Holder Dθ1:n Ln n L2/3 n σ4/3 + n L2 n + σ2 n L2/3 n σ4/3 + σ2 n L2/3 n σ4/3 + n L2 n + σ2 Minimax Algorithm ARROWS Wavelet Smoothing Trend Filtering Restarting OGD Moving Averages Holder class Holder class ! "" |f(x) f(y)| |x y| k D k2 1 pn (f 0(x))2dx 1 Sobolev class sup n2N+ 0 x1<...