# online_robust_nonstationary_estimation__7de49895.pdf

Online robust non-stationary estimation

Abishek Sankararaman Balakrishnan (Murali) Narayanaswamy

{abisanka, muralibn}@amazon.com, Amazon Web Services, Santa Clara CA, USA.

The real-time estimation of time-varying parameters from high-dimensional, heavytailed and corrupted data-streams is a common sub-routine in systems ranging from those for network monitoring and anomaly detection to those for traffic scheduling in data-centers. For estimation tasks that can be cast as minimizing a strongly convex loss function, we prove that an appropriately tuned version of the clipped Stochastic Gradient Descent (SGD) is simultaneously (i) adaptive to drift, (ii) robust to heavy-tailed inliers and arbitrary corruptions, (iii) requires no distributional knowledge and (iv) can be implemented in an online streaming fashion. All prior estimation algorithms have only been proven to posses a subset of these practical desiderata. A observation we make is that, neither the O 1

t learning rate for clipped SGD known to be optimal for strongly convex loss functions of a stationary data-stream, nor the O(1) learning rate known to be optimal for being adaptive to drift in a noiseless environment can be used. Instead, a learning rate of T α for α < 1 where T is the stream-length is needed to balance adaptivity to potential drift and to combat noise. We develop a new inductive argument and combine it with a martingale concentration result to derive high-probability under any learning rate on data-streams exhibiting arbitrary distribution shift - a proof strategy that may be of independent interest. Further, using the classical doubling-trick, we relax the knowledge of the stream length T. Ours is the first online estimation algorithm that is provably robust to heavy-tails, corruptions and distribution shift simultaneously. We complement our theoretical results empirically on synthetic and real data.

1 Introduction

Technology improvements have made it easier than ever to collect diverse telemetry at high resolution from any cyber or physical system, for both monitoring and control [31]. This in turn has led to a data deluge, where large amounts of data must be processed at scale [7, 16]. Given the scale and velocity of the data sources, offline processing to make predictions and decisions is computationally cumbersome. For real-time applications such as performance monitoring and anomaly detection, offline/batch processing results in stale predictions [6], [61], [54]. This necessitates computationally cheap, online algorithms that make predictions and decisions on high-dimensional streaming data [41], [21], [58]. Further in many applications, the challenge of being restricted to online algorithms is exacerbated by heavy-tails [2], [3], distribution shift [48], [28] and outliers/anomalies in the data-stream [29], [42], [35, 50]. In practice, although several heuristics to circumvent these issues have been designed [1, 66, 48, 28, 35, 42], there is no systematic study of the impact of these challenges on the achievable statistical accuracy. Motivated by these problems, we study algorithms for high-dimensional online statistical estimation and rigorously establish performance guarantees which exhibit graceful degradation with distribution shift and outliers in the data-stream.

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

Figure 1: Figures (a) and (b) show a stream of independent samples (with some corruptions) from unit-variance Gaussian and Pareto distribution with shape parameter 2.001 respectively when the underlying true mean changes with time. Figure (c) shows that clipped-SGD with clip-value of 5 incurs lower regret in estimating the true time-varying mean when the learning rate is set to 1/

T as opposed to the standard 1/(t + 1) known to be optimal in a stationary environment.

1.1 Problem Setup

The online estimation problem we study is modeled as an interaction between an oblivious adversary and the estimator. The adversary before the interaction begins makes two choices - (i) a sequence of probability distributions P1, , PT on Rd and (ii) a sequence of measurable corruption functions (Ct)T t=1 where for every t [T]1, Ct : Rd t Rd t 1 Rp t 1 Rd, whose roles we explain below.

Sequential interaction protocol : At each time t [T], the vector Xt := Zt + Ct, where Zt Pt is sampled independently and Ct := Ct((Zs)s t, (Cs)s<t, (θs)s<t) is the corruption vector computed using the corruption function based on the past random samples and outputs. Xt is shown as input to the estimation algorithm. Subsequently after observing Xt, the estimation algorithm outputs an estimate θt Θ and incurs loss lt := θt θ t 2. Here, L : Rd Rp R is a strongly convex loss function, Θ Rp a closed convex set denoting the parameter space, and θ t := arg minθ Θ EZ Pt[L(Z, θ)] is the unique3 optimizer of the loss function with respect to the distribution Pt, chosen by the adversary at time t. For every time t, we denote this minimizer θ t Θ as the true parameter at time t. The incurred loss lt is un-observed by the algorithm since the true parameter θ t is unknown.

Formally, an estimation4 algorithm A := (At)T t=1 is a sequence of measurable functions such that for all t [T], the estimate output θt := At(Xt, , X1), where At : Rd t Rp is a function of all inputs upto and including time t. After outputting an estimate θt, the algorithm incurs an un-observed loss which is the L2 distance between the estimate and the true parameter.

Causal property of the adversary: Recall that at each time t, the corruption vector is given by Ct = Ct((Zs)s t, (Cs)s<t, (θs)s<t), which is based only on the past realizations Z1, , Zt, C1, , Ct 1 and the estimator s past outputs θ1, , θt 1. The adversary does not have access to future randomness Zt+1, , ZT while choosing the corruption vector Ct. This model of corruptions is stronger than those used in online robust estimation (cf. [19], [13]) since in our model, the adversary can choose the time instants of corruption while in previous models, the time instants of corruption are chosen randomly through i.i.d. coin flips. Our model of corruption locations are motivated by empirical observations that corruptions rarely occur randomly and are typically bunced [23, 42]. Thus we assume that the adversary has causal power to choose the location of corruptions rather than corruptions occuring at random instants.

Regret as a Performance Measure: The regret of the algorithm denoted by Reg T is defined as

t=1 θt θ t ,

1We denote by [T] := {1, , T} 2Throughout, we denote by as the L2 norm operator. 3Convexity implies existence and uniqueness of θ t Θ 4In the literature, this is sometimes also denoted as an optimization algorithm. c.f. [8]

the total cumulative loss. This definition is known as clean dynamic regret [13], since (i) θt is compared against the true parameter θ t , even if the input Xt at time t is corrupted, i.e., even if Ct = 0, and (ii) θt is compared to the time-varying optimal θ t . Technically the regret of an algorithm A is a random variable that is a function of the probability measures (Pt)T t=1 and the corruption functions (Ct)T t=1 and denoted as Reg(A) T ((Pt)T t=1, (Ct)T t=1). We will denote the regret as Reg T whenever it is clear from context.

We now define the data-stream complexity parameters impacting regret : (i) distribution shift and (ii) corruptions.

Distribution Shift5: The cumulative distance between successive optimum points ΦT := PT t=2 θ t θ t 1 measures the complexity of distribution-shift on the data-stream. This is a natural measure since if the estimator outputs θt Θ at time t, will incur an L2 loss lt = θt θ t .

Corruptions: The corruption measure is the count ΛT := PT t=1 1(Ct = 0), of the number of times the random sample Zt is corrupted before inputting to the estimator. The count ΛT is a random-variable since the vector Ct is a deterministic function of past random variables.

With this setup, the main desiderata we seek of an estimator is that it (i) is online : O(d) run-time per data point on a stream and O(d) overall memory, (ii) is free of distributional knowledge, i.e., does not require bounds on moments of (Pt)t 1, distribution shift or corruptions, (iv) attains sub-linear in T regret even when the distributions (Pt)t 1 are heavy-tailed, and (v) exhibits graceful degradation of regret with respect to distribution shift and corruptions on the data stream.

1.2 Main Contributions

1. Problem formulation and lower bounds: We formalize the problem of online estimation under drift and corruptions and define a dimension-free target regret benchmark in Equation (1) that captures all the aforementioned desiderata. We give lower bounds in Propositions 16.1 and 2.6 that shows the necessity of assumptions such as finite diameter for the set Θ Rd in which the unknown parameters θ t Θ lie. Further discussion in Table 1 and in Section 11.

2. High-probability upper bounds: Theorem 5.1 proves that a tuned version of clipped-SGD achieves all aforementioned desiderata. No previous algorithm achieves all the desiderata simultaneously (Tables 1 and 2). Further in Theorem 4.3, we show that if in addition the data-stream is sub-gaussian, then our bound is optimal in the sense when instantiated with special cases of having no drift and corruptions [24] or in the setting of no corruptions and 0 variance [43], it recovers known optimal results.

3. Algorithmic insights: Our key finding is that neither the learning rate of O(1) known to be optimal for SGD to adapt to distribution shift in the absence of noise, nor the O(1/t) learning rate known to be optimal in a stationary environment can be used. Instead a learning rate of T α for α < 1 is needed to combat noise and be adaptive to drift simultaneously. See also Figure 1. We show a lower bound in Proposition 6.1 that this insight is fundamental and not an artifact of our analysis. Knowledge of T is relaxed using the doubling trick in Section 14 in the Appendix. Our result improves prior state of art in [8] that show that in the absence of corruptions, SGD without clipping can give sub-linear regret bounds holding in expectation. However, in the presence of heavy-tailed noise, high probability guarantees are more insightful compared to only in expectation [63]. Our algorithmic contribution is that a tuned version of clipped SGD can give high probability sub-linear regret guarantees, that hold even in the presence of non-stationarities and corruptions.

4. Technical contributions in the proof techniques: From a technical perspective, we employ novel proof techniques to derive high-probability under any learning rate on data-streams exhibiting arbitrary distribution shift. The prior state-of-art analysis in [36, 55] are limited to the specialized setting when the learning rate is ηt = 1/(t + 1) and the data does not exhibit drift. To be adaptive to drift requires newer techniques even in the case of sub-gaussian tails to achieve dimension-free bounds (cf. Section 4.1). We used an induction based technique to apply martingale concentrations in Lemma 19.9. We build upon this technique using contradiction arguments in Lemma 20.10 to extend to the heavy-tailed case. These arguments are of independent interest to prove bounds for other iterative algorithms.

5Following convention [22], we use the term drift to refer to distribution shift

Setting Best lower bound Best upper bound

Stationary, no-corruptions Ω[

Trace(Σ) O[

Trace(Σ) (ΦT = 0, ΛT = 0) + p

νmax(Σ) ln(1/δ))] [10] + p

νmax(Σ) ln(T/δ))] [38] Stationary, corruptions Prop 2.6: Ω(ΛT D) Thm 5.1: e O(T 5 6 σ2 + T 3 4 DΛT ) (ΦT = 0, ΛT 0)

Non-stationary, no-corruptions [8]: Ω(T 2/3Φ1/3 T ) Thm 5.1: e O(T 2 3 ΦT + T 5 6 σ2) (ΦT 0, ΛT = 0). [8] gives bound in expectation only. Non-stationary, corruptions Prop 16.1: Ω(ΛT D) Thm 5.1: e O(T 2 3 ΦT + T 5 6 σ2 + T 3 4 ΛT D) (ΦT 0, ΛT 0)

Table 1: Regret bounds for online estimation with heavy-tailed data and strongly convex loss. The setting of ΦT = 0, ΛT = 0 (line 1) is characterized (upto log factors) in [10] and [38]. The first dimension free, high probability upper bounds are established in this paper for all other settings. For the setting of (ΦT 0, ΛT = 0), [8] gives a lower bound and an upper bound holding in expectation, while ours is the first high-probability upper bound. Except line 1, there is a gap between known lower and upper bounds, closing which is an open problem (see also Section 9).

1.3 Motivating application

Online Anomaly Detection (AD) problems consist of a loss function L( , ) and parameter θt that varies with time t, such that for the input Xt received at time t, the algorithm outputs anomaly score St := L(Xt, θt) ([47], [12]). The objective is to output lower scores for samples that are not corrupted, i.e., if at time t the data distribution is Zt Pt, then the optimal anomaly score S t := L(Xt, θ t ) is the one produced by the model minimizing the average anomaly score i.e., θ t arg minθ Θ EZ Pt[L(Z, θ)] [51]. The distance θt θ t between θt the model used at time t and the optimal θ t is a measure of AD performance degradation [32, 50], motivating our study of low regret algorithms.

Special cases of such problems include robust online mean estimation [19] and robust online linear regression [20], both of which we consider in simulations in Section 7. The online mean estimation corresponds to L(X, θ) := 1

2 X θ 2 2 since arg minθ EZ P[ Z θ 2] = EZ P[Z]. Online linear regression corresponds to splitting the input X Rd1+d2 as X := (X(1), X(2)), with X(1) Rd1 and X(2) Rd2 and using the re-construction loss L(X, θ) := 1

2 X(2) θT X(1) 2 2 as anomaly score. The linear regression setting of detecting anomalies by reconstructing one half of the input X from the other is popularly known as self-supervised AD models [52]. Empirically, in [30], L(X, θ) := (Id θ)X with θ Rd d a matrix of rank k < d and Id Rd d the identity matrix is used to detect anomalies. The linear model of [30] is extended to a non-linear setting in [42], using an auto-encoder [65]. In [17], a tree density model for L(X, θ) is studied. These papers focus is empirical and do not give statistical guarantees.

Organization of the Paper: In Section 2 we state the target benchmark regret along with the assumptions under which we seek it and appropriate lower bounds. In Section 3 we give the clipped SGD algorithm. In Section 4, we give results under the additional assumption of sub-gaussian tails. This helps build intuition and proof techniques for the general result which we give in Section 5. We state implications of our results in Section 6, provide empirical evidence in Section 7, discuss related work in Section 8 and conclude with open problems in Section 9.

2 Formalizing the desiderata

Before formalizing the desiderata in Equation (1), we state some mild assumptions standard in the study of statistical estimation under which we seek guarantees [55, 19].

2.1 Model Assumptions

Assumption 2.1 (Strong convexity). There exists 0 < m M < known to the algorithm, such that for all t, the function θ EZ Pt[L(Z, θ)] is M smooth and m strongly convex. This is a benign assumption that essentially states that properties of the loss function such as the convexity and smoothness are known to the estimation algorithm.

Methods Regret Moment Drift Corruption Free of distributional assumed tolerance tolerance knowledge

T) 2 0 0 [36] O(σT 1/p) p (1, 2] 0 0

d T+ 2 0 O(T) (d T)1/4Λ1/2 T ) Corollary. 4.8 O(T 2/3ΦT + T 2/3σ sub O(T 1 3 ) O(T 2 3 ) (This paper) +T 1 3 D 1 2 ΛT ) gaussian Theorem 4.3 O(T αΦT + T 2 α

2 σ sub O(T 1 α) O(T 2 α

2 ) 6 (This paper) +T α

2 D 1 2 ΛT ) gaussian Corollary 5.3 O(T 2/3ΦT + T 5/6σ2

2 O(T 1/3) O(T 1/4) (This paper) +T 3 4 D1/2ΛT ) Theorem 5.1 O(T 4α

3 ΦT + T 3 α

3 ) O(T 2 3α

2 ) (This paper) +T 3α

2 D 1 2 ΛT )

Table 2: Comparison of high-probability regret bounds of online estimation algorithms of strongly convex functions under various moment assumptions. [55] requires distance of the initial point to the unknown optimum. Theorem 4.3, Corollary 4.8, [19] and [55] requires a bound on σ. Further, [19] is not dimension-free as regret depends on d. We do not include [8] as their regret bounds only hold in expectation and not with high probability.

Assumption 2.2 (Convex domain). The algorithm knows a closed convex set Θ Rp such that for all t, θ t := arg minθ Θ EZ Pt[L(Z, θ)] Θ is the true parameter to be estimated at time t. Assumption 2.3 (Finite diameter). The diameter D := maxx,y Θ x y of Θ is finite, i.e., D < .

This is a necessary assumption when ΛT > 0 (Proposition 2.6). If ΛT = 0, we relax this assumption in Corollaries 4.5 and 5.4. Let Rt(θ) = EZ Pt[L(Z, θ)], denote the population risk at time t. Assumption 2.4 (Known finite upper bound on the true gradient). There exists a known finite G < such that supθ Θ Rt(θ) G. This is necessary for many online optimization algorithms [36, 55, 25, 27]. Assumption 2.5 (Existence of second moment). There exists a matrix Σ 0, unknown to the algorithm such that for all t and θ Θ, the covariance of the random-vector L(Z, θ) is bounded by Σ, i.e., EZ Pt[( L(Z, θ) Rt(θ)) ( L(Z, θ) Rt(θ))T ] Σ holds for all t and θ. The class of distributions admissible by this assumption is large including all sub-gaussian distributions and heavy-tailed distributions that do not have finite higher moments [25], [44]. Observe that no parameteric assumptions on the distribution family is made. Throughout, we denote by σ2 and νmax(Σ) as the trace and the largest eigen-value of Σ respectively. For a positive semi-definite matrix Σ Rp p, P(Σ) is the set of all probability measures satisfying assumptions 2.1, 2.4 and 2.5.

2.2 Target benchmark for algorithm design

We say that an estimator is free of distributional knowledge if it does not need as input bounds on either the moments of Pt nor on the stream complexity ΦT and ΛT . Our key focus - formalized below in Equation (1) - is: can we design an online algorithm without distributional assumptions that has sub-linear regret and that degrades gracefully with drift and corruption?

Can we design an online algorithm A without distributional knowledge, such that l, m, n < 1 such that Σ 0 Rp p, (Pt)T t=1 P(Σ) and (Ct)T t=1, the regret bound

Reg(A) T ((Pt)T t=1, (Ct)T t=1) O (T lΦT + T m + T n DΛT ).poly σ, log T

(1) holds with probability at-least 1 δ ?

6α (0, 1] is an input parameter.

Throughout, we use O( ) to hide poly-logarithmic factors. Our main result is Theorem 5.1, where we prove that tuned clipped SGD achieves this, and our other desiderata.

2.3 Why is Equation (1) a good benchmark for regret bounds with drift and corruption?

The benchmark demands robustness to heavy-tailed data, since regret is sub-linear for any distribution with finite second moment. The benchmark is dimension free, since dimension terms d, p do not appear in Equation (1). The finite diameter D only affects the bound due to corruptions and not the other terms. In particular, if ΛT = 0 almost-surely, then the benchmark seeks a valid bound even if D = . In general when ΛT > 0 however, dependence on D cannot be avoided as we show in Proposition 2.6 below (proof in Section 16). Proposition 2.6 (Finite diameter is necessary even in the absence of drift). There exists a strongly convex loss function L such that for every D > 0 and d N, domain Θ = {x Rd : x D} such that for every T N and every algorithm, there exists a probability measure P on Rd from which the un-corrupted samples are drawn from (i.e., ΦT = 0) and a sequence of corruption functions (Ct)T t=1, such that the regret is at-least Ω(ΛT D) with probability 1.

Thus, even if there is no drift, finite diameter assumption cannot be avoided. Proof is given in Section 16, relies on the modeling asumption that the corruption locations can be arbitrary. This lower bound is in contrast to prior work [19] which shows that finite diameter is not necessary in the absence of corruptions and when the time instants of corruptions are random. Further in Table 1, we compare the best known lower and upper bounds for the various settings in the presence and absence of distribution shifts and corruptions. As can be seen, in all cases involving either distribution shift or corruptions, our work is the first to give high-probability estimation regret bounds.

Equation (1) gives drift and corruption tolerance as additional performance measures: Since the regret explicitly depends on ΦT and ΛT we can define the drift tolerance and corruption tolerance as performance measures of the algorithm. Drift (Corruption) tolerance is the largest ΦT (largest ΛT ) for which we still guarantee sub-linear in T regret. Thus, if Equation (1) is satisfied by an algorithm, then the drift and corruption tolerance is O(T 1 l) and O(T 1 n) respectively. Thus, higher these tolerances, the better an algorithm is as they indicate that the algorithm s degradation with drifts and corruptions are more graceful. Our method is the only one to have non-zero drift tolerance (Table 2).

3 The clipped SGD Algorithm

In this section, we formally describe the algorithm in Algorithm 1 that achieves the desiderata. Algorithm 1 produces an estimate at time t is given by θt PΘ(θt 1 ηtclip( L(Xt, θt 1, λ)), where PΘ is the projection operator on to Θ and clip(x, c) := min(1, c/ x )x, x Rd, c 0.

Algorithm 1 Clipped-SGD [55]

1: Input: (ηt)t 1, λ > 0, θ0 Θ , T time-horizon 2: for each time t = 1, 2, , do 3: Receive sample Xt 4: Output θt PΘ (θt 1 ηtclip( L(Xt, θt 1), λ)) 5: end for

Intuition for why Algorithm 1 can achieve Equation (1): If there is no distribution shift or corruptions, clipped SGD with appropriate learning rate converges to the true parameter [55]. Now, if there are corruptions, clipping limits the impact on the overall regret. On the other hand, when a distribution shift occurs, the dynamics of Algorithm 1 is equivalent to restarting estimation under the new distribution, which will converge to the true parameter of this shifted distribution [55].

4 Regret bounds when the gradient vectors have sub-gaussian tails

Before giving the general result, we consider the special case of sub-gaussian distributions in this section. We do so because (i) the additional structure of sub-gaussian distributions enable us to prove stronger results in Corollaries 4.6 and 4.7 in the sequel, and (ii) the proof techniques from this special case enables us to build towards the proof of the general setting.

Definition 4.1 (Sub-gaussian random vectors, [57]). A random vector Y Rd with co-variance matrix Σ is said to be sub-gaussian, if for all λ > 0 and u Rd with u = 1, E[eλ Y,u ]

e λ2νmax(Σ)

2 , where νmax(Σ) is the highest eigen-value of Σ. Assumption 4.2 (Sub-gaussian assumption with known upper bound). For every t, and θ Θ, the 0 mean random vector L(Z, θ) Rt(θ), where Z Pt is sub-gaussian with covariance matrix upper-bounded in the positive semi-definite sense by an known positive-semi-definite matrix Σ. The following is the main result of this section.

Theorem 4.3 (Informal version of Theorem 18.1). Suppose Assumption 4.2 holds. For every δ (0, 1), if Algorithm 1 is run with constant step-size η = 1 m T α for α [0, 1] and clipping

value λ G + σ + O q

νmax(Σ) ln T

δ , then with probability at-least 1 δ,

Reg T O ΦT T α +

v u u t T 1+αΦT q

νmax(Σ) ln T

m | {z } Regret from distribution shift

νmax(Σ) ln T

| {z } Finite-sample estimation error

mλD | {z } Regret from corruptions

The main achievement in Theorem 4.3 is in explicitly identifying how the choice of α trade-offs the regret contributions from distribution shift, finite sample error and corruptions. Remark 4.4 (Corruption and drift tolerance). When Theorem 4.3 is instantiated with α (0, 1), Equation (2) yields a drift tolerance of O(T 1 α) and corruption tolerance of O(T 1 α

2 ). In particular, instantiating with α = 2/3 yields O(T 1/3) and O(T 2/3) corruption and drift tolerance respectively.

We now read off several corollaries from this result. Corollary 4.5 (Finite diameter assumption can be relaxed in the absence of corruptions). Let ΛT = 0 almost-surely, i.e., there are no adversarial corruptions. Then, under the settings of Theorem 4.3 even when the set Θ is unbounded, the regret bound given in Equation (2) holds. Corollary 4.6 (Optimal in the stationary case:). Under the assumptions of Theorem 4.3, if ΦT = ΛT = 0, then when Algorithm 1 is run with parameters λ = + and η = 1/m T, the regret bound Reg T O(

νmax(Σ) ln(T/δ) + σ)) holds with probability at-least 1 δ.

This corollary recovers the well known result [24] of convergence of SGD on strongly convex functions with sub-gaussian gradients. Corollary 4.7 (Optimal in the noiseless setting). Under the assumptions of Theorem 4.3, if ΛT = 0, and σ = 0, i.e., there is no noise, then running Algorithm 1 with λ = + and η = 1/m obtains regret bound of Reg T = O(ΦT ).

This result matches the lower bound in the noise-less setting [67] and recovers the upper-bound by previous analysis of online clipped-SGD specialized to the noiseless setting [43]. Corollary 4.8 (Special case of α = 2/3). When Theorem 4.3 is instantiated with α = 2/3, then w.p. at-least 1 δ, a regret bound of Reg T O((T 1/3ΦT + T 2/3

m + T 1/3ΛT )poly(σ, G, ln(T/δ))). Remark 4.9 (The excess risk can be bounded using smoothness). Oftentimes, the regret is also measured through the excess risk, i.e., PT t=1 EZ Pt[L(Z, θt) L(Z, θ t )]. Since we assume that L is M smooth (Assumption 2.1), Theorem 4.3 also bounds the excess risk regret by observing that EZ Pt[L(Z, θt) L(Z, θ t )] M

2 θt θ t 2.

4.1 Proof sketch for Theorem 4.3 and technical innovations

The full proof is Appendix 18. We first establish due to the condition on λ that if an input sample is not corrupted, then the gradient will never be clipped. Then in Lemma 19.6, we expand the one-step

recursion of the clipped SGD updates by exploiting the strong-convexity. In order to account for the effect of corruptions, we expand the recursion differently if a time-step is corrupted or not. If a time-step is corrupted, then we accumulate both a bias term of norm at-most λ and an appropriate variance term. If one were only interested in bounds in expectation, then the proof will follow the standard clipped SGD analysis as the expectation of the variance terms are 0.

To prove a high probability bound, a natural approach is to apply martingale concentrations to bound the variance terms as done in [24] in the absence of drifts and corruptions. A naive way to apply martingale concentrations would be to bound the variance error terms due to drifts and corruption by the worst case error by using the finite diameter D. However, this will not lead to the dimension free bound of Theorem 4.3 instead will lead to a bound in where the finite diameter D will also multiply the regret term due to finite sample error. In particular, this approach will not result in Corollary 4.5 of being able to relax the finite diameter assumption in the absence of corruptions.

We circumvent this issue by using an inductive argument to bound the variance terms in Lemma 19.9. Concretely, we prove the induction hypothesis that for each time t, error terms until time t is bounded by an appropriate function of the drifts and corruptions. To establish the induction for a time t, we condition on the event the hypothesis holds till time t 1, and employ martingale concentration to show the error at time t is small. Then we plug the martingale bound back into the one-step recursion and show the induction hypothesis also holds at time t. To get the final un-conditional result, we un-roll all the conditional events by an union bound.

5 Regret bounds in the general heavy-tailed case

The following is the main result of the paper, where we relax Assumption 4.2 and thus making the algorithm free of distributional knowledge and allowing for potentially heavy-tailed data.

Theorem 5.1 (Informal version of Theorem 20.1). When Algorithm 1 is run with clip value λ = 2GT α

3 , and step-size η = 1 m T α for α [0, 1], then for every δ (0, 1),

3 ΦT m3/2 + σT 1 2 + 5α

G | {z } Regret due to distribution shift

3 (Gσ)2 ln T

m | {z } Finite-sample estimation error

D m | {z } Regret from corruptions

(3) holds with probability at-least 1 δ.

The achievement in Theorem 5.1 is in explicitly identifying how the choice of α trade-offs the regret contributions from distribution shift, finite sample error and corruptions, despite heavy-tailed data. Remark 5.2 (Price for relaxing assumption 4.2 is sub-optimal bound in the stationary case:). The setting in Theorem 5.1 is weaker compared to that of Theorem 4.3 since (i) no sub-gaussian tail assumptions are made, and (ii) no knowledge of the upper-bound matrix Σ is assumed. The price for relaxing these assumptions is a weaker regret, specifically in the term due to finite-sample estimation error. This term scales as O(T 1 α

3 ) in Theorem 5.1 while it scales as O(T 1 α

2 ) in Theorem 4.3. In particular in the stationary case, when Equation (3) is instantiated with α = 1, the regret bound reads as Reg T = O(T 2/3σ2) which is weaker than the optimal O(

Tσ) for the stationary case [55, 39].

Corollary 5.3 (Setting α = 1/2). Under the conditions of Theorem 5.1, if Algorithm 1 is run with α = 1/2, then with probability at-least 1 δ, it satisfies a regret bound of Reg T O((T 2/3ΦT + T 11/12 ΦT + T 5/6 + T 3/4ΛT )poly(σ, G, 1/m, ln(T/δ))).

Corollary 5.4 (Finite diameter assumption can be relaxed in the absence of corruptions). Let ΛT = 0 almost-surely, i.e., there are no adversarial corruptions. Then, under the settings of Theorem 5.1 even when the set Θ is unbounded, the regret bound given in Equation (3) holds.

Remark 5.5 (Corruption and drift tolerance). Theorem 5.1 when instantiated with α (0, 1) yields a drift tolerance of O(T 1 4α

3 ) and corruption tolerance of O(T 1 3α

2 ). In particular, instantiating with α = 1/2 yields O(T 1/3) and O(T 1/4) corruption and drift tolerance respectively.

Figure 2: Plot of regret versus α over 10000 steps for mean-estimation. We observe that the best choice of α decreases as ΦT and/or ΛT increases. More details in Section 7.

5.1 Proof sketch of Theorem 5.1 and technical innovations

The proof for this case builds upon the techniques used to prove Theorem 4.3. Unlike in the subgaussian case, we cannot guarantee that only corrupted samples will be clipped in the general case. This introduces an additional bias and variance terms due to clipping of potential un-corrupted samples or inliers. The bias terms of the inliers are handled using techniques from prior works [25, 55]. We bound the variance terms using an inductive argument similar in spirit to that of Theorem 4.3. However, identifying the correct hypothesis involving drifts and outliers so that the induction argument holds is challenging. The challenge is because the induction hypothesis assumed till time t 1 impacts the martingale error term, which in-turn impacts whether the induction hypothesis will hold at time t. We introduce a free parameter to the regret term and deduce the induction hypothesis to hold if a quadratic equation in this parameter does not have any real roots. This extra term contributes to regret degradation compared to the sub-gaussian case (Lemma 20.10).

6 Insights and remarks

Known Time Horizon: This is a benign assumption and can be overcome by the standard doubling trick (Section 2.3 of [11]) with additional log2(T) factor regret (c.f. Appendix 14).

Price for being adaptive to distribution shift: Theorems 4.3 and 5.1 show that, in order to minimize regret due to statistical estimation, we need to set α = 1, i.e., choose a learning rate of O(1/T). This was shown to be optimal in in the absence of drift and corruptions both under sub-gaussian assumptions [24] and in heavy tail settings [55]. On the other hand, we see from Theorems 4.3 and 5.1, that in the presence of drift ΦT > 0, a learning rate α < 1 is sufficient to ensure sub-linear regret. The following lower bound (proved in Appendix 15) shows that α < 1 is also necessary.

Proposition 6.1 (Lower bound showing O(1/T) learning rate is not adaptive to drifts). There exists a loss function L such that for every T, there exists a sequence of probability measures (Pt)t with the diameter D 2 log(T), distribution shift ΦT 2 log(T), ΛT = 0, such that Algorithm 1 when run with λ 1 and step size ηt := 1 t+1 will incur regret at-least T

3 with probability 1.

Finite diameter D is necessary in general: We already saw in Proposition 2.6 that Ω(ΛT D) regret is necessary. We prove the lower-bound by considering mean-estimation in Section 16. Variants of Proposition 2.6 was observed in the literature (for ex. Proposition 1.3 [13], Lemma 6.1 in [33], Theorem D.1 in [14], Line 8 of Algorithm 3 in [19]). In theoretical statistics literature, Ω(ΛT D) is considered un-desirable [13] and thus the models studied restrict corruptions to occur at random times [13, 19] rather than arbitrary as in our setup. However, in practice corruptions are rarely random and typically occur close in time for instance due to machine failures or other external confounding factors [23, 42].

Regret bounds are dimension-free: The problem dimensions d, p do not appear in the regret bound. Moreover, the finite diameter D only appears in the regret term affecting through adversarial corruptions and not in the distribution shift and finite sample error terms. Further, Corollaries 4.5 and 5.4 give regret bounds that hold even when D = in the absence of corruptions.

7 Experiments

Theorems 4.3 and 5.1 give upper bounds on regret showing that tuning the learning rate depending on the amount of distribution shift and number of corruptions in the stream is beneficial. We empirically observe similar phenomena in Figures 2 and 3 (in the Appendix), thus indicating our theoretical observations are fundamental and not artifact of our bounds. We consider L(X, θ) = 1 2 X θ 2 corresponding to mean-estimation in Figure 2 and linear-regression of L(X, θ) = 1 2 X(2) θT X(1) 2, where X = (X(1), X(2)) with X(1) Rd 1, X(2) R, θ R(d 1) in Figure 3 in the Appendix in Section 12. We compare clipped-SGD with learning rate 1/m T α for variety of α with clipping λ = 2T 0.25 and use the doubling trick to remove dependence of T (i.e., use Algorithm 2). For the case of α = 1, we use the learning rate of ηt = 1/(m(t + 1)) and λ = 2

T as suggested in [55]. All plots are averaged over 30 runs and we plot the median along with the 5th and 95th quantile bands. We observe in Figures 2 and 3 that as ΦT or ΛT increases, the optimal α to use decreases, validating the theoretical insight. Further details in the Appendix in Section 12. Evaluations on real-data is conducted in the Appendix in Section 13.

8 Related Work

FITNESS for mean estimation is proposed in [50] that requires variance as input and is adaptive to drifts and corruptions under sub-gaussian distributions with per-sample computational complexity is O(d T). In contrast, our algorithm applies to any strongly convex function, does not require moment bounds, data can be heavy-tailed. The work of [8] studied regret bounds in the absence of corruptions and only give bounds in expectation. A sequence of papers in online estimation including [44, 46, 13, 25, 55, 50] have derived algorithms that are robust in different ways. However, none of them consider the impact of distribution shift. The works of [46, 13] study robust linear regression in the absence of drifts with [46] making Gaussian assumptions on both covariates and response while [13] makes Gaussian assumption only on the responses. The work of [55] show that clipped SGD attains near optimal rates for any strongly convex loss function in the absence of drifts and corruptions. The paper of [19] study online estimation of strongly convex loss functions with corruptions, but do not consider drift. Moreover, their regret bounds are not dimension free (cf. Table 2). Although the paper of [64] gives regret bounds for online learning, a more general setting compared to estimation, do not consider impact of drifts. High probability bounds for optimization with heavy-tailed data have been studied by [44, 25, 15, 60, 36] in recent times. However, none of their analysis considers the impact of drift and corruptions. More related work in Section 10.

9 Conclusions and Open Problems

We studied online estimation in the presence of drifts and corruptions and potentially heavy-tailed inlier data and proved regret bounds for clipped SGD for strongly convex loss functions. Ours is the first proof that an online algorithm can simultaneously be robust to heavy-tails and corruptions and adaptive to drift. A key result was to show how the choice of learning-rate trades off drift, finite sample error and corruptions. Our work leaves several interesting open problems.

The optimal choice of α in Theorems 4.3 and 4.3 is a trade-off between distribution shift, finite sample error and corruptions. Can the optimal α be set without knowing ΦT , ΛT or σ in advance?

In the absence of corruptions, can regret O((

T +T aΦ1 a T )poly(G, σ, log(T/δ)), for some a < 1 be achieved? Such an algorithm would simultaneously have both (i) O(T) distribution shift tolerance which is the best possible and matching the lower bound established in [50], and (ii) closing the gap of Remark 5.2 for the stationary case. Can m = 1/2 in Equation (1) be achieved in the general case with drift and corruptions? Theorem 5.1 shows that only m 2/3 is achievable. For the special case of stationary stream, [55] shows m = 1/2 is achievable, matching the lower bound that m 1/2 is necessary. (Open problem from [55]) Does there exists an online algorithm that can obtain the statistically optimal regret of O(

νmax(Σ) log(T/δ))) for a stationary stream?

[1] Giuseppe Aceto, Alessio Botta, Walter De Donato, and Antonio Pescap e. Cloud monitoring: A survey. Computer Networks, 57(9):2093 2115, 2013.

[2] Ashkan Aghdai, Fan Zhang, Nadun Dasanayake, Kang Xi, and H Jonathan Chao. Traffic measurement and analysis in an organic enterprise data center. In 2013 IEEE 14th International Conference on High Performance Switching and Routing (HPSR), pages 49 55. IEEE, 2013.

[3] Mohammad Alizadeh, Shuang Yang, Milad Sharif, Sachin Katti, Nick Mc Keown, Balaji Prabhakar, and Scott Shenker. pfabric: Minimal near-optimal datacenter transport. ACM SIGCOMM Computer Communication Review, 43(4):435 446, 2013.

[4] Arthur Asuncion and David Newman. Uci machine learning repository, 2007.

[5] Dheeraj Baby and Yu-Xiang Wang. Optimal dynamic regret in exp-concave online learning. In Conference on Learning Theory, pages 359 409. PMLR, 2021.

[6] Jarrod Bakker, Bryan Ng, Winston KG Seah, and Adrian Pekar. Traffic classification with machine learning in a live network. In 2019 IFIP/IEEE Symposium on Integrated Network and Service Management (IM), pages 488 493. IEEE, 2019.

[7] Richard G Baraniuk. More is less: Signal processing and the data deluge. Science, 331(6018):717 719, 2011.

[8] Omar Besbes, Yonatan Gur, and Assaf Zeevi. Non-stationary stochastic optimization. Operations research, 63(5):1227 1244, 2015.

[9] S ebastien Bubeck. Convex optimization: Algorithms and complexity. Foundations and Trends in Machine Learning, 8(3-4):231 357, 2015.

[10] Olivier Catoni. Challenging the empirical mean and empirical variance: a deviation study. In Annales de l IHP Probabilit es et statistiques, volume 48, pages 1148 1185, 2012.

[11] Nicolo Cesa-Bianchi and G abor Lugosi. Prediction, learning, and games. Cambridge university press, 2006.

[12] Varun Chandola, Arindam Banerjee, and Vipin Kumar. Anomaly detection: A survey. ACM computing surveys (CSUR), 41(3):1 58, 2009.

[13] Sitan Chen, Frederic Koehler, Ankur Moitra, and Morris Yau. Online and distribution-free robustness: Regression and contextual bandits with huber contamination. In 2021 IEEE 62nd Annual Symposium on Foundations of Computer Science (FOCS), pages 684 695. IEEE, 2022.

[14] Yeshwanth Cherapanamjeri, Efe Aras, Nilesh Tripuraneni, Michael I Jordan, Nicolas Flammarion, and Peter L Bartlett. Optimal robust linear regression in nearly linear time. ar Xiv preprint ar Xiv:2007.08137, 2020.

[15] Ashok Cutkosky and Harsh Mehta. High-probability bounds for non-convex stochastic optimization with heavy tails. Advances in Neural Information Processing Systems, 34:4883 4895, 2021.

[16] Guido Dartmann, Houbing Song, and Anke Schmeink. Big data analytics for cyber-physical systems: machine learning for the internet of things. Elsevier, 2019.

[17] Ibrahim Delibalta, Kaan Gokcesu, Mustafa Simsek, Lemi Baruh, and Suleyman S Kozat. Online anomaly detection with nested trees. IEEE Signal Processing Letters, 23(12):1867 1871, 2016.

[18] Ilias Diakonikolas and Daniel M Kane. Recent advances in algorithmic high-dimensional robust statistics. ar Xiv preprint ar Xiv:1911.05911, 2019.

[19] Ilias Diakonikolas, Daniel M. Kane, Ankit Pensia, and Thanasis Pittas. Streaming algorithms for high-dimensional robust statistics. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 5061 5117. PMLR, 17 23 Jul 2022.

[20] Jiashi Feng, Huan Xu, and Shie Mannor. Outlier robust online learning. ar Xiv preprint ar Xiv:1701.00251, 2017.

[21] Massimo Gallo, Alessandro Finamore, Gwendal Simon, and Dario Rossi. Real-time deep learning based traffic analytics. In Proceedings of the SIGCOMM 20 Poster and Demo Sessions, pages 76 78. 2020.

[22] Jo ao Gama, Indr e ˇZliobait e, Albert Bifet, Mykola Pechenizkiy, and Abdelhamid Bouchachia. A survey on concept drift adaptation. ACM computing surveys (CSUR), 46(4):1 37, 2014.

[23] Peng Gao, Xusheng Xiao, Zhichun Li, Fengyuan Xu, Sanjeev R Kulkarni, and Prateek Mittal. {AIQL}: Enabling efficient attack investigation from system monitoring data. In 2018 {USENIX} Annual Technical Conference ({USENIX}{ATC} 18), pages 113 126, 2018.

[24] Saeed Ghadimi and Guanghui Lan. Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization, ii: shrinking procedures and optimal algorithms. SIAM Journal on Optimization, 23(4):2061 2089, 2013.

[25] Eduard Gorbunov, Marina Danilova, and Alexander Gasnikov. Stochastic optimization with heavy-tailed noise via accelerated gradient clipping. Advances in Neural Information Processing Systems, 33:15042 15053, 2020.

[26] M Harries. Splice-2 comparative evaluation: electricity pricing (technical report unsw-csetr-9905). Artificial Intelligence Group, School of Computer Science and Engineering, The University of New South Wales, Sydney, 2052, 1999.

[27] Elad Hazan. Introduction to online convex optimization. Foundations and Trends in Optimization, 2(3-4):157 325, 2016.

[28] Cormac Herley. Automated detection of automated traffic. In 31st USENIX Security Symposium (USENIX Security 22), pages 1615 1632, 2022.

[29] T Ryan Hoens, Robi Polikar, and Nitesh V Chawla. Learning from streaming data with concept drift and imbalance: an overview. Progress in Artificial Intelligence, 1:89 101, 2012.

[30] Hao Huang and Shiva Prasad Kasiviswanathan. Streaming anomaly detection using randomized matrix sketching. Proceedings of the VLDB Endowment, 9(3):192 203, 2015.

[31] Wazir Zada Khan, MH Rehman, Hussein Mohammed Zangoti, Muhammad Khalil Afzal, Nasrullah Armi, and Khaled Salah. Industrial internet of things: Recent advances, enabling technologies and open challenges. Computers & Electrical Engineering, 81:106522, 2020.

[32] Young-geun Kim, Yongchan Kwon, Hyunwoong Chang, and Myunghee Cho Paik. Lipschitz continuous autoencoders in application to anomaly detection. In International Conference on Artificial Intelligence and Statistics, pages 2507 2517. PMLR, 2020.

[33] Adam Klivans, Pravesh K Kothari, and Raghu Meka. Efficient algorithms for outlier-robust regression. In Conference On Learning Theory, pages 1420 1430. PMLR, 2018.

[34] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998.

[35] Jia Li, Shimin Di, Yanyan Shen, and Lei Chen. Fluxev: a fast and effective unsupervised framework for time-series anomaly detection. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining, pages 824 832, 2021.

[36] Zijian Liu and Zhengyuan Zhou. Stochastic nonsmooth convex optimization with heavy-tailed noises. ar Xiv preprint ar Xiv:2303.12277, 2023.

[37] G abor Lugosi and Shahar Mendelson. Mean estimation and regression under heavy-tailed distributions: A survey. Foundations of Computational Mathematics, 19(5):1145 1190, 2019.

[38] G abor Lugosi and Shahar Mendelson. Sub-gaussian estimators of the mean of a random vector. 2019.

[39] Gabor Lugosi and Shahar Mendelson. Robust multivariate mean estimation: the optimality of trimmed mean. The Annals of Statistics, 49(1):393 410, 2021.

[40] Chaitanya Manapragada, Mahsa Salehi, and Geoffrey I Webb. Extremely fast hoeffding adaptive tree. In 2022 IEEE International Conference on Data Mining (ICDM), pages 319 328. IEEE, 2022.

[41] M Hammad Mazhar and Zubair Shafiq. Real-time video quality of experience monitoring for https and quic. In IEEE INFOCOM 2018-IEEE Conference on Computer Communications, pages 1331 1339. IEEE, 2018.

[42] Yisroel Mirsky, Tomer Doitshman, Yuval Elovici, and Asaf Shabtai. Kitsune: an ensemble of autoencoders for online network intrusion detection. ar Xiv preprint ar Xiv:1802.09089, 2018.

[43] Aryan Mokhtari, Shahin Shahrampour, Ali Jadbabaie, and Alejandro Ribeiro. Online optimization in dynamic environments: Improved regret rates for strongly convex problems. In 2016 IEEE 55th Conference on Decision and Control (CDC), pages 7195 7201. IEEE, 2016.

[44] Alexander V Nazin, Arkadi S Nemirovsky, Alexandre B Tsybakov, and Anatoli B Juditsky. Algorithms of robust stochastic optimization based on mirror descent method. Automation and Remote Control, 80(9):1607 1627, 2019.

[45] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. Understanding the exploding gradient problem. Co RR, abs/1211.5063, 2(417):1, 2012.

[46] Scott Pesme and Nicolas Flammarion. Online robust regression via sgd on the l1 loss. Advances in Neural Information Processing Systems, 33:2540 2552, 2020.

[47] Maxim Raginsky, Rebecca M Willett, Corinne Horn, Jorge Silva, and Roummel F Marcia. Sequential anomaly detection in the presence of noise and limited feedback. IEEE Transactions on Information Theory, 58(8):5544 5562, 2012.

[48] Hansheng Ren, Bixiong Xu, Yujing Wang, Chao Yi, Congrui Huang, Xiaoyu Kou, Tony Xing, Mao Yang, Jie Tong, and Qi Zhang. Time-series anomaly detection service at microsoft. In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, pages 3009 3017, 2019.

[49] Byron P Roe, Hai-Jun Yang, Ji Zhu, Yong Liu, Ion Stancu, and Gordon Mc Gregor. Boosted decision trees as an alternative to artificial neural networks for particle identification. Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment, 543(2-3):577 584, 2005.

[50] Abishek Sankararaman, Balakrishnan Narayanaswamy, Vikramank Y Singh, and Zhao Song. Fitness:(fine tune on new and similar samples) to detect anomalies in streams with drift and outliers. In International Conference on Machine Learning, pages 19153 19177. PMLR, 2022.

[51] Bernhard Sch olkopf, John C Platt, John Shawe-Taylor, Alex J Smola, and Robert C Williamson. Estimating the support of a high-dimensional distribution. Neural computation, 13(7):1443 1471, 2001.

[52] Vikash Sehwag, Mung Chiang, and Prateek Mittal. Ssd: A unified framework for self-supervised outlier detection. ar Xiv preprint ar Xiv:2103.12051, 2021.

[53] Ohad Shamir. A variant of azuma s inequality for martingales with subgaussian tails. ar Xiv preprint ar Xiv:1110.2392, 2011.

[54] Tushar Swamy, Alexander Rucker, Muhammad Shahbaz, Ishan Gaur, and Kunle Olukotun. Taurus: a data plane architecture for per-packet ml. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, pages 1099 1114, 2022.

[55] Che-Ping Tsai, Adarsh Prasad, Sivaraman Balakrishnan, and Pradeep Ravikumar. Heavy-tailed streaming statistical estimation. In International Conference on Artificial Intelligence and Statistics, pages 1251 1282. PMLR, 2022.

[56] Tim van Erven, Sarah Sachs, Wouter M Koolen, and Wojciech Kotlowski. Robust online convex optimization in the presence of outliers. In Conference on Learning Theory, pages 4174 4194. PMLR, 2021.

[57] Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.

[58] Jonathan Vestin, Andreas Kassler, Deval Bhamare, Karl-Johan Grinnemo, Jan-Olof Andersson, and Gergely Pongracz. Programmable event detection for in-band network telemetry. In 2019 IEEE 8th international conference on cloud networking (Cloud Net), pages 1 6. IEEE, 2019.

[59] H Victor. A general class of exponential inequalities for martingales and ratios. The Annals of Probability, 27(1):537 564, 1999.

[60] Nuri Mert Vural, Lu Yu, Krishna Balasubramanian, Stanislav Volgushev, and Murat A Erdogdu. Mirror descent strikes again: Optimal stochastic convex optimization under infinite noise variance. In Conference on Learning Theory, pages 65 102. PMLR, 2022.

[61] Liangcheng Yu, John Sonchack, and Vincent Liu. Mantis: Reactive programmable switches. In Proceedings of the Annual conference of the ACM Special Interest Group on Data Communication on the applications, technologies, architectures, and protocols for computer communication, pages 296 309, 2020.

[62] Jingzhao Zhang, Tianxing He, Suvrit Sra, and Ali Jadbabaie. Why gradient clipping accelerates training: A theoretical justification for adaptivity. ar Xiv preprint ar Xiv:1905.11881, 2019.

[63] Jingzhao Zhang, Sai Praneeth Karimireddy, Andreas Veit, Seungyeon Kim, Sashank Reddi, Sanjiv Kumar, and Suvrit Sra. Why are adaptive methods good for attention models? Advances in Neural Information Processing Systems, 33:15383 15393, 2020.

[64] Jiujia Zhang and Ashok Cutkosky. Parameter-free regret in high probability with heavy tails. ar Xiv preprint ar Xiv:2210.14355, 2022.

[65] Lijun Zhang, Shiyin Lu, and Zhi-Hua Zhou. Adaptive online learning in dynamic environments. Advances in neural information processing systems, 31, 2018.

[66] Chong Zhou and Randy C Paffenroth. Anomaly detection with robust deep autoencoders. In Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, pages 665 674, 2017.

[67] Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th international conference on machine learning (icml-03), pages 928 936, 2003.

10 More related work

Comparison with [8]:The seminal work of [8] started the work on non-stationary estimation in the absence of corruptions. They establish a regret lower bound of Ω(Φ1/3T 2/3) for non-stationary estimation in the presence of non-stationarities. They also established an upper bound by showing that SGD without clipping yields matching regret upper bounds in expectation. The upper bound in Theorem 5.1 on the other hand gives high-probability regret bounds that hold even in the presence of corruptions, in addition to non-stationarities. Expressing an upper bound in high-probability is both technically challenging and algorithmically insightful. The insight we make is that for having high-probability bounds, we need to have clipped SGD. On the other hand, since [8] only give bounds holding in expectation, they do not need any clipping in their algorithms. The necessity of clipping in heavy-tailed settings is not an artifact of analysis, but is crucial for good empirical performance, as noted in recent works of [55]. Thus a conceptual contribution we make is that even in the absence of corruptions, a different algorithm compared to [8], namely that of clipping gradients is required to get regret bounds holding in high-probability. From a technical perspective, the proofs for high-probability bounds need different techniques as compared to [8]. For instance, we need several martingale and induction arguments to arrive at high-probability bounds in heavy tails while [8] have a much more simpler proofs just based on convexity, since they only give bounds in expectation.

Robust Stochastic Optimization The paper of [44] proves high-probability bounds for optimizing convex functions under heavy tailed gradients. However, they do not consider corruptions or drifts and assume finite diameter. In contrast our bounds explicitly depend on drifts and corruptions and can handle infinite diameter in the absence of corruptions. The work of [25] give high-probability bounds for offline clipped gradient descent based algorithms for optimizing both convex and strongly convex functions. However, they do not consider drift or corruptions. The work of [60] studies convergence of gradient descent methods in the absence of second moments, but only give bounds in expectation and do not consider drift or corruptions. The paper of [15] give high-probability bounds for stochastic optimization, but their analysis does not consider drift or corruptions. None of these papers simultaneously consider drifts, corruptions and heavy-tailed noise. The paper of [63] extend this to heavy-tailed noise under only the p (1, 2]th moment and non-convex functions. However, they only give results in expectation and in a setting without drift and corruptions.

Robust Statistics There is a growing literature [18, 37, 39, 10] that give near-optimal offline algorithms in the presence of corruptions but no drifts. Since our focus is to devise online algorithms that are adaptive to distribution shift, these algorithms are not directly applicable to our setting.

Dynamic Regret Minimization A parallel line of work concerns algorithms being adaptive to drifts for online convex optimization problems [27] where the data are arbitrary rather than sampled from a distribution [67, 65, 56, 5]. However, these algorithms assume that there are no noise in the gradients and no corruptions on the data stream.

Gradient clipping The work of [45] showed clipping gradients as a practical work-around to the exploding gradients problem. Theoretically, [62] study the effect of clipping in offline training under restrictive noise assumptions and do not consider impact of drifts and corruptions.

11 Key algorithmic challenge

All distribution changes initially look like corruptions

In this section, we give a simple example to see whay developing algorithms that are robust to drift and corruptions are challenging. Consider a simple example of estimating the mean in the absence of noise, i.e., when variance is 0. We will show two possible underlying scenarios that yield the same observation to the forecaster. In both scenarios, the observed samples are all 0 in the first T ΛT rounds and is R = 0 in all of the last ΛT rounds. The first scenario is one where the true mean in the first T ΛT samples is 0 and the true mean in the last ΛT samples is R, i.e., this scenario has no adversarial corruptions and a cumulative distribution shift of R. In the second scenario, the true mean for all rounds is 0, except the last ΛT samples are adversarially corrupted, i.e., this scenario has no distribution shift but has ΛT adversarially corrupted samples. If the true system is scenario 1, then the regret of a forecaster is the distance from the ground-truth R, while in the second scenario is the distance from the un-corrupted mean 0. Thus, any algorithm will incur in at-least one of the scenarios, a regret of at-least RΛT

2 . This example highlights the main tension in the problem : a

Figure 3: Over T = 10000 steps, we vary ΦT and ΛT keeping the clipping value λ = 2T 0.25 for linear regression. We observe that the best choice of α decreases as the distribution shift ΦT and/or corruptions ΛT increases.

distribution shift initially presents as outliers and the online algorithm needs samples and thus incurs regret in order to learn this to be the new normal .

12 Additional details on simulations

We give more details on experiments reported in Figures 2 and 3 for synthetic data.

Mean-Estimation Setup At each time t, we sample a 0 mean univariate random variable Yt from the Pareto distribution with mean 0, variance σ2 := 5 and tail parameter 3, so that it has a finite second moment. The un-corrupted d-dimensional vector at time t is denoted by Zt := (Yt µt)

d vt, where vt is a unit vector sampled uniformly on the sphere of d = 256 dimensions. The true means µt are chosen by running a random walk, i.e., µ0 = 0 and µt := µt 1 + wt, where wt N 0, 1 T 1 κ Id . This ensures that ΦT T κ. In Figures 2a, 2b, 2c, 2d , we set κ {0.5, 0.25, 0.15, 0} respectively. The instants of corruption are chosen at integer multiples of

T and the observed sample Xt = 0, if t = k

T , for some non-negative integer k. Otherwise, Xt = Zt. For a given underlying mean vector, we run the streaming setup 30 times, and plot the average along with 95% confidence interval on the regret estimates.

Linear Regression Setup: We generate the covariate Xt R256 from a generalized Pareto distribution with mean 0, variance σ2 := 1, i.e. m = 1. The true parameter at time 0 was set to θ0 := [d 0.5, , d 0.5]. At each time t, the true parameter θ t := µt 1 + wt where wt N 0, 1 T 1 κ Id . This ensures that ΦT T κ. In Figures 3a, 3b, 3c, 3d , we set κ {0.5, 0.25, 0.15, 0} respectively.At each time t, the response yt = Xt, θ t + nt, where nt is a Pareto distribution with 0 mean, variance σ2 = 3 and shape parameter 2.5. The instants of corruption are chosen at integer multiples of

T . At instants of corruption, the covariates are sampled from a normal distribution with identity covariance and mean 100. The response variable for a corrupted time point is set to 0. For a given set of underlying mean vector, we run the streaming setup 30 times, and plot the average along with 95% confidence interval on the regret estimates.

Methods: We compare clipped-SGD with learning rates ηt := 1/m T α and clipping values λ = 2T β for a variety of α and β. Further we implement the doubling trick given in Algorithm 2 so that clipped-SGD does not require knowledge of the time-horizon T. In addition, we also compare against the clipped-SGD version of [55] which uses the time-varying learning rate of 1/(t) learning and clipping value of 2

T. Note that the algorithm of [55] has strictly more information than the one we implement as it has knowledge of time horizon T for setting the clipping value, while the one we propose does not and relies on the doubling trick. Further, we do not assume that the set Θ is known and consider the un-projected clipped-SGD for our algorithm, while for that of [55], we consider the smallest L2 ball around the horizon that contains all the (θ t )T t=1. Thus, the algorithm we implement is truly parameter-free as it does not have any problem specific parameters in its implementation.

Observations: All plots in Figure 2 show both mean-estimation and linear regression for varying amounts of distribution shift ΦT , while keeping the number of corruptions ΛT =

T and clipping value λ = 2T 0.25. For both mean-estimation in Figures 2a, 2b, 2c, 2d and linear regression in Figures

Figure 4: Over T = 10000 steps, we vary ΦT while keeping the number of corruptions ΛT =

T and the clipping value λ = 2T 0.25 the same. The top rows (a) (d) are for the synthetic meanestimation and the bottom figures (e) (h) are for linear regression.

3a, 3b, 3c, 3d, we show that as the amount of distribution shift ΦT increases, the learning rate needs to be more aggressive to be adaptive to the distribution shift.

Sample paths In Figure 4, we plot a particular sample path over time for a variety of settings of α.

13 Real-data experiments

The key result of this section is that even on real-data with real performance metrics, adapting the learning-rate to the amount of distribution shift on a data-stream is beneficial. We consider two tasks to demonstrate this : online binary classification and online anomaly detection.

Dataset Stream-length T Task Dimension d Electricity NSW [26, 4] 45312 Binary classification 13 Mini Boone [49, 4] 130065 Binary Classification 50 MNIST [34] 11811 Anomaly Detection 784 Table 3: Real data-sets used.

13.1 Classification setting

Datasets: We consider two binary classification datasets available from the UCI repository [4] (i) NSW Electricity data [26] and (ii) Mini Boone data [49]. The NSW Electricity dataset consists of 45312 samples with each data-point having 6 numeric features and one categorical feature representing the day of week. After one-hot-encoding the categorical feature, we have 13 numeric features and a binary target. The Mini Boone dataset has 130065 samples with each data point having 50 numeric features and a binary target. We choose these two as they have been shown to be good binary classification benchmark for data-streams with drift [40]. In particular, the Mini Boone dataset is extreme where the first 36499 has target of 1 while all the last 93565 data points have target of 0. For both the datasets, the order of streaming is the same order in which they are collected. This is consistent with the evaluation protocol of streaming algorithms in the presence of drifts [40].

Figure 5: The plot of Regret with different choice of learning rate. In Figures (a) and (b), we plot the log(Regret), where lower is better, while in (c) we plot the Precision-Recall area under curve, where higher is better. We see for the Electricity and MNIST dataset that α 0.5 has the best performance, while for the Mini Boone dataset in (b), we see that α = 0 has the best performance. The result for Mini Boone is not surprising as that dataset is has very little noise (σ 0) compared to the Electricity dataset. Thus, the optimal choice of α is lower in Mini Boone compared to the Electricity data. Furthermore, across datasets, tasks and models, our theoretical insights hold demonstrating that they are not artifacts of analysis.

Model and Methods: For both data-sets, we consider a simple one layer Logistic regression as the model to train. We set the gradient clipping value λ = 1 for both datasets. For both these datasets, we deviate from our setup and employ the classical online learning evaluation [40]. At each time t, the covariate is first shown to the algorithm which then predicts the target. The prediction of the logistic regression is then threshold at 0.5. The instantaneous loss at time t is the indicator loss whether if predictor after applying the threshold matches the target. The covariate and label pair is then used to take one step of clipped SGD in order to get the model used at the next time step. We experiment with different learning rates and plot their impact in Figure 5.

Observations: We observe in Figure 5 that for the Electricity dataset which has more shifts and is noisy σ > 0, the optimal choice of α 0.5, while for the Mini Boone dataset has just one change point (i.e., ΦT 0) and is noiseless σ 0 (cf [40]). Thus the optimal α 0.

13.2 Online Anomaly Detection

Dataset : We considered the standard MNIST dataset [34]. We modified it to a stream containing drift and corruptions/anomalies as follows. We introduce abrupt change points by first streaming in all 0 s followed by 1 s and so on with the last of the stream being 9 s. This way, the stream has 9 abrupt change-points. In addition, within a change-point, the images are slowly rotating clockwise at a fixed angle of 1 degree rotation to model slowly varying change in between two abrupt changes. Anomalies are introduced at random with probability 0.05 by sampling a digit other than the current segment of time. For instance 5c, the third point on the stream is an anomaly since the digit is different from 0 which is the inlier digit till the first break-point. Similarly, the last but one the data point on the stream is an anomaly since it is different from 9, the inlier digit in that segment.

Methods: We flatten the image into a vector of 784 and consider a simple 2 layer auto-encoder with hidden dimension 512 and activation function of Re LU [66]. We initialize the network to be random at the beginning of the stream and consider clipped SGD with clip-value set to 5 for various learning rates as shown in Figure 5c. At each time t, the sample Xt is revealed to the algorithm, a single clipped gradient step is taken on this sample to update the model and then the anomaly score is produced from the resulting model.

Observation: In Figure 5c, we plot the anomaly detection accuracy measured through the average precision recall score, where higher the score indicates better AD performance. We see that neither too small nor too high a value of α obtains the highest performance.

/latexit>Xt

Abrupt change-points

Slowly varying rotations within a change-point

Figure 6: The data stream of MNIST digits with drifts and corruptions. The stream is ordered such that all digits labeled 0 arrive before digits labeled 1 and so-on. The abrupt change points are time instants when the next digit starts appearing. Within an abrupt change-point corresponding to a digit i {0, , 9} and time-step t, with probability 0.95, a digit i image is sampled from the MNIST dataset without replacement, rotated by angle qt. With probability 0.05, an image corresponding to a digit other than i is sampled, which is labeled (but unknown to the algorithm) as a true anomaly. These images are marked with the red-boundary in the figure. The rotation angle varies with time as qt+1 := qt + 1

t, i.e., the rotations are slowly varying. The goal of the algorithm is to detect the true anomalies. In this experiment (see also Table 3), T = 1181 and the number of anomalies was 563.

14 Removing the knowledge of time horizon T from Algorithm 1

In this section, we re-state the doubling trick from [11] to remove the dependence of the stream length T on the learning rate tuning in Theorems 4.3 and 5.1.

Algorithm 2 Online-Clipped-SGD without time horizon

1: Input: Learning rate exponent α [0, 1], clipping exponent β 1

2, θ0 Θ initialization, m, M 2: TIME-HORIZON 1, η 1, λ 2m(G + 1) 3: for each time t = 1, 2, , do 4: if t >TIME-HORIZON then 5: TIME-HORIZON 2 TIME-HORIZON 6: η 1 m TIME-HORIZONα . 7: λ 2m(G + 1)TIME-HORIZONβ

8: end if 9: Receive sample Xt 10: Output θt PΘ (θt 1 ηclip( L(Xt, θt 1), λ)) 11: end for

Proposition 14.1 ([11]). If for a given α, β, δ, T, clipped-SGD satisfies RT R(α, β, T) ln T

with probability at-least 1 δ, then Algorithm 2 will satisfy regret RT R(α, β, T)(log2(T))2 ln 2

with probability at-least 1 δ.

Proof. Over a time interval of T, there are log2(T) times when Line 5 of Algorithm 2 is executed. Denote by the time-interval between two successive executions of Line 5 as a phase of the algorithm. In the ith phase, the hypothesis of the proposition gives that with probability at-most δ, the regret in phase i denoted by R(i) R(α, β, T) ln 2i

δ . Thus, taking an union bound over all the log2(T)

phases, we get that with probability at-least 1 log2(T)δ, the total regret RT is bounded by

i=1 R(α, β, T) ln 2i

RT R(α, β, T)(log2(T))2 ln 2

15 Proof of Proposition 6.1

The proof is by construction. For each time t = 1, 2, , denote by the function ft(x) := 1

2(x 1 Ht)2, where Ht := Pt 1 s=1 1 s. Let the initial point θ0 = 0 and let us set the clipping value of clipped gradient descent to be larger than or equal to 1.

We will argue by induction that θt = Ht + 1

t for all t = 1, 2, . Thus, the loss suffered at time t is lt := 1 1

t , since the optimal point at all times t is 1 + Ht.

For the base-case of time 1, we see that the gradient of f1( ) at x = θ0 = 0 is 1. Thus, θ1 = 1 = H2 and the loss at time 1 is 0 = 1 1

1. Assume the induction hypothesis is true for all s = 1, , t for some t. Consider the time t + 1. The induction hypothesis tells that at time t, θt = Ht+1. At time t + 1, the function revealed is (x 1 Ht+1)2 and thus the gradient at any x = Ht+1 is 1. The step size at time t + 1 is ηt+1 = 1 t+1 and thus θt+1 = Ht+1 + 1 t+1 = Ht+2. Thus, the induction hypothesis holds.

This gives that the total cumulative regret is

16 Proofs of Lower Bounds on Estimation Regret

Before giving the proof of Proposition 2.6, we state and prove a simpler result that finite diameter is necessary in the presence of both non-stationarities and corruptions. Proposition 16.1 (Finite diameter is necessary in general). There exists a strongly convex loss function L such that for every D > 0, domain Θ = [ D/2, D/2] R and every T N, such that for every algorithm, there exists a sequence of probability measures (Pt)T t=1 and corruptions (Ct)T t=1 where regret is at-least Ω(ΛT D) with probability 1.

We prove the following Proposition on mean-estimation, which will prove Proposition 16.1. Proposition 16.2 (Mean estimation needs finite diameter). Let Θ = [ D/2, D/2] R, timehorizon T and the number of corruptions be ΛT T and L(Z, θ) = 1

2(Z θ)2. For every algorithm, there exists a sequence of probability measures (Pt)T t=1 and corruption functions (Ct)T t=1, such that the regret incurred is at-least Ω(ΛT D) with probability 1.

Proof. The crux of this lower-bound is that the time instants of corruptions are adversarially chosen. The proof follows by construction where the underlying ground-truth is one of two possibilities and there is no noise, i.e., the variance of the probability distributions chosen by the adversary are all 0. We will show that even against an oblivious adversary, Ω(ΛT D) regret is un-avoidable.

Assume that the adversary has only two choices either an environment in which the true mean at all times is D/2 or one in which the true mean in the first T ΛT samples is D/2 and the true mean in the last ΛT samples is D/2. If the adversary picks the first scenario then it does not corrupt any sample and since there is no noise, all the T observed samples in the first scenario will be equal to D/2. If

the adversary picks the second scenario as the environment, then it will corrupt the last ΛT samples to show as D/2 instead of the true D/2. Thus, in either choice of the adversary the observed set of T samples by the forecaster is deterministic D/2 for all T samples. Thus, the forecaster cannot distinguish whether the underlying environment is from scenario 1 or 2 from observations, even if the forecaster has knowledge of the two possible situations from which the adversary is choosing.

Thus, for any deterministic choice of outputs by the algorithm, the adversary can choose one of these two situations such that the regret on the last ΛT samples of the stream incurs regret at-least ΛT D

16.1 Proof of Proposition 2.6

Similar to Proposition 2.6, consider two scenarios for mean-estimation. In one scenario, the uncorrupted samples are all drawn from a Dirac mass at 0, but the first ΛT samples are corrupted with all d coordinates set to D/

d. In the other scenario, there are no corruptions and the un-corrupted samples are all from Dirac mass at location with all coordinates D/

d. In both situations, the first ΛT samples are identical. Thus, no estimator in the first ΛT samples can distinguish between these two scenarios and will incur regret at-least (ΛT D)/4.

17 Useful convexity based inequalities

In this section, we collect some inequalities that we will repeatedly use throughout the proofs. Throughout the rest of the paper, for every t [T], we denote by Rt(θ) := EZ Pt[L(Z, θ)] and by θ t := arg minθ Θ Rt(θ). Convexity of Rt( ) and the domain Θ being convex implies that θ t is well-defined, exists and is unique. Recall that for all time t, Rt(θ) is strongly convex with parameters M, m respectively. Thus, we know that

Rt(θ t ) Rt(θt 1) + Rt(θt 1), θ t θt 1 + m

2 θ t θt 1 2 2. (4)

Further since θ t := arg minθ Rt(θ), we have that

Rt(θt 1) Rt(θ t ) m

2 θt 1 θ t 2 2.

Putting these two together, we see that

Rt(θt 1), θt 1 θ t m θt 1 θ t 2 2. (5)

Also, We further use the following lemma.

Lemma 17.1 (Lemma 3.11 from [9]). Let f : Rd R be a M smooth and m strongly convex function. Then for all x, y Rd,

f(x) f(y), x y m M M + m x y 2 2 + 1 M + m f(x) f(y) 2 2.

By substituting x = θt 1, y = θ t and f( ) = Rt( ) and by leveraging the fact that Rt(θ t ) = 0, we get the inequality that

Rt(θt 1), θt 1 θ t m M m + M θt 1 θ t 2 2 + 1 M + m Rt(θt 1) 2 2.

Re-arranging, we see that

Rt(θt 1) 2 2 (M + m) Rt(θt 1), θt 1 θ t m M θt 1 θ t 2 2. (6)

Further, using the Cauchy-schwartz inequality that Rt(θt 1), θt 1 θ t Rt(θt 1) θt 1 θ t D Rt(θt 1) , we get from the preceeding inequality that

Rt(θt 1) (M + m)D. (7)

18 Theorem statement when the gradients are sub-gaussian with known upper-bound

Theorem 18.1 (Formal version of Theorem 4.3). Suppose Assumption 4.2 holds. For every δ (0, 1),

if Algorithm 1 is run with clipping value λ G + p

Tr(Σ) + C q

νmax(Σ) ln 2T

δ , where C is given in Proposition 19.1 and step-size η > 0, then with probability at-least 1 δ,

Reg T 2(l1 + ΦT )

νmax(Σ) ln 6T

v u u t TΦT C q

νmax(Σ) ln 2T

Corollary 18.2. Suppose Assumption 4.2 holds. For every δ (0, 1), if Algorithm 1 is run with

clipping value λ G + p

Tr(Σ) + C q

νmax(Σ) ln 2T

δ , where C is given in Proposition 19.1

and step-size η = 1 m T α for α [0, 1), then with probability at-least 1 δ,

v u u t T 1+αΦT C q

νmax(Σ) ln 2T

| {z } Regret due to distribution shift

νmax(Σ) ln 6T

| {z } Regret due to standard statistical error

| {z } Regret due to adversarial corruptions

The above theorem immediately yields the following corollary by substituting the learning rate as η = T 2/3.

Corollary 18.3. Suppose Assumption 4.2 holds. For every δ (0, 1), if Algorithm 1 is run with

clipping value λ G + p

Tr(Σ) + C q

νmax(Σ) ln 2T

δ , where C is given in Proposition 19.1

and step-size η = T 2/3

m , then with probability at-least 1 δ,

v u u t T 5/3ΦT C q

νmax(Σ) ln 2T

| {z } Regret due to distribution shift

νmax(Σ) ln 6T

| {z } Regret due to standard statistical error

+ O ΛT T 1/3

| {z } Regret due to adversarial corruptions

19 Proof of Theorem 18.1

The proof in this section is based on the following result on high-dimensional sub-gaussian random vectors.

Proposition 19.1 (Exercise 6.3.5 [57]). There exists an absolute constant C > 80, such that for every sub-gaussian random-vector Z with 0 mean random-vector and covariance matrix Σ and every δ (0, 1), the following

" Z 2 E[ Z 2] C

νmax(Σ) ln 2

holds, where νmax(Σ) is the largest eigen-value of the upper-bound covariance matrix Σ.

19.0.1 Notations

We follow the same proof architecture as that of [55]. We define three sequences of random variables (ψt)t 1 and ( ψ)t 1 as follows.

ψt := clip( L(Zt, θt 1), λ) Rt(θt 1), ψt := clip( L(Xt, θt 1), λ) Rt(θt 1), eψt := L(Zt, θt 1) Rt(θt 1).

These are random vectors since Zt Pt and both Ct and θt are measurable with respect to the sigma algebra generated by σ(Z1, , Zt, C1, , Ct 1). Clearly, for all times t 1, on the event that Ct = 0, ψt = ψt holds almost-surely. Furthermore, from triangle inequality almost-surely for all time t 1, we have

ψt 2 2 ψt 2 2 + 2λ21Ct =0. (11)

Recall from Assumption 2.5 that for every t and θ Θ, the covariance matrix of the random vector L(Z, θ) Rt(θ) is bounded from above in the positive semi-definite sense by Σ. Denote by Trace(Σ) and νmax(Σ) as the trace and the highest eigen value respectively of the matrix Σ. Assumption 4.2 implies the following two lemmas.

Denote by the event Eno-clip as

t=1 { L(Zt, θt 1) = clip( L(Zt, θt 1)} ,

t=1 {ψt = eψt}. (12)

19.0.2 Supporting Lemmas based on sub-gaussian concentrations

The main result of this section is stated at the end in Corollary 19.5. In order to state them, we build a series of useful bounds along the way. The following lemma bounds the estimation error of the true gradient using the sample Zt at each time t. Lemma 19.2.

sup eθ1, ,eθT Θ PZ1, ,ZT

sup t [T ] L(Zt, eθt) Rt(eθt) > C

νmax(Σ) ln 2T

In words, the above lemma states that for any sequence of eθ1, , eθT , the probability that the norm of the difference between L(Zt, eθt) and its expectation Rt(eθt) where Zt Pt exceeds

νmax(Σ) ln 2T

δ is bounded above by δ.

Proof. Fix any deterministic eθ1, , eθT Θ. Assumption 4.2 yields that for all time t, L(Zt, eθt) Rt(eθt) is a 0 mean sub-gaussian random vector with covariance matrix upper-bounded in the positive definite sense by Σ. Observe from Jensen s inequality that for all t and θ θ

E[ L(Zt, θ) Rt(θ) ] p

E[ L(Zt, θ) Rt(θ) 2] p

Tr(Σ), (13)

Now, Hoeffding s inequality in Proposition 19.1 gives that

L(Zt, eθt) Rt(eθt) C

νmax(Σ) ln 2T

+ E[ L(Zt, eθt) Rt(eθt) ],

νmax(Σ) ln 2T

holds with probability at-least 1 δ

T . Now, taking an union bound over t = {1, , T} gives that

sup t [T ] L(Zt, eθt) Rt(eθt) > C

νmax(Σ) ln 2T

Now, since eθ1, , eθT Θ was arbitrary, we can take a supremum on both sides and that concludes the proof.

Corollary 19.3. With probability at-least 1 δ, for all times t {1, , T},

νmax(Σ) ln 2T

Proof. The proof follows from applying the Lemma 19.2 to the set of random locations θ1, , θT .

Lemma 19.4. If Algorithm 1 is run with clipping value λ G+ p

νmax(Σ) ln 2T

δ , where C is given in Proposition 19.1, then with probability at-least 1 δ gradient is never clipped for any un-corrupted sample, i.e., with probability at-least 1 δ, event Eno-clip holds.

Proof of Lemma 19.4. From the definition, we know that the gradients are never clipped for the un-corrupted samples if L(θt 1, Zt) λ holds for all t.

From Corollary 19.3, we know that with probability at-least 1 δ, for all t {1, , T}, we have

L(Zt, θt 1) Rt(θt 1) C

νmax(Σ) ln 2T

Tr(Σ). (14)

From the triangle inequality we know that

L(Z, θt 1) L(Z, θt 1) Rt(θt 1) + Rt(θt 1) , L(Z, θt 1) Rt(θt 1) + sup θ Θ Rt(θ) ,

= L(Z, θt 1) Rt(θt 1) + G. (15)

Substituting the bound from Equation (14) into (15), we get that for every t [T],

L(Z, θt 1) G + p

νmax(Σ) ln 2T

holds with probability at-least 1 δ.

This immediately yields the following corollary.

Corollary 19.5. If Algorithm 1 is run with clipping value λ G+ p

νmax(Σ) ln 2T

where C is given in Proposition 19.1, then with probability at-least 1 2δ, for all times t [T],

νmax(Σ) ln 2T

Proof. On the event that both Corollary 19.3 and Lemma 19.4 holds, we have the result. Both those events fail to hold with probability at-most 2δ.

Denote by the event

νmax(Σ) ln 2T

Tr(Σ) . (17)

Corollary 19.5 states that P[Eψ] 1 2δ.

19.1 Lemmas based on expanding the gradient descent recursion

Lemma 19.6. Under event Eψ, for every time t [T],

θ t θt 2 2 (1 ηm)t 1 θ1 θ 1 2 2 +

s=1 (1 ηm)t s θ s θ s+1 2 2 + 2ηB2

s=1 (1 ηm)t s 1 θs θ s, ψs+1 + 2ηB

s=1 (1 ηm)t s 1 θ s θ s+1

+ 4η2λ2 t 1 X

s=2 (1 ηm)t s1Cs =0 + 4ηλD

s=2 (1 ηm)t s1Cs =0, , (18)

where B = C q

νmax(Σ) ln 2T

Tr(Σ) holds.

19.1.1 Proof of Lemma 19.6

Proof. Consider any time t. We have

θt θ t 2 2 = PΘ(θt 1 ηclip(L(Xt, θt 1), λ)) θ t 2 2, (19)

(a) θt 1 ηclip(L(Xt, θt 1), λ) θ t 2 2, (20)

= θt 1 η( ψt + Rt(θt 1)) θ t 2 2,

= θt 1 θ t 2 2 + η2 ψt + Rt(θt 1) 2 2 2η θt 1 θ t , ψt + Rt(θt 1) ,

(b) θt 1 θ t 2 2 + 2η2 ψt 2 2 + 2η2 Rt(θt 1) 2 2 2η θt 1 θ t , ψt + Rt(θt 1) , (21)

Step (a) follows since Θ is a convex set, PΘ(θt) θ t θt θ t , since θ t Θ. In step (b), we use the fact that a + b 2 2 2 a 2 2 + 2 b 2 2, for all a, b Rd. Substituting Equation (6) into (21), we get that

θ t θt 2 2 θt 1 θ t 2 2 + 2η2 ψt 2 2 2η θt 1 θ t , ψt

+ 2η2 (M + m) Rt(θt 1), θt 1 θ t m M θt 1 θ t 2 2 2η Rt(θt 1), θt 1 θ t .

Re-arranging the equation above yields

θ t θt 2 2 (1 2η2m M) θt 1 θ t 2 2 + 2η2 ψt 2 2 2η θt 1 θ t , ψt 2η(1 η ((M + m)) Rt(θt 1), θt 1 θ t .

Further substituting Equation (5) into the display above yields that

θ t θt 2 2 (1 2ηm + 2η2m2) θt 1 θ t 2 2 + 2η2 ψt 2 2 2η θt 1 θ t , ψt ,

(1 ηm) θt 1 θ t 2 2 + 2η2 ψt 2 2 2η θt 1 θ t , ψt ,

where the inequality comes from the fact that if ηm < 1 = 2ηm 2η2m2 > ηm. We simplify the display above using the inequality in Equation (45) as,

θ t θt 2 2 (1 ηm) θt 1 θ t 1 2 2 + (1 ηm) θ t 1 θ t 2 2 + 2η2 ψt 2 2 + 4η2λ21Ct =0 2η θt 1 θ t , ψt + 2η θt 1 θ t , ψt ψt . (22)

Using the Cauchy-Schartz inequality that θt 1 θ t , ψt ψt θt 1 θ t ψt ψt 2λ θt 1 θ t 1Ct =0, where the last inequality comes from the fact that for all time t, ψt ψt 2λ1Ct =0 almost-surely. Plugging this into Equation (22) yields

θ t θt 2 2 (1 ηm) θt 1 θ t 1 2 2 + (1 ηm) θ t 1 θ t 2 2 + 2η2 ψt 2 2 + 4η2λ21Ct =0 2η θt 1 θ t , ψt + 4ηλ θt 1 θ t 1Ct =0. (23)

Using the fact that the diameter of the set Θ is DΘ now yields that

θ t θt 2 2 (1 ηm) θt 1 θ t 1 2 2 + (1 ηm) θ t 1 θ t 2 2 + 2η2 ψt 2 2 + 4η2λ21Ct =0 2η θt 1 θ t , ψt + 4ηλD1Ct =0.

Unrolling the recursion yields,

θ t θt 2 2 (1 ηm)t 1 θ1 θ 1 2 2+

s=1 (1 ηm)s θ t s θ t s+1 2 2+2η2 t 1 X

s=1 (1 ηm)s 1 ψt s+1 2 2+

4η2λ2 t 1 X

s=1 (1 ηm)s 11Ct s+1 =0 2η

s=1 (1 ηm)s 1 θt s θ t s+1, ψt s+1 +4ηλD

s=1 (1 ηm)s 11Ct s+1 =0.

Using the fact that Pt 1 s=1(1 ηm)s 1 1 ηm, we get that

θ t θt 2 2 (1 ηm)t 1 θ1 θ 1 2 2 +

s=1 (1 ηm)s θ t s θ t s+1 2 2

+ 2η2 t 1 X

s=1 (1 ηm)s 1 ψt s+1 2 2 | {z } Term I

s=1 (1 ηm)s 1 θt s θ t s+1, ψt s+1

| {z } Term II

+ 4η2λ2 t 1 X

s=1 (1 ηm)s 11Ct s+1 =0 + 4ηλD

s=1 (1 ηm)s 11Ct s+1 =0. (24)

Changing the variables in the summation, the above inequality can be written as

θ t θt 2 2 (1 ηm)t 1 θ1 θ 1 2 2 +

s=1 (1 ηm)t s θ s θ s+1 2 2

+ 2η2 t 1 X

s=2 (1 ηm)t s ψs 2 2 | {z } Term I

s=1 (1 ηm)t s 1 θs θ s+1, ψs+1

| {z } Term II

+ 4η2λ2 t 1 X

s=2 (1 ηm)t s1Cs =0 + 4ηλD

s=2 (1 ηm)t s1Cs =0. (25)

Simplifying term I from Corollary 19.5 since event Eψ holds by assumption of the Lemma, we get that with probability at-least 1 2δ,

θ t θt 2 2 (1 ηm)t 1 θ1 θ 1 2 2 +

s=1 (1 ηm)t s θ s θ s+1 2 2

s=1 (1 ηm)t s 1 θs θ s, ψs+1

| {z } Term II

s=1 (1 ηm)t s 1 θ s θ s+1, ψs+1

+ 4η2λ2 t 1 X

s=2 (1 ηm)t s1Cs =0 + 4ηλD

s=2 (1 ηm)t s1Cs =0., (26)

where B = C q

νmax(Σ) ln 2T

Tr(Σ). Further, simplifying using Cauchy Schwartz inequality, we get the stated result of Lemma 19.6.

θ t θt 2 2 (1 ηm)t 1 θ1 θ 1 2 2 +

s=1 (1 ηm)t s θ s θ s+1 2 2

s=1 (1 ηm)t s 1 θs θ s, ψs+1

| {z } Term II

s=1 (1 ηm)t s 1Φs

+ 4η2λ2 t 1 X

s=2 (1 ηm)t s1Cs =0 + 4ηλD

s=2 (1 ηm)t s1Cs =0., (27)

19.2 Martingale Analysis of Term II

Denote by the constant

Ut (1 ηm)t 1 θ1 θ 1 2 2 +

s=1 (1 ηm)t s θ s θ s+1 2 2 + 2ηB

s=1 (1 ηm)t s 1Φs

m + 4η2λ2 t 1 X

s=2 (1 ηm)t s1Cs =0 + 4ηλD

s=2 (1 ηm)t s1Cs =0

The following proposition holds by simple algebraic manipulation. Proposition 19.7. For every t [2, T] and s t,

(1 ηm)t s Us Ut.

Denote by the sequence (ξs)T s=1 of random vectors as

ξs = θ s θs if θ s θs 2 2Us 0 otherwise (29)

Denote by the event

s=1 (1 ηm)t s ξs, ψs+1

20Utνmax(Σ) ln 2T

Lemma 19.8.

P[Eξ] 1 δ. (31)

Recall the definition of event Eψ in Equation (17).

Corollary 19.5 give that P[Eψ] 1 2δ. Lemma 19.9. Under the events Eξ and Eψ,

θ t θt 2 2 2Ut, (32)

holds for all t [T].

19.3 Proof of Theorem 18.1

Proof. Corollary 19.5 gives that P[EC ψ] 2δ. Lemma 19.8 gives that P[Ec ξ] δ. An union bound argument thus gives that P[EC ψ EC ξ ] 2δ. Thus, we have that both events Eψ and Eξ hold with probability at-least 1 3δ, i.e.,

P[Eψ Eξ] 1 3δ.

Under the event Eψ Eξ, Lemma 19.9 gives that

Now, applying the fact that p P

i xi, we can simplify Ut as

2 (1 ηm) t 1

2 θ1 θ 1 2 +

s=1 (1 ηm) s 2 θ t s θ t s+1

νmax(Σ) ln 2T

νmax(Σ) ln 2T

s=1 (1 ηm) t s 1

s=1 (1 ηm) s 1

2 1Ct s+1 =0 + p

s=1 (1 ηm) s 1

2 1Ct s+1 =0

Observe the following deterministic identities that can be proven by switching the order of summations. X

s 1 (1 ηm) s 1

2 2 ηm, (34)

s=1 (1 ηm) s 1

2 1Ct s+1 =0 2ΛT

s=1 (1 ηm) s 2 θ t s θ t s+1 2ΦT

s=1 (1 ηm) t s 1

θ s θ s+1 2 Pt s=1 p

θ s θ s+1 ηm 2 TΦT

The first inequality above follows from the fact that for all x (0, 1), we have 1 1 1 x 2

last inequality follows from Cauchy-Schwartz inequality that PT i=1 xi q

T PT i=1 xi. Recall that

the regret Reg T := PT t=1 θt θ t . Thus we get from Equations (32) and (33) that with probability at-least 1 3δ,

Reg T 2(l1 + ΦT )

νmax(Σ) ln 2T

v u u t TΦT C q

νmax(Σ) ln 2T

holds with probability at-least 1 3δ. By a change of variables by re-parametrizing δ δ

3 yields the stated result.

19.4 Proofs of Lemma 19.9

Proof. We prove this lemma by induction on t. For the base case of t = 1, Equation (32) holds with probability 1 trivially by definition. Now for the induction hypothesis, assume Equation (32) holds for all s {1, , t 1} for some t > 1. Since we are on the event Eξ, we Equation (27) holds. Thus,

θ t θt 2 2 (1 ηm)t 1 θ1 θ 1 2 2 +

s=1 (1 ηm)t s θ s θ s+1 2 2

s=1 (1 ηm)t s 1 θs θ s, ψs+1

| {z } Term II

s=1 (1 ηm)t s 1Φs

+ 4η2λ2 t 1 X

s=2 (1 ηm)t s1Cs =0 + 4ηλD

s=2 (1 ηm)t s1Cs =0. (39)

From the induction hypothesis, we have that θ s θs 2 2 2Us holds for all s {1, , t 1}. Thus,

s=1 (1 ηm)t s ξs, ψs+1 =

s=1 (1 ηm)t s θs θ s, ψs+1 .

Further since we are on the event Eξ, we have that

s=1 (1 ηm)t s ξs, ψs+1

20Utνmax(Σ) ln 2T

Thus, plugging this back into Equation (39), we get that

θ t θt 2 2 Ut +

80ηUtνmax(Σ) ln 2T

From Equation (28), we see that 2ηB2

m Ut. Substituting this in the above equation, we see that

θ t θt 2 2 Ut + Ut q

40νmax(Σ) ln 2T

The second inequality follows since B = C q

νmax(Σ) ln 2T

Tr(Σ), where C 40 as given

in Proposition 19.1. Thus,

40νmax(Σ) ln( 2T

19.5 Proof of Lemma 19.8

We first reproduce an useful result. Lemma 19.10 (Sub-gaussian Martingale Azuma-Hoeffding inequality [53]). Suppose Y1, , YT is a martingale sequence with respect to a filtration (Ft)T t=1, i.e., E[Yt|Ft 1] = 0 for all t. Further, suppose there exists deterministic non-negative numbers (σt)T t=1 such that for every λ R and

t [T], we have E[exp(λYt)|Ft 1] exp λ2σ2 t 2 almost-surely. Then for every a > 0,

28 PT t=1 σ2 t

In order to prove Lemma 19.8, we will show that for each t [T], the event

E(t) ξ = t 1 X

s=1 (1 ηm)t s ξs, ψs+1 2BUt

holds with probability at-least 1 δ

T . Then an union bound will conclude the proof of Lemma 19.8.

To do so, first observe that the sum Pt 1 s=1(1 ηm)t s ξs, ψs+1 is a martingale difference with respect to the filtration (Ft)T t=1. This follows from two facts.

Fact I ξs Fs is measurable with respect to the filtration generated by all the random vectors Z1, , Zs and the corruption vectors C1, , Cs. This follows since the adversary is causal and does not have access to future randomness to decide on corruption levels.

Fact II E[ψs+1|Fs] = 0 is un-biased.

Fact III, The sequence of random variables ((1 ηm)t s ξs, ψs+1 )t s=1 satisfies the premise of Lemma 19.10 with σs (1 ηm)2(t s) ξs 2νmax(Σ) 2(1 ηm)t s Utνmax(Σ), for all 1 s t.

The third fact follows from the definition that conditional on Fs, the covariance matrix of ψs+1 is upper-bounded by Σ and ξs 2 2Us for all s, and (1 ηm)t s Us Ut holds for all s t.

In order to apply Lemma 19.10, observe that the sum of variances Pt s=1 σ2 s = 2Utνmax(Σ) Pt s=1(1 ηm)t s 2Utνmax(Σ)

ηm . Applying Lemma 19.10, we get that

s=1 (1 ηm)t s ξs, ψs+1

20Utνmax(Σ) ln 2T

19.6 Proof of Corollary 4.5

The finite diameter that D < fact is only used in Lemma 19.6. From observing the proof of Lemma 19.6, it holds even if D = , albeit the bound is vacuous if ΛT > 0. Thus, the entire proof holds verbatim even if D = and gives non-trivial regret guarantees when ΛT = 0.

20 Theorem statement and Proofs in the general case from Section 5

Theorem 20.1 (Formal version of Theorem 5.1). For every δ (0, 1), if Algorithm 1 is run with clipping value λ 2G, and step-size η > 0, then with probability at-least 1 δ,

λ mη + 2λ m

σ + 2 2(l1 + ΦT )

ηm + 4σT r η

TΦT ηλm2 + 2

Corollary 20.2. For every δ (0, 1), if Algorithm 1 is run with clipping value λ = (2GT β), and step-size η = 1 m T α for α [0, 1), then with probability at-least 1 δ,

GT α+β + T 3α

! σΦT m3/2 + σ m3/2

ΦT T 1+2α β

| {z } Regret due to distribution shift

2 ) + T 1 α

2 + G2T 1+2β α σ2

| {z } Regret due to finite-sample estimation error

G2T 2β+α + T 3α

G + G3/2T 3β

| {z } Regret due to adversarial corruptions

Corollary 20.3. For every δ (0, 1), if Algorithm 1 is run with clipping value λ = 2GT α

3 , and step-size η = 1 m T α for α [0, 1], then with probability at-least 1 δ,

3 ΦT m3/2 + σT 1 2 + 5α

| {z } Regret due to distribution shift

3 (Gσ)2 ln T

| {z } Regret due to finite-sample estimation error

| {z } Regret due to adversarial corruptions

Corollary 20.4. For every δ (0, 1), if Algorithm 1 is run with clipping value λ max(2G, T 1/5), and step-size η = T 1/2

m , then with probability at-least 1 δ,

Reg T O T 5/6(ΦT + p

| {z } Regret due to distribution shift

T 5/6Dλ 1 2 Tr(Σ)

| {z } Regret due to finite-sample estimation error

T 2β+α + T 3α

| {z } Regret due to adversarial corruptions

20.1 Proof of Theorem 20.1

We follow a similar proof architecture as that of Theorem 18.1. We define two sequences of random variables (ψt)t 1 and ( ψ)t 1 as follows.

ψt := clip( L(Zt, θt 1), λ) Rt(θt 1),

ψt := clip( L(Xt, θt 1), λ) Rt(θt 1).

These are random vectors since Zt Pt and both Ct and θt are measurable with respect to the sigma algebra generated by σ(Z1, , Zt, C1, , Ct 1). Clearly, for all times t 1, on the event that Ct = 0, ψt = ψt holds almost-surely. Furthermore, from triangle inequality almost-surely for all time t 1, we have

ψt 2 2 ψt 2 2 + 2λ21Ct =0. (45)

Expanding the one-step recursion

Similar to the analysis carried out in the proof of Theorem 18.1, we consider any time t and write the recursion as

θt θ t 2 2 = PΘ(θt 1 ηclip(L(Xt, θt 1), λ)) θ t 2 2.

Now, this can be expanded verbatim as done in Section 19.1 to yield an identical replica of Equation 24.

θ t θt 2 2 (1 ηm)t 1 θ1 θ 1 2 2 +

s=1 (1 ηm)s θ t s θ t s+1 2 2

+ 2η2 t 1 X

s=1 (1 ηm)s 1 ψt s+1 2 2 2η

s=1 (1 ηm)s 1 θt s θ t s+1, ψt s+1

+ 4η2λ2 t 1 X

s=1 (1 ηm)s 11Ct s+1 =0 + 4ηλD

s=1 (1 ηm)s 11Ct s+1 =0. (46)

Denote by ψt := ψ(b) t + ψ(v) t , where ψ(b) t := EZt[ψt|Ft 1] and ψ(v) t := ψt ψ(b) t . Using this in the display above and using that fact that a + b 2 2 2 a 2 2 + 2 b 2 2, we get

θ t θt 2 2 (1 ηm)t 1 θ1 θ 1 2 2 +

s=1 (1 ηm)s θ t s θ t s+1 2 2

+ 4η2 t 1 X

s=1 (1 ηm)s 1 ψ(b) t s+1 2 2 + 4η2 t 1 X

s=1 (1 ηm)s 1 ψ(v) t s+1 2 2

s=1 (1 ηm)s 1 θt s θ t s+1, ψ(b) t s+1

s=1 (1 ηm)s 1 θt s θ t s+1, ψ(v) t s+1

+ 4η2λ2 t 1 X

s=1 (1 ηm)s 11Ct s+1 =0 + 4ηλD

s=1 (1 ηm)s 11Ct s+1 =0. (47)

Further simplifying by adding and subtracting EZt[ ψ(v) t 2 2|Ft 1] to be above display, we get

θ t θt 2 2 (1 ηm)t 1 θ1 θ 1 2 2 +

s=1 (1 ηm)s θ t s θ t s+1 2 2

+ 4η2 t 1 X

s=1 (1 ηm)s 1 ψ(b) t s+1 2 2 + 4η2 t 1 X

s=1 (1 ηm)s 1EZt s+1[ ψ(v) t s+1 2 2|Ft s]

+ 4η2 t 1 X

s=1 (1 ηm)s 1( ψ(v) t s+1 2 2 EZt s+1[ ψ(v) t s+1 2 2|Ft s])

s=1 (1 ηm)s 1 θt s θ t s, ψ(b) t s+1 2η

s=1 (1 ηm)s 1 θ t s θ t s+1, ψ(b) t s+1

s=1 (1 ηm)s 1 θt s θ t s, ψ(v) t s+1 2η

s=1 (1 ηm)s 1 θ t s θ t s+1, ψ(v) t s+1

+ 4η2λ2 t 1 X

s=1 (1 ηm)s 11Ct s+1 =0 + 4ηλD

s=1 (1 ηm)s 11Ct s+1 =0. (48)

Lemma 20.5. If λ 2 supθ Θ Rt(θ) = 2G, the following inequalities hold.

ψ(v) t 2λ (49)

ψ(b) t 2 4σ2

EZt[ ψ(v) t 2 2|Ft 1] 10σ2 (51)

Simplifying Equation (48) using bounds in Lemma 20.5, we get

θ t θt 2 2 (1 ηm)t 1 θ1 θ 1 2 2 +

s=1 (1 ηm)s θ t s θ t s+1 2 2

s=1 (1 ηm)s 1 + 40η2σ2 t 1 X

s=1 (1 ηm)s 1

+ 4η2 t 1 X

s=1 (1 ηm)s 1( ψ(v) t s+1 2 2 EZt s+1[ ψ(v) t s+1 2 2|Ft s+1])

s=1 (1 ηm)s 1 θt s θ t s ψ(b) t s+1 + 2η

s=1 (1 ηm)s 1 θ t s θ t s+1 ψ(b) t s+1

s=1 (1 ηm)s 1 θt s θ t s, ψ(v) t s+1 2η

s=1 (1 ηm)s 1 θ t s θ t s+1, ψ(v) t s+1

+ 4η2λ2 t 1 X

s=1 (1 ηm)s 11Ct s+1 =0 + 4ηλD

s=1 (1 ηm)s 11Ct s+1 =0. (52)

Further applying the bound that ψ(b) t 4σ2

θ t θt 2 2 (1 ηm)t 1 θ1 θ 1 2 2 +

s=1 (1 ηm)s θ t s θ t s+1 2 2

s=1 (1 ηm)s 1 + 40η2σ2 t 1 X

s=1 (1 ηm)s 1

+ 4η2 t 1 X

s=1 (1 ηm)s 1( ψ(v) t s+1 2 2 EZt s+1[ ψ(v) t s+1 2 2|Ft s])

s=1 (1 ηm)s 1 θt s θ t s + 8σ2η

s=1 (1 ηm)s 1 θ t s θ t s+1

s=1 (1 ηm)s 1 θt s θ t s, ψ(v) t s+1 2η

s=1 (1 ηm)s 1 θ t s θ t s+1, ψ(v) t s+1

+ 4η2λ2 t 1 X

s=1 (1 ηm)s 11Ct s+1 =0 + 4ηλD

s=1 (1 ηm)s 11Ct s+1 =0. (53)

We further simplify the underscored term above as

s=1 (1 ηm)s 1 θ t s θ t s+1, ψ(v) t s+1 2η

s=1 (1 ηm)s 1 θ t s θ t s+1 ψ(v) t ,

s=1 (1 ηm)s 1 θ t s θ t s+1 .

Using the inequality above into Equation (53) along with the bound that 4η2λ2 Pt 1 s=1(1 ηm)s 11Ct s+1 =0 4ηλ2ΛT

m , where ΛT := PT t=1 1Ct =0, we get

θ t θt 2 2 (1 ηm)t 1 θ1 θ 1 2 2 +

s=1 (1 ηm)s θ t s θ t s+1 2 2

m + 4η2 t 1 X

s=1 (1 ηm)s 1( ψ(v) t s+1 2 2 EZt s+1[ ψ(v) t s+1 2 2|Ft s])

| {z } Term I

s=1 (1 ηm)s 1 θt s θ t s + 8σ2η

s=1 (1 ηm)s 1 θ t s θ t s+1

s=1 (1 ηm)s 1 θt s θ t s, ψ(v) t s+1

| {z } Term II

s=1 (1 ηm)s 1 θ t s θ t s+1

+ 4η2λ2 t 1 X

s=1 (1 ηm)s 11Ct s+1 =0 + 4ηλD

s=1 (1 ηm)s 11Ct s+1 =0. (54)

Denote by the event E(1) as

s=1 (1 ηm)s 1( ψ(v) t s+1 2 2 EZt s+1[ ψ(v) t s+1 2 2|Ft s])

32η2λ2 ln 2T

+ 32η3/2λσ q

Lemma 20.6.

P[E(1)] 1 δ.

For every t {1, , T}, denote by the constant

Ut = (1 ηm)t 1 θ1 θ 1 2 2 +

s=1 (1 ηm)s θ t s θ t s+1 2 2

m +32η2λ2 ln 2T

+ 32η3/2λσ q

s=1 (1 ηm)s 1 θ t s θ t s+1

s=1 (1 ηm)s 1 θ t s θ t s+1 + 4η2λ2 t 1 X

s=1 (1 ηm)s 11Ct s+1 =0

s=1 (1 ηm)s 11Ct s+1 =0

Proposition 20.7. For every 1 s < t T,

(1 ηm)s 1Ut s Ut. (56)

For each t {1, , T}, denote by the event E(2) t as

E(2) t = 2η

s=1 (1 ηm)s 1 θt s θ t s, ψ(v) t s+1 4ηλ p

Observe from definition, P[E(2) 1 ] = 1. In order to bound the other events, the following lemma holds. Lemma 20.8. For each t {2, , T}

P[E(2) t |E(2) t 1, , E(2) 1 ] 1 δ

Denote by the event

t=1 E(2) t . (57)

Corollary 20.9.

P[E(2)] 1 δ,

where event E(2) is given in Equation (57).

Lemma 20.10. Under events E(1) and E(2), for all times t {1, , T},

where Q = 206σ2

We prove this lemma by induction. The base case of t = 1 holds trivially by definition.

By the induction hypothesis, assume that for some t > 1, l2 s QUs holds for all s < t. Then, from Equation (54) we have that

l2 t Ut + 8σ2η

s=1 (1 ηm)s 1 θt s θ t s + 4ηλ p

(a) Ut + 8σ2η

s=1 (1 ηm) s 1

(1 ηm)s QUt s + 4ηλ p

(56) Ut + 8σ2η

s=1 (1 ηm) s 1

QUt + 4ηλ p

(b) Ut + 64σ2

Step (a) follows from the induction hypothesis. Step (b) follows since 2 ηm > 1.

Claim If Q 206σ2 λ2mη + 4λ2

σ2 + 2, then Ut + 64σ2

mλ QUt + 16λ q

We prove this claim by contradiction. Assume that Q 206σ2 λ2mη + 4λ2

σ2 +2 and that Ut+ 64σ2

m > QUt. Re-arranging, we see that

mλ + 16λ r η

mλ + 16λ r η

mλ + 16λ r η

The last inequality follows from Equation (55), where Ut 40ησ2

m . Re-aranging the last inequality, we see that 64σ λ 40mη + 16λ

It is easy to verify that the above inequality and the fact that Q = 206σ2 λ2mη + 4λ2

σ2 + 2 cannot hold simultaneously. This can be checked by squaring both sides and solving for the root of the quadratic equation in Q.

20.2 Proof of Theorem 20.1

We know from Lemma 20.6 and Corollary 20.9, that with probability at-least 1 2δ, events E(1) E(2) holds. Lemma 20.10 gives that for all t {1, , T}, θt θ t 2 QUt, where Q is given in Lemma 20.10 and Ut is given in Equation (55). Now applying the inequality that x + y x+ y, we get that

σ + 2 (1 ηm) t 1

s=1 (1 ηm) s 2 θ t s θ t s+1

λm +7σ η m +6ηλ

2m1/4 +3σ η

s=1 (1 ηm) s 1

θ t s θ t s+1

s=1 (1 ηm) s 1

θ t s θ t s+1 + 2ηλ

s=1 (1 ηm) s 1

2 1Ct s+1 =0

s=1 (1 ηm) s 1

2 1Ct s+1 =0

Recalling that Reg T := PT t=1 θt θ t , we have by simplifying using the series expansions in Equations (34, 35, 36, 37), we get that

σ + 2 2(l1 + ΦT )

ηm + 4σT r η

TΦT ηλm2 + 2

20.3 Proof of Lemma 20.8

Proof. Fix a time t {1, , T}. We wish to bound the sum Pt 1 s=1 2η(1 ηm)s 1 θ t s θt s, ψ(v) t s+1 . In order to do so, we define the following sequence of random vectors

Denote by the sequence (ξs)T s=1 of random vectors as

ξs = θ s θs if θ s θs 2 QUs, 0 otherwise (58)

where Q is defined in Lemma 20.10. From the induction hypothesis, on the event E(2) t 1, , E(2) 1 , we have

s=1 2η(1 ηm)s 1 θ t s θt s, ψ(v) t s+1 =

s=1 2η(1 ηm)s 1 ξt s, ψ(v) t s+1 ,

holding almost-surely. We will now bound the sum Pt 1 s=1 2η(1 ηm)s 1 ξt s, ψ(v) t s+1 using Freedman s martinglae inequality. We know that,

|2η(1 ηm)s 1 ξt s, ψ(v) t s+1 | 2η(1 ηm)s 1 xit s ψ(v) t s+1 ,

Further, the sum of conditional variances can be bounded as

s=1 |2η(1 ηm)s 1 ξt s, ψ(v) t s+1 |2

s=1 |16λ2η2(1 ηm)s 1QUt,

Applying Freedman s inequality, we see that

s=1 2η(1 ηm)s 1 θ t s θt s, ψ(v) t s+1 4ηλ p

16η2λ2QUt ln2(T/δ) + 32λ2ηQUt

20.4 Useful Martingale concentration inequality

Lemma 20.11 (Freedman s inequality[59]). Suppose Y1, , YT is a bounded martingale with respect to a filtration (Ft)T t=0 with E[Yt|Ft 1] = 0 and P[|Yt| B] = 1 for all t {1, , T}. Denote by Vs := Ps n=1 Var(Yn|Fn 1) be the sum of conditional variances. Then, for every a, v > 0,

n [1, T] such that

t=1 Yt a and Vn v

Re-arranging the above inequality, we see that if

2 + 2v ln 2T

then the RHS of Equation (59) is bounded above by δ

20.5 Proof of Lemma 20.6

Proof of Lemma 20.6. Fix a t {1, , T}. For s {1, , t 1}, denote by the random variable Y (t) s := 4η2(1 ηm)s 1( ψ(v) t s+1 2 2 EZt s+1[ ψ(v) t s+1 2 2|Ft s]). Observe that the sequence (Y (t) s )t 1 s=1 is a martingale difference sequence with respect to the filtration (Gs)t 1 s=1, where Gs := Ft s. Furthermore, Lemma 20.5 gives that with probability 1, |Y (t) s | 4η2(4λ2 + 4λ2) 32η2λ2. We can bound the conditional variance as

s=1 Var(Y (t) s |Gs) 16η4 t 1 X

s=1 (1 ηm)2(s 1)EZt s[( ψ(v) t s+1 2 2 EZt s+1[ ψ(v) t s+1 2 2|Ft s])2|Ft s],

49 16η48λ2 t 1 X

s=1 (1 ηm)2(s 1)EZt s[| ψ(v) t s+1 2 2 EZt s+1[ ψ(v) t s+1 2 2|Ft s]||Ft s],

128η4λ2 t 1 X

s=1 (1 ηm)2(s 1)(2EZt s[| ψ(v) t s+1 2 2|Ft s]),

51 2560η4λ2σ2 t 1 X

s=1 (1 ηm)2(s 1),

Now, putting B := 32η2λ2 and v = 2560η3λ2σ2

m , we get from Equation (60) that with probability at-least 1 δ/2,

Term I 32η2λ2 ln 2T

s 32η2λ2 ln 2T

2 + 5120η3λ2σ2

32η2λ2 ln 2T

+ 32η3/2λσ q

20.6 Proof of Lemma 20.5 reproduced from [25]

This follows from the analysis of [25] and reproduced here for completeness.

Proof of Equation (49)

ψ(v) t = ψt EZt[ψt|Ft 1] ,

(a) = clip( L(Zt, θt 1), λ) EZt[clip( L(Zt, θt 1), λ)|Ft 1], clip( L(Zt, θt 1), λ) + EZt[clip( L(Zt, θt 1), λ)|Ft 1] , λ + λ

Equality (a) follows from the fact that θt 1 Ft 1.

Proof of Equation (50)

Denote by the event ξt := 1 L(Zt,θt 1) λ and by χt := 1 L(Zt,θt 1) Rt(θt 1) > λ

2 . Since we have λ > 2G, we have the inequality that

ξt χt a.s. (61)

Then, we have that

clip L(t, θt 1)), λ) = L(Zt, θt 1)(1 ξt) + λ L(Zt, θt 1) L(Zt, θt 1)ξt.

Now, expanding out ψ(b) t , we get

ψ(b) t 2 = EZt[ψt|Ft 1] 2, = EZt[clip( L(Zt, θt 1), λ) Rt(θt 1)|Ft 1] 2,

L(Zt, θt 1) + λ L(Zt, θt 1) 1 L(Zt, θt 1)ξt Rt(θt 1) Ft 1

λ L(Zt, θt 1) 1 L(Zt, θt 1)ξt

λ L(Zt, θt 1) 1 L(Zt, θt 1) ξt

1 λ L(Zt, θt 1)

L(Zt, θt 1) ξt

EZt[ L(Zt, θt 1) ξt|Ft 1],

61 EZt[ L(Zt, θt 1) χt|Ft 1], EZt[ L(Zt, θt 1) Rt(θt 1) χt|Ft 1] + Rt(θt 1) EZt[χt|Ft 1],

EZt[ L(Zt, θt 1) Rt(θt 1) 2 2|Ft 1]EZt[χ2 t|Ft 1] + Rt(θt 1) EZt[χt|Ft 1],

EZt[χt|Ft 1] + λ

2 EZt[χt|Ft 1]. (62)

Now computing the probability

EZt[χt|Ft 1] = P[ L(Zt, θt 1) Rt(θt 1)|Ft 1 λ

= P[ L(Zt, θt 1) Rt(θt 1) 2 λ2

The last in-equality is from Chebychev s inequality. Substituting Equation (63) into Equation (62) yields the result.

Proof of Equation (51)

E[ ψ(v) t 2 2|Ft 1] (a) = EZt[ clip(L(Zt, θt 1), λ) EZt[clip(L(Zt, θt 1), λ)|Ft 1] 2 2|Ft 1],

(b) E[ clip(L(Zt, θt 1), λ) Rt(θt 1) 2 2|Ft 1]. (64)

Inequality (a) follows from the fact that θt 1 Ft 1 and step (b) follows from the fact that for any random vector Z and filtration F and Y F, EZ[ Z E[Z|F] 2 2|F] E[ Z Y 2 2|F] holds almost-surely. To simplify notation, we denote by Et := E[ |Ft 1]. We now bound Equation (64) to conclude.

Et[ clip(L(Zt, θt 1), λ) Rt(θt 1) 2 2] = Et[ L(Zt, θt 1) Rt(θt 1) 2 2(1 ξt)2]+

" λ L(Zt, θt 1) L(Zt, θt 1) Rt(θt 1)

(a) σ2 + Et

2 λ L(Zt, θt 1) L(Zt, θt 1)

2 + 2 Rt(θt 1) 2 2

2λ2Et[ξ2 t ],

2λ2Et[ξ2 t ],

63 σ2 + 5λ2

Inequality (a) follows from the fact that for any two vectors a, b, a + b 2 2 2 a 2 2 + 2 b 2 2, and the assumption that supθ Θ Et[ L(Z, θ) Rt(θ) 2 2] σ2.

20.7 Proof of Corollary 5.4

The finite diameter that D < fact is only used in Equation (46) which holds even if D = , albeit the bound is vacuous if ΛT > 0. Thus, the entire proof holds verbatim even if D = and gives non-trivial regret guarantees when ΛT = 0.