# unconstrained_robust_online_convex_optimization__b9631c47.pdf

Unconstrained Robust Online Convex Optimization

Jiujia Zhang 1 Ashok Cutkosky 1

This paper addresses online learning with corrupted feedback. Our learner is provided with potentially corrupted gradients gt instead of the true gradients gt. We make no assumptions about how the corruptions arise: they could be the result of outliers, mislabeled data, or even malicious interference. We focus on the difficult unconstrained setting in which our algorithm must maintain low regret with respect to any comparison point u Rd. The unconstrained setting is significantly more challenging as existing algorithms suffer extremely high regret even with very tiny amounts of corruption (which is not true in the case of a bounded domain). Our algorithms guarantee regret u G(

T + k) when G maxt gt is known, where k is a measure of the total amount of corruption. When G is unknown we incur an extra additive penalty of ( u 2 + G2)k.

1. Introduction

In this paper, we consider unconstrained online convex optimization (OCO) under the presence of adversarial corruptions. In general, OCO is a framework in which a learner iteratively outputs a prediction wt W, then observes a vector gt = ℓt(wt) for some convex loss function ℓt : W R, and then incurs a loss of ℓt(wt). The learner s performance over a time horizon T is evaluated by the regret relative to a fixed competitor u W, denoted as RT (u):

t=1 gt, wt u

t=1 ℓt(wt) ℓt(u)

The inequality above follows by convexity of ℓt. Classical results in this field consider a bounded domain W with

1Department of Electrical and Computer Engineering, Boston University, Boston, USA. Correspondence to: Jiujia Zhang <jiujiaz@bu.edu>, Ashok Cutkosky <ashok@cutkosky.com>.

Proceedings of the 42 st International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s).

known diameter D and a Lipschitz bound G maxt gt . In this setting, the optimal bound is RT (u) O(GD

T) (Zinkevich, 2003; Abernethy et al., 2008).

Our work focuses on the unconstrained case W = Rd, where it is typical to aim for a regret guarantee that scales not with a uniform diameter bound D, but with the norm of the comparator u . Such bounds are often called comparator adaptive (because they adapt to the comparator u), or parameter-free (because this adaptivity suggests that the algorithms require less hyperparameter tuning). In this unconstrained setting, the classical algorithms achieve RT (u) = O( u G

T) (Mcmahan & Streeter, 2012; Mc Mahan & Orabona, 2014; Orabona & P al, 2016; Orabona, 2014) (which is also optimal).

We are interested in a harder variant of the OCO framework with corrupted gradients. Specifically, instead of any direct information about the function ℓt, after each round the learner is provided with a vector gt that should be interpreted as an estimate of gt = ℓt(wt). Our aim is to obtain a regret that scales as u G(

T + k) for all u W, where k is some measure of the degree to which gt = gt that will be formally defined in Section 2. Roughly speaking, k can be interpreted as the number of rounds in which gt = gt. Notably, the desired rate is robust to adversarial corruptions in the sense that it allows k = O(

T) before the bound becomes worse than the optimal result without corruptions.

Our dual challenges of corrupted gt and unconstrained W are naturally motivated by problems in practice. The unconstrained setting is ubiquitous in machine learning - consider the classical logistic regression setting, for which it is unusual to impose constraints. The corrupted gt in contrast is less commonly studied, but represents a common practical issue: the computed gradients may not be good estimates of a true gradient, either due to the presence of statistical outliers, numerical precision issues in the gradient computation, or mislabeled or otherwise damaged data.

We distinguish two different settings in our results: one in which the algorithm is provided with prior knowledge of a number G maxt gt , and one in which it is not. This is a common dichotomy in unconstrained OCO, even without corruptions. In the former case, the classical result of O( u G

T) is obtainable, while in the latter case it is not: instead the optimal results are RT (u)

Unconstrained Robust Online Convex Optimization

O( u maxt gt

T + u 3 maxt gt ) (Cutkosky, 2019; Mhammedi & Koolen, 2020), or O( u maxt gt

T + u 2 + maxt gt 2) by Cutkosky & Mhammedi (2024). The later excels particularly whenever G is not excessively large: G u

To the best of our knowledge, the setting of unconstrained OCO with corruptions has not been studied before. Perhaps the closest works to ours are Zhang & Cutkosky (2022); Jun & Orabona (2019); van der Hoeven (2019) and van Erven et al. (2021). (Zhang & Cutkosky, 2022; Jun & Orabona, 2019; van der Hoeven, 2019) study the unconstrained setting, but assume that gt is a random value with E[ gt] = gt. In contrast, we assume no such stochastic structure on gt. On the other hand, van Erven et al. (2021) does not make any assumptions about the nature of the corruptions, but assumes that W has finite diameter D. They achieve a regret of O(DG(

T + k)). Our development will borrow some ideas from van Erven et al. (2021) with the aim to bound RT (u) , but we face unique difficulties. As detailed in Section 3, the unconstrained setting means that even very small corruptions could have dire consequences (unlike in the constrained setting). Moreover, if the Lipschitz constant G is not known, the problem becomes even more challenging; as detailed in Section 6.1, prior methods for handling unknown G do not apply because we never learn G even at the end of all T rounds.

The notion of adversarial corruption is common in the field of robust statistics, with early efforts focusing primarily on the presence of outliers in linear regression (Huber, 2004; Cook, 2000; Thode, 2002). These insprired broader application in machine learning, asuch as Robust PCA (Cand es et al., 2011), anomaly detection (Raginsky et al., 2012; Delibalta et al., 2016; Zhou & Paffenroth, 2017; Sankararaman et al., 2022), robust regression (Klivans et al., 2018; Cherapanamjeri et al., 2020; Chen et al., 2022), and mean estimation (Lugosi & Mendelson, 2021). For a comprehensive review of recent advances in this area, see Diakonikolas & Kane (2019).

Adversarial corruption has also been studied in iterative optimization setting other than OCO, such as in stochastic bandits (Lykouris et al., 2018; Gupta et al., 2019; Ito, 2021; Agarwal et al., 2021) and stochastic optimization (Chang et al., 2022; Sankararaman & Narayanaswamy, 2024).

Contributions and Organization In the case that the algorithm is given prior knowledge of G, we provide an algorithm that achieves RT (u) = O( u G(

T + k)) in Section 5.1, with a matching lower bound (see Section 5.2). Alternatively, when G is unknown, a regret bound with an additional penalty of ( u 2 + G2)k is attained (see Section 6.3).

Meanwhile, we provide two specific applications of our re-

sults in Appendix D. First, we show that our method can be used to solve stochastic convex optimization problems in some of the gradient computations are altered in an arbitrary way. Second, we solve a natural online version of a distributionally robust optimization problem. Before providing our main results, we introduce notation and define our corruption model in Section 2.

2. Notation and Problem Setup

Notation For each t, ℓt : W R is a convex function, where and W = Rd. Let wt W be iterates from some online learning algorithm and denote gt = ℓt(wt) as the true (sub)gradient. Let gt be the the corrupted gradient observed by the learner. Define 1{ } as the indicator function, where 1{TRUE} = 1, 1{FALSE} = 0. We use | | to denote the cardinality of a set. Let denote the Euclidean norm. Denote R+ = {x R : x 0}. We define shorthand notation for sets [T] = {1, 2, . . . , T} and [a, T] = {a, a+1, . . . , T} for any a [T]. We use B [T] to denote an index set, and B = [T] \ B for its complement. We use O( ) to hide constant factors and O( ) to additionally conceal any polylogarithmic factors.

Problem Setup Instead of the true gradients gt, our algorithms only receive potentially corrupted gradients gt. Two natural measures to quantify corruptions are:

t=1 1{gt = gt} (1)

kdeviation := 1

t=1 gt gt (2)

where G is a scalar that satisfies G maxt gt and is often referred as the Lipschitz constant . The metric kcount counts the rounds in which gt = gt but allowing for arbitrarily large deviations gt gt in those rounds. This is suitable for detecting outlier effects and highlighted in studies such as (van Erven et al., 2021; Sankararaman & Narayanaswamy, 2024). Conversely, kdeviation measures the cumulative deviation, accommodating corruption in every round, making it optimal for identifying subtle yet widespread errors or malicious activities, akin to the issues addressed in (Lykouris et al., 2018; Gupta et al., 2019; Ito, 2021; Agarwal et al., 2021; Chang et al., 2022).

In order to provide a unified way to study those two distinct corruption measures in Equation (1) and (2), we assume that our algorithm is provided with a number k that satisfies:

|B| := |{t [T] : gt gt G}| k (3)

Unconstrained Robust Online Convex Optimization

Figure 1. KT-bettor with ℓ(w) = |w 1| and comparator u = 1. (a)-(b): T = 400 and corruption happens during t [300, 319]. (c): Ratio between regrets with and without corruptions with various total corrupted rounds k [20, 30, 40, 50, 60, 70] and T = k2.

t=1 min ( gt gt , G) k (4)

Intuitively, B denotes the rounds with a large amount of corruption. Notice that

|B| min (kcount, kdeviation)

t=1 min ( gt gt , G) min (kcount, kdeviation)

Hence, an algorithm whose complexity depends on k satisfying Equation (3) and (4) can be used with either k = kcount or k = kdeviation depending on which is more appropriate to the problem at hand.

3. Challenges in Unconstrained Domain

Dealing with corruptions with an unconstrained domain is significantly more challenging than one with a bounded domain. As an illustration of the added difficutly, suppose the corruptions are so small that gt gt G for all t. Then in a bounded W with a diameter D, an algorithm that completely ignores the possibility of corruptions and directly runs on gt will have low regret. This can be seen as follows: since u wt D for every u, wt W, we have:

t=1 gt, wt u

t=1 gt, wt u +

t=1 gt gt wt u

t=1 gt, wt u + k GD

In this case, u wt D prevents the algorithm from straying too far from the comparator u.

The situation is much more difficult in the unconstrained setting. Algorithm for this setting typically produce outputs wt that potentially grow exponentially fast in order to quickly compete with comparators that are very far from the starting point. However, this also means the algorithm is especially fragile to corruption since the growth of wt can be highly sensitive to deviations in gt gt . Even a small deviation could cause wt to move extremely far away and therefore incur a very high regret. This phenomenon is illustrated in Figure 1 with the KT-bettor algorithm (Orabona & P al, 2016), which is a standard example of an unconstrained learner.

In Figure 1, we considered ℓt(w) = |x 1| for all t. Figure 1a and 1b demonstrate k = 20 gradients being corrupted by setting gt = gt during rounds t [300, 300 + k 1] = [300, 319] over a time span of T = k2 = 400. This results in an exponential deviation away from the comparator u = 1 and so incurs a high regret. Finally, we show that this problem becomes exacerbated as k increases by simulating k [20, 30, 40, 50, 60, 70] for T = k2 in Figure 1c.

4. Robustification through Regularization

In this section, we outline our general algorithm-design recipe. When receiving possibly corrupted gradients gt, we first employ a gradient clipping step with some threshold ht that outputs a clipped version gc t, defined as follows:

gc t = gt gt min (ht, gt ) (5)

This preprocessing step corrects some corruption effect when ht is appropriately chosen. For example, in the case of ht = G maxt gt , then gc t is always less corrupted than gt, as gc t gt gt gt . Then gc t is used as a feedback to an online learner, yielding the following expression

Unconstrained Robust Online Convex Optimization

for RT (u):

t=1 gt, wt u

t=1 gc t, wt u +

t=1 gt gc t, wt u

Next, we incorporate a regularization function rt : W R. We can re-write RT (u) as the following:

t=1 gc t, wt u + rt(wt) rt(u)

| {z } :=RA T (u)

t=1 gt gc t, wt

| {z } :=ERROR

| {z } :=CORRECTION

t=1 gt gc t, u

| {z } :=BIAS

The interpretation is that RT (u) can be controlled through four components ERROR, CORRECTION, BIAS and a composite regret RA T (u). Here, RA T (u) is the regret of an online learner A with respect to the losses gc t, w + rt(w). Some information about the structure of rt may be known before wt is chosen, making this problem similar to online optimization with composite losses (Duchi et al., 2010). We summarize the general procedure as Algorithm 1. Depending on whether G maxt gt is known or not, the appropriate settings for ht, the selection of regularizer rt and the online learner A differs. We apply this general framework to the case where G is known in Section 5.1 first, and the consider the significantly more challenging unknown-G setting in Section 6.

Algorithm 1 General Protocol

1: Input: Clipping thresholds: 0 < h1 h T +1 An online learning algorithms A. A regularizer: rt : W R+

2: for t = 1 to T do 3: Play wt, receive gt 4: Compute gc t as Eqn. (5) 5: Send gc t, ht+1 to A and get wt+1 6: end for

5. Robust Learning with Knowledge of Lipschitz Constant

In this section, we proceed under the simpler setting that G maxt gt is known a priori. We therefore will set

ht = G for all iterations in the definition of gc t (see Equation (5)).

5.1. The Algorithm and Regret Guarantee

The quantity ERROR as defined in Equation (6) can be upper bounded by:

t B gt gc t + X

t B gt gc t

t B gt gt + G| B|

k G max t wt

where B is defined in Equation (3), and B = [T] \ B. The second line is due to gt gc t gt gt G, t B, because ht = G gt . The last inequality is due to the corruption model presented in Equation (4). This suggests that the worst case value for ERROR is at most k G maxt wt . This could potentially be exponential in t as shown in Lemma 8 (Zhang & Cutkosky, 2022). On the other hand, no matter which regularizer rt we choose, a worst case upper bound on BIAS can be derived:

BIAS k G u +

To enable CORRECTION to cancel ERROR while ensuring BIAS does not grow too large, the ideal choice of regularizer rt should achieve the following bounds simultaneously:

t=1 rt(wt) Ω(k G max t wt ),

t=1 rt(u) O(k G u )

To accomplish this, we choose rt(w) = ft(w) as displayed in Equation (7), which belongs to family of Huber losses first proposed by (Zhang & Cutkosky, 2022):

ft(w; c, p, α) = cσt(w; p, α)/S1 1/p t (7)

i=1 wi p + αp (8)

and a piecewise function σt:

σt(w; p, α) =

( w p, w wt (p w (p 1) wt ) wt p 1, otherwise (9)

This function behaves polynomially near wt and linear otherwise. By setting c = k G, p = ln T, α = ϵ/k, ft

Unconstrained Robust Online Convex Optimization

achieves the desired properties. See a detailed discussion of the characteristics of this family of losses in (Zhang & Cutkosky, 2022). For completeness we include a proof of the relevant properties in Lemma B.1. Overall, these bounds allow ERROR CORRECTION O(1) and BIAS O(k G u ).

Armed with this peculiar regularization function, we now need an online learner that can control RA T (u). This is similar to online learning with loss ℓt(wt) := gc t, wt + rt(wt) at each round t. However, we cannot simply apply any standard OCO algorithm to these losses. The issue is that regularizer rt is Θ(k G) Lipschitz rather than G-Lipschitz. So, a naive approach would result in RA T (u) = O( u k G

T), but we want our final regret to be only O( u G(

T + k)). The key to avoid the multiplicative k-factor is to observe that rt is known ahead-of-time: it is a composite term in the loss, and so ideally the regret should only depend on gc t. Unfortunately, composite losses are not as well-studied in the unconstrained online learning literature. Moreover, our composite loss is somewhat non-standard in that the shape of the function rt depends slightly on wt. Nevertheless, we show that in fact the online mirror descent algorithm developed by (Jacobsen & Cutkosky, 2022) actually does guarantee composite regret in our setting. That is RA T (u) O( u G

T) (see Theorem C.2, and an explicit algorithmic update procedure in Algorithm 2). By combining these ingredients, we are ready to state the overall regret bound, whose proof is deferred to Appendix D.

Theorem 5.1. Suppose gt, gt satisfies assumptions in Equation (3) and (4). Setting ht = G, rt = ft as defined in Equation (7) with c = k G, p = ln T, α = ϵ/k for some ϵ > 0, set Algorithm 2 algorithm as the base algorithm A. Then Algorithm 1 guarantees:

RT (u) O h ϵG + u G

Theorem 5.1 shows that the penalty for corrupted gradients is at most O( u Gk). This result has a few intriguing properties. First, so long as k

T, the penalty is subasymptotic to the standard uncorrupted regret bound O( u G

T). That is, we can tolerate k up to

T essentially for free . Next, observe that for u = 0, the regret is ϵ no matter what k is. Constant regret at the origin is typical for unconstrained algorithms, but is especially remarkable for our corrupted setting. Imagine a scenario in which we define 0 to represent some default action. Our bound then suggests that no matter how much corruption is present, we never do significantly worse than this default.

5.2. Lower Bounds

We present a lower bound in Theorem 5.2 with proofs deferred in Appendix E. This result shows that the upper bound

of Theorem 5.1 is tight with respect to some comparator u

with any pre-specified magnitude. In addition, we provide a second lower bound as Theorem E.5 in Appendix E, which has the matching log factor.

Theorem 5.2. For every D > 0, there exists a comparator u Rd such that u = D, g1, , g T and g1, , g T such that gt , gt 1, PT t=1 1{ gt = gt} = k:

t=1 gt, wt u Ω h u

6. Robust Learning with Unknown Lipschitz Constant

In this section, we consider the scenario where G maxt gt is unknown. We first discuss the additional challenges must be addressed in Section 6.1, followed by intuition of selecting ht, rt. Finally, we discuss the choice of the base algorithm A and the regret guarantee.

6.1. Challenges with Unknown G

The combination of unconstrained domain, unknown G and corrupted gradients provides a set of challenges that stymie current techniques in the literature. Let us unpack these challenges carefully, starting from where our analysis in Section 5 breaks down.

Previously, we employed a clipping threshold ht = G to automatically filter out gt values that were in some sense obviously corrupted , replacing them with values that were only corrupted by at most G. Then, we employed a technical regularization scheme by setting rt = ft to cancel out these bounded corruptions. This strategy is no longer available: we do not know G.

A natural approach would be to maintain a time-varying threshold ht that estimates G on-the-fly, and use it in place of the unknown constant G. However, this is harder than it seems because inevitably we will sometimes have ht < G, and so we are likely to clip even an uncorrupted gradient: gc t = gt even when gt = gt. This means that unlike the known G-case, our clipping operation is not purely benign. Nevertheless, methods in the literature for tackling the unknown-G setting without corruptions often use exactly this method with ht = maxi<t gi G (Cutkosky, 2019; Mhammedi & Koolen, 2020; Cutkosky & Mhammedi, 2024).

Unfortunately, this does not work for our corrupted setting. The new difficulty is that we never know the value G, even in hindsight. The choice ht = maxi<t gi is bad: a single corrupted gradient with gt G will be catastrophic. Moreover, even if we were guaranteed that gt G for all t, it might be that gt ht, so that truncation actually

Unconstrained Robust Online Convex Optimization

results in gt gc t gt gt . We will call this additional bias a truncation error . As a result, we need to address a new challenge:

1. How can we choose ht in a robust way that serves as a good estimate of G without introducing too much truncation error?

In addition, even if we could address challenge 1 and have access to ht G, the regularizer rt = ft used for our known-G algorithm also required exact knowledge of the value for G. We rely on the scaling factor c O(k G) to be big enough to cancel the ERROR k G maxt wt as P t ft(wt) O(c maxt wt ). Therefore, simply setting c = kht is not sufficient since ht values are likely smaller than G.

In the known-G case, the Huber regularizer ft almost perfectly cancels the error by imposing a penalty that is also roughly proportional to maxt wt . The challenge in the unknown G-case is that the proportionality constant of k G is unknown, so we cannot perform this cancellation. Instead, we propose to use a regularizer rt(w) = O( w 2) that grows faster than the Huber regularizer ft as illustrated in Figure 2. This way, k G maxt wt rt(maxt wt ) for large enough wt no matter what G is. By scaling rt appropriately, we would like to ensure P

t rt(wt) Ω(k maxt wt 2). Then, as long as maxt wt G, P t rt(wt) will completely cancel ERROR. So overall, without prior knowledge of G, ERROR CORRECTION O(k G2) is guaranteed. However, the quadratic growth of rt introduces another challenge:

2. How can we exploit a more aggressive quadratic regularization without introducing too much regularization bias P

t rt(u) for arbitrary u W?

In particular, notice that any constant regularizer rt(u) = C w 2 would add bias of P

t rt(u) = TC u 2 to the regret, so that we would require C = O(1/T). This in turn is too small to ensure P

t rt(wt) Ω(k maxt wt 2). Our solution will require a more nuanced time-varying regularizer.

6.2. Adaptive Thresholding and Error Correction

Challenge 1: To meet the challenge of choosing ht properly, we introduce a tracking mechanism that conservatively increases ht over time. The key insight is that at most k values of t can have gt > 2G (See Lemma F.1). Based on this observation, we propose a simple way to maintain a threshold ht which provides a conservative lower bound estimate of G. The threshold ht starts with some small value h1 = τG > 0 and stays constant until we observe k + 1

4 2 0 2 4 0

Figure 2. Comparison of Huber loss ft without scaling factor σt with p = 3, |wt| = 1, and |w|2. |w|2 always grow faster than σt(w) far away from the origin.

iterations in which gt ht. This mechanism is named as FILTER and is displayed as Algorithm 3 in Appendix F.

Notice that ht only doubles if it is guaranteed that some gt satisfies ht gt , so that ht O(G) always. Denote rounds where gradients are clipped as P = {t [T] : gt = gc t}, that is P denotes rounds that have ht gt . The conditional doubling mechanism also ensures | P| O(k) (See Lemma F.2). This means only a small fraction of gt are truncated before ht becomes a very accurate estimate of G. Hence, the total rounds which are vulnerable to suffering from truncation error is small. This FILTER strategy improves upon a method with a similar purpose in van Erven et al. (2021); it uses only constant space rather than O(k) space.

Challenge 2: As a consequence of using ht from FILTER as the clipping threshold, we can decompose the ERROR term defined in Equation (6) by using gt = gc t for t P:

t P gt gc t wt

| {z } E P:truncation error

t P gt gt wt

| {z } EP:corruption error

Therefore, we must choose a regularizer rt to offset those errors. We discuss how each error component is treated and finally reveal the structure of rt.

Truncation error E P can be reduced by keeping wt from being too large. Thus, every time we experience some truncation error, we would like to use some regularization to force the learner to decrease wt . To this end, we use a quadratic regularization penalty αt wt 2 where αt = 0 only when t P so as to only encourage wt to decrease only when we have evidence that truncation error has occurred. Then, for any round t P we bound the truncation error

Unconstrained Robust Online Convex Optimization

minus the quadratic regularizer αt wt 2 as follows:

gt gc t wt αt wt 2

sup X 0 gt gc t X αt X2 = gt gc t 2

such a regularization scheme implies overall:

t αt wt 2 X

and additional regularization bias is u 2 P

t αt. This means the truncation error will be controlled as O(k G2), and the additional bias from this regularization will also be mild as O(k u 2), if the following conditions hold simultaneously: X

t P 1/αt O(k), X

t [T ] αt O(k)

Notice that | P| O(k) and that ht+1 = ht, t P. Thus, the following αt fulfills both requirements:

αt = γα 1{ht+1 = ht} (11)

by setting γα = O(1).

Next, for managing the corruption error EP, one might attempt to take the same approach by imposing an additional quadratic regularizer βt wt 2. Similar calculation as used in the control of truncation error suggests that we can control the corruption error so long as P

t P 1/βt O(k) and P

t βt O(k) holds simultaneously. Unfortunately, |P| could potentially be as large as Ω(T), and so it is not possible to pick constant βt, t P that satisfies these desired rates.

Our remedy is to apply a milder regularization by combining the quadratic regularizer which is active only on a smaller subset P0 P and the Huber regularization ft which is active at every round. To see how this will work, suppose we divide the T rounds into N epochs in which maxt wt [2N, 2N+1]. Specifically, we begin with some z1 = τD > 0. Whenever wt zt, we set zt+1 = 2 wt , and zt+1 = zt otherwise (formally summarized as TRACKER as Algorithm 4, Appendix G). We choose P0 = {t : zt+1 = zt}, that is time steps P0 = {τ1, . . . , τN} when threshold doubles. For any two consecutive time steps τn, τn+1 P0, we define [τn : τn+1 1] as a single epoch for each n N. Notice that for t [τn, τn+1 1], we must have wt 2 wτn . Further, the total corruption error incurred in one such epoch is:

t=τn gt gc t wt 2 wτn

t=τn gt gc t

So, applying a sufficiently large regularization on exactly the indices in P0 will ideally be enough to cancel all of the corruption error. Further, notice that |P0| O(ln(maxt wt )). Thus, one might hope to apply an approach similar to the one used for truncation error: use a regularizer βt w 2 with βt = 0 for t / P0 and βt equal to some constant β for t P0. Unfortunately, this will yield a regularization bias of P

t P0 β u 2 O(β ln(maxt wt ) u 2). While this seems benign, recall that actually maxt wt may be exponential in T, so that ln(maxt wt ) is still polynomial in T. To resolve this, we gradually attenuate βt by choosing:

βt = γβ 1{zt+1 = zt} 1 + Pt i=1 1{zi+1 = zi} (12)

With this choice, the following are guaranteed:

t P0 1/βt O 1

k ln max t wt

t [T ] βt = O(k ln ln max t wt ) = O(k)

with γβ = O(k). The first identity guarantees that EP P t βt wt 2 is at most O(k G2 ln maxt wt ), while the second ensures that the additional bias is only PT t=1 βt u 2 O(k u 2). Now, we still need to handle the potentially-large O(k G2 ln maxt wt ) term. This remaining corruption bound can be controlled by adding a small multiple of the Huber regularizer ft, which satisfies P

t ft(w) O(k G2 ln maxt wt ) even for c G.

To summarize, our solution to solve Challenge 2 is to set regularizer as:

rt(w) = ft(w) + at w 2

where at = αt + βt. Overall, the sparsely applied quadratic regulaization and small Huber regularization at every round allow the following bounds as shown in Lemma H.2:

ERROR CORRECTION

BIAS u 2 T X

t=1 ft(u) + k G u

with appropriately chosen constants, ERROR

CORRECTION O(k G2) and BIAS O(k G u + u 2) are achieved.

Unconstrained Robust Online Convex Optimization

6.3. Base Algorithm and the Regret

With the regularizer rt chosen in the previous section, it remains to choose a learner A which guarantees regret:

RA T (u) :=

t=1 gc t, wt u + ft(wt) ft(u)

t=1 at wt 2 u 2

However, the solution of Section 5.1 of learning with composite loss ℓt(w) = gc t, wt +ft(w)+at w 2 is no longer appropriate because it requires the composite term to be known in round t. Unfortunately, in this case the quadratic component depends on at = αt + βt, which is unknown until gc t is revealed. Due to the quadratic component, we employ the epigraph-based regularization technique recently developed by Cutkosky & Mhammedi (2024) to construct A. Briefly, this technique projects convex optimization problems in W = Rd to an augmented space Rd+1. The first d coordinate of the solution in the augmented space is the decision variable wt, and the extra coordinate is a technical device that is used to modify the loss in a way that makes it easier to control the quadratic penalty. We incorporate this technique in our set up to obtain RA T (u) O u G

T + u 2k as provided in Theorem I.3. We leave a brief summary of the technique in Appendix I.

Combining with the error correction parts from the previous section, we obtain a general regret guarantee as follows with free parameters c, γα, γβ that specify the Huber regularizer and quadratic regularizers:

Theorem 6.1. Suppose gt, gt satisfies assumptions in Equation (3) and (4). Setting rt(w) = ft(w)+αt w 2, where ft is defined in Equation (7) with parameters: α = ϵτG/c, p = ln T, γ = γα + γβ, for some ϵ, c, γα, γβ, τG, τD > 0, set Algorithm 5 algorithm as the base algorithm A. Then Algorithm 1 guarantees:

ϵ[G]τG + kτD[G]τG + u [G]τG

+ c ( u + τD) + (γα(k + 1) + γβ) u 2

+ (k + 1)[G]2 τG γα + k2[G]2 τG γβ

Notice that although it appears possible to set c = 0 in the above bound, there is a log(1/c) term hidden inside the O that prevents excessively small c values.

Next, we provide two different way to set c, γβ. By setting both c, γβ = O(k) and τD = O(1/k), we obtain:

Corollary 6.2. With c = kτG, γβ = k, γα = 1, τD = ϵ/k and rest of parameters same as Theorem 6.1, Algorithm 1 guarantees a regret bound:

ϵ[G]τG + u [G]τG

+ (k + 1) u 2 + [G]2 τG #

where [G]τG := max(τG, G).

Just as in the known-G case, the parameter settings in Corollary 6.2 yield O(

T) regret so long as k

T so that we can experience a significant amount of corruption without damaging the asymptotics of the regret bound.

We can also achieve the desirable safety property of Theorem 6.1 in which the regret with respect to the baseline point u = 0 is constant no matter what k is via a different setting of the regularization parameters as provided in Corollary 6.3:

Corollary 6.3. With c = τG, γβ = k2, γα = k + 1, τD = 1 and rest of parameters same as Theorem 6.1, Algorithm 1 guarantees a regret bound:

ϵ[G]τG + (k + 1)[G]τG + [G]2 τG

T + k + 1 + u 2(k + 1)2 #

where [G]τG := max(τG, G).

However, in this case we now pay a larger penalty for u = 0 that scales with k2 rather than k. This is a natural tradeoff: we ensured constant regret at the origin by increasing the regularization away from the origin. The overall regularization is stronger, which encourages smaller wt . Thus, this algorithm configuration is likely to be advantageous when competing with small u .

7. Conclusion

In this paper, we considered unconstrained online convex optimization that only have access to potentially corrupted gradients gt instead of the true gradient gt, in which the corruption level is measured by k. In the case that G maxt gt is known, we provide an algorithm that achieves the optimal regret guarantee u G(

T + k). When G is unknown it incur an extra additive penalty of ( u 2 + G2)k. While the u 2 + G2 is optimal without corruption (Cutkosky & Mhammedi, 2024), it is unclear whether the multiplicative dependence on k is optimal in the presence of corruption.

Unconstrained Robust Online Convex Optimization

Impact Statement

This paper advances the theoretical understanding of online convex optimization, contributing to the mathematical foundations of machine learning. As it does not involve deployable systems or datasets, the broader societal and ethical implications are indirect, and no specific concerns need to be highlighted.

Abernethy, J., Bartlett, P. L., Rakhlin, A., and Tewari, A. Optimal strategies and minimax lower bounds for online convex games. In Proceedings of the nineteenth annual conference on computational learning theory, pp. 415 424, 2008.

Agarwal, A., Agarwal, S., and Patil, P. Stochastic dueling bandits with adversarial corruption. In Algorithmic Learning Theory, pp. 217 248. PMLR, 2021.

Ben-Tal, A., El Ghaoui, L., and Nemirovski, A. Robust optimization, volume 28. Princeton university press, 2009.

Ben-Tal, A., Hazan, E., Koren, T., and Mannor, S. Oraclebased robust optimization via online learning. Operations Research, 63(3):628 638, 2015.

Bertsekas, D. Convex optimization theory, volume 1. Athena Scientific, 2009.

Cand es, E. J., Li, X., Ma, Y., and Wright, J. Robust principal component analysis? Journal of the ACM (JACM), 58(3): 1 37, 2011.

Chang, F.-C., Nabiei, F., Wu, P.-Y., Cioba, A., Vakili, S., and Bernacchia, A. Gradient descent: Robustness to adversarial corruption. In OPT 2022: Optimization for Machine Learning (Neur IPS 2022 Workshop), 2022.

Chen, S., Koehler, F., Moitra, A., and Yau, M. Online and distribution-free robustness: Regression and contextual bandits with huber contamination. In 2021 IEEE 62nd Annual Symposium on Foundations of Computer Science (FOCS), pp. 684 695. IEEE, 2022.

Cherapanamjeri, Y., Aras, E., Tripuraneni, N., Jordan, M. I., Flammarion, N., and Bartlett, P. L. Optimal robust linear regression in nearly linear time. ar Xiv preprint ar Xiv:2007.08137, 2020.

Cook, R. D. Detection of influential observation in linear regression. Technometrics, 42(1):65 68, 2000.

Cutkosky, A. Algorithms and Lower Bounds for Parameterfree Online Learning. Ph D thesis, Stanford University, 2018.

Cutkosky, A. Artificial constraints and hints for unbounded online learning. In Proceedings of the Thirty-Second Conference on Learning Theory, pp. 874 894, 2019.

Cutkosky, A. and Mhammedi, Z. Fully unconstrained online learning. Advances in Neural Information Processing Systems, 2024.

Cutkosky, A. and Orabona, F. Black-box reductions for parameter-free online learning in banach spaces. In Conference On Learning Theory, pp. 1493 1529, 2018.

Delibalta, I., Gokcesu, K., Simsek, M., Baruh, L., and Kozat, S. S. Online anomaly detection with nested trees. IEEE Signal Processing Letters, 23(12):1867 1871, 2016.

Diakonikolas, I. and Kane, D. M. Recent advances in algorithmic high-dimensional robust statistics. ar Xiv preprint ar Xiv:1911.05911, 2019.

Duchi, J. C., Shalev-Shwartz, S., Singer, Y., and Tewari, A. Composite objective mirror descent. In COLT, volume 10, pp. 14 26. Citeseer, 2010.

Gupta, A., Koren, T., and Talwar, K. Better algorithms for stochastic bandits with adversarial corruptions. In Conference on Learning Theory, pp. 1562 1578. PMLR, 2019.

Huber, P. J. Robust statistics, volume 523. John Wiley & Sons, 2004.

Ito, S. On optimal robustness to adversarial corruption in online decision problems. Advances in Neural Information Processing Systems, 34:7409 7420, 2021.

Jacobsen, A. and Cutkosky, A. Parameter-free mirror descent. In Loh, P.-L. and Raginsky, M. (eds.), Proceedings of Thirty Fifth Conference on Learning Theory, volume 178 of Proceedings of Machine Learning Research, pp. 4160 4211. PMLR, 2022. URL https://proceedings.mlr.press/ v178/jacobsen22a.html.

Jun, K.-S. and Orabona, F. Parameter-free online convex optimization with sub-exponential noise. In Conference on Learning Theory, pp. 1802 1823. PMLR, 2019.

Klivans, A., Kothari, P. K., and Meka, R. Efficient algorithms for outlier-robust regression. In Conference On Learning Theory, pp. 1420 1430. PMLR, 2018.

Levy, D., Carmon, Y., Duchi, J. C., and Sidford, A. Largescale methods for distributionally robust optimization. Advances in Neural Information Processing Systems, 33: 8847 8860, 2020.

Lugosi, G. and Mendelson, S. Robust multivariate mean estimation: the optimality of trimmed mean. 2021.

Unconstrained Robust Online Convex Optimization

Lykouris, T., Mirrokni, V., and Paes Leme, R. Stochastic bandits robust to adversarial corruptions. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pp. 114 122, 2018.

Mcmahan, B. and Streeter, M. No-regret algorithms for unconstrained online convex optimization. In Advances in neural information processing systems, pp. 2402 2410, 2012.

Mc Mahan, H. B. and Orabona, F. Unconstrained online linear learning in hilbert spaces: Minimax algorithms and normal approximations. In COLT, pp. 1020 1039, 2014.

Mhammedi, Z. and Koolen, W. M. Lipschitz and comparator-norm adaptivity in online learning. Conference on Learning Theory, pp. 2858 2887, 2020.

Namkoong, H. and Duchi, J. C. Stochastic gradient methods for distributionally robust optimization with fdivergences. Advances in neural information processing systems, 29, 2016.

Orabona, F. Simultaneous model selection and optimization through parameter-free stochastic learning. In Advances in Neural Information Processing Systems, pp. 1116 1124, 2014.

Orabona, F. A modern introduction to online learning. ar Xiv preprint ar Xiv:1912.13213, 2019.

Orabona, F. and P al, D. Coin betting and parameter-free online learning. In Lee, D. D., Sugiyama, M., Luxburg, U. V., Guyon, I., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 29, pp. 577 585. Curran Associates, Inc., 2016.

Oren, Y., Sagawa, S., Hashimoto, T. B., and Liang, P. Distributionally robust language modeling. ar Xiv preprint ar Xiv:1909.02060, 2019.

Qi, Q., Guo, Z., Xu, Y., Jin, R., and Yang, T. An online method for a class of distributionally robust optimization with non-convex objectives. Advances in Neural Information Processing Systems, 34:10067 10080, 2021.

Raginsky, M., Willett, R. M., Horn, C., Silva, J., and Marcia, R. F. Sequential anomaly detection in the presence of noise and limited feedback. IEEE Transactions on Information Theory, 58(8):5544 5562, 2012.

Sagawa, S., Koh, P. W., Hashimoto, T. B., and Liang, P. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. ar Xiv preprint ar Xiv:1911.08731, 2019.

Sankararaman, A. and Narayanaswamy, B. Online robust non-stationary estimation. Advances in Neural Information Processing Systems, 36, 2024.

Sankararaman, A., Narayanaswamy, B., Singh, V. Y., and Song, Z. Fitness:(fine tune on new and similar samples) to detect anomalies in streams with drift and outliers. In International Conference on Machine Learning, pp. 19153 19177. PMLR, 2022.

Thode, H. C. Testing for normality. CRC press, 2002.

van der Hoeven, D. User-specified local differential privacy in unconstrained adaptive online learning. In Advances in Neural Information Processing Systems, pp. 14103 14112, 2019.

van Erven, T., Sachs, S., Koolen, W. M., and Kotlowski, W. Robust online convex optimization in the presence of outliers. In Conference on Learning Theory, pp. 4174 4194. PMLR, 2021.

Xie, S. M., Pham, H., Dong, X., Du, N., Liu, H., Lu, Y., Liang, P. S., Le, Q. V., Ma, T., and Yu, A. W. Doremi: Optimizing data mixtures speeds up language model pretraining. Advances in Neural Information Processing Systems, 36, 2023.

Zhang, J. and Cutkosky, A. Parameter-free regret in high probability with heavy tails. Advances in Neural Information Processing Systems, 35:8000 8012, 2022.

Zhang, Z., Cutkosky, A., and Paschalidis, I. Pde-based optimal strategy for unconstrained online learning. In International Conference on Machine Learning, pp. 26085 26115. PMLR, 2022.

Zhang, Z., Yang, H., Cutkosky, A., and Paschalidis, I. C. Improving adaptive online learning using refined discretization. In International Conference on Algorithmic Learning Theory, pp. 1208 1233. PMLR, 2024.

Zhou, C. and Paffenroth, R. C. Anomaly detection with robust deep autoencoders. In Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 665 674, 2017.

Zinkevich, M. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning (ICML03), pp. 928 936, 2003.

Unconstrained Robust Online Convex Optimization

A. Unconstrained Online Convex Optimization with Hints

In the unconstrained setting, there are algorithms requires a uniform bound G maxt gt upfront which guarantees O( u G

T) (Mc Mahan & Orabona, 2014; Orabona & P al, 2016; Cutkosky & Orabona, 2018; Zhang et al., 2022). In the case where G is unknown, algorithms are usually devised through an intermediate step with a slightly ideal scenario, that is the algorithm receives a gradient gt with a hints ht+1 = maxi t+1 gi at each iteration t. It turns out by having access to ht to guide the algorithm, same regret O( u h T

T) can be achieved (Cutkosky, 2019; Mhammedi & Koolen, 2020; Jacobsen & Cutkosky, 2022; Zhang et al., 2024).

In this paper, we also follows the same strategy of assuming a good hints ht = maxi t gt is supplied to the algorithm, and eventually investigate the scenario of only the current best estimate ht maxi t 1 gt is available. Hence most of the proofs in the appendix are displayed in the way of relying on a time varying hints : 0 < h1 h T h T +1 to accommodate the design of both known G and unknown G case.

B. Bounds on Regularizer in Equation (7)

Our results are based on appropriate algebraic property of ft as displayed in Equation (7) and was firstly studied by Zhang & Cutkosky (2022). We re-stated relevant bounds as Lemma B.1 for completeness. Lemma B.1 (Lemma 11 and Lemma 13 of Zhang & Cutkosky (2022)). Let ft : W R+ be defined as follows for some c 0, α > 0 and p 1,

ft(w; c, p, α) =

c(p w (p 1) wt ) wt p 1

(Pt i=1 wi p+αp)1 1/p , w > wt

c w p 1 (Pt i=1 wi p+αp)1 1/p , w wt

t=1 ft(wt) c

t=1 wt p + αp !1/p

t=1 ft(u) cp u T 1/p "

p (p 1)/p + 1

In particular, when p = ln T for T 3:

t=1 ft(wt) c max t wt α

t=1 ft(u) 3c ln T u ln 1 + u

Proof. The first set of bounds are the same as Zhang & Cutkosky (2022) Lemma 13. For the second set of bounds: the

lower bound is due to PT t=1 wt p + αp 1/p PT t=1 wt p 1/p followed by an application of of Lemma 11 in Zhang

& Cutkosky (2022); the upper bound is due to xq x + 1 for x > 0 and 0 < q < 1, where we set x = ln 1 + u

and q = (p 1)/p followed by T 1/ ln T = e 3.

C. Base Algorithms for known G

In this section, we verify the centered mirror descent framework (see their Algorithm 1 and we contextualize it as Algorithm 2 here) developed by (Jacobsen & Cutkosky, 2022) automatically guarantee a composite regret RA T (u) for free thus is a compatible candidate in achieving robust unconstrained learning.

Algorithm 2 is an explicit update with two regularizers ψt(w):

ψt(w) = 3 Z w

0 Ψ t(x)dx, Ψ t(x) = min η 1/ht

ln(x/at + 1)

Unconstrained Robust Online Convex Optimization

and some ft(w) (In their notation as φt(w) ). That is

wt+1 = argmin w gt, w + Dψt(w | wt) + t(w) + ft(w) (14)

where Dψ(a | b) = ψ(a) ψ(b) ψ(b), a b is the Bregman divergence induced by ψ, and t(w) = Dψt+1(w | w1) Dψt(w | w1).

Notice the first two terms in Equation (14) corresponds to the classical mirror descent with mirror map ψt, the third term t(w) encourages iterate to be close to the initial start w1, thus echoing centered mirror descent. and ft being any arbitrary composite regularizer and its structure is completely known in time in order to produce wt+1. shows the above update with ft = 0 (φt in their notation) guarantees RT (u) O( u h T

T) (Theorem 6), which is optimal. The second composite regularizer ft was specifically tailored as ft(w) = O( gt 2 w ) in order to attain optimal dynamic regret (See their Proposition 1).

We verify RA T (u) O( u h T

T) when ft as defined in Equation (7) for our purpose, which is formalized as Theorem C.2. In fact, the development of (Jacobsen & Cutkosky, 2022) attains the same bound for any arbitrary composite regularizer ft for user to exploit. We first present a helper Lemma that is equivalent to the update in Equation (14) before the theorem:

Lemma C.1. Equation (14) is equivalent to

wt+1 = argmin w gt, w + Dψt(w | wt) (15)

wt+1 = argmin w Dψt(w | wt+1) + t(w) + ft+1(w) (16)

Proof. By first order optimality, Equation (15) implies

gt + ψt( wt+1) ψt(wt) = 0

wt+1 = ψ 1 t ( ψt(wt) gt)

Substitute wt+1 into Equation (16), we see the equivalence.

Theorem C.2. Suppose 0 < h1 h2 h T and g1, , g T be arbitrary sequence satisfies gt ht for all t [T]. Then Algorithm 2 guarantees

RA T (u) :=

t=1 gt, wt u + rt(wt) rt(u) O ϵh T + u h T

where rt is defined as Equation (7) for arbitrary c, p, α > 0.

Proof. First, we verify Algorithm 2 is indeed the update corresponding to Equation (14). By Lemma C.1, Equation (14) is equivalent to a two step update as shown in Equation (15) and (16).

By first order optimality to Equation (15):

wt+1 = ψ 1 t ( ψt(wt) gt)

Then substitute to Equation (16)

ψt+1(wt+1) + ft+1(wt+1) = ψt(wt) gt := θt

Define Ψt( w ) := ψt(w) = R w 0 Ψ t(x)dx and Rt+1(x) := cp |x|p 1

(St+|x|p 1)1 1/p (Notice ft+1(x) = x x Rt+1( x )), thus by substitute derivatives

Ψ t+1( wt+1 ) + Rt+1( wt+1 ) | {z } :=Lt+1( wt+1 )

Unconstrained Robust Online Convex Optimization

Thus, the solution wt+1 satisfies the following for some x 0:

wt+1 = x θt

θt , Lt+1(x) = θt

In particular, define Ft = log(1 + x

Vt Ft(x) + Rt(x), ht p

Ft(x) < Vt 3ht Ft(x) + 3Vt

ht + Rt(x), otherwise

The boundary case is when x : ht p

Ft(x ) = Vt. That is x = at(exp(Vt/h2 t) 1). Consider by cases, if

Lt(x) = θt Lt(x ) = 6Vt

ht + Rt(x )

Then x x , ht p

Ft(x) < Vt thus the first branch should be evoked. Similarly, consider the other case, the complete update should be:

Vt Ft(x) + Rt(x), θt 6Vt

ht + Rt(x ) 3ht Ft(x) + 3Vt

ht + Rt(x), otherwise

It remains to show the composite regret RA T (u) guaranteed by Algorithm 2. Notice the original proof of Theorem 6 of (Jacobsen & Cutkosky, 2022) relies on their generalized Lemma 1 by setting ft = 0 (their φt). Thus Lemma 1 of (Jacobsen & Cutkosky, 2022) implies for general ft:

RT (u) ψT +1(u) +

t=1 ft(u) +

t=1 gt, wt wt+1 Dψt(wt+1 | wt) t(wt+1) ft+1(wt+1)

Move relative terms to left hand side:

t=1 ft(wt) ft(u) ψT +1(u) +

t=1 gt, wt wt+1 Dψt(wt+1 | wt) t(wt+1) + f1(w1) f T +1(w T +1)

w1 = 0, thus f1(w1) = 0. And ft is non-negative

t=1 gt, wt wt+1 Dψt(wt+1 | wt) t(wt+1)

Thus following the exactly same proof of Theorem 6 of (Jacobsen & Cutkosky, 2022):

RA T (u) 4h T + 6 u max

v u u u t VT +1 ln

BT +1 ln2(BT +1)

BT +1 ln2(BT +1)

Unconstrained Robust Online Convex Optimization

Algorithm 2 Robust Online Learning By Exploiting Linear Offset

1: Input: Time horizon T; Initial Value ϵ; Corruption parameter k; Regularization relevant parameters: c, p, α; Hints: 0 < h1 h2 , h T +1; 2: Initialize: θ1 = 0, C1 = 0, N1 = 4, B1 = 4N1, w1 = 0, S0 = αp. 3: for t = 1 to T do 4: Define: Ft(x) = ln(1 + x/at), Rt+1(x) = cp x p 1/(St + x p)1 1/p

5: Output wt, receive gt where gt ht 6: Compute θt+1 = ψt(wt) gt 7: Update Ct+1 = Ct + gt 2, Nt+1 = Nt + gt 2/h2 t, Bt+1 = Bt + 4Nt 8: Set Vt+1 = h2 t+1 + Ct+1 and αt+1 = ϵ

Bt+1 ln2(Bt+1) 9: Compute x = at+1 exp(Vt+1/h2 t+1) 1

Vt+1Ft+1(x) + Rt+1(x), θt 6Vt+1

ht+1 + Rt+1(x )

3ht+1Ft+1(x) + 3Vt+1

ht+1 + Rt+1(x), otherwise

11: Solve for xt+1 : Lt+1(xt+1) = θt {solvable via bisection where xt+1 [0, S1/p t ]} 12: Update wt+1 = xt+1 θt θt , St+1 = St + wt+1 p

13: end for

We remark that the inverse problem involved in Algorithm 2 Line 11 can be efficiently handled by bisection method since Lt+1(x) is monotonically increasing. Notice that 0 = Lt+1(0) Lt+1(xt+1) 0, so 0 can be used as an initial lower bound. On the other hand Rt+1(xt+1) Lt+1(xt+1) = θt , thus xt+1 R 1 t+1( θ ), where for y 0:

R 1 t+1(y) =

For simplicity, we can always use S1/p t as an initial upper bound.

D. Proof to Theorem 5.1 and Applications

We provide the regret guarantee of Algorithm 1 when using an instance of Algorithm 2 as a base learner A as well as two applications when h1 = = h T +1 = G

Theorem 5.1. Suppose gt, gt satisfies assumptions in Equation (3) and (4). Setting ht = G, rt = ft as defined in Equation (7) with c = k G, p = ln T, α = ϵ/k for some ϵ > 0, set Algorithm 2 algorithm as the base algorithm A. Then Algorithm 1

guarantees:

RT (u) O h ϵG + u G

Proof. The proof begins with the regret decomposition in Equation (6) and is displayed below for convenience.

t=1 gc t, wt u + rt(wt) rt(u)

| {z } :=RA T (u)

t=1 gt gc t, wt

| {z } :=ERROR

| {z } :=CORRECTION

t=1 gt gc t, u

| {z } :=BIAS

Define OFFSET := ERROR CORRECTION, and in Section 5.1 we already shown ERROR k G maxt wt , thus

OFFSET k G max t wt

Unconstrained Robust Online Convex Optimization

by Lemma B.1 by substituting c, pα

k G max t wt k G(max t |wt| ϵ/k) = O (ϵG)

Similarly by following the application of Lemma B.1:

BIAS k G u +

k G u + 3k G ln T|u|

= O (k G u )

In addition, Algorithm 2 is set as A in response to gc t with ht = G. Thus by Theorem C.2 guarantees

RA T (u) O(ϵG + u G

Combining three parts and we complete the proof.

Here, we provide implication of Theorem 5.1 to stochastic convex optimization and an online version of distributionally robust optimization.

Stochastic convex optimization with corruptions OCO and convex stochastic optimization are connected through the classical Online-to-Batch Conversion (Orabona, 2019). Below, we present the implications of Theorem 5.1 stochastic convex optimization in a setting where k gradient evaluations are arbitrarily corrupted.

Corollary D.1 (Stochastic Convex Optimization via Online to Batch). Suppose L : W R is convex and E[ℓt(w)] = L(w), gt = ℓt(wt) and Et[ gt ] G. Algorithm 1 have access to gt such that PT t=1 1{gt = gt} k, then Algorithm 1 guarantees

Proof. The proof leverages the standard online to batch conversion (Theorem 3.1 in (Orabona, 2019) by setting αt = 1), then combining with the regret bounds from Theorem 5.1.

Distributionally robust optimization Distributionally robust optimization is a form of robust stochastic optimization on training data sampled from distribution P that is not the same as the population distribution Q (Ben-Tal et al., 2009; 2015). Typically, Q is considered as uniform, but the actual training data collection process might be biased, meaning P is different to Q. In this situation, stochastic optimization which treats each training example with equal weight is no longer appropriate.

Namkoong & Duchi (2016) formalized this framework as the following model with respect to a set of losses ℓ1, . . . ℓT , and an uncertainty set Pk = {P T : Df(P||Q) C(k, T)}, where Df(P||Q) is the f-Divergence, for a convex function f : R+ 7 R with f(1) = 0.

argmin w sup P Pk

t=1 ptℓt(w)

the decision variable from above formulation takes account into the worst case distributional uncertainty, hence is intuitively associated with improving generalization error given an appropriate uncertainty set Pk (Sagawa et al., 2019).

Distributionally robust optimization is increasingly relevant in the training of large language models, where training data are sourced from different domains (Xie et al., 2023). This is due to data from some domain are relatively atypical in comparison to others in representing the overall population distribution (Oren et al., 2019). Although empirical gain has been observed by incorporating distributionally robust optimization, the scalability has always been a primary concern for

Unconstrained Robust Online Convex Optimization

model training (Levy et al., 2020; Qi et al., 2021). Therefore, we consider a natural online version of distributionally robust optimization model proposed by Namkoong & Duchi (2016), with its online analogous metric formulated as:

t=1 pt(ℓt(wt) ℓt(u))

We present the implication of Algorithm 1 to this problem with respect to total variation DT V and Kullback-Leibler divergence DKL. In particular, we assume ℓt is convex and Q is uniform.

Corollary D.2 (Online Distributionally Robust Optimization). Suppose gt ℓt(wt) and gt G. Algorithm 1 runs on gt guarantees

t=1 pt(ℓt(wt) ℓt(u)) O

T . In addition, in the case where DKL 2k2

T 2 the same guarantee is achieved.

Proof. We begin with the case of DT V (P||Q) = 1

2 PT t=1 qt| pt

T , where qt = 1

T . First, we link the regret incurred by Algorithm 2 that runs on gt, and we denote the unobservable gradient as gt = pt

t=1 pt(ℓt(wt) ℓ(u))

t=1 pt gt, wt u

t=1 qt gt, wt u +

qt 1 gt, wt u

t=1 gt, wt u +

t=1 gt gt, wt u

G PT t=1 gt gt PT t=1 |1 pt

qt | 2k, gt, gt satisfies Equation (2), hence Theorem 5.1 provides the guarantee:

t=1 pt(ℓt(wt) ℓ(u)) O

In terms of DKL, we exploit the Pinsker s inequality DT V 2DKL, Hence DKL 2k2

T 2 yields to the same results.

E. Lower Bounds

In this section, we present two type of matching lower bounds to Theorem 5.1: Theorem 5.2 provides a lower bound for any comparator u Rd with arbitrary magnitude D > 0. Theorem E.5 is a lower bound with log factors, which appears in unconstrained OCO upper bounds.

We begin by presenting a helper lemma that aids in the analysis of Theorem 5.2, followed by Lemmas required to proof to Theorem 5.2.

Lemma E.1. Suppose z1, z2, , z T { 1, +1} with equal probability. Then for every t [T] for some T 1.

Unconstrained Robust Online Convex Optimization

Proof. Define St = P

i [T ]:i =t zi, by conditioning on g T { 1, +1}:

= E [sign (ST + 1)] [sign (ST 1)]

k { T, T +2, ,T } (sign(k + 1) sign(k 1)) P(ST = k)

We consider T by cases: suppose T is even, sign(k + 1) sign(k 1) = 2 when k = 0, and sign(k + 1) sign(k 1) = 0 otherwise. Thus applying T T/2 2T 1(T/2) 1/2

= P(ST = 0) = T T/2

2 T 2 1(T/2) 1/2 =

Similarly if T is odd, by symmetry to ST = 1:

2 (P(ST = 1) + P(ST = 1))

= T (T + 1)/2

Define T = T 1 thus T is even

2 ! (T + 1)!

(T )! 2 (T +1)

T + 22 (T +1)

1 8T = 1 8(T 1) 1 16T

Thus combining two cases:

Due to symmetry, St has the same distribution t [T]:

Theorem 5.2. For every D > 0, there exists a comparator u Rd such that u = D, g1, , g T and g1, , g T such that gt , gt 1, PT t=1 1{ gt = gt} = k:

t=1 gt, wt u Ω h u

Unconstrained Robust Online Convex Optimization

Proof. Consider the following random sequence: zk+1, zk+2, , z T { 1, +1} with equal probability and z1 = , zk = sign(PT t=k+1 zt). And z1 = = zk = 0 and zt = zt, t k + 1. Let q Rn be any unity vector. Suppose gt = ztq, gt = ztq, t T. Select u = D sign(PT t=k+1 gt)q. Thus:

E[RT (u )] = E

t=1 gt, wt u

t=1 E [ gt, wt ] E

t=k+1 E [ gt, u ]

E t [zt]q, wt + Dk + D

by Lemma E.1

= Ω( u (k +

The second lower bound in Theorem E.5 has a matching log factors by uses the definition of regret at the origin of an online learning algorithm, formalized as:

t=1 gt, wt 0 ϵ (17)

This condition implies that an algorithm maintaining small ϵ is inherently conservative: it will perform well if the comparator is close to the origin, but this behavior may come at the cost of performing poorly if the comparator is far from the origin. Before presenting the analysis to Theorem E.5, we first list previously established result on properties of iterates wt produced by any algorithm has constant regret guarantee at the origin as defined in Equation (17). Lemma E.2 was originally appeared in (Cutkosky, 2018) then being re-interpreted by (Orabona, 2019). Lemma E.3 from (Zhang & Cutkosky, 2022).

Lemma E.2 (Theorem 5.11 of (Orabona, 2019)). For any OLO algorithm suffers constant regret at the origin (Equation (17)) and |gt| 1, there exist βt Rd such that βt 1 and

for all t [T].

Lemma E.3 (Lemma 8 of (Zhang & Cutkosky, 2022): Unconstrained OLO Iterate Growth). Suppose assumptions in Lemma E.2 is satisfied. Then for every t [T], wt ϵ2t 1.

We first derive an lower bound for algorithms satisfies assumption in Lemma E.2. The construction was originally appeared in Theorem 5.12 from (Orabona, 2019). Finally, the lower bound in the context of adversarial corruptions is presented in Theorem 5.2.

Lemma E.4 (Unconstrained OLO Lower Bound). Suppose assumptions in Lemma E.2 is satisfied, then set gt = [gt,1, 0, , 0], gt,1 = g = 1 for all t [T]. Then there exists an u Rd such that u = 2ϵe T , and

t=1 gt, wt u ϵ + u

T 30 ln 1 + u 2T

Unconstrained Robust Online Convex Optimization

Proof. Let rt = Pt i=1 gi, wi . Then

t=1 gt, wt = ϵ + r T 1 g T , w T

by Lemma E.2, there exists some βT : βT 1

= ϵ + r T 1 g T , βT (ϵ + r T 1)

= (1 gt, βt ) (ϵ + r T 1)

Then recursively expand r T 1, r T 2, , r1 with Lemma E.2, then for some βt : βt 1

t=1 gt, wt = ϵ

t=1 (1 gt, βt )

t=1 gt, wt ϵ

t=1 max βt 1 (1 gt, βt ) = ϵ

t=1 (1 + |g|) = ϵ 1 + |g|2T

T ϵ exp |g|2T

where we used inequality (1 + x

n)n ex by setting n = T, x = |g|2T for the last step. Rearrange above equation, we have

t=1 gt, wt ϵ ϵ exp |g|2T = ϵ exp

| PT t=1 gt,1|2

where f(x) = ϵ exp( x2

T ), by Theorem K.2 part 1, we have f(x) = f (x). Then by the definition of double conjugate f ,

t=1 gt, wt ϵ f (

t=1 gt,1) =

t=1 gt,1, u1 f (u1)

By Theorem K.2 part 2, the supreme is achieve at

t=1 gt,1) = 2ϵ

PT t=1 gt,1 2

Substitute u 1 and set u = [u 1, 0, , 0], then Equation (18) becomes:

t=1 gt, wt ϵ

t=1 gt,1, u 1 + f (u 1) =

t=1 gt, u + f (u 1)

Rearrange we have

t=1 gt, wt u ϵ + f (u 1) (19)

It remains to obtain a lower bound to f (u 1). By Lemma K.4 and Lemma K.3, we have

0.6 ln 1 + T|u 1|2

0.6 ln 1 + T |u 1|2

Unconstrained Robust Online Convex Optimization

Notice that 0.6 ln 1 + T |u 1|2

2ϵ2 = 0.6 ln(1 + 2 exp(T)2T) > 1.5, hence by Lemma K.5

3 ln 1 + T|u 1|2

T 30 ln 1 + T|u 1|2

Substitute the lower bound to f (u 1) to Equation (19)

t=1 gt, wt u ϵ + |u 1|

T 30 ln 1 + |u 1|2T 2ϵ2

T 30 ln 1 + u 2T

Theorem E.5. For any algorithm that maintains Equation (17) for some ϵ > 0, there exists a sequence of g1, , g T and g1, , g T such that gt , gt 1, PT t=1 1{ gt = gt} = k, and a u Rd such that

t=1 gt, wt u Ω h ϵ + u

Proof. the proof strategy is that algorithm with regret guarantee as shown in Equation (17) attains a matching lower bound Ω(ϵ + u

T) in responding to gt as shown in Lemma E.4. The by reversing the direction of exactly k gradients by taking account into the growth behavior of wt (Lemma E.3) and a particular hard comparator u constructed in Lemma E.4, we can show regrets during those rounds builds up linearly. Let g1, , g T , where gt 1 as defined in Lemma E.4 and suppose algorithm operates on those gradients. Let S be the index set S = {t [T] : gt = gt}. Then by the lower bound presented in Lemma E.4

t=1 gt, wt u =

t=1 gt, wt u +

t=1 gt gt, wt u

t S gt gt, wt u

for some u Rd and u = 2ϵe T . For t S, define gt as follows

t=1 gt, wt u Ω(ϵ + u

t S wt + k u

Finally, By Lemma E.3 wt ϵ2t 1. Hence wt 1

t=1 gt, wt u Ω(ϵ + u

2 u + k u = Ω ϵ + u

Unconstrained Robust Online Convex Optimization

F. Adaptive Thresholding

In this section, we formalize the adaptive thresholding and clipping mechanism, namely FILTER, summarized in Section 6.2. This mechanism relies on prior knowledge of big corrupted gradients numbers which is naturally restricted by corruption model in Equation (3). We present this result as Lemma F.1, followed FILTER as Algorithm 3 and its property in Lemma F.2.

Lemma F.1. For g1, , g T and g1, , g T that satisfies Equation (3), then there are at most k number of gt such that gt 2G.

Proof. By definition of B = {t [T] : gt gt > G}:

B := {t [T] : gt gt > G}

= {t [T] : gt gt > G, gt < G} {t [T] : gt gt > G, gt = G}

{t [T] : gt gt > G, gt = G sign( gt)}

= {t [T] : G gt > G}

= {t [T] : gt > 2G}

Finally, due to Equation (3), k := |B| |{t [T] : gt > 2G}|.

Algorithm 3 FILTER: k-lag Thresholding and Gradient Clipping

1: Input: Corruption parameter k, Initial Lipschitz guess: τ = τG > 0. 2: Initialize: Filter threshold h1 = τ, P = {}. 3: for t = 1 to T do 4: Receive gt. 5: if gt > ht then 6: Set gc t = gt gt ht, update counter: n = n + 1. 7: if n = k then 8: Update Threshold ht+1 = 2ht, reset counter: n = 0. 9: end if 10: else 11: Set gc t = gt, register rounds P = P {t}. 12: Maintain threshold ht+1 = ht. 13: end if 14: Output gc t, ht+1. 15: end for

We display some convenience property of Algorithm FILTER, notice all quantities apart from ht are for assisting analysis only

Lemma F.2. (Algorithm 3 property) Suppose gt, gt satisfies Equation (3), and Algorithm 3 receives gt, then its per iteration outputs gc t, ht+1 satisfies:

(1) ht+1 = ht, t P = {t [T] : gc t = gt}

(2) gc t ht, t [T]

(3) τ = h1 h2 h T +1 max(τ, 4G)

(4) |P| T (k + 1) max log2 8G

Proof. We show each property in turns.

(1) guaranteed by algorithm line 11-12.

(2) either line 4 or line 11 is evoked to compute gc t.

Unconstrained Robust Online Convex Optimization

(3) ht being non-decreasing sequence and ht = τ is by construction. Hence it remains to show an upperbound to ht, t [T + 1]. The key to this proof is there are at most k number of gt such that gt 2G gaurateed by Equation (3) (See Lemma F.1).

In the case where initial value of τ 2G, then the check point h never doubled since each time of doubling requires k + 1 number of gt exceeds current one. (by line 4-9)

Now, we consider τ < 2G, where threshold doubling ht+1 = 2ht was evoked at least once (line 8) with initial value τ. Then h T +1 = 2Nτ for some N [T], where N is the number of time line 8 was evoked.

On the other hand, at least k + 1 number of gt such that gt 2N 1τ were observed thus have triggered line 8 so eventually h T +1 = 2Nτ. Thus by Lemma F.1, 2N 1τ 2G, N log2 4G

Thus h T +1 = 2Nτ 4G. Moreover, ht is non-decreasing, and we complete the proof.

(4) |P| is associated with the number of time in which check point h doubled. By the proof to property (3) that 2N 1τ 2G, thus N max log2 4G

τ , 0 as an upper bound that the number of line 8 being executed.

Each execution of line 8 requires exactly k+1 number of gt being clipped. Thus there were (k+1) max log2 4G

number of rounds not being register to P by the time when last time step t [T], when the execution of line 8 happens.

For T t > t , there were less than (k + 1) number of gt not being registered into P, otherwise threshold would have been doubled. Thus

|P| (k + 1) max log2 4G

, 0 + (k + 1) = (k + 1) max log2 8G

G. Adaptive Tracking

We introduce TRACKER, a simple doubling mechanism for estimating maxt wt . as shown in Algorithm 4. The properties of TRACKER is displayed in Lemma G.1.

Algorithm 4 TRACKER: Track the Magnitude of wt

1: Input: Initial magnitude guess: τ = τD > 0. 2: Initialize: Filter threshold z1 = τ, (Counter, Set): (n = 0, Tn = {}), Checkpoint t0 = 1. 3: for t = 1 to T do 4: Receive wt. 5: if wt > zt then 6: Double: zt+1 = 2 wt . 7: Update counter: n = n + 1. 8: Add a new checkpoint: tn = t, initialize a new set: Tn = {}. 9: else 10: Maintain: zt+1 = zt. 11: end if 12: Register round: Tn Tn {t}. 13: end for

Lemma G.1. (Algorithm 4 property) Algorithm 4 guarantees

(1) [T] is partitioned by T0, T1, T2, , TN, for some N where N max(0, log2 2 maxt wt /τ).

(2) τ = zt = zt+1, wt τ, t T0

(3) wt 2 wtn , t Tn, n [N]

(4) τ = z1 z2 z T +1 max(τ, 2 maxt wt )

Unconstrained Robust Online Convex Optimization

Proof. We show each property in turns.

(1) partition property can be seen by in the initialization of n = 0 with increment of 1 (line 6) and whenever counter n updates a new set Tn is created (line 7). And t [T] is assigned to Tn for some n 0 (line 11).

As z T +1 = 2Nτ 2 maxt wt . Thus N max(0, log2 2 maxt wt /τ).

(2) For the time period of n = 0, line 4 was never executed.

(3) By construction Tn = {tn, tn + 1, , tn+1 1}, n [N 1], TN = {t N, , T}. When t = tn, the inequality holds. Thus we consider t Tn \ {tn}, line 9 was triggered, hence zt+1 = zt = ztn+1 and wt zt. On the other hand, by property (2) ztn+1 = 2ztn and wtn > ztn. Thus

2 wtn > 2ztn = ztn+1 = zt wt , t Tn \ {tn}

(4) since z1 = τ and zt+1 is either through line 5 (double) or line 9 (maintain). Thus non-decreasing property holds.

Suppose line 5 was never executed, then z T +1 = z1 = τ. Now we consider line 5 was executed at least once. Let t [T] be the last time step in which line 5 was executed. Thus

z T +1 = z T = = zt +1 = 2zt < 2 wt

a further upper bound is zt 2 maxt wt for t [t + 1, T + 1], combing with zt being non-decreasing, we complete the proof.

H. Error Correction

We provides the error correction effort as a result of trigger signals αt, βt from FILTER and tracker, respectively and the chosen regularizer rt(w) = ft(w) + at w 2 as discussed in Section 6.3. We aim to bound OFFSET := ERROR CORRECTION by spliting it into two components:

t/ P gt gc t wt

t=1 αt wt 2

OFFSET1: due to adaptive clipping

t P gt gt wt

t=1 βt wt 2

OFFSET2: due to corruption

BIAS = u 2 T X

t ft(u) + u X

t P gt gc t + u X

t/ P gt gc t

We begin with a helper Lemma followed by the upper bound. Lemma H.1 (Error of Truncated Gradients). Suppose gt, gt satisfies assumptions in Equation (3) and (4). Define gc t as in Equation (5) with ht τ for some τ > 0. Then

t=1 gt gc t 2k max(τ, G)

Proof. By definition: gt gc t G + τ 2 max(τ, G). Thus

t=1 gt gc t =

t=1 min ( gt gc t , 2 max(τ, G))

t=1 min ( gt gc t , G)

2k max (τ, G)

where the last step is due to Equation (4).

Unconstrained Robust Online Convex Optimization

Lemma H.2. Suppose gt, gt satisfies assumptions in Equation (3) and (4). Algorithm 3 and 4 are initialized with some τG, τD > 0. Algorithm 3 in response to gt and output ht+1, Algorithm 4 in response to arbitrary sequence wt in which maxt wt ϵ

22T , and outputs zt+1. Define

rt(w) = ft(w) + at w 2

where ft is defined as shown in Equation (7) for some c > 0, p = ln T, α = ϵτG/c, at = αt + βt, αt, βt are defined as in Equation (11) and (12). For some γα, γβ > 0:

OFFSET 64 max(τG, G)2

γα (k + 1) max log2 8G

+ ϵτG + 16k2 max (τG, G)2

γβ ln 64k2 max (τG, G)2

2τD + 2kτD max(τG, G)

BIAS O (γα(k + 1) + γβ) u 2 + c u + u k max(τG, G)

Proof. We show three components in turn:

OFFSET1: due to adaptive clipping:

OFFSET1 := X

t/ P gt gc t wt αt wt 2 X

t/ P (G + ht) wt αt wt 2 (20)

For each fixed t P, we have At wt αt wt 2 sup X 0 At X αt X2 A2 t 4αt , where At = G + ht > 0. Hence an upper bound to Equation (20) can be derived by substitute αt = γα, t P:

4αt = 1 4γα

t/ P (G + ht)2 (G + h T )2

4γα | P| 64 max(τG, G)2

γα (k + 1) max log2 8G

where the last inequality is due to upperbound to | P| by Lemma F.2 (4).

OFFSET2: due to corruption:

The upper bound is obtained through two steps. In each step we aim to show:

OFFSET2 := X

t P gt gt wt

t=1 βt wt 2

| {z } step 1: O(G2k log(maxt wt ))

t=1 rt(wt) O G2k ln(max t wt )

| {z } step 2: O(G2k)

By construction, we have

( βt = γβ 1{zt+1 =zt} 1+Pt i=1 1{zi+1 =zi}, t = tn, n [N]

0, otherwise

Proceed with analysis to step 1, where second line is by Lemma G.1 property (1) and value of βt displayed above:

step 1 : = X

t P gt gt wt

t=1 βt wt 2

t P Tn gt gt wt

n=1 βtn wtn 2

t P T0 gt gt wt +

n=1 2 wtn X

t P Tn gt gt

n=1 βtn wtn 2

t P T0 gt gt +

n=1 2 wtn X

t P Tn gt gt

n=1 βtn wtn 2

Unconstrained Robust Online Convex Optimization

where the third line is due to Lemma G.1 property (3). For the first summand, we can define some imaginary truncated gradients

zt = gt gt min ( gt , τG)

t P T0 gt gt PT t=1 gt zt 2k max(τG, G) by evoking Lemma H.1. Thus,

step 1 2kτD max(τG, G) +

n=1 2 wtn X

t P Tn gt gt

n=1 βtn wtn 2 (21)

Now we analyze each summands over n in Equation (21). Considering a fixed n [N]:

t P Tn gt gt βtn wtn 2 sup X 0 X X

t P Tn 2 gt gt βtn X2

t P Tn gt gt 2

t P Tn gt gt

i=1 1{zi+1 = zi}

t P Tn gt gt

i=1 1{zi+1 = zi}

t P Tn gt gt

where the second to last line is due to number of zt+1 doubled N = PT t=1 1{zi+1 = zi}. Now, we substitute it back to equation (21)

step 1 2kτD max(τG, G) + 2 (1 + N)

t P Tn gt gt

2kτD max(τG, G) + 2 (1 + N)

t P Tn gt gt

= 2kτD max(τG, G) + 2 (1 + N)

2kτD max(τG, G) + 2 (1 + N)

t=1 gt gc t

2kτD max(τG, G) + 2 (1 + N)

γβ (2k max(h T , G))2

Unconstrained Robust Online Convex Optimization

where the last step is due to Lemma H.1 by noticing gc t h T , t [T]. This mean we obtained an upper bound to step 1:

step 1 := X

t P gt gt wt

t=1 βt wt 2 2kτD max(τG, G) + 8k2 max (τG, G)2 (1 + N)

2kτD max(τG, G) + 8k2 max (τG, G)2 (1 + max 0, log2 2 maxt wt

2kτD max(τG, G) + 8k2 max (τG, G)2 log2 2 + 4 maxt wt

2kτD max(τG, G) + 16k2 max (τG, G)2 ln 2 + 4 maxt wt

where the second step is due to Lemma G.1 (1). Thus, it is sufficient to bound step 2 defined as follows to obtain a bound for OFFSET2 that is independent of maxt wt :

step 2 := 16k2 max (τG, G)2 ln 2 + 4 maxt wt

evoke Lemma B.1 with α = ϵτG/c

16k2 max (τG, G)2 ln 2 + 4 maxt wt

γβ c max t wt + ϵτG

16k2 max (τG, G)2

γβ ln(2 + X) cτD

for A, B > 0, A ln(2 + X) BX obtains its supremum at X = A/B 2 > 2. Hence sup X> 2 A ln(2 + X) BX =

A ln(A/B) A + 2B. By substituting A = 16k2 max(τG,G)2

γβ , B = cτD

= 16k2 max (τG, G)2

ln 64k2 max (τG, G)2

Thus step 1 and step 2 implies

OFFSET2 ϵτG + 16k2 max (τG, G)2

γβ ln 64k2 max (τG, G)2

2τD + 2kτD max(τG, G)

BIAS: comparator related term

BIAS := u 2 T X

t ft(u) + u X

t P gt gc t + u X

t/ P gt gc t

t ft(u) + u

t=1 gt gc t

t ft(u) + u 2k max (h T , G)

t ft(u) + u 16k max (τG, G) (22)

Unconstrained Robust Online Convex Optimization

where the last It remains to show the first two terms in Equation (22) can be bounded by desired orders. For the first summand, P

t at = P t αt + P t βt. Thus by definition and Lemma F.2 (4):

t αt γα|P| γα(k + 1) max log2 8G

O(γα(k + 1))

and Lemma G.1 (1) implies N O(ln maxt wt ) and the condition maxt wt ϵ

22T implies:

1{zt+1 = zt} 1 + Pt i=1 1{zi+1 = zi} = γβ ln (1 + N) O(γβ)

The second term in Equation (22) can be upper bounded by Lemma B.1 by substituting α = ϵτG/c:

t=1 ft(u) 3c ln T u

BIAS O (γα(k + 1) + γβ) u 2 + c u + u k max(τG, G)

I. Base Algorithms for unknown G

In this section, we show the Epigraph-based regularization scheme developed by (Cutkosky & Mhammedi, 2024) guarantees appropriate composite regret defined as:

RA T (u) :=

t=1 gt, wt u + rt(wt) rt(u)

when rt(w) = ft(w) + at w 2 and the sequence at satisfies assumption in Lemma H.2. That is

RA T (u) :=

t=1 gt, wt u + ft(wt) ft(u) +

t=1 at wt 2 u 2

Thus, can be used as a base algorithm A that can be supplied to Algorithm 1 when the G maxt gt is unknown. The proof is identical to (Cutkosky & Mhammedi, 2024) except at behaves differently here. Nevertheless, we summarize as Algorithm 5 and shows the regret guarantee attained stilled suffices in achieving our aim.

Epigraph-based Regularization is a geometric reparameterization: it suffices to design two learners: Aw to wt and Ay to produce yt, where (wt, yt) W = {(w, y) : y w 2} Rd+1. Then it suffices to design algorithm to bound:

t=1 gt, wt u + ft(wt) ft(u)

t=1 at(yt u 2)

This is a sum of two regrets for the pair wt and yt subject to yt wt 2. This problem can be solved by using a pair of unconstrained learners (Aw, Ay) that produce ( ˆwt, ˆyt) Rd+1 and guarantee regret bounds:

t=1 gt, ˆwt u + ft( ˆwt) ft(u)

Unconstrained Robust Online Convex Optimization

t=1 at(ˆyt u 2)

By a black-box conversion from unconstrained-to-constrained learning due to (Cutkosky & Orabona, 2018) (Theorem 3) to enforce the constraint: this involves a projection ΠW : Rd+1 W := {(w, y) : y w 2} (line 7) and a certain technical correction to the gradient feedback (11-15). These procedures output variables (wt, yt) = ΠW (( ˆwt, ˆyt)) that satisfies the constraint and guarantees RAw T (u) RAw T (u) and RAy T (u) RAy T (u). In particular, we choose Algorithm 2 as Aw and (Jacobsen & Cutkosky, 2022) Algorithm 4 as Ay. We summarize the entire procedure as Algorithm 5 and its gaurantee on RA T (u) is displayed as Theorem I.3. We first present a useful definition and a helper Lemma.

Definition I.1. For the set W = {(w, y) : y w 2} Rd+1, and arbitrary (w, y) W and ( ˆw, ˆy) Rd+1 and some ht, γ > 0:

(1) norm: (w, y) t = h2 t w 2 + γ2y2

(2) dual norm: (w, y) ,t = w 2

(3) distance function of ( ˆw, ˆy) to W: St(( ˆw, ˆy)) = infy w 2 (w, y) ( ˆw, ˆy) t

(4) subgradient at ( ˆw, ˆy): St(( ˆw, ˆy)) = h2 t ( ˆ w w) h2 t ˆ w w 2+γ2(ˆy y)2 , γ2(ˆy y) h2 t ˆ w w 2+γ2(ˆy y)2

(5) projection map Πt W (( ˆw, ˆy)) = argmin(w,y) W (w, y) ( ˆw, ˆy) t

Lemma I.2. In the same notation as Definition I.1, if gt ht and αt [0, γ], and (δw t , δy t ) = (gt, at) ,t St(( ˆwt, ˆyt)) then

2ht, |δy t |

Proof. Since gt ht and αt [0, γ], (gt, at) ,t 2. On the other hand St(( ˆw, ˆy)) ,t = 1, and

(δw t , δy t ) ,t = δw t 2

h2 t + |δy t |2

h2 t + |δy t |2

This implies both δw t 2

h2 t 2 and |δy t |2

Theorem I.3. Suppose gt, gt satisfies assumptions in Equation (3) and (4), and having access to gc t as defined in Equation (5) with ht provided by FILTER (Algorithm 3). with α = ϵ/c, γ = γα + γβ, for some ϵ, c, γα, γβ, τG, τD > 0, Algorithm 5

guarantees:

RA T (u) O ϵh T + u h T

T + u 2 (γα(k + 1) + γβ)

In addition, the produced iterate satisfies maxt wt ϵ

Proof. Notice that we used ˆwt, ˆyt as outputs from some unconstrained learner and wt, yt being their projection on Wt in Algorithm 5 denote. We begin our analysis from Equation (23):

t=1 gt, wt u + ft(wt) ft(u) +

t=1 at(yt u 2)

Unconstrained Robust Online Convex Optimization

By Cutkosky & Orabona (2018) Theorem 3

t=1 gc t + δw t , ˆwt u + ft(wt) ft(u) +

t=1 (at + δy t )(yt u 2)

Notice ˆwt wt , thus ft(wt) ft( ˆwt)

t=1 gc t + δw t , ˆwt u + ft( ˆwt) ft(u)

t=1 (at + δy t )(yt u 2)

Since γβ = γ

2 , at = αt + βt γα + γβ = γ. Thus, by Lemma I.2, gc t + δw t ht +

2ht 3ht and |at + δy t | γ +

2γ 3γ. By Theorem C.2:

RAw T (u) O ϵh T + u h T

and Theorem 10 of Cutkosky & Mhammedi (2024) shown

RAy T (u) O

v u u tγ2 + γ

Thus, we can bound Equation (24) by combine both bounds:

ϵh T + u h T

v u u tγ2 + γ

since P t at = P t αt+P t βt, by the same computation as that of Lemma H.2, P t αt O(γα(k+1)) and P t βt O(γβ)

O ϵh T + u h T

γ2 + γ ((γα(k + 1) + γβ)

γ γα and γ γβ

O ϵh T + u h T

T + u 2 (γα(k + 1) + γβ)

Unconstrained Robust Online Convex Optimization

Algorithm 5 Robust Online Learning By Exploiting In-Time Offset

1: Input: Time horizon T, FILTER (Algorithm 3), TRACKER (Algorithm 4), an algorithm Ay with optimal rate in parameter-free literature (e.g., Jacobsen & Cutkosky (2022) Algorithm 4), corruption parameter k, base algorithm parameters ϵ, regularization parameters c, α, γα, γβ, γ. 2: Initialize: Initialize Algorithm 2 as Aw with ϵ. Initialize Ay with ϵ. Initialize FILTER with τG (outputs ht as a conservative lower-bound guess for G). Initialize TRACKER with τD (outputs zt as a conservative lower-bound guess for maxt |wt|). 3: for t = 1 to T do 4: Receive ˆwt from Aw; Receive ˆyt from Ay. 5: Compute operators in Definition I.1 with ht, γ. 6: # Explicit projection of ( ˆwt, ˆyt) through projection map Πt W as in Definition I.1. 7: Compute projection: (wt, yt) = Πt W (( ˆwt, ˆyt)). 8: Play wt, receive gc t, ht+1 from FILTER. Send wt to TRACKER and receive zt+1. 9: Compute αt, βt as defined in Equations (11), (12). 10: Compute quadratic regularizer weights: at = αt + βt. 11: # Compute gradient correction direction (δw t , δy t ) with ,t and St as in Definition I.1. 12: Compute: (δw t , δy t ) = ( gc t, at) ,t St(( ˆwt, ˆyt)). 13: # Send corrected gradients: 14: Send ( 1

2 ( gc t + δw t ) , 2ht+1) to Aw. 15: Send 1

2(at + δy t ) and 3

2γ to Ay. 16: end for

J. Regret Analysis with Unknown G

We provide the regret guarantee of Algorithm 1 when using an instance of Algorithm 5 as a base learner A. Theorem J.1 (Restated Theorem 6.1). Suppose gt, gt satisfies assumptions in Equation (3) and (4). Setting rt(w) = ft(w) + αt w 2, where ft is defined in Equation (7) with parameters: α = ϵτG/c, p = ln T, γ = γα + γβ, for some ϵ, c, γα, γβ, τG, τD > 0, there exists an algorithm A as a base algorithm such that Algorithm 1 guarantees:

RT (u) O 8ϵ max(τG, G) + u max(τG, G)

T + k + c u + (γα(k + 1) + γβ) u 2

+ 16k2 max (τG, G)2

γβ ln 64k2 max (τG, G)2

2τD + 2kτD max(τG, G)

+ 64(k + 1) max(τG, G)2

γα max log2 8G

Proof. Note the definition of RT (u), the proof is completed by combining Lemma H.2 and Theorem I.3

K. Fenchel Conjugate

Here we collects basic properties of Fenchel conjugate, see reference such as (Bertsekas, 2009; Orabona, 2019), and previously established Lemma used in Appendix E for completeness.

Definition K.1. Let f : Rd [ , ], the Fenchel conjugate f is defined as

f (θ) = sup x Rd θ, x f(x)

the double conjugate f is defined as

f (θ) = sup x Rd θ, x f (x)

Theorem K.2. Let f : Rd ( , ]

1. f(x) = f (x), x Rd iff f is convex and lower semicontinuous

Unconstrained Robust Online Convex Optimization

2. θ, x f(x) achieves its supremum in x at x = x iff x f (θ)

Lemma K.3. (Theorem A.32 of (Orabona, 2019)) The Lambert function W : R+ R+ is defined as

x = W(x) exp (W(x)) , for x > 0

and W(x) > 0.6 ln(1 + x) for x > 0.

Lemma K.4. (Theorem A.3 of (Orabona, 2019)) Let a, b > 0, f(x) = b exp(x2/2a). Then the Fenchel conjugate is

f (θ) = a|θ|

where W( ) is the Lambert function.

Proof. The proof is based on rearrange x 3

2, the condition is equivalent to 1 1

Given x > 0, divide both side by x 1 1

Rearrange and we complete the proof.