# discounted_adaptive_online_learning_towards_better_regularization__213f7b2a.pdf

Discounted Adaptive Online Learning: Towards Better Regularization

Zhiyu Zhang 1 David Bombara 1 Heng Yang 1

We study online learning in adversarial nonstationary environments. Since the future can be very different from the past, a critical challenge is to gracefully forget the history while new data comes in. To formalize this intuition, we revisit the discounted regret in online convex optimization, and propose an adaptive (i.e., instance optimal), FTRL-based algorithm that improves the widespread non-adaptive baseline gradient descent with a constant learning rate. From a practical perspective, this refines the classical idea of regularization in lifelong learning: we show that designing better regularizers can be guided by the principled theory of adaptive online optimization. Complementing this result, we also consider the (Gibbs & Candes, 2021)-style online conformal prediction problem, where the goal is to sequentially predict the uncertainty sets of a black-box machine learning model. We show that the FTRL nature of our algorithm can simplify the conventional gradient-descent-based analysis, leading to instance-dependent performance guarantees.

1. Introduction

Online learning can be broadly defined as a sequential decision making problem, where each decision leverages the learned knowledge from previous observations. However, while forgetting is often thought as the opposite of learning, the two concepts are actually coherent due to the distribution shifts in practice. Think about deploying a drone in the wild: a common subtask is to learn its time-varying dynamics model on the fly, but when doing that, we have

Future versions available at https://arxiv.org/abs/ 2402.02720. Link to the code: https://github.com/ Computational Robotics/discounted-adaptive. 1Harvard University. Correspondence to: Zhiyu Zhang <zhiyuz@seas.harvard.edu>, David Bombara <davidbombara@g.harvard.edu>, Heng Yang <hankyang@seas.harvard.edu>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

to exclude the obsolete data that possibly contradicts the current environment. This leads to a natural challenge for the resulting online learning problem: how do we handle the possible shortage of data after forgetting?

One potential solution is to inject suitable inductive bias into the algorithm, which is among the most important ideas in machine learning. Specifically, the inductive bias refers to our prior belief of the ground truth, before observing the data of interest. Regarding the drone example, physics provides general principles on the evolution of the nature, while modern foundation models can encode diverse world knowledge from large scale pre-training. The point is that even though forgetting reduces the amount of online data, one could still exploit such inductive bias to improve the learning performance.

Despite this natural intuition, algorithmically achieving it remains a nontrivial task. In particular, the associated online learning algorithm has two considerations to trade off (i.e., the inductive bias and the online data), and optimally balancing them requires going beyond simple heuristics. The present work studies this problem from a theoretical perspective, based on a discounted variant of Online Convex Optimization (OCO; Zinkevich 2003; Cesa-Bianchi & Lugosi 2006). Our results emphasize the importance of regularization in nonstationary online learning:

On an algorithm that gradually forgets the online data, one could use regularizers to inject the inductive bias, which the algorithm never forgets.

Furthermore, designing better optimizers can be guided by the theory of adaptive OCO.

1.1. Contribution

This paper presents new results on two related topics: (Section 2) nonstationary online learning, and (Section 3) Online Conformal Prediction (OCP; Gibbs & Candes, 2021).

First, we consider the discounted OCO problem, formally introduced in Section 2. It is known that under standard assumptions on the complexity of the problem, the minimax optimal discounted regret bound can be achieved by an extremely simple and widespread baseline Online Gradient Descent with constant1 learning rate (denoted

1Non-annealing and horizon-independent.

Discounted Adaptive Online Learning: Towards Better Regularization

as constant-LR OGD). In this paper, we propose an adaptive algorithm based on Follow the Regularized Leader (FTRL; Abernethy et al., 2008), such that

the minimax optimality of the OGD baseline is improved to instance optimality; and all assumptions beyond convexity are removed.

More concretely, this is achieved by combining an undiscounted adaptive OCO algorithm (Zhang et al., 2024) with a simple rescaling trick the latter can convert scalefree (Orabona & P al, 2018) upper bounds on the standard undiscounted regret to the discounted regret, which could be of independent interest.

In practice, compared to existing nonstationary OCO algorithms that minimize the dynamic regret or the strongly adaptive regret, our algorithm eschews the standard bilevel aggregation procedure (Hazan & Seshadhri, 2009; Daniely et al., 2015; Zhang et al., 2018a), thus is computationally more efficient. Compared to the default approach in continual / lifelong learning based on constant LR OGD (Abel et al., 2023), our algorithm uses a nontrivial data-dependent regularizer to adaptively exploit the available inductive bias. Furthermore, the design of this regularizer follows from the principled theory of adaptive OCO, rather than heuristics.

Next, we consider OCP, a downstream online learning task with set-membership predictions. To combat the distribution shifts commonly found in practice, recent works (Gibbs & Candes, 2021; Gibbs & Cand es, 2022; Bhatnagar et al., 2023) applied nonstationary OCO algorithms (such as constant-LR OGD) to this setting. The twist is that besides appropriate regret bounds, one has to establish coverage guarantees as well.

In this setting, our discounted OCO algorithm leads to strong instance-dependent guarantees, e.g., the adaptivity to the targeted coverage level. Notably, since our algorithm is built on the FTRL framework rather than gradient descent, the associated coverage guarantee follows directly from the stability of its iterates. This exemplifies the analytical strength of FTRL in OCP (over gradient descent), which, to our knowledge, has not been demonstrated in the literature.

Finally, we complement the above theoretical results with OCP experiments (Section 4).

1.2. Related work

This paper explores the connection between two separate topics in online learning: discounting and adaptivity.

Discounting Motivated by the intuition that the recent history is more important than the distant past , the discounted regret has been studied by a series of works on non-

stationary online learning (Cesa-Bianchi & Lugosi, 2006; Freund & Hsu, 2008; Chernov & Zhdanov, 2010; Kapralov & Panigrahy, 2011; Cesa-Bianchi et al., 2012; Brown & Sandholm, 2019). Most of them do not consider adaptivity, although the under-appreciated (Kapralov & Panigrahy, 2011) presents important early ideas. Recently, the discounted regret seems to lose its theoretical popularity to the dynamic regret and the strongly adaptive regret, which we survey in Appendix A. However, due to its computational efficiency, the idea of discounting is still prevalent in practice, as exemplified by the success of the ADAM optimizer (Kingma & Ba, 2014).

Concurrent to this work, Ahn et al. (2024) presented a conversion from the discounted regret to the dynamic regret. Independently, Jacobsen & Cutkosky (2024) studied the dynamic and strongly adaptive regret of discounted algorithms, in the context of online linear regression.

Adaptivity Focusing on stationary environments, adaptive OCO concerns going beyond the conventional worst case guarantees (Orabona, 2023a, Chapter 4 and 9). More than a decade of research effort culminates in a series of OCO algorithms that do not rely on any extra assumption beyond convexity (Cutkosky, 2019; Mhammedi & Koolen, 2020; Chen et al., 2021; Jacobsen & Cutkosky, 2022; Zhang et al., 2024; Cutkosky & Mhammedi, 2024), which this paper builds on. Different from non-adaptive algorithms like OGD, such adaptivity crucially relies on various sophisticated forms of regularization (Mc Mahan & Orabona, 2014; Orabona & P al, 2016; Cutkosky & Orabona, 2018; Zhang et al., 2022). This naturally resonates with the crucial role of regularization in forgettive continual / lifelong learning settings (De Lange et al., 2021), but to our knowledge, a quantitative connection has not been established in the literature. By studying adaptivity on the discounted regret, we aim to fill this gap.

Regarding the OCP problem (the second half of this paper), related works are discussed in Section 3.

1.3. Notation

Throughout this paper, denotes the Euclidean norm. ΠX (x) is the Euclidean projection of x onto a closed convex set X. The diameter of a set X is supx,y X x y . For two integers a b, [a : b] is the set of all integers c such that a c b. The brackets are removed when on the subscript, denoting a tuple with indices in [a : b]. If a > b, then the product Qb i=a λi := 1. log means natural logarithm. 0 is a zero vector whose dimension depends on the context.

We define the imaginary error function as erfi(x) := R x 0 exp(u2)du; this is scaled by π/2 from the conventional definition, thus can also be queried from standard software packages like SCIPY and JAX.

Discounted Adaptive Online Learning: Towards Better Regularization

2. Discounted Adaptivity

Setting The first half of this paper is about discounted Online Convex Optimization (OCO), a two-person repeated game between a player we control and an adversarial environment. Different from the standard OCO problem (Zinkevich, 2003), there is also an expert that sequentially selects the discount factors for the player. Specifically, in each round we consider the following interaction protocol.

1. We (the player) make a prediction xt X using past observations, where X Rd is closed and convex.

2. The environment picks a convex loss function lt : X R, and reveal a subgradient gt lt(xt) to the player.

3. The expert picks a discount factor λt 1 (0, ),2 and reveal it to the player.

4. The environment can choose to terminate the game. If so, let T be the total number of rounds.

At the end of the game, the environment can also choose any fixed prediction u X, called a comparator. Without knowing the environment and the expert beforehand, our (the player s) goal is to guarantee low discounted regret,

Rλ1:T T (l1:T , u) :=

[lt(xt) lt(u)] . (1)

We say an algorithm is minimax or non-adaptive if given an uncertainty set S, it upper-bounds the worst case regret

sup (l1:T ,u) S Rλ1:T T (l1:T , u).

This paper aims to design adaptive algorithms that directly bound Rλ1:T T (l1:T , u) by a function of the problem instance (i.e., both the losses l1:T and the comparator u).

Discounting vs forgetting To motivate the above discounted setting, suppose λt 1. Then, Eq.(1) recovers the standard undiscounted regret in OCO, from which we can further upper-bound the total loss of the player, PT t=1 lt(xt). However, the effectiveness of this follow-up argument relies on the existence of a comparator u with low total loss PT t=1 lt(u). We call such an environment stationary .

This paper concerns the nonstationary environments violating the previous argument. Here, the expert provides discount factors λ1:T to the player as side information, suggesting the usefulness of each observation for future predictions. For example,

With λt λ < 1, the weight of past losses decays quickly in Eq.(1), therefore the corresponding algorithm is motivated to forget the past , matching the intuition developed in Section 1.

2The one round delay is due to our regret definition: the loss function lt is undiscounted in the t-th round.

A more extreme case is when λt takes value in {0, 1}. Then, the problem reduces to a restarting variant of standard OCO, where each restart is triggered by λt = 0.

Overall, this paper treats the discount factors λ1:T as part of the problem description, and focus on establishing tight upper bounds on Eq.(1). Choosing λ1:T online is an important issue, which we defer to future works.

Inductive bias In the typical application of lifelong learning, one would use the output xt of an online learning algorithm as the parameter of the underlying machine learning model, therefore we consider the inductive bias as a fixed prediction x X known at the beginning of the game (possibly obtained from pre-training). Intuitively, simply predicting xt = x would work decently well , and by further modifying it in the game, the goal is to correct its time-varying imperfection using the sequentially revealed online data. Consistent with the arguments from Section 1, an adaptive algorithm that optimally exploits x would perform better than algorithms that do not.

Following this intuition, we consider the initialization x1 = x in the discounted OCO game. Without loss of generality, we assume x = 0 throughout this section, since a different x can be implemented by simply shifting the coordinates.

2.1. Preliminary

We begin by introducing the widespread non-adaptive baseline, constant-LR OGD. To this end, notice that the main difference between the discounted regret Eq.(1) and the well-studied undiscounted regret (λt 1) is the effective time horizon. Instead of the maximum length T, in the t-th round Eq.(1) concerns an exponentially weighted look-back window of length

which is roughly min[(1 λ2) 1, T] in the special case of λt λ < 1. For later use, we also define the discounted gradient variance Vt and the discounted Lipschitz constant Gt,

gi 2 ; Gt := max i [1:t]

A classical wisdom in online learning is that the learning rates of OGD should be inversely proportional to the square root of the time horizon (Orabona, 2023a, Chapter 2). Combining it with the effective time horizon discussed above, the following result is a folklore.

Discounted Adaptive Online Learning: Towards Better Regularization

Online Gradient Descent Consider the OGD update rule: after observing the loss gradient gt, we pick a learning rate ηt, take a gradient step and project the update back to the domain X, i.e.,

xt+1 = ΠX (xt ηtgt) .

Theorem 1 (Abridged Theorem 6 and 7). If the loss functions are all G-Lipschitz, the diameter of the domain is at most D, and the discount factor λt = λ (0, 1), then OGD with a constant learning rate ηt = D

1 λ2 guarantees for all T = Ω( 1 1 λ),

sup l1:T ,u Rλ T (l1:T , u) = O DG p

Conversely, fix any variance budget V (0, G2HT ] and any comparator u such that u, u X. For any algorithm, there exists a loss sequence such that VT defined in Eq.(3) satisfies VT = V , and

max h Rλ1:T T (l1:T , u), Rλ1:T T (l1:T , u) i = Ω u p

Picking the domain X as a norm ball centered at the origin, one could see that the worst case bound of constant LR OGD is minimax optimal under the Lipschitzness and bounded-domain assumptions, which forms the theoretical foundation of this common practice. Nonetheless, there is an instance-dependent gap that illustrates a natural direction to improve the algorithm: removing the supremum, and directly aiming for

Rλ1:T T (l1:T , u) = O u p

Such a bound is never worse than the O(DG HT ) bound of constant-LR OGD (under the same assumptions), while the associated algorithm can be agnostic to both D and G. This is the key strength of adaptivity: without imposing any artificial structural assumption, the algorithm performs as if it knows the correct assumption from the beginning.

2.2. Main result

To pursue Eq.(4), our main idea is a simple rescaling trick.

In the undiscounted setting (λt 1), Eq.(4) has been almost achieved by several recent works (Cutkosky, 2019; Mhammedi & Koolen, 2020; Jacobsen & Cutkosky, 2022; Zhang et al., 2023) modulo necessary residual factors. For any such algorithm A, let RT,A(g1:T , u) denote its undiscounted regret with respect to linear losses gt, and comparator u, i.e., Eq.(1) with λt 1.

Next, consider the discounted regret with general λ1:T . We take an aforementioned algorithm A and apply it to a

sequence of surrogate loss gradients ˆg1:T , where

If λ1, . . . , λT < 1, this amounts to upweighting recent losses, or equivalently, forgetting older ones. The generated prediction sequence x1:T satisfies a discounted regret bound that scales with RT,A (ˆg1:T , u), the undiscounted regret of the base algorithm A on ˆg1:T .

Rλ1:T T (l1:T , u)

[lt(xt) lt(u)]

RT,A (ˆg1:T , u) . (6)

Within this trick, a particular challenge is that even if all the actual loss functions lt are Lipschitz in a time-uniform manner ( G, s.t., maxt gt G), the surrogate loss functions ˆgt, are not. Therefore, the base algorithm A cannot rely on any a priori knowledge or estimate of the timeuniform Lipschitz constant, similar to the scale-free property (Orabona & P al, 2018) in adaptive online learning.3

To make this concrete, we first forego the adaptivity to the comparator u and analyze an example based on the famous ADAGRAD (Duchi et al., 2011). The obtained algorithm will be a building block of our main results.

Gradient adaptive OGD Proposed for the undiscounted setting, ADAGRAD (Duchi et al., 2011) represents OGD with gradient-dependent learning rates. Using it as the base algorithm A leads to the following RMSPROP-like prediction rule and its discounted regret bound.

Theorem 2. If the diameter of X is at most D, then OGD with learning rate ηt = DV 1/2 t guarantees for all T N+ and loss sequence l1:T ,

sup u Rλ1:T T (l1:T , u) 3

As one would hope for, the bound strictly improves constant LR OGD while matching the lower bound on the VT dependence. It is tempting to seek an even better learning rate ηt that improves the remaining D to u , but such a direction leads to a dead end (Remark B.1).

3A scale-free algorithm generates the same predictions x1:T if all the gradients g1:T are scaled by an arbitrary c > 0. However, we need a bit more, since an algorithm can be scale-free even if it requires an estimate of the time-uniform Lipschitz constant at the beginning (Mhammedi & Koolen, 2020; Jacobsen & Cutkosky, 2022).

Discounted Adaptive Online Learning: Towards Better Regularization

Algorithm 1 1D magnitude learner on [0, ).

Require: Hyperparameter ε > 0 (default ε = 1).

1: Initialize parameters v1 = 0, s1 = 0, h1 = 0. 2: for t = 1, 2, . . . do 3: If ht = 0, define the unprojected prediction xt = 0. Otherwise, with erfi(x) := R x 0 exp(u2)du, (which can be queried from SCIPY and JAX),

xt = ε erfi

vt + 2htst + 16h2 t

vt + 2htst + 16h2 t exp s2 t 4(vt + 2htst + 16h2 t)

4: Predict xt = Π[0, ) ( xt), the projection of xt to the domain [0, ). 5: Receive the 1D loss gradient gt R and the discount factor λt 1 (0, ). 6: Clip gt by defining gt,clip = Π[ λt 1ht,λt 1ht] (gt), and update ht+1 = max (λt 1ht, |gt|). 7: If gt,clip xt < gt,clipxt, define a surrogate loss gradient gt,clip = 0. Otherwise, gt,clip = gt,clip. 8: Update vt+1 = λ2 t 1vt + g2 t,clip, st+1 = λt 1st gt,clip. 9: end for

To solve this problem, we will resort to the Follow the Regularized Leader (FTRL) framework (Orabona, 2023a, Chapter 7) instead of OGD. The key intuition (Fang et al., 2022; Jacobsen & Cutkosky, 2022) is that, FTRL is stronger since it memorizes the initialization, whereas OGD without extra regularization does not. This is particularly important in the discounted setting: FTRL gradually forgets the past online data but not the inductive bias x , whereas OGD forgets everything altogether.

Simultaneous adaptivity To achieve simultaneous adaptivity to both l1:T and u, things can get subtle. As mentioned earlier, there are a number of choices for the base algorithm A. We will adopt the algorithm from (Zhang et al., 2024), surveyed in Appendix B.2, which offers an important benefit (i.e., no explicit T-dependence, Remark B.4). Without loss of generality,4 assume the domain X = Rd.

Overall, our discounted algorithm employs the polardecomposition technique from (Cutkosky & Orabona, 2018): using polar coordinates, predicting xt Rd (to chase the optimal comparator u) can be decomposed into two independent tasks, learning the good direction u/ u and the good magnitude u . The direction is learned by the RMSPROP-

4Due to (Cutkosky, 2020, Theorem 2), given any unconstrained algorithm that operates on X = Rd, we can impose any closed and convex constraint without changing its regret bound.

like algorithm from Theorem 2, while the magnitude is learned by a discounted variant of the erfi-potential learner (Zhang et al., 2024, Algorithm 1) that operates on the nonnegative real line [0, ). This magnitude learner itself will be important for the OCP application, therefore we present its pseudocode as Algorithm 1.

Algorithm 1 has the following intuition. At its center is a special instance of FTRL building on the adaptive OCO theory, which generates the prediction xt using the discounted gradient variance vt, the discounted gradient sum st, and the discounted Lipschitz constant ht. Complementing this core component, two additional ideas are applied to fix certain technical problems: (i) the unconstrained-to-constrained reduction from (Cutkosky & Orabona, 2018; Cutkosky, 2020), and (ii) the hint-and-clipping technique from (Cutkosky, 2019). The readers are referred to Appendix B.2 for a detailed explanation. The following theorem is proved in Appendix B.3. Theorem 3. Given any hyperparameter ε > 0, Algorithm 1 guarantees for all time horizon T N+, loss sequence l1:T , comparator u [0, ) and stability window length τ [1 : T],

Rλ1:T T (l1:T , u) ε q

VT + 2GT S + 16G2 T

+ u (S + GT ) + max t [T τ+1:T ] xt

! max t [1:T τ] xt

S = 8GT 1 + p

log(2uε 1 + 1) 2

VT + 16G2 T 1 + p

log(2uε 1 + 1) .

This result might seem a bit intimidating, so let us take a few steps to interpret it. In particular, we want to justify the appropriate asymptotic regime to consider (VT G2 T and u ε), such that our main result on Rd (Theorem 4) can use the big-Oh notation to improve clarity.

First, one would typically expect VT G2 T , since when λt = λ (0, 1) and |gt| = G for all t, we have GT = G and VT = HT G2 (1 λ2) 1G2. As long as the discount factor λt is close enough to 1, the condition VT G2 T is likely to hold for general/practical gradient sequences as well.

Second, the hyperparameter ε serves as a prior guess of the comparator u. If the guess is correct (ε = u), then by assuming VT G2 T and maxt [1:T ] xt = O(u) (roughly, the predictions are stable), the regret bound becomes

Rλ1:T T (l1:T , u) = O u p

Discounted Adaptive Online Learning: Towards Better Regularization

exactly matching the lower bound. Realistically such an oracle tuning is illegal, since ε is selected at the beginning of the game, while reasonable comparators u are hidden before all the loss functions are revealed. This is where our algorithm shines: as long as ε is moderately small, i.e., O(u), we would have the O(u S) term dominating the regret bound, which only suffers a multiplicative logarithmic penalty relative to the impossible oracleoptimal rate Eq.(7). In comparison, it is well-known that the regret bound of constant-LR OGD with learning rate η depends polynomially on η and η 1 (cf., the proof of Theorem 6), which means our algorithm is provably more robust to suboptimal hyperparameter tuning.

Third, we explain the use of τ. In nonstationary environments, the range of xt can vary significantly over time, therefore a time-uniform characterization of its stability (maxt [1:T ] xt, Remark B.2) could be overly conservative. We use a stability window of arbitrary length τ to divide the time horizon into two parts: the earlier part is forgotten rapidly (due to the QT 1 t=T τ λt multiplier), so only the later part really matters.

Example 1. Suppose again that λt = λ (0, 1). If λ 1, the forgetting multiplier can be approximated by

t=T τ λt = λτ = λ 1 1 λ (1 λ)τ e(λ 1)τ. (Lemma B.1)

Consider λ = 0.99 for example: τ = 700 ensures λτ e 7 < 10 3. That is, if the past range of xt is negligible after a 10 3-attenuation, then our regret bound only depends on the localized stability of xt evaluated in the recent 700 rounds, which is a small fraction in (let s say) T millions. More intuitively, it means past mistakes do not matter .

Given the 1D magnitude learner and its guarantee, the extension to Rd is now standard (Cutkosky & Orabona, 2018). We defer the pseudocode to Appendix B.3.

Theorem 4 (Main result). Given any hyperparameter ε > 0, Algorithm 3 in Appendix B.3 guarantees for all T N+, loss gradients g1:T and comparator u Rd,

Rλ1:T T (l1:T , u) max t [1:T ] xt

VT log( u ε 1) u GT log( u ε 1) ,

where O( ) is in the regime of large VT (VT GT ) and large u ( u ε). Furthermore, for u = 0,

Rλ1:T T (l1:T , 0) O ε p

VT + max t [1:T ] xt

Similar to Theorem 3, the iterate stability term can be split into two parts using an arbitrary τ. The key message is that

if the iterates are indeed stable (maxt [1:T ] xt = O( u )) and we suppress all the logarithmic factors with O( ), then

Rλ1:T T (l1:T , u) O u p

It matches the lower bound and improves all the aforementioned algorithms. Granted, there are several nuances in this statement, but they are in general necessary even in the undiscounted special case (Orabona, 2023a, Chapter 9).

Finally, we summarize a range of practical strengths. As shown earlier, the proposed algorithm is robust to hyperparameter tuning. Observations and mistakes from the distant past (which are possibly misleading for the future) are appropriately forgotten, such that the algorithm runs consistently over its lifetime. Compared to the typical aggregation framework in nonstationary online learning (Daniely et al., 2015; Zhang et al., 2018a), our algorithm runs faster and never explicitly restarts (although we require given discount factors to softly restart). In addition, compared to constant-LR OGD that also tries to minimize the discounted regret, our algorithm makes a better use of the inductive bias x in the general case of x = 0, our discounted regret bound scales with u x . This exemplifies the crucial role of regularization, designed from the adaptive OCO theory.

3. Online Conformal Prediction

Complementing the above result, we now consider its application in conformal prediction (Vovk et al., 2005), a framework that quantifies the uncertainty of black box ML models. Specifically, we study its online version (Gibbs & Candes, 2021; Bastani et al., 2022; Gibbs & Cand es, 2022; Zaffran et al., 2022; Bhatnagar et al., 2023), where no statistical assumptions (e.g., exchangeability) are imposed at all. Our setting follows the nicely written (Bhatnagar et al., 2023, Section 2), and the readers are referred to (Roth, 2022; Angelopoulos & Bates, 2023) for additional background.

3.1. Preliminary

Similar to OCO, OCP is again a two-person repeated game. Let a constant α (0, 1) be the targeted miscoverage rate fixed before the game starts. At the beginning of the t-th round, we receive a set-valued function Ct : R 0 2Y

mapping any radius parameter r [0, ) to a subset Ct(r) of the label space Y. The Ct function is nested: for any r > r, we have Ct(r) Ct(r ) Y. Then,

1. We pick a radius parameter rt [0, ) and output the prediction set Ct(rt).

2. The environment reveals the optimal radius r t [0, ). Intuitively, our prediction set Ct(rt) is large enough only if rt > r t .

3. Our performance is evaluated by the pinball loss

Discounted Adaptive Online Learning: Towards Better Regularization

l(α, ) t (rt), where for all r [0, ),

l(α, ) t (r) :=

( α(r r t ), r > r t , (α 1)(r r t ), else. (8)

To demonstrate this framework, here is a simple 1D forecasting example. Suppose there is a base ML model that in each round makes a prediction ˆxt R of the true time series x t R. On top of that, we wrap such a point prediction xt by a confidence set prediction Ct(rt) = (ˆxt rt, ˆxt +rt). Ideally we want Ct(rt) to cover the true series x t , and this can be checked after x t is revealed. That is, by defining the optimal radius r t = |x t ˆxt|, we claim success if rt > r t .

Quite naturally, since r t is arbitrary, it is impossible to ensure coverage unless rt is meaninglessly large. A reasonable objective is then asking our empirical (marginal) coverage rate to be approximately 1 α, which amounts to showing 1 T

t=1 1 [rt r t ] α

= o(1), (9)

and this is beautifully equivalent to characterizing the cumulative subgradients of the pinball loss,5 PT t=1 l(α, ) t (rt) = o(T). One catch is that there

are trivial predictors6 satisfying Eq.(9) (Bastani et al., 2022), so one needs an extra measure to rule them out. Such a secondary objective can be the regret bound on the pinball loss, T X

t=1 l(α, ) t (rt)

t=1 l(α, ) t (u) = o(T), (10)

which motivates using OCO algorithms to select rt.

How does nonstationarity enter the picture? Since the proposal of OCP in (Gibbs & Candes, 2021), the main emphasis is on problems with distribution shifts, which traditional conformal prediction methods based on exchangeability and data splitting have trouble dealing with. For example, the popular ACI algorithm (Gibbs & Candes, 2021) essentially uses constant-LR OGD as we have shown, this is inconsistent with minimizing the standard regret Eq.(10), and the right OCO performance metric that justifies it could be a nonstationary one (e.g., the discounted regret). In a similar spirit, (Gibbs & Cand es, 2022; Bhatnagar et al., 2023) apply dynamic and strongly adaptive OCO algorithms, effectively analyzing the subinterval variants of Eq.(9) and Eq.(10).

Another key ingredient of OCP is assuming the optimal radius maxt r t D for some D > 0, which is often reasonable in practice and important for the coverage guarantee. As opposed to prior works (Bastani et al., 2022; Bhatnagar

5At the singular point r t , define l(α, ) t (r t ) = α 1. 6Alternating between rt = and rt = 0 independent of data.

et al., 2023) that require knowing D at the beginning to initialize properly, we seek an adaptive algorithm agnostic to this oracle knowledge.

Our goal Overall, we aim to show that without knowing D, applying Algorithm 1 leads to discounted adaptive versions of the marginal coverage bound Eq.(9) and the regret bound Eq.(10). This offers advantages over the ACI-like approach that tackles similar sliding window objectives using constant-LR OGD. Along the way, we demonstrate how the structure of OCP allows controlling the iterate stability of Algorithm 1 (or generally, FTRL algorithms), which then makes the proof of coverage fairly easy. Due to the limited space, experiments are presented in Appendix D.

3.2. Main result

Beyond pinball loss From now on, define ACP as the OCP algorithm that uses Algorithm 1 to select rt (see Appendix C.1 for pseudocode). We make a major generalization: instead of using subgradients of the pinball loss Eq.(8) to update Algorithm 1, we use subgradients g t f t (rt), where f t (r) is any convex function minimized at r t , and right at rt = r t we have g t 0 without loss of generality. This includes the pinball loss l(α, ) t (r) as a special case. Notably, f t (r) does not need to be globally Lipschitz, which unleashes the full power of our base algorithm. Put together, ACP takes a confidence hyperparameter ε and a sequence of discount factors λ1:T determined by our objectives. We assume maxt r t D, but D is unknown by ACP .

Strength of FTRL The key to our result is Lemma 3.1 connecting the prediction to the discounted coverage metric

If λt = 1 for all t and we use the pinball loss to define g 1:T , then |S T | /T recovers Eq.(9). The point is that if λt = λ < 1, then just like the intuition throughout this paper, Eq.(11) is essentially a sliding window coverage metric. Associated algorithms would gradually forget the past, which intuitively counters the distribution shifts.

Lemma 3.1 (Abridged Lemma C.2). ACP guarantees for all t N+,

V t log(rt+1ε 1) G t log(rt+1ε 1) ,

where V t and G t are defined on the OCP loss gradients using Eq.(3), and O( ) is in the regime of rt+1 ε.

The proof of this lemma is a bit involved, but the high level idea is very simple: if we use a FTRL algorithm (rather than OGD) as the OCO subroutine for OCP, then

Discounted Adaptive Online Learning: Towards Better Regularization

the radius prediction is roughly rt+1 ψ(S t / p

V t ) for some prediction function ψ (if the algorithm is not adaptive enough then the denominator is Ht instead), which means S t p

V t ψ 1(rt+1) = O( Ht). Dividing both sides by Ht yields the desirable coverage guarantee. In comparison, the parallel analysis using OGD can be much more complicated (e.g., the grouping argument in (Bhatnagar et al., 2023)) due to the absence of S t in the explicit update rule. Such an analytical strength of FTRL seems to be overlooked in the OCP literature.

We also note that although the above lemma depends only logarithmically on rt+1, without any problem structure the latter could still be large (exponential in t), which invalidates this approach. The remaining step is showing that if the underlying optimal radius r t is time-uniformly bounded by D, then even without knowing D, we could replace rt+1 in the above lemma by O(D). Intuitively it should make sense; materializing it carefully gives us the final result.

Theorem 5. Without knowing D, ACP guarantees that for all T N+, we have the discounted coverage bound

V T log(Dε 1) G T log(Dε 1) ,

and the discounted regret bound from Theorem 3.

To interpret this result, we focus on the coverage bound, since the discounted regret bound has been discussed extensively in Section 2. First, consider the pinball loss and the undiscounted setting λt = 1, where our bound can be directly compared to prior works. For OGD, (Gibbs & Candes, 2021, Proposition 1) shows that the learning rate η = D/

T (as suggested by regret minimization) achieves |S T | = O(

T), while (Bhatnagar et al., 2023, Theorem 2) shows that ηt = D/ Vt (i.e., ADAGRAD) achieves |S T | = O(α 2T 3/4 log T). Although the latter is empirically strong, the theory is a bit unsatisfying as one would expect the gradient adaptive approach to be a pure upgrade (plus, the bound blows up as α 0). Our Theorem 5 is in some sense the right fix , as essentially, |S T | = O(p V T ) O(

T). This is primarily due to the strength of FTRL over OGD, which we hope to demonstrate. Besides, our algorithm also improves the regret bound of these baselines while being agnostic to D.

All these algorithms can be extended to the sliding window setting (e.g., the algorithm from (Gibbs & Candes, 2021) becomes constant-LR OGD, which is the version actually applied in practice), and the above comparison still roughly holds. There is just one catch: OGD algorithms make coverage guarantees on exact sliding windows [T H + 1 : T] of length H, whereas our algorithm bounds the coverage on a slightly different exponential window of effective length HT , Eq.(11). Nonetheless, all these algorithms also guarantee the same type of discounted regret bounds, which

are still comparable like in Section 2.

Adaptivity in OCP Next, we discuss the benefits of gradient adaptivity more concretely in OCP. Suppose again that λt = 1 and we use the pinball loss. Then, instead of the non-adaptive bound |S T | = O(

T), we have

|S T | = O p

v u u tα2 T X

t=1 1[rt > r t ] + (α 1)2 T X

t=1 1[rt r t ]

Since asymptotically the miscoverage rate is α, we have PT t=1 1[rt r t ] αT, which means the bound is roughly O p

α(1 α)T . That is, the gradient adaptivity in OCO translates to the target rate adaptivity in OCP.

In terms of the sample complexity, it is then straightforward to see that the OGD baseline requires at least ε 2 rounds to guarantee an empirical marginal coverage rate within [α ε, α + ε], while our algorithm requires at least αε2

rounds. Very concretely, in the typical setting of α = 0.05 (i.e., we want 95% confidence sets), our algorithm only requires 1

20 as many samples compared to the OGD baseline. Besides the coverage bound, such an α-dependent saving applies to the regret bound as well.

4. Experiment

Finally, we demonstrate the practicality of our algorithm in OCP experiments. Our setup closely builds on (Bhatnagar et al., 2023). Except our own algorithms, we adopt the implementation of the baselines and the evaluation procedure from there. Details are deferred to Appendix D.

Setup We consider image classification in a sequential setting, where each image is subject to a corruption of timevarying strength. Given a base machine learning model that generates the Ct function (by scoring all the possible labels), the goal of OCP is to select the radius parameter rt, which yields a set of predicted labels. Ideally, we want such a set to contain the true label, while being as small as possible. The targeted miscoverage rate α is selected as 0.1.

We test three versions of our algorithm: MAGL-D is our Algorithm 1 with ε = 1 and λt = 0.999; MAGL is its undiscounted version (λt = 1); and MAGDIS is a much simplified variant of Algorithm 1 that basically sets ht = 0. We are aware that a possible complaint towards our approach is that the algorithm is too complicated, but as we will show, this simplified version presented as Algorithm 5 also demonstrates strong empirical performance despite losing the performance guarantee.

The baselines we test are summarized in Table 1, following

Discounted Adaptive Online Learning: Towards Better Regularization

Method Avg. Coverage Avg. Width LCE100 Runtime Simple OGD 0.899 125.0 0.11 1.00 0.03 SF-OGD 0.899 124.5 0.09 1.05 0.03 SAOCP 0.883 119.2 0.10 11.07 0.19 Split Conformal 0.843 129.5 0.47 1.57 0.04 NEx Conformal 0.892 123.0 0.14 2.22 0.02 FACI 0.889 123.4 0.12 2.98 0.07 FACI-S 0.892 124.7 0.11 2.17 0.07 Alg. 1: MAGL-D 0.884 118.5 0.08 1.06 0.03 Alg. 2: MAGL 0.894 122.1 0.09 1.05 0.02 Alg. 5: MAGDIS 0.888 122.1 0.07 1.02 0.04

Table 1. Performance of different methods. Baselines are colored in black, while our algorithms are colored in blue. The best performer in each metric is bolded and the second-best is underlined. The runtime, plus/minus its standard deviation, is normalized by the mean of Simple OGD s runtime; averages are take over ten trials per algorithm.

the implementation of (Bhatnagar et al., 2023). In particular, SF-OGD (Bhatnagar et al., 2023) is equivalent to ADAGRAD with oracle tuning: by definition it requires knowing the maximum possible radius D to set the learning rate, and in practice, D is estimated from an offline dataset (which is a form of oracle tuning from the theoretical perspective). To test the effect of such tuning, we create another baseline called Simple OGD , which is simply SF-OGD with its estimate of D set to 1. We emphasize that despite its name, Simple OGD is still a gradient adaptive algorithm.

Four metrics are evaluated, and we define them formally in Appendix D. First, the average coverage measures the empirical coverage rate over the entire time horizon. Similarly, the average width refers to the average size of the prediction set, also over the entire time horizon. Ideally we want the average coverage to be close to 1 α = 0.9, and if that is satisfied, lower average width is better. Different from these two, the local coverage error (LCE100) measures the deviation of the local empirical coverage rate (over the worst sliding time window of length 100) from the target rate 0.9 this is arguably the most important metric (due to the distribution shifts), and lower is better. Finally, we also test the runtime of all the algorithms, normalized by that of Simple OGD.

Result The results are summarized in Table 1. Among all the algorithms tested, our algorithms achieve the lowest local coverage error (LCE100). In terms of metrics on the entire time horizon, our algorithms also demonstrate competitive performance compared to the baselines: although the average coverage is worse than that of Simple OGD and SF-OGD, the average width is lower. In addition, our algorithms run almost as fast as Simple OGD and SF-OGD, and importantly, they are significantly faster than SAOCP (Bhatnagar et al., 2023) which is an aggregation algorithm that minimizes the strongly adaptive regret.

By comparing the three versions of our algorithm, as one

would expect, the discounted version (MAGL-D) improves the undiscounted version (MAGL). Remarkably, the much simplified MAGDIS achieves competitive performance despite the lack of performance guarantees.

5. Conclusion

This work revisits the classical notion of discounted regret using recently developed techniques in adaptive online learning. In particular, we propose a discounted simultaneously adaptive algorithm (with respect to both the loss sequence and the comparator), with demonstrated practical benefits in online conformal prediction. Along the way, we propose (i) a simple rescaling trick to minimize the discounted regret; and (ii) a FTRL-based analytical strategy to guarantee coverage in online conformal prediction. More broadly, we show that the adaptive OCO theory can help designing better regularization mechanisms for continual / lifelong learning, where forgetting is necessary.

Moving forward, we hope the present work can help revive the community s interest in the discounted regret. For nonstationary online learning, it is a simpler metric to study than the alternatives, while offering certain practical advantages (Appendix A). We leave the online selection of the discount factors as an important open question. In addition, there are recent works (Cutkosky et al., 2023; Ahn et al., 2024) suggesting the connection between discounting and deep learning optimization (e.g., ADAM), which could be an exciting direction for future research.

Acknowledgements

This research was supported by Harvard University.

Discounted Adaptive Online Learning: Towards Better Regularization

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. The algorithms in this paper can broadly help learning, prediction, and decision-making in dynamic environments with distribution shifts. The main focus is theoretical, therefore we do not foresee any negative societal impact.

Abel, D., Barreto, A., Van Roy, B., Precup, D., van Hasselt, H., and Singh, S. A definition of continual reinforcement learning. ar Xiv preprint ar Xiv:2307.11046, 2023.

Abernethy, J., Hazan, E. E., and Rakhlin, A. Competing in the dark: An efficient algorithm for bandit linear optimization. In Conference on Learning Theory, pp. 263 273, 2008.

Adamskiy, D., Koolen, W. M., Chernov, A., and Vovk, V. A closer look at adaptive regret. Journal of Machine Learning Research, 17(1):706 726, 2016.

Ahn, K., Zhang, Z., Kook, Y., and Dai, Y. Understanding Adam optimizer via online learning of updates: Adam is FTRL in disguise. ar Xiv preprint ar Xiv:2402.01567, 2024.

Angelopoulos, A. N. and Bates, S. Conformal prediction: A gentle introduction. Foundations and Trends in Machine Learning, 16(4):494 591, 2023.

Baby, D. and Wang, Y.-X. Optimal dynamic regret in expconcave online learning. In Conference on Learning Theory, pp. 359 409. PMLR, 2021.

Baby, D. and Wang, Y.-X. Optimal dynamic regret in proper online learning with strongly convex losses and beyond. In International Conference on Artificial Intelligence and Statistics, pp. 1805 1845. PMLR, 2022.

Barber, R. F., Candes, E. J., Ramdas, A., and Tibshirani, R. J. Conformal prediction beyond exchangeability. The Annals of Statistics, 51(2):816 845, 2023.

Bastani, O., Gupta, V., Jung, C., Noarov, G., Ramalingam, R., and Roth, A. Practical adversarial multivalid conformal prediction. Advances in Neural Information Processing Systems, 35:29362 29373, 2022.

Bhatnagar, A., Wang, H., Xiong, C., and Bai, Y. Improved online conformal prediction via strongly adaptive online learning. In International Conference on Machine Learning, pp. 2337 2363. PMLR, 2023.

Brown, N. and Sandholm, T. Solving imperfect-information games via discounted regret minimization. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp. 1829 1836, 2019.

Cesa-Bianchi, N. and Lugosi, G. Prediction, learning, and games. Cambridge university press, 2006.

Cesa-Bianchi, N., Gaillard, P., Lugosi, G., and Stoltz, G. Mirror descent meets fixed share (and feels no regret). Advances in Neural Information Processing Systems, 25, 2012.

Chen, L., Luo, H., and Wei, C.-Y. Impossible tuning made possible: A new expert algorithm and its applications. In Conference on Learning Theory, pp. 1216 1259. PMLR, 2021.

Chernov, A. and Zhdanov, F. Prediction with expert advice under discounted loss. In International Conference on Algorithmic Learning Theory, pp. 255 269. Springer, 2010.

Cutkosky, A. Artificial constraints and hints for unbounded online learning. In Conference on Learning Theory, pp. 874 894. PMLR, 2019.

Cutkosky, A. Parameter-free, dynamic, and stronglyadaptive online learning. In International Conference on Machine Learning, pp. 2250 2259. PMLR, 2020.

Cutkosky, A. and Mhammedi, Z. Fully unconstrained online learning. ar Xiv preprint 2405.20540, 2024.

Cutkosky, A. and Orabona, F. Black-box reductions for parameter-free online learning in banach spaces. In Conference On Learning Theory, pp. 1493 1529. PMLR, 2018.

Cutkosky, A., Mehta, H., and Orabona, F. Optimal, stochastic, non-smooth, non-convex optimization through onlineto-non-convex conversion. In International Conference on Machine Learning, pp. 6643 6670. PMLR, 2023.

Daniely, A., Gonen, A., and Shalev-Shwartz, S. Strongly adaptive online learning. In International Conference on Machine Learning, pp. 1405 1411. PMLR, 2015.

De Lange, M., Aljundi, R., Masana, M., Parisot, S., Jia, X., Leonardis, A., Slabaugh, G., and Tuytelaars, T. A continual learning survey: Defying forgetting in classification tasks. IEEE transactions on pattern analysis and machine intelligence, 44(7):3366 3385, 2021.

Drenska, N. and Kohn, R. V. Prediction with expert advice: A PDE perspective. Journal of Nonlinear Science, 30(1): 137 173, 2020.

Discounted Adaptive Online Learning: Towards Better Regularization

Duchi, J., Hazan, E., and Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(7), 2011.

Fang, H., Harvey, N. J., Portella, V. S., and Friedlander, M. P. Online mirror descent and dual averaging: keeping pace in the dynamic case. Journal of Machine Learning Research, 23(1):5271 5308, 2022.

Freund, Y. and Hsu, D. A new hedging algorithm and its application to inferring latent random variables. ar Xiv preprint ar Xiv:0806.4802, 2008.

Gibbs, I. and Candes, E. Adaptive conformal inference under distribution shift. Advances in Neural Information Processing Systems, 34:1660 1672, 2021.

Gibbs, I. and Cand es, E. Conformal inference for online prediction with arbitrary distribution shifts. ar Xiv preprint ar Xiv:2208.08401, 2022.

Haagerup, U. The best constants in the khintchine inequality. Studia Mathematica, 70(3):231 283, 1981.

Harvey, N. J., Liaw, C., Perkins, E., and Randhawa, S. Optimal anytime regret with two experts. Mathematical Statistics and Learning, 6(1):87 142, 2023.

Hazan, E. and Seshadhri, C. Efficient learning algorithms for changing environments. In International Conference on Machine Learning, pp. 393 400, 2009.

Ivgi, M., Hinder, O., and Carmon, Y. Dog is SGD s best friend: A parameter-free dynamic step size schedule. In International Conference on Machine Learning, pp. 14465 14499. PMLR, 2023.

Jacobsen, A. and Cutkosky, A. Parameter-free mirror descent. In Conference on Learning Theory, pp. 4160 4211. PMLR, 2022.

Jacobsen, A. and Cutkosky, A. Online linear regression in dynamic environments via discounting. ar Xiv preprint ar Xiv:2405.19175, 2024.

Jun, K.-S., Orabona, F., Wright, S., and Willett, R. Improved strongly adaptive online learning using coin betting. In Artificial Intelligence and Statistics, pp. 943 951. PMLR, 2017.

Kapralov, M. and Panigrahy, R. Prediction strategies without loss. Advances in Neural Information Processing Systems, 24, 2011.

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014.

Lu, Z. and Hazan, E. On the computational efficiency of adaptive and dynamic regret minimization. ar Xiv preprint ar Xiv:2207.00646, 2023.

Mc Mahan, H. B. and Orabona, F. Unconstrained online linear learning in hilbert spaces: Minimax algorithms and normal approximations. In Conference on Learning Theory, pp. 1020 1039. PMLR, 2014.

Mhammedi, Z. and Koolen, W. M. Lipschitz and comparator-norm adaptivity in online learning. In Conference on Learning Theory, pp. 2858 2887. PMLR, 2020.

Orabona, F. A modern introduction to online learning. ar Xiv preprint ar Xiv:1912.13213, 2023a.

Orabona, F. Normalized gradients for all. ar Xiv preprint ar Xiv:2308.05621, 2023b.

Orabona, F. and P al, D. Coin betting and parameter-free online learning. Advances in Neural Information Processing Systems, 29, 2016.

Orabona, F. and P al, D. Scale-free online learning. Theoretical Computer Science, 716:50 69, 2018.

Orabona, F. and P al, D. Parameter-free stochastic optimization of variationally coherent functions. ar Xiv preprint ar Xiv:2102.00236, 2021.

Roth, A. Uncertain: Modern topics in uncertainty estimation, 2022.

Vovk, V., Gammerman, A., and Shafer, G. Algorithmic learning in a random world, volume 29. Springer, 2005.

Zaffran, M., F eron, O., Goude, Y., Josse, J., and Dieuleveut, A. Adaptive conformal predictions for time series. In International Conference on Machine Learning, pp. 25834 25866. PMLR, 2022.

Zhang, L., Lu, S., and Zhou, Z.-H. Adaptive online learning in dynamic environments. Advances in neural information processing systems, 31, 2018a.

Zhang, L., Yang, T., and Zhou, Z.-H. Dynamic regret of strongly adaptive methods. In International Conference on Machine Learning, pp. 5882 5891. PMLR, 2018b.

Zhang, Z., Cutkosky, A., and Paschalidis, I. PDE-based optimal strategy for unconstrained online learning. In International Conference on Machine Learning, pp. 26085 26115. PMLR, 2022.

Zhang, Z., Cutkosky, A., and Paschalidis, I. C. Unconstrained dynamic regret via sparse coding. ar Xiv preprint ar Xiv:2301.13349, 2023.

Zhang, Z., Yang, H., Cutkosky, A., and Paschalidis, I. C. Improving adaptive online learning using refined discretization. In International Conference on Algorithmic Learning Theory, pp. 1208 1233. PMLR, 2024.

Discounted Adaptive Online Learning: Towards Better Regularization

Zhao, P., Zhang, Y.-J., Zhang, L., and Zhou, Z.-H. Dynamic regret of convex and smooth functions. Advances in Neural Information Processing Systems, 33:12510 12520, 2020.

Zinkevich, M. Online convex programming and generalized infinitesimal gradient ascent. In International Conference on Machine Learning, pp. 928 936, 2003.

Discounted Adaptive Online Learning: Towards Better Regularization

Organization Appendix A surveys the related topic of dynamic and strongly adaptive regret. Appendix B and C contain the details of discounted online learning and the OCP application, respectively. Details of the experiments are presented in Appendix D.

A. Dynamic and Strongly Adaptive Regret

This section surveys the recent predominant approach in nonstationary online learning, namely the aggregation-type algorithms that minimize either the dynamic regret or the strongly adaptive regret. We discuss the main idea behind these algorithms, as well as the possible practical concerns.

Definition For clarity, let us assume the diameter of the domain X is at most D.

Generalizing the undiscounted static regret, the Zinkevich-style dynamic regret (Zinkevich, 2003; Zhang et al., 2018a;b;

Zhao et al., 2020; Baby & Wang, 2021; 2022; Jacobsen & Cutkosky, 2022; Lu & Hazan, 2023; Zhang et al., 2023) allows the comparator u X to be time-varying. Formally, the goal is to upper-bound

RT (l1:T , u1:T ) :=

t=1 lt(ut),

for all comparator sequences u1, . . . , u T . If the losses are G-Lipschitz, then a typical dynamic regret bound has the form RT (l1:T , u1:T ) O G p

where Pu := PT 1 t=1 ut ut+1 is called the path length of the u1:T sequence.

Alternatively, the (strongly) adaptive regret (Hazan & Seshadhri, 2009; Daniely et al., 2015; Adamskiy et al., 2016; Jun et al., 2017; Cutkosky, 2020; Lu & Hazan, 2023) generalizes the undiscounted static regret to subintervals of the time horizon. Formally, for a time interval I [1 : T], we define

RI(l I, u) := X

t I lt(xt) X

and with G-Lipschitz losses, the typical goal is to show that simultaneously on all time intervals I,

RI(l I, u) O DG p

regardless of the specific loss sequence and the comparator. Note that the name strongly adaptive is due to historical reasons; in general it is narrower that the recent concept of adaptive online learning in the literature. The latter (adopted in this work) refers to achieving instance-dependent performance guarantees.

Hidden in the above definitions is an implicit dependence on the nonstationarity of the environment. Taking the dynamic regret for example,7 the standard follow-up reasoning is bounding the total loss of the algorithm, PT t=1 lt(xt), through the oracle inequality, T X

t=1 lt(xt) inf u1:T

t=1 lt(ut) + O G p

Although the dynamic regret bound holds for all comparator sequences u1:T , the only important ones are those with low total loss, PT t=1 lt(ut). In this way, the path length Pu of the important comparator sequence essentially measures the variation of the loss sequence l1:T .

7For the strongly adaptive regret, a similar argument can be made using the length |I| of important time intervals , where the environment is almost stationary and a time-invariant comparator u X induces low loss.

Discounted Adaptive Online Learning: Towards Better Regularization

Algorithm Due to the intricate connections between these two performance metrics (Zhang et al., 2018b; Cutkosky, 2020; Baby & Wang, 2021), all known algorithms that minimize either of them share the same two-level compositional design philosophy:

1. The low level maintains a class of base online learning algorithms in parallel, each targeting a different nonstationarity level of the environment that is possibly correct.

2. The high level aggregates these base algorithms, in order to adapt to the true nonstationarity level unknown beforehand.

Within this procedure, a particularly important consideration is the range of targeted nonstationarity levels. This determines the amount of base algorithms maintained at the same time, thus affects the overall statistical and computational performance.

Practical concern Following the above reasoning, we argue that minimizing either the dynamic regret or the strongly adaptive regret requires targeting a wide range of nonstationarity levels, which is sometimes impractical. For example, in the dynamic regret, the path length Pu of the important comparator sequence can take any value from Θ(1) to Θ(T), whose ratio grows with T. From the computational perspective, the consequence is that the standard model selection approach on an exponentially spaced grid of Pu requires maintaining O(log t) base algorithms in the t-th round,8 making the entire algorithm slower over time. Furthermore, excessively conflicting beliefs Pu = Θ(1) and Pu = Θ(T) are maintained simultaneously, with the former advocating convergence and the latter on the exact contrary. As we demonstrate shortly, this may cause statistical failures : even in periodic environments where Pu = Θ(T) is the only correct belief , the algorithm may still be deceived by the possibility of Pu = Θ(1) and converge to trivial time-invariant decisions.

Possible example of failure Consider a 1D OCO problem inspired by time series forecasting, where the loss functions l1:T are the quadratic loss with respect to a ground truth z1:T sequence. That is,

lt(x) = (x zt)2 .

Let the ground truth be a unit square wave with period 2H: zt = ( 1) (t 1)/H . Furthermore, we set the domain X = [ 2, 2], such that the Lipschitz constant G = 6.

Consider any algorithm with an optimal Pu-dependent dynamic regret bound, or an optimal strongly adaptive regret bound for 1D OCO (with bounded domain and Lipschitz losses). Then, as a consequence, the algorithm also guarantees a sublinear static regret bound, T X

t=1 lt(xt) min u X

t=1 lt(u) + o(T).

Suppose the time horizon T is a multiple of 2H. Then,

t=1 lt(u) = T

2 (u 1)2 + T

2 (u + 1)2 = T(u2 + 1),

which means that the optimal fixed comparator is u = 0. It suggests that the xt sequence generated by the algorithm converges to the trivial time-invariant prediction 0 in the long run, even though the ground truth zt does not converge. (Technically, the algorithm can do better by learning the periodicity of the ground truth and changing the direction accordingly, e.g., always predicting a nonzero xt of the same sign as zt. However, since the algorithm is designed for adversarial environments which are not necessarily periodic, such a strategic behavior is unlikely to hold.) To be more specific, we expect the xt sequence to track the ground truth zt sequence, but less and less responsively as t increases (which could be called degradation ), and eventually xt converges to 0.

We performed preliminary numerical experiments to validate this hypothesis. The expected degradation is clearly observed on the algorithm from (Jacobsen & Cutkosky, 2022), but for the structurally simpler meta-expert algorithm from (Zhang et al., 2018a), the degradation of the xt sequence is not significant given the scale of our preliminary experiments. We defer a thorough investigation of this problem to future works.

8A notable exception is (Lu & Hazan, 2023), which shows that log log t computation per round is achievable at the expense of slightly larger regret bounds.

Discounted Adaptive Online Learning: Towards Better Regularization

Takeaway Back to our main argument, we aim to show that for a nonstationary online learning algorithm, it could be impractical to simultaneously maintain a wide range of beliefs on the correct nonstationarity level (of the environment). In some sense, this is on the contrary of a common perception of the field (i.e., the adaptivity to the true nonstationarity level is always better). Our study of discounted algorithms is partially motivated by this idea: such discounted algorithms only target one nonstationarity level, which is specified by the given discount factor.

B. Detail of Section 2

Appendix B.1 contains omitted proofs from Section 2, excluding the analysis of our main algorithm. Appendix B.2 introduces the undiscounted algorithm from (Zhang et al., 2024) and its guarantees; these are applied to our reduction from Section 2.2. Appendix B.3 proves our main result, i.e., discounted regret bounds that adapt simultaneously to l1:T and u.

B.1. Preliminary proofs

We first prove an auxiliary lemma.

Lemma B.1. For all λ (0, 1), λ 1 1 λ e 1. Moreover, limλ 1 λ 1 1 λ = e 1.

Proof of Lemma B.1. For the first part of the lemma, taking log on both sides, it suffices to show

1 1 λ log λ 1.

This holds due to log λ λ 1. The second part is due to limλ 1 (1 λ) 1 log λ = 1.

The following theorem characterizes non-adaptive OGD.

Theorem 6. If the loss functions are all G-Lipschitz and the diameter of the domain diam(X) is at most D, then OGD with learning rate ηt = DG 1H 1/2 t guarantees for all T N+,

sup l1:T ,u Rλ1:T T (l1:T , u) 3

Furthermore, if the discount factor λt = λ for some λ (0, 1), then OGD with a time-invariant learning rate ηt = DG 1

1 λ2 guarantees for all T 1

sup l1:T ,u Rλ T (l1:T , u) 3

Proof of Theorem 6. We start with the first part of the theorem. The standard analysis of gradient descent centers around the descent lemma (Orabona, 2023a, Lemma 2.12),

gt, xt u 1 2ηt xt u 2 1

2ηt xt+1 u 2 + ηt

Applying convexity and taking a telescopic sum,

Rλ1:T T (l1:T , u)

[lt(xt) lt(u)]

2ηt xt u 2 1

2ηt xt+1 u 2 + ηt

Discounted Adaptive Online Learning: Towards Better Regularization

= 1 2η1 x1 u 2 T 1 Y

gt 2 . (12)

Now, consider the second sum on the RHS. Notice that Ht = λ2 t 1Ht 1 + 1,

λ2 t 1Ht 1 + 1 q

Therefore, we can apply the diameter condition xt u D (and the Lipschitzness gt G as well), and use a telescopic sum again to obtain

Rλ1:T T (l1:T , u) D2

t=1 H 1/2 t

To proceed, define

t=1 H 1/2 t

We now show that Sum T 2 HT by induction.

When T = 1, we have H1 = 1 and Sum1 = H 1/2 1 = 1, therefore Sum1 2 H1.

When T > 1, starting from the induction hypothesis Sum T 1 2 p

Sum T = H 1/2 T + λT 1Sum T 1 H 1/2 T + 2λT 1 p

Applying HT = λ2 T 1HT 1 + 1,

Sum T (λ2 T 1HT 1 + 1) 1/2 + 2 q

λ2 T 1HT 1 2 q

λ2 T 1HT 1 + 1 = 2 p

where the inequality is due to the concavity of square root.

Combining everything above leads to the first part of the theorem.

As for the second part (time-invariant λ < 1), we follow the same procedure until Eq.(12). The learning rate ηt is independent of t; denote it as η = DG 1

1 λ2. Then,

Rλ T (l1:T , u) λT 1

2η x1 u 2 + 1 λ

t=2 λT t xt u 2 + η

t=1 λT t gt 2

2η + D2(1 λ)

t=2 λT t + ηG2

2η + D2(1 λ)

Discounted Adaptive Online Learning: Towards Better Regularization

1 λ2 . (1 λ2 2(1 λ))

Following the definition of HT , in the case of time-invariant λ < 1,

Due to Lemma B.1, if T 1

2(1 λ) 1, we have λ2T e 1, therefore

Combining the above completes the proof.

Accompanying the upper bounds, the following theorem characterizes an instance-dependent discounted regret lower bound.

Theorem 7. Consider any combination of time horizon T N+, discount factors λ1:T , Lipschitz constant G > 0, and variance budget V 0, G2HT , where HT is defined in Eq.(2). Furthermore, consider any nonzero u X that satisfies u X. For any OCO algorithm that possibly depends on these quantities, there exists a sequence of linear losses lt(x) = gt, x such that gt G for all t [1 : T],

and max h Rλ1:T T (l1:T , u), Rλ1:T T (l1:T , u) i 1

Proof of Theorem 7. The proof is a mild generalization of the undiscounted argument, e.g., (Orabona, 2023a, Theorem 5.1). We provide it here for completeness.

We first define a random sequence of loss gradients. Let ε1, . . . , εT be a sequence of iid Rademacher random variables: εt equals 1 with probability 1

2 each. Also, let

V PT t=1 QT 1 i=t λ2 i .

Then, we define the loss gradient sequence g1:T as gt = Lεt u u . Notice that L G.

Now consider the regret with respect to l1:T , the random linear losses induced by the random gradient sequence g1:T .

Rλ1:T T l1:T , u =

! L u u, xt εt

Rλ1:T T l1:T , u =

gt, xt + u =

! L u u, xt εt +

E h max h Rλ1:T T ( l1:T , u), Rλ1:T T ( l1:T , u) ii

! L u u, xt εt

! L u E [ u, xt εt] + L u E

Discounted Adaptive Online Learning: Towards Better Regularization

(Khintchine inequality (Haagerup, 1981))

Finally, note that the above lower-bounds the expectation. There exists a loss gradient sequence g1:T with gt {Lu/ u , Lu/ u }, such that max h Rλ1:T T (l1:T , u), Rλ1:T T (l1:T , u) i 1

The next theorem, presented in Section 2.2, analyzes gradient adaptive OGD.

Theorem 2. If the diameter of X is at most D, then OGD with learning rate ηt = DV 1/2 t guarantees for all T N+ and loss sequence l1:T ,

sup u Rλ1:T T (l1:T , u) 3

Proof of Theorem 2. As shown in (Orabona, 2023a, Chapter 4.2), applying OGD with the undiscounted gradient-dependent learning rate

ηt = D q Pt i=1 ˆgi 2 (13)

to surrogate linear losses ˆgt, guarantees the undiscounted regret bound

t=1 ˆgt, xt u 3

For the discounted setting, we follow the rescaling trick from Section 2.2. First, consider the effective prediction rule when ˆgt is defined according to Eq.(5).

xt D q Pt i=1 ˆgi 2 ˆgt

xt D r Pt i=1 Qi 1 j=1 λ 2 j gi 2

xt D r Pt i=1 Qt 1 j=i λ2 j gi 2 gt

which is exactly the prediction rule this theorem proposes. As for the regret bound, we have

t=1 ˆgt, xt u 3

t=1 ˆgt 2 = 3

Plugging it into Eq.(6) and using the definition of VT from Eq.(3) complete the proof.

Discounted Adaptive Online Learning: Towards Better Regularization

Remark B.1 (Failure of OGD). As mentioned in Section 2.2, one might suspect that OGD with an even better learning rate than Theorem 2 (i.e., possibly using the oracle knowledge of u ) can further improve the regret bound to O( u VT ), matching the lower bound (Theorem 7). We now explain where this intuition could come from, and why it is misleading.

Consider the undiscounted setting (λt = 1). For any scaling factor α, it is well-known that OGD with learning rate η = αG 1T 1/2 achieves the undiscounted regret bound O (α + u 2 α 1)G

T (Orabona, 2023a, Chapter 2.1),

which becomes O( u G

T) if the oracle tuning α = u is allowed. This might suggest that the remaining suboptimality of Theorem 2 can be closed by a similar oracle tuning, i.e., setting

ηt = u Vt . (14)

However, such an analogy misses a key point: the aforementioned nice property of α-scaled OGD only holds with time-invariant learning rates, and therefore, it can be found that OGD with the time-varying learning rate Eq.(14) does not yield the O( u G VT ) bound we aim for, let alone the impossibility of oracle tuning. Broadly speaking, it is known (but sometimes overlooked) that OGD has certain fundamental incompatibility with time-varying learning rates, but this can be fixed by an extra regularization centered at the origin (Orabona & P al, 2018; Fang et al., 2022; Jacobsen & Cutkosky, 2022). Such a procedure encodes the initialization into the algorithm, which essentially makes it similar to FTRL.

B.2. Algorithm from (Zhang et al., 2024)

In this subsection we introduce the algorithm from (Zhang et al., 2024) and its guarantee. Proposed for the undiscounted setting (λt = 1), it achieves a scale-free regret bound that simultaneously adapts to both the loss sequence and the comparator.

Following the polar-decomposition technique from (Cutkosky & Orabona, 2018), the main component of this algorithm is a subroutine (a standalone one dimensional OCO algorithm) that operates on the positive real line [0, ). We present the pseudocode of this subroutine as Algorithm 2. Technically it is a combination of (Zhang et al., 2024, Algorithm 1 + part of Algorithm 2) and (Cutkosky, 2019, Algorithm 2). Except the choice of the Φ function which uses an unusual continuous-time analysis, everything else is somewhat standard in the community. Here is an overview of the idea.

The core component is the potential function Φh,ε(v, s) that takes two input arguments, the observed gradient variance v and the observed gradient sum s. For now let us briefly suppress the dependence on auxiliary parameters h and ε. At the beginning of the t-th round, with the observations of past gradients g1:t 1, we ideally (there are two twists explained later) would like to define its summaries (sometimes called sufficient statistics)

i=1 g2 i , st =

and predict xt = 2Φ(vt, st), i.e., the partial derivative of the potential function Φ with respect to its second argument, evaluated at the pair (vt, st). This is the standard procedure of the potential method (Cesa-Bianchi & Lugosi, 2006; Mc Mahan & Orabona, 2014; Mhammedi & Koolen, 2020) in online learning, which is associated to a well-established analytical strategy (Mc Mahan & Orabona, 2014). Equivalently, one may also interpret this procedure as Follow the Regularized Leader (FTRL) (Orabona, 2023a, Chapter 7.3) with linearized losses: the potential function Φ is essentially the convex conjugate of a FTRL regularizer (Zhang et al., 2022, Section 3.1).

A bit more on the design of Φ: the intuition is that h is typically much smaller than v and s, so if we set h = 0 (rigorously this is not allowed since the prediction would be a bit too aggressive; but just for our intuitive discussion this is fine), then the potential function is morally

Φ(v, s) ε v

0 erfi(u)du 1

which is associated to the prediction function

2Φ(v, s) ε erfi s 2 v

0 exp(u2)du.

Discounted Adaptive Online Learning: Towards Better Regularization

Algorithm 2 ((Zhang et al., 2024, Algorithm 1 and 2) + (Cutkosky, 2019, Algorithm 2)) Undiscounted 1D magnitude learner on [0, ).

Require: Hyperparameter ε > 0 (default ε = 1).

1: Initialize parameters v1 = 0, s1 = 0, h1 = 0. Define the following (h, ε)-parameterized potential function Φh,ε(v, s) with two input arguments v and s.

Φh,ε(v, s) := ε p

v + 2hs + 16h2

0 erfi(u)du 1

2: for t = 1, 2, . . . do 3: If ht = 0, define an unprojected prediction xt = 0; otherwise, define

xt = 2Φht,ε(vt, st)

vt + 2htst + 16h2 t

vt + 2htst + 16h2 t exp s2 t 4(vt + 2htst + 16h2 t)

4: Predict xt = Π[0, ) ( xt), the projection of xt to the domain [0, ). 5: Receive the loss gradient gt R. 6: Clip the gradient by defining gt,clip = Π[ ht,ht] (gt), and let ht+1 = max (ht, |gt|). 7: Define a surrogate loss gradient gt,clip R, where

( gt,clip, gt,clip xt gt,clipxt, 0, else.

8: Update vt+1 = vt + g2 t,clip, st+1 = st gt,clip. 9: end for

The use of the erfi function might seem obscure, but essentially it is rooted in a recent trend (Drenska & Kohn, 2020; Zhang et al., 2022; Harvey et al., 2023) connecting online learning to stochastic calculus: we scale the OCO game towards its continuous-time limit and solve the obtained Backward Heat Equation. Prior to this trend, the predominant idea was using (Mc Mahan & Orabona, 2014; Orabona & P al, 2016; Mhammedi & Koolen, 2020)

Φ(v, s) ε v exp s2

2Φ(v, s) εconstant s

v3/2 exp s2

It can be shown that the erfi prediction rule is quantitatively stronger, and quite importantly, it makes the hyperparameter ε unitless (Zhang et al., 2024, Section 5). In contrast, in the more classical potential function Eq.(15), ε carries the unit of gradient squared , which means the algorithm requires a guess of the time-uniform Lipschitz constant maxt gt at the beginning of the game. Due to the discussion in Section 2.2 (right after introducing the rescaling trick), this suffers from certain suboptimality (Remark B.4) when applied with rescaling.

Although most of the heavy lifting is handled by the choice of Φ, a remaining issue is that with h = 0, the prediction xt = 2Φ(vt, st) is not necessarily positive, which violates the domain constraint [0, ). To fix this issue, we adopt the technique from (Cutkosky & Orabona, 2018; Cutkosky, 2020) which has become a standard tool for the community: predict xt = Π[0, )( xt) which is the projection of xt = 2Φ(vt, st) to the domain [0, ), and update the sufficient statistics (vt+1, st+1) using a surrogate loss gradient gt,clip (Line 7 of Algorithm 2) instead of gt,clip (the clipping will be explained later; for now we might regard gt,clip = gt, the actual gradient).

The intuition behind Line 7 is that, if the unprojected prediction xt = 2Φ(vt, st) is already negative and the actual loss gradient gt,clip encourages it to be more negative , then we set gt,clip = 0 in the update of the (vt+1, st+1) pair to

Discounted Adaptive Online Learning: Towards Better Regularization

avoid this undesirable behavior (unprojected prediction drifting away from the domain). In other situations, it is fine to use gt,clip directly in the update of (vt+1, st+1), therefore we simply set gt,clip = gt,clip.

Rigorously, (Cutkosky, 2020, Theorem 2) shows that for all comparator u [0, ), gt,clip, xt u gt,clip, xt u . That is, as long as the unprojected prediction sequence x1:T guarantees a good regret bound (in an improper manner, i.e., violating the domain constraint), then the projected prediction sequence x1:T also guarantees a good regret bound, but properly.

Another twist is related to updating the auxiliary parameter h, which was previously ignored. Due to the typical limitation of FTRL algorithms, we have to guess the range of gt (i.e., a time-varying Lipschitz constant Gt such that gt Gt) right before making the prediction xt, and ht serves as this guess. (Cutkosky, 2019) suggests using the range of past loss gradients ht = maxi [1:t 1] |gi| to guess that |gt| ht. Surely this could be wrong: in that case (|gt| > ht), we clip gt to [ ht, ht] before sending it to the (vt+1, st+1) update.

We also clarify a possible confusion related to guessing the Lipschitzness . We have argued that if an algorithm requires guessing the time-uniform Lipschitz constant maxt gt at the beginning of the game, then it does not serve as a good base algorithm A in our rescaling trick. The use of ht is different: it is updated online as a guess of gt , which is fine (to apply with rescaling).

In terms of the concrete guarantee, the following theorem characterizes the undiscounted regret bound of Algorithm 2. The proof is a straightforward corollary of (Zhang et al., 2024, Lemma B.2) and (Cutkosky, 2020, Theorem 2), therefore omitted. Theorem 8 ((Cutkosky, 2020; Zhang et al., 2024)). Algorithm 2 guarantees for all time horizon T N+, loss gradients g1:T and comparator u [0, ),

t=1 gt(xt u) ε

t=1 g2 t + 2GS + 16G2 + u S +

t=1 |gt gt,clip| |xt u| ,

where G = maxt [1:T ] |gt| and

S = 8G 1 + p

log(2uε 1 + 1) 2 + 2

t=1 g2 t + 16G2 1 + p

log(2uε 1 + 1) .

Remark B.2 (Iterate stability). The above bound has an iterate stability term PT t=1 |gt gt,clip| |xt u|, which can be further upper-bounded by (u+maxt [1:T ] xt)GT . Similar characterizations of stability have appeared broadly in stochastic optimization before (Orabona & P al, 2021; Ivgi et al., 2023; Orabona, 2023b). For the general adversarial setting we consider, (Cutkosky, 2019) suggests using artificial constraints to trade maxt xt for terms that only depend on g1:T and u. We do not take this route because (i) it makes our discounted algorithm more complicated but does not seem to improve the performance; and (ii) even without artificial constraints, the prediction magnitude is indeed controlled in our main application (online conformal prediction), and possibly in the downstream stochastic setting as well (following (Orabona, 2023b)).

Given the above undiscounted 1D subroutine, we can extend it to Rd following the standard polar-decomposition technique (Cutkosky & Orabona, 2018). Overall, the algorithm becomes the λt = 1 special case of Algorithm 3 (our main algorithm presented in Section B.3). This is as expected, since the discounted setting is a strict generalization. The following theorem is essentially (Zhang et al., 2024, Theorem 2 + the discussion after that). Theorem 9 ((Zhang et al., 2024)). With λt = 1 for all t, Algorithm 3 from Section B.3 guarantees for all T N+, loss gradients g1:T and comparator u Rd,

t=1 gt, xt u O u p

VT log( u ε 1) u GT log( u ε 1) +

t=1 gt gt,clip xt u ,

where VT and GT are defined in Eq.(3), and O( ) is in the regime of large VT (VT GT ) and large u ( u ε). Furthermore, if the comparator u = 0, we have

t=1 gt, xt O ε p

t=1 gt gt,clip xt .

Discounted Adaptive Online Learning: Towards Better Regularization

B.3. Analysis of the main algorithm

This subsection presents our main discounted algorithm and its regret bound. We start from its 1D magnitude learner.

Theorem 3. Given any hyperparameter ε > 0, Algorithm 1 guarantees for all time horizon T N+, loss sequence l1:T , comparator u [0, ) and stability window length τ [1 : T],

Rλ1:T T (l1:T , u) ε q

VT + 2GT S + 16G2 T + u (S + GT ) + max t [T τ+1:T ] xt

! max t [1:T τ] xt

where S = 8GT 1 + p

log(2uε 1 + 1) 2 + 2 q

VT + 16G2 T 1 + p

log(2uε 1 + 1) .

Proof of Theorem 3. The proof mostly follows from carefully checking the equivalence of the following two algorithms: (i) Algorithm 1 (the discounted algorithm) on a sequence of loss gradients g1:T , and (ii) Algorithm 2 (the undiscounted algorithm) on the sequence of scaled surrogate gradients, Qt 1 i=1 λ 1 i gt; t [1 : T]. Since the quantities in Algorithm 1 and 2 follow the same notation, we separate them by adding a superscript D on quantities in Algorithm 1, and ND in their Algorithm 2 counterparts.

We show this by induction: suppose for some t [1 : T], we have

s ND t , v D t =

v ND t , h D t =

Such an induction hypothesis holds for t = 1. Then, from the prediction rules (and the fact that all the discount factors are strictly positive), the unprojected predictions x D t = x ND t , and the projected predictions x D t = x ND t .

Now consider the gradient clipping. Since g ND t = Qt 1 i=1 λ 1 i g D t ,

g ND t,clip = Π[ h ND t ,h ND t ] g ND t

= c 1Π[ ch ND t ,ch ND t ] cg ND t ( c > 0)

Π[ λt 1h D t ,λt 1h D t ] g D t (c = Qt 1 i=1 λi)

g D t,clip.

Then, due to x D t = x ND t and x D t = x ND t , we have g ND t,clip = Qt 1 i=1 λ 1 i g D t,clip.

Finally, consider the updates of s D t+1, v D t+1 and h D t+1, as well as their Algorithm 2 counterparts.

s D t+1 = λt 1s D t g D t,clip =

g ND t,clip =

! s ND t g ND t,clip =

h D t+1 = max λt 1h D t , g D t = max

That is, the induction hypothesis holds for t + 1, and therefore we have shown the equivalence of the considered two algorithms.

Discounted Adaptive Online Learning: Towards Better Regularization

As for the regret bound of Algorithm 1, combining the reduction Eq.(6) and the regret bound of Algorithm 2 (Theorem 8 from Appendix B.2) immediately gives us

Rλ1:T T (l1:T , u) ε q

VT + 2GT S + 16G2 T + u S +

|gt gt,clip| |xt u| ,

where S = 8GT 1 + p

log(2uε 1 + 1) 2 + 2 q

VT + 16G2 T 1 + p

log(2uε 1 + 1) .

The remaining clipping error term can be bounded similarly as (Cutkosky, 2019, Theorem 2). For any τ [1 : T],

|gt gt,clip| |xt u| max t [1:T τ] |xt u|

|gt gt,clip|

= max t [1:T τ] |xt u|

(ht+1 λt 1ht) (from Algorithm 1)

= max t [1:T τ] |xt u|

= max t [1:T τ] |xt u|

max t [1:T τ] xt + u T τ 1 Y

|gt gt,clip| |xt u| max t [T τ+1:T ] xt + u " T 1 Y

Finally, multiplying QT 1 t=1 λt and using GT = h T +1 and GT τ = h T τ+1 complete the proof.

As for the extension to Rd following (Cutkosky & Orabona, 2018), we present the pseudocode as Algorithm 3. There is a small twist: when applying Algorithm 1 as the 1D subroutine, in its Line 6 we set gt,clip = gt, and ht+1 is given by the meta-algorithm. That is, the gradient clipping is handled on the high level (Algorithm 3) rather than the low level (Algorithm 1).

Algorithm 3 Discounted adaptivity on Rd.

1: Define A1d as a minor variant of Algorithm 1 (with hyperparameter ε), where its Line 6 is replaced by: Set gt,clip = gt, and receive a hint ht+1. 2: Define AB as the algorithm from Theorem 2, on the d-dimensional unit L2 norm ball (with D = 2). 3: Initialize h1 = 0. 4: for t = 1, 2, . . . do 5: Query A1d for its prediction yt R. 6: Query AB for its prediction wt Rd; wt 1. 7: Predict xt = wtyt, receive the loss gradient gt Rd and the discount factor λt (0, ). 8: Update ht+1 = max (λt 1ht, gt ), and define gt,clip = gtλt 1ht/ht+1 (if ht+1 = 0 then gt,clip = 0). 9: Send gt,clip, wt and gt,clip as the surrogate loss gradients to A1d and AB, respectively. 10: Send the discount factor λt to A1d and AB. 11: Send the hint ht+1 to A1d. 12: end for

Algorithm 3 induces our main theorem (Theorem 4). The proof combines the undiscounted regret bound (Theorem 9) and our rescaling trick. It is almost the same as the above proof of Theorem 3, therefore omitted.

Discounted Adaptive Online Learning: Towards Better Regularization

Remark B.3 (Importance of forgetting). Different from the undiscounted algorithm in (Zhang et al., 2024), the above discounted generalization rapidly forgets the past observations (which could be misleading for the future). Such an idea might even be useful in stochastic convex optimization as well, in particular regarding functions with spatially inhomogeneous Lipschitzness. An example is the quadratic function, which is both strongly convex and smooth. Globally it is not Lipschitz, but near its minimizer the Lipschitz constant is small. When optimizing such a function, if an optimization algorithm internally estimates the Lipschitz constant from its historical observations, then this estimate could be overly conservative as the iterates move closer to the minimizer. An intuitive solution is to gradually forget the past, just like the idea of discounted online learning algorithms. Rigorously characterizing this intuition could be an exciting direction, which we defer to future works.

Remark B.4 (Benefit of (Zhang et al., 2024)). We used (Zhang et al., 2024) as the base algorithm in our scaling trick, but there are other options (Mhammedi & Koolen, 2020; Jacobsen & Cutkosky, 2022). The problem of such alternatives is that, they still need an estimate of the time-uniform Lipschitz constant at the beginning of the game, due to certain unit inconsistency (discussed in Appendix B.2, e.g., Eq.(15)). In the undiscounted setting, they guarantee

t=1 gt, xt u O u p

VT log( u T) u GT log( u T) +

t=1 gt gt,clip xt u ,

as opposed to Theorem 9 in this paper. After the scaling trick, the T dependence in this bound will be transferred to the obtained discounted regret bound, such that the latter also depends on the end time T. In other words, such a discounted regret bound is not lifetime consistent .

C. Detail of Section 3

This section presents omitted details of our OCP application. First, we present the pseudocode of ACP in Appendix C.1, with the notations (relevant quantities are with the superscipt ) from the OCP setting. All the concrete proofs are presented in Appendix C.2.

C.1. Pseudocode of the OCP algorithm

The pseudocode of ACP is Algorithm 4. This is equivalent to directly applying Algorithm 1, our main 1D OCO algorithm, to the setting of Section 3.

In particular, the problem structure of OCP allows removing the surrogate loss construction (Line 7 of Algorithm 1), which makes the algorithm slightly simpler. To see this, notice that the surrogate loss gt,clip there is only needed (i.e., does not equal gt,clip) when the unprojected prediction xt < 0 and the projected prediction xt = 0. In the OCP notation, this means that our radius prediction rt = 0 r t , therefore the subgradient g t evaluated at rt satisfies g t 0. Back to the notation of Algorithm 1, we have gt 0, therefore after clipping gt,clip 0. Putting things together,

gt,clip ( xt xt) 0.

That is, the condition in Algorithm 1 that triggers gt,clip = 0 is impossible, therefore gt,clip always equals gt,clip.

C.2. Omitted proofs

Moving to the analysis, we first prove the key lemma connecting the prediction magnitude of ACP to the coverage metric S t , Eq.(11). This is divided into two steps.

First (Lemma C.1), we approximate the prediction rule of ACP , such that rt+1 characterizes the discounted sum of clipped gradients

g i,clip, (16)

which is an internal quantity of Algorithm 4.

The second step (Lemma C.2) is connecting S t,clip to the unclipped version S t , Eq.(11). This is the version presented in the main paper.

Discounted Adaptive Online Learning: Towards Better Regularization

Algorithm 4 The proposed OCP algorithm ACP .

Require: Hyperparameter ε > 0 (default ε = 1).

1: Initialize S 0,clip = 0, V 0,clip = 0, G 0 = 0. 2: for t = 1, 2, . . . do 3: If G t 1 = 0, define the unprojected prediction rt = 0. Otherwise,

rt = ε erfi

V t 1,clip + 2G t 1S t 1,clip + 16 G t 1 2

V t 1,clip + 2G t 1S t 1,clip + 16 G t 1 2 exp

S t 1,clip 2

4 V t 1,clip + 2G t 1S t 1,clip + 16 G t 1 2

4: Predict rt = Π[0, ) ( rt). 5: Receive the OCP loss gradient g t R and the discount factor λt 1 (0, ). 6: Clip g t by defining g t,clip = Π[ λt 1G t 1,λt 1G t 1] (g t ) .

7: Compute running statistics of past observations,

g i,clip, V t,clip =

g i,clip 2 , G t = max i [1:t]

Lemma C.1. Consider S t,clip from Eq.(16). ACP (Algorithm 4) guarantees for all t,

S t,clip 2 q

V t,clip 1 + p

log (1 + 2rt+1ε 1) + 13G t 1 + p

log (1 + 2rt+1ε 1) 2 ,

where V t,clip and G t are internal quantities of Algorithm 4.

Proof of Lemma C.1. Throughout this proof we will consider the internal variables of Algorithm 4. The superscript are always removed to simplify the notation. Assume without loss of generality that internally in Algorithm 4, Gt = 0 for the considered t. Otherwise, all the gradients g1, . . . , gt are zero, which makes the statement of the lemma obvious.

To upper-bound St,clip, the analysis is somewhat similar to the control of Fenchel conjugate in (Zhang et al., 2024, Lemma B.1), although the latter serves a different purpose. Consider the input argument of the erfi function in the definition of rt+1. There are two cases.

Case 1: St,clip < 2 p

Vt,clip + 2Gt St,clip + 16G2 t.

With straightforward algebra, S2 t,clip 8Gt St,clip 4 Vt,clip + 16G2 t < 0,

St,clip 4Gt + q

16G2 t + 4 (Vt,clip + 16G2 t) 2 p

Vt,clip + 13Gt.

Case 2: St,clip 2 p

Vt,clip + 2Gt St,clip + 16G2 t.

Since Gt = 0, Algorithm 4 predicts rt+1 = Π[0, )( rt+1), where

rt+1 = ε erfi

St,clip 2 p

Vt,clip + 2Gt St,clip + 16G2 t

Discounted Adaptive Online Learning: Towards Better Regularization

Vt,clip + 2Gt St,clip + 16G2 t exp

" S2 t,clip 4(Vt,clip + 2Gt St,clip + 16G2 t)

Due to a lower estimate of the erfi function (Zhang et al., 2024, Lemma A.3), for all x 1, erfi(x) exp(x2)/2x. Then,

St,clip 2 p

Vt,clip + 2Gt St,clip + 16G2 t

Vt,clip + 2Gt St,clip + 16G2 t St,clip exp

" S2 t,clip 4(Vt,clip + 2Gt St,clip + 16G2 t)

and simple algebra characterizes the multiplier on the RHS, p

Vt,clip + 2Gt St,clip + 16G2 t St,clip 2Gt p

Vt,clip + 2Gt St,clip + 16G2 t = Vt,clip + 16G2 t St,clip p

Vt,clip + 2Gt St,clip + 16G2 t 0.

St,clip 2 p

Vt,clip + 2Gt St,clip + 16G2 t

" S2 t,clip 4(Vt,clip + 2Gt St,clip + 16G2 t)

Vt,clip + 2Gt St,clip + 16G2 t 2St,clip Gt p

Vt,clip + 2Gt St,clip + 16G2 t

St,clip 2 p

Vt,clip + 2Gt St,clip + 16G2 t

Due to another estimate of erfi 1 (Zhang et al., 2024, Lemma A.4), for all x 0 we have erfi 1(x) 1+ p

log(x + 1). Then,

St,clip 2 q

Vt,clip + 2Gt St,clip + 16G2 t erfi 1 2rt+1ε 1

Vt,clip + 2Gt St,clip + 16G2 t 1 + p

log (1 + 2rt+1ε 1) .

Similar to the algebra of Case 1,

St,clip 2 p

Vt,clip 1 + p

log (1 + 2rt+1ε 1) + 13Gt 1 + p

log (1 + 2rt+1ε 1) 2 .

Combining the two cases, we have

St,clip 2 p

Vt,clip 1 + p

log (1 + 2rt+1ε 1) + 13Gt 1 + p

log (1 + 2rt+1ε 1) 2 .

That is, St,clip is now bounded from the above by O( p

On the other side, we now consider bounding St,clip from below. This is a classical induction argument similar to (Zhang et al., 2024, Lemma 4.1). Consider any time index i [1 : t],

If Si 1,clip 0, then according to the prediction rule of Algorithm 4, we have ri = 0 regardless of the value of Gi 1. Then, due to the structure of OCP, we have gi 0, thus gi,clip 0 and Si,clip = λi 1Si 1,clip gi,clip λi 1Si 1,clip.

If Si 1,clip > 0, then Si,clip λi 1Si 1,clip |gi,clip| Gi.

Combining these two and using an induction from S0,clip = 0, we have St,clip Gt.

Summarizing the above completes the proof.

Discounted Adaptive Online Learning: Towards Better Regularization

The following lemma is the full version of Lemma 3.1 presented in Section 3; we further use V t,clip V t there.

Lemma C.2. ACP guarantees for all t,

V t,clip 1 + p

log (1 + 2rt+1ε 1) + 14G t 1 + p

log (1 + 2rt+1ε 1) 2 ,

where V t,clip and G t are internal quantities of Algorithm 4.

Proof of Lemma C.2. Given Lemma C.1, the remaining task is connecting |S t,clip| to |S t |. To this end,

|S t | S t,clip +

g i g i,clip ,

and the sum on the RHS is at most G t following the proof of Theorem 3.

The next lemma exploits the bounded domain assumption.

Lemma C.3. If maxtr t D, then without knowing D, ACP guarantees for all t,

V t,clip 1 + p

log (1 + 2Dε 1) + 15G t 1 + p

log (1 + 2Dε 1) 2 ,

where V t,clip and G t are internal quantities of Algorithm 4.

Proof of Lemma C.3. Consider S t,clip defined in Eq.(16). We now use induction to show that

S t,clip 2 q

V t,clip 1 + p

log (1 + 2Dε 1) + 14G t 1 + p

log (1 + 2Dε 1) 2 ,

and after that, we complete the proof using |S t | S t,clip + G t , just like Lemma C.2.

Concretely, note that such a statement on S t,clip trivially holds for t = 1. Then, suppose it holds for any t. In the t + 1-th round, there are two cases.

Case 1: rt+1 D.

Due to Lemma C.1, we have

S t,clip 2 q

V t,clip 1 + p

log (1 + 2rt+1ε 1) + 13G t 1 + p

log (1 + 2rt+1ε 1) 2

V t,clip 1 + p

log (1 + 2Dε 1) + 13G t 1 + p

log (1 + 2Dε 1) 2 .

S t+1,clip λt S t,clip + g t+1,clip

λ2 t V t,clip 1 + p

log (1 + 2Dε 1) + 13λt G t 1 + p

log (1 + 2Dε 1) 2 + g t+1,clip

V t+1,clip 1 + p

log (1 + 2Dε 1) + 14G t+1 1 + p

log (1 + 2Dε 1) 2 .

Case 2: rt+1 > D.

In this case, rt+1 > r t+1 so the OCP gradient g t+1 0, and the clipped gradient g t+1,clip 0. Meanwhile, in order to have rt+1 > D 0, we must have S t,clip 0 from the prediction rule. Then, due to the same signs of g t+1,clip and S t,clip, we have

S t+1,clip = λt S t,clip g t+1,clip

Discounted Adaptive Online Learning: Towards Better Regularization

λt S t,clip g t+1,clip

λ2 t V t,clip 1 + p

log (1 + 2Dε 1) + 14λt G t 1 + p

log (1 + 2Dε 1) 2

V t+1,clip 1 + p

log (1 + 2Dε 1) + 14G t+1 1 + p

log (1 + 2Dε 1) 2 .

Combining the two cases, the induction statement holds in the t + 1-th round. Finally we use |S t | S t,clip + G t .

With everything above, Theorem 5 is a simple corollary. Theorem 5. Without knowing D, ACP guarantees that for all T N+, we have the discounted coverage bound

V T log(Dε 1) G T log(Dε 1) ,

and the discounted regret bound from Theorem 3.

Proof of Theorem 5. The regret bound trivially applies. The coverage bound follows from Lemma C.3 and the fact that V t,clip V t . Then we consider the asymptotic regime of D ε.

D. Detail of Experiment

This section presents details of our experiment. Our setup builds on the great work of (Bhatnagar et al., 2023).

Setup We test our OCP algorithm (Algorithm 4, which is based on OCO Algorithm 1 and 2) in the context of classifying altered images that arrive in a sequential manner. Given a parameterized prediction set Ct( ) dependent on the label provided by a base ML model, at each time step our algorithms predict the radius rt that corresponds to the uncertainty of the base model s prediction, resulting in the prediction set Ct(rt). We adopt the procedure, code, and base model from (Bhatnagar et al., 2023). Given a sequence of images, we expect that if images are increasingly corrupted by blur, noise, and other factors, the prediction set size must increase to account for the deviation from the base model s training distribution. We hypothesize that the rate of such a distribution shift also affects the OCP algorithms performance, thus we test the cases where the corruption levels shift suddenly versus gradually.

Our algorithms are specifically designed to not use knowledge of the maximum magnitude D of the optimal radius r t (i.e., the maximum uncertainty level). In some prediction scenarios, it is conceivable that D is impossible to know a priori and thus cannot be used as a hyperparameter. In contrast, certain algorithms from the literature use an empirical estimate of D from an offline dataset, which amounts to oracle tuning . These include the Strongly Adaptive Online Conformal Prediction (SAOCP) and Scale-Free Online Gradient Descent (SF-OGD) proposed by (Bhatnagar et al., 2023). In our study, we create a modified version of SF-OGD called Simple OGD , that does not use such an oracle tuning. The only hyperparameter that we set is the learning rate, which we set to 1 for Simple OGD. Note that despite the name, Simple OGD is also gradient adaptive, which differs from the ACI algorithm from (Gibbs & Candes, 2021).

Baselines We perform OCP for ten different algorithms:

1. MAGL-D: We test our Algorithm 1 with ε = 1 and discount factor λt = 0.999, which we name MAGL-D (Magnitude Learner with Lipschitz Constant Estimate and Discounting).

2. MAGL: We test Algorithm 2, the undiscounted algorithm that uses the running estimate of the Lipschitz constant, ht, with ε = 1. We name this algorithm MAGL (Magnitude Learner with Lipschitz constant estimate).

3. MAGDIS: We also test Algorithm 5 with ε = 1 and λt = 0.999, which is a simplified version of Algorithm 1 that essentially sets ht = 0, does not clip gt, and initializes vt > 0. We name this algorithm MAGDIS (Magnitude Learner with Discounting).

Using the implementation from (Bhatnagar et al., 2023), we also obtain results for SAOCP, SF-OGD, Simple OGD, Standard Split Conformal Prediction (SCP) (Vovk et al., 2005), Non-Exchangeable SCP (NEx CP) (Barber et al., 2023), Fully-Adaptive Conformal Inference (FACI) (Gibbs & Cand es, 2022), and FACI-S (Bhatnagar et al., 2023).

Discounted Adaptive Online Learning: Towards Better Regularization

Algorithm 5 Modified 1D magnitude learner on [0, ).

Require: Hyperparameter ε > 0.

1: Initialize parameters v1 > 0, s1 = 0. 2: for t = 1, 2, . . . do 3: Define the unprojected prediction xt,

xt = ε erfi st

4: Predict xt = Π[0, ) ( xt), the projection of xt to the domain [0, ). 5: Receive the 1D loss gradient gt R and the discount factor λt 1 (0, ). 6: If gt xt < gtxt, define a surrogate loss gradient gt = 0. Otherwise, gt = gt. 7: Update vt+1 = λ2 t 1vt + g2 t , st+1 = λt 1st gt. 8: end for

Metrics We choose a targeted coverage rate of 90% for all experiments, which means α = 0.1. To quantify the algorithms performance, we follow the definitions from (Bhatnagar et al., 2023) to evaluate the coverage (local and average), prediction width (local and average), worst-case local coverage error (LCEk), and runtime. They are defined as follows.

First, let Yt be the true label of the t-th image, and for brevity, let b Ct Ct(rt) be the t-th prediction set over the labels. Then, errt is the indicator function of miscoverage at time t:

errt := 1 h Yt / b Ct i ,

where errt = 1 if the prediction set b Ct does not include the true label Yt.

Local Coverage Over any sliding window, k, the local coverage is defined as:

Local Coverage(t) := 1

i=t (1 erri) .

For all trials, we used an interval length of k = 100.

Average Coverage The average coverage is similarly defined, but averaged over the total time steps T. For all experiments, T = 6011.

Avg. Coverage := 1

i=0 (1 erri) .

Local Width The local width is the cardinality of b Ct, averaged over the length k time window:

Local Width(t) := 1

It is compared to the best fixed local width defined as follows. If we let C t Ct(r t ) denote the optimal prediction set had we known the optimal radius r t beforehand, then the best fixed local width Local Width (t) is defined as the 1 α quantile of {|C i | ; i [t : t + k 1]}.

Average Width Similarly, the average width is defined as:

Avg.Width := 1

Discounted Adaptive Online Learning: Towards Better Regularization

Local Coverage Error (LCE) The LCE over the sliding window of length k is defined as:

LCEk := max τ,τ+k 1 [1,T ]

Essentially, it means the largest deviation of the empirical miscoverage rate (evaluated over sliding time windows of length k) from the targeted miscoverage rate α.

Main results Our main results are shown in Figure 1 and Table 1 (Section 4). The purpose of Figure 1 is to demonstrate the dependence of the local coverage and the local width on (1) sudden and (2) gradual distribution shifts (i.e., the time-varying corruption level).9 The purpose of Table 1 is to summarize the performance of our algorithms and compare them to the baselines; the results there correspond to the case of sudden distribution shift.

Local Coverage

Local Coverage

Local Width

Local Width

0 1000 2000 3000 4000 5000 6000 Time

Corruption Level

0 1000 2000 3000 4000 5000 6000 Time

Corruption Level

Best Fixed Mag L-D Mag L Mag Dis

Figure 1. The local coverage (first row), local width (second row), and corresponding corruption level (third row) of our algorithms. Results are obtained using corrupted versions of Tiny Image Net, with time-varying corruption level (distribution shift). (Left) Results for sudden changes in corruption level. (Right) Results for gradual changes in corruption level. The distribution shifts every 500 steps. Moving averages are plotted with a window size of 100 time steps (k = 100).

Figure 1 justifies the validity of our algorithms. Specifically, the local coverage of our algorithms fluctuate around the target coverage of 0.9 for both sudden and gradual distribution shifts. Similarly, the local width approximately replicates the best fixed local width, Local Width (t), as shown in Figure 1 (middle row).

Hyperparameter sensitivity We also test the algorithms sensitivities to the offline estimate of D, which we denote as Dest. The baselines SAOCP and SF-OGD require this parameter to initialize, while Simple OGD and all three of our algorithms do not. To this end, we rerun the experiments above and change the ratio of Dest to the true value D. The following settings are tested:

D = 10 3, 10 2, 10 1, 100, 101, 102, 103 .

We plot the influence of Dest on the average coverage and the average width in Figure 2. For these experiments, we use the case when image corruptions are sudden. Coverage being much larger or much less than 0.9 is not a desirable behavior; given satisfactory coverage, lower width is desirable.

As Dest/D increases, both the average coverage and average width increase for SAOCP and SF-OGD. In the opposite direction, the average coverage and average width also decrease as Dest/D decrease for SAOCP and SF-OGD. When Dest = D, Simple OGD and SF-OGD have approximately equivalent performance. This is not to say that using an offline

9For better visibility, a 1D Gaussian filter is applied to the local width and the local coverage plots, same as in (Bhatnagar et al., 2023).

Discounted Adaptive Online Learning: Towards Better Regularization

10 3 10 2 10 1 100 101 102 103

Average Coverage

10 3 10 2 10 1 100 101 102 103

Average Width

Simple OGD SF-OGD SAOCP Mag L-D Mag L Mag Dis

Figure 2. The average coverage (first row) and average width (second row) as a function of the estimated maximum radius, Dest relative to the true radius D. The performance of SF-OGD and SAOCP (Bhatnagar et al., 2023) are sensitive to Dest/D. Averages are taken over the entire time horizon, where the total time steps T = 6011.

estimate of Dest/D does not improve performance. The catch is that, for the particular experimental dataset, setting the learning rate to 1 in Simple OGD coincidentally just happens to be a well-picked learning rate, given the true value of the maximum radius magnitude D. In contrast, our magnitude learners remain fixed under variations in Dest/D, since they do not need Dest to initialize. The baselines SCP, NEx CP, FACI, and FACI-S are not presented in Figure 2 as to better highlight the performance of SAOCP, OGD, and our magnitude learners. Benchmarks on these additional baselines can be found in (Bhatnagar et al., 2023).

Split Conformal

NEx Conformal

1.0 1.02 1.05 1.05 1.06 1.57 2.17 2.22 2.98

Figure 3. The runtime per time step of each algorithm, normalized to the runtime of Simple OGD. The runtime of SAOCP is longest due to it being a meta-algorithm that initializes SF-OGD on each time step.

Runtime We also measure the time for each algorithm to complete a single prediction. Runtime results are provided in Figure 3. The results are normalized relative to the runtime of Simple OGD. Note the runtime standard deviations in Table 1. Accounting for standard deviations, the runtime differences between Simple OGD, SF-OGD, MAGL, and MAGDIS are negligible. The runtime of SAOCP is longest due to it being a meta-algorithm that initializes SF-OGD on each time step, c.f., Appendix A.