# gradientvariation_online_learning_under_generalized_smoothness__4848210f.pdf

Gradient-Variation Online Learning under Generalized Smoothness

Yan-Feng Xie, Peng Zhao, Zhi-Hua Zhou National Key Laboratory for Novel Software Technology, Nanjing University, China School of Artiﬁcial Intelligence, Nanjing University, China {xieyf, zhaop, zhouzh}@lamda.nju.edu.cn

Gradient-variation online learning aims to achieve regret guarantees that scale with variations in the gradients of online functions, which is crucial for attaining fast convergence in games and robustness in stochastic optimization, hence receiving increased attention. Existing results often require the smoothness condition by imposing a ﬁxed bound on gradient Lipschitzness, which may be unrealistic in practice. Recent efforts in neural network optimization suggest a generalized smoothness condition, allowing smoothness to correlate with gradient norms. In this paper, we systematically study gradient-variation online learning under generalized smoothness. We extend the classic optimistic mirror descent algorithm to derive gradient-variation regret by analyzing stability over the optimization trajectory and exploiting smoothness locally. Then, we explore universal online learning, designing a single algorithm with the optimal gradient-variation regrets for convex and strongly convex functions simultaneously, without requiring prior knowledge of curvature. This algorithm adopts a two-layer structure with a meta-algorithm running over a group of base-learners. To ensure favorable guarantees, we design a new Lipschitz-adaptive meta-algorithm, capable of handling potentially unbounded gradients while ensuring a second-order bound to effectively ensemble the base-learners. Finally, we provide the applications for fast-rate convergence in games and stochastic extended adversarial optimization.

1 Introduction

We consider online convex optimization (OCO) [Hazan, 2016; Orabona, 2019], a ﬂexible framework that models the decision-making problem in an online fashion. At each round t [T], an online learner is required to submit a decision xt from a convex compact set X Rd and the environments reveal a convex function ft : X 7 R. Then the learner suffers a loss ft(xt) and updates her decision. The standard performance measure is the regret [Zinkevich, 2003] that benchmarks the cumulative loss of the learner against the best decision in hindsight, formally deﬁned as

t=1 ft(xt) min x X

t=1 ft(x). (1)

Regret bounds of O(

T) and O( 1

λ log T) are established for convex and λ-strongly convex functions respectively [Zinkevich, 2003; Hazan et al., 2007]. While these results are known to be minimax optimal [Abernethy et al., 2008], in this paper we are more interested in obtaining gradientvariation regret guarantees, which replace the dependence of T in the regret bounds by variations in

Correspondence: Peng Zhao <zhaop@lamda.nju.edu.cn>

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

the gradients of online functions [Chiang et al., 2012] deﬁned as

t=2 sup x X ft(x) ft 1(x) 2 2. (2)

This quantity can be as small as a constant in stable environments where online functions remain ﬁxed, and is at most O(T) in the worst case under standard OCO assumptions, safeguarding minimax results. Besides this favorable adaptivity, recent studies have shown close relationships of gradient-variation online learning to various ﬁelds, including fast convergence in games [Rakhlin and Sridharan, 2013b; Syrgkanis et al., 2015; Zhang et al., 2022b] and robust stochastic optimization [Sachs et al., 2022; Chen et al., 2024], hence receiving increased attention [Zhao et al., 2020; Yan et al., 2023; Tsai et al., 2023; Ataee Tarzanagh et al., 2024; Zhao et al., 2024].

In online learning, it is proved that the smoothness assumption is necessary for ﬁrst-order algorithms to achieve gradient-variation regret bounds as discussed in Remark 1 of Yang et al. [2014], which is also restated in Proposition 2 in Appendix B. Previous works typically rely on the global L-smoothness condition, imposing a ﬁxed upper bound on the gradient Lipschitzness, i.e., requiring 2ft(x) 2 L for all t [T] and x X. However, this global assumption restricts the applicability of theories to loss functions that are quadratically bounded from above. Furthermore, recent studies in neural network optimization have observed phenomena where the global smoothness condition fails to model optimization dynamics effectively, especially for important types of neural networks like LSTM [Zhang et al., 2020b] and Transformer [Crawshaw et al., 2022]. Therefore, modern optimization has devoted efforts to generalizing the smoothness condition. For example, Zhang et al. [2020b] introduce (L0, L1)-smoothness, which assumes 2f(x) 2 L0 + L1 f(x) 2 for an ofﬂine objective function f( ). A notable generalization is the recent proposal of the ℓ-smoothness condition [Li et al., 2023], which assumes 2f(x) 2 ℓ( f(x) 2) with a link function ℓ( ), signiﬁcantly broadening previous assumptions through the ﬂexibility of ℓ( ). Given this, it is natural to ask how to design online algorithms to exploit generalized smoothness and obtain favorable gradient-variation regret guarantees.

In this paper, we provide a systematic study of gradient-variation online learning under generalized smoothness. We extend the classic optimistic online mirror descent (optimistic OMD) algorithm [Chiang et al., 2012; Rakhlin and Sridharan, 2013a] to derive gradient-variation regret bounds, achieving O( VT ) regret and O(log VT ) regret for convex and strongly convex functions under generalized smoothness, respectively. We emphasize the importance of stability analysis across the optimization trajectory, which allows generalized smoothness to be effectively exploited locally. Speciﬁcally, optimistic OMD maintains two sequences with submitted decisions {xt}T t=1 and intermediate decisions {bxt}T t=1. We need to control algorithmic stability by appropriate step size tuning and optimism design, ensuring that xt is sufﬁciently close to bxt to exploit local smoothness at bxt.

Based on this development, we investigate universal online learning [van Erven and Koolen, 2016; Wang et al., 2019; Mhammedi et al., 2019; Zhang et al., 2022a; Yan et al., 2023; Yang et al., 2024], where the learner aims to design a single algorithm that simultaneously attains the optimal regret for both convex and strongly functions without the prior knowledge of curvature information. For this scenario, a common wisdom is to adopt an online ensemble consisting of a meta-base two-layer structure to handle the environmental uncertainty [Zhao et al., 2024], i.e., the unknown curvature of loss functions, where a meta-algorithm is running over a set of base-learners with different conﬁgurations. The base-learners are basically the instantiations of the developed variants of optimistic OMD, as mentioned earlier. However, designing the meta-algorithm is non-trivial with new challenges. The ﬁrst challenge is from the potentially unbounded smoothness, which might lead to unbounded Lipschitz constants as well. This challenge requires the meta-algorithm to be Lipschitz-adaptive, adapting to Lipschitzness on the ﬂy. Furthermore, we also expect it to provide a second-order regret, technically required when analyzing the ensemble errors, and to enable predictions with optimism, thereby producing the gradient variation. The second challenge is the complexity introduced by the combination procedure inherent in the ensemble method, which further complicates the smoothness estimation, making it difﬁcult to properly tune the meta-algorithm and exploit smoothness.

To this end, we address both challenges with the function-variation-to-gradient-variation conversion and a newly-designed Lipschitz-adaptive meta-algorithm. The conversion technique, drawing inspiration from Bai et al. [2022], decouples the design between the meta and base levels and derives the gradient variation directly from function values, allowing us to avoid the cancellation-based analysis [Yan et al., 2023] for utilizing smoothness at the meta level. Nevertheless, this conversion

requires the meta-algorithm to handle heterogeneous inputs due to certain technical considerations, and we are not aware of available algorithms satisfying all the requirements, motivating us to design a new algorithm. Based on optimistic Adapt-ML-Prod [Wei et al., 2016] and the clipping technique [Cutkosky, 2019], we present a new Lipschitz-adaptive meta-algorithm with a simpler algorithmic design, which can be of independent interest. With this algorithm, we can apply the function-variation-to-gradient-variation conversion to achieve the optimal results for both convex and strongly convex functions, up to doubly logarithmic factors of T, without knowing curvature.

Our ﬁndings for gradient-variation online learning are useful for several important applications, including fast-convergence online games [Rakhlin and Sridharan, 2013a; Syrgkanis et al., 2015] and stochastic extended adversarial online learning [Sachs et al., 2022], where we establish new results under the generalized smoothness condition.

The rest of paper is organized as follows. Section 2 provides preliminaries and key ideas for exploiting the generalized smoothness throughout the trajectory. In Section 3 we study universal online learning and present our key meta-algorithm. Section 4 discusses our applications. Related work is provided in Appendix A. All proofs can be found in the remaining appendices (Appendix B D).

2 Gradient-Variation Online Learning under Generalized Smoothness

In this section, we ﬁrst introduce the problem setup, including the formal deﬁnition of generalized smoothness and other assumptions used in the paper. We then extend the optimistic online mirror descent framework to achieve gradient-variation regret bounds under generalized smoothness.

2.1 Problem Setup: Generalized Smoothness and Assumptions

Recent studies [Zhang et al., 2020b; Chen et al., 2023b] extend the global smoothness condition by allowing the smoothness to positively correlate with the gradient norm, where a particular function is required to model this relationship. Zhang et al. [2020b] introduce the (L0, L1)-smoothness condition, where the smoothness is upper bounded by a linear function of the gradient norm, i.e., 2f(x) 2 L0 + L1 f(x) 2. Li et al. [2023] further generalize this by imposing a weaker assumption on the link function and propose the generalized smoothness deﬁned as follows.

Deﬁnition 1 (ℓ-smoothness). A twice-differentiable function f : X 7 R is called ℓ-smooth for some non-decreasing continuous link function ℓ: [0, + ) 7 (0, + ) if it satisﬁes that 2f(x) 2 ℓ( f(x) 2) for any x X.

The mild requirement on the link function ℓ( ) allows for considerable generality. By selecting a linear link function, ℓ-smoothness immediately recovers (L0, L1)-smoothness [Zhang et al., 2020b]. Furthermore, it has been shown that ℓ-smoothness can imply a wide class of functions including rational, logarithmic, and self-concordant functions [Li et al., 2023]. Based on this generalized smoothness notion, we now provide the formal assumption on the smoothness of online functions.

Assumption 1 (generalized smoothness). The online function ft : X 7 R is ℓt-smooth in an open set containing X Rd for t [T], and the learner can query ℓt(x) provided any point x X.

We also require a standard bounded domain assumption in the OCO literature [Hazan, 2016].

Assumption 2 (bounded domain). The feasible domain X Rd, which contains the origin 0, is non-empty and closed with the diameter bounded by D, i.e., x y 2 D for any x, y X.

We do not assume the prior knowledge of the Lipschitz constant of online functions. In fact, the unboundedness of smoothness may result in unbounded Lipschitz constants. If a Lipschitz upper bound were known, the generalized smoothness condition would be trivialized, as it would allow us to directly compute the upper bound of the smoothness constant. Furthermore, following the discussion in Jacobsen and Cutkosky [2023, Page 2, second paragraph on the right], we assume that there exist ﬁnite but unknown upper bounds G and L for Lipschitzness and smoothness to ensure the theoretical results are valid. Note that these quantities will only appear in the ﬁnal regret bounds, and our algorithms does not use them as the inputs. Throughout the paper, we use the O( )-notation to hide the constants and use the e O( )-notation to omit the poly-logarithmic factors in T.

2.2 Algorithmic Framework

We choose optimistic online mirror descent (optimistic OMD) [Rakhlin and Sridharan, 2013a] as the algorithmic framework, which provides a uniﬁed view to design and analyze many online algorithms. Compared to classic OMD [Nemirovskij and Yudin, 1985; Beck and Teboulle, 2003], optimistic OMD predicts with side information, an optimistic vector Mt Rd. This optimistic vector, also known as optimism, serves as a prediction of the incoming function ft+1( ), leading to tighter regret bounds when accurate. Optimistic OMD updates the decisions in two steps:

xt = arg min x X { Mt, x + Dψt(x, bxt)} , bxt+1 = arg min x X { ft(xt), x + Dψt(x, bxt)} , (3)

where Dψt(x, y) = ψt(x) ψt(y) ψt(y), x y is the Bregman divergence associated with the regularizer ψt : X 7 R. Optimistic OMD maintains two sequences: the sequence of submitted decisions {xt}T t=1, and the one of intermediate decisions {bxt}T t=1. Although a simpliﬁed optimistic OMD with one-step update per round exists [Joulani et al., 2020], we will demonstrate later that tuning the step size based on intermediate decisions is crucial for adapting to generalized smoothness.

2.3 Gradient-Variation Regret for Convex and Strongly Convex Functions

When minimizing the convex or strongly convex functions, we set the regularizer as ψt(x) = 1 2ηt x 2 2 and optimistic OMD updates with the following steps:

xt = ΠX [bxt ηt Mt] , bxt+1 = ΠX [bxt ηt ft(xt)] , (4)

where ΠX [y] = arg minx X x y 2 denotes the Euclidean projection operator. Next, we brieﬂy review approaches for obtaining the gradient-variation bound under global smoothness. This bound typically follows from the regret analysis for optimistic OMD:

t=1 ηt ft(xt) Mt 2 2

1 ηt ( xt bxt 2 2 + bxt xt 1 2 2). (5)

On the right-hand side, the second term is known as the stability term, while the third one is the negative terms that can be further bounded by O( PT t=1 1 ηt xt xt 1 2 2). Previous studies [Chiang et al., 2012; Zhao et al., 2024] for gradient-variation regret under global smoothness often set optimism as Mt = ft 1(xt 1), such that, the positive stability term can be upper bounded by ηt ft(xt) ft 1(xt) 2 2 + ηt ft 1(xt) ft 1(xt 1) 2 2, where the ﬁrst part can be directly converted to the desired gradient variation and the second part will be at most ηt L2 xt xt 1 2 2 under global smoothness. Given the smoothness constant L, tuning the step size as ηt 1/(4L) ensures O(ηt L2 xt xt 1 2 2 1

ηt xt xt 1 2 2) 0, thus obtaining the gradient-variation bound.

However, under generalized smoothness, we do not have a global parameter L for setting the step sizes, and the smoothness constants are related to the decisions. To follow the previous approach, optimistic OMD would require the smoothness constant before generating xt to tune the step size, ensuring that the negative terms are large enough to cancel ηt ft 1(xt) ft 1(xt 1) 2 2. Nevertheless, the smoothness constant between xt and xt 1 can only be evaluated after updating to xt, resulting in a contradiction. Unlike ofﬂine optimization, where the function is ﬁxed and smoothness constants can be shown to decrease along the trajectory [Li et al., 2023], in online optimization, the online functions change at each round, preventing the reuse of previous smoothness estimations.

To address this challenge, our key idea is to perform a trajectory-wise analysis and conﬁgure the algorithm using estimated smoothness so far. The key technical lemma is the local smoothness property of ℓt-smooth functions [Li et al., 2023], which allows the smoothness constant between two points to be estimated in advance, provided that the two points are close enough. Lemma 1 (local smoothness [Li et al., 2023, Lemma 3.3]). Suppose f : X 7 R is ℓ-smooth. For x, y X such that x y 2 f(x) 2 ℓt(2 f(x) 2), f(x) f(y) 2 ℓ(2 f(x) 2) x y 2.

Recall that in the update procedures (3) of optimistic OMD, the submitted decision xt is updated based on the intermediate decision bxt. Therefore, it is convenient to control their distance and then exploit the local smoothness at point bxt. Speciﬁcally, we set optimism Mt = ft 1(bxt) and the step size ηt 1/(4b Lt 1), where b Lt 1 = ℓt 1(2 ft 1(bxt) 2) denotes the locally estimated

smoothness and is used to tune the step size. This conﬁguration leads to ηt ft(xt) ft 1(bxt) 2 2 for the second term in Eq. (5), which can be further upper bounded as

ft(xt) ft 1(bxt) 2 2 2 ft(xt) ft 1(xt) 2 2 + 2 ft 1(xt) ft 1(bxt) 2 2. (6)

The ﬁrst part is basically the favorable gradient variation, so it sufﬁces to handle the second part. Performing the stability analysis for OMD and noticing the step size setting, it can be veriﬁed that xt bxt 2 ηt ft 1(bxt) 2 ft 1(bxt) 2/(4b Lt 1). This satisﬁes the criteria for applying Lemma 1 to the ℓt 1-smooth function ft 1( ), allowing us to upper bound the second term in (6) by O(b L2 t 1 xt bxt 2 2). We have clipped ηt by 1/(4b Lt 1), thereby ensuring the negative term is sufﬁcient to cancel out the positive term. Below, we summarize the result for convex functions.

Theorem 1. Under Assumptions 1 - 2 and assuming online functions are convex, we set the optimism as Mt = ft 1(bxt) and f0( ) = 0, with step sizes as η1 = D and, for t 2,

1 + Pt 1 s=1 fs(xs) fs 1(xs) 2 2 , min s [t] 1 4ℓs 1(2 fs 1(bxs) 2)

optimistic OMD in (4) ensures the following regret bound,

VT + b Lmax D2 ,

where VT = PT t=2 supx X ft(x) ft 1(x) 2 2 measures the gradient variations and b Lmax = maxt [T ] b Lt is the maximum smoothness constant over the optimization trajectory.

This result implies a tighter bound in scenarios where the environments change slowly, i.e., VT = O(1). Meanwhile, it safeguards the worst-case optimal result since VT O(T) holds in all cases. When assuming ℓt( ) L for t [T], the ℓt-smoothness condition degenerates to the classic global L-smoothness condition, and our result implies an O( VT + LD2) bound, which matches the best-known gradient-variation regret bounds with the ﬁrst-order oracle [Chiang et al., 2012; Yan et al., 2023; Zhao et al., 2024] even in terms of the dependence on D and L. Compared to ofﬂine optimization, our result depends on b Lmax, the maximum smoothness constant along the trajectory. This dependence arises from the adversarial nature of online learning, where the loss functions chosen in consecutive rounds may differ signiﬁcantly, making it hopeless to leverage the previous estimates of smoothness to improve the dependence.

We further provide an improved gradient-variation regret bound for strongly convex functions, with step size tuning based on recent result under global smoothness [Chen et al., 2024, 3.4].

Theorem 2. Under Assumptions 1 - 2 and assuming online functions are λ-strongly convex, we set the optimism as Mt = ft 1(bxt), f0( ) = 0, and step sizes as η1 = 2/λ and, for t 2, ηt = 2/(λt + 16 maxs [t] ℓs 1(2 fs 1(bxs) 2)), optimistic OMD in (4) ensures the regret bound REGT O 1

λ log VT + b Lmax D2 , where b Lmax = maxt [T ] b Lt.

The above theorem requires the knowledge of curvature information λ. In Section 3, we design a universal method to remove this requirement and achieve the optimal guarantees for convex and strongly convex functions simultaneously without knowing λ. In Appendix B.2, we discuss the challenge to obtain a gradient-variation bound for exp-concave functions under Assumption 1.

3 Universal Online Learning under Generalized Smoothness

Classic online learning algorithms require the curvature information of online functions as algorithmic parameters to achieve favorable regret guarantees. However, obtaining these curvature parameters can be difﬁcult in practice. This challenge motivates the recent study of universal online learning [van Erven and Koolen, 2016; Cutkosky and Boahen, 2017; Wang et al., 2019; Mhammedi et al., 2019; Zhang et al., 2022a; Yan et al., 2023; Yang et al., 2024], which aims to design a single algorithm that can achieve optimal regrets without knowing the curvature information. In this section, we study universal online learning with gradient-variation regret under generalized smoothness.

3.1 Reviewing Related Work and Techniques

We review related work on gradient-variation universal online learning under global smoothness [Zhang et al., 2022a; Yan et al., 2023]. To handle the unknown curvature, universal online learning algorithms utilize a two-layer structure, consisting of a meta-algorithm that ensembles a group of base-learners. Each base-learner optimizes functions with a speciﬁc convex curvature, while the meta-algorithm is designed to ensure that ensemble errors do not ruin base-learners guarantees. Denoted by N the number of base-learners, the decision xt = P i [N] pt,ixt,i submitted by a twolayer structure algorithm comprises two key components: pt N, the weights provided by the meta-algorithm, and xt,i, the decision of the i-th base-learner. The analysis of a universal algorithm begins by decomposing the regret into two parts against any base-learner. In particular, we choose the base-learner with the best performance (the index i is unknown) as the benchmark:

t=1 ft(xt,i )

| {z } META-REG

t=1 ft(xt,i ) min x X

| {z } BASE-REG

where the ﬁrst part is the meta-regret, evaluating the meta-algorithm s performance against the best base-learner, and the second part, known as the base-regret, measures the best learner s performance.

Zhang et al. [2022a] advocate for a simple approach by employing the meta-algorithms with secondorder regret guarantees, which facilitates the analysis at the meta level. In speciﬁc, they use Adapt-ML-Prod [Gaillard et al., 2014] as the meta-algorithm, showing that the meta-regret for strongly convex and exp-concave functions are constants by exploiting the negative terms from convexity. Consider λ-strongly convex functions as an example and assume the i -th base-learner ensures the optimal O( 1

λ log VT ) base-regret. At the meta level, Zhang et al. [2022a] pass the linearized regret rt,i = ft(xt), xt xt,i to the meta-algorithm for each base-learner. By strong convexity and the guarantees of Adapt-ML-Prod, the meta-regret can be bounded by a constant:

t=1 xt xt,i 2 2

t=1 r2 t,i λ

t=1 xt xt,i 2 2 O(1),

where the last inequality follows from p P t ft(xt), xt xt,i 2 b Gmax p P t xt xt,i 2 2 and is then canceled by the negative terms via the AM-GM inequality. By leveraging the negative terms from strong convexity, the meta-regret can be well-bounded, allowing the overall regret to be dominated by the base-regret, which is then controlled by selecting appropriate base-algorithms.

However, this method is unsuitable for producing the gradient-variation bound for convex functions. To address it, Yan et al. [2023] propose to use a meta-algorithm which ensures an optimistic and second-order regret bound while provides additional negative terms P t pt pt 1 2 1. Besides showing that the meta-regret is a constant for strongly convex and exp-concave functions following the previous approach, with newly designed optimism, Yan et al. [2023] prove that the meta-regret for convex functions can be roughly bounded by:

t=1 xt,i xt 1,i 2 2 + L2 T X

t=1 pt pt 1 2 1 + L2 T X

i=1 pt,i xt,i xt 1,i 2 2

The ﬁrst term is the gradient variation, matching the optimal order of convex functions. Yan et al. [2023] demonstrate that the remaining stability terms can be canceled through the collaboration between the base and meta levels [Zhao et al., 2024] with the prior knowledge of the global smoothness constant L, thus obtaining the near-optimal gradient-variation bounds for convex functions as well. Nevertheless, the employed meta-algorithm already has a two-layer structure, resulting in a three-layer structure for the overall algorithm, which is relatively complicated.

3.2 Key Challenges and Main Ideas

We aim to design a universal algorithm with the optimal gradient-variation bounds under generalized smoothness, which exhibits two challenges. First, the Lipschitz condition of online functions is unknown to the meta-algorithm, which requires the meta-algorithm to be Lipschitz-adaptive, provide a second-order regret, and enable predictions with optimism. Second, the combination of the

ensemble method further complicates the estimation of smoothness constants, making it challenging to tune algorithms properly and to cancel stability terms as Yan et al. [2023] did.

We tackle the second challenge by utilizing a function-variation-to-gradient-variation conversion to derive the gradient-variation bounds at the meta level, drawing inspiration from the development of dynamic regret minimization [Bai et al., 2022]. This conversion technique decouples the meta and base levels, allowing us to avoid cancellation-based analysis. To illustrate, suppose a meta-algorithm ensuring O( p P t(ℓt,i mt,i )2) provided optimism mt = (mt,1, . . . , mt,N). By setting ℓt,i = ft(xt,i ) ft(xref) and mt,i = ft 1(xt,i ) ft 1(xref), where xref is a ﬁxed reference point, the meta regret bound becomes O( p P t[(ft(xt,i) ft 1(xt,i)) (ft(xref) ft 1(xref))]2). By the mean value theorem, [(ft(xt,i) ft 1(xt,i)) (ft(xref) ft 1(xref))]2 equals the ﬁrst term below, which can be further upper bounded by the gradient variation:

[ ft(ξt,i) ft 1(ξt,i), xt,i xref ]2 D2 supx X ft(x) ft 1(x) 2 2.

This technique brings hope for minimizing the convex functions. To develop a universal method, our ﬁrst attempt is to utilize Ms Mw C-Master [Chen et al., 2021] as the meta-algorithm, which satisﬁes all the three requirements imposed by the ﬁrst challenge. Nevertheless, the heterogeneous inputs at the meta level present another challenge to this approach. The heterogeneity arises as we pass rt,i = ft(xt) ft(xt,i) to the meta-algorithm for base-learner responsible for convex functions, leveraging the conversion technique, while rt,i = ft(xt), xt xt,i for base-learners minimizing strongly convex functions. It remains unclear how to adapt Ms Mw C-Master [Chen et al., 2021] to our heterogeneous inputs, as Ms Mw C-Master requires an ℓt as inputs, and the guarantee is for the regret in the form of rt = pt, ℓt ℓt. However, such an ℓt cannot be retrieved from our above design. Fortunately, we observe that the Prod algorithms [Cesa-Bianchi et al., 2007; Gaillard et al., 2014; Wei et al., 2016] are friendly to heterogeneous inputs. Technically, the Prod algorithms provide the same guarantees as long as P i [N] pt,irt,i 0, thanks to the potential-based analysis. Therefore, aside from the requirements of the ﬁrst challenge, we expect that the meta-algorithm can be analyzed similarly to the Prod algorithms, motivating us to design a new meta-algorithm. As a byproduct, we present a universal algorithm with a two-layer structure under global smoothness with the developed techniques. It is more efﬁcient than that by Yan et al. [2023] and attains the optimal gradient-variation guarantees for convex, strongly convex, and exp-concave functions, at a cost of additional function value queries. We defer algorithms and regret bounds to Appendix C.5.

3.3 A New Lipschitz-Adaptive Meta-Algorithm

In Algorithm 1, we present our meta-algorithm, which builds on optimistic Adapt-ML-Prod [Wei et al., 2016] and incorporates the clipping technique introduced by Cutkosky [2019] and further reﬁned by Chen et al. [2021]. This algorithm, described in the language of Prediction with Experts Advice (PEA), may be of independent interest beyond adapting to the generalized smoothness. Apart for satisfying all expected requirements, this algorithm offers a simpler design, which does not require a forced restart as opposed to Ms Mw C-Master [Chen et al., 2021].

This efﬁciency improvement is achieved through a simple self-conﬁdent learning rate in Line 6 of Algorithm 1, unlike that uses ﬁxed learning rates and thus require restarts. In essence, our approach incorporates the clipping mechanism by adding B2 t to the denominator and removing the threshold on learning rates commonly applied in prior Prod algorithms [Gaillard et al., 2014; Wei et al., 2016]. This term B2 t acts as a threshold, ensuring that ηt,i| rt,i mt,i| 1/2, a critical condition in the analysis (typically satisﬁed when the Lipschitz constant is provided for prior Prod algorithms). Lipschitz-adaptive algorithms may be sensitive to the choice of B0, while in our case, B0 = Θ(1/(log T)) is sufﬁcient and does not ruin the guarantees as the factor O(log log) is often treated as a constant [Gaillard et al., 2014; Luo and Schapire, 2015]. Theorem 3 summarizes the guarantee of Algorithm 1 and the proof is provided in Appendix C.3.

Theorem 3. Setting mt,i = pt, ℓt 1 ℓt 1,i in Algorithm 1 ensures that, for any i [N], PT t=1 pt, ℓt PT t=1 ℓt,i is bounded as follows, where BT = max{B0, maxt [T ] rt mt }:

t=1 (rt,i mt,i )2 + BT

log(N) + log(BT + log T) !

t=1 ℓt ℓt 1 2

Algorithm 1 Lipschitz Adaptive Optimistic Adapt-ML-Prod

Input: prior information of the scale B0, the number of experts N. 1: Initialization: set w1,i = 1, m1,i = 0 and η1,i = 1/ p

1 + 4B2 0 for all i [N]. 2: for t = 1 to T do 3: Update the weight for i [N] ewt,i = wt,i exp(ηt,imt,i);

4: Calculate decision pt N with pt,i = ηt,i e wt,i P j [N] ηt,j e wt,j and submit it;

5: Receive rt, update Bt = max{Bt 1, rt mt }, and build rt,i = mt,i+ Bt 1

Bt (rt,i mt,i);

6: Update the learning rate for i [N]: ηt+1,i = q

1 1+Pt s=1( rs,i ms,i)2+4B2 t ;

7: Update the weight for i [N]: wt+1,i = wt,i exp ηt,i rt,i η2 t,i( rt,i mt,i)2 ηt+1,i

ηt,i . 8: end for

Algorithm 2 Universal Gradient-Variation Online Learning under Generalized Smoothness

Input: curvature coefﬁcient pool H, number of base-learners N, prior information of the scale B0. 1: Initialization: Send N and B0 to the meta-algorithm, for λ H, initialize an algorithm in Theorem 2 with it; initialize an algorithm in Theorem 1. 2: for t = 1 to T do 3: Obtain pt from meta-algorithm, xt,i from each base-learner i [N]; 4: Submit xt = P i [N] pt,ixt,i; 5: Receive ft( ) and send it to each base-learner for update; 6: For strongly convex functions learners: set rt,i = ft(xt), xt xt,i ; 7: For convex functions learner: set rt,i = ft(xt) ft(xt,i); 8: Send mt+1,i = ft(xt) ft(xt,i) to the meta-algorithm for i [N]. 9: end for

Our algorithm improves efﬁciency at the cost of an additional factor O( log N). This factor is ignorable for universal online learning since we set N = O(log T), and the factor O(log log T) is negligible. Considering other related Lipschitz-adaptive algorithms, Mhammedi et al. [2019] obtain a regret bound of O( p P t(rt,i )2 (log(N) + log log(BT T)) + BT log(N)), which offers better dependence on the dominant term p P

t(rt,i )2 but it is unclear how to include optimism. Chen et al. [2021] achieve a bound of O( p P t(ℓt,i mt,i )2 log(NT) + BT log(NT)) with a twolayer algorithm; however, the O( log T) term would ruin the desired O(log VT ) bound for strongly convex functions. We remark that the compared methods enjoy other strengths not discussed here, such as the ability to compete with an arbitrary competitor x N and the versatility to handle various learning scenarios, while our method is sufﬁcient for our purpose and the only option to tackle all the challenges as we mention in Section 3.2. Lastly, notice that the optimism mt involves the decision pt, which might be improper since pt depends on mt as well. We refer readers to Appendix C.1 for efﬁciently setting mt through a univariate binary search.

We emphasize that optimism mt in our algorithm is not chosen arbitrarily. In Line 5, we clip the regret with optimism to keep them on the same scale. The performance is then evaluated based on the clipped regret rt,i. For the analytical purpose, it is essential that pt, rt 0, and a sufﬁcient condition for this is ensuring pt, mt 0, which imposes an additional requirement on mt. In Appendix C.2, we discuss how this requirement introduces challenges for exp-concave functions minimization in universal online learning.

3.4 Overall Algorithm and Regret Guarantees

The function-variation-to-gradient-variation technique decouples the design of universal methods into the base and meta levels, and we are ready to combine the proposed components together. We employ algorithms in Section 2.3 as the base-learners and use Algorithm 1 as the meta-learner, concluding in Algorithm 2. Theorem 4 presents its guarantee with the proof in Appendix C.4.

Theorem 4. Under Assumptions 1 - 2 and assuming a global lower bound such that f ft(x) for any x X, t [T], setting N = log2 T + 1, deﬁning the curvature coefﬁcient pool H =

{2i 1/T : i [N 1]}, and specifying B0, Algorithm 2 simultaneously ensures:

O( VT log BT ), (convex),

O 1 λ log VT + b G2 max log2(BT )/λ + BT log BT , (λ-strongly convex),

where λ [1/T, 1], BT = O(max{B0, D maxt [T ] supx ft(x) ft 1(x) 2}) and b Gmax is the Lipschitz constant on the optimization trajectory.

Without loss of generality, we assume λ [1/T, 1] for strongly convex functions. If λ < 1/T, the optimal O((log VT )/λ) bound would imply linear regret, in which case we would treat them as general convex functions. If λ > 1, our result is slightly worse than the optimal one by a negligible constant factor. This simpliﬁcation is also employed by Zhang et al. [2022a]; Yan et al. [2023]. Remark 1. This theorem additionally requires the lower bound of loss functions, which is used to perform the binary search when setting the optimism. We defer the details of the binary search to Appendix C.1. This assumption is also employed recently in parameter-free optimizations [Hazan and Kakade, 2019; Attia and Koren, 2024; Khaled and Jin, 2024], and we can simply choose f = 0 in empirical risk minimization settings [Hazan and Kakade, 2019].

4 Applications

In this section, we demonstrate the importance of our results by providing two applications (SEA model and online games), where new results can be directly implied from our ﬁndings.

4.1 Stochastically Extended Adversarial (SEA) Model

The stochastically extended adversarial (SEA) model [Sachs et al., 2022] interpolates adversarial and stochastic online optimization. It assumes that the environments select the loss function ft( ) from a distribution Dt. The adversarial nature is characterized by shifts in distribution Dt, and when Dt remains constant, the model captures the environments stochastic behavior. The following quantities are introduced to measure the levels of adversarial and stochastic behaviors in environments:

t=2 sup x X Ft(x) Ft 1(x) 2 2

t=1 sup x X E[ ft(x) Ft(x) 2 2]. (9)

where we denote by Ft(x) = Eft Dt[ft(x)]. In above, Σ2 1:T represents the adversarial shift of the distribution, and σ2 1:T denotes the stochastic variance.

Sachs et al. [2022] prove an O( p

Σ2 1:T ) regret for convex functions, and a reﬁned regret bound of O((σ2 max + Σ2 max) log(σ2 1:T + Σ2 1:T )) for strongly convex functions is obtained by Chen et al. [2023a]; Sachs et al. [2023], where σ2 max = maxt [T ] supx X E[ ft(x) Ft(x) 2 2] and Σ2 max = max T t=1 supx X Ft(x) Ft 1(x) 2 2. Yan et al. [2023] present a universal method which can obtain e O( p

Σ2 1:T ) and O((σ2 max +Σ2 max) log(σ2 1:T +Σ2 1:T )) bounds for convex and strongly convex functions. However, these results require the global smoothness assumption.

Our result in Section 3 implies a new ﬁnding for the SEA model, relaxing the assumption from the global smoothness to the generalized smoothness, while adapting to unknown curvature, summarized in Corollary 1. The proof can be found in Appendix D.1. Corollary 1. Under Assumptions 1 - 2 and assuming a global lower bound for the loss functions such that f ft(x) for any x X, t [T], setting N = log2 T + 1, deﬁning the curvature coefﬁcient pool H = {2i 1/T : i [N 1]}, and specifying B0 with a speciﬁc value, then, under the SEA model, Algorithm 2 simultaneously ensures:

eσ2 1:T + p

Σ2 1:T ) log( b Gmax D) , (convex),

O (eσ2 max + Σ2 max) log(eσ2 1:T + Σ2 1:T ) , (strongly convex),

where b Gmax is the maximum empirical Lipschitz constant, eσ2 1:T = PT t=1 E[supx X ft(x) Ft(x) 2 2], and eσ2 max = E[maxt [T ] supx X ft(x) Ft(x) 2 2].

Note that, in real-world streaming learning applications, this corollary can offer a more generalized depiction of data throughput with limited computing resources [Zhou, 2024; Wang et al., 2024], given the connection between these scenarios and the SEA model [Chen et al., 2024, 5.6]. Our result depends on eσ2 1:T , a larger quantity than σ2 1:T but still can track the stochastic variance. This is because our algorithm utilizes the information afterward xt 1 to generate xt. We refer the interested readers for this subtle issue to the discussion by Chen et al. [2024]. This dependence currently is unknown how to improve even under the global smoothness condition since we need to leverage the function variation to produce gradient variation, inevitably involving the afterward information.

4.2 Fast Rates in Games

Our second application explores the min-max game, aiming to achieve an ε-approximate solution to the problem minx X maxy Y f(x, y) within an O(1/T) fast convergence rate. Here, we assume that f( , y) is convex for any y Y, and correspondingly, f(x, ) is concave given any x X. Additionally, we assume that both X Rn and Y Rm are bounded convex sets. Pioneering research [Syrgkanis et al., 2015] demonstrates that optimistic algorithms can reach a convergence rate of O(1/T) by leveraging gradient variation. However, these results are limited to the global smoothness condition. In this part, we demonstrate that our ﬁndings in Section 2 directly imply a new algorithm that can exploit generalized smoothness.

Following the notations of Nemirovski [2004], we deﬁne Z = X Y, z = (x, y) Z and introduce an operator F : Z Rn Rm with F(z) = ( xf(x, y), yf(x, y)). We extend the concept of ℓ-smoothness to the min-max optimization setting as follows. Deﬁnition 2 (ℓ-smoothness for min-max game). A differentiable convex-concave function f : X Y 7 R is called ℓ-smooth with a non-decreasing link function ℓ: [0, + ) 7 (0, + ) if it satisﬁes: for any z1, z2 Z, if z1, z2 B z, F (z) 2 ℓ(2 F (z) 2) , then F(z1) F(z2) 2 ℓ(2 F(z) ) z1 z2 2, where B(z, r) denotes the Euclidean ball centered at point z with radius r.

This deﬁnition is a counterpart to Deﬁnition 1 in the min-max game, but weaker as it does not require the twice-differentiability requirement. The ε-approximate solution (x , y ) to the min-max game is formally deﬁned by f(x , y) ε f(x , y ) f(x, y ) + ε for any x X, y Y. To achieve the fast convergence rate to the solution, indeed our result in Section 2 can be directly applied to the min-max game and obtains the following tailored algorithm for min-max optimization:

zt = ΠZ [bzt ηt F(bzt)] , bzt+1 = ΠZ [bzt ηt F(zt)] . (10)

In above we set the optimism at the point bzt in order to exploit the smoothness locally. We conclude our result in Corollary 2, with the proof available in Appendix D.2. Corollary 2. Assume that the convex-concave function f(x, y) is ℓ-smooth, and the domain Z is bounded with diameter D. By applying the tuning strategy described in Theorem 1, and deﬁning the ﬁnal approximated solution as z T = 1

T PT t=1 zt, where zt is generated by Eq. (10), we achieve an ε-approximate solution with a convergence rate of O(1/T).

5 Conclusion

In this paper, we provide a systematic study of gradient-variation online learning under the generalized smoothness condition. We exploit trajectory-wise smoothness to achieve the optimal regret bounds: O( VT ) for convex functions and O(log VT ) for strongly convex functions, respectively. We further consider more complicated online learning scenarios, motivating us to design a new Lipschitz-adaptive meta-algorithm, which can be of independent interest. Hinging on this algorithm with the function-variation-to-gradient-variation technique, we design a universal algorithm which guarantees the optimal results for convex functions and strongly convex functions simultaneously without knowing the curvature. In addition, our ﬁndings directly imply new results in stochastic extended adversarial online learning and fast-rate games under generalized smoothness.

An important future direction for future research is to explore whether our method can be further extended to accommodate the one-gradient feedback model, where the learner receives only the gradient information of the decision submitted in each round. Another interesting problem is to exploit the exp-concavity in gradient-variation online learning under generalized smoothness.

Acknowledgements

This research was supported by National Key R&D Program of China (2022ZD0114800), NSFC (U23A20382), and Jiangsu SF (BK20220776). Peng Zhao was supported in part by the Xiaomi Foundation.

J. D. Abernethy, P. L. Bartlett, A. Rakhlin, and A. Tewari. Optimal stragies and minimax lower bounds for online convex games. In Proceedings of the 21st Annual Conference on Learning Theory (COLT), pages 415 424, 2008. D. Ataee Tarzanagh, P. Nazari, B. Hou, L. Shen, and L. Balzano. Online bilevel optimization: Regret analysis of online alternating gradient methods. In Proceedings of The 27th International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), pages 2854 2862, 2024. A. Attia and T. Koren. How free is parameter-free stochastic optimization? In Proceedings of the 41st International Conference on Machine Learning (ICML), pages 2009 2034, 2024. P. Auer, N. Cesa-Bianchi, and C. Gentile. Adaptive and self-conﬁdent on-line learning algorithms. Journal of Computer and System Sciences, 64(1):48 75, 2002. Y. Bai, Y.-J. Zhang, P. Zhao, M. Sugiyama, and Z.-H. Zhou. Adapting to online label shift with provable guarantees. In Advances in Neural Information Processing Systems 35 (Neur IPS), pages 29960 29974, 2022. A. Beck and M. Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operation Research Letter, 31(3):167 175, 2003. N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge University Press, 2006. N. Cesa-Bianchi, Y. Mansour, and G. Stoltz. Improved second-order bounds for prediction with expert advice. Machine Learning, 66(2-3):321 352, 2007. L. Chen, H. Luo, and C. Wei. Impossible tuning made possible: A new expert algorithm and its applications. In Proceedings of the 34th Conference on Learning Theory (COLT), pages 1216 1259, 2021. S. Chen, W.-W. Tu, P. Zhao, and L. Zhang. Optimistic online mirror descent for bridging stochastic and adversarial online convex optimization. In Proceedings of the 40th International Conference on Machine Learning (ICML), pages 5002 5035, 2023a. S. Chen, Y.-J. Zhang, W.-W. Tu, P. Zhao, and L. Zhang. Optimistic online mirror descent for bridging stochastic and adversarial online convex optimization. Journal of Machine Learning Research, 25 (178):1 62, 2024. Z. Chen, Y. Zhou, Y. Liang, and Z. Lu. Generalized-smooth nonconvex optimization is as efﬁcient as smooth nonconvex optimization. In Proceedings of the 40th International Conference on Machine Learning (ICML), pages 5396 5427, 2023b. C.-K. Chiang, T. Yang, C.-J. Lee, M. Mahdavi, C.-J. Lu, R. Jin, and S. Zhu. Online optimization with gradual variations. In Proceedings of the 25th Conference On Learning Theory (COLT), pages 6.1 6.20, 2012. M. Crawshaw, M. Liu, F. Orabona, W. Zhang, and Z. Zhuang. Robustness to unbounded smoothness of generalized signsgd. In Advances in Neural Information Processing Systems 34 (Neur IPS), volume 35, pages 9955 9968, 2022. A. Cutkosky. Artiﬁcial constraints and hints for unbounded online learning. In Proceedings of the 32nd Annual Conference on Computational Learning Theory (COLT), pages 874 894, 2019. A. Cutkosky and K. Boahen. Stochastic and adversarial online learning without hyperparameters. In Advances in Neural Information Processing Systems 30 (NIPS), pages 5059 5067, 2017. J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(7), 2011. M. Faw, L. Rout, C. Caramanis, and S. Shakkottai. Beyond uniform smoothness: A stopped analysis of adaptive SGD. In Proceedings of the 36th Conference on Learning Theory (COLT), pages 89 160, 2023.

P. Gaillard, G. Stoltz, and T. van Erven. A second-order bound with excess losses. In Proceedings of the 27th Annual Conference on Learning Theory (COLT), pages 176 196, 2014.

E. Hazan. Introduction to Online Convex Optimization. Foundations and Trends in Optimization, 2 (3-4):157 325, 2016.

E. Hazan and S. Kakade. Revisiting the polyak step size. Ar Xiv preprint, ar Xiv:1905.00313, 2019.

E. Hazan, A. Agarwal, and S. Kale. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69(2-3):169 192, 2007.

A. Jacobsen and A. Cutkosky. Parameter-free mirror descent. In Proceedings of the 35th Annual Conference on Learning Theory (COLT), pages 4160 4211, 2022.

A. Jacobsen and A. Cutkosky. Unconstrained online learning with unbounded losses. In Proceedings of the 40th International Conference on Machine Learning (ICML), pages 14590 14630, 2023.

P. Joulani, A. György, and C. Szepesvári. A modular analysis of adaptive (non-)convex optimization: Optimism, composite objectives, variance reduction, and variational bounds. Theoretical Computer Science, 808:108 138, 2020.

A. Khaled and C. Jin. Tuning-free stochastic optimization. In Proceedings of the 41st International Conference on Machine Learning (ICML), pages 23622 23661, 2024.

B. Kulis and P. L. Bartlett. Implicit online learning. In Proceedings of the 27th International Conference on Machine Learning (ICML), pages 575 582, 2010.

H. Li, J. Qian, Y. Tian, A. Rakhlin, and A. Jadbabaie. Convex and non-convex optimization under generalized smoothness. In Advances in Neural Information Processing Systems 35 (Neur IPS), 2023.

H. Lu, R. M. Freund, and Y. Nesterov. Relatively smooth convex optimization by ﬁrst-order methods, and applications. SIAM Journal on Optimization, 28(1):333 354, 2018.

H. Luo and R. E. Schapire. Achieving all with no parameters: Ada Normal Hedge. In Proceedings of the 28th Annual Conference on Computational Learning Theory (COLT), pages 1286 1304, 2015.

Z. Mhammedi, W. M. Koolen, and T. van Erven. Lipschitz adaptivity with multiple learning rates in online learning. In Proceedings of the 32nd Annual Conference on Computational Learning Theory (COLT), pages 2490 2511, 2019.

A. Nemirovski. Prox-method with rate of convergence O(1/t) for variational inequalities with lipschitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM Journal on Optimization, 15(1):229 251, 2004.

A. Nemirovskij and D. B. Yudin. Problem complexity and method efﬁciency in optimization. SIAM Review, 27(2):264 265, 1985.

Y. Nesterov. Lectures on Convex Optimization, volume 137. Springer, 2018.

C. Nicolò and O. Francesco. Temporal variability in implicit online learning. In Advances in Neural Information Processing Systems 33 (Neur IPS), pages 12377 12387, 2020.

F. Orabona. A Modern Introduction to Online Learning. Ar Xiv preprint, ar Xiv:1912.13213, 2019.

F. Orabona and D. Pál. Scale-free online learning. Theoretical Computer Science, 716:50 69, 2018.

A. Rakhlin and K. Sridharan. Online learning with predictable sequences. In Proceedings of the 26th Conference On Learning Theory (COLT), pages 993 1019, 2013a.

A. Rakhlin and K. Sridharan. Optimization, learning, and games with predictable sequences. In Advances in Neural Information Processing Systems 26 (NIPS), pages 3066 3074, 2013b.

A. Reisizadeh, H. Li, S. Das, and A. Jadbabaie. Variance-reduced clipping for non-convex optimization. Ar Xiv preprint, ar Xiv:2303.00883, 2023.

S. Sachs, H. Hadiji, T. van Erven, and C. Guzmán. Between stochastic and adversarial online convex optimization: Improved regret bounds via smoothness. In Advances in Neural Information Processing Systems 34 (Neur IPS), pages 691 702, 2022.

S. Sachs, H. Hadiji, T. van Erven, and C. Guzmán. Accelerated rates between stochastic and adversarial online convex optimization. Ar Xiv preprint, ar Xiv:2303.03272, 2023.

V. Syrgkanis, A. Agarwal, H. Luo, and R. E. Schapire. Fast convergence of regularized learning in games. In Advances in Neural Information Processing Systems 28 (NIPS), pages 2989 2997, 2015. M. Telgarsky. Stochastic linear optimization never overﬁts with quadratically-bounded losses on general data. In Proceedings of the 35th Annual Conference on Learning Theory (COLT), pages 5453 5488, 2022. C. Tsai, Y. Lin, and Y. Li. Data-dependent bounds for online portfolio selection without lipschitzness and smoothness. In Advances in Neural Information Processing Systems 36 (Neur IPS), pages 62764 62791, 2023. T. van Erven and W. M. Koolen. Meta Grad: Multiple learning rates in online learning. In Advances in Neural Information Processing Systems 29 (NIPS), pages 3666 3674, 2016. B. Wang, H. Zhang, Z. Ma, and W. Chen. Convergence of adagrad for non-convex objectives: Simple proofs and relaxed assumptions. In Proceedings of the 36th Annual Conference on Learning Theory (COLT), pages 161 190, 2023. G. Wang, S. Lu, and L. Zhang. Adaptivity and optimality: A universal algorithm for online convex optimization. In Proceedings of the 35th Conference on Uncertainty in Artiﬁcial Intelligence (UAI), pages 659 668, 2019. J. Wang, M. Yu, P. Zhao, and Z.-H. Zhou. Learning with adaptive resource allocation. In Proceedings of the 41st International Conference on Machine Learning (ICML), pages 52099 52116, 2024. C.-Y. Wei, Y.-T. Hong, and C.-J. Lu. Tracking the best expert in non-stationary stochastic environments. In Advances in Neural Information Processing Systems 29 (NIPS), pages 3972 3980, 2016. Y.-H. Yan, P. Zhao, and Z.-H. Zhou. Universal online learning with gradient variations: A multilayer online ensemble approach. In Advances in Neural Information Processing Systems 36 (Neur IPS), pages 37682 37715, 2023. T. Yang, M. Mahdavi, R. Jin, and S. Zhu. Regret bounded by gradual variation for online convex optimization. Machine Learning, 95(2):183 223, 2014. W. Yang, Y. Wang, P. Zhao, and L. Zhang. Universal online convex optimization with 1 projection per round. Ar Xiv preprint, ar Xiv:2405.19705, 2024. B. Zhang, J. Jin, C. Fang, and L. Wang. Improved analysis of clipping algorithms for non-convex optimization. In Advances in Neural Information Processing Systems 33 (Neur IPS), pages 15511 15521, 2020a. J. Zhang, T. He, S. Sra, and A. Jadbabaie. Why gradient clipping accelerates training: A theoretical justiﬁcation for adaptivity. In Proceedings of the 8th International Conference on Learning Representations (ICLR), 2020b. L. Zhang, G. Wang, J. Yi, and T. Yang. A simple yet universal strategy for online convex optimization. In Proceedings of the 39th International Conference on Machine Learning (ICML), pages 26605 26623, 2022a. M. Zhang, P. Zhao, H. Luo, and Z.-H. Zhou. No-regret learning in time-varying zero-sum games. In Proceedings of the 39th International Conference on Machine Learning (ICML), pages 26772 26808, 2022b. P. Zhao, Y.-J. Zhang, L. Zhang, and Z.-H. Zhou. Dynamic regret of convex and smooth functions. In Advances in Neural Information Processing Systems 33 (Neur IPS), pages 12510 12520, 2020. P. Zhao, Y.-J. Zhang, L. Zhang, and Z.-H. Zhou. Adaptivity and non-stationarity: Problemdependent dynamic regret for online convex optimization. Journal of Machine Learning Research, 25(98):1 52, 2024. Z.-H. Zhou. Learnability with time-sharing computational resource concerns. National Science Review, 11(10):nwae204, 2024. M. Zinkevich. Online convex programming and generalized inﬁnitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning (ICML), pages 928 936, 2003.

A Related Work

In this section, we review the related work in gradient-variation online learning, the generalized smoothness conditions, and the Lipschitz-adaptive algorithms.

A.1 Gradient-Variation Online Learning

The gradient-variation quantity deﬁned in Eq. (2) is ﬁrstly introduced by Chiang et al. [2012] for global smooth functions. Chiang et al. [2012] obtain O( VT ) and O( d

α log VT ) regret bounds respectively for general convex and α-exp-concave functions. Zhang et al. [2022a] later achieve O( 1

λ log VT ) result for λ-strongly convex functions. Considering the non-stationary environments, Zhao et al. [2020] study gradient variation under the dynamic regret, a strengthened measure that requires the learner to compete with a sequence of time-varying comparators. Their work revives the study of gradient-variation online learning, especially revealing the importance of the stability analysis in the two-layer online ensemble. Their results are further improved in Zhao et al. [2024], where they only require one gradient per round with the same optimal guarantees through the collaboration between the meta and the base. For universal online learning [van Erven and Koolen, 2016], there are several recent researches trying to derive the gradient-variation bounds in this context [Zhang et al., 2022a; Sachs et al., 2023; Yan et al., 2023].

Gradient variation demonstrates its importance due to its close connection with many online learning problems. Pioneering works [Rakhlin and Sridharan, 2013a; Syrgkanis et al., 2015; Zhang et al., 2022b] demonstrate the importance of gradient variation via a useful property (Regret bounded by Variation in Utilities, RVU) to achieve fast-rate convergences in multi-player games. Recently, Sachs et al. [2022] and Chen et al. [2024] reveal that the gradient variation is also essential in bridging stochastic and adversarial convex optimization.

However, all the mentioned works require the global smoothness assumption. As highlighted in Proposition 2 in Appendix B, the smoothness condition is necessary to obtain the gradient-variation bounds for algorithms with ﬁrst-order oracle assumption [Nesterov, 2018]. Exceptions are that Jacobsen and Cutkosky [2022]; Bai et al. [2022] obtain the gradient-variation bounds without the smoothness assumption, but they require implicit updates [Kulis and Bartlett, 2010; Nicolò and Francesco, 2020] per round, which may be inefﬁcient under online settings. Our work considers generalized smoothness, and we develop ﬁrst-order methods to achieve gradient-variation bounds.

A.2 Generalized Smoothness

Generalized smoothness has received increasing attention in recent years since the analysis under the standard global smoothness is insufﬁcient to depict the dynamics of neural network optimization. Based on the empirical observation for the relationship between the smoothness and the gradients of LSTMs, Zhang et al. [2020b] relax the global smoothness assumption by (L0, L1)-smoothness condition, which assumes 2f(x) 2 L0 + L1 f(x) 2 for ofﬂine objective function f : X 7 R. With this new smoothness assumption, Zhang et al. [2020b] explain the importance of gradient clipping in neural network training. There are a variety of subsequent works developed for different methods and settings [Zhang et al., 2020a; Crawshaw et al., 2022; Reisizadeh et al., 2023; Faw et al., 2023; Wang et al., 2023]. There are studies further generalizing the (L0, L1)-smoothness condition. Chen et al. [2023b] introduce the α-symmetric smoothness condition, which symmetrizes the (L0, L1)-smoothness condition and allows the smoothness to depend polynomially on the gradient. Notably, Li et al. [2023] propose the ℓ-smoothness, deﬁned as 2f(x) 2 ℓ( f(x) 2). This condition does not specify a particular form for the function ℓ: R R, other than some mild assumptions about its properties, thereby allowing great generality of this notion.

Telgarsky [2022] introduces the concept of (G, L)-quadratically bounded functions, which aims to generalize the Lipschitz condition as f(x) 2 G + L x x0 2 with a reference point x0 X. Though this notion covers the standard global smoothness condition, it lacks a detailed depiction of the relationship between the smoothness and the gradients, hindering further research into understanding the optimization dynamics such as the role of gradient clipping [Zhang et al., 2020b]. Jacobsen and Cutkosky [2023] investigate online convex optimization under this constraint such that the online functions may be unbounded. However, it is unclear how to obtain the gradientvariation regret in their context and using their methods.

We also mention another generalization of the standard smoothness called the relative smoothness [Lu et al., 2018]. A function f : X 7 R is L-smooth relative function if f(x) f(y) + f(y), x y + LDh(x, y) for any x, y int X, where h : X 7 R is a reference function with Dh(x, y) denoting the Bregman divergence. However, the global constant L is still required to tune the algorithm, and studying this notion is beyond the scope of our paper.

A.3 Lipschitz-Adaptive Algorithms

The upper bound of gradients G and the diameter of the bounded feasible domain D are often required to build up online algorithms. An algorithm that requires the diameter D of the bounded feasible domain D but not the upper bound of gradients is known to be Lipschitz-adaptive [Duchi et al., 2011; Orabona and Pál, 2018; Cutkosky, 2019; Mhammedi et al., 2019]. For the general OCO setting, to the best of our knowledge, there are no Lipschitz-adaptive algorithm that ensures a gradient-variation bound. When specialized to the Prediction with Experts Advice (PEA) setting [Cesa-Bianchi and Lugosi, 2006], Chen et al. [2021] design a two-layer algorithm with a restarting mechanism that can obtain a gradient-variation bound in PEA setting. In this paper, we develop a new Lipschitz-adaptive algorithm with a lightweight design, which guarantees the gradient-variation bound as well and is the only option for our purposes, to the best of our knowledge.

B Omitted Details for Section 2

In this section, we ﬁrst provide a judgement of the necessity of smoothness for ﬁrst-order online algorithms in Appendix B.1. Next, we will provide proofs for the theorems in Section 2.3. In Appendix B.2, we present the omitted discussion for the challenge in designing the algorithm for exp-concave functions. We introduce a key lemma in Appendix B.3, which abstracts the key idea of exploiting the generalized smoothness. Later, the proofs for Theorem 1 and Theorem 2 are provided in Appendix B.4 and Appendix B.5, respectively.

B.1 Necessity of Smoothness

We ﬁrst provide a lower bound on the convergence rate for ﬁrst-order methods with convex functions.

Proposition 1 (Theorem 3.2.1 of Nesterov [2018]). There exists a G-Lipschitz and convex function f : Rd 7 R with x1 x D where x arg minx Rd f(x) such that

f(xt) min x Rd f(x) GD 2(2 +

for any optimization scheme that generates a sequence {xt} satisfying that

xt x1 + Lin { f(x1), . . . , f(xt 1)} ,

where Lin{a1, . . . , at} denotes the linear span of vectors a1, . . . , at.

Yang et al. [2014] prove the necessity of the smoothness to obtain the gradient-variation bounds with convex functions using the ﬁrst-order online algorithms. We formally present this idea in below.

Proposition 2 (Remark 1 of Yang et al. [2014]). The smoothness assumption for G-Lipschitz and convex online functions ft : X 7 R is necessary for any online algorithms, whose decisions are linear combinations of the queried gradients when no projections are made, and ensure:

t=1 ft(xt) min x X

t=1 ft(x) O( p

VT ) + constant,

with only c T times gradient queries, with c N being a constant independent of T and environmental parameters, such as the Lipschitz constant and the smoothness constant.

Proof. We prove this proposition by contradiction. Assume that ft is non-smooth. Then consider a special case where the algorithm projects the decisions onto X = Rd, i.e., no projections are made,

and all the online functions are equal f1 = = f T = f. Notice that, under such case, the gradient variation is zero and therefore,

t=1 f(xt) min x X

t=1 f(x) O (1) ,

which implies x T = 1

T PT t=1 xt approaches an O(1/T) convergence rate. Now it remains to check whether this convergence rate contradicts with Proposition 1. Denoted by x 1, . . . , x c T the points that the algorithm queries gradients on, because the decisions are linear combinations of the gradients when no projections are made, then,

x T x1 + Lin { f(x 1), . . . , f(x c T )} ,

which indicates that, there exists a method which promises O(c/T) = O(1/T) convergence rate under the protocol considered in Proposition 1, contradicting the lower bound.

B.2 Challenge for Exp-Concave Functions in Regret Minimization

In Section 2, we design algorithms for convex and strongly convex functions respectively. However, the gradient-variation regret for exp-concave functions [Hazan et al., 2007] under generalized smoothness has not yet been addressed. For online learning with global smooth functions, it has been demonstrated that an O(d log VT ) regret is attainable for exp-concave functions [Chiang et al., 2012], which is realized by an optimistic variant of the online Newton step algorithm [Hazan et al., 2007] using the last-round gradient as optimism, i.e., Mt = ft 1(xt 1). However, it remains unclear how to obtain a general optimistic bound of order O(d log DT ) with DT = PT t=1 ft(xt) Mt 2 2 for arbitrary optimism {Mt}T t=1. Our technique for achieving gradient-variation regret under generalized smoothness relies on a ﬂexible setting for optimism (which may not be the last-round gradient) and step size tuning (which requires a trajectory-wise stability analysis), making it challenging for extension to exp-concave functions. We leave this as an open question for future research.

B.3 Lemma for Regret Minimization

Below, we present a lemma that leverages the local smoothness of the optimization trajectory to derive gradient variation within the OMD framework. This lemma can be applied in the analysis of both convex and strongly convex functions.

Lemma 2. Under Assumptions 1 and 2, by selecting regularizer as ψt(x) = 1 2ηt x 2 2, setting step

sizes satisfying that ηt+1 ηt and ηt 1/(4b Lt 1), where b Lt 1 = ℓt 1(2 ft 1(bxt) 2), and by choosing optimism as Mt = ft 1(bxt), the OMD in Eq. (4) ensures the regret bound:

t=1 ft(xt), xt x

x bxt 2 2 x bxt+1 2 2

t=1 ηt ft(xt) ft 1(xt) 2 2

1 4ηt xt bxt 2 2

t=1 ηt ft(xt) ft 1(xt) 2 2

1 4ηt xt bxt 2 2, (11)

where x arg minx X PT t=1 ft(x) denotes the best decision in hindsight.

Proof. Following the standard analysis of OMD (Lemma 3), OMD with the optimism chosen as Mt = ft 1(bxt) ensures the following regret bound:

t=1 ft(xt), xt x

x bxt 2 2 x bxt+1 2 2

| {z } TERM-A

t=1 ηt ft(xt) ft 1(bxt) 2 2 | {z } TERM-B

bxt+1 xt 2 2 + xt bxt 2 2

| {z } TERM-C

The analysis for TERM-A is straightforward under Assumption 2:

TERM-A 1 2η1 x bx1 2 2 +

2ηt 1 2ηt 1 ) x bxt 2 2 D2

2ηt 1 ) = D2

Next, we analyze the TERM-B under the generalized smoothness condition. By the basic calculation, we can decompose the TERM-B into following two terms:

t=1 ηt ft(xt) ft 1(xt) 2 2 + 2

t=2 ηt ft 1(xt) ft 1(bxt) 2 2, (12)

where the ﬁrst term is the desirable gradient variation and the second term should be further analyzed under the generalized smoothness. Given the optimism setting and the step size tuning, we demonstrate that xt and bxt are sufﬁciently close to apply local smoothness:

xt bxt 2 ηt ft 1(bxt) 2 ft 1(bxt) 2

In above, the ﬁrst inequality is by the the Pythagorean theorem and the second inequality is by the step size tuning. Therefore, we can apply Lemma 1 to bound the gradient deviation in Eq. (12) by

t=2 ηt ft 1(xt) ft 1(bxt) 2

t=2 ηtb L2 t 1 xt bxt 2 2.

Finally, combining TERM-A, TERM-B and TERM-C together, we obtain:

t=1 ft(xt), xt x D2

t=1 ηt ft(xt) ft 1(xt) 2 2

1 4ηt xt bxt 2 2

t=1 (2ηtb L2 t 1 1

4ηt ) xt bxt 2 2

t=1 ηt ft(xt) ft 1(xt) 2 2

1 4ηt xt bxt 2 2,

where the second inequality is by setting ηt 1/(4b Lt 1). Hence, we ﬁnish the proof.

B.4 Proof of Theorem 1

Proof. The step size tuning in Eq. (7) in Theorem 1 satisﬁes the criterions for applying Lemma 2, speciﬁcally that ηt+1 ηt and ηt 1/(4b Lt 1), where b Lt 1 = ℓt 1(2 ft 1(bxt) 2). Therefore, by Lemma 2 and the convexity of loss functions, we obtain:

t=1 ft(xt), xt x D2

t=1 ηt ft(xt) ft 1(xt) 2 2,

where x arg minx X PT t=1 ft(x). The ﬁrst term can be bounded as follows,

2ηT 2b Lmax D2 + D

t=2 ft(xt) ft 1(xt) 2 2 + D

= O b Lmax D2 + D p

where we denote by b Lmax = maxt [T ] b Lt. For the second term, we apply the self-conﬁdent tuning lemma (Lemma 5) by choosing f(x) = 1/ x in the lemma statement:

t=1 ηt ft(xt) ft 1(xt) 2 2

D ft(xt) ft 1(xt) 2 2 q

1 + Pt 1 s=1 fs(xs) fs 1(xs) 2 2

s=1 fs(xs) fs 1(xs) 2 2 + D max t [T ] ft(xt) ft 1(xt) 2 2

VT + b G2 max D ,

where we use b Gmax to denote the empirically maximum Lipschitz constant. The additional term O( b G2 max D) results from the lack of knowledge about b Gmax. However, we can improve this term to O( b Gmax D) by incorporating the clipping technique [Cutkosky, 2019; Chen et al., 2021] into OMD framework. In the main text, we avoid introducing this method to prevent complicating the approach further. Our goal in the OMD introduction is to illustrate how to adapt to generalized smoothness at the base level. The details of this reﬁned approach are provided in Appendix B.6.

B.5 Proof of Theorem 2

Proof. For λ-strongly convex functions, we tune the step size as ηt = 2/(λt + 16 maxs [t 1] b Ls), where we denote by b Lt = ℓt(2 ft(bxt+1) ) the locally estimated smoothness constant. By the property of strongly convex functions and Eq. (11) in Lemma 2, we have:

t=1 ft(xt), xt x λ

x bxt 2 2 x bxt+1 2 2 λ

2 x xt 2 2 | {z } TERM-A

t=1 ηt ft(xt) ft 1(xt) 2 2 | {z } TERM-B

1 4ηt xt bxt 2 2 | {z } TERM-C

Unlike in the case of convex functions, TERM-A involves negative terms derived from the strong convexity of the loss functions, which require a slightly different analysis:

TERM-A 1 2η1 x bx1 2 2 +

2ηt 1 2ηt 1 ) x bxt 2 2 λ

4 + 4 max s [t] b Ls 4 max s [t 1] b Ls) x bxt 2 2 λ

2 + 4D2 max s [T ] b Ls +

λ 4 x bxt 2 2 λ

2 x xt 2 2 (bounded domain assumption)

4 + 4D2b Lmax +

λ 2 xt bxt 2 2 ( x bxt 2 2 2 x xt 2 2 + 2 xt bxt 2 2)

4 + 4D2b Lmax +

λ η2 t 2 ft(xt) ft 1(bxt) 2 2 (Lemma 4)

4 + 4D2b Lmax +

t=1 ηt ft(xt) ft 1(bxt) 2 2

4 + 4D2b Lmax +

t=1 2ηt ft(xt) ft 1(xt) 2 2 +

t=1 2ηt ft 1(xt) ft 1(bxt) 2 2,

where the last term can be cancelled by TERM-C, and the third term is in the same order of TERM-B. Therefore, we only need to focus on TERM-B. For TERM-B, we follow the analysis of Chen et al. [2024], who applies a simpler analysis for the self-conﬁdent tuning. First, we deﬁne:

t=1 ft(xt) ft 1(xt) 2 2

Then by dividing the time horizon into [1, α] and [α + 1, T], we can upper bound TERM-B as:

t=1 ηt ft(xt) ft 1(xt) 2 2 + 2

t=α+1 ηt ft(xt) ft 1(xt) 2 2

1 λt ft(xt) ft 1(xt) 2 2 + 4

1 λt ft(xt) ft 1(xt) 2 2

4(max s [T ] fs(xs) fs 1(xs) 2 2)

1 λt ft(xt) ft 1(xt) 2 2 (13)

16 b G2 max

1 λt + 4 λ(α + 1)

t=α+1 ft(xt) ft 1(xt) 2 2

16 b G2 max λ (1 + ln α) + 4

16 b G2 max λ ln

t=1 ft(xt) ft 1(xt) 2 2 + 1

+ 16 b G2 max + 4

where we use b Gmax to denote the maximum Lipschitz constant estimated empirically. Therefore:

t=1 ft(x ) O

λ log VT + b Lmax D2 + λD2 !

which ﬁnishes the proof.

B.6 OMD Incorporating Clipping Technique

In Appendix B.4, an additional term O( b G2 max D) shows up in the ﬁnal regret bound, which results from the lack of knowledge about b Gmax. In this subsection, we improve the term O( b G2 max D) to O( b Gmax D) by incorporating the clipping technique [Cutkosky, 2019; Chen et al., 2021] into the OMD framework. This modiﬁed OMD algorithm is deﬁned as follows,

xt = ΠX [bxt ηt ft 1(bxt)] , bxt+1 = ΠX [bxt ηtegt] , (14)

where egt = ft 1(bxt) + Bt 1

Bt ( ft(xt) ft 1(bxt)) is a clipped gradient with the maintained threshold Bt = max{B0, maxs [t] fs(xs) fs 1(bxs) 2}. Notice that, xt still updates from bxt in the same manner as illustrated in Theorem 1, therefore, we can apply the similar analysis to it when exploiting the smoothness locally. Correspondingly, we provided the following step size tuning:

B2 t 1 + Pt 1 s=1 egs fs 1(bxs) 2 2 , min s [t] 1

where the key modiﬁcation is that we add B2 t 1 in the denominator to facilitate the tuning analysis and we denote by b Lt = ℓt(2 ft(bxt+1) ). The following theorem presents the guarantee.

Theorem 5. Under Assumptions 1 - 2, assuming online functions are convex, OMD presented in Eq. (14) with the step sizes in Eq. (15) ensures that:

2VT + 4b Lmax D2 + 5 b Gmax D,

where VT = PT t=2 supx X ft(x) ft 1(x) 2 2 is the gradient variations, b Lmax = maxt [T ] b Lt is the maximum smoothness constant over the optimization trajectory, and b Gmax is the maximum empirically estimated Lipschitz constant.

Proof. First, we prove that the clipping technique incurs a constant in the regret bound:

t=1 ft(xt), xt x egt, xt x =

Bt ft(xt) ft 1(bxt), xt x

t=1 (Bt Bt 1) = D(BT B0) = O( b Gmax D).

In the following analysis, we will focus on the regret associated with the clipped gradient egt. Following the standard analysis of OMD (Lemma 3), and with simple calculations, we arrive at:

t=1 egt, xt x D2

2ηT |{z} TERM-A

t=1 ηt egt ft 1(bxt) 2 2 | {z } TERM-B

1 2ηt xt bxt 2 2 | {z } TERM-C

By the step size tuning, TERM-A can be bounded as:

TERM-A 2b Lmax D2 + D

v u u t B2 T +

s=1 egs fs 1(bxs) 2 2

2b Lmax D2 + b Gmax D + D

s=2 egs fs 1(bxs) 2 2

= 2b Lmax D2 + b Gmax D + D

B2 s 1 B2s fs(xs) fs 1(bxs) 2 2

2b Lmax D2 + b Gmax D + D

s=2 fs(xs) fs 1(bxs) 2 2

| {z } TERM-A-VAR

For TERM-B in (16), by the self-conﬁdent tuning lemma (Lemma 6) we obtain,

egt ft 1(bxt) 2 2 q

B2 t 1 + Pt 1 s=1 egs fs 1(bxs) 2 2 D

egt ft 1(bxt) 2 2 q Pt s=1 egs fs 1(bxs) 2 2

t=1 egs ft 1(bxt) 2 2 2D

t=2 ft(xt) ft 1(bxt) 2 2

| {z } TERM-B-VAR

+2 b Gmax D. (18)

Combine TERM-A-VAR in (17), TERM-B-VAR in (18), and negative term TERM-C in (16):

TERM-A-VAR + TERM-B-VAR TERM-C

t=2 ft(xt) ft 1(xt) 2 2 + 2

t=2 ft 1(xt) ft 1(bxt) 2 2

1 2ηt xt bxt 2 2

t=2 ft(xt) ft 1(xt) 2 2 + 5D

t=2 ft 1(xt) ft 1(bxt) 2 2

1 2ηt xt bxt 2 2

t=2 b L2 t 1 xt bxt 2 2

t=2 2 max s [t 1] b Ls xt bxt 2 2

16 b Lmax D2,

where in the forth line we apply Lemma 1, and ub the ﬁnal line, we apply the AM-GM inequality, 2

ab a b. Combing each component together, we conclude the proof as:

t=1 ft(xt), xt x

t=1 egt, xt x + 2 b Gmax D

2VT + 4b Lmax D2 + 5 b Gmax D.

C Omitted Details for Section 3

In this section, we present the omitted details for Section 3. Appendix C.1 discusses how set the optimism mt efﬁciently via a binary search. Appendix C.2 provides the omitted discussion for the challenge in universal online learning with exp-concave functions. Proofs for the theorems in Section 3 are included in Appendix C.3 and Appendix C.4. Appendix C.5 gives a simple universal algorithm under the global smoothness condition, which improves the optimality and efﬁciency of the method by Yan et al. [2023], at a cost of additional function value queries.

C.1 Discussion on Optimism

In this part, we discuss how to set the optimism mt that involves the decision pt. We present the steps for incorporating the optimism and generating the decision pt in Algorithm 1 for reference:

ewt,i = wt,i exp(ηt,imt,i), pt,i = ηt,i ewt,i P j [N] ηt,j ewt,j . (19)

For more general consideration, we set optimism the as mt,i = f(P i [N] pt,ixt,i) h(xt,i) where f : X 7 R is a convex and continuous function and xi X is a decision available before setting the optimism. Notice that ewi requires pt to update while pt is produced based on ewt,i, resulting in a circular argument. Following Wei et al. [2016], the pt can be solved via the binary search technique. We deﬁne α = f(P

i [N] pt,ixi), then the weight ewt,i is a function of α as ewt,i(α) = wt,i exp(ηt,i(α f(xt,i))). Furthermore, with the update formulation in (19), the decision pt,i is a function of α as well, with the formulation pt,i(α) = ηt,i e wt,i(α) P j [N] ηt,j e wt,j(α). By introducing a function

g(α) = f(P i [N] pt,i(α)xi), solving the decision pt is equal to solving g(α) = α.

Below, we prove the existence of a solution to g(α) = α. Provided the lower bound f of function f( ), and by the convexity of the function f( ), the searching range of α is restricted to f α maxi [N]{f(xi)}, and thus α is bounded. The continuity of the function f( ) implies the continuity of the function g(α) as well. The choice of α = f results in g(α) α = f(P i [N] pt,i(α)xi)

f 0, and the choice of α = maxi [N]{f(xi)} implies g(α) α P i [N] pt,i(α)f(xi) maxi [N]{f(xi)} 0, indicating that a solution to g(α) = α exists. By using a binary search within [ f, maxi [N]{f(xi)}], we can approach α within an error O(1/T) in O(log T) iterations.

The above argument requires the lower bound of the function f( ) to determine the searching range over α. When considering a simpler case where f(xi) = ℓi and f(P i [N] pt,ixi) = P i [N] pt,iℓi, we can omit the requirement for the lower bound because α falls within [mini [N]{ℓi}, maxi [N]{ℓi}], because of the simpler structure of linear functions.

C.2 Challenge for Exp-Concave Functions in Universal Online Learning

In Section 3.4, we present Algorithm 2 that achieves gradient-variation bounds for convex and strongly convex functions simultaneously. However, it does not guarantee such bounds for exp-concave functions. Besides the challenges discussed in Section B.2, a new obstacle arises in designing the meta-algorithm. We require optimism to satisfy pt, mt 0, since we pass the meta-algorithm with heterogeneous inputs, and therefore we set mt,i = ft 1(xt) ft 1(xt,i) for all the base-learners. This optimism design is suitable for strongly convex functions, as the term p P t(ft 1(xt) ft 1(xt,i))2 introduced by optimism can be bounded by b Gmax p P t xt xt,i 2 2, cancelled by the negative term P t λ xt xt,i 2 2 from strong convexity. However, for exp-concave functions, the negative term ft(xt), xt xt,i 2 from exp-concavity may not be sufﬁcient to cancel p P t(ft 1(xt) ft 1(xt,i))2. We leave this as an open problem for future exploration.

C.3 Proof of Theorem 3

In this subsection, we prove a slightly generalized version of Theorem 3, which does not specify optimism and instead imposes conditions only on rt. One can verify that the setting of optimism in Theorem 3 satisﬁes the requirement of Theorem 6.

Theorem 6. By setting rt,i such that PN i=1 pt,i rt,i 0, Algorithm 1 ensures that, for any i [N], the regret PT t=1 pt, ℓt PT t=1 ℓt,i can be bounded by:

t=1 (rt,i mt,i )2 + BT

log(N) + log(BT + log T) !

where BT = max{B0, maxt [T ] rt mt }.

Proof. First, we demonstrate that the clipping technique [Chen et al., 2021; Cutkosky, 2019] incurs a constant in the regret bound:

t=1 rt,i rt,i =

t=1 rt,i mt,i Bt 1

Bt (rt,i mt,i ) =

Bt (rt,i mt,i )

Bt |rt,i mt,i |

Bt rt mt BT B0. (20)

In below, we focus on the analysis associated with clipped regret rt,i . Following previous work [Wei et al., 2016], we deﬁne Wt = PN i=1 wt,i to represent the summation of weights at time t. The quantity Wt can be realized as the potential to be analyzed. Next, we consider to upper bound ln WT +1.

By the inequality x xα + (α 1)/e for x > 0, α 0, for any i [N], we have:

w T +1,i (w T +1,i)

ηT,i ηT +1,i + 1

ηT +1,i 1 . (21)

Based on the updates in Line 7 of Algorithm 1, we bound the ﬁrst term on the right-hand side as:

ηT,i ηT +1,i = w T,i exp ηT,i r T,i η2 T,i ( r T,i m T,i)2

= ew T,i exp ηT,i ( r T,i m T,i) η2 T,i ( r T,i m T,i)2

ew T,i (1 + ηT,i ( r T,i m T,i)) . (22)

The last inequality is by exp(x x2) 1 + x for x 1/2. This is a crucial condition needed to be veriﬁed for Lipschitz-adaptive meta-algorithm. By the tuning of learning rates, and the clipping technique, we control the range of ηT,i( r T,i m T,i) well:

ηT,i| r T,i m T,i| = ηT,i BT 1

BT |r T,i m T,i| ηT,i BT 1 BT 1

which meets the criterion for applying the mentioned inequality. By plugging inequality (22) into (21), we can further analyze the weights for all experts at time T:

i=1 w T +1,i

i=1 ew T,i (1 + ηT,i ( r T,i m T,i)) +

i=1 ew T,i (1 ηT,im T,i) +

i=1 ηT,i ew T,i r T,i +

i=1 ew T,i exp( ηT,im T,i) +

i=1 ηT,i ew T,i r T,i +

i=1 w T,i + N X

j=1 ηT,j ew T,j N X

i=1 p T,i r T,i +

i=1 w T,i +

ηT +1,i 1 ,

where we apply 1 x exp( x) for any x R, and the last inequality is by the assumption in the theorem statement that PN i=1 pt,i rt,i 0 for any t [T].

Now we are ready to upper bound WT +1 in an inductive style:

i=1 w T +1,i

i=1 w T,i +

ηT +1,i 1 = WT +

ηt+1,i 1 = N +

ηt+1,i 1 , (23)

where the last inequality is by the induction. It remains to analyze the last term, the deviations of the learning rates. We present the following analysis tailored for the new learning rate setting, i [N]:

ηT,i ηT +1,i 1 =

v u u t 1 + PT t=1( rt,i mt,i)2 + 4B2 T 1 + PT 1 t=1 ( rt,i mt,i)2 + 4B2 T 1 1

1 + 4B2 T 4B2 T 1 + ( r T,i m T,i)2

1 + PT 1 t=1 ( rt,i mt,i)2 + 4B2 T 1 1

2 4B2 T 4B2 T 1 + ( r T,i m T,i)2

1 + PT 1 t=1 ( rt,i mt,i)2 + 4B2 T 1 ( 1 + x 1 + 1

2 φT,i 1 + 4B2 0 + PT 1 t=1 φt,i . (φt,i 4B2 t 4B2 t 1 + ( rt,i mt,i)2 0)

By Lemma 5 with the choices of f(x) = 1/x, a0 = 1 + 4B2 0, at = φt,i in the lemma statement, summing up the preceding inequality from 1 to T results in:

ηt+1,i 1 2B2 T 1 + 4B2 0 + 1

1 + 4B2 T +

t=1 ( rt,i mt,i)2 !

2 ln(1 + 4B2 0)

2B2 T 1 + 4B2 0 + 1

2 ln 1 + (T + 4)B2 T 1 + 4B2 0

Now combining (23) and (24), we can upper bound ln WT +1 as follows:

ln WT +1 ln 1 + B2 T 1 + 4B2 0 + 1

2e ln 1 + TB2 T 1 + 4B2 0

+ ln N. (25)

In another direction, we lower bound ln WT +1 ln w T +1,i with an inductive argument:

1 ηT +1,i ln w T +1,i = 1 ηT,i

ln w T,i + ηT,i r T,i η2 T,i ( r T,i m T,i )2

= 1 ηT,k ln w T,i ηT,i ( r T,i m T,i )2 + r T,i

= 1 η1,i ln w1,i

t=1 ηt,i ( rt,i mt,i )2 +

t=1 ηt,i ( rt,i mt,i )2 +

t=1 rt,i . (w1,i = 1)

Rearranging the above equality with notice of Eq. (20), we have:

t=1 rt,i + BT

t=1 ηt,i ( rt,i mt,i )2 + 1 ηT +1,i

ln 1 + B2 T 1 + 4B2 0 + 1

2e ln 1 + TB2 T 1 + 4B2 0

+ ln N + BT ,

where the second term is in the order of

v u u t B2 T +

t=1 (rt,i mt,i )2 (log(BT + log(TBT )) + log N)

As for the ﬁrst term, by applying Lemma 6, it can be bounded as follows:

t=1 ηt,i ( rt,i mt,i )2 =

( rt,i mt,i )2 q

1 + 4B2 t + Pt 1 s=1( rs,i ms,i )2

( rt,i mt,i )2 q

1 + Pt s=1( rs,i ms,i )2 2

t=1 ( rt,i mt,i )2

B2 t 1 B2 t (rt,i mt,i )2 2

t=1 (rt,i mt,i )2.

Thus, the proof is complete.

C.4 Proof of Theorem 4

Proof. We ﬁrst decompose the static regret based on the performance of base-learner i into two parts as presented in Eq. (8).

The base-regret is guaranteed by the corresponding base-learner via Theorem 1 and Theorem 2. And we mainly focus on the analysis of the meta-regret by leveraging Theorem 6. First we are required to verify the condition that P i [N] pt,i rt,i 0. Without loss of generality, we assume the 1-st baselearner is for convex functions. Recall that we set rt,i = ft(xt), xt xt,i for strongly convex

functions learners i [2, N], rt,1 = ft(xt) ft(xt,1) for the convex function learner, and optimism mt,i = ft 1(xt) ft 1(xt,i) for each base-learner i [N], therefore we have: X

i [N] pt,i rt,i = 1 Bt 1

ft 1(xt 1) X

i [N] pt,ift 1(xt,i)

+ pt,1(ft(xt) ft(xt,1)) + X

i [2,N] pt,i ft(xt), xt xt,i

i [N] pt,ift 1(xt,i) X

i [N] pt,ift 1(xt,i)

+ pt,1 ft(xt), xt xt,1 + X

i [2,N] pt,i ft(xt), xt xt,i = 0.

Therefore, Theorem 6 is applicable in analyzing the meta-regret for universal online learning. In follows, we prove the regret bounds for convex functions and strongly convex functions respectively.

Convex functions. By Theorem 1, the base-regret is bounded by O( VT ), as for the meta-regret, by the setting of inputs and optimism, by Theorem 6, it is bounded by

ft(xt) ft(xt,1) ft 1(xt) ft 1(xt,1) 2 CT + BT log BT

ft(xt) ft 1(xt) ft(xt,1) ft 1(xt,1) 2 CT + BT log BT

ft(ξt,1) ft 1(ξt,1), xt xt,1 2 CT + BT log BT

t=1 sup x X ft(x) ft 1(x) 2 2 CT + BT log BT

where we denote by CT = O(log(BT + log(BT T))) and in the third line we apply the mean value theorem. Combining the base-regret and meta-regret together, we concludes that the static regret bound for convex functions is bounded by O( VT log(BT +log(BT T))), where BT = O( b Gmax D) with b Gmax denoting the maximum Lipschitz constant.

Strongly convex functions. For λ -strong convex functions with λ [1/T, 1], by the construction of the curvature coefﬁcient pool H, there exists i [2, N] such that λi λ 2λi . With this speciﬁc i -th base-learner, the base-regret can be upper bounded by O((log VT )/λ ) by Theorem 2, up to a multiplicative constant of 2. The meta-regret can be bounded as follows:

t=1 ft(xt), xt xt,i

2 xt xt,i 2 2

t=1 ( ft(xt), xt xt,i (ft 1(xt) ft 1(xt,i )))2 CT

2 xt xt,i 2 2 + BT CT

t=1 ft(xt), xt xt,i 2 + ft 1(ξt,i ), xt xt,i 2 CT

2 xt xt,i 2 2 + BT CT

t=1 xt xt,i 2 2 CT

2 xt xt,i 2 2 + BT CT

Algorithm 3 Universal Gradient-Variation Online Learning under Global Smoothness

Input: curvature coefﬁcient pools Hsc, Hexp, number of base-learners N, optimistic Adapt-MLProd [Wei et al., 2016] as the meta-algorithm, base-algorithms Acvx, Asc, Aexp. 1: Initialization: Pass N to the meta-algorithm, initialize a base-learner with Acvx; for λ Hsc, initialize a base-learner with Asc; for α Hexp, initialize a base-learner with Aexp. 2: for t = 1 to T do 3: Obtain pt from meta-algorithm, xt,i from each base-learner i [N]; 4: Submit xt = P i [N] pt,ixt,i; 5: Receive ft(xt) and send it to each base-learner for update; 6: For strongly convex and exp-concave functions learners: set rt,i = ft(xt), xt xt,i ; 7: For convex functions learner: set rt,1 = ft(xt) ft(xt,1); 8: Send rt to the meta-algorithm; 9: Send mt+1,1 = ft(xt) ft(xt,1) and mt+1,i = 0, i [2, N] to the meta-algorithm. 10: end for

b G2 max C2 T λ + BT log BT

b G2 max log2 BT

λ + BT log BT

where we use b Gmax to denote the maximum Lipschitz constant on the optimization trajectory, treat log log T as constants, and BT = O( b Gmax D). In the fourth line, we again utilize the mean value theorem. The ﬁfth line follows from Lemma 7, which ensures that:

| ft 1(ξt,i ), xt xt,i | max {| ft 1(xt,i ), xt xt,i |, | ft 1(xt), xt xt,i |} ,

thus, our result depends on the Lipschitz constant on the optimization trajectory. In the last step, we apply the AM-GM inequality. The above statements show that the meta-regret is bounded by constants. With the base-regret guarantee, the proof for strongly convex functions is complete.

C.5 A Simple Universal Algorithm under Global Smoothness

As a byproduct, our techniques can be used to design a simpler two-layer universal algorithm that achieves the optimal gradient-variation regret bounds for convex, strongly convex, and exp-concave functions simultaneously, thereby improving upon the results of Yan et al. [2023]. The crux involves leveraging the function-variation-to-gradient-variation technique for convex functions, and following the strategy of Zhang et al. [2022a] to demonstrate that the meta-regret is bounded by constants for strongly convex and exp-concave functions, at a cost of O(log T) times function value queries per round. Given the global smoothness constant and the Lipschitz constant, our algorithm does not need to be Lipschitz-adaptive, thus suitable for more general optimism settings.

In Algorithm 3, we present this idea. In contrast to Algorithm 2, which is designed under the generalized smoothness, this algorithm in addition can guarantee gradient-variation bound for expconcave functions. In below, we present the theoretical guarantees for Algorithm 3. Corollary 3. Under Assumption 2, and assuming the loss functions are L-smooth and G-Lipschitz, we set N = 2 log2 T + 1. The curvature coefﬁcient pools are deﬁned as Hsc = Hexp = {2i 1/T : i [(N 1)/2]}. By selecting suitable base-algorithms, for λ, α [1/T, 1], Algorithm 3 ensures the following results simultaneously:

VT (log log T)), (convex), O 1 λ log VT + G2 log log T

λ , (λ-strongly convex),

O d α log VT + log log T

α , (α-exp-concave).

Proof. When assuming the Lipschitz constant G, optimistic Adapt-ML-Prod [Wei et al., 2016] can

ensure the meta-regret bounded by O( q Pt t=1(rt,i mt,i)2 log log T), thus the dependence of logarithmic terms is improved compared to Theorem 4. The proofs for convex functions and strongly convex functions are nearly identical to the proofs for Theorem 4 in Appendix C.4; thus, we omit them here. We highlight the importance of the function-variation-to-gradient-variation technique, bounding the meta-regret of order O( VT log log T) without the cancellation-based analysis.

Next, we show that the meta-regret for α -exp-concave functions is bounded by a constant. By the construction of the curvature pool Hexp, there exists base-learner i with the input curvature αi satisfying that αi α 2αi . Decompose the regret against this speciﬁc base-learner and by the deﬁnition of exp-concave functions, we have:

t=1 ft(xt) ft(xt,i )

t=1 ft(xt), xt xt,i α

2 ft(xt), xt xt,i 2

t=1 (rt,i mt,i )2 log log T α

2 ft(xt), xt xt,i 2 !

t=1 ( ft(xt), xt xt,i )2 log log T α

2 ft(xt), xt xt,i 2 !

O log log T

where the second-to-last line follows from the settings that rt,i = ft(xt), xt xt,i and mt,i = 0, and we apply the AM-GM inequality in the ﬁnal step. Therefore, by choosing the base-algorithm that ensures the regret bound of O( d

α log VT ), we complete the proof for this theorem.

D Omitted Details for Section 4

This section provides the omitted proofs for our two applications.

D.1 Proof of Corollary 1

Proof. We prove this theorem in a black-box manner, thanks to the gradient-variation bound we derive. By Theorem 4, for convex functions, Algorithm 2 ensures:

t=1 sup x X ft(x) Ft(x) 2 2 +

t=2 sup x X Ft(x) Ft 1(x) 2 2

where we decompose ft(x) ft 1(x) 2 2 as follows:

O ft(x) Ft(x) 2 2 + Ft(x) Ft 1(x) 2 2 + Ft 1(x) ft 1(x) 2 2 . (26)

Finally, by taking expectation on both sides and leveraging the concavity of the square root, we have:

t=1 E sup x X ft(x) Ft(x) 2 2

t=2 E sup x X Ft(x) Ft 1(x) 2 2

which is in order of O p

For strongly convex functions, as shown in Eq. (13) in the proof of Theorem 2, the multiplicative factor b Gmax can be replaced by a more reﬁned factor maxt [T ] ft(xt) ft 1(xt) 2. Then Algorithm 2 can ensure the following bound for strongly convex functions:

max t [T ] ft(xt) ft 1(xt) 2 2 log

t=2 sup x X ft(x) ft 1(x) 2 2

By applying a similar argument as in Eq. (26) to decompose the gradient variation, taking the expectation on both sides, and leveraging the concavity of the logarithm, we conclude the proof.

D.2 Proof of Corollary 2

Proof. The step sizes for optimistic OMD in (10) is ηt = min{D, mins [t] 1 4ℓs 1(2 Fs 1(bzs) 2)}. By the convexity and the concavity for the objective function f( , ), for any z = (x, y) Z, we can linearize the gap for an ε-approximate solution as f( x T , y) f(x, y T ) 1

T PT t=1 F(zt), zt z .

Following the proof of Lemma 2, we can demonstrate that optimistic OMD in (10) ensures:

t=1 F(zt), zt z O

maxt [T ] zt z 2 2 ηT +

t=1 ηt F(zt) F(bzt) 2 2

1 ηt zt bzt 2 2

The ﬁrst term on the right-hand side is in the order of O(b Lmax D2) by the step size conﬁguration, where b Lmax denotes the maximum smoothness constant on the optimization trajectory. By the ℓ-smoothness in Deﬁnition 2, the second term can be bounded as ηt F(zt) F(bzt) 2 2 O(ηtℓs 1(2 Fs 1(bzs) 2)2 zt bzt 2 2), which can be further cancelled by the negative terms. Therefore, PT t=1 F(zt), zt z is bounded by a constant, thus, the convergence rate to an ε-approximate solution is O(1/T), implying a fast convergence rate.

E Supporting Lemmas

E.1 Lemmas for Optimistic OMD

We ﬁrst present the generic lemma for optimistic OMD with dynamic regret [Zhao et al., 2024], which encompasses the standard regret by setting u1 = ... = u T = x arg minx X PT t=1 ft(x). Lemma 3 (Theorem 1 of Zhao et al. [2024]). Optimistic OMD specialized at Eq. (3) satisﬁes

t=1 ft(xt), xt ut

t=1 ft(xt) Mt, xt bxt+1 +

t=1 (Dψt(ut, bxt) Dψt(ut, bxt+1))

t=1 (Dψt(bxt+1, xt) + Dψt(xt, bxt)) ,

where u1, . . . , u T X are arbitrary comparators in the feasible domain.

The next lemma, known as the stability lemma, establishes an upper bound on the proximity between successive decisions in terms of the gradient utilized for updates. Lemma 4 (Proposition 7 of Chiang et al. [2012]). Consider the following two updates: (i) x = arg minx X { g, x + Dψ(x, c)}, and (ii) x = arg minx X { g , x + Dψ(x, c)}. When the regularizer ψ : X 7 R is λ-strongly convex function with respect to norm , we have λ x x g g .

E.2 Self-Conﬁdent Tuning Lemmas

In this part, we provide some useful lemmas when analyzing the self-conﬁdent tuning strategy. Lemma 5 (Extension of Lemma 14 Gaillard et al. [2014]). Let a0 > 0 and at [0, B] be real numbers for all t [T] and let f : (0, + ) 7 [0, + ) be a nonincreasing function. Then PT t=1 atf Pt 1 s=0 as B f(a0) + R PT t=0 at a0 f(u)du.

Lemma 6 (Lemma 3.5 of Auer et al. [2002]). Let a1, . . . , a T and δ be non-negative real numbers.

Then PT t=1 at

δ+Pt s=1 as 2 q

δ + PT t=1 at

E.3 Technical Lemma

Lemma 7. Let f : X 7 R be a convex, twice differentiable function. Then for any x, y int X and λ [0, 1], we have that:

| f(λx + (1 λ)y), x y | max {| f(x), x y |, | f(y), x y |} .

Proof. Deﬁne a function that φ(t) = f(tx + (1 t)y). For this univariate convex function, we have φ (t) = f(tx + (1 t)y), x y . By the convexity of φ(t), there is |φ (t)| max {|φ (1)|, |φ (0)|}, concluding the proof.

Neur IPS Paper Checklist

1. Claims Question: Do the main claims made in the abstract and introduction accurately reﬂect the paper s contributions and scope? Answer: [Yes] Justiﬁcation: In the abstract, we present the contributions sequentially through ordinal words and we discuss the contributions further in the introduction part. Guidelines: The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reﬂect how much the results can be expected to generalize to other settings. It is ﬁne to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper. 2. Limitations Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justiﬁcation: One limitation of our work is the requirement for multiple gradient queries. Another limitation is that we cannot obtain the gradient-variation results for exp-concave functions due to the technical reasons. Guidelines: The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-speciﬁcation, asymptotic approximations only holding locally). The authors should reﬂect on how these assumptions might be violated in practice and what the implications would be. The authors should reﬂect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reﬂect on the factors that inﬂuence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efﬁciency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be speciﬁcally instructed to not penalize honesty concerning limitations. 3. Theory Assumptions and Proofs Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?

Answer: [Yes] Justiﬁcation: The assumptions are provided in Section 2. The main theoretical results are provided from Section 2 to Section 4. All proofs can be found in the Appendix and, speciﬁcally from Appendix B to Appendix D, which contain the proofs for the main results. Guidelines: The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced. 4. Experimental Result Reproducibility Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [NA] Justiﬁcation: This paper does not include experiments. Guidelines: The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or veriﬁable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might sufﬁce, or if the contribution is a speciﬁc model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results. 5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufﬁcient instructions to faithfully reproduce the main experimental results, as described in supplemental material?

Answer: [NA]

Justiﬁcation: This paper does not include experiments, and no data or code will be provided.

Guidelines:

The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https://nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted.

6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results?

Answer: [NA]

Justiﬁcation: This paper does not include experiments.

Guidelines:

The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material.

7. Experiment Statistical Signiﬁcance

Question: Does the paper report error bars suitably and correctly deﬁned or other appropriate information about the statistical signiﬁcance of the experiments?

Answer: [NA]

Justiﬁcation: This paper does not include experiments.

Guidelines:

The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, conﬁdence intervals, or statistical signiﬁcance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions).

The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not veriﬁed. For asymmetric distributions, the authors should be careful not to show in tables or ﬁgures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding ﬁgures or tables in the text.

8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufﬁcient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments?

Answer: [NA]

Justiﬁcation: This paper does not include experiments.

Guidelines:

The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper).

9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines?

Answer: [Yes]

Justiﬁcation: The research conducted in the paper conforms with the Neur IPS Code of Ethics.

Guidelines:

The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction).

10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?

Answer: [NA]

Justiﬁcation: This paper only focuses on online learning theory, and we are not aware of any potential societal impacts.

Guidelines:

The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact.

Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake proﬁles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact speciﬁc groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efﬁciency and accessibility of ML).

11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?

Answer: [NA]

Justiﬁcation: This paper does not include experiments, and there are no risks for misuse of data or models.

Guidelines:

The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety ﬁlters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.

12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?

Answer: [NA]

Justiﬁcation: This paper does not include experiments, and we do not use existing assets.

Guidelines:

The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided.

If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators.

13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?

Answer: [NA]

Justiﬁcation: This paper does not include experiments, and we do not release new assets.

Guidelines:

The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip ﬁle.

14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)?

Answer: [NA]

Justiﬁcation: This paper focuses on online learning theory and does not involve crowdsourcing nor research with human subjects.

Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is ﬁne, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector.

15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects

Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained?

Answer: [NA]

Justiﬁcation: This paper focuses on online learning theory and does not involve crowdsourcing nor research with human subjects.

Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects.

Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary signiﬁcantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.