# smoothed_online_convex_optimization_with_delayed_feedback__17e38eb1.pdf

Smoothed Online Convex Optimization with Delayed Feedback

Sifan Yang1,2 , Wenhao Yang1,2 , Wei Jiang1 , Yuanyu Wan3 and Lijun Zhang1,2

1National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China 2School of Artificial Intelligence, Nanjing University, Nanjing 210023, China 3School of Software Technology, Zhejiang University, Ningbo 315100, China {yangsf, yangwh, jiangw, zhanglj}@lamda.nju.edu.cn, wanyy@zju.edu.cn

Smoothed online convex optimization (SOCO), in which the online player incurs both a hitting cost and a switching cost for changing its decisions, has garnered significant attention in recent years. While existing studies typically assume that the gradient information is revealed immediately, such an assumption may not hold in some real-world applications. To overcome this limitation, we investigate SOCO with delayed feedback, and develop two online algorithms that can minimize the dynamic regret with switching cost. Firstly, we extend Mild-OGD, an existing algorithm that adopts the meta-expert framework for online convex optimization with delayed feedback, to account for switching cost. Specifically, we analyze the switching cost in the expert-algorithm of Mild-OGD, and then modify its meta-algorithm to incorporate this cost when assigning the weight to each expert. We demonstrate that our proposed method, Smelt DOGD can achieve an O( p

d T(PT + 1)) dynamic regret bound with switching cost, where d is the maximum delay and PT is the path-length. Secondly, we develop an efficient variant to reduce the number of projections per round from O(log T) to 1, yet maintaining the same theoretical guarantee. The key idea is to construct a new surrogate loss defined over a simpler domain for expert-algorithms so that these experts do not need to perform the complex projection operations in each round. Finally, we conduct experiments to validate the effectiveness and efficiency of our algorithms.

1 Introduction

This paper investigates smoothed online convex optimization (SOCO) [Goel and Wierman, 2019], a popular variant of online convex optimization (OCO) [Shalev-Shwartz, 2012]. SOCO is motivated by real-world scenarios where the changes in states introduces additional costs, such as thermal management [Zanini et al., 2010], dynamic right-sizing for data centers [Lin et al., 2011], video streaming [Joseph

Lijun Zhang is the corresponding author.

and de Veciana, 2012], spatiotemporal sequence prediction [Kim et al., 2015], speech animation [Kim et al., 2015], etc. Specifically, SOCO is typically formulated as a game between a player and an adversary. In each round t [T], the player begins by selecting a decision xt from a convex feasible set X Rn, where n is the dimensionality. When the online player makes a decision xt, the adversary simultaneously chooses a convex loss function ft( ): X 7 R. The player then incurs both a hitting cost ft(xt) and a switching cost m(xt, xt 1) for changing its decisions between rounds. For general convex functions, a natural choice of switching cost is the distance between the successive decisions, i.e., m(xt, xt 1) = xt xt 1 [Zhang et al., 2022]. Following the previous work [Chen et al., 2018; Zhang et al., 2021], we adopt the dynamic regret with switching cost to measure the performance of the online player, which is formally defined as:

D-R-SC(u1, ..., u T )

t=1 (ft(xt) + xt xt 1 )

t=1 ft(ut), (1)

where u1, ..., u T X is a sequence of any feasible comparators. Over the past years, a few algorithms [Li et al., 2018; Chen et al., 2018; Goel et al., 2019; Zhang et al., 2021; Zhang et al., 2022] have been proposed to minimize the dynamic regret with switching cost for SOCO. The optimal dynamic regret bound for SOCO is O( p

T(PT + 1)) [Zhang et al., 2021], where

PT (u1, ..., u T ) =

t=1 ut ut 1 (2)

denotes the path length of an arbitrary comparator sequence u1, ..., u T X. However, existing work typically assumes that information about the hitting cost, such as the gradient ft(xt), can be obtained immediately after the player makes a decision xt, which is often hard to satisfy in some practical scenarios. For example, in online advertising, the decision involves selecting ad delivery strategies for users. Advertisers adjust their strategies based on the interactions of users, like advertisement clicks, but there is often a delay in receiving the feedback. [Mc Mahan et al., 2013; Joulani et al., 2013;

Proceedings of the Thirty-Fourth International Joint Conference on Artiﬁcial Intelligence (IJCAI-25)

He et al., 2014; Wan et al., 2022a; Wan et al., 2022c; Wan et al., 2023; Yang et al., 2025]. Motivated by this issue, we focus on SOCO with arbitrary delays, in which the gradient of the loss function ft(xt) arrives at the end of the round t + dt 1, where dt 1 represents an arbitrary delay at round t. Delayed feedback has been investigated in the standard OCO setting over the past decades [Weinberger and Ordentlich, 2002; Joulani et al., 2013; Quanrud and Khashabi, 2015; Korotin et al., 2020; Wang et al., 2021; Wan et al., 2022b; Wan et al., 2022c]. Recently, Wan et al. [2024] propose a delayed variant of OGD [Zinkevich, 2003], termed delayed online gradient descent (DOGD), and establish an O(

d T(PT +1)) dynamic regret bound, where d is the maximum delay and PT is the path-length defined in (2). To further enhance the dynamic regret bound, they develop a twolayer structured algorithm, named as Mild-OGD, which runs several DOGD instances as the expert-algorithms at the same time and uses delayed Hedge [Korotin et al., 2020] as the meta-algorithm to track the best one. Mild-OGD achieves an O( p

d T(PT + 1)) dynamic regret bound. However, it lacks consideration for the switching cost, which restricts its applicability in SOCO. Additionally, owing to maintaining O(log T) expert-algorithms simultaneously, Mild-OGD requires multiple projections onto the feasible domain per round, which could be computationally expensive when the feasible domain is complex. In this paper, we develop a smoothed variant of Mild OGD, named as smoothed multiple delayed online gradient descent (Smelt-DOGD), to deal with SOCO under delayed feedback. Similar to Mild-OGD, Smelt-DOGD adopts the meta-expert framework but explicitly accounts for the switching cost. Specifically, we prove that the expert-algorithm, DOGD, can obtain an O(

d T(PT + 1)) dynamic regret bound with switching cost. Then we modify the metaalgorithm by additionally considering the switching cost when deriving the weight for each expert. Smelt-DOGD achieves an O( p

d T(PT + 1)) dynamic regret bound, which is on the same order as the dynamic regret bound without switching cost [Wan et al., 2024]. Moreover, if the delay does not alter the arrival order of gradients, the dynamic regret bound with switching cost can be further improved to O( p

S(PT + 1)), where S = PT t=1 dt d T represents the sum of delays. Furthermore, to develop an efficient method, we employ a black-box technique [Cutkosky and Orabona, 2018; Cutkosky, 2020; Zhao et al., 2022], which reduces the projection complexity from O(log T) to 1. Particularly, we simplify the optimization task over the complicated domain X to an optimization problem within the smallest Euclidean ball Y encompassing X so that the projection onto Y is much cheaper. We construct a surrogate loss over the domain Y, which is minimized by the expert-algorithm. In this way, the expert-algorithm only performs a simple rescaling operation instead of the complex projection and the meta-algorithm projects the weighted decision onto the domain X just once per round. The theoretical analysis indicates that the efficient variant achieves a regret bound of the same order as Smelt-

DOGD. Finally, we conduct numerical experiments on online classification and online regression problems, and the results demonstrate the superiority and efficiency of our algorithms.

2 Related Work In this section, we provide a brief review of related research in dynamic regret, SOCO and OCO with arbitrary delays.

2.1 Dynamic Regret Dynamic regret is introduced by Zinkevich [2003] to deal with the changing environment, where the optimal decision may vary over time. It is defined as the difference between the loss of the online player and a sequence of any comparators u1, ..., u T X, typically formulated as PT t=1 ft(xt) PT t=1 ft(ut). Zinkevich [2003] first establishes an O(

T(1 + PT )) dynamic regret bound for OGD, where PT is defined in (2). To further reduce the dynamic regret, Zhang et al. [2018a] propose a two-layer algorithm, named Ader, which achieves the optimal O( p

T(PT + 1)) dynamic regret bound. Specifically, Ader runs O(log T) OGD instances with different learning rates simultaneously and combines them using Hedge [Freund and Schapire, 1997]. In recent years, several studies have achieved tighter dynamic regret bounds by leveraging the curvature of loss functions, such as exponential concavity [Baby and Wang, 2021] and strong convexity [Baby and Wang, 2022]. This two-layer structure has gained widespread adoption in recent years [Zhang et al., 2020; Zhang et al., 2021; Wang et al., 2024; Yang et al., 2024a]. While u1, ..., u T X can be an arbitrary sequence in dynamic regret, several studies [Jadbabaie et al., 2015; Mokhtari et al., 2016; Zhang et al., 2017; Zhang et al., 2018b; Wan et al., 2021; Wang et al., 2021] investigate the restricted dynamic regret where the comparator sequence for dynamic regret is minimizers of the online functions, i.e., ut = arg minx X ft(x). However, as mentioned in Zhang et al. [2018a], the restricted dynamic regret is too pessimistic and may lead to overfitting in some problems.

2.2 Smoothed Online Convex Optimization Due to the fact that changes of decisions may introduce additional cost in numerous real-world scenarios, SOCO has received extensive attention in the machine learning community. In the literature, there are two performance metrics designed for SOCO: competitive ratio and dynamic regret with switching cost. While various algorithms have been proposed to reduce the competitive ratio, all of them are limited to the lookahead setting where the learner can observe the hitting cost ft( ) before making the decision, and may rely on strict conditions, such as the strong convexity and quadratic growth [Lin et al., 2011; Bansal et al., 2015; Chen et al., 2015; Li et al., 2018; Goel and Wierman, 2019; Li et al., 2020; Zhang et al., 2021; Wang et al., 2021]. For this reason, our subsequent discussion will primarily focus on the dynamic regret with switching cost, which requires weaker assumptions. Li et al. [2018] consider a setting where the online player has access to the next W hitting costs and propose receding horizon gradient descent (RHGD) to minimize the dy-

Proceedings of the Thirty-Fourth International Joint Conference on Artiﬁcial Intelligence (IJCAI-25)

namic regret with switching cost. However, they choose the switching cost in the form of a quadratic function, i.e., xt xt 1 2/2, which may not be well-suited for general convex functions. Moreover, their analysis relies on the stringent conditions, like the strong convexity and smoothness, which are difficult to satisfy in practical applications. Online balanced descent (OBD) [Chen et al., 2018], a classic algorithm in SOCO, achieves an O(

TL) dynamic regret bound with switching cost, where L is the upper bound of the path-length, i.e., PT = PT t=1 ut ut 1 L. Notably, this upper bound is nonadaptive as it depends on L, rather than the actual path-length PT . Later, regularized OBD (R-OBD) [Goel et al., 2019] improves the dynamic regret bound with switching cost to O( p

TPT,ℓ2), where PT,ℓ2 = PT t=1 ut ut 1 2. Although the regret bound of R-OBD is adaptive, R-OBD chooses the squared ℓ2-norm as the switching cost. To overcome this limitation, Zhang et al. [2021] propose Smoothed Ader (SAder), which is a smoothed variant of Ader. Similar to Ader [Zhang et al., 2018a], SAder employs a two-layer structure and is proven to attain an O( p

T(PT + 1)) dynamic regret bound with switching cost.

2.3 OCO with Delayed Feedback Weinberger and Ordentlich [2002] first consider the setting in which the feedback of each decision xt will be received at end of round t+d 1, where d is a fixed delay, and develop a method for adapting classic OCO algorithms to delayed setting. They choose the static regret as the performance metric and demonstrate that if a vanilla OCO algorithm achieves a regret bound of R(T) in non-delayed setting, this method can obtain a regret bound of d R(T/d ). However, in practice, the delays are typically arbitrary, which limits the applicability of this method. To deal with OCO with arbitrary delays, Joulani et al. [2013] extend the above technique, which can attain a regret bound of d R(T/d) for traditional OCO algorithm with R(T) regret, where d is the maximum delay. Later, Quanrud and Khashabi [2015] propose a variant of OGD to handle the OCO with delays, which achieves a static regret bound of O(

S), where S = PT t=1 dt T is the sum of delays over T rounds. In recent years, Korotin et al. [2020] consider the problem of prediction with expert advice and propose delayed Hedge, achieving an O(

S) static regret bound. However, all the aforementioned studies focus on the static regret, which is only meaningful for environments where at least one fixed decision can minimize the cumulative loss, and is not suitable for non-stationary environments in which the optimal decision varies over time. To address this issue, Wan et al. [2024] investigate the dynamic regret under delayed OCO and develop DOGD, a variant of OGD, which can achieve a dynamic regret bound of O(

d T(PT +1)). DOGD queries the gradient ft(xt) at round t and because of delayed feedback, it only uses each gradient received at round t to perform a gradient descent step. To further reduce the dynamic regret bound, Wan et al. [2024] design Mild-OGD, which runs several DOGD instances at the same time and uses delayed Hedge [Korotin et

al., 2020] to track the best one. While Mild-OGD is proven to obtain an optimal O( p

d T(PT + 1)) dynamic regret bound, it is inefficient as it performs O(log T) projection operations in each round.

3 Main results

In this section, we first start with necessary assumptions and then present our Smelt-DOGD with theoretical guarantees. Finally, we develop an efficient version of Smelt-DOGD.

3.1 Assumptions

We adopt the common assumptions of online convex optimization (OCO) [Shalev-Shwartz, 2012].

Assumption 1. (Convexity) All loss functions ft( ) are convex in the domain X.

Assumption 2. (Bounded gradient norms) The norm of the gradients of all loss functions over the domain X is bounded by G, i.e., ft( ) G, t [T].

Assumption 3. (Bounded domain) The diameter of the domain X is at most D, i.e., maxx,y X x y D, and 0 X.

While the above assumptions is enough to handle SOCO with delayed feedback, previous work [Mc Mahan and Streeter, 2014; Wan et al., 2024] additionally introduces the following assumption to achieve a tighter regret bound.

Assumption 4. (In-Order property) The delay of queries does not alter the arrival order of the gradients, i.e., for any 1 i < j T, fi(xi) arrives before fj(xj).

Although Assumption 4 may appear overly stringent, it is indeed satisfied in certain applications. For instance, in parallel and distributed optimization, the delay dt in each round tends to gradually increase, i.e., dt dt+1, meaning that the gradient queried first is more likely to arrive first [Mc Mahan and Streeter, 2014; Zhou et al., 2018]. Therefore, it is reasonable to assume that the delay enjoys the In-Order assumption. Notably, we need to emphasize that Assumption 4 is not always required and we will explicitly state the requirement in subsequent theorems.

3.2 Smelt-DOGD

Due to the lack of consideration for switching cost, Mild OGD cannot be directly applied to SOCO with delayed feedback. To overcome this limitation, we first provide a novel analysis of its expert-algorithm, DOGD, and then modify the meta-algorithm to account for the switching cost. Our proposed method is outlined below.

Expert-algorithm. As described in Algorithm 2, we choose DOGD as the expert-algorithm. The input of expertalgorithm is the learning rate η. Due to the delayed feedback, we receive a set of gradients { fk (xk) | k Ft} in each round t, where Ft = {k [T] | k + dk 1 = t} represents the index set of queried gradients arriving in round t and the elements in this set are sorted in the ascending order. For each gradient received in round t, the expert-algorithm performs a

Proceedings of the Thirty-Fourth International Joint Conference on Artiﬁcial Intelligence (IJCAI-25)

Algorithm 1 Smelt-DOGD: Expert-algorithm

Require: A learning rate η 1: Initialization zη 1 = 0 and τ = 1 2: for time step t = 1 to T do 3: Submit xη t = zη τ to the meta-algorithm 4: Receive gradients { fk(xk)|k Ft} from the metaalgorithm 5: for k Ft do 6: Compute zη τ+1 = ΠX [zη τ η fk(xk)] 7: Set τ = τ + 1 8: end for 9: end for

gradient descent update based on its arrival order and then projects the result onto the feasible domain, i.e.,

zη τ+1 = ΠX [zη τ η fk(xk)] , (3)

where ΠX [ ] is the projection onto the domain X. Although Wan et al. [2024] have established the dynamic regret bound for DOGD, they do not consider the switching cost. In the following, we derive the dynamic regret bound with switching cost for DOGD. Theorem 1. Under Assumptions 1, 2 and 3, for any comparator sequence u1, ..., u T X, by setting η = D G PT t=1 mt ,

DOGD ensures

D-R-SC(u1, ..., u T ) D2 + DPT

η + ηG2 T X

S(PT + 1) + C),

where mt = t Pt i=1 |Fi| , S = PT t=1 dt d T is the sum of delays, d is the maximum delay and

C = 0, if Assumption 4 holds, min{DGT, 2d GPT }, otherwise. (4)

Remark 1. From the analysis of Wan et al. [2024], it is direct to verify that min{DGT, 2d GPT } G 2d TDPT and PT t=1 mt PT t=1 dt = S d T. Therefore, Theorem 1 implies that the dynamic regret with switching cost of DOGD is O(

d T(PT +1)) in the worst case and can achieve a better regret bound of O(

S(PT + 1)) when Assumption 4 holds. Remark 2. It is worth noting that if we set the learning rate

G PT t=1 mt , DOGD enjoys an O( p

d T(PT + 1))

regret bound, which is better than the bound in Theorem 1. This indicates that if we know the path-length beforehand, we can tune the learning rate to obtain an improved bound. To deal with the uncertainty of the path-length, we choose to run multiple instances of DOGD with different learning rates and track the best one using the meta-algorithm outlined below.

Meta-algorithm. As summarized in Algorithm 2, the metaalgorithm takes two input parameters: the hyper-parameter step size α and the set of learning rates for experts H. We

Algorithm 2 Smelt-DOGD: Meta-algorithm

Require: A parameter α and a set H containing learning rates for experts 1: Activate a set of experts {Eηi|ηi H} by invoking the expert-algorithm for each learning rate ηi H 2: Sort learning rates in the ascending order and set wηi 1 = |H|+1 i(i+1)|H|, let x1, {xηi 1 }N i=1 be any point in X 3: for time step t = 1 to T do 4: Receive xηi t from each expert Eηi

5: Play the decision xt = P ηi H wηi t xηi t 6: Query ft(xt) and receive { fk(xk)|k Ft} 7: Update the weight of each expert according to (6) 8: Send { fk(xk)|k Ft} to each expert Eηi

first active a set of experts {Eηi|ηi H} by invoking the expert-algorithm for each ηi H in Step 1. It s worth noting that the learning rates ηi are arranged in the ascending order, meaning η1 ... η|H|. In Step 2, we initialize the weight wηi 1 of each expert as

wηi 1 = |H| + 1 i(i + 1)|H|. (5)

In each round t [T], the meta algorithm receives the decision xηi t from each expert-algorithm Eηi and then computes a weighted decision xt based on these decisions: xt = P ηi H wηi t xηi t . Subsequently, the meta-algorithm queries the gradient ft(xt) of the current decision xt. Due to the delayed feedback of the gradient, it only receives the results of previous queries { fk(xk)|k Ft}. The meta-algorithm then updates the weight of each expert algorithm wηi t+1 according to:

wηi t+1 = wηi t e α(P

k Ft sk(x ηi k )+ x ηi t x ηi t 1 ) P µ H wµ t e α( P

k Ft sk(xµ k)+ xµ t xµ t 1 ) , (6)

where α is a parameter and sk(xηi k ) = fk(xk), xηi k xk , in which we use the surrogate loss to avoid the inconsistent delay between the meta-algorithm and the expert-algorithm. Diverging from the Mild-OGD, Smelt-DOGD incorporates the switching cost into the loss of each expert-algorithm to measure its performance in (6). In the last, the metaalgorithm sends the queried gradient { fk(xk)|k Ft} to each expert Eηi. Then we present the following theorem to establish the theoretical guarantee for Smelt-DOGD. Theorem 2. Under Assumptions 1, 2 and 3, for any comparator sequence u1, ..., u T X, by setting

H = ηi = 2i 1D

G β |i = 1, ..., N , α = 1 (G + 1)D β ,

where N = 1

2 log2(T + 1) + 1, β = PT t=1 mt, mt = t Pt i=1 |Fi|, Smelt-DOGD ensures

D-R-SC(u1, ..., u T ) O( p

S(PT + 1) + C),

where S = PT t=1 dt and C is defined in (4).

Proceedings of the Thirty-Fourth International Joint Conference on Artiﬁcial Intelligence (IJCAI-25)

Remark 3. Theorem 2 implies that Smelt-DOGD attains an O( p

S(PT + 1) + C) bound for dynamic regret with switching cost, which is on the same order as Mild-OGD. If Assumption 4 holds, Smelt-DOGD can achieve an regret bound of O( p

S(PT + 1)). Conversely, if Assumption 4 is not satisfied, Smelt-DOGD obtains an O( p

d T(PT + 1)) dynamic regret bound, where d is the maximum delay.

3.3 Efficient Smelt-DOGD In Smelt-DOGD, each expert-algorithm is required to project its decision point zη t onto the feasible domain X in (3) per round. Since Smelt-DOGD simultaneously runs O(log T) expert-algorithm instances, it requires O(log T) projections on average per round, which may be computationally expensive, especially when the feasible domain X is complex. To overcome the above-mentioned issue, we develop an efficient version of Smelt-DOGD by using a black-box technique [Cutkosky and Orabona, 2018; Cutkosky, 2020; Zhao et al., 2022; Yang et al., 2024b], which reduces the projection complexity from O(log T) to 1. The theoretical analysis demonstrates that the efficient variant achieves the same order of dynamic regret bound as Smelt-DOGD. Although this reduction method has been previously utilized to reduce the projection complexity under the standard OCO, our results demonstrate that it is also applicable to SOCO with delayed feedback for the first time. The key idea of our reduction technique is to replace the expensive projection onto the feasible domain X in the expert-algorithm with a cheaper rescaling operation. Firstly, we introduce a simpler domain Y to replace the feasible domain X. In particular, we choose Y as the minimum Euclidean ball containing the feasible domain, i.e., Y = {x | x D} X. Then, following the approach of the previous work [Cutkosky, 2020], we construct a surrogate loss function gt : Y 7 R over the domain Y:

gt(yη t ) = ft (xt) , yη t yt

I{ ft(xt),vt <0} ft (xt) , vt MX (yη t ), (7)

where MX (y) = infx X y x is the distance between y and domain X, vt = (yt xt) / yt xt is a unit vector indicating the projection direction and I{ } is an indicator function defined as

I{A} := 1 if A holds, 0 else.

Our expert-algorithm will minimize the surrogate loss gt(yη t ) so that it only performs a simpler rescaling operation to the surrogate domain Y. In each round, we project the weighted decision point yt back onto the feasible domain X only once in the meta-algorithm. Applying this reduction technique to Smelt-DOGD, we develop the efficient version of Smelt-DOGD, as described below.

Expert-algorithm. As shown in Algorithm 3, our expertalgorithm performs the gradient descent step using the gradient gk(yk) of the surrogate loss over the domain Y. In each round, after receiving the gradient set { gk(yk)|k Ft} from the meta-algorithm, the expert algorithm utilizes the

Algorithm 3 Efficient Smelt-DOGD: Expert-algorithm

Require: A learning rate η 1: Initialize zη 1 = 0 and τ = 1 2: for time step t = 1 to T do 3: Submit yη t = zη τ to the meta-algorithm 4: Receive gradients { gk(yk)|k Ft} from the metaalgorithm 5: for k Ft do 6: Compute ˆzη τ+1 = zη τ η gk(yk) 7: Project ˆzη τ+1 to the domain Y according to (8) 8: Set τ = τ + 1 9: end for 10: end for

Algorithm 4 Efficient Smelt-DOGD: Meta-algorithm

Require: A parameter α and a set H containing learning rates for experts 1: Activate a set of experts {Eηi|ηi H} by invoking the expert-algorithm for each learning rate ηi H 2: Set wηi 1 of each expert according to (5) and initialize x1, {yηi 1 }N i=1 to be any point in X 3: for time step t = 1 to T do 4: Receive yηi t from each expert Eηi

5: Compute the decision yt+1 = P ηi H wηi t yηi t 6: Submit xt+1 = ΠX [yt+1] 7: Query ft(xt) and receive { fk(xk)|k Ft} 8: For each gradient in { fk(xk)|k Ft}, calculate { gk(yk)|k Ft} 9: Update the weight of each expert according to (9) 10: Send { gk(xk)|k Ft} to expert Eηi

11: end for

gradient of the surrogate loss to update its own decisions. After we update the decision ˆzη τ+1 in Step 6, we can just use a scaling operation to project this decision into the surrogate domain Y:

zη τ+1 = ˆzη τ+1

I{ ˆzη τ+1 D} + D bzη τ+1 I{ bzη τ+1 >D}

(8) which is more efficient than the traditional projection operation. Meta-algorithm. Our meta-algorithm is presented in Algorithm 4. Similar to Smelt-DOGD, we first activate a set of experts {Eηi|ηi H} with different learning rates. In each round t, the meta-algorithm receives decisions yηi t from the expert algorithms and computes a weighted sum of these decisions: yt+1 = X

ηi H wηi t yηi t .

Since these decisions yηi t lie within the domain Y, the result yt+1 needs to be projected onto the feasible domain X, which is the only projection performed in each round. After receiving the gradient information, we calculate the gradient { gk(yk)|k Ft} using { fk(xk)|k Ft}. According to the Lemma 1 in Zhao et al. [2022],

Proceedings of the Thirty-Fourth International Joint Conference on Artiﬁcial Intelligence (IJCAI-25)

it is not hard to verify that for any y Y, when ft (xt) , vt 0, we ensure gt(y) = ft(xt), where vt = (y ΠX [y]) / y ΠX [y] . When ft (xt) , vt < 0, we have gt(y) = ft (xt) ft (xt) , vt vt. Notably, when deriving the gradient gt(yt), we encounter the need to calculate ΠX [yt]. However, we can simply utilize the previous results xt from Step 6 in Algorithm 4. We also incorporate the switching cost of each expert algorithm over the domain Y into the surrogate loss function:

wηi t+1 = wηi t e α( P

k Ft hk(y ηi k )+ y ηi t y ηi t 1 ) P µ H wµ t e α( P

k Ft hk(yµ k )+ yµ t yµ t 1 ) , (9)

where hk(yηi k ) = gk(yk), yηi k yk . From the above discussion, it is evident that our efficient Smelt-DOGD reduces the projection complexity from O(log T) to 1. Next, we present the dynamic regret bound with switching cost for the efficient Smelt-DOGD. Theorem 3. Under Assumptions 1, 2 and 3, for any comparator sequence u1, ..., u T X, by setting

H = ηi = 2i D

G β |i = 1, ..., N , α = 1 2(G + 2)D β ,

where N = 1

2 log2(T + 1) + 1, β = PT t=1 mt, mt = t Pt i=1 |Fi|, efficient Smelt-DOGD ensures

D-R-SC(u1, ..., u T ) O( p

S(PT + 1) + C),

where S = PT t=1 dt and C is defined in (4). Remark 4. Theorem 3 demonstrates that the efficient Smelt DOGD can achieve an O( p

S(PT + 1) + C) dynamic regret with switching cost, which implies that it reduces the projection complexity of Smelt-DOGD from O(log T) to 1 while preserving the order of its regret bound.

4 Experiments In this section, we evaluate the performance of the proposed Smelt-DOGD and efficient Smelt-DOGD methods against an existing algorithm for OCO with delayed feedback, Mild OGD [Wan et al., 2024], through numerical experiments. For the parameters, we set those of Smelt-DOGD and efficient Smelt-DOGD according to Theorem 2 and Theorem 3, respectively. As for Mild-OGD, we follow recommendations from Theorem 3 in Wan et al. [2024].

4.1 Online Classification We implement the online classification on ijcnn1 dataset from LIBSVM Data [Chang and Lin, 2011; Prokhorov, 2001]. Let T denote the number of total rounds. In each round t [T], a batch of training examples {(xt,1, yt,1) , . . . , (xt,m, yt,m)} arrive, where (xt,i, yt,i) [ 1, 1]n { 1, 1}, i = 1, . . . , m. The online learner aims to predict a linear model wt W and suffers both a hitting cost ft(wt) = 1

m Pm i=1 w t xt,i yt,i and a switching cost wt wt 1 . In practical applications, there is often a trade-off in the switching cost, i.e.,

λ wt wt 1 , thus the total loss in each round is Ht(wt) = 1 m Pm i=1 w t xt,i yt,i +λ wt wt 1 . Although Smelt DOGD only considers the setting where λ = 1, we can rescale the loss function to satisfy the requirement. Specifically, we redefine the hitting cost as ft(xt)

λ , and our methods will minimize the following loss function

ht (wt) = 1 λm

w t xt,i yt,i + wt wt 1 .

In this experiment, we set domain diameter as D = 10 and follow Zhao et al. [2022] to choose the domain W as an ellipsoid W = w Rn | w Ew λmin(E) (D/2)2 , where E is a certain diagonal matrix and λmin denotes its minimum eigenvalue. To simulate the changing environment, we flip the labels of samples every 1000 iterations. For this dataset, dimensionality n = 22 and we set T = 4000, batch size m = 256, G =

22, λ = 10 and delay dt is selected uniformly at random from [1, 5]. The results, including the cumulative loss PT t=1 ht(wt), the instantaneous loss ht(wt) and the running time (in seconds), are presented in Figure 1 with error bars. All curves are averaged over 10 runs under different random seeds. As can be seen, both of our proposed methods suffer less cumulative loss than Mild-OGD. Moreover, our efficient Smelt DOGD achieves a comparable performance to Smelt-DOGD, while achieving approximately 4 times speedup due to the improved projection complexity.

4.2 Online Regression In this experiment, we consider a least mean square regression problem. In each round t, a small batch of training examples {(xt,1, yt,1) , . . . , (xt,m, yt,m)} arrive, and simultaneously, the learner makes a prediction wt of the unknown parameter. The learner suffers a loss, defined as ft(wt) = 1 m Pm i=1(w t xt,i yt,i)2 + λ wt wt 1 , where m is the batch size and λ is the trade-off parameter. We also aim to minimize the rescaled function

ht (wt) = 1 λm

w t xt,i yt,i 2 + wt wt 1 .

We conduct the experiments on a synthetic dataset, which is constructed through the following process. We first sample the ground truth vector w from an ellipsoid W = w Rn | w Ew λmin(E) (D/2)2 , where D is set to be 10, E is a certain diagonal matrix and λmin denotes its minimum eigenvalue. As for the feature vector xt,i, we sample them uniformly at random from [ 1, 1]n. Afterwards, we generate yt,i according to a linear model: yt,i = w xt,i +ϵt, where the noise ϵt is drawn from a normal distribution with a mean of 0 and a standard deviation of 0.1. We set the dimensionality n = 500, T = 4000, batch size m = 128, tradeoff parameter λ = 10, D = 200 and delay dt is selected uniformly at random from [1, 5]. To simulate the changing environment, we flip w every 1000 iterations. We report the cumulative loss PT t=1 ht(wt), the instantaneous loss ht(wt) and the running time (in seconds) in Figure 2 with error bars. All curves averaged over 5 runs under

Proceedings of the Thirty-Fourth International Joint Conference on Artiﬁcial Intelligence (IJCAI-25)

0 1000 2000 3000 4000 # of iterations

Cumulative loss

(a) Cumulative loss

0 1000 2000 3000 4000 # of iterations

Instantaneous loss

(b) Instantaneous loss

0 1000 2000 3000 4000 # of iterations

Running time

(c) Running time

Mild-OGD Smelt-DOGD Efﬁcient Smelt-DOGD

Figure 1: Results for the online classification.

0 2000 4000 # of iterations

Cumulative loss

(a) Cumulative loss

0 2000 4000 # of iterations

Instantaneous loss

(b) Instantaneous loss

0 2000 4000 # of iterations

Running time

(c) Running time

Mild-OGD Smelt-DOGD Efﬁcient Smelt-DOGD

Figure 2: Results for the online regression.

different random seeds. As shown in Figure 2, our Smelt DOGD and efficient Smelt-DOGD suffer less loss than Mild OGD. Additionally, efficient Smelt-DOGD is significantly more efficient compared with other methods, albeit with a slight compromise on the cumulative loss. To summarize, the empirical results demonstrate the effectiveness of our methods in addressing the switching cost, as well as efficient Smelt-DOGD s superiority in terms of the running time.

5 Conclusion

In this paper, we investigate SOCO with arbitrary delays. We extend Mild-OGD, a two-layer algorithm for OCO with delayed feedback, to account for the switching cost. We first provide a novel analysis of its expert-algorithm, and take the switching cost into consideration when assigning the weight to each expert in the meta-algorithm. Our proposed method, Smelt-DOGD, can achieve an O( p

d T(PT + 1)) dynamic

regret bound with switching cost. Moreover, to reduce the computational overhead, we employ a black-box technique to develop an efficient version of Smelt-DOGD, which only performs a single projection per round, yet still keeps the order of its regret bound. Finally, the results of experiments further validate the superiority of our methods.

Acknowledgments

This work was partially supported by NSFC (U23A20382, 62361146852), and the Collaborative Innovation Center of Novel Software Technology and Industrialization. The authors would like to thank the anonymous reviewers for their constructive suggestions.

[Baby and Wang, 2021] Dheeraj Baby and Yu-Xiang Wang. Optimal dynamic regret in exp-concave online learning. In

Proceedings of the Thirty-Fourth International Joint Conference on Artiﬁcial Intelligence (IJCAI-25)

Proceedings of the 34th Annual Conference On Learning Theory, pages 359 409, 2021. [Baby and Wang, 2022] Dheeraj Baby and Yu-Xiang Wang. Optimal dynamic regret in proper online learning with strongly convex losses and beyond. In Proceedings of International Conference on Artificial Intelligence and Statistics, pages 1805 1845, 2022. [Bansal et al., 2015] Nikhil Bansal, Anupam Gupta, Ravishankar Krishnaswamy, Kirk Pruhs, Kevin Schewior, and Cliff Stein. A 2-competitive algorithm for online convex optimization with switching costs. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, 2015. [Chang and Lin, 2011] Chih-Chung Chang and Chih-Jen Lin. Libsvm: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2(3):1 27, 2011. [Chen et al., 2015] Niangjun Chen, Anish Agarwal, Adam Wierman, Siddharth Barman, and Lachlan LH Andrew. Online convex optimization using predictions. In Proceedings of the 2015 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, pages 191 204, 2015. [Chen et al., 2018] Niangjun Chen, Gautam Goel, and Adam Wierman. Smoothed online convex optimization in high dimensions via online balanced descent. In Proceedings of the 31st Annual Conference On Learning Theory, pages 1574 1594, 2018. [Cutkosky and Orabona, 2018] Ashok Cutkosky and Francesco Orabona. Black-box reductions for parameterfree online learning in banach spaces. In Proceedings of the 31st Annual Conference On Learning Theory, pages 1493 1529, 2018. [Cutkosky, 2020] Ashok Cutkosky. Parameter-free, dynamic, and strongly-adaptive online learning. In Proceedings of the 37th International Conference on Machine Learning, pages 2250 2259, 2020. [Freund and Schapire, 1997] Yoav Freund and Robert E Schapire. A decision-theoretic generalization of online learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119 139, 1997. [Goel and Wierman, 2019] Gautam Goel and Adam Wierman. An online algorithm for smoothed regression and lqr control. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 2504 2513, 2019. [Goel et al., 2019] Gautam Goel, Yiheng Lin, Haoyuan Sun, and Adam Wierman. Beyond online balanced descent: An optimal algorithm for smoothed online optimization. In Advances in Neural Information Processing Systems 32, pages 1875 1885, 2019. [He et al., 2014] Xinran He, Junfeng Pan, Ou Jin, Tianbing Xu, Bo Liu, Tao Xu, Yanxin Shi, Antoine Atallah, Ralf Herbrich, Stuart Bowers, et al. Practical lessons from predicting clicks on ads at facebook. In Proceedings of the 8th International Workshop on Data mining for Online Advertising, pages 1 9, 2014.

[Jadbabaie et al., 2015] Ali Jadbabaie, Alexander Rakhlin, Shahin Shahrampour, and Karthik Sridharan. Online optimization: Competing with dynamic comparators. In Proceedings of the 18th International Conference on Artificial Intelligence and Statistics, pages 398 406, 2015.

[Joseph and de Veciana, 2012] Vinay Joseph and Gustavo de Veciana. Jointly optimizing multi-user rate adaptation for video transport over wireless systems: Mean-fairnessvariability tradeoffs. In Proceedings of the 31st Annual IEEE International Conference on Computer Communications, pages 567 575, 2012.

[Joulani et al., 2013] Pooria Joulani, Andras Gyorgy, and Csaba Szepesvari. Online learning under delayed feedback. In Proceedings of the 30th International Conference on Machine Learning, pages 1453 1461, 2013.

[Kim et al., 2015] Taehwan Kim, Yisong Yue, Sarah Taylor, and Iain Matthews. A decision tree framework for spatiotemporal sequence prediction. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 577 586, 2015.

[Korotin et al., 2020] Alexander Korotin, Vladimir V yugin, and Evgeny Burnaev. Adaptive hedging under delayed feedback. Neurocomputing, 397:356 368, 2020.

[Li et al., 2018] Yingying Li, Guannan Qu, and Na Li. Using predictions in online optimization with switching costs: A fast algorithm and a fundamental limit. In 2018 Annual American Control Conference, pages 3008 3013, 2018.

[Li et al., 2020] Yingying Li, Guannan Qu, and Na Li. Online optimization with predictions and switching costs: Fast algorithms and the fundamental limit. IEEE Transactions on Automatic Control, pages 4761 4768, 2020.

[Lin et al., 2011] Minghong Lin, Adam Wierman, Lachlan LH Andrew, and Eno Thereska. Dynamic right-sizing for power-proportional data centers. In Proceedings of the 30th IEEE International Conference on Computer Communications, pages 1098 1106, 2011.

[Mc Mahan and Streeter, 2014] Brendan Mc Mahan and Matthew Streeter. Delay-tolerant algorithms for asynchronous distributed online learning. In Advances in Neural Information Processing Systems 27, pages 2915 2923, 2014.

[Mc Mahan et al., 2013] H Brendan Mc Mahan, Gary Holt, David Sculley, Michael Young, Dietmar Ebner, Julian Grady, Lan Nie, Todd Phillips, Eugene Davydov, Daniel Golovin, et al. Ad click prediction: a view from the trenches. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1222 1230, 2013.

[Mokhtari et al., 2016] Aryan Mokhtari, Shahin Shahrampour, Ali Jadbabaie, and Alejandro Ribeiro. Online optimization in dynamic environments: Improved regret rates for strongly convex problems. In Proceedings of 55th IEEE Conference on Decision and Control, pages 7195 7201, 2016.

Proceedings of the Thirty-Fourth International Joint Conference on Artiﬁcial Intelligence (IJCAI-25)

[Prokhorov, 2001] Danil Prokhorov. Ijcnn 2001 neural network competition. Slide presentation in IJCNN, 1(97):38, 2001. [Quanrud and Khashabi, 2015] Kent Quanrud and Daniel Khashabi. Online learning with adversarial delays. In Advances in Neural Information Processing Systems 28, pages 1270 1278, 2015. [Shalev-Shwartz, 2012] Shalev-Shwartz. Online learning and online convex optimization. Foundations and Trends in Machine Learning, 4(2):107 194, 2012. [Wan et al., 2021] Yuanyu Wan, Bo Xue, and Lijun Zhang. Projection-free online learning in dynamic environments. In Proceedings of the 35th AAAI Conference on Artificial Intelligence, volume 35, pages 10067 10075, 2021. [Wan et al., 2022a] Yuanyu Wan, Wei-Wei Tu, and Lijun Zhang. Online frank-wolfe with arbitrary delays. In Advances in Neural Information Processing Systems 35, pages 19703 19715, 2022. [Wan et al., 2022b] Yuanyu Wan, Wei-Wei Tu, and Lijun Zhang. Online frank-wolfe with arbitrary delays. In Advances in Neural Information Processing Systems 35, pages 19703 19715, 2022. [Wan et al., 2022c] Yuanyu Wan, Wei-Wei Tu, and Lijun Zhang. Online strongly convex optimization with unknown delays. Machine Learning, 111(3):871 893, 2022. [Wan et al., 2023] Yuanyu Wan, Yibo Wang, Chang Yao, Wei-Wei Tu, and Lijun Zhang. Projection-free online learning with arbitrary delays. ar Xiv preprint ar Xiv:2204.04964, 2023. [Wan et al., 2024] Yuanyu Wan, Chang Yao, Mingli Song, and Lijun Zhang. Non-stationary online convex optimization with arbitrary delays. In 41th International Conference on Machine Learning, 2024. [Wang et al., 2021] Juncheng Wang, Ben Liang, Min Dong, Gary Boudreau, and Hatem Abou-Zeid. Delay-tolerant constrained oco with application to network resource allocation. In IEEE INFOCOM 2021-IEEE Conference on Computer Communications, pages 1 10, 2021. [Wang et al., 2024] Yibo Wang, Wenhao Yang, Wei Jiang, Shiyin Lu, Bin Wang, Haihong Tang, Yuanyu Wan, and Lijun Zhang. Non-stationary projection-free online learning with dynamic and adaptive regret guarantees. In Proceedings of the 38th AAAI Conference on Artificial Intelligence, pages 15671 15679, 2024. [Weinberger and Ordentlich, 2002] Marcelo J Weinberger and Erik Ordentlich. On delayed prediction of individual sequences. IEEE Transactions on Information Theory, 48(7):1959 1976, 2002. [Yang et al., 2024a] Wenhao Yang, Wei Jiang, Yibo Wang, Ping Yang, Yao Hu, and Lijun Zhang. Small-loss adaptive regret for online convex optimization. In Proceedings of the 41st International Conference on Machine Learning, pages 56156 56195, 2024. [Yang et al., 2024b] Wenhao Yang, Yibo Wang, Peng Zhao, and Lijun Zhang. Universal online convex optimization

with 1 projection per round. In Advances in Neural Information Processing Systems 37, pages 31438 31472, 2024. [Yang et al., 2025] Sifan Yang, Yuanyu Wan, and Lijun Zhang. Online nonsubmodular optimization with delayed feedback in the bandit setting. In Proceedings of the 39th AAAI Conference on Artificial Intelligence, pages 21992 22000, 2025. [Zanini et al., 2010] Francesco Zanini, David Atienza, Giovanni De Micheli, and Stephen P Boyd. Online convex optimization-based algorithm for thermal management of mpsocs. In Proceedings of the 20th Symposium on Great Lakes Symposium on VLSI, pages 203 208, 2010. [Zhang et al., 2017] Lijun Zhang, Tianbao Yang, Jinfeng Yi, Rong Jin, and Zhi-Hua Zhou. Improved dynamic regret for non-degenerate functions. In Advances in Neural Information Processing Systems 30, pages 732 741, 2017. [Zhang et al., 2018a] Lijun Zhang, Shiyin Lu, and Zhi-Hua Zhou. Adaptive online learning in dynamic environments. In Advances in Neural Information Processing Systems 31, pages 2250 2259, 2018. [Zhang et al., 2018b] Lijun Zhang, Tianbao Yang, Rong Jin, and Zhi-Hua Zhou. Dynamic regret of strongly adaptive methods. In Proceedings of the 35th International Conference on Machine Learning (ICML), pages 5882 5891, 2018. [Zhang et al., 2020] Lijun Zhang, Shiyin Lu, and Tianbao Yang. Minimizing dynamic regret and adaptive regret simultaneously. In Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics (AISTATS), pages 309 319, 2020. [Zhang et al., 2021] Lijun Zhang, Wei Jiang, Shiyin Lu, and Tianbao Yang. Revisiting smoothed online learning. In Advances in Neural Information Processing Systems 34, pages 13599 13612, 2021. [Zhang et al., 2022] Lijun Zhang, Wei Jiang, Jinfeng Yi, and Tianbao Yang. Smoothed online convex optimization based on discounted-normal-predictor. In Advances in Neural Information Processing Systems 35, pages 4928 4942, 2022. [Zhao et al., 2022] Peng Zhao, Yan-Feng Xie, Lijun Zhang, and Zhi-Hua Zhou. Efficient methods for non-stationary online learning. In Advances in Neural Information Processing Systems 35, pages 11573 11585, 2022. [Zhou et al., 2018] Zhengyuan Zhou, Panayotis Mertikopoulos, Nicholas Bambos, Peter Glynn, Yinyu Ye, Li-Jia Li, and Li Fei-Fei. Distributed asynchronous optimization with unbounded delays: How slow can you go? In Proceedings of the 35th International Conference on Machine Learning, pages 5970 5979, 2018. [Zinkevich, 2003] Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning, pages 928 936, 2003.

Proceedings of the Thirty-Fourth International Joint Conference on Artiﬁcial Intelligence (IJCAI-25)