# online_frankwolfe_with_arbitrary_delays__be6ba690.pdf

Online Frank-Wolfe with Arbitrary Delays

Yuanyu Wan1,2, Wei-Wei Tu3,4, Lijun Zhang4,

1School of Software Technology, Zhejiang University, Ningbo, China 2Zhejiang University-China Southern Power Grid Joint Research Centre on AI, Hangzhou, China 34Paradigm Inc., Beijing, China 4National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China wanyy@zju.edu.cn, tuweiwei@4paradigm.com, zhanglj@lamda.nju.edu.cn

The online Frank-Wolfe (OFW) method has gained much popularity for online convex optimization due to its projection-free property. Previous studies show that OFW can attain an O(T 3/4) regret bound for convex losses and an O(T 2/3) regret bound for strongly convex losses. However, they assume that each gradient queried by OFW is revealed immediately, which may not hold in practice and limits the application of OFW. To address this limitation, we propose a delayed variant of OFW, which allows gradients to be delayed by arbitrary rounds. The main idea is to perform an update similar to OFW after receiving any delayed gradient, and play the latest decision for each round. Despite its simplicity, we prove that our delayed variant of OFW is able to achieve an O(T 3/4 + d T 1/4) regret bound for convex losses and an O(T 2/3 + d log T) regret bound for strongly convex losses, where d is the maximum delay. This is quite surprising since under a relatively large amount of delay (e.g., d = O(

T) for convex losses and d = O(T 2/3/ log T) for strongly convex losses), the delayed variant of OFW enjoys the same regret bound as that of the original OFW.

1 Introduction

Online convex optimization (OCO) has become a leading paradigm for online learning due to its capability to model various problems from diverse domains such as online routing, online collaborative ﬁltering, and online advertisement [Hazan, 2016]. In general, it is formulated as a structured repeated game between a player and an adversary. In each round t, the player ﬁrst chooses a decision xt from a convex decision set K Rn, where n is the dimensionality. Then, the adversary selects a convex function ft(x) : K 7 R, and the player suffers a loss ft(xt). The player aims to choose decisions such that the regret R(T) = PT t=1 ft(xt) minx K PT t=1 ft(x) is sublinear in the number of total rounds T. Online gradient descent (OGD) is a standard method for OCO, which enjoys an O(

T) regret bound for convex losses [Zinkevich, 2003] and an O(log T) regret bound for strongly convex losses [Hazan et al., 2007]. However, it needs to compute a projection onto the decision set to ensure the feasibility of each decision, which is computationally expensive for complex decision sets [Hazan and Kale, 2012].

To tackle this computational issue, Hazan and Kale [2012] propose the online Frank-Wolfe (OFW) method, which has become one of the most commonly used algorithms for OCO over complex decision sets. The main advantage of OFW is its projection-free property: instead of performing the projection operation, it utilizes a linear optimization step to select a feasible decision, which could be much more efﬁcient. For example, in the problem of online collaborative ﬁltering, the decision set

Lijun Zhang is the corresponding author.

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

consists of matrices with a bounded trace norm, and the linear optimization step is at least an order of magnitude faster than the projection operation [Hazan and Kale, 2012]. Moreover, it has been shown that OFW achieves an O(T 3/4) regret bound for convex losses [Hazan and Kale, 2012, Hazan, 2016] and an O(T 2/3) regret bound for strongly convex losses [Wan and Zhang, 2021, Garber and Kretzu, 2021], which are the best known regret bounds of projection-free methods without further assumptions.

However, OFW requires that the gradient ft(xt) is revealed immediately after making the decision xt, which is not necessarily satisﬁed in reality. For example, in the previously mentioned online collaborative ﬁltering [Hazan and Kale, 2012], the decision is a prediction of a user-item rating matrix, and the corresponding gradient depends on the true rating of a user on a item, which may not be decided by the user immediately. Therefore, it is natural to consider a more practical setting, where the gradient ft(xt) arrives at the end of round t + dt 1, and dt 1 denotes an arbitrary delay. To handle this setting, one potential way is to combine OFW with an existing black-box technique for converting any traditional OCO algorithm into this delayed setting [Joulani et al., 2013]. To be precise, this black-box technique is to pool independent instances of OFW, each of which acts as a learner in the non-delayed setting over a subsequence of rounds. In each round, a single instance will be taken out from the pool, which makes a decision and then waits for its feedback before rejoining the pool. If the pool is empty, a new instance of OFW will be added to it. Moreover, according to Joulani et al. [2013], their black-box technique is able to attain a regret bound of d R(T/d) by combining with a traditional OCO algorithm with R(T) regret, where d is the maximum delay. As a result, combining it with OFW will attain an O(d1/4T 3/4) regret bound for convex losses and an O(d1/3T 2/3) regret bound for strongly convex losses, which magnify the regret bounds of OFW in the non-delayed setting by a coefﬁcient depending the delay. Thus, it is natural to ask whether the effect of delay can be further reduced.

In this paper, we give an afﬁrmative answer by developing a simple method called delayed OFW, which is robust to a relatively large amount of delay for both convex and strongly convex losses. Different from the black-box technique that needs to maintain multiple instances of OFW [Joulani et al., 2013], our delayed OFW is a natural extension of OFW in the delayed setting, which updates the decision similar to OFW after receiving any delayed gradient, and plays the latest decision for each round. Our theoretical contributions are summarized as follows.

First, we prove that our delayed OFW attains an O(T 3/4 + d T 1/4) regret bound for convex losses, where d is the maximum delay, which matches the O(T 3/4) regret bound in the non-delayed setting as long as d does not exceed O(

Second, we prove that our delayed OFW attains an O(T 2/3 + d log T) regret bound for strongly convex losses, which matches the O(T 2/3) regret bound in the non-delayed setting as long as d does not exceed O(T 2/3/ log T).

Therefore, our regret bounds are strictly better than those achieved by combining the black-box technique [Joulani et al., 2013] with OFW, when the term involving d in them is not dominant. Furthermore, simulation experiments are conducted to verify the performance of our delayed OFW.

2 Related work

In this section, we brieﬂy review related work on projection-free algorithms for OCO, and OCO under delayed feedback.

2.1 Projection-free algorithms for OCO

The OFW method [Hazan and Kale, 2012, Hazan, 2016] is the ﬁrst projection-free algorithm for OCO, which is an online extension of the classical Frank-Wolfe method [Frank and Wolfe, 1956, Jaggi, 2013]. For convex losses, OFW ﬁrst chooses an arbitrary x1 K, and then iteratively updates its decision by the following linear optimization step

vt argmin x K Ft(xt), x , xt+1 = xt + σt(vt xt) (1)

where Ft(x) is a surrogate loss function deﬁned as

i=1 fi(xi), x + x x1 2 2 (2)

and η, σt are two parameters. By setting parameters appropriately, it can attain an O(T 3/4) regret bound for convex losses.

If losses are convex and smooth, Hazan and Minasyan [2020] propose a randomized projection-free method, which is based on a classical OCO method called follow the perturbed leader [Kalai and Vempala, 2005], and achieve a regret bound of O(T 2/3). Recently, Wan and Zhang [2021] prove that OFW can achieve an O(T 2/3) regret bound for strongly convex losses. Speciﬁcally, to utilize the strong convexity of losses, they redeﬁne Ft(x) in (2) to

fi(xi), x + β

where β is the modulus of the strong convexity. The same regret bound is concurrently established by Garber and Kretzu [2021] in a similar way.

Moreover, projection-free OCO algorithms, which are able to achieve an O(

T) regret bound for convex losses and an O(log T) regret bound for strongly convex losses, have been proposed for polyhedral sets [Garber and Hazan, 2016], smooth sets [Levy and Krause, 2019], and special sets that are accessible through a membership oracle [Mhammedi, 2022] or a separation oracle [Garber and Kretzu, 2022], respectively. Although regret bounds of these algorithms are better than those of OFW, they can only be efﬁciently implemented in some special cases, and thus are still not able to replace OFW. In addition to these algorithms, Wan and Zhang [2021] show that by setting σt with a line search technique, OFW can adaptively utilize the strong convexity of decision sets to attain an O(T 2/3) regret bound for convex losses and an O(

T) regret bound for strongly convex losses. Besides the standard setting, projection-free algorithms for other OCO scenarios such as the distributed setting [Zhang et al., 2017, Wan et al., 2020, 2022b] and the bandit setting [Chen et al., 2019, Garber and Kretzu, 2020, 2021] have also been proposed.

However, despite this great ﬂourish of research on projection-free OCO algorithms, the practical problem of delayed feedback has not been considered.

2.2 OCO under delayed feedback

The starting point for studies on OCO under delayed feedback is the seminal work of Weinberger and Ordentlich [2002], which focuses on a special case with a ﬁxed delay d , i.e., the feedback for each decision xt is received at the end of round t + d 1. They propose a technique that can convert any traditional OCO algorithm for the non-delayed setting into this delayed setting. Speciﬁcally, their technique is to run d instances of a traditional OCO algorithm, where each instance is used every d rounds, which is feasible because the feedback for any decision xt must be available for selecting the decision xt+d . If the traditional OCO algorithm enjoys a regret bound of R(T) in the non-delayed setting, they prove that this technique achieves a regret bound of d R(T/d ). As a result, by combining with OGD, the regret bound of this technique is O(

d T) for convex losses and O(d log T) for strongly convex losses. However, it needs to run d instances of OGD, which requires more storage and computational resources than the original OGD. To address this problem, Langford et al. [2009] study the same delayed setting, and show that simply performing each gradient descent step in the original OGD with a delayed gradient can also achieve the O(

d T) regret bound for convex losses and the O(d log T) regret bound for strongly convex losses.

Furthermore, Joulani et al. [2013] extend the technique in Weinberger and Ordentlich [2002] to handle a more general setting, where each feedback is delayed by arbitrary rounds. For this setting, their technique is able to attain a regret bound of d R(T/d) by combining with a traditional OCO algorithm with R(T) regret, where d is the maximum delay. Note that although this technique can also convert OFW to the delayed setting, it can only attain an O(d1/4T 3/4) regret bound for convex losses and an O(d1/3T 2/3) regret bound for strongly convex losses, which cannot match regret bounds of OFW in the non-delayed setting as long as d is larger than Ω(1). Additionally, similar to the technique

in Weinberger and Ordentlich [2002], it also needs to run multiple instances of a traditional OCO algorithm, which could be prohibitively resource-intensive. Many studies [Quanrud and Khashabi, 2015, Joulani et al., 2016, Li et al., 2019, Flaspohler et al., 2021, Wan et al., 2022a] have proposed delayed OCO algorithms, which only require the same storage and computational resources as in the non-delayed setting, but do not consider projection-free algorithms.

3 Main results

In this section, we ﬁrst introduce necessary preliminaries including the problem setting, deﬁnitions, and assumptions. Then, we present our delayed OFW and the corresponding theoretical guarantees for convex and strongly convex losses, respectively.

3.1 Preliminaries

We consider the problem of OCO with arbitrary delays [Joulani et al., 2013, Quanrud and Khashabi, 2015]. Similar to the standard OCO, in each round t = 1, . . . , T, the player ﬁrst chooses a decision xt from the decision set K, and then the adversary selects a convex function ft(x). However, different from the standard OCO, the gradient gt = ft(xt) is revealed at the end of round t + dt 1, where dt 1 denotes an arbitrary delay. As a result, the player actually receives gradients {gk|k Ft} at the end of round t, where Ft = {k|k + dk 1 = t}. Then, we recall the standard deﬁnition for strongly convex functions [Boyd and Vandenberghe, 2004].

Deﬁnition 1 A function f(x) : K R is called β-strongly convex over K if for all x, y K, it holds that f(y) f(x) + f(x), y x + β

Finally, we introduce two assumptions, which are commonly used in studies about OCO [Shalev Shwartz, 2011, Hazan, 2016].

Assumption 1 The gradients of all loss functions are bounded by G, i.e., it holds that ft(x) 2 G for any x K and t [T].

Assumption 2 The diameter of the decision set K is bounded by D, i.e., it holds that x y 2 D for any x, y K.

3.2 Delayed OFW for convex losses

As presented in (1) and (2), the original OFW for convex losses requires the gradient gt before making the decision xt+1. However, in the problem of OCO with arbitrary delays, this requirement is not necessarily satisﬁed, because the player actually receives gradients {gk|k Ft} at each round t, which may not contain gt. To address this limitation, we propose a natural generalization of OFW to this delayed problem, which is described as follows.

The main idea is to update the decision similar to OFW for each received gradient, and play the latest decision for each round. In this way, there exist some intermediate decisions that are not really played. To facilitate presentations, we introduce an additional notation yτ to denote the τ-th intermediate decision. Moreover, we denote the sum of τ received gradients by gτ. Initially, we choose an arbitrary vector y1 K and set τ = 1, g0 = 0. At each round t = 1, . . . , T, we play the latest decision xt = yτ and query the gradient gt = ft(xt). Then, we receive delayed gradients that are queried in a set of rounds Ft, and perform the following steps for each received gradient.

Speciﬁcally, for any k Ft, inspired by (2) of the original OFW, we ﬁrst compute gτ = gτ 1 + gk and deﬁne

Fτ(y) = η gτ, y + y y1 2 2.

Then, similar to (1) of the original OFW, we perform the following update

vτ argmin y K Fτ(yτ), y , yτ+1 = yτ + στ(vτ yτ)

where στ is a parameter. Following Wan and Zhang [2021], it is set by a line search rule

στ = argmin σ [0,1] σ(vτ yτ), Fτ(yτ) + σ2 vτ yτ 2 2. (4)

Algorithm 1 Delayed OFW for Convex Losses

1: Input: η 2: Initialization: choose an arbitrary vector y1 K and set τ = 1, g0 = 0 3: for t = 1, 2, . . . , T do 4: Play xt = yτ and query gt = ft(xt) 5: Receive a set of delayed gradients {gk|k Ft} 6: for k Ft do 7: Update gτ = gτ 1 + gk and deﬁne Fτ(y) = η gτ, y + y y1 2 2 8: Compute vτ argminy K Fτ(yτ), y 9: Update yτ+1 = yτ + στ(vτ yτ) with στ in (4) and set τ = τ + 1 10: end for 11: end for

Finally, we update τ = τ + 1 so that τ still indexes the latest intermediate decision.

The detailed procedures are summarized in Algorithm 1, which is named as delayed OFW for convex losses. Let d = max{d1, . . . , d T }. We establish the following theorem with respect to the regret of Algorithm 1.

Theorem 1 For any x K, under Assumptions 1 and 2, Algorithm 1 with η = D

2G(T +2)3/4 has

t=1 ft(x ) = O(T 3/4 + d T 1/4).

Theorem 1 shows that without knowing the value of d, our Algorithm 1 can attain an O(T 3/4+d T 1/4) regret bound for convex losses with arbitrary delays. This bound matches the O(T 3/4) regret bound of OFW in the non-delayed setting [Hazan, 2016], as long as d does not exceed O(

T). Moreover, it is better than the O(d1/4T 3/4) regret bound achieved by combining the technique of Joulani et al. [2013] and the O(T 3/4) regret bound of OFW for convex losses, as long as d does not exceed O(T 2/3).

3.3 Delayed OFW for strongly convex losses

We proceed to handle β-strongly convex losses by slightly modifying Algorithm 1. Recall that in the standard OCO without delays, the critical idea of utilizing the strong convexity of losses is to replace the surrogate loss function in (1) by that in (3) [Wan and Zhang, 2021]. The main difference is that the regularization term in (3) is about all historical decisions, instead of only the initial decision.

Inspired by (3), we ﬁrst redeﬁne Fτ(y) in Algorithm 1 to

Fτ(y) = gτ, y +

2 y yi 2 2.

Second, since Fτ(y) is modiﬁed, we adjust the line search rule to

στ = argmin σ [0,1] σ(vτ yτ), Fτ(yτ) + βτσ2

2 vτ yτ 2 2. (5)

The detailed procedures are summarized in Algorithm 2, which is named as delayed OFW for strongly convex losses. Then, we establish the following theorem about the regret of Algorithm 2.

Theorem 2 Suppose all losses are β-strongly convex and Assumptions 1 and 2 hold. For any x K, Algorithm 2 has T X

t=1 ft(x ) = O(T 2/3 + d log T).

Theorem 2 shows that our Algorithm 2 can attain an O(T 2/3 + d log T) regret bound for strongly convex losses with arbitrary delays. First, this bound is better than the O(T 3/4 +d T 1/4) regret bound

Algorithm 2 Delayed OFW for Strongly Convex Losses

1: Input: β 2: Initialization: choose an arbitrary vector y1 K and set τ = 1, g0 = 0 3: for t = 1, 2, . . . , T do 4: Play xt = yτ and query gt = ft(xt) 5: Receive a set of delayed gradients {gk|k Ft} 6: for k Ft do 7: Update gτ = gτ 1 + gk and deﬁne Fτ(y) = gτ, y + Pτ i=1 β 2 y yi 2 2 8: Compute vτ argminy K Fτ(yτ), y 9: Update yτ+1 = yτ + στ(vτ yτ) with στ in (5) and set τ = τ + 1 10: end for 11: end for

in Theorem 1, which is established by only using the convexity condition. Second, it matches the O(T 2/3) regret bound of OFW for strongly convex losses in the non-delayed setting [Wan and Zhang, 2021], as long as d does not exceed O(T 2/3/ log T). Third, it is better than the O(d1/3T 2/3) regret bound achieved by combining the technique of Joulani et al. [2013] and the O(T 2/3) regret bound of OFW for strongly convex losses, as long as d does not exceed O(T/(log T)3/2).

4 Theoretical analysis

In this section, we ﬁrst introduce some insights and preliminaries for our analysis, and then prove Theorems 1 and 2 by introducing some lemmas. Due to the limitation of space, the proofs for those lemmas are provided in the appendix.

4.1 Insights and preliminaries

Note that in the non-delayed setting, the regret bound of OFW is worse than that of OGD due to the gap between the linear optimization step and the projection operation. Thus, in the delayed setting, it is natural to conjecture that both the linear optimization step and the delay will affect the regret of our delayed OFW. To keep the same regret bound as the original OFW, our key insight is to prove that the combined effect of the linear optimization step and the delay can be additive, instead of being multiplicative, and the effect of the delay can be weaker than that of the linear optimization step for a relatively large amount of delay.

Moreover, there are some necessary preliminaries for our analysis. First, because of the effect of delays, there may exist some gradients that arrive after round T. Although our Algorithms 1 and 2 do not need to use these gradients, they are useful for the analysis. As a result, in our analysis, we further set xt = yτ and perform steps 5 to 10 of Algorithms 1 and 2 for any t = T +1, . . . , T +d 1. In this way, all gradients g1, g2, . . . , g T queried by our algorithms are utilized, which produces decisions y1, y2, . . . , y T +1.

Then, let τt = 1 + Pt 1 i=1 |Fi| for any t [T + d]. It is not hard to verify that our Algorithms 1 and 2 ensure that xt = yτt (6) for any t [T + d 1]. Next, we deﬁne

It = , if |Ft| = 0, {τt, τt + 1, . . . , τt+1 1}, otherwise. (7)

Let s = min {t|t [T + d 1], |Ft| > 0}. It is not hard to verify that

T +d 1 t=s It = [T], Ii Ij = , i = j, (8)

T +d 1 t=s Ft = [T], Fi Fj = , i = j. (9) To facilitate the analysis, we further denote the time-stamp of the τ-th gradient used in the update of our Algorithms 1 and 2 by cτ. To help understanding, one can imagine that our Algorithms 1 and 2 also implement cτ = k in their step 7. If |Ft| = 0, we have {cτt, . . . , cτt+1 1} = Ft. (10)

By using this notation, Fτ(y) deﬁned in Algorithms 1 and 2 are respectively equivalent to

i=1 gci, y + y y1 2 2, (11)

i=1 gci, y +

2 y yi 2 2. (12)

4.2 Proof of Theorem 1

Let t = t + dt 1 for any t [T]. According to the convexity of ft(x), we have

ft(xt) ft(x ) gt, xt x = gt, xt x + gt, xt xt gt, xt x + G xt xt 2 = gt, yτt x + G yτt yτt 2 where the last equality is due to (6).

Let A = PT t=1 G yτt yτt 2. By summing over t = 1, . . . , T, we have

t=1 gt, xt x

t=1 gt, yτt x + A. (13)

Then, we bound the ﬁrst term in the right side of (13) as follows

t=1 gt, yτt x =

i Ft gi, yτi+di 1 x

i Ft gi, yτt x =

i=τt gci, yτt x

i=τt ( gci, yi x + gci, yτt yi )

t=1 gct, yt x +

i=τt gci, yτt yi

where the ﬁrst equality is due to (9), the second equality is due to i + di 1 = t for any i Ft, the third equality is due to (10), and the last equality is due to (7) and (8).

Then, let B = PT t=1 gct, yt x and C = PT +d 1 t=s Pτt+1 1 i=τt G yτt yi 2. By combining (13), (14), and gci, yτt yi G yτt yi 2, we have

t=1 ft(x ) A + B + C. (15)

Next, we proceed to bound terms A, B, and C. Speciﬁcally, we ﬁrst establish the following bound for the sum of terms A and C by carefully analyzing the distance yτt yτt 2 in the term A and the distance yτt yi 2 in the term C.

Lemma 1 Let y t = argminy K Ft 1(y) for any t [T + 1], where Ft(y) is deﬁned in (11). Suppose Assumption 1 and 2 hold, and there exist some constants γ > 0 and 0 < α 1 such that Ft 1(yt) Ft 1(y t ) γ(t + 2) α for any t [T + 1]. Algorithm 1 ensures

A + C 3d GD + 4Gd γ + 8G γ

2 α T 1 α/2 + 3ηG2d T

where A = PT t=1 G yτt yτt 2 and C = PT +d 1 t=s Pτt+1 1 i=τt G yτt yi 2.

Note that Lemma 1 introduces an assumption about yt and Ft 1(y). According to our Algorithm 1, yt is actually generated by approximately minimizing Ft 1(y) with a linear optimization step. Therefore, by following the analysis of the original OFW [Hazan, 2016], we show that this assumption can be satisﬁed with γ = 8D2 and α = 1/2.

Lemma 2 Let y t = argminy K Ft 1(y) for any t [T +1], where Ft(y) is deﬁned in (11). Under Assumptions 1 and 2, for any t [T + 1], Algorithm 1 with η = D

2G(T +2)3/4 has

Ft 1(yt) Ft 1(y t ) 8D2 t + 2.

Then, by combining η = D

2G(T +2)3/4 with Lemmas 1 and 2, we have

A + C (3 + 8

2GD 3 T 3/4 + 3GDd T 1/4

2 = O(T 3/4 + d T 1/4). (16)

Furthermore, by following the analysis of the original OFW [Hazan, 2016], we establish an upper bound for the term B.

Lemma 3 Under Assumptions 1 and 2, for any x K, Algorithm 1 with η = D

2G(T +2)3/4 ensures

t=1 gct, yt x 11

2GD(T + 2)3/4

3 + GDT 1/4

Finally, by combining (15), (16), and Lemma 3, we complete this proof.

4.3 Proof of Theorem 2

Since ft(x) is β-strongly convex, we have

t=1 gt, xt x

2 xt x 2 2. (17)

Then, we note that the ﬁrst term in the right side of (17) can be bounded by reusing (13) and (14). Speciﬁcally, we have

t=1 ft(x ) A + C +

t=1 gct, yt x

2 xt x 2 2 (18)

where terms A and C are deﬁned in the proof of Theorem 1.

Next, we consider the last term in the right side of (18). For any yt, xt, x K, we have

yt x 2 2 = yt xt 2 2 + xt x 2 2 + 2 yt xt, xt x

3D yt xt 2 + xt x 2 2 where the last inequality is due to 2 yt xt, xt x 2 yt xt 2 xt x 2 and Assumption 2.

Let B = PT t=1 gct, yt x PT t=1 β 2 yt x 2 2 and E = PT t=1 3βD

2 yt yτt 2. By combining the above inequality and (6) with (18), we have

t=1 [ft(xt) ft(x )] A + C + B + E. (19)

Then, we proceed to establish upper bounds for terms A, C, and E by carefully analyzing the distance yτt yτt 2 in the term A, the distance yτt yi 2 in the term C, and the distance yt yτt 2 in the term E.

Lemma 4 Let y t = argminy K Ft 1(y) for any t = 2, . . . , T + 1, where Ft(y) is deﬁned in (12). Suppose Assumption 1 and 2 hold, all losses are β-strongly convex, and there exist some constants γ > 0 and 0 α < 1 such that Ft 1(yt) Ft 1(y t ) γ(t 1)α for any t = 2, . . . , T + 1. Algorithm 1 ensures

2βγ + 6D 2βγ

1 + α T (1+α)/2 + 3βd D2 + 3D(G + βD)d ln T,

A + C 3d GD + 4G(G + βD)d(1 + ln T)

β + 4d G r2γ

β 8G 1 + αT (1+α)/2,

where A = PT t=1 G yτt yτt 2, C = PT +d 1 t=s Pτt+1 1 i=τt G yτt yi 2, and E = PT t=1 3βD

2 yt yτt 2.

Note that Lemma 4 also introduces an assumption about yt and Ft 1(y). According to our Algorithm 2, yt is actually generated by approximately minimizing Ft 1(y) with a linear optimization step. Therefore, by following the analysis of OFW for strongly convex losses [Wan and Zhang, 2021], we show that this assumption is satisﬁed with γ = 16(G + 2βD)2/β and α = 1/3.

Lemma 5 Let y t = argminy K Ft 1(y) for any t = 2, . . . , T + 1, where Ft(y) is deﬁned in (12). Suppose Assumption 1 and 2 hold, and all losses are β-strongly convex. For any t = 2, . . . , T + 1, Algorithm 2 has

Ft 1(yt) Ft 1(y t ) 16(G + 2βD)2(t 1)1/3

By combining Lemmas 4 and 5, we have

A + C + E = O T (1+α)/2 + d log T = O(T 2/3 + d log T). (20)

Furthermore, by following the analysis of OFW for strongly convex losses [Wan and Zhang, 2021], we establish an upper bound for the third term in (19).

Lemma 6 Suppose Assumption 1 and 2 hold, and all losses are β-strongly convex. For any x K, Algorithm 2 ensures

2(G + 2βD)2T 2/3

β + 2(G + 2βD)2 ln T

β + (G + βD)D

where B = PT t=1 gct, yt x PT t=1 β 2 yt x 2 2.

Finally, by combining (19), (20), and Lemma 6, we complete this proof.

5 Experiments

In this section, we conduct simulation experiments to verify the performance of our delayed OFW for convex losses and strongly convex losses. All algorithms are implemented wtih Matlab R2017a and tested on a laptop with 2.4GHz CPU and 16GB memory. The code is available from https: //github.com/yuanyuwan/Neur IPS22.

Speciﬁcally, we consider a delayed OCO problem with T = 10000 total rounds over the decision set K = x R1000 x 1 200 , which satisﬁes Assumption 2 with D = 400 due to x y 2 x 2 + y 2 x 1 + y 1 400 for any x, y K. Inspired by Li et al. [2019], in each round t [T], the adversary chooses the following loss function

ft(x) = x 2 + bt, x

where each element of bt R1000 is independently and uniformly sampled from [ 1, 1]. For any x K, since ft(x) = 2x + bt, it is easy to verify that ft(x) 2 2 x 2 + bt 2 2 x 1 + bt 2 400 +

1000, which implies that this loss function satisﬁes Assumption 1 with G = 400 +

1000. We also notice that this loss function satisﬁes the deﬁnition of β-strongly convex function (i.e., Deﬁnition 1) with β = 2. Moreover, different values of the maximum delay d in the set {1, 51, 101, . . . , 501} are tried in our experiments. For each speciﬁc d, to simulate arbitrary delays, dt is independently and uniformly sampled from the set { 0.1d , 0.1d + 1, 0.1d + 2, . . . , d}.

We compare our proposed algorithms against the combination of the technique in Joulani et al. [2013] and OFW. Our delayed OFW for convex losses and strongly convex losses are denoted as DOFW (i.e., Algorithm 1) and DOFWsc (i.e., Algorithm 2), respectively. Similarly, since the technique in Joulani et al. [2013] is named as black-box online learning under delayed feedback (BOLD), we denote the combination of BOLD with OFW for convex losses and strongly convex losses as BOLD-OFW and BOLD-OFWsc, respectively. The parameters of all algorithms are set as what their corresponding theories suggest. Speciﬁcally, according to our Theorems 1 and 2, we set η = D

2G(T +2)3/4 for our DOFW and set β = 2 for our DOFWsc. According to Joulani et al. [2013], BOLD-OFW

0 100 200 300 400 500 0

(a) Different d

0 2000 4000 6000 8000 10000 0

(b) d = 501

Figure 1: Comparisons of our DOFW and DOFWsc against BOLD-OFW and BOLD-OFWsc.

and BOLD-OFWsc will maintain several instances of OFW for convex and strongly convex losses, respectively. Note that in the non-delayed case with d = 1 our delayed OFW actually reduces to the original OFW. Therefore, our Theorems 1 and 2 can also be utilized to choose the parameters for each instance of OFW maintained in BOLD-OFW and BOLD-OFWsc. To be precise, in BOLD-OFW, we set η = D

2G(T/d+2)3/4 for each instance of OFW for convex losses, since the total rounds of each instance is roughly T/d. In BOLD-OFWsc, we only need to set β = 2 for each instance of OFW for strongly convex losses. Moreover, for our delayed OFW and each instance of OFW, the initial decision is set to 1/50, where 1 denotes the all-ones vector.

Fig. 1(a) shows the total loss of T rounds for each algorithm under different values of the maximum delay d. First, when d = 1, the total loss of our DOFW is the same as that of BOLD-OFW and the total loss of our DOFWsc is the same as that of BOLD-OFWsc, which is reasonable because in this case DOFW and BOLD-OFW reduce to the original OFW for convex losses, and DOFWsc and BOLDOFWsc reduce to the original OFW for strongly convex losses. Second, for d = 51, 101, 151, . . . , 501, our DOFW and DOFWsc are better than BOLD-OFW and BOLD-OFWsc respectively, which clearly veriﬁes the advantage of our algorithms in the delayed setting. It is worthy to notice that d = 501 is larger than T 2/3. Moreover, for our DOFW and DOFWsc, when d increases from 1 to 501, the growth of the total loss is very slow, which is consistent with the dependence of our regret bounds on d. Fig. 1(b) shows the cumulative loss for each algorithm when d = 501. As the number of iterations increases, the cumulative loss of BOLD-OFW and BOLD-OFWsc increase much faster than that of our algorithms.

6 Conclusion and future work

In this paper, we propose delayed OFW for OCO with arbitrary delays. For convex losses, we show that it attains an O(T 3/4 +d T 1/4) regret bound, which matches the O(T 3/4) regret bound of OFW in the non-delayed setting, as long as d does not exceed O(

T). When losses are strongly convex, we further prove that it can attain an O(T 2/3 + d log T) regret bound, which matches the O(T 2/3) regret bound of OFW in the non-delayed setting, as long as d does not exceed O(T 2/3/ log T). Simulation experiments demonstrate the performance of delayed OFW in the delayed setting.

This paper only extends the classical OFW to the delayed setting. In the future, we will investigate how to develop delayed variants for other projection-free online algorithms.

Acknowledgments

This work was partially supported by NSFC (61921006, 62122037), and Jiangsu SF (BK20200064).

Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, 2004.

Lin Chen, Mingrui Zhang, and Amin Karbasi. Projection-free bandit convex optimization. In Proceedings of the 22nd International Conference on Artiﬁcial Intelligence and Statistics, pages 2047 2056, 2019.

Genevieve E Flaspohler, Francesco Orabona, Judah Cohen, Soukayna Mouatadid, Miruna Oprescu, Paulo Orenstein, and Lester Mackey. Online learning with optimism and delay. In Proceedings of the 38th International Conference on Machine Learning, pages 3363 3373, 2021.

Marguerite Frank and Philip Wolfe. An algorithm for quadratic programming. Naval Research Logistics Quarterly, 3(1 2):95 110, 1956.

Dan Garber and Elad Hazan. A linearly convergent conditional gradient algorithm with applications to online and stochastic optimization. SIAM Journal on Optimization, 26(3):1493 1528, 2016.

Dan Garber and Ben Kretzu. Improved regret bounds for projection-free bandit convex optimization. In Proceedings of the 23rd International Conference on Artiﬁcial Intelligence and Statistics, pages 2196 2206, 2020.

Dan Garber and Ben Kretzu. Revisiting projection-free online learning: the strongly convex case. In Proceedings of the 24th International Conference on Artiﬁcial Intelligence and Statistics, pages 3592 3600, 2021.

Dan Garber and Ben Kretzu. New projection-free algorithms for online convex optimization with adaptive regret guarantees. In Proceedings of 35th Conference on Learning Theory, pages 2326 2359, 2022.

Elad Hazan. Introduction to online convex optimization. Foundations and Trends in Optimization, 2 (3 4):157 325, 2016.

Elad Hazan and Satyen Kale. Projection-free online learning. In Proceedings of the 29th International Conference on Machine Learning, pages 1843 1850, 2012.

Elad Hazan and Edgar Minasyan. Faster projection-free online learning. In Proceedings of the 33rd Annual Conference on Learning Theory, pages 1877 1893, 2020.

Elad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69(2):169 192, 2007.

Martin Jaggi. Revisiting frank-wolfe: Projection-free sparse convex optimization. In Proceedings of the 30th International Conference on Machine Learning, pages 427 435, 2013.

Pooria Joulani, András György, and Csaba Szepesvári. Online learning under delayed feedback. In Proceedings of the 30th International Conference on Machine Learning, pages 1453 1461, 2013.

Pooria Joulani, András György, and Csaba Szepesvári. Delay-tolerant online convex optimization: Uniﬁed analysis and adaptive-gradient algorithms. Proceedings of the 30th AAAI Conference on Artiﬁcial Intelligence, pages 1744 1750, 2016.

Adam Kalai and Santosh Vempala. Efﬁcient algorithms for online decision problems. Journal of Computer and System Sciences, 71(3):291 307, 2005.

John Langford, Alexander J. Smola, and Martin Zinkevich. Slow learners are fast. In Advances in Neural Information Processing Systems 22, pages 2331 2339, 2009.

Kﬁr Y. Levy and Andreas Krause. Projection free online learning over smooth sets. In Proceedings of the 22nd International Conference on Artiﬁcial Intelligence and Statistics, pages 1458 1466, 2019.

Bingcong Li, Tianyi Chen, and Georgios B. Giannakis. Bandit online learning with unknown delays. In Proceedings of the 22nd International Conference on Artiﬁcial Intelligence and Statistics, pages 993 1002, 2019.

Zakaria Mhammedi. Efﬁcient projection-free online convex optimization with membership oracle. In Proceedings of 35th Conference on Learning Theory, pages 5314 5390, 2022.

Kent Quanrud and Daniel Khashabi. Online learning with adversarial delays. In Advances in Neural Information Processing Systems 28, pages 1270 1278, 2015.

Shai Shalev-Shwartz. Online learning and online convex optimization. Foundations and Trends in Machine Learning, 4(2):107 194, 2011.

Yuanyu Wan and Lijun Zhang. Projection-free online learning over strongly convex sets. In Proceedings of the 35th AAAI Conference on Artiﬁcial Intelligence, pages 10076 10084, 2021.

Yuanyu Wan, Wei-Wei Tu, and Lijun Zhang. Projection-free distributed online convex optimization with O(

T) communication complexity. In Proceedings of the 37th International Conference on Machine Learning, pages 9818 9828, 2020.

Yuanyu Wan, Wei-Wei Tu, and Lijun Zhang. Online strongly convex optimization with unknown delays. Machine Learning, 111(3):871 893, 2022a.

Yuanyu Wan, Guanghui Wang, Wei-Wei Tu, and Lijun Zhang. Projection-free distributed online learning with sublinear communication complexity. Journal of Machine Learning Research, 23 (172):1 53, 2022b.

Marcelo J. Weinberger and Erik Ordentlich. On delayed prediction of individual sequences. IEEE Transactions on Information Theory, 48(7):1959 1976, 2002.

Wenpeng Zhang, Peilin Zhao, Wenwu Zhu, Steven C. H. Hoi, and Tong Zhang. Projection-free distributed online learning in networks. In Proceedings of the 34th International Conference on Machine Learning, pages 4054 4062, 2017.

Martin Zinkevich. Online convex programming and generalized inﬁnitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning, pages 928 936, 2003.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reﬂect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes]

(c) Did you discuss any potential negative societal impacts of your work? [N/A] This work is mostly theoretical and the societal impacts discussion is not applicable. (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [Yes] (b) Did you include complete proofs of all theoretical results? [Yes] 3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [No] (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [N/A] (b) Did you mention the license of the assets? [N/A]

(c) Did you include any new assets either in the supplemental material or as a URL? [Yes] (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identiﬁable information or offensive content? [N/A] Only synthetic data are used in this work, which do not contain personally identiﬁable information and offensive content. 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]