# projectionfree_online_learning_over_strongly_convex_sets__38da9585.pdf

Projection-free Online Learning over Strongly Convex Sets

Yuanyu Wan1, Lijun Zhang1,2,

1National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China 2Pazhou Lab, Guangzhou 510330, China {wanyy, zhanglj}@lamda.nju.edu.cn

To efﬁciently solve online problems with complicated constraints, projection-free algorithms including online frankwolfe (OFW) and its variants have received signiﬁcant interest recently. However, in the general case, existing efﬁcient projection-free algorithms only achieved the regret bound of O(T 3/4), which is worse than the regret of projection-based algorithms, where T is the number of decision rounds. In this paper, we study the special case of online learning over strongly convex sets, for which we ﬁrst prove that OFW can enjoy a better regret bound of O(T 2/3) for general convex losses. The key idea is to reﬁne the decaying step-size in the original OFW by a simple line search rule. Furthermore, for strongly convex losses, we propose a strongly convex variant of OFW by redeﬁning the surrogate loss function in OFW. We show that it achieves a regret bound of O(T 2/3) over general convex sets and a better regret bound of O(

T) over strongly convex sets.

Introduction Online convex optimization (OCO) is a powerful framework that has been used to model and solve problems from diverse domains such as online routing (Awerbuch and Kleinberg 2004, 2008) and online portfolio selection (Blum and Kalai 1999; Agarwal et al. 2006; Luo, Wei, and Zheng 2018). In OCO, an online player plays a repeated game of T rounds against an adversary. At each round t, the player chooses a decision xt from a convex set K. After that, a convex function ft(x) : K R chosen by the adversary is revealed, and incurs a loss ft(xt) to the player. The goal of the player is to choose decisions so that the regret deﬁned as

t=1 ft(xt) min x K

is minimized. Over the past decades, various algorithms such as online gradient descent (OGD) (Zinkevich 2003), online Newton step (Hazan, Agarwal, and Kale 2007) and follow-the-regularized-leader (Shalev-Shwartz 2007; Shalev-Shwartz and Singer 2007) have been proposed to yield optimal regret bounds under different scenarios.

Lijun Zhang is the corresponding author. Copyright c 2021, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

To ensure the feasibility of each decision, a common way in these algorithms is to apply a projection operation to any infeasible decision. However, in many high-dimensional problems with complicated decision sets, the projection step becomes a computational bottleneck (Zhang et al. 2013; Chen et al. 2016; Yang, Lin, and Zhang 2017). For example, in semi-deﬁnite programs (Hazan 2008) and multiclass classiﬁcation (Hazan and Luo 2016), it amounts to computing the singular value decomposition (SVD) of a matrix, when the decision set consists of all matrices with bounded trace norm. To tackle the computational bottleneck, Hazan and Kale (2012) proposed a projection-free algorithm called online Frank-Wolfe (OFW) that replaces the timeconsuming projection step with a more efﬁcient linear optimization step. Taking matrix completion as an example again, the linear optimization step can be carried out by computing the top singular vector pair of a matrix, which is signiﬁcantly faster than computing the SVD. Although OFW can efﬁciently handle complicated decision sets, it only attains a regret bound of O(T 3/4) for the general OCO, which is worse than the optimal O(

T) regret bound achieved by projection-based algorithms.

More speciﬁcally, OFW is an online extension of an ofﬂine algorithm called Frank-Wolfe (FW) (Frank and Wolfe 1956; Jaggi 2013) that iteratively performs linear optimization steps to minimize a convex and smooth function. In each round, OFW updates the decision by utilizing a single step of FW to minimize a surrogate loss function, which implies that the approximation error caused by the single step of FW could be the main reason for the regret gap between projection-based algorithms and OFW. Recently, Garber and Hazan (2015) made a quadratic improvement in the convergence rate of FW for strongly convex and smooth ofﬂine optimization over strongly convex sets compared to the general case. They used a simple line search rule to choose the step-size of FW, which allows FW to converge faster even if the strong convexity of the decision set is unknown. It is therefore natural to ask whether the faster convergence of FW can be utilized to improve the regret of OFW. In this paper, we give an afﬁrmative answer by improving OFW to achieve an regret bound of O(T 2/3) over strongly convex sets, which is better than the original O(T 3/4) regret bound. Inspired by Garber and Hazan (2015), the key idea is to re-

The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21)

Algorithm Extra Condition on Loss Extra Condition on K Regret Bound OFW O(T 3/4) LLOO-OCO polyhedral O(

T) LLOO-OCO strongly convex polyhedral O(log T) Fast OGD smooth O(

T) Fast OGD strongly convex smooth O(log T) OSPF smooth O(T 2/3) OFW with Line Search (this work) strongly convex O(T 2/3) SC-OFW (this work) strongly convex O(T 2/3) SC-OFW (this work) strongly convex strongly convex O(

Table 1: Comparisons of regret bounds for efﬁcient projection-free online algorithms including OFW (Hazan and Kale 2012; Hazan 2016), LLOO-OCO (Garber and Hazan 2016), Fast OGD (Levy and Krause 2019), OSPF (Hazan and Minasyan 2020) and our algorithms.

ﬁne the decaying step-size in the original OFW by a simple line search rule. Furthermore, we study OCO with strongly convex losses, and propose a strongly convex variant of OFW (SCOFW). Different from OFW, we introduce a new surrogate loss function from Garber and Hazan (2016) to utilize the strong convexity of losses, which has been used to achieve an O(log T) regret bound for strongly convex OCO over polyhedral sets. Theoretical analysis reveals that SC-OFW for strongly convex OCO attains a regret bound of O(T 2/3) over general convex sets1 and a better regret bound of O(

T) over strongly convex sets.

Related Work

In this section, we brieﬂy review the related work about efﬁcient projection-free algorithms for OCO, the time complexity of which is a constant in each round. OFW (Hazan and Kale 2012; Hazan 2016) is the ﬁrst projection-free algorithms for OCO, which attains a regret bound of O(T 3/4) by only performing 1 linear optimization step at each round. Recently, some studies have proposed projection-free online algorithms which are as efﬁcient as OFW and attain better regret bounds, for special cases of OCO. If the decision set is a polytope, Garber and Hazan (2016) proposed a variant of OFW named as LLOO-based online convex optimization (LLOO-OCO), which enjoy an O(

T) regret bound for convex losses and an O(log T) regret bound for strongly convex losses. For OCO over smooth sets, Levy and Krause (2019) proposed a projectionfree variant of OGD via devising a fast approximate projection for such sets, and established O(

T) and O(log T) regret bounds for convex and strongly convex losses, respectively. Besides these improvements for OCO over special decision sets, Hazan and Minasyan (2020) proposed online smooth projection free algorithm (OSPF) for OCO with smooth losses, which is a randomized method and achieves an expected regret bound of O(T 2/3).

1When this paper is under review, we notice that a concurrent work (Garber and Kretzu 2020b) also proposed an algorithm similar to SC-OFW and established the O(T 2/3) regret bound.

Furthermore, OFW has been extended to two practical scenarios. The ﬁrst scenario is the bandit setting (Flaxman, Kalai, and Mc Mahan 2005; Bubeck et al. 2015), where only the loss value is available to the player. Chen, Zhang, and Karbasi (2019) proposed the ﬁrst bandit variant of OFW, and established an expected regret bound of O(T 4/5). Later, two improved bandit variants of OFW were proposed to enjoy better expected regret bounds on the order of O(T 3/4) for convex losses (Garber and Kretzu 2020a) and e O(T 2/3) for strongly convex losses (Garber and Kretzu 2020b). The second scenario is the distributed setting (Duchi, Agarwal, and Wainwright 2011; Hosseini, Chapman, and Mesbahi 2013), where many players are distributed over a network and each player can only access to the local loss function. The ﬁrst projection-free algorithm for distributed OCO was proposed by Zhang et al. (2017), which requires the communication complexity of O(T). Recently, Wan, Tu, and Zhang (2020) further reduced the communication complexity from O(T) to O(

T). Despite these great progresses about projection-free online learning, for the general OCO, OFW is still the best known efﬁcient projection-free algorithm and the regret bound of O(T 3/4) has remained unchanged. In this paper, we will study OCO over strongly convex sets, and improve the regret bound to O(T 2/3) and O(

T) for convex and strongly convex losses, respectively. The detailed comparisons between efﬁcient projection-free algorithms are summarized in Table 1.

Main Results In this section, we ﬁrst introduce necessary preliminaries including common notations, deﬁnitions and assumptions. Then, we present an improved regret bound for OFW over strongly convex sets by utilizing the line search. Finally, we introduce our SC-OFW algorithm for strongly convex OCO as well as its theoretical guarantees.

Preliminaries In this paper, the convex set K belongs to a ﬁnite vector space E. For any x, y K, the inner product is denoted by x, y . We recall two standard deﬁnitions for smooth and

strongly convex functions (Boyd and Vandenberghe 2004), respectively.

Deﬁnition 1 Let f(x) : K R be a function over K. It is called β-smooth over K if for all x, y K

f(y) f(x) + f(x), y x + β

Deﬁnition 2 Let f(x) : K R be a function over K. It is called α-strongly convex over K if for all x, y K

f(y) f(x) + f(x), y x + α

When f(x) : K R is an α-strongly convex function and x = argminx K f(x), Garber and Hazan (2015) have proved that for any x K

2 x x 2 2 f(x) f(x ) (1)

f(x) f(x ). (2)

Then, we introduce the deﬁnition of the strongly convex set, which has been well studied in ofﬂine optimization (Levitin and Polyak 1966; Demyanov and Rubinov 1970; Dunn 1979; Garber and Hazan 2015; Rector-Brooks, Wang, and Mozafari 2019).

Deﬁnition 3 A convex set K E is called α-strongly convex with respect to a norm if for any x, y K, γ [0, 1] and z E such that z = 1, it holds that

γx + (1 γ)y + γ(1 γ)α

2 x y 2z K.

As shown by Garber and Hazan (2015), various balls induced by ℓp norms, Schatten norms and group norms are strongly convex, which are commonly used to constrain the decision. For example, the ℓp norm ball deﬁned as

K = {x Rd| x p r}

is (p 1)d 1 2 1

p r -strongly convex with respect to the ℓ2 norm for any p (1, 2] (Garber and Hazan, 2015, Corollary 1). We also introduce two common assumptions in OCO (Shalev-Shwartz 2011; Hazan 2016).

Assumption 1 The diameter of the convex decision set K is bounded by D, i.e.,

for any x, y K.

Assumption 2 At each round t, the loss function ft(x) is G-Lipschitz over K, i.e.,

|ft(x) ft(y)| G x y 2

for any x, y K.

Algorithm 1 OFW with Line Search

1: Input: feasible set K, η 2: Initialization: choose x1 K 3: for t = 1, , T do 4: Deﬁne Ft(x) = η Pt τ=1 fτ(xτ), x + x x1 2 2 5: vt argmin x K Ft(xt), x

6: σt = argmin σ [0,1] σ(vt xt), Ft(xt) + σ2 vt xt 2 2

7: xt+1 = xt + σt(vt xt) 8: end for

OFW with Line Search For the general OCO, OFW (Hazan and Kale 2012; Hazan 2016) arbitrarily chooses x1 from K, and then iteratively updates its decision as the following linear optimization step

v = argmin x K Ft(xt), x

xt+1 = xt + σt(v xt)

τ=1 fτ(xτ), x + x x1 2 2 (3)

is the surrogate loss function, σt and η are two parameters. According to Hazan (2016), OFW with η = O(T 3/4) and σt = O(t 1/2) attains the regret bound of O(T 3/4). However, this decaying step-size σt = O(t 1/2) cannot utilize the property of strongly convex sets. Inspired by a recent progress on FW over strongly convex sets (Garber and Hazan 2015), we utilized a line search rule to choose the parameter σt as

σt = argmin σ [0,1] σ(vt xt), Ft(xt) + σ2 vt xt 2 2.

The detailed procedures are summarized in Algorithm 1. From previous discussions, the approximation error of minimizing Ft(x) by a single step of FW has a signiﬁcant impact on the regret of OFW. Therefore, we ﬁrst present the following lemma with respect to the approximation error for Algorithm 1 over strongly convex sets. Lemma 1 Let K be an αK-strongly convex set with respect to the ℓ2 norm. Let x t = argminx K Ft 1(x) for any t [T + 1], where Ft(x) is deﬁned in (3). Then, for any t [T + 1], Algorithm 1 with η = D 2G(T +2)2/3 has

Ft 1(xt) Ft 1(x t ) ϵt = C (t + 2)2/3

where C = max 4D2, 4096

We ﬁnd that the approximation error incurred by a single step of FW is upper bounded by O(t 2/3) for Algorithm 1 over strongly convex sets. For a clear comparison, we note that the approximation error for the original OFW over general convex sets has a worse bound of O(1/

t) (Hazan, 2016, Lemma 7.3).

Algorithm 2 Strongly Convex Variant of OFW (SC-OFW)

1: Input: feasible set K, λ 2: Initialization: choose x1 K 3: for t = 1, , T do 4: Deﬁne Ft(x) = Pt τ=1( fτ(xτ), x + λ

2 x xτ 2 2) 5: vt argmin x K Ft(xt), x

6: σt = argmin σ [0,1] σ(vt xt), Ft(xt) + σ2λt

2 vt xt 2 2

7: xt+1 = xt + σt(vt xt) 8: end for

By applying Lemma 1 and further analyzing the property of decisions x 1, , x T deﬁned in Lemma 1, we prove that OFW with line search enjoys the following regret bound over strongly convex sets. Theorem 1 Let K be an αK-strongly convex set with respect to the ℓ2 norm and C = max 4D2, 4096

x K, Algorithm 1 with η = D 2G(T +2)2/3 ensures

t=1 ft(x ) 11

C(T + 2)2/3.

We ﬁnd that the O(T 2/3) regret bound in Theorem 1 is better than the O(T 3/4) bound achieved by previous studies for the original OFW over general convex sets (Hazan and Kale 2012; Hazan 2016).

SC-OFW for Strongly Convex Losses Moreover, we propose a strongly convex variant of OFW (SC-OFW), which is inspired by Garber and Hazan (2016). To be precise, Garber and Hazan (2016) have proposed a variant of OFW for strongly convex OCO over polyhedral sets, which enjoys the O(log T) regret bound. Compared with the original OFW, their algorithm has two main differences. First, a local linear optimization step for polyhedral sets is developed to replace the linear optimization step utilized in OFW. Second, to handle λ-strongly convex losses, Ft(x) is redeﬁned as

fτ(xτ), x + λ

where c is a parameter that depends on properties of polyhedral sets. Since this paper considers OCO over strongly convex sets, our SC-OFW adopts the linear optimization step utilized in the original OFW, and simpliﬁes Ft(x) in (4) as

fτ(xτ), x + λ

by simply setting c = 0. The detailed procedures are summarized in Algorithm 2, where the parameter σt is selected

by the line search as

σt = argmin σ [0,1] σ(vt xt), Ft(xt) + σ2λt

2 vt xt 2 2.

To the best of our knowledge, SC-OFW is the ﬁrst projection-free algorithm for strongly convex OCO over any decision set. In the following lemma, we upper bound the approximation error of minimizing the surrogate loss function for SCOFW over strongly convex sets.

Lemma 2 Let K be an αK-strongly convex set with respect to the ℓ2 norm. Let x t = argminx K Ft 1(x) for any t = 2, , T + 1, where Ft(x) is deﬁned in (5). Then, for any t = 2, , T + 1, Algorithm 2 has

Ft 1(xt) Ft 1(x t ) C

where C = max 4(G+λD)2

Furthermore, based on Lemma 2, we provide the following theorem with respect to the regret bound of SC-OFW over strongly convex sets.

Theorem 2 Let K be an αK-strongly convex set with respect to the ℓ2 norm. If ft(x) is λ-strongly convex for any t [T], for any x K, Algorithm 2 ensures

t=1 ft(x ) C

2T + C ln T

where C = max 4(G+λD)2

From Theorem 2, we achieve a regret bound of O(

T) for strongly convex losses, which is better than the above O(T 2/3) bound for general convex losses. Moreover, we show the regret bound of Algorithm 2 over any general convex set K in the following theorem.

Theorem 3 If ft(x) is λ-strongly convex for any t [T], for any x K, Algorithm 2 ensures

t=1 ft(x ) 3

where C = 16(G+λD)2

By comparing Theorem 3 with Theorem 2, we can ﬁnd that our Algorithm 2 enjoys a better regret bound when the decision set is strongly convex.

Theoretical Analysis In this section, we prove Theorems 1 and 2. The omitted proofs can be found in the full version (Wan and Zhang 2020).

Proof of Theorem 1 In the beginning, we deﬁne x t = argminx K Ft 1(x) for any t = 1, , T + 1, where Ft(x) is deﬁned in (3).

Since each function ft(x) is convex, we have

t=1 ft(xt), xt x

t=1 ft(xt), xt x t

t=1 ft(xt), x t x

So, we will upper bound the regret by analyzing A and B. By applying Lemma 1, we have

t=1 ft(xt), xt x t

t=1 ft(xt) 2 xt x t 2

Ft 1(xt) Ft 1(x t )

C (t + 2)1/3 3G

C(T + 2)2/3

where the second inequality is due to (1) and the fact that Ft 1(x) is 2-strongly convex, and the last inequality is due to PT t=1 (t + 2) 1/3 3(T + 2)2/3/2. To bound B, we introduce the following lemma. Lemma 3 (Lemma 6.6 of Garber and Hazan (2016)) Let {ft(x)}T t=1 be a sequence of loss functions and let x t argminx K Pt τ=1 fτ(x) for any t [T]. Then, it holds that

t=1 ft(x t ) min x K

t=1 ft(x) 0.

To apply Lemma 3, we deﬁne f1(x) = η f1(x1), x + x x1 2 2 for t = 1 and ft(x) = η ft(xt), x for any t 2. We recall that Ft(x) = Pt τ=1 fτ(x) and x t+1 = argminx K Ft(x) for any t = 1, , T. Then, by applying Lemma 3 to { ft(x)}T t=1, we have

t=1 ft(x t+1)

t=1 ft(x ) 0

which implies that

t=1 ft(xt), x t+1 x

( x x1 2 2 x 2 x1 2 2)/η

where the last inequality is due to x 2 x1 2 2 0 and Assumption 1.

Moreover, by combining the fact that Ft(x) is 2-strongly convex with (1), we have

x t x t+1 2 2 Ft(x t ) Ft(x t+1)

=Ft 1(x t ) Ft 1(x t+1) + η ft(xt) (x t x t+1)

η ft(xt) 2 x t x t+1 2

which implies that

x t x t+1 2 η ft(xt) 2 ηG. (9)

By combining (8), (9) and η = D 2G(T +2)2/3 , we have

t=1 ft(xt), x t x

t=1 ft(xt), x t+1 x

t=1 ft(xt), x t x t+1

t=1 ft(xt) 2 x t x t+1 2

2DG(T + 2)2/3 + DG(T + 2)1/3

C(T + 2)2/3 + G

C(T + 2)2/3

where the last inequality is due to D

C/2 and (T + 2)1/3 (T + 2)2/3 for any T 1. By combining (7) and (10), we complete the proof.

Proof of Theorem 2 Let ft(x) = ft(xt), x + λ

2 x xt 2 2 for any t [T] and x t = argminx K Ft 1(x) for any t = 2, , T + 1. Since each function ft(x) is λ-strongly convex, we have

ft(xt), xt x λ

t=1 ( ft(xt) ft(x ))

t=1 ( ft(xt) ft(x t+1))

t=1 ( ft(x t+1) ft(x ))

So, we will derive a regret bound by analyzing A and B. To bound A, we introduce the following lemma.

Lemma 4 (Lemma 6.7 of Garber and Hazan (2016)) For any t [T], the function ft(x) = ft(xt), x + λ

2 x xt 2 2 is (G + λD)-Lipschitz over K.

By applying Lemma 4, for any t = 3, , T + 1, we have

Ft 1(x t 1) Ft 1(x t )

=Ft 2(x t 1) Ft 2(x t ) + ft 1(x t 1) ft 1(x t )

(G + λD) x t 1 x t 2.

Moreover, since each Ft(x) is tλ-strongly convex, for any t = 3, , T + 1, we have

x t 1 x t 2 2 2(Ft 1(x t 1) Ft 1(x t )) (t 1)λ

2(G + λD) x t 1 x t 2 (t 1)λ .

Therefore, for any t = 3, , T + 1, we have

x t 1 x t 2 2(G + λD)

(t 1)λ . (11)

By applying Lemmas 2 and 4, we have

t=2 ( ft(xt) ft(x t+1))

t=2 (G + λD) xt x t+1 2

t=2 xt x t 2

t=2 x t x t+1 2

2(Ft 1(xt) Ft 1(x t )) (t 1)λ

2C (t 1)λ + (G + λD)

λ + 2(G + λD)2 ln T

2T + C ln T

where the third inequality is due to p

(t 1)λ xt x t 2 p

2(Ft 1(xt) Ft 1(x t )) for t 2 and (11), and the last inequality is due to 2(G + λD)

Due to f1(x1) 2 G and x1 x 2 2 D, we have

A = f1(x1) f1(x 2) +

t=2 ( ft(xt) ft(x t+1))

= f1(x1), x1 x 2 λ

2 x 2 x1 2 2

t=2 ( ft(xt) ft(x t+1))

f1(x1) 2 x1 x 2 2 + C

2T + C ln T

2T + C ln T

Moreover, we note that Ft(x) = Pt τ=1 fτ(x) and x t+1 = argminx K Ft(x) for any t [T]. By applying Lemma 3 to { ft(x)}T t=1, we have

t=1 ft(x t+1)

t=1 ft(x ) 0. (13)

By combining (12) and (13), we complete the proof.

Proof of Lemma 1 For brevity, we deﬁne ht = Ft 1(xt) Ft 1(x t ) for t [T] and ht(xt 1) = Ft 1(xt 1) Ft 1(x t ) for t = 2, , T. For t = 1, since x1 = argminx K x x1 2 2, we have h1 = F0(x1) F0(x 1) = 0 ϵ1. (14) For any T + 1 t 2, assuming ht 1 ϵt 1 , we have ht(xt 1) =Ft 1(xt 1) Ft 1(x t ) =Ft 2(xt 1) Ft 2(x t ) + η ft 1(xt 1), xt 1 x t Ft 2(xt 1) Ft 2(x t 1)

+ η ft 1(xt 1), xt 1 x t ϵt 1 + η ft 1(xt 1) 2 xt 1 x t 2 ϵt 1 + η ft 1(xt 1) 2 x t 1 x t 2 + η ft 1(xt 1) 2 xt 1 x t 1 2 ϵt 1 + η2G2 + ηG ϵt 1

where the ﬁrst inequality is due to x t 1 = argminx K Ft 2(x) and the last inequality is due to xt 1 x t 1 2 p

Ft 2(xt 1) Ft 2(x t 1) and (9). Then, by substituting η = D/(2G(T + 2)2/3) into (15), we have ht(xt 1)

C 2(T + 2)2/3(t + 1)1/3 + D2

4(T + 2)4/3

ϵt 1 + ϵt 1 4(t + 1)1/3 + ϵt 1 16(t + 1)2/3

ϵt 1(1 + 1/(2(t + 1)1/3))

where the second inequality is due to T t 1 and D

C 2 . Then, to bound ht = Ft 1(xt) Ft 1(x t ) by ϵt, we further introduce the following lemma.

Lemma 5 (Derived from Lemma 1 of Garber and Hazan (2015)) Let f(x) : K R be a convex and βfsmooth function, where K is αK-strongly convex with respect to the ℓ2 norm. Moreover, let xin K and xout = xin + σ (v xin), where v argminx K f(xin), x and

σ = argminσ [0,1] σ(v xin), f(xin) + σ2βf

2 v xin 2 2. For any x argminx K f(x), we have

f(xout) f(x )

(f(xin) f(x )) max 1

2, 1 αK f(xin) 2

We note that Ft 1(x) is 2-smooth for any t [T + 1]. By applying Lemma 5 with f(x) = Ft 1(x) and xin = xt 1, for any t [T + 1], we have xout = xt and

ht ht(xt 1) max 1

2, 1 αK Ft 1(xt 1) 2

Because of (16), (17) and 1 + 1 2(t+1)1/3 3 2, if 1 2

αK Ft 1(xt 1) 2

16 , it is easy to verify that

4 C (t + 1)2/3

= C (t + 2)2/3 3(t + 2)2/3

4(t + 1)2/3

C (t + 2)2/3 = ϵt

where the last inequality is due to 3(t+2)2/3

4(t+1)2/3 1 for any t 2. Then, if 1

2 > αK Ft 1(xt 1) 2

16 , there exist two cases. First, if ht(xt 1) 3C 4(t+1)2/3 , it is easy to verify that

ht ht(xt 1) 3C 4(t + 1)2/3 ϵt (19)

where the lase inequality has been proved in (18). Second, if ht(xt 1) > 3C 4(t+1)2/3 , we have

ht(xt 1) 1 αK Ft 1(xt 1) 2

1 + 1 2(t + 1)1/3

1 αK Ft 1(xt 1) 2

1 + 1 2(t + 1)1/3

ϵt (t + 2)2/3

1 + 1 2(t + 1)1/3

3C 32(t + 1)1/3

where the second inequality is due to (16) and the third inequality is due to (2).

Since (t + 2)2/3 (t + 1)2/3 + 1 for any t 0, it is easy to verify that

1 + 1 2(t + 1)1/3

1 + 2 (t + 1)1/3

which further implies that

1 + 2 (t + 1)1/3

3C 32(t + 1)1/3

1 + 2 (t + 1)1/3

1 2 (t + 1)1/3

where the second inequality is due to αK

3C 64. By combining (14), (18), (19) and (20), we complete the proof.

Conclusion and Future Work In this paper, we ﬁrst prove that the classical OFW algorithm with line search attains an O(T 2/3) regret bound for OCO over strongly convex sets, which is better than the O(T 3/4) regret bound for the general OCO. Furthermore, for strongly convex losses, we introduce a strongly convex variant of OFW, and prove that it achieves a regret bound of O(T 2/3) over general convex sets and a better regret bound of O(

T) over strongly convex sets. An open question is whether the regret of OFW and its strongly convex variant over strongly convex sets can be further improved if the losses are smooth. We note that Hazan and Minasyan (2020) have proposed a projection-free algorithm for OCO over general convex sets, and established an improved regret bound of O(T 2/3) by taking advantage of the smoothness.

Acknowledgments This work was partially supported by the NSFC (61921006), NSFC-NRF Joint Research Project (61861146001), and the Collaborative Innovation Center of Novel Software Technology and Industrialization.

References Agarwal, A.; Hazan, E.; Kale, S.; and Schapire, R. E. 2006. Algorithms for portfolio management based on the Newton method. In Proceedings of the 23rd International Conference on Machine Learning, 9 16. Awerbuch, B.; and Kleinberg, R. 2008. Online linear optimization and adaptive routing. Journal of Computer and System Sciences 74(1): 97 114. Awerbuch, B.; and Kleinberg, R. D. 2004. Adaptive routing with end-to-end feedback: Distributed learning and geometric approaches. In Proceedings of the 36th Annual ACM Symposium on Theory of Computing, 45 53. Blum, A.; and Kalai, A. 1999. Universal portfolios with and without transaction costs. Machine Learning 35(3): 193 205.

Boyd, S.; and Vandenberghe, L. 2004. Convex Optimization. Cambridge University Press.

Bubeck, S.; Dekel, O.; Koren, T.; and Peres, Y. 2015. Bandit convex optimization:

T regret in one dimension. In Proceedings of the 28th Conference on Learning Theory, 266 278.

Chen, J.; Yang, T.; Lin, Q.; Zhang, L.; and Chang, Y. 2016. Optimal stochastic strongly convex optimization with a logarithmic number of projections. In Proceedings of the 32nd Conference on Uncertainty in Artiﬁcial Intelligence, 122 131.

Chen, L.; Zhang, M.; and Karbasi, A. 2019. Projection-free bandit convex optimization. In Proceedings of the 22nd International Conference on Artiﬁcial Intelligence and Statistics, 2047 2056.

Demyanov, V. F.; and Rubinov, A. M. 1970. Approximate Methods in Optimization Problems. Elsevier Publishing Company.

Duchi, J. C.; Agarwal, A.; and Wainwright, M. J. 2011. Dual averaging for distributed optimization: Convergence analysis and network scaling. IEEE Transactions on Automatic Control 57(3): 592 606.

Dunn, J. C. 1979. Rates of convergence for conditional gradient algorithms near singular and nonsingular extremals. SIAM Journal on Control and Optimization 17(2): 187 211.

Flaxman, A. D.; Kalai, A. T.; and Mc Mahan, H. B. 2005. Online convex optimization in the bandit setting: Gradient descent without a gradient. In Proceedings of the 16th Annual ACM-SIAM Symposium on Discrete Algorithms, 385 394.

Frank, M.; and Wolfe, P. 1956. An algorithm for quadratic programming. Naval Research Logistics Quarterly 3(1 2): 95 110.

Garber, D.; and Hazan, E. 2015. Faster rates for the frankwolfe method over strongly-convex sets. In Proceedings of the 32nd International Conference on Machine Learning, 541 549.

Garber, D.; and Hazan, E. 2016. A linearly convergent conditional gradient algorithm with applications to online and stochastic optimization. SIAM Journal on Optimization 26(3): 1493 1528.

Garber, D.; and Kretzu, B. 2020a. Improved regret bounds for projection-free bandit convex optimization. In Proceedings of the 23rd International Conference on Artiﬁcial Intelligence and Statistics, 2196 2206.

Garber, D.; and Kretzu, B. 2020b. Revisiting projection-free online learning: the strongly convex case. Ar Xiv e-prints ar Xiv: 2010.07572.

Hazan, E. 2008. Sparse approximate solutions to semideﬁnite programs. In Latin American Symposium on Theoretical Informatics, 306 316.

Hazan, E. 2016. Introduction to online convex optimization. Foundations and Trends in Optimization 2(3 4): 157 325.

Hazan, E.; Agarwal, A.; and Kale, S. 2007. Logarithmic regret algorithms for online convex optimization. Machine Learning 69(2): 169 192.

Hazan, E.; and Kale, S. 2012. Projection-free online learning. In Proceedings of the 29th International Conference on Machine Learning, 1843 1850.

Hazan, E.; and Luo, H. 2016. Variance-reduced and projection-free stochastic optimization. In Proceedings of the 33rd International Conference on Machine Learning, 1263 1271.

Hazan, E.; and Minasyan, E. 2020. Faster projection-free online learning. In Proceedings of the 33rd Annual Conference on Learning Theory, 1877 1893.

Hosseini, S.; Chapman, A.; and Mesbahi, M. 2013. Online distributed optimization via dual averaging. In 52nd IEEE Conference on Decision and Control, 1484 1489.

Jaggi, M. 2013. Revisiting frank-wolfe: Projection-free sparse convex optimization. In Proceedings of the 30th International Conference on Machine Learning, 427 435.

Levitin, E. S.; and Polyak, B. T. 1966. Constrained minimization methods. USSR Computational mathematics and mathematical physics 6: 1 50.

Levy, K. Y.; and Krause, A. 2019. Projection free online learning over smooth sets. In Proceedings of the 22nd International Conference on Artiﬁcial Intelligence and Statistics, 1458 1466.

Luo, H.; Wei, C.-Y.; and Zheng, K. 2018. Efﬁcient online portfolio with logarithmic regret. In Advances in Neural Information Processing Systems 31, 8235 8245.

Rector-Brooks, J.; Wang, J.-K.; and Mozafari, B. 2019. Revisiting projection-free optimization for strongly convex constraint sets. In Proceedings of the 33rd AAAI Conference on Artiﬁcial Intelligence, 1576 1583.

Shalev-Shwartz, S. 2007. Online Learning: Theory, Algorithms, and Applications. Ph.D. thesis, The Hebrew University of Jerusalem.

Shalev-Shwartz, S. 2011. Online learning and online convex optimization. Foundations and Trends in Machine Learning 4(2): 107 194.

Shalev-Shwartz, S.; and Singer, Y. 2007. A primal-dual perspective of online learning algorithm. Machine Learning 69(2 3): 115 142.

Wan, Y.; Tu, W.-W.; and Zhang, L. 2020. Projection-free distributed online convex optimization with O(

T) communication complexity. In Proceedings of the 37th International Conference on Machine Learning, 9818 9828.

Wan, Y.; and Zhang, L. 2020. Projection-free online learning over strongly convex sets. Ar Xiv e-prints ar Xiv:2010.08177.

Yang, T.; Lin, Q.; and Zhang, L. 2017. A richer theory of convex constrained optimization with reduced projections and improved rates. In Proceedings of the 34th International Conference on Machine Learning, 3901 3910.

Zhang, L.; Yang, T.; Jin, R.; and He, X. 2013. O(log T) projections for stochastic optimization of smooth and strongly convex functions. In Proceedings of the 30th International Conference on Machine Learning, 1121 1129. Zhang, W.; Zhao, P.; Zhu, W.; Hoi, S. C. H.; and Zhang, T. 2017. Projection-free distributed online learning in networks. In Proceedings of the 34th International Conference on Machine Learning, 4054 4062. Zinkevich, M. 2003. Online convex programming and generalized inﬁnitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning, 928 936.