# projectionfree_online_learning_in_dynamic_environments__ffac28f4.pdf

Projection-free Online Learning in Dynamic Environments

Yuanyu Wan, Bo Xue, Lijun Zhang

National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China {wanyy, xueb, zhanglj}@lamda.nju.edu.cn

To efﬁciently solve high-dimensional problems with complicated constraints, projection-free online learning has received ever-increasing research interest. However, previous studies either focused on static regret that is not suitable for dynamic environments, or only established the dynamic regret bound under the smoothness of losses. In this paper, without the condition of the smoothness, we propose a novel projection-free online algorithm, and achieve an O(max{T 2/3V 1/3 T ,

T}) dynamic regret bound for convex functions and an O(max{ TVT log T, log T}) dynamic regret bound for strongly convex functions, where T is the time horizon and VT denotes the variation of loss functions. Speciﬁcally, we ﬁrst improve an existing projection-free algorithm called online conditional gradient (OCG) to enjoy small dynamic regret bounds with the prior knowledge of VT . To work with unknowable VT , we maintain multiple instances of the improved OCG that can handle different functional variations, and combine them with a meta-algorithm that can track the best one. Experimental results validate the efﬁciency and effectiveness of our algorithm.

Introduction Online convex optimization (OCO) is a powerful framework for online learning, which enjoys both computational efﬁciency and theoretical guarantees (Shalev-Shwartz 2011). According to the protocol of OCO, it is a repeated game between a learner and an adversary. In each round t, the learner ﬁrst chooses a decision xt K, where K is a convex decision set. Then, the adversary reveals a convex loss function ft(x) : K R, and the learner suffers a loss ft(xt). To measure the performance of the learner, the static regret with respect to the best ﬁxed decision is commonly used, which is deﬁned as

RS = R(T) =

t=1 ft(xt) min x K

where T is the total number of rounds. Over the past decades, various online algorithms have been put forward to minimize the static regret under differ-

Lijun Zhang is the corresponding author. Copyright c 2021, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

ent scenarios, which can be divided into projection-based algorithms (Zinkevich 2003; Hazan, Agarwal, and Kale 2007; Shalev-Shwartz and Singer 2007) and projection-free algorithms (Hazan and Kale 2012; Hazan 2016; Garber and Hazan 2016). Speciﬁcally, projection-based algorithms such as online gradient descent (OGD) (Zinkevich 2003) and regularized follow the leader (RFTL) (Shalev-Shwartz and Singer 2007; Hazan 2016) perform one projection step in each round, which could be computationally expensive for high-dimensional problems with complicated constraints. In contrast, projection-free algorithms such as online conditional gradient (OCG) (Hazan and Kale 2012; Hazan 2016) and its variants replace the projection step with one linear optimization step, which can be carried out efﬁciently and have received ever-increasing attention (Zhang et al. 2017b; Chen et al. 2018; Chen, Zhang, and Karbasi 2019; Wan, Tu, and Zhang 2020; Wan and Zhang 2021). Although the static regret has been extensively studied for projection-based and projection-free algorithms, it is not suitable to measure the performance of the learner in dynamic environments, where the best decision may frequently change. To address this limitation, recent advances in OCO focused on dynamic regret which measures the performance of the learner against a sequence of local minimizers

RD = R(x 1, , x T ) =

t=1 ft(x t ) (1)

where x t argminx K ft(x) is a local minimizer, and proposed many projection-based algorithms for minimizing the dynamic regret (Besbes, Gur, and Zeevi 2015; Jadbabaie et al. 2015; Mokhtari et al. 2016; Yang et al. 2016; Zhang et al. 2018; Zhang, Lu, and Yang 2020; Zhang 2020). Due to the arbitrary ﬂuctuation in the loss functions, it is impossible to achieve a sub-linear dynamic regret bound unless introducing some conditions on the comparator sequence or the function sequence (Jadbabaie et al. 2015). A common condition introduced by previous studies (Besbes, Gur, and Zeevi 2015; Zhang et al. 2018) is that the functional variation deﬁned as

t=2 max x K |ft(x) ft 1(x)|

is sub-linear in T. If the value of VT is not small than 1 and given, Besbes, Gur, and Zeevi (2015) achieved an

The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21)

O(T 2/3V 1/3 T ) dynamic regret bound for convex functions and an O(log T TVT ) dynamic regret bound for strongly convex functions by properly restarting OGD. Moreover, Zhang et al. (2018) showed that maintaining multiple OGD and combining them carefully are able to achieve almost the same dynamic regret bounds for convex and strongly convex functions without any prior knowledge of VT . However, the projection step could limit their applications. While recent studies (Kalhan et al. 2019, 2020) proposed projection-free algorithms for dynamic environments, they only established the dynamic regret bound for smooth functions. To tackle this limitation, we propose a novel projection-free algorithm named as Multi OCG+ to minimize the dynamic regret without the smoothness. Speciﬁcally, we ﬁrst develop an improved variant of OCG named as OCG+, which achieves an O(max{T 2/3V 1/3 T ,

T}) dynamic regret bound for convex functions and an O(max{ TVT log T, log T}) dynamic regret bound for strongly convex functions with the prior knowledge of VT . Compared with the original OCG, there exist three critical changes.

Besides the surrogate loss function deﬁned in OCG, our OCG+ further introduces a new surrogate loss function from Garber and Hazan (2016) to utilize the strong convexity.

The restarting strategy (Besbes, Gur, and Zeevi 2015) has been utilized in our OCG+ to handle a speciﬁc functional variation.

Different from OCG that only performs 1 linear optimization step in each round, our OCG+ performs multiple linear optimization steps.

Note that the third change is inspired by previous studies (Chen et al. 2018; Xie et al. 2020; Hazan and Minasyan 2020), which have utilized multiple linear optimization steps to improve the static regret of projection-free algorithms. Furthermore, to handle the unknown VT , our Multi-OCG+ maintains multiple instances of OCG+ with different restarting frequencies, and combines them with a meta-algorithm that can track the best one. We prove that the dynamic regret bounds of Multi-OCG+ are on the same order as those bounds of OCG+.

Related Work In this section, we brieﬂy review the related work on the static and dynamic regret in the context of OCO.

Static Regret Since the pioneering work of Zinkevich (2003), the static regret has been extensively studied under different scenarios (Hazan, Agarwal, and Kale 2007; Shalev-Shwartz 2011; Hazan 2016). For convex functions, there are several algorithms such as the classical OGD (Zinkevich 2003) and RFTL (Hazan 2016) that achieved the O(

T) static regret bound. For strongly convex functions, Hazan, Agarwal, and Kale (2007) established the O(log T) static regret bound for OGD. The O(

T) rate for convex functions and the

O(log T) rate for strongly convex functions are known to be minimax optimal (Abernethy et al. 2008). However, in each round, these algorithms need to perform one projection step, which could be computationally expensive for highdimensional problems with complicated constraints (Hazan and Kale 2012). For example, OGD needs to perform the following projection step

xt+1 = argmin x K x (xt ηt ft(xt)) 2 2

where ηt is a parameter. To tackle this computational bottleneck, OCG (Hazan and Kale 2012; Hazan 2016) was put forward, which is the ﬁrst projection-free online algorithm for OCO. The key idea of OCG is to replace the projection step with one linear optimization step, as follows

vt = argmin x K

Ft(xt) x , xt+1 = xt + σt(vt xt)

where Ft(x) = η Pt 1 i=1 fi(xi) x + x x1 2 2 is a surrogate loss function, η and σt are parameters, which can be carried out efﬁciently. However, the static regret bound of OCG is O(T 3/4), which is worse than the optimal O(

T) bound for convex functions. Recently, projection-free algorithms with optimal static regret for both convex and strongly convex functions have been proposed for special decision sets such as polytope (Garber and Hazan 2016) and smooth set (Levy and Krause 2019). Furthermore, if O(T 3/2) linear optimization steps are allowed in each round, Chen et al. (2018) have achieved the optimal static regret bound O(

T) for convex and smooth functions. In the same setting, Xie et al. (2020) reduced the number of linear optimization steps to O(T) while achieved the static regret bound of O(

T log T). Hazan and Minasyan (2020) proposed a projection-free algorithm by estimating the expected decision of follow the perturbed leader (FPL) algorithm (Hazan 2016) with enough samples, each of which is computed by one linear optimization step. If T linear optimization steps are allowed in each round, their algorithm attains a regret bound of O(

d T) for convex functions, where d is the dimensionality and the dependence on d is caused by the randomized regularization in FPL.

Dynamic Regret In the pioneering work of Zinkevich (2003), a more general deﬁnition of dynamic regret is proposed to measure the performance of the learner against any sequence of comparators

R(u1, , u T ) =

t=1 ft(ut) (2)

where u1, , u T K. To bound the general dynamic regret, Zinkevich (2003) introduced the path-length deﬁned as P(u1, , u T ) = PT t=2 ut ut 1 2 and proved that OGD enjoys a general dynamic regret bound of O(

T (1 + P(u1, , u T ))). Similarly, using a dynamic model Φt( ) to predict a reference point for the t-th round, Hall and Willett (2013) introduced a variant of the path-length deﬁned as P (u1, , u T ) =

PT t=2 ut Φt(ut 1) 2 and proposed a novel algorithm to achieve a general dynamic regret bound of O

T (1 + P (u1, , u T ))). Furthermore, Zhang, Lu, and Zhou (2018) proposed a serial of novel algorithms that reduce the general dynamic regret bounds to O( p

T(1 + P(u1, , u T ))) and O( p

T (1 + P (u1, , u T ))), respectively. Because of x t argminx K ft(x), it is easy to verify that

R(u1, , u T ) R(x 1, , x T )

which implies that the dynamic regret deﬁned in (1) can be treated as the worst case of the general one deﬁned in (2). Additionally, there also exist many studies that directly investigate the worst case under different scenarios (Besbes, Gur, and Zeevi 2015; Jadbabaie et al. 2015; Mokhtari et al. 2016; Yang et al. 2016; Zhang et al. 2017a, 2018). However, the above algorithms are based on the projection step, which could limit their applications. Recently, Kalhan et al. (2019, 2020) proposed two projectionfree algorithms termed OFW and Meta-Frank Wolfe for dynamic environments. For smooth functions, they proved that OFW and Meta-Frank Wolfe with O(T a) linear optimization steps per round have the dynamic regret bound of O(

T(1 + VT + DT )) and O(1 + VT + T 1 a + RA D) respectively, where DT = PT t=2 ft(xt) ft 1(xt 1) 2 2 and RA D denotes the dynamic regret of an online linear optimization oracle. Compared with Kalhan et al. (2019, 2020), our work has signiﬁcant differences. First, we do not require the smoothness of functions, which is not satisﬁed by some common losses such as the hinge loss and absolute loss. Second, we further utilize the strongly convexity to achieve a better regret bound, which was not not studied by Kalhan et al. (2019, 2020). Third, as long as VT is sublinear in T, our regret of O(max{T 2/3V 1/3 T ,

T}) for convex functions is sublinear in T. By contrast, the regret of OFW could be O(T) if VT = O(

T) or DT = O(T). Moreover, since RA D could be O(T) in the worst case, it is not comparable between the regret of Meta-Frank Wolfe and our regret.

Main Results In this section, we ﬁrst introduce necessary preliminaries. Then, we present our OCG+ that is an improved variant of OCG, and establish small dynamic regret bounds with the prior knowledge of VT . Finally, we present our Multi OCG+, which enjoy similar dynamic regret bounds without any prior knowledge of VT .

Preliminaries Following previous studies on OCO (Hazan and Kale 2012; Zhang, Lu, and Zhou 2018), we introduce two common assumptions, which will be used to bound the dynamic regret.

Assumption 1 The diameter of the convex decision set K is bounded by D, i.e., x y 2 D for any x K, y K.

Assumption 2 At each round t, the loss function ft(x) is G-Lipschitz over K, i.e., |ft(x) ft(y)| G x y 2 for any x K, y K.

Algorithm 1 CG

1: Input: feasible set K, K, F(x), xin 2: z0 = xin 3: for k = 0, 1, , K 1 do 4: vk argmin x K

5: σk = argmin σ [0,1] {F(zk + σ(vk zk))}

6: zk+1 = zk + σk(vk zk) 7: end for 8: return xout = z K

Then, we recall the standard deﬁnitions for smooth and strongly convex functions (Boyd and Vandenberghe 2004). Deﬁnition 1 Let f(x) : K R be a function over K. It is called β-smooth over K if for all x K, y K

f(y) f(x) + f(x) (y x) + β

Deﬁnition 2 Let f(x) : K R be a function over K. It is called α-strongly convex over K if for all x K, y K

f(y) f(x) + f(x) (y x) + α

Furthermore, we introduce conditional gradient (CG) (Frank and Wolfe 1956; Jaggi 2013), which will be utilized as a subroutine of our proposed algorithms. Given a function F(x) : K R and an initial point z0 = xin K, the idea of CG is to iteratively perform linear optimization step K times

vk = argmin x K

F(zk) x , zk+1 = zk + σk(vk zk)

where σk = argminσ [0,1] {F(zk + σ(vk zk))} is selected by line search. The detailed procedures of CG are summarized in Algorithm 1. Compared with performing linear optimization step once, CG can output a point xout such that F(xout) is small enough. Note that CG is originally an algorithm for ofﬂine optimization, and OCG proposed by Hazan and Kale (2012) is its extension for online learning. However, OCG only performs 1 linear optimization in each round, and its static regret bound O(T 3/4) is worse that the optimal bound O(

T). As a result, directly applying the restarting strategy (Besbes, Gur, and Zeevi 2015) to OCG could not achieve the optimal dynamic regret. Although previous studies improved the static regret of OCG by performing multiple linear optimization steps in each round, there exist some limitations. Chen et al. (2018) and Xie et al. (2020) used stochastic gradients to update decision, and their results required the smoothness of loss functions. The regret bound of Hazan and Minasyan (2020) has additional dependence on the dimensionality d, which is caused by the randomized regularization. Different from these studies that employed some randomized methods, we utilize CG that is a deterministic method to improve the static regret of OCG, and further propose a projection-free algorithm with small dynamic regret bounds for convex and strongly convex functions.

Algorithm 2 OCG+

1: Input: feasible set K, the modulus of strong convexity λ, Kγ, γ 2: Set ηγ = D G γ 3: for t = 1, , T do 4: if t mod γ = 1 then 5: sγ = t and choose xγ t K 6: end if 7: if λ = 0 then 8: F γ t+1(x) = ηγ Pt i=sγ fi(xγ i ) x + x xγ sγ 2 2 9: else 10: F γ t+1(x) = Pt i=sγ fi(xγ i ) x + λ

2 x xγ i 2 2

2 x xγ sγ 2 2 11: end if 12: xγ t+1 = CG(K, Kγ, F γ t+1(x), xγ t ) 13: end for

OCG+ for a Knowable Function Variation To cope with dynamic environments, we investigate the dynamic regret deﬁned in (1), which measures the performance of the learner against a sequence of local minimizers. Following previous studies (Besbes, Gur, and Zeevi 2015; Zhang et al. 2018), the goal of this paper is to upper bound the dynamic regret by the functional variation VT . By utilizing CG as a subroutine, the detailed procedures of our algorithm are presented in Algorithm 2, where λ is the modulus of strong convexity of loss functions, γ and Kγ are parameters, which is named as improved online conditional gradient (OCG+). Compared with the original OCG, we have made three critical changes to achieve small dynamic regret bounds. First, besides the surrogate loss function for convex functions in line 8 of OCG+, a new surrogate loss function F γ t+1(x) in line 10 is introduced from Garber and Hazan (2016), which can utilize the strong convexity. Note that Garber and Hazan (2016) utilized this surrogate loss function to propose a projection-free online algorithm over polyhedral sets, which only performs one linear optimization step in each round and attains the optimal static regret. By contrast, our algorithm are designed to minimize dynamic regret over any convex decision set, which requires multiple linear optimization steps in each round. Second, we utilize the restarting strategy (Besbes, Gur, and Zeevi 2015) in our OCG+ to handle a speciﬁc functional variation. The restarting frequency of our OCG+ is determined by the parameter γ. Let r = T/γ , qi = (i 1)γ+1 for i = 1, , r and qr+1 = T + 1. It is easy to verify that OCG+ essentially performs the same steps on time intervals

[q1, q2 1], [q2, q3 1], , [qr, qr+1 1] (3)

successively. Third, during the j-th time interval in (3), our OCG+ invokes CG shown in Algorithm 1 as

xγ t+1 = CG(K, Kγ, F γ t+1(x), xγ t )

to choose the decision xγ t+1. The following two theorems present the dynamic regret bounds of Algorithm 2 for convex and strongly convex functions, respectively.

Theorem 1 Under Assumptions 1 and 2, for convex losses, Algorithm 2 with γ T and Kγ = γ ensures

RD 8TGD γ + 2γVT .

Theorem 2 Under Assumptions 1 and 2, for λ-strongly convex losses, Algorithm 2 with γ T and Kγ = γ2 ensures

RD 2T(c1 + c2 ln(γ + 1))

where c1 = λD2

2 + 2(G + λD)D and c2 = 2(G+λD)2

Based on Theorems 1 and 2, we derive the speciﬁc dynamic regret bounds of Algorithm 2. Corollary 1 Assume that ft(x) is convex for any t [T]. If VT p

1/T, under Assumptions 1 and 2, Algorithm 2 with γ = (T/VT )2/3 and Kγ = γ ensures

2GD + 2)T 2/3V 1/3 T .

Otherwise, under Assumptions 1 and 2, Algorithm 2 with γ = T and Kγ = γ achieves

Corollary 2 Let c1 = λD2

2 + 2(G + λD)D and c2 =

λ . Assume that ft(x) is λ-strongly convex for any t [T]. If VT ln(T + 1)/T, under Assumptions 1 and 2, Algorithm 2 with γ = jp

T ln(T + 1)/VT k and Kγ = γ2 ensures

RD (4c1 + 4c2 + 2) p

TVT ln(T + 1).

Otherwise, under Assumptions 1 and 2, Algorithm 2 with γ = T and Kγ = γ2 achieves

RD 2c1 + (2c2 + 2) ln(T + 1).

Remark From Corollaries 1 and 2, if the value of VT is available, our OCG+ achieves an O(max{T 2/3V 1/3 T ,

T}) dynamic regret bound for convex functions and an O(max{ TVT log T, log T}) dynamic regret bound for strongly convex functions. If VT 1, these bounds reduce to O(T 2/3V 1/3 T ) for convex functions which matches the minimax rate proved by Besbes, Gur, and Zeevi (2015), and O( TVT log T) for strongly convex losses which nearly matches the minimax rate of O( TVT ) proved by Besbes, Gur, and Zeevi (2015) up to a polylogarithmic factor.

Multi-OCG+ for Unknown Function Variations However, in practice, the value of VT could be unknown, which limits the applications of OCG+. To tackle this limitation, an efﬁcient strategy to search the optimal γ for OCG+ is required. Note that similar problems also exist in previous studies on OCO (van Erven and Koolen 2016; Zhang, Lu, and Zhou 2018; Wang, Lu, and Zhang 2019; Wang et al. 2020), where it is necessary to search the optimal parameter for other algorithms. The main idea of their solutions is to run multiple instances of their algorithms with different parameters,

and combine them with a meta-algorithm based on the exponentially weighted average forecaster (Cesa-Bianchi and Lugosi 2006), which is able to track the best instance. Inspired by this idea, we deﬁne a set H = {γ1, , γN} of N values for parameter γ, and activate a set of experts {Eγ|γ H} by invoking OCG+ shown in Algorithm 2 as

Eγ = OCG+(K, λ, Kγ, γ).

In each round, we simultaneously run these experts {Eγ|γ H} to generate a set of decisions {xγ t |γ H}. Then, we adopt the exponentially weighted average forecaster as follows to generate a weight wγ t for each expert Eγ, and combine {xγ t |γ H} as xt = P γ H wγ t xγ t . According to Cesa Bianchi and Lugosi (2006), the initial weight of each expert Eγi is set to be wγi 1 = C i(i+1) where C = 1 + 1 N is used to normalize these weights. In each round t, after observing the loss function, the weights of experts are updated as

wγ t+1 = wγ t exp( τft(xγ t )) P γ H wγ t exp( τft(xγ t ))

where τ > 0 is a constant. The detailed procedures of our algorithm are summarized in Algorithm 3, and it is named as Multi-OCG+. Compared with OCG+, our Multi-OCG+ does not require any prior knowledge of VT and enjoys the following dynamic regret bounds. Theorem 3 Assume that ft(x) is convex for any t [T]. Let H = γi = 2i|i = 0, , N where N = log2(T) , and τ = p

8/TG2D2. For each expert in {Eγ|γ H}, let Kγ = γ. Under Assumptions 1 and 2, Algorithm 3 ensures

RD max n c3

T, c4T 2/3V 1/3 T o

TG2D2/8 (1 + 2 ln N)

where c3 = 8

2GD + 2 and c4 = 16GD + 2.

Theorem 4 Assume that ft(x) is λ-strongly convex for any t [T]. Let H = γi = 2i|i = 0, , N where N = log2(T) , and τ = λ/G2. For each expert in {Eγ|γ H}, let Kγ = γ2. Under Assumptions 1 and 2, Algorithm 3 ensures

4c1 + (4c2 + 2) ln(T + 1) + 2G2

(8c1 + 8c2 + 2) p

TVT ln(T + 1) + 2G2

where c1 = λD2

2 + 2(G + λD)D and c2 = 2(G+λD)2

Remark To compare with previous studies, we consider the case of VT 1. For convex functions, Theorem 3 shows that the dynamic regret bound of our Algorithm 3 is on the order of O(T 2/3V 1/3 T ) where we treat the double logarithmic as a constant, which also matches the minimax rate proved by Besbes, Gur, and Zeevi (2015). For strongly-convex functions, Theorem 4 shows that the dynamic regret bound of our Algorithm 3 is on the order of

Algorithm 3 Multi-OCG+

1: Input: feasible set K, strong convexity parameter λ, τ, H = {γ1, , γN} and Kγ, γ H 2: Activate a set of experts {Eγ|γ H} by invoking Algorithm 2 as Eγ = OCG+(K, λ, Kγ, γ) 3: Set wγi 1 = C i(i+1) for i [N], where C = 1 + 1

N 4: for t = 1, , T do 5: Receive xγ t from each expert Eγ and set xt = P γ H wγ t xγ t

6: Set wγ t+1 = wγ t exp( τft(xγ t )) P

γ H wγ t exp( τft(xγ t )), γ H

O( TVT log T), which nearly matches the minimax rate proved by Besbes, Gur, and Zeevi (2015) up to a polylogarithmic factor. Although Besbes, Gur, and Zeevi (2015) established similar dynamic regret bounds, their algorithm needs to know the upper bound of VT and is not projectionfree. Besides, our bound for convex functions is slightly better than the bound O(T 2/3V 1/3 T log1/3 T) achieved by another projection-based algorithm (Zhang et al. 2018) that does not require any prior knowledge of VT .

Theoretical Analysis

In this section, we only provide the proofs of Theorems 1 and 2. Due to the limitation of space, we postpone the omitted proofs to the supplementary material.

Proof of Theorem 1

Let r = T/γ , qi = (i 1)γ + 1 for i = 1, , r and qr+1 = T + 1. First, we have

t=1 ft(xγ t )

t=1 min x K ft(x)

t=qi ft(xγ t )

t=qi min x K ft(x)

t=qi ft(xγ t ) min x K

| {z } :=ai

t=qi min x K ft(x)

To bound ai, we introduce the following lemma, which gives the static regret bound of Algorithm 2 over each interval in (3).

Lemma 1 Assume that ft(x) is convex for any t [T]. Let r = T/γ , qi = (i 1)γ + 1 for i = 1, , r and qr+1 =

T + 1. Under Assumptions 1 and 2, for any x K and j [r], Algorithm 2 with Kγ = γ ensures

t=qj ft(xγ t )

t=qj ft(x ) 4GD γ.

It is not hard to bound ai using Lemma 1 as below

t=qi ft(xγ t ) min x K

t=qi ft(x) 4GD γ. (5)

To upper bound bi, we follow the proof of Theorem 3 in Zhang et al. (2018) as below

bi = min x K

t=qi ft(x t )

t=qi ft(x qi)

t=qi ft(x t )

γ max t [qi,qi+1 1]

ft(x qi) ft(x t ) .

For brevity, let

t=qi+1 max x K |ft(x) ft 1(x)|.

Then, for any t [qi, qi+1 1], we have

ft(x qi) ft(x t )

=ft(x qi) fqi(x qi) + fqi(x qi) ft(x t )

ft(x qi) fqi(x qi) + fqi(x t ) ft(x t )

Combining (7) with (6), we have

t=qi min x K ft(x) 2γVT (i). (8)

Substituting (5) and (8) into (4), we have

t=1 ft(xγ t )

t=1 min x K ft(x) 4r GD γ + 2γ

8TGD γ + 2γVT

where the last inequality is due to r T

γ and Pr i=1 VT (i) VT .

Proof of Theorem 2 Similar to the proof of Theorem 1, we introduce the following lemma, which gives the static regret bound of Algorithm 2 over each interval in (3) for strongly convex functions.

Lemma 2 Let r = T/γ , qi = (i 1)γ+1 for i = 1, , r and qr+1 = T + 1. If each ft(x) is λ-strongly convex and

Assumptions 1 and 2 hold, for any x K and j [r], Algorithm 2 with Kγ = γ2 ensures

t=qj ft(xγ t )

t=qj ft(x )

2 + 2(G + λD)D + 2(G + λD)2 ln(γ + 1)

Let r = T/γ , qi = (i 1)γ + 1 for i = 1, , r and qr+1 = T + 1. It is not hard to verify that (4) and (8) still hold. So, we only need to bound ai in (4) using Lemma 2 as below

ai c1 + c2 ln(γ + 1). (9)

Then, substituting (9) and (8) into (4), we have

t=1 ft(xγ t )

t=1 min x K ft(x)

r(c1 + c2 ln(γ + 1)) + 2γ

2T(c1 + c2 ln(γ + 1))

where the last inequality is due to r T

γ and Pr i=1 VT (i) VT .

Experiments In this section, we perform numerical experiments in dynamic environments to verify the efﬁciency and effectiveness of our Multi-OCG+. All algorithms are implemented wtih Matlab R2016b and tested on a linux machine with 2.4GHz CPU and 768GB RAM.

Settings and Datasets Following Hazan and Kale (2012), we consider the online matrix completion problem, the goal of which is to construct a low-rank matrix X Rm n that can approximate a matrix M Rm n by observing its entries according to the online setting. At each round t, the learner ﬁrst chooses a matrix X such that X r, where X denotes the trace norm of X and r is a constant. Then, the learner receives a loss function ft(X) = P

(i,j) OBt |Xij Mij| where OBt [m] [n]. We use a publicly available dataset Movie Lens 100K1, which originally contains 100000 ratings in {1, 2, 3, 4, 5} by 943 users on 1682 movies and can be denoted as {(ik, jk, Mikjk)}100000 k=1 . To create dynamic environments, we slightly modify the dataset such that the optimal low-rank approximation matrix changes at certain rounds. To be precise, we construct a larger dataset denoted as {(ik, jk, Mikjk)}300000 k=1 by combining three copies of the original Movie Lens 100K and ﬂipping the original value of Mikjk by multiplying 1 for any k = 100001, , 200000. For simplicity, this dataset is equally divided into T = 3000

1https://grouplens.org/datasets/movielens/100k/

Number of Iterations 0 500 1000 1500 2000 2500 3000

Cumulative Loss

8 RFTL CBCE-RFTL Multi-OCG+

(a) Comparison of cumulative loss

Number of Iterations 0 500 1000 1500 2000 2500 3000

Cumulative Runtime (s)

RFTL CBCE-RFTL Multi-OCG+

(b) Comparison of cumulative runtime

Figure 1: Experimental results for online matrix completion in dynamic environments

partitions according to its original sequence, and we denote the set of (i, j) in the t-th partition as OBt. In this way, the optimal low-rank approximation matrix will change after each 1000 rounds. Moreover, we set r = 5000, following Hazan and Kale (2012).

Baselines and Results Note that for any t = 1, , T, ft(X) is not strongly convex. So, we ﬁrst compare our Multi-OCG+ against RFTL (Hazan 2016) to demonstrate that simply running an algorithm with optimal static regret cannot deal with dynamic environments. Speciﬁcally, RFTL updates as

xt+1 = argmin x K

i=1 fi(xi) x + x x1 2 2

where η is a parameter and we set η = c/

T by tuning the constant c from the set {0.1, 1.0, , 1e6}. To further verify the performance of our Multi-OCG+, we also compare it against the projection-based algorithm proposed by Zhang et al. (2018), which achieves the existing best dynamic regret bound without any prior knowledge of environment changing. Similar as the framework of our Multi-OCG+, their algorithm also consists of two parts: A set of expert {EI|I I}, each of which is an instance of the expert-algorithm running over an interval I I, where I is the geometric covering intervals proposed by Daniely, Gonen, and Shalev-Shwartz (2015); A meta-algorithm CBCE (Jun et al. 2017), which is able to combine the decisions generated by active experts in each round. We note that the expert-algorithm could be any online algorithm with an O(

T) static regret bound. In our experiments, we select RFTL as the expert-algorithm, and denote this algorithm as CBCE-RFTL. Speciﬁcally, CBCERFTL contains two parameters including the prior distribution π |I| over all experts and the learning rate ηI

for each expert EI. Following Jun et al. (2017), we set π as the uniform distribution, and set ηI = c/ p

|I|, where c is selected from {0.1, 1.0, , 1e6}. For our Multi-OCG+, we set H = γi = 2i|i = 0, , log2(T) . Since ft(X) is not strongly convex, the parameter τ is set to be s/

T, where s is selected from {1e 4, 1e 3, , 1.0}. Besides, the parameter ηγ of each expert Eγ is set to be c/ γ, where c is selected from {0.1, 1.0, , 1e6}. Although in theory we may need to set Kγ = γ for our Multi-OCG+ to achieve an optimal dynamic regret bound for convex loss functions, we ﬁnd that Multi-OCG+ with a much smaller K can also achieve good performance in our experiments. Therefore, we simply set Kγ = 4 for our Multi-OCG+ to reduce the time cost. Figure 1 shows the cumulative loss and runtime of each algorithm in dynamic environments. Obviously, the performance of RFTL becomes worse after the environment changes, which shows that RFTL cannot deal with dynamic environments. By contrast, CBCE-RFTL and our Multi OCG+ catch up with changing environments very fast. Furthermore, our Multi-OCG+ outperforms CBCE-RFTL, and is signiﬁcantly faster than it, which veriﬁes the advantages of our algorithm in the dynamic regret and time cost.

In this paper, we propose a projection-free online algorithm named as Multi-OCG+ to minimize the dynamic regret without the smoothness. According to theoretical analysis, our Multi-OCG+ enjoys an optimal dynamic regret bound of O(max{T 2/3V 1/3 T ,

T}) for convex functions, which does not require any prior knowledge of VT . Furthermore, for strongly convex functions, Multi OCG+ achieves a nearly optimal dynamic regret bound of O(max{ TVT log T, log T}). Experiments in dynamic environments demonstrate the efﬁciency and effectiveness of our Multi-OCG+.

Acknowledgments This work was partially supported by the NSFC (61976112), Jiangsu SF (BK20200064), and Open Research Projects of Zhejiang Lab (NO. 2021KB0AB02).

References Abernethy, J. D.; Bartlett, P. L.; Rakhlin, A.; and Tewari, A. 2008. Optimal stragies and minimax lower bounds for online convex games. In Proceedings of the 21st Annual Conference on Learning Theory, 415 424.

Besbes, O.; Gur, Y.; and Zeevi, A. 2015. Non-stationary stochastic optimization. Operations Research 63(5): 1227 1244.

Boyd, S.; and Vandenberghe, L. 2004. Convex Optimization. Cambridge University Press.

Cesa-Bianchi, N.; and Lugosi, G. 2006. Prediction, Learning, and Games. Cambridge University Press.

Chen, L.; Harshaw, C.; Hassani, H.; and Karbasi, A. 2018. Projection-free online optimization with stochastic gradient: From convexity to submodularity. In Proceedings of the 35th International Conference on Machine Learning, 813 822.

Chen, L.; Zhang, M.; and Karbasi, A. 2019. Projection-free bandit convex optimization. In Proceedings of the 22nd International Conference on Artiﬁcial Intelligence and Statistics, 2047 2056.

Daniely, A.; Gonen, A.; and Shalev-Shwartz, S. 2015. Strongly adaptive online learning. In Proceedings of the 32nd International Conference on Machine Learning, 1405 1411.

Frank, M.; and Wolfe, P. 1956. An algorithm for quadratic programming. Naval Research Logistics Quarterly 3(1 2): 95 110.

Garber, D.; and Hazan, E. 2016. A linearly convergent conditional gradient algorithm with applications to online and stochastic optimization. SIAM Journal on Optimization 26(3): 1493 1528.

Hall, E. C.; and Willett, R. M. 2013. Dynamical models and tracking regret in online convex programming. In Proceedings of the 30th International Conference on Machine Learning, 579 587.

Hazan, E. 2016. Introduction to online convex optimization. Foundations and Trends in Optimization 2(3 4): 157 325.

Hazan, E.; Agarwal, A.; and Kale, S. 2007. Logarithmic regret algorithms for online convex optimization. Machine Learning 69(2): 169 192.

Hazan, E.; and Kale, S. 2012. Projection-free online learning. In Proceedings of the 29th International Conference on Machine Learning, 1843 1850.

Hazan, E.; and Minasyan, E. 2020. Faster projection-free online learning. In Proceedings of the 33rd Annual Conference on Learning Theory, 1877 1893.

Jadbabaie, A.; Rakhlin, A.; Shahrampour, S.; and Sridharan, K. 2015. Online optimization: Competing with dynamic

comparators. In Proceedings of the 18th International Conference on Artiﬁcial Intelligence and Statistics, 398 406. Jaggi, M. 2013. Revisiting Frank-Wolfe: Projection-free sparse convex optimization. In Proceedings of the 30th International Conference on Machine Learning, 427 435. Jun, K.-S.; Orabona, F.; Wright, S.; and Willett, R. 2017. Improved strongly adaptive online learning using coin betting. In Proceedings of the 20th International Conference on Artiﬁcial Intelligence and Statistics, 943 951. Kalhan, D. S.; Bedi, A. S.; Koppel, A.; Rajawat, K.; Gupta, A.; and Banerjee, A. 2020. Projection free dynamic online learning. In Proceedings of the 45th IEEE International Conference on Acoustics, Speech and Signal Processing, 3957 3961. Kalhan, D. S.; Bedi, A. S.; Koppel, A.; Rajawat, K.; Hassani, H.; Gupta, A.; and Banerjee, A. 2019. Dynamic online learning via Frank-Wolfe algorithm. Levy, K. Y.; and Krause, A. 2019. Projection free online learning over smooth sets. In Proceedings of the 22nd International Conference on Artiﬁcial Intelligence and Statistics, 1458 1466. Mokhtari, A.; Shahrampour, S.; Jadbabaie, A.; and Ribeiro, A. 2016. Online optimization in dynamic environments: Improved regret rates for strongly convex problems. In Proceedings of 55th Conference on Decision and Control, 7195 7201. Shalev-Shwartz, S. 2011. Online learning and online convex optimization. Foundations and Trends in Machine Learning 4(2): 107 194. Shalev-Shwartz, S.; and Singer, Y. 2007. A primal-dual perspective of online learning algorithm. Machine Learning 69(2 3): 115 142. van Erven, T.; and Koolen, W. M. 2016. Meta Grad: Multiple learning rates in online learning. In Advances in Neural Information Processing Systems 29, 3666 3674. Wan, Y.; Tu, W.-W.; and Zhang, L. 2020. Projection-free distributed online convex optimization with O(

T) communication complexity. In Proceedings of the 37th International Conference on Machine Learning, 9818 9828. Wan, Y.; and Zhang, L. 2021. Projection-free online learning over strongly convex sets. In Proceedings of the 35th AAAI Conference on Artiﬁcial Intelligence. Wang, G.; Lu, S.; Hu, Y.; and Zhang, L. 2020. Adapting to smoothness: A more universal algorithm for online convex optimization. In Proceedings of the 34th AAAI Conference on Artiﬁcial Intelligence, 6162 6169. Wang, G.; Lu, S.; and Zhang, L. 2019. Adaptivity and optimality: A universal algorithm for online convex optimization. In Proceedings of 35th Conference on Uncertainty in Artiﬁcial Intelligence. Xie, J.; Shen, Z.; Zhang, C.; Wang, B.; and Qian, H. 2020. Efﬁcient projection-free online methods with stochastic recursive gradient. In Proceedings of the 34th AAAI Conference on Artiﬁcial Intelligence, 6446 6453.

Yang, T.; Zhang, L.; Jin, R.; and Yi, J. 2016. Tracking slowly moving clairvoyant: Optimal dynamic regret of online learning with true and noisy gradient. In Proceedings of the 33rd International Conference on Machine Learning. Zhang, L. 2020. Online learning in changing environments. In Proceedings of the 29th International Joint Conference on Artiﬁcial Intelligence, 5178 5178. Zhang, L.; Lu, S.; and Yang, T. 2020. Minimizing dynamic regret and adaptive regret simultaneously. In Proceedings of the 23rd International Conference on Artiﬁcial Intelligence and Statistics, 309 319. Zhang, L.; Lu, S.; and Zhou, Z.-H. 2018. Adaptive online learning in dynamic environments. In Advances in Neural Information Processing Systems 31, 1323 1333. Zhang, L.; Yang, T.; Jin, R.; and Zhou, Z.-H. 2018. Dynamic regret of strongly adaptive methods. In Proceedings of the 35th International Conference on Machine Learning, 5877 5886. Zhang, L.; Yang, T.; Yi, J.; Jin, R.; and Zhou, Z.-H. 2017a. Improved dynamic regret for non-degenerate functions. In Advances in Neural Information Processing Systems 30, 732 741. Zhang, W.; Zhao, P.; Zhu, W.; Hoi, S. C. H.; and Zhang, T. 2017b. Projection-free distributed online learning in networks. In Proceedings of the 34th International Conference on Machine Learning, 4054 4062. Zinkevich, M. 2003. Online convex programming and generalized inﬁnitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning, 928 936.