# equipping_expertsbandits_with_longterm_memory__5c70e67d.pdf

Equipping Experts/Bandits with Long-term Memory

Kai Zheng1,2 zhengk92@pku.edu.cn Haipeng Luo3 haipengl@usc.edu

Ilias Diakonikolas4 ilias.diakonikolas@gmail.com Liwei Wang1,2 wanglw@cis.pku.edu.cn

We propose the ﬁrst reduction-based approach to obtaining long-term memory guarantees for online learning in the sense of Bousquet and Warmuth [8], by reducing the problem to achieving typical switching regret. Speciﬁcally, for the classical expert problem with K actions and T rounds, using our framework we develop various algorithms with a regret bound of order O( p

T(S ln T + n ln K)) compared to any sequence of experts with S 1 switches among n min{S, K} distinct experts. In addition, by plugging speciﬁc adaptive algorithms into our framework we also achieve the best of both stochastic and adversarial environments simultaneously. This resolves an open problem of Warmuth and Koolen [35]. Furthermore, we extend our results to the sparse multi-armed bandit setting and show both negative and positive results for long-term memory guarantees. As a side result, our lower bound also implies that sparse losses do not help improve the worst-case regret for contextual bandits, a sharp contrast with the non-contextual case.

1 Introduction

In this work, we propose a black-box reduction for obtaining long-term memory guarantees for two fundamental problems in online learning: the expert problem [17] and the multi-armed bandit (MAB) problem [6]. In both problems, a learner interacts with the environment for T rounds, with K ﬁxed available actions. At each round, the environment decides the loss for each action while simultaneously the learner selects one of the actions and suffers the loss of this action. In the expert problem, the learner observes the loss of every action at the end of each round (a.k.a. full-information feedback), while in MAB, the learner only observes the loss of the selected action (a.k.a. bandit feedback).

For both problems, the classical performance measure is the learner s (static) regret, deﬁned as the difference between the learner s total loss and the loss of the best ﬁxed action. It is wellknown that the minimax optimal regret is Θ(

T ln K) [17] and Θ(

TK) [6, 4] for the expert problem and MAB respectively. Comparing against a ﬁxed action, however, does not always lead to meaningful guarantees, especially when the environment is non-stationary and no single ﬁxed action performs well. To address this issue, prior work has considered a stronger measure called switching/tracking/shifting regret, which is the difference between the learner s total loss and the loss

1 Key Laboratory of Machine Perception, MOE, School of EECS, Peking University 2 Center for Data Science, Peking University 3 University of Southern California 4 University of Wisconsin-Madison

33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada.

of a sequence of actions with at most S 1 switches. Various existing algorithms (including some black-box approaches) achieve the following switching regret ( O( p

TS ln(TK)) for the expert problem [23, 21, 1, 27, 24], (1)

TKS ln(TK)) for multi-armed bandits [6, 28]. (2)

We call these typical switching regret bounds. Such bounds essentially imply that the learner pays the worst-case static regret for each switch in the benchmark sequence. While this makes sense in the worst case, intuitively one would hope to perform better if the benchmark sequence frequently switches back to previous actions, as long as the algorithm remembers which actions have performed well previously.

Indeed, for the expert problem, algorithms with long-term memory were developed that guarantee

switching regret of order O q

n ) , where n min{S, K} is the number of

distinct actions in the benchmark sequence [8, 2, 13].1 Although there is no known lower bound, this regret bound essentially matches the one achieved by a computationally inefﬁcient approach of running Hedge over all benchmark sequences with S switches among n experts, an approach that usually leads to the information-theoretically optimal regret guarantee. Compared to the typical switching regret bound of form (1) (which can be written as O( p

T(S ln T + S ln K))), this longterm memory guarantee implies that the learner pays the worst-case static regret only for each distinct action encountered in the benchmark sequence, and pays less for each switch, especially when n is very small. Algorithms with long-term memory guarantees have been found to have better empirical performance [8], and applied to practical applications such as TCP round-trip time estimation [30], intrusion detection system [29], and multi-agent systems [31]. We are not aware of any similar studies for the bandit setting.

Overview of our contributions. The main contribution of this work is to propose a simple blackbox approach to equip expert or MAB algorithms with long-term memory and to achieve switching regret guarantees of similar ﬂavor to those of [8, 2, 13]. The key idea of our approach is to utilize a variant of the conﬁdence-rated expert framework of [7], and to use a sub-routine to learn the conﬁdence/importance of each action for each time. Importantly this sub-routine itself is an expert/bandit algorithm over only two actions and needs to enjoy some typical switching regret guarantee (for example of form (1) for the expert problem). In other words, our approach reduces the problem of obtaining long-term memory to the well-studied problem of achieving typical switching regret. Compared to existing methods [8, 2, 13], the advantages of our approach are the following:

1. While existing methods are all restricted to variants of the classical Hedge algorithm [17], our approach allows one to plug in a variety of existing algorithms and to obtain a range of different algorithms with switching regret O( p

T(S ln T + n ln K)). (Section 3.1)

2. Due to this ﬂexibility, by plugging in speciﬁc adaptive algorithms, we develop a parameter-free algorithm whose switching regret is simultaneously O( p

T(S ln T + n ln K)) in the worst-case and O(S ln T + n ln(K ln T)) if the losses are piece-wise stochastic (see Section 2 for the formal deﬁnition). This is a generalization of previous best-of-both-worlds results for static or switching regret [19, 27], and resolves an open problem of Warmuth and Koolen [35]. The best previous bound for the stochastic case is O(S ln(TK ln T)) [27]. (Section 3.2)

3. Our framework allows us to derive the ﬁrst nontrivial long-term memory guarantees for the bandit setting, while existing approaches fail to do so (more discussion to follow). For example, when n is a constant and the losses are sparse, our algorithm achieves switching regret O(S1/3T 2/3 + K3 ln T) for MAB, which is better than the typical bound (2) when S and K are large. For example, when S = Θ(T 7 10 ) and K = Θ(T 3 10 ), our bound is of order O(T 9 10 ln T) while bound (2) becomes vacuous (linear in T), demonstrating a strict separation in learnability. (Section 4)

To motivate our results on long-term memory guarantees for MAB, a few remarks are in order. It is not hard to verify that existing approaches achieve switching regret O( p

TK(S ln T + n ln K)) for MAB. However, the polynomial dependence on the number of actions K makes the improvement of

1The setting considered in [8, 2] is in fact slightly different from, yet closely related to, the expert problem. One can easily translate their regret bounds into the bounds we present here.

this bound over the typical bound (2) negligible. It is well-known that such polynomial dependence on K is unavoidable in the worst-case due to the bandit feedback. This motivates us to consider situations where the necessary dependence on K is much smaller. In particular, Bubeck et al. [10] recently showed that if the loss vectors are ρ-sparse, then a static regret bound of order O( Tρ ln K +K ln T) is achievable, exhibiting a much more favorable dependence on K. We therefore focus on this sparse MAB problem and study what nontrivial switching regret bounds are achievable.

We ﬁrst show that a bound of order O( p

TρS ln(KT) + KS ln T), a natural generalization of the typical switching regret bound of (2) to the sparse setting, is impossible. In fact, we show that for any S the worst-case switching regret is at least Ω(

TKS), even when ρ = 2. Since achieving switching regret for MAB can be seen as a special case of contextual bandits [6, 26], this negative result also implies that, surprisingly, sparse losses do not help improve the worst-case regret for contextual bandits, which is a sharp contrast with the non-contextual case studied in [10] (see Theorem 6 and Corollary 7). Despite this negative result, however, as mentioned we are able to utilize our general framework to still obtain improvements over bound (2) when n is small. Our construction is fairly sophisticated, requiring a special sub-routine that uses a novel one-sided log-barrier regularizer and admits a new kind of local-norm guarantee, which may be of independent interest.

2 Preliminaries

Throughout the paper, we use [m] to denote the set {1, . . . , m} for some integer m. The learning protocol for the expert problem and MAB with K actions and T rounds is as follows: For each time t = 1, . . . , T, (1) the learner ﬁrst randomly selects an action It [K] according to a distribution pt K (the (K 1)-dimensional simplex); (2) simultaneously the environment decides the loss vector ℓt [ 1, 1]K; (3) the learner suffers loss ℓt(It) and observes either ℓt in the expert problem (full-information feedback) or only ℓt(It) in MAB (bandit feedback). For any sequence of T actions i1, . . . , i T [K], the expected regret of the learner against this sequence is deﬁned as

R(i1:T ) = E

where the expectation is with respect to both the learner and the environment and rt(i), the instantaneous regret (against action i), is deﬁned as p t ℓt ℓt(i). When i1 = = i T , this becomes the traditional static regret against a ﬁxed action. Most existing works on switching regret impose a constraint on the number of switches for the benchmark sequence: PT t=2 1 {it = it 1} S 1. In other words, the sequence can be decomposed into S disjoint intervals, each with a ﬁxed comparator as in static regret. Typical switching regret bounds hold for any sequence with this constraint and are in terms of T, K and S, such as Eq. (1) and Eq. (2).

The number of switches, however, does not fully characterize the difﬁculty of the problem. Intuitively, a sequence that frequently switches back to previous actions should be an easier benchmark for an algorithm with long-term memory that remembers which actions performed well in the past. To encode this intuition, prior works [8, 2, 13] introduced another parameter n = | {i1, . . . , i T } |, the number of distinct actions in the sequence, to quantify the difﬁculty of the problem, and developed switching regret bounds in terms of T, K, S and n. Clearly one has n min {S, K}, and we are especially interested in the case when n min {S, K}, which is natural if the data exhibits some periodic pattern. Our goal is to understand what improvements are achievable in this case and how to design algorithms that can leverage this property via a uniﬁed framework.

Stochastic setting. In general, we do not make any assumptions on how the losses are generated by the environment, which is known as the adversarial setting in the literature. We do, however, develop an algorithm (for the expert problem) that enjoys the best of both worlds it not only enjoys some robust worst-case guarantee in the adversarial setting, but also achieves much smaller logarithmic regret in a stochastic setting. Speciﬁcally, in this stochastic setting, without loss of generality, we assume the n distinct actions in {i1, . . . , i T } are 1, . . . , n. It is further assumed that for each i [n], there exists a constant gap αi > 0 such that Et [ℓt(j) ℓt(i)] αi for all j = i and all t such that it = i, where the expectation is with respect to the randomness of the environment conditioned on the history up to the beginning of round t. In other words, for every time step the algorithm is compared to the best action whose expected value is constant away from those of other actions. This is a natural generalization of the stochastic setting studied for static regret or typical switching regret [19, 27].

Algorithm 1: A Simple Reduction for Long-term Memory

1 Input: expert algorithm A learning over K actions with static regret guarantee (cf. Condition 1), expert algorithms A1, . . . , AK learning over two actions {0, 1} with switching regret guarantee (cf. Condition 2), parameter η 1/5

2 for t = 1, 2, . . . do

3 Receive sampling distribution wt K from A

4 Receive sampling probability zt(i) for action 1 from Ai for each i [K]

5 Sample It pt where pt(i) zt(i)wt(i), i, and receive ℓt [ 1, 1]K

6 Feed loss vector ct to A, where ct(i) = zt(i)rt(i) with rt(i) = p t ℓt ℓt(i)

7 Feed loss vector (0, 5η rt(i)) to Ai for each i [K]

Conﬁdence-rated actions. Our approach makes use of the conﬁdence-rated expert setting of Blum and Mansour [7], a generalization of the expert problem (and the sleeping expert problem [18]). The protocol of this setting is the same as the expert problem, except that at the beginning of each round, the learner ﬁrst receives a conﬁdence score zt(i) for each action i. The regret against a ﬁxed action i is also scaled by its conﬁdence and is now deﬁned as E h PT t=1 zt(i)rt(i) i . The expert problem is

clearly a special case with zt(i) = 1 for all t and i. There are a number of known examples showing why this formulation is useful, and our work will add one more to this list.

To obtain a bound on this new regret measure, one can in fact simply reduce it to the regular expert problem [7, 19, 27]. Speciﬁcally, let A be some expert algorithm over the same K actions producing sampling distributions w1, . . . , w T K. The reduction works by sampling It according to pt such that pt(i) zt(i)wt(i), i and then feeding ct to A where ct(i) = zt(i)rt(i), i. Note that by the deﬁnition of pt one has w t ct = P

i wt(i)z(i)(ℓt(i) p t ℓt) = 0. Therefore, one can directly equalize the conﬁdence-rated regret and the regular static regret of the reduced problem: E h PT t=1 zt(i)rt(i) i = E h PT t=1(w t ct ct(i)) i .

3 General Framework for the Expert Problem

In this section, we introduce our general framework to obtain long-term memory regret bounds and demonstrate how it leads to various new algorithms for the expert problem. We start with a simpler version and then move on to a more elaborate construction that is essential to obtain best-of-both-worlds results.

3.1 A simple approach for adversarial losses

A simple version of our approach is described in Algorithm 1. At a high level, it simply makes use of the conﬁdence-rated action framework described in Section 2. The reduction to the standard expert problem is executed in Lines 5 and 6, with a black-box expert algorithm A.

It remains to specify how to come up with the conﬁdence score zt(i). We propose to learn these scores via a separate black-box expert algorithm Ai for each i. More speciﬁcally, each Ai is learning over two actions 0 and 1, where action 0 corresponds to conﬁdence score 0 and action 1 corresponds to score 1. Therefore, the probability of picking action 1 at time t naturally represents a conﬁdence score between 0 and 1, which we denote by zt(i) overloading the notation (Line 4).

As for the losses fed to Ai, we ﬁx the loss of action 0 to be 0 (since shifting losses by the same amount has no real effect), and set the loss of action 1 to be 5η rt(i) (Line 7). The role of the term rt(i) is intuitively clear the larger the loss of action i compared to the algorithm, the less conﬁdent we should be about it; the role of the constant bias term 5η will become clear in the analysis (in fact, it can even be removed at the cost of a worse bound see Appendix B.2).

Finally we specify what properties we require from the black-box algorithms A, A1, . . . , AK. In short, A needs to ensure a static regret bound, while A1, . . . , AK need to ensure a switching regret bound. See Figure 1 for an illustration of our reduction. The trick is that since A1, . . . , AK are learning over only two actions, this construction helps us to separate the dependence on K and the

Conﬁdence-Rated Expert

Long-Term Memory

Switching Regret

Static Regret

provide conﬁdence

Figure 1: Illustration of reduction. The main idea of our approach is to reduce the problem of obtaining long-term memory guarantee to the conﬁdence-rated expert problem and the problem of obtaining switching regret. Algorithms for the latter learn and provide conﬁdence to the conﬁdencerated expert problem. The reduction from conﬁdence-rated expert to obtaining standard static regret is known [7].

number of switches S. These (static or switching) regret bounds could be the standard worst-case T-dependent bounds mentioned in Section 1, in which case we would obtain looser long-term memory guarantees (speciﬁcally, n times worse see Appendix B.2). Instead, we require these bounds to be data-dependent and in particular of the form speciﬁed below: Condition 1. There exists a constant C > 0 such that for any η (0, 1/5] and any loss sequence c1, . . . , c T [ 2, 2]K, algorithm A (possibly with knowledge of η) produces sampling distributions w1, . . . , w T K and ensures one of the following static regret bounds:

t=1 ct(i) C ln K

t=1 |ct(i)| , i [K] (3)

t=1 ct(i) C ln K

w t ct ct(i) , i [K]. (4)

Condition 2. There exists a constant C > 0 such that for any η (0, 1/5], any loss sequence h1, . . . , h T [ 3, 3]2, and any S [T], algorithm Ai (possibly with knowledge of η) produces sampling distributions q1, . . . , q T 2 and ensures one of the following switching regret bounds against any sequence b1, . . . , b T {0, 1} with PT t=2 1 {bt = bt 1} S 1:2

t=1 ht(bt) CS ln T

t=1 |ht(bt)| , (5)

t=1 ht(bt) CS ln T

q t ht ht(bt) , (6)

t=1 ht(bt) CS ln T

b {0,1} qt(b) |ht(b)| . (7)

We emphasize that these data-dependent bounds are all standard in the online learning literature,3 and provide a few examples below (see Appendix A for brief proofs). Proposition 1. The following algorithms all satisfy Condition 1: Variants of Hedge [20, 34], Prod [12], Adapt-ML-Prod [19], Ada Normal Hedge [27], and i Prod/Squint [25]. Proposition 2. The following algorithms all satisfy Condition 2: Fixed-share [23], a variant of Fixed-share (Algorithm 5 in Appendix A), and Ada Normal Hedge.TV [27].

2In terms of notation in Algorithm 1, qt = (1 zt(i), zt(i)). 3In fact, most standard bounds replace the absolute value we present here with square, leading to even smaller bounds (up to a constant). We choose to use the looser ones with absolute values since this makes the conditions weaker while still being sufﬁcient for all of our analysis.

Algorithm 2: A Parameter-free Reduction for Best-of-both-worlds

1 Deﬁne: M = log2

T 5 + 1, ηj = min n 1 5, 2j 1

o for j [M]

2 Input: expert algorithm A learning over KM actions with static regret guarantee (cf. Condition 3), expert algorithms {Aij}i [K],j [M] learning over two actions {0, 1} with switching regret guarantee (cf. Condition 2)

3 for t = 1, 2, . . . do

4 Receive sampling distribution wt KM from A

5 Receive sampling probability zt(i, j) for action 1 from Aij for each i [K] and j [M]

6 Sample It pt where pt(i) PM j=1 zt(i, j)wt(i, j), i, and receive ℓt [ 1, 1]K

7 Feed loss vector ct to A, where ct(i, j) = zt(i, j)rt(i) with rt(i) = p t ℓt ℓt(i)

8 Feed loss vector (0, 5ηj|rt(i)| rt(i)) to Aij for each i [K] and j [M]

We are now ready to state the main result for Algorithm 1 (see Appendix B.1 for the proof).

Theorem 3. Suppose Conditions 1 and 2 both hold. With η = min 1 5, q

S ln T +n ln K

, Algorithm 1

ensures R(i1:T ) = O p

T(S ln T + n ln K) for any loss sequence ℓ1, . . . , ℓT and benchmark

sequence i1, . . . , i T such that PT t=2 1{it = it 1} S 1 and |{i1, . . . , i T }| n.

Our bound in Theorem 3 is slightly worse than the existing bound of O q

2],4 but still improves over the typical switching regret O( p

T(S ln T + S ln K)) (Eq. (1)), especially when n is small and S and K are large. To better understand the implication of our bounds, consider the following thought experiment. If the learner knew about the switch points (that is, {t : it = it 1}) that naturally divide the whole game into S intervals, she could simply pick any algorithm with optimal static regret (

#rounds ln K) and apply S instances of this algorithm, one for each interval, which, via a direct application of the Cauchy-Schwarz inequality, leads to switching regret

TS ln K. Compared to bound (1), this implies that the price of not knowing the switch points is

TS ln T. Similarly, if the learner knew not only the switch points, but also the information on which intervals share the same competitor, then she could naturally apply n instances of the static algorithm, one for each set of intervals with the same competitor. Again by the Cauchy-Schwarz inequality, this leads to switching regret

Tn ln K. Therefore, our bound implies that the price of not having any prior information of the benchmark sequence is still

Compared to existing methods, our framework is more ﬂexible and allows one to plug in any combination of the algorithms listed in Propositions 1 and 2. This ﬂexibility is crucial and allows us to solve the problems discussed in the following sections. The approach of [2] makes use of a sleeping expert framework, a special case of the conﬁdence-rated expert framework. However, their approach is not a general reduction and does not allow plugging in different algorithms. Finally, we note that our construction also shares some similarity with the black-box approach of [14] for a multi-task learning problem.

3.2 Best of both worlds

To further demonstrate the power of our approach, we now show how to use our framework to construct a parameter-free algorithm that enjoys the best of both adversarial and stochastic environments, resolving the open problem of [35] (see Algorithm 2). The key is to derive an adaptive switching regret bound that replaces the dependence on T by the sum of the magnitudes of the instantaneous regret P t |rt(i)|, which previous works [19, 27] show is sufﬁcient for adapting to the stochastic setting and achieving logarithmic regret.

4In fact, using the adaptive guarantees of Ada Normal Hedge [27] or i Prod/Squint [25] that replaces the ln K dependence in Eq. (4) by a KL divergence term, one can further improve the term n ln K in our bound to n ln K

n matching previous bounds. Since this improvement is small, we omit the details.

To achieve this goal, the ﬁrst modiﬁcation we need is to change the bias term for the loss of action 1 for Ai from 5η to 5η|rt(i)|. Following the proof of Theorem 3, one can show that the dependence on |{t : it = i}| now becomes P

t:it=i |rt(i)| for the regret against i. If we could tune η optimally in terms of this data-dependent quality, then this would imply logarithmic regret in the stochastic setting by the same reasoning as in [19, 27].

However, the difﬁculty is that the optimal tuning of η is unknown beforehand, and more importantly, different actions require tuning η differently. To address this issue, at a high level we discretize the learning rate and pick M = Θ(ln T) exponentially increasing values (Line 1), then we make M = Θ(ln T) copies of each action i [K], one for each learning rate ηj. More speciﬁcally, this means that the number of actions for A increases from K to KM, and so does the number of sub-routines with switching regret, now denoted as Aij for i [K] and j [M]. Different copies of an action i share the same loss ℓt(i) for A, while action 1 for Aij now suffers loss 5ηj|rt(i)| rt(i) (Line 8). The rest of the construction remains the same. Note that selecting a copy of an action is the same as selecting the corresponding action, which explains the update rule of the sampling probability pt in Line 6 that marginalizes over j. Also note that for a vector in RKM (e.g., wt, ct, zt), we use (i, j) to index its coordinates for i [K] and j [M].

Finally, with this new construction, we need algorithm A to exhibit a more adaptive static regret bound and in some sense be aware of the fact that different actions now correspond to different learning rates. More precisely, we replace Condition 1 with the following condition:

Condition 3. There exists a constant C > 0 such that for any η1, . . . , ηM (0, 1/5] and any loss sequence c1, . . . , c T [ 2, 2]KM, algorithm A (possibly with knowledge of η1, . . . , ηM) produces sampling distributions w1, . . . , w T KM and ensures the following static regret bounds: for all i [K] and j [M]:5

t=1 ct(i, j) C ln(KM)

w t ct ct(i, j) . (8)

Once again, this requirement is achievable by many existing algorithms and we provide some examples below (see Appendix A for proofs).

Proposition 4. The following algorithms all satisfy Condition 3: A variant of Hedge (Algorithm 6 in Appendix A), Adapt-ML-Prod [19], Ada Normal Hedge [27], and i Prod/Squint [25].

We now state our main result for Algorithm 2 (see Appendix B.3 for the proof).

Theorem 5. Suppose algorithm A satisﬁes Condition 3 and {Aij}i [K],j [M] all satisfy Condition 2.

Algorithm 2 ensures that for any benchmark sequence i1, . . . , i T such that PT t=2 1{it = it 1} S 1 and |{i1, . . . , i T }| n, the following hold:

In the adversarial setting, we have R(i1:T ) = O p

T(S ln T + n ln(K ln T)) ;

In the stochastic setting (deﬁned in Section 2), we have R(i1:T ) = O Pn i=1 Si ln T +ln(K ln T )

where Si = 1 + PT t=2 1 {(it 1 = i it = i) (it 1 = i it = i)} s.t. P

i [n] Si 3S.6

In other words, with a negligible price of ln ln T for the adversarial setting, our algorithm achieves logarithmic regret in the stochastic setting with favorable dependence on S and n. The best prior result is achieved by Ada Normal Hedge.TV [27], with regret O p

T(S ln(TK ln T)) for the adversarial

case and O Pn i=1 Si ln(T K ln T )

for the stochastic case. We also remark that a variant of the algorithm of [8] with a doubling trick can achieve a guarantee similar to ours, but weaker in the sense that each αi is replaced by mini αi. To the best of our knowledge this was previously unknown and we provide the details in Appendix B.4 for completeness.

5In fact an analogue of Eq. (3) with individual learning rates would also sufﬁce, but we are not aware of any algorithms that achieve such guarantee. 6This deﬁnition of Si is the same as the one in the proof of Theorem 3.

Algorithm 3: A Sparse MAB Algorithm with Long-term Memory

1 Input: parameter η 1 500, γ, δ

2 Deﬁne: regularizers ψ(w) = 1

η PK i=1 w(i) ln w(i) + γ PK i=1 ln 1 w(i) and φ(z) = 1

z, Bregman divergence Dφ(z, z ) = φ(z) φ(z ) φ (z )(z z )

3 Initialize: w1 = 1

K where 1 RK is the all-one vector, and z1(i) = 1 for all i [K]

4 for t = 1, 2, . . . do

5 Compute pt = (1 η)pt + η

K 1 where pt(i) zt(i)wt(i), i

6 Sample It pt, receive ℓt(It), and construct loss estimator bℓ(i) = ℓt(i)

pt(i)1 {i = It} , i

7 Set rt(i) = p t bℓt bℓt(i) and ct(i) = zt(i)rt(i) ηzt(i)bℓt(i)2 for each i [K]

8 Update wt+1 = argminw K Pt τ=1 w cτ + ψ(w) update of A

9 Update zt+1(i) = argminz [δ,1] rt(i)z + Dφ(z, zt(i)) for each i [K] update of Ai

4 Long-term Memory under Bandit Feedback

In this section, we move on to the bandit setting where the learner only observes the loss of the selected action ℓt(It) instead of ℓt. As mentioned in Section 1, one could directly generalize the approach of [8, 2, 13] to obtain a bound of order O( p

TK(S ln T + n ln K)), a natural generalization of the full information guarantee, but such a bound is not a meaningful improvement compared to (2), due to the

K dependence that is unavoidable for MAB in the worst case. Therefore, we consider a special case where the dependence on K is much smaller: the sparse MAB problem [10]. Speciﬁcally, in this setting we make the additional assumption that all loss vectors are ρ-sparse for some ρ [K], that is, ℓt 0 ρ for all t. It was shown in [10] that for sparse MAB the static regret is of order O( Tρ ln K + K ln T), exhibiting a much favorable dependence on K.

Negative result. To the best of our knowledge, there are no prior results on switching regret for sparse MAB. In light of bound (2), a natural conjecture would be that it would be possible to achieve switching regret of O( p

TρS ln(KT) + KS ln T) with S switches. Perhaps surprisingly, we show that this is in fact impossible. Theorem 6. For any T, S, K 2 and any MAB algorithm, there exists a sequence of loss vectors that are 2-sparse, such that the switching regret of this algorithm is at least Ω(

The high level idea of the proof is to force the algorithm to overfocus on one good action and thus miss an even better action later. This is similar to the construction of [15, Lemma 3] and [37, Theorem 4.1], and we defer the proof to Appendix C.1. This negative result implies that sparsity does not help improve the typical switching regret bound (2). In fact, since switching regret for MAB can be seen as a special case of the contextual bandits problem [6, 26], this result also immediately implies the following corollary, a sharp contrast compared to the positive result for the non-contextual case mentioned earlier (see Appendix C.1 for the deﬁnition of contextual bandit and related discussions). Corollary 7. Sparse losses do not help improve the worst-case regret for contextual bandits.

Long-term memory to the rescue. Despite the above negative results, we next show how longterm memory can still help improve the switching regret for sparse MAB. Speciﬁcally, we use our general framework to develop a MAB algorithm whose switching regret is smaller than O(

TKS) whenever ρ and n are small while S and K are large. Note that this is not a contradiction with Theorem 6, since in the construction of its proof, n is as large as min {S, K}.

At a high level, our algorithm (Algorithm 3) works by constructing the standard unbiased importanceweighted loss estimator bℓt (Line 6) and plugging it into our general framework (Algorithm 1). However, we emphasize that it is highly nontrivial to control the variance of these estimators without leading to bad dependence on K in this framework where two types of sub-routines interact with each other. To address this issue, we design specialized sub-algorithms A and Ai to learn wt and zt(i) respectively. For learning wt, we essentially deploy the algorithm of [10] for sparse MAB, which is an instance of the standard follow-the-regularized-leader algorithm with a special hybrid regularizer,

combining the entropy and the log-barrier (Lines 2 and 8). However, note that the loss ct we feed to this algorithm is not sparse and we cannot directly apply the guarantee from [10], but it turns out that one can still utilize the implicit exploration of this algorithm, as shown in our analysis. Compared to Algorithm 1, we also incorporate an extra bias term ηzt(i)bℓt(i)2 in the deﬁnition of ct (Line 7), which is important for canceling the large variance of the loss estimator.

For learning zt(i) for each i, we design a new algorithm that is an instance of the standard Online Mirror Descent algorithm (see e.g., [22]). Recall that this is a one-dimensional problem, as we are trying to learn the distribution (1 zt(i), zt(i)) over actions {0, 1}. We design a special onedimensional regularizer φ(z) = 1

z , which can be seen as a one-sided log-barrier,7 to bias towards action 1 . Technically, this provides a special local-norm guarantee that is critical for our analysis and may be of independent interest (see Lemma 14 in Appendix C.2). In addition, we remove the bias term in the loss for action 1 (so it is only rt(i) now) as it does not help in the bandit case, and we also force zt(i) to be at least δ for some parameter δ, which is important for achieving switching regret. Line 9 summarizes the update for zt(i).

Finally, we also enforce a small amount of uniform exploration by sampling It from pt, a smoothed version of pt (Line 5). We present the main result of our algorithm below (proven in Appendix C.2).

Theorem 8. With η = max n S 1 3 ρ 2

T ρ o , δ = q

S T ηn, γ = 200K2, Algorithm 3 ensures R(i1:T ) = O (ρS)

1 3 (n T) 2 3 + n p

Tρ ln K + n K3 ln T (9)

for any sequence of ρ-sparse losses ℓ1, . . . , ℓT and any benchmark sequence i1, . . . , i T such that PT t=2 1{it = it 1} S 1 and |{i1, . . . , i T }| n.

In the case when ρ and n are constants, our bound (9) becomes O(S 1 3 T 2 3 +K3 ln T), which improves over the existing bound O( p

TKS ln(TK)) when ( T

S ) 1 3 < K < (TS) 1 5 (also recall the example in Section 1 where our bound is sublinear in T while existing bounds become vacuous).

As a ﬁnal remark, one might wonder if similar best-of-both-worlds results are also possible for MAB in terms of switching regret, given the positive results for static regret [9, 33, 5, 32, 36, 38]. We point out that the answer is negative the proof of [37, Theorem 4.1] implicitly implies that even with one switch, logarithmic regret is impossible for MAB in the stochastic setting.

5 Conclusion

In this work, we propose a simple reduction-based approach to obtaining long-term memory regret guarantee. By plugging various existing algorithms into this framework, we not only obtain new algorithms for this problem in the adversarial case, but also resolve the open problem of Warmuth and Koolen [35] that asks for a single algorithm achieving the best of both stochastic and adversarial environments in this setup. We also extend our results to the bandit setting and show both negative and positive results.

One clear open question is whether our bound for the bandit case (Theorem 8) can be improved, and more generally what is the best achievable bound in this case.

Acknowledgments. The authors would like to thank Alekh Agarwal, Sébastien Bubeck, Dylan Foster, Wouter Koolen, Manfred Warmuth, and Chen-Yu Wei for helpful discussions. Kai Zheng and Liwei Wang were supported by Natioanl Key R&D Program of China (no. 2018YFB1402600), BJNSF (L172037). Haipeng Luo was supported by NSF Grant IIS-1755781. Ilias Diakonikolas was supported by NSF Award CCF-1652862 (CAREER) and a Sloan Research Fellowship.

7The usual log-barrier regularizer (see e.g. [16, 3, 36]) would be 1

z + ln 1 1 z ) in this case.

[1] D. Adamskiy, W. M. Koolen, A. Chernov, and V. Vovk. A closer look at adaptive regret. In International Conference on Algorithmic Learning Theory, pages 290 304. Springer, 2012.

[2] D. Adamskiy, M. K. Warmuth, and W. M. Koolen. Putting bayes to sleep. In Advances in neural information processing systems, pages 135 143, 2012.

[3] A. Agarwal, H. Luo, B. Neyshabur, and R. E. Schapire. Corralling a band of bandit algorithms. Conference on Learning Theory, 2017.

[4] J.-Y. Audibert and S. Bubeck. Regret bounds and minimax policies under partial monitoring. Journal of Machine Learning Research, 11(Oct):2785 2836, 2010.

[5] P. Auer and C.-K. Chiang. An algorithm with nearly optimal pseudo-regret for both stochastic and adversarial bandits. In Conference on Learning Theory, pages 116 120, 2016.

[6] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48 77, 2002.

[7] A. Blum and Y. Mansour. From external to internal regret. Journal of Machine Learning Research, 8(Jun):1307 1324, 2007.

[8] O. Bousquet and M. K. Warmuth. Tracking a small set of experts by mixing past posteriors. Journal of Machine Learning Research, 3(Nov):363 396, 2002.

[9] S. Bubeck and A. Slivkins. The best of both worlds: stochastic and adversarial bandits. In Conference on Learning Theory, pages 42 1, 2012.

[10] S. Bubeck, M. Cohen, and Y. Li. Sparsity, variance and curvature in multi-armed bandits. In Algorithmic Learning Theory, pages 111 127, 2018.

[11] S. Bubeck, Y. Li, H. Luo, and C.-Y. Wei. Improved path-length regret bounds for bandits. In Conference On Learning Theory, 2019.

[12] N. Cesa-Bianchi, Y. Mansour, and G. Stoltz. Improved second-order bounds for prediction with expert advice. Machine Learning, 66(2-3):321 352, 2007.

[13] N. Cesa-Bianchi, P. Gaillard, G. Lugosi, and G. Stoltz. Mirror descent meets ﬁxed share (and feels no regret). In Advances in Neural Information Processing Systems, pages 980 988, 2012.

[14] P. Christiano. Manipulation-resistant online learning. Ph D thesis, University of California, Berkeley, 2017.

[15] A. Daniely, A. Gonen, and S. Shalev-Shwartz. Strongly adaptive online learning. In International Conference on Machine Learning, pages 1405 1411, 2015.

[16] D. J. Foster, Z. Li, T. Lykouris, K. Sridharan, and E. Tardos. Learning in games: Robustness of fast convergence. In Advances in Neural Information Processing Systems, pages 4734 4742, 2016.

[17] Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1):119 139, 1997.

[18] Y. Freund, R. E. Schapire, Y. Singer, and M. K. Warmuth. Using and combining predictors that specialize. In In Proceedings of the Twenty-Ninth Annual ACM Symposium on the Theory of Computing. Citeseer, 1997.

[19] P. Gaillard, G. Stoltz, and T. Van Erven. A second-order bound with excess losses. In Conference on Learning Theory, pages 176 196, 2014.

[20] E. Hazan and S. Kale. Extracting certainty from uncertainty: Regret bounded by variation in costs. Machine learning, 80(2-3):165 188, 2010.

[21] E. Hazan and C. Seshadhri. Adaptive algorithms for online decision problems. In Electronic colloquium on computational complexity (ECCC), volume 14, 2007.

[22] E. Hazan et al. Introduction to online convex optimization. Foundations and Trends R in Optimization, 2(3-4):157 325, 2016.

[23] M. Herbster and M. K. Warmuth. Tracking the best expert. Machine learning, 32(2):151 178, 1998.

[24] K.-S. Jun, F. Orabona, S. Wright, R. Willett, et al. Online learning for changing environments using coin betting. Electronic Journal of Statistics, 11(2):5282 5310, 2017.

[25] W. M. Koolen and T. Van Erven. Second-order quantile methods for experts and combinatorial games. In Conference on Learning Theory, pages 1155 1175, 2015.

[26] J. Langford and T. Zhang. The epoch-greedy algorithm for multi-armed bandits with side information. In Advances in neural information processing systems, pages 817 824, 2008.

[27] H. Luo and R. E. Schapire. Achieving all with no parameters: Adanormalhedge. In Conference on Learning Theory, pages 1286 1304, 2015.

[28] H. Luo, C.-Y. Wei, A. Agarwal, and J. Langford. Efﬁcient contextual bandits in non-stationary worlds. In Conference On Learning Theory, pages 1739 1776, 2018.

[29] H. T. Nguyen and K. Franke. Adaptive intrusion detection system via online machine learning. In 2012 12th International Conference on Hybrid Intelligent Systems (HIS), pages 271 277. IEEE, 2012.

[30] B. A. A. Nunes, K. Veenstra, W. Ballenthin, S. Lukin, and K. Obraczka. A machine learning framework for tcp round-trip time estimation. EURASIP Journal on Wireless Communications and Networking, 2014(1):47, 2014.

[31] T. Santarra. Communicating Plans in Ad Hoc Multiagent Teams. Ph D thesis, UC Santa Cruz, 2019.

[32] Y. Seldin and G. Lugosi. An improved parametrization and analysis of the exp3++ algorithm for stochastic and adversarial bandits. In Conference on Learning Theory, 2017.

[33] Y. Seldin and A. Slivkins. One practical algorithm for both stochastic and adversarial bandits. In International Conference on Machine Learning, pages 1287 1295, 2014.

[34] J. Steinhardt and P. Liang. Adaptivity and optimism: An improved exponentiated gradient algorithm. In International Conference on Machine Learning, pages 1593 1601, 2014.

[35] M. K. Warmuth and W. M. Koolen. Open problem: Shifting experts on easy data. In Conference on Learning Theory, pages 1295 1298, 2014.

[36] C.-Y. Wei and H. Luo. More adaptive algorithms for adversarial bandits. In Conference On Learning Theory, pages 1263 1291, 2018.

[37] C.-Y. Wei, Y.-T. Hong, and C.-J. Lu. Tracking the best expert in non-stationary stochastic environments. In Advances in neural information processing systems, pages 3972 3980, 2016.

[38] J. Zimmert, H. Luo, and C.-Y. Wei. Beating stochastic and adversarial semi-bandits optimally and simultaneously. In International Conference on Machine Learning, 2019.