# online_learning_with_imperfect_hints__9d8f1e98.pdf

Online Learning with Imperfect Hints

Aditya Bhaskara 1 Ashok Cutkosky 2 3 Ravi Kumar 2 Manish Purohit 2

We consider a variant of the classical online linear optimization problem in which at every step, the online player receives a hint vector before choosing the action for that round. Rather surprisingly, it was shown that if the hint vector is guaranteed to have a positive correlation with the cost vector, then the online player can achieve a regret of O(log T), thus signiﬁcantly improving over the O(

T) regret in the general setting. However, the result and analysis require the correlation property at all time steps, thus raising the natural question: can we design online learning algorithms that are resilient to bad hints? In this paper we develop algorithms and nearly matching lower bounds for online learning with imperfect directional hints. Our algorithms are oblivious to the quality of the hints, and the regret bounds interpolate between the always-correlated hints case and the no-hints case. Our results also generalize, simplify, and improve upon previous results on optimistic regret bounds, which can be viewed as an additive version of hints.

1. Introduction

In the standard online convex optimization model (Zinkevich, 2003), at each time step t, an algorithm ﬁrst plays a point xt in a convex set, and then the system responds with a convex loss function. The loss incurred by the algorithm is the function evaluated at the point xt. The performance of an algorithm is measured using the concept of regret. The regret of an algorithm is the difference between the total loss it incurs and the loss of the best ﬁxed point it could have played (in hindsight); algorithms with sub-linear regret are

1University of Utah, Salt Lake City Utah, USA 2Google Research, Mountain View California, USA 3Boston University, Boston Massachusetts, USA. Correspondence to: Aditya Bhaskara <bhaskaraaditya@gmail.com>, Ashok Cutkosky <ashok@cutkosky.com>, Ravi Kumar <ravi.k53@gmail.com>, Manish Purohit <mpurohit@google.com>.

Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s).

hence desirable. The framework of online convex optimization is quite powerful, general, and has been extensively studied. Many important problems such as portfolio selection, learning from mixture of experts, matrix completion, recommendation systems, and certain online combinatorial optimization problems can be cast in this framework. For a detailed exposition, see the books by Hazan (2016) and Shalev-Shwartz (2011).

An important special case of online convex optimization is when the loss function is actually linear, i.e., the loss function is given by a cost vector. In this case, algorithms with regret O(

T), where T is the number of steps, are known (Zinkevich, 2003; Kalai & Vempala, 2005); furthermore, this bound is also optimal (Cesa-Bianchi & Lugosi, 2006). In fact, from a regret point of view, the linear case is the hardest since if the loss function is strongly convex, then there are algorithms achieving only O(log T) regret (Hazan et al., 2007). There has been some effort to better understand the regret landscape of linear loss functions, especially on how to circumvent the pessimistic Ω(

T) barrier.

A particularly intriguing line of work was initiated by Hazan & Megiddo (2007), who modeled a notion of predictability in online learning settings. In their model, the algorithm knows the ﬁrst coordinate of the cost vector at all time steps. Under this assumption, they showed a regret bound of O(d2/α log T) when the convex set is the Euclidean ball, where α is the magnitude of the ﬁrst coordinate that is known to the algorithm and d is the dimension of the space. Their work was subsequently generalized and extended by Dekel et al. (2017), who considered a scenario when the online algorithm is provided with a directional hint at each step; this hint is assumed to be always weakly but positively correlated with the cost vector. They showed a regret bound of O(d/α log T), where α is the amount of correlation present in the hint.

The biggest drawback in these previous works is that they require the hints to be helpful at every time step. Clearly, this is a stringent requirement that may easily fail to hold. This is especially so if the hints are provided by, say, a learning algorithm! In such a scenario, one can only expect the hints to be good on average or have other probabilistic guarantees of goodness. This means in particular that some of the hints could potentially be very misleading. Since the

Online Learning with Imperfect Hints

algorithm is oblivious to the quality of each individual hint, it is desirable to have an algorithm that is both consistent and robust: utilize the good hints as well as possible to minimize regret, while at the same time not be damaged too much by bad hints. Speciﬁcally, the algorithm should never incur worse than O(

T) regret, as otherwise the algorithm was better off not using any hints at all! This type of MLprovided hints and their role in improving combinatorial online algorithms have generated a lot of recent interest for problems such as caching (Lykouris & Vassilvitskii, 2018; Rohatgi, 2020; Kumar et al., 2020), ski-rental (Kumar et al., 2018), bipartite matching (Kumar et al., 2019), and scheduling (Kumar et al., 2018; Lattanzi et al., 2020). This serves as another motivation for our work.

Formulation. We consider the online convex optimization problem with a linear loss function in the presence of hints that can be imperfect. At each time step t, the algorithm is provided with a hint vector ht. After the algorithm plays a point xt, a cost vector ct is revealed and the algorithm incurs a loss of ct, xt . The hint vector ht typically gives non-trivial information about ct. Formally, given a parameter α, a hint ht is said to be good if it satisﬁes ct, ht α ct 2 and bad otherwise.

Our results. We design an algorithm that achieves a regret bound that smoothly interpolates between the two extreme cases when the hints ht are good at all time steps and when hints are arbitrarily wrong. In particular, for any α > 0, we obtain a regret of

B) α log(1 + T B)

where B is the number of times steps when the hints are bad, i.e., ct, ht < α ct 2. The dependence on B turns out to be nearly optimal as we will show in Section 4. We also generalize these results when the underlying feasible space is (q, µ)-uniformly convex and show matching lower bounds. For the formal statements, see Theorems 3.1 and A.3.

Surprisingly, our algorithm simultaneously also yields improved regret guarantees when the hint ht is viewed as an additive estimate of the cost vector: a hint is good if ct ht is small. This notion of hint was considered in Rakhlin & Sridharan (2013); Hazan & Kale (2010); Mohri & Yang (2016); Steinhardt & Liang (2014), who gave regret bounds

of the form O q PT t=1 ct ht 2 . We achieve a regret

O q PT t=1( ct ht 2 ht 2) (see Corollary 3.7).

Even when restricted to the special case where the hints are all good, our result improves upon the regret bound of Dekel et al. (2017) in multiple ways. First, our regret bound is dimension-free, i.e., better by a factor of the dimension of

the space. Second, our algorithm is signiﬁcantly faster: their work relied on expensive matrix calculations yielding O(d2) computation per round, while our algorithm runs in O(d) time, matching simple gradient descent. Third, our proofs are simpler as we rely on loss functions that are easily seen to be strongly convex (as opposed to proving exp-concavity). Furthermore, for the case of q > 2, Dekel et al. (2017) only obtained comparable regret bounds when all the hints are in the same direction. We generalize this in two ways, allowing different hints at each step and a small number of bad hints.

Finally, we consider the unconstrained variant of online optimization, where the algorithm allowed to play any point xt B, while achieving a regret that depends on u for all u B. This setting is discussed in Section 5.

2. Preliminaries

Let B be a real Banach space with norm and let B be its dual space with norm . Let c = c1, c2, . . . be cost vectors in B such that ct 1. In the classical online learning setting, c1, c2, . . . arrive one by one and at time t, an algorithm A responds with a vector xt B, before ct arrives. The regret of the algorithm A for a vector u B is

RA(u, c, T) =

t=1 ct, xt u ,

where we use the , notation to denote the application of a dual vector in B to a vector in B. (For instance if B is the space Rd with being the ℓ2-norm, we have B = B and , will correspond to the standard inner product.)

We consider the case when there are hints available to an algorithm. Let h = h1, h2, . . . be the hints, where each hint ht B, ht 1, is available to the algorithm A at time t; this hint is available before A responds with xt. The regret deﬁnition is the same and is denoted RA(u, c, T | h).

The hints need not be perfect. To capture this, let α > 0 be a ﬁxed threshold. We deﬁne GT,α to be the set of indices t where the hint ht is good, i.e., has a large correlation with ct. Similarly, we deﬁne BT,α to be the set of indices where the hint is bad. Formally, we deﬁne:

GT,α = {t T : ct, ht α ct 2 }, and

BT,α = {t T : ct, ht < α ct 2 }.

Let BT = BT,0, i.e., the time steps when ht is negatively correlated with ct. We will also use a compressed-sum notation for indexed variables: a1:t = Pt i=1 ai.

Let K = {x B : x 1}. We consider two settings, a constrained setting where we must choose xt K and an unconstrained setting sans this restriction. In the former case, we will be concerned only with bounding RA(u, c, T) for u K, while in the latter we will consider any u B.

Finally, we establish some notation about convex functions

Online Learning with Imperfect Hints

and spaces. For a convex function f, we use f(x) B

to denote the set of subgradients of f at x. We say that f is µ-strongly convex with respect to the norm if for all x, y and g f(x), we have f(y) f(x)+ g, y x + µ

2 x y 2. We say that the Banach space B is µ-strongly convex if the function 1

2 x 2 is µ-strongly convex with respect to for some µ > 0. We note this notion is equivalent to the deﬁnition of strong convexity of a space used in Dekel et al. (2017); e.g., see the discussion after Deﬁnition 4.16 in Pisier (2016). Further, a Banach space is reﬂexive if the natural injection i : B B given by i(x), c = c, x is an isomorphism of Banach spaces. Note that all ﬁnitedimensional Banach spaces are reﬂexive. Throughout this paper, we assume that B is reﬂexive and µ-strongly convex.

A typical example is B = Rd with equal to the standard ℓ2 norm. In this case B is reﬂexive and 1-strongly convex.

3. Constrained Learning with Imperfect Hints

We ﬁrst consider the constrained setting of the problem in which the online algorithm must choose a point xt K at all time steps t T. To illustrate our main ideas, we ﬁrst focus on the case when the Banach space B is µ-strongly convex. Our techniques also extend to general (q, µ)-uniformly convex spaces and we present this extension in Appendix A. Theorem 3.1. Consider the online linear optimization problem over a Banach space with a µ-strongly convex norm, where at every step we receive a hint vector ht and need to output a point xt K. Then there is an efﬁcient algorithm that for any α > 0, achieves regret

t BT,α ct 2 + r T

µα log 1 + X

t GT,α ct 2

where r T = q

t BT | ct, ht |.

We remark about the order of quantiﬁers in the theorem. The bound holds for any α > 0 and the algorithm itself is oblivious to α. Thus, if we have B bad hints (i.e., |BT,α| = B), then r T

1 + B and the number of good steps is T B, so we obtain the upper bound of O( 1+B

α log(1 + T B)). Also, the bound is never larger than

T, because if α is large, GT,α = , and thus the ﬁrst term is the only one that remains, and it is

Outline of the algorithm. Our algorithm (denoted ALG) can be best viewed as a procedure that interacts with an inner online convex optimization subroutine, which we denote by A. At every step, ALG receives a prediction xt from A, which it modiﬁes using the hint ht, and produces xt. Then the algorithm receives ct, using which it produces a function ℓt (which depends on ht, ct, and an additional parameter rt that ALG maintains). This function, along with relevant parameters, are passed to A. The key properties that

Algorithm 1 OLO with imperfect hints (Procedure ALG) input Hints ht followed by cost vectors ct

Deﬁne λ0 = 1/µ and r0 = 1. for t = 1 . . . T do

Get hint ht Get xt from procedure A, and set

xt = xt + ( xt 2 1)

2rt ht Play xt and receive cost ct B

if ct, ht < 0 then

Set rt+1 = p

r2 t + | ct, ht | else

Set rt+1 = rt end if Deﬁne σt = | ct,ht |µ

rt Deﬁne λt as the solution to:

λt = ct 2 σ1:t + µλ1:t Send the loss function ℓht,rt,ct(x), λt to procedure A // (loss function deﬁned in (1)) end for

Algorithm 2 FTRL with adaptive rate (Procedure A) input Convex functions ℓt, parameters λt

At t = 1 return x1 = 0 for t = 2 . . . T do

Output xt := argmin x 1 ℓ1:t 1(x) + λ0:t 1

Receive loss ℓt and parameter λt end for

we show are: (a) the regret of ALG can be related to the regret of the procedure A, and (b) the functions ℓt are strongly convex, and thus the regret of A can be bounded efﬁciently using known techniques. The parameter rt encapsulates the conﬁdence in hints seen so far.

Algorithms 1 and 2 describe the procedures ALG and A. Intuitively, given a prediction xt, we should be able to improve the loss ct, xt by playing instead xt = xt ht; assuming the hint ht is positively correlated with ct. However, there are two immediate problems with this approach. First, if ht is negatively correlated with ct then we have actually increased the loss. Second, this addition operation may cause xt to leave the set K, which is not allowed. We address both concerns by setting xt = xt δrt(xt)ht, where

δrt(xt) = 1 x 2

2rt is a carefully chosen scale factor.

The surrogate loss function used in the algorithm is:

ℓht,rt,ct(x) = ct, x + | ct, ht |

2rt ( x 2 1). (1)

It is clear from the description that as the algorithm proceeds,

Online Learning with Imperfect Hints

rt is monotone increasing and hence rt 1 for all t. We ﬁrst demonstrate that Algorithm 1 always plays a feasible point, i.e., xt K for all t.

Lemma 3.2. For any t, xt 1. In other words, the point xt played by Algorithm 1 is always feasible.

Proof. From the description of A, xt 1. Thus since rt 1 and by the triangle inequality, we have

xt xt + (1 xt 2)

xt + (1 xt 2)

= xt + (1 xt )(1 + xt )

This is clearly 1, as xt 1.

We next establish some basic properties of the surrogate loss function.

Lemma 3.3. Let ℓt denote ℓht,rt,ct deﬁned in (1). This function satisﬁes:

1. If ct, ht 0, then ℓt(xt) = ct, xt . 2. If ct, ht < 0, then

ct, xt ℓt(xt) + | ct, ht |

3. For all u B with u 1, ℓt(u) ct, u . 4. ℓt(x) is | ct,ht |µ

rt -strongly convex. 5. ℓt(x) is 2 ct -Lipschitz.

Proof. The ﬁrst three properties are immediate from the deﬁnitions of ℓt, xt and the fact that xt 1 and rt 1. The fourth one follows from the fact that 1

2 x 2 is µstrongly convex, and that adding a convex function to a strongly convex function preserves strong convexity. The last property is also a consequence of the fact that x 2 is 2-Lipschitz inside the unit ball (which follows from x 2 y 2 = ( x + y )( x y ) ) and since rt 1.

This implies the following lemma, which is crucial for our argument. It relates the regret of ALG with the regret of FTRL (procedure A). Recall the deﬁnition of BT from before (the time steps when the hints are negatively correlated with the cost vector).

Lemma 3.4. Let u B satisfy u 1, and let ℓt be shorthand for ℓht,rt,ct as before. Then

RALG(u, c, T) RA(u, ℓ, T) + X

Proof. By deﬁnition, RALG(u, c, T) = P t ct, xt ct, u P

t ct, xt ℓt(u), by Property 3 in Lemma 3.3.

Now using the ﬁrst two properties, we have that when the hints are positively correlated, i.e., ct, ht 0, we have ct, xt = ℓt(xt), and otherwise (i.e., t BT ) we have ct, xt ℓt(xt) + | ct,ht |

rt . This completes the proof of the lemma.

We bound the ﬁrst term in (2) using known results for FTRL, and the second term by the following simple lemma.

Lemma 3.5. From our deﬁnition of rt, we have X

t BT | ct, ht |.

Proof. From our algorithm, note that rt is precisely q

1 + P τ<t,τ BT | ct, ht |. Thus, since all

the terms | ct, ht | are 1, we can use the fact that for all non-negative real numbers {zi}m i=1, m X

zt z1:t 2 z1:m,

to the numbers | ct, ht | for t BT . This implies the lemma. (The fact above is standard in the analysis of FTRL; for instance, see Lemma 4 of Mc Mahan (2017).)

It remains to bound the regret of the FTRL procedure A. We now use the general techniques presented in Mc Mahan (2017); Hazan et al. (2008) to do this.

Lemma 3.6. Suppose we run procedure A using our choice of ℓt, λt, σt. Then for any α > 0 and u 1, the regret RA(u, ℓ, T) is at most

t BT,α ct 2 µ + r T log(1+µ P

t GT,α ct 2 )

where r T = q

t BT | ct, ht |.

Proof. Note that ℓt = ℓht,rt,ct is σt-strongly convex as we observed earlier, so that the function ℓ1:t(x)+ λ0:t 1

2 x 2 is σ1:t + µλ0:t 1-strongly convex. Then, using the analysis of the FTRL procedure (Theorem 1 of Mc Mahan (2017)), we set gt to be an arbitrary subgradient of ℓt at xt and obtain:

RA(u, ℓ, T) λ0:T

gt 2 σ1:t + µλ0:t 1

Since ℓt is 2 ct -Lipschitz (Lemma 3.3), we have that gt 2 4 ct , so the regret is:

ct 2 σ1:t + µλ0:t 1

Online Learning with Imperfect Hints

Next, observe that since ct 1, we must have λt 1 µ = λ0 for all t. Therefore the regret is

t=1 λt + ct 2 σ1:t + µλ1:t

Now, we can use our choice of λt to appeal to the result of Hazan et al. (2008); see Lemma 3.1 of their paper. We also reproduce a slightly more general version of this result in Lemma B.10 for completeness. This lets us replace our choice of λt with any other choice up to constants, yielding:

RA(u, ℓ, T) 1

2µ + 4 min λ t

t=1 λ t + ct 2 σ1:t + µλ 0:t

Let us now show how to pick λ t that depend on the parameter α > 0, thus giving the bound in the lemma. Deﬁne Qα = P

t BT,α ct 2 , i.e., the total squared norm at time steps where the desired correlation condition between the hint and the cost vector is not met. Now set λ 1 = 1 + Qα and λ t = 0 for t > 1. Then RA(u, ℓ, T) is at most:

ct 2 σ1:t + µ 1 + Qα

We can separate the sum into t BT,α and indices outside (i.e., in GT,α). This gives:

ct 2 σ1:t + µ 1 + Qα Qα µ 1 + Qα + X

ct 2 1 + σ1:t .

The ﬁrst term is clearly Qα/µ. To analyze the second term, we use the fact that for any t GT,α, we have

σt αµ ct 2 rt αµ ct 2 r T ,

where in the last step we used the monotonicity of rt. Thus by denoting the numbers { ct 2 }t GT,α by w1, w2, . . . , wm (in order), we have

ct 2 1 + σ1:t r T

wi r T αµ + w1:i

Z w1:m+(r T /αµ)

µ, we can bound this by r T

αµ log(1 + µw1:m). Recalling the deﬁnition of r T , the proof follows.

Theorem 3.1 now follows immediately from Lemmas 3.4, 3.5, and 3.6.

Remark. The regret bound in Theorem 3.1 has two important terms. The ﬁrst term depends on the sum of the squared norm of the cost vectors over all the time indices t BT,α when the hint vector was not strongly correlated with the cost. As we show in Section 4, such a dependence is unavoidable. The second term is

t BT | ct, ht | log 1 + µ s X

t GT,α ct 2

α log(1 + µ|GT,α|).

In the special case when all hints are α-correlated, we have |BT | = |BT,α| = 0 and |GT,α| = T, which improves upon regret bounds of Dekel et al. (2017) since we drop the dependence on the dimension.

In Appendix A, we show that our algorithm directly extends to the case when the underlying Banach space B is (q, µ)- uniformly convex for q > 2 to yield a regret bound of O T

3.1. Recovering and improving optimistic bounds

In this section we relate our notion of hints in the constrained setting to the idea of optimistic regret. For simplicity, we focus on the case that B is a Hilbert space and is the Hilbert space norm (or, for concreteness, that B = Rd and is the ℓ2 norm). In this setting we can write B = B and = . Recall that prior optimistic algorithms (e.g., (Rakhlin & Sridharan, 2013)) achieve regret bounds of the form:

R(u, c, T) = O

t=1 ct ht 2

Interestingly, in the unconstrained case, Cutkosky (2019) achieves regret

t=1 ct ht 2 ht 2 !

which sacriﬁces a logarithmic factor to improve ct ht 2

to ct ht 2 ht 2. However, their construction failed to achieve such a result when there are constraints. Here, we show that in fact our same algorithm with no modiﬁcations obtains this reﬁned optimistic bound when constrained to the unit ball. Speciﬁcally, we have the following result:

Corollary 3.7. Let B be a Hilbert space. Then Algorithm 1 guarantees regret on the unit ball K: 1 2 + 8 + 16 log 1 + T

where Z = 1 + PT t=1 max ct ht 2 ht 2, 0 .

Proof. Recall that in a Hilbert space, µ = 1 and q = p = 2.

Online Learning with Imperfect Hints

Then, looking at the regret bound of Theorem 3.1, we have

RALG(u, c, T) 1

t BT,α ct 2

+ 2r T + 4r T

α log 1 + X

t GT,α ct 2 ,

where r T = q

t BT | ct, ht |.

Next, notice that for any t BT , we have

| ct, ht | = ct, ht

2 ct 2 ct, ht 1

2( ct ht 2 ht 2).

v u u t1 + 1

t=1 max( ct ht 2 ht 2, 0).

Further, if we set α = 1

4, then for any t BT,α, we have

ct 2 ct 2 + ct 2 4 ct, ht

= 2( ct ht 2 ht 2).

t BT,α ct 2

t=1 max ( ct ht 2 ht 2, 0).

Putting all this together and over-approximating constants, we can conclude the proof.

4. Lower Bounds

We now show that the regret bounds achieved by our algorithms are near-optimal. Recall that the regret bound had two terms: one corresponding to hints that are negatively correlated with ct, and one corresponding to hints that are positively correlated, but not correlated enough . Our ﬁrst lower bound shows that even the second term is necessary.

4.1. Bad hints are uncorrelated

Assume that we are in Euclidean space with the standard ℓ2 norm, where the algorithm needs to play a point in the unit ball. We show the following:

Theorem 4.1. There exists a sequence of hint vectors h1, h2, . . . and cost vectors c1, c2, . . . with the following properties: (a) ct, ht 0 for all t, (b) for all but B time steps, we have ct, ht = ct (i.e., hints are perfect), and (c) any online learning algorithm that plays given the hints incurs an expected regret of Ω(

Proof. We consider the following example in two dimensions, with orthogonal unit vectors e1 and e2. For the ﬁrst

B time steps, suppose that ht = e2, and ct = e1, where the sign is chosen uniformly at random at each step. Now, let z = c1 + +c B. For the rest of the time steps, suppose that ht = ct = z/ z . In other words, we have the standard one-dimensional hard instance in the ﬁrst B steps (which incurs an expected regret of

B), appended with time steps where the hints are perfect.

Any online algorithm incurs an expected loss 0 on the ﬁrst B steps (and loss (t B) on the rest of the steps), while we have the expected z =

B, and so playing the vector z/ z at all the time steps incurs a total loss of (t B)

B. Thus the expected regret is

The proof above (as well as the ones that follow) exhibit a distribution over instances for which any deterministic algorithm incurs an expected regret of

B. Applying Yao s lemma (e.g., (Motwani & Raghavan, 1995)), the regret lower bound therefore applies to randomized algorithms as well.

4.2. Bad hints are spread out over time

Theorem 4.1 is taking advantage of an adversarial distribution of bad hints. By placing all the useless hints at the beginning of the game, we force the algorithm to experience high regret that it cannot recover from. It turns out such overtly adversarial distributions are not necessary: even if the bad hints are randomly distributed, the algorithm must still suffer high regret. (We note that in this case, we no longer have ct, ht 0 for all rounds.)

Theorem 4.2. Consider the one-dimensional problem with domain being the unit interval [ 1, 1]. Suppose ht = 1 for all t and that each ct takes value p 1 with probability p and value p with probability 1 p, for p = B/T and B T/4. Then the expected number of bad hints is B and the expected regret of any algorithm is at least

Proof. Note that a hint is negatively correlated with the cost if the cost is negative, which happens with probability p. Thus the expected number of bad hints is p T = B. Now at each step, we have E[ct] = 0. Thus, whatever xt the

algorithm plays, we have that E h PT t=1 ctxt i = 0; thus, the expected loss of the algorithm is 0. Finally, we have that the vector z = c1:t has norm E[ z ] = p

B/2. Therefore compared to the best vector in hindsight, namely z

z , the expected regret is at least

4.3. A lower bound for the ℓq norm

Next, we show that even when the hint is always Ω(1) correlated with the cost, our upper bound for general q (which is T (q 2)/(q 1)) is optimal in the class of dimension-free bounds.

Theorem 4.3. There exists a sequence of hints h1, h2, . . .

Online Learning with Imperfect Hints

and costs c1, c2, . . . in RT +1 such that (a) ct, ht Ω(1) for all t, and (b) any online learning algorithm that plays

given the hints incurs an expected regret of Ω T

(q 2) (q 1) .

Proof. Let e0, e1, e2, . . . , e T be orthogonal, unit length vectors in our space, and suppose that at time t = 1, 2, 3, . . . , we have ct = e0 et, where the sign is chosen u.a.r. (to keep all the vectors of 1, we can normalize ct; this does not change the analysis, so we skip this step).

Now, suppose the algorithm plays vectors x1, x2, . . . . We have the total expected loss to be exactly P

t e0, xt , which has magnitude at most T. Let us construct a vector u with u q 1 that has a higher magnitude for the inner product.

Let us denote z = c1:t = Te0 + PT t=1 σtet, for some signs σt. Deﬁne u = PT t=0 βtut, where:

2q T 1 q 1 ; βt = σt T 1 q 1 .

We have PT t=1 βq t = T T q q 1 = T 1 q 1 . Next, we make the simple observation that for any γ < 1/2, (1 3γ

e 3γ/2 1 γ. Using this, we have βq 0 1 T 1 q 1 , and thus u q 1. Next, we have

2q T 1 q 1 T + T T 1 q 1

T 1 1 q 1 .

Thus, compared to the point u, any algorithm has an expected regret T (q 2)/(q 1). This completes the proof.

Indeed, even if we allow a dependence on dimension, obtaining a log T regret is impossible for q > 2. We refer to Appendix C for a regret lower bound (quite similar to the above) even in two dimensions. In this case, the lower bound interpolates between log T (which we achieve for q = 2) and

T (which is achievable if we lose a factor linear in the dimension).

5. Unconstrained Learning with Hints

We now consider the unconstrained setting where the online algorithm is allowed to output any x B. In this section, we show that the unconstrained setting is much simpler than the constrained version of the problem.

Recall the deﬁnition of BT,α. For the unconstrained setting, we work with a more relaxed notion of bad hints. Let B T,α be the smallest set of indices such that P

t [T ]\B T,α ct, ht α P

t [T ]\B T,α ct 2. We observe that, by deﬁnition, for any α > 0, we have |B T,α| |BT,α|.

Our algorithm is essentially a black-box reduction to a standard parameter-free online linear optimization algorithm

without any hints and follows the framework of adding independent online learning algorithms (Cutkosky, 2019). In fact, our algorithm is identical to the optimistic online learning algorithm of Cutkosky (2019). However, we are able to obtain better regret guarantees by a tighter analysis.

Denote CT = PT t=1 ct 2.

Lemma 5.1. Let A be a parameter-free online linear optimization algorithm that guarantees a regret bound of: RA(u, c, T) f( u , CT , ϵ), ϵ > 0, for some function f( , , ) where f(0, , ϵ) ϵ and f is monotone in all the three parameters. Then, there exists an algorithm B for online learning with hints that guarantees the regret bound:

RB(u, c, T | h) min f( u , CT , ϵ) + ϵ,

n 2f( u , CT , ϵ) y

t=1 ct, ht o .

Proof. We design an algorithm B that utilizes the provided online learning algorithm A in two distinct settings. First, let xt B be the output of algorithm A in response to loss vectors c1, . . . , ct 1 B . We also use algorithm A in the scalar (i.e., R) setting by providing ct, ht as the losses. Let yt be the output of algorithm A in response to loss vectors c1, h1 , . . . , ct, ht .

On receiving hints h1, . . . , ht, and losses for the previous time steps c1, . . . , ct 1, our algorithm B outputs zt = xt ytht.

Then for all u B, we have

RB(u, c, T | h) =

t=1 ct, zt u

t=1 ct, xt u

t=1 yt ct, ht

t=1 ct, xt u +

t=1 ct, ht (y yt)

f( u , CT , ϵ) + f(|y|,

t=1 ct, ht 2, ϵ)

t=1 ct, ht ,

using the regret bounds guaranteed by algorithm A. Setting y = 0 is sufﬁcient to obtain the ﬁrst part of the regret bound. To obtain the second part of the bound, we use

Online Learning with Imperfect Hints

ct, ht 2 ct 2 and the monotonicity of f to obtain

RB(u, c, T | h)

2f( u , CT , ϵ) y

We are now ready to present our main result for unconstrained online learning with hints.

Theorem 5.2. For the unconstrained online linear optimization problem with hints, for any α > 0, there exists an algorithm B that guarantees for any u B and ϵ > 0, we have RB(u, c, T | h)

ϵ + u log 1 + T u

Proof. An algorithm A that satisﬁes the properties of Lemma 5.1 is provided by Cutkosky & Orabona (2018). Their algorithm guarantees

f( u , CT , ϵ) = ϵ + 8 u log 8 u 2(1 + 4CT )4.5

2 + log 5 u 2(2 + 8CT )9

Similar algorithms with differing constants and dependencies on the ct are described in Jun & Orabona (2019); Orabona & P al (2016); Cutkosky & Sarlos (2019); Mc Mahan & Orabona (2014); Foster et al. (2018; 2017); Kempka et al. (2019); van der Hoeven (2019).

Applying Lemma 5.1 with this algorithm A, we get

RB(u, c, T | h)

2f( u , CT , ϵ) y

= inf y u y 0

2ϵ + Q1 + u Q2 p

where we let Q1 = 16 u log 8 u 2(1+4CT )4.5

r 2 + log 5 u 2(2+8CT )9

ϵ2 + 1 for brevity.

However, by deﬁnition of B T,α, we have

t=1 ct, ht = X

t [T ]\B T,α ct, ht + X

t B T,α ct, ht

t=1 ct 2 + X

ct, ht α ct 2

t=1 ct 2 2|B T,α|.

Substituting back into (4) and using y = u /

RB(u, c, T | h) 2ϵ + Q1 + u Q2 p

|B T,α| CT + 2 u q

|B T,α|. (5)

However for any CT , we have

|B T,α| CT Q2 2 q

indeed, this follows since

And thus (5) yields that RB(u, c, T | h) is at most:

2ϵ + Q1 + u Q2 2 q

= 2ϵ + 16 u log 8 u 2(1 + 4CT )4.5

+ 64 u 2 + log 5 u 2(2+8CT )9

ϵ + u log 1 + T u

The bound of Theorem 5.2 is similar to our results in the constrained setting, but now we have replaced BT,α with the relaxed quantity B T,α. The unconstrained algorithms requires the good hints to be good only on average, while the constrained algorithm required each individual good hint to be good. This is a signiﬁcant relaxation: consider our lower bound argument of Theorem 4.1, in which ct, ht is 0 for the ﬁrst T

2 rounds and 1 afterwards. A constrained algorithm must suffer O(

T) regret in this setting, but in the unconstrained case the hints are 1

2-correlated on average, and so the algorithm will suffer only O(log T) regret. It is strictly easier to take advantage of hints in the unconstrained setting than in the constrained setting.

Online Learning with Imperfect Hints

6. Conclusions

In this work we obtained an algorithm for online linear optimization in the presence of imperfect hints. Our algorithm generalizes previous results that used hints in online optimization to get improved regret bounds, but were not robust against hints that were not guaranteed to be good. By tolerating bad hints while getting optimal regret bounds, our work thus makes it possible for the hints to be derived from a learning oracle.

Cesa-Bianchi, N. and Lugosi, G. Prediction, Learning, and Games. Cambridge University Press, 2006.

Cutkosky, A. Combining online learning guarantees. In COLT, pp. 895 913, 2019.

Cutkosky, A. and Orabona, F. Black-box reductions for parameter-free online learning in Banach spaces. In COLT, pp. 1493 1529, 2018.

Cutkosky, A. and Sarlos, T. Matrix-free preconditioning in online learning. In ICML, pp. 1455 1464, 2019.

Dekel, O., Flajolet, A., Haghtalab, N., and Jaillet, P. Online learning with a hint. In NIPS, pp. 5299 5308, 2017.

Foster, D. J., Kale, S., Mohri, M., and Sridharan, K. Parameter-free online learning via model selection. In NIPS, pp. 6020 6030, 2017.

Foster, D. J., Rakhlin, A., and Sridharan, K. Online learning: Sufﬁcient statistics and the Burkholder method. In COLT, pp. 3028 3064, 2018.

Hazan, E. Introduction to online convex optimization. Foundations and Trends in Optimization, 2(3-4):157 325, 2016.

Hazan, E. and Kale, S. Extracting certainty from uncertainty: Regret bounded by variation in costs. Machine Learning, 80(2-3):165 188, 2010.

Hazan, E. and Megiddo, N. Online learning with prior knowledge. In COLT, pp. 499 513, 2007.

Hazan, E., Agarwal, A., and Kale, S. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69(2-3):169 192, 2007.

Hazan, E., Rakhlin, A., and Bartlett, P. L. Adaptive online gradient descent. In NIPS, pp. 65 72, 2008.

Jun, K.-S. and Orabona, F. Parameter-free online convex optimization with sub-exponential noise. In COLT, pp. 1802 1823, 2019.

Kalai, A. and Vempala, S. Efﬁcient algorithms for online decision problems. JCSS, 71(3):291 307, 2005.

Kempka, M., Kotlowski, W., and Warmuth, M. K. Adaptive scale-invariant online algorithms for learning linear models. In ICML, pp. 3321 3330, 2019.

Kumar, R., Purohit, M., and Svitkina, Z. Improving online algorithms using ML predictions. In Neur IPS, pp. 9661 9670, 2018.

Kumar, R., Purohit, M., Schild, A., Svitkina, Z., and Vee, E. Semi-online bipartite matching. In ITCS, pp. 50:1 50:20, 2019.

Kumar, R., Purohit, M., Svitkina, Z., and Vee, E. Interleaved caching with access graphs. In SODA, pp. 1846 1858, 2020.

Lattanzi, S., Lavastida, T., Moseley, B., and Vassilvitskii, S. Online scheduling via learned weights. In SODA, pp. 1859 1877, 2020.

Li, X. and Orabona, F. On the convergence of stochastic gradient descent with adaptive stepsizes. In AISTATS, pp. 983 992, 2019.

Lykouris, T. and Vassilvitskii, S. Competitive caching with machine learned advice. In ICML, pp. 3302 3311, 2018.

Mc Mahan, H. B. A survey of algorithms and analysis for adaptive online learning. JMLR, 18(1):3117 3166, 2017.

Mc Mahan, H. B. and Orabona, F. Unconstrained online linear learning in Hilbert spaces: Minimax algorithms and normal approximations. In COLT, pp. 1020 1039, 2014.

Mohri, M. and Yang, S. Accelerating online convex optimization via adaptive prediction. In AISTATS, pp. 848 856, 2016.

Motwani, R. and Raghavan, P. Randomized Algorithms. Cambridge University Press, 1995.

Orabona, F. and P al, D. Coin betting and parameter-free online learning. In NIPS, pp. 577 585. 2016.

Pisier, G. Martingales in Banach Spaces. Cambridge University Press, 2016.

Rakhlin, A. and Sridharan, K. Online learning with predictable sequences. In COLT, pp. 993 1019, 2013.

Rohatgi, D. Near-optimal bounds for online caching with machine learned advice. In SODA, pp. 1834 1845, 2020.

Shalev-Shwartz, S. Online learning and online convex optimization. Foundation and Trends in Machine Learning, 4(2):107 194, 2011.

Online Learning with Imperfect Hints

Steinhardt, J. and Liang, P. Adaptivity and optimism: An improved exponentiated gradient algorithm. In ICML, pp. 1593 1601, 2014.

van der Hoeven, D. User-speciﬁed local differential privacy

in unconstrained adaptive online learning. In Neur IPS, pp. 14080 14089, 2019.

Zinkevich, M. Online convex programming and generalized inﬁnitesimal gradient ascent. In ICML, pp. 928 936, 2003.