# logarithmic_regret_from_sublinear_hints__300a8198.pdf

Logarithmic Regret from Sublinear Hints

Aditya Bhaskara Department of Computer Science

University of Utah Salt Lake City, UT bhaskaraaditya@gmail.com

Ashok Cutkosky Dept. of Electrical and Computer Engineering

Boston University

Boston, MA ashok@cutkosky.com

Ravi Kumar Google Research Mountain View, CA ravi.k53@gmail.com

Manish Purohit Google Research Mountain View, CA mpurohit@google.com

We consider the online linear optimization problem, where at every step the algorithm plays a point xt in the unit ball, and suffers loss hct, xti for some cost vector ct that is then revealed to the algorithm. Recent work showed that if an algorithm receives a hint ht that has non-trivial correlation with ct before it plays xt, then it can achieve a regret guarantee of O(log T), improving on the bound of (

T) in the standard setting. In this work, we study the question of whether an algorithm really requires a hint at every time step. Somewhat surprisingly, we show that an algorithm can obtain O(log T) regret with just O(

T) hints under a natural query model; in contrast, we also show that o(

T) hints cannot guarantee better than (

T) regret. We give two applications of our result, to the well-studied setting of optimistic regret bounds and to the problem of online learning with abstention.

1 Introduction

There has been a spate of work on improving the performance of online algorithms with the help of externally available hints. The goal of these works is to circumvent worst-case bounds and exploit the capability of machine-learned models that can potentially provide these hints. There have been two main lines of study. The ﬁrst is for combinatorial problems, where the goal has been to be improve the competitive ratio of online algorithms; problems considered here include ski-rental [9, 17], caching [13, 19, 26], scheduling [1, 12, 17, 22], matching [16, 18], etc. The second is in the learning theory setting, where the goal has to been to improve the regret of online optimization algorithms. A series of recent papers showed how to achieve better regret guarantees, assuming that we have a hint about the cost function before the algorithm makes a choice. For many variants of the online convex optimization problem, works such as [25, 8, 2, 3] studied the power of having prior information about a cost function. Works such as [29] have also studied improved regret bounds in partial information or bandit settings. In all these works, a desirable property is to ensure consistency, which demands better performance with better quality hints, and robustness, which guarantees a certain level of performance with poor quality or even adversarially bad hints.

Recall the standard online optimization model [30], which is a game between an algorithm and an adversary. In each round, the algorithm plays a point and the adversary responds with a cost function that is visible to the algorithm, and the cost in this round is measured by evaluating the cost function on the played point. In online linear optimization, the cost function is linear. The regret of an algorithm is the worst-case difference between the cost of the algorithm and the cost of an algorithm

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

that plays a ﬁxed point in each round. A natural way to incorporate hints or prior information is to give the algorithm access to a hint about the cost function at a given round before it chooses a point to play at that round. In this sense, hints are present gratis [25].

The availability of a hint in each round seems natural in some settings, e.g., when cost functions change gradually with time [25, 4]). However, it can be prohibitive in many others, for instance, if hints are obtained by using expensive side information, or if they are generated by a running a computationally expensive ML model. Furthermore, hints can also be wasteful when the problem instance, or even a large sub-instance, is such that the algorithm cannot really derive any substantial beneﬁt from their presence. This leads to the question of strengthening the online learning with hints model by making it parsimonious. In this work, we pursue this direction, where we offer the algorithm the ﬂexibility to choose to ask for a hint before it plays the point. It then becomes an onus on the algorithm to know when to ask for a hint and how to use it judiciously, while ensuring both consistency and robustness. The performance of such an algorithm is measured not only by its regret but also by the number of hints it uses.

Our contributions. We now present a high level summary of our results. For ease of exposition, we will defer the formal statements to the respective sections. All our results are for the problem of online linear optimization (OLO) when the domain is the unit 2 ball; the cost function at every step is deﬁned using a cost vector ct (and the cost or loss is the inner product of the point played with ct). We call a hint perfect if it is the same as the cost vector at that time step, good if it is weakly correlated with the cost vector, and bad otherwise.

As our main result, we show that for OLO in which hints are guaranteed to be good whenever the algorithm asks for a hint, there is an efﬁcient randomized algorithm that obtains O(log T) regret using only O(

T) hints. We extend our result to the case when |B| hints can be bad (chosen in an oblivious manner, as we will discuss later), and give an algorithm that achieves a regret bound of O(

|B| log T), while still asking for O(

T) hints. It is interesting to contrast our result with prior work: Dekel et al. [8] obtained an algorithm with O(log T) regret, when a good hint is available in every round. Bhaskara et al. [2] made this result robust, obtaining a regret bound of O(

|B| log T), when there are |B| bad hints. Our result improves upon these works by showing the same asymptotic regret bounds, but using only O(

T) hints. Our result also has implications for optimistic regret bounds [25] (where we obtain the same results, but with fewer hints) and for online learning with abstention [24] (where we can bound the number of abstentions).

We also show two lower bounds that show the optimality of our algorithm. The ﬁrst is regarding the minimum number of hints needed to get O(log T) regret: we show that any (potentially randomized) algorithm that uses o(

T) hints will suffer a regret of (

T). The second and more surprising result is the role of randomness: we show that any deterministic algorithm that obtains O(log T) regret must use (T/ log(T)) hints, even if each of them is perfect. This shows the signiﬁcance of having a randomized algorithm and an oblivious adversary.

Finally we extend our results to the unconstrained OLO setting (see Section 6), where we design a deterministic algorithm to obtain O(log3/2 T) regret (suitably deﬁned for the unconstrained case) when all hints are good, and a randomized algorithm to obtain O(log T) regret, which can be extended to the presence of bad hints.

There are three aspects of our results that we ﬁnd surprising. The ﬁrst is even the possibility of obtaining O(log T) regret using only a sublinear number of hints. The second is the sharp threshold on the number of hints needed to obtain logarithmic regret; the regret does not gracefully degrade when the number of hints is below O(

T). The third is the deterministic vs randomized separation between the constrained and the unconstrained cases, when all queried hints are good.

We now present some intuition why our result is plausible. Consider the standard worst-case adversary for OLO: random mean-zero costs. This case is hard because the learner achieves zero expected cost, but the competitor achieves

T total cost. However, if we simply play a hint ht on O(

T/ ) rounds, each such round incurs cost, which is enough to cancel out the

T regret while making only O(

T) hint queries. Thus, the standard worst-case instances are actually easy with hints. More details on this example and an outline of our algorithm are provided at the start of Section 3.

Organization. Section 2 provides the necessary background. The main algorithm and analysis for the constrained OLO case are in Section 3. The extensions and applications of this can be found in Section 4. Section 5 contains the lower bounds and Section 6 contains the algorithms and analyses for the unconstrained case. All missing proofs are in the Supplementary Material.

2 Preliminaries

Let k k denote the 2-norm and Bd = {x 2 Rd | kxk 1} denote the unit 2 ball in Rd. We use the compressed sum notation and use c1:t to denote Pt

i=1 ci and kck2

1:t to denote Pt

i=1 kcik2. Let ~c = c1, . . . , c T be a sequence of cost vectors. Let [T] = {1, . . . , T}.

OLO problem and Regret. The constrained online linear optimization (OLO) problem is modeled as a game over T rounds. At each time t 2 [T], an algorithm A plays a vector xt 2 Bd, and then an adversary responds with a cost vector ct 2 Bd. The algorithm incurs cost (or loss) hxt, cti at time t. The total cost incurred by the algorithm is cost A(~c) = PT

t=1hxt, cti. The regret of the algorithm A with respect to a comparator or benchmark vector u 2 Bd is

RA(u,~c) = cost A(~c) cost Au(~c) =

hxt u, cti,

where Au is the algorithm that always plays u at every time step. The regret of an algorithm A is its worst-case regret with respect to all u 2 Bd:

RA(~c) = sup

u2Bd RA(u,~c).

Hints and query cost. Let > 0 be ﬁxed and known. In this paper we consider the OLO setting where, at any round t before choosing xt, an algorithm A is allowed to obtain a hint ht 2 Bd. If hht, cti kctk2, we say that the hint is -good. If A opts to obtain a hint at time t, then it incurs a query cost of kctk2; the query cost is 0 if no hint was obtained at time t. The deﬁnition of regret stays the same and we denote it by RA, ( ). The total query cost of A is given by QA, (~c) = PT

t kctk2 where

t is an indicator function used to denote whether A queried for a hint at time t. Note that the algorithm does not actually know the query cost for a round until the end of the round. If A is a randomized algorithm, the notions of expected regret and expected query cost follow naturally. We consider the setting when the adversarial choice of the hint ht and cost vector ct at time t is oblivious to whether the algorithm queries for a hint at time t but can depend adaptively on all previous decisions.

More generally, we consider the case that some subset of the hints are bad in the sense that hct, hti < kctk2; we let B denote the set of such indices t. Although we assume is known to our algorithms, we do not assume any information about B. Further, our algorithm is charged kctk2 for querying a hint even if the hint was bad.

3 Main algorithm

Intuition and outline. The high-level intuition behind our algorithm is the following: suppose for a moment that = 1/2 and each hint is -correlated with the corresponding cost. Now suppose the cost vectors c1, . . . , c T are random unit vectors, as in the standard tight example for FTRL. In this case, if an algorithm were to make a hint query for the ﬁrst 4

T steps, set xt = ht in those steps, and play FTRL subsequently, then the cost incurred by the algorithm will be less than 2

T in the ﬁrst 4

T steps, and 0 (in expectation) subsequently. On the other hand, for random vectors, we have kc1:T k 2

T with high probability, and thus the best vector in hindsight achieves a total cost kc1:T k 2

T. Thus the algorithm above actually incurs regret 0.

It turns out that the key to the above argument is kc1:T k being small. In fact, suppose that ct are unit vectors, and assume that kc1:T k T/4. Now, suppose the algorithm makes a hint query at 10

T random indices, sets xt = ht in those steps, and uses FTRL in the other steps. One can show that the cost incurred by the algorithm is 5

T plus the cost of the FTRL steps. Since the cost in the

FTRL steps is within

T of the cost incurred by the competitor u, we can show that the regret is once again 0. The missing subtlety here is accounting for the cost of u on the query steps, but using the bound on kc1:T k, this can be adequately controlled if the queries are done at random.

The above outline suggests that the difﬁcult case is kc1:T k being large. However, it turns out that prior work of Huang et al. [11] showed that when the domain is the unit ball, FTL achieves logarithmic regret if we have kc1:tk (t) for all t.

Our algorithm can be viewed as a combination of the two ideas above. If we could identify the largest round S for which kc1:Sk is small, then we can perform uniform sampling until round S and use FTL subsequently, and hope that we have logarithmic regret overall (although it is not obvious that the two bounds compose). The problem with this is that we do not know the value of S. We thus view the problem of picking a sampling probability as a one-dimensional OLO problem in itself, and show that an online gradient descent (OGD) algorithm achieves low regret when compared to all sequences that have the structure of being δ until a certain time and 0 thereafter (which captures the setting above). The overall algorithm thus picks the querying probability using the OGD procedure, and otherwise resorts to FTRL, as in the outline above.

We now present the details of the overall algorithm (Section 3.3) and the two main subroutines it uses (Sections 3.1 and 3.2).

3.1 A sharper analysis of FTRL

In this section we consider the classic adaptive Follow the Regularized Leader (FTRL) algorithm Aftrl (Algorithm 1) and show a regret bound that is better than the usual one, if the length of the aggregate cost vector grows rapidly after a certain initial period.

For convenience, let σt = kctk2. Deﬁne the regularizer terms as r0 = 1 and for t 1, let

1 + σ1:t 1. (1)

By deﬁnition, we have r0:t = p1 + σ1:t. Furthermore, we have rt < 1 for all t, since σt = kctk2 1. The FTRL algorithm Aftrl then plays the points x1, x2, . . . , which are deﬁned as

xt+1 = argmin

hc1:t, xi + r0:t

Algorithm 1 Adaptive FTRL Aftrl.

x1 0, r0 1 for t = 1, . . . , T do

Play point xt Receive cost ct r0:t

1:t xt+1 argmin

hc1:t, xi + r0:t

Algorithm 2 OGD with shrinking domain Aogd. Require: Parameter λ

p1 0, D1 [0, 1] for t = 1, . . . , T do

Play point pt Receive cost zt and σt 1 {σt will eventually be set to kctk2} Dt [0, min(1, λ p1+σ1:t )]

t λ 1+σ1:t pt+1 Dt (pt tzt), where Dt is the projection to the interval end for

We will show that Aftrl satisﬁes the following reﬁned regret guarantee:

Theorem 3.1. Consider Aftrl on a sequence ct of cost vectors and let 2 (0, 1) be any parameter. Suppose that S is an index such that for all t > S, kc1:tk

4 (1 + σ1:t) (recall σt = kctk2). Then,

1. For all N 2 [T] and for any kuk 1, we have

hct, xt ui 4.5

2. For the index S deﬁned above, we have the reﬁned regret bound:

2 + 18 + 8 log(1 + σ1:T )

hct, xti. (3)

Part (1) of the theorem follows from the standard analysis of FTRL; we include the proof in Appendix B.1 for completeness. Part (2) is a novel contribution, where we show that if kc1:tk grows quickly enough, then the subsequent regret is small. It can be viewed as a generalization of a result of Huang et al. [11], who proved such a regret bound for S = 0.

3.2 Switch-once dynamic regret

In this section we show a regret bound against all time-varying comparators of a certain form. More formally, we consider the one-dimensional OLO problem where the costs are zt and λ 1 is a known parameter. We also assume that at time t, the algorithm is provided with a parameter σt 2 [0, 1] that will give some extra information about the sequence of comparators of interest. Thus the modiﬁed OLO game can be described as follows:

For t = 1, 2, . . . , the algorithm ﬁrst plays pt 2 [0, 1], and then zt, σt are revealed. zt always satisﬁes z2

t 4σt. We wish to minimize the regret with respect to a class of comparator sequences (qt)T

t=1 (deﬁned below), i.e., minimize P

t zt(pt qt) over all sequences in the class.

We remark that for the purposes of this subsection, we can think of σt as z2

t . (The more general setup is needed when we use this in Algorithm 3.)

Deﬁnition 3.2 (Valid-in-hindsight sequences). We say that a sequence (qt)T

t=1 is valid in hindsight if there exists an S 2 [T] and a δ 2 [0, 1] such that

1. qt = δ for all t S and qt = 0 for t > S. 2. At the switching point, we have δ2 λ2 1+σ1:S .

We now show that a variant of online gradient descent (OGD) with a shrinking domain achieves low regret with respect to all valid-in-hindsight sequences; we call this Aogd (see Algorithm 2).

Theorem 3.3. Let λ 1 be a given parameter, and (zt)T

t=1 be any sequence of cost values satisfying z2

t 4σt. Let (qt)T

t=1 be a valid-in-hindsight sequence. The points pt produced by Aogd then satisfy:

zt(pt qt) λ (1 + 3 log(1 + σ1:T )) .

3.3 Full algorithm

In this section we present the full algorithm that utilizes the above two ingredients.

Algorithm 3 Algorithm with hints Ahints. Require: Parameter

Initialize an instance of Aftrl and an instance of Aogd with parameter λ = 10/ for t = 1, . . . , T do

Receive pt from Aogd; Receive xt from Aftrl With probability pt, get a hint ht and play ˆxt = ht; otherwise, play ˆxt = xt Receive ct Send ct to Aftrl as tth cost; Send zt = kctk2 hct, xti and σt = kctk2 to Aogd end for

Theorem 3.4. When B = ;,

E[RAhints, (~c)] 78 + 38 log(1 + kck2

1:T ) and E[QAhints, (~c)] 20

Proof. Let us ﬁrst bound the expected cost of querying the hints. From the description of Aogd (Algorithm 2), because of the shrinking domain, we have pt |Dt 1| 10 p

1+σ1:t 1 . At time t,

the expected query cost paid by the algorithm is pt kctk2. Using the above, we can bound this as

10σt p1 + σ1:t 1

20pσ1:T = 20

where the last inequality follows from concavity of the square root function (e.g., [20, Lemma 4]).

Now we proceed to the more challenging task of bounding the expected regret. We start by noting that the expected loss on any round t is simply E[hct, ˆxti] = pthct, hti+(1 pt)hct, xti. Therefore the expected regret for a ﬁxed u is:

E[hct, ˆxt ui] =

pthct, ht xti + hct, xt ui

pt( kctk2 hct, xti) + hct, xt ui,

where we have used the fact that the hint ht is -good. The main claim is then the following.

Lemma 3.5. For the choice of pt, xt deﬁned in Ahints (Algorithm 3), we have

pt( kctk2 hct, xti)+hct, xt ui 78 + 38 log(1 + kck2

1 + log(1 + σ1:T )

Assuming this, the bound on expected regret easily follows, completing the proof.

We now focus on proving Lemma 3.5.

Proof of Lemma 3.5. The key idea is to prove the existence of a valid-in-hindsight sequence (qt)T

t=1 (Deﬁnition 3.2) such that when pt on the LHS is replaced with qt, the sum is O(log(T)/ ). The guarantee of Algorithm 2 (i.e., Theorem 3.3) then completes the proof. Speciﬁcally, since we set λ = 10/ and zt = ( kctk2 hct, xti), Theorem 3.3 guarantees that for any valid-in-hindsight sequence (qt)T

t=1, we have:

pt( kctk2 hct, xti) + hct, xt ui

qt( kctk2 hct, xti) + hct, xt ui + 10

1 + 3 log(1 + kck2

Let us deﬁne S to be the largest index in [T] such that kc1:Sk

4 (1 + σ1:S). Firstly, by Theorem 3.1 (part 2), for any such index S we have:

2 + 18 + 8 log(1 + σ1:T )

hct, xti. (5)

2 +kc1:Sk+PS

t=1hct, xti denote the excess over the logarithmic term. Note that by Theorem 3.1 (part 1), for a vector u = c1:S

kc1:Sk, we have kc1:Sk + PS

t=1hct, xti = PS

t=1hct, xt ui 4.5p1 + σ1:S. Thus, we have 5p1 + σ1:S.

Now, note that if 1 + σ1:S 100

2 , we have 5p1 + σ1:S 50/ . Thus, by setting qt = 0 for all t (which is clearly valid-in-hindsight), the proof follows. Further, if 1, then clearly we can again set qt = 0 for all t to complete the proof. Thus, we assume in the remainder of the proof that 1 + σ1:S > 100

Our goal now is to construct a valid-in-hindsight switch-once sequence that has value qt = δ 2 [0, 1] for t S and qt = 0 for t > S such that we also have :

σt + hct, xti

1 + σ1:S. (6)

First, let us understand the term in the parentheses on the LHS above. We bound this using the following claim.

Claim. σ1:S + P

t Shct, xti

2 (1 + σ1:S).

Proof of claim. Suppose that we have P

t Shct, xti <

2 (1 σ1:S). By deﬁnition of S, we have

2 + kc1:Sk +

4 (1 + σ1:S) +

4 (1 + σ1:S) + .

From our assumption that p1 + σ1:S 10

, the RHS above is , which in turn is at most 1. Since we assumed > 1, this is a contradiction so the claim holds.

Thus in order to satisfy (6), we simply choose

δ = 10 p1 + σ1:S

By assumption, this lies in [0, 1], and further, for λ = 10/ , the (qt) deﬁned above is a valid-inhindsight sequence. Combining the fact that 5p1 + σ1:S with (6), we have that

qt( kctk2 hct, xti) + hct, xt ui 18 + 8 log(1 + σ1:T )

Now appealing to the guarantee of Theorem 3.3 as discussed earlier, we can replace qt with pt by suffering an additional logarithmic term on the RHS. Combining all these cases completes the proof of Lemma 3.5, and thus also the proof of Theorem 3.4.

4 Extensions and applications

In the following subsections, we extend the analysis of Algorithm 3 to disparate settings: we consider robustness to uninformative or bad hints, the more classical optimistic regret setting, and online learning with abstention.

4.1 Bad hints

First, we extend Theorem 3.4 to consider the case B 6= ; by carefully accounting for the regret incurred during rounds where t 2 B and making crucial use of the shrinking domain Dt.

Theorem 4.1. For any B,

E[RAhints, (~c)] 78 + 38 log(1 + kck2

log(1 + kck2

, and E[QAhints, (~c)] 20

4.2 Optimistic bounds

Next, we show that our results have implications for optimistic regret bounds (e.g., [10, 6, 25, 27,

23]). The standard optimistic regret bound takes the form O(

t=1 kct htk2). We will show

that the same result can be obtained (up to logarithmic factors) even while only looking at O(

T) hints. The approach is very simple: if we set = 1

4, then a little calculation shows that for t 2 B, kctk2 + khtk2 = O(kct htk2), so that Theorem 4.1 directly implies the desired result.

Theorem 4.2. Set = 1

E[RAhints, (~c)] 312 + 152 log(1 + kck2

log(1 + kck2

kct htk2 log(T)

A , and E[QAhints, (~c)] 20

4.3 Online learning with abstention

Finally, we apply our algorithm to the problem of online learning with abstentions. In this variant of the OLO game, instead of being provided with hints, the learner is allowed to abstain on any given round. When the learner abstains, it receives a loss of kctk2 for some known but pays a query cost of kctk2. The regret is again the total loss suffered by the learner minus the total loss suffered by the best ﬁxed adversary, which does not abstain. This setting is very similar to the scenario studied by [24] in the expert setting, but in addition to moving from the simplex to the unit ball, we ask for a more detailed guarantee from the learner: it is not allowed to abstain too often, as measured by the query cost. In this setting, our Algorithm 3 works essentially out-of-the-box: every time the algorithm queries a hint, we instead simply choose to abstain. This procedure then guarantees:

Theorem 4.3. In the online learning with abstention model, the variant of Algorithm 3 that abstains whenever the original algorithm would ask for a hint guarantees expected regret at most:

78 + 38 log(1 + kck2

1 + log(1 + σ1:T )

Further, the expected query cost is at most 20

Proof. Since we abstain with probability pt and otherwise play xt, the expected regret is PT

t=1 pt kctk2 + (1 pt)hct, xti hct, ui. Thus the regret bound follows directly from Lemma 3.5. The query cost bound follows the identical argument as Theorem 3.4.

5 Lower bounds

In this section we ﬁrst show that the regret bound obtained in Theorem 3.4 is essentially tight. Next, we show that randomness is necessary in our algorithms.

Theorem 5.1. Let 2 (0, 1] be any parameter, and suppose A is an algorithm for OLO with hints

that makes o

hint queries. Then there exists a sequence of cost vectors ct and hints ht of unit

length, such that (a) in any round t where a hint is asked, hct, hti kctk2, and (b) the regret of A on this input sequence is (

Proof. We will construct a distribution over inputs {ct, ht} and argue that any deterministic algorithm has an expected regret (

T) for inputs drawn from this distribution. By the minmax theorem, we then have a lower bound for any (possibly randomized) algorithm A.

We consider two-dimensional inputs. At each step, ht is chosen to be a uniformly random vector on the unit circle (in R2). The cost ct is set to be ht

t , where h?

t is a unit vector orthogonal to ht, and the sign is chosen uniformly at random. Now for any deterministic algorithm, if A queries ht at time t, then it can play a point aht + bh?

t , for scalars a, b. In expectation, this has inner product a with ct, and thus the expected cost incurred by A in this step is (since |a| 1). If A does not query ht, then ct is simply a random unit vector, and thus the expected cost incurred by A in this step is 0.

Next, consider the value minkuk 1

thct, ui, i.e., the best cost in hindsight; this is clearly k P

t ctk. By construction, ct is a uniformly random vector on the unit circle in R2 (and the

choices are independent for different t). Thus we have E[k P

T). (This follows from properties of sums of independent random unit vectors; see Supplementary Material for a proof.)

Thus, if the algorithm makes K queries, then the expected regret is at least K + (

quantity is (

T) as long as K = o

, thus completing the proof.

We next show that for deterministic algorithms, making O(

T) hint queries is insufﬁcient for obtaining o(

T) regret, even if the hints provided are always 1-good (i.e., hints are perfect).

Theorem 5.2. Let A be any deterministic algorithm for OLO with hints that makes at most C

T < T/2 queries, for some parameter C > 0. Then there is a sequence cost vectors ct and hints ht of unit length such that (a) ht = ct whenever A makes a hint query, and (b) the regret of A on this input sequence is at least

We remark that by setting C appropriately, we can also show that for a deterministic algorithm to achieve logarithmic regret, it needs to make

6 Unconstrained setting

Algorithm 4 Algorithm with hints (unconstrained case). Require: Parameters , , K, d-dimensional unconstrained OLO algorithm Aunc, one-dimensional

unconstrained OLO algorithm Aunc-1D guaranteeing (7) for t = 1, . . . , T do

{Randomized version}

t 1 with probability min

; 0 otherwise.

{Deterministic version}

t 1 iff 1 + Pt 1

=1 i hc , h i K

Receive wt 2 Rd from Aunc; Receive yt 2 R from Aunc-1D. If

t = 1, get hint ht Play xt = wt

thtyt; Receive cost ct. Send ct to Aunc as tth cost; Send gt =

thht, cti 2 R to Aunc-1D as tth cost. end for

In this section, we consider unconstrained online learning in which the domain is all of Rd. In this setting, it is unreasonable to deﬁne the regret using a supremum over all comparison points u 2 Rd as this will invariably lead to inﬁnite regret. Instead, we bound the regret as a function of u. For example, when hints are not available, standard results provide bounds of the form [7, 15, 28, 21, 5]:

hct, xt ui + Akuk

kctk2 log(kuk T/ + 1) + Bkuk log(kuk T/ + 1), (7)

for constants A and B and any user-speciﬁed . Using such algorithms as building blocks, we design an algorithm with O(

T) expected query cost and for all comparators u, regret is O(kuk/ ).

The algorithm is somewhat simpler than in the constrained case: we take an ordinary algorithm that does not use hints and subtract the hints from its predictions. Intuitively, each subtraction decreases the regret by kck2, so we need only O(

T) such events. With constraints, this is untenable because subtracting the hint might violate the constraint, but there is no problem in the unconstrained setting. Instead, the primary difﬁculty is that we need the regret to decrease not by kck2 but by kukkck2 for some unknown kuk. This is accomplished by learning a scaling factor yt that is applied to the hints.

Moreover, in the case that all hints are guaranteed to be -good, we devise a deterministic algorithm for the unconstrained setting. In light of Theorem 5.2, this establishes a surprising separation between the unconstrained and constrained settings. For the deterministic approach, we directly

measure the total query cost and simply query a hint whenever the cost is less than our desired budget. Note that this strategy fails if we allow bad hints as the adversary could provide a bad hint every time we ask for a hint. The full algorithm is presented in Algorithm 4, with the randomized and deterministic analyses provided by Theorems 6.1 and 6.2. Theorem 6.1. The randomized version of Algorithm 4 guarantees an expected regret at most:

log(kuk T/ )

K + log(kuk T/ ) log log(T kuk/ )

t2B khtk2 log(T)

with expected query cost at most 2K

Theorem 6.2. If B = ;, then the deterministic version of Algorithm 4 guarantees:

hct, xt ui 2 + O

log(kuk T/ + 1)

+ kuk log3/2(kuk T/ ) log log(kuk T/ )

with a query cost at most 2K

7 Conclusions

In this paper, we consider OLO where an algorithm has query access to hints, in both the constrained and unconstrained settings. Surprisingly, we show that it is possible to obtain logarithmic expected regret by querying for hints at only O(

T) time steps. Our work also demonstrates an intriguing separation between randomized and deterministic algorithms for constrained online learning. While our algorithms need to know , an open question is to obtain an algorithm that can operate without knowing . Extending our model to the bandit setting is also an interesting research direction.

Acknowledgments and Disclosure of Funding

Aditya Bhaskara acknowledges the support from NSF awards 2047288 and 2008688.

[1] Etienne Bamas, Andreas Maggiori, Lars Rohwedder, and Ola Svensson. Learning augmented

energy minimization via speed scaling. In Neur IPS, 2020.

[2] Aditya Bhaskara, Ashok Cutkosky, Ravi Kumar, and Manish Purohit. Online learning with

imperfect hints. In ICML, 2020.

[3] Aditya Bhaskara, Ashok Cutkosky, Ravi Kumar, and Manish Purohit. Online linear optimiza-

tion with many hints. In Neur IPS, 2020.

[4] S ebastien Bubeck, Yuanzhi Li, Haipeng Luo, and Chen-Yu Wei. Improved path-length regret

bounds for bandits. In COLT, pages 508 528, 2019.

[5] Liyu Chen, Haipeng Luo, and Chen-Yu Wei. Impossible tuning made possible: A new expert

algorithm and its applications. In COLT, 2021.

[6] Chao-Kai Chiang, Tianbao Yang, Chia-Jung Lee, Mehrdad Mahdavi, Chi-Jen Lu, Rong Jin,

and Shenghuo Zhu. Online optimization with gradual variations. In COLT, pages 6 1, 2012.

[7] Ashok Cutkosky and Francesco Orabona. Black-box reductions for parameter-free online learning in Banach spaces. In COLT, pages 1493 1529, 2018.

[8] Ofer Dekel, Arthur Flajolet, Nika Haghtalab, and Patrick Jaillet. Online learning with a hint.

In NIPS, pages 5299 5308, 2017.

[9] Sreenivas Gollapudi and Debmalya Panigrahi. Online algorithms for rent-or-buy with expert

advice. In ICML, pages 2319 2327, 2019.

[10] Elad Hazan and Satyen Kale. Extracting certainty from uncertainty: Regret bounded by varia-

tion in costs. Machine Learning, 80(2-3):165 188, 2010.

[11] Ruitong Huang, Tor Lattimore, Andr as Gy orgy, and Csaba Szepesv ari. Following the leader

and fast rates in online linear prediction: Curved constraint sets and other regularities. JMLR, 18(145):1 31, 2017.

[12] Sungjin Im, Ravi Kumar, Mahshid Montazer Qaem, and Manish Purohit. Non-clairvoyant

scheduling with predictions. In SPAA, pages 285 294, 2021.

[13] Zhihao Jiang, Debmalya Panigrahi, and Kevin Sun. Online algorithms for weighted paging

with predictions. ICALP, pages 69:1 69:18, 2020.

[14] Adam Kalai and Santosh Vempala. Efﬁcient algorithms for online decision problems. JCSS,

71(3):291 307, 2005.

[15] Michal Kempka, Wojciech Kotlowski, and Manfred K Warmuth. Adaptive scale-invariant online algorithms for learning linear models. In ICML, pages 3321 3330, 2019.

[16] Ravi Kumar, Manish Purohit, Aaron Schild, Zoya Svitkina, and Erik Vee. Semi-online bipartite

matching. In ITCS, pages 50:1 50:20, 2019.

[17] Ravi Kumar, Manish Purohit, and Zoya Svitkina. Improving online algorithms using ML predictions. In Neur IPS, pages 9661 9670, 2018.

[18] Thomas Lavastida, Benjamin Moseley, R. Ravi, and Chenyang Xu. Learnable and instance-

robust predictions for online matching, ﬂows and load balancing. In ESA, pages 59:1 59:17, 2021.

[19] Thodoris Lykouris and Sergei Vassilvtiskii. Competitive caching with machine learned advice.

In ICML, pages 3296 3305, 2018.

[20] H Brendan Mc Mahan. A survey of algorithms and analysis for adaptive online learning. JMLR,

18(1):3117 3166, 2017.

[21] Zakaria Mhammedi and Wouter M Koolen. Lipschitz and comparator-norm adaptivity in on-

line learning. In COLT, pages 2858 2887, 2020.

[22] Michael Mitzenmacher. Scheduling with predictions and the price of misprediction. In ITCS,

pages 14:1 14:18, 2020.

[23] Mehryar Mohri and Scott Yang. Accelerating online convex optimization via adaptive predic-

tion. In AISTATS, pages 848 856, 2016.

[24] Gergely Neu and Nikita Zhivotovskiy. Fast rates for online prediction with abstention. In

COLT, pages 3030 3048, 2020.

[25] Alexander Rakhlin and Karthik Sridharan. Online learning with predictable sequences. In

COLT, pages 993 1019, 2013.

[26] Dhruv Rohatgi. Near-optimal bounds for online caching with machine learned advice. In

SODA, pages 1834 1845, 2020.

[27] Jacob Steinhardt and Percy Liang. Adaptivity and optimism: An improved exponentiated gradient algorithm. In ICML, 2014.

[28] Dirk van der Hoeven. User-speciﬁed local differential privacy in unconstrained adaptive online

learning. In Neur IPS, pages 14080 14089, 2019.

[29] Chen-Yu Wei and Haipeng Luo. More adaptive algorithms for adversarial bandits. In COLT,

pages 1263 1291, 2018.

[30] Martin Zinkevich. Online convex programming and generalized inﬁnitesimal gradient ascent.

In ICML, pages 928 936, 2003.