# online_linear_optimization_with_many_hints__7e311037.pdf

Online Linear Optimization with Many Hints

Aditya Bhaskara Department of Computer Science

University of Utah Salt Lake City, UT bhaskaraaditya@gmail.com

Ashok Cutkosky Dept. of Electrical and Computer Engineering

Boston University

Boston, MA ashok@cutkosky.com

Ravi Kumar Google Research Mountain View, CA ravi.k53@gmail.com

Manish Purohit Google Research Mountain View, CA mpurohit@google.com

We study an online linear optimization (OLO) problem in which the learner is provided access to K hint vectors in each round prior to making a decision. In this setting, we devise an algorithm that obtains logarithmic regret whenever there exists a convex combination of the K hints that has positive correlation with the cost vectors. This signiﬁcantly extends prior work that considered only the case K = 1. To accomplish this, we develop a way to combine many arbitrary OLO algorithms to obtain regret only a logarithmically worse factor than the minimum regret of the original algorithms in hindsight; this result is of independent interest.

1 Introduction

In this paper we consider a variant of the classic online linear optimization (OLO) problem [28]. In OLO, at each time step, an algorithm must play a point xt in some convex set X Rd, and then it is presented with a cost vector ct and incurs loss hct, xti. This process repeats for T time steps. The algorithm s performance is measured via the regret relative to some comparison point u 2 X, deﬁned as PT

t=1hct, xt ui.

This problem is of fundamental interest in a variety of ﬁelds. OLO algorithms are directly applicable for solving the learning with expert advice problem as well as online convex optimization [4]. Further, in machine learning, one frequently encounters stochastic convex optimization problems, which may be solved via online convex optimization through the online-to-batch conversion [3]. Many of the popular optimization algorithms used in machine learning practice today (e.g., [10, 18]) can be analyzed within the OLO framework. For more details and further applications, we refer the interested reader to the excellent texts [4, 12, 24].

OLO is well-understood from an algorithmic viewpoint. For the vanilla version of the problem, algorithms with regret O(

T) are known [28, 16] and this bound is tight [4]. An interesting line of research has been to identify situations and conditions where the regret can be substantially smaller than

T. Towards this, Dekel et al. [9] proposed the study of OLO augmented with hints; their work was motivated by an earlier work of Hazan and Megiddo [14]. In their setup, the algorithm has access to a hint at each time step before it responds and this hint is guaranteed to be more than -correlated with the cost vector. They obtained an algorithm with a regret of O(d/ log T), where d is the dimension of the space. Very recently, Bhaskara et al. [2] generalized their results to the case when the hints can be arbitrary, i.e., not necessarily weakly positively correlated at each time step. They

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

obtain an algorithm with a dimension-free regret bound of roughly O(

B/ log T), where B is the number of (bad) time steps when the hints are less than -correlated with the cost vector.

While this line of work gives a promising way to go beyond

T regret, in many situations, it is not clear how to obtain a hint sequence that correlates well with the cost vector in most time steps. Prior work on optimism [13, 23, 26] has suggested using costs from earlier time steps, costs from earlier batches, or even from other learning algorithms. This suggests that it is often possible to obtain multiple sources that provide hint sequences, and we may hope that an appropriate combination of them correlates well with the cost vector in most time steps.

In this work, we focus on this natural setting in which multiple (arbitrary) hints are available to the algorithm at each time step. If some aggregate of the hints is helpful, we would like to perform as well as if we knew this aggregate a priori. As we discuss in Section 3.2, this is difﬁcult because the beneﬁt of aggregating multiple hints is a nonlinear function of the beneﬁts of the individual hints. Even if all the hints are individually bad, an algorithm may be able to gain signiﬁcantly from using some convex combination of the hints.

Our results. Let K be the number of hints available at each time step. We obtain an online learning algorithm for the constrained case, where the responses of the algorithm must be inside the unit ball. Our algorithm obtains a regret of roughly O(

B/ log T + (log T +

(log T)(log K))/ ), where B is the number of time steps when the best convex combination of the hints is less than -correlated with the cost vector. We refer to Theorem 5 for the formal guarantees. We also obtain lower bounds showing the dependence of the regret on both K and is essentially tight (Section 3.3).

Our algorithm is designed in two stages. In the ﬁrst stage, we assume that the optimal threshold is known. We build an algorithm based on carefully deﬁning a smoothed hinge loss function that captures the performance over the entire simplex of hints and then using Mirror Descent on the losses. The second stage eliminates the assumption on knowing by developing a new combiner algorithm. This is a general randomized procedure that combines a collection of online learning algorithms and achieves regret only logarithmically worse than the minimum regret of the original algorithms. (This combiner is of independent interest and we show a few applications outside our main theme.)

For the unconstrained setting (deﬁned formally below), we develop an algorithm that achieves a (relative) regret of roughly O(log T (

B/ + plog K/ )), where B is once again deﬁned as before. Our algorithm thus competes with the best convex combination of the hints.

2 Preliminaries

Let [T] = {1, . . . , T}. In the classical online learning setting, at each time t 2 [T], an algorithm A responds with a vector xt 2 Rd. After the response, a cost vector ct 2 Rd is revealed and the algorithm incurs a cost of hct, xti. We assume that kctk 1, 8t T, where k k always indicates the 2-norm unless speciﬁed otherwise. The regret of the algorithm A for a vector u 2 Rd is

RA(u,~c) = RA(u,~c, T) =

hct, xt ui.

A hint is a vector h 2 Rd, khk 1 and ~h = (h1, h2, . . .) is a sequence of hints. We consider the case when there are multiple hints available to the algorithm A. In each round t, the algorithm A gets K hints h(1)

t , . . . , h(K)

t before it responds with xt. While some of the hint sequences might be good and others might be misleading, our goal is to design an algorithm that does nearly as well as if we were just given the best sequence of hints. Let H = {~h(1), . . . ,~h(K)} denote the set of hint sequences. The regret deﬁnition is the same as always and is denoted RA(u,~c | H).

Let K RK denote the simplex. Given a sequence ~w = (w1, w2, . . .) of vectors in K, we write H(~w) to indicate the sequence of hints with tth hint PK

t , where w(i)

t indicates the ith component of wt. If ~w is a constant sequence (w, w, . . .), then we write H(w) instead of H(~w).

Let > 0 be a ﬁxed threshold. For a ﬁxed hint sequence ~h, we deﬁne B~h

to be the set of all time steps where the hint ht is bad, i.e., less than -correlated with the cost ct. Formally, we have

t 2 [T] : hct, hti < kctk2

We consider two settings to measure the worst-case regret of an algorithm. In the constrained setting, we are given some set B and the worst-case regret of A is deﬁned as RA(B,~c | H) = supu2B RA(u,~c | H); in this paper we take B = {x 2 Rd : kxk 1}, the unit ball. In the unconstrained setting, the regret of A is measured over u 2 Rd and we denote it by RA(u,~c | H), which we will bound uniformly by another function of u.

2.1 Single hint case

Now we recall and mildly improve the results of [2] for the case that there is a single hint at every time step (i.e., K = 1). We will consider the case of ﬁxed and known ; note that the algorithm of [2] is agnostic to , but we show that by committing to a ﬁxed we can improve the regret bound. We will remove this dependence on a known later in Section 4. The modiﬁcation to both the algorithm and the analysis is not hard, and so we defer the proof to Appendix A.

Theorem 1. For any 0 < < 1, there exists an algorithm 1-HINT that runs in O(d) time per update, takes a single hint sequence ~h, and guarantees regret:

R1-HINT (B,~c | {~h}) 1

kctk2 + log T

t=1 max(0, hct, hti)

(log T)|B~h |

In contrast, the bound in [2] had the factor (log T)/ instead of

(log T)/ (in the ﬁrst term).

3 Constrained setting: Known

Recall that in the constrained setting, the algorithm must always respond with xt 2 B, the unit ball. Our main result is a version of Theorem 1 for K > 1, and it will extend the previous works of [2, 9]. The high-level approach is quite natural: we design a meta-learner that maintains a loss for each hint sequence at each time, and at time t, uses the losses to decide on an appropriate convex combination wt of the hints {hi

i=1. We then run an instance of the single hint algorithm, 1-HINT , using this combination as the provided hint.

There are two main challenges with this approach. First, the regret bound of 1-HINT depends on the quantity BH(~w)

, which depends on the convex combination wt used at each step t, and it is not clear how to relate it to the corresponding terms for the individual hint sequences. Second, the regret bound assumes a knowledge of , while our ﬁnal goal is to compete with the best possible (unknown) . We deal with the second challenge in Section 4 by designing a general combination algorithm. In this section we address the ﬁrst challenge; all the algorithms in this section assume a ﬁxed and known value of . Any omitted proofs can be found in Appendix B.

3.1 Multiplicative weights on hint sequences

We ﬁrst show a result weaker than the main result of this section (Theorem 5). The algorithm is conceptually simpler, and it demonstrates what one obtains by using a simple multiplicative weight update (MWU) rule to learn the best hint sequence among the K sequences, and then use Theorem 1 with the learned hint sequence. Since the single-hint regret bound (Theorem 1) depends on just

the number of time steps when the hint has a poor correlation with the cost vector, using an MWU algorithm using binary losses sufﬁces. In particular, if ~h MW denotes the hint sequence obtained from the multiplicative weights algorithm, we can show that |B~h MW

| O(mini2K |B~h(i)

|). We defer the proof of the following theorem to Appendix B.1.

Theorem 2. Let 2 (0, 1) be given. There exists a randomized algorithm AMW for OLO with K hint sequences that has a regret bound of

E[RAMW(B,~c | H)] O

(log T)(|B~h(i)

| + log K) + log T

Note that this is usually weaker than Theorem 5 because it competes only with the best individual hint sequence, and not necessarily the best convex combination of hints. It can only be a better bound if K T so that log K = !(log T).

3.2 Smoothed hinge loss

Algorithm 1 K-HINTS Input: Parameter

Deﬁne (w) = (log K) + PK

i=1 w(i)(log w(i))

Initialize 1-HINT /2 Initialize w1 (1/K, . . . , 1/K) 2 K for t = 1, . . . , T do

Get hints h(1)

t , . . . , h(K)

t Send ht PK

t to 1-HINT /2 Get xt from 1-HINT /2 Respond xt, receive cost ct Send ct to 1-HINT /2 t(w)

i=1 w(i)h(i)

gt r t(wt) wt+1 argminw2 K

hg1:t, wi +

=1 kg k21 log K (w) end for

The multiplicative weights approach allows us to obtain regret guarantees that depend on the number of bad hints in the best of the K hint sequences. But, what we would really like is for the regret bound to scale with the number of bad hints in the best convex combination of the hint sequences. This can be a signiﬁcant gain: consider the setting in which K = 2 and = 1

4, and on even iterations t we have hct, h(1)

t i = 1/4 while on odd iterations hct, h(1)

t i = 1. Suppose h(2)

t is the same, but has high correlation on even iterations and negative correlation on odd iterations. Then both h(1)

t have T/2 bad hints , but the convex combination

t 2 has no bad hints! This highlights the fundamental problem with the multiplicative weights approach: linear combinations of hints might result in much better performance than the corresponding linear combination of the respective performances of the hints.

We will address this issue by considering a specially crafted loss function that more accurately captures performance over the entire simplex of hints. Intuitively, we would like to design a loss function such that for any w 2 K, the loss t(w) is low if and only if ht(w) = PK

t w(i) has the desired correlation with kctk2. Once we have the appropriate loss function, we can then use an online learning algorithm on the losses t to obtain the desired convex combination of hints at each time step.

Formally, the following smoothed version of the hinge loss is adequate for our purposes.

0 a > b 1 b(b a)2 a 2 [0, b] b 2a a < 0

For any w 2 K, we deﬁne the loss function as t(w) = (hct, ht(w)i, kctk2) where ht(w) = PK

i=1 w(i)h(i)

t and ( ) is as deﬁned in (1). We ﬁrst present several important properties of this loss function in the following proposition. Proposition 3. Let 2 (0, 1) be ﬁxed and for all t 2 [T], let t(w) = (hct, ht(w)i, kctk2). Then,

(a). t is convex and non-negative. (b). If ht(w) is -good (i.e., hct, ht(w)i kctk2), then t(w) = 0 and 0 2 @ t(w).

(c). If ht(w) is not ( /2)-good (i.e., hct, ht(w)i < kctk2/2), then t(w) kctk2/4. (d). t is 2-Lipschitz with respect to the 1-norm.

(e). kr t(w)k2

t(w) for all w 2 K.

(f). t(w) kctk2 + 2 max(0, hct, ht(w)i).

Proof. Properties (a) (b) are immediate from the deﬁnition of ( , ). For property (c), if hct, ht(w)i < 0, then we have t(w) = kctk2 2hct, ht(w)i kctk2. On the other hand, if 0

hct, ht(w)i < kctk2/2, then we have t(w) =

kctk2 hct, ht(w)i

22 kctk2/4.

For the next properties, deﬁne f : R ! R by f(x) = (x, kctk2). By manually computing derivatives of f we can see that f is 2-Lipschitz and 1-smooth. Further since |hct, h(i)

t i| 1 for all i, we have that g(w) = hct, ht(w)i is 1-Lipschitz with respect to the 1-norm. Therefore t must be 2-Lipschitz with respect to the 1-norm, proving (d).

By inspecting the derivatives of f, we see that f 0(x)2 4 kctk2 f(x). Further, we have r t(w)(i) =

t if 0(hct, ht(w)i). Therefore kr t(w)k1 kctkf 0(hct, ht(w)i), from which (e) follows. For (f), we observe that f(x) kctk2 + 2 max(0, x).

We are now ready to present our ﬁnal algorithm K-HINTS . At each timestep t, we ﬁrst choose a wt 2 K via FTRL using an entropic regularizer on the losses t (see last line of Algorithm 1). We then supply the learned hint ht(wt) = PK

t to an instance of the single hint algorithm. For technical reasons, we use the single hint algorithm 1-HINT /2 where the desired correlation with the cost vector is set to /2 instead of . Algorithm 1 presents the complete pseudocode. The performance of the FTRL subroutine can be bounded via classical results in FTRL (see [19]) used in concert with the smoothness of the losses t, following [25]. The ﬁnal result is the following Proposition 4, which we prove in Appendix B. Proposition 4. Let wt 2 K be chosen via FTRL on the losses t as in Algorithm 1. Then, for any w? 2 K, we have

t(wt) 22 log K

With this proposition, we can prove the main result of this section: Theorem 5. Let 2 (0, 1) be given. Then K-HINTS on OLO with K hint sequences guarantees:

RK-HINTS (B,~c | H) O

t=1 max(0, hct, ht(w)i)

(log T)(log K)

(log T)|BH(w)

| + (log T) +

(log T)(log K)

In the above, ht(w) = PK

i=1 w(i)h(i)

t is the tth hint of the sequence H(w) for w 2 K.

Proof. Let w? be an arbitrary element of K. By Proposition 3(f), we have t(w?) kctk2 + 2 max(0, hct, ht(w?)i) for all t, and t(w?) = 0 if hct, ht(w?)i kctk2. Therefore,

kctk2 + 2 max(0, hct, ht(w?)i)

where we have deﬁned the variable Q = P

kctk2 + 2 max(0, hct, ht(w?)i).

Further, by deﬁnition of the smoothed hinge loss, we have t(wt) max(0, hct, ht(wt)i) for all t 2 [T]. Therefore, by Proposition 4 and (2), we have

max (0, hct, ht(wt)i)

t(wt) 2Q + 22 log K

Also, since the loss function is always non-negative, we have

t2BH( ~ w) /2

t2BH( ~ w) /2

where the second inequality uses Proposition 3(c). Once again, using Proposition 4 and (2), we have

t2BH( ~ w) /2

Finally, recall that we have sent the hint sequence H(~w) = (h1(w1), . . . , h T (w T )) to the algorithm 1-HINT /2. Thus by Theorem 1, we have:

RK-HINTS (B,~c | H) 1

t2BH( ~ w) /2

kctk2 + log T

(2 log T) PT

t=1 max(0, hct, ht(wt)i)

substituting (3) and (4),

v u u t2(log T)

2Q + 22 log K

C C A . (5)

The ﬁnal result now follows from the deﬁnition of Q and simple calculations.

Non-negatively correlated hints. Recall that in the case of K = 1, [9] obtains a regret of O((log T)/ ) in the case where all the hints are -correlated with ct. A weaker assumption is to have hht, cti 0 at all steps, with the -correlation property holding at all but B time steps. In this case, [2] showed that the regret must be at least (p B ), and also gave an algorithm that

achieves a regret of O

p B + log T

. Using Theorem 5, we obtain this bound for general K.

Corollary 6. Consider OLO with K hint sequences where for every t and every hint h(i)

t , we have the property that hh(i)

t , cti 0. Further, suppose that for some > 0, there exists an (unknown) convex combination w such that for the hint sequence H(w), the number of hints that do not satisfy hht(w), cti kctk2 is at most B . Then there exists an algorithm that achieves a regret at most

B + log T + plog K

Speciﬁcally, before substituting to obtain (5), observe that under the non-negative correlation assumption, max(0, hct, ht(wt)i) = 0 for all t, and thus we only have the ﬁrst two terms of (5). This gives the desired bound.

3.3 Lower bounds

In this section we provide some lower bounds, focusing on the dependence on K and . Our primary technique is to specify hint sequences and costs such that, even given the hint, the cost is -correlated with some combination of hints, but otherwise is a random variable with mean 0 and variance 1. The high variance in the costs guarantees nearly

T regret, which we express in terms of and K to achieve our bounds. Omitted proofs can be found in Appendix C. We begin with a lower bound showing that the dependence on

(log K)/ holds even in one dimension.

Theorem 7. For any and T 1 log 1

, there exists a sequence ~c of costs and a set H of hint sequences, |H| = K for some K, such that: (i) there is a convex combination of the K hints that

always has correlation with the costs and (ii) the regret of any online algorithm is at least

Next, we show that a dependence on 1/ is also unavoidable:

Theorem 8. In the two-dimensional constrained setting, there is a sequence ~h and ~c of hints and costs (K = 1) such that: (i) 8t, hht, cti , and (ii) the regret of any online algorithm is at least (1/ ).

Together, these bounds show that the plog K and 1/ terms in our upper bounds are necessary. There is a gap in our upper and lower bounds in terms of the dependence on log T, and the gap between

and max{plog K, 1/ }.

4 Combining learners

Algorithm 2 Deterministic combiner Cdet.

Input: Online algorithms A1, . . . , AK Reset A1 Set i 1, γ 1, r 0, 1, ri,γ

0 0 for t = 1, . . . , T do

Get y from Ai and respond xt y Get cost ct, deﬁne g ct Send g to Ai as th cost Set ri,γ

0=1hg 0, y 0 ui if ri,γ

> γ then if i = K then

Set γ 2γ end if Set i (i mod K) + 1 Set 1 Set ri,γ

0 0 Reset Ai end if Set + 1 end for

In Section 3, we presented an algorithm for online learning with multiple hints. However, the algorithm required knowing , the desired correlation between a hint h and the cost vector ct. In this section, we eliminate this assumption. To do this, we design a generic way to combine incomparable-in-foresight regret guarantees obtained by different algorithms and essentially get the best regret among them in hindsight. With this combiner, handling unknown is easy: consider K-HINTS for different values of and apply the combiner to get the best among them.

The results in this section apply in the constrained setting and to both the hints and the classical no-hints case (see [5] for analogous results that apply in the unconstrained setting when the base algorithms are parameter-free ). These combiner algorithms themselves are of independent interest and lead to other applications in the constrained online learning setting that we elaborate in Appendix E.

For technical reasons, we need the following deﬁnition of a monotone regret bound . Essentially all regret bounds known for online linear optimization satisfy this deﬁnition.

Deﬁnition 9 (Monotone regret bound). An online learning algorithm A is associated with a monotone regret bound S([a, b],~c), if S( , ) is such that when A is run on only the costs ca, . . . , cb, producing outputs xa, . . . , xb, we have the guarantee:

hct, xt ui S([a, b],~c),

and further it satisﬁes S([a0, b0],~c) S([a, b],~c) for all sequences ~c whenever [a0, b0] [a, b].

Note that if an algorithm A has a monotone regret bound S( , ), then the total regret experienced by algorithm A is bounded as RA(B,~c) S([1, T],~c).

4.1 Deterministic combiner

We ﬁrst design a simple deterministic algorithm Cdet that combines K online learning algorithms with monotone regret bounds and obtains a regret that is at most K times the regret suffered by the best algorithm on any given cost sequence. The combiner starts with an initial guess of the regret γ and guesses that the ﬁrst algorithm is the best, playing its predictions. It keeps trusting the current choice of the best algorithm until the regret it incurs exceeds the current guess γ; once that happens, it chooses the next algorithm. Once all the algorithms have been tried, it doubles the guess γ and starts over. Notice that this does not require knowledge of the bounds Si; these can be replaced with the true regret bounds, rather than simply the best bound that present analysis is capable of delivering.

Theorem 10. Suppose A1, . . . , AK are deterministic OLO algorithms that are associated with monotone regret bounds S1, . . . , SK. Suppose 8t, supx,y2Bhct, x yi 1. Then, we have:

RCdet(B,~c) K

i Si([1, T],~c)

Proof sketch. We give a brief sketch here and defer the formal proof to Appendix D. We can divide the operation of Algorithm 2 into phases in which γ is constant. In each phase, Algorithm 2 incurs a regret of at most γ + 1 from each of the K algorithms for a total regret of at most K(γ + 1). Let P denote the total number of phases and let j = argmini Si([1, T],~c) be the algorithm with the least total regret. In the (P 1)th phase, algorithm Aj must have incurred a regret of at least 2P 2 (otherwise we would not have the Pth phase). Since we assume that Sj is a monotone regret bound, it follows that mini Si([1, T],~c) 2P 2 and hence P max(1, 2 + log2(mini Si([1, T],~c))). Since γ = 2p 1 in phase p, we can bound the total regret incurred by Algorithm 2 as

K(2p 1 + 1) K(P + 2P ) K2P +1

i Si([1, T],~c)

4.2 Randomized combiner

The deterministic combiner Cdet, while achieving the best regret among A1, . . . , AK, incurs a factor K. We now show that using randomization, this factor can be made O(log K) in expectation.

Intuitively, Cdet incurs the factor K since it might be unlucky and have to cycle through all the K algorithms even after it correctly guesses γ. We can avoid this worst-case behavior by selecting the base algorithm uniformly at random, rather than in a deterministic order. We formally describe this randomized combiner Crand in Algorithm 4 in Appendix D. Informally, in each phase with constant γ, at each time step, Crand simulates all the K algorithms and maintains a candidate set C of algorithms that have incurred a regret of at most γ. Once the current algorithm incurs a regret of γ, Crand selects the next algorithm to be one from the set C uniformly at random. Suppose the algorithms in C are ranked by the ﬁrst time they incur a regret bound of γ. Since an algorithm Ai is chosen uniformly at random, in expectation, by the time Ai incurs a regret of γ, half of the algorithms in C have already incurred at least γ regret and thus the size of C halves at each step. Thus, we can argue that we only cycle through O(log K) base algorithms in each phase. We defer the formal proof of the following theorem to Appendix D. Theorem 11. Suppose A1, . . . , AK are deterministic OLO algorithms with monotone regret bounds S1, . . . , SK. Suppose for all t, supx,y2Bhct, x yi 1. Then for any ﬁxed sequence ~c of costs (i.e., an oblivious adversary), we have:

E [RCrand(B,~c)] log2(K + 1)

i Si([1, T],~c)

Further, if ~c is allowed to depend on the algorithm s randomness (i.e., an adaptive adversary), then

RCrand(B,~c) K

i Si([1, T],~c)

4.3 Constrained setting: Unknown

For any ﬁxed > 0, Theorem 5 yields a monotone regret bound. For 1 i log T, let Ai denote the instantiation of Algorithm 1 with i = 2 i. By Theorem 5, each algorithm Ai is associated with a monotone regret bound Si( , ) such that

RAi(B,~c) Si([1, T],~c) = O

(log T)|BH(w)

+ (log T) +

(log T)(log K) i

Further since |BH(w)

i+1 | |BH(w)

i |, we have Si+1( ,~c) 2Si( ,~c). Applying Theorem 11 on these log T algorithms thus yields the following result.

Theorem 12. Given a set H = {~h1, . . . ,~h K} of hint sequences, there exists a randomized algorithm A such that for any ﬁxed sequence of cost vectors ~c, the expected regret E[RA(B,~c | H)] is at most:

:(log log T)

(log T)|BH(w)

| + (log T) +

(log T)(log K)

5 Unconstrained setting

In this section, we develop an algorithm that leverages multiple hints in the unconstrained setting. Recall that in this setting, the output xt and comparison point u are allowed to range over all of Rd. Thus we cannot hope to bound regret by a uniform constant for all u. Instead, we bound the regret as a function of kuk. This setting has seen increased interest [6, 7, 11, 21, 22], and recently the notion of hints has also been studied [2, 5]. Here, we consider multiple hints in the unconstrained setting. Unlike the constrained case, this algorithm does not need to know and hence does not need the combiner. The algorithm again competes with the best convex combination of the hints.

Following [2, 5], our algorithm initializes K + 1 unconstrained online learners. The ﬁrst online learner ignores the hints and attempts to output xt to minimize the regret. Each of the following K online learners is restricted to output real numbers y(i)

t for i = 1, . . . , K rather than points in Rd. The ﬁnal output of our algorithm is then given by ˆxt = xt + PK

t . Intuitively, the ith one-dimensional algorithm is attempting to learn how useful the ith hint sequence is. Upon receiving the cost ct, we provide the ith one-dimensional algorithm with the cost hct, h(i)

t i. Note that we are leaning heavily on the lack of constraints in this construction. Our regret bound is given in Theorem 13, proved in Appendix F.

Theorem 13. There is an algorithm A for the unconstrained setting such that for any u 2 Rd and any 2 (0, 1), we have

RA(u,~c | H) = O

:kuk(log T)

6 Conclusions

In this paper we obtained algorithms for online linear optimization in the presence of many hints that can be imperfect. Besides generalizing previous results on online optimization with hints, our contributions include a simple algorithm for combining arbitrary learners that seems to have broader applications. Interesting future research directions include tightening the dependence on in various cases and exploring the possibility of improved bounds for speciﬁc online optimization problems.

Broader Impact

Our work focuses on theoretical foundations. Online learning methods have had direct impact in domains such as online advertising. But primarily, the methods developed are used in improving other optimization procedures, and thus only have an indirect impact. We believe that there are no adverse ethical aspects or potentially negative societal consequences of our work.

Acknowledgments and Disclosure of Funding

Aditya Bhaskara is partially supported by NSF (CCF-2008688) and by a Google Faculty Research Award.

[1] Sanjeev Arora, Elad Hazan, and Satyen Kale. The multiplicative weights update method: a

meta-algorithm and applications. Theory of Computing, 8(1):121 164, 2012.

[2] Aditya Bhaskara, Ashok Cutkosky, Ravi Kumar, and Manish Purohit. Online learning with

imperfect hints. In ICML, 2020.

[3] Nicoló Cesa-Bianchi, Alex Conconi, and Claudio Gentile. On the generalization ability of

on-line learning algorithms. In NIPS, pages 359 366, 2002. [4] Nicolo Cesa-Bianchi and Gabor Lugosi. Prediction, Learning, and Games. Cambridge University Press, 2006. [5] Ashok Cutkosky. Combining online learning guarantees. In COLT, pages 895 913, 2019. [6] Ashok Cutkosky and Kwabena Boahen. Online learning without prior information. In COLT,

pages 643 677, 2017. [7] Ashok Cutkosky and Francesco Orabona. Black-box reductions for parameter-free online

learning in Banach spaces. In COLT, pages 1493 1529, 2018. [8] Ashok Cutkosky and Tamas Sarlos. Matrix-free preconditioning in online learning. In ICML,

pages 1455 1464, 2019. [9] Ofer Dekel, Arthur Flajolet, Nika Haghtalab, and Patrick Jaillet. Online learning with a hint. In

NIPS, pages 5299 5308, 2017. [10] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning

and stochastic optimization. JMLR, 12(Jul):2121 2159, 2011. [11] Dylan J. Foster, Alexander Rakhlin, and Karthik Sridharan. Online learning: Sufﬁcient statistics

and the Burkholder method. In COLT, pages 3028 3064, 2018. [12] Elad Hazan. Introduction to online convex optimization. Foundations and Trends R in Opti-

mization, 2(3-4):157 325, 2016. [13] Elad Hazan and Satyen Kale. Extracting certainty from uncertainty: Regret bounded by variation

in costs. Machine learning, 80(2-3):165 188, 2010. [14] Elad Hazan and Nimrod Megiddo. Online learning with prior knowledge. In COLT, pages

499 513, 2007. [15] Elad Hazan, Alexander Rakhlin, and Peter L Bartlett. Adaptive online gradient descent. In

NIPS, pages 65 72, 2008. [16] Adam Kalai and Santosh Vempala. Efﬁcient algorithms for online decision problems. JCSS,

71(3):291 307, 2005. [17] Michal Kempka, Wojciech Kotlowski, and Manfred K Warmuth. Adaptive scale-invariant online

algorithms for learning linear models. In ICML, pages 3321 3330, 2019. [18] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR,

2015. [19] H Brendan Mc Mahan. A survey of algorithms and analysis for adaptive online learning. JMLR,

18(1):3117 3166, 2017. [20] Zakaria Mhammedi and Wouter M Koolen. Lipschitz and comparator-norm adaptivity in online

learning. In COLT, pages 2858 2887, 2020. [21] Francesco Orabona. Simultaneous model selection and optimization through parameter-free

stochastic learning. In NIPS, pages 1116 1124, 2014. [22] Francesco Orabona and Dávid Pál. Coin betting and parameter-free online learning. In NIPS,

pages 577 585, 2016. [23] Alexander Rakhlin and Karthik Sridharan. Online learning with predictable sequences. In

COLT, pages 993 1019, 2013. [24] Shai Shalev-Shwartz et al. Online learning and online convex optimization. Foundations and

Trends R in Machine Learning, 4(2):107 194, 2012. [25] Nathan Srebro, Karthik Sridharan, and Ambuj Tewari. Smoothness, low noise and fast rates. In

NIPS, pages 2199 2207, 2010. [26] Jacob Steinhardt and Percy Liang. Adaptivity and optimism: An improved exponentiated

gradient algorithm. In ICML, pages 1593 1601, 2014. [27] Dirk van der Hoeven. User-speciﬁed local differential privacy in unconstrained adaptive online

learning. In Neur IPS, pages 14080 14089, 2019. [28] Martin Zinkevich. Online convex programming and generalized inﬁnitesimal gradient ascent.

In ICML, pages 928 936, 2003.