# online_learning_with_sublinear_bestaction_queries__c8dde758.pdf

Online Learning with Sublinear Best-Action Queries

Matteo Russo Sapienza University of Rome, Italy mrusso@diag.uniroma1.it

Andrea Celli Bocconi University, Italy andrea.celli2@unibocconi.it

Riccardo Colini-Baldeschi Meta, Central Applied Science, UK rickuz@meta.com

Federico Fusco Sapienza University of Rome, Italy fuscof@diag.uniroma1.it

Daniel Haimovich Meta, Central Applied Science, UK danielha@meta.com

Dima Karamshuk Meta, Central Applied Science, UK karamshuk@meta.com

Stefano Leonardi Sapienza University of Rome, Italy leonardi@diag.uniroma1.it

Niek Tax Meta, Central Applied Science, UK niek@meta.com

In online learning, a decision maker repeatedly selects one of a set of actions, with the goal of minimizing the overall loss incurred. Following the recent line of research on algorithms endowed with additional predictive features, we revisit this problem by allowing the decision maker to acquire additional information on the actions to be selected. In particular, we study the power of best-action queries, which reveal beforehand the identity of the best action at a given time step. In practice, predictive features may be expensive, so we allow the decision maker to issue at most k such queries. We establish tight bounds on the performance any algorithm can achieve when given access to k best-action queries for different types of feedback models. In particular, we prove that in the full feedback model, k queries are enough to achieve an optimal regret of Θ(min{

T, T/k}). This finding highlights the significant multiplicative advantage in the regret rate achievable with even a modest (sublinear) number k Ω(

T) of queries. Additionally, we study the challenging setting in which the only available feedback is obtained during the time steps corresponding to the k best-action queries. There, we provide a tight regret rate of Θ(min{T/

k, T 2/k2}), which improves over the standard Θ(T/

k) regret rate for label efficient prediction for k Ω(T 2/3).

1 Introduction

Online learning is a foundational problem in machine learning. In its simplest version, a decision maker repeatedly interacts with a fixed set of n actions over a time horizon T. At each time, the decision maker needs to choose one of a set of actions; subsequently, it receives an action-dependent loss and observes some feedback. These loss functions are generated by an omniscient (but oblivious) adversary and are only revealed on-the-go. The goal of the decision maker is to design a learning algorithm that achieves small regret with respect to the best fixed action in hindsight, i.e., the difference between the decision maker s loss and that of the fixed action. Several online learning algorithms

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

have been developed, characterized by optimal instance-independent regret bound, depending on the feedback model [Cesa-Bianchi and Lugosi, 2006, Slivkins, 2019].

Following the recent literature on algorithms with machine learning-based predictions (see, e.g., the survey by Mitzenmacher and Vassilvitskii [2020]), we study the case where the learner is allowed to issue a limited number of best-action queries to an oracle that reveals the identity of the best action for that step, so that the learner can choose it. This setting is motivated by scenarios in which obtaining accurate predictions on the optimal choice among numerous actions is possible but comes with significant costs and time constraints. For instance, consider an online platform that continuously moderates posted content (e.g., Meta [Meta, a,b] or Google [Google]), and the online learning problem it faces: posts are generated one after the other, and the platform s task consists in deciding whether or not to flag the content as harmful. In this application, the platform may do so via (i) content moderation actions that are based on (expert) human reviews (that plays the role of best-action queries), and (ii) automated content moderation decisions, i.e., decisions arising by employing an online learning algorithm. Due to budget or time constraints, access to human reviewing is a scarce resource, and the platform can only employ external reviewers at most k times.

While the idea of incorporating hints or queries into online learning models has already been studied [e.g., Bhaskara et al., 2021b, 2023b], we are the first to study best-action queries. Bhaskara et al. [2021b] focus on online linear optimization with full feedback, with a query model which outputs vectors that are correlated with the actual losses. In the case of optimization over strongly convex domains, the regret bound improves from

T to log T, even for learners that receive hints for O(

T) times. In an alternative model, Bhaskara et al. [2023b] studies comparison queries that allow the decision maker to know in advance, at each time, which among a small number of actions has the best loss. In this model, probing 2 arms is sufficient to achieve time-independent regret bounds for online linear and convex optimization with full feedback, an in the stochastich multi-armed bandit problem. Our model differs from previous ones in two directions: (i) the online learner issues at most k queries (differently from Bhaskara et al. [2023b]), and (ii) these queries are purely ordinal and the domain is not strongly convex* (so that the logarithmic bound in Bhaskara et al. [2021b] does not apply).

Another potential way of addressing how to efficiently using limited moderation is through the abstention learning model, where the learner can "abstain" from making a decision and instead use additional resources to receive the optimal response. This includes models like KWIK and full-information models [Li et al., 2011, Sayedi et al., 2010, Zhang and Chaudhuri, 2016, Cortes et al., 2018, Neu and Zhivotovskiy, 2020, Gangrade et al., 2021].

1.1 Our Model

In our model, an online learner repeatedly interacts with n actions over a time horizon T. At the beginning of each time t [T] , the learner chooses one of these actions it and suffers a loss ℓt(it) generated by an (oblivious) adversary that may depend on both the action and the time; then, it observes some feedback. In this paper, we allow the learning algorithm A to issue a best-action query, for at most k out of T times. When the learner issues a query, an oracle reveals the identity of the best action at that time, i t , so that the learner can select it. The quality of a learning algorithm is measured via the regret: the difference between its performance and that of the best fixed action in hindsight. The regret of an algorithm A against a sequence of losses ℓ [0, 1]n T reads

RT (A, ℓ) = X

t [T ] E [ℓt(it)] min i [n]

t [T ] ℓt(i),

where the expectation runs over the (possibly) randomized decisions of the algorithm. We are interested in designing learning algorithms that perform well, i.e., suffer sublinear regret against all possible sequences of losses. For this reason, we denote with RT (A) (without the dependence on ℓ) the worst-case regret of A: RT (A) = supℓRT (A, ℓ). The minimax regret of a learning problem is then the regret achievable by the best algorithm. In our paper we pinpoint the exact minimax regret rates for the problems studied.

*The model with k actions can be captured by a linear function where each entry corresponds to the loss of a specific discrete action, and the continuous action space is the probability simplex over the n discrete actions (which is not strongly convex). We adopt the notational convention that [x] stands for the set of the first x natural numbers.

For the sake of simplicity, we denote with LT (i) the total loss incurred by action i over the whole time horizon: LT (i) = P

t [T ] ℓt(i). Lmin T = P

t [T ] ℓt(i t ) denotes instead the sum of the minimum

loss actions at each time. Finally, LT (Ak) = E h P

t [T ] ℓt(it) i denotes the expected total loss of an algorithm Ak issuing at most k best-action queries.

1.2 Our Results

Full feedback. We start with the full feedback (a.k.a. prediction with experts) where the learner observes at each time the losses of all the actions after the action is chosen, i.e., all the loss vector ℓt = (ℓt(1), ℓt(2), . . . , ℓt(n)) after action it is chosen. We obtain the following results:

We show that by combining the Hedge algorithm with k queries issued uniformly at random (Algorithm 1), we obtain a regret rate of O(min{

T, T/k}), see Theorem 2.2. In Theorem 3.3, we complement this positive result with a tight lower bound: we design a family of instances that forces every algorithm with k queries to suffer the same regret.

It is surprising that issuing a sublinear number of queries, say k = T α for α (1/2, 1), yields a significant improvement on the regret, which becomes T 1 α. Note, the total of the losses incurred by any algorithm in k rounds is at most k, which only affects the overall regret in an additive way; nevertheless, we prove that issuing k queries has an impact on the regret that is multiplicative in k. For instance, T 2/3 queries are enough to decrease the regret to T 1/3, which is well below the Θ(

T) minimax regret for prediction with expert without queries [Cesa-Bianchi and Lugosi, 2006].

Label efficient feedback. We then proceed to study a partial feedback model inspired by the label efficient paradigm [Cesa-Bianchi and Lugosi, 2006]. In the label-efficient prediction problem, the learner only observes the losses in a few selected times. With label efficient feedback, the learner only observes feedback during the k times where the best-action queries are issued (but only after choosing the action). We obtain the following results:

We modify the (label-efficient) Hedge algorithm [Cesa-Bianchi and Lugosi, 2006] to achieve a regret rate of O(min{T/

k, T 2/k2}), see Theorem 2.4, In Theorem 3.4, we show that it is not possible to improve on these rates: there exists instances where all algorithms suffer Ω(T/

k) regret if k O(T 2/3) or Ω(T 2/k2) if k Ω(T 2/3).

We observe that our algorithms improve (for k Ω(T 2/3)) on the regret rate of label efficient prediction with k steps of feedback, which is of order Θ(T/

k) [Chapter 6.2 in Cesa-Bianchi and Lugosi, 2006]. Also in this case, we observe the surprising multiplicative impact on the regret of the k queries. For instance, k = T 3/4 queries (which impact on a total loss of the same order but leave unaffected Θ(T) rounds) are enough to achieve O(

Stochastic i.i.d. setting. In Section B, we derive (up to polylogarithmic factors) the above results in the stochastic setting, but with different algorithms. Namely, in full feedback, we show how Follow The-Leader [Auer et al., 2003] achieves regret Θ(T/k) when k Ω(

T) otherwise. With label efficient feedback, Explore-Then-Commit [Perchet et al., 2015] achieves Θ(T 2/k2) when k Ω(T 2/3), and Θ(T/

k) otherwise.

1.3 Technical Challenges

Issuing a best-action query has the effect of inducing a possibly negative instantaneous regret. In fact, the benchmark that the algorithm is comparing against at time t is the loss of the best fixed action over the whole time horizon, which may naturally be larger, at time t, than the best action i t in that specific time. Overall, the magnitude of the total negative regret that is generated by k queries is at most O(k), which is typically sublinear in T, and affects linearly on the overall regret. We prove that this sublinear number of queries has a (surprising) multiplicative effect on the overall regret bound.

We use Θ(x) to hide polylogarithmic terms in x.

Our analysis reveals that any algorithm employing a uniform querying strategy can be characterized by a total loss decomposed into two terms: a k/T fraction of the best dynamic loss, representing the sum of the smallest losses at each time, and the total loss of the same algorithm operating without queries, scaled by a factor of 1 k/T. Additionally, within our proof, we conduct a more refined analysis of the Hedge algorithm in the no-querying setting. Specifically, we can express the regret as a function of the difference between the best dynamic loss and the best-fixed cumulative loss in hindsight. This formulation demonstrates a noteworthy enhancement in regret rates compared to the conventional bound of O(

T) when the discrepancy between the best dynamic loss and the best-fixed cumulative loss in hindsight is relatively small.

Concerning lower bounds, their construction is more challenging than their no queries counterpart. In fact, we need to design hard instances against any query-augmented learning algorithm, and provide tight bounds in both k and T. The lower bounds we construct are stochastic, this implies that the minimax regret we find are tight, even in the stochastic model, where the losses are drawn i.i.d. from a fixed but unknown distribution. To exemplify this, let us consider the following natural instance, which provides a simple proof of the Ω(

T) regret lower bound in the adversarial setting (without queries). The instance is composed of two arms, whose rewards are i.i.d. Bernoulli distribution with probability 1/2. Any learning algorithm achieves T/2 regret, while the best-fixed arm in hindsight is expected to achieve an extra Θ(

T) term. Now, if the learner is given the power to issue (order of)

T queries, then its regret naturally drops from (order of)

T to constant.

2 Hedge with k Best-Action Queries

In this section, we propose two algorithms that address online learning with full and label efficient feedback. They are built by combining the Hedge algorithm [Chapter 2 in Cesa-Bianchi and Lugosi, 2006] to uniform queries. We start by presenting some known properties of Hedge (in full feedback) that are crucial key for the main results of this section.

Lemma 2.1. Consider the Hedge algorithm Hedgeη( ℓ) run on loss sequence ℓ [0, U]n T with learning rate η < 1/U. Then, for all action i [n], it holds that,

LT (Hedgeη( ℓ)) 1 1 Uη LT (i) + log n

where LT (Hedgeη( ℓ)) = P t [T ],i [n] pt(i) ℓt(i) is the expected cumulative loss of Hedgeη( ℓ) and LT (i) is the cumulative loss of action i.

Proof. We have that, by definition,

WT +1 w T +1(i)wt(i) exp η ℓt(i) = Y

t [T ] exp η ℓt(i) = exp η LT (i) .

We also know that

i [n] pt(i) ℓt(i) + η2 X

i [n] pt(i) ℓ2 t(i)

i [n] pt(i) ℓt(i) + η2 X

i [n] pt(i) ℓ2 t(i)

which combined with the earlier bound and taking logarithms of both sides (since they are both positive), gives

η LT (i) log n + X

i [n] pt(i) ℓt(i) + η2 X

i [n] pt(i) ℓ2 t(i)

This is a simple corollary of the expected distance of a random walk.

t [T ],i [n] pt(i) ℓt(i) + η2 X

t [T ],i [n] pt(i) ℓ2 t(i)

log n (η Uη2) X

t [T ],i [n] pt(i) ℓt(i).

The second inequality above follows from log(1 + z) z for z R, and the third by observing that ℓ2 t(i) U ℓt(i), since ℓt(i) [0, U]. The lemma follows by rearranging the terms above.

2.1 Full Feedback: An O( T log n

k ) Regret Bound

Theorem 2.2. Consider the problem of online learning with full feedback and k best-action queries, then there exists an algorithm Ak that guarantees

RT (Ak) min p

T log n, T log n

We prove that Algorithm 1 exhibits the desired regret guarantees. In particular, as we illustrate next, this algorithm performs uniform querying, i.e., they choose a uniformly random subset Q [T] of size k where to issue best-action queries. In the case of uniform queries, and when feedback and action taken by the algorithm are not correlated, a useful simplification can be made. Observation 2.3. Let A0 be an algorithm with no querying power, with full or label-efficient feedback. Consider Ak, obtained from A0 by equipping it with k uniformly random queries across the time horizon T or with an independent query on each step time step with probability k/T. Similarly, let i0 t and it be the actions selected by A0 and Ak at time t. Then, for all adversarial sequences ℓ [0, 1]n T of action losses,

E [ℓt(it)] = 1 k

E ℓt(i0 t) + k

T E [ℓt(i t )] ,

for all t [T], and thus

LT (Ak) = 1 k

LT (A0) + k

T Lmin T . (1)

Algorithm 1 Hedge with Best-Action Queries

1: Input: Sequence of losses ℓt(i) and query budget k [T] 2: Sample k out of T time steps uniformly at random and denote this random set by Q

3: Set η = q

T, otherwise η = k

T 4: Initialize w1(i) = 1 for all i [n] 5: for t [T] do 6: if t Q then 7: Observe i t = arg mini [n] ℓt(i) 8: Select action i t 9: else 10: Let Wt = P

i [n] wt(i)

11: Select action i with probability pt(i) = wt(i)

Wt 12: Observe ℓt(i) for all i [n] 13: Update wt+1(i) = wt(i) exp ( η(ℓt(i) ℓt(i t ))) for all i [n]

Proof of Theorem 2.2. Let us first note that Algorithm 1 without queries is an instantiation Hedgeη( ℓ) applied to losses ℓt(i) = ℓt(i) ℓt(i t ) [0, 1]. Let this algorithm be denoted as A0. Then, applying Lemma 2.1 and expanding the terms, we obtain

LT (A0) LT (i)

1 η + log n η(1 η) ηLmin T 1 η .

Let Ak be Algorithm 1 with k best-action queries. By Observation 2.3, specifically (1), it holds that

LT (Ak) (1 k/T)LT (i)

1 η + (1 k/T) log n

η(1 η) η(1 k/T)Lmin T 1 η + k

RT (Ak, i) (1 k/T) log n

η(1 η) + Tη k

T(1 η)(LT (i) Lmin T ) min p

T log n, T log n

where the last inequality holds by setting η = max np

log n/T, k/T o .

Before proceeding further, let us provide some technical intuition of why Theorem 2.2 should hold. The additive term k

T Lmin T (given in Equation (1)) alters the choice of the learning rate, which is allowed to be more aggressive, thus impacting the regret in a multiplicative way. To be more specific, in the usual Hedge performance analysis, we do not care about the negative term ηLmin T 1 η . This negative term together with the additive k

T Lmin T term given by best-action queries allows us to set the optimal η to be larger than the usual (order of) 1/

T. In other words, the additive impact of the k T Lmin T term permits a multiplicative gain in regret as the learning rate η is modified and increased.

2.2 Label Efficient Feedback: An O( T 2 log n

k2 ) Regret Bound

We extend Algorithm 1 to a setting where feedback is given only during querying time steps, with the only difference that the update rule is performed just after the querying time steps and nowhere else across the time horizon. We prove the following theorem:

Theorem 2.4. For all adversarial sequences ℓ [0, 1]n T of action losses and for all k q

2 1, in the label efficient query model, there exists an algorithm Ak that guarantees

RT (Ak) 2 min

k , T 2 log n

Algorithm description. We first describe the algorithm Ak we use: Let Xt Ber ˆk/T be a Bernoulli random variable, for some ˆk k to be specified later. Ak issues a best action query if, at time step t, Xt = 1 and unless the query budget is exhausted. Once the query budget is exhausted, the algorithm stops querying. Otherwise, it performs the usual update rule on losses ˆℓt(i) = T

ˆk (ℓt(i) ℓt(i t )) I {Xt = 1}. The algorithm then simply selects action It = i t if Xt = 1 and action It = i with probability pt(i) if Xt = 0. Moreover, we denote by X t = (X1, . . . , Xt), I t = (I1, . . . , It).

For the sake of the analysis, we introduce another algorithm A k, which is the same as algorithm Ak with the only (but crucial) difference that it issues a query if and only if Xt = 1, regardless of whether or not query budget is exhausted. We thus bound the regret of Ak in terms of the regret of A k.

Lemma 2.5. For all adversarial sequences ℓ [0, 1]n T of action losses, in the label efficient query model, algorithm A k guarantees

RT (A k) min

ˆk , T 2 log n

Proof. For algorithm A k, we have that its counterpart without queries, A 0, is an instantiation Hedgeη( ℓ), with ℓ= ˆℓ. Thus, by Lemma 2.1, we obtain

ˆLT (A k) = X

t [T ],i [n] pt(i) ˆℓt(i) 1 1 T

ˆk η ˆLT (i) + log n

as long as η < ˆk/T. We now recognize that, since E h ˆℓt(i) | X t 1, I t 1 i = ℓt(i) ℓt(i t ), then

i [n] E pt(i) ˆℓt(i) X t 1, I t 1

i [n] pt(i) (ℓt(i) ℓt(i t )).

Therefore, by the tower property of expectation applied around (2), we have

LT (A 0) = X

i [n] E h pt(i) ˆℓt(i) i = X

i [n] pt(i) (ℓt(i) ℓt(i t ))

ˆk η LT (i) Lmin T + log n

ˆk ˆk Tη LT (i) + log n

Tη ˆk Tη Lmin T ,

where the last inequality holds since Tη ˆk. By Observation 2.3, we get

LT (A k) 1 k

ˆk ˆk Tη LT (i) + log n

ˆk Tη Lmin T + k

which means that the regret is upper bounded by

RT (A k, i) ˆk log n

η(ˆk Tη) + T 2η kˆk

T(ˆk Tη) (LT (i) Lmin T ) min

ˆk , T 2 log n

where the last inequality holds by setting η = max 1 T

, and then by noticing, in

the latter case, that ˆk kˆk

With this lemma, we are ready to prove Theorem 2.4, where we bound the regret of algorithm Ak.

Proof of Theorem 2.4. We consider event E = {|Q| k}, so that, slightly abusing notation, we write the regret of algorithm Ak as

RT (Ak) = E [RT (Ak) | E] P [E] + E RT (Ak) | E P E E [RT (Ak) | E] + T P E .

To upper bound the second summand above, we have

T P E T exp

2(k + 1 ˆk)2

by Hoeffding s inequality applied on the binomial random variable |Q| with expectation ˆk, and as

long as ˆk k q

For what concerns the first summand, we recognize that under event E, Ak and A k are exactly the same algorithm. Thus, it holds that E [RT (Ak, i) | E] = E [RT (A k, i) | E]. Moreover, if E RT (Ak) | E 0, and since P [E] 1 1/T, then

E [RT (A k, i) | E] T T 1 RT (A k, i) T T 1 min

ˆk , T 2 log n

by Lemma 2.5. If, instead, E RT (Ak) | E < 0, we also know that, by an identical derivation to (3), E RT (Ak) | E P E 1. Therefore,

E [RT (A k, i) | E] T T 1 (RT (A k, i) + 1) T T 1

ˆk , T 2 log n

again by Lemma 2.5. Overall, we obtain

RT (Ak) 2 min

k , T 2 log n

3 Lower Bounds

In this section, we construct two of randomized instances of the learning problem, which induce a tight lower bound on the minimax regret rates for both feedback models. We define the random variables Zt as the feedback observed by the algorithm at the end of time t. In the full feedback model, Zt = ℓt, while in the label efficient setting, Zt = ℓt only if a query is issued at time t (and Zt is an empty n-dimensional vector otherwise). Furthermore, we denote with Z t the array containing the feedback Z1, Z2, . . . , Zt until time t.

We start by describing the two stochastic instances, which are characterized by two distributions over two losses. As a notational convention, we denote the losses with ℓt, and introduce two probability measures P+, P (and their corresponding expectations E+, E ). Let ε, q [0, 1] be two parameters we set later (with ε q), we have that the losses of the n = 2 actions are distributed as follows:

(ℓt(1), ℓt(2)) =

(1, 1) w.p. 1

2 under both P+ and P

(0, 0) w.p. 1

2 2q under both P+ and P

(0, 1) w.p. q + ε under P+ and w.p. q ε under P

(1, 0) w.p. q ε under P+ and w.p. q + ε under P

We now introduce and prove a general Lemma on the expected regret R T (Ak) suffered by any deterministic algorithm Ak which issues at most k queries, against the i.i.d. sequence of valuations drawn according to P . Since we want a result that holds for both feedback models, we introduce the random set F which contains the times where Ak actually observes the losses; note, NF = |F| is equal to T in full feedback and to the number queries issued in the partial information model. Lemma 3.1. For any determinstic algorithm Ak which issues at most k best-action queries, we have:

R+ T (Ak) + R T (Ak) exp 5ε2

q E+ [NF ] (T k)ε

Proof. The best action i under P+, respectively P is the first, respectively the second, one, with an expected loss of E [ℓt(i )] = 1

2 + q ε. (4)

On the other hand, the best realized action i t yields an expected loss of

E [ℓt(i t )] = E [min{ℓt(1), ℓt(2)}] = (q ε). (5)

Moreover, if the algorithm chooses a suboptimal action it = i , its expected instantaneous regret is:

E+ [ℓt(2)] E+ [ℓt(i )] = E [ℓt(1)] E [ℓt(i )] = 2ε. (6)

Let now N+, respectively N , be the random variable that counts the number of times that Ak selects action 1, respectively 2, in times that are not in F (i.e., where the choice of 1 is not due to a query). Combining (4),(5), and (6), we have the following:

R T (Ak) E "X

t/ F (ℓt(it) ℓt(i )) + X

t F (ℓt(i t ) ℓt(i ))

= 2ε E [N ] (q ε)E [NF ] .

Since NF k, and N+ + N T k, we have the following bound on the regret:

R+ T (Ak) P+ N+ T k

(T k)ε (q ε)k

R T (Ak) P N+ > T k

(T k)ε (q ε)k.

Summing the above two expressions, we obtain

R+ T (Ak) + R T (Ak) P+ N+ T k

+ P N+ > T k

(T k)ε 2(q ε)k.

At this point, we apply the Bretagnolle-Huber Inequality [Theorem 14.2 in Lattimore and Szepesvári, 2020] to bound the first term on the right-hand side of the above inequality:

+ P N+ > T k

2 exp DKL P+ Z T , P Z T

where P Z T is the push-forward measure on all the possible sequences of feedback observed by Ak under P . The lemma is concluded by combining the above inequality with the following claim, and plugging it into (7).

Claim 3.2. It holds that DKL P+ Z T , P Z T

q E+ [NF ] .

Proof. Let us observe that, once we fix the feedback history until time t 1, i.e., fix a realization of the feedback Z t 1, we have that

DKL P+ Zt|Z t 1, P Zt|Z t 1

= I {t F} DKL P+ Zt|t F , P Zt|t F ,

where P+ Zt|Z t 1 (respectively P Zt|Z t 1) is the push-forward measure over {0, 1}2 when losses are drawn according to P+ (respectively P ), conditioning on the previous observations. The equality above holds because (i) algorithm Ak observes feedback if and only t F (by definition), and (ii) Ak is deterministic and whether or not t F may only depend on the past. We now upper bound the KL-divergence term above:

DKL P+ Zt|Z t 1, P Zt|Z t 1

= (q + ε) log 1 + 2ε q ε

+ (q ε) log 1 2ε q + ε

(q + ε) 2ε q ε (q ε) 2ε q + ε = 4ε2q q2 ε2 5ε2

where the first inequality follows from log(1 + z) z for all z R, and the last holds as long as we choose ε < q

5. To complete our derivation, we express the overall KL divergence exploiting the tower property of conditional expectation:

DKL P+ Z T , P Z T

t [T ] E+ h DKL P+ Zt|Z t 1, P Zt|Z t 1

q E+ [NF ] ,

where expectation is taken over all possible feedback realizations Z t 1, and the last inequality follows by earlier derivations.

This concludes the proof of the lemma.

We now show how to use the above lemma to derive the lower bounds. We start with full feedback.

Theorem 3.3. In the full feedback query model, for all k [T], we have the following lower bounds:

For any algorithm Ak that has access to k < c0

T queries, it holds that RT (Ak) c0

For any algorithm Ak that has access to k c0

T queries, it holds that RT (Ak) c1 T

Where c0 = 1/(e8

5) and c1 = 1/(320e2) are universal constants.

Proof. In full feedback, the algorithm always observes the losses, so that the feedback variable Zt = ℓt and NF = T. We prove the Theorem via Yao s minimax Theorem: we prove that any deterministic algorithm Ak fails against the random instance composed as follows: with probability 1/2 the losses are drawn i.i.d. according to P+, otherwise, they are drawn i.i.d. according to P . We can then apply Lemma 3.1 and obtain that the expected regret of Ak against such mixture is equal to

E [RT (Ak)] = 1

2(R+ T (Ak) + R T (Ak)) exp 5ε2

2 2(q ε)k. (8)

Now, if k < c0

T, we set ε = 2

5T and q = 1

4 in (8) to get

E [RT (Ak)] = 1

2(R+ T (Ak) + R T (Ak))

Otherwise, if k c0

T, then consider the following choice of the parameters: ε = 1 40ek + 4e 1

40e T and q = 5ε2T = T 320e2k2 + 4e 1 160e2k + (4e 1)2

320e2T . Plugging these parameters in (8), we get:

E [RT (Ak)] = 1

2(R+ T (Ak) + R T (Ak)) Tε

4e 5ε2k T + 1 1

= T 320e2k + 4e 1

160e2 + (4e 2)2k

320e2T T 320e2k .

A similar analysis can be carried over for the label efficient setting.

Theorem 3.4. In the label efficient feedback model, for all k [T], we have the following lower bounds:

For any algorithm Ak that has access to k < c0 T

k queries, it holds that RT (Ak) c0 T 4

For any algorithm Ak that has access to k c0 T

k queries, it holds that RT (Ak) c1 T 2

Where c0 = 1/(e8

5) and c1 = 1/(320e2) are universal constants.

Proof. In the label efficient feedback model, Zt is meaningful only for times in F. We then prove the result by Yao s minimax principle, by showing that any determinisitic algorithm Ak which issues at most k queries suffers the desired regret against the instance that uniformly chooses between P+ and P . We can apply Lemma 3.1 (nothing that NF k) to get:

E [RT (Ak)] = 1

2(R+ T (Ak) + R T (Ak)) exp 5ε2

Once again, we have two cases. If k < c0 T

k, we choose ε = 2

5k and q = 1

E [RT (Ak)] = 1

2(R+ T (Ak) + R T (Ak)) T 2e8

Otherwise, if k c0 T

k, and we choose ε = T 40ek2 + 4e 1

40ek and q = 5ε2k = T 2 320e2k3 + (4e 1)T

320e2k , to obtain

E [RT (Ak)] = 1

2(R+ T (Ak) + R T (Ak)) Tε

4e 5ε2k2 + 1 1

320e2k2 + T(4e 1)

160e2k + (4e 1)2

4 Conclusions

Our work introduces best-action queries in the context of online learning. We provide tight minimax regret in both the full feedback model and in the label efficient one. We establish that leveraging a sublinear number of best action queries is enough to improve significantly the regret rates achievable without best-action queries. Promising avenues for future research involve integrating best-action queries with diverse feedback forms, extending beyond full feedback, such as bandit feedback, partial monitoring, and feedback graphs (where, in particular, Lemma 2.1 does not hold). Moreover, our work only studies the case where queries are perfect, i.e., the queried oracle gives the correct identity of the best action at that time step with probability 1. Imagining a noisy oracle that gives the correct identity of the best action only with probability 1/n + δ is also an interesting future direction this work leaves open.

Acknowledgments

Andrea Celli is partially supported by MUR - PRIN 2022 project 2022R45NBB funded by the Next Generation EU program. Federico Fusco, Stefano Leonardi and Matteo Russo are partially supported by the ERC Advanced Grant 788893 AMDROMA Algorithmic and Mechanism Design Research in Online Markets , by the FAIR (Future Artificial Intelligence Research) project PE0000013, funded by the Next Generation EU program within the PNRR-PE-AI scheme (M4C2, investment 1.3, line on Artificial Intelligence), and by the PNRR MUR project IR0000013-So Big Data.it. Stefano Leonardi and Matteo Russo are also partially supported by the MUR PRIN grant 2022EKNE5K (Learning in Markets and Society).

A. Aamand, J. Y. Chen, H. L. Nguyen, S. Silwal, and A. Vakilian. Improved frequency estimation algorithms with and without predictions. Co RR, abs/2312.07535, 2023.

A. Antoniadis, H. Broersma, and Y. Meng. Online graph coloring with predictions. Co RR, abs/2312.00601, 2023a.

A. Antoniadis, C. Coester, M. Eliás, A. Polak, and B. Simon. Online metric algorithms with untrusted predictions. ACM Trans. Algorithms, 19(2):19:1 19:34, 2023b.

A. Antoniadis, C. Coester, M. Eliás, A. Polak, and B. Simon. Mixing predictions for online metric algorithms. In ICML, volume 202 of Proceedings of Machine Learning Research, pages 969 983. PMLR, 2023c.

P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. The nonstochastic multiarmed bandit problem. 32(1):48 77, 2003.

X. Bai and C. Coester. Sorting with predictions. In Neur IPS, 2023.

A. Bhaskara and K. Munagala. Competing against adaptive strategies in online learning via hints. In AISTATS, volume 206 of Proceedings of Machine Learning Research, pages 10409 10424. PMLR, 2023.

A. Bhaskara, A. Cutkosky, R. Kumar, and M. Purohit. Online learning with imperfect hints. In ICML, volume 119 of Proceedings of Machine Learning Research, pages 822 831. PMLR, 2020.

A. Bhaskara, A. Cutkosky, R. Kumar, and M. Purohit. Power of hints for online learning with movement costs. In AISTATS, volume 130 of Proceedings of Machine Learning Research, pages 2818 2826. PMLR, 2021a.

A. Bhaskara, A. Cutkosky, R. Kumar, and M. Purohit. Logarithmic regret from sublinear hints. In Neur IPS, pages 28222 28232, 2021b.

A. Bhaskara, A. Cutkosky, R. Kumar, and M. Purohit. Bandit online linear optimization with hints and queries. In ICML, volume 202 of Proceedings of Machine Learning Research, pages 2313 2336. PMLR, 2023a.

A. Bhaskara, S. Gollapudi, S. Im, K. Kollias, and K. Munagala. Online learning and bandits with queried hints. In ITCS, volume 251 of LIPIcs, pages 16:1 16:24. Schloss Dagstuhl - Leibniz Zentrum für Informatik, 2023b.

N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge University Press, 2006.

D. Cheng, X. Zhou, and B. Ji. Understanding the role of feedback in online learning with switching costs. In ICML, volume 202 of Proceedings of Machine Learning Research, pages 5521 5543. PMLR, 2023.

C. Cortes, G. De Salvo, C. Gentile, M. Mohri, and S. Yang. Online learning with abstention. In ICML, volume 80 of Proceedings of Machine Learning Research, pages 1067 1075. PMLR, 2018.

O. Dekel, A. Flajolet, N. Haghtalab, and P. Jaillet. Online learning with a hint. In NIPS, pages 5299 5308, 2017.

A. Gangrade, A. Kag, A. Cutkosky, and V. Saligrama. Online selective classification with limited feedback. In Neur IPS, pages 14529 14541, 2021.

S. Gollapudi and D. Panigrahi. Online algorithms for rent-or-buy with expert advice. In ICML, volume 97 of Proceedings of Machine Learning Research, pages 2319 2327. PMLR, 2019.

Google. Google Support: how automation is used in content moderation. https://support. google.com/adspolicy/answer/13584894. Accessed: 2024-05-22.

Z. Jiang, D. Panigrahi, and K. Sun. Online algorithms for weighted paging with predictions. ACM Trans. Algorithms, 18(4):39:1 39:27, 2022.

S. Lattanzi, T. Lavastida, B. Moseley, and S. Vassilvitskii. Online scheduling via learned weights. In SODA, pages 1859 1877. SIAM, 2020.

T. Lattimore and C. Szepesvári. Bandit Algorithms. Cambridge University Press, 2020.

L. Li, M. L. Littman, T. J. Walsh, and A. L. Strehl. Knows what it knows: a framework for self-aware learning. Mach. Learn., 82(3):399 443, 2011.

T. Lykouris and S. Vassilvitskii. Competitive caching with machine learned advice. J. ACM, 68(4): 24:1 24:25, 2021.

Meta. Meta Transparency Center: how we label violations. https://transparency.fb.com/ en-gb/policies/improving/content-actioned-metric/, a. Accessed: 2024-05-22.

Meta. Meta Transparency Center: prioritizing content review. https://transparency.fb.com/ en-gb/policies/improving/prioritizing-content-review/, b. Accessed: 2024-05-22.

M. Mitzenmacher and S. Vassilvitskii. Algorithms with predictions. In Beyond the Worst-Case Analysis of Algorithms, pages 646 662. Cambridge University Press, 2020.

G. Neu and N. Zhivotovskiy. Fast rates for online prediction with abstention. In COLT, volume 125 of Proceedings of Machine Learning Research, pages 3030 3048. PMLR, 2020.

V. Perchet, P. Rigollet, S. Chassang, and E. Snowberg. Batched bandit problems. In COLT, volume 40 of JMLR Workshop and Conference Proceedings, page 1456. JMLR.org, 2015.

M. Purohit, Z. Svitkina, and R. Kumar. Improving online algorithms via ML predictions. In Neur IPS, pages 9684 9693, 2018.

A. Sayedi, M. Zadimoghaddam, and A. Blum. Trading off mistakes and don t-know predictions. In NIPS, pages 2092 2100. Curran Associates, Inc., 2010.

M. Shi, X. Lin, and L. Jiao. Power-of-2-arms for bandit learning with switching costs. In Mobi Hoc, pages 131 140. ACM, 2022.

A. Slivkins. Introduction to multi-armed bandits. Found. Trends Mach. Learn., 12(1-2):1 286, 2019.

C. Zhang and K. Chaudhuri. The extended littlestone s dimension for learning with mistakes and abstentions. In COLT, volume 49 of JMLR Workshop and Conference Proceedings, pages 1584 1616. JMLR.org, 2016.

A Omitted Proofs

A.1 Proof of Lemma 2.1

Lemma 2.1. Consider the Hedge algorithm Hedgeη( ℓ) run on loss sequence ℓ [0, U]n T with learning rate η < 1/U. Then, for all action i [n], it holds that,

LT (Hedgeη( ℓ)) 1 1 Uη LT (i) + log n

where LT (Hedgeη( ℓ)) = P

t [T ],i [n] pt(i) ℓt(i) is the expected cumulative loss of Hedgeη( ℓ) and LT (i) is the cumulative loss of action i.

Proof. We have that, by definition,

WT +1 w T +1(i)wt(i) exp η ℓt(i) = Y

t [T ] exp η ℓt(i) = exp η LT (i) .

We also know that

i [n] pt(i) ℓt(i) + η2 X

i [n] pt(i) ℓ2 t(i)

i [n] pt(i) ℓt(i) + η2 X

i [n] pt(i) ℓ2 t(i)

which combined with the earlier bound and taking logarithms of both sides, gives

η LT (i) log n + X

i [n] pt(i) ℓt(i) + η2 X

i [n] pt(i) ℓ2 t(i)

t [T ],i [n] pt(i) ℓt(i) + η2 X

t [T ],i [n] pt(i) ℓ2 t(i)

log n (η Uη2) X

t [T ],i [n] pt(i) ℓt(i).

The second inequality above follows from log(1 + z) z for z R, and the third by observing that ℓ2 t(i) U ℓt(i), since ℓt(i) [0, U]. The lemma follows by rearranging the terms above.

B Best-Action Queries in the Stochastic i.i.d. Setting

In this section, we show how the results provided in Section 2 can be obtained in the stochastic i.i.d. setting (defined next) using the Follow-The-Leader algorithm, albeit a suboptimal dependence in n.

In the stochastic i.i.d. setting, each action is associated with a fixed but unknown distribution Di supported in [0, 1]. Action i s loss at time step t, ℓt(i), is drawn independently from distribution Di. We denote with µ(i) the expected loss of action i, and with i = arg mini [n] µ(i) be the action of lowest expected loss. We measure the performance of a learning algorithm A, by considering its regret with respect to the best action distribution:

t [T ] E [ℓt(it) ℓt(i )] .

We denote by i = µ(i) µ(i ) be the gap between the expected loss of the best action and that of action i, and by Ψi = E [|ℓ(i) ℓ(i )|] the expected absolute value of such gap (note, we omit the dependence on time as the losses are drawn i.i.d. across time). Whenever the learner issues a query, then the identity of the action i t with best realized loss is revealed, so that the learner gets, in expectation, E [ℓt(i t )] = E mini [n] ℓt(i) .

B.1 Useful Facts and Observations

The following facts and simple results are particularly useful for analyzing algorithms in the stochastic generation model. Fact B.1 (Bernstein s Inequality). Let Y1, . . . , Ym be independent random variables such that Pm τ=1 E [Yτ] = µ and P [|Yτ| c] = 1, for c > 0 and all τ [m]. Then, for γ > 0, we have

where σ2 = Pm τ=1 E Y 2 τ E [Yτ]2. Moreover, when Yτ s are also identically distributed, we have σ2 = m(E Y 2 τ E [Yτ]2) m E [|Yτ|] =: mΨ. Thus,

Proposition B.2. Let T, n N be the time horizon and number of experts respectively, with T n 1. For all distributions Di of actions losses, and any algorithm A, its regret is

2 (Ψ ), (9)

where Ψ := E [|mini =i ℓ(i) ℓ(i )|] and := E [mini =i ℓ(i) ℓ(i )].

Proof. We first observe that, by the i.i.d. assumption, we can always make the algorithm query in the first k time steps, without losing generality. Therefore, as Q = {t | t k}, the algorithm suffers E mini [n] ℓt(i) in the first k time steps, which, by independence, gives

t k E min i [n] ℓt(i) = k E min i [n] ℓ(i) .

We also have the following useful observation that allows us to rewrite the maximum in a convenient form. Namely,

min i [n] ℓ(i) = ℓ(i ) + min i =i ℓ(i) min i =i ℓ(i) ℓ(i )

The claim follows by the definitions of Ψ and .

Thus, in the stochastic case, the goal of a regret-minimizing algorithm is to minimize the first summand above since the second is characteristic of the instance at hand. Observation B.3. Let T, n N be the time horizon and number of experts respectively, with T n 1. For all distributions Di of actions losses, Ψi i Ψ .

Proof. Since E minj [n] ℓ(j) E [min{ℓ(i ), ℓ(i)}] for all i [n], we have

Ψ = 2 E min j [n] ℓ(j) ℓ(i ) 2 E [min{ℓ(i), ℓ(i )} ℓ(i )] = i Ψi,

where the last equality also follows from Equation (10) applied to actions i, i only.

B.2 Label Efficient Feedback:Explore-Then-Commit with Ω(T 2/3) Best-Action Queries

We consider the label efficient feedback case first, where feedback is given only during querying time steps. The techniques used in this section are key for the full feedback case, and easier to illustrate.

We provide a simple variation of the Explore-Then-Commit (ETC) algorithm Perchet et al. [2015] that, in the first k Ω(T 2/3) time steps, is given free access to the identity of the best action before committing to the action to select. We have the following theorem:

Algorithm 2 Explore-Then-Commit with Best-Action Queries

Input: Sequence of losses ℓt(i) Di and k [T] query budget for t [T] do

if t k then

Observe i t = arg mini [n] ℓt(i) Select action i t Observe ℓt(i) for all i [n] else

Select action j = arg mini [n] µk(i)

Theorem B.4. Let T, n N be the time horizon and number of actions respectively, with T n 1. For all distributions Di of actions losses, Algorithm 2 with query access k [4T/9] guarantees

RT (Ak) min

2k , 2n T 2 ln T

We split the proof of this theorem in two lemmas that immediately imply it, one for k < T 2/3 and one otherwise.

Lemma B.5. If k < T 2/3, then RT (Ak) 3T q

Proof. For all actions i, we define the clean event as

Ei = {|µk(i) µ(i)| < ε},

where ε = q

2k . Similarly, let E be the intersection of all the Ei s for i [n]. By Hoeffding s inequality [Equation (5.8) in Chapter 5 of Lattimore and Szepesvári, 2020], we have the following:

P E = P i [n] Ei X

i [n] P Ei 2n exp 2kε2 1

Consider now the expected instantaneous regret suffered by ETC at time t + 1. We have two cases: Either the clean event holds, so that the instantaneous regret is at most 2εt, or it does not, in which case the instantaneous regret is at most 1. By the law of total probability, we have the following:

E [ℓt+1(it) ℓt+1(i)] P Et + E [ℓt+1(it) ℓt+1(i)|Et] 1

T + 2εt 3εt.

Summing the expected instantaneous regrets for all t yields the desired bound of

as we use the lower bound Ψ 0 to say that in the first k times the regret the algorithm suffers is at most 0.

Lemma B.6. If T 2/3 k 4T/9, then RT (Ak) 2n T 2 ln T

Proof. We start by showing the statement for 2 actions, and then generalize to n.

The case of 2 actions. Let us recall that, by Proposition B.2, we have

RT (Ak) = |{t k + 1 | it = 2}| k

where we have assumed that µ1 µ2, without loss of generality. We suffer positive regret in the last T k time steps when we select action 2 over 1, which happens if and only if µk(1) µk(2). Hence, the first summand of the above regret expression simply becomes (T k) P [ µk(2) µk(1)].

P [ µk(2) µk(1)] = P

t=1 ℓt(1) ℓt(2) 0

so if we define gt = ℓt(1) ℓt(2), whose expected value is E [gt] = and P [|gt| 1] = 1, then, by Fact B.8,

t=1 gt + k k

We distinguish two cases, namely 2 2Ψ+ ln T

k and 2 2Ψ+ > ln T

k . In the latter case, we get that

RT (Ak) = (T k) P [ µk(2) µk(1)] k

T exp ( ln T) = 1.

Otherwise, we obtain that Ψ

ln T 1 , and so

RT (Ak) = (T k) P [ µk(2) µk(1)] k

The first inequality holds since the maximizing = 4T 3k

2k2 ln T, which is consistent with 0 Ψ 1 since T 2/3 k 4T/9.

Generalizing to n actions. We again assume that the first is the best action, and apply Proposition B.2, to obtain

E I µk(j) max i =j µk(i) (ℓt(j) ℓt(1)) k

j =1 j P µk(j) max i =j µk(i) k

j =1 j P [ µk(j) µk(1)] k

which holds by the independence of realizations across time steps. Let us define gt(j) = ℓt(1) ℓt(j), so that, Ψj = E [|gt(j)|], E [gt(j)] = j and P [|gt(j)| 1] = 1. Then, by Fact B.8, we obtain

P [ µk(j) µk(1)] = P

t=1 gt(j) 0

k 2 j 2Ψj + j

Recall that, by Observation B.3, Ψ Ψj j. Therefore, let us sum and subtract to the regret upper bound the sum k 2n P

j =1(Ψj j), and get

k 2 j 2Ψj + j

j =1 (Ψj j) (Ψ )

k 2 j 2Ψj + j

For each j = 1, an identical derivation to the one in the case of two actions would yield the same regret rate, with k/n in place of k. Thus, overall, we obtain

RT (Ak) 2n T 2 ln T

which concludes the proof.

B.3 Full Feedback: Follow-The-Leader with Ω

T Best-Action Queries

We provide an algorithm that, equipped with best-action queries for k time steps and with full feedback, achieves regret O(T/k) for k Ω(

T) otherwise. This means that, with the addition of feedback, fewer queries are needed to switch to a much smaller regret rate.

The idea is to use the Follow-The-Leader paradigm Auer et al. [2003]. That is, we select the best action in the first k time steps since queries give us its identity for free, and successively, we select the action maximizing the empirical average so far, µt 1(i) = 1 t 1 Pt 1 τ=1 ℓτ(i).

Algorithm 3 Follow-The-Leader with Best-Action Queries

Input: Sequence of losses ℓt(i) Di and k [T] query budget for t [T] do

if t k then

Observe i t = arg mini [n] ℓt(i) Select action i t Observe ℓt(i) for all i [n] else

Select action j = arg mini [n] µt 1(i)

We have the following guarantee:

Theorem B.7. Let T, n N be the time horizon and number of actions respectively, with T n 1. For all distributions Di of actions losses, Algorithm 3 with query access k 2

T guarantees

RT (Ak) min 3 p

2T log 2n T, 5n T

For our purposes, only a weaker version of Fact B.1 is needed: Fact B.8. Let Y1, . . . , Ym be i.i.d. random variables such that E [Yτ] = Y , E [|Yτ|] = ΨY , and P [|Yτ| 1] = 1, for all τ [m]. Then, it holds that

exp m 2 Y 2ΨY + Y

We split the proof of this theorem in two lemmas that immediately imply it, one for k < 2

T and one otherwise. Next, we subsume the notation of Theorem B.7.

Lemma B.9. If k < 2

T, then RT (Ak) 3 2T log 2n T.

Proof. For all times t and actions i, we define the clean event as

Ei,t = {|µt(i) µ(i)| < εt},

where εt = q

2t . Similarly, let Et be the intersection of all the Ei,t s for i [n]. By Hoeffding s inequality [Equation (5.8) in Chapter 5 of Lattimore and Szepesvári, 2020], we have the following:

P Et = P i [n] Ei,t X

i [n] P Ei,t 2n exp 2tε2 t 1

Consider now the expected instantaneous regret suffered by FTL at time t + 1. We have two cases: Either the clean event holds, so that the instantaneous regret is at most 2εt, or it does not, in which case the instantaneous regret is at most 1. By the law of total probability, we have the following:

E [ℓt+1(it) ℓt+1(i)] P Et + E [ℓt+1(it) ℓt+1(i)|Et] 1

T + 2εt 3εt.

Summing the expected instantaneous regrets for all t yields the desired bound of

RT (Ak) 3 p

2T log 2n T,

as we use the lower bound Ψ 0 to say that in the first k times the regret the algorithm suffers is at most 0.

Lemma B.10. If k 2

T, then RT (Ak) 5n T

Proof. We start by showing the statement for 2 actions, and then generalize to n.

The case of 2 actions. We again assume that µ1 µ2 without loss of generality. FTL s regret, thus, reads

RT (Ak) = E

t=k+1 (ℓt(2) ℓt(1)) I { µt 1(2) µt 1(1)}

t=k+1 P [ µt 1(2) µt 1(1)] k

t=k+1 exp t 2

2 (Ψ ), (11)

where the second equality follows from the independence of step t from all the preceding steps, and the inequality by Fact B.8. We distinguish three cases, namely 2 2Ψ+ 1

T < 2 2Ψ+ ln T

2 2Ψ+ > ln T

k . In the first case, we have that Ψ

2 (T 1) and so,

RT (Ak) T k

where the second inequality follows since the maximizing the expression is = 2

k + 3 4T . This implies that Ψ T 2k2 3 32T + 1 2k, which is consistent with 0 Ψ 1 since k 2

In the last case, 2 2Ψ+ > ln T

RT (Ak) = (T k) P [ µt 1(2) µt 1(1)] k

T exp ( ln T) = 1.

We are left to show the case where 1

T < 2 2Ψ+ ln T

k . To this end, let us observe that, for all T 2, it holds that T X

t=m+1 exp t

for all 1 m, r T. This is the case because

t=m+1 exp t

= e k/r e T/r

e1/r 1 re k/r r,

since 1 + z ez for all z R. Therefore,

RT (Ak) 2 2Ψ +

2(Ψ ) = Ψ 4

We distinguish two subcases: On the one hand, assume that > 8 k, then 4 k

2 < 0. Since Ψ k 2 2 ln T

2 , we have

RT (Ak) k 2

4 ln T + 3k

8 + 4 + 1 ln T

since the expression is maximized for = 8+ln T

2k > 8 k, as long as T > e8. This implies that Ψ 8 k ln T ln T

8k , which is consistent with 0 Ψ 1. On the other hand, 8

k, and thus

2 0. We use that Ψ T 2

2 , and get

RT (Ak) T 2

since the expression is maximized for = 4

k + 3 2T < 8

k. This implies that Ψ 8T

k + 3 8T , which is consistent with 0 Ψ 1 since k 2

Generalizing to n actions. The generalization is very similar to the one in the proof of B.4. Indeed, the regret equals

E I µt 1(j) min i =j µt 1(i) (ℓt(j) ℓt(1)) k

j =1 j P µt 1(j) min i =j µt 1(i) k

j =1 j P [ µt 1(j) µt 1(1)] k

t 2 j 2Ψj + j

where the last inequality follows by Fact B.8. Identically to the proof of Theorem B.4, we sum and subtract the term k 2n P j =1(Ψj j), and get

t 2 j 2Ψj + j

j =1 (Ψj j) (Ψ )

t 2 j 2Ψj + j

2n (Ψj j) .

For each j = 1, an identical derivation to the one in the case of two actions would yield the same regret rate, with k/n in place of k. Thus, overall, we obtain

RT (Ak) 5n T

which concludes the proof.

C Further Related Work

Correlated hints. A first model that is close to ours has been introduced by Dekel et al. [2017], which studies Online Linear Optimization when the learner has access to vectors correlated with the actual losses. Surprisingly, this type of hint guarantees an exponential improvement in the regret bound, which is logarithmic in the time horizon when the optimization domain is strongly convex. Subsequently, Bhaskara et al. [2020] generalize such results in the presence of imperfect hints, while Bhaskara et al. [2023b] prove that Ω(

T) hints are enough to achieve the logarithmic regret bound. While these works share significant similarities with ours, we stress that their results are not applicable to the standard prediction with experts model, given the non-strong convexity of the probability simplex over the actions.

Queries/Ordinal hints. Bhaskara et al. [2023b] studies online learning algorithms augmented with ordinal queries of the following type: a query takes in input a small subset of the actions and receives in output the identity of the best one. With this twist, it is possible to bring the regret down to O(1) by observing only 2 experts losses in advance at each time. Although our best-action query is stronger (as it compares all the actions), our learner is constrained in the number of times it can issue such queries.

Algorithms with predictions. Other works that explore the interplay between hints and feedback are [Bhaskara et al., 2021a, Shi et al., 2022, Cheng et al., 2023, Bhaskara and Munagala, 2023, Bhaskara et al., 2023a]. More broadly, our work follows the literature on enhancing performance through external information (predictions). This has been extensively applied to a variety of online problems to model partial information about the input sequence that can be fruitfully exploited if accurate for improving the performance of the algorithms. Examples include sorting [Bai and Coester, 2023], frequency estimation [Aamand et al., 2023], various online problems [Purohit et al., 2018, Gollapudi and Panigrahi, 2019] such as metrical task systems [Antoniadis et al., 2023b,c], graph coloring [Antoniadis et al., 2023a], caching [Lykouris and Vassilvitskii, 2021], scheduling [Lattanzi et al., 2020, Jiang et al., 2022], and several others.

Neur IPS Paper Checklist

Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope?

Answer: [Yes] .

Justification: Please refer to sections 2 and 3.

Guidelines:

The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper.

2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors?

Answer: [NA] .

Justification: The paper has no significant limitation. Being theoretical in nature, there is a set of model assumptions (see Question 3). The model is realistic as the number of queries we can issue is limited. Moreover, if we supposed that the external queries come from a human (which is true in many applications), we could assume that these reviews are noiseless (see, e.g., https://transparency.meta.com/en-gb/ policies/improving/content-actioned-metric/,https://transparency. meta.com/en-gb/policies/improving/prioritizing-content-review/, https://support.315google.com/adspolicy/answer/13584894). This, of course, no longer holds when queries are generated by an external learning algorithm, where some degree of noise is inevitable. Our current model does not account for potentially faulty queries. Indeed, our algorithms blindly trust the query, and the bounds presented in our analysis rely on the assumption that the query is perfect.

Guidelines:

The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness.

While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations.

3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?

Answer: [Yes] .

Justification: Please refer to sections 2 and 3.

Guidelines:

The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced.

4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)?

Answer: [NA] .

Justification: The paper does not contain any experimental results.

Guidelines:

The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully.

(c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results. 5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [NA] . Justification: The paper does not contain any experimental results. Guidelines:

The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [NA] . Justification: The paper does not contain any experimental results. Guidelines:

The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment Statistical Significance

Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [NA] . Justification: The paper does not contain any experimental results.

Guidelines:

The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text.

8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments?

Answer: [NA] .

Justification: The paper does not contain any experimental results.

Guidelines:

The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper).

9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines?

Answer: [Yes] .

Justification: We study a basic theoretical question that has been studied before.

Guidelines:

The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction).

10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?

Answer: [NA] .

Justification: The paper is theoretical in nature and does not have any significant societal impact. Guidelines:

The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML). 11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] . Justification: The paper poses no such risks. Guidelines:

The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort. 12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [NA] . Justification: The paper does not use existing assets. Guidelines:

The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL.

The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators.

13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?

Answer: [NA] .

Justification: The paper does not release new assets.

Guidelines:

The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file.

14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)?

Answer: [NA] .

Justification: The paper does not involve crowdsourcing nor research with human subjects.

Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector.

15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained?

Answer: [NA] .

Justification: The paper does not involve crowdsourcing nor research with human subjects.

Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects.

Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.