# on_adaptivity_in_informationconstrained_online_learning__8020bf0f.pdf

The Thirty-Fourth AAAI Conference on Artiﬁcial Intelligence (AAAI-20)

On Adaptivity in Information-Constrained Online Learning

Siddharth Mitra,1 Aditya Gopalan2

1Chennai Mathematical Institute, smitra@cmi.ac.in 2Indian Institute of Science, aditya@iisc.ac.in

We study how to adapt to smoothly-varying ( easy ) environments in well-known online learning problems where acquiring information is expensive. For the problem of label efﬁcient prediction, which is a budgeted version of prediction with expert advice, we present an online algorithm whose regret depends optimally on the number of labels allowed and Q

(the quadratic variation of the losses of the best action in hindsight), along with a parameter-free counterpart whose regret depends optimally on Q (the quadratic variation of the losses of all the actions). These quantities can be signiﬁcantly smaller than T (the total time horizon), yielding an improvement over existing, variation-independent results for the problem. We then extend our analysis to handle label efﬁcient prediction with bandit (partial) feedback, i.e., label efﬁcient bandits. Our work builds upon the framework of optimistic online mirror descent, and leverages second order corrections along with a carefully designed hybrid regularizer that encodes the constrained information structure of the problem. We then consider revealing action-partial monitoring games a version of label efﬁcient prediction with additive information costs which in general are known to lie in the hard class of games having minimax regret of order T 2/3. We provide a strategy with an O((Q T) 1/3) bound for revealing action games, along with one with a O((QT) 1/3) bound for the full class of hard partial monitoring games, both being strict improvements over current bounds.

1 Introduction

Online learning is a branch of machine learning that is concerned with the problem of dynamically optimizing utility (or loss) over time in the face of uncertainty, and gives valuable principles to reason about acting under uncertainty. The study of online learning has developed along two concrete lines insofar as modeling the uncertain environment is concerned. On one hand, there is a rich body of work on learning in stochastic environments from an average-case point of view, such as i.i.d. multi-armed bandits (see, e.g., the survey of (Bubeck, Cesa-Bianchi, and others 2012)), online learning in Markov decision processes (Jaksch, Ortner, and Auer 2010;

Copyright 2020, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

Azar, Osband, and Munos 2017), stochastic partial monitoring (Bart ok, P al, and Szepesv ari 2011), etc., which often yields performance guarantees that are strong but can closely depend on the stochastic models at hand. On the other hand, much work has been devoted to studying non-stochastic (or arbitrary or adversarial) models of environments from a worst-case point of view prediction with experts, bandits and partial monitoring problems to name a few (Cesa-Bianchi and Lugosi 2006) which naturally yields rather pessimistic guarantees. Recent efforts have focused on bridging this spectrum of modeling structure in online learning problems as arising from non-stochastic environments with loss function sequences exhibiting adequate temporal regularity. These include the derivation of ﬁrst-order regret bounds or adapting to loss sequences with low loss of the best action (Allenberg et al. 2006), second-order bounds or adapting to loss sequences with low variation in prediction with experts (Rakhlin and Sridharan 2012; Steinhardt and Liang 2014) and benign multi-armed bandits (Hazan and Kale 2011; Bubeck et al. 2019; Bubeck, Cohen, and Li 2017; Wei and Luo 2018). In this regard, this paper is an attempt to extend our understanding of adaptivity to low variation in several standard online learning problems where information comes at a cost, namely label efﬁcient prediction (Cesa-Bianchi, Lugosi, and Stoltz 2005), label efﬁcient bandits (Cesa-Bianchi and Lugosi 2006) and classes of partial monitoring problems (Bart ok et al. 2014). In the process, we uncover new ways of using existing online learning techniques within the Online Mirror Descent (OMD) family, and partially make progress towards a program of studying the impact of easy (i.e., slowly-varying) environments in information-constrained online learning and partial monitoring problems. Our speciﬁc contributions are:

1. For the label efﬁcient prediction game with expert advice, we give a learning algorithm with a regret bound of O

(Q T log K)/n where Q is the quadratic variation of the best expert, T is the time horizon of the game, K is the number of experts and n is the bound on label queries; the bound holds for all regimes except when n Q /T = O(K2). We follow this up with an algorithm with an unconditional

regret guarantee of O(

(QT log K)/n) that holds for any label query budget n and total quadratic variation Q. Our algorithms are based on the optimistic OMD framework, but with new combinations of the negative entropy and log-barrier regularization that are best suited to the label efﬁcient game s information structure.

2. We generalize the results to label efﬁcient bandits where one receives bandit (i.e., for only the chosen expert) feedback at only up to n chosen time instants, and obtain O

(Q T K)/n regret. We also show that our upper bounds on regret for label efﬁcient prediction and label efﬁcient bandits are tight in their dependence on Q and n by demonstrating variation-dependent fundamental lower bounds on regret.

3. We show that adapting to low variation is also possible in the class of hard partial monitoring games as per the taxonomy of partial monitoring problems by (Bart ok et al. 2014), where we show an algorithm that achieves O((QTK) 1/3) regret. To the best of our knowledge, this is the ﬁrst algorithm exhibiting instance-dependent bounds for partial monitoring.

Problem Setup and Notation A label efﬁcient prediction game proceeds for T rounds with K T arms or experts . In each round (time instant) t, the learner selects an arm it [K] := 1, 2, . . . , K. Simultaneously, the adversary chooses a loss vector ℓt [0, 1]K where ℓt,i is the loss of arm i at time t. At each round, the learner can additionally choose to observe the full loss vector ℓt, provided the number of times it has done so in the past has not exceeded a given positive integer n T that represents an information budget or constraint. We work in the oblivious adversarial setting where ℓt does not depend on the previous actions of the learner i1, i2, . . . , it 1; this is akin to the adversary ﬁxing the (worst-possible) sequence of loss vectors in advance. The learner s goal is to minimize its expected regret deﬁned as

max i [K] E

where the expectation is taken with respect to the learner s randomness. Given a convex function R over Ω, we denote by DR the Bregman divergence with respect to R deﬁned as DR(x, y) R(x) R(y) R(y), x y x, y Ω. For any point u RK, we deﬁne the local norm at x with respect to R as u x =

u 2R(x)u and the corresponding dual norm as u x, =

u 2R(x)u. We denote by ϵ, the fraction of time we are allowed the full loss vector i.e. ϵ = n/T. The ϵ can be seen as a way to model the constraint on information deﬁned by the problem. The quadratic variation for a loss vector sequence ℓ1, . . . , ℓT is deﬁned by Q = T t=1 ℓt μT 2 2 with μs = 1 s s t=1 ℓt. Additionally, the quadratic variation of the best arm(s) is Q = T t=1(ℓt,i μT,i )2 where μs,i = 1

s s t=1 ℓt,i and i = argmini [K] T t=1 ℓt,i .

2 Key Ideas and Algorithms Optimistic OMD The underlying framework behind our algorithms is that of Online Mirror Descent (OMD) (Hazan 2016, e.g.). The vanilla update rule of (active) mirror descent can be written as xt = argminx Ω{ x, ℓt 1 + DR(x, xt 1)}. On the other hand, our updates are:

xt = argmin x Ω { x, ϵmt + DR(x, x t)} (1)

x t+1 = argmin x Ω { x, ϵ ℓt + at + DR(x, x t)} (2)

where ϵ = n/T , mt corresponds to optimistic1 estimates of the loss vectors (which we will also refer to as messages), and at denotes a second order correction that we explicitly deﬁne later. Throughout the paper, ℓt is used to denote an (unbiased) estimate of ℓt that the learner constructs at time t. Optimistic OMD with second order corrections was ﬁrst studied in (Wei and Luo 2018), whereas its Follow-the-Regularized-Leader (FTRL) counterpart was introduced earlier by (Steinhardt and Liang 2014). Both of these approaches build upon the general optimistic OMD framework of (Rakhlin and Sridharan 2012) and (Chiang et al. 2012). We deﬁne our updates with scaled losses and messages, where we reiterate that the scaling factor ϵ reﬂects the limitation on information. This scaling also impacts our second order corrections which are ηϵ2( ℓt mt)2. It is worthwhile to note that this is explicitly different from the ηϵ( ℓt mt)2 that one may expect in light of the analysis done in (Wei and Luo 2018), or the η( ℓt mt)2 one would anticipate when following (Steinhardt and Liang 2014). One may argue that our update rules are equivalent to dividing throughout by ϵ, or put differently, by merging an ϵ into the step size, and this indeed true. However, the point we would like to emphasize is that no matter how one deﬁnes the updates, the second order correction at can be seen to incorporate the problem dependent parameter ϵ. This tuning of the second order correction based on ϵ is different from what one observes for the full information problem (Steinhardt and Liang 2014) or for bandits (Wei and Luo 2018). The second order corrections represent a further penalty on arms which are deviating from their respective messages, and these corrections are what enable us to furnish best arm dependent bounds. As usual, the arm we play is still sampled from the distribution xt given by equation (1).

Challenges & Our Choice of Regularization We brieﬂy discuss the challenges posed by label efﬁcient prediction and how our choice of regularizer addresses these. When shifting away from the classical prediction with expert advice problem to any limited feedback (i.e., over experts or arms) information structure, one usually works with importanceweighted estimates of the loss vectors constructed using the observed (limited) feedback (called inverse propensity weighting estimation). This is indeed the case with label

1 Optimistic is used to denote the fact that we would be best off if these estimates were exactly the upcoming loss. Indeed, if mt were ℓt, it would be equivalent to 1-step lookahead, known to yield low regret.

REFERENCE FEEDBACK NEGENTROPY:LOG-BARRIER REGULARIZER RATIO USED (Bubeck, Cohen, and Li 2017) Bandit 1 : 2η (Wei and Luo 2018) Bandit 0 : 1 (Bubeck et al. 2019) Bandit K/η : 1/η = K : 1 (Steinhardt and Liang 2014) Full Information 1 : 0 This work Label Efﬁcient Full Information 1/η : 1/Kη = K : 1 This work Label Efﬁcient Bandit Feedback 0 : 1

Table 1: Choice of regularization (negative entropy vs. logarithmic barrier) in OMD for exploiting regularity

efﬁcient prediction, however, the probabilities in the denominator remain ﬁxed at ϵ, unlike in bandits where the xt,i in the denominator can be arbitrarily small. Consequently, one may be led to believe that the standard negative entropic regularizer, as is typically used for full information (Steinhardt and Liang 2014), will sufﬁce for the more general but related label efﬁcient prediction. However, maintaining the |η ℓt| 1 inequality which is standard in analyses similar to Exp3 imposes a strict bound of η ϵ. Since the low quadratic variation, on the other hand, would encourage one to set an aggressive learning rate η, this makes the applicability of the algorithm rather limited, and even then, with marginal gain. Put crisply, it is desirable that low quadratic variation should lead an algorithm to choose an aggressive learning rate, and negative entropy fails to maintain a stability property2 , key in obtaining OMD regret bounds, in such situations. The log-barrier regularizer, used by (Wei and Luo 2018) for bandit feedback certainly guarantees this, however using log-barrier blindly translates to a

K dependence on the number of arms K. These challenges place label efﬁcient prediction with slowly varying losses in a unique position, as one requires enough curvature to ensure stability, yet not let this added curvature signiﬁcantly hinder exploration. Our solution is to use a hybrid regularizer, that is, a weighted sum of the negative entropic regularizer and the log-barrier regularizer:

i=1 xi log xi 1/(Kη)

This regularizer has been of recent interest due to the work of (Bubeck et al. 2019), and (Bubeck, Cohen, and Li 2017), but the weights chosen for both components is highly applicationspeciﬁc and tends to reﬂect the nature of the problem. As reported above, we only require the log-barrier to guarantee stability, and therefore associate a small (roughly 1/Kη) weight to it and a dominant mass of 1/η to negative entropy. The additional 1/K factor part of the log-barrier weight is carefully chosen to exactly cancel the K in the leading K log T term generated by the log-barrier component, and consequently, not have a

K dependence on the number of arms in the ﬁnal regret bound.

2In the sense of successive points being sufﬁciently close to each other. Please refer to Lemma 14 in https://arxiv.org/abs/1910.08805.

Reservoir Sampling When considering quadratic variation as a measure of adaptivity, a natural message to pass is the mean of the previous loss history, that is mt = μt 1 = 1/t 1 t 1 s=1 ℓs. However, the constraint on information prohibits us from having the full history, and we therefore have to settle for some estimate of the mean. Reservoir sampling, ﬁrst used in (Hazan and Kale 2011), solves this very problem. Speciﬁcally, by allocating roughly k(1+log T) rounds for reservoir sampling (where we choose k to be log T), reservoir sampling gives us estimates μt such that E[ μt] = μt, and Var[ μt] = Q/kt. It does so by maintaining a carefully constructed reservoir S of size k, the elements from which are then averaged to output the estimate of the mean. Our message mt at any time t is the average of the vectors contained in the reservoir S. We specify the reservoir sampling algorithm in Algorithm 1.

Algorithm 1 RESERVOIR SAMPLING

1: Input: Reservoir S, Reservoir size k, Stream ℓ1, ℓ2, . . . 2: for t = 1, 2, . . . , k do 3: Include ℓt in S 4: end for 5: for t = k + 1, . . . do 6: bt Bern (k/t) 7: if bt = 1 then 8: Include ℓt in S by replacing it with a uniformly at random chosen element of S 9: end if 10: end for

2.1 Main Algorithm Algorithm 2 builds upon the preceding ideas and as stated, is speciﬁcally for the label efﬁcient prediction problem discussed thus far. The algorithms required for the extensions we provide in section 4 are based upon Algorithm 2 with a few minor differences. Also, in the interest of brevity, we have excluded the explicit mentioning of the reservoir sampling steps in Algorithm 2. Before we proceed, we would like to cleanly state our choice of messages, loss estimates, and second order corrections used and this is done in Table 2. Our messages, for all the sections will be mt,i = μt 1,i. Note that throughout the paper, the random variable dt = 1 signiﬁes that we ask for feedback at time t, and is 0 otherwise. Additionally, note that we consider not exceeding the budget of n in expectation, however, there is a standard reduction

PROBLEM SECTION ℓt,i mt,i at REGRET BOUND

Label Efﬁcient Prediction 2.1, 3 (ℓt,i mt,i)

ϵ 1{dt=1} 6ηϵ2( ℓt mt)2 O

Label Efﬁcient Bandits 4.1 (ℓt,i mt,i)

ϵxt,i 1{dt=1,it=i} 6ηϵ2xt,i( ℓt mt)2 O

Revealing Action Games 4.2 (ℓt,i mt,i)1{dt=1}

α 1{dt=1} 6ηα2( ℓt mt)2 O (Q T)1/3

Hard Partial Monitoring 4.2 (ℓt,i mt,i)

xt,j 1{it=j} 0 O (QTK)1/3

Table 2: Overview of loss estimates, second order corrections, and the corresponding upper bounds on regret

to get a high probability guarantee which can be found in (Cesa-Bianchi and Lugosi 2006).

Algorithm 2 ADAPTIVE LABEL EFFICIENT PREDICTION

1: Input: R = 1/η K i=1 xi log xi 1/Kη K i=1 log xi , 2: η , ϵ 3: Initialize: x 1 = argminx Ω R(x) 4: for t = 1, 2, . . . , T do 5: dt Bern(ϵ) 6: xt = argminx Ω { x, ϵmt + DR(x, x t)} 7: Play it xt, and if dt = 1, observe ℓt 8: Construct ℓt = (ℓt mt)

ϵ 1{dt=1} + mt 9: Let at = 6ηϵ2( ℓt mt)2

10: Update:

11: x t+1 = argminx Ω x, ϵ ℓt + at + DR(x, x t)

12: end for

3 Results and Analysis We now give a general regret result for the OMD updates (1) and (2). It spells out the condition we must maintain to ultimately enable best arm dependent bounds while also demonstrating the price of limited information on regret, which is the additional 1/ϵ factor. The proofs for all results in this section and subsequent sections are available in the full version of this paper3.

Lemma 1. For the update rules (1) and (2), if:

xt x t+1, ϵ( ℓt mt) + at xt, at 0 (3)

then, for all u Ω, we have:

ϵ DR(u, x t) DR(u, x t+1) + u, at Pt , (4)

where Pt DR(x t+1, xt) + DR(xt, x t) 0

Note that when at = 0 is employed in the updates (1)-(2), i.e., no second order corrections, the ﬁrst term in (3) can directly be handled using H older s inequality (in some norm

3The full version is available at https://arxiv.org/abs/1910.08805

where R is strongly convex). Doing so allows us to cancel the unwanted xt x t+1 2 term using the DR(x t+1, xt) term in Pt (which follows by strong convexity) while retaining the crucial ( ℓt mt) 2 variance term. However, with general second order corrections (at 0), the key variance term is u, at as it corresponds to the best arm s second moment under a suitably chosen u and the responsibility of cancelling the entire ﬁrst term of (3) now falls upon xt, at . Under limited information, negative entropy is unable to maintain this and we therefore have to incorporate the log barrier function (see also (Wei and Luo 2018)). We now state our main result for adaptive label efﬁcient prediction which bounds the regret of Algorithm 2.

Theorem 2. For at = 6ηϵ2( ℓt mt)2, ℓt = (ℓt mt)

ϵ 1{dt=1} + mt, ϵ = n/T and η 1/162K where the sequence of messages mt are generated using the reservoir sampling scheme, the expected regret of Algorithm 2 satisﬁes the following:

E [RT ] log K + log T

ϵη + 18ηQ .

Furthermore, if ϵQ 1458K2 log KT, then E [RT ] =

with an optimal choice of η.

Consider a concrete example of a game played for time T, where we anticipate Q

T. In this scenario, if we were to run the standard label efﬁcient prediction algorithm as given in (Cesa-Bianchi, Lugosi, and Stoltz 2005), we would get a regret bound of O T 3/4 ; following an FTRL with negative entropy4-based strategy would be inapplicable in this setting due to the constraint we highlight in section 2, however, Algorithm 2 would incur

T regret a marked improvement. Also, note that because of the full vector feedback, it is not required to allocate any rounds exclusively for reservoir sampling. This fact is reﬂected in not having to incur any additive penalty for reservoir sampling. Proof sketch of Theorem 2 The result of Theorem 2 follows rather straightforwardly from (4). The key part of the proof lies in showing that the choice of messages, second-order

4As done in (Steinhardt and Liang 2014) for prediction with experts

corrections, and loss estimators satisfy (3). To show that (3) i.e. xt x t+1, ϵ( ℓt mt) + at xt, at is satisﬁed, we upper bound the left hand side by xt x t+1 xt ϵ( ℓt mt)+at xt, , and show that this is indeed upper bounded by xt, at . We then relate both xt x t+1 xt and ϵ( ℓt mt)+ at xt, and show that our choice of estimator, messages, and second-order corrections guarantee that ϵ( ℓt mt)+at xt, is small . Theorem (2) is slightly restricted in scope, due to the lower bound required on ϵQ , in its ability to attain the optimal regret scaling with quadratic variation. We now proceed to discuss what can be said without any constraint on ϵQ . Speciﬁcally, we will provide an algorithm obtaining O(

(QT log K)/n) regret under all scenarios, the trade-off however being that we will be penalized by Q instead of Q . In settings where the ϵQ condition does not hold and incurring regret in terms of Q is not unfavourable (as an extreme example, consider constant variation on all arms, with very limited feedback) the strategy below will certainly be of use. The algorithm, again based on OMD, foregoes second order corrections and has updates deﬁned by:

xt = argmin x Ω { x, ϵmt + DR(x, x t)} (5)

x t+1 = argmin x Ω { x, ϵ ℓt + DR(x, x t)} (6)

Without second order corrections, the ϵ term can be folded into the regularizer and the updates reduce to the ones studied in (Rakhlin and Sridharan 2012). For updates (5) and (6), we have the following analogue of Lemma 1, and then consequently, the analogue of Theorem 2. We include these here in the interest of completeness, but equivalent statements can be found in (Rakhlin and Sridharan 2012).

Lemma 3. For any u Ω, updates (5) and (6) guarantee that:

DR(u, x t) DR(u, x t+1)

+ xt x t+1, ϵ ℓt ϵmt DR(x t+1, xt) DR(xt, x t) .

Theorem 4. For R = 1 η K i=1 xi log xi, ℓt =

ϵ 1{dt=1} + mt, ϵ = n/T and η > 0, where the sequence of messages are generated using the reservoir sampling scheme, Algorithm 2 with at = 0 yields:

E[RT ] log K

Optimally tuning η yields a O

(QT log K)/n bound.

Trying to deeper understand how the constraint of Theorem 2 can be sidestepped to yield a universal algorithm dependent on Q remains a direction of future interest.

Parameter-Free Algorithms Note that we have assumed knowledge of T, Q and Q when optimising for the ﬁxed

step size η in the above discussion. This is often not possible and we now brieﬂy discuss the extent to which we can obtain parameter-free algorithms. In Theorem 5 we claim that we can choose η adaptively for the Q dependent bound we present in Theorem 45. It remains open whether a Q dependent bound (or in general, any non-monotone dependent bound) can be made parameter free for even the standard prediction with expert advice problem. The challenge is essentially that our primary tool to sidestep prior knowledge of a parameter the doubling trick is inapplicable for nonmonotone quantities. Even freeing algorithms from prior knowledge of nondecreasing arm dependent quantities, such as maxi Qi remains open for limited information setups (i.e. anything outside prediction with expert advice) due to the lack of a clear auxiliary term one can observe. In Algorithm 3, we proceed in epochs (or rounds) such that η remains ﬁxed per epoch. Denote by ηα the value of η in epoch α. We will write Tα for the ﬁrst time instance in epoch α.

Algorithm 3 PARAMETER FREE ADAPTIVE LABEL EFFICIENT PREDICTION

1: Initialize: η = 2 log K

ϵ , T1 = 1, t = 1. 2: for α = 1, 2, . . . do 3: x t = argminx Ω R(x) 4: while t T do 5: Draw dt Bern(ϵ), update xt according to (5) 6: Play it xt and if dt = 1, observe ℓt 7: Update x t+1 according to (6)

8: if t s=Tα K i=1( ℓs,i ms,i)2 2 log K

ϵ2η2 α 1 then

9: η η/2, Tα+1 t, t t + 1 10: break 11: end if 12: t t + 1 13: end while 14: end for

Theorem 5. For the conditions mentioned in Theorem 4, Algorithm 3 (a parameter free algorithm) achieves:

(QT log K)/n +

4 Adapting to Slowly Varying Losses in Other Information-constrained Games We will now investigate exploiting the regularity of losses in a variety of other settings with implicit/explicit information constraints. We will ﬁrst focus on bandit feedback, following which we will brieﬂy discuss partial monitoring.

4.1 Label Efﬁcient Bandits The change here is in the feedback information the learner receives when asking for information. Instead of receiving

5Note that similarly to (Hazan and Kale 2011) we still assume knowledge of T, but this can be circumvented using standard tricks.

the full loss vector, the learner now only receives the loss of the played arm it, i.e. the itth coordinate of ℓt. We will continue to use the same update rules (1) and (2) here. What will change most importantly is the regularizer which will now solely be the log barrier regularizer R = 1

η K i=1 log 1

xi . Note that the coefﬁcient of log barrier is also 1/η instead of the earlier 1/Kη. The loss estimates and second order corrections will also change and these are all mentioned in Table 2. We will now state the main theorem for label efﬁcient bandits.

Theorem 6. For at,i = 6ηϵ2xt,i( ℓt mt)2, ℓt = ℓt mt

ϵxt,i 1{dt=1,it=i} + mt,i, ϵ = n/T and η 1/162K where the sequence of messages mt are given by reservoir sampling, the regret of Algorithm 2 modiﬁed for label efﬁcient bandits satisﬁes:

E [RT ] K log T

ϵη + 18ηQ + K(log T)2 .

Note that since we are in the bandit feedback setting, we now reserve certain rounds solely for reservoir sampling. This is reﬂected in the additive K(log T)2 term in regret. There are now (log T)2 rounds allotted to each of the K arms, hence the term. There will also be a few minor changes in the algorithm primarily corresponding to the appropriate execution of reservoir sampling for bandit feedback.

4.2 Partial Monitoring We will now discuss adaptivity in partial monitoring games. A partial monitoring game G = (L, H) is deﬁned by a pair L and H of K N matrices. Both matrices are visible to the learner and the adversary. At each time t, the learner selects a row (or arm, action) it [K] and the opponent chooses a column yt [N]. The learner then incurs a loss of ℓ(it, yt) and observes feedback h(it, yt) 6. When clear from context, we will denote by ℓ(i, t) the loss of arm i at time t and by h(i, t) the feedback of arm i at time t. The expected regret here is:

max i [K] E

t=1 ℓ(it, yt)

t=1 ℓ(i , yt)

Revealing Action Partial Monitoring First consider the class of partial monitoring games with a revealing action that is, suppose H has a row with N distinct elements. It is clear that if the learner plays this row, they can receive full information regarding which column the adversary has chosen. The cost of playing this row very well deﬁnes which class this game falls into (see for example the spam game discussed in (Lattimore and Szepesv ari 2019)), but in general, the minimax regret of these games scales as T 2/3 and these games therefore fall in the hard class of games. Revealing action games and label efﬁcient prediction differ in the way they charge the learner for information. For label efﬁcient prediction, we have seen that there is a ﬁxed number of times

6We are considering oblivious adversarial opponents as before and further take entries of H to be in [0, 1]. The assumption on the entries is not major since the learner can always appropriately encode the original entries by numbers.

(budget) one can obtain information, but there is no additional cost of doing so. In revealing action games however, there is a loss associated to each time the learner asks for information. We will now show a reduction from this class of games to the standard label efﬁcient prediction we discussed in sections 2 and 3. Let the cost of playing the revealing action be c = maxb [N] L(a, b) where a [K] is the revealing action row of L. Suppose α is the probability with which we play the revealing action at each round. α here corresponds to the ϵ from earlier sections, however α is now a free parameter7. We will still run reservoir sampling in the background as before to obtain the optimistic messages mt. Now, in this light, the following theorem can be seen to follow from Theorem 2.

Theorem 7. For at = 6ηα2( ℓt mt)2, ℓt = (ℓt mt)

α 1{dt=1} + mt, α 1 and η 1/162K where the sequence of messages mt are generated using reservoir sampling, the expected regret of Algorithm 2 modiﬁed for revealing action partial monitoring games with loss entries in [0, 1] satisﬁes the following:

E [RT ] log K + log T

αη + 18ηQ + αTc + (log T)2 .

Optimising the parameters η and γ yields a bound of O (Q T log K) 1/3 .

Note that now, we will again have to allocate rounds specifically for reservoir sampling as was the case with bandits, hence the additive (log T)2 term. The added αTc corresponds to the cost paid for playing the revealing action.

Hard Partial Monitoring Games We now turn to the hard class of partial monitoring games. As mentioned in (Piccolboni and Schindelhauer 2008) and (Cesa-Bianchi and Lugosi 2006), we will assume that there exists a matrix W such that L = WH. This is not an unreasonable assumption, as if this does not hold for the given L and H, one can suitably modify (see (Piccolboni and Schindelhauer 2008)) L and H to ensure L = W H , and if this condition continues to fail after appropriate modiﬁcations, (Piccolboni and Schindelhauer 2008) show that sublinear regret is not possible for the original G = (L, H). Observe that L = WH will allow us to write ℓ(i, t) = j [K] w(i, j)h(j, t). Therefore:

j [K] w(i, j)h(j, t) mt,i 1{it=j}

xt,j + mt,i

is now an unbiased estimate of ℓ(i, t). mt is still the optimistic messages where mt,i corresponds to an estimate of the average loss incurred by arm i till time t. These will still be obtained using reservoir sampling and we will maintain a separate reservoir for each arm i [K]. Note that since ℓ(i, t) =

j [K] w(i, j)h(j, t) and the matrices L, W, and H are all visible to the learner, playing action r at time t for example will allow the learner to observe the rth component w(i, r)h(r, t) of the loss for each action i [K]. Therefore,

7Note that the update rules (1) and (2) will now also have α in place of ϵ

by maintaining an estimate (reservoir) for each component, we will be able to maintain an estimate for each arm. Now, for these games we will use optimistic OMD without second order corrections (Rakhlin and Sridharan 2012; Chiang et al. 2012). The update rules are the same as equations (5) and (6) without the ϵ term. Additionally, the arm we play will be sampled from wt where wt = (1 γ)xt + γ1. The forced exploration is necessary to allow a minimum mass on all arms. Note that the structure deﬁned by ℓ(i, t) =

j [K] w(i, j)h(j, t) says that we potentially have to play all arms to maintain unbiased estimates of any arm. This forced exploration is unavoidable (see (Cesa-Bianchi and Lugosi 2006)).

Theorem 8. Given G = (L, H) with loss entries in [0, 1], a matrix W such that L = WH, η > 0 and R =

1/η K i=1 xi log xi, the update rules (5) and (6) (omitting the ϵ) mixed with γ forced exploration satisﬁes: E[RT ] log K

2γ + γT. Optimising for η and γ gives us a regret

of O((QTK) 1/3) .

Note here the strong dependence on K which is an outcome of each ℓ(i, t) being dependent on potentially all (K) other actions.

5 Lower Bounds

We now prove explicit quadratic variation-based lower bounds for (standard) label efﬁcient prediction and label efﬁcient bandits. By capturing both the constraint on information as well as the quadratic variation of the loss sequence, our lower bounds generalize and improve upon existing lower bounds. We extend the lower bounds for label efﬁcient prediction to further incorporate the quadratic variation of the loss sequence and enhance the quadratic variation dependent lower bounds for multi-armed bandits to also include the constraint on information by bringing in the number of labels the learner can observe (n). Our bounds will be proven in a 2-step manner similar to that in (Gerchinovitz and Lattimore 2016). The main feature of step 1 (the lemma step) is that of centering the Bernoulli random variables around a parameter α instead of 1/2, which leads the regret bound to involve the α(1 α) term corresponding to the variance of the Bernoulli distribution. Step 2 (the theorem step) builds upon step 1 and shows the existence of a loss sequence belonging to an α-variation ball (deﬁned below) which also incurs regret of the same order. Recall the quadratic variation for a given loss sequence: Q = T t=1 ℓt μT 2 2 T K/4. Now, for α [0, 1/4] deﬁne an α-variation ball as: Vα {{ℓt}T t=1 : Q/T K α}. Theorems 10 and 12, after incorporating Q αTK give us lower bounds of Ω(

(QT log(K 1))/Kn) and Ω(

QT/n) respectively. Our corresponding upper bounds are O(

(QT log K)/n) and O(

QT K/n) .8 Comparing the two tells us that our strategies are optimal in their dependence on Q and on the constraint in information indicated by

8We upper bound all of our Q dependent upper bounds by Q so as to consistently compare with the lower bounds. Note that Q and Q are in general incomparable and all that be said is that Q Q.

n. There is however a gap of

K . This gap was mentioned in (Gerchinovitz and Lattimore 2016) for the speciﬁc case of the multi-armed bandit problem, and was closed recently in (Bubeck, Cohen, and Li 2017). Barring the easy to see

(Q log K)/K lower bound for prediction with expert advice (which is also what Theorem 10 translates to for n = T), we are unaware of other fundamental Q based lower bounds for prediction with expert advice. The upper bounds for prediction with expert advice however are of O( Q log K) ((Hazan and Kale 2010), (Steinhardt and Liang 2014) etc.), and this again suggests the

K gap. Closing this for prediction with expert advice, label efﬁcient prediction and for label efﬁcient bandits remains open, as does the question of ﬁnding Q dependent lower bounds.

Label Efﬁcient Prediction (Full Information) As mentioned previously, the main difference here from the standard label efﬁcient prediction lower bound proof (Cesa-Bianchi, Lugosi, and Stoltz 2005) is that of centering the Bernoulli random variables around a parameter α which is responsible for ultimately bringing out the quadratic variation of the sequence. Our main statements for label efﬁcient prediction are as follows.

Lemma 9. Let α (0, 1), K 2, T n c2 log(K 1)

1 α . Then, for any randomized strategy for the label efﬁcient prediction problem, there exists a loss sequence under which

α(1 α) log(K 1)

n for c = e/

Theorem 10. Let K 2, T n max{32 log(K

1), 256 log T} and α max 32 log T

n , 8 log(K 1)

4 . Then, for any randomized strategy for the label efﬁcient pre-

diction problem, max{ℓt} vα E[RT ] 0.36T

Label Efﬁcient Bandits The main difference here from standard bandit proofs is that now, the total number of revealed labels (each label is now a single loss vector entry) cannot exceed n. Hence, the

i [K] Ni(t 1) term which appears in the analysis is upper bounded by n (where Ni(t 1) denotes the pulls of arm i up till time t 1).

Lemma 11. Let α (0, 1), K 2, T n K/(4(1 α)). Then, for any randomized strategy for the label efﬁcient bandit problem, there exists a loss sequence under which E[RT ] T

α(1 α)K/n .

Theorem 12. Let K 2, T n max{32K, 384 log T} and α max 2c log T

4 with c = (4/9)2(3

1)2 12. Then, for any randomized strategy for the label ef-

ﬁcient bandit problem, max{ℓt} vα E[RT ] 0.04T

6 Conclusion We consider problems lying at the intersection of 2 relevant questions in online learning how does one adapt to slowly varying data, and what best can be done with a constraint on information. As far as we know, the proposed algorithms

are the ﬁrst to jointly address both of these questions. There remain plenty of open problems in the area. Seeing to what extent universal Q dependent algorithms can be obtained in starved information settings is a direction of future interest, as is closing the gap in K highlighted in Section 5. Moreover, extending the notion of adaptivity to partial monitoring games to consider locally observable games and even more interestingly, locally observable sub-games within hard games also remain open. Higher order lower bounds for partial monitoring games have also not been studied and one wonders to what extent adaptivity can help in partial monitoring.

7 Acknowledgments Siddharth Mitra would like to thank GD Raghava and Prathamesh Mayekar for many helpful discussions.

References Allenberg, C.; Auer, P.; Gy orﬁ, L.; and Ottucs ak, G. 2006. Hannan consistency in on-line learning in case of unbounded losses under partial monitoring. In International Conference on Algorithmic Learning Theory, 229 243. Springer. Azar, M. G.; Osband, I.; and Munos, R. 2017. Minimax regret bounds for reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, 263 272. JMLR. org. Bart ok, G.; Foster, D. P.; P al, D.; Rakhlin, A.; and Szepesv ari, C. 2014. Partial monitoring classiﬁcation, regret bounds, and algorithms. Mathematics of Operations Research 39(4):967 997. Bart ok, G.; P al, D.; and Szepesv ari, C. 2011. Minimax regret of ﬁnite partial-monitoring games in stochastic environments. In Proceedings of the 24th Annual Conference on Learning Theory, 133 154. Bubeck, S.; Li, Y.; Luo, H.; and Wei, C.-Y. 2019. Improved path-length regret bounds for bandits. Co RR abs/1901.10604. Bubeck, S.; Cesa-Bianchi, N.; et al. 2012. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning 5(1):1 122. Bubeck, S.; Cohen, M. B.; and Li, Y. 2017. Sparsity, variance and curvature in multi-armed bandits. In ALT. Cesa-Bianchi, N., and Lugosi, G. 2006. Prediction, Learning, and Games. New York, NY, USA: Cambridge University Press. Cesa-Bianchi, N.; Lugosi, G.; and Stoltz, G. 2005. Minimizing regret with label efﬁcient prediction. IEEE Transactions on Information Theory 51(6):2152 2162. Chiang, C.; Yang, T.; Lee, C.; Mahdavi, M.; Lu, C.; Jin, R.; and Zhu, S. 2012. Online optimization with gradual variations. In COLT, volume 23 of JMLR Proceedings, 6.1 6.20. JMLR.org. Gerchinovitz, S., and Lattimore, T. 2016. Reﬁned lower bounds for adversarial bandits. In NIPS, 1190 1198. Hazan, E., and Kale, S. 2010. Extracting certainty from uncertainty: regret bounded by variation in costs. Machine Learning 80(2):165 188.

Hazan, E., and Kale, S. 2011. Better algorithms for benign bandits. J. Mach. Learn. Res. 12:1287 1311. Hazan, E. 2016. Introduction to online convex optimization. Found. Trends Optim. 2(3-4):157 325. Jaksch, T.; Ortner, R.; and Auer, P. 2010. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research 11(Apr):1563 1600. Lattimore, T., and Szepesv ari, C. 2019. Cleaning up the neighborhood: A full classiﬁcation for adversarial partial monitoring. In ALT, volume 98 of Proceedings of Machine Learning Research, 529 556. PMLR. Piccolboni, A., and Schindelhauer, C. 2008. Discrete prediction games with arbitrary feedback and loss (extended abstract). 208 223. Rakhlin, A., and Sridharan, K. 2012. Online learning with predictable sequences. ar Xiv preprint ar Xiv:1208.3728. Steinhardt, J., and Liang, P. S. 2014. Adaptivity and optimism: An improved exponentiated gradient algorithm. In ICML. Wei, C.-Y., and Luo, H. 2018. More adaptive algorithms for adversarial bandits.