# comparatoradaptive_convex_bandits__fb3229d8.pdf

Comparator-Adaptive Convex Bandits

Dirk van der Hoeven Mathematical Institute Leiden University dirk@dirkvanderhoeven.com

Ashok Cutkosky Boston University ashok@cutkosky.com

Haipeng Luo Computer Science Department University of Southern California haipengl@usc.edu

We study bandit convex optimization methods that adapt to the norm of the comparator, a topic that has only been studied before for its full-information counterpart. Speciﬁcally, we develop convex bandit algorithms with regret bounds that are small whenever the norm of the comparator is small. We ﬁrst use techniques from the full-information setting to develop comparator-adaptive algorithms for linear bandits. Then, we extend the ideas to convex bandits with Lipschitz or smooth loss functions, using a new variant of the standard single-point gradient estimator and carefully designed surrogate losses.

1 Introduction

In many situations, information is readily available. For example, if a gambler were to bet on the outcome of a football game, he can observe the outcome of the game regardless of what bet he made. In other situations, information is scarce. For example, the gambler could be deciding what to eat for dinner: should I eat a salad, a pizza, a sandwich, or not at all? These actions will result in different and unknown outcomes, but the gambler will only see the outcome of the action he actually takes, with one notable exception: not eating results in a predetermined outcome of being very hungry.

These two situations are instantiations of two different settings in online convex optimization: the full information setting and the bandit setting. More formally, both settings are sequential decision making problems where in each round t = 1, . . . , T, a learner has to make a prediction wt W Rd and an adversary provides a convex loss function ℓt : W R. Afterwards, in the full information setting [27] the learner has access to the loss function ℓt, while in the bandit setting [19, 13] the learner only receives the loss evaluated at the prediction, that is, ℓt(wt). In both settings the goal is to minimize the regret with respect to some benchmark point u in hindsight, referred to as the comparator. More speciﬁcally, the regret against u is the difference between the total loss incurred by the predictions of the learner and that of the comparator:

t=1 ℓt(wt) ℓt(u).

When the learner s strategy is randomized, we measure the performance by the expected regret E [RT (u)].

Standard algorithms in both the full information setting and the bandit setting assume that the learner s decision space W is a convex compact set and achieve sublinear regret against the optimal comparator in this set: u = arg minu W PT t=1 ℓt(u ). To tune these standard algorithms optimally, however, one requires knowledge of the norm of the comparator u , which is unknown. A common workaround is to simply tune the algorithms in terms of the worst-case norm: maxu W u , assumed to be 1 without loss of generality. This results in worst-case bounds that do not take advantage of the case when u is small. For example, when the loss functions are L-Lipschitz, classic Online

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

Table 1: Summary of main results. Regret is measured with respect to the total loss of an arbitrary point u Rd in the unconstrained setting, or an arbitrary point u W in the constrained setting with a decision space W contained in the unit ball. T is the total number of rounds, 1/c is radius of the largest ball contained by W, and ν is the self-concordant parameter. Both c and ν are bounded by O(d).

Loss functions (L-Lipschitz) Regret for unconstrained settings Regret for constrained settings

Linear (Section 3.2) e O u d L

T e O u cd L

Convex (Section 4.1 and 4.2) e O u L

d T 3 4 e O u c L

Convex and β-smooth (Section 4.2) e O max{ u 2, u }β(d LT) 2 3 -

Gradient Descent [27] guarantees RT (u) = O(L

T) in the full information setting, while the algorithm of [13] guarantees E [RT (u)] = O(d

LT 3/4) in the bandit setting, both of which are independent of u .

Recently, there has been a series of works in the full information setting that addresses this problem by developing comparator-adaptive algorithms, whose regret against u depends on u for all u W simultaneously (see for example Mc Mahan and Orabona [22], Orabona and P al [23], Foster et al. [14], Cutkosky and Boahen [9], Kotlowski [20], Cutkosky and Orabona [10], Foster et al. [16], Jun and Orabona [18], Van der Hoeven [25]). These bounds are often not worse than the standard worst-case bounds, but could be much smaller in the case when there exists a comparator with small norm and reasonably small total loss. Moreover, most of these results also hold for the so-called unconstrained setting where W = Rd, that is, both the learner s predictions and the comparator can be any point in Rd. For example, Cutkosky and Orabona [10] achieve RT (u) = e O( u L

T) for all u, in both the constrained and unconstrained settings, under full information feedback.1

While developing comparator-adaptive algorithms is relatively well-understood at this point in the full information setting, to the best of our knowledge, this has not been studied at all for the more challenging bandit setting. In this work, we make the ﬁrst attempt in this direction and develop comparator-adaptive algorithms for several situations, including learning with linear losses, general convex losses, and convex and smooth losses, for both the constrained and unconstrained settings. Our results are summarized in Table 1. Ignoring other parameters for simplicity, for the linear case, we achieve e O( u

T) regret (Section 3.2); for the general convex case, we achieve e O( u T 3 4 ) regret in both the constrained and unconstrained setting (Sections 4.1 and 4.2); and for the convex and smooth case, we achieve e O max{ u 2, u }β(d LT) 2 3 regret in the unconstrained setting (Section 4.1).

In order to achieve our results for the convex case, we require an assumption on the loss, namely that the value of ℓt(0) is known for all t.2 While restrictive at ﬁrst sight, we believe that there are abundant applications where this assumption holds. As one instance, in control or reinforcement learning problems, 0 may represent some nominal action which has a known outcome: not eating results in hunger, or buying zero inventory will result in zero revenue. Another application is a classiﬁcation problem where the features are not revealed to the learner. For example, end-users of a prediction service may not feel comfortable revealing their information to the service. Instead, they may be willing to do some local computation and report the loss of the service s model. Most classiﬁcation models (e.g. logistic regression) have the property that the loss of the 0 parameter is a known constant regardless of the data, and so this situation would also ﬁt into our framework. Common loss functions that satisfy this assumption are linear loss, logistic loss, and hinge loss.

Techniques Our algorithms are based on sophisticated extensions of the black-box reduction introduced by Cutkosky and Orabona [10], which separately learns the magnitude and the direction of the prediction. To make the reduction work in the bandit setting, however, new ideas are required, including designing an appropriate surrogate loss function and a new variant of the standard one-point gradient estimator with time-varying parameters. Note that Cutkosky and Orabona [10] also propose

1Throughout the paper, the notation e O hides logarithmic dependence on parameters T, u , and L. 2For the linear case, this clearly holds since ℓt(0) = 0.

a method to convert any unconstrained algorithm to a constrained one in the full information setting, but this does not work in the bandit setting for technical reasons. Instead, we take a different approach by constraining the magnitude of the prediction directly.

Related work As mentioned, there has been a line of recent work on comparator-adaptive algorithms for the full information setting. Most of them do not transfer to the bandit setting, except for the approach of Cutkosky and Orabona [10] from which we draw heavy inspiration. To the best of our knowledge, comparator-adaptive bandit algorithms have not been studied before. Achieving adaptivity in a broader sense is generally hard for problems with bandit feedback; see negative results such as [12, 21] as well as recent progress such as [7, 15].

In terms of worst-case (non-adaptive) regret, the seminal work of [1] is the ﬁrst to achieve O(

T) regret for bandit with linear losses, and [19, 13] are the ﬁrst to achieve sublinear regret for general convex case. Over the past decade, the latter result has been improved in many different ways [2, 24, 3, 17], and regret of order O(

T) under no extra assumptions was recently achieved [4, 5, 6]. However, these O(

T) bounds are achieved by very complicated algorithms that incur a huge dependence on the dimension d. Our algorithms are more aligned with the simpler ones with milder dimension-dependence [1, 13, 24] and achieve the same dependence on T in different cases. How to achieve comparator-adaptive regret of order O(

T) for the general convex case is an important future direction.

2 Preliminaries

In this section, we describe our notation, state the deﬁnitions we use, and introduce the bandit convex optimization setting formally. We also describe the black-box reduction of [10] we will use throughout the paper.

Notation and deﬁnitions The inner product between vectors g Rd and w Rd is denoted by w, g . R+ denotes the set of positive numbers. The Fenchel conjugate F of a convex function F is deﬁned as F (w) = supg w, g F(g). denotes a norm and g = supw: w 1 w, g denotes the dual norm of g. The Bregman divergence associated with convex function F between points x and y is denoted by BF (x y) = F(x) F(y) F(y), x y , where F(x) denotes the gradient of F evaluated at x. The unit ball equipped with norm is denoted by B = {w : w 1}. The unit sphere with norm is denoted by S = {w : w = 1}. The unit ball and sphere with norm 2 are denoted by B and S respectively. x U(Z) denotes that x follows the uniform distribution over Z. We say a function f is β-smooth over the set W if the following holds:

f(y) f(x) + f(x), y x + β

2 x y 2 2, x, y W.

We say a function f is L-Lipschitz over the set W if the following holds:

|f(y) f(x)| L y x 2, x, y W.

Throughout the paper we will assume that β, L 1. Also, by mild abuse of notation, we use f(x) to indicate an arbitrary subgradient of a convex function f at x.

All of our algorithms are reductions that use prior algorithms in disparate ways to obtain our new results. In order for these reductions to work, we need some assumptions on the base algorithms. We will encapsulate these assumptions in interfaces that describe inputs, outputs, and guarantees described by an algorithm rather than its actual operation (see Interfaces 3 and 4 for examples). We can use speciﬁc algorithms from the literature to implement these interfaces, but our results depend only on the properties described in the interfaces.

2.1 Bandit Convex Optimization

The bandit convex optimization protocol proceeds in rounds t = 1, . . . , T. In each round t the learner plays wt W Rd. Simultaneously, the environment picks an L-Lipschitz convex loss function ℓt : W R, after which the learner observes ℓt(wt). Importantly, the learner only observes the loss function evaluated at wt, not the function itself. This forces the learner to play random points and

Algorithm 1 Black-Box Reduction with Full Information

1: Input: Direction algorithm AZ and scaling algorithm AV 2: for t = 1 . . . T do 3: Get zt Z from AZ 4: Get vt R from algorithm AV 5: Play wt = vtzt, receive gt 6: Send gt to algorithm AZ as the t-th loss vector 7: Send zt, gt to algorithm AV as the t-th loss value 8: end for

estimate the feedback he wants to use to update wt. Therefore, in the bandit feedback setting, the goal is to bound the expected regret E [RT (u)], where the expectation is with respect to both the learner and the environment.

We make a distinction between linear bandits, where ℓt(w) = w, gt , and convex bandits, where ℓt can be any L-Lipschitz convex function. Throughout the paper, if W = Rd we assume that W is compact, has a non-empty interior, and contains 0. Without loss of generality we assume that 1 c B W B for some c 1. Some of our bounds depend on c, which, without loss of generality, is always bounded by d, due to a reshaping trick discussed in [13].

2.2 Black-Box Reductions with Full Information

Our algorithms are based on a black-box reduction from [10] for the full information setting (see Algorithm 1). The reduction works as follows. In each round t the algorithms plays wt = vtzt, where zt Z for some domain Z, is the prediction of a constrained algorithm AZ, and vt is the prediction of a one-dimensional algorithm AV. The goal of AZ is to learn the direction of the comparator while the goal of AV is to learn the norm of the comparator. Let gt be the gradient of ℓt at wt, which is known to the algorithm in the full information setting. We feed gt as feedback to AZ and zt, gt as feedback to AV. Although the original presentation considers only Z = B, we will need to extend the analysis to more general domains.

As outlined by Cutkosky and Orabona [10], the regret of Algorithm 1 decomposes into two parts. The ﬁrst part of the regret is for learning the norm of u, and is controlled by Algorithm AV. The second part of the regret is for learning the direction of u and is controlled by AZ. The proof is provided in Appendix A for completeness.

Lemma 1. Let RV T ( u ) = PT t=1(vt u ) zt, gt be the regret for learning u by Algorithm

AV and let RZ T u u = PT t=1 zt u u , gt be the regret for learning u u by AZ. Then Algorithm 1 satisﬁes

RT (u) = RV T ( u ) + u RZ T

Cutkosky and Orabona [10] provide an algorithm to ensure RV T ( u ) = e O 1 + u L

T , given that gt L. This algorithm satisﬁes the requirements described later in Interface 3, and will be used throughout this paper.

3 Comparator-Adaptive Linear Bandits

Now, we apply the reduction of section 2.2 to develop comparator-adaptive algorithms for linear bandits. We will see that in the unconstrained case, the reduction works almost without modiﬁcation, but in the constrained case we will need to be more careful to enforce the constraints.

3.1 Unconstrained Linear Bandits

We begin by discussing the unconstrained linear bandit setting, which turns out to be the easiest setting we consider. Following Algorithm 1, we will still play wt = vtzt. However, instead of taking a ﬁxed zt from a full-information algorithm, we take a random zt from a bandit algorithm.

Algorithm 2 Black-Box Reduction for Linear Bandits

1: Input: Constrained Linear Bandit Algorithm AZ and unconstrained 1-d Algorithm AV 2: for t = 1 . . . T do 3: Get zt Z from AZ 4: Get vt R from AV 5: Play wt = vtzt 6: Receive loss wt, gt 7: Compute Lt = 1 vt wt, gt = zt, gt . 8: Send Lt to Algorithm AZ as t-th loss value. 9: Send Lt to Algorithm AV as t-th loss value. 10: end for

Interface 3 Scale Learning Interface (see example implementation in [10])

1: Input: A line segment l R 2: for t = 1 . . . T do 3: Play vt l 4: Receive loss value gt such that |gt| LV 5: end for 6: Ensure: for all ˆv l, PT t=1(vt ˆv)gt = e O 1 + |ˆv|LV

Importantly, we can recover zt, gt exactly since wt, gt 1

vt = zt, gt . This means that we have enough information to send appropriate feedback to both AV and AZ and apply the argument of Lemma 1. Interestingly, we use a full-information one-dimensional algorithm for AV, and only need AZ to take bandit input. This is because AV gets full information in the form of zt, gt .

The algorithm AZ for learning the direction, on the other hand, now must be a bandit algorithm because intuitively we do not immediately get the full direction information gt from the value of the loss alone. We will need this algorithm to fulﬁl the requirements described by Interface 4. One such algorithm is given by continuous Exponential Weights on a constrained set (see Van der Hoeven et al. [26, section 6] for details).

Our unconstrained linear bandit algorithm then is constructed from Algorithm 2 by choosing an algorithm that implements Interface 4 as AZ and Interface 3 with l = R as AV. Plugging in the guarantees of the individual algorithms and taking the expectation of (1), the total expected regret is e O(1 + u d L

T). Compared to the full information setting we have gained a factor d in the regret bound, which is unavoidable given the bandit feedback [11]. The formal result is below.

Theorem 1. Suppose AZ implements Interface 4 with domain Z = B and AV implements Interface 3 with l = R+. Then Algorithm 2 satisﬁes for all u Rd:

E[R(u)] = e O(1 + u d L

3.2 Constrained Linear Bandits

The algorithm in the previous section only works for W = Rd. In this section, we consider a compact set W Rd.

In the full-information setting, Cutkosky and Orabona [10] provide a projection technique for producing constrained algorithms from unconstrained ones. Unfortunately, this technique does not translate directly to the bandit setting, and we must be more careful in designing our constrained linear bandit algorithm. The key idea is to constrain the internal scaling algorithm AV, rather than attempting to constrain the ﬁnal predictions wt. Enforcing constraints on the scaling algorithm s outputs vt will naturally translate into a constraint on the ﬁnal predictions wt.

To produce a constrained linear bandit algorithm, we again use Algorithm 2, but now we instantiate AV implementing Interface 3 with l = [0, 1] rather than l = R+, and instantiate AZ implementing Interface 4 with Z = W rather than Z = B. As in the unconstrained setting, this allows us to feed full information feedback to AV. At the same time restricting Interface 3 to l = [0, 1] also guarantees

Interface 4 Direction Learning Interface for Linear Bandits (see example implementation in [26])

1: Input: Domain Z 2: for t = 1 . . . T do 3: Play zt Z 4: Receive loss value zt, gt such that | zt, gt | L 5: end for 6: Ensure: for all u Z, E h PT t=1 zt u, gt i = e O d L

that wt W. The regret bound of this algorithm is given in Theorem 2. The proof follows from combining Lemma 1 with the guarantees of Interfaces 3 and 4 and can be found in Appendix B. Theorem 2. Suppose AZ implements 4 with domain Z = W and AV implements Interface 3 with l = [0, 1]. Then Algorithm 2 satisﬁes for all u W,

E[RT (u)] = e O 1 + u cd L

If W is a unit ball, then c = 1. For other shapes of W, recall that c is at most d, which leads to a regret bound of O 1 + u d2L

4 Comparator-Adaptive Convex Bandits

In the general convex bandit problem, it is not clear how to use the single evaluation point feedback ℓt(wt) to derive any useful information about ℓt. Fortunately, Flaxman et al. [13] solved this problem by using randomness to extract the gradients of a smoothed version of ℓt. To adapt to the norm of the comparator, we employ the following tweaked version of smoothing used by Flaxman et al. [13]:

ℓv t (w) = Eb U(B)[ℓt(w + vδb)], (2)

where v, δ > 0. In contrast to prior work using this framework, our smoothing now depends on the scaling parameter v. Lemma 2 gives the gradient of ℓv t (w) and is a straightforward adaptation of Lemma 2.1 by Flaxman et al. [13]. Lemma 2. For δ (0, 1], v > 0:

ℓv t (w) = d

vδ Es U(S)[ℓt(w + vδs)s]. (3)

With this lemma, we can estimate the gradient of the smoothed version of ℓt by evaluating ℓt at a random point, essentially converting the convex problem to a linear problem, except that one also needs to control the bias introduced by smoothing. Note that this estimate scales with 1

v, which can be problematic if v is small. To deal with this issue, we require one extra assumption: the value of ℓt(0) is known to the learner. As discussed in section 1, this assumption holds for several applications, including some control or reinforcement learning problems, where 0 represents a nominal action with a known outcome. Furthermore, certain loss functions satisfy the second assumption by default, such as linear loss, logistic loss, and hinge loss. Without loss of generality we assume that ℓt(0) = 0, as we can always shift ℓt without changing the regret.

Our general algorithm template is provided in Algorithm 5. It incorporates the ideas of Algorithm 2, but adds new smoothing and regularization elements in order to deal with the present more general situation. More speciﬁcally, it again makes use of subroutine AV, which learns the scaling. The direction is learned by Online Gradient Descent [27], as was also done by Flaxman et al. [13]. Given zt and vt, our algorithm plays the point wt = vt(zt + δst) for some parameter δ and st drawn uniformly at random from S. By equation (3), we have

vtδ ℓt(wt)st

= ℓvt t (vtzt). (4)

This means that we can use ˆgt = d vtδℓt(wt)st as an approximate gradient estimate, and we send this ˆgt to Online Gradient Descent as the feedback. In other words, Online Gradient Descent itself is

Algorithm 5 Black-Box Comparator-Adaptive Convex Bandit Algorithm

1: Input: Scaling algorithm AV, δ (0, 1], α [0, 1], domain Z B, and learning rate η 2: Set z1 = 0 3: for t = 1 . . . T do 4: Get vt from AV 5: Sample st U(S) 6: Set wt = vt(zt + δst) 7: Play wt 8: Receive ℓt(wt) 9: Set ˆgt = d vtδℓt(wt)st 10: if ℓt is β-smooth then 11: Set ℓt(v) = v zt, ˆgt + βδ2v2

12: else 13: Set ℓt(v) = v zt, ˆgt + 2δL|v| 14: end if 15: Send ℓt(vt) to algorithm AV as the t-th loss value 16: Update zt+1 = arg minz (1 α)Z η z, ˆgt + zt z 2 2 17: end for

essentially dealing with a full-information problem with gradient feedback and is required to ensure a regret bound E[PT t=1 zt u, ˆgt ] = e O( d L

T) for all u in some domain Z. For technical reasons, we will also need to enforce zt (1 α)Z for some α [0, 1]. This restriction will be necessary in the constrained setting to ensure vt(zt + δst) W .

Next, to specify the feedback to the scaling learning black-box AV, we deﬁne a surrogate loss function ℓt(v) which contains a linear term v zt, ˆgt and also a regularization term (see Algorithm 5 for the exact deﬁnition). The feedback to AV is then ℓt(vt). Therefore, AV is essentially learning these surrogate losses, also with full gradient information. The regularization term is added to deal with the bias introduced by smoothing. This term does not appear in prior work on convex bandits, and it is one of the key components needed to ensure that the ﬁnal regret is in terms of the unknown u .

Algorithm 5 should be seen as the analogue of the black-box reduction of Algorithm 1, but for bandit feedback instead of full information. The expected regret guarantee of Algorithm 5 is shown below, and the proof can be found in appendix C. Lemma 3. Suppose AV implements Interface 3 with l R+, wt W for all t, and let LV =

maxt ℓt(vt). Suppose that ℓt(0) = 0. Then Algorithm 5 with δ, α (0, 1] and η = q

δ2 4(d L)2T satisﬁes for all u l and r > 0 with ur u Z,

E [RT (u)] = e O 1 + TδL u

T + α u 2TL .

In addition, if ℓt is also β-smooth for all t, then we have

E [RT (u)] = e O

T + α u 2TL

This bound has two main points not obviously under our direct control: the assumption that the wt lie in W, and the value of LV, which is a bound on | ℓt(vt)|. In the remainder of this section we will specify the various settings of Algorithm 5 that guarantee that wt W and that LV is suitably bounded: two settings for the unconstrained setting and one for the constrained setting. The α u TL term is due to zt (1 α)Z rather than zt Z, which induces a small amount of bias. The r in Lemma 3 is to ensure that we satisfy the requirements for Online Gradient Descent to have a suitable regret bound. For unconstrained convex bandits r = 1. For constrained convex bandits we will see that 1

r = c (recall that we assume that 1

4.1 Unconstrained Convex Bandits

In this section we instantiate Algorithm 5 and derive regret bounds for either general convex losses or convex and smooth losses. We start with general convex losses. Since W = Rd, we do not need to

ensure that zt + δst W and we can safely set α = 0. This choice guarantees that zt + δst 2B and that | ℓt(vt)| 2d L

δ + 2δL. Then, Lemma 3 directly leads to Theorem 3 (the proof is deferred to appendix C.1). Theorem 3. Supppose AV implements Interface 3 with l = R+ and that ℓt(0) = 0. Then Algorithm

5 with δ = min{1,

4 }, Z = B, α = 0, and η = q

δ2 4(d L)2T satisﬁes for all u Rd,

E [RT (u)] = e O 1 + u Ld

For unconstrained smooth bandits, we face an extra challenge. To bound the regret of Algorithm 5, | ℓt(vt)| = | zt, ˆgt + β2δ2vt| must be bounded. Now in contrast to the linear or Lipschitz cases, in the smooth case ℓt(vt) is not Lipschitz over R+. We will address this by artiﬁcially constraining vt. Speciﬁcally, we ensure that vt 1 δ3 , which implies |δ2vt| = O 1

δ . This makes the Lipschitz constant of ℓt to be dominated by the gradient estimate ˆgt rather than the regularization. To see how this affects the regret bound, consider two cases, u 2 1 δ3 and u 2 > 1 δ3 . If u 2 1 δ3 then we have not hurt anything by constraining vt since u 2 satisﬁes the same constraint. If instead u 2 > 1 δ3 then the consequences for the regret bound are not immediately clear. However, following a similar technique in [8], we use the fact that the regret against 0 is O(1) and the Lipschitz assumption to show that we have added a penalty of only O( u 2LT):

E[RT (u)] = E[RT (0)] +

t=1 E[ℓt(0) ℓt(u)] = O(1 + u 2LT).

Since u 2 > 1 δ3 the penalty for constraining vt is O( u 2LT) = O( u 2 2Lδ3T), which is O( u 2 2L

T) if we set δ = O(T 1/6). The formal result can be found below and its proof can be found in appendix C.1. Theorem 4. Suppose AV implements Interface 3 with l = (0, 1

δ3 ], that ℓt is β-smooth for all t, and that ℓt(0) = 0. Then Algorithm 5 with δ = min{1, (d L)1/3T 1/6}, Z = B, α = 0, and

δ2 4(d L)2T satisﬁes for all u Rd,

E [RT (u)] = e O 1 + max{ u 2, u }β(d LT) 2 3 + max{ u 2 2, u }d L2β

4.2 Constrained convex bandits

For the constrained setting we will set Z = W and α = δ. This ensures that vt(zt + δst) W and we can apply Lemma 3 to ﬁnd the regret bound in Theorem 5 below. Compared to the unconstrained setting, the regret bound now scales with c, which is due to the reshaping trick discussed in [13]. Theorem 5. Suppose AV implements Interface 3 with l = (0, 1] and that ℓt(0) = 0. Then Algorithm

5 with δ = min{1,

d T 1/4}, Z = W, α = δ = min{1,

d T 1/4}, and η = q

δ2 4(d L)2T satisﬁes for all u W,

E [RT (u)] = e O 1 + ( u 2 + c u )

d T 3/4 + c u d L

5 Conclusion

In this paper, we develop the ﬁrst algorithms that have comparator-adaptive regret bounds for various bandit convex optimization problems. The regret bounds of our algorithms scale with u , which may yield smaller regret in favourable settings.

For future research, there are a number of interesting open questions. First, our current results do not encompass improved rates for smooth losses on constrained domains. At ﬁrst blush, one might feel this is relatively straightforward via methods based on self-concordance [24], but it turns out that while such techniques provide good direction-learning algorithms, they may cause the gradients provided to the scaling algorithm to blow-up. Secondly, there is an important class of loss functions for which we did not obtain norm adaptive regret bounds: smooth and strongly convex losses. It is known that in this case an expected regret bound of O(d

T) can be efﬁciently achieved [17]. However, to achieve this regret bound the algorithm of Hazan and Levy [17] uses a clever exploration scheme, which unfortunately leads to sub-optimal regret bounds for our algorithms.

Broader Impact

Our contribution is primarily theoretical, and we do not foresee any negative ethical or societal impact.

Acknowledgments and Disclosure of Funding

Dirk van der Hoeven was supported by the Netherlands Organization for Scientiﬁc Research (NWO grant TOP2EW.15.211). Haipeng Luo was supported by NSF Awards IIS-1755781 and IIS-1943607. Ashok Cutkosky was employed at Google Research while working on this project.

[1] Abernethy, J., Hazan, E., and Rakhlin, A. (2008). Competing in the dark: An efﬁcient algorithm for bandit linear optimization. In Conference on Learning Theory (COLT), pages 263 274.

[2] Agarwal, A., Dekel, O., and Xiao, L. (2010). Optimal algorithms for online convex optimization with multi-point bandit feedback. In Conference on Learning Theory (COLT), pages 28 40. Citeseer.

[3] Agarwal, A., Foster, D. P., Hsu, D. J., Kakade, S. M., and Rakhlin, A. (2011). Stochastic convex optimization with bandit feedback. In Advances in Neural Information Processing Systems, pages 1035 1043.

[4] Bubeck, S., Dekel, O., Koren, T., and Peres, Y. (2015). Bandit convex optimization:

T regret in one dimension. In Conference on Learning Theory (COLT), pages 266 278.

[5] Bubeck, S. and Eldan, R. (2016). Multi-scale exploration of convex functions and bandit convex optimization. In Conference on Learning Theory (COLT), pages 583 589.

[6] Bubeck, S., Lee, Y. T., and Eldan, R. (2017). Kernel-based methods for bandit convex optimization. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pages 72 85. ACM.

[7] Chen, Y., Lee, C.-W., Luo, H., and Wei, C.-Y. (2019). A new algorithm for non-stationary contextual bandits: Efﬁcient, optimal, and parameter-free. In Conference On Learning Theory (COLT), pages 696 726.

[8] Cutkosky, A. (2019). Artiﬁcial constraints and hints for unbounded online learning. In Conference on Learning Theory (COLT), pages 874 894.

[9] Cutkosky, A. and Boahen, K. (2017). Online learning without prior information. In Conference on Learning Theory (COLT), pages 643 677.

[10] Cutkosky, A. and Orabona, F. (2018). Black-box reductions for parameter-free online learning in banach spaces. In Conference on Learning Theory (COLT), pages 1493 1529.

[11] Dani, V., Kakade, S. M., and Hayes, T. P. (2008). The price of bandit information for online optimization. In Advances in Neural Information Processing Systems, pages 345 352.

[12] Daniely, A., Gonen, A., and Shalev-Shwartz, S. (2015). Strongly adaptive online learning. In International Conference on Machine Learning, pages 1405 1411.

[13] Flaxman, A. D., Kalai, A. T., Kalai, A. T., and Mc Mahan, H. B. (2005). Online convex optimization in the bandit setting: gradient descent without a gradient. In Proceedings of the sixteenth annual ACM-SIAM symposium on Discrete algorithms, pages 385 394. Society for Industrial and Applied Mathematics.

[14] Foster, D. J., Kale, S., Mohri, M., and Sridharan, K. (2017). Parameter-free online learning via model selection. In Advances in Neural Information Processing Systems, pages 6020 6030.

[15] Foster, D. J., Krishnamurthy, A., and Luo, H. (2019). Model selection for contextual bandits. In Advances in Neural Information Processing Systems, pages 14714 14725.

[16] Foster, D. J., Rakhlin, A., and Sridharan, K. (2018). Online learning: Sufﬁcient statistics and the burkholder method. In Conference on Learning Theory (COLT), pages 3028 3064.

[17] Hazan, E. and Levy, K. (2014). Bandit convex optimization: Towards tight bounds. In Advances in Neural Information Processing Systems, pages 784 792.

[18] Jun, K.-S. and Orabona, F. (2019). Parameter-free online convex optimization with subexponential noise. In Conference on Learning Theory (COLT), pages 1802 1823.

[19] Kleinberg, R. D. (2005). Nearly tight bounds for the continuum-armed bandit problem. In Advances in Neural Information Processing Systems, pages 697 704.

[20] Kotlowski, W. (2017). Scale-invariant unconstrained online learning. In Proceedings of the International Conference on Algorithmic Learning Theory (ALT), pages 412 433.

[21] Lattimore, T. (2015). The pareto regret frontier for bandits. In Advances in Neural Information Processing Systems, pages 208 216.

[22] Mc Mahan, H. B. and Orabona, F. (2014). Unconstrained online linear learning in hilbert spaces: Minimax algorithms and normal approximations. In Conference on Learning Theory (COLT), pages 1020 1039.

[23] Orabona, F. and P al, D. (2016). Coin betting and parameter-free online learning. In Advances in Neural Information Processing Systems, pages 577 585.

[24] Saha, A. and Tewari, A. (2011). Improved regret guarantees for online smooth convex optimization with bandit feedback. In Proceedings of the Fourteenth International Conference on Artiﬁcial Intelligence and Statistics, pages 636 642.

[25] Van der Hoeven, D. (2019). User-speciﬁed local differential privacy in unconstrained adaptive online learning. In Advances in Neural Information Processing Systems, pages 14080 14089.

[26] Van der Hoeven, D., Van Erven, T., and Kotlowski, W. (2018). The many faces of exponential weights in online learning. In Conference on Learning Theory (COLT), pages 2067 2092.

[27] Zinkevich, M. (2003). Online convex programming and generalized inﬁnitesimal gradient ascent. In International Conference on Machine Learning, pages 928 936.