# optimistic_bandit_convex_optimization__f673db5b.pdf

Optimistic Bandit Convex Optimization

Mehryar Mohri Courant Institute and Google

251 Mercer Street New York, NY 10012

mohri@cims.nyu.edu

Scott Yang Courant Institute 251 Mercer Street New York, NY 10012

yangs@cims.nyu.edu

We introduce the general and powerful scheme of predicting information re-use in optimization algorithms. This allows us to devise a computationally efﬁcient algorithm for bandit convex optimization with new state-of-the-art guarantees for both Lipschitz loss functions and loss functions with Lipschitz gradients. This is the ﬁrst algorithm admitting both a polynomial time complexity and a regret that is polynomial in the dimension of the action space that improves upon the original regret bound for Lipschitz loss functions, achieving a regret of e O

T 11/16d3/8#

. Our algorithm further improves upon the best existing polynomial-in-dimension bound (both computationally and in terms of regret) for loss functions with Lipschitz gradients, achieving a regret of e O

T 8/13d5/3#

1 Introduction

Bandit convex optimization (BCO) is a key framework for modeling learning problems with sequential data under partial feedback. In the BCO scenario, at each round, the learner selects a point (or action) in a bounded convex set and observes the value at that point of a convex loss function determined by an adversary. The feedback received is limited to that information: no gradient or any other higher order information about the function is provided to the learner. The learner s objective is to minimize his regret, that is the difference between his cumulative loss over a ﬁnite number of rounds and that of the loss of the best ﬁxed action in hindsight.

The limited feedback makes the BCO setup relevant to a number of applications, including online advertising. On the other hand, it also makes the problem notoriously difﬁcult and requires the learner to ﬁnd a careful trade-off between exploration and exploitation. While it has been the subject of extensive study in recent years, the fundamental BCO problem remains one of the most challenging scenarios in machine learning where several questions concerning optimality guarantees remain open.

The original work of Flaxman et al. [2005] showed that a regret of e O(T 5/6) is achievable for bounded loss functions and of e O(T 3/4) for Lipschitz loss functions (the latter bound is also given in [Kleinberg, 2004]), both of which are still the best known results given by explicit algorithms. Agarwal et al. [2010] introduced an algorithm that maintains a regret of e O(T 2/3) for loss functions that are both Lipschitz and strongly convex, which is also still state-of-the-art. For functions that are Lipschitz and also admit Lipschitz gradients, Saha and Tewari [2011] designed an algorithm with a regret of

e O(T 2/3) regret, a result that was recently improved to e O(T 5/8) by Dekel et al. [2015].

Here, we further improve upon these bounds both in the Lipschitz and Lipschitz gradient settings. By incorporating the novel and powerful idea of predicting information re-use, we introduce an algorithm with a regret bound of e O

for Lipschitz loss functions. Similarly, our algorithm also achieves the best regret guarantee among computationally tractable algorithms for loss functions with Lipschitz

30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.

gradients: e O

. Both bounds admit a relatively mild dependency on the dimension of the action space.

We note that the recent remarkable work by [Bubeck et al., 2015, Bubeck and Eldan, 2015] has proven the existence of algorithms that can attain a regret of e O(T 1/2), which matches the known lower bound (T 1/2) given by Dani et al.. Thus, the dependency of our bounds with respect to T is not optimal. Furthermore, two recent unpublished manuscripts, [Hazan and Li, 2016] and [Bubeck et al., 2016], present algorithms achieving regret e O(T 1/2). These results, once veriﬁed, would be ground-breaking contributions to the literature. However, unlike our algorithms, the regret bound for both of these algorithms admits a large dependency on the dimension d of the action space: exponential for [Hazan and Li, 2016], d O(9.5) for [Bubeck et al., 2016]. One hope is that the novel ideas introduced by Hazan and Li [2016] (the application of the ellipsoid method with a restart button and lower convex envelopes) or those by Bubeck et al. [2016] (which also make use of the restart idea but introduces a very original kernel method) could be combined with those presented in this paper to derive algorithms with the most favorable guarantees with respect to both T and d.

We begin by formally introducing our notation and setup. We then highlight some of the essential ideas in previous work before introducing our new key insight. Next, we give a detailed description of our algorithm for which we prove theoretical guarantees in several settings.

2 Preliminaries

2.1 BCO scenario

The scenario of bandit convex optimization, which dates back to [Flaxman et al., 2005], is a sequential prediction problem on a convex compact domain K Rd. At each round t 2 [1, T], the learner selects a (possibly) randomized action xt 2 K and incurs the loss ft(xt) based on a convex function ft : K ! R chosen by the adversary. We assume that the adversary is oblivious, so that the loss functions are independent of the player s actions. The objective of the learner is to minimize his regret with respect to the optimal static action in hindsight, that is, if we denote by A the learner s randomized algorithm, the following quantity:

Reg T (A) = E

We will denote by D the diameter of the action space K in the Euclidean norm: D = supx,y2K kx yk2. Throughout this paper, we will often use different induced norms. We will denote by k k A the norm induced by a symmetric positive deﬁnite (SPD) matrix A 0, deﬁned for all x 2 Rd by kxk A =

x>Ax. Moreover, we will denote by k k A, its dual norm, given by k k A 1. To simplify the notation, we will write k kx instead of k kr2R(x), when the convex and twice differentiable function R: int(K) ! R is clear from the context. Here, int(K) is the set interior of K.

We will consider different levels of regularity for the functions ft selected by the adversary. We will always assume that they are uniformly bounded by some constant C > 0, that is |ft(x)| C for all t 2 [1, T] and x 2 K, and, by shifting the loss functions upwards by at most C, we will also assume, without loss of generality, that they are non-negative: ft 0, for all t 2 [1, T]. Moreover, we will always assume that ft is Lipschitz on K (henceforth denoted C0,1(K)):

8t 2 [1, T], 8x, y 2 K, |ft(x) ft(y)| Lkx yk2. In some instances, we will further assume that the functions admit H-Lipschitz gradients on the interior of the domain (henceforth denoted C1,1(int(K))):

9H > 0: 8t 2 [1, T], 8x, y 2 int(K), krft(x) rft(y)k2 Hkx yk2. Since ft is convex, it admits a subgradient at any point in K. We denote by gt one element of the subgradient at the point xt 2 K selected by the learner at round t. When the losses are C1,1, the only element of the subgradient is the gradient, and gt = rft(xt). We will use the shorthand v1:t = Pt

s=1 vs to denote the sum of t vectors v1, . . . , vt. In particular, g1:t will denote the sum of the subgradients gs for s 2 [1, t].

Lastly, we will denote by B1(0) =

x 2 Rd : kxk2 1

Rd the d-dimensional Euclidean ball of radius one and by @B1(0) the unit sphere.

2.2 Follow-the-regularized-leader template

A standard algorithm in online learning, both for the bandit and full-information setting is the follow-the-regularized-leader (FTRL) algorithm. At each round, the algorithm selects the action that minimizes the cumulative linearized loss augmented with a regularization term R: K ! R. Thus, the action xt+1 is deﬁned as follows:

xt+1 = argmin

1:tx + R(x),

where > 0 is a learning rate that determines the tradeoff between greedy optimization and regularization.

If we had access to the subgradients at each round, then, FTRL with R(x) = kxk2

2 and = 1 p

T would yield a regret of O(

d T), which is known to be optimal. But, since we only have access to the loss function values ft(xt) and since the loss functions change at each round, a more reﬁned strategy is needed.

2.2.1 One-point gradient estimates and surrogate losses

One key insight into the bandit convex optimization problem, due to Flaxman et al. [2005], is that the subgradient of a smoothed version of the loss function can be estimated by sampling and rescaling around the point the algorithm originally intended to play. Lemma 1 ([Flaxman et al., 2005, Saha and Tewari, 2011]). Let f : K ! R be an arbitrary function (not necessarily differentiable) and let U(@B1(0)) denote the uniform distribution over the unit

sphere. Then, for any δ > 0 and any SPD matrix A 0, the function bf deﬁned for all x 2 K by bf(x) = Eu U(@B1(0))[f(x + δAu)] is differentiable over int(K) and, for any x 2 int(K), bg = d

δ f(x + δAu)A 1u is an unbiased estimate of r bf(x):

E u U(@B1(0))

δ f(x + δAu)A 1u

The result shows that if at each round t we sample ut U(@B1(0)), deﬁne an SPD matrix At and play the point yt = xt + δAtu (assuming that yt 2 K), then bgt = d

δ f(xt + δAtut)A 1

t ut is an unbiased estimate of the gradient of bf at the point xt originally intended: E[bgt] = r bf(xt). Thus, we can use FTRL with these smoothed gradient estimates: xt+1 = argminx2K bg>

1:tx + R(x), at the cost of the approximation error from ft to bft. Furthermore, the norm of these estimate gradients can be bounded. Lemma 2. Let δ > 0, ut 2 @B1(0) and At 0, then the norm of bgt = d

δ f(xt + δAtut)A 1

t ut can be bounded as follows: kbgtk2

Proof. Since ft is bounded by C, we can write kbgtk2

δ2 C2ut A 1

This gives us a bound on the Lipschitz constant of bft in terms of d, δ, and C.

2.2.2 Self-concordant barrier as regularization

When sampling to derive a gradient estimate, we need to ensure that the point sampled lies within the feasible set K. A second key idea in the BCO problem, due to Abernethy et al. [2008], is to design ellipsoids that are always contained in the feasible sets. This is done by using tools from the theory of interior-point methods in convex optimization. Deﬁnition 1 (Deﬁnition 2.3.1 [Nesterov and Nemirovskii, 1994]). Let K Rd be closed convex, and let 0. A C3 function R: int(K) ! R is a -self-concordant barrier for K if for any sequence (zs)1

s=1 with zs ! @K, we have R(zs) ! 1, and if for all x 2 int(K), and y 2 Rd, the following inequalities hold:

|r3R(x)[y, y, y]| 2kyk3

x, |r R(x)>y| 1/2kykx.

Since self-concordant barriers are preserved under translation, we will always assume for convenience that minx2K R(x) = 0.

Nesterov and Nemirovskii [1994] show that any d-dimensional closed convex set admits an O(d)- self-concordant barrier. This allows us to always choose a self-concordant barrier as regularization.

We will use several other key properties of self-concordant barriers in this work, all of which are stated precisely in Appendix 7.1.

3 Previous work

The original paper by Flaxman et al. [2005] sampled indiscriminately around spheres and projected back onto the feasible set at each round. This yielded a regret of e O

for C0,1 loss functions. The follow-up work of Saha and Tewari [2011] showed that for C1,1 loss functions, one can run FTRL with a self-concordant barrier as regularization and sample around the Dikin ellipsoid to attain an improved regret bound of e O

More recently, Dekel et al. [2015] showed that by averaging the smoothed gradient estimates and still using the self-concordant barrier as regularization, one can achieve a regret of e O

. Speciﬁcally, denote by gt = 1 k+1

i=0 bgt i the average of the past k + 1 incurred gradients, where bgt i = 0 for t i 0. Then we can play FTRL on these averaged smoothed gradient estimates: xt+1 = argmin2K g>

t x + R(x), to attain the better guarantee.

Abernethy and Rakhlin [2009] derive a generic estimate for FTRL algorithms with self-concordant barriers as regularization: Lemma 3 ([Abernethy and Rakhlin, 2009]-Theorem 2.2-2.3). Let K be a closed convex set in Rd and let R be a -self-concordant barrier for K. Let {gt}T

t=1 Rd and > 0 be such that kgtkxt, 1/4 for all t 2 [1, T]. Then, the FTRL update xt+1 = argminx2K g>

1:tx + R(x) admits the following guarantees:

kxt xt+1kxt 2 kgtkxt, , 8x 2 K,

By Lemma 2, if we use FTRL with smoothed gradients, then the upper bound in this lemma can be further bounded by

R(x) 2 T C2d2

Furthermore, the regret is then bounded by the sum of this upper bound and the cost of approximating ft with bft. On the other hand, Dekel et al. [2015] showed that if we used FTRL with averaged smoothed gradients instead, then the upper bound in this lemma can be bounded as

δ2(k + 1) + 2D2L2

The extra factor (k + 1) in the denominator, at the cost of now approximating ft with ft, is what contributes to their improved regret result.

In general, ﬁnding surrogate losses that can both be approximated accurately and admit only a mild variance is a delicate task, and it is not clear how the constructions presented above can be improved.

4 Algorithm

4.1 Predicting the predictable

Rather than designing a newer and better surrogate loss, our strategy will be to exploit the structure of the current state-of-the-art method. Speciﬁcally, we draw upon the technique of predictable sequences from [Rakhlin and Sridharan, 2013]. The idea here is to allow the learner to preemptively guess the

gradient at the next step and optimize for this in the FTRL update. Speciﬁcally, if egt+1 is an estimate of the time t + 1 gradient gt+1 based on information up to time t, then the learner should play:

xt+1 = argmin

(g1:t + egt+1)>x + R(x).

This optimistic FTRL algorithm admits the following guarantee: Lemma 4 (Lemma 1 [Rakhlin and Sridharan, 2013]). Let K be a closed convex set in Rd, and let R be a -self-concordant barrier for K. Let {gt}T

t=1 Rd and > 0 such that kgt egtkxt, 1/4 8t 2 [1, T]. Then the FTRL update xt+1 = argminx2K(g1:t + egt+1)>x+R(x) admits the following guarantee:

In general, it is not clear what would be a good prediction candidate. Indeed, this is why Rakhlin and Sridharan [2013] called this algorithm an optimistic FTRL. However, notice that if we elect to play the averaged smoothed losses as in [Dekel et al., 2015], then the update at each time is gt = 1 k+1

i=0 bgt i. This implies that the time t + 1 gradient is gt+1 = 1 k+1

i=0 bgt+1 i, which includes the smoothed gradients from time t + 1 down to time t (k 1). The key insight here is that at time t, all but the (t + 1)-th gradient are known!

This means that if we predict

egt+1 = 1 k + 1

bgt+1 i 1 k + 1bgt+1 = 1 k + 1

then the ﬁrst term in the bound of Lemma 4 will be in terms of

gt egt = 1 k + 1

bgt i 1 k + 1

bgt i = 1 k + 1bgt.

In other words, all but the time t smoothed gradient will cancel out. Essentially, we are predicting the predictable portion of the averaged gradient and guaranteeing that the optimism will pay off. Moreover, where we gained a factor of 1 k+1 in the averaged loss case, we should expect to gain a factor of 1 (k+1)2 by using this optimistic prediction.

Note that this technique of optimistically predicting the variance reduction is widely applicable. As alluded to with the reference to [Schmidt et al., 2013], many variance reduction-type techniques, particularly in stochastic optimization, use historical information in their estimates (e.g. SVRG [Johnson and Zhang, 2013], SAGA [Defazio et al., 2014]). In these cases, it is possible to predict the information re-use and improve the convergence rates of each algorithm.

4.2 Description and pseudocode

Here, we give a detailed description of our algorithm, OPTIMISTICBCO. At each round t, the algorithm uses a sample ut from the uniform distribution over the unit sphere to deﬁne an unbiased estimate of the gradient of bft, a smoothed version of the loss function ft, as described in Section 2.2.1: bgt d

δ ft(yt)(r2R(xt)) 1/2ut. Next, the trailing average of these unbiased estimates over a ﬁxed window of length k + 1 is computed: gt = 1 k+1

i=0 bgt i. The remaining steps executed at each round coincide with the Follow-the-Regularized-Leader update with a self-concordant barrier used as a regularizer, augmented with an optimistic prediction of the next round s trailing average. As described in Section 4.1, all but one of the terms in the trailing average are known and we predict their occurence:

egt+1 = 1 k + 1

bgt+1 i, xt+1 = argmin

( g1:t + egt+1)> x + R(x).

Note that Theorem 3 implies that the actual point we play, yt, is always a feasible point in K. Figure 1 presents the pseudocode of the algorithm.

OPTIMISTICBCO(R, δ, , k, x1)

1 for t 1 to T do 2 ut SAMPLE(U(@B1(0))) 3 yt xt + δ(r2R(xt)) 1

2 ut 4 PLAY(yt) 5 ft(yt) RECEIVELOSS(yt) 6 bgt d

δ ft(yt)(r2R(xt)) 1

2 ut 7 gt 1 k+1

i=0 bgt i 8 egt+1 1 k+1

i=1 bgt+1 i 9 xt+1 argminx2K ( g1:t + egt+1)>x + R(x) 10 return PT

Figure 1: Pseudocode of OPTIMISTICBCO, with R: int(K) ! R, δ 2 (0, 1], > 0, k 2 Z, and x1 2 K.

5 Regret guarantees

In this section, we state our main results, which are regret guarantees for OPTIMISTICBCO in the C0,1 and C1,1 cases. We also highlight the analysis and proofs for each regime.

5.1 Main results

The following is our main result for the C0,1 case. Theorem 1 (C0,1 Regret). Let K Rd be a convex set with diameter D and (ft)T

t=1 a sequence of loss functions with each ft : K ! R+ C-bounded and L-Lipschitz. Let R be a -self-concordant barrier for K. Then, for k δ 12Cd, the regret of OPTIMISTICBCO can be bounded as follows:

Reg T (OPTIMISTICBCO) LT + LδDT + Ck

2 + 2Cd2 T δ2(k + 1)2 + 1

In particular, for = T 11/16d 3/8, δ = T 5/16d3/8, k = T 1/8d1/4, the following guarantee holds for the regret of the algorithm:

Reg T (OPTIMISTICBCO) = e O

T 11/16d3/8

The above result is the ﬁrst improvement on the regret of Lipschitz losses in terms of T since the original algorithm of Flaxman et al. [2005] that is realizable from a concrete algorithm as well as polynomial in both dimension and time (both computationally and in terms of regret). Theorem 2 (C1,1 Bound). Let K Rd be a convex set with diameter D and (ft)T

t=1 a sequence of loss functions with each ft : K ! R+ C-bounded, L-Lipschitz and H-smooth. Let R be a -selfconcordant barrier for K. Then, for k δ 12d, the regret of OPTIMISTICBCO can be bounded as follows:

Reg T (OPTIMISTICBCO) LT + Hδ2D2T

+ (TL + DHT)2 k D

log(1/ ) + Ck + d2T δ2(k + 1)2 .

In particular, for = T 8/13d 5/6, δ = T 5/26d1/3, k = T 1/13d5/3, the following guarantee holds for the regret of the algorithm:

Reg T (OPTIMISTICBCO) = e O

This result is currently the best polynomial-in-time regret bound that is also polynomial in the dimension of the action space (both computationally and in terms of regret). It improves upon the work of Saha and Tewari [2011] and Dekel et al. [2015].

We now explain the analysis of both results, starting with Theorem 1 for C0,1 losses.

5.2 C0,1 analysis

Our analysis proceeds in two steps. We ﬁrst modularize the cost of approximating the original losses ft(yt) incurred with the averaged smoothed losses that we treat as surrogate losses. Then we show that the algorithm minimizes the regret against the surrogate losses effectively. The proofs of all lemmas in this section are presented in Appendix 7.2. Lemma 5 (C0,1 Structural bound on true losses in terms of smoothed losses). Let (ft)T

t=1 be a sequence of loss functions, and assume that ft : K ! R+ is C-bounded and L-Lipschitz, where K Rd. Denote

bft(x) = E u U(@B1(0))[ft(x + δAtu)], bgt = d

δ ft(yt)A 1

t ut, yt = xt + δAtut

for arbitrary At, δ, and ut. Let x = argminx2K

t=1 ft(x), and let x

2 argminy2K,dist(y,@K)> ky x k. Assume that we play yt at every round. Then the following structural estimate holds:

Reg T (A) = E[

ft(yt) ft(x )] LT + 2LδDT +

E[ bft(xt) bft(x

Thus, at the price of LT + 2LδDT, it sufﬁces to look at the performance of the averaged losses for the algorithm. Notice that the only assumptions we have made so far are that we play points sampled on an ellipsoid around the desired point scaled by δ and that the loss functions are Lipschitz. Lemma 6 (C0,1 Structural bound on smoothed losses in terms of averaged losses). Let (ft)T

t=1 be a sequence of loss functions, and assume that ft : K ! R+ is C-bounded and L-Lipschitz, where K Rd. Denote

bft(x) = E u U(@B1(0))[ft(x + δAtu)], bgt = d

δ ft(yt)A 1

t ut, yt = xt + δAtut

for arbitrary At, δ, and ut. Let x = argminx2K

t=1 ft(x), and let x

2 argminy2K,dist(y,@K)> ky x k. Furthermore, denote

ft(x) = 1 k + 1

bft i(x), gt = 1 k + 1

Assume that we play yt at every round. Then we have the structural estimate:

bft(xt) bft(x

2 + LT sup t2[1,T ],i2[0,k t]

E[kxt i xtk2] +

While we use averaged smoothed losses as in [Dekel et al., 2015], the analysis in this lemma is actually somewhat different. Because Dekel et al. [2015] always assume that the loss functions are in C1,1, they elect to use the following decomposition:

bft(xt) bft(x

) = bft(xt) ft(xt) + ft(xt) ft(x

This is because they can relate r ft(x) = 1 k+1

i=0 r bft i(x ) to gt = 1 k+1

i=0 r bft i(xt i) using the fact that the gradients are Lipschitz. Since the gradients of C0,1 functions are not Lipschitz, we cannot use the same analysis. Instead, we use the decomposition

bft(xt) bft(x

) = bft(xt) bft i(xt i) + bft i(xt i) ft(x

The next lemma afﬁrms that we do indeed get the improved 1 (k+1)2 factor from predicting the predictable component of the average gradient.

Lemma 7 (C0,1 Algorithmic bound on the averaged losses). Let (ft)T

t=1 be a sequence of loss functions, and assume that ft : K ! R+ is C-bounded and L-Lipschitz, where K Rd. Let x = argminx2K

t=1 ft(x), and let x

2 argminy2K,dist(y,@K)> ky x k. Assume that we play according to the algorithm with k δ 12Cd. Then we maintain the following guarantee:

2Cd2 T δ2(k + 1)2 + 1

So far, we have demonstrated a bound on the regret of the form:

Reg T (A) LT + 2LδDT + Ck

2 + LT sup t2[T ],i2[k t]

E[kxt i xtk2] + 2Cd2 T δ2(k + 1)2 + 1

Thus, it remains to ﬁnd a tight bound on supt2[1,T ],i2[0,k t] E[kxt i xtk2], which measures the stability of the actions across the history that we average over. This result is similar to that of Dekel et al. [2015], except that we additionally need to account for the optimistic gradient prediction used. Lemma 8 (C0,1 Algorithmic bound on the stability of actions). Let (ft)T

t=1 be a sequence of loss functions, and assume that ft : K ! R+ is C-bounded and L-Lipschitz, where K Rd. Assume that we play according to the algorithm with k δ 12Cd. Then the following estimate holds:

E[kxt i xtk2] 2 k D

Proof. [of Theorem 1] Putting all the pieces together from Lemmas 5, 6, 7, 8, shows that

Reg T (A) LT+LδDT+Ck

δ2(k + 1)2 +1

R(x )+LT2 D

Since x is at least away from the boundary, it follows from [Abernethy and Rakhlin, 2009] that R(x ) log(1/ ). Plugging in the stated quantities for , k, and δ yields the result.

5.3 C1,1 analysis

The analysis of the C1,1 regret bound is similar to the C0,1 case. The only difference is that we leverage the higher regularity of the losses to provide a more reﬁned estimate on the cost of approximating ft with ft. Apart from that, we will reuse the bounds derived in Lemmas 6, 7, and 8. The proof of the following lemma, along with that of Theorem 2, is provided in Appendix 7.3. Lemma 9 (C1,1 Structural bound on true losses in terms of smoothed losses). Let (ft)T

t=1 be a sequence of loss functions, and assume that ft : K ! R+ is C-bounded, L-Lipschitz, and H-smooth, where K Rd. Denote

bft(x) = E u U(@B1(0))[ft(x + δAtu)], bgt = d

δ ft(yt)A 1

t ut, yt = xt + δAtut

for arbitrary At, δ, and ut. Let x = argminx2K

t=1 ft(x), and let x

2 argminy2K,dist(y,@K)> ky x k. Assume that we play yt at every round. Then the following structural estimate holds:

Reg T (A) = E[

ft(yt) ft(x )] LT + 2Hδ2D2T +

E[ bft(xt) bft(x

6 Conclusion

We designed a computationally efﬁcient algorithm for bandit convex optimization admitting stateof-the-art guarantees for C0,1 and C1,1 loss functions. This was achieved using the general and powerful technique of predicting predictable information re-use. The ideas we describe here are directly applicable to other areas of optimization, in particular stochastic optimization.

Acknowledgements

This work was partly funded by NSF CCF-1535987 and IIS-1618662 and NSF GRFP DGE-1342536.

J. Abernethy and A. Rakhlin. Beating the adaptive bandit with high probability. In COLT, 2009.

J. Abernethy, E. Hazan, and A. Rakhlin. Competing in the dark: An efﬁcient algorithm for bandit

linear optimization. In COLT, pages 263 274, 2008.

A. Agarwal, O. Dekel, and L. Xiao. Optimal algorithms for online convex optimization with

multi-point bandit feedback. In COLT, pages 28 40, 2010.

S. Bubeck and R. Eldan. Multi-scale exploration of convex functions and bandit convex optimization.

Co RR, abs/1507.06580, 2015.

S. Bubeck, O. Dekel, T. Koren, and Y. Peres. Bandit convex optimization: sqrt {T} regret in one

dimension. Co RR, abs/1502.06398, 2015.

S. Bubeck, R. Eldan, and Y. T. Lee. Kernel-based methods for bandit convex optimization. Co RR,

abs/1607.03084, 2016.

V. Dani, T. P. Hayes, and S. M. Kakade. Stochastic linear optimization under bandit feedback.

A. Defazio, F. Bach, and S. Lacoste-Julien. Saga: A fast incremental gradient method with support

for non-strongly convex composite objectives. In NIPS, pages 1646 1654, 2014.

O. Dekel, R. Eldan, and T. Koren. Bandit smooth convex optimization: Improving the bias-variance

tradeoff. In NIPS, pages 2908 2916, 2015.

A. D. Flaxman, A. T. Kalai, and H. B. Mc Mahan. Online convex optimization in the bandit setting:

Gradient descent without a gradient. In SODA, pages 385 394, 2005.

E. Hazan and Y. Li. An optimal algorithm for bandit convex optimization. Co RR, abs/1603.04350,

R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction.

In NIPS, pages 315 323, 2013.

R. D. Kleinberg. Nearly tight bounds for the continuum-armed bandit problem. In Advances in

Neural Information Processing Systems, pages 697 704, 2004.

Y. Nesterov. Introductory Lectures on Convex Optimization: A Basic Course. Springer, New York,

NY, USA, 2004.

Y. Nesterov and A. Nemirovskii. Interior-point Polynomial Algorithms in Convex Programming.

Studies in Applied Mathematics. Society for Industrial and Applied Mathematics, 1994. ISBN 9781611970791.

A. Rakhlin and K. Sridharan. Online learning with predictable sequences. In COLT, pages 993 1019,

A. Saha and A. Tewari. Improved regret guarantees for online smooth convex optimization with

bandit feedback. In AISTATS, pages 636 642, 2011.

M. W. Schmidt, N. L. Roux, and F. R. Bach. Minimizing ﬁnite sums with the stochastic average

gradient. Co RR, abs/1309.2388, 2013.