# bandit_convex_optimization_towards_tight_bounds__22d8ca01.pdf

Bandit Convex Optimization: Towards Tight Bounds

Elad Hazan Technion Israel Institute of Technology Haifa 32000, Israel ehazan@ie.technion.ac.il

Kﬁr Y. Levy Technion Israel Institute of Technology Haifa 32000, Israel kfiryl@tx.technion.ac.il

Bandit Convex Optimization (BCO) is a fundamental framework for decision making under uncertainty, which generalizes many problems from the realm of online and statistical learning. While the special case of linear cost functions is well understood, a gap on the attainable regret for BCO with nonlinear losses remains an important open question. In this paper we take a step towards understanding the best attainable regret bounds for BCO: we give an efﬁcient and near-optimal regret algorithm for BCO with strongly-convex and smooth loss functions. In contrast to previous works on BCO that use time invariant exploration schemes, our method employs an exploration scheme that shrinks with time.

1 Introduction

The power of Online Convex Optimization (OCO) framework is in its ability to generalize many problems from the realm of online and statistical learning, and supply universal tools to solving them. Extensive investigation throughout the last decade has yield efﬁcient algorithms with worst case guarantees. This has lead many practitioners to embrace the OCO framework in modeling and solving real world problems.

One of the greatest challenges in OCO is ﬁnding tight bounds to the problem of Bandit Convex Optimization (BCO). In this bandit setting the learner observes the loss function only at the point that she has chosen. Hence, the learner has to balance between exploiting the information she has gathered and between exploring the new data. The seminal work of [5] elegantly resolves this exploration-exploitation dilemma by devising a combined explore-exploit gradient descent algorithm. They obtain a bound of O(T 3/4) on the expected regret for the general case of an adversary playing bounded and Lipschitz-continuous convex losses.

In this paper we investigate the BCO setting assuming that the adversary is limited to inﬂicting strongly-convex and smooth losses and the player may choose points from a constrained decision set. In this setting we devise an efﬁcient algorithm that achieves a regret of O(

T). This rate is the best possible up to logarithmic factors as implied by a recent work of [11], cleverly obtaining a lower bound of Ω(

T) for the same setting.

During our analysis, we develop a full-information algorithm that takes advantage of the strongconvexity of loss functions and uses a self-concordant barrier as a regularization term. This algorithm enables us to perform shrinking exploration which is a key ingredient in our BCO algorithm. Conversely, all previous works on BCO use a time invariant exploration scheme.

This paper is organized as follows. In Section 2 we introduce our setting and review necessary preliminaries regarding self-concordant barriers. In Section 3 we discuss schemes to perform single-

Setting Convex Linear Smooth Str.-Convex Str.-Convex & Smooth Full-Info. Θ(

T) Θ(log T)

O(T 3/4) O(

T) O(T 2/3) O(

T) [Thm. 10] Ω(

Table 1: Known regret bounds in the Full-Info./ BCO setting. Our new result is highlighted, and improves upon the previous O(T 2/3) bound.

point gradient estimations, then we deﬁne ﬁrst-order online methods and analyze the performance of such methods receiving noisy gradient estimates. Our main result is described and analyzed in Section 4; Section 5 concludes.

1.1 Prior work

For BCO with general convex loss functions, almost simultaneously to [5], a bound of O(T 3/4) was also obtained by [7] for the setting of Lipschitz-continuous convex losses. Conversely, the best known lower bound for this problem is Ω(

T) proved for the easier full-information setting.

In case the adversary is limited to using linear losses, it can be shown that the player does not pay for exploration; this property was used by [4] to devise the Geometric Hedge algorithm that achieves an optimal regret rate of O(

T). Later [1], inspired by interior point methods, devised the ﬁrst efﬁcient algorithm that attains the same nearly-optimal regret rate for this setup of bandit linear optimization.

For some special classes of nonlinear convex losses, there are several works that lean on ideas from [5] to achieve improved upper bounds for BCO. In the case of convex and smooth losses [9] attained an upper bound of O(T 2/3). The same regret rate of O(T 2/3) was achieved by [2] in the case of strongly-convex losses. For the special case of unconstrained BCO with strongly-convex and smooth losses, [2] obtained a regret of O(

T). A recent paper by Shamir [11], signiﬁcantly advanced our understanding of BCO by devising a lower bound of Ω(

T) for the setting of stronglyconvex and smooth BCO. The latter implies the tightness of our bound.

A comprehensive survey by Bubeck and Cesa-Bianchi [3], provides a review of the bandit optimization literature in both stochastic and online setting.

2 Setting and Background

Notation: During this paper we denote by || || the ℓ2 norm when referring to vectors, and use the same notation for the spectral norm when referring to matrices. We denote by Bn and Sn the n-dimensional euclidean unit ball and unit sphere, and by v Bn and u Sn random variables chosen uniformly from these sets. The symbol I is used for the identity matrix (its dimension will be clear from the context). For a positive deﬁnite matrix A 0 we denote by A1/2 the matrix B such that B B = A, and by A 1/2 the inverse of B. Finally, we denote [N] := {1, . . . , N}.

2.1 Bandit Convex Optimization

We consider a repeated game of T rounds between a player and an adversary, at each round t [T]

1. player chooses a point xt K.

2. adversary independently chooses a loss function ft F.

3. player suffers a loss ft(xt) and receives a feedback Ft.

In the OCO (Online Convex Optimization) framework we assume that the decision set K is convex and that all functions in F are convex. Our paper focuses on adversaries limited to choosing functions from the set Fσ,β; the set off all σ-strongly-convex and β-smooth functions.

We also limit ourselves to oblivious adversaries where the loss sequence {ft}T t=1 is predetermined and is therefore independent of the player s choices. Mind that in this case the best point in hindsight is also independent of the player s choices. We also assume that the loss functions are deﬁned over the entire space Rn and are strongly-convex and smooth there; yet the player may only choose points from a constrained set K.

Let us deﬁne the regret of A, and its regret with respect to a comparator w K:

Regret A T =

t=1 ft(xt) min w K

t=1 ft(w ), Regret A T (w) =

A player aims at minimizing his regret, and we are interested in players that ensure an o(T) regret for any loss sequence that the adversary may choose.

The player learns through the feedback Ft received in response to his actions. In the full informations setting, he receives the loss function ft itself as a feedback, usually by means of a gradient oracle - i.e. the decision maker has access to the gradient of the loss function at any point in the decision set. Conversely, in the BCO setting the given feedback is ft(xt), i.e., the loss function only at the point that he has chosen; and the player aims at minimizing his expected regret, E Regret A T .

2.2 Strong Convexity and Smoothness

As mentioned in the last subsection we consider an adversary limited to choosing loss functions from the set Fσ,β, the set of σ-strongly convex and β-smooth functions, here we deﬁne these properties. Deﬁnition 1. (Strong Convexity) We say that a function f : Rn R is σ-strongly convex over the set K if for all x, y K it holds that,

f(y) f(x) + f(x) (y x) + σ

2 ||x y||2 (1)

Deﬁnition 2. (Smoothness) We say that a convex function f : Rn R is β-smooth over the set K if the following holds:

f(y) f(x) + f(x) (y x) + β

2 ||x y||2, x, y K (2)

2.3 Self Concordant Barriers

Interior point methods are polynomial time algorithms to solving constrained convex optimization programs. The main tool in these methods is a barrier function that encodes the constrained set and enables the use of a fast unconstrained optimization machinery. More on this subject can be found in [8].

Let K Rn be a convex set with a non empty interior int(K) Deﬁnition 3. A function R : int(K) R is called ν-self-concordant if:

1. R is three times continuously differentiable and convex, and approaches inﬁnity along any sequence of points approaching the boundary of K.

2. For every h Rn and x int(K) the following holds:

| 3R(x)[h, h, h]| 2( 2R(x)[h, h])3/2 and | R(x)[h]| ν1/2( 2R(x)[h, h])1/2

here, 3R(x)[h, h, h] := 3 t1 t2 t3 R(x + t1h + t2h + t3h) t1=t2=t3=0.

Our algorithm requires a ν-self-concordant barrier over K, and its regret depends on ν. It is well known that any convex set in Rn admits a ν = O(n) such barrier (ν might be much smaller), and that most interesting convex sets admit a self-concordant barrier that is efﬁciently represented.

The Hessian of a self-concordant barrier induces a local norm at every x int(K), we denote this norm by || ||x and its dual by || || x and deﬁne h Rn:

h 2R(x)h, ||h|| x = q

h ( 2R(x)) 1h

we assume that 2R(x) always has a full rank.

The following fact is a key ingredient in the sampling scheme of BCO algorithms [1, 9]. Let R is be self-concordant barrier and x int(K) then the Dikin Ellipsoide,

W1(x) := {y Rn : ||y x||x 1} (3)

i.e. the || ||x-unit ball centered around x, is completely contained in K.

Our regret analysis requires a bound on R(y) R(x); hence, we will ﬁnd the following lemma useful: Lemma 4. Let R be a ν-self-concordant function over K, then:

R(y) R(x) ν log 1 1 πx(y), x, y int(K)

where πx(y) = inf{t 0 : x + t 1(y x) K}, x, y int(K)

Note that πx(y) is called the Minkowsky function and it is always in [0, 1]. Moreover, as y approaches the boundary of K then πx(y) 1.

3 Single Point Gradient Estimation and Noisy First-Order Methods

3.1 Single Point Gradient Estimation

A main component of BCO algorithms is a randomized sampling scheme for constructing gradient estimates. Here, we survey the previous schemes as well as the more general scheme that we use.

Spherical estimators: Flaxman et al. [5] introduced a method that produces single point gradient estimates through spherical sampling. These estimates are then inserted into a full-information procedure that chooses the next decision point for the player. Interestingly, these gradient estimates are unbiased predictions for the gradients of a smoothed version function which we next deﬁne.

Let δ > 0 and v Bn, the smoothed version of a function f : Rn R is deﬁned as follows:

ˆf(x) = E[f(x + δv)] (4)

The next lemma of [5] ties between the gradients of ˆf and an estimate based on samples of f: Lemma 5. Let u Sn, and consider the smoothed version ˆf deﬁned in Equation (4), then the following applies:

ˆf(x) = E[n

δ f(x + δu)u] (5)

Therefore, n

δ f(x + δu)u is an unbiased estimator for the gradients of the smoothed version.

(a) Eigenpoles Sampling

(b) Continuous Sampling

(c) Shrinking Sampling

Figure 1: Dikin Ellipsoide Sampling Schemes

Ellipsoidal estimators: Abernethy et al. [1] introduced the idea of sampling from an ellipsoid (speciﬁcally the Dikin ellipsoid) rather than a sphere in the context of BCO. They restricted the sampling to the eigenpoles of the ellipsoid (Fig. 1a). A more general method of sampling continuously from an ellipsoid was introduced in [9] (Fig. 1b). We shall see later that our algorithm uses a shrinking-sampling scheme (Fig. 1c), which is crucial in achieving the O(

T) regret bound.

The following lemma of [9] shows that we can sample f non uniformly over all directions and create an unbiased gradient estimate of a respective smoothed version: Corollary 6. Let f : Rn R be a continuous function, let A Rn n be invertible, and v Bn, u Sn. Deﬁne the smoothed version of f with respect to A: ˆf(x) = E[f(x + Av)] (6)

Then the following holds:

ˆf(x) = E[nf(x + Au)A 1u] (7)

Note that if A 0 then {Au : u Sn} is an ellipsoid s boundary.

Our next lemma shows that the smoothed version preserves the strong-convexity of f, and that we can measure the distance between ˆf and f using the spectral norm of A2: Lemma 7. Consider a function f : Rn R, and a positive deﬁnite matrix A Rn n. Let ˆf be the smoothed version of f with respect to A as deﬁned in Equation (6). Then the following holds:

If f is σ-strongly convex then so is ˆf.

If f is convex and β-smooth, and λmax be the largest eigenvalue of A then:

0 ˆf(x) f(x) β

2 ||A2||2 = β

2 λ2 max (8)

Remark: Lemma 7 also holds if we deﬁne the smoothed version of f as ˆf(x) = Eu Sn[f(x+Au)] i.e. an average of the original function values over the unit sphere rather than the unit ball as deﬁned in Equation (6). Proof is similar to the one of Lemma 7.

3.2 Noisy First-Order Methods

Our algorithm utilizes a full-information online algorithm, but instead of providing this method with exact gradient values we insert noisy estimates of the gradients. In what follows we deﬁne ﬁrst-order online algorithms, and present a lemma that analyses the regret of such algorithm receiving noisy gradients.

Deﬁnition 8. (First-Order Online Algorithm) Let A be an OCO algorithm receiving an arbitrary sequence of differential convex loss functions f1, . . . , f T , and providing points x1 A and xt A(f1, . . . , ft 1). Given that A requires all loss functions to belong to some set F0. Then A is called ﬁrst-order online algorithm if the following holds:

Adding a linear function to a member of F0 remains in F0; i.e., for every f F0 and a Rn then also f + a x F0

The algorithm s choices depend only on its gradient values taken in the past choices of A, i.e. : A(f1, . . . , ft 1) = A( f1(x1), . . . , ft 1(xt 1)), t [T]

The following is a generalization of Lemma 3.1 from [5]: Lemma 9. Let w be a ﬁxed point in K. Let A be a ﬁrst-order online algorithm receiving a sequence of differential convex loss functions f1, . . . , f T : K R (ft+1 possibly depending on z1, . . . zt). Where z1 . . . z T are deﬁned as follows: z1 A, zt A(g1, . . . , gt 1) where gt s are vector valued random variables such that:

E[gt z1, f1, . . . , zt, ft] = ft(zt)

Then if A ensures a regret bound of the form: Regret A T BA( f1(x1), . . . , f T (x T )) in the full information case then, in the case of noisy gradients it ensures the following bound:

t=1 ft(zt)]

t=1 ft(w) E[BA(g1, . . . , g T )]

4 Main Result and Analysis

Following is the main theorem of this paper: Theorem 10. Let K be a convex set with diameter DK and R be a ν-self-concordant barrier over K. Then in the BCO setting where the adversary is limited to choosing β-smooth and σ-stronglyconvex functions and |ft(x)| L, x K, then the expected regret of Algorithm 1 with η = q

(ν+2β/σ) log T

2n2L2T is upper bounded as

E[Regret T ] 4n L

T log T + 2L + βD2 K 2 = O

whenever T/ log T 2 (ν + 2β/σ).

Algorithm 1 BCO Algorithm for Str.-convex & Smooth losses

Input: η > 0, σ > 0, ν-self-concordant barrier R Choose x1 = arg minx K R(x) for t = 1, 2 . . . T do

Deﬁne Bt = 2R(xt) + ησt I 1/2

Draw u Sn Play yt = xt + Btu Observe ft(xt + Btu) and deﬁne gt = nft(xt + Btu)B 1 t u Update xt+1 = arg minx K Pt τ=1 g τ x + σ

2 ||x xτ||2 + η 1R(x) end for

Algorithm 1 shrinks the exploration magnitude with time (Fig. 1c); this is enabled thanks to the strong-convexity of the losses. It also updates according to a full-information ﬁrst-order algorithm

denoted FTARL-σ, which is deﬁned below. This algorithm is a variant of the FTRL methodology as deﬁned in [6, 10].

Algorithm 2 FTARL-σ

Input: η > 0, ν-self concordant barrier R Choose x1 = arg minx K R(x) for t = 1, 2 . . . T do

Receive ht(xt) Output xt+1 = arg minx K Pt τ=1 hτ(xτ) x + σ

2 ||x xτ||2 + η 1R(x) end for

Next we give a proof sketch of Theorem 10

Proof sketch of Therorem 10. Let us decompose the expected regret of Algorithm 1 with respect to w K:

E [Regret T (w)] := PT t=1 E [ft(yt) ft(w)]

= PT t=1 E [ft(yt) ft(xt)] (9)

+ PT t=1 E h ft(xt) ˆft(xt) i

PT t=1 E h ft(w) ˆft(w) i

+ PT t=1 E h ˆft(xt) ˆft(w) i

where expectation is taken with respect to the player s choices, and ˆft is deﬁned as

ˆft(x) = E[ft(x + Btv)], x K

here v Bn and the smoothing matrix Bt is deﬁned in Algorithm 1.

The sampling scheme used by Algorithm 1 yields an unbiased gradient estimate gt of the smoothed version ˆft, which is then inserted to FTARL-σ (Algorithm 2). We can therefore interpret Algorithm 1 as performing noisy ﬁrst-order method (FTARL-σ) over the smoothed versions. The xt s in Algorithm 1 are the outputs of FTARL-σ, thus the term in Equation (12) is associated with exploitation . The other terms in Equations (9)-(11) measure the cost of sampling away from xt, and the distance between the smoothed version and the original function, hence these term are associated with exploration . In what follows we analyze these terms separately and show that Algorithm 1 achieves O(

The Exploration Terms: The next hold by the remark that follows Lemma 7 and by the lemma itself:

E[ft(yt) ft(xt)] = E Eu[ft(xt + Btu)] ft(xt) xt] 0.5βE ||B2 t ||2 β/2ησt (13)

E[ft(w) ˆft(w)] = E h E[ ˆft(w) ft(w) xt] i 0.5βE ||B2 t ||2 β/2ησt (14)

E[ft(xt) ˆft(xt)] = E h E[ft(xt) ˆft(xt) xt] i 0 (15)

where ||B2 t ||2 1/ησt follows by the deﬁnition of Bt and by the fact that 2R(xt) is positive deﬁnite.

The Exploitation Term: The next Lemma bounds the regret of FTARL-σ in the full-information setting:

Lemma 11. Let R be a self-concordant barrier over a convex set K, and η > 0. Consider an online player receiving σ-strongly-convex loss functions h1, . . . , h T and choosing points according to FTARL-σ (Algorithm 2), and η|| ht(xt)|| t 1/2, t [T]. Then the player s regret is upper bounded as follows:

t=1 ht(w) 2η

t=1 (|| ht(xt)|| t )2 + η 1R(w), z K

here (||a|| t )2 = a T ( 2R(xt) + ησt I) 1a

Note that Algorithm 1 uses the estimates gt as inputs into FTARL-σ. Using Corollary 6 we can show that the gt s are unbiased estimates for the gradients of the smoothed versions ˆft s. Using the regret bound of the above lemma, and the unbiasedness of the gt s, Lemma 9 ensures us:

t=1 E h ˆft(xt) ˆft(w) i 2η

t=1 E[(||gt|| t )2] + η 1R(w) (16)

By the deﬁnitions of gt and Bt, and recalling |ft(x)| L, x K, we can bound:

E[(||gt|| t )2 xt] = E h n2 (ft(xt + Btu))2 u B 1 t 2R(xt) + ησt I 1 B 1 t u xt i (n L)2

Concluding: Plugging the latter into Equation (16) and combining Equations (9)-(16) we get:

E[Regret T (w)] 2η(n L)2T + η 1 R(w) + 2βσ 1 log T (17)

Recall that x1 = arg minx K R(x) and assume w.l.o.g. that R(x1) = 0 (we can always add R a constant). Thus, for a point w K such that πx1(w) 1 T 1 Lemma 4 ensures us that R(w) ν log T. Combining the latter with Equation (17) and the choice of η in Theorem 10 assures an expected regret bounded by 4n L p

(ν + 2βσ 1) T log T. For w K such that πx1(w) > 1 T 1

we can always ﬁnd w K such that ||w w || O(T 1) and πx1(w ) 1 T 1, using the Lipschitzness of the ft s, Theorem 10 holds.

Correctness: Note that Algorithm 1 chooses points from the set {xt + 2R(xt) + ησt I 1/2 u, u Sn} which is inside the Dikin ellipsoid and therefore belongs to K (the Dikin Eliipsoid is always in K).

5 Summary and open questions

We have presented an efﬁcient algorithm that attains near optimal regret for the setting of BCO with strongly-convex and smooth losses, advancing our understanding of optimal regret rates for bandit learning.

Perhaps the most important question in bandit learning remains the resolution of the attainable regret bounds for smooth but non-strongly-convex, or vice versa, and generally convex cost functions (see Table 1). Ideally, this should be accompanied by an efﬁcient algorithm, although understanding the optimal rates up to polylogarithmic factors would be a signiﬁcant advancement by itself.

Acknowledgements

The research leading to these results has received funding from the European Union s Seventh Framework Programme (FP7/2007-2013) under grant agreement n 336078 ERCSUBLRN.

[1] Jacob Abernethy, Elad Hazan, and Alexander Rakhlin. Competing in the dark: An efﬁcient algorithm for bandit linear optimization. In COLT, pages 263 274, 2008.

[2] Alekh Agarwal, Ofer Dekel, and Lin Xiao. Optimal algorithms for online convex optimization with multi-point bandit feedback. In COLT, pages 28 40, 2010.

[3] S ebastien Bubeck and Nicolo Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning, 5(1):1 122, 2012.

[4] Varsha Dani, Thomas P. Hayes, and Sham Kakade. The price of bandit information for online optimization. In NIPS, 2007.

[5] Abraham Flaxman, Adam Tauman Kalai, and H. Brendan Mc Mahan. Online convex optimization in the bandit setting: gradient descent without a gradient. In SODA, pages 385 394, 2005.

[6] Elad Hazan. A survey: The convex optimization approach to regret minimization. In Suvrit Sra, Sebastian Nowozin, and Stephen J. Wright, editors, Optimization for Machine Learning, pages 287 302. MIT Press, 2011.

[7] Robert D Kleinberg. Nearly tight bounds for the continuum-armed bandit problem. In NIPS, volume 17, pages 697 704, 2004.

[8] Arkadii Nemirovskii. Interior point polynomial time methods in convex programming. Lecture Notes, 2004.

[9] Ankan Saha and Ambuj Tewari. Improved regret guarantees for online smooth convex optimization with bandit feedback. In AISTATS, pages 636 642, 2011.

[10] Shai Shalev-Shwartz. Online learning and online convex optimization. Foundations and Trends in Machine Learning, 4(2):107 194, 2011.

[11] Ohad Shamir. On the complexity of bandit and derivative-free stochastic convex optimization. In Conference on Learning Theory, pages 3 24, 2013.