# bandit_convex_optimization_towards_tight_bounds__22d8ca01.pdf Bandit Convex Optimization: Towards Tight Bounds Elad Hazan Technion Israel Institute of Technology Haifa 32000, Israel ehazan@ie.technion.ac.il Kfir Y. Levy Technion Israel Institute of Technology Haifa 32000, Israel kfiryl@tx.technion.ac.il Bandit Convex Optimization (BCO) is a fundamental framework for decision making under uncertainty, which generalizes many problems from the realm of online and statistical learning. While the special case of linear cost functions is well understood, a gap on the attainable regret for BCO with nonlinear losses remains an important open question. In this paper we take a step towards understanding the best attainable regret bounds for BCO: we give an efficient and near-optimal regret algorithm for BCO with strongly-convex and smooth loss functions. In contrast to previous works on BCO that use time invariant exploration schemes, our method employs an exploration scheme that shrinks with time. 1 Introduction The power of Online Convex Optimization (OCO) framework is in its ability to generalize many problems from the realm of online and statistical learning, and supply universal tools to solving them. Extensive investigation throughout the last decade has yield efficient algorithms with worst case guarantees. This has lead many practitioners to embrace the OCO framework in modeling and solving real world problems. One of the greatest challenges in OCO is finding tight bounds to the problem of Bandit Convex Optimization (BCO). In this bandit setting the learner observes the loss function only at the point that she has chosen. Hence, the learner has to balance between exploiting the information she has gathered and between exploring the new data. The seminal work of [5] elegantly resolves this exploration-exploitation dilemma by devising a combined explore-exploit gradient descent algorithm. They obtain a bound of O(T 3/4) on the expected regret for the general case of an adversary playing bounded and Lipschitz-continuous convex losses. In this paper we investigate the BCO setting assuming that the adversary is limited to inflicting strongly-convex and smooth losses and the player may choose points from a constrained decision set. In this setting we devise an efficient algorithm that achieves a regret of O( T). This rate is the best possible up to logarithmic factors as implied by a recent work of [11], cleverly obtaining a lower bound of Ω( T) for the same setting. During our analysis, we develop a full-information algorithm that takes advantage of the strongconvexity of loss functions and uses a self-concordant barrier as a regularization term. This algorithm enables us to perform shrinking exploration which is a key ingredient in our BCO algorithm. Conversely, all previous works on BCO use a time invariant exploration scheme. This paper is organized as follows. In Section 2 we introduce our setting and review necessary preliminaries regarding self-concordant barriers. In Section 3 we discuss schemes to perform single- Setting Convex Linear Smooth Str.-Convex Str.-Convex & Smooth Full-Info. Θ( T) Θ(log T) O(T 3/4) O( T) O(T 2/3) O( T) [Thm. 10] Ω( Table 1: Known regret bounds in the Full-Info./ BCO setting. Our new result is highlighted, and improves upon the previous O(T 2/3) bound. point gradient estimations, then we define first-order online methods and analyze the performance of such methods receiving noisy gradient estimates. Our main result is described and analyzed in Section 4; Section 5 concludes. 1.1 Prior work For BCO with general convex loss functions, almost simultaneously to [5], a bound of O(T 3/4) was also obtained by [7] for the setting of Lipschitz-continuous convex losses. Conversely, the best known lower bound for this problem is Ω( T) proved for the easier full-information setting. In case the adversary is limited to using linear losses, it can be shown that the player does not pay for exploration; this property was used by [4] to devise the Geometric Hedge algorithm that achieves an optimal regret rate of O( T). Later [1], inspired by interior point methods, devised the first efficient algorithm that attains the same nearly-optimal regret rate for this setup of bandit linear optimization. For some special classes of nonlinear convex losses, there are several works that lean on ideas from [5] to achieve improved upper bounds for BCO. In the case of convex and smooth losses [9] attained an upper bound of O(T 2/3). The same regret rate of O(T 2/3) was achieved by [2] in the case of strongly-convex losses. For the special case of unconstrained BCO with strongly-convex and smooth losses, [2] obtained a regret of O( T). A recent paper by Shamir [11], significantly advanced our understanding of BCO by devising a lower bound of Ω( T) for the setting of stronglyconvex and smooth BCO. The latter implies the tightness of our bound. A comprehensive survey by Bubeck and Cesa-Bianchi [3], provides a review of the bandit optimization literature in both stochastic and online setting. 2 Setting and Background Notation: During this paper we denote by || || the ℓ2 norm when referring to vectors, and use the same notation for the spectral norm when referring to matrices. We denote by Bn and Sn the n-dimensional euclidean unit ball and unit sphere, and by v Bn and u Sn random variables chosen uniformly from these sets. The symbol I is used for the identity matrix (its dimension will be clear from the context). For a positive definite matrix A 0 we denote by A1/2 the matrix B such that B B = A, and by A 1/2 the inverse of B. Finally, we denote [N] := {1, . . . , N}. 2.1 Bandit Convex Optimization We consider a repeated game of T rounds between a player and an adversary, at each round t [T] 1. player chooses a point xt K. 2. adversary independently chooses a loss function ft F. 3. player suffers a loss ft(xt) and receives a feedback Ft. In the OCO (Online Convex Optimization) framework we assume that the decision set K is convex and that all functions in F are convex. Our paper focuses on adversaries limited to choosing functions from the set Fσ,β; the set off all σ-strongly-convex and β-smooth functions. We also limit ourselves to oblivious adversaries where the loss sequence {ft}T t=1 is predetermined and is therefore independent of the player s choices. Mind that in this case the best point in hindsight is also independent of the player s choices. We also assume that the loss functions are defined over the entire space Rn and are strongly-convex and smooth there; yet the player may only choose points from a constrained set K. Let us define the regret of A, and its regret with respect to a comparator w K: Regret A T = t=1 ft(xt) min w K t=1 ft(w ), Regret A T (w) = A player aims at minimizing his regret, and we are interested in players that ensure an o(T) regret for any loss sequence that the adversary may choose. The player learns through the feedback Ft received in response to his actions. In the full informations setting, he receives the loss function ft itself as a feedback, usually by means of a gradient oracle - i.e. the decision maker has access to the gradient of the loss function at any point in the decision set. Conversely, in the BCO setting the given feedback is ft(xt), i.e., the loss function only at the point that he has chosen; and the player aims at minimizing his expected regret, E Regret A T . 2.2 Strong Convexity and Smoothness As mentioned in the last subsection we consider an adversary limited to choosing loss functions from the set Fσ,β, the set of σ-strongly convex and β-smooth functions, here we define these properties. Definition 1. (Strong Convexity) We say that a function f : Rn R is σ-strongly convex over the set K if for all x, y K it holds that, f(y) f(x) + f(x) (y x) + σ 2 ||x y||2 (1) Definition 2. (Smoothness) We say that a convex function f : Rn R is β-smooth over the set K if the following holds: f(y) f(x) + f(x) (y x) + β 2 ||x y||2, x, y K (2) 2.3 Self Concordant Barriers Interior point methods are polynomial time algorithms to solving constrained convex optimization programs. The main tool in these methods is a barrier function that encodes the constrained set and enables the use of a fast unconstrained optimization machinery. More on this subject can be found in [8]. Let K Rn be a convex set with a non empty interior int(K) Definition 3. A function R : int(K) R is called ν-self-concordant if: 1. R is three times continuously differentiable and convex, and approaches infinity along any sequence of points approaching the boundary of K. 2. For every h Rn and x int(K) the following holds: | 3R(x)[h, h, h]| 2( 2R(x)[h, h])3/2 and | R(x)[h]| ν1/2( 2R(x)[h, h])1/2 here, 3R(x)[h, h, h] := 3 t1 t2 t3 R(x + t1h + t2h + t3h) t1=t2=t3=0. Our algorithm requires a ν-self-concordant barrier over K, and its regret depends on ν. It is well known that any convex set in Rn admits a ν = O(n) such barrier (ν might be much smaller), and that most interesting convex sets admit a self-concordant barrier that is efficiently represented. The Hessian of a self-concordant barrier induces a local norm at every x int(K), we denote this norm by || ||x and its dual by || || x and define h Rn: h 2R(x)h, ||h|| x = q h ( 2R(x)) 1h we assume that 2R(x) always has a full rank. The following fact is a key ingredient in the sampling scheme of BCO algorithms [1, 9]. Let R is be self-concordant barrier and x int(K) then the Dikin Ellipsoide, W1(x) := {y Rn : ||y x||x 1} (3) i.e. the || ||x-unit ball centered around x, is completely contained in K. Our regret analysis requires a bound on R(y) R(x); hence, we will find the following lemma useful: Lemma 4. Let R be a ν-self-concordant function over K, then: R(y) R(x) ν log 1 1 πx(y), x, y int(K) where πx(y) = inf{t 0 : x + t 1(y x) K}, x, y int(K) Note that πx(y) is called the Minkowsky function and it is always in [0, 1]. Moreover, as y approaches the boundary of K then πx(y) 1. 3 Single Point Gradient Estimation and Noisy First-Order Methods 3.1 Single Point Gradient Estimation A main component of BCO algorithms is a randomized sampling scheme for constructing gradient estimates. Here, we survey the previous schemes as well as the more general scheme that we use. Spherical estimators: Flaxman et al. [5] introduced a method that produces single point gradient estimates through spherical sampling. These estimates are then inserted into a full-information procedure that chooses the next decision point for the player. Interestingly, these gradient estimates are unbiased predictions for the gradients of a smoothed version function which we next define. Let δ > 0 and v Bn, the smoothed version of a function f : Rn R is defined as follows: ˆf(x) = E[f(x + δv)] (4) The next lemma of [5] ties between the gradients of ˆf and an estimate based on samples of f: Lemma 5. Let u Sn, and consider the smoothed version ˆf defined in Equation (4), then the following applies: ˆf(x) = E[n δ f(x + δu)u] (5) Therefore, n δ f(x + δu)u is an unbiased estimator for the gradients of the smoothed version. (a) Eigenpoles Sampling (b) Continuous Sampling (c) Shrinking Sampling Figure 1: Dikin Ellipsoide Sampling Schemes Ellipsoidal estimators: Abernethy et al. [1] introduced the idea of sampling from an ellipsoid (specifically the Dikin ellipsoid) rather than a sphere in the context of BCO. They restricted the sampling to the eigenpoles of the ellipsoid (Fig. 1a). A more general method of sampling continuously from an ellipsoid was introduced in [9] (Fig. 1b). We shall see later that our algorithm uses a shrinking-sampling scheme (Fig. 1c), which is crucial in achieving the O( T) regret bound. The following lemma of [9] shows that we can sample f non uniformly over all directions and create an unbiased gradient estimate of a respective smoothed version: Corollary 6. Let f : Rn R be a continuous function, let A Rn n be invertible, and v Bn, u Sn. Define the smoothed version of f with respect to A: ˆf(x) = E[f(x + Av)] (6) Then the following holds: ˆf(x) = E[nf(x + Au)A 1u] (7) Note that if A 0 then {Au : u Sn} is an ellipsoid s boundary. Our next lemma shows that the smoothed version preserves the strong-convexity of f, and that we can measure the distance between ˆf and f using the spectral norm of A2: Lemma 7. Consider a function f : Rn R, and a positive definite matrix A Rn n. Let ˆf be the smoothed version of f with respect to A as defined in Equation (6). Then the following holds: If f is σ-strongly convex then so is ˆf. If f is convex and β-smooth, and λmax be the largest eigenvalue of A then: 0 ˆf(x) f(x) β 2 ||A2||2 = β 2 λ2 max (8) Remark: Lemma 7 also holds if we define the smoothed version of f as ˆf(x) = Eu Sn[f(x+Au)] i.e. an average of the original function values over the unit sphere rather than the unit ball as defined in Equation (6). Proof is similar to the one of Lemma 7. 3.2 Noisy First-Order Methods Our algorithm utilizes a full-information online algorithm, but instead of providing this method with exact gradient values we insert noisy estimates of the gradients. In what follows we define first-order online algorithms, and present a lemma that analyses the regret of such algorithm receiving noisy gradients. Definition 8. (First-Order Online Algorithm) Let A be an OCO algorithm receiving an arbitrary sequence of differential convex loss functions f1, . . . , f T , and providing points x1 A and xt A(f1, . . . , ft 1). Given that A requires all loss functions to belong to some set F0. Then A is called first-order online algorithm if the following holds: Adding a linear function to a member of F0 remains in F0; i.e., for every f F0 and a Rn then also f + a x F0 The algorithm s choices depend only on its gradient values taken in the past choices of A, i.e. : A(f1, . . . , ft 1) = A( f1(x1), . . . , ft 1(xt 1)), t [T] The following is a generalization of Lemma 3.1 from [5]: Lemma 9. Let w be a fixed point in K. Let A be a first-order online algorithm receiving a sequence of differential convex loss functions f1, . . . , f T : K R (ft+1 possibly depending on z1, . . . zt). Where z1 . . . z T are defined as follows: z1 A, zt A(g1, . . . , gt 1) where gt s are vector valued random variables such that: E[gt z1, f1, . . . , zt, ft] = ft(zt) Then if A ensures a regret bound of the form: Regret A T BA( f1(x1), . . . , f T (x T )) in the full information case then, in the case of noisy gradients it ensures the following bound: t=1 ft(zt)] t=1 ft(w) E[BA(g1, . . . , g T )] 4 Main Result and Analysis Following is the main theorem of this paper: Theorem 10. Let K be a convex set with diameter DK and R be a ν-self-concordant barrier over K. Then in the BCO setting where the adversary is limited to choosing β-smooth and σ-stronglyconvex functions and |ft(x)| L, x K, then the expected regret of Algorithm 1 with η = q (ν+2β/σ) log T 2n2L2T is upper bounded as E[Regret T ] 4n L T log T + 2L + βD2 K 2 = O whenever T/ log T 2 (ν + 2β/σ). Algorithm 1 BCO Algorithm for Str.-convex & Smooth losses Input: η > 0, σ > 0, ν-self-concordant barrier R Choose x1 = arg minx K R(x) for t = 1, 2 . . . T do Define Bt = 2R(xt) + ησt I 1/2 Draw u Sn Play yt = xt + Btu Observe ft(xt + Btu) and define gt = nft(xt + Btu)B 1 t u Update xt+1 = arg minx K Pt τ=1 g τ x + σ 2 ||x xτ||2 + η 1R(x) end for Algorithm 1 shrinks the exploration magnitude with time (Fig. 1c); this is enabled thanks to the strong-convexity of the losses. It also updates according to a full-information first-order algorithm denoted FTARL-σ, which is defined below. This algorithm is a variant of the FTRL methodology as defined in [6, 10]. Algorithm 2 FTARL-σ Input: η > 0, ν-self concordant barrier R Choose x1 = arg minx K R(x) for t = 1, 2 . . . T do Receive ht(xt) Output xt+1 = arg minx K Pt τ=1 hτ(xτ) x + σ 2 ||x xτ||2 + η 1R(x) end for Next we give a proof sketch of Theorem 10 Proof sketch of Therorem 10. Let us decompose the expected regret of Algorithm 1 with respect to w K: E [Regret T (w)] := PT t=1 E [ft(yt) ft(w)] = PT t=1 E [ft(yt) ft(xt)] (9) + PT t=1 E h ft(xt) ˆft(xt) i PT t=1 E h ft(w) ˆft(w) i + PT t=1 E h ˆft(xt) ˆft(w) i where expectation is taken with respect to the player s choices, and ˆft is defined as ˆft(x) = E[ft(x + Btv)], x K here v Bn and the smoothing matrix Bt is defined in Algorithm 1. The sampling scheme used by Algorithm 1 yields an unbiased gradient estimate gt of the smoothed version ˆft, which is then inserted to FTARL-σ (Algorithm 2). We can therefore interpret Algorithm 1 as performing noisy first-order method (FTARL-σ) over the smoothed versions. The xt s in Algorithm 1 are the outputs of FTARL-σ, thus the term in Equation (12) is associated with exploitation . The other terms in Equations (9)-(11) measure the cost of sampling away from xt, and the distance between the smoothed version and the original function, hence these term are associated with exploration . In what follows we analyze these terms separately and show that Algorithm 1 achieves O( The Exploration Terms: The next hold by the remark that follows Lemma 7 and by the lemma itself: E[ft(yt) ft(xt)] = E Eu[ft(xt + Btu)] ft(xt) xt] 0.5βE ||B2 t ||2 β/2ησt (13) E[ft(w) ˆft(w)] = E h E[ ˆft(w) ft(w) xt] i 0.5βE ||B2 t ||2 β/2ησt (14) E[ft(xt) ˆft(xt)] = E h E[ft(xt) ˆft(xt) xt] i 0 (15) where ||B2 t ||2 1/ησt follows by the definition of Bt and by the fact that 2R(xt) is positive definite. The Exploitation Term: The next Lemma bounds the regret of FTARL-σ in the full-information setting: Lemma 11. Let R be a self-concordant barrier over a convex set K, and η > 0. Consider an online player receiving σ-strongly-convex loss functions h1, . . . , h T and choosing points according to FTARL-σ (Algorithm 2), and η|| ht(xt)|| t 1/2, t [T]. Then the player s regret is upper bounded as follows: t=1 ht(w) 2η t=1 (|| ht(xt)|| t )2 + η 1R(w), z K here (||a|| t )2 = a T ( 2R(xt) + ησt I) 1a Note that Algorithm 1 uses the estimates gt as inputs into FTARL-σ. Using Corollary 6 we can show that the gt s are unbiased estimates for the gradients of the smoothed versions ˆft s. Using the regret bound of the above lemma, and the unbiasedness of the gt s, Lemma 9 ensures us: t=1 E h ˆft(xt) ˆft(w) i 2η t=1 E[(||gt|| t )2] + η 1R(w) (16) By the definitions of gt and Bt, and recalling |ft(x)| L, x K, we can bound: E[(||gt|| t )2 xt] = E h n2 (ft(xt + Btu))2 u B 1 t 2R(xt) + ησt I 1 B 1 t u xt i (n L)2 Concluding: Plugging the latter into Equation (16) and combining Equations (9)-(16) we get: E[Regret T (w)] 2η(n L)2T + η 1 R(w) + 2βσ 1 log T (17) Recall that x1 = arg minx K R(x) and assume w.l.o.g. that R(x1) = 0 (we can always add R a constant). Thus, for a point w K such that πx1(w) 1 T 1 Lemma 4 ensures us that R(w) ν log T. Combining the latter with Equation (17) and the choice of η in Theorem 10 assures an expected regret bounded by 4n L p (ν + 2βσ 1) T log T. For w K such that πx1(w) > 1 T 1 we can always find w K such that ||w w || O(T 1) and πx1(w ) 1 T 1, using the Lipschitzness of the ft s, Theorem 10 holds. Correctness: Note that Algorithm 1 chooses points from the set {xt + 2R(xt) + ησt I 1/2 u, u Sn} which is inside the Dikin ellipsoid and therefore belongs to K (the Dikin Eliipsoid is always in K). 5 Summary and open questions We have presented an efficient algorithm that attains near optimal regret for the setting of BCO with strongly-convex and smooth losses, advancing our understanding of optimal regret rates for bandit learning. Perhaps the most important question in bandit learning remains the resolution of the attainable regret bounds for smooth but non-strongly-convex, or vice versa, and generally convex cost functions (see Table 1). Ideally, this should be accompanied by an efficient algorithm, although understanding the optimal rates up to polylogarithmic factors would be a significant advancement by itself. Acknowledgements The research leading to these results has received funding from the European Union s Seventh Framework Programme (FP7/2007-2013) under grant agreement n 336078 ERCSUBLRN. [1] Jacob Abernethy, Elad Hazan, and Alexander Rakhlin. Competing in the dark: An efficient algorithm for bandit linear optimization. In COLT, pages 263 274, 2008. [2] Alekh Agarwal, Ofer Dekel, and Lin Xiao. Optimal algorithms for online convex optimization with multi-point bandit feedback. In COLT, pages 28 40, 2010. [3] S ebastien Bubeck and Nicolo Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning, 5(1):1 122, 2012. [4] Varsha Dani, Thomas P. Hayes, and Sham Kakade. The price of bandit information for online optimization. In NIPS, 2007. [5] Abraham Flaxman, Adam Tauman Kalai, and H. Brendan Mc Mahan. Online convex optimization in the bandit setting: gradient descent without a gradient. In SODA, pages 385 394, 2005. [6] Elad Hazan. A survey: The convex optimization approach to regret minimization. In Suvrit Sra, Sebastian Nowozin, and Stephen J. Wright, editors, Optimization for Machine Learning, pages 287 302. MIT Press, 2011. [7] Robert D Kleinberg. Nearly tight bounds for the continuum-armed bandit problem. In NIPS, volume 17, pages 697 704, 2004. [8] Arkadii Nemirovskii. Interior point polynomial time methods in convex programming. Lecture Notes, 2004. [9] Ankan Saha and Ambuj Tewari. Improved regret guarantees for online smooth convex optimization with bandit feedback. In AISTATS, pages 636 642, 2011. [10] Shai Shalev-Shwartz. Online learning and online convex optimization. Foundations and Trends in Machine Learning, 4(2):107 194, 2011. [11] Ohad Shamir. On the complexity of bandit and derivative-free stochastic convex optimization. In Conference on Learning Theory, pages 3 24, 2013.