# the_hedge_algorithm_on_a_continuum__ee9534d4.pdf The Hedge Algorithm on a Continuum Walid Krichene WALID@EECS.BERKELEY.EDU University of California, 652 Sutardja Dai Hall, Berkeley, CA 94720 USA Maximilian Balandat BALANDAT@EECS.BERKELEY.EDU University of California, 736 Sutardja Dai Hall, Berkeley, CA 94720 USA Claire Tomlin TOMLIN@EECS.BERKELEY.EDU University of California, 721 Sutardja Dai Hall, Berkeley, CA 94720 USA Alexandre Bayen BAYEN@BERKELEY.EDU University of California, 642 Sutardja Dai Hall, Berkeley, CA 94720 USA We consider an online optimization problem on a compact subset S Rn (not necessarily convex), in which a decision maker chooses, at each iteration t, a probability distribution x(t) over S, and seeks to minimize a cumulative expected loss, PT t=1 Es x(t)[ℓ(t)(s)], where ℓ(t) is a Lipschitz loss function revealed at the end of iteration t. Building on previous work, we propose a generalized Hedge algorithm and show a O( t log t) bound on the regret when the losses are uniformly Lipschitz and S is uniformly fat (a weaker condition than convexity). Finally, we propose a generalization to the dual averaging method on the set of Lebesgue-continuous distributions over S. 1. Introduction We consider the online optimization setting used by Zinkevich (2003) and Hazan et al. (2007), where a decision maker chooses, at each iteration t, a probability distribution x(t) over some compact feasible set S Rn, and incurs a loss Es x(t)[ℓ(t)(s)]. When choosing x(t), the decision maker only has access to (ℓ(τ)( ))1 τ t 1, i.e. the loss functions up to iteration t 1. The cumulative regret is a natural measure of performance in sequential decision problems; it was introduced by Hannan (1957) in the context of repeated games, then later used in the analysis of general sequential decision problems, see Proceedings of the 32 nd International Conference on Machine Learning, Lille, France, 2015. JMLR: W&CP volume 37. Copyright 2015 by the author(s). for example Cesa-Bianchi & Lugosi (2006) and Bubeck & Cesa-Bianchi (2012). The cumulative regret R(t) at iteration t is defined as the difference between the loss incurred by the decision maker and the loss of the best fixed decision in hindsight, that is, R(t) = Pt τ=1 Es x(t)[ℓ(τ)(s)] infs S Pt τ=1 ℓ(τ)(s). Zinkevich (2003) shows that if the feasible set S is convex, and the loss functions ℓ(t) are convex with a bounded gradient, a simple online gradient descent algorithm achieves a O( t) cumulative regret. In 2007, Hazan et al. show that if the loss functions ℓ(t) are exp-concave, uniformly in t, a generalized Hedge algorithm achieves a O(log t) regret. The Hedge algorithm, also known as the multiplicative weights update (Arora et al., 2012), has been extensively studied in the discrete case, i.e. when S is a discrete set. The Hedge algorithm was introduced as the exponentially weighted average forecaster by Littlestone & Warmuth (1989). It has also been analyzed in the context of convex optimization, and is known as the exponentiated gradient method (Kivinen & Warmuth, 1997), or the entropic descent method (Beck & Teboulle, 2003). The Hedge algorithm is a simple method to implement and to analyze, and achieves sublinear regret in the discrete case whenever the loss functions ℓ(t) are uniformly bounded. More precisely, if the action set has size N and the the learning rates have a 1/ t decay rate, then the regret is bounded by O( t log N), see for example the analysis in (Bubeck & Cesa-Bianchi, 2012). We seek to generalize this regret bound to a setting in which the action set is a continuum, while making only mild assumptions on the geometry of the set and the class of loss functions. The logarithmic regret bound achieved by Hazan et al. (2007) requires the feasible set to be convex, and the loss functions to be exp-concave. We extend their analy- The Hedge Algorithm on a Continuum Assumptions on ℓ(t) convex α-exp-concave uniformly L-Lipschitz Assumptions on S convex convex v-uniformly fat Method Gradient descent (Zinkevich) Hedge (Hazan et al.) Hedge (this paper) Learning rates 1/ R(t)/t O 1/ O t 1 log t Table 1. Some regret upper bounds for different classes of losses. sis to a less restrictive class of problems, which only requires uniform fatness of the action set (a weaker condition than convexity) and uniform Lipschitz continuity of the loss functions. We show that under such assumptions the Hedge algorithm achieves a O( t log t) regret. Table 1 summarizes the regret bounds for these problem classes. The online optimization model on a continuum has various applications, including machine learning (Hazan et al., 2007), portfolio optimization (Cover, 1991; Blum & Kalai, 1999), pricing with uncertain demand and transmission power control over noisy channels (Cope, 2009). By relaxing the assumptions of convexity of the feasible set and exp-concavity of the loss functions, we extend the class of problems for which the Hedge algorithm provides bounds on the worst-case regret. For example, in the context of portfolio optimization, this would allow for nonconvex diversification constraints and non-convex transaction costs (Xidonas & Mavrotas, 2014). In Section 2, we derive a general regret bound for Lipschitz losses. In Section 3, we specialize this bound to convex feasible sets, then relax the convexity assumption and show that the Hedge algorithm guarantees sublinear regret on uniformly fat sets. In Section 4, we study the dual averaging method and prove a regret bound on the set of Lebesgue-continuous distributions on S, then discuss how one can recover the Hedge regret bound as a special case. In Section 5, we compare the Hedge algorithm to learning on a finite cover of S. Finally, we illustrate these theoretical results with a numerical example in Section 6. 2. A General Regret Bound on Metric Spaces Consider a compact metric space (S, d) where d is a distance function. Let ν be a reference probability measure on S, and denote by ν(S) the set of probability measures that are absolutely continuous with respect to ν. Let ℓ(t) C0(S, R+) denote the loss function at iteration t. We assume that the losses are bounded uniformly in t, i.e. M > 0 such that ℓ(t)(s) [0, M] for all t N and all s S. The decision maker chooses, at iteration t, a distribution over S, i.e. an element of ν(S). Its density w.r.t. ν will be denoted by x(t). The Hedge algorithm with initial density x(0) and learning rates (ηt), is defined by the sequence (x(t)) of densities as follows: x(t+1)(s) = 1 Z(t) x(0)(s) exp ηt+1 Pt τ=1ℓ(τ)(s) (1) where Z(t) is the appropriate normalization constant, i.e., Z(t) = Es x(0) exp( ηt+1 Pt τ=1 ℓ(τ)(s)) . The Hedge algorithm is summarized in Algorithm 1. Algorithm 1 Hedge algorithm with initial density x(0) and learning rates (ηt). Choose action s x(t) Observe loss function ℓ(t) Update x(t+1)(s) x(0)(s) exp ηt+1 Pt τ=1 ℓ(τ)(s) Define r(t)(s) = Eu x(t)[ℓ(t)(u)] ℓ(t)(s) as the instantaneous regret function at iteration t, and R(t) = sups S Pt τ=1 r(τ)(s) =Pt τ=1Es x(τ)[ℓ(τ)(s)] infs S Pt τ=1ℓ(τ)(s) (2) as the cumulative regret. The first term in the above expression is the expected cumulative loss of the decision maker. The second term is the infimum of the cumulative loss function, which will be denoted by L(t)(s) := Pt τ=1 ℓ(τ)(s) Since S is compact and L(t) is continuous, the infimum of L(t) is attained on S. We write s t arg mins S L(t)(s) for the minimizer. We start by giving a first regret bound for Lipschitzcontinuous losses. This bound can be obtained as a consequence of Theorem 4.6 of (Audibert, 2009), and a similar result is proved by (Dalalyan & Salmon, 2012). Lemma 1. The Hedge algorithm with non-increasing learning rates (ηt) guarantees τ=1 ητ + ξ(ηt, L(t)) L(t)(s t ) (3) where ξ : R+ L2(S) R is given by ξ(η, L) = 1 η log Z x(0)(s) exp( ηL(s))ν(ds) The Hedge Algorithm on a Continuum τ=1 Ex(τ)[ℓ(τ)] M 2 ξ(ητ, L(τ)) ξ(ητ+1, L(τ)) + ξ(ηt, L(t)) (4) η2 log Z x(0)(s) exp( ηf(s)) ν(ds) 1 R f(s)x(0)(s) exp( ηf(s)) ν(ds) R x(0)(s) exp( ηf(s)) ν(ds) Z f(s)xf(s)ν(ds) = 1 Zf xf(s)ν(ds) 1 Z log exp( ηf(s))xf(s)ν(ds) Z log exp( ηf(s)) Zf xf(s)ν(ds) = 1 Z log xf(s) x(0)(s)xf(s)ν(ds) = DKL(xf, x(0)) (5) Proof. We have ξ(ηt+1, L(t+1)) ξ(ηt+1, L(t)) = 1 ηt+1 log R x(0)(s) exp( ηt+1 Pt+1 τ=1 ℓ(τ)(s)) ν(ds) R x(0)(u) exp( ηt+1 Pt τ=1 ℓ(τ)(u)) ν(du) = 1 ηt+1 log Z x(t+1)(s) exp( ηt+1ℓ(t+1)(s)) ν(ds) = 1 ηt+1 log Ex(t+1) exp( ηt+1ℓ(t+1)) Ex(t+1) ℓ(t+1) ηt+1M 2 where the last inequality follows from Hoeffding s lemma. Summing the inequalities, we find that Pt τ=1 ξ(ητ, L(τ)) ξ(ητ, L(τ 1)) Pt τ=1 Ex(τ)[ℓ(τ)] M 2 8 Pt τ=1 ητ Rearranging, and observing that ξ(η, L(0)) = ξ(η, 0) = 1 η log R x(0)(s) ν(ds) = 0, we obtain equation (4) at the top of the page. Next, we show that each term of the second sum in (4) is non-positive. Since ηt+1 ηt by assumption, it suffices to show that, for any bounded Lipschitz function f, η 7 ξ(η, f) is decreasing. Calculating the partial derivative w.r.t. η (using Dominated Convergence to differentiate under the integral) we obtain (5), where we use xf to denote the density function xf(s) = Z 1 f x(0)(s) exp( ηf(s)), with Zf the corresponding normalization constant. Thus ηξ(η, f) is proportional to the negative Kullback-Leibler divergence of xf with respect to x(0), therefore η 7 ξ(η, f) is non-increasing. The bound (4) then reduces to Pt τ=1 Ex(τ)[ℓ(τ)] M 2 8 Pt τ=1 ητ + ξ(ηt, L(t)) and we conclude by subtracting L(t)(s t ) = mins S L(t)(s) from both sides. Next, we refine this regret bound by bounding the difference ξ(ηt, L(t)) L(t)(s t ). For any measurable subset A S, we define the diameter D(A) := sups,s A d(s, s ) and the generalized volume Vx(0)(A) := R A x(0)(s) ν(ds). Lemma 2. Suppose that the loss functions ℓ(t) are L-Lipschitz uniformly in t. Consider a sequence (St) of measurable subsets of S, such that s t St for all t. Then the Hedge algorithm with non-increasing learning rates (ηt) guarantees the following bound on the regret: τ=1 ητ + t LD(St) log Vx(0)(St) Proof. Since the loss functions ℓ(t) are uniformly LLipschitz, we have for all s St, |ℓ(τ)(s) ℓ(τ)(s t )| Ld(s, s t ) LD(St). Hence, ℓ(τ)(s) ℓ(τ)(s t ) + LD(St), and L(t)(s) L(t)(s t ) + t LD(St). Therefore ξ(ηt, L(t)) 1 Stx(0)(s) exp ηt L(t)(s) ν(ds) Stx(0)(s) exp ηt(L(t)(s t ) + t LD(St)) ν(ds) = L(t)(s t ) + t LD(St)LD(St) 1 Stx(0)(s) ν(ds) = L(t)(s t ) + t LD(St) 1 ηt log Vx(0)(St) Combining this with Lemma 1 concludes the proof. Lemma 2 provides a regret bound in terms of any sequence (St) of subsets of S, with each St containing s t , an optimal decision in hindsight. However, this bound is only useful if one can construct such a sequence with appropriate relative decay rates of the diameters and generalized volumes. More precisely, we have the following corollary. Corollary 1. Consider the Hedge algorithm with learning rates (ηt) such that Pt τ=1 ητ = o(t). Suppose that there exists a sequence (St) of subsets with s t St, t, and such that D(St) = o(1) and log Vx(0)(St) = o(tηt). Then the regret grows sublinearly, i.e. lim supt R(t)/t 0. In the next section, we give sufficient conditions on the action set S that guarantee the existence of such a sequence. The Hedge Algorithm on a Continuum 3. Sublinear regret in Rn We now restrict our attention to finite dimensional Euclidean spaces. Let S be a compact subset of Rn. We start with the simple case of convex S, and construct a sequence (St) using a homothetic transformation centered at s t , similarly to (Blum & Kalai, 1999). Unless stated otherwise, we make the following assumption for the remainder of this paper: Assumption 1. The reference measure is the Lebesgue measure λ, and the initial distribution x(0) is the Lebesgueuniform distribution over S, i.e. x(0)(s) = 1 λ(S). 3.1. Sublinear Regret on Convex Sets Lemma 3. If S is a convex compact subset of Rn and x(0) is the Lebesgue-uniform probability density over S, then the set St defined by the homothetic transformation St = s t + dt(s s t ), s S (7) has diameter D(St) = dt D(S) and generalized volume Vx(0)(St) = λ(St) λ(S) = dn t . Proof. We have D(St) = sups,s St s s = sups,s S s t + dt(s s t ) s t dt(s s ) = dt sups,s S s s Furthermore, using a change of variable s = s t + dt(s s t ), s S, we can write Vx(0)(St) = R St x(0)(s)λ(ds) = 1 λ(S) R S | det(dt In)|λ(ds ) = dn t Theorem 1 (Hedge on convex compact subsets of Rn). Let S Rn be convex and compact, and suppose that the ℓ(t) are L-Lipschitz uniformly in t. Then under the Hedge algorithm with learning rates ηt = θt α log t, α [0, 1), we have t M 2θ 8(1 α) In particular, the per-round regret is O t α log t , where α = min(α, 1 α). Proof. Constructing the sequence St as in Lemma 3, it follows from the regret bound (6) that t + LD(S)dt n log dt Bounding Pt τ=1 ητ θ log t Pt τ=1 τ α θ log t R t 0 τ αdτ = θt1 α log t 1 α , we have t M 2θ 8(1 α) tα + LD(S)dt + n θ log 1/dt t1 α log t Now (8) follows by taking dt = 1/t. Corollary 2. Under the assumptions of Theorem 1, with ηt = η constant, we have R(t) t + n log t η t . For a given horizon T, we can choose η = 8n log T/T to minimize this bound, for which 3.2. Sublinear Regret on Uniformly Fat Sets The convexity assumption can, in fact, be relaxed, while keeping the same asymptotic rate of the regret. Intuitively, to be able to use the sequence (St) of sets as constructed in Lemma 3, it suffices to find, for each s t , a convex set Kt containing s t such that its volume Vx(0)(Kt) = λ(Kt) λ(S) is uniformly bounded below. This motivates the following relaxation of convexity. Definition 3.1 (Uniform fatness). A set S Rn is vuniformly fat w.r.t. the density x if, for all s S, there exists a convex set Ks S such that s Ks and Vx(Ks) = R Ks x(s)λ(ds) v. Intuitively, the uniform fatness property ensures that there is sufficient volume around any point of the set, so that it is possible to assign sufficient probability mass around the optimal point s t in particular. Note that uniform fatness excludes isolated points, but does not require the set to be connected. Figure 1. Illustration of the uniform fatness condition (left) and the construction of the set St = s t + dt(Ks t s t ) in the proof of Theorem 2 (right). We are now ready to give a regret bound for the Hedge algorithm on uniformly fat sets. Theorem 2. Let x(0) be Lebesgue uniform, and suppose that S is v-uniformly fat w.r.t. x(0) and that the loss functions are L-Lipschitz uniformly in time. Then the regret of the Hedge algorithm with learning rates ηt = θ t α log t, α [0, 1), satisfies t M 2θ 8(1 α) t + n log t + log 1 v θ t1 α log t (10) In particular, if S is convex, then it is 1-uniformly fat, and Theorem 1 becomes a special case of Theorem 2. The Hedge Algorithm on a Continuum Proof. Since S is v-uniformly fat, for all t, there exists a convex measurable subset Ks t S with s t Ks t and Vx(0)(Ks t ) v. Similarly to (7), define St as the homothetic transformation of Ks t with center s t and ratio dt. By Lemma 3, we have D(St) = dt D(K t ) dt D(S) and Vx(0)(St) = dn t Vx(0)(Kt) dn t v. Applying the regret bound (6) with ηt = θt α log t, we have t M 2θ 8(1 α) tα + LD(St) log Vx(0)(St) θt1 α log t M 2θ 8(1 α) tα + dt LD(S) log(dn t v) θt1 α log t and we conclude by taking dt = 1/t. Corollary 3. Under the assumptions of Theorem 2, with constant learning rate ηt = η, we have R(t) t + n log t log v η t . For a given horizon T, we can choose ηT = M 1p 8(n log T log v)/T to minimizes this bound, for which n log T log v Remark 1. If S Rn is a lower-dimensional manifold, it is not uniformly fat with respect to the Lebesgue measure on Rn. However, if it is homeomorphic to a uniformly fat S Rm, m < n, then one can run the Hedge algorithm on S instead. 4. Dual Averaging on L2(S) In this Section, we study a more general family of algorithms based on the dual averaging method. Dual averaging (Nesterov, 2009) is a general method for solving constrained optimization problems. It was applied to online learning on a convex subset of Rn, for example in (Xiao, 2010) and (Bubeck, 2014). Building on these ideas, we propose to apply it to our problem of learning on uniformly fat sets. Consider a Hilbert space E, and a feasible set X E, assumed closed and convex. Given a sequence (ℓ(t)) of linear functionals in the dual space E , the method projects, at each step, the cumulative dual vector L(t) = Pt τ=1 ℓ(τ) onto the feasible set, using a Bregman projection. This is summarized in Algorithm 2. In constrained convex optimization, one seeks to minimize a convex function f over X, and the dual vectors ℓ(t) are taken to be subgradients of f at the current iterate, but dual averaging provides regret guarantees without requiring ℓ(t) to be subgradient vectors. The function ψ in Algorithm 2 is assumed to be ℓψstrongly convex with respect to a norm1 on E, that 1The reference norm need not necessarily be the norm induced by the inner product on E. Algorithm 2 Dual averaging method with input sequence (ℓ(t)) and learning rates (ηt) 1: for t N do 2: L(t) = Pt τ=1 ℓ(τ) x(t+1) = arg min x X D L(t), x E + 1 ηt+1 ψ(x) (12) is, ψ(x) ψ(y) + ψ(y), x y + ℓψ 2 x y 2 for all x, y X. To simplify the discussion, we also assume, without loss of generality, that infx X ψ(x) = 0. The dual averaging update (12) can be written in terms of the Legendre-Fenchel transform of ψ: Let ψ (L) = infx X ψ(x) L, x . Note that the minimum is attained and the minimizer is unique since ψ is strongly convex and X is closed and convex (Theorem 11.9 in (Bauschke & Combettes, 2011)). Then ψ (L) = arg minx X ψ(x) L, x , and the update (12) can be written x(t+1) = ψ ( ηt+1L(t)). To apply the dual averaging method to our problem of learning on S, let E = L2(S), the Lebesgue space of square integrable functions2 on S, endowed with the inner product f, g = R S f(s)g(s)λ(ds), and the feasible set X := f L2(S) : f 0 a.e. and R S f(s)λ(ds) = 1 . Note that while X is closed and convex, it may be unbounded3. An element f X will be identified with the probability distribution with density f. The dual space is E = L2(S), and since S is compact, E contains, in particular, the set C0(S) of continuous functions on S. We next show that the regret of the dual averaging method grows sublinearly under appropriate assumptions on the feasible set S and the regularizer ψ. The result extends the regret bound of (Nesterov, 2009) to L2(S). We will use the following Lemma, which can be proved following Lemma 1 in (Nesterov, 2009), mutatis mutandis. Lemma 4. If ψ is ℓψ-strongly convex w.r.t. . Then ψ is 1 ℓψ -smooth w.r.t. , that is, for all x, y ψ (x) ψ (y) ψ (y), x y 1 2ℓψ x y 2 . Lemma 5 (Dual averaging regret bound). Suppose that there exists M > 0 such that for all t, ℓ(t) M. Then under the dual averaging method with non-increasing learning rates (ηt), for all x X, 2An element of E is, in fact, an equivalence class of functions equal almost everywhere. 3Consider for example the simple case S = [0, 1], and the sequence fn = n1[0, 1 n ], for which fn 1 = 1 but fn 2 = n. The Hedge Algorithm on a Continuum D ℓ(τ), x(τ) x E 1 ηt ψ(x) + M 2 τ=1 ητ (13) Proof. We use a similar argument to the proof of Lemma 1. Define the potential function ξ : R+ L2(S) R ξ(η, L) = 1 ηψ ( ηL) = 1 η infx X ηL, x + ψ(x) We first show the following inequality: D x(t), ℓ(t)E ξ(ηt, L(t)) ξ(ηt 1, L(t 1)) + ηt (14) Since ψ is ℓψ-strongly convex, by Lemma 4, ψ is 1 ℓψ - smooth, therefore ψ ( ηt L(t)) ψ ( ηt L(t 1)) D ψ ( ηt L(t 1)), ηtℓ(t)E + 1 2ℓψ ηtℓ(t) 2 = ηt D x(t), ℓ(t)E + η2 t 2ℓψ ℓ(t) 2 Dividing by ηt and rearranging, we have D x(t), ℓ(t)E ξ(ηt, L(t)) ξ(ηt, L(t 1)) + ηt 2ℓψ ℓ(t) 2 . To prove (14), it suffices to show that η 7 ξ(η, L) is decreasing. Taking the derivative with respect to η, ηξ(η, L) = 1 η2 ψ ( ηL) 1 η L, ψ ( ηL) η2 (ψ ( ηL) + ηL, ψ ( ηL) ) η2 ψ (0) by convexity of ψ η2 inf x X ψ(x) = 0 which proves inequality (14). Summing over τ {1, . . . , t}, and using the bound on ℓ(t) , we have D ℓ(τ), x(τ)E ξ(ηt, L(t)) ξ(η0, L(0)) + M 2 Finally, by definition of ξ, ξ(η0, L(0)) = 1 η0 inf x X ψ(x) = 0 ξ(ηt, L(t)) = 1 D ηt L(t), x E + ψ(x) D L(t), x E + 1 D ℓ(τ), x(τ)E D L(t), x E + 1 ηt ψ(x) + M 2 which proves the claim. Lemma 5 gives an upper bound on the regret with respect to elements of X, the set of Lebesgue-continuous densities. This can also provide a bound on the regret (2), with respect to elements of S, as defined in Section 2, by observing that when S is uniformly fat4, R(t) = Pt τ=1 ℓ(τ), x(τ) mins S Pt τ=1 ℓ(τ)(s) = Pt τ=1 ℓ(τ), x(τ) infx X DPt τ=1 ℓ(τ), x E . In particular, if ψ is bounded on X, then it follows from Lemma 5 that ηt sup x X ψ(x) + M 2 which implies that the regret is sublinear for ηt = Θ(t α), α (0, 1), for any sequence of continuous losses. However, when X is unbounded, so is ψ by strong convexity. But one can still obtain a sublinear bound on the regret for Lipschitz losses, as stated in the following Theorem. Theorem 3 (Dual averaging regret for Lipschitz losses). Suppose that ℓ(t) is L-Lipschitz, and ℓ(t) M, uniformly in t. Then the dual averaging method with learning rates (ηt) guarantees the following bound on the regret: For any positive sequence (dt), t + LD(S)dt + 1 tηt infx Btψ(x) where Bt X denotes the set of Lebesgue-continuous densities supported on B(s t , D(S)dt). Proof. Since the losses are L-Lipschitz, we have, x Bt, DPt τ=1 ℓ(τ), x E = Z Pt τ=1 ℓ(τ)(s)x(s)λ(ds) Pt τ=1(ℓ(τ)(s t ) + LD(S)dt)x(s)λ(ds) = L(t)(s t ) + t LD(S)dt Thus, for all x Bt, R(t) = Pt τ=1 ℓ(τ), x(τ) L(t)(s t ) Pt τ=1 ℓ(τ), x(τ) x + t LD(S)dt 1 ηt ψ(x) + M 2 2ℓψ Pt τ=1 ητ + t LD(S)dt where the last inequality uses Lemma 5. We conclude by dividing by t and taking the infimum over x Bt. In the finite dimensional case, the Hedge algorithm is known to be an instance of the dual averaging method, 4Proof in the supplementary material. The Hedge Algorithm on a Continuum when the feasible set X is the simplex, and the distance generating function ψ is the negative entropy function, as observed by Nesterov (2009), Beck & Teboulle (2003), and many others. This is also true in the infinite dimensional case, as discussed for example in (Audibert, 2009) and in Chapter 5 in (Catoni, 2004). Next, we apply the regret bound of Theorem 3 to the Hedge algorithm. Example 1 (Hedge algorithm or the entropic dual averaging). Let the reference measure be the Lebesgue measure on S, denoted by λ, and let ψ be the generalized negative entropy, defined by S f(s) log f(s)λ(ds) + log λ(S) (16) Note that ψ(f) is the Kullback-Leibler divergence of f with respect to the Lebesgue-uniform distribution. By Pinsker s inequality, ψ is 1-strongly convex with respect to the total variation norm f = R |f(s)|λ(ds). Furthermore, ψ is nonnegative since it is minimal on the Lebesgue uniform distribution, for which it takes value 0. One can show that ψ (f) = log 1 λ(S) R S exp(f(s))λ(ds), thus the definitions of ξ(η, L) in Lemmas 1 and 5 coincide. With this choice of X and ψ, the dual averaging update can be shown to be equal to the Hedge density (1): Proposition 1. Let L(t) E , and consider the dual averaging iteration (12) with ψ the negative entropy (16). Then the solution x(t+1) is given by the Hedge update rule x(t+1)(s) = 1 Z(t) e ηt+1L(t)(s) with normalization constant Z(t) = R S e ηt+1L(t)(s)λ(ds). A proof is provided in the supplementary material for completeness. By Theorem 3, the regret of the Hedge algorithm with Lipschitz losses is then bounded as follows t + LD(S)dt + 1 tηt infx Bt ψ(x) In order to bound the last term, consider a subset St B(s t , D(S)dt), and take x to be the uniform distribution over St (thus x Bt). Then ψ(x) = log λ(S) + Z 1 λ(St) log 1 λ(St)λ(ds) = log λ(S) Now, if S is v-uniformly fat, then there exists a convex set Kt containing s t , such that λ(Kt)/λ(S) v, and constructing St as in the proof of Theorem 1, as the homothetic transform of Kt with center s t and ratio dt we have D(St) D(S)dt and λ(St)/λ(S) vdn t , therefore infx Bt ψ(x) log(vdn t ), and taking dt = 1/t, t + n log t log v t ηt which results in a regret bound similar to Theorem 2. 5. Learning on a finite cover In this section, we briefly compare our method to a related idea: for a given horizon, compute a finite cover of the set, such that the maximum difference of losses on each element of the cover is small enough, and then perform a discrete learning algorithm on the finite cover. More precisely, fixing a horizon T and a constant ϵT > 0, suppose that we can compute a finite cover AT of S such that, for all ST AT , sup s,s ST |ℓ(t)(s) ℓ(t)(s )| ϵT . (17) If we call R(t) the regret with respect to the discrete set, then running the discrete Hedge algorithm on this finite cover with learning rate η guarantees (Bubeck & Cesa Bianchi, 2012) that 8 η + log |A| and the optimal η given the horizon T yields the regret bound R(T )/T = O p log |A | /T . Since we incur at most ϵT additional per-round regret due to the variation of losses within each element of the cover, we have that R(T )/T = O p log |A| / T + ϵT . Since the loss functions are L-Lipschitz, a sufficient condition for (17) to hold is to have each element ST of the cover have diameter D(ST ) ϵT /L. Thus, the size of the cover is typically |AT | = O(1/ϵn T ). Under this estimate of the size of AT , we have R(T )/T = O p n log ϵT / T +ϵT , and choosing ϵT = 1/ T, we obtain the bound R(T )/T = O p which matches the regret rate of Corollary 3. The Hedge algorithm on uniformly fat sets is conceptually similar to the idea of working with a finite cover. This is most visible in the proof of Theorem 2, where we rely on the existence of a set around s t with an appropriate relationship between diameter and volume. To apply the Hedge algorithm, one needs to sample from the distributions x(t), without having to explicitly construct a cover. Sampling from the distribution can be more tractable, as is the case in the example of Section 6. 6. Numerical results We test our algorithm on a numerical example in R2 with convex quadratic loss functions of the form ℓ(t)(s) = 1 2(s µt) Qt(s µt) + ct restricted to the domain S R2 shown in Figure 2. The Hedge Algorithm on a Continuum Figure 2. The set S for the numerical example. If x(0) is the uniform distribution over S one can show that x(t+1)(s) | Qt| 1/2 exp 1 2(s µt) Qt(s µt) on S, so x(t+1) is a multivariate Gaussian distribution restricted to S with covariance matrix Qt = ηt P τ t Qτ and mean µt = Q 1 t ηt P τ t Qτµτ. Hence the size of the parameter space required to represent the cumulative loss (and thus the Hedge densities) is independent of the horizon. Remark 2. Since the Hedge distributions are, in this case, multivariate Gaussian, sampling from these distributions can be done efficiently. This example is one instance of a problem in which one can directly sample from the Hedge distributions without having to maintain a discrete cover. More generally, the complexity of the Hedge algorithm depends on the complexity of sampling from the Hedge distributions. For a simulation horizon of T = 104, we randomly generated the parameters µt, Qt and ct of the loss functions subject to the uniform bounds M = 10 and L = 5. The set S has diameter D = 5.83 and is v-uniformly fat with v = 0.273. We simulated the algorithm 2500 times for each of the different choices of the learning rates. Figure 3 shows means (solid lines), regret bounds (dashed lines) and regions between the 10% and 90% quantiles (shaded) of the per-round cumulative regret over these simulations. To determine the regret, the computation of the best choice in hindsight for each period t was performed by solving multiple quadratic programs on a convex decomposition of S. Figure 3. Mean time-average cumulative regret (solid), 10% and 90% quantiles (shaded regions) and worst-case bounds (dashed). We first verify from Figure 3 that the regret bound (10) is satisfied. Since the loss functions are generated randomly (and not adversarially), the regrets observed in simulation are much smaller than the theoretical regret bounds. We further note that the optimal constant learning rates ηopt from Corollary 3 are outperformed by higher constant learning rates5, an observation familiar from the discrete case (Even-Dar et al., 2008; Koolen et al., 2014). Finally, Table 2 compares the decay rates of the per-round cumulative regret (solid lines in Figure 3), estimated using a linear regression on log R(t) t as a function of log t, with those of the corresponding theoretical bound (10) (dashed lines). The observed rates of decay are higher than those of the theoretical bounds. ηt simulation bound bound (t ) t 0.15 0.778 0.173 0.150 t 0.3 0.644 0.397 0.300 Table 2. Decay rates of the per-round regret. 7. Concluding remarks We studied an online optimization problem over a compact subset S of Rn. Previous work shows that when S is convex, one can achieve O( t) cumulative regret using a gradient descent method when the losses are convex, and O(log t) cumulative regret using a generalized Hedge algorithm when the losses are uniformly exp-concave. We consider Lipschitz losses, and relax the convexity assumption of S. In particular, we show that as long as the set is uniformly fat, i.e. there exists a convex set of minimal volume v around all points of the set, then the generalized Hedge algorithm achieves O( t log t) regret. We further proved a regret bound for dual averaging method, a generalization of the Hedge algorithm. A question which remains open is whether the uniform fatness condition is necessary for a given class of algorithms to achieve sublinear regret. A related problem, which is not studied here, is bandit learning on a continuum, in which only the current loss value ℓ(t)(s(t)) is revealed. This problem is studied for example by Bubeck et al. (2011) for Lipschitz losses and general topological spaces. A hierarchical algorithm is proposed, which achieves a O(td1(log t)d2) where d1, d2 1 depend on the geometry of the problem. The algorithm requires explicitly computing a hierarchical cover of the set. One question is whether one can generalize the Hedge algorithm to such a bandit setting, so that sublinear regret can be achieved without the need to explicitly maintain a cover. 5Due to the rather weak assumptions on the action set and the loss functions, the optimal learning rate ηopt(T) for the known horizon T derived from Corollary (3) is very small. The Hedge Algorithm on a Continuum Arora, Sanjeev, Hazan, Elad, and Kale, Satyen. The multiplicative weights update method: a meta-algorithm and applications. Theory of Computing, 8(1):121 164, 2012. Audibert, Jean-Yves. Fast learning rates in statistical inference through aggregation. The Annals of Statistics, 37 (4):pp. 1591 1646, 2009. ISSN 00905364. Bauschke, Heinz H. and Combettes, Patrick L. Convex Analysis and Monotone Operator Theory in Hilbert Spaces. CMS Books in Mathematics. Springer, 2011. Beck, Amir and Teboulle, Marc. Mirror descent and nonlinear projected subgradient methods for convex optimization. Oper. Res. Lett., 31(3):167 175, May 2003. Blum, Avrim and Kalai, Adam. Universal portfolios with and without transaction costs. Machine Learning, 35(3): 193 205, 1999. Bubeck, S ebastien. Theory of convex optimization for machine learning. Ar Xiv, 2014. Bubeck, S ebastien and Cesa-Bianchi, Nicol o. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning, 5(1):1 122, 2012. Bubeck, S ebastien, Munos, R emi, Stoltz, Gilles, and Szepesvari, Csaba. X-armed bandits. Journal of Machine Learning Research (JMLR), 12(12):1587 1627, 2011. Catoni, Olivier. Statistical learning theory and stochastic optimization, Ecole d Et e de Probabilit es de Saint-Flour XXXI2001, volume 1851. Springer, 2004. Cesa-Bianchi, Nicol o and Lugosi, G abor. Prediction, learning, and games. Cambridge University Press, 2006. Cope, Eric W. Regret and convergence bounds for a class of continuum-armed bandit problems. Automatic Control, IEEE Transactions on, 54(6):1243 1253, June 2009. Cover, Thomas M. Universal portfolios. Mathematical Finance, 1(1):1 29, 1991. Dalalyan, Arnak S. and Salmon, Joseph. Sharp oracle inequalities for aggregation of affine estimators. Ann. Statist., 40(4):2327 2355, 08 2012. Even-Dar, Eyal, Kearns, Michael, Mansour, Yishay, and Wortman, Jennifer. Regret to the best vs. regret to the average. Machine Learning, 72(1-2):21 37, 2008. Hannan, James. Approximation to Bayes risk in repeated plays. Contributions to the Theory of Games, 3:97 139, 1957. Hazan, Elad, Agarwal, Amit, and Kale, Satyen. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69(2-3):169 192, 2007. Kivinen, Jyrki and Warmuth, Manfred K. Exponentiated gradient versus gradient descent for linear predictors. Information and Computation, 132(1):1 63, 1997. Koolen, Wouter M., Erven, Tim Van, and Gr unwald, Peter D. Learning the learning rate for prediction with expert advice. In Advances in Neural Information Processing Systems (NIPS) 27, pp. 2294 2302, Dec 2014. Littlestone, Nick and Warmuth, Manfred K. The weighted majority algorithm. In Foundations of Computer Science, 1989., 30th Annual Symposium on, pp. 256 261. IEEE, 1989. Nesterov, Yurii. Primal-dual subgradient methods for convex problems. Mathematical Programming, 120(1): 221 259, 2009. Xiao, Lin. Dual averaging methods for regularized stochastic learning and online optimization. J. Mach. Learn. Res., 11:2543 2596, December 2010. ISSN 1532-4435. Xidonas, Panos and Mavrotas, George. Multiobjective portfolio optimization with non-convex policy constraints: Evidence from the eurostoxx 50. The European Journal of Finance, 20(11):957 977, 2014. Zinkevich, Martin. Online convex programming and generalized infinitesimal gradient ascent. In ICML, pp. 928 936, 2003.