# online_learning_with_kernel_losses__0925fc16.pdf Online learning with kernel losses Aldo Pacchiano * 1 Niladri S. Chatterji * 1 Peter L. Bartlett 1 We present a generalization of the adversarial linear bandits framework, where the underlying losses are kernel functions (with an associated reproducing kernel Hilbert space) rather than linear functions. We study a version of the exponential weights algorithm and bound its regret in this setting. Under conditions on the eigen-decay of the kernel we provide a sharp characterization of the regret for this algorithm. When we have polynomial eigen-decay (µj O(j β)), we find that the regret is bounded by Rn O(nβ/2(β 1)). While under the assumption of exponential eigendecay (µj O(e βj)) we get an even tighter bound on the regret Rn O(n1/2). When the eigen-decay is polynomial we also show a non-matching minimax lower bound on the regret of Rn Ω(n(β+1)/2β) and a lower bound of Rn Ω(n1/2) when the decay in the eigenvalues is exponentially fast. We also study the full information setting when the underlying losses are kernel functions and present an adapted exponential weights algorithm and a conditional gradient descent algorithm. 1. Introduction In adversarial online learning, a player interacts with an unknown and arbitary adversary in a sequence of rounds. At each round, the player chooses an action from an action space and incurs a loss associated with that chosen action. The loss functions are determined by the adversary and are fixed at the beginning of each round. After choosing an action the player observes some feedback, which can help guide the choice of actions in subsequent rounds. The most common feedback model is the full information model, where the player has access to the entire loss function at *Equal contribution 1University of California Berkeley. Correspondence to: Aldo Pacchiano , Niladri S. Chatterji . Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s). the end of each round. Another, more challenging feedback model is the partial information or bandit feedback model where the player at the end of the round just observes the loss associated with the action chosen in that particular round. There are also other feedback models in between and beyond the full and bandit information models, many of which have also been studied in detail. A figure of merit that is often used to judge online learning algorithms is the notion of regret, which compares the players actions to the best single action in hindsight (defined formally in Section 1.2). When the underlying action space is a continuous and compact (possibly convex) set and the losses are linear or convex functions over this set; there are many algorithms known that attain sub-linear and sometimes optimal regret in both these feedback settings. In this work we present a generalization of the well studied adversarial online linear learning framework. In our paper, at each round the player selects an action a A Rd. This action is mapped to an element in a reproducing kernel Hilbert space (RKHS) generated by a mapping K( , ). The function K( , ) is a kernel map, that is, it can thought as an inner product of an appropriate Hilbert space H. The kernel map can be expressed as K(x, y) = Φ(x), Φ(y) H, where Φ( ) RD is the associated feature map. Thus at each round the loss is Φ(a), w H, where w H is the adversary s action. In the full information setting, as feedback, the player has access to the entire adversarial loss function , w H. In the bandit setting the player is just presented with the value of the loss, Φ(a), w H. Notice that this class of losses is much more general than ordinary linear losses and includes potentially non-linear and non-convex losses like: 1. Linear Losses: a, w H = a w. This loss is well studied in both the bandit and full information setting. We shall see that our regret bounds will match the bounds established in the literature for these losses. 2. Quadratic Losses: D φ(a), W b E H = a Wa + b a, where W is a symmetric matrix and b is a vector. Convex quadratic losses have been well studied under full information feedback as the online eigenvector decomposition problem. Our work establishes regret bounds Online learning with kernel losses in the full information setting and also under the mostly unexplored bandit feedback. 3. Gaussian Losses: Φ(a), Φ(y) H = exp a y 2 2/2σ2 . We provide regret bounds of kernel losses not commonly studied before like Gaussian losses that provide a different loss profile than a linear or convex loss. 4. Polynomial Losses: Φ(a), Φ(y) H = (1 + a y)2 for example. We also provide regret bounds for polynomial kernel losses which are potentially (non-convex) under both partial and full information settings. Specifically in the full information setting we study posynomial losses (discussion in Section 4.3). 1.1. Related Work Adversarial online convex bandits that was introduced and first studied by (33; 22). The problem most closely related to our work is the case when the losses are linear introduced earlier by (35; 7). In this setting (20; 18; 13) proposed the EXP 2 (Expanded Exp) algorithm under different choices of exploration distributions. (20) worked with the uniform distribution over the barycentric spanner of the set, in (18) this distribution was the uniform distribution over the set and in (13) they use the exploration distribution given by John s theorem that leads to a regret bound of O((dn log(N))1/2), where N is the number of actions, n is the number of rounds and d is the dimension of the losses. For this same problem when the set A is convex and compact, (1) analyzed Mirror descent to get a regret bound of O(d p θn log(n)) for some θ > 0. For the case with general convex losses with bandit feedback recently (15) proposed a poly-time algorithm that has a regret guarantee of O(d9.5 n), which is optimal in its dependence on the number of rounds n. Previous work on this problem includes, (2; 41; 27; 21; 14; 28) in the adversarial setting under different assumptions on the structure of the convex losses and by (3) who studied this problem in the stochastic setting1. (46) study stochastic kernelized contextual bandits with a modified UCB algorithm to obtain a regret bound similar to ours, Rn p dn where d is the effective dimension dependent on the eigen-decay of the kernel. This problem was also studied previously for loss functions drawn from Gaussian processes in (44). Online learning under bandit feedback has also been studied when the losses are non-parametric, for example when the losses are Lipschitz (16; 40). In the full information case, the online optimization framework with convex losses was first introduced by (49). The conditional gradient descent algorithm (a modification of which we study in this work) for convex losses in this set- 1For an extended bibliography of the work on online convex bandits see (15). ting was introduced and analyzed by (31) and then improved subsequently by (26). The exponential weights algorithm which we modify and use multiple times in this paper has a rich history and has been applied to various online as well as offline settings. The particular with the losses being convex quadratic functions has been well studied in the full information setting. This problem is also called online eigenvector decomposition or online PCA. Very recently (4) established a regret bound of O( n) for the problem by presented an efficient algorithm that achieves this rate a modified exponential weights strategy, follow the compressed leader. Previous results for this problem were established in both adversarial and stochastic settings by modifications of exponential weights, gradient descent and follow the perturbed leader algorithms (6; 45; 47; 48; 32; 23). In the full information setting there has also been work on analyzing gradient descent and mirror descent in RKHS spaces (36; 8). However, in these papers the player is allowed to play any action in a bounded set in Hilbert space, while in our paper the player is constrained to only play rank one actions, that is the player chooses an action in A which gets mapped to an action in the RKHS. CONTRIBUTIONS Our primary contribution is to extend the linear bandits framework to more general classes of kernel losses. We present an algorithm in this setting and provide a regret bound for the same. We provide a more detailed analysis of the regret when we make assumptions on the eigen-decay of the kernel. Particularly when we assume the polynomial eigen-decay of the kernel (µj O(j β)) we can guar- antee the regret is bounded as Rn O(n β 2(β 1) ). Under exponential eigendecay we can guarantee an even sharper bound on the regret of Rn O(n1/2). We also provide a minimax lower bound on the regret of Rn Ω(n(β+1)/2β) and Rn Ω(n1/2) under the polynomial and exponential decay eigen-decay assumptions respectively. We analyze an exponential weights algorithm and a conditional gradient algorithm for the full information case where we don t need to assume any conditions on the eigen-decay. Finally we provide a couple of applications of our framework (i) general quadratic losses (not necessarily convex) with linear terms which we can solve efficiently in the full information setting, (ii) we provide a computationally efficient algorithm when the underlying losses are posynomial (special class of polynomials). ORGANIZATION OF THE PAPER In the next section we introduce the notation and definitions. In Section 2 we present our algorithm under bandit feedback and present regret bounds for this algorithm. In Section 3 we study the problem in the full information setting. In Section Online learning with kernel losses 4 we present two applications of our framework, and prove that our algorithms are computationally efficient in these settings. All the proofs, technical details and experiments are relegated to the appendix. 1.2. Notation, main definitions and setting Here we introduce definitions and notational conventions used throughout the paper. In each round t = {1, . . . , n}, the player chooses an action vector {at}n t=1 A Rd. The underlying kernel function at each round is K( , ) which is a map from Rd Rd R such that it is a kernel map and has an associated separable reproducing kernel Hilbert space (RKHS) H with an inner product , H (for more properties of kernel maps and RKHS see (42)). Let Φ( ) : Rd 7 RD denote a feature map of K( , ) such that for every x, y we have K(x, y) = Φ(x), Φ(y) H. Note that the dimension of the RKHS, D could be infinite (for example in the Gaussian kernel over [0, 1]d). We let the adversary choose a vector in H, wt W RD and at each round the loss incurred by the player is Φ(at), wt H. We assume that the adversary is oblivious, that is, it is a function of the previous actions of the player (a1, . . . , at 1) but unaware of the randomness used to generate at. We let the size of the sets A, W be bounded2 in kernel norm, that is, sup a A K(a, a) G2 and, sup w W w, w H G2. (1) Throughout this paper we assume a rank-one learner, that is, in each round the player can pick a vector v H, such that v = Φ(a) for some a Rd. We now formally define the notion of expected regret. Definition 1 (Expected regret) The expected regret of an algorithm M after n rounds is defined as t=1 Φ(at), wt H t=1 Φ(a ), wt H where a = infa A {Pn t=1 Φ(a), wt H} and the expectation is over the randomness in the algorithm. Essentially this amounts to comparing against the best single action a in hindsight. Our hope will be to find a randomized strategy such that the regret grows sub-linearly with the number of rounds n. In what follows we will omit the subscript H from the subscript of the inner product whenever it is clear from the context that it refers to the RKHS inner product. 2We set the bound on the size of both sets to be the same for ease of exposition, but they could be different and would only change the constants in our results. To establish regret guarantees we will find that it is essential to work with finite dimensional kernels when working under bandit feedback (more details regarding this in the proof of the regret bound of Algorithm 2.3). General kernel maps can have infinite dimensional feature maps thus we will require the construction of a finite dimensional kernel that uniformly approximates the original kernel K( , ). This motivates the definition of ϵ-approximate kernels. Definition 2 (ϵ-approximate kernels) Let K1 and K2 be two kernels over A A and let ϵ > 0. We say K2 is an ϵ-approximation of K1 if for all x, y A, |K1(x, y) K2(x, y)| ϵ. 2. Bandit Feedback Setting In this section we present our results on kernel bandits. In the bandit setting we assume the player knows the underlying kernel function K( , ), however, at each round after the player plays a vector at only the value of the loss associated with that action is revealed to the player Φ(at), wt H and not the action of the adversary wt. We also assume that the player s action set A has finite cardinality3.This is a generalization of the well studied adversarial linear bandits problem. As we will see in subsequent sections to guarantee a bound on the regret in the bandit setting our algorithm will build an estimate of adversary s action wt. This becomes impossible if wt is infinite dimensional (D ). To circumvent this, we will construct a finite dimensional proxy kernel that is an ϵ-approximation of K. Whenever no approximate kernel is needed, for example when D < we allow the adversary to be able to choose an action wt W RD without imposing extra requirements on the set W other than being bounded in H norm. When D is infinite we impose an additional constraint on the adversary to also select rank-one actions at each round, that is, wt = Φ(yt) where yt Rd. Next we present a discussion of the procedure to construct a finite kernel that approximates the original kernel well. 2.1. Construction of the finite dimensional kernel To construct the finite dimensional kernel we will rely crucially on Mercer s theorem. We first recall a couple of useful definitions. Definition 3 Let A Rd and P a probability measure supported over A. We denote by L2(A; P) the space of square integrable functions over A and measure P, L2(A; P) := 3This assumption can be relaxed to let A be a compact set when K is Lipschitz continuous. In this setting we can instead work with an appropriately fine approximating cover over the set A. Online learning with kernel losses ( A f 2(x)d P(x) < Definition 4 A kernel K : A A R is square integrable with respect to measure P over A, if R A A K2(x, y)d P(x)d P(y) < . Now we are ready to present Mercer s theorem (38) (see (19)). Theorem 5 (Mercer s Theorem) Let A Rd be compact and P be a finite Borel measure with support A. Suppose K is a continuous square integrable positive definite kernel on A, and define a positive definite operator TK : L2(A; P) 7 L2(A; P) by (TKf) ( ) := Z A K( , x)f(x) d P. Then there exists a sequence of eigenfunctions {φi} i=1 that form an orthonormal basis of L2(A; P) consisting of eigenfunctions of TK, and an associated sequence of nonnegative eigenvalues {µj} j=1 such that TK(φj) = µjφj for j = 1, 2, . . .. Moreover the kernel function can be represented as i=1 µiφi(u)φi(v) (3) where the convergence of the series holds uniformly. Mercer s theorem yields a natural way to construct a feature map Φ(x) for K by defining the ith component of the feature map to be Φ(x)i := µiφi(x). With this choice of feature map the eigenfunctions {φi} i=1 are orthogonal under the inner product , H4. Armed with Mercer s theorem we first present a simple deterministic procedure to obtain a finite dimensional ϵ-approximate kernel of K. Essentially when the eigenfunctions of the kernel are uniformly bounded, supx A|φj(x)| B for all j, and if the eigenvalues decay at a suitable rate we can truncate the series in (3) to get a finite dimensional approximation. Lemma 6 Given ϵ > 0, let {µj} j=1 be the Mercer operator eigenvalues of K under a finite Borel measure P with support A and eigenfunctions {φj} j=1 with µ1 µ2 . Further assume that supj N supx A|φj(x)| B for some B < . Let m(ϵ) be such that P j=m+1 µj ϵ 4B2 . Then the kernel induced by a truncated feature map, ( µiφi(x) if i m 0 o.w. (4) 4To see this observe that the function φi can be expressed as a vector in the RKHS as a vector vi with φi in the ith coordinate and zeros everywhere else. So for any two vi and vj with i = j we have vi, vj H = 0. induces a kernel ˆKo m := Φo m(x), Φo m(y) H = Pm j=1 µjφj(x)φj(y), for all (x, y) A A that is an ϵ/4-approximation of K. The Hilbert space induced by the ˆKo m is a subspace of the original Hilbert space H. The proof of this lemma is a simple application of Mercer s theorem and is relegated to Appendix C. If we have access to the eigenfunctions of K we can construct and work with ˆKo m because as Lemma 6 shows ˆKo m is an ϵ/4-approximation to K. Additionally, ˆKo m also has the same first m Mercer eigenvalues and eigenfunctions under P as K. Unfortunately, in most applications of interest the analytical computation of the eigenfunctions {φi} i=1 is not possible. We can get around this by building an estimate of the eigenfunctions using samples from P by leveraging results from kernel principal component analysis (PCA). Definition 7 Let Sm be the subspace of H spanned by the first m eigenvectors of the covariance matrix Ex P Φ(x)Φ(x) . This corresponds to the span of the eigenfunctions φ1, φ2, . . . , φm in H 5 . Define the linear projection operator PSm : H 7 H that projects onto the subspace Sm; where P(Sm)(v + v ) = v, if v Sm and v S m. Remark 8 The feature map Φo m(x) is a projection of the complete feature map to this subspace, Φo m(x) = PSm(Φ(x)). Let x1, x2, . . . , xp P be p i.i.d. samples and construct the sample (kernel) covariance matrix, ˆΣ := 1 p Pp i=1 Φ(xi)Φ(xi) . Let ˆSm be the subspace spanned by the m top eigenvectors of ˆΣ. Define the stochastic feature map, Φm(x) := P ˆSm(Φ(x)), the feature map defined by projecting Φ(x) to the random subspace ˆSm. Intuitively we would expect that if the number of samples p is high enough, then the kernel defined by the feature map Φm(x), ˆKm(x, y) = Φm(x), Φm(y) H will also be an ϵapproximation for the original kernel K. Formalizing this claim is the following theorem. Theorem 9 Let m, P be defined as in Lemma 6. Define the m-th level eigen-gap as δm = 1 2 (µm µm+1). Also let Bm = 2G δm 1 + p α 2 , 2δm > ϵ > 0 and p 2B2 m G2 ϵ . Then the finite dimensional kernels ˆKo m and ˆKm satisfy the following properties with probability 1 e α, 1. supx,y A |K(x, y) ˆKm(x, y)| ϵ. 5This holds as the ith eigenvector of the covariance matrix has φi as the ith coordinate and zero everywhere else combined with the fact that {φi} i=1 are orthonormal under the L(A; P) inner product. Online learning with kernel losses Algorithm 1 Finite dimensional proxy construction Input :Kernel K, effective dimension m, set A, measure P, bias tolerance ϵ > 0, number of samples p. Function :Finite proxy feature map Φm( ) sample x1, , xp P. construct sample Gram matrix ˆKi,j = 1 p K(xi, xj). calculate the top m eigenvectors of ˆK {ω1, ω2, . . . , ωm}. for j = 1, . . . , m do Set vj = Pp k=1 ωjkΦ(xk), (ωjk is the kth entry of ωj) end define the feature map v1, Φ(x) H ... vm, Φ(x) H Pp k=1 ω1k K(xk, x) ... Pp k=1 ωmk K(xk, x) 2. The Mercer eigenvalues µ(p) 1 µ(p) m and µ1 µm of ˆKm and ˆKo m are close, supi=1, ,m |µ(p) i µi| ϵ/2. Theorem 9 shows that given ϵ > 0 the finite dimensional proxy ˆKm is a ϵ-approximation of K with high probability as long as sufficiently large number of samples are used. Furthermore, the top m eigenvalues of the second moment matrix of K are at most ϵ/2-away from the eigenvalues of the second moment matrix of ˆKm under P. To construct Φm( ) we need to calculate the top m eigenvectors of the sample covariance matrix ˆΣ, however, it is equivalent to calculate the top m eigenvectors of the sample Gram matrix K and use them to construct the eigenvectors of ˆΣ (for more details see Appendix B where we review the basics of kernel PCA). 2.2. Bandits Exponential Weights In this section we present a modified version of exponential weights adapted to work with kernel losses. Exponential weights has been analyzed extensively applied to linear losses under bandit feedback (20; 17; 13). Two technical challenges make it hard to directly adapt their algorithms to our setting. The first challenge is that at each round we need to estimate the adversarial action wt. If the feature map of the kernel is finite dimensional this is easy to handle, however when the feature map is infinite dimensional, this becomes challenging and we need to build an approximate feature map Φm( ) using Algorithm 2.1. This introduces a bias in our estimate of the adversarial action wt and we will need to control the contribution of the bias in our regret analysis. The second challenge will be to lower bound the minimum eigenvalue of the kernel covariance matrix as we will need to invert this matrix to estimate wt. For general kernels which are infinite dimensional, the minimum eigenvalue is zero. To resolve this we will again turn to our construction of a finite dimensional proxy kernel. 2.3. Bandit Algorithm and Regret Bound In our exponential weights algorithm we first build the finite dimensional proxy kernel ˆKm using Algorithm 2.1. The rest of the algorithm is then almost identical to the exponential weights algorithm (EXP 2) studied for linear bandits in (20; 17; 13). In Algorithm 2.3 we set the exploration distribution νA J to be such that it induces John s distribution (νJ ) over Φm(A) := {Φm(a) Rm : a A} (first introduced as an exploration distribution in (13); also a short discussion is presented in Appendix H.1). Note that for finite sets it is possible to build minimal volume ellipsoid containing conv(Φm(A)) John s ellipsoid and John s distribution in polynomial time (24)6. We assume without loss of generality that the center of the set A is such that the John s ellipsoid is centered at the origin. If we know beforehand the behavior of the eigen-decay of the Mercer eigenvalues of K under measure µ we will be able to choose our tuning parameters optimally. In our algorithm we also build and invert the exact covariance matrix Σ(t) m , however this can be relaxed and we can work with a sample covariance matrix instead. We analyze the required sample complexity and error introduced by this additional step in Appendix D. We now state the main result of this paper which is an upper bound on the regret of Algorithm 2.3. Theorem 10 Let µi be the i-th Mercer operator eigenvalue of K for the uniform measure µ over A. Let m, p, α and ϵ be chosen as specified by the conditions in Theorem 9. Let the mixing coefficient be chosen such that γ = ηG4m. Then Algorithm 2.3 with probability 1 e α has regret bounded by Rn γn + (e 2)G4ηmn + 3ϵn + 1 η log(|A|). We prove this theorem in Appendix A. Note that this is similar to the regret rate attained for adversarial linear bandits in (20; 18; 13) with an additional term 3ϵn that accounts for the bias in our loss estimates ˆwt. In our regret bounds the parameter m plays the role of the effective dimension and will be determined by the rate of the eigen-decay of the kernel. When the underlying Hilbert space is finite dimensional (as is the case when the losses are linear) our regret 6It is thus possible to construct νJ over Φm(A) in polynomial time. However, as A is a finite set, using Φm( ) and νJ it is also possible to construct νA J efficiently. Online learning with kernel losses Algorithm 2 Bandit Information: Exponential Weights Input :Set A, learning rate η > 0, mixing coefficient γ > 0, number of rounds n, uniform distribution µ over A, exploration distribution νA J over A, kernel map K, effective dimension m(ϵ), number of samples p. Build kernel ˆKm with feature map Φm( ) using Algorithm 2.1 with kernel K, dimension m, distribution µ, bias tolerance ϵ and number of samples p. set q1(a) = νA J . for t = 1, . . . , n do set pt = γνA J + (1 γ)qt choose at pt observe Φ(at), wt H build the covariance matrix Σ(t) m = Ex pt Φm(x)Φm(x) compute the estimate ˆwt = Σ 1 m Φm(at) Φ(at), wt H. update qt+1(a) qt(a) exp ( η ˆwt, Φm(a) H) end bound recovers exactly the results of previous work (that is, ϵ = 0 and m = d). Next we state the following different characteristic eigenvalue decay profiles. Definition 11 (Eigenvalue decay) Let the Mercer operator eigenvalues of a kernel K with respect to a measure P over a set A be denoted by µ1 µ2 . . .. 1. K is said to have (C, β)-polynomial eigenvalue decay (with β > 1) if for all j N we have µj Cj β. 2. K is said to have (C, β)-exponential eigenvalue decay if for all j N we have µj Ce βj. Under assumptions on the eigen-decay we can establish bounds on the effective dimension m and µm, so that the condition stated in Lemma 6 is satisfied and we are guaranteed to build an ϵ-approximate kernel ˆKm. We establish bounds on m in Proposition 33 presented in Appendix C.1. Corollary 12 Let the conditions stated in Theorem 10 hold. Then Algorithm 2.3 has its regret bounded by the following rates with probability 1 e α. 1. If K has (C, β)-polynomial eigenvalue decay under measure µ, with β > 2. Then by choosing η = 1 3 1 2β 1 h β 1 4CB2 i1/2β 1 log(|A|) ((e 1)G4) β 1 β 2β 1 and m = h 4CB2 (β 1)ϵ i1/β 1 where ϵ = (e 1)ηG4 3 (β 1)/β h 4CB2 β 1 i1/β , the ex- pected regret is bounded by 1 2β 1 e G4 log (|A|) β 1 2. If K has (C, β)-exponential eigenvalue decay under measure µ. Then by choosing η = β log(|A|) (e 1)G4 log 4CB2 1/2 and m = 1 β log 4CB2 where ϵ = (e 1)G4η 3β log 4CB2 β , with n large enough so that ϵ < 1, the expected regret is bounded by 18G4 log (|A|) n Remark 13 Under (C, β)-polynomial eigen-decay condition we have that the regret is upper bounded by Rn O(n β 2(β 1) ). While when we have (C, β)-exponential eigen-decay we almost recover the adversarial linear bandits regret rate (up to logarithmic factors), with Rn O(n1/2 log(n)). One way to interpret the results of Corollary 12 in contrast to the regret bounds obtained for linear losses is the following. We introduce additional parameters into our analysis to handle the infinite dimensionality of our feature vectors the effective dimension m and bias of our estimate ϵ. When the effective dimension m is chosen to be large we get can build an estimate of the adversarial action ˆwt which has low bias, however this estimate would have large variance (O(m)). On the other hand if we choose m to be small we can build a low variance estimate of the adversarial action but with high bias (ϵ is large). We trade these off optimally to get the regret bounds established above. In the case of exponential decay we obtain that the choice m = O(log(n)) is optimal, hence the regret bound only degrades by a logarithmic factor in terms of n as compared to linear losses (where m would be a constant). When we have polynomial decay, the effective dimension is higher m = O(n 1 2(β 1) ) which leads to worse bounds on the expected regret. Note that asymptotically as β the regret bound goes to n1/2 which aligns well with the intuition that the effective dimension is small. While when β 2 (the effective dimension m ) the regret bound becomes close to linear in n. We can also show a minimax lower bound for these two settings that are close to matching the upper bound. Proposition 14 (informal) For any algorithm used by the player, there exist a strategy for the adversary such that Rn Ω n 2β whenever µj = O(j β), while when the decay is exponential Rn Ω n1/2 . Online learning with kernel losses Algorithm 3 Full Information: Exponential Weights Input :Set A, learning rate η > 0, number of rounds n. Set p1(a) uniform distribution over A. for t = 1, . . . , n do choose at pt observe wt update pt+1(a) pt(a) exp ( η wt, Φ(a) H) end The lower bound follows by a modification of the arguments used to prove a lower bound linear bandits. For a complete proof see Appendix E. 3. Full Information Setting 3.1. Full information Exponential Weights We begin by presenting a version of the exponential weights algorithm, Algorithm 3 adapted to our setup. In each round we sample an action vector at A from the exponential weights distribution pt. After observing the loss, Φ(at), wt H we update the distribution by a multiplicative factor, exp( η wt, Φ(a) H). In the algorithm presented we choose the initial distribution p1(a) to be uniform over the set A, however we note that alternate initial distributions with support over the whole set could also be considered. We can establish a sub-linear regret of O( n) for the exponential weights algorithm. Theorem 15 Assume that in Algorithm 3 the step size η is chosen to be, η = q log(vol(A)) e 2 1 G2n1/2 , with n large enough such that q log(vol(A)) e 2 1 n1/2 1. Then the expected regret after n rounds is bounded by, (e 2) log(vol(A))G2n1/2. We prove this regret bound in Appendix F.1. 3.2. Conditional Gradient Descent Next we present an online conditional gradient (Frank Wolfe) method (26) adapted for kernel losses. The conditional gradient method is also a well studied algorithm studied in both the online and offline setting (for a review see (25)). The main advantage of the conditional gradient method is that as opposed to projected gradient descent and related methods, the projection step is avoided. At each round the conditional gradient method involves the optimization of a linear (kernel) objective function over A to get a point vt A. Next we update the optimal mean action Xt+1 by re-weighting the previous mean action Xt by (1 γt) and weight our new action vt by γt. Note that this construction also automatically suggests a distribution over Algorithm 4 Full Information: Conditional Gradient Input :Set A, number of rounds n, initial action a1 A, inner product , H, learning rate η, mixing rates {γt}n t=1. X1 = Φ(a1) choose D1 such that Ex D1Φ(x) = X1 for t = 1, 2, . . . , n do sample at Dt observe the adversarial action wt define Ft(Y ) η Pt 1 s=1 ws, Y H + Y X1 2 H compute vt = argmina A Ft(Xt), Φ(a) H update mean Xt+1 = (1 γt)Xt + γtΦ(vt) choose Dt+1 s.t. Ex Dt+1[Φ(x)] = Xt+1. end a1, v1, v2, . . . , vt A such that, Xt+1 is a convex combination of Φ(a1), Φ(x1), . . . , Φ(at). For this algorithm we can prove a regret bound of O(n3/4) (presented in Appendix F.2.). Theorem 16 Let the step size be η = 1 2n3/4 . Also let the mixing rates be γt = min{1, 2/t1/2}, then Algorithm 4 attains regret of Rn 8G2n3/4. 4. Applications 4.1. General Quadratic Losses The first example of losses that we present are general quadratic losses. At each round the adversary can choose a symmetric (not necessarily positive semi-definite matrix) A Rd d, and a vector b Rd, with a constraint on the norm of the matrix and vector such that A 2 F + b 2 2 G2. If we embed this pair into a Hilbert space defined by the feature map (A, b) we get a kernel loss defined as Φ(x), (A, b) H = x Ax + b x, where Φ(x) = (xx , x) is the associated feature map for any x A and the inner product in the Hilbert space is defined as the concatenation of the trace inner product on the first coordinate and the Euclidean inner product on the second coordinate. The cumulative loss that the player would aspire to minimize is, Pn t=1 x t Atxt + b t xt. The setting without the linear term, that is when bt = 0 with positive semidefinite matrices At is previously well studied in (47; 48; 23; 4). However when the matrix is not positive semi-definite (making the losses non-convex) and there is a linear term, regret guarantees and tractable algorithms have not been studied even in the full information case. As this is a kernel loss we have regret bounds for these losses. We demonstrate in the subsequent sections in the full information case it is also possible to run our algorithms efficiently. First for exponential weights we show sampling is efficient for these losses. Online learning with kernel losses Lemma 17 (Proof in Appendix F.1) Let B Rd d be a symmetric matrix and b Rd. Sampling from q(a) exp(a Ba + a b) for a 2 1, a Rd is tractable in O(d4) time. 4.2. Guarantees for Conditional Gradient Descent We now demonstrate that conditional gradient descent also can be run efficiently when the adversary plays a general quadratic loss. At each round the conditional gradient descent requires the player to solve the optimization problem, vt = argmina A Ft(Xt), Φ(a) H. When the set of actions is A = a Rd : a 2 1 then under quadratic losses this problem becomes, vt = argmin a A a Ba + b a, (5) for an appropriate matrix B and b that can be calculated by aggregating the adversary s actions up to step t. Observe that the optimization problem (5) is a quadratically constrained quadratic program (QCQP) given our choice of A. The dual problem is the (semi-definite program) SDP, s. t. B + µI b/2 b/2 t For this particular program with a norm ball constraint set it is known the duality gap is zero provided Slater s condition holds, that is, strong duality holds (see Annex B.1 (12)). 4.3. Posynomial Losses In this section we will define a posynomial game, by introducing posynomial losses and prove that these losses can also be viewed as kernel inner products. We will use the connection between posynomials and Geometric programs to prove that conditional gradient descent can be run efficiently on this family of losses. Definition 18 (Monomial) A function f : Rd + 7 R defined as f(x) = cxα1 1 xα2 2 xαd d , where c > 0 and αi R, is called a monomial function. A sum of monomials is a posynomial. Definition 19 (Posynomial) A function f : Rd + 7 R defined as k=1 ckxα1k 1 xα2k 2 xαdk d where ck > 0 and αik R, is called a posynomial function. Note that posynomial functions are closed under addition, multiplication and non-negative scaling. If we assume the adversary at each round plays a vector of dimension m with all non-negative entries, wt = (c1, c2, , cm), while the player chooses a vector x Rd +. This vector is then partitioned into m parts, x = (x1, x2 | {z } s1 , . . . , xd 2, xd 1, xd | {z } sm and the feature vector is defined as xα1 1 xα2 2 ... xαd 2 d 2 xαd 1 d 1 xαd d Where the ith component of Φ( ) is only a function of the ith partition of the coordinates si. Then the loss obtained on the evaluation of the inner product between the adversary and player action is a posynomial loss function, wt, Φ(x) H = k=1 ckxαk1 1 xαkd d . A number of scenarios can be modeled as a minimization/maximization problem over posynomial functions (see (11) for a detailed list of examples). We now show that conditional gradient descent can be run efficiently over posynomial losses. If we again assume that the set of actions A = {a Rd : a 2 1}. Additionally we all choose the initial action to be the solution to the optimization problem, a1 = argmin a A Observe that the objective function is a posynomial subject to a posynomial inequality constraint. This is a geometric program that can be solved efficiently by changing variables and converting into a convex program (Section 2.5 in (11)). At each round of the conditional gradient descent algorithm requires us to solve the optimization problem, vt = argmin a A η s=1 wt + 2(Xt Φ(a1)), Φ(a) H. (6) Given that posynomials are closed under addition, and given our choice of a1, the objective function (6) is still a posynomial and the constraint is a posynomial inequality. This can again be cast as a geometric program that can be solved efficiently at each round. It would be interesting to explore and study more kernel losses for which we have regret guarantees and for which our algorithms are also computationally efficient. Online learning with kernel losses [1] Jacob Duncan Abernethy, Elad Hazan, and Alexander Rakhlin. An efficient algorithm for bandit linear optimization. In 21st Annual Conference on Learning Theory, 2008. [2] Alekh Agarwal and Ofer Dekel. Optimal algorithms for online convex optimization with multi-point bandit feedback. In Conference on Learning Theory, pages 28 40, 2010. [3] Alekh Agarwal, Dean P Foster, Daniel J Hsu, Sham M Kakade, and Alexander Rakhlin. Stochastic convex optimization with bandit feedback. In Advances in Neural Information Processing Systems, pages 1035 1043, 2011. [4] Zeyuan Allen-Zhu and Yuanzhi Li. Follow the compressed leader: Faster algorithms for matrix multiplicative weight updates. ar Xiv preprint ar Xiv:1701.01722, 2017. [5] Sanjeev Arora, Elad Hazan, and Satyen Kale. The multiplicative weights update method: A meta-algorithm and applications. Theory of Computing, 8(1):121 164, 2012. [6] Sanjeev Arora and Satyen Kale. A combinatorial, primal-dual approach to semidefinite programs. In Proceedings of the thirty-ninth annual ACM symposium on Theory of computing, pages 227 236. ACM, 2007. [7] Baruch Awerbuch and Robert D Kleinberg. Adaptive routing with end-to-end feedback: Distributed learning and geometric approaches. In Proceedings of the thirtysixth annual ACM symposium on Theory of computing, pages 45 53. ACM, 2004. [8] Maximilian Balandat, Walid Krichene, Claire Tomlin, and Alexandre Bayen. Minimizing regret on reflexive Banach spaces and learning Nash equilibria in continuous zero-sum games. ar Xiv preprint ar Xiv:1606.01261, 2016. [9] Keith Ball. An elementary introduction to modern convex geometry. Flavors of geometry, 31:1 58, 1997. [10] Peter Bartlett. Learning in sequential decision problems. Lecture Notes Stat 260/CS 294-102, 2014. [11] Stephen Boyd, Seung-Jean Kim, Lieven Vandenberghe, and Arash Hassibi. A tutorial on geometric programming. Optimization and engineering, 8(1):67, 2007. [12] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004. [13] S ebastien Bubeck, Nicolo Cesa-Bianchi, and Sham Kakade. Towards minimax policies for online linear optimization with bandit feedback. In Annual Conference on Learning Theory, volume 23, pages 41 1. Microtome, 2012. [14] S ebastien Bubeck, Ofer Dekel, Tomer Koren, and Yuval Peres. Bandit convex optimization: T regret in one dimension. In Conference on Learning Theory, pages 266 278, 2015. [15] S ebastien Bubeck, Ronen Eldan, and Yin Tat Lee. Kernel-based methods for bandit convex optimization. ar Xiv preprint ar Xiv:1607.03084, 2016. [16] Nicol o Cesa-Bianchi, Pierre Gaillard, Claudio Gentile, and S ebastien Gerchinovitz. Algorithmic chaining and the role of partial feedback in online nonparametric learning. In Conference on Learning Theory, pages 465 481, 2017. [17] Nicolo Cesa-Bianchi and G abor Lugosi. Prediction, learning, and games. Cambridge university press, 2006. [18] Nicolo Cesa-Bianchi and G abor Lugosi. Combinatorial bandits. Journal of Computer and System Sciences, 78(5):1404 1422, 2012. [19] Nello Cristianini and John Shawe-Taylor. An introduction to support vector machines and other kernelbased learning methods. Cambridge university press, 2000. [20] Varsha Dani, Sham M Kakade, and Thomas P Hayes. The price of bandit information for online optimization. In Advances in Neural Information Processing Systems, pages 345 352, 2008. [21] Ofer Dekel, Ronen Eldan, and Tomer Koren. Bandit smooth convex optimization: Improving the biasvariance tradeoff. In Advances in Neural Information Processing Systems, pages 2926 2934, 2015. [22] Abraham D Flaxman, Adam Tauman Kalai, and H Brendan Mc Mahan. Online convex optimization in the bandit setting: gradient descent without a gradient. In Proceedings of the sixteenth annual ACM-SIAM symposium on Discrete algorithms, pages 385 394. Society for Industrial and Applied Mathematics, 2005. [23] Dan Garber, Elad Hazan, and Tengyu Ma. Online learning of eigenvectors. In International Conference on Machine Learning, pages 560 568, 2015. [24] Martin Gr otschel, L aszl o Lov asz, and Alexander Schrijver. Geometric algorithms and combinatorial optimization, volume 2. Springer Science & Business Media, 2012. Online learning with kernel losses [25] Elad Hazan. Introduction to online convex optimization. Foundations and Trends R in Optimization, 2(34):157 325, 2016. [26] Elad Hazan and Satyen Kale. Projection-free online learning. ar Xiv preprint ar Xiv:1206.4657, 2012. [27] Elad Hazan and Kfir Levy. Bandit convex optimization: Towards tight bounds. In Advances in Neural Information Processing Systems, pages 784 792, 2014. [28] Elad Hazan and Yuanzhi Li. An optimal algorithm for bandit convex optimization. ar Xiv preprint ar Xiv:1603.04350, 2016. [29] Wassily Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American statistical association, 58(301):13 30, 1963. [30] Alan J Hoffman and Helmut W Wielandt. The variation of the spectrum of a normal matrix. In Selected Papers Of Alan J Hoffman: With Commentary, pages 118 120. World Scientific, 2003. [31] Martin Jaggi. Convex optimization without projection steps. ar Xiv preprint ar Xiv:1108.1170, 2011. [32] Adam Kalai and Santosh Vempala. Efficient algorithms for online decision problems. Journal of Computer and System Sciences, 71(3):291 307, 2005. [33] Robert D Kleinberg. Nearly tight bounds for the continuum-armed bandit problem. In Advances in Neural Information Processing Systems, pages 697 704, 2005. [34] L aszl o Lov asz and Santosh Vempala. The geometry of log-concave functions and sampling algorithms. Random Structures & Algorithms, 30(3):307 358, 2007. [35] H Brendan Mc Mahan and Avrim Blum. Online geometric optimization in the bandit setting against an adaptive adversary. In International Conference on Computational Learning Theory, pages 109 123. Springer, 2004. [36] H Brendan Mc Mahan and Francesco Orabona. Unconstrained online linear learning in Hilbert spaces: Minimax algorithms and normal approximations. In Conference on Learning Theory, pages 1020 1039, 2014. [37] Shahar Mendelson, Alain Pajor, et al. On singular values of matrices with independent rows. Bernoulli, 12(5):761 773, 2006. [38] James Mercer. Functions of positive and negative type, and their connection with the theory of integral equations. Philosophical transactions of the royal society of London. Series A, containing papers of a mathematical or physical character, 209:415 446, 1909. [39] Jiazhong Nie, Wojciech Kotłowski, and Manfred K Warmuth. Online PCA with optimal regrets. In International Conference on Algorithmic Learning Theory, pages 98 112. Springer, 2013. [40] Alexander Rakhlin and Karthik Sridharan. Online nonparametric regression with general loss functions. ar Xiv preprint ar Xiv:1501.06598, 2015. [41] Ankan Saha and Ambuj Tewari. Improved regret guarantees for online smooth convex optimization with bandit feedback. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 636 642, 2011. [42] Bernhard Scholkopf and Alexander J Smola. Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press, 2001. [43] Shai Shalev-Shwartz and Yoram Singer. A primal-dual perspective of online learning algorithms. Machine Learning, 69(2-3):115 142, 2007. [44] Niranjan Srinivas, Andreas Krause, Sham M Kakade, and Matthias Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. ar Xiv preprint ar Xiv:0912.3995, 2009. [45] Koji Tsuda, Gunnar R atsch, and Manfred K Warmuth. Matrix exponentiated gradient updates for online learning and Bregman projection. Journal of Machine Learning Research, 6(Jun):995 1018, 2005. [46] Michal Valko, Nathaniel Korda, R emi Munos, Ilias Flaounas, and Nelo Cristianini. Finite-time analysis of kernelised contextual bandits. ar Xiv preprint ar Xiv:1309.6869, 2013. [47] Manfred K Warmuth and Dima Kuzmin. Online variance minimization. In International Conference on Computational Learning Theory, pages 514 528. Springer, 2006. [48] Manfred K Warmuth and Dima Kuzmin. Randomized online PCA algorithms with regret bounds that are logarithmic in the dimension. Journal of Machine Learning Research, 9(Oct):2287 2320, 2008. [49] Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning (ICML-03), pages 928 936, 2003. [50] Laurent Zwald and Gilles Blanchard. On the convergence of eigenspaces in kernel principal component analysis. In Advances in neural information processing systems, pages 1649 1656, 2006.