# online_learning_with_kernel_losses__0925fc16.pdf

Online learning with kernel losses

Aldo Pacchiano * 1 Niladri S. Chatterji * 1 Peter L. Bartlett 1

We present a generalization of the adversarial linear bandits framework, where the underlying losses are kernel functions (with an associated reproducing kernel Hilbert space) rather than linear functions. We study a version of the exponential weights algorithm and bound its regret in this setting. Under conditions on the eigen-decay of the kernel we provide a sharp characterization of the regret for this algorithm. When we have polynomial eigen-decay (µj O(j β)), we ﬁnd that the regret is bounded by Rn O(nβ/2(β 1)). While under the assumption of exponential eigendecay (µj O(e βj)) we get an even tighter bound on the regret Rn O(n1/2). When the eigen-decay is polynomial we also show a non-matching minimax lower bound on the regret of Rn Ω(n(β+1)/2β) and a lower bound of Rn Ω(n1/2) when the decay in the eigenvalues is exponentially fast.

We also study the full information setting when the underlying losses are kernel functions and present an adapted exponential weights algorithm and a conditional gradient descent algorithm.

1. Introduction

In adversarial online learning, a player interacts with an unknown and arbitary adversary in a sequence of rounds. At each round, the player chooses an action from an action space and incurs a loss associated with that chosen action. The loss functions are determined by the adversary and are ﬁxed at the beginning of each round. After choosing an action the player observes some feedback, which can help guide the choice of actions in subsequent rounds. The most common feedback model is the full information model, where the player has access to the entire loss function at

*Equal contribution 1University of California Berkeley. Correspondence to: Aldo Pacchiano <pacchiano@berkeley.edu>, Niladri S. Chatterji <chatterji@berkeley.edu>.

Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s).

the end of each round. Another, more challenging feedback model is the partial information or bandit feedback model where the player at the end of the round just observes the loss associated with the action chosen in that particular round. There are also other feedback models in between and beyond the full and bandit information models, many of which have also been studied in detail. A ﬁgure of merit that is often used to judge online learning algorithms is the notion of regret, which compares the players actions to the best single action in hindsight (deﬁned formally in Section 1.2).

When the underlying action space is a continuous and compact (possibly convex) set and the losses are linear or convex functions over this set; there are many algorithms known that attain sub-linear and sometimes optimal regret in both these feedback settings. In this work we present a generalization of the well studied adversarial online linear learning framework. In our paper, at each round the player selects an action a A Rd. This action is mapped to an element in a reproducing kernel Hilbert space (RKHS) generated by a mapping K( , ). The function K( , ) is a kernel map, that is, it can thought as an inner product of an appropriate Hilbert space H. The kernel map can be expressed as K(x, y) = Φ(x), Φ(y) H, where Φ( ) RD is the associated feature map.

Thus at each round the loss is Φ(a), w H, where w H is the adversary s action. In the full information setting, as feedback, the player has access to the entire adversarial loss function , w H. In the bandit setting the player is just presented with the value of the loss, Φ(a), w H.

Notice that this class of losses is much more general than ordinary linear losses and includes potentially non-linear and non-convex losses like:

1. Linear Losses: a, w H = a w. This loss is well studied in both the bandit and full information setting. We shall see that our regret bounds will match the bounds established in the literature for these losses.

2. Quadratic Losses: D φ(a), W b E

H = a Wa + b a,

where W is a symmetric matrix and b is a vector. Convex quadratic losses have been well studied under full information feedback as the online eigenvector decomposition problem. Our work establishes regret bounds

Online learning with kernel losses

in the full information setting and also under the mostly unexplored bandit feedback.

3. Gaussian Losses: Φ(a), Φ(y) H = exp a y 2 2/2σ2 . We provide regret bounds of kernel losses not commonly studied before like Gaussian losses that provide a different loss proﬁle than a linear or convex loss.

4. Polynomial Losses: Φ(a), Φ(y) H = (1 + a y)2 for example. We also provide regret bounds for polynomial kernel losses which are potentially (non-convex) under both partial and full information settings. Specifically in the full information setting we study posynomial losses (discussion in Section 4.3).

1.1. Related Work

Adversarial online convex bandits that was introduced and ﬁrst studied by (33; 22). The problem most closely related to our work is the case when the losses are linear introduced earlier by (35; 7). In this setting (20; 18; 13) proposed the EXP 2 (Expanded Exp) algorithm under different choices of exploration distributions. (20) worked with the uniform distribution over the barycentric spanner of the set, in (18) this distribution was the uniform distribution over the set and in (13) they use the exploration distribution given by John s theorem that leads to a regret bound of O((dn log(N))1/2), where N is the number of actions, n is the number of rounds and d is the dimension of the losses. For this same problem when the set A is convex and compact, (1) analyzed Mirror descent to get a regret bound of O(d p

θn log(n)) for some θ > 0. For the case with general convex losses with bandit feedback recently (15) proposed a poly-time algorithm that has a regret guarantee of O(d9.5 n), which is optimal in its dependence on the number of rounds n. Previous work on this problem includes, (2; 41; 27; 21; 14; 28) in the adversarial setting under different assumptions on the structure of the convex losses and by (3) who studied this problem in the stochastic setting1. (46) study stochastic kernelized contextual bandits with a modiﬁed UCB algorithm to obtain a regret bound similar to ours, Rn p

dn where d is the effective dimension dependent on the eigen-decay of the kernel. This problem was also studied previously for loss functions drawn from Gaussian processes in (44). Online learning under bandit feedback has also been studied when the losses are non-parametric, for example when the losses are Lipschitz (16; 40).

In the full information case, the online optimization framework with convex losses was ﬁrst introduced by (49). The conditional gradient descent algorithm (a modiﬁcation of which we study in this work) for convex losses in this set-

1For an extended bibliography of the work on online convex bandits see (15).

ting was introduced and analyzed by (31) and then improved subsequently by (26). The exponential weights algorithm which we modify and use multiple times in this paper has a rich history and has been applied to various online as well as ofﬂine settings. The particular with the losses being convex quadratic functions has been well studied in the full information setting. This problem is also called online eigenvector decomposition or online PCA. Very recently (4) established a regret bound of O( n) for the problem by presented an efﬁcient algorithm that achieves this rate a modiﬁed exponential weights strategy, follow the compressed leader. Previous results for this problem were established in both adversarial and stochastic settings by modiﬁcations of exponential weights, gradient descent and follow the perturbed leader algorithms (6; 45; 47; 48; 32; 23).

In the full information setting there has also been work on analyzing gradient descent and mirror descent in RKHS spaces (36; 8). However, in these papers the player is allowed to play any action in a bounded set in Hilbert space, while in our paper the player is constrained to only play rank one actions, that is the player chooses an action in A which gets mapped to an action in the RKHS.

CONTRIBUTIONS

Our primary contribution is to extend the linear bandits framework to more general classes of kernel losses. We present an algorithm in this setting and provide a regret bound for the same. We provide a more detailed analysis of the regret when we make assumptions on the eigen-decay of the kernel. Particularly when we assume the polynomial eigen-decay of the kernel (µj O(j β)) we can guar-

antee the regret is bounded as Rn O(n

β 2(β 1) ). Under exponential eigendecay we can guarantee an even sharper bound on the regret of Rn O(n1/2). We also provide a minimax lower bound on the regret of Rn Ω(n(β+1)/2β) and Rn Ω(n1/2) under the polynomial and exponential decay eigen-decay assumptions respectively. We analyze an exponential weights algorithm and a conditional gradient algorithm for the full information case where we don t need to assume any conditions on the eigen-decay. Finally we provide a couple of applications of our framework (i) general quadratic losses (not necessarily convex) with linear terms which we can solve efﬁciently in the full information setting, (ii) we provide a computationally efﬁcient algorithm when the underlying losses are posynomial (special class of polynomials).

ORGANIZATION OF THE PAPER

In the next section we introduce the notation and deﬁnitions. In Section 2 we present our algorithm under bandit feedback and present regret bounds for this algorithm. In Section 3 we study the problem in the full information setting. In Section

Online learning with kernel losses

4 we present two applications of our framework, and prove that our algorithms are computationally efﬁcient in these settings. All the proofs, technical details and experiments are relegated to the appendix.

1.2. Notation, main deﬁnitions and setting

Here we introduce deﬁnitions and notational conventions used throughout the paper.

In each round t = {1, . . . , n}, the player chooses an action vector {at}n t=1 A Rd. The underlying kernel function at each round is K( , ) which is a map from Rd Rd R such that it is a kernel map and has an associated separable reproducing kernel Hilbert space (RKHS) H with an inner product , H (for more properties of kernel maps and RKHS see (42)). Let Φ( ) : Rd 7 RD denote a feature map of K( , ) such that for every x, y we have K(x, y) = Φ(x), Φ(y) H. Note that the dimension of the RKHS, D could be inﬁnite (for example in the Gaussian kernel over [0, 1]d).

We let the adversary choose a vector in H, wt W RD and at each round the loss incurred by the player is Φ(at), wt H. We assume that the adversary is oblivious, that is, it is a function of the previous actions of the player (a1, . . . , at 1) but unaware of the randomness used to generate at. We let the size of the sets A, W be bounded2 in kernel norm, that is,

sup a A K(a, a) G2 and, sup w W w, w H G2. (1)

Throughout this paper we assume a rank-one learner, that is, in each round the player can pick a vector v H, such that v = Φ(a) for some a Rd. We now formally deﬁne the notion of expected regret.

Deﬁnition 1 (Expected regret) The expected regret of an algorithm M after n rounds is deﬁned as

t=1 Φ(at), wt H

t=1 Φ(a ), wt H

where a = infa A {Pn t=1 Φ(a), wt H} and the expectation is over the randomness in the algorithm.

Essentially this amounts to comparing against the best single action a in hindsight. Our hope will be to ﬁnd a randomized strategy such that the regret grows sub-linearly with the number of rounds n. In what follows we will omit the subscript H from the subscript of the inner product whenever it is clear from the context that it refers to the RKHS inner product.

2We set the bound on the size of both sets to be the same for ease of exposition, but they could be different and would only change the constants in our results.

To establish regret guarantees we will ﬁnd that it is essential to work with ﬁnite dimensional kernels when working under bandit feedback (more details regarding this in the proof of the regret bound of Algorithm 2.3). General kernel maps can have inﬁnite dimensional feature maps thus we will require the construction of a ﬁnite dimensional kernel that uniformly approximates the original kernel K( , ). This motivates the deﬁnition of ϵ-approximate kernels.

Deﬁnition 2 (ϵ-approximate kernels) Let K1 and K2 be two kernels over A A and let ϵ > 0. We say K2 is an ϵ-approximation of K1 if for all x, y A, |K1(x, y) K2(x, y)| ϵ.

2. Bandit Feedback Setting

In this section we present our results on kernel bandits. In the bandit setting we assume the player knows the underlying kernel function K( , ), however, at each round after the player plays a vector at only the value of the loss associated with that action is revealed to the player Φ(at), wt H and not the action of the adversary wt. We also assume that the player s action set A has ﬁnite cardinality3.This is a generalization of the well studied adversarial linear bandits problem. As we will see in subsequent sections to guarantee a bound on the regret in the bandit setting our algorithm will build an estimate of adversary s action wt. This becomes impossible if wt is inﬁnite dimensional (D ). To circumvent this, we will construct a ﬁnite dimensional proxy kernel that is an ϵ-approximation of K.

Whenever no approximate kernel is needed, for example when D < we allow the adversary to be able to choose an action wt W RD without imposing extra requirements on the set W other than being bounded in H norm. When D is inﬁnite we impose an additional constraint on the adversary to also select rank-one actions at each round, that is, wt = Φ(yt) where yt Rd. Next we present a discussion of the procedure to construct a ﬁnite kernel that approximates the original kernel well.

2.1. Construction of the ﬁnite dimensional kernel

To construct the ﬁnite dimensional kernel we will rely crucially on Mercer s theorem. We ﬁrst recall a couple of useful deﬁnitions.

Deﬁnition 3 Let A Rd and P a probability measure supported over A. We denote by L2(A; P) the space of square integrable functions over A and measure P, L2(A; P) :=

3This assumption can be relaxed to let A be a compact set when K is Lipschitz continuous. In this setting we can instead work with an appropriately ﬁne approximating cover over the set A.

Online learning with kernel losses (

A f 2(x)d P(x) <

Deﬁnition 4 A kernel K : A A R is square integrable with respect to measure P over A, if R

A A K2(x, y)d P(x)d P(y) < .

Now we are ready to present Mercer s theorem (38) (see (19)).

Theorem 5 (Mercer s Theorem) Let A Rd be compact and P be a ﬁnite Borel measure with support A. Suppose K is a continuous square integrable positive deﬁnite kernel on A, and deﬁne a positive deﬁnite operator TK : L2(A; P) 7 L2(A; P) by

(TKf) ( ) := Z

A K( , x)f(x) d P.

Then there exists a sequence of eigenfunctions {φi} i=1 that form an orthonormal basis of L2(A; P) consisting of eigenfunctions of TK, and an associated sequence of nonnegative eigenvalues {µj} j=1 such that TK(φj) = µjφj for j = 1, 2, . . .. Moreover the kernel function can be represented as

i=1 µiφi(u)φi(v) (3)

where the convergence of the series holds uniformly.

Mercer s theorem yields a natural way to construct a feature map Φ(x) for K by deﬁning the ith component of the feature map to be Φ(x)i := µiφi(x). With this choice of feature map the eigenfunctions {φi} i=1 are orthogonal under the inner product , H4. Armed with Mercer s theorem we ﬁrst present a simple deterministic procedure to obtain a ﬁnite dimensional ϵ-approximate kernel of K. Essentially when the eigenfunctions of the kernel are uniformly bounded, supx A|φj(x)| B for all j, and if the eigenvalues decay at a suitable rate we can truncate the series in (3) to get a ﬁnite dimensional approximation.

Lemma 6 Given ϵ > 0, let {µj} j=1 be the Mercer operator eigenvalues of K under a ﬁnite Borel measure P with support A and eigenfunctions {φj} j=1 with µ1 µ2 . Further assume that supj N supx A|φj(x)| B for some B < . Let m(ϵ) be such that P j=m+1 µj ϵ 4B2 . Then the kernel induced by a truncated feature map,

( µiφi(x) if i m 0 o.w. (4)

4To see this observe that the function φi can be expressed as a vector in the RKHS as a vector vi with φi in the ith coordinate and zeros everywhere else. So for any two vi and vj with i = j we have vi, vj H = 0.

induces a kernel ˆKo m := Φo m(x), Φo m(y) H = Pm j=1 µjφj(x)φj(y), for all (x, y) A A that is an ϵ/4-approximation of K.

The Hilbert space induced by the ˆKo m is a subspace of the original Hilbert space H. The proof of this lemma is a simple application of Mercer s theorem and is relegated to Appendix C. If we have access to the eigenfunctions of K we can construct and work with ˆKo m because as Lemma 6 shows ˆKo m is an ϵ/4-approximation to K. Additionally, ˆKo m also has the same ﬁrst m Mercer eigenvalues and eigenfunctions under P as K. Unfortunately, in most applications of interest the analytical computation of the eigenfunctions {φi} i=1 is not possible. We can get around this by building an estimate of the eigenfunctions using samples from P by leveraging results from kernel principal component analysis (PCA).

Deﬁnition 7 Let Sm be the subspace of H spanned by the ﬁrst m eigenvectors of the covariance matrix Ex P Φ(x)Φ(x) .

This corresponds to the span of the eigenfunctions φ1, φ2, . . . , φm in H 5 . Deﬁne the linear projection operator PSm : H 7 H that projects onto the subspace Sm; where P(Sm)(v + v ) = v, if v Sm and v S m.

Remark 8 The feature map Φo m(x) is a projection of the complete feature map to this subspace, Φo m(x) = PSm(Φ(x)).

Let x1, x2, . . . , xp P be p i.i.d. samples and construct the sample (kernel) covariance matrix, ˆΣ := 1 p Pp i=1 Φ(xi)Φ(xi) . Let ˆSm be the subspace spanned

by the m top eigenvectors of ˆΣ. Deﬁne the stochastic feature map, Φm(x) := P ˆSm(Φ(x)), the feature map deﬁned by projecting Φ(x) to the random subspace ˆSm. Intuitively we would expect that if the number of samples p is high enough, then the kernel deﬁned by the feature map Φm(x), ˆKm(x, y) = Φm(x), Φm(y) H will also be an ϵapproximation for the original kernel K. Formalizing this claim is the following theorem.

Theorem 9 Let m, P be deﬁned as in Lemma 6. Deﬁne the m-th level eigen-gap as δm = 1

2 (µm µm+1). Also let

Bm = 2G δm 1 + p α

2 , 2δm > ϵ > 0 and p 2B2 m G2 ϵ .

Then the ﬁnite dimensional kernels ˆKo m and ˆKm satisfy the following properties with probability 1 e α,

1. supx,y A |K(x, y) ˆKm(x, y)| ϵ.

5This holds as the ith eigenvector of the covariance matrix has φi as the ith coordinate and zero everywhere else combined with the fact that {φi} i=1 are orthonormal under the L(A; P) inner product.

Online learning with kernel losses

Algorithm 1 Finite dimensional proxy construction Input :Kernel K, effective dimension m, set A, measure P, bias tolerance ϵ > 0, number of samples p. Function :Finite proxy feature map Φm( ) sample x1, , xp P. construct sample Gram matrix ˆKi,j = 1

p K(xi, xj).

calculate the top m eigenvectors of ˆK {ω1, ω2, . . . , ωm}. for j = 1, . . . , m do

Set vj = Pp k=1 ωjkΦ(xk), (ωjk is the kth entry of ωj) end deﬁne the feature map

v1, Φ(x) H ... vm, Φ(x) H

Pp k=1 ω1k K(xk, x) ... Pp k=1 ωmk K(xk, x)

2. The Mercer eigenvalues µ(p) 1 µ(p) m and µ1 µm of ˆKm and ˆKo m are close, supi=1, ,m |µ(p) i µi| ϵ/2.

Theorem 9 shows that given ϵ > 0 the ﬁnite dimensional proxy ˆKm is a ϵ-approximation of K with high probability as long as sufﬁciently large number of samples are used. Furthermore, the top m eigenvalues of the second moment matrix of K are at most ϵ/2-away from the eigenvalues of the second moment matrix of ˆKm under P.

To construct Φm( ) we need to calculate the top m eigenvectors of the sample covariance matrix ˆΣ, however, it is equivalent to calculate the top m eigenvectors of the sample Gram matrix K and use them to construct the eigenvectors of ˆΣ (for more details see Appendix B where we review the basics of kernel PCA).

2.2. Bandits Exponential Weights

In this section we present a modiﬁed version of exponential weights adapted to work with kernel losses. Exponential weights has been analyzed extensively applied to linear losses under bandit feedback (20; 17; 13). Two technical challenges make it hard to directly adapt their algorithms to our setting.

The ﬁrst challenge is that at each round we need to estimate the adversarial action wt. If the feature map of the kernel is ﬁnite dimensional this is easy to handle, however when the feature map is inﬁnite dimensional, this becomes challenging and we need to build an approximate feature map Φm( ) using Algorithm 2.1. This introduces a bias in our estimate of the adversarial action wt and we will need to control the contribution of the bias in our regret analysis. The second challenge will be to lower bound the minimum eigenvalue

of the kernel covariance matrix as we will need to invert this matrix to estimate wt. For general kernels which are inﬁnite dimensional, the minimum eigenvalue is zero. To resolve this we will again turn to our construction of a ﬁnite dimensional proxy kernel.

2.3. Bandit Algorithm and Regret Bound

In our exponential weights algorithm we ﬁrst build the ﬁnite dimensional proxy kernel ˆKm using Algorithm 2.1. The rest of the algorithm is then almost identical to the exponential weights algorithm (EXP 2) studied for linear bandits in (20; 17; 13). In Algorithm 2.3 we set the exploration distribution νA J to be such that it induces John s distribution (νJ ) over Φm(A) := {Φm(a) Rm : a A} (ﬁrst introduced as an exploration distribution in (13); also a short discussion is presented in Appendix H.1). Note that for ﬁnite sets it is possible to build minimal volume ellipsoid containing conv(Φm(A)) John s ellipsoid and John s distribution in polynomial time (24)6. We assume without loss of generality that the center of the set A is such that the John s ellipsoid is centered at the origin.

If we know beforehand the behavior of the eigen-decay of the Mercer eigenvalues of K under measure µ we will be able to choose our tuning parameters optimally. In our algorithm we also build and invert the exact covariance matrix Σ(t) m , however this can be relaxed and we can work with a sample covariance matrix instead. We analyze the required sample complexity and error introduced by this additional step in Appendix D. We now state the main result of this paper which is an upper bound on the regret of Algorithm 2.3.

Theorem 10 Let µi be the i-th Mercer operator eigenvalue of K for the uniform measure µ over A. Let m, p, α and ϵ be chosen as speciﬁed by the conditions in Theorem 9. Let the mixing coefﬁcient be chosen such that γ = ηG4m. Then Algorithm 2.3 with probability 1 e α has regret bounded by

Rn γn + (e 2)G4ηmn + 3ϵn + 1

η log(|A|).

We prove this theorem in Appendix A. Note that this is similar to the regret rate attained for adversarial linear bandits in (20; 18; 13) with an additional term 3ϵn that accounts for the bias in our loss estimates ˆwt. In our regret bounds the parameter m plays the role of the effective dimension and will be determined by the rate of the eigen-decay of the kernel. When the underlying Hilbert space is ﬁnite dimensional (as is the case when the losses are linear) our regret

6It is thus possible to construct νJ over Φm(A) in polynomial time. However, as A is a ﬁnite set, using Φm( ) and νJ it is also possible to construct νA J efﬁciently.

Online learning with kernel losses

Algorithm 2 Bandit Information: Exponential Weights Input :Set A, learning rate η > 0, mixing coefﬁcient γ > 0, number of rounds n, uniform distribution µ over A, exploration distribution νA J over A, kernel map K, effective dimension m(ϵ), number of samples p. Build kernel ˆKm with feature map Φm( ) using Algorithm

2.1 with kernel K, dimension m, distribution µ, bias tolerance ϵ and number of samples p. set q1(a) = νA J . for t = 1, . . . , n do

set pt = γνA J + (1 γ)qt choose at pt observe Φ(at), wt H build the covariance matrix

Σ(t) m = Ex pt Φm(x)Φm(x)

compute the estimate ˆwt = Σ 1 m Φm(at) Φ(at), wt H. update qt+1(a) qt(a) exp ( η ˆwt, Φm(a) H) end

bound recovers exactly the results of previous work (that is, ϵ = 0 and m = d). Next we state the following different characteristic eigenvalue decay proﬁles.

Deﬁnition 11 (Eigenvalue decay) Let the Mercer operator eigenvalues of a kernel K with respect to a measure P over a set A be denoted by µ1 µ2 . . ..

1. K is said to have (C, β)-polynomial eigenvalue decay (with β > 1) if for all j N we have µj Cj β.

2. K is said to have (C, β)-exponential eigenvalue decay if for all j N we have µj Ce βj.

Under assumptions on the eigen-decay we can establish bounds on the effective dimension m and µm, so that the condition stated in Lemma 6 is satisﬁed and we are guaranteed to build an ϵ-approximate kernel ˆKm. We establish bounds on m in Proposition 33 presented in Appendix C.1.

Corollary 12 Let the conditions stated in Theorem 10 hold. Then Algorithm 2.3 has its regret bounded by the following rates with probability 1 e α.

1. If K has (C, β)-polynomial eigenvalue decay under measure µ, with β > 2. Then

by choosing η = 1

3 1 2β 1 h β 1 4CB2 i1/2β 1 log(|A|)

((e 1)G4) β 1

β 2β 1 and m = h 4CB2 (β 1)ϵ i1/β 1

where ϵ = (e 1)ηG4

3 (β 1)/β h 4CB2

β 1 i1/β , the ex-

pected regret is bounded by

1 2β 1 e G4 log (|A|) β 1

2. If K has (C, β)-exponential eigenvalue decay under measure µ. Then by choosing η = β log(|A|)

(e 1)G4 log 4CB2

1/2 and m = 1 β log 4CB2

where ϵ = (e 1)G4η

3β log 4CB2

β , with n large enough so that ϵ < 1, the expected regret is bounded by

18G4 log (|A|) n

Remark 13 Under (C, β)-polynomial eigen-decay condition we have that the regret is upper bounded by Rn O(n

β 2(β 1) ). While when we have (C, β)-exponential eigen-decay we almost recover the adversarial linear bandits regret rate (up to logarithmic factors), with Rn O(n1/2 log(n)).

One way to interpret the results of Corollary 12 in contrast to the regret bounds obtained for linear losses is the following. We introduce additional parameters into our analysis to handle the inﬁnite dimensionality of our feature vectors the effective dimension m and bias of our estimate ϵ. When the effective dimension m is chosen to be large we get can build an estimate of the adversarial action ˆwt which has low bias, however this estimate would have large variance (O(m)). On the other hand if we choose m to be small we can build a low variance estimate of the adversarial action but with high bias (ϵ is large). We trade these off optimally to get the regret bounds established above. In the case of exponential decay we obtain that the choice m = O(log(n)) is optimal, hence the regret bound only degrades by a logarithmic factor in terms of n as compared to linear losses (where m would be a constant). When we have polynomial decay, the effective dimension is higher m = O(n

1 2(β 1) ) which leads to worse bounds on the expected regret. Note that asymptotically as β the regret bound goes to n1/2 which aligns well with the intuition that the effective dimension is small. While when β 2 (the effective dimension m ) the regret bound becomes close to linear in n.

We can also show a minimax lower bound for these two settings that are close to matching the upper bound.

Proposition 14 (informal) For any algorithm used by the player, there exist a strategy for the adversary such that Rn Ω n

2β whenever µj = O(j β), while when the

decay is exponential Rn Ω n1/2 .

Online learning with kernel losses

Algorithm 3 Full Information: Exponential Weights Input :Set A, learning rate η > 0, number of rounds n. Set p1(a) uniform distribution over A. for t = 1, . . . , n do

choose at pt observe wt update pt+1(a) pt(a) exp ( η wt, Φ(a) H) end

The lower bound follows by a modiﬁcation of the arguments used to prove a lower bound linear bandits. For a complete proof see Appendix E.

3. Full Information Setting

3.1. Full information Exponential Weights

We begin by presenting a version of the exponential weights algorithm, Algorithm 3 adapted to our setup. In each round we sample an action vector at A from the exponential weights distribution pt. After observing the loss, Φ(at), wt H we update the distribution by a multiplicative factor, exp( η wt, Φ(a) H). In the algorithm presented we choose the initial distribution p1(a) to be uniform over the set A, however we note that alternate initial distributions with support over the whole set could also be considered. We can establish a sub-linear regret of O( n) for the exponential weights algorithm.

Theorem 15 Assume that in Algorithm 3 the step size η

is chosen to be, η = q

log(vol(A))

e 2 1 G2n1/2 , with n large

enough such that q

log(vol(A))

e 2 1 n1/2 1. Then the expected regret after n rounds is bounded by,

(e 2) log(vol(A))G2n1/2.

We prove this regret bound in Appendix F.1.

3.2. Conditional Gradient Descent

Next we present an online conditional gradient (Frank Wolfe) method (26) adapted for kernel losses. The conditional gradient method is also a well studied algorithm studied in both the online and ofﬂine setting (for a review see (25)). The main advantage of the conditional gradient method is that as opposed to projected gradient descent and related methods, the projection step is avoided. At each round the conditional gradient method involves the optimization of a linear (kernel) objective function over A to get a point vt A. Next we update the optimal mean action Xt+1 by re-weighting the previous mean action Xt by (1 γt) and weight our new action vt by γt. Note that this construction also automatically suggests a distribution over

Algorithm 4 Full Information: Conditional Gradient Input :Set A, number of rounds n, initial action a1 A, inner product , H, learning rate η, mixing rates {γt}n t=1. X1 = Φ(a1) choose D1 such that Ex D1Φ(x) = X1 for t = 1, 2, . . . , n do

sample at Dt observe the adversarial action wt deﬁne Ft(Y ) η Pt 1 s=1 ws, Y H + Y X1 2 H compute vt = argmina A Ft(Xt), Φ(a) H update mean Xt+1 = (1 γt)Xt + γtΦ(vt) choose Dt+1 s.t. Ex Dt+1[Φ(x)] = Xt+1. end

a1, v1, v2, . . . , vt A such that, Xt+1 is a convex combination of Φ(a1), Φ(x1), . . . , Φ(at). For this algorithm we can prove a regret bound of O(n3/4) (presented in Appendix F.2.).

Theorem 16 Let the step size be η = 1 2n3/4 . Also let the mixing rates be γt = min{1, 2/t1/2}, then Algorithm 4 attains regret of Rn 8G2n3/4.

4. Applications

4.1. General Quadratic Losses

The ﬁrst example of losses that we present are general quadratic losses. At each round the adversary can choose a symmetric (not necessarily positive semi-deﬁnite matrix) A Rd d, and a vector b Rd, with a constraint on the norm of the matrix and vector such that A 2 F + b 2 2 G2. If we embed this pair into a Hilbert space deﬁned by the feature map (A, b) we get a kernel loss deﬁned as Φ(x), (A, b) H = x Ax + b x, where Φ(x) = (xx , x) is the associated feature map for any x A and the inner product in the Hilbert space is deﬁned as the concatenation of the trace inner product on the ﬁrst coordinate and the Euclidean inner product on the second coordinate. The cumulative loss that the player would aspire to minimize is, Pn t=1 x t Atxt + b t xt. The setting without the linear term, that is when bt = 0 with positive semideﬁnite matrices At is previously well studied in (47; 48; 23; 4). However when the matrix is not positive semi-deﬁnite (making the losses non-convex) and there is a linear term, regret guarantees and tractable algorithms have not been studied even in the full information case.

As this is a kernel loss we have regret bounds for these losses. We demonstrate in the subsequent sections in the full information case it is also possible to run our algorithms efﬁciently. First for exponential weights we show sampling is efﬁcient for these losses.

Online learning with kernel losses

Lemma 17 (Proof in Appendix F.1) Let B Rd d be a symmetric matrix and b Rd. Sampling from q(a) exp(a Ba + a b) for a 2 1, a Rd is tractable in O(d4) time.

4.2. Guarantees for Conditional Gradient Descent

We now demonstrate that conditional gradient descent also can be run efﬁciently when the adversary plays a general quadratic loss. At each round the conditional gradient descent requires the player to solve the optimization problem, vt = argmina A Ft(Xt), Φ(a) H. When the set of actions is A = a Rd : a 2 1 then under quadratic losses this problem becomes,

vt = argmin a A a Ba + b a, (5)

for an appropriate matrix B and b that can be calculated by aggregating the adversary s actions up to step t. Observe that the optimization problem (5) is a quadratically constrained quadratic program (QCQP) given our choice of A. The dual problem is the (semi-deﬁnite program) SDP,

s. t. B + µI b/2 b/2 t

For this particular program with a norm ball constraint set it is known the duality gap is zero provided Slater s condition holds, that is, strong duality holds (see Annex B.1 (12)).

4.3. Posynomial Losses

In this section we will deﬁne a posynomial game, by introducing posynomial losses and prove that these losses can also be viewed as kernel inner products. We will use the connection between posynomials and Geometric programs to prove that conditional gradient descent can be run efﬁciently on this family of losses.

Deﬁnition 18 (Monomial) A function f : Rd + 7 R deﬁned as

f(x) = cxα1 1 xα2 2 xαd d ,

where c > 0 and αi R, is called a monomial function.

A sum of monomials is a posynomial.

Deﬁnition 19 (Posynomial) A function f : Rd + 7 R deﬁned as

k=1 ckxα1k 1 xα2k 2 xαdk d

where ck > 0 and αik R, is called a posynomial function.

Note that posynomial functions are closed under addition, multiplication and non-negative scaling. If we assume the adversary at each round plays a vector of dimension m with all non-negative entries, wt = (c1, c2, , cm), while the player chooses a vector x Rd +. This vector is then partitioned into m parts,

x = (x1, x2 | {z } s1

, . . . , xd 2, xd 1, xd | {z } sm

and the feature vector is deﬁned as

xα1 1 xα2 2 ... xαd 2 d 2 xαd 1 d 1 xαd d

Where the ith component of Φ( ) is only a function of the ith partition of the coordinates si. Then the loss obtained on the evaluation of the inner product between the adversary and player action is a posynomial loss function,

wt, Φ(x) H =

k=1 ckxαk1 1 xαkd d .

A number of scenarios can be modeled as a minimization/maximization problem over posynomial functions (see (11) for a detailed list of examples). We now show that conditional gradient descent can be run efﬁciently over posynomial losses. If we again assume that the set of actions A = {a Rd : a 2 1}. Additionally we all choose the initial action to be the solution to the optimization problem,

a1 = argmin a A

Observe that the objective function is a posynomial subject to a posynomial inequality constraint. This is a geometric program that can be solved efﬁciently by changing variables and converting into a convex program (Section 2.5 in (11)). At each round of the conditional gradient descent algorithm requires us to solve the optimization problem,

vt = argmin a A η

s=1 wt + 2(Xt Φ(a1)), Φ(a) H. (6)

Given that posynomials are closed under addition, and given our choice of a1, the objective function (6) is still a posynomial and the constraint is a posynomial inequality. This can again be cast as a geometric program that can be solved efﬁciently at each round.

It would be interesting to explore and study more kernel losses for which we have regret guarantees and for which our algorithms are also computationally efﬁcient.

Online learning with kernel losses

[1] Jacob Duncan Abernethy, Elad Hazan, and Alexander Rakhlin. An efﬁcient algorithm for bandit linear optimization. In 21st Annual Conference on Learning Theory, 2008.

[2] Alekh Agarwal and Ofer Dekel. Optimal algorithms for online convex optimization with multi-point bandit feedback. In Conference on Learning Theory, pages 28 40, 2010.

[3] Alekh Agarwal, Dean P Foster, Daniel J Hsu, Sham M Kakade, and Alexander Rakhlin. Stochastic convex optimization with bandit feedback. In Advances in Neural Information Processing Systems, pages 1035 1043, 2011.

[4] Zeyuan Allen-Zhu and Yuanzhi Li. Follow the compressed leader: Faster algorithms for matrix multiplicative weight updates. ar Xiv preprint ar Xiv:1701.01722, 2017.

[5] Sanjeev Arora, Elad Hazan, and Satyen Kale. The multiplicative weights update method: A meta-algorithm and applications. Theory of Computing, 8(1):121 164, 2012.

[6] Sanjeev Arora and Satyen Kale. A combinatorial, primal-dual approach to semideﬁnite programs. In Proceedings of the thirty-ninth annual ACM symposium on Theory of computing, pages 227 236. ACM, 2007.

[7] Baruch Awerbuch and Robert D Kleinberg. Adaptive routing with end-to-end feedback: Distributed learning and geometric approaches. In Proceedings of the thirtysixth annual ACM symposium on Theory of computing, pages 45 53. ACM, 2004.

[8] Maximilian Balandat, Walid Krichene, Claire Tomlin, and Alexandre Bayen. Minimizing regret on reﬂexive Banach spaces and learning Nash equilibria in continuous zero-sum games. ar Xiv preprint ar Xiv:1606.01261, 2016.

[9] Keith Ball. An elementary introduction to modern convex geometry. Flavors of geometry, 31:1 58, 1997.

[10] Peter Bartlett. Learning in sequential decision problems. Lecture Notes Stat 260/CS 294-102, 2014.

[11] Stephen Boyd, Seung-Jean Kim, Lieven Vandenberghe, and Arash Hassibi. A tutorial on geometric programming. Optimization and engineering, 8(1):67, 2007.

[12] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.

[13] S ebastien Bubeck, Nicolo Cesa-Bianchi, and Sham Kakade. Towards minimax policies for online linear optimization with bandit feedback. In Annual Conference on Learning Theory, volume 23, pages 41 1. Microtome, 2012.

[14] S ebastien Bubeck, Ofer Dekel, Tomer Koren, and Yuval Peres. Bandit convex optimization:

T regret in one dimension. In Conference on Learning Theory, pages 266 278, 2015.

[15] S ebastien Bubeck, Ronen Eldan, and Yin Tat Lee. Kernel-based methods for bandit convex optimization. ar Xiv preprint ar Xiv:1607.03084, 2016.

[16] Nicol o Cesa-Bianchi, Pierre Gaillard, Claudio Gentile, and S ebastien Gerchinovitz. Algorithmic chaining and the role of partial feedback in online nonparametric learning. In Conference on Learning Theory, pages 465 481, 2017.

[17] Nicolo Cesa-Bianchi and G abor Lugosi. Prediction, learning, and games. Cambridge university press, 2006.

[18] Nicolo Cesa-Bianchi and G abor Lugosi. Combinatorial bandits. Journal of Computer and System Sciences, 78(5):1404 1422, 2012.

[19] Nello Cristianini and John Shawe-Taylor. An introduction to support vector machines and other kernelbased learning methods. Cambridge university press, 2000.

[20] Varsha Dani, Sham M Kakade, and Thomas P Hayes. The price of bandit information for online optimization. In Advances in Neural Information Processing Systems, pages 345 352, 2008.

[21] Ofer Dekel, Ronen Eldan, and Tomer Koren. Bandit smooth convex optimization: Improving the biasvariance tradeoff. In Advances in Neural Information Processing Systems, pages 2926 2934, 2015.

[22] Abraham D Flaxman, Adam Tauman Kalai, and H Brendan Mc Mahan. Online convex optimization in the bandit setting: gradient descent without a gradient. In Proceedings of the sixteenth annual ACM-SIAM symposium on Discrete algorithms, pages 385 394. Society for Industrial and Applied Mathematics, 2005.

[23] Dan Garber, Elad Hazan, and Tengyu Ma. Online learning of eigenvectors. In International Conference on Machine Learning, pages 560 568, 2015.

[24] Martin Gr otschel, L aszl o Lov asz, and Alexander Schrijver. Geometric algorithms and combinatorial optimization, volume 2. Springer Science & Business Media, 2012.

Online learning with kernel losses

[25] Elad Hazan. Introduction to online convex optimization. Foundations and Trends R in Optimization, 2(34):157 325, 2016.

[26] Elad Hazan and Satyen Kale. Projection-free online learning. ar Xiv preprint ar Xiv:1206.4657, 2012.

[27] Elad Hazan and Kﬁr Levy. Bandit convex optimization: Towards tight bounds. In Advances in Neural Information Processing Systems, pages 784 792, 2014.

[28] Elad Hazan and Yuanzhi Li. An optimal algorithm for bandit convex optimization. ar Xiv preprint ar Xiv:1603.04350, 2016.

[29] Wassily Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American statistical association, 58(301):13 30, 1963.

[30] Alan J Hoffman and Helmut W Wielandt. The variation of the spectrum of a normal matrix. In Selected Papers Of Alan J Hoffman: With Commentary, pages 118 120. World Scientiﬁc, 2003.

[31] Martin Jaggi. Convex optimization without projection steps. ar Xiv preprint ar Xiv:1108.1170, 2011.

[32] Adam Kalai and Santosh Vempala. Efﬁcient algorithms for online decision problems. Journal of Computer and System Sciences, 71(3):291 307, 2005.

[33] Robert D Kleinberg. Nearly tight bounds for the continuum-armed bandit problem. In Advances in Neural Information Processing Systems, pages 697 704, 2005.

[34] L aszl o Lov asz and Santosh Vempala. The geometry of log-concave functions and sampling algorithms. Random Structures & Algorithms, 30(3):307 358, 2007.

[35] H Brendan Mc Mahan and Avrim Blum. Online geometric optimization in the bandit setting against an adaptive adversary. In International Conference on Computational Learning Theory, pages 109 123. Springer, 2004.

[36] H Brendan Mc Mahan and Francesco Orabona. Unconstrained online linear learning in Hilbert spaces: Minimax algorithms and normal approximations. In Conference on Learning Theory, pages 1020 1039, 2014.

[37] Shahar Mendelson, Alain Pajor, et al. On singular values of matrices with independent rows. Bernoulli, 12(5):761 773, 2006.

[38] James Mercer. Functions of positive and negative type, and their connection with the theory of integral equations. Philosophical transactions of the royal society

of London. Series A, containing papers of a mathematical or physical character, 209:415 446, 1909.

[39] Jiazhong Nie, Wojciech Kotłowski, and Manfred K Warmuth. Online PCA with optimal regrets. In International Conference on Algorithmic Learning Theory, pages 98 112. Springer, 2013.

[40] Alexander Rakhlin and Karthik Sridharan. Online nonparametric regression with general loss functions. ar Xiv preprint ar Xiv:1501.06598, 2015.

[41] Ankan Saha and Ambuj Tewari. Improved regret guarantees for online smooth convex optimization with bandit feedback. In Proceedings of the Fourteenth International Conference on Artiﬁcial Intelligence and Statistics, pages 636 642, 2011.

[42] Bernhard Scholkopf and Alexander J Smola. Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press, 2001.

[43] Shai Shalev-Shwartz and Yoram Singer. A primal-dual perspective of online learning algorithms. Machine Learning, 69(2-3):115 142, 2007.

[44] Niranjan Srinivas, Andreas Krause, Sham M Kakade, and Matthias Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. ar Xiv preprint ar Xiv:0912.3995, 2009.

[45] Koji Tsuda, Gunnar R atsch, and Manfred K Warmuth. Matrix exponentiated gradient updates for online learning and Bregman projection. Journal of Machine Learning Research, 6(Jun):995 1018, 2005.

[46] Michal Valko, Nathaniel Korda, R emi Munos, Ilias Flaounas, and Nelo Cristianini. Finite-time analysis of kernelised contextual bandits. ar Xiv preprint ar Xiv:1309.6869, 2013.

[47] Manfred K Warmuth and Dima Kuzmin. Online variance minimization. In International Conference on Computational Learning Theory, pages 514 528. Springer, 2006.

[48] Manfred K Warmuth and Dima Kuzmin. Randomized online PCA algorithms with regret bounds that are logarithmic in the dimension. Journal of Machine Learning Research, 9(Oct):2287 2320, 2008.

[49] Martin Zinkevich. Online convex programming and generalized inﬁnitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning (ICML-03), pages 928 936, 2003.

[50] Laurent Zwald and Gilles Blanchard. On the convergence of eigenspaces in kernel principal component analysis. In Advances in neural information processing systems, pages 1649 1656, 2006.