# shifting_regret_mirror_descent_and_matrices__f93396f3.pdf

Shifting Regret, Mirror Descent, and Matrices

Andr as Gy orgy A.GYORGY@IMPERIAL.AC.UK Dept. of Electrical and Electronic Engineering, Imperial College London, London, SW7 2BT, UK

Csaba Szepesv ari SZEPESVA@UALBERTA.CA Dept. of Computing Science, University of Alberta, Edmonton, AB, T6G 2E8 CANADA

We consider the problem of online prediction in changing environments. In this framework the performance of a predictor is evaluated as the loss relative to an arbitrarily changing predictor, whose individual components come from a base class of predictors. Typical results in the literature consider different base classes (experts, linear predictors on the simplex, etc.) separately. Introducing an arbitrary mapping inside the mirror decent algorithm, we provide a framework that uniﬁes and extends existing results. As an example, we prove new shifting regret bounds for matrix prediction problems.

1. Introduction

In the standard online learning framework, the goal of the forecaster is to compete with a set of static reference predictors. However, this goal is only meaningful if a static predictor can be expected to perform well on the given problem. When the environment changes over time, it makes more sense to consider dynamic, time-varying reference predictors. In this paper we consider the problem where the goal of the forecaster is to compete with switching predictors that can switch between elements of a base predictor class and for each prediction mimic the forecast of the actually selected base predictor.

This problem received substantial attention in both learning and information theory, resulting in several algorithms that can compete with switching predictors. Most of these algorithms are based on variants of the exponentially weighted average prediction method, bearing different computational advantages depending on the base predictor class: variants of the ﬁxed-share algorithm of Herbster & Warmuth (1998)

Proceedings of the 33 rd International Conference on Machine Learning, New York, NY, USA, 2016. JMLR: W&CP volume 48. Copyright 2016 by the author(s).

are used when the base class is small, while variants of the transition diagram based method of Willems (1996) are applied for large base reference classes that admit efﬁcient solutions for the static prediction problem. While the algorithms for small expert classes achieve near-optimal behavior in complexity that is linear both in the time horizon T and the number of experts N, the algorithms for large classes can typically be implemented with O(T 2 log N) or O(T 2d) complexity, where d is the dimension of the expert set when it is inﬁnite. Computationally efﬁcient combinations of the two methods have been proposed for large base predictor classes (Willems & Krom, 1997; Hazan & Seshadhri, 2009; Gy orgy et al., 2012) whose computational complexity is almost linear in T and maintain the mild dependence on the size of the expert class, while only slightly deteriorating performance (see Gy orgy et al. 2012 for an overview of tracking algorithms.) In fact, these methods are general reduction methods that can transform any lowregret algorithm to one with low switching (or tracking) regret: here the regret scales with the complexity of the comparator sequence, measured by the number of switches. Another measure introduced by Hazan & Seshadhri (2009) considers regret over contiguous time intervals, called the adaptive regret, while recently a stronger version of the same concept was introduced by Daniely et al. (2015) (see, also, Adamskiy et al. 2012). Although strongly adaptive regret and adaptive regret are stronger measures than switching regret, the algorithms developed for these problems are essentially identical, and can be showed to perform well under all of these criteria.

A notable case when (near-) optimal performance relative to switching predictors is achievable with computational complexity that is linear both in time and the dimension of the predictors is the case of online linear and convex optimization: Here, Herbster & Warmuth (2001) uniﬁed earlier methods and combined gradient-descent type algorithms with projections, while Zinkevich (2003) showed that the mirror descent algorithm (e.g., Nemirovski & Yudin, 1998; Beck & Teboulle, 2003) with a quadratic regularizer also enjoys favorable performance guarantees. In these prob-

Shifting Regret, Mirror Descent, and Matrices

lems the complexity of the reference predictors is typically measured by some norm of the variation of its predictions, measuring how much the predictor shifts over time; hence the resulting bounds are usually called shifting bounds (when the prediction set is discrete, the complexity measure usually reduces to the number of switches). Note that the general wrapper algorithms derived for switching regret (see Hazan & Seshadhri, 2009; Gy orgy et al., 2012; Adamskiy et al., 2012; Daniely et al., 2015) are not directly applicable to obtain shifting bounds. Recently, Cesa-Bianchi et al. (2012) combined the tracking results of Herbster & Warmuth (1998); Bousquet & Warmuth (2002) with the projection ideas of Herbster & Warmuth (2001) to obtain a projected exponential weighting scheme for linear/convex optimization that improves upon previous results. Finally, Hall & Willett (2013) included models of the temporal behavior of the optimal predictor in the mirror descent algorithm when the Bregman-divergence used as the regularizer is bounded.

In this paper we present a uniﬁed view and analysis of algorithms derived for online learning with changing environments (including tracking, shifting, adaptive and strongly adaptive regret), and extend the results of Cesa-Bianchi et al. (2012) to cover any instantiation of the mirror descent algorithm. In particular, after the projection step in the mirror descent algorithm, we allow another, arbitrary transformation of the prediction (this is similar to the algorithm of Hall & Willett 2013, but their motivation is different and, as a result, they give regret bounds of different kind). We give sufﬁcient conditions when this twisted predictor achieves good shifting regret bounds. We extend these results by providing shifting bounds for contiguous time intervals, extending the recently introduced strongly adaptive regret notion of Daniely et al. (2015). As an example, we apply the results to prove shifting bounds for matrix prediction problems, which is the ﬁrst result for the matrix case with non-stationary comparators.

2. Preliminaries

We consider the following standard set-up of online convex optimization. Let d > 0 be an integer and let X be a d-dimensional real vector space that is equipped with an inner product , . Let denote a norm on X, and let denote its dual norm: for any u X, u = supv X, v 1 | u, v |. For any vector v X, vi denotes its ith coordinate (1 i d), that is, v = (v1, . . . , vd). For simplicity, the reader may think of X as the d-dimensional Euclidean space Rd where u, v = Pd i=1 uivi. For integers q s, [q, s] denotes the set of integers {i : q i s}. We use 1 to denote a d-dimensional vector whose entries are all 1.

Online optimization is a repeated game. In each round t =

1, 2, . . . of the game, the forecaster chooses a prediction wt from some decision set K X, the environment chooses a loss function ℓt : K R from a class L of functions mapping from K to R. At the end of round t, the loss function ℓt is revealed to the forecaster and the forecaster incurs loss ℓt(wt). In this paper we consider online convex optimization where K is non-empty, convex and compact, and the loss functions ℓt are convex and bounded.

The goal of the forecaster is to keep its cumulative loss

for some (or any) time horizon T as small as possible. While minimizing the loss b LT is clearly not possible in general, we aim at comparing the loss of the forecaster to the loss of an arbitrary predictor sequence u T 1 = (u1, . . . , u T ) KT , deﬁned as

LT (u T 1 ) =

t=1 ℓt(ut).

The regret of the forecaster against u T 1 is deﬁned as

R(u T 1 ) = b LT LT (u T 1 ).

Instead of considering the regret on the whole time interval [1, T], we will consider the strongly adaptive notion of regret (Daniely et al., 2015), which bounds the regret of the algorithm on any interval [q, s] for any 1 q s T. More precisely, we will consider this interval regret against a changing predictor sequence us q = (uq, . . . , us) deﬁned as R(us q) = b Lq:s L(us q)

where b Lq:s = Ps t=q ℓt(wt) and L(us q) = Ps t=q ℓt(ut) (note that to simplify the notation, the time interval in the regret is only denoted through the index of us q). By convexity of the losses, for any u K,

ℓt(wt) ℓt(u) ℓt(wt), wt u , (1)

where ℓt denotes a subgradient of ℓt, hence we will focus on bounding ℓt(wt), wt u . For brevity, in the remainder of the paper we will use the notation ft = ℓt(wt). We will also assume that ft is bounded.

The algorithm studied is based on the mirror descent (MD) algorithm (Nemirovski & Yudin, 1998; Beck & Teboulle, 2003). To deﬁne the mirror descent algorithm, we need some extra deﬁnitions. Let A X be a convex set, and we will consider competing with predictors taking values in K A, which is assumed to be non-empty. Let R : A R be a Legendre function: R is strictly convex, its derivative, R(u), exists for any u A (A denotes the interior

Shifting Regret, Mirror Descent, and Matrices

of A) and R(u) as u approaches the boundary of A. The Bregman divergence DR : A A R with respect to R is deﬁned, for any (u, v) A A , as

DR(u, v) = R(u) R(v) R(v), u v .

3. The twisted mirror descent algorithm

Starting at a point w1 K A , the mirror descent algorithm recursively predicts, at time t + 1,

wt+1 = argmin u K A [ηt ft, u + DR(u, wt) ]

where ηt > 0 (recall that ft = ℓt(wt) denotes a subgradient of ℓt at wt). We consider a generalization of the mirror descent algorithm given in Algorithm 1. We call this algorithm the twisted mirror descent (TMD) algorithm. The main point is that once the standard minimization step in the mirror descent algorithm is performed, the resulting value vt is transformed using some function φt : K Lt 1 K A to get the ﬁnal prediction wt = φt(vt, ℓ1, . . . , ℓt 1). While this mapping is admittedly overly general (it allows maps that do not functionally depend on vt), several algorithms from the literature can be written in this form using some special function φt with some structure (Herbster & Warmuth, 2001; Bousquet & Warmuth, 2002; Cesa-Bianchi et al., 2012; Hall & Willett, 2013). In particular, Hall & Willett (2013) propose a variant where φt is time-invariant and depends on vt only. We follow this line of work and consider only φt : K K A and wt = φt(vt). Obviously, when φt is identity, the TMD algorithm becomes the standard mirror descent algorithm. Further speciﬁc choices of φt will be discussed later. In the analysis of the algorithm we will also use the unconstrained minimum

vt+1 = argmin u A [ηt ℓt(wt), u + DR(u, wt) ] .

It is well-known that, due to our assumptions, both vt and vt+1 are unique and belong to A , and

vt+1 = argmin v K A DR(v, vt+1). (2)

This latter equality suggests the implementation whereby ﬁrst vt+1 is computed and then the obtained point is projected back to K A to obtain vt+1. However, we will also ﬁnd (2) useful in our proofs.

Similarly to the two standard analyses of the mirror descent algorithm, we will analyze the TMD algorithm based on a well-known lemma, which goes back at least to the work of Herbster & Warmuth (2001):

Lemma 1. Let w A , g X, and deﬁne v = argminw K A[ g, w + DR(w , w)] and v =

Algorithm 1 Twisted mirror descent.

1. Set w1 K A .

2. At time t = 1, 2, . . . predict wt, and compute

vt+1 = argmin u K A [ηt ℓt(wt), u + DR(u, wt) ]

wt+1 = φt+1(vt+1, ℓ1, . . . , ℓt)

argminw A[ g, w + DR(w , w)]. Then for any u K A,

g, w u DR(u, w) DR(u, v)

+ DR(w, v) DR(v, v) .

If, in addition, R is σ-strongly convex with respect to the norm , that is, DR(u, v) σ

2 u v 2 for all u A, v A , then one can show that

DR(w, v) DR(v, v) g 2 2σ . (3)

This yields the so-called prox-lemma (Beck & Teboulle, 2003; Nemirovski et al., 2009), stating that

g, w u DR(u, w) DR(u, v) + g 2 2σ .

In our subsequent statements we will usually state the general bound of Lemma 1 and use the nicer-looking proxlemma bounds in the examples, except for the matrix prediction case in Section 4 where the divergence form gives qualitatively better results.

3.1. Shifting regret

In this section we generalize the results of Cesa-Bianchi et al. (2012) who only considered prediction on the simplex, that is, when K = A = Pd is the d-dimensional probability simplex, R(v) = Pd i=1 vi log vi vi for any vector v = (v1, . . . , vd) A, and the resulting Bregman divergence is DR(v, v ) = Pd i=1 vi log vi

v i vi + v i when

v = (v 1, . . . , v d) A .1 The results show that the regret of TMD relative to u T 1 scales with the speed of change of u T 1 .

We start with a few simple reformulations of Lemma 1 that help writing telescoping terms and deﬁne meaningful mappings φt. The ﬁrst result is a generalization of the decomposition given by Cesa-Bianchi et al. (2012) for the simplex.

1Throughout the paper we use the deﬁnition 0 log 0 = 0, which results in a continuous extension of R and DR at the boundary of A.

Shifting Regret, Mirror Descent, and Matrices

Lemma 2. Assuming A contains the 0 vector, the following bound holds for the TMD algorithm for any t = 1, 2, . . . , and any ut, ut+1 K A:

ft, wt ut 1

DR(ut, wt) DR(ut+1, wt+1)

+ R(ut+1) R(ut)

+ DR(0, wt+1) DR(0, vt+1)

+ R(vt+1) R(wt+1), ut

+ R(wt+1), ut ut+1

+ DR(wt, vt+1) DR(vt+1, vt+1) ,

where vt+1 = argminu A [ηt ℓt(wt), u + DR(u, wt) ].

Proof. The lemma is an easy application of Lemma 1 for the TMD algorithm with g = ηtft, w = wt, v = vt+1, v = vt+1 and u = ut followed by the decomposition

DR(ut, wt) DR(ut, vt+1)

= DR(ut, wt) DR(ut+1, wt+1)

+ DR(ut+1, wt+1) DR(ut, vt+1)

and some algebra:

DR(ut+1, wt+1) DR(ut, vt+1)

= R(ut+1) R(wt+1) R(wt+1), ut+1 wt+1

R(ut) + R(vt+1) + R(vt+1), ut vt+1

= R(ut+1) R(ut) + R(vt+1)

R(wt+1), ut + R(wt+1), ut ut+1

+R(0) R(wt+1) R(wt+1), wt+1

R(0) + R(vt+1) + R(vt+1), vt+1 .

Putting everything together gives the statement of the lemma.

To make use of the above result, one needs to deﬁne the mappings φt to keep the following three terms small:

DR(0, wt+1) DR(0, vt+1);

R(vt+1) R(wt+1);

The following theorem shows that controlling these quantities by an appropriate choice of φt indeed results in a meaningful bound. Theorem 3. Assume that, for all t, TMD is run with ηt = η > 0 and with a choice of the mappings φt guaranteeing

DR(0, wt+1) DR(0, vt+1) Lt

sup u K A\{0}

R(vt+1) R(wt+1), u

for some Lt, Mt, Nt R. Then, for any comparator sequence u T 1 (K A)T and any interval [q, s] [1, T], the regret R(us q) is bounded from above by

DR(uq, wq) DR(us+1, ws+1) + R(us+1) R(uq)

t=q (Lt + Mt ut + Nt ut ut+1 )

DR(wt, vt+1) DR(vt+1, vt+1) !

where u T +1 K A is arbitrary2 and vt+1 is deﬁned in Lemma 2.

Proof. The Cauchy-Schwartz inequality implies that the second inner product in Lemma 2 can be bounded as

R(wt+1), ut ut+1 R(wt+1) ut ut+1 ,

while the conditions on φt imply

R(vt+1) R(wt+1), ut Mt ut .

Applying these results in Lemma 2 shows

DR(ut, wt) DR(ut+1, wt+1)

+ R(ut+1) R(ut) + Lt + Mt ut

+ Nt ut ut+1 + DR(wt, vt+1) DR(vt+1, vt+1)

Summing this inequality for all q t s, the statement of the theorem follows immediately by (1).

The above result is a typical example of a regret bound with respect to a time-varying reference sequence u T 1 , as it depends on the variations of us q: assuming Lt = L and Mt = M for all T, the dependence is on the total norm Ps t=q ut and the variation DV (us q) = Ps 1 t=q ut ut+1 of the sequence (note that us+1 can always be chosen to be equal to us when we express the bound in the theorem).

Example 4. The simplest example when TMD, and actually the pure MD, works is the case when we use a p-norm regularizer with p (1, 2] over a ball, that is, A = X = Rd, R(u) = 1

2 u 2 p, K = {u Rd : u p D/2}. In this case the dual norm is the q-norm with q = p/(p 1).

2In fact, for any s, in the regret bound for R(us q), the value of us+1 can be chosen freely to optimize the bound. The above form is presented for the sake of simplicity. This holds for any of our future regret bounds, too, however, this note will not be repeated any more.

Shifting Regret, Mirror Descent, and Matrices

Furthermore, DR is known to be (p 1)-strongly convex with respect to the p-norm. Thus, assuming ft q G, (3) implies that DR(wt, vt+1) DR(vt+1, vt+1) η2G2/(2(p 1)). It is easy to see that in this setup the identity mapping φt(v) = v is a good choice (reducing TMD to MD), giving Lt = Mt = 0, and Nt = D/2 since R(u) q = u p. Selecting w1 = 0, we have DR(u1, w1) = R(u1) D2/8 for any u K, and setting u T +1 = u T yields R(u T +1) D2/8, giving the following regret bound

R(u T 1 ) D2 + 2D DV (u T 1 ) 4η + ηTG2

where DV (u T 1 ) = PT 1 t=1 ut ut+1 p (for simplicity and illustrational purposes, we consider the regret only over the whole interval [1, T]). Optimizing η independently of

DV (u T 1 ) gives η = D

2T and results in the bound

R(u T 1 ) G D + DV (u T 1 ) s

Optimizing η also as a function of an a priori known upper bound DV DV (u T 1 ), we get

R(u T 1 ) G

T(D2 + 2D DV )

A slightly different (sometimes improved) version of Theorem 3 can be obtained if we can give coordinatewise conditions for the gradients of R in the theorem. In the following we consider the case when X = Rd and all coordinates of the gradients of R are non-positive, and the predictors are taken from the non-negative orthant. In what follows we make i R denote the ith coordinate of the gradient of R. We also let D+ T V (u, v) = Pd i=1 max{ui vi, 0} for all u, v Rd, which satisﬁes D+ T V = 1

2 u v 1 and equals the total variation distance when u 1 = v 1.

Theorem 5. Assume K [0, )d, and i R(u) 0 for all u K A. Suppose that, for all t, TMD is run with ηt = η > 0 and R that is σ-strongly convex with respect to , and with a choice of the mappings φt guaranteeing

DR(0, wt+1) DR(0, vt+1) Lt i R(vt+1) i R(wt+1) Mt i R(wt+1) Nt

for some Lt, Mt, Nt R. Then, for any comparator sequence u T 1 (K A)T and any interval [q, s] [1, T],

the regret R(us q) is bounded from above by

DR(uq, wq) DR(us+1, ws+1) + R(us+1) R(uq)

Lt + Mt( ut 1 D+ T V (ut+1, ut))

+ Nt D+ T V (ut+1, ut) + η

where u T +1 (K A) is arbitrary.

Proof. The proof of the theorem follows the same lines as that of Theorem 3. The slight difference is in how the inner products in Lemma 2 are bounded. We will use the fact that i R(vt+1) 0:

R(vt+1) R(wt+1), ut + R(wt+1), ut ut+1

i:ut,i ut+1,i

h ( i R(vt+1) i R(wt+1))ut,i

i R(wt+1)(ut+1,i ut,i) i

i:ut,i>ut+1,i

h ( i R(vt+1) i R(wt+1))ut,i

i R(wt+1)(ut+1,i ut,i)

+ i R(vt+1)(ut+1,i ut,i) i

i:ut,i ut+1,i

h ( i R(vt+1) i R(wt+1))ut,i

i R(wt+1)(ut+1,i ut,i) i

i:ut,i>ut+1,i ( i R(vt+1) i R(wt+1))ut+1,i

i:ut,i ut+1,i (ut+1,i ut,i)

i=1 ut+1,i X

i:ut,i ut+1,i (ut+1,i ut,i)

= Nt D+ T V (ut+1, ut) + Mt( u 1 D+ T V (ut+1, ut)).

The proof can be ﬁnished in the same way as in Theorem 3, also using (3) to bound the difference of the divergence terms in the last line of the bound of Lemma 2.

Example 6. The above theorem is very useful when one works on the simplex, as in (Cesa-Bianchi et al., 2012). Then K = Pd is the d-dimensional probability simplex, R(v) = Pd i=1 vi log vi vi for any vector v = (v1, . . . , vd) K, and the resulting Bregman divergence is DR(v, v ) = Pd i=1 vi log vi

v i when v = (v 1, . . . , v d) K. Note that in this case the norm is the 1-norm and σ = 1 (by

Shifting Regret, Mirror Descent, and Matrices

Pinsker s inequality). Then selecting φt as the ﬁxed share update of Herbster & Warmuth (2001), that is,

wt+1 = φt(vt+1) = (1 α)vt+1 + α

for some α > 0, the assumptions of the theorem are satisﬁed, giving rise to the bound in Proposition 1 of Cesa Bianchi et al. (2012) as follows: First, we have Lt = 0, Mt = log 1 1 α, and Nt = log d

α. Letting u T +1 = u T ,

m(u T 1 ) = 1 2 PT 1 t=1 ut+1 ut 1, assuming each ft [0, 1]d and starting the algorithm from the uniform distribution w1 = 1/d, the bound becomes

log d + (T m(u T 1 )) log 1 1 α + m(u T 1 ) log d

The 1/8 factor instead of 1/2 in the last term can be obtained by shifting ft to [ 1/2, 1/2]d, which does not change the linearized regret. This result exactly recovers the corresponding bounds of Herbster & Warmuth (2001) and Cesa-Bianchi et al. (2012). The slight improvement compared to Theorem 3 is the appearence of the m(u T 1 ) term multiplying log 1 1 α and the appearance of 1/2 in the deﬁnition of m(u T 1 ).

4. Application to linear prediction over trace-bounded positive deﬁnite matrices

In this section we consider the application of the previous result to a natural online matrix-prediction problem, taken from Hazan et al. (2012), who showed that a number of matrix-valued prediction problems, such as collaborative ﬁltering, gambling and max-cut can be reduced to this common problem. Here we show how TMD can be applied to this problem to compete with a changing sequence of matrices for a more general family of matrix prediction problems, thereby simultaneously extending, e.g., the scope of the results of Hazan et al. (2012). Further, our results also show, that contrary to what Hazan et al. (2012) claim, standard mirror descent analysis is sufﬁcient to get nontrivial results for matrix prediction at least as long as one uses Lemma 1 instead of the prox-lemma.3

In order to deﬁne the problem, we need some notation. Let S denote the vector space of N N real-valued symmetric matrices equipped with the inner product X, Y = tr(X Y ). Further, let S++ S (and S+ S) denote

3 In particular, Hazan et al. (2012) wrote in the second half of page 38.2: In contrast, a direct application of the online mirror descent framework to this problem yields a regret of O(

n2T) . In their paper, the matrices are n n, which implies that this bound is trivial, as they explain it. Our results show that a general online mirror descent framework, in particular Theorem 3, gives the bounds of the desired scaling behavior.

the set of N N real-valued positive deﬁnite (respectively, semi-deﬁnite) matrices. Let X denote the tracenorm: X = X tr = PN i=1 |λi(X)|, where λi(X) is the ith eigenvalue of matrix X S. The dual norm of the trace-norm is the spectral (or operator) norm: X = max1 i N |λi(X)|. Note that H older s inequality holds by duality, that is, X, Y X Y , for X, Y S.

Let τ be a positive number. The competitor set K is chosen to be a closed convex, non-empty subset of

Kτ = X S+ : X τ .

Note that Kτ is a closed, convex set. The loss is assumed to be linear: ℓt(X) = Ft, X , where Ft S is assumed to have bounded dual norm, that is, Ft L for some L Lγ with Lγ = F S : F 2 γ .

We will consider the (τ, γ)-matrix prediction problem where we want to compete with the best matrix from K Kτ in hindsight given a sequence of loss matrices F1, . . . , FT L. For simplicity, we assume that τ N IN N K where IN N denotes the N N identity matrix.

Hazan et al. (2012) considered a special (τ, γ)-matrix prediction problem: In there setup K is a subset of

Kτ,β = X S+ : X τ, Xi,i β, 1 i N ,

for some β > 0, and L is a subset of

b Lγ = F S : F 2 γ, F 2 is diagonal .

The signiﬁcance of this problem is that several interesting matrix prediction problems, such as online collaborative ﬁltering, gambling and max-cut can be reduced to this form, which they call the (β, τ, γ)-matrix prediction problem. Note that Kτ,β K and b Lγ Lγ, thus this is a strictly smaller set of problems. Also note that τ N IN N Kτ,β implies τ βN.

Let us now consider how TMD can be applied to these settings (to emphasize that the iterates are matrices, we will write capital letters Vt and Wt instead of vt and wt, respectively). Unsurprisingly, we choose the unnormalized negative entropy regularizer to instantiate TMD. To introduce this, deﬁne the application of a function f : R R to a symmetric matrix X as f(X) = PN i=1 f(λi)uiu i , where X = PN i=1 λiuiu i is an eigendecomposition of X. Note that f(X) is well deﬁned.

For X positive deﬁnite, we let R(X) denote the unnormalized negative entropy of X:

R(X) = tr(X log(X) X).

Shifting Regret, Mirror Descent, and Matrices

It is well-known that R is a Legendre function over A = S+. In particular, the derivative of R exists on A = S++

and satisﬁes R(X) = log(X). Thus, the underlying Bregman divergence is equal to

DR(X, Y ) = tr(X log X X log Y X + Y ) .

For brevity, we will call DR(X, Y ) the relative entropy of X with respect to Y .

It remains to choose the mappings φt : Kτ Kτ. For 0 α 1, and 0 < cφ, let fα(λ) = (1 α)λ + αcφ/N and deﬁne

φt(X) = fα(X) = (1 α)X + αcφ

N IN N . (6)

Note that the latter is a generalization of the ﬁxed share update (5). To see why the second equality holds, notice that since X is symmetric, for the eigendecomposition X = PN i=1 λiuiu i it holds that PN i=1 uiu i = IN N. Furthermore, by the convexity of K, if cφ

N IN N K, then φt(X) K for any X K. That is, in this case φt is a valid transformation in TMD.

From Theorem 3, we get the following result:

Theorem 7. Assume η, γ > 0 such that η γ 1, and let ηt = η > 0 for all t > 0. Let K Kτ be a closed convex set with τ

N IN N K, let φt be deﬁned by (6) with cφ = τ. If τ(1 α+α/N) 1, deﬁne N = log((1 α+α/N)τ), else let N = log N

ατ . Let F T 1 , U T 1 be sequences such that Ut K and Ft L for some L Lγ. Let W1 = τ

N IN N and let W T 2 be the sequence of matrices chosen by TMD for t [2, T] with Wt K when the adversary s choices are F T 1 . Finally, let R(U T 1 ) = PT t=1 Ft, Wt Ft, Ut be the regret of TMD against U T 1 on this sequence. Then, deﬁning UT +1 = UT ,

τ + U1 (log N 1) + ατT

+ log 1 1 α T X

t=1 Ut + N T X

t=1 Ut Ut+1

+ ηT sup W K,F L W, F 2 .

Before presenting the proof of the theorem, which is a direct consequence of Theorem 3 and standard techniques analyzing the mirror descent algorithm for positive semidefinite matrices (see, e.g., Tsuda et al., 2006; Arora & Kale, 2007; Hazan et al., 2012), we present regret bounds for two special cases, the (τ, γ)- and (β, τ, γ)-matrix prediction problems. Note that Theorem 13 of Hazan et al. (2012) follows from this result (when Ut is the constant sequence). This clearly demonstrates the power of Theorem 3.

Corollary 8. Suppose the assumptions of Theorem 7 hold and T > 1.4 Set α = 1/T and suppose τ 1+ N 1 T N N+1.5

Then, for the (τ, γ)-matrix prediction problem, selecting

γT , we have

R(U T 1 ) = O τ p

γT(1 + log N)

γT 1 + log N log(τ)

t=1 Ut Ut+1

For the (β, τ, γ)-matrix prediction problem, selecting η = q

βγT , we have

R(U T 1 ) = O p

βγτT(1 + log N)

βγT τ(1 + log N) log(τ)

t=1 Ut Ut+1

Proof. We apply Theorem 7. By the deﬁnition of α, ατT = τ. Since Ut K and log 1 1 α

1/(T 1), we have U1 + log 1 1 α PT t=1 Ut

log 1 1 α PT t=2 Ut τ. By the condition on τ, N

log τ. Moreover, UT U1 PT 1 t=1 Ut Ut+1 . Finally, for W Kτ and F Lγ, W, F 2 W F 2 τγ, while for W Kτ,β and F b Lγ, W, F 2 βγ. Putting everything together and optimizing the value of η without the PT 1 t=1 Ut Ut+1 term, we obtain the bounds of the corollary.6

Proof of Theorem 7. The result follows from Theorem 3 once we choose the appropriate values for the parameters of this theorem and verify its conditions. Let us ﬁrst choose values for Lt, Nt and Mt. Fix a value for t.

We need to select Lt so that

DR(0, Wt+1) DR(0, Vt+1) Lt .

Since DR(0, Y ) = tr(Y ), we have DR(0, Wt+1) DR(0, Vt+1) = tr(Wt+1 Vt+1) = α(τ tr(Vt+1)) ατ, thanks to Vt+1 S+. Hence, we choose Lt = ατ.

Now, consider the condition

sup U K S++ R(Vt+1) R(Wt+1), U

4The inequality η γ 1 implies a larger lower bound on T. 5Similar results can be derived for τ < 1 + N 1 T N N+1. 6In fact, the constants for the (τ, γ)-matrix prediction problem can be slightly improved by using that DR is 1/τ-strongly convex on Kτ and applying (3) (i.e., the prox-lemma), but this is left out for simplicity.

Shifting Regret, Mirror Descent, and Matrices

(the domain of R is S++). Fix some U K S+

and let Vt+1 = PN i=1 λiziz i be an eigendecomposition of Vt+1. By the deﬁnition of φ, Wt+1 = PN i=1((1 α)λi + ατ/N)ziz i . As noted earlier, R(X) = log(X). Hence,

Z .= R(Vt+1) R(Wt+1)

i=1 log λi (1 α)λi + ατ/N

i=1 log λi (1 α)λi + ατ/N

ziz i , U .

Now, since U is nonnegative deﬁnite, ziz i , U = tr(z i Uzi) = z i Uzi 0. Therefore,

i=1 ziz i , U

where C = max1 j N log λj (1 α)λj+ατ/N . Therefore,

log λj (1 α)λj + ατ/N

log λj (1 α)λj

= log 1 1 α

and hence C log( 1 1 α). Introduce Z = PN i=1 log( 1 1 α)ziz i . Thus, Z, U Z , U and from H older s inequality we get

Z, U Z U log 1 1 α

and so we choose Mt = log 1 1 α.

Let us now turn to the choice of Nt. We need to choose Nt such that

R(Wt+1) Nt. (7)

R(Wt+1) = max 1 i N

log (1 α)λi + α τ

A simple case analysis gives that this is upper bounded by N , which can be chosen to be the value of Nt.

Now, let us bound DR(Wt, Vt+1). For this, we use the following lemma, which can be extracted from Tsuda et al. (2006); Arora & Kale (2007) or Hazan et al. (2012):

Lemma 9. Let R be the negentropy regularizer, F S, F 1, W S++, V = argmin V S+ F, V + DR(V, W). Then DR(W, V ) W, F 2 .

Proof. Note that V = R 1( R(W) F) = exp(log(W) F). Plugging this into the deﬁnition of DR we get

DR(W, V ) = tr(W log W W log V W + V )

= tr(W log W W(log W F) W + V )

= tr(WF W + exp(log(W) F)) .

By the Golden-Thompson inequality, tr(exp(log(W) F)) tr W exp( F). Now, for any A S, A 1, exp(A) IN N + A + A2 (for matrices A, B S+, A B denotes B A S+). Further, for any W, A, B S+, A B implies W, A W, B . Hence, tr W exp( F) tr W(IN N F + F 2). Putting the inequalities together, cancelling terms we get the claimed inequality.

Using this lemma with F = ηFt, W = Wt, since η γ 1 by assumption, we get DR(Wt, Vt+1) η2 Wt, F 2 t . Moreover, DR(Vt+1, Vt+1) 0.

To ﬁnish, we choose UT +1 = U1 and so it remains to bound DR(U1, W1). Let U1 = PN i=1 λiziz i be the eigendecomposition of U1. Since U1 is symmetric, we can write W1 = τ N IN N = τ N PN i=1 ziz i . Hence,

DR(U1, W1) = τ + PN i=1 λi log λi τ/N λi = τ +

U1 log N + PN i=1 λi log λi

τ 1 τ + U1 log N PN i=1 λi = τ + U1 (log N 1), where we used that log(λi/τ) 0 since λi τ and λi 0.

Plugging in the bounds obtained into (4), we get

n τ + U1 (log N 1) + ατT

+ log 1 1 α T X

t=1 Ut + N T X

t=1 Ut Ut+1 o

+ ηT sup W K,F L W, F 2 ,

which is the desired bound.

5. Conclusion

We presented a unifying framework for deriving mirrordescent based algorithms for online learning in changing environments. A generic result was provided that indicated how mirror descent algorithms can be modiﬁed to obtain shifting regret bounds and shifting regret bounds over intervals. As corollaries, we derived existing variants of the mirror descent algorithm (for various problems), and recovered their shifting regret bounds, as well as derived a new matrix prediction algorithm for tracking and the ﬁrst shifting bound for matrix prediction problems.

Shifting Regret, Mirror Descent, and Matrices

Acknowledgements

We would like to thank Yaoliang Yu and the anonymous reviewers for their insightful comments on earlier versions of this paper. This work was supported in part by the Alberta Innovates Technology Futures through the Alberta Ingenuity Centre for Machine Learning and by NSERC.

D. Adamskiy, W. M. Koolen, A. V. Chernov, and V. Vovk. A closer look at adaptive regret. In Algorithmic Learning Theory - 23rd International Conference, ALT 2012, Lyon, France, October 29-31, 2012. Proceedings, pp. 290 304, 2012.

S. Arora and S. Kale. A combinatorial, primal-dual approach to semideﬁnite programs. In STOC, pp. 227 236. 2007.

A. Beck and M. Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 31(3):167 175, 2003.

O. Bousquet and M. K. Warmuth. Tracking a small set of experts by mixing past posteriors. Journal of Machine Learning Research, 3:363 396, Nov. 2002.

N. Cesa-Bianchi, P. Gaillard, G. Lugosi, and G. Stoltz. Mirror descent meets ﬁxed share (and feels no regret). In P. L. Bartlett, F. Pereira, C. Burges, L. Bottou, and K. Weinberger (eds.), Advances in Neural Information Processing Systems 25, pp. 989 997. 2012.

A. Daniely, A. Gonen, and S. Shalev-Shwartz. Strongly adaptive online learning. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, pp. 1405 1411, 2015.

A. Gy orgy, T. Linder, and G. Lugosi. Efﬁcient tracking of large classes of experts. IEEE Transactions on Information Theory, IT-58(11):6709 6725, Nov. 2012.

E. C. Hall and R. M. Willett. Dynamical models and tracking regret in online convex programming. In Proc. 20th International Conference on Machine Learning (ICML2013), volume 28 of JMLR Workshop and Conference Proceedings, Atlanta, GA, June 2013.

E. Hazan and C. Seshadhri. Efﬁcient learning algorithms for changing environments. In Proc. 26th Annual International Conference on Machine Learning, pp. 393 400. ACM, 2009.

E. Hazan, S. Kale, and S. Shalev-Shwartz. Near-optimal algorithms for online matrix prediction. In COLT. 2012.

M. Herbster and M. K. Warmuth. Tracking the best expert. Machine Learning, 32(2):151 178, 1998.

M. Herbster and M. K. Warmuth. Tracking the best linear predictor. Journal of Machine Learning Research, 1: 281 309, 2001.

A. Nemirovski and D. Yudin. Problem Complexity and Method Efﬁciency in Optimization. Wiley, 1998.

A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM J. Optimization, 4:1574 1609, 2009.

K. Tsuda, G. Ratsch, and M. K. Warmuth. Matrix exponentiated gradient updates for on-line learning and Bregman projection. Journal of Machine Learning Research, 6(1): 995 1018, 2006.

F. M. J. Willems. Coding for a binary independent piecewise-identically-distributed source. IEEE Transactions on Information Theory, IT-42:2210 2217, Nov. 1996.

F. M. J. Willems and M. Krom. Live-and-die coding for binary piecewise i.i.d. sources. In Proc. 1997 IEEE Int. Symp. Inform. Theory, pp. 68, Ulm, Germany, June July 1997.

M. Zinkevich. Online convex programming and generalized inﬁnitesimal gradient ascent. In Proc. 20th International Conference on Machine Learning (ICML-2003), Washington, DC, 2003.