# online_learning_for_structured_loss_spaces__3a09bbc3.pdf

Online Learning for Structured Loss Spaces

Siddharth Barman, Aditya Gopalan, Aadirupa Saha Indian Institute of Science Bangalore 560012 {barman, aditya, aadirupa}@iisc.ac.in

We consider prediction with expert advice when the loss vectors are assumed to lie in a set described by the sum of atomic norm balls. We derive a regret bound for a general version of the online mirror descent (OMD) algorithm that uses a combination of regularizers, each adapted to the constituent atomic norms. The general result recovers standard OMD regret bounds, and yields regret bounds for new structured settings where the loss vectors are (i) noisy versions of vectors from a low-dimensional subspace, (ii) sparse vectors corrupted with noise, and (iii) sparse perturbations of low-rank vectors. For the problem of online learning with structured losses, we also show lower bounds on regret in terms of rank and sparsity of the loss vectors, which implies lower bounds for the above additive loss settings as well.

1 Introduction Online learning problems, such as prediction with expert advice (Cesa-Bianchi and Lugosi 2006) and online convex optimization (Zinkevich 2003), involve a learner who sequentially makes decisions from a decision set. The learner seeks to minimize her total loss over a sequence of loss functions, unknown at the beginning, but which is revealed causally. Speciﬁcally, she attempts to achieve low regret, for each sequence in a class of loss sequences, with respect to the best single decision point in hindsight. The theory of online learning, by now, has yielded ﬂexible and elegant algorithmic techniques that enjoy provably sublinear regret in the time horizon of plays. Regret bounds for online learning algorithms typically hold across inputs (loss function sequences) that have little or no structure. For instance, for the prediction with experts problem, the exponentially weighted forecaster (Cesa-Bianchi and Lugosi 2006) is known to achieve an expected regret of O(

T ln N) over any sequence of N-dimensional loss vectors with coordinates bounded in [0, 1]; here T is the number of rounds of play. There is often, however, more structure in the inputs of online learning problems beyond elementary ℓ -type constraints, which a learner with a priori knowledge can hope

Copyright c 2018, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved. A full version of this paper is available at https://arxiv.org/abs/ 1706.04125

to exploit and improve her performance. A notable example is when the loss vectors for the prediction with experts problem come from a low-dimensional subspace (Hazan et al. 2016). This is often the case in recommender systems based on latent factor models (Koren, Bell, and Volinsky 2009), where users and items are represented in terms of their features or attribute vectors, typically of small dimension. Under a bilinear model for the utility of a user-item pair, each user s utility across all items becomes a vector from a subspace of dimension at most the size of the feature vectors. (Hazan et al. 2016) show that in this setup the learner can limit her regret to O(d

T) when each loss vector comes from a d-dimensional subspace of RN. If d N (in fact, d = o

ln N ), then this is potentially advantageous over a more general best-experts algorithm like Exponential Weights. This example is interesting not only because it shows that geometric/structural properties known in advance can help the learner achieve order-wise better regret, but also because it opens up the possibility of studying whether other, arguably more realistic, forms of structure can be exploited, such as sparsity in the input (or more generally small norm) and, more importantly, additive combinations of such structures, e.g., low-rank losses added with losses of small ℓ2-norm, which expresses losses that are noisy perturbations of a low-dimensional subspace. In this paper, we take a step in this direction and develop a framework for online learning problems with structured losses.

Our Results and Techniques: We consider the prediction with experts problem with loss sequences in which each element (loss vector) belongs to a set that respects structural constraints. Speciﬁcally, we assume that the loss vectors belong to a sum of atomic norm balls1 (Chandrasekaran et al. 2012), say A + B, where the sum of sets, A and B, is in the Minkowski sense.2 For this setup which we call online learning with additive loss spaces we show a general regret guarantee for an online mirror descent (OMD) algorithm that uses a combination of regularizer functions, each

1centrally symmetric, convex, compact sets with their centroids at the origin. 2A + B = {a + b : a A, b B}.

The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18)

of which is adapted to a constituent atomic norms of A and B, respectively. Specializing this result for a variety of loss-function sets recovers standard OMD regret guarantees for strongly convex regularizers (Shalev-Shwartz 2012a), and subsumes a result for the online low-rank problem (Hazan et al. 2016). But more importantly, this allows us to obtain new results from old regret guarantees for settings such as noisy low rank (where losses are perturbations from a low-dimensional subspace), noisy sparse (where losses are perturbations of sparse vectors), and sparse low-rank (where losses are sparse perturbations from a low-dimensional subspace); see Tables 1 and 2. Another contribution of this work is to show lower bounds on regret for online learning with structured losses. We derive a generic lower bound on regret, for any algorithm for the prediction with experts problem, using structured (in terms of sparsity and dimension) loss vectors. This result allows us to derive regret lower bounds in a variety of individual and additive loss space settings including sparse, noisy, low rank, noisy low-rank, and noisy sparse losses. Related work. The work that is perhaps closest in spirit to ours is that of (Hazan et al. 2016), who study the best experts problem when the loss vectors all come from a lowdimensional subspace of the ambient space. A key result of theirs is that the online mirror descent (OMD) algorithm, used with a suitable regularization, improves the regret to depend only on the low rank and not the ambient dimension. More broadly, OMD theory provides regret bounds depending on properties of the regularizer and the geometry of the loss and decision spaces (Shalev-Shwartz 2012b). We are able to notably generalize this to the more ﬂexible setting of additive losses. Online learning with structure has been studied in the recent past from the point of view of overall sequence complexity or hardness (learning with easy data ). This includes work that shows algorithms enjoying ﬁrstand second-order regret bounds (Cesa-Bianchi, Mansour, and Stoltz 2007), and with performance depending on the quadratic variation of the inputs (Hazan and Kale 2010; Steinhardt and Liang 2014). There is also recent work on achieving regret scaling with the covering number of the sequence of observed loss vectors (Cohen and Mannor 2017), which is another notion of easy data. Our problem formulation explores a different formulation of learning with easy data, in which the adversary, instead of being constrained to choose loss sequences with low total magnitude or variation, chooses loss vectors from sets with enough geometric structure (atomic norm balls).

2 Notation and Preliminaries

For an integer n Z+, we use [n] to denote the set {1, 2, . . . n}. For a vector x Rn, xi denotes the ith component of x. The p-norm of x is deﬁned as x p = ( n i=1 |xi|p)1/p, 1 p < . Write x := maxn i=1 |xi| and x 0 := |{i | xi = 0}|. If is a norm deﬁned on a closed convex set Ω Rn, then its corresponding dual

norm is deﬁned as

u = sup x Ω: x 1 x u ,

where x u = i xiui is the standard inner product. It follows that the dual of the standard p-norm (p 1) is the q-norm, where q is the H older conjugate of p, i.e., 1 p + 1

q = 1. The n-probability simplex is deﬁned as Δn = {x [0, 1]n | n i=1 xi = 1}. Given any set A Rn, we denote the convex hull of A as conv(A). Clearly, when A = {e1, e2, . . . en}, conv (A) = Δn, where ei [0, 1]n denotes ith standard basis vector of Rn.

2.1 Atomic Norm and its Dual (Chandrasekaran et al. 2012) The notion of an atomic norm along with its dual will provide us with a uniﬁed framework for addressing structured loss spaces, and will be used extensively in the paper. Let A Rn be a set which is convex, compact, and centrally symmetric about the origin (i.e., a A if and only if a A). The atomic norm induced by the set A is deﬁned as

||x||A := inf{t > 0 | x t A}, for x Rn.

The dual of the atomic norm induced by A becomes the support function of A (Boyd and Vandenberghe 2004); formally,

||x|| A := sup{x.z | z A}, for x Rn.

For example, if the set A is the convex hull of all unitnorm one-sparse vectors, i.e., A := conv ({ ei}n i=1), then the corresponding atomic norm is the standard ℓ1-norm x 1 = i |xi|.

2.2 Problem setup We consider the online learning problem of learning with expert advice from a collection of N experts (Cesa-Bianchi and Lugosi 2006). In each round t = 1, 2, . . . , T, the learner receives advice from each of the N experts, following which the learner selects an expert from a distribution pt ΔN, maintained over the experts, whose advice is to be followed. Upon this, the adversary reveals the losses incurred by the N experts, lt = lt(1), lt(2), . . . , lt(N) [0, 1]N, lt(i) being the loss incurred by the ith expert. The learner suffers an expected loss of EIt pt[lt(It)] = N i=1 pt(i)lt(i). If the game is played for a total of T rounds, then the objective of the learner is to minimize the expected cumulative regret deﬁned as:

E [Regret T ] =

t=1 pt.lt min i [N]

It is well-known that without any further assumptions over the losses lt, the best achievable regret for this problem is Θ(

T ln N). Indeed, the exponential weights algorithm or the Hedge algorithm achieves regret O(

T ln N) (Arora, Hazan, and Kale 2012, Theorem 2.3), and a matching lower

bound exists as well (Cesa-Bianchi and Lugosi 2006, Theorem 3.7). Now, a very natural question to ask is: can a better (smaller) regret be achieved if the loss sequence has more structure? Suppose the loss vectors (lt)T t=1 all belong to a common structured loss space L [0, 1]N, such as:

1. Sparse loss space: L = {l [0, 1]N | l 0 = s}. Here, s [N] is the sparsity parameter.

2. Spherical loss space: L = {l [0, 1]N | l 2 A = l Al ϵ}, where A is a positive deﬁnite matrix and ϵ > 0.

3. Noisy loss space: L = {l [0, 1]N | l 2 2 = ϵ}, ϵ > 0}. Note that noisy losses are a special class of spherical losses where A = IN, the identity matrix.

4. Low-rank loss space: L = {l [0, 1]N | l = Uv }, here the rank of matrix U RN d is equal to d [N] and vector v Rd (as mentioned previously, such loss vectors were considered by (Hazan et al. 2016)).

5. Additive loss space: L = L1 + L2 (Minkowski Sum). More formally, L = {l = l1 +l2 | l1 L1 and l2 L2}, where L1 [0, 1]N and L2 [0, 1]N are structured loss spaces themselves.3 Examples include any combination of the previously described loss spaces, such as the lowrank + noisy space.

The Exponential Weight or Hedge algorithm achieves O(

T ln N) regret (Cesa-Bianchi and Lugosi 2006; Shalev Shwartz 2012b) in all of the above settings. The relevant question is whether the geometry of such loss spaces can be exploited in a principled fashion to achieve improved regret guarantees (possibly independent of ln N). In other words, can we come up with algorithms for above cases such that the regret is O(

ωT), where ω < ln N? We will show that, for all of the above loss spaces, we can obtain a regret factor ω which is order-wise better than ln N. In particular, we will establish these regret bounds by employing the Online Mirror Descent algorithm (described below) with a right choice of atomic norms. Furthermore, using this algorithm, we will also develop a framework to obtain new regret bounds from old. That is, we show that if we have an online mirror descent setup for L1 and L2, then we can in fact obtain a low-regret algorithm for the additive loss space L1 + L2.

2.3 Online Mirror Descent

In this section, we give a brief introduction to the Online Mirror Descent (OMD) algorithm (Bubeck 2011; Shalev Shwartz 2012b), which is a subgradient descent based method for online convex optimization with a suitably chosen regularizer. A reader well-versed with the analysis of OMD may skip the statement of Theorem 3 and proceed to the next section. OMD generalizes the basic mirror descent algorithm used for ofﬂine optimization problems (see, e.g., (Beck and

3Note that, in the problem setup at hand the learner observes only the loss vectors lt, and does not have access to the loss components l1t or l2t.

Teboulle 2003)). Before detailing the algorithm, we will recall a few relevant deﬁnitions: Deﬁnition 1. Bregman Divergence. Let Ω Rn be a convex set, and f : Ω R be a strictly convex and differentiable function. Then the Bregman divergence associated with f, denoted by Bf : Ω Ω R, is deﬁned as

Bf(u, v) = f(u) f(v) (u v) f(v), u, v Ω .

Deﬁnition 2. Strong Convexity Let Ω Rn be a convex set, and f : Ω R be a differentiable function. Then f is called α-strongly convex over Ω with respect to the norm iff for all x, y Ω,

f(x) f(y) ( f(y))T (x y) α

Equivalently, a continuous twice differentiable function, f, over Ω is said to be α-strongly convex iff for all x, w Ω we have

x T 2f(w)x α x 2.

We now describe the OMD algorithm for the online learning problem setup (Sec. 2.2).

Algorithm 1 Online Mirror Descent (OMD)

1: Parameters: Learning rate η > 0. 2: Convex set Ω RN, such that ΔN Ω 3: Strictly convex, differentiable function R : Ω R 4: Initialize: p1 = argmin p ΔN R(p)

5: for t = 1, 2, T do 6: Play pt ΔN 7: Receive loss vector lt [0, 1]N

8: Incur loss pt.lt 9: Update: 10: R( pt+1) R(pt) ηlt (Assuming pt+1 Ω) 11: pt+1 argmin p ΔN BR(p, pt+1)

12: end for

The following regret guarantee for the above algorithm is well-known. Theorem 3 (OMD regret bound (Theorem 5.2, (Bubeck 2011))). Let the loss vectors, {lt}T t=1, belong to a loss space L [0, 1]N, which is bounded with respect to a (arbitrary) norm ; in particular, for any l L we have l G. Furthermore, let Ω ΔN be a convex set, and R : Ω R be a strictly convex, differentiable function that satisﬁes R(p) R(p1) D2 for parameter D R and all p ΔN; where p1 := argminp ΔN R(p). Also, let the restriction of R to ΔN be α-strongly convex with respect to , the dual norm of . Then, the regret of OMD algorithm with set Ω, regularizer function R, and learning rate η > 0, for T rounds satisﬁes

Regret T (OMD(η)) =

t=1 pt.lt N min i=1

Loss Space Regret Bound Atomic Norm Regularizer s-Sparse 2

ln(s + 1)T 1

2 p x 2 q (p = 2 ln(s + 1)) (q = p p 1) Spherical

ϵλmax(A 1)T 1 ϵ A ϵx A 1x

ϵT 1 ϵ 2 ϵx x

Table 1: OMD Regret Bounds for Structured Loss Spaces

where p1, p2, . . . p T denotes the sequential predictions of

the algorithm in T rounds. Moreover, setting η = D

T (i.e., minimizing the right-hand-side of the above bound), we have

Regret T (OMD(η )) DG

3 Online Mirror Descent for Structured Losses This section shows that, for speciﬁc structured loss spaces, instantiating the OMD algorithm with a right choice of the norm and regularizer R leads to improved (over the standard O(

T ln N) bound) regret guarantees. Proofs of these results appear in the full version of the paper.

1. Sparse loss space: L = {l [0, 1]N | l 0 = s}, s [N] being the loss sparsity parameter. Then using the q-norm,

R(x) = x 2 q = N i=1 xq i 2

q , where q = ln s ln s 1, s =

(s + 1)2, as the regularizer, we get,

ln(s + 1)T .

2. Spherical loss space: L = {l [0, 1]N | l 2 A = l Al ϵ}, where A is a positive deﬁnite matrix, ϵ > 0. Then using the square of the ellipsoidal norm as the regularizer, R(x) = ϵ x 2 A 1 = ϵx A 1x, we get,

λmax(A 1)ϵT ,

where λmax(A 1) denotes the maximum eigenvalue of A 1. 3. Noisy loss sapce: L = {l [0, 1]N | l 2 2 ϵ}, ϵ > 0. Then using the square of the standard Euclidean norm as the regularizer, R(x) = ϵ x 2 2, we get,

Note that the noisy-loss case is a special case of the spherical-loss setting, where A = A 1 = IN.

(Hazan et al. 2016) have also used OMD to address the loss vectors that belong to a low-dimensional subspace. Speciﬁcally, if the loss space L = {l [0, 1]N | l = Uv }, with U RN d being a rank d matrix and vector v Rd. They have shown that the regularizer R(x) = x 2 H = x Hx (where H = IN + U MU, M is the matrix corresponding to the L owner-John ellipsoid (Hazan et al. 2016)

of L and IN is the identity matrix) leads to the following regret bound:

In addition, for the standard loss space L = [0, 1]n, one can execute the OMD algorithm with the unnormalized negative entropy, R(x) = N i=1 xi log xi N i=1 xi, as the regularizer, to obtain:

Note that the above regret bound is the same as that for the Hedge algorithm. In fact, it can be veriﬁed that, with the above choice of regularizer, the OMD algorithm exactly reduces to Hedge (Bubeck 2011).

4 Online Learning for Additive Losses We now present a key result of this paper, which enables us to obtain new regret bounds from old. In particular, we will develop a framework that provides a low-regret OMD algorithm for an additive loss space L = L1 + L2, using the OMD setup of the constituent loss spaces L1 and L2. Speciﬁcally, we detail how to choose an appropriate regularizer for losses from L and, hence, construct a low-regret OMD algorithm. Theorem 4. (Main Result) Let L1, L2 [0, 1]N be two loss spaces, such that L1 A1, L2 A2, where A1, A2 RN are two centrally symmetric, convex, compact sets. We observe a sequence of loss vectors {lt}T t=1, such that in any round t [T], lt = l1t + l2t, where l1t L1 and l2t L2. Consider two differentiable, strictly convex functions R1 : Ω1 R, R2 : Ω2 R, where Ω1, Ω2 ΔN are two convex sets. The restrictions of R1 and R2 to ΔN are, respectively, α1and α2-strongly convex with respect to the norms || || A1 and || || A2. Also, let parameters D1 and D2 be such that R1(p) R1(p1) D2 1 and R2(p) R2(p1) D2 2, for all p ΔN; where p1 := argminp ΔN (R1(p) + R2(p)).

Then (with learning rate η =

(D2 1+D2 2) min(α1,α2)

T , regularizer R := R1 + R2, and p1 as the initial prediction) the regret of the OMD algorithm is bounded as

(D2 1 + D2 2) T min(α1, α2) .

A proof of the above theorem appears in Section 4.2.

Loss Space Regret Bound Atomic Norm Regularizer d-Low Rank A, A = A1 + A2, where + ϵ-Noise

2(16d + ϵ)T A1 = x RN |

x H 1x 1 , x 2 H + ϵ x 2 2

A2 = x RN | 1 ϵ

x T x 1 . s-Sparse A, A = A1 + A2, where + ϵ-Noise 2

2(1 + ϵ) ln(s + 1)T A1 = x RN | 1

2 x p 1 , x 2 q + ϵ x 2 2

A2 = x RN | 1 ϵ

x T x 1 . d-Low Rank A, A = A1 + A2, where + s-Sparse 2

2(16d + 1) ln(s + 1)T A1 = x RN |

x H 1x 1 , x 2 H + x 2 q

A2 = x RN | 1

Table 2: Our Results for Additive Loss Spaces

Remark 5. There exist loss spaces L1 and L2 such that OMD algorithm obtained via Theorem 4 provides an orderwise optimal regret bound for the additive loss space L = L1 + L2; see Appendix A for speciﬁc examples. The above theorem leads to the following corollary. Corollary 6. (New Regret Bounds from Old) Suppose L1, L2 [0, 1]N are two loss spaces such that l A1 1, l L1, and l A2 1, l L2, where A1, A2 RN are two centrally symmetric, convex, compact sets. Also, suppose there exists two strictly convex, differentiable functions R1 : Ω1 R and R2 : Ω2 R, (Ω1, Ω2 ΔN, convex) such that OMD with regularizer functions R1 and R2 gives the regret bounds of D1

2T α1 and D2

2T α2 over loss spaces L1 and L2, respectively. Here, α1 (α2) is the strong convexity parameter of R1 (R2) over ΔN, with respect to the atomic norm || || A1 (|| || A2). In addition, let D1 and D2 be parameters such that, for all p ΔN,

R1(p) R1(p 1) D2 1 with p 1 = argminq ΔN R1(q),

R2(p) R2(p 2) D2 2 with p 2 = argminq ΔN R2(q). Then, for the additive loss space L = L1 + L2, the OMD algorithm with regularizer function R = R1 + R2, initial prediction p1 = argminp ΔN (R1(p) + R2(p)) and learn-

ing rate η =

(D2 1+D2 2) min(α1,α2)

T ) enjoys the following regret guarantee:

(D2 1 + D2 2)T min(α1, α2) .

Note that we can prove this corollary using Theorem 4 by simply verifying the following inequalities: R1(p) R1(p1) D2 1 and R2(p) R2(p1) D2 2, for all p ΔN and p1 := argminq ΔN (R1(q) + R2(q)). This follows, since R1(p 1) R1(p1) and R2(p 2) R2(p1); recall that p 1 := argminq ΔN R1(q) and p 2 := argminq ΔN R2(q).

4.1 Applications of the main result

In this section, we will derive novel regret bounds for additive loss spaces (L = L1 + L2) wherein the individual components (L1 and L2) are the loss spaces which were considered in Section 3. These results are derived by applying Theorem 4; detailed proofs appear in the full version of the paper.

Corollary 7 (Noisy Low Rank Losses). Suppose L1 = {l [0, 1]N | l = Uv } is a d-dimensional loss space (1 d ln N), perturbed with noisy losses L2 = {l [0, 1]N | l 2 2 ϵ, ϵ > 0}. Then, the regret of the OMD algorithm over the loss space L = L1 + L2 with regularizer R(x) = x Hx + ϵ x 2 2 and learning rate η =

T is upper bounded as follows

2(16d + ϵ)T .

Corollary 8 (Noisy Sparse Losses). Suppose L1 = {l [0, 1]N | l 0 = s} is an s-sparse loss space (s [N]), perturbed with noisy losses from L2 = {l [0, 1]N | l 2 2 ϵ, ϵ > 0}. Then, the regret of the OMD algorithm over the loss space L = L1 + L2 with regularizer R(x) = x 2 q + ϵ x 2 2 and learning rate η =

1+ϵ (2 ln(s+1) 1)T is upper bounded as follows

2(1 + ϵ) ln(s + 1)T .

Corollary 9 (Low Rank losses with Sparse noise). Suppose L1 = {l [0, 1]N | l = Uv } is a d rank loss space (1 d ln N), perturbed with s-sparse losses L2 = {l [0, 1]N | l 0 = s}, s [N]. Then, the regret of the OMD algorithm over the loss space L = L1 + L2 with regularizer R(x) = x Hx + x 2 q and learning rate

16d+1 (2 ln(s+1) 1)T is upper bounded as follows

2(16d + 1) ln(s + 1)T .

4.2 Proof of Theorem 4 Before proceeding to prove the theorem, we will establish the following useful lemmas. Let A1, A2 be any two convex, compact, centrally symmetric subsets of Rn and A = A1 + A2 (Minkowski Sum). Then, note that A is also convex, compact, and centrally symmetric. This follows from the fact that conv(X) + conv(Y) = conv(X + Y) for any X, Y Rn. In addition, we have Lemma 10. ||x||A max{||x1||A1, ||x2||A2}, where x = x1 + x2, x1 A1, x2 A2.

Proof. Recall the deﬁnition of atomic norm A from Section 2.1. Suppose for any x = (x1+x2) Rn, t1 = ||x1||A1 and t2 = ||x2||A2. Clearly, x = x1+x2 (t1A1+t2A2) t(A1 +A2), where t = max{t1, t2}. The proof now follows directly from the deﬁnition of atomic norm, x A.

Lemma 11. ||x|| A = ||x|| A1 + ||x|| A2, for all x Rn.

Proof. Consider any x Rn,

||x|| A = sup{x.z | z A} = sup{x.(z1 + z2) | z1 A1, z2 A2} = sup{x.z1 | z1 A1} + sup{x.z2 | z2 A2} = ||x|| A1 + ||x|| A2 .

Lemma 12. Let Ω Rn be a convex set. Consider two differentiable functions R1 : Rn R and R2 : Rn R, that are respectively α1 and α2-strongly convex with respect to || || A1 and || || A2 over Ω. Then, R = R1 + R2 is α = 1 2 min(α1, α2)-strongly convex with respect to || || A over Ω.

Proof. For any x, y Ω,

R(x) R(y) R(y)(x y)

= R1(x) R1(y) + R2(x) R2(y) R1(y)T (x y) + R2(y)(x y)

2 x y 2 A1 + α2

2 (2 x y 2 A1 + 2 x y 2 A2), (α = 1

2 min(α1, α2))

2 ( x y A1 + x y A2)2, since 2(a2 + b2) > (a + b)2, a, b R

2 ( x y 2 A ) (via Lemma 11) .

Hence, R = R1+R2 is α = 1

2 min(α1, α2)-strongly convex with respect to || || A over Ω.

Proof. of Theorem 4 Consider the norm = A, and its dual norm = A. Note that:

1. Lemma 10 along with the bounds l1 A1 1 and l2 A2 1 imply that l A 1, for any l = l1+l2 L. Hence, L A.

2. For any p ΔN, R(p) R(p1) = (R1(p) R1(p1)) + (R2(p) R2(p1)) D2 1+D2 2. Hence, D =

D2 1 + D2 2.

3. R(x) = R1(x) + R2(x) is min{α1,α2}

2 -strongly convex with respect to A, x ΔN (Lemma 12). Hence, α = min{α1,α2}

The result now follows by applying Theorem 3.

5 Lower Bounds In this section we will derive lower bounds for the problem of learning with expert advice, for various structured loss spaces. Relevant deﬁnitions and missing proofs can be found in the full version of the paper. We ﬁrst state the lower bound for a general loss space L RN (Theorem 13) based on a lower-bound result of (Ben-David, P al, and Shalev-Shwartz 2009) for online learning of binary hypotheses classes in terms of Littlestone s dimension.4

Theorem 13 (A Generic Lower Bound). Given parameters V > 0 and s > 0 along with any online learning algorithm, there exists a sequence of V -dimensional loss vectors l1, l2, . . . , l T {0, s}N of sparsity 2V N (i.e., rank ([l1, l2, . . . , l T ]) = V and lt 0 = 2V , for all t [T]) such that

Regret T 2s

The following corollary is a direct consequence of Theorem 13. Corollary 14. Given parameters V [ln N] and s > 0 along with any online learning algorithm, there exists a sequence of loss vectors l1, l2, . . . , l T [ s, s]N of VCdimension V (i.e., V C({l1, l2, . . . , l T }) = V ), such that

Regret T 2s

Proof. Consider the set of loss vectors L = {l {0, s}N | l 0 2V }. From the deﬁnition of VC dimension,5 it follows that V C(L) = V . Hence, Theorem 13 implies the stated claim.

Next we instantiate Theorem 13 to derive the regret lower bounds for the structured loss spaces introduced in Section 3. In particular, we begin by stating a lower bound for sparse loss vectors. Corollary 15. (Lower Bound for Sparse losses) Given k [N] and s > 0 along with any online learning algorithm, there exists a sequence of loss vectors l1, l2, . . . , l T [ s, s]N of sparsity k N (i.e. lt 0 = k for all t [T]) such that

Regret T 2s

4For online learning problems, Littlestone s dimension is used as a characterization of complexity of a hypothesis class, learnable in an online fashion. Further details can be found in the full version of the paper. 5The full version of the paper provides a deﬁnition of VC dimension along with relevant references.

Along the same lines, Theorem 13 leads to a lower bound for losses with small ℓp norm. Corollary 16. (Lower Bound for ℓp losses) Given p [ln N] and s > 0 along with any online learning algorithm, there exists a sequence of loss vectors l1, l2, . . . , l T [ s, s]N of ℓp norm at most s (i.e., lt p s) such that

Proof. Consider the set of all 2p-sparse loss vectors in [ s

2]N. Any such loss vector l [ s

2]N has l p s. The stated claim now follows by applying Theorem 13 with parameters s

2 and V = p.

Corollary 17. (Lower Bound for Noisy Losses) Given ϵ > 0 and any online learning algorithm, there exists a sequence of ϵ-noisy loss vectors l1, l2, . . . , l T [ ϵ, ϵ]N (i.e., lt 2 2 ϵ) such that

Proof. Consider the set of all 2-sparse loss vectors in [ ϵ

2]N. Clearly any such loss vector l [ ϵ

2]N has l 2 2 ϵ. Hence with parameters s = ϵ

2 and V = 1, the result follows directly from theorem 13.

Remark 18. Note that Theorem 13 (with parameter V = d, s = 1) recovers the lower bound for low rank loss spaces as established by (Hazan et al. 2016): given 1 d ln N and any online learning algorithm, there exists a sequence of d-rank loss vectors l1, l2, . . . , l T [ 1, 1]N such that

We next derive the regret lower bounds for few instances of additive loss spaces. Corollary 19. (Lower Bound for Noisy Low Rank) Given parameters ϵ > 0 and d [ln N] along with any online learning algorithm, there exists a sequence of loss vectors l1, l2, . . . , l T [ (1 + ϵ), (1 + ϵ)]N, where lt = lt1 + lt2, with lt1 {l [ 1, 1]N | l = Uv} (U RN d is a rank d matrix), and lt2 2 2 ϵ, such that

Regret T 2 1 + ϵ

Proof. Let N = 2d. Consider the matrix H { 1, 1}N d where 2d rows of H represent 2d vertices of the dhypercube in [ 1, 1]N. Let, L1 = {H(:, 1), . . . H(:, d)}, and L2 = l ϵ

2d N | l 2 2 = ϵ . Note that

any loss vectors in L2 is 2d-sparse. Consider L = L1 + L2. The result now follows from Theorem 13, noting that with s = 1 + ϵ

2d and V = d, the lower-bounding loss vectors assured in Theorem 13, l1, . . . , l T , are contained in L.

Corollary 20. (Lower Bound for Noisy Sparse) Given parameters ϵ > 0 and k [N] along with any online learning algorithm, there exists a sequence of loss vectors l1, l2, . . . , l T [ (1 + ϵ), (1 + ϵ)]N, where lt = lt1 + lt2, with lt1 {l [ 1, 1]N | l 0 k}, and lt2 2 2 = ϵ, such that

Regret T 2 1 + ϵ

Proof. Consider the loss spaces: L1 = {l { 1, 1}N |

l 0 = k}, and L2 = l ϵ

k N | l 2 2 = ϵ . Note that any loss vectors in L2 is k-sparse. Write L = L1 + L2. The corollary now follows from Theorem 13, noting that with s = 1 + ϵ

k and V = ln k the lowerbounding loss vectors assured in Theorem 13, l1, . . . , l T , are contained in L.

6 Conclusion

In this paper, we have developed a theoretical framework for online learning with structured losses, namely the broad class of problems with additive loss spaces. The framework yields both algorithms that generalize standard online mirror descent and also novel regret upper bounds for relevant settings such as noisy + sparse, noisy + low-rank, and sparse + low-rank losses. In addition, we have derived lower bounds i.e., fundamental limits on regret for a variety of online learning problems with structured loss spaces. In light of these results, tightening the gap between the upper and lower bounds for structured loss spaces is a natural, open problem. Another relevant thread of research is to study settings wherein the learner knows that the loss space is structured, but is oblivious to the exact instantiation of the loss space, e.g., the losses might be perturbations of vectors from a lowdimensional subspace, but, a priori, the learning algorithm might not know the underlying subspace.6 Addressing structured loss spaces in bandit settings also remains an interesting direction for future work.

A Tight Examples for Theorem 4

In this section, we present loss spaces L1 and L2 such that OMD algorithm obtained via Theorem 4 provides an order-wise optimal regret guarantee for the additive loss space L = L1 + L2.

Composition of Low Ranks: Let L1 = {l [0, 1]N | l = U1v } and L2 = {l [0, 1]N | l = U2v} be loss spaces of rank d1 and d2, respectively (i.e., rank of the matrices U1 and U2 are respectively d1 and d2). Here (d1 + d2) ln N. Consider the regularizer R(x) = x (H1 + H2)x, where H1 = IN+U 1 M1U1, and H2 = IN+U 2 M2U2, M1 and M2 being the L owner John ellipsoid matrix for L1 and L2.

6The results of (Hazan et al. 2016) address the noiseless version of this problem.

That is, R(x) = R1(x) + R2(x), where R1(x) and R2(x) are the regularizers for L1 and L2 respectively. Theorem 4 assets that the OMD algorithm, with regularizer R, for the loss space L = L1 + L2 achieves the following regret bound:

2(d1 + d2)T .

This regret guarantee is tight, since Rank(L) can be as high as (d1 + d2) and, hence, we get a nearly matching lower bound by applying the result of (Hazan et al. 2016); see also Remark 18 in Section 5.

Composition of Noise Let loss spaces L1 = {l [0, 1]N | l 2 2 ϵ1} and L2 = {l [0, 1]N | l 2 2 ϵ2}. Then, via an instantiation of Theorem 4, we get that the regret of the OMD algorithm over the loss space L = L1 + L2, with

regularizer R(x) = (ϵ1 + ϵ2) x 2 2 (and η =

T ) is upper bounded as follows:

2(ϵ1 + ϵ2)T .

Again, modulo constants, this is the best possible regret guarantee for L; see Corollary 17.

References Arora, S.; Hazan, E.; and Kale, S. 2012. The multiplicative weights update method: a meta-algorithm and applications. Theory of Computing 8(1):121 164. Beck, A., and Teboulle, M. 2003. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters 31(3):167 175. Ben-David, S.; P al, D.; and Shalev-Shwartz, S. 2009. Agnostic online learning. In COLT. Boyd, S., and Vandenberghe, L. 2004. Convex Optimization. New York, NY, USA: Cambridge University Press. Bubeck, S. 2011. Introduction to online optimization. Lecture Notes, Princeton University. Cesa-Bianchi, N., and Lugosi, G. 2006. Prediction, learning, and games. Cambridge university press. Cesa-Bianchi, N.; Mansour, Y.; and Stoltz, G. 2007. Improved second-order bounds for prediction with expert advice. Machine Learning 66(2):321 352. Chandrasekaran, V.; Recht, B.; Parrilo, P. A.; and Willsky, A. S. 2012. The convex geometry of linear inverse problems. Foundations of Computational mathematics 12(6):805 849. Cohen, A., and Mannor, S. 2017. Online learning with many experts. ar Xiv preprint ar Xiv:1702.07870. Hazan, E., and Kale, S. 2010. Extracting certainty from uncertainty: regret bounded by variation in costs. Machine Learning 80(2):165 188. Hazan, E.; Koren, T.; Livni, R.; and Mansour, Y. 2016. Online learning with low rank experts. In Proceedings of the 29th Conference on Learning Theory, COLT 2016, New York, USA, June 23-26, 2016, 1096 1114. Koren, Y.; Bell, R.; and Volinsky, C. 2009. Matrix factorization techniques for recommender systems. Computer 42(8):30 37.

Shalev-Shwartz, S. 2012a. Online learning and online convex optimization. Found. Trends Mach. Learn. 4(2). Shalev-Shwartz, S. 2012b. Online learning and online convex optimization. Foundations and Trends R in Machine Learning 4(2):107 194. Steinhardt, J., and Liang, P. 2014. Adaptivity and optimism: An improved exponentiated gradient algorithm. In International Conference on Machine Learning, 1593 1601. Zinkevich, M. 2003. Online convex programming and generalized inﬁnitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning.