# multitask_online_mirror_descent__a93e8d22.pdf

Published in Transactions on Machine Learning Research (09/2022)

Multitask Online Mirror Descent

Nicolò Cesa-Bianchi nicolo.cesa-bianchi@unimi.it Department of Computer Science, University of Milan, Italy

Pierre Laforgue pierre.laforgue@unimi.it Department of Computer Science, University of Milan, Italy

Andrea Paudice andrea.paudice@unimi.it Department of Computer Science, University of Milan, Italy Istituto Italiano di Tecnologia, Genova, Italy

Massimiliano Pontil massimiliano.pontil@iit.it Istituto Italiano di Tecnologia, Genova, Italy University College London, United Kingdom

Reviewed on Open Review: https: // openreview. net/ forum? id= zw RX9kk Kzj

We introduce and analyze MT-OMD, a multitask generalization of Online Mirror Descent (OMD) which operates by sharing updates between tasks. We prove that the regret of MT-OMD is of order p

1 + σ2(N 1)

T, where σ2 is the task variance according to the geometry induced by the regularizer, N is the number of tasks, and T is the time horizon. Whenever tasks are similar, that is σ2 1, our method improves upon the

NT bound obtained by running independent OMDs on each task. We further provide a matching lower bound, and show that our multitask extensions of Online Gradient Descent and Exponentiated Gradient, two major instances of OMD, enjoy closed-form updates, making them easy to use in practice. Finally, we present experiments which support our theoretical ﬁndings.

1 Introduction

In multitask learning (Caruana, 1997), one faces a set of tasks to solve, and tries to leverage their similarities to learn faster. Task similarity is often formalized in terms of Euclidean distances among the best performing models for each task, see Evgeniou & Pontil (2004) for an example. However, in online convex optimization, and Online Mirror Descent (OMD) in particular, it is well known that using diﬀerent geometries to measure distances in the model space can bring substantial advantages see, e.g., Hazan (2016); Orabona (2019). For instance, when the model space is the probability simplex in Rd, running OMD with the KL divergence (corresponding to an entropic regularizer) allows one to learn at a rate depending only logarithmically on d. It is thus natural to investigate to what extent measuring task similarities using geometries that are possibly non-Euclidean could improve the analysis of online multitask learning. From an application perspective, typical online multitask scenarios include federated learning applications for mobile users (e.g., personalized recommendation or health monitoring) or for smart homes (e.g., energy consumption prediction), mobile sensor networks for environmental monitoring, or even networked weather forecasting. These scenarios ﬁt well with online learning, as new data is being generated all the time, and require diﬀerent losses and decision sets, motivating the design of a general framework.

In this work, we introduce MT-OMD, a multitask generalization of OMD which applies to any strongly convex regularizer. We present a regret analysis establishing that MT-OMD outperforms OMD (run independently on each task) whenever tasks are similar according to the geometry induced by the regularizer. Our work

Published in Transactions on Machine Learning Research (09/2022)

builds on the multitask extension of the Perceptron algorithm developed in Cavallanti et al. (2010), where prior knowledge about task similarities is expressed through a symmetric positive deﬁnite interaction matrix A. Typically, A = I + L, where L is the Laplacian of a task relatedness graph with adjacency matrix W. The authors then show that the number of mistakes depends on P

i ui 2 2 + P

i,j Wij ui uj 2 2, where each ui denotes the best model for task i. This expression can be seen as a measure of task dispersion with respect to matrix W and norm 2. The Euclidean norm appears because the Perceptron is an instance of OMD for the hinge loss with the Euclidean regularizer, so that distances in the model space are measured through the corresponding Bregman divergence, which is the Euclidean squared norm.

For an arbitrary strongly convex regularizer ψ, the regret of OMD is controlled by a Bregman divergence and a term inversely proportional to the curvature of the regularizer. The key challenge we face is how to extend the OMD regularizer to the multitask setting so that the dispersion term captures task similarities. A natural strategy would be to choose a regularizer whose Bregman divergence features P

i,j Wij Bψ(ui, uj). Although this mimics the Euclidean dispersion term of the Perceptron, the associated regularizer has a small curvature, compromising the divergence-curvature balance which, as we said, controls the regret. Observing that the Perceptron s dispersion term can be rewritten A1/2u 2 2, where A is a block version (across tasks) of A and u is the concatenation of the reference vectors ui, our solution consists in using the regularizer ψ A1/2 , where ψ is the compound version of any base regularizer ψ deﬁned on the model space. While exhibiting the right curvature, this regularizer has still the drawback that A1/2u might be outside the domain of ψ. To get around this diﬃculty, we introduce a notion of variance aligned with the geometry induced by ψ, such that the corresponding Bregman divergence Bψ A1/2u, A1/2v is always deﬁned for sets of tasks with small variance. We then show that the Bregman divergence can be upper bounded in terms of the task variance σ2, and by tuning appropriately the matrix A we obtain a regret bound for MT-OMD that scales as p

1 + σ2(N 1). In contrast, the regret of independent OMD scales as

N, highlighting the advantage brought by MT-OMD when tasks have a small variance. We stress that this improvement is independent of the chosen regularizer, thereby oﬀering a substantial acceleration in a wide range of scenarios. To keep the exposition simple, we ﬁrst work with a ﬁxed and known σ2. We then show an extension of MT-OMD that does not require any prior knowledge on the task similarity. The rest of the paper is organized as follows. In Section 2, we introduce the multitask online learning problem and describe MT-OMD, our multitask extension to solve it. In Section 3, we derive a regret analysis for MT-OMD, which highlights its advantage when tasks are similar. Section 4 is devoted to algorithmic implementations, and Section 5 to experiments.

Related work. Starting from the seminal work by Caruana (1997), multitask learning has been intensively studied for more than two decades, see Zhang & Yang (2021) for a recent survey. Similarly to Cavallanti et al. (2010), our work is inspired by the Laplacian multitask framework of Evgeniou et al. (2005). This framework has been extended to kernel-based learning (Sheldon, 2008), kernel-based unsupervised learning (Gu et al., 2011), contextual bandits (Cesa-Bianchi et al., 2013), spectral clustering (Yang et al., 2014), stratiﬁed model learning (Tuck et al., 2021), and, more recently, federated learning (Dinh et al., 2021). See also Herbster & Lever (2009) for diﬀerent applications of Laplacians in online learning. A multitask version of OMD has been previously proposed by Kakade et al. (2012). Their approach, unlike ours, is cast in terms of matrix learning, and uses group norms and Schatten p-norm regularizers. Their bounds scale with the diameter of the model space according to these norms (as opposed to scaling with the task variance, as in our analysis). Moreover, their learning bias is limited to the choice of the matrix norm regularizer and does not explicitly include a notion of task similarity matrix. Abernethy et al. (2007); Dekel et al. (2007) investigate diﬀerent multitask extensions of online learning, see also Alquier et al. (2017); Finn et al. (2019); Balcan et al. (2019); Denevi et al. (2019) for related extensions to meta-learning. Some online multitask applications are studied in Pillonetto et al. (2008); Li et al. (2014; 2019), but without providing any regret analyses. Saha et al. (2011); Zhang et al. (2018) extend the results of Cavallanti et al. (2010) to dynamically updated interaction matrices. However, no regret bounds are provided. Murugesan et al. (2016) look at distributed online classiﬁcation and prove regret bounds, but they are not applicable in our asynchronous model. Other approaches for learning task similarities include Zhang & Yeung (2010); Pentina & Lampert (2017); Shui et al. (2019). We ﬁnally note the recent work by Boursier et al. (2022), which establishes multitask learning guarantees with trace norm regularization when the number of samples per task is small, and that by Laforgue et al. (2022), which learns jointly the tasks and their structure, but only with the Euclidean regularizer and under the assumption that the task activations are stochastic.

Published in Transactions on Machine Learning Research (09/2022)

Although our asynchronous multitask setting is identical to that of Cavallanti et al. (2010), we emphasize that our work extends theirs much beyond the fact that we consider arbitrary convex losses instead of just the hinge loss. Algorithmically, MT-OMD generalizes the Multitask Perceptron in much the same way OMD generalizes the standard Perceptron. From a technical point of view, Theorem 1 in Cavallanti et al. (2010) is a direct consequence of the Kernel Perceptron Theorem, and is therefore limited to Euclidean geometries. Instead, our work provides a complete analysis of all regularizers of the form ψ(A1/2 ). Although Cavallanti et al. (2010) also contains a non-Euclidean p-norm extension of the Multitask Perceptron, we point out that their extension is based on a regularizer of the form Au 2 p. This is diﬀerent from MT-OMD for p-norms, which instead uses P i (A1/2u)(i) 2 p. As a consequence, their bound is worse than ours (see Appendix C for technical details), does not feature any variance term, and does not specialize to the Euclidean case when p = 2. Note that our analysis on the simplex is also completely novel as far as we know.

2 Multitask Online Learning

We now describe the multitask online learning problem, and introduce our approach to solve it. We use a cooperative and asynchronous multiagent formalism: the online algorithm is run in a distributed fashion by communicating agents, that however make predictions at diﬀerent time steps.

Problem formulation and reminders on OMD. We consider an online convex optimization setting with a set of N N agents, each learning a possibly diﬀerent task on a common convex decision set V Rd. At each time step t = 1, 2, . . . some agent it N makes a prediction xt V for its task, incurs loss ℓt(xt), and observes a subgradient of ℓt at xt, where ℓt is a convex loss function. We say that it is the active agent at time t. Both the sequence i1, i2, . . . of active agents and the sequence ℓ1, ℓ2, . . . of convex losses are chosen adversarially and hidden from the agents. Note that the algorithm we propose is deterministic, such that ℓt might be indiﬀerently chosen before xt is predicted (oblivious adversary) or after. Our goal is to minimize the multitask regret, which is deﬁned as the sum of the individual regrets

t: it=i ℓt(xt) inf u V

t: it=i ℓt(u)

i=1 inf u V

t: it=i ℓt(u) . (1)

A natural idea to minimize Equation (1) is to run N independent OMDs, one for each agent. Recall that OMD refers to a family of algorithms, typically used to minimize a regret of the form P

t ℓt(xt) infu V P

t ℓt(u), for any sequence of proper convex loss functions ℓt. An instance of OMD is parameterized by a λ-strongly convex regularizer ψ: Rd R, and has the update rule

xt+1 = arg min x V ηtgt, x + Bψ(x, xt) , (2)

where gt Rd is a subgradient of ℓt at point xt, Bψ(x, y) = ψ(x) ψ(y) ψ(y), x y denotes the Bregman divergence associated to ψ, and ηt > 0 is a tunable learning rate. Standard results allow to bound the regret achieved by the sequence of iterates produced by OMD. For a ﬁxed η and any initial point x1 V , we have (Orabona, 2019, Theorem 6.8) that for all u V

t=1 ℓt(xt) ℓt(u) Bψ(u, x1)

t=1 gt 2 , (3)

with the dual norm of the norm with respect to which ψ is strongly convex (see Deﬁnition 4 in the Appendix). The choice of the regularizer ψ shapes the above bound through the quantities Bψ(u, x1) and gt . When ψ = 1

2 2 2, we have Bψ(x, y) = 1

2 x y 2 2, = 2, λ = 1, and the algorithm is called Online Gradient Descent (OGD). However, depending on the problem, a diﬀerent choice of the regularizer might better captures the underlying geometry. A well-known example is Exponentiated Gradient (EG), an instance of OMD in which V is the probability simplex in Rd, such that V = := {x Rd + : P

j xj = 1}. EG uses the negative entropy regularizer x 7 P

j xj ln(xj), and assuming that gt Lg, one achieves bounds of order O(Lg

T ln d), while OGD yields bounds of order O(Lg

d T). We emphasize that our cooperative

Published in Transactions on Machine Learning Research (09/2022)

extension adapts to several types of regularizers, and can therefore exploit these improvements with respect to the dependence on d, see Proposition 8. Let C be a generic constant such that C

T bounds the regret incurred by the chosen OMD (e.g., C = Lg

ln d, or C = Lg

d above). Then, by Jensen s inequality the multitask regret of N independent OMDs satisﬁes

where Ti = PT t=1 I{it = i} denotes the number of times agent i was active. Our goal is to show that introducing communication between the agents may signiﬁcantly improve on Equation (4) with respect to the dependence on N.

A multitask extension. We now describe our multitask OMD approach. To gain some insights on it, we ﬁrst focus on OGD. For i N and t T, let xi,t Rd denote the prediction maintained by agent i at time step t. By completing the square in Equation (2) for ψ = ψEuc := 1

2 2 2, the independent OGDs updates can be rewritten for all i N and t such that it = i:

xi,t+1 = ΠV, 2 xi,t ηtgt , (5)

where ΠV, denotes the projection operator onto the convex set V according to the norm , that is ΠV, (x) = arg miny V x y . Our analysis relies on compound representations, that we explain next. We use bold notation to refer to compound vectors, such that for u1, . . . , u N Rd, the compound vector is u = [u1, . . . , u N] RNd. For i N, we use u(i) to refer to the ith block of u, such that u(i) = ui in the above example. So xt is the compound vector of the (xi,t)N i=1, such that xt = x(it) t , and the multitask regret rewrites as RT (u) = PT t=1 ℓt x(it) t ℓt u(it) . For any set V Rd, let V = V N RNd denote the compound set such that u V is equivalent to u(i) V for all i N. Equipped with this notation, the independent OGD updates Equation (5) rewrite as

xt+1 = ΠV , 2 xt ηtgt , (6)

with gt RNd such that g(i) t = gt for i = it, and 0Rd otherwise. In other words, only the active agent has a non-zero gradient and therefore makes an update. Our goal is to incorporate communication into this independent update. To that end, we consider the general idea of sharing updates by considering (sub) gradients of the form A 1gt, where A 1 RNd Nd is a shortcut notation for A 1 Id and A RN N is any symmetric positive deﬁnite interaction matrix. Note that A is a parameter of the algorithm playing the role of a learning bias. While our central result (Theorem 1) holds for any choice of A, our more specialized bounds (see Propositions 5 to 8) apply to a parameterized family of matrices A. A simple computation shows that (A 1gt)(i) = A 1 iit gt. Thus, every agent i makes an update proportional to A 1 iit at each time step t. In other words, the active agent (the only one to suﬀer a loss) shares its update with the other agents. Results in Section 3 are proved by designing a matrix A 1 (or equivalently A) such that A 1 iit captures the similarity between tasks i and it. Intuitively, the active agent it should share its update (gradient) with another agent i to the extent their respective tasks are similar. Overall, denoting by u M =

u Mu the Mahalanobis norm of u, the MT-OGD update writes

xt+1 = ΠV , A xt ηt A 1gt . (7)

In comparison to Equation (6), the need for changing the norm in the projection, although unclear at ﬁrst sight, can be explained in multiple ways. First, it is key to the analysis, as we see in the proof of Theorem 1. Second, it can be interpreted as another way of exchanging information between agents, see Remark 1. Finally, note that update Equation (7) can be decomposed as

ext+1 = arg min x RNd ηtgt, x + 1

2 x xt 2 A ,

xt+1 = arg min x V

1 2 x ext+1 2 A , (8)

Published in Transactions on Machine Learning Research (09/2022)

showing that it is natural to keep the same norm in both updates. Most importantly, what Equation (8) tells us, is that the MT-OGD update rule is actually an OMD update see e.g., (Orabona, 2019, Section 6.4) with regularizer x 7 1

2 x 2 A = ψEuc A1/2x . This provides a natural path for extending our multitask approach to any regularizer. Given a base regularizer ψ: Rd R, the compound regularizer ψ is given by ψ: x RNd 7 PN i=1 ψ x(i) . When there exists a function φ: R R such that ψ(x) = P

j φ(xj), the compound regularizer is the natural extension of ψ to RNd. Note, however, that the relationship can be more complex, e.g., when ψ(x) = 1

2 x 2 p. Using regularizer ψ A1/2 , whose associate divergence is Bψ A1/2x, A1/2x , the MT-OMD update thus reads

xt+1 = arg min x V ηtgt, x + Bψ A1/2x, A1/2xt . (9)

Clearly, if ψ = ψEuc, we recover the MT-OGD update. Observe also that whenever A = IN, MT-OMD is equivalent to N independent OMDs. We conclude this exposition with a remark shedding light on the way MT-OMD introduces communication between agents. Remark 1 (Communication mechanism in MT-OMD). Denoting yt = A1/2xt, Equation (9) rewrites

xt+1 = A 1/2 arg min y A1/2(V ) ηt A 1/2gt, y + Bψ(y, yt) . (10)

The two occurrences of A 1/2 reveal that agents communicate in two distinct ways: one through the shared update (the innermost occurrence of A 1/2), and one through computing the ﬁnal prediction xt+1 as a linear combination of the solution to the optimization problem. Multiplying Equation (10) by A1/2, MT-OMD can also be seen as standard OMD on the transformed iterate yt.

3 Regret Analysis

We now provide a regret analysis for MT-OMD. We start with a general theorem presenting two bounds, for constant and time-varying learning rates. These results are then instantiated to diﬀerent types of regularizer and variance in Propositions 2 to 8. The main diﬃculty is to characterize the strong convexity of ψ A1/2 , see Lemmas 11 and 12 in the Appendix. Throughout the section, V Rd is a convex set of comparators, and (ℓt)T t=1 is a sequence of proper convex loss functions chosen by the adversary. Note that all technical proofs can be found in Appendix A. Theorem 1. Let ψ: Rd R be λ-strongly convex with respect to norm on V , let A RN N be symmetric positive deﬁnite, and set x1 V . Then, MT-OMD with ηt := η produces a sequence of iterates (xt)T t=1 such that for all u V , RT (u) is bounded by

Bψ A1/2u, A1/2x1

η + max i N A 1 ii η 2λ

t=1 gt 2 . (11)

Moreover, for any sequence of nonincreasing learning rates (ηt)T t=1, MT-OMD produces a sequence of iterates (xt)T t=1 such that for all u V , RT (u) is bounded by

max t T Bψ A1/2u, A1/2xt

ηT + max i N A 1 ii 1 2λ

t=1 ηt gt 2 . (12)

3.1 Multitask Online Gradient Descent

2 2 2, A = IN (independent updates), unit-norm reference vectors (u(i))N i=1, Lg-Lipschitz losses, and x1 = 0, bound Equation (11) becomes: ND2/2η + ηTL2 g/2. Choosing η = D

T, we recover the DLg

NT bound of Equation (4). Our goal is to design interaction matrices A that make Equation (11) smaller. In the absence of additional assumptions on the set of comparators, it is however impossible to get a systematic improvement: the bound is a sum of two terms, and introducing interactions typically reduces one term but increases the other. To get around this diﬃculty, we introduce a simple condition on the task similarity, that allows us to control the increase of Bψ A1/2u, A1/2x1 for a carefully designed class of interaction matrices.

Published in Transactions on Machine Learning Research (09/2022)

Deﬁnition 1. Let : Rd R be any norm, and u = (1/N) PN i=1 u(i), for any u RNd. We deﬁne the variance of u w.r.t. as

Var (u) = 1 N 1

Let D = supu V u , and σ > 0. The comparators with variance smaller than σ2D2 are denoted by

V ,σ = u V : Var (u) σ2D2 . (13)

For sets of comparators of the form Equation (13), we show that MT-OGD achieves signiﬁcant improvements over its independent counterpart. The rationale behind this gain is fairly natural: the tasks associated with comparators in Equation (13) are similar due to the variance constraint, so that communication indeed helps. Note that condition Equation (13) does not enforce any restriction on the norms of the individual u(i), and is much more complex than a simple rescaling of the feasible set by σ2. For instance, one could imagine task vectors highly concentrated around some vector u0, whose norm is D: the individual norms are close to D, but the task variance is small. This is precisely the construction used in the separation result (Proposition 4). As MT-OMD leverages the additional information of the task variance (unavailable in the independent case), it is expected that an improvement should be possible. The problems of how to use this extra information and what improvement can be achieved through it are addressed in the rest of this section. To that end, we ﬁrst assume σ2 to be known. This assumption can be seen as a learning bias, analog to the knowledge of the diameter D in standard OGD bounds. In Section 3.4, we then detail a Hedge-based extension of MT-OGD that does not require the knowledge of σ2 and only suﬀers an additional regret of order T log N.

The class of interaction matrices we consider is deﬁned as follows. Let L = IN 11 /N. We consider matrices of the form A(b) = IN + b L, where b 0 quantiﬁes the magnitude of the communication. For more intuition about this choice, see Section 3.4. We can now state a ﬁrst result highlighting the advantage brought by MT-OGD.

Proposition 2. Let ψ = 1

2 2 2, D = supx V x 2, and σ 1. Assume that ℓt(x) 2 Lg for all t T and any x V . Set b = N, x1 = 0, and η = D p

N(N + 1)(1 + (N 1)σ2)/Lg

2T. Then, MT-OGD produces a sequence of iterates (xt)T t=1 such that for all u V 2,σ

RT (u) DLg p

1 + σ2(N 1)

Proof sketch. With ψ = 1

2 2 2, and x1 = 0, we have 2Bψ A(b)1/2u, A(b)1/20 = u 2 2 + b(N 1)Var 2(u), which is smaller than ND2(1 + b N 1

N σ2). Then, it is easy to check that A(b) 1

ii = b+N (1+b)N for all i N. Substituting these values into Equation (11), we obtain

RT (u) ND2(1 + b N 1

N σ2) 2η + ηTL2 g 2 b + N (1 + b)N .

Finally, set η = ND

N σ2 (1+b) (b+N)T and b = N.

Thus, MT-OGD enjoys a p

1 + σ2(N 1) dependence, which is smaller than

N when tasks have a variance smaller than 1. When σ = 0 (all tasks are equal), MT-OGD scales as if there were only one task. When σ 1, the analysis suggests to choose b = 0, i.e., A = IN, and one recovers the performance of independent OGDs. Note that the additional

2 factor in Equation (14) can be removed for limit cases through a better optimization in b: the bound obtained in the proof actually reads DL p

F(σ) T, with F(0) = 1 and F(1) = N. However, the function F lacks of interpretability outside of the limit cases (for details see Appendix A.2) motivating our choice to present the looser but more interpretable bound Equation (14). For a large N, we have p

1 + σ2(N 1) σ

N. The improvement brought by MT-OGD is thus roughly proportional to the square root of the task variance. From now on, we refer to this gain as the multitask acceleration. This improvement achieved by MT-OGD is actually optimal up to constants, as revealed by the following lower bound, which is only 1/4 of Equation (14).

Published in Transactions on Machine Learning Research (09/2022)

Proposition 3. Under the conditions of Proposition 2, the regret of any algorithm satisﬁes

sup u V 2,σ RT (u) 1

1 + σ2(N 1)

Another way to gain intuition about Equation (14) is to compare it to the lower bound for OGD considering independent tasks (IT-OGD). The following separation result shows that MT-OGD may strictly improve over IT-OGD.

Proposition 4. Let d 9, N = 2d, and σ 1 to be tuned later. Then, there exists u V 2,σ such that

RIT OGD T (u)

Proposition 2 then yields that for any σ2 < N 16 18N 16

RIT OGD T (u) > RMT OGD T (u) .

3.2 Extension to any Norm Regularizers

A natural question is: can the multitask acceleration be achieved with other regularizers? Indeed, the proof of Proposition 2 crucially relies on the fact that the Bregman divergence can be exactly expressed in terms of u 2 2 and Var 2(u). In the following proposition, we show that such an improvement is also possible for all regularizers of the form 1

2 2, for arbitrary norms , up to an additional multiplicative constant. A crucial application is the use of the p-norm on the probability simplex, which is known to exhibit a logarithmic dependence in d for a well-chosen p.

Proposition 5. Let : Rd R be any norm, ψ = 1

2 2, D = supx V x , and σ 1. Assume that ℓt(x) Lg for all t T, x V . Set b = N, x1 = 0 and η = D p

N(N + 1)(1 + (N 1)σ2)/Lg

2T. Then, MT-OMD produces a sequence of iterates (xt)T t=1 such that for all u V ,σ

RT (u) DLg p

1 + σ2(N 1)

In particular, for d 3 and V = , choosing = p, for p = 2 ln d/(2 ln d 1), and assuming that ℓt(x) Lg, it holds for all u p,σ

RT (u) Lg p

1 + σ2(N 1)

16e T ln d .

In comparison, under the same assumptions, bound Equation (14) would write as: Lg p

1 + σ2(N 1)

Projecting onto Vσ. Propositions 2 and 5 reveal that whenever tasks are similar (i.e., whenever u Vσ), then using the regularizer ψ A1/2 with A = IN accelerates the convergence. However, this is not the only way to leverage the small variance condition. For instance, one may also use this information to directly project onto Vσ V , by considering the update

xt+1 = arg min x Vσ ηtgt, x + Bψ A1/2x, A1/2xt . (15)

Although not necessary in general (Propositions 2 and 5 show that communicating the gradients is suﬃcient to get an improvement), this reﬁnement presents several advantages. First, it might be simpler to compute in practice, see Section 4. Second, it allows for adaptive learning rates, that preserve the guarantees while being independent from the horizon T (Proposition 6). Finally, it allows to derive L bounds with the multitask acceleration for smooth loss functions (Proposition 7). Results are stated for arbitrary norms, but bounds sharper by a factor 2 can be obtained for 2.

Published in Transactions on Machine Learning Research (09/2022)

Proposition 6. Let : Rd R be any norm, ψ = 1

2 2, D = supx V x , and σ 1. Set b = N, and ηt = D p

N(N + 1)(1 + (N 1)σ2)(Pt i=1 gi 2 ) 1/2. Then, Equation (15) produces a sequence of iterates (xt)T t=1 such that for all u V ,σ

RT (u) 8D p

1 + σ2(N 1) T X

Proposition 7. Let : Rd R be any norm, ψ = 1

2 2, D = supx V x , and σ 1. Assume that the ℓt are M-smooth, i.e., ℓt(x) ℓt(y) M x y for all t T, and any x, y V . Set b and ηt as in Proposition 6. Then, update Equation (15) produces a sequence of iterates (xt)T t=1 such that for all u V ,σ

RT (u) 16D p

1 + σ2(N 1)

1 + σ2(N 1) +

t=1 ℓt u(it)

Remark 2 (Strongly convex and exp-concave losses). Another popular assumption to derive improved regret bounds is to consider strongly convex or exp-concave losses. Recall that a function f is said to be α-exp-concave if exp( αf) is concave (for instance, the logistic loss of a linear predictor with norm bounded by U and unitnorm inputs is exp( 2U)/2-exp-concave). In this context, standard single task algorithms achieve improved regret bounds of order ln T, see e.g., (Orabona, 2019, Corollary 7.24 and Section 7.10). We highlight that a simple adaptation of the single task analysis is not enough to exhibit a multitask acceleration in these cases. Indeed, the cornerstone of our analysis is to leverage the compound representation, in which the interactions are more easily analyzed. On the other hand, the strong convexity or exp-concavity of the losses provide a sharper control on the instantaneous regret that depends on the norm of the active predictor/comparator only, but cannot be extended to the compound framework. Another way to look at the problem is to recall that FTRL deals with strongly convex losses by choosing ψ 0. This means e ψ = ψ(A1/2 ) 0, such that MT-FTRL is equivalent to independent FTRL and no multitask improvement can be achieved.

3.3 Regularizers on the Simplex

As seen in Propositions 2 to 7, MT-OMD induces a multitask acceleration in a wide range of settings, involving diﬀerent regularizers (Euclidean norm, p-norms) and various kind of loss functions (Lipschitz continuous, smooth continuous gradients). This systematic gain suggests that multitask acceleration essentially derives from our approach, and is completely orthogonal to the improvements achievable by choosing the regularizer appropriately. Bounds combining both beneﬁts are actually derived in the second claim of Proposition 5. However, all regularizers studied so far share a crucial feature: they are deﬁned on the entire space Rd. As a consequence, the divergence Bψ(A1/2u, A1/2x) is always well deﬁned, which might not be true in general, for instance when the comparator set studied is the probability simplex . A workaround consists in assigning the value + to the Bregman divergence whenever either of the arguments is outside of the compound simplex = N. The choice of the interaction matrix A then becomes critical to prevent the bound from exploding, and calls for a new deﬁnition of the variance. Indeed, note that for i N we have (A(b)1/2u)(i) =

1 + b u(i) + (1

1 + b) u. If all u(i) are equal (say to u0 ), then all (A(b)1/2u)(i) are also equal to u0 and A(b)1/2u . However, if they are diﬀerent, by deﬁnition of u, for all j d, there exists i N such that u(i) j uj. Then, for b large enough,

1 + b u(i) j + (1

1 + b) uj becomes negative, and (A(b)1/2u)(i) is out of the simplex. Luckily, the maximum acceptable value for b can be easily deduced from the following variance deﬁnition.

Deﬁnition 2. Let u Rd. For all j d, let

umax j = max i N u(i) j , and umin j = min i N u(i) j .

Then, with the convention 0/0 = 0 we deﬁne

Var (u) = max j d

umax j umin j umax j

Published in Transactions on Machine Learning Research (09/2022)

and for any σ 1 σ = u : Var (u) σ2 .

Equipped with this new variance deﬁnition, we can now analyze regularizers deﬁned on the simplex. Proposition 8. Let ψ : R be λ-strongly convex w.r.t. norm , and such that there exist x and C < + such that for all x , Bψ(x, x ) C. Let σ 1, and assume that ℓt(x) Lg for all t T and x . Set b = (1 σ2)/σ2, x1 = [x , . . . , x ], and η = N p

2λ(1 + b)C/Lg p

(b + N)T. Then, MT-OMD produces a sequence of iterates (xt)T t=1 such that for all u σ

RT (u) Lg p

1 + σ2(N 1) p

For the negative entropy we have x =1/d and C= ln d. With subgradients satisfying ℓt(x) Lg we obtain RT (u) Lg p

1 + σ2(N 1)

Proposition 8 shows that the multitask acceleration is not an artifact of the Euclidean geometry, but rather a general feature of MT-OMD, as long as the variance deﬁnition is aligned with the geometry of the problem.

3.4 Adaptivity to the Task Variance

Most of the results we presented so far require the knowledge of the task variance σ2. We now present an Hedge-based extension of MT-OMD, denoted Hedge-MT-OMD, that does not require any prior information on σ2. First, note that for σ2 1, MT-OMD becomes equivalent to independent OMDs. A simple approach consists then in using Hedge see, e.g., (Orabona, 2019, Section 6.8) over a set of experts, each running an instance of MT-OMD with a diﬀerent value of σ2 chosen on a uniform grid of the interval [0, 1].1 We can show that Hedge-MT-OGD only suﬀers an additional regret of order T log N against MT-OGD run with the exact knowledge of Var 2(u). Theorem 9. Let D = supx V x 2, and assume that ℓt(x) 2 Lg for all t T, x V . Then, for all u V the regret of Hedge-MT-OGD is bounded by

min Var 2(u), 1 N

Variance deﬁnition and choice of A. Note that we have (N 1)Var 2(u) = 1 N P

i,j u(i) u(j) 2 2 = u Lu, where L = IN 11 /N is the Laplacian of the weighted clique graph over {1, . . . , N}, with edges of 1/N. A natural extension then consists in considering variances of the form

Var W 2(u) =

i,j=1 Wij u(i) u(j) 2 2 = u LW u

for any adjacency matrix W and its Laplacian LW . For instance, if we expect tasks to be concentrated in clusters, it is natural to consider Wij = 1 if u(i) and u(j) (are thought to) belong to the same cluster, and 0 otherwise. This local version is interesting, as it allows to satisfy the variance condition with a smaller σ, which improves the MT-OMD regret bound. Note that the proof of Theorem 1 can be readily adapted to this deﬁnition by considering the class of interaction matrices {A(b) = IN + b LW }. The bound however features maxi N[A(b) 1]ii, which depends on W in a nontrivial way and requires a case by case analysis, preventing from stating a general result for an arbitrary W. Considering even more general matrices A, i.e., that do not write as IN + b L, suﬀers from the same problem (one then also needs to compute Bψ(A1/2u, A1/2v) on a case by case basis), and does not enjoy anymore the variance interpretation seen above. Furthermore, note that Proposition 2 is obtained by minimizing Equation (11) with respect to A. For matrices of the form A(b), this tradeoﬀonly depends on b, and is thus much easier to solve than for general matrices. Finally, we stress that local variances can be similarly considered on the simplex. Instead of involving the global umax j , the variance formula then features for each task/node a local maximum (respectively minimum) over its neighbours.

1Note that Hedge-MT-OGD computes the loss subgradient at arbitrary points (corresponding to the expert s predictions).

Published in Transactions on Machine Learning Research (09/2022)

3.5 Additional Remarks

We conclude this section with two additional results. The ﬁrst one draws an interesting connection between our multitask framework and the notion of dynamic regret, while the second emphasizes on the importance of the asynchronous nature of the activations.

Remark 3 (Connection to dynamic regret). Note that our online multitask setting can be viewed as a special case of dynamic regret, where each comparator ut in the sequence of comparators u1, u2, . . . belongs to an unknown set {u 1, . . . , u N} of known cardinality N. Moreover, at the beginning of each time step t, the learner is told the index it {1, . . . , N} of the comparator ut against which the regret is measured at time t. As a consequence, any algorithm for dynamic regret minimization can be used in our setting. In the Euclidean case, the optimal dynamic regret bound is of order D2 + DVT

T, where D is the Euclidean diameter of the decision space and VT = PT t=1 ut+1 ut 2. In order to facilitate the comparison to our bound p

D2 + D2(N 1)σ2

T, let assume that the sequence of adversarial activations is such that it = it 1, and that all pairs of distinct elements in {u 1, . . . , u N} appear with the same frequency as consecutive comparators ut, ut+1. Let PT = PT t=1 ||ut+1 ut||2 2. We have

t=1 ||ut+1 ut||2 2 = T N(N 1)

i =j ||u i u j||2 2 = 2T 1 N(N 1)

||u i u j||2 2 2 = 2Tσ2D2 ,

such that our upper bound is at most

D2 + D2(N 1)σ2

D2 + (N 1)PT

which is better as soon as T N 1

We ﬁnally derive a regret bound in the case where several agents are active at each time step.

Proposition 10. Consider the setting of Proposition 2, but assume now that at each time step t a subset of agents At {1, . . . , N}, of cardinality |At| = p, is chosen by the adversary and asked to make predictions. Recall that in this case the regret of an independent approach is of order DLg p NT. Then, MT-OMD run with b = p N and η = ND p

1 + pσ2(N 1)/Lg p

p(1 + p)T produces a sequence of iterates (xt)T t=1 such that for all u V 2,σ RT (u) DLg p

1 + σ2(N 1) p

p1/2 + p3/2 .

Note that for p = 1, we recover exactly Proposition 2.

Proposition 10 highlights that having asynchronous activations is critical: the more active agents at a single time step, i.e., the bigger p, the bigger p

1 + σ2(N 1) p

p1/2 + p3/2, i.e., the smaller the multitask acceleration. In the extreme case where p = N, our multitask framework reduces to a standard online convex optimization problem, with diameter

ND and gradients with Euclidean norms bounded by

NLg. Standard lower bounds are then of order NDLg

T, conﬁrming that no multitask acceleration is possible.

4 Algorithms

We now show that MT-OGD and MT-EG enjoy closed-form updates, making them easy to implement. Note that the MT-OGD derivation is valid for any matrix A positive deﬁnite, while MT-EG requires A 1/2 to be stochastic. This is veriﬁed by matrices of the form A = IN + LW (Lemma 13).

MT-OGD. Let V = {u Rd : u 2 D}, and σ 1. Recall that V = V N = {u RNd : u 2, D}, and V 2,σ = {u V : Var 2(u) σ2}. Solving the ﬁrst equation in Equation (8), we obtain that the iterate xt+1 produced by MT-OGD is the solution to

n xt ηt A 1gt x 2 A : x 2, D o .

Published in Transactions on Machine Learning Research (09/2022)

However, computing this update is made diﬃcult by the discrepancy between the norms used in the objective and the constraint. A simple work around consists in considering the minimization over the Mahalanobis ball VA = {u RNd : u 2 A (1 + bσ2)ND2} instead. It is easy to check that V 2,σ VA, so that every result derived in Section 3 for MT-OGD remains valid (only the fact that comparators and iterates are in VA is actually used). With the substitution yt = A1/2xt the MT-OGD update then rewrites (see Appendix B.1 for technical details)

yt+1 = Proj yt ηt A 1/2gt, p

(1 + bσ2)ND , (16)

where Proj(x, τ) = min 1, τ x 2 x. Note that Equation (16) can be easily turned back into an update on xt by making the inverse substitution. In practice however, xt is only computed to make the predictions. The pseudo-code of the algorithm is given in Algorithm 1.

Algorithm 1 MT-OGD input: A positive deﬁnite matrix A RN N (interaction matrix), σ2 > 0 (task variance), D > 0 (comparators set diameter), ηt > 0 (learning-rate schedule) initialization: A = A Id RNd Nd, x1 = 0, y1 = A 1/2x1

1: Compute A1/2 and A 1/2

2: for time steps t = 1, 2, . . . do 3: Compute compound prediction xt = A1/2yt 4: Activate agent it and predict x(it) t 5: Get gt ℓ(xt) and compute gt

6: Compute yt+1 = min 1,

(1+bσ2)ND yt ηt A 1/2gt 2

(yt ηt A 1/2gt)

MT-EG. Using Equation (8) with yt = A1/2xt, MT-EG reads

yt+1 = arg min y RNd ηA 1/2gt, y + Bψ(y, yt) ,

yt+1 = arg min y A1/2( ) Bψ(y, yt+1) , (17)

where ψ is the compound negative entropy regularizer such that ψ(x) = PN i=1 Pd j=1 x(i) j ln x(i) j . One can show (see Appendix B.2 for details) that the update can be rewritten for all i N and j d

y(i) t+1,j = y(i) t,j exp ηA 1/2 iit gt,j 1 ,

y(i) t+1,j = y(i) t+1,j Pd k=1 y(i) t+1,k .

Combining both equations, we ﬁnally obtain

y(i) t+1,j = y(i) t,j e ηA 1/2 iit gt,j

Pd k=1 y(i) t,k e ηA 1/2 iit gt,k . (18)

Update Equation (18) enjoys a natural interpretation. Each block y(i) is operating an individual standard EG update, but with gradient A 1/2 iit gt. When A = IN, only the active block is updated. Otherwise, the update

of block i is proportional to A 1/2 iit , that quantiﬁes the similarity between tasks i and it. The pseudo-code of the algorithm is given in Algorithm 2. Although this work only focuses on OMD for clarity, note that considering Follow-the-Regularized-Leader see, e.g., (Orabona, 2019, Section 7) with e ψ = ψ(A1/2 ) would yield similar bounds. This would allow, for instance, the use of time-varying learning rates with entropic regularization.

Published in Transactions on Machine Learning Research (09/2022)

Algorithm 2 MT-EG input: A positive deﬁnite matrix A RN N (interaction matrix), η > 0 (learning-rate) initialization: A = A Id RNd Nd, x1 = 1/d RNd, y1 = A 1/2x1

1: Compute A1/2 and A 1/2

2: for time steps t = 1, 2, . . . do 3: Compute compound prediction xt = A1/2yt 4: Activate agent it and predict x(it) t 5: Get gt ℓ(xt) and compute gt

6: Compute y(i) t+1,j = y(i) t,j e ηA 1/2 iit gt,j Pd

k=1 y(i) t,k e ηA 1/2 iit gt,k , i [N], j [d]

5 Experiments

In this section, we empirically compare the performance of Hedge-MT-OGD/EG against two natural alternatives: an independent-task approach (IT-OGD/EG) where the agents do not communicate, and a single-task approach (ST-OGD/EG) where a single model is learned and shared by all agents. Note that both IT and ST approaches are special cases of MT-OMD, obtained respectively with the choices b = 0 (i.e., σ2 1), or b = + (i.e., σ2 = 0). In Appendix D we report an additional experiment where we empirically validate the dependence of the performance of MT-OGD on the task variance.

Online Gradient Descent. For this experiment, we use the Lenk dataset (Lenk et al., 1996; Argyriou et al., 2007). It consists of 2880 computer ratings in the range {1, 2, . . . , 10}, made by 180 individuals (the tasks) on the basis of 14 binary features. Each computer is rated on a discrete scale from 0 to 10, expressing the likelihood of an individual buying that computer. We run Hedge-MT-OGD using the clique interaction matrix A = (1 + N)IN 11 and the square loss. For all algorithms, the value of η is set according to the optimal theoretical value, see Proposition 2. In Hedge-MT-OGD, the variance σ2 is learned in a set of 5 experts uniformly spaced over [0, 1]. For simplicity, we use D = 1 and compute the resulting Lipschitz constant accordingly. Results are reported in Figure 1(a).

Exponentiated Gradient. For our second experiment, we consider EMNIST, a classiﬁcation dataset consisting of 62 classes (images of digits, small and capital letters). To speed up computation, we reduced the number of features from 784 down to 10 through a standard dimensionality reduction method. We created 61 binary classiﬁcation tasks by considering the 0 digit class against each other class. To each task, we assigned 10 examples (5 positive, 5 negative) randomly chosen from the set of examples for that task. We considered the linear logistic regression and ran Hedge-MT-EG with the parameterized clique interaction matrix A(b) = (1 + b)IN b11 /N. The value of b is set according to the theoretical value (that depends on σ2, see Proposition 8), while σ2 is learned in a set of 5 experts uniformly spaced over [0, 1]. For all algorithms, the value of η is set according to the optimal theoretical values. Results are reported in Figure 1(b).

6 Conclusion

We introduced and analyzed MT-OMD, a multitask extension of OMD whose regret is shown to improve as the task variance, expressed in terms of the geometry induced by the regularizer, decreases. We provided a unifying analysis and a single algorithm that explains when is multitask acceleration possible based on the current geometry, and how to achieve it. Natural and interesting directions for future research include: (1) analyzing the multitask acceleration in combination with other properties, such as strongly convex losses, and (2) designing and analyzing an extension of MT-OMD that is adaptive to the best interaction matrix.

Jacob Abernethy, Peter Bartlett, and Alexander Rakhlin. Multitask learning with expert advice. In Proceedinds of the 20th International Conference on Computational Learning Theory, pp. 484 498, 2007.

Published in Transactions on Machine Learning Research (09/2022)

(a) Lenk (OGD, cumulative square loss)

(b) EMNIST (EG, cumulative logistic loss)

Figure 1: Comparison between multitask (MT), independent-task (IT), and single-task (ST) OGD and EG on the Lenk and EMNIST datasets. We plot the cumulative losses against time. Lenk is known to work well in multi-task settings, and indeed Hedge-MT-OGD performs signiﬁcantly better than both baselines. On the other hand, EMNIST has a variance signiﬁcantly higher than Lenk. However, even in this unfavorable scenario, Hedge-MT-EG is still outperforming the baselines, though by a small margin.

Gotz Alefeld and Norbert Schneider. On square roots of m-matrices. Linear Algebra and its Applications, 42: 119 132, 1982.

Pierre Alquier, The Tien Mai, and Massimiliano Pontil. Regret bounds for lifelong learning. In Proceedinds of the 20th International Conference Artiﬁcial Intelligence and Statistics, pp. 261 269, 2017.

Andreas Argyriou, Charles A Micchelli, Massimiliano Pontil, and Yiming Ying. A spectral regularization framework for multi-task structure learning. In Proceedings of the 20th Annual Conference on Neural Information Processing Systems, pp. 25 32, 2007.

Maria-Florina Balcan, Mikhail Khodak, and Ameet Talwalkar. Provable guarantees for gradient-based meta-learning. In Proceedinds of the 36th International Conference on Machine Learning, pp. 424 433, 2019.

Heinz H. Bauschke and Patrick L. Combettes. Convex analysis and monotone operator theory in Hilbert spaces. Springer, 2011.

Etienne Boursier, Mikhail Konobeev, and Nicolas Flammarion. Trace norm regularization for multi-task learning with scarce data. ar Xiv preprint ar Xiv:2202.06742, 2022.

Stephen Boyd, Stephen P Boyd, and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, 2004.

Rich Caruana. Multitask learning. Machine Learning, 28(1):41 75, 1997.

Giovanni Cavallanti, Nicolò Cesa-Bianchi, and Claudio Gentile. Linear algorithms for online multitask classiﬁcation. Journal of Machine Learning Research, 11:2901 2934, 2010.

Nicolò Cesa-Bianchi, Claudio Gentile, and Giovanni Zappella. A gang of bandits. In Proceedings of the 26th Annual Conference on Neural Information Processing Systems, pp. 737 745, 2013.

Nicolò Cesa-Bianchi and Gábor Lugosi. Prediction, learning, and games. Cambridge University Press, 2006.

Ofer Dekel, Philip M Long, and Yoram Singer. Online learning of multiple tasks with a shared loss. Journal of Machine Learning Research, 8(10):2233 2264, 2007.

Published in Transactions on Machine Learning Research (09/2022)

Giulia Denevi, Dimitris Stamos, Carlo Ciliberto, and Massimiliano Pontil. Online-within-online meta-learning. In Proceedings of the 32th Annual Conference on Advances in Neural Information Processing Systems 32, pp. 13089 13099, 2019.

Canh T Dinh, Tung T Vu, Nguyen H Tran, Minh N Dao, and Hongyu Zhang. Fedu: A uniﬁed framework for federated multi-task learning with Laplacian regularization. ar Xiv preprint ar Xiv:2102.07148, 2021.

Theodoros Evgeniou and Massimiliano Pontil. Regularized multi task learning. In Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 109 117, 2004.

Theodoros Evgeniou, Charles A Micchelli, and Massimiliano Pontil. Learning multiple tasks with kernel methods. Journal of Machine Learning Research, 6(4):615 637, 2005.

Chelsea Finn, Aravind Rajeswaran, Sham Kakade, and Sergey Levine. Online meta-learning. In Proceedings of the 36th International Conference on Machine Learning, pp. 1920 1930, 2019.

Claudio Gentile. The robustness of the p-norm algorithms. Machine Learning, 53(3):265 299, 2003.

Adam J Grove, Nick Littlestone, and Dale Schuurmans. General convergence results for linear discriminant updates. In Proceedings 10th Annual Conference on Computational Learning Theory, pp. 171 183, 1997.

Quanquan Gu, Zhenhui Li, and Jiawei Han. Learning a kernel for multi-task clustering. In Proceedings of the 25th AAAI Conference on Artiﬁcial Intelligence, pp. 368 373, 2011.

Elad Hazan. Introduction to online convex optimization. Foundations and Trends in Optimization, 2(3-4): 157 325, 2016.

Mark Herbster and Guy Lever. Predicting the labelling of a graph via minimum p-seminorm interpolation. In Proceedings of the 22nd Conference on Learning Theory, 2009.

Sham M Kakade, Shai Shalev-Shwartz, and Ambuj Tewari. Regularization techniques for learning with matrices. Journal of Machine Learning Research, 13(1):1865 1890, 2012.

Pierre Laforgue, Andrea Della Vecchia, Nicolò Cesa-Bianchi, and Lorenzo Rosasco. Adatask: Adaptive multitask online learning. ar Xiv preprint ar Xiv:2205.15802, 2022.

Peter J Lenk, Wayne S De Sarbo, Paul E Green, and Martin R Young. Hierarchical bayes conjoint analysis: Recovery of partworth heterogeneity from reduced experimental designs. Marketing Science, 15(2):173 191, 1996.

Guangxia Li, Steven CH Hoi, Kuiyu Chang, Wenting Liu, and Ramesh Jain. Collaborative online multitask learning. IEEE Transactions on Knowledge and Data Engineering, 26(8):1866 1876, 2014.

Rui Li, Fenglong Ma, Wenjun Jiang, and Jing Gao. Online federated multitask learning. In Proceedings of the 7th IEEE International Conference on Big Data, pp. 215 220, 2019.

Keerthiram Murugesan, Hanxiao Liu, Jaime Carbonell, and Yiming Yang. Adaptive smoothed online multitask learning. In Proceedings of the 29th Annual Conference on Advances in Neural Information Processing Systems, pp. 4296 4304, 2016.

Francesco Orabona. A modern introduction to online learning. ar Xiv preprint ar Xiv:1912.13213, 2019.

Anastasia Pentina and Christoph H Lampert. Multi-task learning with labeled and unlabeled tasks. In Proceedings of the 34th International Conference on Machine Learning, pp. 2807 2816, 2017.

Gianluigi Pillonetto, Francesco Dinuzzo, and Giuseppe De Nicolao. Bayesian online multitask learning of gaussian processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(2):193 205, 2008.

Avishek Saha, Piyush Rai, Hal Daumé, Suresh Venkatasubramanian, et al. Online learning of multiple tasks and their relationships. In Proceedings of the 14th International Conference on Artiﬁcial Intelligence and Statistics, pp. 643 651, 2011.

Published in Transactions on Machine Learning Research (09/2022)

Shai Shalev-Shwartz. Online learning: Theory, algorithms, and applications. Ph D thesis, The Hebrew University of Jerusalem, 2007., 2007.

Daniel Sheldon. Graphical multi-task learning. Technical report, Computing and Information Science Technical Reports, Cornell University, 2008.

Changjian Shui, Mahdieh Abbasi, Louis-Émile Robitaille, Boyu Wang, and Christian Gagné. A principled approach for learning task similarity in multitask learning. In Proceedings of the 28th International Joint Conference on Artiﬁcial Intelligence, pp. 3446 3452, 2019.

Suvrit Sra. Fast projections onto mixed-norm balls with applications. Data Mining and Knowledge Discovery, 25(2):358 377, 2012.

Jonathan Tuck, Shane Barratt, and Stephen Boyd. A distributed method for ﬁtting Laplacian regularized stratiﬁed models. Journal of Machine Learning Research, 22(60):1 37, 2021.

Yang Yang, Zhigang Ma, Yi Yang, Feiping Nie, and Heng Tao Shen. Multitask spectral clustering by exploring intertask correlation. IEEE Transactions on Cybernetics, 45(5):1083 1094, 2014.

Chi Zhang, Peilin Zhao, Shuji Hao, Yeng Chai Soh, Bu Sung Lee, Chunyan Miao, and Steven CH Hoi. Distributed multi-task classiﬁcation: a decentralized online learning approach. Machine Learning, 107(4): 727 747, 2018.

Yu Zhang and Qiang Yang. A survey on multi-task learning. IEEE Transactions on Knowledge and Data Engineering, 2021.

Yu Zhang and Dit Yan Yeung. A convex formulation for learning task relationships in multi-task learning. In Proceedings of the 26th Conference on Uncertainty in Artiﬁcial Intelligence, pp. 733, 2010.

Published in Transactions on Machine Learning Research (09/2022)

A Technical Proofs

In this Appendix are gathered all the technical proofs of the results stated in the article. First, we provide a notation table recalling the most important notation used along the paper.

Notation Meaning

T N Time horizon N N Number of tasks d N Dimension of single reference vectors

ui Rd, for i N Reference vector (best model) for task i u = [u1, . . . , u N] RNd Compound reference vector u(i) Rd Block i of the compound vector u (= ui here)

u = (1/N) P

i u(i) Average reference vector Var (u) = (1/(N 1)) P

i u(i) u 2 Variance of the reference vectors w.r.t. norm V Rd Generic set of comparators for a single task D = supu V u Half diameter of V w.r.t. norm V = {u RNd : i N, u(i) V } Generic set of compound comparators V ,σ = {u V : Var (u) σ2D2} Comparators in V with small variance

umin j = mini N u(i) j Minimum (among tasks) of component j d umax j = maxi N u(i) j Maximum (among tasks) of component j d Var (u) = maxj d(1 umin j /umax j )2 Variance aligned with the probability simplex = {u Rd + : P

j uj = 1} Probability simplex in Rd

= {u RNd : i N, u(i) } Compound probability simplex σ = {u : Var (u) σ2} Comparators in with small variance

1 RN Vector ﬁlled with 1 s IN RN N Identity matrix of dimension N A RN N Generic interaction matrix A(b) = (1 + b)IN (b/N)11 Parameterized interaction matrix A = A Id RNd Nd Block version (Kronecker product with Id) of A A(b) = A(b) Id Block version of A(b)

xi,t Rd, for i N and t T Prediction maintained by agent i at time t xt = [x1,t, . . . , x N,t] RNd Compound predictions maintained at time t it N Active agent at time t (chosen by the adversary)

ℓt : Rd R Loss function at time t (chosen by the adversary) gt = ℓt(x(it) t ) Rd Subgradient of ℓt at point x(it) t gt = [0, . . . , 0, gt, 0, . . . , 0] RNd Compound subgradient (null outside block it)

ψ: V R Generic base regularizer Bψ : V V R Bregman divergence induced by ψ ψ: u RNd 7 P

i ψ(u(i)) Compound regularizer e ψ = ψ(A1/2 ) Regularizer of interest

Table 1: Notation Table.

A.1 Proof of Theorem 1

The ﬁrst important building block of our proof consists in characterizing the strong convexity of e ψ (Lemma 11). To that end, we need to introduce the following deﬁnition and notation about norms.

Published in Transactions on Machine Learning Research (09/2022)

Deﬁnition 3. Let N : Rd R, and N : RN R be two norms. We deﬁne the mixed norm N,N : RNd R as follows2. For all x RNd, x N,N is given by the N norm of the RN vector composed of the N norms of the blocks (x(i))N i=1. Formally, it reads

x RNd, x N,N = N N x(1) , . . . , N x(N) .

When N = p and N = q, for p, q [1, + ], one recovers the standard ℓp,q mixed norm. We use the following shortcut notation

p,N := p,N and N,q := N, q .

Let s N, : Rs R be any norm, and M Rs s be a symmetric positive deﬁnite matrix. We deﬁne the Mahalanobis version of , denoted M, as

x Rs, x M = M 1/2x .

Notice that for = 2, we recover the standard Mahalanobis norm such that x 2,M =

x Mx. For simplicity, and when it is clear from the context, we use the shortcut notation M = 2,M.

We are now equipped to establish the strong convexity of e ψ.

Lemma 11. Assume that the base regularizer ψ: Rd R is λ-strongly convex with respect to some norm . Then, the associated compound regularizer e ψ is λ-strongly convex with respect to the Mahalanobis mixed norm ,2,A.

Proof. First, notice that for every x RNd it holds

He ψ(x) = A1/2 Hψ A1/2x A1/2 ,

where Hψ(x) RNd Nd denotes the Hessian matrix of regularizer ψ at point x. Moreover, due to the deﬁnition of the compound regularizer ψ: x RNd 7 PN i=1 ψ x(i) , matrix Hψ(x) is block diagonal and given by

Hψ(x(1)) 0 . . . 0 0 Hψ(x(2)) . . . 0 ... ... ... ... 0 . . . 0 Hψ(x(N))

Assuming that ψ is λ-strongly convex with respect to norm , it thus holds for all x, v RNd

D He ψ(x)v, v E = D Hψ A1/2x A1/2v , A1/2v E

D Hψ A1/2x (i) A1/2v (i), A1/2v (i)E

A1/2v (i) 2

= λ v 2 ,2,A .

The second important intermediate result (Lemma 12) deals with dual norms, that we recall now.

2Note that a suﬃcient condition for N ,N to be a norm is that N (z1, . . . , z N) = N (|z1|, . . . , |z N|). In Lemma 11 we use N = 2, which satisﬁes this property.

Published in Transactions on Machine Learning Research (09/2022)

Deﬁnition 4. Let s N, and : Rs R be any norm. The dual norm of , denoted , is deﬁned as

x Rs, x = sup y Rs : y 1 x, y .

Lemma 12. Let p, q [1, + ], and p , q their conjugates such that 1/p + 1/p = 1/q + 1/q = 1. Then it holds

( p) = p , and ( p,q) = p ,q .

Let s N, : Rs R be any norm, and M Rs s be a symmetric positive deﬁnite matrix. Then it holds

( M) = ,M 1 .

Assume furthermore that s can be decomposed into s = s1 s2, for s1, s2 N. Then, for any norm : Rs1 R it holds

In particular, combining the last two results yields: ,2,A

Proof. The ﬁrst result is standard in convex analysis, see e.g. Appendix A.1.6 in Boyd et al. (2004). The second result is due to Sra (2012), see Lemma 3 therein. The last two results rely on the following equality (see e.g. Example 3.27 in Boyd et al. (2004)). For any norm , it holds

where f denotes the Fenchel-Legendre conjugate of f, deﬁned as f (x) = supy x, y f(y) Bauschke & Combettes (2011). Applying Equation (19) to M, we get for all x Rs

M 1/2x, M 1/2y 1

M 1/2x, z 1

M 1/2x 2 = 1

2 2 ,M 1 (x) .

Published in Transactions on Machine Learning Research (09/2022)

Applying Equation (19) to ,2, we get for any x Rs1 s2

= sup y Rs1 s2

= sup y Rs1 s2

i=1 x(i), y(i) 1

i=1 sup y(i) Rs1

x(i), y(i) 1

We are now ready to prove Theorem 1. The proof follows from standard arguments to analyze OMD, combined with Lemmas 11 and 12.

Proof of Theorem 1. First, notice that the compound representation allows to write

i=1 inf u V

t: it=i ℓt(u) = inf u V

t=1 ℓt(xt) ℓt u(it) .

Next, for any u V , the convexity and sub-diﬀerentiability of ℓt implies

t=1 ℓt(xt) ℓt u(it)

D gt, xt u(it)E =

t=1 gt, xt u .

Now, observe that update Equation (9) actually deﬁnes an OMD on iterate xt, for the sequence of gradients (gt)T t=1, and with regularizer e ψ. Recall also that e ψ is λ-strongly convex with respect to ,2,A (Lemma 11), whose dual norm is ,2,A 1 (Lemma 12). Applying the standard OMD bound of Equation (3), we thus obtain that for all u V and η > 0 it holds

t=1 gt, xt u Be ψ(u, x1)

t=1 gt 2 ,2,A 1 .

Then, we use that for all x, y RNd it holds Be ψ(x, y) = Bψ A1/2x, A1/2y , and that

gt 2 ,2,A 1 = A 1/2gt 2

A 1/2 iit gt 2

A 1/2 iit 2 gt 2

= A 1 itit gt 2 max i N A 1 ii gt 2 .

Published in Transactions on Machine Learning Research (09/2022)

Combining all arguments, we ﬁnally obtain

RT (u) Bψ A1/2u, A1/2x1

η + max i N A 1 ii η 2λ

The second claim of the theorem is obtained in a similar fashion. Standard OMD results (Orabona, 2019, Theorem 6.8) give that

RT (u) max t T

Be ψ(u, xt)

t=1 ηt gt 2 ,2,A 1 .

Replacing Be ψ(u, xt) and gt 2 ,2,A 1 with Bψ A1/2u, A1/2xt and max i N A 1 ii gt 2 respectively yields the

desired result.

A.2 Proof of Proposition 2

First, it is easy to check that we have

A(b) = (1 + b)IN b11

1 + b IN + (1

A(b) 1 = 1 1 + b IN + b (1 + b) 11

This implies

max i N [A(b) 1]ii = b + N (1 + b)N , (20)

2Bψ A(b)1/2u, A(b)1/20 =

A(b)1/2u (i)

1 + b u(i) + (1

1 + b (u(i) u) + u 2

u(i) u 2 2 +

i=1 u 2 2 + 2

u(i) u 2 2 + u 2 2 + b(N 1)Var 2(u)

u(i) 2 2 + b(N 1)Var 2(u) . (21)

Substituting Equation (20), Equation (21) into Equation (11), and using the deﬁnition of V 2,σ, we get that RT (u) is smaller than

u 2 2 + b(N 1)Var 2(u)

2η + ηTL2 g 2 b + N (1 + b)N ND2(1 + b N 1

N σ2) 2η + ηTL2 g 2 b + N (1 + b)N .

Published in Transactions on Machine Learning Research (09/2022)

Setting η = ND

N σ2 (1+b) (b+N)T , we get

s 1 + b N 1

N σ2 (b + N) 1 + b T .

We now optimize on b. Let α = N 1

N σ2, the optimality condition writes

(1 + αN + 2αb )(1 + b ) = (1 + αb )(b + N)

1 + (α 1)N + 2αb + αb 2 = 0

α (N 1) 1 =

And we have

(1 + αb )(b + N)

1 + b = 1 + αN + 2αb

= 1 + σ2(N 1) + 2σ2 N 1

Hence, the bound becomes DLg p

F(σ) T, with

F(σ) = 1 + σ2(N 1) + 2σ2 N 1

α(1 α)(N 1) α(1 α)(N 1) + 1 1 + αN ,

so that we have F(σ) 2(1 + σ2(N 1)), and the bound becomes DLg p

1 + σ2(N 1)

2T, a value which can also be achieved directly by setting b = N above.

A.3 Proof of Proposition 3

This lower bound is proved by adapting the standard lower bound for (single-agent) Online Linear Optimization, see, e.g., (Orabona, 2019, Chapter 5). We consider linear losses, i.e., we assume that for all t T there exists gt Rd such that ℓt(x) = gt, x . Our goal is to show that for any any sequence x1, . . . , x T we can construct a sequence of losses (or equivalently of gt) such that the regret is lower bounded. Recall that d is the dimension of the set V and D > 0 its radius, and let σ < 1. Assume that N is even, smaller than 2d, and for all i {1, . . . , N/2}, let

w(2i 1) = p

(N 1)/NDσei and w(2i) = p

(N 1)/NDσei ,

where (ei)i d is the canonical basis of Rd. It is easy to check that w(i) 2 D, and that Var(w) = σ2D2, such that w V 2,σ. Assume that N divides T, and that the agents are activated in a cyclic fashion, i.e., it = 1 + t mod N. We introduce the family of gradient vectors

gt = ϵ t/N Lgw(it)

w(it) 2 , (22)

where ϵ1, . . . , ϵT/N are valued in { 1, 1}, with exact values to be determined later on. Indeed, we show that for any sequence of predictions, there exists a choice of ϵ1, . . . , ϵT/N such that the regret is lower bounded. To that end, we use the fact that for any function F : { 1, 1}T/N R, and any probability distribution P with support in { 1, 1}, we have

sup ϵ { 1,1}T/N F(ϵ) Eϵ P T/N [F(ϵ)] .

Published in Transactions on Machine Learning Research (09/2022)

In particular, we choose P to be the Rademacher distribution, such that P{ϵ = 1} = P{ϵ = 1} = 1/2, and all expectations are now understood to be taken with respect to this distribution. Note that we have gt 2 Lg and Eϵ[gt] = 0, for all t. Given any sequence x1, . . . , x T , we can use gradients Equation (22) and obtain

sup ϵ1,...,ϵT/N RT Eϵ

gt, x(it) min u V 2,σ

gt, u(it) #

max u V 2,σ

t=1 ϵ t/N Lg w(it) 2

w(it), u(it) #

max u V 2,σ

t=1 ϵ t/N w(it), u(it) #

max u V 2,σ

τ=1 ϵτ w, u

max u { w,w}

τ=1 ϵτ w, u

where in Equation (23) we used w 2 2 = (N 1)D2σ2 and in Equation (24) we used the Khintchine inequality, see, e.g., (Cesa-Bianchi & Lugosi, 2006, Lemma 8.2). We now combine this lower bound with the one obtained by choosing it = 1 for all t and applying the standard single-agent lower bound DLg

2T, see, e.g., (Orabona, 2019, Theorem 5.1). Combining these two bounds, we obtain that the regret is lower bounded by

2T max 1, σ

1 + σ2(N 1)

which is only 1/4 of the upper bound Equation (14). Hence, with knowledge of σ2, there is no algorithm with better multitask acceleration than MT-OMD, up to constant factors.

A.4 Proof of Proposition 4

Consider the following setting. Let N = 2d, σ < 1, u0 Rd be such that u0 2 2 = 1 σ2, and set for all i d:

u(2i 1) = u0 p

(N 1)/Nσei , and u(2i) = u0 + p

(N 1)/Nσei ,

where (ei)i d is the canonical basis of Rd. It is easy to check that u(i) 2 2 1 2σ2 for all i N, and that Var 2(u) = σ2. Then, a standard lower bound for OGD (Orabona, 2019, Theorem 5.1) applied to the N individual independent OGDs of IT-OGD with linear losses, unit-norm loss gradients, and cyclic activations such that Ti = T/N, gives that

This lower bound is strictly greater than the MT-OGD upper bound Equation (14) as soon as

1 + σ2(N 1), or equivalently as soon as σ2 N 16 18N 16.

Published in Transactions on Machine Learning Research (09/2022)

A.5 Proof of Proposition 5

As discussed in the main body, the analysis for general norms is made more complex by the fact that Equation (21) does not hold with equality anymore. Instead, a series of approximations leveraging the properties of norms is required. Indeed, for any norm , and regularizer ψ = 1

2 2, it holds

2Bψ A(b)1/2u, A(b)1/20 =

A(b)1/2u (i)

1 + b u(i) + (1

1 + b (u(i) u) 2

N u 2 + (1 + b)

2 2ND2 + b(N 1)Var (u)

4ND2 1 + b N 1

N σ2 . (25)

In comparison to Equation (21), the bound is thus multiplied by 4. The rest of the proof (i.e. the optimization in η and b) is similar to that of Proposition 2, and we obtain that for all u V ,σ it holds:

RT (u) DLg p

1 + σ2(N 1)

An interesting use case unlocked by the previous result is the use of the p-norm (for p [1, 2]) on the the probability simplex = {x Rd + : P j xj = 1}. Indeed, by a careful tuning of p one can derive bounds scaling as

ln d, instead of d for OGD. Interestingly, this improvement is orthogonal to our multitask acceleration, so that it is possible to beneﬁt from both. Recall that regularizer 1

2 2 p is (p 1) strongly convex with respect to p (Shalev-Shwartz, 2007, Lemma 17), and that the dual norm of p is q, with q such that 1/p + 1/q = 1. Consider V = , and loss functions such that gt L for all gt ℓt(x), x . One can check that gt 2 q L2 d2/q. Then, substituting Equation (25) into Equation (11) and using the previous remark, we get that for all u p,σ it holds

RT (u) 2ND2 1 + b N 1

η + b + N (1 + b)N η 2(p 1)TL2 d2/q

2DL d1/q p 1

s 1 + b N 1

N σ2 (b + N) 1 + b T ,

with the choice η = 2ND

(1+b)(1+b N 1

N σ2) b+N (p 1) T L2 d2/q . Hence, the bound obtained is the product of two terms,

one depending only on b, the other only on p. We can thus optimize independently. The term in b is the same as in previous proofs, we can reuse the analysis made for Proposition 2. The term in d is the same as in the original proof Grove et al. (1997); Gentile (2003), and the optimization is thus similar. We repeat it here for completeness. One has d1/q/ p 1 = q 1d1/q qd1/q. By diﬀerentiating with respect to q, the last term is minimized for q = 2 ln d, with a value of

2e ln d. The ﬁnal bound obtained is

1 + σ2(N 1)

16e T ln d .

We conclude with a few remarks. First, in order to ensure that q 2 (we need p 2), we may assume that d 3. Second, note that the improvement on the dependence with respect to d comes at the price of a stronger variance condition, as we have Var 2(u) Var p(u) for all p 2. If one is interested in a condition independent from d (indeed p depends on q, which depends on d), the variance with respect to 1 can be used. Finally, note that we have used D = supx x p supx x 1 = 1.

Published in Transactions on Machine Learning Research (09/2022)

A.6 Proof of Proposition 6

This proposition builds upon the second claim of Theorem 1. We start by detailing the proof for the speciﬁc choice = 2. Observe that for ψ = 1

2 2 2 and all u, x V 2,σ we have

Bψ A1/2u, A1/2x = 1

2 u x 2 A u 2 A + x 2 A 2ND2 1 + b N 1

Substituting into Equation (12), we obtain

RT (u) 2ND2 1 + b N 1

ηT + b + N 2(1 + b)N

t=1 ηt gt 2 2

= b + N (1 + b)N

t=1 ηt gt 2 2

with D2 = 4N 2D2(1+b)(1+b N 1

N σ2) b+N . Using ηt =

i=1 gi 2 2 = ND r

2(1+b)(1+b N 1

i=1 gi 2 2 , (Orabona, 2019,

Lemma 4.13) gives

RT (u) b + N (1 + b)N

t=1 gt 2 2 D

(b + N) 1 + b N 1

t=1 gt 2 2 . (26)

Choosing b = N concludes the proof. In comparison to Proposition 2 (and assuming that gt 2 Lg), the bound is multiplied by

2 multiplication is due to the choice of ηt, while the other

4 multiplication comes from the upper bound on maxt T Bψ A1/2u, A1/2xt , which is 4 times bigger than the upper bound of Bψ A1/2u, A1/2x1 . The proof for any norm follows the same path. The only diﬀerence is on bounding maxt T Bψ A1/2u, A1/2xt , which is 4 times bigger than the same quantity for the Euclidean case, see Equation (25). Therefore, an additional

4 = 2 factor is added.

A.7 Proof of Proposition 7

Using Equation (26) with b = N, the smoothness of the losses gives

RT (u) 4D p

1 + σ2(N 1)

t=1 ℓt x(it) t .

Using Lemma 4.20 in Orabona (2019), we obtain

RT (u) 8D p

1 + σ2(N 1)

1 + σ2(N 1) +

t=1 ℓt u(it)

As for Proposition 6, the claim for general losses is obtained by multiplying the bound by 2.

A.8 Proof of Proposition 8

Let x1 = [x , . . . , x ], and observe that A(b)1/2x1 = x1. Then, for all u such that A(b)1/2u we have

Bψ A(b)1/2u, A(b)1/2x1 = Bψ A(b)1/2u, x1 =

i=1 Bψ A(b)1/2u (i), x NC . (27)

Published in Transactions on Machine Learning Research (09/2022)

Plugging Equation (27) into Equation (11), we obtain for all u :

η + η(b + N) 2λ(1 + b)N TL2 g Lg

b + 1 CT , (28)

where we have set η = N

(b+N)T . The next natural question is how to choose b? As bound Equation (28) is decreasing in b, one is encouraged to choose b as large as possible. However, recall that Equation (27) requires A(b)1/2u to be in . So the optimal choice is b = max{b 0: A(b)1/2u }. This value unfortunately depends on u and cannot be used uniformly over . However, the variance condition for σ allows to choose a global b, as we show now. Let u σ. Recall that for all i N we have

A(b)1/2u (i) =

1 + b u(i) + (1

We have to check that these vectors are in the simplex for all i N. There are two conditions that a vector x should verify to be in the simplex

(i) 1 x = 1, and (ii) xj 0 j d.

It is immediate to see that the ﬁrst condition is always satisﬁed. To analyze the second condition, we recall the following notation from Deﬁnition 2

j d, umax j = max i N u(i) j , umin j = min i N u(i) j .

Now, let j d. A suﬃcient condition for (ii) to hold for every A(b)u (i), i N, is

1 + b umin j + (1

1 + b)umax j ,

Or, equivalently,

umax j umax j umin j

Since this condition must hold for every j d, the overall condition is

umax j umax j umin j

σ2 1 = 1 σ2

This condition is quite intuitive. If all best models are equal, umin j = umax j for all j d, one can choose b = + , and achieves a bound independent from N. On the contrary, if there exists j d such that umax j = 1 and umin j = 0, i.e., two diﬀerent corners are linked, then b = 0 is the only possible choice, and a

N dependence is unavoidable. Finally, with b = (1 σ2)/σ2 0, bound Equation (28) becomes

1 + σ2(N 1) p

A.9 Proof of Theorem 9

Let ε > 0, and consider the covering Cε = {ε, 2ε, . . . , 1}, of cardinality 1/ε. Let u V , we want to derive an upper bound of the regret achieved by the best expert in Cε against u. First, assume that Var 2(u) 1, and let σ2 = inf{z Cε : Var 2(u) z}. Note that by deﬁnition we have Var 2(u) σ2 Var 2(u) + ε. Now, let us bound the regret of MT-OGD run with σ2 = σ2. Recall that for any ﬁxed learning rate η, the regret of MT-OGD is bounded by

ND2 1 + (N 1)Var 2(u)

2η + η TL2 g N ND2 1 + (N 1) σ2

2η + η TL2 g N . (29)

Published in Transactions on Machine Learning Research (09/2022)

MT-OGD run with σ2 = σ2 uses η = ND

2T . Plugging this into Equation (29), we get that the regret is upper bounded by

1 + σ2(N 1)

2T DLg 1 + q

Var 2(u) N +

min Var 2(u), 1 N +

On the other hand, if Var 2(u) 1, we know that MT-OGD run with σ2 = 1 is equivalent to independent OGDs and has a regret bounded by

min Var 2(u), 1 N +

Combining Equation (30) and Equation (31), we know that in any case the best expert in Cε has always a regret against u smaller than

min Var 2(u), 1 N +

Let us now compute the regret of Hedge-MT-OGD against the best expert in Cε. By the analysis of Hedge with linear combination of the experts, we know that it is bounded by (Orabona, 2019):

T log(1/ε) , (33)

where L is an upper bound of the inﬁnity norm of the gradients received by Hedge. The latter are equal to the vectors of losses incurred by the diﬀerent experts at each time step. With linear(ized) losses, and denoting by xexpert t the prediction of one expert at time step t, we have

ℓt xexpert t = D gt, xexpert t E = gt, xexpert,(it) t gt 2 xexpert,(it) t 2 DLg .

Hence, L DLg. Now, for any u V , the regret of Hedge-MT-OGD against u is upper bounded by the sum of: (1) the regret of Hedge with respect to the best expert in Cε, that is upper bounded by Equation (33), and (2) the regret of the best expert in Cε against u, that is upper bounded by Equation (32). Summing the two upper bounds and setting ε = 1/N yields the desired result.

A.10 Proof of Proposition 10

Let (ℓt,i)i At be the losses associated to the active agents at times step t. For MT-OMD run with matrix A = A(b) = (1 + b)IN b

N 11 , for b 0, and constant learning rate η > 0, we have

i At ℓt,i x(i) t ℓt,i u(i)

t=1 gt 2 A 1

= ND2 1 + b N 1

t=1 gt 2 A 1 , (34)

Published in Transactions on Machine Learning Research (09/2022)

where gt RNd is the compound gradient vector with non-zero blocks at the indices present in At. Recall that A(b) 1 = 1 1+b IN + b 1+b 11

N . We have

gt 2 A 1 = X

i,j A 1 ij g(i) t , g(j) t

i At A 1 ii g(i) t 2 2 + X

i =j At A 1 ij g(i) t , g(j) t

(1 + b)N + p(p 1) b (1 + b)N

= p L2 g pb + N (1 + b)N

Substituting into equation 34 and setting b = p N, we have

RT (u) ND2 1 + pσ2(N 1)

2TL2 g p(1 + p)

Setting η = ND Lg

1+ pσ2(N 1)

p(1+p) , we ﬁnally get

Rt(u) DLg p

1 + σ2(N 1) p

p1/2 + p3/2 .

B Derivation of the Algorithms

In this appendix we gather all the technical details about the algorithms.

B.1 Details on MT-OGD

With the change of feasible set, xt+1 produced by MT-OGD is the solution to

min x RNd xt ηt A 1gt x 2 A ,

s.t. x 2 A (1 + bσ2)ND2 ,

or again min x RNd A1/2xt ηt A 1/2gt A1/2x 2 2 ,

s.t. A1/2x 2 2 (1 + bσ2)ND2 .

Introducing the notation yt = A1/2xt, the update on the yt s writes

yt+1 = Proj yt ηt A 1/2gt, p

(1 + bσ2)ND .

B.2 Details on MT-EG

Recall that for the negative entropy regularizer we have Bψ(x, y) =

j=1 x(i) j ln x(i) j y(i) j . Expliciting the

objective function of the ﬁrst update, we obtain

ηA 1/2gt, y + Bψ(y, yt) =

j=1 ηA 1/2 iit gt,j y(i) j + y(i) j ln y(i) j y(i) t,j .

Published in Transactions on Machine Learning Research (09/2022)

For all i N and j d, diﬀerentiating with respect to y(i) j and setting the gradient to 0 we get

y(i) t+1,j = y(i) t,j exp ηA 1/2 iit gt,j 1 . (35)

We focus now on the second update Equation (17). The constraint y A1/2( ) rewrites A 1/2y , or equivalently

i N, 1 A 1/2y (i) = 1 ,

i N, j d, A 1/2y (i)

We introduce matrix Y RN d such that Ykl = y(k) l . Then it holds for all i N

1 A 1/2y (i) =

k=1 A 1/2 ik Ykl =

il = A 1/2Y 1

Hence, using that A 1/2 is stochastic, the ﬁrst constraint rewrites

A 1/2Y 1 = 1 Y 1 = 1 i N, 1 y(i) = 1 . (36)

Similarly, the second constraint reads:

k N, j d, A 1/2Y

kj 0 . (37)

The Lagrangian associated to Problem Equation (17) writes

L(y, Λ, µ) =

j=1 y(i) j ln y(i) j y(i) t+1,j

j=1 Λkj A 1/2Y

i=1 µi(1 y(i) 1)

j=1 Yij ln Yij y(i) t+1,j

i=1 Λkj A 1/2 ki Yij +

j=1 µi Yij 1 µ ,

with Λ RN d and µ RN the Lagrange multipliers associated to constraints Equation (37) and Equation (36) respectively. For all i N, j d, diﬀerentiating with respect to Yij and setting the gradient to 0 yields h Y (t+1)i

ij := y(i) t+1,j = y(i) t+1,j e A 1/2Λ

ij µi 1 . (38)

Furthermore, the complementary slackness writes for all i N and j d

Λij A 1/2Y (t+1)

However, Equation (38) gives Y (t+1)

ij > 0, and matrix A 1/2 is stochastic (see Lemma 13), so we have A 1/2Y (t+1)

ij > 0, and consequently Λij = 0. Then

y(i) t+1,j = y(i) t+1,j e µi 1 .

Using 1 y(i) t+1 = 1, we get e µi 1 = 1/ Pd j=1 y(i) t+1,j . Substituting Equation (35) we get for all i N

y(i) t+1,j = y(i) t+1,j Pd k=1 y(i) t+1,k

= y(i) t,j e ηA 1/2 iit gt,j 1

Pd k=1 y(i) t,k e ηA 1/2 iit gt,k 1

= y(i) t,j e ηA 1/2 iit gt,j

Pd k=1 y(i) t,k e ηA 1/2 iit gt,k .

Published in Transactions on Machine Learning Research (09/2022)

Lemma 13. Let A = IN + L, where L is the Laplacian matrix associated to any weighted undirected graph on {1, . . . , N}. Then A 1 and A 1/2 are (doubly) stochastic matrices.

Proof. It is immediate to see from the deﬁnition of A that A 1 and A 1/2 are both symmetric and satisfy A 11 = A 1/21 = 1. It remains to check that their entries are nonnegative. To that end, we use that inverses of M-matrices are entrywise nonnegative. Matrix A is a non-singular M-matrix, as its oﬀ-diagonal entries are nonpositive (i.e., A is a Z-matrix) and its eigenvalues are positive. As a consequence, A1/2 is also a M-matrix (Alefeld & Schneider, 1982, Theorem 4), which concludes the proof.

C Technical Comparison to Cavallanti et al. (2010)

First, we would like to remind the reader that, unlike Cavallanti et al. (2010): (i) we tackle any subdiﬀerentiable loss function, and not only the hinge loss for classiﬁcation, (ii) we address any strongly convex regularizer, and not only the squared Euclidean norm, (iii) our proofs provide clear insights on the regularizer s behaviour, and are not just black box applications of the Kernel Perceptron Theorem, (iv) we provide (matching) lower bounds, (v) we develop what we believe is the correct generalization to p-norms, see technical details below, (vi) we provide a unifying framework and a general analysis that allow to deal with non-Euclidean geometries on the simplex, (vii) we are adaptive to the task variance σ2.

About the p-norm extension, the two regularizers used in Cavallanti et al. (2010) and this paper are fundamentally diﬀerent, as they write respectively

ψ(u) = Au 2 p , and ψ(u) =

(A1/2u)(i) 2 p .

A ﬁrst advantage of our method is that we recover the Euclidean framework by setting p = 2, which is not the case for the regularizer used in Cavallanti et al. (2010). This makes us believe that we propose the right way to address p-norms, and that a general theory of multitask acceleration is worth developing to avoid misinterpretations of this kind. A second advantage is that our bounds have a much better dependence with respect to the task variance σ2 and the number of tasks N. Recall that after several approximations, see Appendix A.5, our bound for p-norms on the simplex features the variance term p

1 + (N 1)σ2 1 +

where σ2 is an upper bound of the task variance according to 1, such that

Var 1(u) = 1 N 1

u(i) u 2 1 σ2 .

On the other hand, the bound of Theorem 6 in Cavallanti et al. (2010) features the variance term

(1 + N)u(i) N u 1

u(i) 1 + N u(i) u 1

1 + Nσ . (40)

Published in Transactions on Machine Learning Research (09/2022)

From Equations (39) and (40), it is clear that our bounds exhibit a much better dependence with respect to σ2 and N. We further highlight that the bound of Theorem 6 in Cavallanti et al. (2010) contains an extra term which is the square of Equation (40), without being multiplied by any time-related quantity though.

D Additional Experiment

We report the results of a synthetic experiment where we plot the multitask regret of MT-OGD against the standard deviation σ of the reference vectors. Speciﬁcally, we consider a regression problem in R2 with respect to the square loss. We set D = 1 and generate N = 4 tasks deﬁning the reference vectors as in the proof of Proposition 3 (see Appendix A.3), so that the task variance is set to the desired value. Each task consists of a sequence of 10 losses, so that T = 40. The losses for the i-th task are generated by ﬁrst sampling an instance x from the unit sphere centered at 1, and then computing its label as y = u(i), x + ϵ, where ϵ is an independent Gaussian noise with variance 0.1 and truncated in [-1, 1]. We plot the ﬁnal multitask regret RT averaged over 30 runs. MT-OGD is tuned with the optimal parameter choices for b and η according to the theory. Results are shown in Figure 2. As expected, the regret of MT-OGD increases linearly with σ, as suggested by the upper bound Equation (14) and the lower bound in Proposition 3. For the sake of comparison, we also benchmark IT-OGD and report its best average performance. The switch in performance occurs slightly before σ = 1, which is also expected as, aside from the dependence in σ, the bounds for MT-OGD scales with

2T (instead of

T for IT-OGD) and are thus slightly worse for σ = 1.

Figure 2: Multitask regret RT of MT-OGD and IT-OGD for T = 40 against the standard deviation σ.