# online_multitask_learning_with_longterm_memory__96a03550.pdf

Online Multitask Learning with Long-Term Memory

Mark Herbster, Stephen Pasteris, Lisa Tse

Department of Computer Science

University College London

London United Kingdom (m.herbster|s.pasteris|l.tse)@cs.ucl.ac.uk

We introduce a novel online multitask setting. In this setting each task is partitioned into a sequence of segments that is unknown to the learner. Associated with each segment is a hypothesis from some hypothesis class. We give algorithms that are designed to exploit the scenario where there are many such segments but signiﬁcantly fewer associated hypotheses. We prove regret bounds that hold for any segmentation of the tasks and any association of hypotheses to the segments. In the single-task setting this is equivalent to switching with long-term memory in the sense of [1]. We provide an algorithm that predicts on each trial in time linear in the number of hypotheses when the hypothesis class is ﬁnite. We also consider inﬁnite hypothesis classes from reproducing kernel Hilbert spaces for which we give an algorithm whose per trial time complexity is cubic in the number of cumulative trials. In the single-task special case this is the ﬁrst example of an efﬁcient regret-bounded switching algorithm with long-term memory for a non-parametric hypothesis class.

1 Introduction

We consider a model of online prediction in a non-stationary environment with multiple interrelated tasks. Associated with each task is an asynchronous data stream. As an example, consider a scenario where a team of drones may need to decontaminate an area of toxic waste. In this example, the tasks correspond to drones. Each drone is receiving a data stream from its sensors. The data streams are non-stationary but interdependent as the drones are travelling within a common site. At any point in time, a drone receives an instance x and is required to predict its label y. The aim is to minimize mispredictions. As is standard in regret-bounded learning we have no statistical assumptions on the data-generation process. Instead, we aim to predict well relative to some hypothesis class of predictors. Unlike a standard regret model, where we aim to predict well in comparison to a single hypothesis, we instead aim to predict well relative to a completely unknown sequence of hypotheses in each task s data stream, as illustrated by the coloring in Figure 1. Each mode (color) corresponds to a distinct hypothesis from the hypothesis class. A switch is said to have occurred whenever we move between modes temporally within the same task.

Thus in task 1, there are three modes and four switches. We are particularly motivated by the case that a mode once present will possibly recur multiple times even within different tasks, i.e., modes switches. We will give algorithms and regret bounds for ﬁnite hypothesis classes

(the experts model [2, 3, 4]) and for inﬁnite non-parametric Reproducing Kernel Hilbert Space (RKHS) [5] hypothesis classes.

The paper is organized as follows. In the next section, we introduce our formal model for online switching multitask learning. In doing so we provide a brief review of some related online learning results which enable us to provide a prospectus for attainable regret bounds. This is done by

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

Figure 1: A Coloring of Data Streams (5 tasks, 6 modes, and 11 switches).

For = 1 to T do

Receive task 2 [s] . Set i ; t σ( ) . Receive instance x xi

t 2 X . Predict ˆy ˆyi

t 2 { 1, 1} . Receive label y yi

t 2 { 1, 1} . Incur Loss L01(y , ˆy ) .

Figure 2: The Switching Multitask Model

considering the bounds achievable by non-polynomial time algorithms. We then provide a brief survey of related work as well as our notational conventions. In Sections 3 and 4 we provide algorithms and bounds for ﬁnite hypothesis classes and RKHS hypothesis classes, respectively. Finally, we provide a few concluding remarks in Section 5. The supplementary appendices contain our proofs.

2 Online Learning with Switching, Memory, and Multiple Tasks

We review the models and regret bounds for online learning in the single-task, switching, and switching with memory models as background for our multitask switching model with memory.

In the single-task online model a learner receives data sequentially so that on a trial t = 1, . . . , T: 1) the learner receives an instance xt 2 X from the environment, then 2) predicts a label ˆyt 2 { 1, 1},

then 3) receives a label from the environment yt 2 { 1, 1} and then 4) incurs a zero-one loss L01(yt, ˆyt) := [yt 6= ˆyt]. There are no probabilistic assumptions on how the environment generates its instances or their labels; it is an arbitrary process which in fact may be adversarial. The only restriction on the environment is that it does not see the learner s ˆyt until after it reveals yt. The learner s aim will be to compete with a hypothesis class of predictors H { 1, 1}X so as to minimize its expected regret, RT (h) := PT

t=1 E[L01(yt, ˆyt)] L01(yt, h(xt)) for every hypothesis h 2 H, where the expectation is with respect to the learner s internal randomization.

In this paper we will consider two types of hypothesis classes: a ﬁnite set of hypotheses Hﬁn, and a set HK induced by a kernel K. A multiplicative weight (MW) algorithm [6] that achieves a regret bound1 of the form

log(|Hﬁn|)T

(8h 2 Hﬁn) (1)

was given in [7] for ﬁnite hypothesis classes. This is a special case of the framework of prediction with expert advice introduced in [2, 3]. Given a reproducing kernel K : X X ! < we denote the induced norm of the reproducing kernel Hilbert space (RKHS) HK as k k K (for details on RKHS see [5] also Appendix C.2). Given an instance sequence x := (x1, . . . , x T ), we let H(x)

K := {h 2 HK : h(xt) 2 { 1, 1}, 8t 2 [T]} denote the functions in HK that are binary-valued on the sequence. An analysis of online gradient descent (OGDK) with the hinge loss, kernel K and randomized prediction [8, see e.g., Ch. 2 & 3] (proof included in Appendix C.3 for completeness) gives an expected regret bound of

K maxt2[T ] K(xt, xt).

In the switching single-task model the hypothesis becomes a sequence of hypotheses h = (h1, h2, . . . , h T ) 2 HT and the regret is RT (h) := PT

t=1 E[L01(yt, ˆyt)] L01(yt, ht(xt)). Two parameters of interest are the number of switches k := PT 1

t=1 [ht 6= ht+1] and the number of modes m := | [T

t=1 {ht}|, i.e., the number of the distinct hypotheses in the sequence. In this work we are

1Technically, when we say that an algorithm achieves a bound, it may be that the algorithm depends on a small set of parameters which we have then assumed are tuned optimally.

interested in long-term memory, that is, algorithms and bounds that are designed to exploit the case of m k.

The methodology of [9] may be used to derive an expected regret bound for Hﬁn in the switching single-task model of the form RT (h) 2 O(

(k log(|Hﬁn|) + k log(T/k))T). Freund in [10] posed an open problem to improve the results of [9] in the case of long-term memory (m k). Freund gave counting arguments that led to an exponential-time algorithm with a regret bound of RT (h) 2 O(

(m log(|Hﬁn|) + k log m + k log(T/k))T). In [1] an efﬁcient algorithm was given with nearly this bound, except for a small additional additive T log log T term under the square root. For the hypothesis class H(x)

K we may give non-memory bounds of the form RT (h) 2

k maxt khtk2

KT) by using a simple modiﬁcation [11] of OGDK (see Appendix C.3). To

the best of our knowledge there are no previous long-term memory bounds for H(x)

K (however see the discussion of [12] in Section 2.2); these will be a special case of our multitask model, to be introduced next.

2.1 Switching Multitask Model

In Figure 2 we illustrate the protocol for our multitask model. The model is essentially the same as the switching single-task model, except that we now have s tasks. On each (global) trial the environment reveals the active task 2 [s]. The ordering of tasks chosen by the environment is arbitrary, and therefore the active task may change on every (global) trial . We use the following notational convention: (global time) i

t (local time) where i = , t = σ( ) and σ( ) := P

j=1[ j = ]. Thus x xi

t, etc., where the mapping is determined implicitly by the task vector 2 [s]T . Each task i 2 [s] has its own data pair (instance, label) sequence (xi

1), . . . , (xi

T i) where T = T 1 + . . . + T s. The multitask hypotheses multiset is denoted as h = (h1, . . . , h T ) (h1

1, . . . , h1

T 1, . . . , hs

1, . . . , hs

T s) 2 HT . In the multitask model we denote the number of switches

as k(h ) := Ps

t+1], the set of modes as m(h ) := [s

multitask regret as RT (h ) := Ps

t=1 E[L01(yi

t)). In the following, we give motivating upper bounds based on exponential-time algorithms induced by meta-experts. We provide a lower bound with respect to H(x)

K in Proposition 4.

The idea of meta-experts is to take the base class of hypotheses and to construct a class of meta-hypotheses by combining the original hypotheses to form new ones, and then apply an MW

algorithm to the constructed class; in other words, we reduce the meta-model to the base-model. In our setting, the base class is Hﬁn { 1, 1}X and our meta-hypothesis class will be some H0 { 1, 1}X 0 where X 0 := {(x, t, i) : x 2 X, t 2 [T i], i 2 [s]}. To construct this set we deﬁne

H(k, m, s, Hﬁn, T 1, . . . , T s) := {(h1

1, . . . , hs

T s) = h 2 HT

ﬁn : k = k( h), m = |m( h)|} and then observe that for each h 2 H we may deﬁne an h0 : X 0 ! { 1, 1} via h0((x, t, i)) := hi

t(x), where hi

t is an element of h. We thus construct H0 by converting each h 2 H to an h0 2 H0. Hence we have reduced the switching multitask model to the single-task model with respect to H0. We proceed to obtain a bound by observing that the cardinality of H is bounded above by

where n = |Hﬁn|. If we then substitute into (1) and then further upper bound we have

RT (h ) 2 O

(m log(n/m) + s log m + k log m + k log((T s)/k))T

for any h 2 HT

ﬁn such that k = k(h ) and m = |m(h )|. The drawback is that the algorithm requires exponential time. In Section 3 we will give an algorithm whose time to predict per trial is O(|Hﬁn|) and whose bound is equivalent up to constant factors.

We cannot directly adapt the above argument to obtain an algorithm and bound for H(x)

K since the cardinality, in general, is inﬁnite, and additionally we do not know x in advance. However, the structure of the argument is the same. Instead of using hypotheses from H(x)

K as building blocks to construct meta-hypotheses, we use multiple instantiations of an online algorithm for H(x)

K as our building blocks. We let AK := {a[1], . . . , a[m]} denote our set of m instantiations that will act as a surrogate for the hypothesis class H(x)

K . We then construct the set, AK(k, m, s, T 1, . . . , T s) := { a 2 AT

K : k = k( a), m = |m( a)|}. Each a 2 AK now deﬁnes a meta-algorithm for the multitask

setting. That is, given an online multitask data sequence (xi

1), . . . , (xj

T j), each element of a will color the corresponding data pair with one of the m instantiations (we will use the function : {(t, i) : t 2 [T i], i 2 [s]} ! [m] to denote this mapping with respect to a). Each instantiation will receive as inputs only the online sequence of the data pairs corresponding to its color ; likewise, the prediction of meta-algorithm a will be that of the instantiation active on that trial. We will use as our base algorithm OGDK. Thus for the meta-algorithm a we have from (2),

for any received instance sequence x 2 X T and for any h[1], . . . , h[m] 2 H(x)

K . The MW algorithm [3, 2, 4] does not work just for hypothesis classes; more generally, it works for collections of algorithms. Hence we may run the MW as a meta-meta-algorithm to combine all of the metaalgorithms a 2 AK. Thus by substituting the loss for each meta-algorithm a (the R.H.S. of (4)) into (1) and using the upper bound

ms(m 1)k for the cardinality of AK, we obtain (using upper bounds for binomial coefﬁcients and the inequality P

RT (h ) 2 O

h2m(h ) khk2

K + s log m + k log m + k log((T s)/k))T

for any received instance sequence x 2 X T and for any h 2 H(x)

T such that k = k(h ) and m = |m(h )|.

The terms m log(n/m) (assuming m n) and P

h2m(h ) khk2

K may be viewed as learner complexities, i.e., the price we pay for identifying the hypotheses that ﬁt the modes. A salient feature of long-term memory bounds is that although the data pairs associated with each hypothesis are intermixed in the multitask sequence, we pay the learner complexity only modestly in terms of potentially leading multiplicative constants. A switching algorithm without long-term memory forgets and pays the full price for a mode on every switch or new task. We gave exponential-time

algorithms for Hﬁn and H(x)

K with O(1) leading multiplicative constants in the discussion leading to (3) and (5). We give efﬁcient algorithms for ﬁnite hypothesis classes and RKHS hypothesis classes in Sections 3 and 4, with time complexities of O(n) and O(T 3) per trial, and in terms of learner complexities they gain only leading multiplicative constants of O(1) and O(log T).

2.2 Related Work

In this section we brieﬂy describe other related work in the online setting that considers either switching or multitask models.

The ﬁrst result for switching in the experts model was the WML algorithm [3] which was generalized in [9]. There is an extensive literature building on those papers, with some prominent results including [1, 13, 14, 15, 16, 17, 15, 18, 19, 20, 21, 22]. Relevant for our model are those papers [1, 14, 17, 15, 20, 21, 22] that address the problem of long-term memory (m k), in particular [1, 14, 17].

Analogous to the problem of long-term memory in online learning is the problem of catastrophic forgetting in artiﬁcial neural network research [23, 24]. That is the problem of how a system can adapt to new information without forgetting the old. In online learning that is the problem of how an algorithm can both quickly adapt its prediction hypothesis and recall a previously successful prediction hypothesis when needed. In the experts model this problem was ﬁrst addressed by [1], which gave an algorithm that stores each of its past state vectors, and then at each update mixes these vectors into the current state vector. In [14], an algorithm and bounds were given that extended the base comparision class of experts to include Bernoulli models. An improved algorithm with a Bayesian intepretation based on the idea of circadian specialists was given for this setting in [17]. Our construction of Algorithm 1 was based on this methodology.

The problem of linear regression with long term memory was posed as an open problem in [17, Sec. 5]. Algorithm 2 gives an algorithm for linear interpolation in a RKHS with a regret bound that reﬂects long-term memory. Switching linear prediction has been considered in [11, 25, 26, 12]. Only [12] addresses the issue of long-term memory. The methodology of [12] is a direct inspiration for Algorithm 2. We signiﬁcantly extend the result of [12, Eq. (1)]. Their result was i) restricted to a

mistake as opposed to a regret bound, ii) restricted to ﬁnite positive deﬁnite matrices and iii) in their mistake bound the term analogous to P

h2m(h ) khk2

K was increased by a multiplicative factor of O(|m(h )|), a signiﬁcantly weaker result.

Multitask learning has been considered extensively in the batch setting, with some prominent early results including [27, 28, 29]. In the online multitask expert setting [30, 31, 32, 17] considered a model which may be seen as a special case of ours where each task is associated only with a single hypothesis, i.e., no internal switching within a task. Also in the expert setting [33, 34] considered models where the prediction was made for all tasks simultaneously. In [34] the aim was to predict well relative to a set of possible predeﬁned task interrelationships and in [33] the interrelationships were to be discovered algorithmically. The online multitask linear prediction setting was considered in [35, 36, 37]. The models of [36, 37] are similar to ours, but like previous work in the expert setting, these models are limited to one hypothesis per task. In the work of [35], the predictions were made for all tasks simultaneously through a joint loss function.

2.3 Preliminaries

For any positive integer m, we deﬁne [m] := {1, 2, . . . , m}. For any predicate [PRED] := 1 if PRED is true and equals 0 otherwise, and for any x 2 <, [x]+ := x[x > 0]. We denote the inner product of vectors as both x, w 2 <n as hx, wi = x w = Pn

i=1 xiwi, component-wise multiplication x w := (x1w1, . . . , xnwn) and the norm as kwk =

hw, wi. If f : < ! < and x 2 <n then f(x) := (f(x1), . . . , f(xn)). The xth-coordinate vector is denoted ex

X := ([x = z])z2X; we commonly abbreviate this to ex. We denote the probability simplex as H := {h 2 [0, 1]H} \ {h : P

h2H = 1} and set n := [n]. We denote the binary entropy as H(p) := p log 1

p +(1 p) log 1 1 p. If v 2 H then h v denotes that h is a random sample from the probability vector v over the set H. For vectors p 2 <m and q 2 <n we deﬁne [p; q] 2 <m+n to be the concatenation of p and q, which we regard as a column vector. Hence [p; q]

>[ p; q] = p> p + q> q.

The notation M + and

M denotes the pseudo-inverse and the unique positive square root, respectively, of a positive semi-deﬁnite matrix M. The trace of a square matrix is denoted by tr(Y ) := Pn

i=1 Yii for Y 2 <n n. The m m identity matrix is denoted Im. A function K : X X ! < is a strictly positive deﬁnite (SPD) kernel iff for every ﬁnite X X the matrix K(x, x0)x,x02X is symmetric and strictly positive deﬁnite, for example, the Gaussian kernel. In addition, we deﬁne Sm to be the set of m m symmetric matrices and let Sm

++ be the subset of positive semideﬁnite and strictly positive deﬁnite matrices, respectively. We deﬁne the squared radius of M 2 Sm

+ as RM := maxi2[m] M +

ii . The (undirected) graph Laplacian matrix is deﬁned by L := D A where D is the degree matrix and A is the adjacency matrix. The corresponding (strictly) positive deﬁnite PDLaplacian of an m-vertex connected graph is L := L + R 1

3 Finite Hypothesis Classes

In this section we present the algorithm and the regret bound for ﬁnite hypothesis classes, with proofs given in Appendix A. The design and analysis of the algorithm is inspired by [17], which considers a Bayesian setting where, on each trial, each hypothesis h gives an estimated probability P(y = ˆy|h) of the outcome y . The idea is for the learner to predict a probability ˆP(y = ˆy) and the loss incurred is the log loss, log(1/ ˆP(y )). Our algorithm, on the other hand, is framed in the well known Allocation setting [38] where the learner must play, on trial , a vector v 2 n and incurs a loss of c v where all components of c are in [0, 1].

To gain some intuition about the algorithm we observe the following. The algorithm maintains and updates the following vectors: a global probability vector 2 Hfin and the local task weight vectors w1

t , . . . , ws

t 2 [0, 1]Hfin. Given an hypothesis h 2 Hﬁn, the scalar

h represents our conﬁdence , on trial , that hypothesis h is in m(h ). For a given task i, hypothesis h 2 Hﬁn, and

local time t, the scalar wi

t,h represents our conﬁdence that h = hi

t if we knew that h was in m(h ). Putting together,

σ( ),h represents our conﬁdence, on trial , that h = hi

t. The weights and wi

t (for tasks i) are designed in such a way that, not only do they store all the information required by the algorithm, but also on each trial we need only update and w

t . Thus the algorithm

Algorithm 1 Predicting Hﬁn in a switching multitask setting.

Parameters: Hﬁn { 1, 1}X ; s, m, k, T 2 N

Initialization: n := |Hﬁn|; 1 1

n; µ := 1 m; w1

1 µ1 ; := 1 k T s; φ := k (m 1)(T s) and

+ (m 1)(T s)H

k (m 1)(T s)

2 T For = 1, . . . , T

Receive task 2 [s] . Receive x 2 X . Set i ; t σ( ) . Predict

, ˆh v , ˆy ˆh (x ) .

Receive y 2 { 1, 1} . Update:

i) 8h 2 Hﬁn, c

h = L01(h(x ), y ) ii) δ wi

iii) β ( wi

t)/( δ) iv) 1 wi

v) +1 vi) wi

t+1 (φ(1 wi

predicts in O(n) time per trial and requires O(sn) space. We bound the regret of the algorithm in the following theorem. Theorem 1. The expected regret of Algorithm 1 with parameters Hﬁn { 1, 1}X ; s, m, k, T 2 N and

+ (m 1)(T s)H

k (m 1)(T s)

is bounded above by

for any h 2 HT

ﬁn such that k = k(h ), m |m(h )|, m > 1. Furthermore,

+ s(log(m) + 1) + k

log(m 1) + 2 log

In further comparison to [17] we observe that we can obtain bounds for the log loss with our algorithm by deﬁning ˆP(y = ˆy) := P

h P(y = ˆy|h) and redeﬁning c

log(P(y = ˆy|h)) in the update. The resultant theorem then matches the bound of [17, Thm. 4] for single-task learning with long-term memory (s = 1) and the bound of [17, Thm. 6] for multitask learning with no switching (k = 0).

4 RKHS Hypothesis Classes

Our algorithm and its analysis builds on the algorithm for online inductive matrix completion with side-information (IMCSI) from [39, Theorem 1, Algorithm 2 and Proposition 4]. IMCSI is an example of a matrix multiplicative weight algorithm [40, 6]. We give notation and background from [39] to provide insight.

The max-norm (or γ2 norm [41]) of a matrix U 2 <m n is deﬁned by

k Ukmax := min P Q>=U

max 1 i m k Pik max

1 j n k Qjk

where the minimum is over all matrices P 2 <m d and Q 2 <n d and every integer d. We denote the class of m d row-normalized matrices as N m,d := { ˆP <m d :

--- = 1, i 2 [m]}. The quasi-dimension of a matrix is deﬁned as follows.

Deﬁnition 2 ([39, Equation (3)]). The quasi-dimension of a matrix U 2 <m n with respect to M 2 Sm

M,N(U) := min ˆ P ˆ Q>=γU

where the inﬁmum is over all row-normalized matrices ˆP 2 N m,d and ˆQ 2 N n,d and every integer d. If the inﬁmum does not exist then Dγ

M,N(U) := +1 (The inﬁmum exists iff k Ukmax 1/γ).

The algorithm IMCSI addresses the problem of the online prediction of a binary comparator matrix U with side information. The side information is supplied as a pair of kernels over the row indices and the

column indices. In [39, Theorem 1] a regret bound O(

( b D/γ2)T) is given, where 1/γ2 k Uk2

max and b D Dγ

M,N(U) are parameters of the algorithm that serve as upper estimates on k Uk2

M,N(U). The ﬁrst estimate 1/γ2 is an upper bound on the squared max-norm (Eq. (6)) which like

the trace-norm may be seen as a proxy for the rank of the matrix [42]. The second estimate b D is an upper bound of the quasi-dimension (Eq. (7)) which measures the quality of the side-information. The quasi-dimension depends upon the best factorization (1/γ) ˆP ˆQ> = U, which will be smaller when the row ˆP (column ˆQ) factors are in congruence with the row (column) kernel. We bound the quasi-dimension in Theorem 47 in Appendix B as a key step to proving Theorem 3.

In the reduction of our problem to a matrix completion problem with side information, the row indices correspond to the domain of the learner-supplied kernel K and the column indices correspond to the temporal dimension. On each trial we receive an x (a.k.a. xi

t). Thus the column of the comparator matrix (now H) corresponding to time will contain the entries H = (h (xυ))υ2[T ]. Although we are predicting functions that are changing over time, the underlying assumption is that the change is sporadic; otherwise it is infeasible to prove a non-vacuous bound. Thus we expect Hi

t+1 and as such our column side-information kernel should reﬂect this expectation. Topologically we would therefore expect a kernel to present as s separate time paths, where nearness in time is nearness on the path. In the following we introduce the path-tree-kernel (the essence of the construction was ﬁrst introduced in [43]), which satisﬁes this expectation in the single-task case. We then adapt this construction to the multitask setting.

A path-tree kernel P : [T] [T] ! <, is formed via the Laplacian of a fully complete binary tree with N := 2dlog2 T e+1 1 vertices. The path corresponds to the ﬁrst T leaves of the tree, numbered sequentially from the leftmost to the rightmost leaf of the ﬁrst T leaves. Denote this Laplacian as L where the path is identiﬁed with [T] and the remaining vertices are identiﬁed with [N] \ [T]. Then using the deﬁnition L := L +

L we deﬁne P( , υ) := (L )+

υ where , υ 2 [T]. We extend the path-tree kernel to a multitask-path-tree kernel by dividing the path into s contiguous segments, where segment i is a path of length T i, and the task vector 2 [s]T determines the mapping from global trial to task and local trial σ( ). We deﬁne P ,T 1,...,T s : [T] [T] ! < as P ,T 1,...,T s( , υ) := P

i=1 T i + σ( ), P υ 1

i=1 T i + σ(υ)

. Observe we do not need to know the task vector in advance; we only require upper bounds on the lengths of the tasks to be able to use this kernel. Finally, we note that it is perhaps surprising that we use a tree rather than a path directly. We discuss this issue following Lemma 49 in Appendix B.

Algorithm 2 requires O(t3) time per trial t since we need to compute the eigendecomposition of three O(t) O(t) matrices as well as sum O(t) O(t) matrices up to t times. We bound the regret of the algorithm as follows. Theorem 3. The expected regret of Algorithm 2 with upper estimates, k k(h ), m |m(h )|,

ˆC C(h ) :=

K + 2(s + k 1)mdlog2 Te2 + 2m2

K max 2[T ] K(x , x ), and learning rate =

ˆ C log(2T )

2T m is bounded by

2 ˆC T log(2T) (8)

with received instance sequence x 2 X T and for any h 2 H(x)

Algorithm 2 Predicting H(x)

K in a switching multitask setting.

Parameters: Tasks s 2 N, task lengths T 1, . . . , T s 2 N, T := Ps

i=1 T i, learning rate: > 0, complexity estimate: ˆC > 0, modes: m 2 [T], SPD Kernel K : X X ! <, P := P ,T 1,...,T s : [T] [T] ! <, with max 2[T ] K(x , x ) ˆ X2

K, and ˆ X2

P := 2dlog2 Te.

Initialization: U ; , X 1 ; , T 1 ; . For = 1, . . . , T

Receive task 2 [s] . Receive x 2 X . Set i ; t σ( ); xi

t x . Deﬁne

K := (K(x, z))x,z2X [{x } ; P := ( P( , υ)) ,υ2T [{ } ,

I|X |+|T |+2 +

Y UNIFORM( γ, γ) ; y tr

t := ˆy sign( y Y ) .

Receive label yi

t := y 2 { 1, 1} . If y y 1 pm then

U U [ {t} , X +1 X [ {x }, and T +1 T [ { } .

Else X +1 X and T +1 T .

Comparing roughly to the bound of the exponential-time algorithm (see (5)), we see that the log m term has been replaced by an m term and we have gained a multiplicative factor of log 2T. From the perspective of long-term memory, we note that the potentially dominant learner complexity term P

h2m(h ) khk2

K has only increased by a slight log 2T term. To gain more insight into the problem we also have the following simple lower bound. Proposition 4. For any (randomized) algorithm and any s, k, m, Γ 2 N, with k + s m > 1 and Γ m log2 m, there exists a kernel K and a T0 2 N such that for every T T0:

E[L01(y , ˆy )] L01(y , h (x )) 2

(Γ + s log m + k log m) T

for some multitask sequence (x1, y1), . . . , (x T , y T ) 2 (X { 1, 1})T and some h 2 [H(x)

such that m |m(h )|, k k(h ), P

h2m(h ) khk2

K |m(h )| log2 m, where X2

K = max 2[T ] K(x , x ).

Comparing the above proposition to the bound of the exponential-time algorithm (see (5)), the most striking difference is the absence of the log T terms. We conjecture that these terms are not necessary for the 0-1 loss. A proof of Theorem 3 and a proof sketch of Proposition 4 are given in Appendix B.

5 Discussion

We have presented a novel multitask setting which generalizes single-task switching under the longterm memory setting. We gave algorithms for ﬁnite hypothesis classes and for RKHS hypothesis classes with per trial prediction times of O(n) and O(T 3). We proved upper bounds on the regret for both cases as well as a lower bound in the RKHS case. An open problem is to resolve the gap in the RKHS case. On the algorithmic side, both algorithms depend on a number of parameters. There is extensive research in online learning methods to design parameter-free methods. Can some of these methods be applied here (see e.g., [44])? For a non-parametric hypothesis class, intuitively it

seems we must expect some time complexity dependence on T. However can we perhaps utilize decay methods such as [45, 46] or sketching methods [47] that have had success in simpler models to improve running times? More broadly, for what other inﬁnite hypothesis classes can we give efﬁcient regret-bounded algorithms in this switching multitask setting with long-term memory?

6 Acknowledgements

This research was supported by the U.S. Army Research Laboratory and the U.K. Ministry of Defence under Agreement Number W911NF-16-3-0001. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the ofﬁcial policies, either expressed or implied, of the U.S. Army Research Laboratory, the U.S. Government, the U.K. Ministry of Defence or the U.K. Government. The U.S. and U.K. Governments are authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation hereon. This research was further supported by the Engineering and Physical Sciences Research Council grant number EP/L015242/1.

Broader Impact

In general this work does not present any speciﬁc foreseeable societal consequence in the authors joint opinion.

This is foundational research in regret-bounded online learning. As such it is not targeted towards any particular application area. Although this research may have societal impact for good or for ill in the future, we cannot foresee the shape and the extent.

Funding Transparency Statement

The authors were supported by the U.S. Army Research Laboratory and the U.K. Ministry of Defence under agreement number W911NF-16-3-0001 and by the Engineering and Physical Sciences Research Council grant number EP/L015242/1.

Competing Interests

The authors assert no competing interests.

[1] O. Bousquet and M.K. Warmuth. Tracking a small set of experts by mixing past posteriors.

Journal of Machine Learning Research, 3:363 396, 2003.

[2] Volodimir G. Vovk. Aggregating strategies. In Proceedings of the Third Annual Workshop on

Computational Learning Theory, COLT 90, pages 371 386, 1990.

[3] Nick Littlestone and Manfred K. Warmuth. The weighted majority algorithm. Inf. Comput.,

108(2):212 261, February 1994.

[4] Nicolo Cesa-Bianchi and Gabor Lugosi. Prediction, Learning, and Games. Cambridge University Press, New York, NY, USA, 2006.

[5] N. Aronszajn. Theory of reproducing kernels. Trans. Amer. Math. Soc., 68:337 404, 1950.

[6] Sanjeev Arora, Elad Hazan, and Satyen Kale. The multiplicative weights update method: a

meta-algorithm and applications. Theory of Computing, 8(6):121 164, 2012.

[7] Nicolò Cesa-Bianchi, Yoav Freund, David Haussler, David P. Helmbold, Robert E. Schapire,

and Manfred K. Warmuth. How to use expert advice. J. ACM, 44(3):427 485, May 1997.

[8] S. Shalev-Shwartz. Online Learning and Online Convex Optimization. Foundations and

Trends R in Machine Learning, 4(2):107 194, 2011.

[9] M. Herbster and M.K. Warmuth. Tracking the best expert. Machine Learning, 32(2):151 178,

[10] Y. Freund. Private communication, 2000. Also posted on http://www.learning-theory.org.

[11] M. Herbster and M.K. Warmuth. Tracking the best linear predictor. Journal of Machine

Learning Research, 1:281 309, 2001.

[12] M. Herbster, S. Pasteris, and S. Pontil. Predicting a switching sequence of graph labelings.

Journal of Machine Learning Research, 16:2003 2022, 2015.

[13] A. György, T. Linder, and G. Lugosi. Tracking the best of many experts. In Proceedings 18th

Annual Conference on Learning Theory, pages 204 216, 2005.

[14] Wouter M. Koolen and Tim van Erven. Freezing and sleeping: Tracking experts that learn by

evolving past posteriors, 2010.

[15] N. Cesa-Bianchi, P. Gaillard, G. Lugosi, and G. Stoltz. Mirror descent meets ﬁxed share (and

feels no regret). In Advances in Neural Information Processing Systems 24, pages 989 997, 2012.

[16] A. György, T. Linder, and G. Lugosi. Efﬁcient tracking of large classes of experts. IEEE

Transactions on Information Theory, 58(11):6709 6725, Nov 2012.

[17] Wouter M. Koolen, Dmitry Adamskiy, and Manfred K. Warmuth. Putting bayes to sleep. In

Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1, NIPS 12, pages 135 143, Red Hook, NY, USA, 2012. Curran Associates Inc.

[18] D. Adamskiy, W. M. Koolen, A. Chernov, and V. Vovk. A closer look at adaptive regret. In

Proceedings of the 23rd International Conference on Algorithmic Learning Theory, ALT 12, pages 290 304, 2012.

[19] A. Daniely, A. Gonen, and S. Shalev-Shwartz. Strongly adaptive online learning. In Proceedings

of the 32nd International Conference on International Conference on Machine Learning - Volume 37, ICML 15, pages 1405 1411, 2015.

[20] Jaouad Mourtada and Odalric-Ambrym Maillard. Efﬁcient tracking of a growing number of

experts. In Proceedings of the 28th International Conference on Algorithmic Learning Theory (ALT), volume 76 of Proceedings of Machine Learning Research, pages 517 539, 2017.

[21] Maria-Florina Balcan, Travis Dick, and Dravyansh Sharma. Online optimization of piecewise

lipschitz functions in changing environments. Co RR, abs/1907.09137, 2019.

[22] Kai Zheng, Haipeng Luo, Ilias Diakonikolas, and Liwei Wang. Equipping experts/bandits with

long-term memory. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d. Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 5929 5939. Curran Associates, Inc., 2019.

[23] Michael Mccloskey and Neil J. Cohen. Catastrophic interference in connectionist networks:

The sequential learning problem. The Psychology of Learning and Motivation, 24:104 169, 1989.

[24] Robert M. French. Catastrophic forgetting in connectionist networks. Trends in Cognitive

Sciences, 3(4):128 135, 1999.

[25] J. Kivinen, A. J. Smola, and R. C. Williamson. Online learning with kernels. IEEE Transactions

on Signal Processing, 52:2165 2176, 2004.

[26] N. Cesa-Bianchi and C. Gentile. Tracking the best hyperplane with a simple budget perceptron.

In Proceedings of the 18th Conference on Learning Theory, pages 483 498, 2006.

[27] Jonathan Baxter. Learning internal representations. In Proceedings of the Eighth Annual

Conference on Computational Learning Theory, COLT ?95, page 311?320, New York, NY, USA, 1995. Association for Computing Machinery.

[28] Rich Caruana. Multitask learning. Machine Learning, 28(1):41 75, 1997.

[29] Theodoros Evgeniou and Massimiliano Pontil. Regularized multi task learning. In Proceedings

of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 109 117, New York, NY, USA, 2004. Association for Computing Machinery.

[30] J. Abernethy, P. Bartlett, and A. Rakhlin. Multitask learning with expert advice. In Proceedings

20th Annual Conference on Learning Theory, pages 484 498, 2007.

[31] Alekh Agarwal, Alexander Rakhlin, and Peter Bartlett. Matrix regularization techniques

for online multitask learning. Technical Report UCB/EECS-2008-138, EECS Department, University of California, Berkeley, Oct 2008.

[32] S. Avishek, R. Piyush, H. Daumé III, and S. Venkatasubramanian. Online learning of multiple

tasks and their relationships. In Proceedings of the 14th International Conference on Artiﬁcial Intelligence and Statistics, pages 643 651, 2011.

[33] Alexander Rakhlin, Jacob D. Abernethy, and Peter L. Bartlett. Online discovery of similarity

mappings. In Zoubin Ghahramani, editor, Machine Learning, Proceedings of the Twenty-Fourth International Conference (ICML 2007), Corvallis, Oregon, USA, June 20-24, 2007, volume 227 of ACM International Conference Proceeding Series, pages 767 774. ACM, 2007.

[34] Gábor Lugosi, Omiros Papaspiliopoulos, and Gilles Stoltz. Online multi-task learning with

hard constraints. In COLT 2009 - The 22nd Conference on Learning Theory, Montreal, Quebec, Canada, June 18-21, 2009, 2009.

[35] O. Dekel, P.M. Long, and Y. Singer. Online learning of multiple tasks with a shared loss.

Journal of Machine Learning Research, 8(10):2233 2264, 2007.

[36] G. Cavallanti, N. Cesa-Bianchi, and C. Gentile. Linear algorithms for online multitask classiﬁ-

cation. Journal of Machine Learning Research, 1:2901 2934, 2010.

[37] Christoph Hirnschall, Adish Singla, Sebastian Tschiatschek, and Andreas Krause. Coordinated

online learning with applications to learning user preferences, 2017.

[38] Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line learning

and an application to boosting. J. Comput. Syst. Sci., 55(1):119?139, August 1997.

[39] Mark Herbster, Stephen Pasteris, and Lisa Tse. Online matrix completion with side information.

In Advances in Neural Information Processing Systems 33. 2020.

[40] K. Tsuda, G. Rätsch, and M.K. Warmuth. Matrix exponentiated gradient updates for on-line

learning and bregman projection. Journal of Machine Learning Research, 6:995 1018, 2005.

[41] N. Linial, S. Mendelson, G. Schechtman, and A. Shraibman. Complexity measures of sign

matrices. Combinatorica, 27(4):439 463, 2007.

[42] Jason D Lee, Ben Recht, Nathan Srebro, Joel Tropp, and Russ R Salakhutdinov. Practical large-

scale optimization for max-norm regularization. In J. D. Lafferty, C. K. I. Williams, J. Shawe Taylor, R. S. Zemel, and A. Culotta, editors, Advances in Neural Information Processing Systems 23, pages 1297 1305. Curran Associates, Inc., 2010.

[43] M. Herbster, G. Lever, and M. Pontil. Online prediction on large diameter graphs. In Advances

in Neural Information Processing Systems 21, pages 649 656, 2008.

[44] Francesco Orabona and Dávid Pál. Coin betting and parameter-free online learning. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS16, pages 577 585, Red Hook, NY, USA, 2016. Curran Associates Inc. [45] Ofer Dekel, Shai Shalev-shwartz, and Yoram Singer. The forgetron: A kernel-based perceptron

on a ﬁxed budget. In Y. Weiss, B. Schölkopf, and J. C. Platt, editors, Advances in Neural Information Processing Systems 18, pages 259 266. MIT Press, 2006. [46] Giovanni Cavallanti, Nicolò Cesa-Bianchi, and Claudio Gentile. Tracking the best hyperplane

with a simple budget perceptron. Mach. Learn., 69(2-3):143 167, 2007. [47] Yair Carmon, John C. Duchi, Aaron Sidford, and Kevin Tian. A rank-1 sketch for matrix

multiplicative weights. In Alina Beygelzimer and Daniel Hsu, editors, Conference on Learning Theory, COLT 2019, 25-28 June 2019, Phoenix, AZ, USA, volume 99 of Proceedings of Machine Learning Research, pages 589 623. PMLR, 2019. [48] Yoav Freund, Robert E. Schapire, Yoram Singer, and Manfred K. Warmuth. Using and combin-

ing predictors that specialize. In Proceedings of the Twenty-Ninth Annual ACM Symposium on Theory of Computing, STOC 97, pages 334 343, New York, NY, USA, 1997. Association for

Computing Machinery. [49] M. Herbster and M. Pontil. Prediction on a graph with a perceptron. In Advances in Neural

Information Processing Systems 19, pages 577 584, 2006. [50] Douglas Klein and Milan Randic. Resistance distance. Journal of Mathematical Chemistry,

12:81 95, 12 1993. [51] M. Herbster, M. Pontil, and S. Rojas-Galeano. Fast prediction on a tree. In Advances in Neural

Information Processing Systems, pages 657 664, 2009. [52] A.B. Novikoff. On convergence proofs on perceptrons. In Proceedings of the Symposium on

the Mathematical Theory of Automata, pages 615 622, 1962. [53] J. Forster, N. Schmitt, and H.U. Simon. Estimating the optimal margins of embeddings in

euclidean half spaces. In Proceedings Computational Learning Theory, pages 402 415, 2001. [54] N. Littlestone. Learning quickly when irrelevant attributes abound: A new linear-threshold

algorithm. Machine Learning, 2:285 318, April 1988. [55] Shai Ben-David, Dávid Pál, and Shai Shalev-Shwartz. Agnostic online learning. In COLT 2009

- The 22nd Conference on Learning Theory, Montreal, Quebec, Canada, June 18-21, 2009,

2009. [56] Nello Cristianini and John Shawe-Taylor. An Introduction to Support Vector Machines and

Other Kernel-based Learning Methods. Cambridge University Press, 2000. [57] Martin Zinkevich. Online Convex Programming and Generalized Inﬁnitesimal Gradient Ascent.

In Proceedings, Twentieth International Conference on Machine Learning, volume 2, pages 928 935, 2003. [58] N. Cesa-Bianchi, P. M. Long, and M. K. Warmuth. Worst-case quadratic loss bounds for on-line

prediction of linear functions by gradient descent. IEEE Transactions on Neural Networks, 7(3):604 619, 1996. Earlier version in 6th COLT, 1993.