# online_learning_with_transductive_regret__e4501476.pdf

Online Learning with Transductive Regret

Mehryar Mohri Courant Institute and Google Research

New York, NY mohri@cims.nyu.edu

Scott Yang D. E. Shaw & Co.

New York, NY yangs@cims.nyu.edu

We study online learning with the general notion of transductive regret, that is regret with modiﬁcation rules applying to expert sequences (as opposed to single experts) that are representable by weighted ﬁnite-state transducers. We show how transductive regret generalizes existing notions of regret, including: (1) external regret; (2) internal regret; (3) swap regret; and (4) conditional swap regret. We present a general and efﬁcient online learning algorithm for minimizing transductive regret. We further extend that to design efﬁcient algorithms for the time-selection and sleeping expert settings. A by-product of our study is an algorithm for swap regret, which, under mild assumptions, is more efﬁcient than existing ones, and a substantially more efﬁcient algorithm for time selection swap regret.

1 Introduction

Online learning is a general framework for sequential prediction. Within that framework, a widely adopted setting is that of prediction with expert advice [Littlestone and Warmuth, 1994, Cesa-Bianchi and Lugosi, 2006], where the algorithm maintains a distribution over a set of experts. At each round, the loss assigned to each expert is revealed. The algorithm then incurs the expected value of these losses for its current distribution and next updates its distribution.

The standard benchmark for the algorithm in this scenario is the external regret, that is the difference between its cumulative loss and that of the best (static) expert in hindsight. However, while this benchmark is useful in a variety of contexts and has led to the design of numerous effective online learning algorithms, it may not constitute a useful criterion in common cases where no single ﬁxed expert performs well over the full course of the algorithm s interaction with the environment. This had led to several extensions of the notion of external regret, along two main directions.

The ﬁrst is an extension of the notion of regret so that the learner s algorithm is compared against a competitor class consisting of dynamic sequences of experts. Research in this direction started with the work of Herbster and Warmuth [1998] on tracking the best expert, who studied the scenario of learning against the best sequence of experts with at most k switches. This work has been subsequently improved [Monteleoni and Jaakkola, 2003], generalized [Vovk, 1999, Cesa-Bianchi et al., 2012, Koolen and de Rooij, 2013], and modiﬁed [Hazan and Seshadhri, 2009, Adamskiy et al., 2012, Daniely et al., 2015]. More recently, an efﬁcient algorithm with favorable regret guarantees has been given for the general case of a competitor class consisting of sequences of experts represented by a (weighted) ﬁnite automaton [Mohri and Yang, 2017, 2018]. This includes as special cases previous competitor classes considered in the literature.

The second direction is to consider competitor classes based on modiﬁcations of the learner s sequence of actions. This approach began with the notion of internal regret [Foster and Vohra, 1997, Hart and Mas-Colell, 2000], which considers how much better an algorithm could have performed if it had

Work done at the Courant Institute of Mathematical Sciences.

31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.

switched all instances of playing one action with another, and was subsequently generalized to the notion of swap regret [Blum and Mansour, 2007], which considers all possible in-time modiﬁcations of a learner s action sequence. More recently, Mohri and Yang [2014] introduced the notion of conditional swap regret, which considers all possible modiﬁcations of a learner s action sequence that depend on some ﬁxed bounded history. Odalric and Munos [2011] also studied regret against history-dependent modiﬁcations and presented computationally tractable algorithms (with suboptimal regret guarantees) when the comparator class can be organized into a small number of equivalence classes.

In this paper, we consider the second direction and study regret with respect to modiﬁcation rules. We ﬁrst present an efﬁcient online algorithm for minimizing swap regret (Section 3). We then introduce the notion of transductive regret in Section 4, that is the regret of the learner s algorithm with respect to modiﬁcation rules representable by a family of weighted ﬁnite-state transducers (WFSTs). This deﬁnition generalizes the existing notions of external, internal, swap, and conditional swap regret, and includes modiﬁcation rules that apply to expert sequences, as opposed to single experts. Moreover, we present efﬁcient algorithms for minimizing transductive regret. We further extend transductive regret to the time-selection setting (Section 5) and present efﬁcient algorithms minimizing time-selection transductive regret. These algorithms signiﬁcantly improve upon existing state-of-the-art algorithms in the special case of time-selection swap regret. Finally, in Section 6, we extend transductive regret to the sleeping experts setting and present new and efﬁcient algorithms for minimizing sleeping transductive regret.

2 Preliminaries and notation

We consider the setting of prediction with expert advice with a set of N experts. At each round t 2 [T], an online algorithm A selects a distribution pt over , the adversary reveals a loss vector lt 2 [0, 1]N, where lt(x) is the loss of expert x 2 , and the algorithm incurs the expected loss pt lt.

Let Φ denote a set of modiﬁcation functions mapping the expert set to itself. The objective of the algorithm is to minimize its Φ-regret, Reg T (A, Φ), deﬁned as the difference between its cumulative expected loss and that of the best modiﬁcation of the sequence in hindsight:

Reg T (A, Φ) = max

E xt pt[lt(xt)] E xt pt[lt('(xt))]

This deﬁnition coincides with the standard notion of external regret [Cesa-Bianchi and Lugosi, 2006] when Φ is reduced to the family of constant functions: Φext = {'a : ! : a 2 , 8x 2 , 'a(x) = a}, with the notion of internal regret [Foster and Vohra, 1997] when Φ is the family of functions that only switch two actions: Φint = {'a,b : ! : a, b 2 , 'a,b(x) = 1x=ab + 1x=ba + x1x6=a,b}, and with the notion of swap regret [Blum and Mansour, 2007] when Φ consists of all possible functions mapping to itself: Φswap. In Section 4, we will introduce a more general notion of regret with modiﬁcation rules applying to expert sequences, as opposed to single experts.

There are known algorithms achieving an external regret in O(p T log N) with a per-iteration computational cost in O(N) [Cesa-Bianchi and Lugosi, 2006], an internal regret in O(p T log N) with a per-iteration computational cost in O(N 3) [Stoltz and Lugosi, 2005], and a swap regret in O(p TN log N) with a per-iteration computational cost in O(N 3) [Blum and Mansour, 2007].

3 Efﬁcient online algorithm for swap regret

In this section, we present an online algorithm, FASTSWAP, that achieves the same swap regret guarantee as the algorithm of Blum and Mansour [2007], O(p TN log N), but admits the more favorable per-iteration complexity of O(N 2 log(T)), under some mild assumptions.

Existing online algorithms for internal or swap regret minimization require, at each round, solving for a ﬁxed-point of an N N-stochastic matrix [Foster and Vohra, 1997, Stoltz and Lugosi, 2005, Blum and Mansour, 2007]. For example, the algorithm of Blum and Mansour [2007] is based on a meta-algorithm A that makes use of N external regret minimization sub-algorithms {Ai}i2[N] (see Figure 1). Sub-algorithm Ai is specialized in guaranteeing low regret against swapping expert i with any other expert j. The meta-algorithm A maintains a distribution pt over the experts and,

A1 A2 AN qt,1 qt,2

pt,1lt pt,2lt

Figure 1: Illustration of the swap regret algorithm of Blum and Mansour [2007] or the FASTSWAP algorithm, which use a meta-algorithm to control a set of N external regret minimizing algorithms.

Algorithm 1: FASTSWAP; {Ai}N

i=1 are external regret minimization algorithms.

Algorithm: FASTSWAP((Ai)N

i=1) for t 1 to T do

for i 1 to N do

qi QUERY(Ai) Qt [q1 q N]> for j 1 to N do

if t < N then

t c t for 1 to t do

t )>(Qt ~1c>); pt pt + p

t pt pt kptk1 else

t = FIXED-POINT(Qt) xt SAMPLE(pt); lt RECEIVELOSS() for i 1 to N do

ATTRIBUTELOSS(pt[i]lt, Ai)

at each round t, assigns to sub-algorithm Ai only a fraction of the loss, (pt,ilt), and receives the distribution qi (over the experts) returned by Ai. At each round t, the distribution pt is selected to be the ﬁxed-point of the N N-stochastic matrix Qt = [q1 q N]>. Thus, pt = pt Qt is the stationary distribution of the Markov process deﬁned by Qt. This choice of the distribution is natural to ensure that the learner s sequence of actions is competitive against a family of modiﬁcations, since it is invariant under a mapping that relates to this family of modiﬁcations.

The computation of a ﬁxed-point involves solving a linear system of equations, thus, the per-round complexity of these algorithms is in O(N 3) using standard methods (or O(N 2.373), using the method of Coppersmith and Winograd). To improve upon this complexity in the setting of internal regret, Greenwald et al. [2008] estimate the ﬁxed-point by applying, at each round, a single power iteration to some stochastic matrix. Their algorithm runs in O(N 2) time per iteration, but at the price of a regret guarantee that is only in O(

Here, we describe an efﬁcient algorithm for swap regret, FASTSWAP. Algorithm 1 gives its pseudocode. As with the algorithm of Blum and Mansour [2007], FASTSWAP is based on a meta-algorithm A making use of N external regret minimization sub-algorithms {Ai}i2[N]. However, unlike the algorithm of Blum and Mansour [2007], which explicitly computes the stationary distribution of Qt at round t, or that of Greenwald et al. [2008], which applies a single power iteration at each round, our algorithm applies multiple modiﬁed power iterations at round t ( t power iterations). Our modiﬁed power iterations are based on the REDUCEDPOWERMETHOD (RPM) algorithm introduced by Nesterov and Nemirovski [2015]. Unlike the algorithm of Greenwald et al. [2008], FASTSWAP uses a speciﬁc initial distribution at each round, applies the power method to a modiﬁcation of the original stochastic matrix, and uses, as an approximation, an average of all the iterates at that round. Theorem 1. Let A1, . . . , AN be external regret minimizing algorithms admitting data-dependent regret bounds of the form O(

LT (Ai) log N), where LT (Ai) is the cumulative loss of Ai after T

0 1 a:b/1 b:a/1 2 b:b/1

Apple:IBM/0.4 Apple:Apple/0.5

Apple:gold/0.1

IBM:IBM/0.5 IBM:Apple/0.3 IBM:silver/0.2

gold:silver/0.4

gold:gold/0.5 gold:Apple/0.1 silver:gold/0.3 silver:silver/0.5

silver:IBM/0.2

sell:IBM/0.3

sell:Apple/0.7 Apple:IBM/0.3 Apple:Apple/0.5

Apple:sell/0.2 IBM:IBM/0.6 IBM:Apple/0.3

IBM:sell/0.1

sell:gold/0.5 sell:silver/0.5

gold:silver/0.2

gold:gold/0.6

gold:sell/0.2 silver:gold/0.3 silver:silver/0.6

silver:sell/0.1

(i) (ii) (iii)

Figure 2: (i) Example of a WFST T: IT = 0, ilab[ET[0]] = {a, b}, olab[ET[1]] = {b}, ET[2] = {(0, a, b, 1, 1), (0, b, a, 1, 1)}. (ii) Family of swap WFSTs T', with ': {a, b, c} ! {a, b, c}. (iii) A more general example of a WFST.

rounds. Assume that, at each round, the sum of the minimal probabilities given to an expert by these algorithms is bounded below by some constant > 0. Then, FASTSWAP achieves a swap regret in O(p TN log N) with a per-iteration complexity in O

( log T log(1/(1 )), N

The proof is given in Appendix D. It is based on a stability analysis bounding the additive regret term due to using an approximation of the ﬁxed point distribution, and the property that t iterations of the reduced power method ensure a 1 p

t-approximation, where t is the number of rounds. The favorable complexity of our algorithm requires an assumption on the sum of the minimal probabilities assigned to an expert by the algorithms at each round. This is a reasonable assumption which one would expect to hold in practice if all the external regret minimizing sub-algorithms are the same. This is because the true losses assigned to each column of the stochastic matrix are the same, and the rescaling based on the distribution pt is uniform over each row. Furthermore, since the number of rounds sufﬁcient for a good approximation can be efﬁciently estimated, our algorithm can determine when it is worthwhile to switch to standard ﬁxed-point methods, that is when the condition t > N holds. Thus, the time complexity of our algorithm is never worse than that of Blum and Mansour [2007].

4 Online algorithm for transductive regret

In this section, we consider a more general notion of regret than swap regret, where the family of modiﬁcation functions applies to sequences instead of just to single experts. We will consider sequence-to-sequence mappings that can be represented by ﬁnite-state transducers. In fact, more generally, we will allow weights to be used for these mappings and will consider weighted ﬁnite-state transducers. This will lead us to deﬁne the notion of transductive regret where the cumulative loss of an algorithm s sequence of actions is compared to that of sequences images of its action sequence via a transducer mapping. As we shall see, this is an extremely ﬂexible deﬁnition that admits as special cases standard notions of external, internal, and swap regret.

We will start with some preliminary deﬁnitions and concepts related to transducers.

4.1 Weighted ﬁnite-state transducer deﬁnitions

A weighted ﬁnite-state transducer (WFST) T is a ﬁnite automaton whose transitions are augmented with an output label and a real-valued weight, in addition to the familiar input label. Figure 2(i) shows a simple example. We will assume both input and output labels to be elements of the alphabet , which denotes the set of experts. denotes the set of all strings over the alphabet .

We denote by ET the set of transitions of T and, for any transition e 2 ET, we denote by ilab[e] its input label, by olab[e] its output label, and by w[e] its weight. For any state u of T, we denote by ET[u] the set of transitions leaving u. We also extend the deﬁnition of ilab to sets and denote by ilab[ET[u]] the set of input labels of the transitions ET[u].

We assume that T admits a single initial state, which we denote by IT. For any state u and string x 2 , we also denote by δT(u, x) the set of states reached from u by reading string x as input. In particular, we will denote by δT(IT, x) the set of states reached from the initial state by reading string x as input.

The input (or output) label of a path is obtained by concatenating the input (output) transition labels along that path. The weight of a path is obtained by multiplying is transition weights. A path from the initial state to a ﬁnal state is called an accepting path. A WFST maps the input label of each accepting path to its output label, with that path weight probability.

The WFSTs we consider may be non-deterministic, that is they may admit states with multiple outgoing transitions sharing the same input label. However, we will assume that, at any state, outgoing transitions sharing the same input label admit the same destination state. We will further require that, at any state, the set of output labels of the outgoing transitions be contained in the set of input labels of the same transitions. This requirement is natural for our deﬁnition of regret: our learner will use input label experts and will compete against sequences of output label experts. Thus, the algorithm should have the option of selecting an expert sequence it must compete against.

Finally, we will assume that our WFSTs are stochastic, that is, for any state u and input label a 2 , we have P

e2ET[u,a] w[e] = 1. The class of WFSTs thereby deﬁned is broad and, as we shall see, includes the families deﬁning external, internal and swap regret.

4.2 Transductive regret

Given any WFST T, let T be a family of WFSTs with the same alphabet , the same set of states Q, the same initial state I and ﬁnal states F, but with different output labels and weights. Thus, we can write IT , FT , QT , and δT , without any ambiguity. We will also use the notation ET when we refer to the transitions of a transducer within the family T in a way that does not depend on the output labels or weights. We deﬁne the learner s transductive regret with respect to T as follows:

Reg T (A, T ) = max

E xt pt[lt(xt)]

e2ET[δT (IT ,x1:t 1),xt]

w[e] lt(olab[e])

This measures the maximum difference of the expected loss of the sequence x T

1 played by A and the expected loss of a competitor sequence, that is a sequence image by T 2 T of x T

1 , where the expectation for competing sequences is both over pts and the transitions weights w[e] of T. We also assume that the family T does not admit proper non-empty invariant subsets of labels out of any state, i.e. for any state u, there exists no proper subset E ( ET [u] where the inclusion olab[E] ilab[E] holds for all T 2 T . This is not a strict requirement but will allow us to avoid cases of degenerate competitor classes.

As an example, consider the family of WFSTs Ta, a 2 , with a single state Q = I = F = {0} and with Ta deﬁned by self-loop transitions with all input labels b 2 with the same output label a, and with uniform weights. Thus, Ta maps all labels to a. Then, the notion of transductive regret with T = {Ta : a 2 } coincides with that of external regret.

Similarly, consider the family of WFSTs T', ': ! , with a single state Q = I = F = {0} and with T' deﬁned by self-loop transitions with input label a 2 and output '(a), all weights uniform. Thus, T' maps a symbol a to '(a). Then, the notion of transductive regret with T = {T' : ' 2 } coincides with that of swap regret (see Figure 2 (ii)). The more general notion of k-gram conditional swap regret presented in Mohri and Yang [2014] can also be modeled as transductive regret with respect to a family of WFSTs (k-gram WFSTs). We present additional ﬁgures illustrating all of these examples in Appendix A.

In general, it may be desirable to design WFSTs intended for a speciﬁc task, so that an algorithm is robust against some sequence modiﬁcations more than others. In fact, such WFSTs may have been learned from past data. The deﬁnition of transductive regret is ﬂexible and can accommodate such settings both because a transducer can conveniently help model mappings and because the transition weights help distinguish alternatives. For instance, consider a scenario where each action naturally admits a different swapping subset, which may be only a small subset of all actions. As an example, an investor may only be expected to pick the best strategy from within a similar class of strategies. For example, instead of buying IBM, the investor could have bought Apple or Microsoft, and instead of buying gold, he could have bought silver or bronze. One can also imagine a setting where along the sequences, some new alternatives are possible while others are excluded. Moreover, one may wish to assign different weights to some sequence modiﬁcations or penalize the investor for choosing strategies that are negatively correlated to recent choices. The algorithms in this work are ﬂexible

enough to accommodate these environments, which can be straightforwardly modeled by a WFST. We give a simple example in Figure 2(iii) and give another illustration in Figure 5 in Appendix A, which can be easily generalized. Notice that, as we shall see later, in the case where the maximum out-degree of any state in the WFST (size of the swapping subset) is bounded by a mild constant independent of the number of actions, our transductive regret bounds can be very favorable.

4.3 Algorithm

We now present an algorithm, FASTTRANSDUCE, seeking to minimize the transductive regret given a family T of WFSTs.

Our algorithm is an extension of FASTSWAP. As in that algorithm, a meta-algorithm is used that assigns partial losses to external regret minimization slave algorithms and combines the distributions it receives from these algorithms via multiple reduced power method iterations. The meta-algorithm tracks the state reached in the WFST and maintains a set of external regret minimizing algorithms that help the learner perform well at every state. Thus, here, we need one external regret minimization algorithm Au,i, for each state u reached at time t after reading sequence x1:t 1 and each i 2 labeling an outgoing transition at u. The pseudocode of this algorithm is provided in Appendix B.

Let |ET |in denote the sum of the number of transitions with distinct input label at each state of T , that is |ET |in = P

u2QT |ilab[ET [u]]|. |ET |in is upper bounded by the total number of transitions |ET |. Then, the following regret guarantee and computational complexity hold for FASTTRANSDUCE.

Theorem 2. Let (Au,i)u2Q,i2ilab[ET[u]] be external regret minimizing algorithms admitting datadependent regret bounds of the form O(

LT (Au,i) log N), where LT (Au,i) is the cumulative loss of Au,i after T rounds. Assume that, at each round, the sum of the minimal probabilities given to an expert by these algorithms is bounded below by some constant > 0. Then, FASTTRANSDUCE achieves a transductive regret against T that is in O(

T|ET |in log N) with a per-iteration complexity

log T log(1/(1 )), N

The proof is given in Appendix E. The regret guarantee of FASTTRANSDUCE matches that of the swap regret algorithm of Blum and Mansour [2007] or FASTSWAP in the case where T is chosen to be the family of swap transducers, and it matches the conditional k-gram swap regret of Mohri and Yang [2014] when T is chosen to be that of the k-gram swap transducers. Additionally, its computational complexity is typically more favorable than that of algorithms previously presented in the literature when the assumption on holds, and it is never worse.

Remarkably, the computational complexity of FASTTRANSDUCE is comparable to the cost of FASTSWAP, even though FASTTRANSDUCE is a regret minimization algorithm against an arbitrary family of ﬁnite-state transducers. This is because only the external regret minimizing algorithms that correspond to the current state need to be updated at each round.

5 Time-selection transductive regret

In this section, we extend the notion of time-selection functions with modiﬁcation rules to the setting of transductive regret and present an algorithm that achieves the same regret guarantee as Khot and Ponnuswami [2008] in their speciﬁc setting, but with a substantially more favorable computational complexity.

Time-selection functions were ﬁrst introduced in [Lehrer, 2003] as boolean functions that determine which subset of times are relevant in the calculation of regret. This concept was relaxed to the real-valued setting by Blum and Mansour [2007] who considered time-selection functions taking values in [0, 1]. The authors introduced an algorithm which, for K modiﬁcation rules and M timeselection functions, guarantees a regret in O(

TN log(MK)) and admits a per-iteration complexity in O(max{NKM, N 3}). For swap regret with time selection functions, this corresponds to a regret bound of O(

TN 2 log(MN)) and a per-iteration computational cost in O(N N+1M). [Khot and Ponnuswami, 2008] improved upon this result and presented an algorithm with a regret bound in O(

T log(MK)) and a per-iteration computational cost in O(max{MK, N 3}), which is still prohibitively expensive for swap regret, since it is in O(N NM).

Algorithm 2: FASTTIMESELECTTRANSDUCE; AI, (AI,u,i) external regret algorithms.

Algorithm: FASTTIMESELECTTRANSDUCE(I, T , AI, (AI,u,i)I2I,u2QT ,i2ilab[ET [q]]) u IT for t 1 to T do

for each I 2 I do

q QUERY(AI) for each i 2 ilab[ET [u]] do

q I,i QUERY(AI,u,i) Mt,u,I [q I,1112ilab[ET [u]]; . . . ; q I,N1N2ilab[ET [u]]]; Qt,u Qt,u + I(t) q IMt,u,I; Zt Zt + I(t) q I Qt,u Qt,u

Zt for each j 1 to N do

cj mini2ilab[ET [u]] Qt,u

i,j 1j2ilab[ET [u]]

if t < N then

t c t for 1 to t do

t )>(Qt,u ~1c>); pt pt + p

t pt pt kptk1 else

t FIXED-POINT(Qt,u) xt SAMPLE(pt); lt RECEIVELOSS(); u δT [u, xt] for each I 2 I do

t Mt,u,Ilt p>

for each i 2 ilab[ET [u]] do

ATTRIBUTELOSS(AI,u,i, pt[i]I(t)lt) ATTRIBUTELOSS(AI, lt)

We now formally deﬁne the scenario of online learning with time-selection transductive regret. Let I [0, 1]N be a family of time-selection functions. Each time-selection function I 2 I determines the importance of the instantaneous regret at each round. Then, the time-selection transductive regret is deﬁned as:

Reg T (A, I, Φ)

= max I2I,T2Φ

I(t) E xt pt[lt(xt)]

I(t) E xt pt

e2ET[δT (IT ,x1:t 1),xt]

w[e]lt(olab[e])

When the family of transducers admits a single state, this deﬁnition coincides with the notion of time-selection regret studied by Blum and Mansour [2007] or Khot and Ponnuswami [2008].

Time-selection transductive regret is a more difﬁcult benchmark than transductive regret because the learner must account for only a subset of the rounds being relevant, in addition to playing a strategy that is robust against a large set of possible transductions.

To handle this scenario, we propose the following strategy. We maintain an external regret minimizing algorithm AI over the set of time-selection functions. This algorithm will be responsible for ensuring that our strategy is competitive against the a posteriori optimal time-selection function. We also maintain |I||Q|N other external regret minimizing algorithms, {AI,u,i}I2I,u2QT ,i2ilab[ET [u]], which will ensure that our algorithm is robust against each of the modiﬁcation rules and the potential transductions. We will then use a meta-algorithm to assign appropriate surrogate losses to each of these external regret minimizing algorithms and combine them to form a stochastic matrix. As in FASTTRANSDUCE, this meta-algorithm will also approximate the stationary distribution of the matrix and use that as the learner s strategy. We call this algorithm FASTTIMESELECTTRANSDUCE. Its pseudocode is given in Algorithm 2.

Theorem 3. Let (AI,u,i)I2I,u2QT ,i2ilab[ET [q]] be external regret minimizing algorithms admitting data-dependent regret bounds of the form O(

LT (AI,u,i) log N), where LT (AI,u,i) is the cumulative loss of AI,u,i after T rounds. Let AI be an external regret minimizing algorithm over I that admits a regret in O(

T log(|I|)) after T rounds. Assume further that at each round, the sum of the minimal probabilities given to an expert by these algorithms is bounded below by some constant > 0. Then, FASTTIMESELECTTRANSDUCE achieves a time-selection transductive regret with respect to the time-selection family I and WFST family T that is in O

T (log(|I|) + |ET |in log N)

with a per-iteration complexity in O

log(T ) log((1 ) 1), N

In particular, Theorem 3 implies that FASTTIMESELECTTRANSDUCE achieves the same timeselection swap regret guarantee as the algorithm of Khot and Ponnuswami [2008] but with a perround computational cost that is only in O

log(T ) log((1 ) 1), N

, as opposed to

O(|I|N N), which is an exponential improvement! Notice that this signiﬁcant improvement does not require any assumption (it holds even for = 0).

6 Sleeping transductive regret

The standard setting of prediction with expert advice can be extended to the sleeping experts scenario studied by Freund et al. [1997], where, at each round, a subset of the experts are asleep and thus unavailable to the learner. The sleeping experts setting has been used to model problems appearing in text categorization [Cohen and Singer, 1999], calendar scheduling [Blum, 1997], or learning how to formulate search-engine queries [Cohen and Singer, 1996].

The standard benchmark in this setting is the sleeping regret, that is the difference between the cumulative expected loss of the learner and the cumulative expected loss of the best static distribution over the experts, restricted to and normalized over the set of awake experts At at each round t:

E xt u At[lt(xt)]

Here, for any distribution p, we use the notation p At = p|At P

i2At pi with p|A(a) = p(a)1a2A, for any a 2 and A . An alternative deﬁnition of sleeping regret studied and bounded by Freund et al. [1997] is the following:

u(At) E xt p At

E xt u[1xt2Atlt(xt)]

This is also the deﬁnition we will be adopting in our analysis. Note that if u(At) does not vary with t, then the two deﬁnitions only differ by a multiplicative constant. By generalizing the results of Freund et al. [1997] to arbitrary losses, that is beyond those that satisfy equation (6) in their paper, one can

show that there exist algorithms with sleeping regret in O

t=1 u (At) Ext pt[lt(xt)] log(N)

, where u maximizes the expression to be bounded.

In this section, we extend this deﬁnition of sleeping regret to sleeping transductive regret, that is the difference between the learner s cumulative expected loss and the cumulative expected loss of any transduction of the learner s actions among a family of ﬁnite-state transducers, where the weights of the transductions are normalized over the set of awake experts. The sleeping transductive regret can be expressed as follows:

Reg T (A, T , AT

u(At) E xt p At

e2ET[δT (IT ,x1:t 1),xt]

(u|At)olab[e]w[e]lt(olab[e])

Figure 3: Maximum values of and minimum values of in FASTSWAP experiments. The vertical bars represent the standard deviation across 16 instantiations of the same simulation.

When all experts are awake at every round, i.e. At = , the sleeping transductive regret reduces to the standard transductive regret. When the family of transducers corresponds to that of swap regret, we uncover a natural deﬁnition for sleeping swap regret: max'2Φswap,u2 N

t=1 u(At) Ext p At

t [lt(xt)] PT

t=1 Ext p At

u'(xt)1'(xt)2Atlt('(xt))

. We now present an efﬁcient algorithm for minimizing sleeping transductive regret, FASTSLEEPTRANSDUCE. Similar to FASTTRANSDUCE, this algorithm uses a meta-algorithm with multiple regret minimizing sub-algorithms and a ﬁxed-point approximation to compute the learner s strategy. However, since FASTSLEEPTRANSDUCE minimizes sleeping transductive regret, it uses sleeping regret minimizing sub-algorithms (i.e. those with regret guarantees of the form (5)). The meta-algorithm also designs a different stochastic matrix. The pseudocode of this algorithm is given in Appendix C. Theorem 4. Assume that the sleeping regret minimizing algorithms used as inputs of

FASTSLEEPTRANSDUCE achieve data-dependent regret bounds such that, if the algorithm selects the distributions (pt)T

t=1 and observes losses (lt)T

t=1 with awake sets (At)T

t=1, then the regret of Aq

t=1 u (At) Ext pt[lt(xt)] log(N)

. Assume further that at each round, the sum of

the minimal probabilities given to an expert by these algorithms is bounded below by some constant > 0. Then, the sleeping regret Reg T (FASTSLEEPTRANSDUCE, T , AT

1 ) of FASTSLEEPTRANS-

DUCE is upper bounded by O

t=1 u(At)|ET |in log(N)

. Moreover, FASTSLEEPTRANSDUCE

admits a per-iteration complexity in O

log T log(1/(1 )), N

7 Experiments

In this section, we present some toy experiments illustrating the effectiveness of the Reduced Power Method for approximating the stationary distribution in FASTSWAP.

We considered n base learners, where n 2 {40, 80, 120, 160, 200}, each using the weighted-majority algorithm [Littlestone and Warmuth, 1994]. We generated losses as i.i.d. normal random variables with means in (0.1, 0.9) (chosen randomly) and standard deviation equal to 0.1. We capped the losses above and below to remain in [0, 1]. We ran FASTSWAP for 10,000 rounds in each simulation and repeated each simulation 16 times. The plot of the maximum for each simulation is shown in Figure 3. Across all simulations, the maximum attained was 4, so that at most 4 iterations of the RPM were needed on any given round to obtain a sufﬁcient approximation. Thus, the per-iteration cost in these simulations was indeed in e O(N 2), an improvement over the O(N 3) cost in prior work.

8 Conclusion

We introduced the notion of transductive regret, further extended it to the time-selection and sleeping experts settings, and presented efﬁcient online learning algorithms for all these setting with sublinear transductive regret guarantees. We both generalized the existing theory and gave more efﬁcient algorithms in existing subcases. The algorithms and results in this paper can be further extended to the case of fully non-deterministic weighted ﬁnite-state transducers.

Acknowledgments

We thank Avrim Blum for informing us of an existing lower bound for swap regret proven by Auer [2017]. This work was partly funded by NSF CCF-1535987 and NSF IIS-1618662.

D. Adamskiy, W. M. Koolen, A. Chernov, and V. Vovk. A closer look at adaptive regret. In ALT,

pages 290 304. Springer, 2012.

P. Auer. Personal communication, 2017.

A. Blum. Empirical support for Winnow and Weighted-Majority algorithms: Results on a calendar

scheduling domain. Machine Learning, 26(1):5 23, 1997.

A. Blum and Y. Mansour. From external to internal regret. Journal of Machine Learning Research, 8:

1307 1324, 2007.

N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge University Press,

New York, NY, USA, 2006.

N. Cesa-Bianchi, P. Gaillard, G. Lugosi, and G. Stoltz. Mirror descent meets ﬁxed share (and feels

no regret). In NIPS, pages 980 988, 2012.

W. W. Cohen and Y. Singer. Learning to query the web. In In AAAI Workshop on Internet-Based

Information Systems. Citeseer, 1996.

W. W. Cohen and Y. Singer. Context-sensitive learning methods for text categorization. ACM

Transactions on Information Systems, 17(2):141 173, 1999.

A. Daniely, A. Gonen, and S. Shalev-Shwartz. Strongly adaptive online learning. In Proceedings of

ICML, pages 1405 1411, 2015.

D. P. Foster and R. V. Vohra. Calibrated learning and correlated equilibrium. Games and Economic

Behavior, 21(1-2):40 55, 1997.

Y. Freund, R. E. Schapire, Y. Singer, and M. K. Warmuth. Using and combining predictors that

specialize. In STOC, pages 334 343. ACM, 1997.

A. Greenwald, Z. Li, and W. Schudy. More efﬁcient internal-regret-minimizing algorithms. In COLT,

pages 239 250, 2008.

S. Hart and A. Mas-Colell. A simple adaptive procedure leading to correlated equilibrium. Econo-

metrica, 68(5):1127 1150, 2000.

E. Hazan and S. Kale. Computational equivalence of ﬁxed points and no regret algorithms, and

convergence to equilibria. In NIPS, pages 625 632, 2008.

E. Hazan and C. Seshadhri. Efﬁcient learning algorithms for changing environments. In Proceedings

of ICML, pages 393 400. ACM, 2009.

M. Herbster and M. K. Warmuth. Tracking the best expert. Machine Learning, 32(2):151 178, 1998.

S. Khot and A. K. Ponnuswami. Minimizing wide range regret with time selection functions. In 21st

Annual Conference on Learning Theory, COLT 2008, 2008.

W. M. Koolen and S. de Rooij. Universal codes from switching strategies. IEEE Transactions on

Information Theory, 59(11):7168 7185, 2013.

E. Lehrer. A wide range no-regret theorem. Games and Economic Behavior, 42(1):101 115, 2003.

N. Littlestone and M. K. Warmuth. The weighted majority algorithm. Information and computation,

108(2):212 261, 1994.

M. Mohri and S. Yang. Conditional swap regret and conditional correlated equilibrium. In NIPS,

pages 1314 1322, 2014.

M. Mohri and S. Yang. Online learning with expert automata. Ar Xiv 1705.00132, 2017. URL

http://arxiv.org/abs/1705.00132.

M. Mohri and S. Yang. Competing with automata-based expert sequences. In AISTATS, 2018.

C. Monteleoni and T. S. Jaakkola. Online learning of non-stationary sequences. In NIPS, 2003.

Y. Nesterov and A. Nemirovski. Finding the stationary states of Markov chains by iterative methods.

Applied Mathematics and Computation, 255:58 65, 2015.

N. Nisan, T. Roughgarden, E. Tardos, and V. V. Vazirani. Algorithmic game theory, volume 1.

Cambridge University Press Cambridge, 2007.

M. Odalric and R. Munos. Adaptive bandits: Towards the best history-dependent strategy. In

AISTATS, pages 570 578, 2011.

G. Stoltz and G. Lugosi. Internal regret in on-line portfolio selection. Machine Learning, 59(1):

125 159, 2005.

V. Vovk. Derandomizing stochastic prediction strategies. Machine Learning, 35(3):247 282, 1999.