# online_learning_with_optimism_and_delay__56a5b65c.pdf

Online Learning with Optimism and Delay

Genevieve Flaspohler 1 2 Francesco Orabona 3 Judah Cohen 4 Soukayna Mouatadid 5 Miruna Oprescu 6

Paulo Orenstein 7 Lester Mackey 6

Inspired by the demands of real-time climate and weather forecasting, we develop optimistic online learning algorithms that require no parameter tuning and have optimal regret guarantees under delayed feedback. Our algorithms DORM, DORM+, and Ada Hedge D arise from a novel reduction of delayed online learning to optimistic online learning that reveals how optimistic hints can mitigate the regret penalty caused by delay. We pair this delay-as-optimism perspective with a new analysis of optimistic learning that exposes its robustness to hinting errors and a new metaalgorithm for learning effective hinting strategies in the presence of delay. We conclude by benchmarking our algorithms on four subseasonal climate forecasting tasks, demonstrating low regret relative to state-of-the-art forecasting models.

1. Introduction

Online learning is a sequential decision-making paradigm in which a learner is pitted against a potentially adversarial environment (Shalev-Shwartz, 2007; Orabona, 2019). At time t, the learner must select a play wt from some set of possible plays W. The environment then reveals the loss function t and the learner pays the cost t(wt). The learner uses information collected in previous rounds to improve its plays in subsequent rounds. Optimistic online learners additionally make use of side-information or hints about expected future losses to improve their plays. Over a period of length T, the goal of the learner is to minimize regret, an objective that quantiﬁes the performance gap between the learner and the best possible constant play in retrospect in some competitor set U: Regret T = supu2U

t=1 t(wt) t(u). Adversar-

1Dept. of EECS, Massachusetts Institute of Technology 2Dept. of AOSE, Woods Hole Oceanographic Institution 3Dept. of ECE, Boston University 4Atmospheric and Environmental Research 5Dept. of CS, University of Toronto 6Microsoft Research New England 7Instituto de Matem atica Pura e Aplicada. Correspondence to: Genevieve Flaspohler <geﬂaspo@mit.edu>.

Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s).

ial online learning algorithms provide robust performance in many complex real-world online prediction problems such as climate or weather forecasting.

In traditional online learning paradigms, the loss for round t is revealed to the learner immediately at the end of round t. However, many real-world applications produce delayed feedback, i.e., the loss for round t is not available until round t + D for some delay period D.1 Existing delayed online learning algorithms achieve optimal worst-case regret rates against adversarial loss sequences, but each has drawbacks when deployed for real applications with short horizons T. Some use only a small fraction of the data to train each learner (Weinberger & Ordentlich, 2002; Joulani et al., 2013); others tune their parameters using uniform bounds on future gradients that are often challenging to obtain or overly conservative in applications (Mc Mahan & Streeter, 2014; Quanrud & Khashabi, 2015; Joulani et al., 2016; Korotin et al., 2020; Hsieh et al., 2020). Only the concurrent work of Hsieh et al. (2020, Thm. 13) can make use of optimistic hints and only for the special case of unconstrained online gradient descent.

In this work, we aim to develop robust and practical algorithms for real-world delayed online learning. To this end, we introduce three novel algorithms DORM, DORM+, and Ada Hedge D that use every observation to train the learner, have no parameters to tune, exhibit optimal worstcase regret rates under delay, and enjoy improved performance when accurate hints for unobserved losses are available. We begin by formulating delayed online learning as a special case of optimistic online learning and use this delay-as-optimism perspective to develop:

1. A formal reduction of delayed online learning to opti-

mistic online learning (Lems. 1 and 2),

2. The ﬁrst optimistic tuning-free and self-tuning algo-

rithms with optimal regret guarantees under delay (DORM, DORM+, and Ada Hedge D),

3. A tightening of standard optimistic online learning

regret bounds that reveals the robustness of optimistic algorithms to inaccurate hints (Thms. 3 and 4),

1Our initial presentation will assume constant delay D, but we provide extensions to variable and unbounded delays in App. O.

Online Learning with Optimism and Delay

4. The ﬁrst general analysis of follow-the-regularized-

leader (Thms. 5 and 10) and online mirror descent algorithms (Thm. 6) with optimism and delay, and 5. The ﬁrst meta-algorithm for learning a low-regret opti-

mism strategy under delay (Thm. 13).

We validate our algorithms on the problem of subseasonal forecasting in Sec. 7. Subseasonal forecasting predicting precipitation and temperature 2-6 weeks in advance is a crucial task for allocating water resources and preparing for weather extremes (White et al., 2017). Subseasonal forecasting presents several challenges for online learning algorithms. First, real-time subseasonal forecasting suffers from delayed feedback: multiple forecasts are issued before receiving feedback on the ﬁrst. Second, the regret horizons are short: a common evaluation period for semimonthly forecasting is one year, resulting in 26 total forecasts. Third, forecasters cannot have difﬁcult-to-tune parameters in realtime, practical deployments. We demonstrate that our algorithms DORM, DORM+, and Ada Hedge D sucessfully overcome these challenges and achieve consistently low regret compared to the best forecasting models.

Our Python library for Optimistic Online Learning under Delay (Pool D) and experiment code are available at https://github.com/geﬂaspohler/poold.

Notation For integers a, b, we use the shorthand [b] , {1, . . . , b} and ga:b , Pb

i=a gi. We say a function f is proper if it is somewhere ﬁnite and never 1. We let @f(w) = {g 2 Rd : f(u) f(w) + hg, u wi, 8u 2 Rd} denote the set of subgradients of f at w 2 Rd and say f is µ-strongly convex over a convex set W int dom f with respect to k k with dual norm k k if 8w, u 2 W and g 2 @f(w), we have f(u) f(w)+hg, u wi+ µ

2 kw uk2. For differentiable , we deﬁne the Bregman divergence B (w, u) , (w) (u) hr (u), w ui. We deﬁne diam(W) = infw,w02W kw w0k, (r)+ , max(r, 0), and min(r, s)+ , (min(r, s))+.

2. Preliminaries: Optimistic Online Learning

Standard online learning algorithms, such as follow the regularized leader (FTRL) and online mirror descent (OMD) achieve optimal worst-case regret against adversarial loss sequences (Orabona, 2019). However, many loss sequences encountered in applications are not truly adversarial. Optimistic online learning algorithms aim to improve performance when loss sequences are partially predictable, while remaining robust to adversarial sequences (see, e.g., Azoury & Warmuth, 2001; Chiang et al., 2012; Rakhlin & Sridharan, 2013b; Steinhardt & Liang, 2014). In optimistic online learning, the learner is provided with a hint in the form of a pseudo-loss t at the start of round t that represents a guess for the true unknown loss. The online learner can

incorporate this hint before making play wt.

In standard formulations of optimistic online learning, the convex pseudo-loss t(wt) is added to the standard FTRL or OMD regularized objective function and leads to optimistic variants of these algorithms: optimistic FTRL (OFTRL, Rakhlin & Sridharan, 2013a) and single-step optimistic OMD (SOOMD, Joulani et al., 2017, Sec. 7.2). Let gt 2 @ t(wt 1) and gt 2 @ t(wt) denote subgradients of the pseudo-loss and true loss respectively. The inclusion of an optimistic hint leads to the following linearized update rules for play wt+1:

wt+1 = argmin

hg1:t + gt+1, wi + λ (w), (OFTRL)

wt+1 = argmin

hgt + gt+1 gt, wi + Bλ (w, wt)

with g0 = 0 and arbitrary w0 (SOOMD)

where gt+1 2 Rd is the hint subgradient, λ 0 is a regularization parameter, and is proper regularization function that is 1-strongly convex with respect to a norm k k. The optimistic learner enjoys reduced regret whenever the hinting error kgt+1 gt+1k is small (Rakhlin & Sridharan, 2013a; Joulani et al., 2017). Common choices of optimistic hints include the last observed subgradient or average of previously observed subgradients (Rakhlin & Sridharan, 2013a). We note that the standard FTRL and OMD updates can be recovered by setting the optimistic hints to zero.

3. Online Learning with Optimism and Delay

In the delayed feedback setting with constant delay of length D, the learner only observes ( i)t D

i=1 before making play wt+1. In this setting, we propose counterparts of the OFTRL and SOOMD online learning algorithms, which we call optimistic delayed FTRL (ODFTRL) and delayed optimistic online mirror descent (DOOMD) respectively:

wt+1 = argmin

hg1:t D + ht+1, wi + λ (w)

wt+1 = argmin

hgt D + ht+1 ht, wi + Bλ (w, wt)

with h0 , 0 and arbitrary w0, (DOOMD)

for hint vector ht+1. Our use of the notation ht+1 instead of gt+1 for the optimistic hint here is suggestive. Our regret analysis in Thms. 5 and 6 reveals that, instead of hinting only for the future missing loss gt+1, delayed online learners should uses hints ht that guess at the summed subgradients of all delayed and future losses: ht = Pt

3.1. Delay as Optimism

To analyze the regret of the ODFTRL and DOOMD algorithms, we make use of the ﬁrst key insight of this paper:

Online Learning with Optimism and Delay

Learning with delay is a special case of learning with optimism.

In particular, ODFTRL and DOOMD are instances of OFTRL and SOOMD respectively with a particularly bad choice of optimistic hint gt+1 that deletes the unobserved loss subgradients gt D+1:t.

Lemma 1 (ODFTRL is OFTRL with a bad hint). ODFTRL is OFTRL with gt+1 = ht+1 Pt

s=t D+1 gs.

Lemma 2 (DOOMD is SOOMD with a bad hint). DOOMD is SOOMD with gt+1 = gt + gt D gt + ht+1 ht = ht+1 Pt

s=t D+1 gs.

The implication of this reduction of delayed online learning to optimistic online learning is that any regret bound shown for undelayed OFTRL or SOOMD immediately yields a regret bound for ODFTRL and DOOMD under delay. As we demonstrate in the remainder of the paper, this novel connection between delayed and optimistic online learning allows us to bound the regret of optimistic, self-tuning, and tuning-free algorithms for the ﬁrst time under delay.

Finally, it is worth reﬂecting on the key property of OFTRL and SOOMD that enables the delay-to-optimism reduction: each algorithm depends on gt and gt+1 only through the sum g1:t + gt+1.2 For the bad hints of Lems. 1 and 2, these sums are observable even though gt and gt+1 are not separately observable at time t due to delay. A number of alternatives to SOOMD have been proposed for optimistic OMD (Chiang et al., 2012; Rakhlin & Sridharan, 2013a;b; Kamalaruban, 2016). Unlike SOOMD, these procedures all incorporate optimism in two steps, as in the updates

wt+1/2 = argminw2W hgt, wi + Bλ (w, wt 1/2) and

wt+1 = argminw2W h gt+1, wi + Bλ (w, wt+1/2) (1)

described in Rakhlin & Sridharan (2013a, Sec. 2.2). It is unclear how to reduce delayed OMD to an instance of one of these two-step procedures, as knowledge of the unobserved gt is needed to carry out the ﬁrst step.

3.2. Delayed and Optimistc Regret Bounds

To demonstrate the utility of our delay-as-optimism perspective, we ﬁrst present the following new regret bounds for OFTRL and SOOMD, proved in Apps. B and C respectively.

Theorem 3 (OFTRL regret). If is nonnegative, then, for all u 2 W, the OFTRL iterates wt satisfy

Regret T (u) λ (u) + 1

t=1 huber(kgt gtk , kgtk ).

Theorem 4 (SOOMD regret). If is differentiable and

2For SOOMD, gt + gt+1 gt = g1:t + gt+1 (g1:t 1 + gt).

g T +1 , 0, then, 8u 2 W, the SOOMD iterates wt satisfy

Regret T (u) Bλ (u, w0) +

t=1 huber(kgt gtk , kgt + gt+1 gtk ).

Both results feature the robust Huber penalty (Huber, 1964)

huber(x, y) , 1

2(|x| |y|)2

2x2, |y||x|)

in place of the more common squared error term

1 2kgt gtk2

. As a result, Thms. 3 and 4 strictly improve the rate-optimal OFTRL and SOOMD regret bounds of Rakhlin & Sridharan (2013a); Mohri & Yang (2016); Orabona (2019, Thm. 7.28) and Joulani et al. (2017, Sec. 7.2) by revealing a previously undocumented robustness to inaccurate hints gt. We will use this robustness to large hint error kgt gtk to establish optimal regret bounds under delay.

As an immediate consequence of this regret analysis and our delay-as-optimism perspective, we obtain the ﬁrst general analyses of FTRL and OMD with optimism and delay.

Theorem 5 (ODFTRL regret). If is nonnegative, then, for all u 2 W, the ODFTRL iterates wt satisfy

Regret T (u) λ (u) + 1

t=1 bt,F for

bt,F , huber(kht Pt

s=t D gsk , kgtk ).

Theorem 6 (DOOMD regret). If is differentiable and h T +1 , g T D+1:T , then, for all u 2 W, the DOOMD iterates wt satisfy

Regret T (u) Bλ (u, w0) + 1

t=1 bt,O for

bt,O , huber(kht Pt

s=t D gsk , kgt D + ht+1 htk ).

Our results show a compounding of regret due to delay: the bt,F term of Thm. 5 is of size O(D + 1) whenever khtk = O(D + 1), and the same holds for bt,O of Thm. 6 if kht+1 htk = O(1). An optimal setting of λ therefore delivers O(

(D + 1)T) regret, yielding the minimax optimal rate for adversarial learning under delay (Weinberger & Ordentlich, 2002). Thms. 5 and 6 also reveal the heightened value of optimism in the presence of delay: in addition to providing an effective guess of the future subgradient gt, an optimistic hint can approximate the missing delayed feedback (Pt 1

s=t D gs) and thereby signiﬁcantly reduce the penalty of delay. If, on the other hand, the hints are a poor proxy for the missing loss subgradients, the novel huber term ensures that we still only pay the minimax optimal p

D + 1 penalty for delayed feedback.

Related work A classical approach to delayed feedback in online learning is the so-called replication strategy in which D + 1 distinct learners take turns observing and responding to feedback (Weinberger & Ordentlich, 2002;

Online Learning with Optimism and Delay

Joulani et al., 2013; Agarwal & Duchi, 2011; Mesterharm, 2005). While minimax optimal in adversarial settings, this strategy has the disadvantage that each learner only sees

T D+1 losses and is completely isolated from the other replicates, exacerbating the problem of short prediction horizons. In contrast, we develop and analyze non-replicated delayed online learning strategies that use a combination of optimistic hinting and self-tuned regularization to mitigate the effects of delay while retaining optimal worst-case behavior.

To our knowledge, Thm. 5 and its adaptive generalization Thm. 10 provide the ﬁrst general analysis of delayed FTRL with optimism, apart from the concurrent work of Hsieh et al. (2020, Thm. 1). Hsieh et al. (2020, Thm. 13) and Quanrud & Khashabi (2015, Thm. 2.1) focus only on delayed gradient descent, Korotin et al. (2020) study General Hedging, and Joulani et al. (2016, Thm. 4) and Quanrud & Khashabi (2015, Thm. A.5) study non-optimistic OMD under delay. Thms. 5, 6, and 10 strengthen these results from the literature which feature a sum of subgradient norms (Pt 1

s=t D kgsk or Dkgtk ) in place of kht Pt 1

s=t D gsk . Even in the absence of optimism, the latter can be signiﬁcantly smaller: e.g., if the gradients gs are i.i.d. mean-zero vectors, the former has size (D) while the latter has expectation O(

D). In the absence of optimism, Mc Mahan & Streeter (2014) obtain a bound comparable to Thm. 5 for the special case of one-dimensional unconstrained online gradient descent.

In the absence of delay, Cutkosky (2019) introduces metaalgorithms for imbuing learning procedures with optimism while remaining robust to inaccurate hints; however, unlike OFTRL and SOOMD, the procedures of Cutkosky require separate observation of gt+1 and each gt, making them unsuitable for our delay-to-optimism reduction.

3.3. Tuning Regularizers with Optimism and Delay

The online learning algorithms introduced so far all include a regularization parameter λ. In theory and in practice, these algorithms only achieve low regret if the regularization parameter λ is chosen appropriately. In standard FTRL, for example, one such setting that achieves optimal regret

t=1 kgtk2 supu2U (u). This choice, however, cannot be

used in practice as it relies on knowledge of all future unobserved loss subgradients. To make use of online learning algorithms, the tuning parameter λ is often set using coarse upper bounds on, e.g., the maximum possible subgradient norm. However, these bounds are often very conservative and lead to poor real-world performance.

In the following sections, we introduce two strategies for tuning regularization with optimism and delay. Sec. 4 introduces the DORM and DORM+ algorithms, variants of ODFTRL and DOOMD that are entirely tuning-free. Sec. 5

introduces the Ada Hedge D algorithm, an adaptive variant of ODFTRL that is self-tuning; a sequence of regularization parameters λt are set automatically using new, tighter bounds on algorithm regret. All three algorithms achieve the minimax optimal regret rate under delay, support optimism, and have strong real-world performance as shown in Sec. 7.

4. Tuning-free Learning with Optimism

Regret matching (RM) (Blackwell, 1956; Hart & Mas Colell, 2000) and regret matching+ (RM+) (Tammelin et al., 2015) are online learning algorithms that have strong empirical performance. RM was developed to ﬁnd correlated equilibria in two-player games and is commonly used to minimize regret over the simplex. RM+ is a modiﬁcation of RM designed to accelerate convergence and used to effectively solve the game of Heads-up Limit Texas Hold em poker (Bowling et al., 2015). RM and RM+ support neither optimistic hints nor delayed feedback, and known regret bounds have a suboptimal scaling with respect to the problem dimension d (Cesa-Bianchi & Lugosi, 2006; Orabona & P al, 2015). To extend these algorithms to the delayed and optimistic setting and recover the optimal regret rate, we introduce our generalizations, delayed optimistic regret matching (DORM)

wt+1 = wt+1/h1, wt+1i for (DORM)

wt+1 , max(0, (r1:t D + ht+1)/λ)q 1

and delayed optimistic regret matching+ (DORM+)

wt+1 = wt+1/h1, wt+1i for h0 = w0 , 0, (DORM+)

t + (rt D + ht+1 ht)/λ

Each algorithm makes use of an instantaneous regret vector rt , 1hgt, wti gt that quantiﬁes the relative performance of each expert with respect to the play wt and the linearized loss subgradient gt. The updates also include a parameter q 2 and its conjugate exponent p = q/(q 1) that is set to recover the minimax optimal scaling of regret with the number of experts (see Cor. 9). We note that DORM and DORM+ recover the standard RM and RM+ algorithms when D = 0, λ = 1, q = 2, and ht = 0, 8t.

4.1. Tuning-free Regret Bounds

To bound the regret of the DORM and DORM+ plays, we prove that DORM is an instance of ODFTRL and DORM+ is an instance of DOOMD. This connection enables us to immediately provide regret guarantees for these regretmatching algorithms under delayed feedback and with optimism. We ﬁrst highlight a remarkable property of DORM and DORM+ that is the basis of their tuning-free nature. Under mild conditions:

Online Learning with Optimism and Delay

The normalized DORM and DORM+ iterates wt are independent of the choice of regularization parameter λ.

Lemma 7 (DORM and DORM+ are independent of λ). If the subgradient gt and hint ht+1 only depend on λ through (ws, λq 1 ws, gs 1, hs)s t and (ws, λq 1 ws, gs, hs)s t

respectively, then the DORM and DORM+ iterates (wt)t 1 are independent of the choice of λ > 0.

Lem. 7, proved in App. E, implies that DORM and DORM+ are automatically optimally tuned with respect to λ, even when run with a default value of λ = 1. Hence, these algorithms are tuning-free, a very appealing property for real-world deployments of online learning.

To show that DORM and DORM+ also achieve optimal regret scaling under delay, we connect them to ODFTRL and DOOMD operating on the nonnegative orthant with a special surrogate loss ˆ t (see App. D for our proof):

Lemma 8 (DORM is ODFTRL and DORM+ is DOOMD). The DORM and DORM+ iterates are proportional to

ODFTRL and DOOMD iterates respectively with W , Rd

+, ( w) = 1

p, and loss ˆ t( w) = h w, rti.

Lem. 8 enables the following optimally-tuned regret bounds for DORM and DORM+ run with any choice of λ:

Corollary 9 (DORM and DORM+ regret). Under the assumptions of Lem. 7, for all u 2 4d 1 and any choice of λ > 0, the DORM and DORM+ iterates wt satisfy

Regret T (u) inf

p + 1 λ(p 1)

kuk2p 2(p 1)

where h T +1 , r T D+1:T and, for each c 2 [2, 1],

(DORM) = huber(kht Pt

s=t D rskc, krtkc) and

(DORM+) = huber(kht Pt

krt D + ht+1 htkc).

If, in addition, q = argminq0 2 d2/q0(q0 1), then

Regret T (u)

(2 log2(d) 1) PT

Cor. 9, proved in App. F, suggests a natural hinting strategy for reducing the regret of DORM and DORM+: predict the sum of unobserved instantaneous regrets Pt

s=t D rs. We explore this strategy empirically in Sec. 7. Cor. 9 also highlights the value of the q parameter in DORM and DORM+: using the easily computed value q = argminq0 2 d2/q0(q0 1) yields the minimax optimal

log2(d) dependence of regret on dimension (Cesa-Bianchi & Lugosi, 2006; Orabona & P al, 2015). By Lem. 8, setting q in this way is equivalent

to selecting a robust 1

p regularizer (Gentile, 2003) for the underlying ODFTRL and DOOMD problems.

Related work Without delay, Farina et al. (2021) independently developed optimistic versions of RM and RM+ by reducing them to OFTRL and a two-step variant of optimistic OMD (1). Unlike SOOMD, this two-step optimistic OMD requires separate observation of gt+1 and gt, making it unsuitable for our delay-as-optimism reduction and resulting in a different algorithm from DORM+ even when D = 0. In addition, their regret bounds and prior bounds for RM and RM+ (special cases of DORM and DORM+ with q = 2) have suboptimal regret when the dimension d is large (Bowling et al., 2015; Zinkevich et al., 2007).

5. Self-tuned Learning with Optimism

In this section, we analyze an adaptive version of ODFTRL with time-varying regularization λt and develop strategies for setting λt appropriately in the presence of optimism and delay. We begin with a new general regret analysis of optimistic delayed adaptive FTRL (ODAFTRL)

wt+1 = argmin

hg1:t D + ht+1, wi + λt+1 (w)

where ht+1 2 Rd is an arbitrary hint vector revealed before wt+1 is generated, is 1-strongly convex with respect to a norm k k, and λt 0 is a regularization parameter. Theorem 10 (ODAFTRL regret). If is nonnegative and λt is non-decreasing in t, then, 8u 2 W, the ODAFTRL iterates wt satisfy

Regret T (u) λT (u) + PT

t=1 min( bt,F

λt , at,F ) with

bt,F , huber(kht Pt

s=t D gsk , kgtk ) and (2)

at,F , diam(W) min

s=t D gsk , kgtk

The proof of this result in App. G builds on a new regret bound for undelayed optimistic adaptive FTRL (OAFTRL). In the absence of delay (D = 0), Thm. 10 strictly improves existing regret bounds (Rakhlin & Sridharan, 2013a; Mohri & Yang, 2016; Joulani et al., 2017) for OAFTRL by providing tighter guarantees whenever the hinting error kht Pt

s=t D gtk is larger than the subgradient magnitude kgtk . In the presence of delay, Thm. 10 beneﬁts both from robustness to hinting error in the worst case and the ability to exploit accurate hints in the best case. The bounded-domain factors at,F strengthen both standard OAFTRL regret bounds and the concurrent bound of Hsieh et al. (2020, Thm. 1) when diam(W) is small and will enable us to design practical λt-tuning strategies under delay without any prior knowledge of unobserved subgradients. We now turn to these self-tuning protocols.

Online Learning with Optimism and Delay

5.1. Conservative Tuning with Delayed Upper Bound

Setting aside the at,F bounded-domain factors in Thm. 10

for now, the adaptive sequence λt =

s=1 bs,F supu2U (u) is known to be a near-optimal minimizer of the ODAFTRL regret bound (Mc Mahan, 2017, Lemma 1). However, this value is unobservable at time t. A common strategy is to

play the conservative value λt =

(D+1)B0+Pt D 1

s=1 bs,F supu2U (u) ,

where B0 is a uniform upper bound on the unobserved bs,F terms (Joulani et al., 2016; Mc Mahan & Streeter, 2014). In practice, this requires computing an a priori upper bound on any subgradient norm that could possibly arise and often leads to extreme over-regularization (see Sec. 7).

As a preliminary step towards fully adaptive settings of λt, we analyze in App. H a new delayed upper bound (DUB) tuning strategy which relies only on observed bs,F terms and does not require upper bounds for future losses.

Theorem 11 (DUB regret). Fix > 0, and, for at,F , bt,F as in (2), consider the delayed upper bound (DUB) sequence

maxj t D 1 aj D+1:j,F (DUB)

i,F + 2 bi,F .

If is nonnegative, then, for all u 2 W, the ODAFTRL iterates wt satisfy

Regret T (u)

2 maxt2[T ] at D:t 1,F +

t,F + 2 bt,F

As desired, the DUB setting of λt depends only on previously observed at,F and bt,F terms and achieves optimal regret scaling with the delay period D. However, the terms at,F , bt,F are themselves potentially loose upper bounds for the instantaneous regret at time t. In the following section, we show how the DUB regularization setting can be reﬁned further to produce Ada Hedge D adaptive regularization.

5.2. Reﬁned Tuning with Ada Hedge D

As noted by Erven et al. (2011); de Rooij et al. (2014); Orabona (2019), the effectiveness of an adaptive regularization setting λt that uses an upper bound on regret (such as bt,F ) relies heavily on the tightness of that bound. In practice, we want to set λt using as tight a bound as possible. Our next result introduces a new tuning sequence that can be used with delayed feedback and is inspired by the popular Ada Hedge algorithm (Erven et al., 2011). It makes use of the tightened regret analysis underlying Thm. 10 to enable tighter settings of λt compared to DUB, while still controlling algorithm regret (see proof in App. I).

Theorem 12 (Ada Hedge D regret). Fix > 0, and consider

the delayed Ada Hedge-style (Ada Hedge D) sequence

s=1 δs for (Ada Hedge D)

δt , min(Ft+1(wt, λt) Ft+1( wt, λt), hgt, wt wti,

Ft+1( ˆwt, λt) Ft+1( wt, λt) + hgt, wt ˆwti)+

with wt , argminw2W Ft+1(w, λt), (3)

ˆwt , argminw2W Ft+1(w, λt) +

min( kgtk kht gt D:tk , 1)hht gt D:t, wi,

and Ft+1(w, λt) , λt (w) + hg1:t, wi.

If is nonnegative, then, for all u 2 W, the ODAFTRL iterates satisfy

Regret T (u)

2 maxt2[T ] at D:t 1,F +

t,F + 2 bt,F

Remarkably, Thm. 12 yields a minimax optimal O(

(D + 1)T + D) dependence on the delay parameter and nearly matches the Thm. 5 regret of the optimal constant λ tuning. Although this regret bound is identical to that in Thm. 11, in practice the λt values produced by Ada Hedge D can be orders of magnitude smaller than those of DUB, granting additional adaptivity. We evaluate the practical implications of these λt settings in Sec. 7.

As a ﬁnal note, when is bounded on U, we recommend choosing = supu2U (u) so that (u)

1. For negative entropy regularization (u) = Pd

j=1 uj ln(uj) + ln(d) on the simplex U = W = 4d 1, this yields = ln(d) and a regret bound with minimax optimal

ln(d) dependence on d (Cesa-Bianchi & Lugosi, 2006; Orabona & P al, 2015).

Related work Our Ada Hedge D δt terms differ from standard Ada Hedge increments (see, e.g., Orabona, 2019, Sec. 7.6) due to the accommodation of delay, the incorporation of optimism, and the inclusion of the ﬁnal two terms in the min. These non-standard terms are central to reducing the impact of delay on our regret bounds. Prior and concurrent approaches to adaptive tuning under delay do not incorporate optimism and require an explicit upper bound on all future subgradient norms, a quantity which is often difﬁcult to obtain or very loose (Mc Mahan & Streeter, 2014; Joulani et al., 2016; Hsieh et al., 2020). Our optimistic algorithms, DUB and Ada Hedge D, admit comparable regret guarantees (Thms. 11 and 12) but require no prior knowledge of future subgradients.

6. Learning to Hint with Delay

As we have seen, optimistic hints play an important role in online learning under delay: effective hinting can counteract the increase in regret under delay. In this section, we consider the problem of choosing amongst several competing

Online Learning with Optimism and Delay

hinting strategies. We show that this problem can again be treated as a delayed online learning problem. In the following, we will call the original online learning problem the base problem and the learning-to-hint problem the hinting problem.

Suppose that, at time t, we observe the hints gt of m different hinters arranged into a d m matrix Ht. Each column of Ht is one hinter s best estimate of the sum of missing loss subgradients gt D:t. Our aim is to output a sequence of combined hints ht(!t) , Ht!t with low regret relative to the best constant combination strategy ! 2 , 4m 1 in hindsight. To achieve this using delayed online learning, we make use of a convex loss function lt(!) for the hint learner that upper bounds the base learner regret. Assumption 1 (Convex regret bound). For any hint sequence (ht)T

t=1 and u 2 , the base problem admits the

regret bound Regret T (u) C0(u)+C1(u)

t=1 ft(ht) for C1(u) 0 and convex functions ft independent of u.

As we detail in App. K, Assump. 1 holds for all of the learning algorithms introduced in this paper. For example, by Cor. 9, if the base learner is DORM, we may choose

C0(u) = 0, C1(u) =

kuk2p 2(p 1), and the O(D) convex

function ft(ht) = krtkqkht Pt

s=t D rskq bt,q.3

For any base learner satisfying Assump. 1, we choose lt(!) = ft(Ht!) as our hinting loss, use the tuning-free DORM+ algorithm to output the combination weights !t on each round, and provide the hint ht(!t) = Ht!t to the base learner. The following result, proved in App. J, shows that this learning to hint strategy performs nearly as well as the best constant hint combination strategy in restrospect. Theorem 13 (Learning to hint regret). Suppose the base problem satisﬁes Assump. 1 and the hinting problem is solved with DORM+ hint iterates !t, hinting losses lt(!) = ft(Ht!), no meta-hints for the hinting problem, and q = argminq0 2 m2/q0(q0 1). Then the base problem with hints ht(!t) = Ht!t satisﬁes

Regret T (u) C0(u) + C1(u)

t=1 ft(ht(!))

(2 log2(m) 1)( 1

t=1 huber( t, t))

for t , 4(D + 1) Pt

s=t D kγsk2

1, γt 2 @lt(!t),

and t , 4kγt Dk1

s=t D kγsk1.

To quantify the size of this regret bound, consider again the DORM base learner with ft(ht) = krtkqkht Pt

s=t D rskq. By Lem. 26 in App. K, kγtk1 d1/qk Htk1krtkq for k Htk1 the maximum absolute entry of Ht. Each column of Ht is a sum D + 1

3The alternative choice ft(ht) = 1

q also bounds regret but may have size (D2) rather than O(D).

subgradient hints, so k Htk1 is O(D + 1). Thus, for this choice of hinter loss, the huber( t, t) term is O((D + 1)3), and the hint learner suffers only O(T 1/4(D + 1)3/4) additional regret from learning to hint. Notably, this additive regret penalty is O(

(D + 1)T) if D = O(T) (and o(

(D + 1)T) when D = o(T)), so the learning to hint strategy of Thm. 13 preserves minimax optimal regret rates.

Related work Rakhlin & Sridharan (2013a, Sec. 4.1) propose and analyze a method to learn optimism strategies for a two-step OMD base learner. Unlike Thm. 13, the approach does not accommodate delay, and the analyzed regret is only with respect to single hinting strategies ! 2 {ej}j2[m] rather than combination strategies, ! 2 4m 1.

7. Experiments

We now apply the online learning techniques developed in this paper to the problem of adaptive ensembling for subseasonal forecasting. Our experiments are based on the subseasonal forecasting data of Flaspohler et al. (2021) that provides the forecasts of d = 6 machine learning and physics-based models for both temperature and precipitation at two forecast horizons: 3-4 weeks and 5-6 weeks. In operational subseasonal forecasting, feedback is delayed; models make D = 2 or 3 forecasts (depending on the forecast horizon) before receiving feedback. We use delayed, optimistic online learning to play a time-varying convex combination of input models and compete with the best input model over a year-long prediction period (T = 26 semimonthly dates). The loss function is the geographic root-mean squared error (RMSE) across 514 locations in the Western United States.

We evaluate the relative merits of the delayed online learning techniques presented by computing yearly regret and mean RMSE for the ensemble plays made by the online leaner in each year from 2011-2020. Unless otherwise speciﬁed, all online learning algorithms use the recent g hint gs, which approximates each unobserved subgradient at time t with the most recent observed subgradient gt D 1. See App. L for full experimental details, App. N for algorithmic details, and App. M for extended experimental results.

Competing with the best input model The primary beneﬁt of online learning in this setting is its ability to achieve small average regret, i.e., to perform nearly as well as the best input model in the competitor set U without knowing which is best in advance. We run our three delayed online learners DORM, DORM+, and Ada Hedge D on all four subseasonal prediction tasks and measure their average loss.

The average yearly RMSE for the three online learning algorithms and the six input models is shown in Table 1. The DORM+ algorithm tracks the performance of the best input model for all tasks except Temp. 5-6w. All online learning

Online Learning with Optimism and Delay

Table 1: Average RMSE of the 2011-2020 semimonthly forecasts: The average RMSE for online learning algorithms (left) and input models (right) over a 10-year evaluation period with the top-performing learners and input models bolded and blue. In each task, the online learners compare favorably with the best input model and learn to downweight the lower-performing candidates, like the worst models italicized in red.

ADAHEDGED DORM DORM+ MODEL1 MODEL2 MODEL3 MODEL4 MODEL5 MODEL6

PRECIP. 3-4W 21.726 21.731 21.675 21.973 22.431 22.357 21.978 21.986 23.344 PRECIP. 5-6W 21.868 21.957 21.838 22.030 22.570 22.383 22.004 21.993 23.257 TEMP. 3-4W 2.273 2.259 2.247 2.253 2.352 2.394 2.277 2.319 2.508 TEMP. 5-6W 2.316 2.316 2.303 2.270 2.368 2.459 2.278 2.317 2.569

2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021

Cumulative regret (RMSE loss)

Ada Hedge D (RMSE: 21.726)

DORM (RMSE: 21.731)

DORM+ (RMSE: 21.675)

Figure 1: Overall performance: Yearly cumulative regret under RMSE loss for the the Precip. 3-4w task. The zero line corresponds to the performance of the best input model in a given year.

algorithms achieve negative regret for both precipitation tasks. Fig. 1 shows the yearly cumulative regret (in terms of the RMSE loss) of the online learning algorithms over the 10-year evaluation period. There are several years (e.g., 2012, 2014, 2020) in which all online learning algorithms substantially outperform the best input forecasting model. The consistently low regret year-to-year of DORM+ compared to DORM and Ada Hedge D makes it a promising candidate for real-world delayed subseasonal forecasting. Notably, RM+ (a special case of DORM+) is known to have small tracking regret, i.e., it competes well even with strategies that switch between input models a bounded number of times (Tammelin et al., 2015, Thm. 2). We suspect that this is one source of DORM+ s superior performance. We also note that the self-tuned Ada Hedge D performs comparably to the the optimally-tuned DORM, demonstrating the effectiveness of our self-tuning strategy.

Impact of regularization We evaluate the impact of the three regularization strategies developed in this paper: 1) the upper bound DUB strategy, 2) the tighter Ada Hedge D strategy, and 3) the DORM+ algorithm that is tuning-free. This tuning-free property has evident practical beneﬁts, as this section demonstrates.

Fig. 2 shows the yearly regret of the DUB, Ada Hedge D, and DORM+ algorithms. A consistent pattern appears in the yearly regret: DUB has moderate positive regret, Ada Hedge D has both the largest positive and negative regret values, and DORM+ sits between these two extremes. If we examine the weights played by each algorithm (Fig. 3), the

weights of DUB and Ada Hedge D appear respectively overand under-regularized compared to DORM+ (the top model for this task). DUB s use of the upper bound bt,F results in a very large regularization setting (λT = 142.881) and a virtually uniform weight setting. Ada Hedge D s tighter bound δt produces a value for λT = 3.005 that is two orders of magnitude smaller. However, in this short-horizon forecasting setting, Ada Hedge D s aggressive plays result in higher average RMSE. By nature of it s λt-free updates, DORM+ produces more moderately regularized plays wt and negative regret.

To replicate or not to replicate In this section, we compare the performance of replicated and non-replicated variants of our DORM+ algorithm. Both algorithms perform well (see App. M.3), but in all tasks, DORM+ outperforms replicated DORM+ (in which D + 1 independent copies of DORM+ make staggered predictions). Fig. 4 provides an example of the weight plots produced by the replication strategy in the Temp. 5-6w task with D = 3. The separate nature of the replicated learner s plays is evident in the weight plots and leads to an average RMSE of 2.315, versus 2.303 for DORM+ in the Temp. 5-6w task.

Learning to hint Finally, we examine the effect of optimism on the DORM+ algorithms and the ability of our learning to hint strategy to recover the performance of

the best optimism strategy in retrospect. Following the hint construction protocol in App. N.2, we run the DORM+ base algorithm with m = 4 subgradient hinting strategies: gs = gt D 1 (recent g), gs = gs D 1 (prev g),

2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021

Cumulative regret (RMSE loss)

Ada Hedge D (RMSE: 2.273)

DORM+ (RMSE: 2.247)

DUB (RMSE: 2.258)

Figure 2: Regret of regularizers: Yearly cumulative regret (in terms of the RMSE loss) for the three regularization strategies for the Temp. 3-4w task.

Online Learning with Optimism and Delay

Nov Jan Mar May Jul Sep 0.0

1.0 DUB weights wt

Model1 Model2 Model3 Model4 Model5 Model6

Nov Jan Mar May Jul Sep

Ada Hedge D weights wt

Nov Jan Mar May Jul Sep

DORM+ weights wt

Nov Jan Mar May Jul Sep 0

150 Regularization λt

Ada Hedge D (λT = 3.005)

DUB (λT = 142.881)

Figure 3: Impact of regularization: The plays wt of online learning algorithms used to combine the input models for the Temp. 3-4w task in the 2020 evaluation year. The weights of DUB and Ada Hedge D appear respectively over and under regularized compared to DORM+ (the top model for this task) due to their selection of regularization strength λt (right).

Nov Jan Mar May Jul Sep 0.0

1.0 DORM+ weights wt

Model1 Model2 Model3 Model4 Model5 Model6

Nov Jan Mar May Jul Sep

Replicated DORM+ weights wt

Figure 4: To replicate or not to replicate: The plays wt of standard DORM+ and replicated DORM+ algorithms for the Temp. 56w task in the ﬁnal evaluation year.

2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021

Cumulative regret (RMSE loss)

learned (RMSE: 21.745)

mean g (RMSE: 21.830)

none (RMSE: 21.796)

prev g (RMSE: 21.727)

recent g (RMSE: 21.675)

Figure 5: Learning to hint: Yearly cumulative regret (in terms of the RMSE loss) for the adaptive hinting and four constant hinting strategies for the Precip. 3-4w task.

gs = D+1 t D 1g1:t D 1 (mean g), or gs = 0 (none). We also use DORM+ as the meta-algorithm for hint learning to produce the learned optimism strategy that plays a convex combination of the four hinters. In Fig. 5, we ﬁrst note that several optimism strategies outperform the none hinter, conﬁrming the value of optimism in reducing regret. The learned variant of DORM+ avoids the worst-case performance of the individual hinters in any given year (e.g., 2015), while staying competitive with the best strategy (although it does not outperform the dominant recent g strategy overall). We believe the performance of the online hinter could be further improved by developing tighter convex bounds on the regret of the base problem in the spirit of Assump. 1.

8. Conclusion

In this work, we confronted the challenges of delayed feedback and short regret horizons in online learning with optimism, developing practical non-replicated, self-tuned and tuning-free algorithms with optimal regret guarantees. Our delay as optimism reduction and our reﬁned analysis of

optimistic learning produced novel regret bounds for both optimistic and delayed online learning and elucidated the connections between these two problems. Within the subseasonal forecasting domain, we demonstrated that delayed online learning methods can produce zero-regret forecast ensembles that perform robustly from year-to-year. Our results highlighted DORM+ as a particularly promising candidate due to its tuning-free nature and small tracking regret.

In future work, we are excited to further develop optimism strategies under delay by 1) employing tighter convex loss bounds on the regret of the base algorithm to improve the learning to hint algorithm, 2) exploring the relative impact of hinting for past (gt D:t 1) versus future (gt) missing subgradients (see App. M.5 for an initial exploration), and 3) developing adaptive self-tuning variants of the DOOMD algorithm. Within the subseasonal domain, we plan to leverage the ﬂexibility of our optimism formulation to explore hinting strategies that use meteorological expertise to improve beyond the generic mean and past subgradient hints and to deploy our open-source subseasonal forecasting algorithms operationally.

Acknowledgements

This work was supported by Microsoft AI for Earth, an NSF GRFP, and the NSF grants no. 1925930 Collaborative Research: TRIPODS Institute for Optimization and Learning , no. 1908111 AF: Small: Collaborative Research: New Representations for Learning Algorithms and Secure Computation , and no. 2022446 Foundations of Data Science Institute . FO also thanks Nicol o Cesa-Bianchi and Christian Kroer for discussions on RM and RM+.

Online Learning with Optimism and Delay

Agarwal, A. and Duchi, J. C. Distributed delayed stochastic opti-

mization. In Shawe-Taylor, J., Zemel, R., Bartlett, P., Pereira, F., and Weinberger, K. Q. (eds.), Advances in Neural Information Processing Systems, volume 24. Curran Associates, Inc., 2011.

Azoury, K. S. and Warmuth, M. K. Relative loss bounds for on-line

density estimation with the exponential family of distributions. Machine Learning, 43(3):211 246, 2001.

Blackwell, D. An analog of the minimax theorem for vector payoffs. Paciﬁc Journal of Mathematics, 6(1):1 8, 1956.

Bowling, M., Burch, N., Johanson, M., and Tammelin, O. Heads-

up limit hold em poker is solved. Science, 347(6218):145 149, 2015. ISSN 0036-8075. doi: 10.1126/science.1259433.

Cesa-Bianchi, N. and Lugosi, G. Prediction, learning, and games.

Cambridge university press, 2006.

Chiang, C.-K., Yang, T., Lee, C.-J., Mahdavi, M., Lu, C.-J., Jin,

R., and Zhu, S. Online optimization with gradual variations. In Mannor, S., Srebro, N., and Williamson, R. C. (eds.), Proceedings of the 25th Annual Conference on Learning Theory, volume 23, pp. 6.1 6.20, Edinburgh, Scotland, 25 27 Jun 2012.

Cutkosky, A. Combining online learning guarantees. In Beygelz-

imer, A. and Hsu, D. (eds.), Proceedings of the Thirty-Second Conference on Learning Theory, volume 99 of Proceedings of Machine Learning Research, pp. 895 913, Phoenix, USA, 25 28 Jun 2019. PMLR.

Danskin, J. M. The theory of max-min and its application to

weapons allocation problems, volume 5. Springer Science & Business Media, 2012.

de Rooij, S., van Erven, T., Gr unwald, P. D., and Koolen, W. M.

Follow the leader if you can, hedge if you must. Journal of Machine Learning Research, 15(37):1281 1316, 2014.

Erven, T., Koolen, W. M., Rooij, S., and Gr unwald, P. Adaptive

hedge. In Shawe-Taylor, J., Zemel, R., Bartlett, P., Pereira, F., and Weinberger, K. Q. (eds.), Advances in Neural Information Processing Systems, volume 24, pp. 1656 1664. Curran Associates, Inc., 2011.

Farina, G., Kroer, C., and Sandholm, T. Faster game solving via

predictive blackwell approachability: Connecting regret matching and mirror descent. Proceedings of the AAAI Conference on Artiﬁcial Intelligence, 35(6):5363 5371, May 2021.

Flaspohler, G., Orabona, F., Cohen, J., Mouatadid, S., Oprescu,

M., Orenstein, P., and Mackey, L. Replication Data for: Online Learning with Optimism and Delay, 2021. URL https://doi.org/ 10.7910/DVN/IOCFCY.

Gentile, C. The robustness of the p-norm algorithms. Machine

Learning, 53(3):265 299, 2003.

Hart, S. and Mas-Colell, A. A simple adaptive procedure leading to

correlated equilibrium. Econometrica, 68(5):1127 1150, 2000.

Hsieh, Y.-G., Iutzeler, F., Malick, J., and Mertikopoulos, P. Multi-

agent online optimization with delays: Asynchronicity, adaptivity, and optimism. ar Xiv preprint ar Xiv:2012.11579, 2020.

Huber, P. J. Robust Estimation of a Location Parameter. The

Annals of Mathematical Statistics, 35(1):73 101, 1964. doi: 10.1214/aoms/1177703732. URL https://doi.org/10.1214/aoms/ 1177703732.

Hwang, J., Orenstein, P., Cohen, J., Pfeiffer, K., and Mackey, L.

Improving subseasonal forecasting in the western U.S. with machine learning. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2325 2335, 2019.

Joulani, P., Gyorgy, A., and Szepesv ari, C. Online learning under

delayed feedback. In International Conference on Machine Learning, pp. 1453 1461, 2013.

Joulani, P., Gyorgy, A., and Szepesv ari, C. Delay-tolerant online

convex optimization: Uniﬁed analysis and adaptive-gradient algorithms. In Thirtieth AAAI Conference on Artiﬁcial Intelligence, 2016.

Joulani, P., Gy orgy, A., and Szepesv ari, C. A modular analysis

of adaptive (non-) convex optimization: Optimism, composite objectives, and variational bounds. In International Conference on Algorithmic Learning Theory, pp. 681 720. PMLR, 2017.

Kamalaruban, P. Improved optimistic mirror descent for sparsity

and curvature. ar Xiv preprint ar Xiv:1609.02383, 2016.

Koolen, W., Van Erven, T., and Grunwald, P. Learning the learn-

ing rate for prediction with expert advice. In Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N., and Weinberger, K. Q. (eds.), Advances in Neural Information Processing Systems, volume 27, pp. 2294 2302. Curran Associates, Inc., 2014.

Korotin, A., V yugin, V., and Burnaev, E. Adaptive hedging under

delayed feedback. Neurocomputing, 397:356 368, 2020.

Liu, J. and Wright, S. J. Asynchronous stochastic coordinate

descent: Parallelism and convergence properties. SIAM Journal on Optimization, 25(1):351 376, 2015.

Liu, J., Wright, S., R e, C., Bittorf, V., and Sridhar, S. An asyn-

chronous parallel stochastic coordinate descent algorithm. In International Conference on Machine Learning, pp. 469 477. PMLR, 2014.

Mc Mahan, B. and Streeter, M. Delay-tolerant algorithms for asynchronous distributed online learning. In Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N., and Weinberger, K. Q. (eds.), Advances in Neural Information Processing Systems, volume 27, pp. 2915 2923. Curran Associates, Inc., 2014.

Mc Mahan, H. B. A survey of algorithms and analysis for adaptive

online learning. The Journal of Machine Learning Research, 18 (1):3117 3166, 2017.

Mc Quade, S. and Monteleoni, C. Global climate model tracking

using geospatial neighborhoods. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 26, 2012.

Mesterharm, C. On-line learning with delayed label feedback. In

International Conference on Algorithmic Learning Theory, pp. 399 413. Springer, 2005.

Mohri, M. and Yang, S. Accelerating online convex optimization

via adaptive prediction. In Artiﬁcial Intelligence and Statistics, pp. 848 856. PMLR, 2016.

Online Learning with Optimism and Delay

Monteleoni, C. and Jaakkola, T. Online learning of non-stationary

sequences. In Thrun, S., Saul, L., and Sch olkopf, B. (eds.), Advances in Neural Information Processing Systems, volume 16,

pp. 1093 1100. MIT Press, 2004.

Monteleoni, C., Schmidt, G. A., Saroha, S., and Asplund, E. Track-

ing climate models. Statistical Analysis and Data Mining: The ASA Data Science Journal, 4(4):372 392, 2011.

Nesterov, Y. Efﬁciency of coordinate descent methods on huge-

scale optimization problems. SIAM Journal on Optimization, 22(2):341 362, 2012.

Nowak, K., Beardsley, J., Brekke, L. D., Ferguson, I., and Raff,

D. Subseasonal prediction for water management: Reclamation forecast rodeo I and II. In 100th American Meteorological Society Annual Meeting. AMS, 2020.

Orabona, F. A modern introduction to online learning. Ar Xiv,

abs/1912.13213, 2019.

Orabona, F. and P al, D. Scale-free algorithms for online linear opti-

mization. In International Conference on Algorithmic Learning Theory, pp. 287 301. Springer, 2015.

Orabona, F. and P al, D. Optimal non-asymptotic lower bound

on the minimax regret of learning with expert advice. ar Xiv preprint ar Xiv:1511.02176, 2015.

Quanrud, K. and Khashabi, D. Online learning with adversarial

delays. In Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 28, pp. 1270 1278, 2015.

Rakhlin, A. and Sridharan, K. Online learning with predictable

sequences. In Shalev-Shwartz, S. and Steinwart, I. (eds.), Proceedings of the 26th Annual Conference on Learning Theory, pp. 993 1019. PMLR, 2013a.

Rakhlin, S. and Sridharan, K. Optimization, learning, and games

with predictable sequences. In Burges, C. J. C., Bottou, L., Welling, M., Ghahramani, Z., and Weinberger, K. Q. (eds.), Advances in Neural Information Processing Systems, pp. 3066

3074. Curran Associates, Inc., 2013b.

Recht, B., Re, C., Wright, S., and Niu, F. Hogwild!: A lock-free

approach to parallelizing stochastic gradient descent. In Shawe Taylor, J., Zemel, R., Bartlett, P., Pereira, F., and Weinberger, K. Q. (eds.), Advances in Neural Information Processing Systems, volume 24, pp. 693 701. Curran Associates, Inc., 2011.

Rockafellar, R. T. Convex analysis, volume 36. Princeton univer-

sity press, 1970.

Shalev-Shwartz, S. Online learning: Theory, algorithms, and

applications. Ph D thesis, The Hebrew University, 2007.

Shalev-Shwartz, S. Online learning and online convex optimiza-

tion. Foundations and Trends R in Machine Learning, 4(2): 107 194, 2012.

Sra, S., Yu, A. W., Li, M., and Smola, A. Ada Delay: Delay

adaptive distributed stochastic optimization. In Gretton, A. and Robert, C. C. (eds.), Proceedings of the 19th International Conference on Artiﬁcial Intelligence and Statistics, volume 51, pp. 957 965. PMLR, 2016.

Steinhardt, J. and Liang, P. Adaptivity and optimism: An improved

exponentiated gradient algorithm. In International Conference on Machine Learning, pp. 1593 1601, 2014.

Syrgkanis, V., Agarwal, A., Luo, H., and Schapire, R. E. Fast

convergence of regularized learning in games. In Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 28.

Curran Associates, Inc., 2015.

Tammelin, O., Burch, N., Johanson, M., and Bowling, M. Solving

heads-up limit texas hold em. In Twenty-Fourth International Joint Conference on Artiﬁcial Intelligence, 2015.

Weinberger, M. J. and Ordentlich, E. On delayed prediction of in-

dividual sequences. IEEE Transactions on Information Theory, 48(7):1959 1976, 2002.

White, C. J., Carlsen, H., Robertson, A. W., Klein, R. J., Lazo, J. K.,

Kumar, A., Vitart, F., Coughlan de Perez, E., Ray, A. J., Murray, V., et al. Potential applications of subseasonal-to-seasonal (s2s) predictions. Meteorological applications, 24(3):315 325, 2017.

Zinkevich, M., Johanson, M., Bowling, M. H., and Piccione, C.

Regret minimization in games with incomplete information. In Platt, J., Koller, D., Singer, Y., and Roweis, S. (eds.), Advances in Neural Information Processing Systems, volume 20. Curran Associates, Inc., 2007.