# online_pacbayes_learning__299d8e06.pdf

Online PAC-Bayes Learning

Maxime Haddouche Inria and University College London

France and UK

Benjamin Guedj Inria and University College London

France and UK

Most PAC-Bayesian bounds hold in the batch learning setting where data is collected at once, prior to inference or prediction. This somewhat departs from many contemporary learning problems where data streams are collected and the algorithms must dynamically adjust. We prove new PAC-Bayesian bounds in this online learning framework, leveraging an updated deﬁnition of regret, and we revisit classical PAC-Bayesian results with a batch-to-online conversion, extending their remit to the case of dependent data. Our results hold for bounded losses, potentially non-convex, paving the way to promising developments in online learning.

1 Introduction

Batch learning is somewhat the dominant learning paradigm in which we aim to design the best predictor by collecting a training dataset which is then used for inference or prediction. Classical algorithms such as SVMs [see Cristianini et al., 2000, among many others] or feedforward neural networks [Svozil et al., 1997] are popular examples of efﬁcient batch learning. While the mathematics of batch learning constitute a vivid and well understood research ﬁeld, in practice this might not be aligned with the way practitionners collect data, which can be sequential when too much information is available at a given time (e.g. the number of micro-transactions made in ﬁnance on a daily basis). Indeed batch learning is not designed to properly handle dynamic systems.

Online learning (OL) [Zinkevich, 2003, Shalev-Shwartz, 2012, Hazan, 2016] ﬁlls this gap by treating data as a continuous stream with a potentially changing learning goal. OL has been studied with convex optimisation tools and the celebrated notion of regret which measures the discrepancy between the cumulative sum of losses for a speciﬁc algorithm at each datum and the optimal strategy. It led to many fruitful results comparing the efﬁciency of prediction for optimisation algorithms such that Online Gradient Descent (OGD), Online Newton Step through static regret [Zinkevich, 2003, Hazan et al., 2007]. OL is ﬂexible enough to incorporate external expert advice onto classical algorithms with the optimistic point of view that such advices are useful for training [Rakhlin and Sridharan, 2013a,b] and then having optimistic regret bounds. Modern extensions also allow to compare to moving strategies through dynamic regret [see e.g. Yang et al., 2016, Zhang et al., 2018, Zhao et al., 2020]. However, this notion of regret has been challenged recently: for instance, Wintenberger [2021] chose to control an expected cumulative loss through PAC inequalities in order to deal with the case of stochastic loss functions.

Statements holding with arbitrarily large probability are widely used in learning and especially within the PAC-Bayes theory. Since its emergence in the late 90s, the PAC-Bayes theory (see the seminal works of Shawe-Taylor and Williamson, 1997, Mc Allester, 1998, 1999 and the recent surveys by Guedj, 2019, Alquier, 2021) has been a powerful tool to obtain generalisation bounds and to derive efﬁcient learning algorithms. Classical PAC-Bayes generalisation bounds help to understand how a learning algorithm may perform on future similar batches of data. More precisely, PAC-Bayes learning exploits the Bayesian paradigm of explaining a learning problem through a meaningful distribution over a space of candidate predictors [see e.g. Maurer, 2004, Catoni, 2007, Tolstikhin and Seldin, 2013, Mhammedi et al., 2019]. An active line of research in PAC-Bayes learning is to

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

overcome classical assumptions such that data-free prior, bounded loss, iid data [see Lever et al., 2010, 2013, Alquier and Guedj, 2018, Holland, 2019, Rivasplata et al., 2020, Haddouche et al., 2021, Guedj and Pujol, 2021] while remaining in a batch learning sprit. Finally, a pioneering line of work led by [Seldin et al., 2012a,b] on PAC-Bayes learning for martingales and independently developed by [Gerchinovitz, 2011, Foster et al., 2015, Li et al., 2018] boosted PAC-Bayes learning by providing sparsity regret bound, adaptive regret bounds and online algorithms for clustering.

Our contributions. Our goal is to provide a general online framework for PAC-Bayesian learning. Our main contribution (Thm. 2.3 in Sec. 2) is a general bound which is then used to derive several online PAC-Bayesian results (as developed in Secs. 3 and 4). More speciﬁcally, we derive two types of bounds, online PAC-Bayesian training and test bounds. Training bounds exhibit online procedures while the test bound provide efﬁciency guarantees. We propose then several algorithms with their associated training and test bounds as well as a short series of experiments to evaluate the consistency of our online PAC-Bayesian approach. Our efﬁciency criterion is not the classical regret but an expected cumulative loss close to the one of Wintenberger [2021]. More precisely, Sec. 3 propose a stable yet time-consuming Gibbs-based algorithm, while Sec. 4 proposes time efﬁcient yet volatile algorithms. We emphasize that our PAC-Bayesian results only require a bounded loss to hold: no assumption is made on the data distribution, priors can be data-dependent and we do not require any convexity assumption on the loss, as commonly assumed in the OL framework.

Outline. Sec. 2 introduces the theoretical framework as well as our main result. Sec. 3 presents an online PAC-Bayesian algorithm and draws links between PAC-Bayes and OL results. Sec. 4 details online PAC-Bayesian disintegrated procedures with reduced computational time and Sec. 5 gathers supporting experiments. We include reminders on OL and PAC-Bayes in Appendixes A.1 and C. Appendix B provide disucssion about our main result. All proofs are deferred to Appendix D.

2 An online PAC-Bayesian bound

We establish a novel PAC-Bayesian theorem (which in turn will be particularised in Sec. 3) which overcomes the classical limitation of data-independent prior and iid data. We call our main result an online PAC-Bayesian bound as it allows to consider a sequence of priors which may depend on the past and a sequence of posteriors that can dynamically evolve as well. Indeed, we follow the online learning paradigm which considers a continous stream of data that the algorithm has to process on the ﬂy, adjusting its outputs at each time step w.r.t. the arrival of new data and the past. In the PAC-Bayesian framework, this paradigm translates as follows: from an initial (still data independent) prior Q1 = P and a data sample S = (z1, ..., zm), we design a sequence of posterior (Qi)1 i m where Qi = f(Q1, ..., Qi 1, zi).

Framework. Consider a data space Z (which can be only inputs or pairs of inputs/outputs). We ﬁx an integer m > 0 and our data sample S 2 Zm is drawn from an unknown distribution µ. We do not make any assumption on µ. We set a sequence of priors, starting with P1 = P a data-free distribution and (Pi)i 2 such that for each i, Pi is Fi 1 measurable where (Fi)i 0 is an adapted ﬁltration to S. For P, Q 2 M1 (H), the notation P Q indicates that Q is absolutely continuous wrt P (i.e. Q(A) = 0 if P(A) = 0 for measurable A H). We also denote by Qi our sequence of candidate posteriors. There is no restriction on what Qi could be. In what follows we ﬁx a ﬁltration (Fi)i 0 and we denote by KL the Kullback-Leibler divergence between two distributions.

We consider a predictor space H and a loss funtion : H Z ! R+ bounded by a real constant K > 0. This loss deﬁnes the (potentially moving) learning objective. We denote by M1(H) the set of all probability distributions on H. We now introduce the notion of stochastic kernel [Rivasplata et al., 2020] which formalise properly data-dependent measures within the PAC-Bayes framework. First, for a ﬁxed predictor space H, we set H to be the considered σ-algebra on H. Deﬁnition 2.1 (Stochastic kernels). A stochastic kernel from S = Zm to H is deﬁned as a mapping Q : Zm H ! [0; 1] where

For any B 2 H, the function S = (z1, ..., zm) 7! Q(S, B) is measurable,

For any S 2 Zm, the function B 7! Q(S, B) is a probability measure over H.

We denote by Stoch(S, H) the set of all stochastic kernels from S to H and for a ﬁxed S, we set

QS := Q(S, .) the data-dependent prior associated to the sample S through Q.

From now, to refer to a distribution QS depending on a dataset S, we introduce a stochastic kernel Q(., .) such that QS = Q(S, .). Note that this notation is perfectly suited to the case when QS is obtained from an algorithmic procedure A. In this case the stochastic kernel Q of interest is the learning algorithm A. We use this notion to characterise our sequence of priors. Deﬁnition 2.2. We say that a sequence of stochastic kernels (Pi)i=1..m is an online predictive sequence if (i) for all i 1, S 2 Zm, Pi(S, .) is Fi 1 measurable and (ii) for all i 2, Pi(S, .) Pi 1(S, .).

Note that (ii) implies that for all i, Pi(S, .) P1(S, .) with P1(S, .) a data-free measure (yet a classical prior in the PAC-Bayesian theory).

We can now state our main result. Theorem 2.3. For any distribution µ over Zm, any λ > 0 and any online predictive sequence (used as priors) (Pi), for any sequence of stochastic kernels (Qi) we have with probability 1 δ over the sample S µ, the following, holding for the data-dependent measures Qi,S := Qi(S, .), Pi,S := Pi(S, .) :

Ehi Qi,S [E[ (hi, zi) | Fi 1]]

Ehi Qi,S [ (hi, zi)]+KL(Qi,Sk Pi,S)

2 +log(1/δ)

Remark 2.4. For the sake of clarity, we assimilate in what follows the stochastic kernels Qi, Pi to the data-dependent distributions Qi(S, .), Pi(S, .). Then, an online predictive sequence is also assimilated to a sequence of data-dependent distributions. Concretely this leads to the switch of notation Qi,S ! Qi in Thm. 2.3. The reason of this switch is that, even though stochastic kernel is the right theoretical structure to state our main result, we consider in Secs. 3 and 4 practical algorithmic extensions which focus only on data-dependent distributions, hence the need to alleviate our notations.

The proof is deferred to Appendix D.1. See Appendix B for context and discussions.

A batch to online conversion. First, we remark that our bound slightly exceeds the OL framework: indeed, it would require our posterior sequence to be an online predictive sequence as well, which is not the case here (for any i, the distribution Qi,S can depend on the whole dataset ). This is a consequence of our proof method (see Appendix D.1), which is classically denoted as a "batch to online" conversion (in opposition to the "online to batch" procedures as in Dekel and Singer, 2005). In other words, we exploited PAC-Bayesian tools designed for a ﬁxed batch of data to obtain a dynamic result. This is why we refer to our bound as online as it allows to consider sequences of priors and posteriors that can dynamically evolve.

Analysis of the different terms in the bound. Our PAC-Bayesian bound formally differs in many points from the classical ones. On the left-hand side of the bound, the sum of the averaged expected loss conditioned to the past appears. Having such a sum of expectations instead of a single one is necessary to assess the quality of all our predictions. Indeed, because data may be dependent, one can not consider a single expectation as in the iid case. We also stress that taking an online predictive sequence as priors leads to control losses conditioned to the past, which differs from classical PAC-Bayes results designed to bound the expected loss. This term, while original in the PAC-Bayesian framework (to the best of our knowledge) recently appeared (in a modiﬁed form) in Wintenberger [2021, Prop 3]. See Appendix B.2 for further disucssions.

On the right hand-side of the bound, online counterparts of classical PAC-Bayes terms appear. At time i, the measure Qi (i.e. Qi,S according to Remark 2.4) has a tradeoff to achieve between an overﬁtted prediction of zi (the case Qi = δzi where δ is a Dirac measure) and a too weak impact of the new data with regards to our prior knowledge (the case Qi = Pi). The quantity λ > 0 can be seen as a regulariser to adjust the relative impact of both terms.

Inﬂuence of λ. The quantity λ also plays a crucial role on the bound as it is involved in an explicit tradeoff between the KL terms, the conﬁdence term log(1/δ) and the residual term m K2/2. This idea of seeing λ as a trading parameter is not new [Thiemann et al., 2017, Germain et al., 2016]. However, the results from Thiemann et al. [2017] stand w.p. 1 δ for any λ while ours and the ones from Germain et al. [2016] hold for any λ w.p. 1 δ which is weaker and implies to discretise R+ onto a grid to estimate the optimal λ.

We now move on to the design of online PAC-Bayesian algorithms.

3 An online PAC-Bayesian procedure

OL algorithms (we refer to Hazan, 2016 an introduction to the ﬁeld) are producing sequences of predictors by progressively updating the considered predictor (see Appendix A.1 for an example). Recall that, in the OL framework, an algorithm outputs at time i a predictor which is Fi 1-measurable. Here, our goal is to design an online procedure derived from Thm. 2.3 which outputs an online predictive sequence (which is assimilated, according to Remark 2.4, to a sequence of distributions).

Online PAC-Bayesian (OPB) training bound. We state a corollary of our main result which paves the way to an online algorithm. This constructive procedure motivates the name Online PAC-Bayesian training bound (OPBTRAIN in short).

Corollary 3.1 (OPBTRAIN). For any distribution µ over Zm, any λ > 0 and any online predictive sequences ˆQ, P, the following holds with probability 1 δ over the sample S µ :

Ehi ˆ Qi+1 [E[ (hi, zi) | Fi 1]]

Ehi ˆ Qi+1 [ (hi, zi)] + KL( ˆQi+1k Pi)

2 + log(1/δ)

Here, λ is seen as a scale parameter as precised below. The proof consists in applying Thm. 2.3 with for all i, Qi = ˆQi+1 and Pi. Note that in this case, our posterior sequence is an online predictive sequence in order to ﬁt with the OL framework.

Corollary 3.1 suggests to design ˆQ as follows, assuming we have drawn a dataset S = {z1, ..., zm}, ﬁxed a scale parameter λ > 0 and an online predictive sequence Pi:

ˆQ1 = P1, 8i 1 ˆQi+1 = argmin

Ehi Q [ (hi, zi)] + KL(Qk Pi)

which leads to the explicit formulation

(h) = exp ( λ (h, zi)) Eh Pi [exp ( λ (h, zi))]. (2)

Thus, the formulation of Eq. (2), which has been highlighted by Catoni [2003, Sec. 5.1] shows that our online procedure produces Gibbs posteriors. So, PAC-Bayesian theory provides sound justiﬁcation for the somewhat intuitive online procedure in Eq. (1): at time i, we adjust our new measure ˆQi+1 by optimising a tradeoff between the impact of the newly arrived data zi and the one of prior knowledge ˆQi.

Notice that ˆQ is an online predictive sequence: ˆQi is Fi 1-measurable for all i as it depends only on

ˆQi 1 and zi 1. Furthermore, one has ˆQi ˆQi 1 for all i as ˆQi is deﬁned as an argmin and the KL term is ﬁnite if and only it is absolutely continuous w.r.t. ˆQi 1.

Remark 3.2. In Corollary 3.1, while the right hand-side is the reason we considered Eq. (1), the left hand side still needs to be analysed. It expresses how the posterior ˆQi+1 (designed from ˆQi, zi) generalises well on average to any new draw of zi. More precisely, this term measures how much the training of ˆQi+1 is overﬁtting on zi. A low value of it ensures our online predictive sequence, which is obtained from a single dataset, is robust to the randomness of S, hence the interest of optimising the right hand side of the bound. This is a supplementary reason we refer to Corollary 3.1 as an OPBTRAIN bound as it provide robustness guarantees for our training.

Online PAC-Bayesian (OPB) test bound. However, Corollary 3.1 does not say if ˆQi+1 will produce good predictors to minimise (., zi+1), which is the objective of ˆQi+1 in the OL framework (we only have access to the past to predict the future). We then need to provide an Online PAC-Bayesian (OPB) test bound (OPBTEST bound) to quantify our prediction s accuracy. We now derive an OPBTEST bound from Thm. 2.3.

Corollary 3.3 (OPBTEST). . For any distribution µ over Zm, any λ > 0, and any online predictive sequence ( ˆQi), the following holds with probability 1 δ over the sample S µ:

Ehi ˆ Qi [E[ (hi, zi) | Fi 1]]

Ehi ˆ Qi [ (hi, zi)] + λm K2

2 + log(1/δ)

Optimising in λ gives λ =

2 log(1/δ)/m K2 and ensure that:

Ehi ˆ Qi [E[ (hi, zi) | Fi 1]]

Ehi ˆ Qi [ (hi, zi)] + O

log(1/δ)K2m

The proof consists in applying Thm. 2.3 with for all i, Qi = ˆQi = Pi.

Corollary 3.3 quantiﬁes how efﬁcient will our predictions be. Indeed, the left hand side of this bound relates for all i, how good ˆQi is to predict zi (on average) which is what ˆQi is designed for. Note that here, the involved λ can differ from the scale parameter of Eq. (1), it is now a way to compensate for the tradeoff between the two last terms of the bound. The strength of this bound is that since ˆQ is an online predictive sequence, the Kullback-Leibler terms vanished, leaving terms depending only on hyperparameters.

Links with previous approaches

We now present a speciﬁc case of Corollary 3.1 where we choose as priors the online predictive sequence ˆQ (i.e. in Thm. 2.3, we choose Qi = ˆQi+1, Pi = ˆQi). The reason we focus on this speciﬁc case is that it enables to build strong links between PAC-Bayes and OL.

We then adapt our OPBTRAIN bound (Corollary 3.1). The online procedure becomes:

ˆQ1 = P, 8i 1 ˆQi+1 = argmin Q Ehi Q [ (hi, zi)] + KL(Qk ˆQi)

which leads to the explicit formulation

(h) = exp ( λ (h, zi)) Eh ˆ Qi [exp ( λ (h, zi))].

Links with classical PAC-Bayesian bounds. We denote that the optimal predictor in this case is such that at any time i, d ˆQi+1(h) / exp( λ (h, zi))d ˆQi(h) hence d ˆQm+1(h) / exp ( λ Pm

i=1 (h, zi)) d ˆQ1(h). One recognises, up to a multiplicative constant, the optimised

predictor of Catoni [2007, Th 1.2.6] which solves argmin Q Eh Q [ 1

i=1 (h, zi)] + KL(Qk ˆ Q1) λ , thus one sees that in this case, the output of our online procedure after m steps coincides with Catoni s output. This shows consistency of our general procedure which recovers classical result within an online framework: when too many data are available, treating data sequentially until time m leads to the same Gibbs posterior than if we were treating the whole dataset as a batch.

Analogy with Online Gradient Descent (OGD). We propose an analogy between the procedure Eq. (3) and the celebrated OGD algorithm (see Appendix A.1 for a recap). First we remark that our minimisation problem is equivalent to argmin Q λEhi Q [ (hi, zi)] + KL(Qk ˆQi). Then we assume that for any i, ˆQi = N( ˆmi, Id) with ˆ mi 2 Rd and we set Li( ˆmi) = Ehi ˆ Qi [ (hi, zi)] . The minimisation problem becomes: argmin ˆm λLi( ˆm) + 1

2k ˆm ˆmik2. And so using the ﬁrst order Taylor expansion, we use the approximation Li( ˆm) Li( ˆmi) + h ˆm ˆ mi, r Li( ˆmi)i which ﬁnally transform our argmin into the following optimisation process: ˆmi+1 = ˆ mi λr Li( ˆmi) which is exactly OGD on the loss sequence Li. We draw an analogy between the scale parameter λ and the step size in OGD. the KL term translates the inﬂuence of the previous point and the expected loss gives the gradient. This analogy has been already exploited in Shalev-Shwartz [2012] where they approximated Ehi qµ[ (hi, zi)] := Li(µ) µT r Li(µi) where µ is their considered online predictive sequence.

Finally, we remark that the optimum rate in Corollary 3.3 is a O(pm) which is comparable to the best rate of Shalev-Shwartz [2012, Eq (2.5)] (see proposition A.2).

Comparison with previous work. We acknowledge that the procedure of Eq. (3) already appeared in literature. Li et al. [2018, Alg. 1] propose a Gibbs procedure somewhat similar to ours, the main difference being the addition of a surrogate of the true loss at each time step. Within the OL literature, the idea of updating measures online has been recently studied for instance in Chérief-Abdellatif et al. [2019]. More precisely, our procedure is similar to their Streaming Variational Bayes (SVB) algorithm. A slight difference is that they approximated the expected loss similarly to Shalev-Shwartz [2012]. The guarantees Chérief-Abdellatif et al. [2019] provided for SVB hold for Gaussian priors and comes at the cost of additional constraints that do not allow to consider any aggregation strategies contrary to what Corollary 3.1 propose. Their bounds are deterministic and are using tools and assumptions from convex optimisation (such that convex expected losses) while ours are probabilistic and are using measure theory tools which allow to relax these assumptions.

Strength of our result. We emphasize two points. First, to the best of our knowledge, Corollary 3.1 is the ﬁrst bound which theoretically suggests Eq. (3) as a learning algorithm. Second, we stress that Eq. (3) is a particular case of Corollary 3.1 and our result can lead to other fruitful routes. For instance, we consider the idea of adding noise to our measures at each time step to avoid overﬁtting (this idea has been used e.g. in Neelakantan et al., 2015 in the context of deep neural networks): if our online predicitve sequence ( ˆQi) can be deﬁned through a sequence of parameter vectors ˆµ, then we can deﬁne Pi by adding a small noise on ˆµi and thus giving more freedom through stochasticity.

Thus, we see that our procedure led us to the use of the Gibbs posteriors of Catoni. However, in practice, Gaussian distributions are preferred [e.g. Dziugaite and Roy, 2017, Rivasplata et al., 2019, Perez-Ortiz et al., 2021b,a, Pérez-Ortiz et al., 2021]). That is why we focus next on new online PAC-Bayesian algorithms involving Gaussian distributions.

4 Disintegrated online algorithms for Gaussian distributions.

We dig deeper in the ﬁeld of disintegrated PAC-Bayesian bounds, originally explored by Catoni [2007], Blanchard and Fleuret [2007], further studied by Alquier and Biau [2013], Guedj and Alquier [2013] and recently developed by Rivasplata et al. [2020], Viallard et al. [2021] (see Appendix C for a short presentation of the bound we adapted and used). The strength of the disintegrated approach is that we have directly guarantees on the random draw of a single predictor, which avoids to consider expectations over the predictor space. This fact is particularly signiﬁcant in our work as the procedure precised in Eq. (2), require the estimation of an exponential moment to be efﬁcient, which may be costful. We then show that disintegrated PAC-Bayesian bounds can be adapted to the OL framework, and that they have the potential to generate proper online algorithms with weak computational cost and sound efﬁciency guarantees.

Online PAC-Bayesian disintegrated (OPBD) training bounds. We present a general form for online PAC-Bayes disintegrated (OPBD) training bounds. The terminology comes from the way we craft those bounds: from PAC-Bayesian disintegrated bounds we use the same tools as in Thm. 2.3 to create the ﬁrst online PAC-Bayesian disintegrated bounds. OPBD training bounds have the following form.

For any online predictive sequences ˆQ, P, any λ > 0 w.p. 1 δ over S µ and (h1, ..., hm)

ˆQ2 ... ˆQm+1:

E[ (hi, zi) | Fi 1]

(hi, zi) + (hi, ˆQi+1, Pi) + Φ(m), (4)

with , Φ being real-valued functions. controls the global behaviour of Qi+1 w.r.t. the Fi 1measurable prior Pi. If one has no dependency on hi this behaviour is global, otherwise it is local. Note that those functions may depend on λ, δ. However, since they are ﬁxed parameters, we do not make these dependencies explicit. Similarly to Corollary 3.1, this kind of bound allows to derive a learning algorithm (cf Algorithm 1) which outputs an online predicitve sequence ˆQ. Finally we draw (h1, ..., hm) ˆQ2 ... ˆQm+1 (and not ˆQ1 ... ˆQm) since an OPBD bound is designed to justify theoretically an OPBD procedure in the same way Corollary 3.1 allowed to justify Eq. (1).

Why focus on Gaussian measures? The reason is that a Gaussian variable h N(w, σ2Id) can be written as h = w + " with " N(0, σ2Id), and this expression totally deﬁnes h (Id being the identity matrix).

A general OPBD algorithm for Gaussian measure with ﬁxed variance We use an idea presented in Viallard et al. [2021] which restrict the measure set to Gaussian on Rd with known and ﬁxed covariance matrix σ2Id. Then we present in Algorithm 1 a general algorithm (derived from an OPBD training bound) for Gaussian measures with ﬁxed variance which outputs a sequence of gaussian ˆQi = N( ˆwi, σ2Id) from a prior sequence Pi = N(w0

i , σ2Id) where for each i, w0

i is Fi 1measurable. Because the variance is ﬁxed, the distribution is uniquely deﬁned by its mean, thus we identify ˆQi and ˆwi, Pi and w0

Algorithm 1: A general OPBD algorithm for Gaussian measures with ﬁxed variance. Parameters :Time m, scale parameter λ Initialisation :Variance σ2, Initial mean ˆw1 2 Rd, epoch m

1 for each iteration i in 1..m do

2 Observe zi, w0

i and draw "i N(0, σ2Id)

ˆwi+1 := argminw2Rd (w + "i, zi) + (w + "i, w, w0

5 Return ( ˆwi)i=1..m+1

At each time i, Algorithm 1 requires the draw of "i N(0, σ2Id). Doing so, we generated the randomness for our hi (because our bound holds for a single draw of (h1, .., hm) ˆQ2 ... ˆQm+1), we then write hi = w + "i and we optimise w.r.t. to ﬁnd ˆwi+1.

Bounds of interest. We present two possible choices of pairs ( , Φ) derived from the disintegrated results presented in Appendix C. Doing so, we explicit two ready-to-use declinations of Algorithm 1. Corollary 4.1. For any distribution µ over Zm, any online predictive sequences of Gaussian measures with ﬁxed variance ˆQi = N( ˆwi, σ2Id) and Pi = N(w0

i , σ2Id), any λ > 0, w.p. 1 δ over S µ and (hi = ˆwi+1 + "i)i=1..m ˆQ2 ... ˆQm+1, the bound of Eq. (4) holds for the two following pairs , Φ:

1(hi, ˆwi+1, w0

|| ˆwi+1 + "i w0

i ||2 ||"||2

Φ1(m) = λm K2

2 + log(1/δ)

2(hi, ˆwi+1, w0

|| ˆwi+1 w0

2σ2 2(m) = λm K2 + 3 log(1/δ)

Where the notation 1, 2 denote whether the functions have been derived from adapted theorems of

Rivasplata et al., 2020, Viallard et al., 2021 recalled in Appendix C We then can use algorithm 1 with Eq. (5), Eq. (6).

Proof is deferred to Appendix D.2. Note that in Corollary 4.1, we identiﬁed ˆQi to ˆwi and for the last formula, has no dependency on hi.

Comparison with Eq. (1). The main difference with Eq. (1) provided by the disintegrated framework is that the optimisation route does not include an expected term within the optimisation objective. The main advantage is a weaker computational cost when we restrict to Gaussian distributions. The main weakness is a lack of stability as our algorithm now depends at time i on (h + "i, zi) so on "i directly. We denote that Eq. (5) is less stable than Eq. (6) as it involves another dependency on "i through . The reason is that Rivasplata et al. [2020] proposed a bound involving a disintegrated KL divergence while Viallard et al. [2021] proposed a result involving a Rényi divergence avoiding a dependency on "i. We refer to Appendix C for a detailed statement of those properties.

Comparison with van der Hoeven et al. [2018]. Theorem 3 of van der Hoeven et al. [2018] recovers OGD from the exponential weights algorithm by taking a sequence of moving distributions being Gaussians with ﬁxed variance which is exactly what we consider here. From these, they retrieve the classical OGD algorithm as well as its classical convergence rate. Let us compare our results with theirs.

First, if we ﬁx a single step in their bound and assume two traditional assumptions for OGD (a ﬁnite diameter D of the convex set and an uniform bound G on the loss gradients), we recover

for the OGD (greedy GD in van der Hoeven et al., 2018) a rate of D2 2σ2 + σ2T G2

2 . This is, up to constants and notation changes, exactly our i (i 2 {1, 2}). Also, we notice a difference in the way to use Gaussian distributions: Theorem 3 of van der Hoeven et al. [2018] is based on their Lemma 1 which provides guarantees for the expected regret. This is a clear incentive to consider as predictors the mean of the sucessive Gaussians of interest. On the contrary, Corollary 4.1 involves a supplementary level of randomness by considering predictors hi drawn from our Gaussians. This additional randomness appears in our optimisation process (algorithm 1). Finally, notice that van der Hoeven et al. [2018] based their whole work on the use of a KL divergence while Corollary 4.1 not only exploit a disintegrated KL ( 1) but also a Rényi -divergence ( 2). Note that we propose a result only for = 2 for the sake of space constraints but any other value of leads to another optimisation objective to explore.

OPBD test bounds. Similarly to what we did in Sec. 3, we also provide OPBD test bounds to provide efﬁciency guarantees for online predicitve sequences (e.g. the output of algorithm 1). Our proposed bounds have the following general form.

For any online predictive sequence ˆQ, any λ > 0 w.p. 1 δ over S and (h1, ..., hm) ˆQ1 ... ˆQm:

E[ (hi, zi) | Fi 1]

(hi, zi) + Φ(m), (7)

with Φ being a real-valued function(possibly dependent on λ, δ though it is not explicited here).

Note that our predictors (h1, ..., hm) are now drawn from ˆQ1 ... ˆQm. Thus, the left-hand side of the bound considers a hi drawn from an Fi 1-measurable distribution evaluated on (., zi): this is effectively a measure of the prediction performance.

We now state a corollary which gives disintegrated guarantees for any online predicitve sequence.

Corollary 4.2. For any distribution µ over Zm, any λ > 0, and any online predictive sequence ( ˆQi), the following holds with probability 1 δ over the sample S µ and the predictors (h1, ..., hm)

ˆQ1 ... ˆQm, the bound of Eq. (7) holds with :

Φ1(m) = λm K2

2 + log(1/δ)

λ , Φ2(m) = 2λm K2 + log(1/δ)

Where the notation 1, 2 denote whether the functions have been derived from adapted theorems of

Rivasplata et al., 2020, Viallard et al., 2021 recalled in Appendix C. The optimised λ gives in both cases a O(

m log(1/δ).

Proof is deferred to Appendix D.2.

5 Experiments

We adapt the experimental framework introduced in Chérief-Abdellatif et al. [2019, Sec.5] to our algorithms (anonymised code available here). We conduct experiments on several real-life datasets, in classiﬁcation and linear regression. Our objective is twofold: check the convergence of our learning methods and compare their efﬁciencies with classical algorithms. We ﬁrst introduce our experimental setup.

Algorithms. We consider four online methods of interest: the OPB algorithm of Eq. (3) which update through time a Gibbs posterior. We instantiate it with two different priors ˆQ1: a Gaussian distribution and a Laplace one. We also implement Algorithm 1 with the functions 1, 2 from Corollary 4.1. To assess efﬁciency, we implement the classical OGD (as described in Alg. 1 of Zinkevich, 2003) and the SVB method of Chérief-Abdellatif et al. [2019].

Binary Classiﬁcation. At each round i the learner receives a data point xi 2 Rd and predicts its label yi 2 { 1, +1} using hxi, hii, with hi = Eh ˆ Qi[h] for OPB methods or hi being drawn under ˆQi for OPBD methods. The adversary reveals the true value yi, then the learner suffers the loss (hi, zi) =

+ with zi = (xi, yi) and a+ = a if a > 0 and a+ = 0 otherwise. This loss is unbounded but can be thresholded.

Figure 1: Averaged cumulative losses for all four considered datasets. Gibbs Gauss denotes OPB with Gaussian Prior, Gibbs Laplace denotes OPB with Laplace prior. OPBD Riva denotes OPBD with 1, OPBD Via denotes OPBD with 2.

Linear Regression. At each round i, the learner receives a set of features xi 2 Rd and predicts yi 2 R using hxi, hii with hi = Eh ˆ Qi[h] for SVB and OPB methods or hi being drawn under ˆQi for OPBD methods. Then the adversary reveals the true value yt and the learner suffers the loss (hi, zi) =

)2 with zi = (xi, yi). This loss is unbounded but can be thresholded.

Datasets. We consider four real world dataset: two for classiﬁcation (Breast Cancer and Pima Indians), and two for regression (Boston Housing and California Housing). All datasets except the Pima Indians have been directly extracted from sklearn [Pedregosa et al., 2011]. Breast Cancer dataset [Street et al., 1993] is available here and comes from the UCI ML repository as well as the Boston Housing dataset [Belsley et al., 2005] which can be obtained here. California Housing dataset [Pace and Barry, 1997] comes from the Stat Lib repository and is available here. Finally, Pima Indians dataset [Smith et al., 1988] has been recovered from this Kaggle repository. Note that we randomly permuted the observations to avoid to learn irrelevant human ordering of data (such that date or label).

Parameter settings. We ran our experiments on a 2021 Mac Book Pro with an M1 chip and 16 Gb RAM. For OGD, the initialisation point is 0Rd and the values of the learning rates are set to = 1/pm. For SVB, mean is initialised to 0Rd and covariance matrix to Diag(1). Step at time i is i = 0.1/

i. For both of the OPB algorithms with Gibbs posterior, we chose λ = 1/m. As priors, we took respectively a centered Gaussian vector with the covariance matrix Diag(σ2) (σ = 1.5) and an iid vector following the standard Laplace distribution. For the OPBD algorithm with 1, we chose λ = 10 4/m, the initial mean is 0Rd and our ﬁxed covariance matrix is Diag(σ2) with σ = 3.10 3. For the OPBD algorithm with 1, we chose λ = 2.10 3/m, the initial mean is 0Rd and our covariance matrix is Diag(σ2) with σ = 10 2. The reason of those higher scale parameters and variance is that from Rivasplata et al. [2020] is more stochastic (yet unstable) than the one Viallard et al. [2021].

Experimental results. For each dataset, we plot the evolution of the average cumulative loss Pt

i=1 (hi, zi) /t as a function of the step t = 1, . . . , m, where m is the dataset size and hi is the decision made by the learner hi at step i. The results are gathered in Fig. 1

Empirical ﬁndings. OPB with Gaussian prior ( Gibbs Gauss ) outperforms OGD on all datasets except California Housing (on which this method is not implemented ) while OPB with Laplace prior ( Gibbs Laplace ) always fail w.r.t. OGD. OPB methods fail to compete with SVB on the Boston

Housing dataset. OPBD methods compete with SVB on regression problems and clearly outperforms OGD on classiﬁcation tasks. OPBD with 2 (labeled as OPBD Via in Fig. 1) performs better on the California Housing dataset while OPBD with 1 (labeled as OPBD Riva ) is more efﬁcient on the Boston Housing dataset. Both methods performs roughly equivalently on classiﬁcation tasks. This brief experimental validation shows the consistency of all our online procedures as we observe a visible decrease of the cumulative losses through time. It particularly shows that OPBD procedures improve on OGD on these dataset. We refer to Appendix E for additional table gathering the error bars of our OPBD methods.

Why do we perform better than OGD? As stated in Sec. 4, OGD can be recovered as a Gaussian approximation of the exponential weights algorithm (EWA). Thus, a legitimate question is why do we perform better than OGD as our OPBD methods are also based on a Gaussian surrogate of EWA? van der Hoeven et al. [2018] only used Gaussians distributions with ﬁxed variance as a technical tool when the considered predictors are the Gaussian means. In our work, we exploited a richer characteristic of our distributions in the sense our predictors are points sampled from our Gaussians and not only the means. This also has consequences in our learning algorithm as at time i of our algorithm 1, our optimisation step involves a noise "i N(0, σ2I). Thus, we believe that OPBD methods should perform at least as well as OGD. We write at least as we think that the higher ﬂexibility due to this additional level of randomness might result in slightly better empirical performances, as seen on the few datasets in Fig. 1.

6 Conclusion

We establish links between Online Learning and PAC-Bayes. We show that PAC-bayesian bounds are useful to derive new OL algorithms. We also prove sound theoretical guarantees for such algorithms. We emphasise that all of our results stand for any general bounded loss, especially no convexity assumption is needed. Having no convexity assumption on the loss paves the way to exciting future practical studies, starting with Spiking Neural Network which is investigated in an online fashion (see Lobo et al., 2020 for a recent survey). A follow-up question on the theoretical part is whether we can relax the bounded loss assumption: we leave this for future work.

Acknowledgements

We warmly thank reviewers and the Area Chair who provided insigthful comments and suggestions which greatly helped us improve our manuscript. B.G. acknowledges partial support by the U.S. Army Research Laboratory and the U.S. Army Research Ofﬁce, and by the U.K. Ministry of Defence and the U.K. Engineering and Physical Sciences Research Council (EPSRC) under grant number EP/R013616/1; B.G. also acknowledges partial support from the French National Agency for Research, grants ANR-18-CE40-0016-01 and ANR-18-CE23-0015-02.

P. Alquier. User-friendly introduction to PAC-Bayes bounds, 2021. URL https://arxiv.org/

abs/2110.11216.

P. Alquier and G. Biau. Sparse single-index model. Journal of Machine Learning Research, 14(1),

P. Alquier and B. Guedj. Simpler PAC-Bayesian bounds for hostile data. Machine Learning, 107(5):

887 902, 2018. ISSN 1573-0565. URL http://dx.doi.org/10.1007/s10994-017-5690-0.

P. Alquier, J. Ridgway, and N. Chopin. On the properties of variational approximations of gibbs

posteriors. Journal of Machine Learning Research, 17(236):1 41, 2016. URL http://jmlr. org/papers/v17/15-290.html.

D. A. Belsley, E. Kuh, and R. E. Welsch. Regression diagnostics: Identifying inﬂuential data and

sources of collinearity. John Wiley & Sons, 2005.

G. Blanchard and F. Fleuret. Occam s hammer. In International Conference on Computational

Learning Theory, pages 112 126. Springer, 2007.

O. Catoni. A PAC-Bayesian approach to adaptive classiﬁcation. preprint, 840, 2003.

O. Catoni. PAC-Bayesian Supervised Classiﬁcation: The Thermodynamics of Statistical Learning.

Institute of Mathematical Statistics Lecture Notes Monograph Series 56. IMS, Beachwood, OH. MR2483528, 5544465, 2007.

B.-E. Chérief-Abdellatif, P. Alquier, and M. E. Khan. A generalization bound for online variational

inference. In Asian Conference on Machine Learning, pages 662 677. PMLR, 2019.

N. Cristianini, J. Shawe-Taylor, et al. An introduction to support vector machines and other kernel-

based learning methods. Cambridge university press, 2000.

O. Dekel and Y. Singer. Data-driven online to batch conversions. Advances in Neural Information

Processing Systems, 18, 2005.

G. K. Dziugaite and D. M. Roy. Computing nonvacuous generalization bounds for deep (stochastic)

neural networks with many more parameters than training data. ar Xiv preprint ar Xiv:1703.11008, 2017.

D. J. Foster, A. Rakhlin, and K. Sridharan. Adaptive Online Learning. In C. Cortes, N. Lawrence,

D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015. URL https://proceedings.neurips.cc/paper/ 2015/file/19de10adbaa1b2ee13f77f679fa1483a-Paper.pdf.

S. Gerchinovitz. Sparsity Regret Bounds for Individual Sequences in Online Linear Regression.

In S. M. Kakade and U. von Luxburg, editors, Proceedings of the 24th Annual Conference on Learning Theory, volume 19 of Proceedings of Machine Learning Research, pages 377 396, Budapest, Hungary, 09 11 Jun 2011. PMLR. URL https://proceedings.mlr.press/v19/ gerchinovitz11a.html.

P. Germain, F. Bach, A. Lacoste, and S. Lacoste-Julien. PAC-Bayesian theory meets Bayesian

inference. Advances in Neural Information Processing Systems, 29, 2016.

M. Gil, F. Alajaji, and T. Linder. Rényi divergence measures for commonly used univariate continuous

distributions. Information Sciences, 249:124 131, 2013.

B. Guedj. A Primer on PAC-Bayesian Learning. In Proceedings of the second congress of the French

Mathematical Society, 2019.

B. Guedj and P. Alquier. PAC-Bayesian estimation and prediction in sparse additive models. Electron.

J. Statist., 7:264 291, 2013. doi: 10.1214/13-EJS771. URL https://doi.org/10.1214/ 13-EJS771.

B. Guedj and L. Pujol. Still No Free Lunches: The Price to Pay for Tighter PAC-Bayes Bounds.

Entropy, 23(11), 2021. ISSN 1099-4300. doi: 10.3390/e23111529. URL https://www.mdpi. com/1099-4300/23/11/1529.

M. Haddouche, B. Guedj, O. Rivasplata, and J. Shawe-Taylor. PAC-Bayes unleashed: generalisation

bounds with unbounded losses. Entropy, 23(10):1330, 2021.

E. Hazan. Introduction to online convex optimization. Foundations and Trends in Optimization, 2

(3-4):157 325, 2016.

E. Hazan, A. Agarwal, and S. Kale. Logarithmic regret algorithms for online convex optimization.

Machine Learning, 69(2):169 192, 2007.

M. Holland. PAC-Bayes under potentially heavy tails. In H. Wallach, H. Larochelle, A. Beygelzimer,

F. d Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 2715 2724. Curran Associates, Inc., 2019. URL http://papers.nips.cc/ paper/8539-pac-bayes-under-potentially-heavy-tails.pdf.

G. Lever, F. Laviolette, and J. Shawe-Taylor. Distribution-Dependent PAC-Bayes Priors. In M. Hutter,

F. Stephan, V. Vovk, and T. Zeugmann, editors, Algorithmic Learning Theory, pages 119 133, Berlin, Heidelberg, 2010. Springer Berlin Heidelberg. ISBN 978-3-642-16108-7.

G. Lever, F. Laviolette, and J. Shawe-Taylor. Tighter PAC-Bayes Bounds through Distribution-

Dependent Priors. Theor. Comput. Sci., 473:4 28, Feb. 2013. ISSN 0304-3975. URL https: //doi.org/10.1016/j.tcs.2012.10.013.

L. Li, B. Guedj, and S. Loustau. A quasi-Bayesian perspective to online clustering. Electronic

Journal of Statistics, 12(2):3071 3113, 2018. doi: 10.1214/18-EJS1479. URL https://doi. org/10.1214/18-EJS1479.

J. L. Lobo, J. Del Ser, A. Bifet, and N. Kasabov. Spiking neural networks and online learning: An

overview and perspectives. Neural Networks, 121:88 100, 2020.

A. Maurer. A note on the PAC Bayesian theorem. ar Xiv preprint cs/0411099, 2004.

D. A. Mc Allester. Some PAC-Bayesian theorems. In Proceedings of the eleventh annual conference

on Computational Learning Theory, pages 230 234. ACM, 1998.

D. A. Mc Allester. PAC-Bayesian model averaging. In Proceedings of the twelfth annual conference

on Computational Learning Theory, pages 164 170. ACM, 1999.

Z. Mhammedi, P. Grünwald, and B. Guedj. PAC-Bayes Un-Expected Bernstein Inequality. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 12202 12213. Curran Associates, Inc., 2019. URL http://papers.nips.cc/paper/ 9387-pac-bayes-un-expected-bernstein-inequality.pdf.

A. Neelakantan, L. Vilnis, Q. V. Le, I. Sutskever, L. Kaiser, K. Kurach, and J. Martens. Adding

gradient noise improves learning for very deep networks. ar Xiv preprint ar Xiv:1511.06807, 2015.

R. K. Pace and R. Barry. Sparse spatial autoregressions. Statistics & Probability Letters, 33(3):

291 297, 1997.

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Pretten-

hofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825 2830, 2011.

M. Perez-Ortiz, O. Rivasplata, B. Guedj, M. Gleeson, J. Zhang, J. Shawe-Taylor, M. Bober, and

J. Kittler. Learning PAC-Bayes priors for probabilistic neural networks. Submitted., 2021a. URL https://arxiv.org/abs/2109.10304.

M. Perez-Ortiz, O. Rivasplata, E. Parrado-Hernandez, B. Guedj, and J. Shawe-Taylor. Progress in

self-certiﬁed neural networks. In Neur IPS 2021 workshop Bayesian Deep Learning [BDL], 2021b. URL http://bayesiandeeplearning.org/2021/papers/38.pdf.

M. Pérez-Ortiz, O. Rivasplata, J. Shawe-Taylor, and C. Szepesvári. Tighter risk certiﬁcates for neural

networks. Journal of Machine Learning Research, 22, 2021.

A. Rakhlin and K. Sridharan. Online Learning with Predictable Sequences. In S. Shalev-Shwartz and

I. Steinwart, editors, Proceedings of the 26th Annual Conference on Learning Theory, volume 30 of Proceedings of Machine Learning Research, pages 993 1019, Princeton, NJ, USA, 12 14 Jun 2013a. PMLR. URL https://proceedings.mlr.press/v30/Rakhlin13.html.

S. Rakhlin and K. Sridharan. Optimization, Learning, and Games with Predictable Sequences. In C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Weinberger, editors, Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc., 2013b. URL https://proceedings.neurips.cc/paper/2013/file/ f0dd4a99fba6075a9494772b58f95280-Paper.pdf.

O. Rivasplata, V. M. Tankasali, and C. Szepesvári. PAC-Bayes with backprop. ar Xiv preprint

ar Xiv:1908.07380, 2019.

O. Rivasplata, I. Kuzborskij, C. Szepesvári, and J. Shawe-Taylor. PAC-Bayes analysis beyond the

usual bounds. Advances in Neural Information Processing Systems, 33:16833 16845, 2020.

Y. Seldin, N. Cesa-Bianchi, P. Auer, F. Laviolette, and J. Shawe-Taylor. PAC-Bayes-Bernstein

Inequality for Martingales and its Application to Multiarmed Bandits. In D. Glowacka, L. Dorard, and J. Shawe-Taylor, editors, Proceedings of the Workshop on On-line Trading of Exploration and Exploitation 2, volume 26 of Proceedings of Machine Learning Research, pages 98 111, Bellevue, Washington, USA, 02 Jul 2012a. PMLR. URL https://proceedings.mlr.press/ v26/seldin12a.html.

Y. Seldin, F. Laviolette, N. Cesa-Bianchi, J. Shawe-Taylor, and P. Auer. PAC-Bayesian Inequalities

for Martingales. IEEE Transactions on Information Theory, 58(12):7086 7093, 2012b. doi: 10.1109/TIT.2012.2211334.

S. Shalev-Shwartz. Online learning and online convex optimization. Foundations and Trends in

Machine Learning, 4(2):107 194, 2012.

J. Shawe-Taylor and R. C. Williamson. A PAC analysis of a Bayes estimator. In Proceedings of the

10th annual conference on Computational Learning Theory, pages 2 9. ACM, 1997.

J. W. Smith, J. E. Everhart, W. Dickson, W. C. Knowler, and R. S. Johannes. Using the adap learning

algorithm to forecast the onset of diabetes mellitus. In Proceedings of the annual symposium on computer application in medical care, page 261. American Medical Informatics Association, 1988.

W. N. Street, W. H. Wolberg, and O. L. Mangasarian. Nuclear feature extraction for breast tumor

diagnosis. In Biomedical image processing and biomedical visualization, volume 1905, pages 861 870. SPIE, 1993.

D. Svozil, V. Kvasnicka, and J. Pospichal. Introduction to multi-layer feed-forward neural networks.

Chemometrics and intelligent laboratory systems, 39(1):43 62, 1997.

N. Thiemann, C. Igel, O. Wintenberger, and Y. Seldin. A strongly quasiconvex PAC-Bayesian bound.

In International Conference on Algorithmic Learning Theory, pages 466 492. PMLR, 2017.

I. O. Tolstikhin and Y. Seldin. PAC-Bayes-Empirical-Bernstein Inequality. In C. Burges, L. Bottou,

M. Welling, Z. Ghahramani, and K. Weinberger, editors, Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc., 2013. URL https://proceedings.neurips. cc/paper/2013/file/a97da629b098b75c294dffdc3e463904-Paper.pdf.

D. van der Hoeven, T. van Erven, and W. Kotłowski. The many faces of exponential weights in online

learning. In S. Bubeck, V. Perchet, and P. Rigollet, editors, Proceedings of the 31st Conference On Learning Theory, volume 75 of Proceedings of Machine Learning Research, pages 2067 2092. PMLR, 06 09 Jul 2018. URL https://proceedings.mlr.press/v75/hoeven18a.html.

P. Viallard, P. Germain, A. Habrard, and E. Morvant. A general framework for the derandomization

of PAC-Bayesian bounds. 2021. URL https://arxiv.org/abs/2102.08649.

O. Wintenberger. Stochastic Online Convex Optimization; Application to probabilistic time series

forecasting. ar Xiv preprint ar Xiv:2102.00729, 2021.

T. Yang, L. Zhang, R. Jin, and J. Yi. Tracking slowly moving clairvoyant: Optimal dynamic regret of

online learning with true and noisy gradient. In International Conference on Machine Learning, pages 449 457. PMLR, 2016.

L. Zhang, T. Yang, rong jin, and Z.-H. Zhou. Dynamic Regret of Strongly Adaptive Methods. In J. Dy

and A. Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 5882 5891. PMLR, 10 15 Jul 2018. URL https://proceedings.mlr.press/v80/zhang18o.html.

P. Zhao, Y.-J. Zhang, L. Zhang, and Z.-H. Zhou. Dynamic Regret of Convex and Smooth Functions. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 12510 12520. Cur-

ran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/ 939314105ce8701e67489642ef4d49e8-Paper.pdf.

M. Zinkevich. Online convex programming and generalized inﬁnitesimal gradient ascent. In Proceedings of the 20th international conference on machine learning (icml-03), pages 928 936, 2003.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reﬂect the paper s

contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] From the abstract to the conclu-

sion, we explicited our main assumption (bounded loss). (c) Did you discuss any potential negative societal impacts of your work? [N/A] Due to

the theoretical nature of our contribution, we do not foresee immediate societal impact. If anything, we certainly hope a better theoretical understanding of online learning algorithms can foster a more responsible use by practitionners. (d) Have you read the ethics review guidelines and ensured that your paper conforms to

them? [Yes] 2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [Yes] for all theorems

and corollaries (b) Did you include complete proofs of all theoretical results? [Yes] But deferred to

Appendix D due to space constraints. 3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main ex-

perimental results (either in the supplemental material or as a URL)? [Yes] We have included the url in Sec. 5, note that this is an anonymous repository. (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they

were chosen)? [Yes] All details are gathered in Sec. 5 so that all readers can replicate our experiments and validate our fundings. (c) Did you report error bars (e.g., with respect to the random seed after running experi-

ments multiple times)? [No] We chose to consider the classical online gradient descent as main comparison. (d) Did you include the total amount of compute and the type of resources used (e.g., type

of GPUs, internal cluster, or cloud provider)? [Yes] We precised the machine we used in the Parameters Settings paragraph of Sec. 5. 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [Yes] We precised the

origin of our datasets in the Datasets paragraph of Sec. 5. (b) Did you mention the license of the assets? [N/A]

(c) Did you include any new assets either in the supplemental material or as a URL? [Yes]

All the original code is available on the anonymised repository precised in Sec. 5. (d) Did you discuss whether and how consent was obtained from people whose data you re

using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identiﬁable

information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if

applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review

Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount

spent on participant compensation? [N/A]