# online_classification_with_predictions__9b49641d.pdf

Online Classification with Predictions

Vinod Raman Department of Statistics University of Michigan Ann Arbor, MI 48104 vkraman@umich.edu

Ambuj Tewari Department of Statistics University of Michigan Ann Arbor, MI 48104 tewaria@umich.edu

We study online classification when the learner has access to predictions about future examples. We design an online learner whose expected regret is never worse than the worst-case regret, gracefully improves with the quality of the predictions, and can be significantly better than the worst-case regret when the predictions of future examples are accurate. As a corollary, we show that if the learner is always guaranteed to observe data where future examples are easily predictable, then online learning can be as easy as transductive online learning. Our results complement recent work in online algorithms with predictions and smoothed online classification, which go beyond a worse-case analysis by using machine-learned predictions and distributional assumptions respectively.

1 Introduction

In online classification, Nature plays a game with a learner over T N rounds. In each round t [T], Nature selects a labeled example (xt, yt) X Y and reveals just the example xt to the learner. The learner, using the history of the game (x1, y1), ..., (xt 1, yt 1) and the current example xt, makes a potentially randomized prediction ˆyt Y. Finally, Nature reveals the true label yt and the learner suffers the loss 1{ˆyt = yt}. Given access to a hypothesis class H YX consisting of functions h : X Y, the goal of the learner is to minimize its regret, the difference between its cumulative mistake and that of the best fixed hypothesis h H in hindsight. We say a class H is online learnable if there exists a learning algorithm that achieves vanishing average regret for any, potentially adversarial chosen, stream of labeled examples (x1, y1), ..., (x T , y T ). Canonically, one also distinguishes between the realizable and agnostic settings. In the realizable setting, Nature must choose a stream (x1, y1), ..., (x T , y T ) such that there exists a h H for which h(xt) = yt for all t [T]. On the other hand, in the agnostic setting, no such assumptions on the stream are placed.

Due to applications in spam filtering, image recognition, and language modeling, online classification has had a long, rich history in statistical learning theory. In a seminal work, Littlestone [1987] provided a sharp quantitative characterization of which binary hypothesis classes H {0, 1}X are online learnable in the realizable setting. This characterization was in terms of the finiteness of a combinatorial dimension called the Littlestone dimension. Twenty-two years later, Ben-David et al. [2009] proved that the Littlestone dimension continues to characterize the online learnability of binary hypothesis classes in the agnostic setting. Later, Daniely et al. [2011] generalized the Littlestone dimension to multiclass hypothesis classes H YX , and showed that it fully characterizes multiclass online learnability when the label space Y is finite. More recently, Hanneke et al. [2023] extended this result to show that the multiclass Littlestone dimension continues to characterize multiclass online learnability even when Y is unbounded.

While elegant, the characterization of online classification in terms of the Littlestone dimension is often interpreted as an impossibility result [Haghtalab, 2018]. Indeed, due to the restrictive nature of the Littlestone dimension, even simple classes like the 1-dimensional thresholds Hthresh = {x 7

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

1{x a} : a N} are not online learnable in the realizable setting. This hardness arises mainly due to a worst-case analysis: the adversary is allowed to choose any sequence of labeled examples, even possibly adapting to the learner s strategy. In many situations, however, the sequence of data is easy and a worst-case analysis is too pessimistic. For example, if one were to use the daily temperatures to predict snowfall, it is unlikely that temperatures will vary rapidly within a given week. Even so, one might have to access to temperature forecasting models that can accurately predict future temperatures. This motivates a beyond-worst-case analysis of online classification algorithms by proving guarantees that adapt to the easiness of the example stream.

The push for a beyond-worst-case analysis has its roots in classical algorithm design [Roughgarden, 2021]. Of recent interest is Algorithms with Predictions (Aw P), a specific sub-field of beyond-worstcase analysis of algorithms [Mitzenmacher and Vassilvitskii, 2022]. Here, classical algorithms are given additional information about the problem instance in the form of machine-learned predictions. Augmented with these predictions, the algorithm s goal is to perform optimally on a per-input basis when the predictions are good (known as consistency), while always ensuring optimal worst-case guarantees (known as robustness). Ideally, algorithms are also smooth, obtaining performance guarantees that interpolate between instance and worst-case optimality as a function of prediction quality. After a successful application to learning index structures [Kraska et al., 2018], there has been an explosion of work designing algorithms whose guarantees depend on the quality of available, machine-learned predictions Mitzenmacher and Vassilvitskii [2022]. For example, machine-learned predictions have been used to achieve more efficient data-structures [Lin et al., 2022], faster runtimes [Chen et al., 2022, Ergun et al., 2021], better accuracy-space tradeoffs for streaming algorithms [Hsu et al., 2019], and improved performance bounds for online algorithms [Purohit et al., 2018].

Despite this vast literature, the accuracy benefits of machine-learned predictions for online classification are, to the best of our knowledge, unknown. In this work, we bridge the gap between Aw P and online classification. In contrast to previous work, which go beyond a worst-case analysis in online classification through smoothness or other distributional assumptions [Haghtalab et al., 2020, Block et al., 2022, Wu et al., 2023], we give the learner access to a Predictor, a forecasting algorithm that predicts future examples in the data stream. The learner, before predicting a label ˆyt, can query the Predictor and receive predictions ˆxt+1, ..., ˆx T on the future examples. The learner can then use the history of the game (x1, y1), ..., (xt 1, yt 1), the current example xt, and the predictions ˆxt+1, ..., ˆx T to output a label ˆyt. We allow Predictors to be adaptive - they can change their predictions of future examples based on the actual realizations of past examples. From this perspective, Predictors are also online learners, and we quantify the predictability of example streams through their mistake-bounds.

In this work, we seek to design online learning algorithms whose expected regret, given black-box access to a Predictor, degrades gracefully with the quality of the Predictor s predictions. By doing so, we are also interested in understanding how access to a Predictor may impact the characterization of online learnability. In particular, given a Predictor, when can online learnability become easier than in the standard, worst-case setup? Guided by these objectives, we make the following contributions.

(1) In the realizable and agnostic settings, we design online learners that, using black-box access to a Predictor, adapt to the easiness of the example stream. When the predictions of the Predictor are good, our learner s expected mistakes/regret significantly improves upon the worst-case guarantee. When the Predictor s predictions are bad, the expected mistakes/regret of our learner matches the optimal worst-case expected mistake-bound/regret. Finally, our learner s expected mistake-bound/regret degrades gracefully with the quality of the Predictor s predictions.

(2) We show that having black-box access to a good Predictor can make learning much easier than the standard, worst-case setting. More precisely, good Predictors allow offline learnable classes to become online learnable. In this paper, we take the offline setting to be transductive online learning [Ben-David et al., 1997, Hanneke et al., 2024] where Nature reveals the entire sequence of examples x1, ..., x T (but not the labels y1, ..., y T ) to the learner before the game begins. Many offline learnable classes are not online learnable. For example, when Y = {0, 1}, transductive online learnability is characterized by the finiteness of the VC dimension, the same dimension that characterizes PAC learnability. Thus, our result is analogous to that in smoothed online classification, where PAC learnability is also sufficient for online learnability [Haghtalab et al., 2020, Block et al., 2022].

A notable property of our realizable and agnostic online learners is their use of black-box access to a transductive online learner to make predictions. In this sense, our proof strategies involve reducing online classification with predictions to transductive online learning. For both contributions (1) and (2), we consider only the realizable setting in the main text. The results and arguments for the agnostic setting are nearly identical and thus deferred to Appendix F.

1.1 Related Works

Online Algorithms with Predictions. Online Algorithms with Predictions (OAw P) has emerged as an important paradigm lying at the intersection of classical online algorithm design and machine learning. Many fundamental online decision-making problems including ski rental [Gollapudi and Panigrahi, 2019, Wang et al., 2020, Bamas et al., 2020], online scheduling [Lattanzi et al., 2020, Wei and Zhang, 2020, Scully et al., 2021], online facility location [Almanza et al., 2021, Jiang et al., 2021], caching [Lykouris and Vassilvitskii, 2021, Elias et al., 2024], and metrical task systems [Antoniadis et al., 2023], have been analyzed under this framework. Recently, Elias et al. [2024] consider a model where the predictor is allowed to learn and adapt its predictions based on the observed data. This is contrast to previous work on learning-augmented online algorithms, where predictions are made from machine learning models trained on historical data, and thus their predictions are static and non-adaptive to the current task at hand. Elias et al. [2024] study a number of fundamental problems, like caching and scheduling, and show how explicitly designed predictors can lead to improved performance bounds. In this work, we consider a model similar to Elias et al. [2024], where the predictions available to the learning algorithms are not fixed, but rather adapt to the true sequence of data processed by the learning algorithm. However, unlike Elias et al. [2024], we do not hand-craft these predictions, but rather assume our learning algorithms have black-box access to a machine-learned prediction algorithm.

Transductive Online Learning. In the Transductive Online Learning setting, Nature reveals the entire sequence of examples x1, ..., x T to the learner before the game begins. The goal of the learner is to predict the corresponding labels y1, ..., y T in order, receiving the true label yt only after making the prediction ˆyt for example xt. First studied by Ben-David et al. [1997], recent work by Hanneke et al. [2024] has established the minimax rates on expected mistakes/regret in the realizable/agnostic settings. In the context of online classification with predictions, one can think of the transductive online learning setting as a special case where the Predictor never makes mistakes.

Smoothed Online Classification. In addition to Aw P, smoothed analysis [Spielman and Teng, 2009] is another important sub-field of beyond-worst-case analysis of algorithms. By placing distributional assumptions on the input, one can typically go beyond computational and information-theoretic bottlenecks due to worst-case inputs. To this end, Rakhlin et al. [2011], Haghtalab [2018], Haghtalab et al. [2020], Block et al. [2022] consider a smoothed online classification model. Here, the adversary has to choose and draw examples from sufficiently anti-concentrated distributions. For binary classification, Haghtalab [2018] and Haghtalab et al. [2020] showed that smoothed online learnability is as easy as PAC learnability. That is, the finiteness of a smaller combinatorial parameter called the VC dimension is sufficient for smoothed online classification. In this work, we also go beyond the worst-case analysis standard in online classification, but consider a different model where the adversary is constrained to reveal a sequence of examples that are predictable. In this model, we also show that the VC dimension can be sufficient for online learnability.

2 Preliminaries

Let X denote an example space and Y denote the label space. We make no assumptions about Y, so it can be unbounded (e.g., Y = N). Let H YX denote a hypothesis class. For a set A, let A = S n=0 An denote the set of all finite sequences of elements in A. Moreover, we let A n denote the set of all sequences of elements in A of size at most n. Then, X denotes the set of all finite sequences of examples in X and Z X denotes a particular family of such sequences. We abbreviate a sequence z1, ..., z T by z1:T . Finally, for a, b, c R, we let a b c = min{a, b, c}.

2.1 Online Classification

In online classification, a learner A plays a repeated game against Nature over T N rounds. In each round t [T], Nature picks a labeled example (xt, yt) X Y and reveals xt to the learner. The learner makes a randomized prediction ˆyt Y. Finally, Nature reveals the true label yt and the learner suffers the 0-1 loss 1{ ˆyt = yt}. Given a hypothesis class H YX , the goal of the learner is to minimize its expected regret

RA(T, H) := sup x1:T X sup y1:T YT

t=1 1{A(xt) = yt}

t=1 1{h(xt) = yt}

where the expectation is only over the randomness of the learner. A hypothesis class H is said to be online learnable if there exists an (potentially randomized) online learning algorithm A such that RA(T, H) = o(T). If it is guaranteed that the learner always observes a sequence of examples labeled by some hypothesis h H, then we say we are in the realizable setting and the goal of the learner is to minimize its expected cumulative mistakes,

MA(T, H) := sup x1:T X T sup h H E

t=1 1{A(xt) = h(xt)}

where again the expectation is taken only with respect to the randomness of the learner. It is well known that the finiteness of the multiclass extension of the Littlestone dimension (Ldim) characterizes realizable and agnostic online learnability [Littlestone, 1987, Daniely et al., 2011, Hanneke et al., 2023]. See Appendix A for complete definitions.

2.2 Online Classification with Predictions

Motivated by the fact that real-world example streams x1:T are far from worst-case, we give our learner A black-box access to a Predictor P, defined algorithmically in Algorithm 1 and formally in Definition 1. In the rest of the paper, we abuse notation by not explicitly indicating that P takes its own past predictions as input. That is, given a sequence x1:T X T , we will let P(x1:t) denotes its prediction on the t th round.

Definition 1 (Predictor). A Predictor P : (X X T ) Π(X T ) is a map that takes in a sequence of instances x1, x2, ..., its own past predictions ˆx1 1:T , ˆx2 1:T , ..., and outputs a distribution ˆµ Π(X T ). The Predictor make its next prediction by sampling ˆx1:T ˆµ.

Algorithm 1 Predictor P Input: Time horizon T

1 for t = 1, ..., T do

2 Nature reveals the true example xt.

3 Observe xt, update, and make a (potentially randomized) prediction ˆxt 1:T .

Remark. We highlight that our Predictors are very general and can also use side information, in addition to the past examples, to make predictions about future examples. For example, if the examples are daily average temperatures, then Predictors can also use other covariates, like humidity, precipitation, and wind speed, to predict future temperatures.

In each round t [T], the learner A can query the Predictor P to get a sense of what examples it will observe in the future. Then, the learner A can use the history (x1, y1), .., (xt 1, yt 1), the current example xt, and the future predicted examples to classify the current example. Protocol 2 makes explicit the interaction between the learner, the Predictor, and Nature.

Note that, in every round t [T], the Predictor P makes a prediction about the entire sequence of T examples, even those that it has observed in the past. This is mainly for notational convenience as we assume that our Predictors are consistent.

Assumption 1 (Consistency). A Predictor is consistent if for every sequence x1:T X T and every time point t [T], the prediction ˆx1:T = P(x1:t) satisfies the property that ˆx1:t = x1:t.

Algorithm 2 Online Learning with a Predictor Input: Predictor P, Hypothesis class H, Time horizon T

1 for t = 1, ..., T do

2 Nature reveals the true example xt.

3 The Predictor P observes xt, updates, and reveals its predictions ˆxt 1:T .

4 Learner makes a randomized prediction ˆyt using ˆxt 1:T , xt, and (x1, y1), ..., (xt 1, yt 1).

5 Nature reveals the true label yt to the learner.

6 Learner suffers loss 1{yt = ˆyt} and updates itself.

Although stated as an assumption, it is without loss of generality that Predictors are consistent - any inconsistent Predictor can be made consistent by hard coding its input into its output. In addition to consistency, we assume that our Predictors are lazy. Assumption 2 (Laziness). A Predictor is lazy if for every sequence x1:T X T and every t [T], if P(x1:t 1)t = xt, then P(x1:t) = P(x1:t 1). That is, P does not change its prediction if it is correct.

Since Predictors are also online learners, the assumption of laziness is also mild: non-lazy online learners can be generically converted into lazy ones [Littlestone, 1987, 1989]. We always assume that Predictors are consistent and lazy and drop these pronouns for the rest of the paper.

Remark. We highlight that Predictors are adaptive and change their predictions based on the realizations of past examples. This is contrast to existing literature in OAw P, where machine-learned predictions are often static. Nevertheless, our framework is more general and captures the setting where predictions of examples are made once and fixed throughout the game. Indeed, consider the consistent, lazy Predictor that fixes a sequence z1:T X T before the game begins, and for every t [T], outputs the predictions ˆxt 1:T such that ˆxt 1:t = x1:t and ˆxt t+1:T = zt+1:T .

Ideally, when given access to a Predictor P, the expected regret of A should degrade gracefully with the quality of P s predictions. To this end, we quantify the performance of a Predictor P through

MP(x1:T ) := E

t=2 1{P(x1:t 1)t = xt}

the expected number of mistakes that P makes on a sequence of examples x1:T X T . In Section 3, we design an online learner whose expected regret/mistake-bound on a stream (x1, y1), ..., (x T , y T ) can be written in terms of MP(x1:T ).

2.3 Predictability

Predictors and their mistake bounds offer us to ability to define and quantify a notion of easiness for example streams x1:T . In particular, we can distinguish between example streams that are predictable and unpredictable. To do so, let Z X denote a collection of finite sequences of examples. By restricting Nature to playing examples streams in Z, we can define analogous notions of minimax expected regret

RA(T, H, Z) := sup x1:T Z sup y1:T YT E

t=1 1{A(xt) = yt} inf h H

t=1 1{h(xt) = yt}

and minimax expected mistakes,

MA(T, H, Z) := sup x1:T Z sup h H E

t=1 1{A(xt) = h(xt)}

As usual, we say that a tuple (H, Z) is online and realizable online learnable if inf A RA(T, H, Z) = o(T) and inf A MA(T, H, Z) = o(T) respectively. If Z = X , then the definitions above recover the standard, worst-case online classification setup. However, in the more general case where Z X , we can use the existence of good Predictors P and their mistake bounds to quantify the easiness" of a stream class Z. That is, we say Z is predictable if there exists a consistent, lazy Predictor P such that MP(T, Z) := supx1:T Z MP(x1:T ) = o(T).

Definition 2 (Predictability). A class Z X is predictable if and only if inf P MP(T, Z) = o(T).

Definition 2 provides a qualitative definition of what it means for a sequence of examples to be predictable, and therefore easy . If Z X is a predictable class of example streams, then a stream x1:T X T is predictable if x1:T Z. By having access to a good Predictor, sequences of examples that previously exhibited worst-case behavior, now become predictable. One natural predictable collection of streams are those induced by easy-to-learn discrete-time dynamical systems [Raman et al., 2024]. That is, let X be the state space for a finite collection G of transition functions. Then, given an initial state x0 X, one can consider the stream class Z to be the set of all trajectories induced by transition functions in G. In Section 3, we show that for such classes of predictable examples, offline learnability is sufficient for online learnability.

2.4 Offline Learnability

In the classical analysis of online algorithms, one competes against the best offline solution. In the context of online classification, this amounts to comparing online learnability to offline learnability, where we interpret the offline setting as the case where Nature reveals the sequence of examples (x1, ..., x T ) before the game begins. In particular, compared to the standard online learning setting, in the offline" version, the learner knows the sequence of examples x1, ..., x T before the game begins, and its goal is to predict the corresponding labels y1, ..., y T . This setting was recently named Transductive Online Learning" [Hanneke et al., 2024] and the minimax rates in both the realizable and agnostic setting have been established [Ben-David et al., 1997, Hanneke et al., 2023]. For the remainder of the paper, we will use offline and transductive online learnability interchangeably.

For a randomized offline learner B, we let

RB(T, H) := sup x1:T X T sup y1:T YT E

t=1 1{Bx1:T (xt) = yt} inf h H

t=1 1{h(xt) = yt}

denote its minimax expected regret and

MB(T, H) := sup h H sup x1:T X T E

t=1 1{Bx1:T (xt) = h(xt)}

denote its minimax expected mistakes. We use the notation Bx1:T to indicate that B was initialized with the sequence x1:T before the game begins. If MB(T, H) = o(T) or RB(T, H) = o(T), then we say that B is a no-regret offline learner. It turns out that realizable and agnostic offline learnability are equivalent [Hanneke et al., 2024]. That is, MB(T, H) = o(T) RB(T, H) = o(T). Thus, we say a class H YX is offline learnable if and only if there exists a no-regret offline learner for H.

When |Y| = 2, Ben-David et al. [1997] and Hanneke et al. [2023] show that the finiteness of a combinatorial dimension called the Vapnik Chervonenkis (VC) dimension (or equivalently PAC learnability) is sufficient for offline learnability (see Appendix A for complete definitions). Lemma 2.1 (Ben-David et al. [1997], Hanneke et al. [2024]). For every H {0, 1}X , there exists a deterministic offline learner B such that

MB(T, H) = O VC(H) log2 T

where VC(H) is the VC dimension of H.

In Section 3, we use this upper bound in Lemma 2.1 to prove that PAC learnability of H implies (H, Z) online learnability when Z is predictable. Interestingly, Hanneke et al. [2024] also establish a trichotomy in the minimax expected mistakes for offline learning in the realizable setting. That is, for any H YX with |Y| < , the quantity MB(T, H) can only be Θ(1), Θ(log2 T), or Θ(T). On the other hand, in the agnostic setting, RB(T, H) can be Θ(

T) or Θ(T), where Θ hides poly-log terms in T.

Our main result in Section 3 shows that offline learnability is sufficient for online learnability under predictable examples. The following technical lemma will be important when proving so. Lemma 2.2. [Ceccherini-Silberstein et al., 2017, Lemma 5.17] Let g : Z+ 7 R+ be a positive sublinear function. Then, g is bounded from above by a concave sublinear function f : R+ 7 R+.

In light of Lemma 2.2, we let f denote the smallest concave sublinear function upper bounding the positive sublinear function f. For example, our regret bounds in Section 3 will often be in terms of MB(T, H). Although in full generality MB(T, H) MB(T, H), in many cases we have equality. For example, when |Y| = 2, the trichotomy of expected minimax rates established by Theorem 4.1 in Hanneke et al. [2024] shows that MB(T, H) = MB(T, H).

3 Adaptive Rates in the Realizable Setting

In this section, we design learning algorithms whose expected mistake bounds, given black-box access to a Predictor P and offline learner B, adapt to the quality of predictions by P and B. Our main quantitative result is stated below.

Theorem 3.1 (Realizable upper bound). For every H YX , Predictor P, and no-regret offline learner B, there exists an online learner A such that for every realizable stream (x1, y1), ..., (x T , y T ), A makes at most

L(H) | {z } (i)

(MP(x1:T ) + 1) MB(T, H) | {z } (ii)

6 (MP(x1:T ) + 1) MB T MP(x1:T ) + 1 + 1, H + log2 T

| {z } (iii)

mistakes in expectation, where L(H) is the Littlestone dimension of H.

We highlight some important consequences of Theorem 3.1. Firstly, when MP(x1:T ) = 0, the expected mistake bound of A matches (up to constant factors) that of the offline learner B. Thus, when MP(x1:T ) = 0 and B is a minimax optimal offline learner, our learner A performs as well as the best offline learner. Secondly, the expected mistake bound of A is always at most 3 L(H) + 5; the minimax worst-case mistake bound up to constant factors. Thus, our learner A never does worse than the worst-case mistake bound. Thirdly, the expected mistake bound of A gracefully interpolates between the offline and worst-case optimal rates as a function of MP(x1:T ). In Section 3.3, we show that the dependence of A s mistake bound on MP(x1:T ) and MB(T, H) can be tight. Lastly, we highlight that Theorem 3.1 makes no assumption about the size of Y.

With respect to learnability, Corollary 3.2 shows that offline learnability of H is sufficient for online learnability under predictable examples.

Corollary 3.2 (Offline learnability = Realizable Online learnability with Predictable Examples). For every H YX and Z X ,

Z is predictable and H is offline learnable = (H, Z) is realizable online learnable.

This follows from a slight modification of the proof of Theorem 3.1 along with the fact that the term (MP(T, Z) + 1)MB T MP(T,Z)+1, H = o(T) when MB(T, H) = o(T) and MP(T, Z) = o(T). In addition, we can also establish a quantitative version of Corollary 3.2 for VC classes.

Corollary 3.3. For every H {0, 1}X , Predictor P and Z X , there exists an online learner A such that

MA(T, H, Z) = O

VC(H)(MP(T, Z) + 1) log2 T MP(T, Z) + 1

We prove both Corollary 3.2 and 3.3 in Appendix C. Corollary 3.3 shows that PAC learnability implies online learnability under predictable examples. Moreover, for VC classes, when MP(x1:T ) = 0, the upper bound in Corollary 3.3 exactly matches that of Lemma 2.1. An analogous corollary in terms of the Natarajan dimension (see Appendix A for definition) holds when |Y| < .

The remainder of this section is dedicated to proving Theorem 3.1. The proof involves constructing three different online learners, with expected mistake bounds (i), (ii), and (iii) respectively, and then running the Deterministic Weighted Majority Algorithm (DWMA) using these learners as experts [Arora et al., 2012]. The following guarantee of DWMA along with upper bounds (i), (ii), and (iii) gives the upper bound in Theorem 3.1 (see Appendix D for complete proof).

Lemma 3.4 (DWMA guarantee [Arora et al., 2012]). The DWMA run with N experts and learning rate η = 1/2 makes at most 3(mini [N] Mi + log2 N) mistakes, where Mi is the number of mistakes made by expert i [N].

The online learner obtaining the upper bound L(H) is the celebrated Standard Optimal Algorithm [Littlestone, 1987, Daniely et al., 2011], and thus we omit the details here. Our second and third learners are described in Sections 3.1 and 3.2 respectively. Finally, in Section 3.3, we give a lower bound showing that our upper bound in Theorem 3.1 can be tight.

3.1 Proof of upper bound (ii) in Theorem 3.1

Consider a lazy, consistent predictor P. Given any sequence of examples x1:T X T , the Predictor P makes c N mistakes at some timepoints t1, ..., tc [T]. Since P may be randomized, both c and t1, ..., tc are random variables. Crucially, since P is lazy, for every i {0, ..., c + 1}, the predictions made by P on timepoints strictly between ti and ti+1 are correct and remain unchanged, where we take t0 = 0 and tc+1 = T + 1. This means that on round ti, we have that ˆxti ti:ti+1 1 = xti:ti+1 1. Therefore, initializing a fresh copy of an offline learner B with the predictions ˆxti ti:T ensures that B makes at most MB(T ti+1, H) mistakes on the stream (xti, yti), ..., (xti+1 1, yti+1 1). Repeating this argument for all adjacent pairs of timepoints in {t1, ..., tc}, gives the following strategy: initialize a new offline learner B every time P makes a mistakes, and use B to make predictions until the next time P makes a mistake. Algorithm 3 implements this idea.

Algorithm 3 Online Learner Input: Hypothesis class H, Offline learner B, Time horizon T

1 Initialize: i = 0

2 for t = 1, ..., T do

3 Receive xt from Nature.

4 Receive predictions ˆxt 1:T from Predictor P such that ˆxt 1:t = x1:t.

5 if t = 1 or ˆxt 1 t = xt (i.e. P made a mistake) then

6 Let Bi be a new copy of B initialized with the sequence ˆxt t:T and set i i + 1.

7 Query Bi on example xt and play its returned prediction ˆyt.

8 Receive true label yt from Nature and pass it to Bi.

Lemma 3.5. For every H YX , Predictor P, no-regret offline learner B, and realizable stream (x1, y1), ..., (x T , y T ), Algorithm 3 makes at most (MP(x1:T )+1) MB(T, H) mistakes in expectation.

Proof. Let A denote Algorithm 3, (x1, y1), ..., (x T , y T ) denote the realizable stream to be observed by A, and h H to be the labeling hypothesis. Let c be the random variable denoting the number of mistakes made by Predictor P on the stream and t1, ..., tc be the random variables denoting the time points where P makes these errors (e.g . ˆxti 1 ti = xti). Note that ti 2 for all i [c]. We will show pointwise for every value of c and t1, ..., tc that A makes at most (c + 1) MB(T, H) mistakes in expectation over the randomness of B. Taking an outer expectation with respect to the randomness of P and using the fact that E [c] = MP(x1:T ), completes the proof.

First, consider the case where c = 0 (i.e. P makes no mistakes). Then, since P is lazy, we have that ˆxt 1:T = x1:T for every t [T]. Thus line 5 fires exactly once on round t = 1, A initializes an offline learner B1 with x1:T , and A uses B1 to make its prediction on all rounds. Thus, A makes at most MB(T, H) mistakes in expectation.

Now, let c > 0 and t1, ..., tc be the time points where P errs. Partition the sequence 1, ..., T into the disjoint intervals (1, ..., t1 1), (t1, ..., t2 1), ..., (tc, ..., T). Define t0 := 1 and tc+1 := T. Fix an i {0, ..., c}. Observe that for every j {ti, ..., ti+1 1}, we have that ˆxj 1:ti+1 1 = xti+1 1. This comes from the fact that P does not error on timepoints ti + 1, ..., ti+1 1 and is both consistent and lazy (see Assumptions 1 and 2). Thus, line 5 fires on round ti, A initializes an offline learner Bi with the sequence ˆxti ti:T = xti:ti+1 1 ˆxti ti+1:T , and A uses Bi it to make predictions for all remaining timepoints ti, ..., ti+1 1. Note that line 5 does not fire on timepoints ti + 1, ..., ti+1 1.

Consider the hypothetical labeled stream of examples

(ˆxti ti, h (ˆxti ti)), ..., (ˆxti T , h (ˆxti T )) = (xti, yti), ..., (xti+1 1, yti+1 1), (ˆxti ti+1, h (ˆxti ti+1)), ..., (ˆxti T , h (ˆxti T )).

By definition, Bi, after initialized with ˆxti ti:T , makes at most MB(T ti + 1, H) mistakes in expectation when simulated on the stream (ˆxti ti, h (ˆxti ti)), ..., (ˆxti T , h (ˆxti T )). Thus, Bi makes at most MB(T, H) mistakes in expectation on the prefix (ˆxti ti, h (ˆxti ti)), ..., (ˆxti ti+1 1, h (ˆxti ti+1 1)) = (xti, yti), ..., (xti+1 1, yti+1 1). Since on the interval timepoint ti, A instantiates Bi with the sequence ˆxti ti:T and proceeds to simulate Bi on the sequence of labeled examples (xti, yti), ..., (xti+1 1, yti+1 1), A makes at most MB(T, H) mistakes in expectation on the sequence (xti, yti), ..., (xti+1 1, yti+1 1). Since the interval i was chosen arbitrarily, the above analysis is true for every i {0, ..., c} and therefore A makes at most (c + 1) MB(T, H) mistakes in expectation over the entire stream.

3.2 Proof sketch of upper bound (iii) in Theorem 3.1

When MB(T, H) is large (i.e. Ω(

T)), upper bound (ii) is sub-optimal. Indeed, if t1, ..., tc denotes the timepoints where P makes mistakes on the stream x1:T , then Algorithm 3 initializes offline learners with sequences of length T ti +1. The resulting mistake-bound of these offline learners are then in the order of T ti+1, which can be large if t1, ..., tc are evenly spaced across the time horizon. To overcome this, we construct a family E of online learners, each of which explicitly controls the length of the sequences offline learners can be initialized with. Finally, we run DWMA using E as its set of experts. Our family of online learners is parameterized by integers c {0, ..., T 1}. Given an input c {0, ..., T 1}, the online learner parameterized by c partitions the stream into c + 1 roughly equally sized parts of size T

c+1 and runs a fresh copy of Algorithm 3 on each partition. In this way, the online learner parameterized by c ensures that offline learners are initialized with time horizons at most T

c+1 . Algorithm 4 formalizes this online learner and Lemma 3.6, whose proof is in Appendix B, bounds its expected number of mistakes.

Algorithm 4 Expert(c) Input: Copy of Algorithm 3 denoted K, Offline Learner B, Time horizon T

1 Initialize: ti = i T

c+1 for i {1, ..., c}, t0 = 0, and tc+1 = T.

2 for t = 1, ..., T do

3 Let i {0, ..., c} such that t { ti + 1, ..., ti+1}.

4 if t = ti + 1 then

5 Let Ki be a new copy of K initialized with time horizon ti+1 ti and a new copy of B.

6 Receive xt from Nature.

7 Receive predictions ˆxt 1:T from Predictor P such that ˆxt 1:t = x1:t.

8 Forward xt and ˆxt ti+1: ti+1 to Ki via Lines 2 and 3 of Algorithm 3 respectively.

9 Receive ˆyt from Ki via line 6 in Algorithm 3 and predict ˆyt.

10 Receive true label yt and forward it to Ki via line 7 in Algorithm 3.

Lemma 3.6 (Expert guarantee). For any H YX , Predictor P, and no-regret offline learner B, Algorithm 4, given as input c {0, ..., T 1}, makes at most

(MP(x1:T ) + c + 1)MB T c + 1 + 1, H

mistakes in expectation on any realizable stream (x1, y1), ..., (x T , y T ).

Note that when c = 0 and MB(T, H) = MB(T, H), this bound reduces to the one in Lemma 3.5 up to a constant factor. On the other hand, using c = MP(x1:T ) gives the upper bound

2(MP(x1:T ) + 1)MB T MP(x1:T ) + 1 + 1, H .

Since E contains an Expert parameterized for every c {0, ..., T 1}, there always exists an expert E MP(x1:T ) E initialized with input c = MP(x1:T ) . Running DWMA using these set of experts

Algorithm 5 Online learner Input: Hypothesis class H, Offline learner B, Time horizon T

1 For every b {0, ..., T 1} let Eb denote Algorithm 4 parameterized by input b.

2 Run the DWMA using {Eb}b {0,...,T 1} over the stream (x1, y1), ..., (x T , y T ).

E on the data stream (x1, y1), ..., (x T , y T ) ensures that our learner does not perform too much worse than E MP(x1:T ) . Algorithm 5 formalizes this idea and Lemma 3.7 is proved in Appendix B.

Lemma 3.7. For every H YX , Predictor P, and no-regret offline learner B, Algorithm 5 makes at most

(MP(x1:T ) + 1)MB T MP(x1:T ) + 1 + 1, H + log2 T

mistakes in expectation on any realizable stream (x1, y1), ..., (x T , y T ).

3.3 Lower bounds

In light of Theorem 3.1, it is natural ask whether the upper bounds derived in Section 3 are tight. A notable feature in upper bounds (ii) and (iii) is the product of the two mistake bounds MP(x1:T ) and MB(T, H). Can this product can be replaced by a sum? Unfortunately, Theorem 3.8 shows that the upper bound in Theorem 3.1 can be tight. Theorem 3.8. Let X = [0, 1] { }, Y = {0, 1}, and H = {x 7 1{x a}1{x = }}. Let T, n N be such that n + 1 divides T and T n+1 + 1 = 2k for some k N. Then, there exists a Predictor P such that for every online learner A that uses P according to Protocol 2, there exists a realizable stream (x1, y1), ..., (x T , y T ) such that MP(x1:T ) = n but

t=1 1{A(xt) = yt}

2 log2 T n + 1

Theorem 3.8 shows that the upper bound in Theorem 3.1 is tight up to an additive factor in log2 T because Lemma 2.1 gives that inf B MB(T, H) = O(VC(H) log2 T) and VC(H) = 1. The proof of Theorem 3.8 is technical and provided in Appendix E. Our proof involves four steps. First, we construct a class of streams Zn X . Then, using Zn, we construct a deterministic, lazy, consistent Predictor P such that P makes mistakes exactly on timepoints { T n+1 + 1, ..., n T

n+1 + 1} for every stream x1:T Zn. Next, whenever x1:T Zn, we establish an equivalence between the game defined by Protocol 2 when given access to Predictor P and Online Classification with Peeks, a different game where there is no Predictor, but the learner observes the next T n+1 examples at timepoints t {1, T n+1 + 1, ..., n T

n+1 + 1}. Finally, for Online Classification with Peeks, we give a strategy for

Nature such that it can force any online learner to make (n+1) log2( T n+1 ) 2 mistakes in expectation while ensuring that its selected stream satisfies x1:T Zn and infh H PT t=1 1{h(xt) = yt} = 0. A key component of the fourth step is the stream constructed by [Hanneke et al., 2024, Claim 3.4] to show that the minimax mistakes for classes with infinite Ldim is at least log2 T in the offline setting.

Remark. Although Theorem 3.8 is stated using the class of one dimensional thresholds, it can be adapted to hold for any VC class with infinite Ldim as these classes embed thresholds [Alon et al., 2019, Theorem 3].

4 Discussion

In this paper, we initiated the study of online classification when the learner has access to machinelearned predictions about future examples. There are many interesting directions for future research and we list two below. Firstly, we only considered the classification setting, and it would be interested to extend our results to online scalar-valued regression. Secondly, we measure the performance of a Predictor through its mistake-bounds. When X is continuous, this might be an unrealistic measure of performance. Thus, it would be interesting to see whether our results can be generalized to the case where X is continuous and the guarantee of Predictors is defined in terms of ℓp losses.

Acknowledgments and Disclosure of Funding

VR acknowledges the support from the NSF Graduate Research Fellowship Program.

Matteo Almanza, Flavio Chierichetti, Silvio Lattanzi, Alessandro Panconesi, and Giuseppe Re. Online facility location with multiple advice. Advances in Neural Information Processing Systems, 34:4661 4673, 2021.

Noga Alon, Roi Livni, Maryanthe Malliaris, and Shay Moran. Private pac learning implies finite littlestone dimension. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, pages 852 860, 2019.

Antonios Antoniadis, Christian Coester, Marek Eliáš, Adam Polak, and Bertrand Simon. Online metric algorithms with untrusted predictions. ACM Transactions on Algorithms, 19(2):1 34, 2023.

Sanjeev Arora, Elad Hazan, and Satyen Kale. The multiplicative weights update method: a metaalgorithm and applications. Theory of computing, 8(1):121 164, 2012.

Etienne Bamas, Andreas Maggiori, and Ola Svensson. The primal-dual method for learning augmented algorithms. Advances in Neural Information Processing Systems, 33:20083 20094, 2020.

Shai Ben-David, Eyal Kushilevitz, and Yishay Mansour. Online learning versus offline learning. Machine Learning, 29:45 63, 1997.

Shai Ben-David, Dávid Pál, and Shai Shalev-Shwartz. Agnostic online learning. In COLT, volume 3, page 1, 2009.

Adam Block, Yuval Dagan, Noah Golowich, and Alexander Rakhlin. Smoothed online learning is as easy as statistical learning. In Conference on Learning Theory, pages 1716 1786. PMLR, 2022.

T. Ceccherini-Silberstein, M. Salvatori, and E. Sava-Huss. Groups, Graphs and Random Walks. London Mathematical Society Lecture Note Series. Cambridge University Press, 2017. doi: 10.1017/9781316576571.

Nicolo Cesa-Bianchi and Gábor Lugosi. Prediction, learning, and games. Cambridge university press, 2006.

Justin Chen, Sandeep Silwal, Ali Vakilian, and Fred Zhang. Faster fundamental graph algorithms via learned predictions. In International Conference on Machine Learning, pages 3583 3602. PMLR, 2022.

Amit Daniely, Sivan Sabato, Shai Ben-David, and Shai Shalev-Shwartz. Multiclass learnability and the erm principle. In Sham M. Kakade and Ulrike von Luxburg, editors, Proceedings of the 24th Annual Conference on Learning Theory, volume 19 of Proceedings of Machine Learning Research, pages 207 232, Budapest, Hungary, 09 11 Jun 2011. PMLR.

Marek Elias, Haim Kaplan, Yishay Mansour, and Shay Moran. Learning-augmented algorithms with explicit predictors. ar Xiv preprint ar Xiv:2403.07413, 2024.

Jon C Ergun, Zhili Feng, Sandeep Silwal, David P Woodruff, and Samson Zhou. Learning-augmented k-means clustering. ar Xiv preprint ar Xiv:2110.14094, 2021.

Sreenivas Gollapudi and Debmalya Panigrahi. Online algorithms for rent-or-buy with expert advice. In International Conference on Machine Learning, pages 2319 2327. PMLR, 2019.

Nika Haghtalab. Foundation of Machine Learning, by the People, for the People. Ph D thesis, Carnegie Mellon University, 2018.

Nika Haghtalab, Tim Roughgarden, and Abhishek Shetty. Smoothed analysis of online and differentially private learning. Advances in Neural Information Processing Systems, 33:9203 9215, 2020.

Steve Hanneke, Shay Moran, Vinod Raman, Unique Subedi, and Ambuj Tewari. Multiclass online learning and uniform convergence. Proceedings of the 36th Annual Conference on Learning Theory (COLT), 2023.

Steve Hanneke, Shay Moran, and Jonathan Shafer. A trichotomy for transductive online learning. Advances in Neural Information Processing Systems, 36, 2024.

Chen-Yu Hsu, Piotr Indyk, Dina Katabi, and Ali Vakilian. Learning-based frequency estimation algorithms. In International Conference on Learning Representations, 2019.

Shaofeng H-C Jiang, Erzhi Liu, You Lyu, Zhihao Gavin Tang, and Yubo Zhang. Online facility location with predictions. ar Xiv preprint ar Xiv:2110.08840, 2021.

Tim Kraska, Alex Beutel, Ed H Chi, Jeffrey Dean, and Neoklis Polyzotis. The case for learned index structures. In Proceedings of the 2018 international conference on management of data, pages 489 504, 2018.

Silvio Lattanzi, Thomas Lavastida, Benjamin Moseley, and Sergei Vassilvitskii. Online scheduling via learned weights. In Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1859 1877. SIAM, 2020.

Honghao Lin, Tian Luo, and David Woodruff. Learning augmented binary search trees. In International Conference on Machine Learning, pages 13431 13440. PMLR, 2022.

Nicholas Littlestone. Mistake bounds and logarithmic linear-threshold learning algorithms. University of California, Santa Cruz, 1989.

Nick Littlestone. Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. Machine Learning, 2:285 318, 1987.

Thodoris Lykouris and Sergei Vassilvitskii. Competitive caching with machine learned advice. Journal of the ACM (JACM), 68(4):1 25, 2021.

Michael Mitzenmacher and Sergei Vassilvitskii. Algorithms with predictions. Communications of the ACM, 65(7):33 35, 2022.

Manish Purohit, Zoya Svitkina, and Ravi Kumar. Improving online algorithms via ml predictions. Advances in Neural Information Processing Systems, 31, 2018.

Alexander Rakhlin, Karthik Sridharan, and Ambuj Tewari. Online learning: Stochastic, constrained, and smoothed adversaries. Advances in neural information processing systems, 24, 2011.

Vinod Raman, Unique Subedi, and Ambuj Tewari. The complexity of sequential prediction in dynamical systems. ar Xiv preprint ar Xiv:2402.06614, 2024.

Tim Roughgarden. Beyond the worst-case analysis of algorithms. Cambridge University Press, 2021.

Ziv Scully, Isaac Grosof, and Michael Mitzenmacher. Uniform bounds for scheduling with job size estimates. ar Xiv preprint ar Xiv:2110.00633, 2021.

Daniel A Spielman and Shang-Hua Teng. Smoothed analysis: an attempt to explain the behavior of algorithms in practice. Communications of the ACM, 52(10):76 84, 2009.

Shufan Wang, Jian Li, and Shiqiang Wang. Online algorithms for multi-shop ski rental with machine learned advice. Advances in Neural Information Processing Systems, 33:8150 8160, 2020.

Alexander Wei and Fred Zhang. Optimal robustness-consistency trade-offs for learning-augmented online algorithms. Advances in Neural Information Processing Systems, 33:8042 8053, 2020.

Changlong Wu, Ananth Grama, and Wojciech Szpankowski. Online learning in dynamically changing environments. In The Thirty Sixth Annual Conference on Learning Theory, pages 325 358. PMLR, 2023.

A Combinatorial dimensions

In this section, we review existing combinatorial dimensions in statistical learning theory. We start with the VC and Natarajan dimensions which characterize PAC learnability when |Y| = 2 and |Y| < respectively. Definition 3 (VC dimension). A set {x1, ..., xn} X is shattered by H, if y1, ..., yn {0, 1}, h H, such that i [n], h(xi) = yi. The VC dimension of H, denoted VC(H), is defined as the largest natural number n N such that there exists a set {x1, ..., xn} X that is shattered by H. Definition 4 (Natarajan Dimension). A set S = {x1, . . . , xd} is shattered by a multiclass function class H YX if there exist two witness functions f, g : S Y such that f(xi) = g(xi) for all i [d], and for every σ {0, 1}d, there exists a function hσ H such that for all i [d], we have

hσ(xi) = f(xi) if σi = 1 g(xi) if σi = 0 .

The Natarajan dimension of H, denoted N(H), is the size of the largest shattered set S X. If the size of the shattered set can be arbitrarily large, we say that N(H) = .

We note that N(H) = VC(H) whenever |Y| = 2. Next, we move to the online setting, where the Littlestone dimension (Ldim) characterizes multiclass online learnability. To define the Ldim, we first define a Littlestone tree and a notion of shattering. Definition 5 (Littlestone tree). A Littlestone tree of depth d is a complete binary tree of depth d where the internal nodes are labeled by examples of X and the left and right outgoing edges from each internal node are labeled by 0 and 1 respectively.

Given a Littlestone tree T of depth d, a root-to-leaf path down T is a bitstring σ {0, 1}d indicating whether to go left (σi = 0) or to go right (σi = 1) at each depth i [d]. A path σ {0, 1}d

down T gives a sequence of labeled examples {(xi, σi)}d i=1, where xi is the example labeling the internal node following the prefix (σ1, ..., σi 1) down the tree. A hypothesis hσ H shatters a path σ {0, 1}d, if for every i [d], we have hσ(xi) = σi. In other words, hσ is consistent with the labeled examples when following σ. A Littlestone tree T is shattered by H if for every root-to-leaf path σ down T , there exists a hypothesis hσ H that shatters it. Using this notion of shattering, we define the Littlestone dimension of a hypothesis class. Definition 6 (Littlestone dimension). The Littlestone dimension of H, denoted L(H), is the largest d N such that there exists a Littlestone tree T of depth d shattered by H. If there exists shattered Littlestone trees T of arbitrary depth, then we say that L(H) = .

Finally, the following notion of shattering is useful when proving the lower bound in Appendix E. Definition 7 (Threshold shattering). A sequence (x1, ..., xk) X k is threshold-shattered by H {0, 1}X if there exists (h1, ..., hk) Hk such that hi(xj) = 1{j i} for all i, j [k].

B Proof of Lemmas 3.6 and 3.7

Proof. (of Lemma 3.6) Let (x1, y1), ..., (x T , y T ) be the realizable stream to be observed by the Expert. For every i {0, ..., c}, let mi be the random variable denoting the number of mistakes made by P in rounds { ti + 1, ..., ti+1}. Recall that t0 = 0 and tc+1 = T. Let M = Pc i=0 mi be the random variable denoting the total number of mistakes made by P on the realizable stream. Finally, let A denote Algorithm 4. Observe that,

t=1 1{A(xt) = yt}

t= ti+1 1{A(xt) = yt}

t= ti+1 1{Ki(xt) = yt}

i=0 (mi + 1)MB( ti+1 ti, H)

where the first inequality follows from the guarantee of K and Lemma 2.2. Using Jensen s inequality, we get that

i=0 (mi + 1)MB( ti+1 ti, H)

i=0 (mi + 1)

Pc i=0(mi + 1)( ti+1 ti) Pc i=0(mi + 1) , H

" M + c + 1

Pc i=0(mi + 1)( ti+1 ti)

M + c + 1 , H

" M + c + 1

Pc i=0 mi( ti+1 ti) + T

M + c + 1 , H

" M + c + 1

c+1 Pc i=0 mi(i + 1 i) + T

M + c + 1 , H

" M + c + 1

M + c + 1 , H

Using the fact that T

c+1 T c+1 + 1, we have

MB( ti+1 ti, H)

" M + c + 1

c+1 + M + T

M + c + 1 , H

" M + c + 1

T c + 1 + 1, H

= MP(x1:T ) + c + 1

T c + 1 + 1, H

which completes the proof.

Proof. (of Lemma 3.7) Let (x1, y1), ..., (x T , y T ) be the realizable stream to be observed by the learner. Let A denote the online learner in Algorithm 5. By the guarantees of the DWMA, we have

t=1 1{A(xt) = yt}

inf b {0,...,T 1}

t=1 1{Eb(xt) = yt}

t=1 1{E MP(x1:T ) (xt) = yt}

3(MP(x1:T ) + MP(x1:T ) + 1)MB T MP(x1:T ) + 1 + 1, H +3 log2 T

6(MP(x1:T ) + 1)MB T MP(x1:T ) + 1 + 1, H +6 log2 T,

where the last inequality follows from Lemma 3.6 and the fact that MP(x1:T ) MP(x1:T ) MP(x1:T ) + 1.

C Proof of Corollary 3.2 and 3.3

Using Theorem 3.1, we first show that for every H YX , Predictor P, Z X , and no-regret offline learner B, we have that

inf A MA(T, H, Z) = O

L(H) (MP(T, Z)+1) MB(T, H)

(MP(T, Z)+1) MB T MP(T, Z) + 1, H + log2 T

Proof. It suffices to show that Algorithms 3 and 5 have mistake bounds O((MP(T, Z) +

1) MB(T, H)) and O

(MP(T, Z) + 1) MB T MP(T,Z)+1, H + log2 T

respectively. To see that

Algorithm 3 s mistake bounds is O((MP(T, Z) + 1) MB(T, H)), note that MP(x1:T ) MP(T, Z)

for every x1:T Z. To see that Algorithm 5 s expected mistake bound is O

(MP(T, Z) +

1) MB T MP(T,Z)+1, H + log2 T

, we follow the exact same proof strategy as in the proof of Lemma

3.7, but picking a different expert when upper bounding the expected number of mistakes. Namely, following the same steps as in the proof of Lemma 3.7, we have that

t=1 1{A(xt) = yt}

inf b {0,...,T 1}

t=1 1{Eb(xt) = yt}

where A denotes Algorithm 5. Picking b = MP(T, Z) , using Lemma 3.6, and the fact that

MP(x1:T ) MP(T, Z) gives the desired upper bound on E h PT t=1 1{A(xt) = yt} i of

(MP(T, Z) + 1)MB T MP(T, Z) + 1, H + log2 T

completing the proof.

Corollary 3.2 follows from the fact that the upper bound on inf A MA(T, H, Z) is sublinear whenever MP(T, Z) = o(T) and MB(T, H) = o(T). To get Corollary 3.3, recall that by Lemma 2.1, there exists an offline learner B such that

MB(T, H) = O VC(H) log2 T .

Plugging this bound into upper bound (1) completes the proof.

D Proof of Theorem 3.1

Let A denote the DWMA using the Standard Optimal Algorithm (SOA), Algorithm 3 and Algorithm 5 as experts. Then, for any realizable stream (x1, y1), ..., (x T , y T ), Lemma 3.4 gives that

t=1 1{A(xt) = yt}

3E min i [3] Mi + log2 3 3 min i [3] E [Mi] + 5,

where we take M1, M2 and M3 to be the number of mistakes made by the SOA, Algorithm 3, and Algorithm 5 respectively. Note that M2 and M3 are random variables since B and P may be randomized algorithms. Finally, using Lemma 3.5, Lemma 3.7 and the fact that the SOA makes at most L(H) mistakes on any realizable stream [Littlestone, 1987] completes the proof of Theorem 3.1.

E Proof of Theorem 3.8

Let X = R { } and H = {x 7 1{x a}1{x = }}. Let T, n N be chosen such that T is a multiple of n + 1 and T n+1 + 1 = 2k for some k N. Our proof of Theorem 3.8 will be in four steps, as described below.

(1) We construct a class of streams Zn X . (2) Using Zn, we construct a deterministic, lazy, consistent Predictor P such that P makes mistakes exactly on timepoints { T n+1 + 1, ..., n T

n+1 + 1} for every stream x1:T Zn.

(3) When x1:T Zn, we establish an equivalence between the game defined by Protocol 2 when given access to Predictor P and Online Classification with Peeks, a different game where there is no Predictor, but the learner observes the next T n+1 examples at timepoints t {1, T n+1 + 1, ..., n T

(4) For Online Classification with Peeks, we give a strategy for Nature such that it can force

any online learner to make (n+1) log2( T n+1 ) 2 mistakes in expectation while ensuring that the stream of labeled examples it picks (x1, y1), ..., (x T , y T ) satisfies the constraint that x1:T Zn and infh H PT t=1 1{h(xt) = yt} = 0.

Composing steps 1-4 shows the existence of a Predictor P such that for any learner A playing Protocol 2 using P, there exists a realizable stream where A makes at least (n+1)

2 log2( T n+1) mistakes in expectation.

Step 1: Construction of Zn

Let S be the set of all strictly increasing sequences of real numbers in (0, 1) of size T n+1. Fix a function f : R2 S which, given a < b R, outputs an element of S that lies strictly in between a and b. For example, given a < b R, the function f can output evenly spaced real numbers of size T n+1. Let Dyd : S X T n+1 be a function that reorders the input S S in Dyadic order. Namely, if S = (x1, ..., x N) where N + 1 = 2k for some k N, then Dyd(S) is x N

8 , ..., x (2k 1)N

See the Proof of Claim 3.4 in Hanneke et al. [2024] for a more detailed description of a Dyadic order. On the other hand, let Sort : X T n+1 S be a function that reorders its input in increasing order. Let J := {1, ..., T n+1 + 1} n be the set of all sequences of indices of length at most n taking values in {1, ..., T n+1 + 1}. For the remainder of this section, we will use Si to denote the ith element in a sequence S S. Moreover, for any two sequences S1, S2 S, we say S1 < S2 if S1 |S1| < S2 1. That is, S1 < S2, if S1 lies strictly to the left of S2.

Algorithm 6 Stream Generator (SG) Input: S0 S, j1:m J

1 Initialize: a0 = 0, b0 = 1

2 for i = 1, ..., m do

3 if ji = 1 then

4 Si f(ai 1, Si 1 1 )

6 bi Si 1 1 7 else if ji = T n+1 + 1 then

8 Si f(Si 1 T n+1 , bi 1)

9 ai Si 1 T n+1 10 bi bi 1 11 else

12 Si f(Si 1 j 1, Si 1 j )

13 ai Si 1 j 1

14 bi Si 1 j 15 end

17 Return: Dyd(S0) ... Dyd(Sm)

We will construct a stream for every sequence j1:m J , m n, algorithmically as follows. Fix S0 := f(0, 1) S and let SG denote Algorithm 6. Let

Zn = n SG(S0, j1:m) : j1:m J , m {1, ..., n} o

denote the stream class generated by applying SG to inputs S0 and j1:m for every j1:m J , m n. We make four important observations about Zn, which we will use to construct a Predictor that can reconstruct Si given S0 and the first example of the block Si 1. Observation 1. For every sequence x1:T Z, we have that x1: T n+1 = Dyd(S0).

The first observation follows from the fact that the same initial sequence S0 is used to generate every stream in Zn. Observation 2. For any pair j1 1:n, j2 1:n J and m n, if j1 1:m = j2 1:m, then SG(S0, j1 1:m) = SG(S0, j2 1:m).

The second observation follows from the fact that SG is deterministic. Observation 3. For every x1:T Zn such that x1:T := SG(S0, j1:n) = Dyd(S0) ... Dyd(Sn), the index ji can be computed exactly using only Si 1 and Si 1 for every i [n].

To see the third observation, fix some x1:T Zn. Then, there exists a sequence S1, ..., Sn S such that x1:T = Dyd(S0) ... Dyd(Sn) as well as a sequence (a0, b0), ..., (an, bn). In addition, there exists a j1:n J such that x1:T = SG(S0, j1;n). Fix i [n] and consider Si 1 and Si. By definition of Algorithm 6, there exists an index q {1, ..., T n+1 + 1} such that Si = f(Si 1 q 1, Si 1 q ) where we take Si 1 0 = ai 1 and Si 1 T n+1 +1 = bi 1. We claim that the index q is unique. This follows from the

fact that the collection {f(Si 1 j , Si 1 j+1)}

T n+1 j=0 is pairwise disjoint since ai 1 = Si 1 0 < Si 1 1 < ... < Si 1 T n+1 < Si 1 T n+1 +1 = bi 1. Finally, we claim that Si 1 and the element Si 1 identifies the index q. This

follows because f(ai 1, Si 1 1 ) < f(Si 1 1 , Si 1 2 ) < ... < f(Si 1 T n+1 1, Si 1 T n+1 ) < f(Si 1 T n+1 , bi 1) and

thus q is the smallest index p {1, ..., T n+1} such that Si 1 < Si 1 p and T n+1 + 1 if such a p does not exist. Observation 4. Fix a sequence j1:n J and let Dyd(S0) ... Dyd(Sn) = SG(S0, j1:n). For every i, p [n] such that i < p, we have that:

(i) Sp < Si if ji = 1;

(ii) Si ji 1 < Sp < Si ji if 2 ji T n+1;

(iii) Si < Sp if ji = T n+1 + 1.

The fourth observation follows from the fact that for every i [n] and index ji {1, ..., T n+1 + 1}, the remaining sequence of sets Si+1, ..., Sn all lie in the interval (Si ji 1, Si ji) by design of Algorithm 6, where again we take Si 1 0 = ai 1 < Si 1 1 and Si 1 T n+1 +1 = bi 1 > Si 1 T n+1 .

Step 2: Constructing a Predictor for Zn

We now show that Algorithm 7 is a lazy, consistent Predictor for Zn that only makes mistakes at timepoints { T n+1 + 1, ..., n T

Lemma E.1. For any sequence x1:T Zn, Algorithm 7 is a lazy, consistent Predictor for Zn that only makes mistakes at timepoints { T n+1 + 1, ..., n T

Proof. Let P denote Algorithm 7 and x1:T Zn. Then, there exists S1, ..., Sn S and a sequence of indices j1:n J such that x1:T = Dyd(S0) Dyd(S1) ... Dyd(Sn) = SG(S0, j1:n).

We now prove that P makes mistakes only on timepoints { T n+1 + 1, ..., n T

n+1 + 1} and no where else. Our proof is by induction using the following inductive hypothesis. For every i {1, ..., n}, we have that P:

(i) sets J1:i = j1:i on round i T n+1 + 1;

(ii) makes mistakes on rounds { T n+1 + 1, 2T

n+1 + 1, ..., i T n+1 + 1} and no where in between.

Algorithm 7 Predictor for Zn Input: Zn 1 Initialize: J = ()

2 for t = 1, .., T do

3 Receive xt 4 if t = 1 then

5 Set ˆxt 1:T = Dyd(S0) ˆx T n+1 +1:T where ˆx T n+1 +1:T = ( , ..., ).

6 else if t = i T n+1 + 1 for some i {1, ..., n} then

7 Let S = Sort(xt T n+1 :t 1) be the last T n+1 examples sorted in increasing order.

8 Find the smallest j {1, ..., T n+1} such that xt < Sj. If no such j exists, set j = T n+1 + 1.

9 Update J J j.

10 Set ˆxt 1:T = SG(S0, J) ˆxt+ T n+1 :T where ˆxt+ T n+1 :T = ( , ..., ).

12 Set ˆxt 1:T ˆxt 1 1:T .

14 Predict ˆxt 1:T .

For the base case, let i = 1. P does not make any mistakes in {1, 2, ..., T n+1} since it knows S0 using Zn, computes x1: T n+1 = Dyd(S0) in line 5, and does not change its prediction until round i T n+1 + 1

based on line 6. At time point t1 = T n+1 + 1, P makes a mistake since ˆxt1 1 t1 = = xt1. Moreover, using Observation 3, the index j {1, ..., T n+1} computed in round t1 on line 8 matches j1. Thus, we have that J1 = j1. This completes the base case.

Now for the induction step, let i {2, ..., n}. Suppose that the induction step is true for i 1. This means that P:

(i) sets J1:i 1 = j1:i 1 on round (i 1)T

(ii) makes mistakes on rounds { T n+1 + 1, 2T

n+1 + 1, ..., (i 1)T

n+1 + 1} and no where in between.

We need to show that P sets Ji = ji on round i T n+1 + 1, P makes no mistakes between (i 1)T

and i T n+1, but makes a mistake at i T n+1 +1. At timepoint ti 1 = (i 1)T

n+1 +1, P computes Ji 1 = ji 1 (by assumption) and thus sets ˆxti 1 1:T = SG(S0, J1:i 1) = SG(S0, (j1, ..., ji 1)) = Dyd(S0) ... Dyd(Si 1) using Observation 2. Therefore, P predicts on round ti 1 the sequence ˆxti 1 1:T = x1: i T

n+1 ( , ..., ), implying that P makes no mistakes for rounds (i 1)T

n+1 + 2, ..., i T n+1 since it does

not change its prediction until round i T n+1 + 1 by line 12. However, since ˆxti 1 ti = , P makes a mistake on round ti = i T n+1 + 1. Finally, by Observation 3, the example xti and the previously observed sequence xti 1:ti 1 gives away ji, thus P sets Ji = ji on line 8 in round t = i T n+1 +1. This completes the induction step and the proof of the claim that P only makes mistakes on timepoints { T n+1 +1, ..., n T

n+1 +1}. To see that P is lazy, observe that by line 12, P does not update its prediction on rounds in between those in { T n+1 + 1, ..., n T

n+1 + 1}. To see that P is consistent, note that P uses prefixes of j1, ..., jn, S0, and SG to compute its predictions in line 10. Thus, consistency follows from Observation 2.

Step 3: Equivalence to Online Classification with Peeks

For any stream x1:T Zn, having access to the Predictor specified by Algorithm 7 implies that at every t {1, T n+1 + 1, ..., n T

n+1 + 1}, the learner observes predictions ˆxt 1:T where ˆxt 1:t 1 = x1:t 1, ˆxt t:t+ T n+1 = xt:t+ T n+1 , and ˆxt t+ T n+1 +1:T = ( , ..., ). Accordingly, at the timepoints t {1, T n+1 +

1, ..., n T

n+1 + 1}, the learner observes the next T n+1 1 examples xt:t+ T n+1 in the stream, but learns

nothing about the future examples xt+ T n+1 +1:T . In addition, for timepoints in between those in

{1, T n+1 + 1, ..., n T

n+1 + 1}, the learner does not observe any new information from P since by line 12 in Algorithm 7, ˆxi 1:T = ˆxi+r 1:T for every i {1, T n+1 + 1, ..., n T

n+1 + 1} and r {1, ..., T n+1 1}. As a result, whenever x1:T Zn, Protocol 2 with the Predictor specified by Algorithm 7 is equivalent to the setting we call Online Classification with Peeks where there is no Predictor, but the learner observes the next T n+1 1 examples exactly at timepoints t {1, T n+1 + 1, ..., n T

n+1 + 1}. Indeed, by having knowledge of the next T n+1 1 examples exactly at timepoints t {1, T n+1 + 1, ..., n T

n+1 + 1}, a learner for Online Classification with Peeks can simulate a Predictor that acts like Algorithm 7. Likewise, a learner for Online Classification with Predictions can use Algorithm 7 to simulate an adversary that reveals the next T n+1 1 examples exactly at timepoints t {1, T n+1 +1, ..., n T

n+1 +1}. Accordingly, we consider Online Classification with Peeks for the rest of the proof and show how Nature can force the lower bound in Theorem 3.8 under this new setting.

Step 4: Nature s Strategy for Online Classification with Peeks

Let A be any online learner and consider the game where the learner A observes the next T n+1 1 examples at timepoints {1, T n+1 + 1, ..., n T

n+1 + 1}. We construct a hard stream for A in this setting. We first describe a minimax optimal offline strategy for Nature when it is forced to play a sequence of examples S S sorted in Dyadic order.

Algorithm 8 Nature s Minimax Offline Strategy

Input: S = Dyd(S) for some S S, Version space V {0, 1}X Initialize: V1 = V Reveal S to the learner A. for t = 1, ..., T n+1 do Observe the probability ˆpt of A predicting label 1. if ˆpt 1/2 then

If there exists h Vt such that h(xt) = 0, reveal true label yt = 0. Else, reveal yt = 1. else

If there exists h Vt such that h(xt) = 1, reveal true label yt = 1. Else, reveal yt = 0. Update Vt+1 = {h Vt : h(xt) = yt}. end Return: True labels y1, ...y T n+1 , Version space V T n+1 +1

Lemma E.2. For any learner A, S = Dyd(S), and Version space V {0, 1}X , Algorithm 8 forces A to make at least 1

2 log2( T n+1) mistakes in expectation if S is threshold-shattered (Definition 7) by V .

Proof. The lemma follows directly from Theorem 3.4 in Hanneke et al. [2024].

For the definition of threshold-shattering, see Appendix A. Note that for every input S = Dyd(S) and V {0, 1}X to Algorithm 8, its output version space V| S|+1 is non-empty and consistent with the sequence ( S1, y1), ..., ( S T n+1 , y T n+1 ) as long as |V | > 0. This property will be crucial when proving Lemma E.4. We are now ready to describe Nature s strategy for Online Classification with Peeks. The pseudocode is provided in Algorithm 9.

We establish a series of important lemmas. Lemma E.3. For every learner A, if (x1, y1), ..., (x T , y T ) is the stream the output of Algorithm 9 when playing against A, then x1:T Zn.

Proof. Fix a learner A and let (x1, y1), ..., (x T , y T ) denote the output of Algorithm 9 playing against A. Let j1:n denote the sequences of indices output by Algorithm 9. Then, since SG is deterministic, by line 6-7 in Algorithm 9, we have that x1:T = SG(S0, (j1, ..., jn)) Zn.

Lemma E.4. For every learner A, if (x1, y1), ..., (x T , y T ) is the stream the output of Algorithm 9 when playing against A, then (x1, y1), ..., (x T , y T ) is realizable by H.

Algorithm 9 Nature s Strategy for Online Classification with Peeks Input: Learner A, Hypothesis class H

1 Initialize: V1 = H

2 for i = 1, .., n + 1 do

3 if i = 1 then

4 Set x1: T n+1 = Dyd(S0) and reveal it to the learner A.

6 Compute S = SG(S0, (j1, ..., ji 1)).

7 Let x (i 1)T

n+1 +1: i T

n+1 be the last T n+1 examples in S and reveal it to the learner A.

8 Play against A according to Algorithm 8 using x (i 1)T

n+1 +1: i T

n+1 and version space Vi.

9 Let y (i 1)T

n+1 +1, ..., y i T

n+1 be the returned labels and Vi+1 Vi be the returned version space.

10 Let y (i 1)T

n+1 +1, ..., y i T

n+1 be the sequence of true labels after sorting

n+1 +1, y (i 1)T

n+1 +1), ..., (x i T

n+1 , y i T

11 in increasing order with respect to the examples.

12 if y i T

n+1 = 1 then

13 Set ji = T n+1 + 1.

15 Set ji to be the smallest p {1, ..., T n+1} such that y (i 1)T

n+1 +p = 0.

17 Return: Stream (x1, y1), ..., (x T , y T ), indices j1:n, and version spaces V1, ..., Vn+2.

Proof. Fix a learner A and let (x1, y1), ..., (x T , y T ) be the output of Algorithm 9 when playing against A. Let V2, ..., Vn+2 be the sequence of version spaces output by Algorithm 9. It suffices to show that Vn+2 is not empty and is consistent with (x1, y1), ..., (x T , y T ). Our proof will be by induction using the following hypothesis: Vi+1 is non-empty and consistent with the sequence (x1, y1), ..., (x i T

n+1 , y i T

n+1 ). For the base case, let i = 1. Then, by Algorithm 8, line 8 in Algorithm 9, and the fact that |V1| = |H| > 0, we have that |V2| > 0 and V2 is consistent with (x1, y1), ..., (x T n+1 , y T n+1 ). Now consider some i 2 and suppose the induction hypothesis is true for i 1, Then, we know that |Vi| > 0 and Vi is consistent with (x1, y1), ..., (x (i 1)T

n+1 , y (i 1)T

Again, by design of Algorithm 8 and line 9 in Algorithm 9, it follows that |Vi+1| > 0 and Vi+1 is consistent with (x (i 1)T

n+1 +1, y (i 1)T

n+1 +1), ..., (x i T

n+1 , y i T

n+1 ). Since Vi+1 Vi, and Vi is consistent

with (x1, y1), ..., (x (i 1)T

n+1 , y (i 1)T

n+1 ), we get that Vi+1 is consistent with (x1, y1), ..., (x i T

n+1 , y i T

n+1 ), completing the induction step.

Lemma E.5. For every learner A, if (x1, y1), ..., (x T , y T ) and V1, ..., Vn+2 are stream and version spaces output by Algorithm 9 when playing against A, then for every i {1, ..., n + 1}, the version space Vi threshold-shatters x (i 1)T

n+1 +1: i T

Proof. Fix a learner A and let (x1, y1), ..., (x T , y T ) denote the output of Algorithm 9 playing against A. Let j1:n and V1, ..., Vn+2 denote the sequences of indices and version spaces output by Algorithm 9 respectively. Note that x1:T = SG(S0, j1:n). Moreover, for every i {2, ..., n}, we have that x1: i T

n+1 = SG(S0, (j1, ..., ji 1)) by lines 6-7.

Fix an i {1, ..., n + 1}. It suffices to show that the hypotheses parameterized by x (i 1)T

n+1 +1: i T

n+1 belong in Vi. Our proof will be by induction. For the base case, since V1 = H, it trivially follows that the hypothesis parameterized by x (i 1)T

n+1 +1: i T

n+1 belong to V1. Now, suppose that x (i 1)T

n+1 +1: i T

n+1 belong to Vm for some m < i. We show that Vm+1 also contains the hypothesis parameterized by x (i 1)T

n+1 +1: i T

n+1 . Recall that Vm+1 Vm is the subset of Vm that is consistent with the labeled data

n+1 +1, y (m 1)T

n+1 +1), ..., (x m T

n+1 , y m T

and is the result of running Algorithm 8 with input version space Vm and sequence x (m 1)T

n+1 +1: m T

n+1 . It suffices to show that the hypotheses parameterized by x (i 1)T

n+1 +1: i T

n+1 are also consistent with

n+1 +1, y (m 1)T

n+1 +1), ..., (x m T

n+1 , y m T

To show this, recall that jm is the index computed in Lines 11-14 of Algorithm 9 on round m. Let

n+1 +1, y (m 1)T

n+1 +1), ..., ( x m T

n+1 , y m T

be the sample sorted in increasing order by examples. There are three cases to consider. Suppose jm = 1, then y (m 1)T

n+1 +1 = 0, and it must be the case that y (m 1)T

n+1 +p = 0 for all p {2, ..., T n+1}. Since

n+1 = SG(S0, (j1, ..., jm 1)), by definition of Algorithm 6, we have that the last T n+1 entries of

SG(S0, (j1, ..., jm)) all lie strictly to the left of x (m 1)T

n+1 +1. Moreover, by Observation 4, this is true of

the last T n+1 entries of SG(S0, (j1, ..., jm, qm+1, ..., qi 1) for any qm+1, ..., qi 1 {1, ...., T n+1 +1}. Therefore, we must have that x (i 1)T

n+1 +1: i T

n+1 , which are the last T n+1 entries of SG(S0, (j1, ..., ji 1)), lies strictly to the left of x (m 1)T

n+1 +1, implying that their associated hypotheses output 0 on all of

n+1 +1, y (m 1)T

n+1 +1), ..., (x m T

n+1 , y m T

n+1 ) as needed. By symmetry, when jm = T n+1 + 1, we have that x (i 1)T

n+1 +1: i T

n+1 lies strictly to the right of x m T

n+1 , implying that their associated hypotheses output

1 on all of (x (m 1)T

n+1 +1, y (m 1)T

n+1 +1), ..., (x m T

n+1 , y m T

n+1 ) as needed. Now, consider the case where

jm {2, ..., T n+1}. Then, by Algorithm 6 and Observation 4, for any qm+1, ..., qi {1, ...., T n+1 +1}, the last T n+1 entries of SG(S0, (j1, ..., jm, qm+1, ..., qi 1) lie strictly in between x (m 1)T

n+1 +jm 1 and x (m 1)T

n+1 +jm. Thus, the hypotheses parameterized by x (i 1)T

n+1 +1: i T

n+1 output 1 on examples x (m 1)T

n+1 +1: (m 1)T

n+1 +jm 1 and 0 on examples x (m 1)T

n+1 +jm: m T

n+1 . Finally, note that by definition of

jm, it must be the case that y (m 1)T

n+1 +1: (m 1)T

n+1 +jm 1 = (1, ..., 1) and y (m 1)T

n+1 +jm: m T

n+1 = (0, ..., 0). Thus, once again the hypotheses parameterized by x (i 1)T

n+1 +1: i T

n+1 are consistent with the sample

n+1 +1, y (m 1)T

n+1 +1), ..., (x m T

n+1 , y m T

This shows that these hypotheses are contained in Vm+1, completing the induction step.

Step 5: Completing the proof of Theorem 3.8

We are now ready to complete the proof of Theorem 3.8, which follows from composing E.1, E.2, E.3, E.4, and E.5. Namely, Lemma E.1 and the discussion in Section 15 show that there exists a Predictor P such that for any learner A playing according to Protocol 2, Online Classification with Predictions is equivalent to Online Classification with Peeks whenever the stream (x1, y1), ..., (x T , y T ) selected by the adversary satisfies the constraint that x1:T Zn. Lemmas E.3 and E.4 show that for any learner A, Nature playing according to Algorithm 9 guarantees that the resulting sequence (x1, y1), ..., (x T , y T ) satisfies the constraint that x1:T Zn and realizability by H. Thus, for the Predictor P specified by Algorithm 7 and Nature playing according to Algorithm 9, Online Classification with Predictions is equivalent to Online Classification with Peeks. Finally, for Online Classification with Peeks, combining Lemmas E.2 and E.5 shows that for any learner A, Nature, by playing according to

Algorithm 9, guarantees that A makes at least log2( T n+1 ) 2 mistakes in expectation every T n+1 rounds.

Thus, Nature forces A to make at least (n+1)

2 log2( T n+1) mistakes in expectation by the end of the game, completing the proof.

F Adaptive Rates in the Agnostic Setting

In this section, we consider the harder agnostic setting and prove analogous results as in Section 3. Our main quantitative result is the agnostic analog of Theorem 3.1.

Theorem F.1 (Agnostic upper bound). For every H YX , Predictor P, and no-regret offline learner B, there exists an online learner A such that for every stream (x1, y1), ..., (x T , y T ), A s expected regret is at most p

L(H) T log2(e T) | {z } (i)

2(MP(x1:T ) + 1) RB T MP(x1:T ) + 1 + 1, H + p

T log2 T | {z } (ii)

With respect to learnability, Corollary F.2 shows that offline learnability of H is sufficient for online learnability under predictable examples. Corollary F.2 (Offline learnability = Agnostic Online learnability with Predictable Examples). For every H YX and Z X ,

Z is predictable and H is offline learnable = (H, Z) is agnostic online learnable.

In addition, we can also establish a quantitative version of Corollary F.2 for VC classes. Corollary F.3. For every H {0, 1}X , Predictor P, Z X , and no-regret offline learner B, there exists an online learner A such that

RA(T, H, Z) = O

(MP(T, Z) + 1)

VC(H) T MP(T, Z) + 1 log2 T MP(T, Z) + 1

The proof of Corollary F.3 is in Section F.4. The remainder of this section is dedicated to proving Theorem 3.1 and Corollary F.3. The proof is similar to the realizable case. It involves constructing two online learners with expected regret bounds (i) and (ii) respectively, and then running the celebrated Randomized Exponential Weights Algorithm (REWA) using these learners as experts [Cesa-Bianchi and Lugosi, 2006]. The following guarantee of REWA along with upper bound (i) and (ii) gives the upper bound in Theorem F.1. Lemma F.4 (REWA guarantee [Cesa-Bianchi and Lugosi, 2006]). The expected regret of REWA

when run with N experts and learning rate η = q

T is at most mini [N] Mi + p

T log2 N, where Mi is the number of mistakes made by expert i [N].

The online learner obtaining the regret bound p

L(H) T log2(e T) is the generic agnostic online learner from Hanneke et al. [2023], thus we omit the details here. Our second learner is described in Section F.2 and uses Algorithm 3 as a subroutine. The following lemma, bounding the expected regret of Algorithm 3 in the agnostic setting, will be crucial. Lemma F.5. For every H YX , Predictor P, no-regret offline learner B, and stream (x1, y1), ..., (x T , y T ), the expected regret of Algorithm 3 is at most (MP(x1:T ) + 1) RB(T, H).

F.1 Proof of Lemma F.5

The proof closely follows that of Lemma 3.5.

Proof. Let A denote Algorithm 3 and (x1, y1), ..., (x T , y T ) denote the stream to be observed by A. Let c be the random variable denoting the number of mistakes made by Predictor P on the stream and t1, ..., tc be the random variables denoting the time points where P makes these errors (e.g . ˆxti 1 ti = xti). Note that ti 2 for all i [c]. We will show pointwise for every value of c and t1, ..., tc that A makes at most (c + 1) RB(T, H) mistakes in expectation over the randomness of B. Taking an outer expectation with respect to the randomness of P and using the fact that E [c] = MP(x1:T ), completes the proof.

First, consider the case where c = 0 (i.e. P makes no mistakes). Then, since P is lazy, we have that ˆxt 1:T = x1:T for every t [T]. Thus line 5 fires exactly once on round t = 1, A initializes an offline learner B1 with x1:T , and A uses B1 to make its prediction on all rounds. Thus, A makes at most RB(T, H) mistakes in expectation.

Now, let c > 0 and t1, ..., tc be the time points where P errs. Partition the sequence 1, ..., T into the disjoint intervals (1, ..., t1 1), (t1, ..., t2 1), ..., (tc, ..., T). Define t0 := 1 and tc+1 := T. Fix an

i {0, ..., c}. Then, for every j {ti, ..., ti+1 1}, we have that ˆxj 1:ti+1 1 = xti+1 1. This comes from the fact that P does not error on timepoints ti + 1, ..., ti+1 1 and is both consistent and lazy (see Assumptions 1 and 2). Thus, line 5 fires on round ti, A initializes an offline learner Bi with the sequence ˆxti ti:T = xti:ti+1 1 ˆxti ti+1:T , and A uses Bi it to make predictions for all remaining timepoints ti, ..., ti+1 1. Note that line 5 does not fire on timepoints ti + 1, ..., ti+1 1.

Let hi arg minh H Pti+1 1 t=ti 1{h(xt) = yt} be an optimal hypothesis for the partition (ti, ..., ti+1 1). Let yi t = yt for ti t ti+1 1 and yi t = hi(ˆxti t ) for all t ti+1. Then, note that

t=ti 1{h(ˆxti t ) = yi t} = inf h H

t=ti 1{h(xt) = yt}.

Now, consider the hypothetical labeled stream

(ˆxti ti, yi ti), ..., (ˆxti T , yi T ) = (xti, yti), ..., (xti+1 1, yti+1 1), (ˆxti ti+1, yi ti+1), ..., (ˆxti T , yi T ).

By definition, Bi, after initialized with ˆxti ti:T , makes at most

t=ti 1{h(ˆxti t ) = yi t} + RB(T ti, H) = inf h H

t=ti 1{h(xt) = yt} + RB(T ti, H)

mistakes in expectation when simulated on the stream (ˆxti ti, yi ti), ..., (ˆxti T , yi T ). Thus, Bi makes at most infh H Pti+1 1 t=ti 1{h(xt) = yt} + RB(T ti + 1, H) mistakes in expectation on the prefix (ˆxti ti, yi ti), ..., (ˆxti ti+1 1, yi ti+1 1) = (xti, yti), ..., (xti+1 1, yti+1 1). Since on timepoint ti, A instantiates Bi with the sequence ˆxti ti:T and proceeds to simulate Bi on the sequences of labeled examples (xti, yti), ..., (xti+1 1, yti+1 1), A makes at most infh H Pti+1 1 t=ti 1{h(xt) = yt} + RB(T ti + 1, H) mistakes in expectation on the sequence (xti, yti), ..., (xti+1 1, yti+1 1). Since the interval i was chosen arbitrarily, this is true for every i {0, ..., c} and

t=1 1{A(xt) = yt}

t=ti 1{A(xt) = yt}

t=ti 1{h(xt) = yt} + RB(T ti + 1, H)

t=1 1{h(xt) = yt} + (c + 1) RB(T, H),

F.2 Proof of upper bound (ii) in Theorem F.1

The proof of upper bound (ii) in Theorem F.1 closely follows the proof of upper bound (iii) in Theorem 3.1 from the realizable setting. The main idea is to run REWA using the same experts defined in Algorithm 4 and bounding the expected regret in terms of the expected regret of K from Lemma F.5.

We show that Algorithm 5 using REWA in line 3 and the experts in Algorithm 4 with their guarantee in Lemma F.5 achieves upper bound (ii) in Theorem F.1. Let (x1, y1), ..., (x T , y T ) be the stream to be observed by the learner. Let A denote the online learner in Algorithm 5 using REWA in line 3 of

Algorithm 5. By the guarantees of the REWA, we have that

t=1 1{A(xt) = yt}

inf b {0,...,T 1}

t=1 1{Eb(xt) = yt}

t=1 1{E MP(x1:T ) (xt) = yt}

MP(x1:T ) X

t= ti+1 1{Ki(xt) = yt}

MP(x1:T ) X

i=0 (mi + 1)RB( ti+1 ti, H) + inf h H

t= ti+1 1{h(xt) = yt}

MP(x1:T ) X

i=0 (mi + 1)RB( ti+1 ti, H)

t=1 1{h(xt) = yt} + p

t=1 1{h(xt) = yt} + 2(MP(x1:T ) + 1)RB T MP(x1:T ) + 1 + 1, H + p

where the fourth inequality uses the guarantee of K from Lemma F.5 and the last inequality follows using an identical argument as in the proof of Lemma 3.6 since RB(T, H) is a concave, sublinear function of T.

F.3 Proof of Theorem F.1

Let A denote the REWA using the generic agnostic online learner from Hanneke et al. [2023] and the algorithm described in Section F.2 as experts. Then, for any stream (x1, y1), ..., (x T , y T ), Lemma F.4 gives that

t=1 1{A(xt) = yt}

E min i [2] Mi

T min i [2] E [Mi] +

where we take M1 and M2 to be the number of mistakes made by the generic agnostic online learner from Hanneke et al. [2023] and the algorithm described in Section F.2 respectively. Note that M1 and M2 are random variables. Finally, using [Hanneke et al., 2023, Theorem 4] as well as upper bound (ii) completes the proof of Theorem F.1.

F.4 Proof of Corollaries F.2 and F.3

The proof of the generic upper bound on MA(T, H, Z) follows by using the same learner A as in the proof of upper bound (ii) in Theorem F.1. However, this time we bound

inf b {0,...,T 1}

t=1 1{Eb(xt) = yt}

t=1 1{E MP(T,Z) (xt) = yt}

and use an identical analysis as in the proof of upper bound (ii) and Lemma 3.6 to get

RA(T, H, Z) = O

L(H) T log2 T (MP(T, Z)+1) RB T MP(T, Z) + 1, H + p

Corollary F.2 follows from the fact that RA(T, H, Z) = o(T) if MP(T, Z) = o(T) and RB(T, H) = o(T). To get the upper bound in Corollary F.3, it suffices to plug in the upper bound RB(T, H) =

VC(H)T log2 T , given by Theorem 6.1 from Hanneke et al. [2024], into the above upper

bound on RA(T, H, Z).

Neur IPS Paper Checklist

Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: The main claims in the abstract and introduction are proven in Section 3 and Appendix F. In particular, Theorem 3.1, Corollary 3.2, and Corollary 3.3 in the main text cover the main claims made about the realizable setting. The results in Appendix F cover the main claims made about the agnostic setting. Guidelines:

The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper. 2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: The assumptions about Predictors are explicitly stated in Section 2. Section 4 discusses limitations and future work. One example of a limitation is the fact that measuring the performance of the Predictor using the 0-1 loss may be restrictive if X is continuous. Guidelines:

The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations. 3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [Yes] Justification: All theoretical results are numbered and cross-referenced. Assumptions are are made clear in the theorem statements. All new theoretical results have a complete, detailed proof either in the main-text of the appendix. Theorem 3.1 follows immediately from Lemma 3.4, Lemma 3.5, Lemma 3.6, and Lemma 3.7. Lemma 3.4 is a well-known result. Lemma 3.5 is proven in the detail in the main text. Lemma 3.6 and 3.7 are proven in Appendix B. Theorem 3.8 is proven in Appendix E. Corollary 3.2 and 3.3 are proved in Appendix C. All theoretical results for the agnostic setting are stated and proven in Appendix F. Guidelines:

The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced. 4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [NA] Justification: This paper has no experiments. Guidelines:

The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset).

(d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results.

5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?

Answer: [NA]

Justification:This paper has no experiments.

Guidelines:

The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted.

6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results?

Answer: [NA]

Justification: This paper has no experiments.

Guidelines:

The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material.

7. Experiment Statistical Significance

Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?

Answer: [NA]

Justification: This paper has no experiments.

Guidelines:

The answer NA means that the paper does not include experiments.

The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text.

8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments?

Answer: [NA]

Justification: This paper has no experiments.

Guidelines:

The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper).

9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines?

Answer: [Yes]

Justification: We have read and made sure that our paper conforms to the Neur IPS Code of Ethics. We have also made sure to preserve anonymity.

Guidelines:

The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction).

10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?

Answer: [NA]

Justification: Our work is theoretical and contributes to our understanding of machine learning algorithms. Beyond theoretical insights that can be used to develop better algorithms, our work does not have direct societal impacts.

Guidelines:

The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).

11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?

Answer: [NA]

Justification: This paper has no experiments and therefore poses no such risks.

Guidelines:

The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.

12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?

Answer: [NA]

Justification: This paper does not use existing assets.

Guidelines:

The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset.

The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators.

13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?

Answer: [NA]

Justification: This paper does not release new assets.

Guidelines:

The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file.

14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)?

Answer: [NA]

Justification: This paper does not involve crowdsourcing nor research with human subjects.

Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector.

15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects

Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained?

Answer: [NA]

Justification: This paper does not involve crowdsourcing nor research with human subjects.

Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.