# universal_rates_of_empirical_risk_minimization__a0e49d80.pdf

Universal Rates of Empirical Risk Minimization

Steve Hanneke Department of Computer Science Purdue University steve.hanneke@gmail.com

Mingyue Xu Department of Computer Science Purdue University xu1864@purdue.edu

The well-known empirical risk minimization (ERM) principle is the basis of many widely used machine learning algorithms, and plays an essential role in the classical PAC theory. A common description of a learning algorithm s performance is its so-called learning curve , that is, the decay of the expected error as a function of the input sample size. As the PAC model fails to explain the behavior of learning curves, recent research has explored an alternative universal learning model and has ultimately revealed a distinction between optimal universal and uniform learning rates (Bousquet et al., 2021). However, a basic understanding of such differences with a particular focus on the ERM principle has yet to be developed. In this paper, we consider the problem of universal learning by ERM in the realizable case and study the possible universal rates. Our main result is a fundamental tetrachotomy: there are only four possible universal learning rates by ERM, namely, the learning curves of any concept class learnable by ERM decay either at e n, 1/n, log (n)/n, or arbitrarily slow rates. Moreover, we provide a complete characterization of which concept classes fall into each of these categories, via new complexity structures. We also develop new combinatorial dimensions which supply sharp asymptotically-valid constant factors for these rates, whenever possible.

1 Introduction

The classical statistical learning theory mainly focuses on the celebrated PAC (Probably Approximately Correct) model (Vapnik and Chervonenkis, 1974; Valiant, 1984) with emphasis on supervised learning. A particular setting therein, called the realizable case, has been extensively studied. Complemented by the no-free-lunch" theorem (Antos and Lugosi, 1996), the PAC framework, which adopts a minimax perspective, can only explain the best worst-case learning rate by a learning algorithm over all realizable distributions. Such learning rates are thus also called the uniform rates. However, the uniform rates can only capture the upper envelope of all learning curves, and are too coarse to explain practical machine learning performance. This is because real-world data is rarely worst-case, and the data source is typically fixed in a given learning scenario. Indeed, Cohn and Tesauro (1990, 1992) observed from experiments that practical learning rates can be much faster than is predicted by PAC theory. Moreover, many theoretical works (Schuurmans, 1997; Koltchinskii and Beznosova, 2005; Audibert and Tsybakov, 2007, etc.) were able to prove faster-than-uniform rates for certain learning problems, though requiring additional modelling assumptions. To distinguish from the uniform rates, these rates are named the universal rates and was formalized recently by Bousquet et al. (2021) via a distribution-dependent framework. Unlike the simple dichotomy of the optimal uniform rates: every concept class H has a uniform rate being either linear VC(H)/n or bounded away from zero", the optimal universal rates are captured by a trichotomy: every concept class H has a universal rate being either exponential, linear or arbitrarily slow (see Thm.1.6 Bousquet et al., 2021).

In supervised learning, a family of successful learners called the empirical risk minimization (ERM) consist of all learning algorithms that output a sample-consistent classifier. In other words, an ERM

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

algorithm is any learning rule, which outputs a concept in H that minimizes the empirical error (see Appendix A for a formal definition). For notation simplicity, we first introduce

Definition 1 (Version space, Mitchell, 1977). Let H be a concept class and Sn := {(xi, yi)}n i=1 be a dataset, the version space induced by Sn, denoted by VSn(H) (or Vn(H) for short), is defined as VSn(H) := {h H : h(xi) = yi, i [n]}.

Now given labeled samples Sn := {(xi, yi)}n i=1, an ERM algorithm is any learning algorithm that outputs a concept in the sample-induced version space, that is, a sequence of universally measurable functions An : Sn ˆhn VSn(H), n N. Throughout this paper, we will simply denote an ERM algorithm by its output predictors {ˆhn}n N.

It is well-known that the ERM principle plays an important role in understanding general uniform learnability: a concept class is uniformly learnable if and only if it can be learned by ERM. However, while the optimal VC(H)/n rate is achievable by some improper learner (Hanneke, 2016a), ERM algorithms can at best achieve a uniform rate of (VC(H)/n) log (n/VC(H)). Moreover, such a gap has been shown to be unavoidable in general (Auer and Ortner, 2007), which leaves a challenging question to study: what are the sufficient and necessary conditions on H for the entire family of ERM algorithms to achieve the optimal error? Indeed, many subsequent works have devoted to improving the logarithmic factor in specific scenarios. The work of Giné and Koltchinskii (2006) refined the bound by replacing log (n/VC(H)) with log (θ(VC(H)/n)), where θ( ) is called the disagreement coefficient. Based on this, Hanneke and Yang (2015) proposed a new data-dependent bound with log (ˆn1:n/VC(H)), where ˆn1:n is a quantity related to the version space compression set size (a.k.a. the empirical teaching dimension). As a milestone, the work of Hanneke (2016b) proved an upper bound (VC(H)/n) log (s H/VC(H)) and a lower bound (VC(H) + log (s H))/n, where s H is called the star number of H (see Definition 4 in Section 2). Though not quite matching, these two bounds together yield an optimal linear rate when s H < . Thereafter, the uniform rates by ERM can be described as a trichotomy, namely, every concept class H has a uniform rate by ERM being exactly one of the following: 1/n, log (n)/n and bounded away from zero".

From a practical perspective, many ERM-based algorithms are designed and are widely applied in different areas of machine learning, such as the logistic regression and SVM, the CAL algorithm in active learning, the gradient descent (GD) algorithm in deep learning. Since the worst-case nature of the PAC model is too pessimistic to reflect the practice of machine learning, understanding the distribution-dependent performance of ERM algorithms is of great significance. However, unlike that a distinction between the optimal uniform and universal rates has been fully understood, how fast universal learning can outperform uniform learning in particular by ERM remains unclear. Furthermore, we are lacking a complete theory to the characterization of the universal rates by ERM, though certain specific scenarios that admit faster rates by ERM have been discovered (Schuurmans, 1997; van Handel, 2013). In this paper, we aim to answer the following fundamental question:

Question 1. Given a concept class H, what are the possible rates at which H can be universally learned by ERM?

We start with some basic preliminaries of this paper. We consider an instance space X and a concept class H {0, 1}X . Given a probability distribution P on X {0, 1}, the error rate of a classifier h : X {0, 1} is defined as er P (h) := P((x, y) X {0, 1} : h(x) = y). A distribution P is called realizable with respect to H, denoted by P RE(H), if infh H er P (h) = 0. Note that in this definition, h satisfying er P (h ) = infh H er P (h) is called the target concept of the learning problem, and is not necessary in H. We may also say that P is a realizable distribution centered at h . Given an integer n, we denote by Sn := {(xi, yi)}n i=1 P n a i.i.d. P-distributed dataset. In the universal learning framework, the performance of a learning algorithm is commonly measured by its learning curve (Bousquet et al., 2021; Hanneke et al., 2022; Bousquet et al., 2023), that is, the decay of the expected error rate E[er P (ˆhn)] as a function of sample size n. With these settings settled, we are now able to formalize the problem of universal learning by ERM.

Definition 2 (Universal learning by ERM). Let H be a concept class, and R(n) 0 be a rate function. We say

H is universally learnable at rate R by ERM, if for every distribution P RE(H), there exist parameters C, c > 0 such that for every ERM algorithm, E[er P (ˆhn)] CR(cn), for all n N.

H is not universally learnable at rate faster than R by ERM, if there exists a distribution P RE(H) and parameters C, c > 0 such that there exists an ERM algorithm satisfying E[er P (ˆhn)] CR(cn), for infinitely many n N. H is universally learnable with exact rate R by ERM, if H is universally learnable at rate R by ERM, and is not universally learnable at rate faster than R by ERM. H requires arbitrarily slow rates to be universally learned by ERM, if for every rate function R(n) 0, H is not universally learnable at rate faster than R by ERM. Remark 1. The above definition inherits the structure of the definition to the optimal universal learning (Definition 1.4 Bousquet et al., 2021). Here, we are actually considering the worst-case" ERM algorithm, which is consistent with the PAC theory. A crucial difference between this definition and the PAC one is that here the constants C, c > 0 are allowed to depend on the distribution P. In other words, the PAC model can be defined similarly, but requires uniform constants C, c > 0. Consequently, H is universally learnbale at rate R by ERM if it is PAC learnable at rate R by ERM. Remark 2. It is not hard to see that the error rate achieved by any ERM algorithm, given Sn P n as input, is at most suph VSn(H) er P (h). Furthermore, for any distribution P RE(H), there exist ERM algorithms obtaining error rates arbitrarily close to this value. Hence, to obtain upper bounds of the universal rates by ERM, it requires us to bound the random variable suph VSn(H) er P (h), where a common technique is to bound P(suph VSn(H) er P (h) > ϵ) = P( h VSn(H) : er P (h) > ϵ). To obtain lower bounds, it requires us to construct specific hard" distributions.

1.1 Basic examples

In order to develop some initial intuition for what universal learning rates are possible for ERM, we first introduce several basic examples that illustrate the possibilities in the following Section 1.2. To convince the reader, we provide direct analysis (without using our characterization in Section 1.2) for those examples (see details in Appendix B.1). Example 1 (e n learning rate). Any finite class H is universally learnable at exponential rate (Schuurmans, 1997). Indeed, according to their analysis, such exponential rates can also be achieved by any ERM algorithm. Example 2 (1/n learning rate). Let Hthresh,N := {ht : t N} be the class of all threshold classifiers on natural numbers, where ht(x) := 1(x t), for all x N. Hthresh,N is universally learnable at exponential rate since this class does not have an infinite Littlestone tree (Bousquet et al., 2021). However, ERM algorithms cannot guarantee such exponential rates but at best linear rates, when encountering certain realizable distributions centered at the target concept hall-0 s, which is the function that labels zero everywhere (see Appendix A). Example 3 (log (n)/n learning rate). Let X = N and Hsingleton,N := {ht : t X} be the class of all singletons on X, where ht(x) := 1(x = t), for all x N. It is clear that VC(Hsingleton,N) = 1. Note that Hsingleton,N is universally learnable at exponential rate since it has finite Littlestone dimension LD(Hsingleton,N) = 1. However, the exact universal rate by ERM is instead log (n)/n. This is because Hsingleton,N admits certain realizable distributions centered at hall-0 s. Indeed, it is an example where the universal rate by ERM matches the uniform rate, up to a distribution-dependent constant. Example 4 (Arbitrarily slow learning rate). Let X = S

i N Xi be the disjoint union of finite sets with |Xi| = 2i, for all i N. For each i N, let Hi := {h S := 1S : S Xi, |S| 2i 1} and consider the concept class H = S

i N Hi. H is universally learnable at exponential rate since it does not have an infinite Littlestone tree. However, a bad ERM algorithm can perform arbitrarily slowly. Example 5 (Not Glivenko-Cantelli but learnable by ERM). Let X = [0, 1], H := {1S : S X, |S| < }, and P be the uniform (Lebesgue) distribution on [0, 1]. H is universally learnable at exponential rate (no infinite Littlestone tree). Moreover, H is not a universal Glivenko-Cantelli class for P (van Handel, 2013), but is still universally learnable by ERM. However, if we consider the class H {hall-1 s}, which is still not a universal Glivenko-Cantelli class for P, but no longer universally learnable by any ERM algorithm since er P (ˆhn) = 1 regardless of the sample size.

The above examples indicate that the cases of universal learning by ERM do not match the uniform learning, but contains at least five possible cases: every nontrivial concept class H is either universally learnable at exponential rate (but not faster), or is universally learnable at linear rate (but not faster), or is universally learnable at slightly slower than linear rate log (n)/n (but not faster), or is universally

learnable but necessarily with arbitrarily slow rates, or is not universally learnable at all. Throughout this paper, we only consider the case where the given concept class is universally learnable by ERM. We leave it an open question whether there exists a nice characterization that determines the universal learnability by ERM.

1.2 Main results

In this section, we summarize the main results of this paper. The examples in Section 1.1 reveal that there are at least four possible universal rates by ERM. Interestingly, we find that these are also the only possibilities. The following two theorems consist of the main results of this work. In particular, Theorem 1 gives out a complete answer to Question 1. It expresses a fundamental tetrachotomy: there are exactly four possibilities for the universal learning rates by ERM: being either exponential, or linear, or log (n)/n, or arbitrarily slow rates. Moreover, Theorem 2 specifies the answer by pointing out for what realizable distributions (targets), those universal rates are sharp.

Theorem 1 (Universal rates for ERM). For every class H with |H| 3, the following hold:

H is universally learnable by ERM with exact rate e n if and only if |H| < . H is universally learnable by ERM with exact rate 1/n if and only if |H| = and H does not have an infinite star-eluder sequence. H is universally learnable by ERM with exact rate log (n)/n if and only if H has an infinite star-eluder sequence and VC(H) < . H requires at least arbitrarily slow rates to be learned by ERM if and only if VC(H) = .

Remark 3. The formal definition of the star-eluder sequence can be found in Section 2. Unlike the separation between exact e n and 1/n rates is determined by the cardinality of the class, and the separation between exact log (n)/n and arbitrarily slow rates is determined by the VC dimension of the class, whether there exists a simple combinatorial quantity that determines the separation between exact 1/n and log (n)/n rates is unclear and might be an interesting direction for future work. We thought that it is likely the star number s H (Definition 4) is the correct characterization here, but it turns out not unfortunately (see details in Section 4 and Appendix B.3).

Based on Theorem 1, a distinction between the performance of ERM algorithms and the optimal universal learning algorithms can be revealed, which we present in the following table (the required definitions in Case" are deferred to Section 2, and examples can be found in Appendix B.2).

Optimal rate Exact rate by ERM Case Example e n 1/n infinite eluder sequence but no infinite Littlestone tree Example 12 e n log (n)/n infinite star-eluder sequence but no infinite Littlestone tree Example 13 e n arbitrarily slow infinite VC-eluder sequence but no infinite Littlestone tree Example 15 1/n log (n)/n infinite star-eluder sequence but no infinite VCL tree Example 14 1/n arbitrarily slow infinite VC-eluder sequence but no infinite VCL tree Example 16

Furthermore, the distinction between the universal rates and the uniform rates by ERM can also be fully captured, and are depicted schematically in Figure 1 as an analogy to the Fig.4 of Bousquet et al. (2021). Besides the examples in Section 1.1, we also need the following additional example concerning the Littlestone dimension to appear in the diagram.

Example 6 (log (n)/n learning rate and unbounded Littlestone dimension). We consider here the class of two-dimensional halfspaces, that is, X := R2 and Hhalfspaces,R := {1(w x + b 0) : w R2, b R}. It is a classical fact that for any integer d, the class of halfspaces on Rd has a finte VC dimension d, but has an infinite Littlestone tree, and thus having unbounded Littlestone dimension (Shalev-Shwartz and Ben-David, 2014). Finally, to show that this class is universally learnable by ERM at exact log (n)/n rate, we simply consider the subspace S1 X, this is indeed an infinite star set of Hhalfspaces,R centered at hall-0 s and thus an infinite star-eluder sequence.

As a complement to Theorem 1, the following Theorem 2 gives out target-specified universal rates. We say a target concept h is universally learnable by ERM with exact rate R if all realizable distribution P considered in Definition 2 are centered at h . In other words, H is universally learnable with exact rate R is equivalent to say all realizable target concepts are universally learnable with exact rate R. Concretely, for each of the four possible rates stated in Theorem 1, Theorem 2 specifies the target concepts that can be learned at such exact rate by ERM.

arbitrary rate universal Glivenko-Cantelli

Littlestone linear

exponential

Figure 1: A venn diagram depicting the tetrachotomy of the universal rates by ERM and its relation with the uniform rates characterized by the VC dimension and the star number.

Theorem 2 (Target specified universal rates). For every class H with |H| 3, and a target concept h , the following hold:

h is universally learnable by ERM with exact rate e n if and only if H does not have an infinite eluder sequence centered at h . h is universally learnable by ERM with exact rate 1/n if and only if H has an infinite eluder sequence centered at h , but does not have an infinite star-eluder sequence centered at h . h is universally learnable by ERM with exact rate log (n)/n if and only if H has an infinite star-eluder sequence centered at h , but does not have an infinite VC-eluder sequence centered at h . h requires at least arbitrarily slow rates to be universally learned by ERM if and only if H has an infinite VC-eluder sequence centered at h .

All detailed proofs appear in Appendix D. We also provide a brief overview of the main idea of each proof as well as some related concepts in Section 2.

An additional part of this work presents a fine-grained analysis (Bousquet et al., 2023) of the universal rates by ERM, which complements the coarse rates used in Theorem 1. Concretely, we provide a characterization of sharp distribution-free constant factors of the ERM universal rates, whenever possible. The characterization is based on two newly-developed combinatorial dimensions, called the star-eluder dimension (or SE dimension) and the VC-eluder dimension (or VCE dimension) (Definition 9). We say whenever possible" because distribution-free constants are unavailable for certain cases (Remark 16). Such a characterization can also be considered as a refinement to the classical PAC theory, in a sense that it is sometimes better but only asymptotically-valid. Due to space limitation, we defer the definition of fine-grained rates and related results to Appendix C.

1.3 Related works

PAC learning by ERM. The performance of consistent learning rules (including the ERM algorithm) in the PAC (distribution-free) framework has been extensively studied. For VC classes, Blumer et al. (1989) gave out a log (n)/n upper bound of the uniform learning rate. Despite the well-known equivalence between uniform learnability and uniform learnability by the ERM principle (Vapnik and Chervonenkis, 1971), the best upper bounds for general ERM algorithms differ from the optimal sample complexity by an unavoidable logarithmic factor (Auer and Ortner, 2007). By analyzing the disagreement coefficient of the version space, the work of Giné and Koltchinskii (2006); Hanneke (2009) refined the logarithmic factor in certain scenarios. Furthermore, not only being a relevant

measure in the context of active learning (Cohn et al., 1994; El-Yaniv and Wiener, 2012; Hanneke, 2011, 2014), the region of disagreement of the version space was found out to have an interpretation of sample compression scheme with its size known as the version space compression set size (Wiener et al., 2015; Hanneke and Yang, 2015). Based on this, the label complexity of the CAL algorithm can be converted into a bound on the error rates of all consistent PAC learners (Hanneke, 2016b). Finally, Hanneke and Yang (2015); Hanneke (2016b) introduced a simple combinatorial quantity named the star number, and guaranteed that a concept class with finite star number can be uniformly learned at linear rate.

Universal Learning. Observed from empirical experiments, the actual learning rates on real-world data can be much faster than the one described by the PAC theory (Cohn and Tesauro, 1990, 1992). The work of Benedek and Itai (1988) considered a setting lies in between the PAC setting and the universal setting called nonuniform learning, in which the learning rate may depend on the target concept but still uniform over the marginal distributions. After that, Schuurmans (1997) studied classes of concept chains and revealed a distinction between exponential and linear rates along with a theoretical guarantee. Later, more improved learning rates have been obtained for various practical learning algorithms such as stochastic gradient decent and kernel methods (Koltchinskii and Beznosova, 2005; Audibert and Tsybakov, 2007; Pillaud-Vivien et al., 2018, etc.). Additionally, van Handel (2013) studied the uniform convergence property from a universal perspective, and proposed the universal Glivenko-Cantelli property. Until very recently, the universal (distribution-dependent) learning framework was formalized by Bousquet et al. (2021), in which a complete theory of the (optimal) universal learnability was obtained as well. After that, Bousquet et al. (2023) carried out a fine-grained analysis on the distribution-free tail" of the universal learning curves by characterizing the optimal constant factor. As generalizations, Kalavasis et al. (2022); Hanneke et al. (2023) studied the universal rates for multiclass classification, and Hanneke et al. (2022) studied the universal learning rates under an interactive learning setting.

2 Technical overview

In this section, we discuss some technical aspects in the derivation of our main results in Section 1.2. Our analysis to the universal learning rates by ERM is based on three new types of complexity structures named the eluder sequence, the star-eluder sequence and the VC-eluder sequence. More details can be found in Sections 3-4 and all technical proofs are deferred to Appendix D. Definition 3 (Realizable data). Let H be a concept class on an instance space X, we say that a (finite or infinite) data sequence {(x1, y1), (x2, y2), . . .} (X {0, 1}) is realizable (with respect to H), if for every n N, there exists hn H such that hn(xi) = yi, for all i [n]. Definition 4 (Star number). Let X be an instance space and H be a concept class. We define the region of disagreement of H as DIS(H) := {x X : h, g H s.t. h(x) = g(x)}. Let h be a classifier, the star number of h, denoted by sh(H) or sh for short, is defined to be the largest integer s such that there exist distinct points x1, . . . , xs X and concepts h1, . . . , hs H satisfying DIS({h, hi}) {x1, . . . , xs} = {xi}, for every 1 i s. (We say {x1, . . . , xs} is a star set of H centered at h.) If no such largest integer s exists, we define sh = . The star number of H, denoted by s(H) or s H, is defined to be the maximum possible cardinality of a star set of H, or s H = if no such maximum exists. Remark 4. From this definition, it is clear that the star number s H of H satisfies s H VC(H). Indeed, any set {x1, . . . , xd} that shattered by H is also a star set of H based on the following reasoning: Since {x1, . . . , xd} is shattered by H, there exists h H such that h(x1) = = h(xd) = 0. Moreover, for any i [d], there exists hi H satisfying hi(xi) = 1 and hi(xj) = 0 for all j = i. An immediate implication is that a VC-eluder sequence is always a star-eluder sequence (see Definition 6 and Definition 7 below).

With these basic definitions in hand, we next define the three aforementioned sequences: Definition 5 (Eluder sequence). Let H be a concept class, we say that H has an eluder sequence {(x1, y1), . . . , (xd, yd)}, if it is realizable and for every integer k [d], there exists hk H such that hk(xi) = yi for all i < k and hk(xk) = yk. The eluder dimension of H, denoted by E(H), is defined to be the largest integer d 1 such that H has an eluder sequence {(x1, y1), (x2, y2), . . . , (xd, yd)}. If no such largest d exists, we say H has an infinite eluder sequence and define E(H) = . We say an infinite eluder sequence {(x1, y1), (x2, y2), . . .} is centered at h, if h(xi) = yi for all i N.

Remark 5. It has been proved that max{s H, log (LD(H))} E(H) 4max{s H,2LD(H)} (Li et al., 2022, Thm.8), where LD(H) is the Littlestone dimension of H. Moreover, the very recent work of Hanneke (2024) proved that E(H) |H| 2s H LD(H), which implies that any concept class with finite star number and finite Littlestone dimension must be a finite class.

Before proceeding to the next two definitions, we define a sequence of integers {nk}k N as n1 = 0, nk := k 2 for all k > 1, which satisfies nk+1 nk = k for all k N.

Definition 6 (Star-eluder sequence). Let H be a concept class and h be a classifier. We say that H has an infinite star-eluder sequence {(x1, y1), (x2, y2), . . .} centered at h , if it is realizable and for every integer k 1, {xnk+1, . . . , xnk+k} is a star set of Vnk(H) centered at h. Definition 7 (VC-eluder sequence). Let H be a concept class and h be a classifier. We say that H has an infinite VC-eluder sequence {(x1, y1), (x2, y2), . . .} centered at h , if it is realizable and labelled by h, and for every integer k 1, {xnk+1, . . . , xnk+k} is a shattered set of Vnk(H).

Remark 6. In the definitions of eluder sequence and VC-eluder sequence, {(x1, y1), (x2, y2), . . .} centered at h" simply means the sequence is labelled by h. However, the words centered at" in the definition of star-eluder sequence is more meaningful. In this paper, we give them a uniform name in order to make Theorem 2 look consistent.

Remark 7. An infinite star-eluder (VC-eluder) sequence requires the version space to keep on having star (shattered) sets with infinitely increasing sizes. If the size cannot grow infinitely, the largest possible size of the star (shattered) set is called the star-eluder (VC-eluder) dimension (Definition 9), which plays an important role in our fine-grained analysis (Appendix C). To distinguish the notion of star-eluder (VC-eluder) sequence here from the d-star-eluder (d-VC-eluder) sequence defined in Appendix C, we may call the construction in Definition 6 an infinite strong star-eluder sequence, and the construction in Definition 7 an infinite strong VC-eluder sequence.

Proof Sketch of Theorem 1 and 2. The proof of Theorem 1 is devided into two parts (Section 3 and Section 4). Roughly speaking, for each equivalence therein, we first characterize the exact universal rates by ERM via the three aforementioned sequences (see Theorems 3-6 in Section 3). We have to prove a lower bound together with an upper bound for the sufficiency since we are showing the exact" universal rates. The lower bound is established by constructing a realizable distribution on the existent infinite sequence, and the derivation of upper bound is strongly related to the classical PAC theory. To prove the necessity, we will use the method of contradiction. Then in Section 4, we establish equivalent characterizations via those well-known complexity measures, whenever possible. Theorem 2 is an associated target-dependent version, and is directly proved by those corresponding lemmas in Section 3. The complete proof structure for Theorem 1 can be summarized as follow:

For the first bullet, we start by proving that H is universally learnable with exact rate e n if and only if H does not have an infinite eluder sequence (Theorem 3), and then we extend the equivalence by showing that H does not have an infinite eluder sequence if and only if H is a finite class (Lemma 8). For the second bullet, we prove that H is universally learnable with exact rate 1/n if and only if H has an infinite eluder sequence but does not have an infinite star-eluder sequence (Theorem 4). Then the desired equivalence follows immediately from the first bullet. For the third bullet, we prove that H is universally learnable with exact rate log (n)/n if and only if H has an infinite star-eluder sequence but does not have an infinite VC-eluder sequence (Theorem 5). The desired equivalence comes in conjunction with the claim that H has an infinite VC-eluder sequence if and only if H has infinite VC dimension (Lemma 9). Finally, for the last bullet, it suffices to prove that H requires at least arbitrarily slow rates to be universally learned by ERM if and only if H has an infinite VC-eluder sequence (Theorem 6).

3 Exact universal rates

Sections 3 and 4 of this paper are devoted to the proof ideas of Theorems 1 and 2 with further details. In this section, we give a complete characterization of the four possible exact universal rates by ERM (e n, 1/n, log (n)/n and arbitrarily slow rates) via the existence/nonexistence of the three combinatorial sequences defined in Section 2. For each of the following if and only if" results (Theorems 3-6), we are required to prove both the sufficiency and the necessity. The sufficiency consists of both an upper bound and a lower bound since we are proving the exact universal rates. The

necessity also follows simply by the method of contradiction, given the rates are exact. All technical proofs are deferred to Appendix D.1.

3.1 Exponential rates

Theorem 3. H is universally learnable by ERM with exact rate e n if and only if H does not have an infinite eluder sequence.

The lower bound for sufficiency is straightforward and was established by Schuurmans (1997).

Lemma 1 (e n lower bound). Given a concept class H, for any learning algorithm ˆhn, there exists a realizable distribution P with respect to H such that E[er P (ˆhn)] 2 (n+2) for infinitely many n, which implies that H is not universally learnable at rate faster than exponential rate e n. Remark 8. Note that this lower bound is actually stronger than desired in a sense that it holds for any learning algorithm (not necessarily for ERM algorithms). Lemma 2 (e n upper bound). If H does not have an infinite eluder sequence (centered at h ), then H (h ) is universally learnable by ERM at rate e n.

Proof of Theorem 3. The sufficiency follows directly from the lower bound in Lemma 1 together with the upper bound in Lemma 2. Furthermore, Lemma 3 in Section 3.2 proves that the existence of an infinite eluder sequence leads to a linear lower bound of the ERM universal rates. Therefore, the necessity follows by using the method of contradiction.

3.2 Linear rates

Theorem 4. H is universally learnable by ERM with exact rate 1/n if and only if H has an infinite eluder sequence but does not have an infinite star-eluder sequence. Lemma 3 (1/n lower bound). If H has an infinite eluder sequence centered at h , then h is not universally learnable by ERM at rate faster than 1/n. Lemma 4 (1/n upper bound). If H does not have an infinite star-eluder sequence (centered at h ), then H (h ) is universally learnable by ERM at rate 1/n.

Proof of Theorem 4. To prove the sufficiency, on one hand, the existence of an infinite eluder sequence implies a linear lower bound based on Lemma 3. On the other hand, if H does not have an infinite star-eluder sequence, Lemma 4 yields a linear upper bound. The necessity can be proved by the method of contradiction. Concretely, if either of the two conditions fail, the universal rates will be either e n or at least log (n)/n, based on Lemma 2 in Section 3.1 and Lemma 5 in Section 3.3.

3.3 log (n)/n rates

Theorem 5. H is universally learnable by ERM with exact rate log (n)/n if and only if H has an infinite star-eluder sequence but does not have an infinite VC-eluder sequence. Lemma 5 (log (n)/n lower bound). If H has an infinite star-eluder sequence centered at h , then h is not universally learnable by ERM at rate faster than log (n)/n. Remark 9. Note that the conclusion in Remark 5 explains why the intersection of infinite Littlestone classes" and classes with finite star number" is empty in Figure 1. However, we mention in Remark 3 that infinite star number does not guarantee an infinite star-eluder sequence (see Appendix B.3 for details). Hence, Remark 5 cannot explain why the intersection of infinite Littlestone classes" and classes that are learnable at linear rate by ERM" is also empty. To address this problem, we give out the following additional result: Proposition 1. Any infinite concept class H has either an infinite star-eluder sequence or infinite Littlestone dimension. Lemma 6 (log (n)/n upper bound). If H does not have an infinite VC-eluder sequence (centered at h ), then H (h ) is universally learnable by ERM at log (n)/n rate.

Proof of Theorem 5. To prove the sufficiency, on one hand, if H has an infinite star-eluder sequence, the universal rates have a log (n)/n lower bound based on Lemma 5. On the other hand, if H does not

have an infinite VC-eluder sequence, then Lemma 6 yields a log (n)/n upper bound. The necessity can be proved using the method of contradiction based on Lemma 4 in Section 3.2 and Lemma 7 in Section 3.4 below.

3.4 Arbitrarily slow rates

Theorem 6. H requires at least arbitrarily slow rates to be learned by ERM if and only if H has an infinite VC-eluder sequence.

Proof of Theorem 6. Given the necessity proved by Lemma 6 in Section 3.3, it suffices to prove the sufficiency, which is completed by the following Lemma 7.

Lemma 7 (Arbitrarily slow rates). If H has an infinite VC-eluder sequence centered at h , then h requires at least arbitrarily slow rates to be universally learned by ERM.

4 Equivalent characterizations

In Section 3, it has been shown that the eluder sequence, the star-eluder sequence and the VC-eluder sequence are the correct characterizations of the exact universal learning rates by ERM. However, the definitions to them are somewhat non-intuitive. Therefore, in this section, we aim to build connections between these combinatorial sequences and some well-understood complexity measures, which will then give rise to our Theorem 1. Concretely, we have the following two equivalences (see Appendix D.2 for their complete proofs).

Lemma 8. H has an infinite eluder sequence if and only if |H| = .

Lemma 9. H has an infinite VC-eluder sequence if and only if VC(H) = .

Maybe surprisingly, unlike the above two equivalences, s H = is inequivalent to the existence of an infinite star-eluder sequence. Indeed, it is straightforward from definition that if H has an infinite star-eluder sequence, then it must have s H = . However, the converse is not true. Proposition 2. s H = if H has an infinite star-eluder sequence. Moreover, there exist concept classes H with s H = but does not have any infinite star-eluder sequence.

Remark 10. Based on the results in Section 3, the proposition essentially states that the gap between 1/n and log (n)/n exact universal rates by ERM is not characterized by the star number s H. We wonder whether there is some other simple combinatorial quantity that is determinant to this gap, which would be an valuable direction for future work.

Why is the case of star-eluder sequence different from the other two structures? We suspect that such a distinction may arise from the following: unlike the eluder sequence and the VC-eluder sequence, the centered concept of a star-eluder sequence is much more meaningful (see Remark 6). Concretely, within its definition, the set of the following k points is not only required to be a star set of the version space Vnk(H), but is required to be centered at the same labelling target. This intuitively implies that there might exists a class such that for arbitrarily large integer k, it can witness a star set of size k, but with a k-specified center (for different k). Such a class (e.g. Examples 19, 20 in Appendix B.3) does have infinite star number but will not have an infinite star-eluder sequence. Indeed, the relations between those star-related notions (star number, star-eluder dimension, star set and star eluder sequence) turn out to be more complicated than expected, and we leave it to Appendix B.3.

5 Appendix Summary

Due to page limitation, we leave some interesting results as well as all the proofs to Appendices, which are briefly summarized below. Given extra required notations and definitions in Appendix A and related technical lemmas in Appendix E, the main body of supplements consists of three parts, namely, Appendices B, C and D.

Specifically, Appendix B contains three sub-parts. In Appendix B.1, we provide direct mathematical analysis (without using our characterization in Section 1.2) for those basic examples in Section 1.1. In Appendix B.2, we provide details of examples that appeared in Section 1.2. These examples are

carefully constructed, providing evidence that ERM algorithms cannot guarantee the optimal universal rates (Bousquet et al., 2021). In Appendix B.3, we construct nuanced examples to distinguish between the following notions: star number s H (Definition 4), the star-eluder dimension SE(H) (Definition 9), star set (Definition 4) and star eluder sequence (Definition 6). These examples will convince the readers why our characterization in Theorem 1 uses the star eluder sequence rather than the star number (see our discussions in Remarks 3 and 10).

Appendix C presents a fine-grained analysis of the asymptotic rate of decay of the universal learning curves by ERM, whenever possible. This will be an analogy to the optimal fine-grained universal learning curves studied in Bousquet et al. (2023). Concretely, we provide a characterization of sharp distribution-free constant factors of the ERM universal rates. Our characterization of these constant factors is based on two newly-developed combinatorial dimensions, namely, the star-eluder dimension (or SE dimension) and the VC-eluder dimension (or VCE dimension) (Definition 9). We say whenever possible" because distribution-free constants are unavailable for certain cases (see our discussion in Remark 16). Such a characterization can be considered as a refinement to the classical PAC theory, in a sense that it is sometimes better but only asymptotically-valid.

Finally, Appendix D includes all the missing proofs for the theorems and lemmas that have shown up in previous sections.

6 Conclusion and Future Directions

In this paper, we reveal a fundamental tetrachotomy of the universal learning rates by the ERM principle and provide a complete characterization of the exact universal rates via certain appropriate complexity structures. Additionally, by introducing new combinatorial dimensions, we are able to characterize sharp asymptotically-valid constant factors for these rates, whenever possible. While only the realizable case is considered in this paper, we believe analogous results can be extend to different learning scenarios such as the agnostic case. Generalizing the results from binary classification to multiclass classification would be another valuable future direction. Moreover, since this paper considers the worst-case" ERM in its nature, studying the universal rates of the best-case" ERM is also an interesting problem which we leave for future work.

Antos, A. and Lugosi, G. (1996), Strong minimax lower bounds for learning, in Proceedings of the 9th Annual Conference on Computational Learning Theory, pp. 303 309.

Audibert, J.-Y. and Tsybakov, A. B. (2007), Fast learning rates for plug-in classifiers, The Annals of Statistics, 35, 608 633.

Auer, P. and Ortner, R. (2007), A new PAC bound for intersection-closed concept classes, Machine Learning, 66, 151 163.

Benedek, G. M. and Itai, A. (1988), Nonuniform learnability, in Automata, Languages and Programming: 15th International Colloquium Tampere, Finland, July 11 15, 1988 Proceedings 15, Springer, pp. 82 92.

Blumer, A., Ehrenfeucht, A., Haussler, D., and Warmuth, M. K. (1989), Learnability and the Vapnik-Chervonenkis dimension, Journal of the ACM (JACM), 36, 929 965.

Bousquet, O., Hanneke, S., Moran, S., Shafer, J., and Tolstikhin, I. (2023), Fine-grained distributiondependent learning curves, in Proceedings of the 36th Annual Conference on Learning Theory, PMLR, pp. 5890 5924.

Bousquet, O., Hanneke, S., Moran, S., Van Handel, R., and Yehudayoff, A. (2021), A theory of universal learning, in Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, pp. 532 541.

Cohn, D., Atlas, L., and Ladner, R. (1994), Improving generalization with active learning, Machine Learning, 15, 201 221.

Cohn, D. and Tesauro, G. (1990), Can neural networks do better than the Vapnik-Chervonenkis bounds? Advances in Neural Information Processing Systems, 3.

(1992), How tight are the Vapnik-Chervonenkis bounds? Neural Computation, 4, 249 269.

El-Yaniv, R. and Wiener, Y. (2012), Active Learning via Perfect Selective Classification. Journal of Machine Learning Research, 13, 255 279.

Giné, E. and Koltchinskii, V. (2006), Concentration inequalities and asymptotic results for ratio type empirical processes, The Annals of Probability, 34, 1143 1216.

Hanneke, S. (2009), Theoretical foundations of active learning, Carnegie Mellon University.

(2011), Rates of convergence in active learning, The Annals of Statistics, 39, 333 361.

(2014), Theory of disagreement-based active learning, Foundations and Trends in Machine Learning, 7, 131 309.

(2016a), The optimal sample complexity of PAC learning, Journal of Machine Learning Research, 17, 1 15.

(2016b), Refined error bounds for several learning algorithms, Journal of Machine Learning Research, 17, 1 55.

(2024), The Star Number and Eluder Dimension: Elementary Observations About the Dimensions of Disagreement, in Proceedings of the 37th Annual Conference on Learning Theory.

Hanneke, S., Karbasi, A., Moran, S., and Velegkas, G. (2022), Universal rates for interactive learning, Advances in Neural Information Processing Systems, 35, 28657 28669.

Hanneke, S., Moran, S., and Zhang, Q. (2023), Universal Rates for Multiclass Learning, in Proceedings of the 36th Annual Conference on Learning Theory, PMLR, pp. 5615 5681.

Hanneke, S. and Yang, L. (2015), Minimax analysis of active learning. Journal of Machine Learning Research, 16, 3487 3602.

Kalavasis, A., Velegkas, G., and Karbasi, A. (2022), Multiclass learnability beyond the pac framework: Universal rates and partial concept classes, Advances in Neural Information Processing Systems, 35, 20809 20822.

Koltchinskii, V. and Beznosova, O. (2005), Exponential convergence rates in classification, in Proceedings of International Conference on Computational Learning Theory, Springer, pp. 295 307.

Li, G., Kamath, P., Foster, D. J., and Srebro, N. (2022), Understanding the eluder dimension, Advances in Neural Information Processing Systems, 35, 23737 23750.

Littlestone, N. and Warmuth, M. (1986), Relating data compression and learnability, Unpublished manuscript.

Mitchell, T. M. (1977), Version spaces: A candidate elimination approach to rule learning, in Proceedings of the 5th International Joint Conference on Artificial Intelligence, vol. 1, pp. 305 310.

Pillaud-Vivien, L., Rudi, A., and Bach, F. (2018), Exponential convergence of testing error for stochastic gradient methods, in Conference on Learning Theory, PMLR, pp. 250 296.

Ramsey, F. P. (1987), On a problem of formal logic, in Classic Papers in Combinatorics, Springer, pp. 1 24.

Sauer, N. (1972), On the density of families of sets, Journal of Combinatorial Theory, Series A, 13, 145 147.

Schuurmans, D. (1997), Characterizing rational versus exponential learning curves, Journal of Computer and System Sciences, 55, 140 160.

Shalev-Shwartz, S. and Ben-David, S. (2014), Understanding machine learning: From theory to algorithms, Cambridge university press.

Valiant, L. G. (1984), A theory of the learnable, Communications of the ACM, 27, 1134 1142.

van Handel, R. (2013), The universal Glivenko Cantelli property, Probability and Related Fields, 155, 911 934.

Vapnik, V. and Chervonenkis, A. (1974), Theory of pattern recognition, Nauka, Moscow.

Vapnik, V. N. and Chervonenkis, A. Y. (1971), On uniform convergence of the frequencies of events to their probabilities, Theory of Probability and its Applications, 16, 264 280.

Wiener, Y., Hanneke, S., and El-Yaniv, R. (2015), A compression technique for analyzing disagreement-based active learning. Journal of Machine Learning Research, 16, 713 745.

A Preliminaries

Notation 1. We denote by N the set of all natural numbers {0, 1, . . .}. For any n N, we denote [n] := {1, . . . , n}. Notation 2. For any x > 0, we redefine ln (x) := ln (x e) and log (x) := log2 (x 2). Moreover, for correctness, we also adopt the conventions that ln (0) = log (0) = 0, 0 ln ( ) = 0 log ( ) = 0. After then, it is reasonable to define 0 ln (0/0) = 0 log (0/0) = 0.

Notation 3. For any R-valued functions f and g, we write f(x) g(x) if there exists a finite numerical constant c > 0 such that f(x) c g(x) for all x R. For example, ln (x) log (x) and log (x) ln (x). Notation 4. Let X be an instance space, we write hall-0 s and hall-1 s to denote the hypotheses that output all zero labels and all one labels, respectively, that is, hall-0 s(x) = 0, hall-1 s(x) = 1, x X. Notation 5. For an infinite union of spaces (X1 X2 ) and an integer k, we write X<k to denote the finite union of prefix (X1 Xk 1) and write X>k to denote the infinite union of suffix (Xk+1 Xk+2 ).

Definition 8 (Empirical risk minimization). Let H be a concept class on an instance space X. For every n N, let Sn := {(xi, yi)}n i=1 (X {0, 1})n be a set of samples. A learning algorithm that outputs {ˆhn}n N is called an Empirical Risk Minimization (ERM) algorithm, if it satisfies ˆhn arg minh H ˆer Sn(h) := arg minh H{ 1

n Pn i=1 1(h(xi) = yi)} for all n N, where ˆer Sn(ˆhn) is called the empirical error rate of ˆhn on Sn. It is clear that ˆer Sn(ˆhn) = 0 when P RE(H).

B Detailed examples

In this appendix section, we provide further examples. Specifically, in Appendix B.1, we present direct analysis (without using our newly-developed characterization) of each example illustrated in Section 1.1. The aim of the examples in Appendix B.2 is to reveal that ERM algorithms can sometimes be optimal but sometimes not in a universal learning framework, and compare their performance with the optimal universal learning algorithms. Finally, we also provide additional examples related to the star number in Appendix B.3 as complements to Proposition 2 in Section 4.

B.1 Details of examples in Section 1.1

We provide direct analysis to the examples illustrated in Section 1.1 without using the characterization in Theorem 1. Concretely, Examples 8 and 9 illustrate scenarios where linear universal rates occur. Example 10 specifies a case where universal rate matches uniform rate by ERM as log (n)/n. Example 11 specifies a case where extremely fast universal learning is achievable, but where some bad ERM algorithms can give rise to arbitrarily slow rates. Example 7 (Example 1 restated). Any finite class H is universally learnable at exponential rate by ERM. To show this, for any realizable distribution P with respect to H, we have

E h er P (ˆhn) i P ( h H : er P (h) > 0, ˆer Sn(h) = 0)

union bound X

h H:er P (h)>0 P ( ˆer Sn(h) = 0)

h H:er P (h)>0 (1 er P (h))n

|H| 1 min h H:er P (h)>0 er P (h) n

1 t e t |H| exp min h H:er P (h)>0 er P (h) n .

Example 8 (Example 2 restated). Let Hthresh,N := {ht : t N} be the class of all threshold classifiers on the space of natural numbers defined by ht(x) := 1(x t). Hthresh,N is universally learnable at exponential rate since this concept class does not have an infinite Littlestone tree

(Bousquet et al., 2021). In the following part, we show that the worst-case ERM cannot achieve such exponential rate, but has a rate 1/n.

Let ht Hthresh,N be the target hypothesis. Given a dataset Sn, let hˆt = ERM(Sn) be the output of an ERM algorithm. For any realizable distribution P satisfying P{(t, 0)} = 1 for all t < t and P{(t, 1)} = 1 for all t t , we define

tl := max {t < t : P(t) > 0} .

According to the definition of threshold classifiers, if the dataset Sn contains at least a copy of both tl and t , then er P (hˆt) = 0. Therefore, we have

E [er P (hˆt)] P (er P (hˆt) > 0) (1 P(t ))n + (1 P(tl))n.

Note that for any ϵ (0, 1), 1 ϵ e ϵ, it follows immediately that

E [er P (hˆt)] (1 P(t ))n + (1 P(tl))n e n P (t ) + e n P (tl) 2e n min{P (t ),P (tl)}.

However, let us consider a distribution P satisfying P{(t, 0)} = 2 t and P{(t, 1)} = 0 for all t N. Note that P is also realizable with respect to Hthresh,N according to the definition, that is

inf h Hthresh,N er P (h) = inf t N er P (ht) = inf t N 2 t = 0.

Given a dataset Sn := {(xi, yi)}n i=1 P n, let tn := maxi [n] xi be the largest point in the dataset, it is straightforward that the worst-case ERM outputs h [ tn+1, and we have that

E h er P h [ tn+1 i = X

t 1 2 t 1 P max i [n] xi = t = X

t 1 2 t 1 h 1 2 t n 1 2 (t 1) ni .

On one hand, we can lower bound the above infinite series by X

t 1 2 t 1 h 1 2 t n 1 2 (t 1) ni

t=1 2 t 1 h 1 2 t n 1 2 (t 1) ni

h 1 2 t n 1 2 (t 1) ni

n 1 18n, for infinitely many n.

This implies that Hthresh,N is not learnable by ERM at rate faster than 1/n. On the other hand, we have the following upper bound for all n:

t 1 2 t 1 h 1 2 t n 1 2 (t 1) ni X

t 1 2 t 1 2 t n Z 1

0 ϵ(1 ϵ)ndϵ = 1 n + 1,

which implies that Hthresh,N is indeed learnable by ERM at linear rate 1/n. In conclusion, Hthresh,N is universally learnable by ERM with exact 1/n rate. Example 9 (Threshold classifier on R). This example serves as a complement to Example 8. Here, we show that ERM algorithms can sometimes be optimal for universal learning. Specifically, let Hthresh,R := {ht : t R} be the class of all threshold classifiers on the real line defined by ht(x) := 1(x t), x R. It has been shown that Hthresh,R is universally learnable with optimal linear rate (Schuurmans, 1997).

To show that Hthresh,R is also universally learnable with exact linear rate by ERM, we only need to prove an upper bound. To this end, let ht Hthresh,R be the target hypothesis. For any realizable distribution P satisfying P{(t, 0)} = 1 for all t < t and P{(t, 1)} = 1 for all t t , a dataset Sn and ϵ (0, 1), let hˆt = ERM(Sn) be the output of an ERM algorithm. Now we define A and B be the minimal regions left and right to t R such that P(A) = P(B) = ϵ. If at least one of A

and B does not contain any sample, then the worst-case ERM can output hˆt Hthresh,R such that er P (hˆt) ϵ. Therefore, it follows that

E [er P (hˆt)] = Z 1

0 P (er P (hˆt) ϵ) dϵ Z 1

0 2(1 ϵ)ndϵ Z 1

0 2e nϵdϵ 2

Note that such analysis is also applicable to the realizable distribution with the target concept hall-0 s. Therefore, we have that Hthresh,R is universally learnable by ERM with exact rate 1/n.

Example 10 (Example 3 restated). Let X = N and define Hsingleton,N := {ht : t X} be the class of all singletons on X, where ht(x) := 1(x = t), for all x X. It is clear that VC(Hsingleton,N) = 1. Note that Hsingleton,N is universally learnable at exponential rate since it does not have an infinite Littlestone tree (Actually, we have LD(Hsingleton,N) = 1). In the following part, we show that the worst-case ERM algorithm has an exact universal rate log (n)/n.

To get the exact rate by ERM on universally learning Hsingleton,N, we consider a marginal uniform distribution over {1, 2, . . . , 1/ϵ} with all zero labels with ϵ (0, 1), if the dataset Sn does not have a copy of a point 1 x 1/ϵ, the worst-case ERM can label 1 at x, and thus has an error rate er P (ˆhn) ϵ. Based on the Coupon Collector s Problem, we know that to have E[er P (ˆhn)] ϵ, we need n = Ω(ϵ 1 log(1/ϵ)). In other words, E[er P (ˆhn)] Ω( log n

n ), that is, Hsingleton,N is not universally learnable by ERM at rate faster than log (n)/n. Finally, the classical PAC theory yields the same upper bound, and thus log (n)/n is tight.

Example 11 (Example 4 restated). Let X = S

i N Xi be the disjoint union of finite sets with |Xi| = 2i. For each i N, let

Hi := h S := 1S : S Xi, |S| 2i 1 ,

and consider the concept class H = S

i N Hi. In the following part, we show that the worst-case ERM can be arbitrarily slow in learning this class.

Given any rate function R(n) 0, let {nt}t 1 and {it}t 1 be two strictly increasing sequences such that {pt := 2it 2/nt, t 1} satisfies

{pt}t 1 is decreasing , X

t 1 pt 1 and pt 4R(nt).

We consider any ERM algorithm with the following property: if the data Sn = {(xi, yi)}n i=1 satisfies yi = 0 for all i [n], outputs ˆhn Hi Tn with

Tn := min {t : h Hit s.t. h(x1) = = h(xn) = 0} .

We construct the following distribution P:

P {(x, 0)} = 2 itpt, for all x Xit, t N,

where we set P{(x , 0)} = 1 P

t 1 pt for some arbitrary choice of x / S

t N Xit. Since

inf h H er P (h) = inf i N inf h Hi er P (h) inf i N er P (h Xi) inf it:t N er P (h Xit) = inf it:t N P {(x, 0) : x Xit} = 0,

we know that P is realizable with respect to H. Finally, we claim that the ERM defined above behave poorly on P by showing E[er P (ˆhn)] R(n) for infinitely many n. To this end, note that for a dataset Snt = {(xi, yi)}nt i=1 P nt, and for any t N, it holds

P (Tnt t) P {j [nt] : xj Xit} 2it 1 = P

j=1 1 {xj Xit} 2it 1

where the last inequality follows from the Markov s inequality. Therefore,

E h er P (ˆhnt) i 2R(nt) P er P (ˆhnt) 2R(nt)

pt 4R(nt) 2R(nt) P er P (ˆhnt) 1

Lo FT 2R(nt) P er P (ˆhnt) 1

2pt Tnt t P (Tnt t)

2R(nt) P er P (ˆhnt) 1

Tnt t P (Tnt t)

2R(nt) P (Tnt t)

B.2 Optimal universal rates versus exact universal rates by ERM

In this section, we provide evidence that ERM algorithms cannot guarantee the best achievable universal learning rates. Recall that the optimal universal learning rates and the associated characterization have been fully understood by Bousquet et al. (2021), which we present first as follow:

Theorem 7 (Bousquet et al., 2021, Theorem 1.9). For every concept class H with |H| 3, the following hold:

H is universally learnable with optimal rate e n if H does not have an infinite Littlestone tree. H is universally learnable with optimal rate 1/n if H has an infinite Littlestone tree but does not have an infinite (strong) VCL tree. H requires arbitrarily slow rates if H has an infinite (strong) VCL tree.

Based on our Theorem 1, to distinguish the optimal universal rates from the exact universal rates by ERM, we have to distinguish their corresponding characterizations. Indeed, those sequences we defined in Section 2 are strongly related to the Littlestone tree and the VCL tree in Theorem 7. According to the definitions, it is not hard to figure out all the following relations:

Every branch of a Littlestone tree is an eluder sequence. Hence, if H does not have an infinite eluder sequence, then H must not have an infinite Littlestone tree, and thus can be universally learned with optimal exponential rate. However, there exists a class H having an infinite eluder sequence but no infinite Littlestone tree (see Example 12 below). This implies that such a class cannot be universally learned by ERM at rate faster than 1/n, but can be learned by some other optimal" learning algorithms at e n rate. Every branch of a (strong) VCL tree is a VC-eluder sequence, and also a star-eluder sequence. Therefore, if H does not have an infinite star-eluder sequence, then it must not have an infinite VCL tree, and thus can be universally learned with optimal linear rate. However, there exists a concept class that has an infinite star-eluder sequence, but does not have an infinite VCL tree (see Example 14 below). Furthermore, there also exists a concept class that has an infinite star-eluder sequence, but does not even have an infinite Littlestone tree (see Example 13 below). These two examples imply that there exist classes that can not be universally learned by ERM at rate faster than log (n)/n, but can be learned by some other optimal" learning algorithms at 1/n or even e n rates. Moreover, there exists a concept class that has an infinite VC-eluder sequence, but does not have an infinite VCL tree (see Example 16 below), or even no infinite Littlestone tree (see Example 15 below). Such examples imply that there exist classes that require arbitrarily slow rates to be universally learned by ERM, but can be learned by some other optimal" learning algorithms at 1/n or even e n rates.

To summarize, we are able to illustrate all the distinctions as in the following table.

Optimal rate Exact rate by ERM Case Example e n 1/n infinite eluder sequence but no infinite Littlestone tree Example 12 e n log (n)/n infinite star-eluder sequence but no infinite Littlestone tree Example 13 e n arbitrarily slow infinite VC-eluder sequence but no infinite Littlestone tree Example 15 1/n log (n)/n infinite star-eluder sequence but no infinite VCL tree Example 14 1/n arbitrarily slow infinite VC-eluder sequence but no infinite VCL tree Example 16

Example 12 (Infinite eluder sequence but no infinite Littlestone tree). A simple example is given in Example 8, where H = Hthresh,N is the class of all threshold classifiers on N. Note that H does not have an infinite Littlestone tree, but any infinite sequence {(x1, 0), (x2, 0), . . .} with x1 < x2 < . . . is an infinite eluder sequence of H centered at hall-0 s. In particular, hall-0 s is the only realizable target that allows an infinite eluder sequence. In other words, for H, all the realizable distribution with target concept h H is universally learnable by ERM at exponential rate, except that special one hall-0 s, which matches our analysis within Example 8. Example 13 (Infinite star-eluder sequence but no infinite Littlestone tree). Let X := S

k N Xk be the disjoint union of finite sets with |Xk| = k and H := S

k 1 Hk, where Hk := {1x : x Xk}. Note that this is exactly singletons on an infinite domain and we have the following hold:

1. H does not have an infinite Littlestone tree since for any root x X, the subclass {h H : h(x) = 1} has only size 1, and thus the corresponding subtree of the Littlestone tree must be finite. 2. H has an infinite star-eluder sequence. Indeed, any infinite sequence {(x1, 0), (x2, 0), . . .} with xk Xk for all k 1, is an infinite star-eluder sequence. To see this, note that for any k N, and any nk, the version space Vnk(H) contains S

j>nk Hj. Therefore, {(xnk+1, 0), (xnk+2, 0), . . . , (xnk+k, 0)} is a star set of Vnk(H) centered at hall-0 s, witnessed by concepts {1{xnk+1}, 1{xnk+2}, . . . , 1{xnk+k}}.

Example 14 (Infinite star-eluder sequence but no infinite VCL tree). Let X1 and H1 be defined in Example 13, let X2 = R and H2 = Hthresh,R be the class of all threshold classifiers on R. Note that H2 has an infinite Littlestone tree. Now we define X := X1 X2 and H := H1 H2, and have the following hold:

1. H does not have an infinite VCL tree since for any fixed root x X, the subclass {h H : h(x) = 1} has a VC dimension only 1, and thus the corresponding subtree of the VCL tree must be finite. 2. H has an infinite star-eluder sequence (see Example 13). Example 15 (Infinite VC-eluder sequence but no infinite Littlestone tree). Let X := S

k N Xk be the disjoint union of finite sets with |Xk| = k and H := S

k 1 Hk, where Hk := {1S : S Xk}. We have the following hold:

1. H does not have an infinite Littlestone tree since for any root x X, the subclass {h H : h(x) = 1} is finite, and thus the corresponding subtree of the Littlestone tree must be finite. 2. H has an infinite VC-eluder sequence. Indeed, any sequence {(x1, 0), (x2, 0), (x3, 0), . . .} with xnk+1, . . . , xnk+k Xk for all k 1, is an infinite VC-eluder sequence. Furthermore, it has been argued that VC(H) = (Ex.2.3 Bousquet et al., 2021), which is consistent with our Lemma 9 in Section 4. Example 16 (Infinite VC-eluder sequence but no infinite VCL tree). Let X1 and H1 be defined in Example 15, let X2 = R and H2 = Hthresh,R be the class of all threshold classifiers on R. Note that H2 has an infinite Littlestone tree. Now we define X := X1 X2 and H := H1 H2, and have the following hold:

1. H does not have an infinite VCL tree since for any fixed root x X, the subclass {h H : h(x) = 1} has a bounded VC dimension, and thus the corresponding subtree of the VCL-tree must be finite. 2. H has an infinite VC-eluder sequence (see Example 15).

B.3 Star-related notions

In this section, we provide examples to distinguish between the following star-related notions: star number s H (Definition 4), the star-eluder dimension SE(H) (Definition 9), star set (Definition 4) and star eluder sequence (Definition 6).

In particular, Example 17 reveals that having an infinite star number of h does not guarantee that H has an infinite star-eluder sequence centered at the same target h . Note that if s H = always yields an infinite star set, then we can simply choose this infinite star set to be an infinite star-eluder sequence. Unfortunately, Example 18 fails the conjecture. Furthermore, Proposition 2 in Section 4 is convinced by Example 19. Finally, Example 20 gives an instance that SE(H) = and infinite star-eluder sequence are not equivalent as well. For comparison, we recall that E(H) = is equivalent to an infinite eluder sequence, and VCE(H) = is equivalent to an infinite VC-eluder sequence (see a discussion in Appendix C).

Example 17 (Infinite star number and infinite star-eluder sequence with different centers). Let us recall Example 3, where Hsingleton,N is the class of singletons on natural numbers. According to the analysis in Example 13, we know that Hsingleton,N has an infinite star number of hall-0 s, and also an infinite star-eluder sequence centered at hall-0 s.

Now we slightly change the setting: Let X := S k N Xk be the disjoint union of finite sets with |Xk| = k (one may simply assume X := N). Denote Xk := {xk,1, . . . , xk,k} and define hk,i(x) := 1{x = xk,i or x / Xk}, for all 1 i k. We let H := {hk,i, k N, 1 i k}, and have the following hold:

1. shall-0 s = : Given arbitrarily large integer k, {(xk,1, 0), (xk,2, 0), . . . , (xk,k, 0)} is a star set centered at hall-0 s, witnessed by hypotheses {hk,i, 1 i k}. 2. shall-1 s = : Given arbitrarily large integer k, {(x1,1, 1), (x2,1, 1), . . . , (xk,1, 1)} is a star set centered at hall-1 s, witnessed by hypotheses {hi,2, 1 i k}. 3. H has an infinite star-eluder sequence centered at hall-1 s: Indeed, {(x1,1, 1), (x2,1, 1), . . .} is an example of infinite star-eluder sequence. 4. H does not have an infinite star-eluder sequence centered at hall-0 s.

Example 18 (Infinite star number but no infinite star set). We slightly change the setting in Example 17: Let X := S

k N Xk be the disjoint union of finite sets with |Xk| = k (one may again simply assume X := N). Denote Xk := {xk,1, . . . , xk,k}, let hk,i(x) := 1{x = xk,i or x X>k}, for all 1 i k and k N, and let H := {hk,i, 1 i k, k N}. We have the following hold:

1. s H = since H has a star set of arbitrarily large finite size. 2. H does not have an infinite star set.

It is worthwhile to mention that in this example, H does have an infinite star-eluder sequence {(x1,1, 0), (x2,1, 0), (x2,2, 0), . . .} centered at hall-0 s. Hence, an infinite star set is an infinite stareluder sequence, but not the only possibility.

Example 19 (Infinite star number but no infinite star-eluder sequence). We slightly change the setting in Example 18 as follow: Let X := S

k N Xk be the disjoint union of finite sets with |Xk| = k (one may again simply assume X := N). Denote Xk := {xk,1, . . . , xk,k}, let hk,i(x) := 1{x = xk,i or x X<k}, for all 1 i k and k N, and let H := {hk,i, 1 i k, k N}. Then the following hold:

1. s H = since H has a star set of arbitrarily large finite size. 2. H does not have an infinite star-eluder sequence, and SE(H) < .

Example 20 (Infinite star-eluder dimension but no infinite star-eluder sequence). For any k N, let Xk := S

t N Xk,t be disjoint union of finite sets with |Xk,t| = k for all t N. Let X := S

k N Xk also with disjoint subspaces {Xk}k N. For notation simplicity, let us denote Xk,t := {xk,t,1, . . . , xk,t,k} for all k, t N. Now we can define a hypothesis class as follow: let hk,t,j(x) := 1{(x = xk,t,j) (x Xk,>t) (x X<k)}, for all k, t N and 1 j k, and let H := {hk,t,j, 1 j k, k, t N}. We have the following hold:

1. SE(H) = since for arbitrarily large k N, H has an infinite k-star-eluder sequence Xk with all labels 0. 2. H does not have an infinite (strong) star-eluder sequence.

Remark 11. Altogether, we have the follow relations

H has an infinite star set

H has an infinite star-eluder sequence

Remarkably, a complete theory to the relations between these notions is still lacking here, which might be of independent interests.

C Fine-grained analysis

In this appendix section, we provide a fine-grained analysis of the asymptotic rate of decay of the universal learning curves by ERM, whenever possible. This will be an analogy to the optimal fine-grained universal learning curves studied in Bousquet et al. (2023). Our characterization of the sharp distribution-free constant factors is based on two newly-introduced combinatorial dimensions named the star-eluder dimension or SE dimension and the VC-eluder dimension or VCE dimension. We present their formal definitions first.

Definition 9 (Star(VC)-eluder dimension). Let H be a concept class, we say H has an infinite

d-star-eluder sequence {(x1, y1), (x2, y2), . . .} centered at h, if it is realizable and for every k N, {xkd+1, . . . , xkd+d} is a star set of Vkd(H) centered at h. Furthermore, the star-eluder dimension of H, denoted by SE(H), is defined to be the largest integer d 0 such that H has an infinite d-star-eluder sequence. If H does not have any infinite 1-stareluder sequence, we define SE(H) = 0. If for arbitrarily large integer d, H has an infinite d-star-eluder sequence, we define SE(H) = . d-VC-eluder sequence {(x1, y1), (x2, y2), . . .} centered at h, if it is realizable, and for every k N, h(xk) = yk and {xkd+1, . . . , xkd+d} is a shattered set of Vkd(H). Furthermore, the VC-eluder dimension of H, denoted by VCE(H), is defined to be the largest integer d 0 such that H has an infinite d-VC-eluder sequence. If H does not have any infinite 1-VC-eluder sequence, we define VCE(H) = 0. If for arbitrarily large integer d, H has an infinite d-VC-eluder sequence, we define VCE(H) = . Remark 12. We recall that the eluder dimension E(H) in Definition 5 represents the length of the longest eluder sequence that exists in H. Indeed, an eluder sequence is exactly one branch of a Littlestone tree, and thus E(H) < implies that H has no infinite Littlestone tree. The converse is not true, because H may have a finite Littlestone tree with some of the branches being infinitely long (see Example 12). Similarly, the star-eluder dimension SE(H) and the VC-eluder dimension VCE(H) here are also strongly related to certain combinatorial structures that have been studied before. In particular for VCE(H), one may refer to the concepts of the (strong) VCL tree, d-VCL tree and the VCL dimension introduced by Bousquet et al. (2021, 2023). Indeed, an infinite (strong) VC-eluder sequence is exactly one branch of a strong VCL tree, and an infinite d-VC-eluder sequence is exactly one branch of an infinite d-VCL tree. Since an infinite 1-VCL-tree is exactly an infinite Littlestone tree, an infinite 1-VC-eluder sequence is thus exactly an infinite eluder sequence. Moreover, recall that VCL(H) = 0 implies that H does not have an infinite Littlestone tree, and similarly, here we have VCE(H) = 0 implies that H does not have an infinite eluder sequence.

Remark 13. For any concept class H, the following hold:

1. E(H) SE(H) VCE(H). 2. VCE(H) 1 SE(H) 1 E(H) = . 3. VCE(H) = 0 SE(H) = 0 E(H) < .

We then state the formal definition of the fine-grained universal rates by ERM.

Definition 10 (Fine-grained universal rates by ERM). Let H be a concept class and R(n) 0 be a distribution-free rate function. We say

H is universally learnable at fine-grained rate R by ERM, if for every distribution P RE(H), there exists a distribution-dependent rate λ(n) = o (R(n)) such that for every ERM algorithm, E[er P (ˆhn)] R(n) + λ(n), for all n N. H is not universally learnable at fine-grained rate faster than R by ERM, if there exists a distribution P RE(H) such that there is an ERM algorithm satisfying E[er P (ˆhn)] R(n), for infinitely many n N. H is universally learnable with exact fine-grained rate R by ERM, if H is universally learnable at fine-grained rate R by ERM, and is not universally learnable at fine-grained rate faster than R by ERM.

Note that the crucial difference between this definition and Definition 2 is that here R(n) is independent of the data distribution P. In other words, the fine-grained rates provide optimal distribution-free upper and lower envelopes of the universal learning curves up to numerical constant factors.

Remark 14. Definition 10 describes special cases of Definition 2 in the following sense: If H is universally learnable at fine-grained rate (no faster than) R by ERM, then it is universally learnable at rate (no faster than) R by ERM as well. Briefly speaking, the fine-grained analysis aims to find the correct characterization that captures the optimal distribution-free upper envelope and lower envelope of all the distribution-dependent learning curves, tight up to numerical constant factors.

We now turn to state our results of fine-grained universal rates by ERM. All technical aspects of the proofs are deferred to Appendix D.3.

Theorem 8 (Fine-grained learning rates). For every class H with |H| 3, the following hold:

If VCE(H) < , then H is universally learnable at fine-grained rate VCE(H) log n

n , and is not universally learnable at fine-grained rate faster than VCE(H)

n , by ERM. If SE(H) < , then H is universally learnable at fine-grained rate VCE(H)

n log ( SE(H)

but is not universally learnable at fine-grained rate faster than VCE(H)+log (SE(H))

n , by ERM.

or equivalently, there exist finite numerical constants α, β > 0 such that

If VCE(H) < , then

E h er P (ˆhn) i α VCE(H)

n , for infinitely many n N, (1)

E h er P (ˆhn) i β VCE(H) log n

n + 2 n/2κ , n N, (2)

where κ = κ(P) is a distribution-dependent constant. If SE(H) < , then

E h er P (ˆhn) i α VCE(H) + log (SE(H))

n , for infinitely many n N, (3)

E h er P (ˆhn) i β VCE(H)

n log SE(H)

+ 2 n/2ˆκ , n N, (4)

where ˆκ = ˆκ(P) is a distribution-dependent constant.

Remark 15. Our proofs use α = 1/20 and β = 160.

Remark 16. When SE(H) = VCE(H) = 0, E(H) < , or SE(H) = VCE(H) = 1, E(H) = , we still have the bounds (3) and (4) since we define log (0) = 0, 0 log (0/0) = 0 and log (x) := log (x 2) for any x > 0. Moreover, we remark that neither VCE(H) = nor SE(H) = is considered in these fine-grained rates. This is because when VCE(H) = , arbitrarily slow rates cannot admit distribution-free constants. And when SE(H) = , it is still impossible because it does not guarantee an infinite star-eluder sequence, that is, the lower bound of (1) cannot be increased to log (n)/n, and is the sharpest one we can have here.

Remark 17. It is not hard to understand that a target-specified version of fine-grained universal rates by ERM is also derivable, based on a centered version of the star-eluder dimension SEh and VC-eluder dimension VCEh .

Remark 18. It is worth noting that, when SE(H) < , there is a mismatch between the lower bound and the upper bound. This serves as an analogy to the mismatch between Cor.12 and Thm.13 in Hanneke (2016b), and certain demonstrating examples have been exhibited in Hanneke and Yang (2015). In the following two examples, we provide evidence that such a gap does exist, in a sense that both the upper bound and the lower bound can sometimes be tight for some classes. Roughly speaking, if an infinite SE(H)-star-eluder sequence in H is also an infinite VCE(H)-VC-eluder sequence (see Definition 9), then VCE(H)

n log ( SE(H)

VCE(H)) is the optimal rate, otherwise VCE(H)+log (SE(H))

n is optimal.

Example 21 (Optimal (VCE(H) + log (SE(H)))/n rate). We construct a concept class H such that an infinite VCE(H)-VC-eluder sequence and an infinite SE(H)-star-eluder sequence cannot be realized by an infinite sequence. To this end, we slightly change the example presented in Appendix D.2 of Hanneke and Yang (2015), which yields the tightness of a lower bound (VC(H) + log (s H))/n.

Specifically, let d, s > 0 be two integers satisfying d s. Let X := Z \ {0} := X1 X2, where X1 := N \ {0} and X2 := N \ {0} = X1. We can also write

X1 = (X1,0 X1,1 ), where X1,k := {ks + 1, . . . , (k + 1)s} for all k N,

X2 = (X2,0 X2,1 ), where X2,k := { (k + 1)d, . . . , kd 1} for all k N.

Now we let H := H1 H2 satisfying VCE(H) = d and SE(H) = s, where

H1 := {hk,j : j X1,k, k N}, where hk,j(x) := 1(x = j or x X1,>k).

H2 := {hk,S : S X2,k, k N}, where hk,S(x) := 1(x S or x X2,>k).

In particular, X1 itself is an infinite s-star-eluder sequence centered at hall-0 s, and X2 itself is an infinite d-VC-eluder sequence, but they do not intersect. To show that the upper bound can be decreased to match the lower bound, we simply note that for any infinite s-star-eluder sequence, its associated VC-eluder dimension is exactly 1, resulting in a log (s)/n upper bound. For any infinite d-VC-eluder sequence, its associated star-eluder dimension is also d, resulting in a d/n upper bound. The maximum of the two upper bounds yields the desired one.

Example 22 (Optimal (VCE(H)/n) log (SE(H)/VCE(H)) rate). We construct a concept class H such that there exists an infinite sequence in H which is both an infinite VCE(H)-VC-eluder sequence and an infinite SE(H)-star-eluder sequence. To this end, we slightly change the example presented in Appendix D.1 of Hanneke and Yang (2015), which yields the tightness of an upper bound (VC(H)/n) log (s H/VC(H)).

Specifically, let d, s > 0 be two integers satisfying d s. Let X := N and for every k N, define hk,S(x) := 1(x S or x > (k + 1)s) for every subset S {ks + 1, . . . , (k + 1)s} with |S| d. Let H := {hk,S : S {ks + 1, . . . , (k + 1)s}, |S| d, k N}. Note that for this class, we have VCE(H) = d, SE(H) = s and there exists an infinite sequence serving as an infinite d-VC-eluder sequence as well as an infinite s-star-eluder sequence. To show in this case that the lower bound can be increased to match the upper bound, the realizable distribution that witnesses this rate is referred to Appendix D.1.1 of Hanneke and Yang (2015).

Now let us turn to the proof of Theorem 8, which is based on the following two lemmas and within each an upper bound as well as a lower bound are established.

Lemma 10. For every concept class H with |H| 3, if VCE(H) < , then the following hold:

E h er P (ˆhn) i VCE(H)

18n , for infinitely many n N,

E h er P (ˆhn) i 28VCE(H) log n

n + 2 n/2κ , n N,

where κ = κ(P) is a distribution-dependent constant.

Remark 19. Note that a concept class with its VCE dimension finite can either have an infinite star-eluder sequence or not, which results in a difference (of a logarithmic factor) in the upper bound and lower bound stated in Lemma 10.

Recall that VC(H) < yields a uniform upper bound VC(H) log (n)/n. On one hand, VCE(H) = implies that H has a shattered set of arbitrarily large size, which further implies an unbounded VC dimension. On the other hand, according to Lemma 9, VC(H) = implies that H has an infinite VC-eluder sequence, and thus VCE(H) = holds as well. Therefore, VCE(H) = if and only if VC(H) = if and only if H has an infinite VC-eluder sequence. Moreover, when VCE(H) < , a trivial observation is VCE(H) VC(H) < . However, the following example reveals that VCE and VC are not the same dimension, namely, there exists a class H having strictly VCE(H) < VC(H) (see the following Example 23). Therefore, Lemma 10 sometimes reflects an improvement over the classical uniform bound. Example 23 (VCE(H) < VC(H) < ). To make it more convincing, we provide an example of infinite classes here. Let X1 be a finite set of size d, and X2 be an infinite instance space that is disjoint with X1. For simplicity, one may assume that X1 := { d, (d 1), . . . , 1} and X2 := N. We define X := X1 X2 and let H := {h S,k := 1S {k}, S X1, k N}. This class has VC(H) = (d + 1) but VCE(H) = 1 since there is no infinite 2-VC-eluder sequence. Similarly, we can also construct an example that witnesses strictly SE(H) < s H < . Lemma 11. For every concept class H with |H| 3, if SE(H) < , then the following hold:

E h er P (ˆhn) i log (SE(H))

12n , for infinitely many n N,

E h er P (ˆhn) i 160VCE(H)

n log SE(H)

+ 2 n/2ˆκ , n N,

where ˆκ = ˆκ(P) is a distribution-dependent constant.

Remark 20. Note that in Theorem 8, the lower bound appears as VCE(H)+log (SE(H))

n . Indeed, SE(H) < immediately implies VCE(H) < and then the lower bound in Lemma 10 holds. Combining with the lower bound in Lemma 11 will give us the desired result in Theorem 8 with some sufficiently small constant, e.g. α = 1/20.

Remarkably, when both SE(H) and VCE(H) are finite, either of VCE(H) and log (SE(H)) can be larger than the other, and we provide the following examples for evidence. Therefore, none of the quantities can be removed in the lower bound. Example 24 (VCE(H) < log (SE(H)) < ). Let X := S

k N Xk be the disjoint union of finite sets with |Xk| = d < . We denote Xk := {xk,1, . . . , xk,d}, for every k N, let hk,j(x) := 1{x = xk,j or x X>k}, for every k N and every 1 j d, and finally let H := {hk,j, 1 j d, k N}. For this class, we have VCE(H) = 2 < log (SE(H)) = log d for a sufficiently large d. Example 25 (log (SE(H)) < VCE(H) < ). Let X := S

k N Xk be the disjoint union of finite sets with |Xk| = d < . We denote Xk := {xk,1, . . . , xk,d}, for every k N, let hk,S(x) := 1{x S or x X>k}, for every k N and every subset S Xk, and finally let H := {hk,S, S Xk, k N}. For this class, we have log (SE(H)) = log d < VCE(H) = d.

D.1 Omitted Proofs in Section 3

Proposition 3 (Proposition 1 restated). Any infinite concept class H has either an infinite star-eluder sequence or infinite Littlestone dimension.

To prove the proposition, we first introduce a new complexity structure named the threshold sequence. Definition 11 (Threshold sequence). Let H be a concept class, we say that H has an infinite threshold sequence {(x1, y1), (x2, y2), . . .}, if it is realizable and for every integer k, there exists hk H such that hk(xi) = yi for all i < k and hk(xi) = yi for all i k. We say an infinite threshold sequence {(x1, y1), (x2, y2), . . .} is centered at h, if h(xi) = yi for all i N.

The following claim turns out to be an alternative result to the proposition. Claim 1. Any infinite class H has either an infinite star set or an infinite threshold sequence.

Given that the claim holds, the proof to the proposition is then straightforward. This is because an infinite star set itself is an infinite star-eluder sequence. Moreover, an infinite threshold sequence

gives rise to infinite Littlestone dimension since H can have a Littlestone tree of arbitrarily large depth (an easy example is Hthresh,N). Therefore, it suffices to prove Claim 1. The remaining proof relies on a connection to the classical Ramsey theory, which we briefly introduced as follow.

The classical Ramsey s theorem states that one will find monochromatic cliques in any edge labelling (with colors) of a sufficiently large complete graph. Specifically, let r be an positive integer, a simple 2-colors version of the Ramsey s theorem states that there exists a smallest positive integer R(r, r), named the (diagonal) Ramsey number, such that every red-blue edge coloring of the complete graph on R(r, r) vertices contains either a red clique on r vertices or a blue clique on r vertices. However, we will need the following extension of the theorem to an infinite graph. Theorem 9 (Infinite Ramsey s theorem, Ramsey, 1987). For any countably infinite set, if its induced complete graph is colored with finitely many colors, then there is an infinite monochromatic clique.

Proof of Claim 1. Based on Lemma 8, we know that any infinite class H has an infinite eluder sequence. Let {(x1, y1), (x2, y2), . . .} be an infinite eluder sequence centered at h , that is, for any j N, there exists hj H such that hj(xi) = yi = h (xi) for all i < j and hj(xj) = yj = h (xj). We aim to show that there exists an infinite subsequence {(xi1, yi1), (xi2, yi2), . . .} that is either an infinite star set centered at h or an infinite threshold sequence centered at h . To this end, we consider the infinite eluder sequence {(x1, y1), (x2, y2), . . .} as a red-blue coloring of an infinite complete graph according to the following: let the vertices be indexed by N, then for every edge ei,j with integers i > j, we color it red if hj(xi) = yi and blue otherwise.

Note that for any infinite subsequence {(xi1, yi1), (xi2, yi2), . . .}, if the infinite subgraph comprised of the vertices {i1, i2, . . .} is monochromatically red, then hij(xik) = yik = h (xik) for all integers k > j. Since hij(xik) = yik = h (xik) for all integers k < j and hij(xij) = yij = h (xij), it implies that {(xi1, yi1), (xi2, yi2), . . .} is an infinite star set centered at h , witnessed by {hij}j N. Moreover, if the infinite subgraph comprised of the vertices {i1, i2, . . .} is monochromatically blue, it is not hard to verify that {(xi1, yi1), (xi2, yi2), . . .} is an infinite threshold sequence centered at h . The proof is completed by applying the infinite Ramsey s theorem.

Lemma 12 (Lemma 1 restated). Given a concept class H, for any learning algorithm ˆhn, there exists a realizable distribution P with respect to H such that E[er P (ˆhn)] 2 (n+2) for infinitely many n, which implies that H is not universally learnable at rate faster than exponential e n.

Proof of Lemma 12. We prove the lemma by using the probabilistic method". Let us consider non-trivially that |H| > 2, let h1, h2 H and x, x X such that h1(x) = h2(x) = y and h1(x ) = h2(x ). Now for any learning algorithm ˆhn, we define the following two realizable distributions P0 and P1, where Pi{(x, y)} = 0.5 and Pi{(x , i)} = 0.5, i {0, 1}. Let I Bernoulli(0.5), and conditioned on I, let Sn := {(x1, y1), (x2, y2), . . . , (xn, yn)} and (xn+1, yn+1) be i.i.d. samples from PI that the learning algorithm ˆhn is trained on. We note that

E h P ˆhn(xn+1) = yn+1 Sn, I i 1

2P x1 = . . . = xn = x, xn+1 = x = 2 (n+2).

Furthermore, by the law of total probability, we have

E h P ˆhn(xn+1) = yn+1 Sn, I i =1

i {0,1} E h P ˆhn(xn+1) = yn+1 Sn, I = i I = i i

max i {0,1} E h P ˆhn(xn+1) = yn+1 Sn, I = i I = i i .

The above two inequalities imply that for every n, there exists in {0, 1} such that

E h P ˆhn(xn+1) = yn+1 Sn, I = in I = in i = E[er Pin(ˆhn)] 2 (n+2).

In particular, by the pigeonhole principle, there exists i {0, 1} such that in = i infinitely often, which completes the proof.

Lemma 13 (Lemma 2 restated). If H does not have an infinite eluder sequence centered at h , then h is universally learnable by ERM at rate e n.

Proof of Lemma 13. Since H does not have an infinite eluder sequence centered at h , then for any realizable distribution P centered at h and data sequence S := {(x1, h (x1)), (x2, h (x2)), . . .} P N, we have

# n t N : t > t s.t. h VSt(H) : h(xt ) = h (xt ) o < .

For the largest such integer t, we further have P( h VSt(H) : h(xt ) = h (xt )) = 1 for some t := t (S) > t. This is true because the probability decays exponentially. Therefore, we have

lim n PS P N (P (x X : h Vn(H) s.t. h(x) = h (x)) = 0) = 1,

which implies that there is a distribution-dependent positive integer k := k(P) < such that

P (P (x X : h Vk(H) s.t. h(x) = h (x)) = 0) 1/2.

Now for any integer n > k, we split the dataset Sn P n into n/k parts with each one sized at least k, denoted by Sn,1, . . . , Sn, n/k . It holds then

P (P (x X : h Vn(H) s.t. h(x) = h (x)) = 0)

P i {1, . . . , n/k } : P x X : h VSn,i(H) s.t. h(x) = h (x) = 0

i=1 P P x X : h VSn,i(H) s.t. h(x) = h (x) = 0 2 n/k ,

which also holds for n k. Finally, it follows that

E h er P (ˆhn) i P ( h Vn(H) : er P (h) > 0)

=P (P (x X : h Vn(H) s.t. h(x) = h (x)) = 0) 2 n/k , n N.

Lemma 14 (Lemma 3 restated). If H has an infinite eluder sequence centered at h , then h is not universally learnable by ERM at rate faster than 1/n.

Proof of Lemma 14. Let {(x1, y1), (x2, y2), . . .} be an infinite eluder sequence centered at h , we consider the following distribution P: P{(xi, yi)} = 2 i and P{(xi, 1 yi)} = 0 for all i N. Note that P is realizable (with respect to H) with target h . Given a dataset Sn := {(xi, yi)}n i=1 P n, let the worst-case ERM outputs ˆhn := ERM(Sn). For any t N, if Sn does not contain any copy of the points in {xi, i > t}, we have er P (ˆhn) 2 t. The probability of such event is

i=1 1 {Xi {xt+1, xt+2, . . .}} = 0

i=1 P (Xi {x1, . . . , xt}) = 1 2 t n .

Therefore, it follows immediately that

E h er P (ˆhn) i

t=1 2 t 1 2 t n 1

where the second inequality follows from choosing t = log n .

Lemma 15 (Lemma 4 restated). If H does not have an infinite star-eluder sequence centered at h , then h is universally learnable by ERM at rate 1/n.

Before proceeding to the proof of Lemma 15, we first introduce several useful tools. The following definition of the sample compression scheme was originally stated in Littlestone and Warmuth (1986).

Definition 12 (Sample compression scheme). Let H be a concept class and Sn := {(xi, yi)}n i=1. A sample compression scheme for H consists of two maps (κ, ρ) such that the following hold:

The compression map κ takes Sn to T := κ(Sn), for some T S t=0{(x, y) Sn}t. The reconstruction function ρ takes T to ρ(T) : X {0, 1}.

The size of the sample compression scheme (κ, ρ) is defined as max Sn (X {0,1})n |κ(Sn)|. A sample compression scheme (κ, ρ) is called sample-consistent for H, if for any realizable distribution P with respect to H and Sn := {(xi, yi)}n i=1 P n, it holds that ˆer Sn(ρ(κ(Sn))) = 0. A sample compression scheme (κ, ρ) is called stable if for every subsequence S satisfying κ(Sn) S Sn, it holds that ρ(κ(S )) = ρ(κ(Sn)), that is, removing any non-compression point from Sn does not change the classifier returned by the sample compression scheme. Definition 13 (Version space compression set). Let H be a concept class and P be a realizable distribution with respect to H. For any n N, and any dataset Sn := {(xi, yi)}n i=1 P n, the version space compression set ˆCn is defined to be the smallest subset of Sn satisfying VSn(H) = V ˆCn(H). Furthermore, we define the version space compression set size as ˆn(Sn) := | ˆCn|, which is a data-dependent quantity. Finally, we define ˆn1:n := max1 m n ˆm(Sm), which is also datadependent. However, ˆn1:n is not only just dependent on the full sample, but is also dependent on any prefix of the sample (that is, the order of the sample). Remark 21. It has been argued in Wiener et al. (2015) that the region of disagreement of the version space DIS(Vn(H)) can be described as a compression scheme, where the size of the compression scheme is exactly the version space compression set size ˆn(Sn).

With these definitions in hand, we are now able to prove the lemma.

Proof of Lemma 15. Let P be a realizable distribution with respect to H centered at h , let Sn := {(xi, yi)}n i=1 P n be a dataset and ˆCn Sn be the corresponding version space compression set with size | ˆCn| = ˆn(Sn). We let (κ, ρ) be a sample compression scheme of size ˆn(Sn) defined by κ(Sn) = ˆCn and ρ( ˆCn) = ˆhn. Since any ERM algorithm will output predictors {ˆhn}n N satisfying ˆhn Vn(H), it is clear that

ˆer Sn (ρ (κ (Sn))) =

i=1 1 {ρ (κ (Sn)) (xi) = yi} =

i=1 1 n ˆhn(xi) = yi o = 0,

and thus it is sample-consistent. Furthermore, let S be any subsequence satisfying ˆCn S Sn. On one hand, we have VSn(H) VS (H). On the other hand, we also have VS (H) V ˆ Cn(H) = VSn(H). Therefore, we conclude VSn(H) = VS (H), and thus ρ(κ(S )) = ρ(κ(Sn)), that is, the compression scheme (κ, ρ) is also stable. Now we can apply Lemma 24, and then obtain

E h er P (ˆhn) i = E [er P (ρ (κ (Sn)))] E[ˆn(Sn)]

Indeed, it has been proved that ˆn(Sn) sh (Thm.13 Hanneke and Yang, 2015), and for completeness, we prove it as in Lemma 25 in Appendix E. The only remaining concern is that the fact H does not have an infinite star-eluder sequence centered at h " does not guarantee sh < . However, it essentially states that the version space will eventually have a bounded star number centered at h . Since H does not have an infinite star-eluder sequence centered at h , for any sequence S := {(x1, y1), (x2, y2), . . .} P N, there exists a data-dependent integer k := k(S) < such that sh (Vn k(H)) < k. Moreover, we know there exists a distribution-dependent constant factor k := k(P) < such that k(S) k(P) with probability at least 1/2. By using a similar argument in the proof of Lemma 15, we have

E h er P (ˆhn) i k

n + 2 n/k , n N,

which proves a target-specified linear upper bound.

Lemma 16 (Lemma 5 restated). If H has an infinite star-eluder sequence centered at h , then h is not universally learnable by ERM at rate faster than log (n)/n.

Proof of Lemma 16. Suppose that S := {(x1, y1), (x2, y2), . . .} is an infinite star-eluder sequence in H centered at h , that is h (xi) = yi for all i N. For notation simplicity, let X1 := {x1}, X2 := {x2, x3}, X3 := {x4, x5, x6}, . . . , Xk := {xnk+1, . . . , xnk+k}, . . ., with nk := k 2 . We consider a strictly increasing sequence {kt}t N that will be specified later, and only put non-zero probability

masses on these disjoint sets Xkt with t N. Then, let X := S t N Xkt be a union of disjoint finite sets with |Xkt| = kt, and consider the following marginal distribution PX on X:

PX {x Xkt} = 2 t and PX (x) = 2 t/kt, x Xkt.

It immediately implies the joint distribution P := P(PX , h ) that is realizable with respect to H:

P {(x, h (x))} = 2 t/kt, P {(x, 1 h (x))} = 0, x Xkt, k N.

Now for any n N, we let Sn := {(xi, yi)}n i=1 P n and consider the event E := E1 E2, where

E1 := {Sn does not contain a copy of any point in Xk>t} , E2 := {Sn does not contain a copy of at least one point in Xkt} .

If E happens, the worst-case ERM can output some ˆhn VSn(H) such that er P (ˆhn) 2 t/kt. This is because: Vnkt(H) VSn,k<t (H), where Sn,k<t contains the samples of Sn that falling into Xk1 Xkt 1, and then Xkt = {xnkt+1, . . . , xnkt+kt} is a star set of VSn,k<t (H) witnessed by a set of functions, denoted by {hnkt+1, . . . , hnkt+kt}. In other words, VSn,k<t (H) contains a size-kt singletons" with point-wise probability mass 2 t/kt. However, the remaining samples Sn Xkt does not contain a copy of every point in Xkt, which results in an error rate er P (ˆhn) 2 t/kt, with ˆhn := hnkt+j for some 1 j kt.

Hence, it remains to characterize the probability of E. To this end, we refer to the so-called Coupon Collector s Problem, and define a random variable

ˆnkt := min {n N : Xkt Sn} .

Note that ˆnkt can be represented as a sum Pkt j=1 Gj of independent geometric random variables Gj Geometric( kt+1 j

kt 2 t) for 1 j kt, with

E [ˆnkt] = Pkt j=1 E [Gj] = Pkt j=1 kt 2t

kt+1 j = kt 2t Pkt j=1 1 kt+1 j = kt 2t Hkt

Var [ˆnkt] = Pkt j=1 Var [Gj] < Pkt j=1 kt+1 j

kt 2 t 2 < π2 k2 t 22t

where Hm is mth harmonic number satisfying Hm log (m). Then the standard Chebyshev s inequality implies that P(|ˆnkt E[ˆnkt]| > z) Var[ˆnkt] z 2. By choosing z = p

2Var[ˆnkt], we have with probability at least 1/2,

ˆnkt > E [ˆnkt] p

2Var [ˆnkt] kt 2t log kt π

In particular, when kt 38, it holds that log kt 2π/

3, and thus ˆnkt > 2t 1kt log kt with probability at least 1/2. Altogether, we have for any n 2t 1kt log kt,

P (E2) P (n < ˆnkt) P n 2t 1kt log kt, 2t 1kt log kt < ˆnkt 1/2.

Moreover, to characterize the probability of E1, note that for any x PX , P(x Xk>t) = 2 kt, which implies immediately that for any n N, P(E1) = (1 2 kt)n. Now for any integer kt 38, we let n = 2t 1kt log kt, and have (for infinitely many n) that

P er P (ˆhn) 2 t

P (E) P (E1) P (E2) 1

2 1 2 kt n ,

which implies further

E h er P (ˆhn) i 2 t

kt P er P (ˆhn) 2 t

1 kt2t+1 1 2 kt n =: ηn,t.

Finally, by choosing kt = Ω(2t), we can guarantee that n η 1 n,t log η 1 n,t. Applying Lemma 29, we have E[er P (ˆhn)] ηn,t log (n)/n, for infinitely many n.

Lemma 17 (Lemma 6 restated). If H does not have an infinite VC-eluder sequence centered at h , then h is universally learnable by ERM at log (n)/n rate.

Proof of Lemma 17. We first prove that any class H is universally learnable by ERM at log (n)/n rate if VC(H) < . For any realizable distribution P with respect to H, we let S2n := {(xi, yi)}2n i=1 P 2n, and denote Sn := {(xi, yi)}n i=1 and Tn := {(xi, yi)}2n i=n+1, namely, the ghost samples". Given ϵ (0, 1), Lemma 31 states that for any n 8/ϵ,

P ( h H : ˆer Sn(h) = 0 and er P (h) > ϵ) 2P ( h H : ˆer Sn(h) = 0 and ˆer Tn (h) > ϵ/2) .

Moreover, Lemma 32 states that for any n VC(H)/2,

P ( h H : ˆer Sn(h) = 0 and ˆer Tn (h) > ϵ/2) 2en VC(H)

VC(H) 2 nϵ/2.

Altogether, we have

P er P (ˆhn) > ϵ P ( h H : ˆer Sn(h) = 0 and er P (h) > ϵ) 2 2en VC(H)

for any n max{8/ϵ, VC(H)/2}. Finally, the upper bound on the expectation can be derived via the follow analysis (which will be used several times later): let

VC(H) log 2en VC(H)

and then by letting the RHS of (5) =: δ, we have

VC(H) log 2en VC(H)

When ϵ ϵn, we of course still have P(er P (ˆhn) > ϵ) 1. It follows that for all n VC(H)/2,

E h er P (ˆhn) i = Z 1

0 P er P (ˆhn) > ϵ dϵ

8 n P er P (ˆhn) > ϵ dϵ + Z 8 n

0 P er P (ˆhn) > ϵ dϵ

8 n P er P (ˆhn) > ϵ dϵ + Z 1

ϵn P er P (ˆhn) > ϵ dϵ + Z 8 n

0 P er P (ˆhn) > ϵ dϵ

ϵn 2 2en VC(H)

VC(H) 2 nϵ/2dϵ

n log 2en VC(H)

+ 2 + 2 ln (2)

n log n VC(H)

For n VC(H)/2, the result is trivial.

Now to prove a target-specified upper bound, let P be any realizable distribution centered at the target concept h with an associated marginal distribution denoted by PX . Suppose that H does not have an infinite VC-eluder sequence centered at h , then there is a largest target-dependent integer d := d(h ) < such that there exists an infinite d-VC-eluder sequence centered at h , but no infinite (d + 1)-VC-eluder sequence centered at h (see Definition 9). We know that there exists a positive distribution-dependent integer k := k(P) < such that P(VC(Vk(H)) d) 1/2.

We let Sn := {(xi, yi)}n i=1 P n be a dataset, and consider the event En := {VC(V n/2 (H)) d}. The probability of this event can be characterized as follow: for any n N, assume first n k, we then split the dataset Sn P n into n/k parts with each one sized at least k, denoted by Sn,1, . . . , Sn, n/k . For every 1 i n/k , we know that the corresponding induced version space has VC dimension VC(VSn,i(H)) d with probability at least 1/2. Note that Vn(H) = T

1 i n/k VSn,i(H) satisfies VC(Vn(H)) VC(VSn,i(H)), for every 1 i n/k . Therefore, we have P(VC(Vn(H)) > d) P( 1 i n/k : VC(VSn,i(H)) > d) 2 n/k . Note that this bound also holds when n < k since a probability is always at most 1. Altogether, we obtain P( En) = P(VC(V n/2 (H)) > d) 2 n/2k . Conditioning on this event, the previous analysis

of the uniform rate log (n)/n can be applied since the version space has a bounded VC dimension d. Finally, we have that for all n N,

E h er P (ˆhn) i = Z

0 P er P (ˆhn) > ϵ dϵ

P er P (ˆhn) > ϵ En + P ( En) dϵ d

where both d := d(h ) and k := k(P) are distribution-dependent constants. In conclusion, h is universally learnable by ERM at log (n)/n rate.

Lemma 18 (Lemma 7 restated). If H has an infinite VC-eluder sequence centered at h , then h requires at least arbitrarily slow rates to be universally learned by ERM.

Proof of Lemma 18. Let S := {(x1, y1), (x2, y2), . . .} be an infinite VC-eluder sequence in H centered at h , that is, h (xi) = yi for all i N. We inherit the notations used in Lemma 5 by letting Xk := {xnk+1, . . . , xnk+k} with nk := k 2 , for all k N. Let X := S

k N Xk be a union of disjoint finite sets with |Xk| = k, and consider the following marginal distribution PX on X:

PX {x Xk} = pk and PX (x) = pk/k, x Xk,

where {pk}k N is a sequence of probabilities satisfying P

k 1 pk 1 that will be specified later. It implies immediately the following realizable (joint) distribution P := P(PX , h ):

P {(x, h (x))} = pk/k, P {(x, 1 h (x))} = 0, x Xk, k N.

Our remaining target is to show that for any rate function R(n) 0, H cannot be universally learned by the worst-case ERM at rate faster than R(n) under the distribution P. To this end, we let Sn := {(xi, yi)}n i=1 P n. For any t N and any j [kt], we consider the following event

En,k,t,j := Sn does not contain a copy of any point in Xk>t {xnkt+j} ,

where {kt}t N is an increasing sequence of integers that will be specified later. If En,k,t,j happens, then the worst-case ERM algorithm can output some ˆhn Vnkt(H) such that er P (ˆhn) pkt/kt, that is the classifier that predicts incorrectly on the missing" point in Xkt. Moreover, to characterize the probability of event En,k,t,j, we have P(En,k,t,j) = (1 P k>kt pk pkt/kt)n. Therefore, we make a construction by applying Lemma 30, and finally get for all t N,

E h er P ˆhnt i X

j [kt] pkt P (Ent,k,t,j) pkt

k>kt pk pkt

nt R (nt) .

D.2 Omitted Proofs in Section 4

Lemma 19 (Lemma 8 restated). H has an infinite eluder sequence if and only if |H| = .

Proof of Lemma 19. The necessity is straightforward. To show the sufficiency, we construct such an infinite eluder sequence via the following procedure: pick some x1 X such that both V(x1,0)(H) := {h H : h(x1) = 0} and V(x1,1)(H) := {h H : h(x1) = 1} are non-empty. Such a point x1 must exist since otherwise we will have |H| = 1. Furthermore, we know that at least one of them is infinite, since otherwise we will have |H| < . We assume, without loss of generality, that |V(x1,0)(H)| = . Then we pick some x2 X such that both V{(x1,0),(x2,0)}(H) := {h H : h(x1) = 0, h(x2) = 0} and V{(x1,0),(x2,1)}(H) := {h H : h(x1) = 0, h(x2) = 1} are non-empty. Note that such an x2 exists, because otherwise we will have |V(x1,0)(H)| < . Again, at least one of them is infinite for the same reason, and then we choose x3 from that infinite one. Following a similar procedure, we can get an infinite sequence {(x1, y1), (x2, y2), . . .}, where {x1, x2, . . .} are chosen to keep the separates of version space non-empty, and {y1, y2, . . .} are chosen to keep the version space infinite. Now note that for any i N, we can find some hi VSi(H) = , where Si = {(x1, 1 y1), (x2, 1 y2), . . . , (xi 1, 1 yi 1), (xi, yi)}. According to Definition 5, we know that {(x1, y1), (x2, y2), . . .} is an infinite eluder sequence consistent with H.

Lemma 20 (Lemma 9 restated). H has an infinite VC-eluder sequence if and only if VC(H) = .

Proof of Lemma 20. According to Definition 7, the necessity is straightforward, i.e. we must have VC(H) = if H has an infinite VC-eluder sequence. It remains to prove the sufficiency, that is, VC(H) = yields the existence of an infinite VC-eluder sequence.

We construct an infinite VC-eluder sequence via the following procedure: Let x1 X be any point, then at least one of V{(x1,0)}(H) and V{(x1,1)}(H) has an infinite VC dimension, which is because otherwise H = V{(x1,y1)}(H) V{(x1,1 y1)}(H) will have a finite VC dimension based on Lemma 28. Let y1 {0, 1} such that V{(x1,y1)}(H) has an infinite VC dimension, and let {x2, x3} be a shattered set of V{(x1,y1)}(H). Similarly, we know least one of the following four subclasses V{(x1,y1),(x2,0),(x3,0)}(H), V{(x1,y1),(x2,0),(x3,1)}(H), V{(x1,y1),(x2,1),(x3,0)}(H), V{(x1,y1),(x2,1),(x3,1)}(H) has an infinite VC dimension since otherwise VC(V{(x1,y1)}(H)) < will lead to a contradiction. We then pick labels y2, y3 {0, 1} such that V{(x1,y1),(x2,y2),(x3,y3)}(H) has an infinite VC dimension. For notation simplicity, let S := {(x1, y1), (x2, y2), . . .} and let nk := k 2 . Inductively, for any k N, if VS1+2+ +(k 1)(H) = VSnk (H) has an infinite VC dimension, let {xnk+1, . . . , xnk+k} be a shattered set of VSnk (H). Lemma 28 yields the existence of a set of labels {ynk+1, . . . , ynk+k} {0, 1}k such that VSnk+1(H) has an infinite VC dimension. Otherwise,

VC VSnk (H) = VC

(ynk+1,...,ynk+k) {0,1}k VSnk+1(H)

2k+4 max VC VSnk+1(H) < ,

which leads us to a contradiction! By such a construction, the returned infinite sequence S := {(x1, y1), (x2, y2), . . .} is an infinite VC-eluder sequence consistent with H.

D.3 Omitted Proofs in Appendix C

Lemma 21 (Lemma 10 restated). For every concept class H with |H| 3, if VCE(H) < , then the following hold:

E h er P (ˆhn) i VCE(H)

18n , for infinitely many n N,

E h er P (ˆhn) i 28VCE(H) log n

n + 2 n/2κ , n N,

where κ = κ(P) is a distribution-dependent constant.

Proof of Lemma 21. Let H be a concept class with VCE(H) = d < . Note that when VCE(H) = 0, the results hold trivially, and hence we consider only d 1 in the remaining part of the proof.

To show the upper bound, let P be a realizable distribution with respect to H, and for any n N, let Sn := {(x1, y1), . . . , (xn, yn)} P n be a dataset. Since VCE(H) = d, for any infinite sequence S := {(x1, y1), (x2, y2), . . .} P N, there exists a (smallest) non-negative integer k = k(S) < such that VC(Vk(H)) d. For any ERM algorithm A, let ˆhn := AH(Sn) Vn(H), which can also be written as ˆhn := ˆhn,k := AVk(H) (Sk+1:n), for every k [n]. Recall that for any ϵ (0, 1), using the same argument as in the proof of Lemma 6, we can get

P er P (ˆhn,k) > ϵ 2 2e(n k)

d 2 (n k)ϵ/2, n k + max {8/ϵ, d/2} . (6)

Now for any n N, we consider the event En := {VC(V n/2 (H)) d}. Applying the inequality P(A) P(A|B) + P( B), we have

P er P (ˆhn) > ϵ P er P (ˆhn, n/2 ) > ϵ En + P ( En) . (7)

Let the RHS of (6) =: δ (0, 1), then for the first probability in (7), we have

P er P (ˆhn) > ϵ En = P er P (ˆhn, n/2 ) > ϵ VC(V n/2 (H)) d δ,

for all n max {16/ϵ, d} n/2 + max {8/ϵ, d/2}, and also

d log 2e(n n

+ 1 =: ϵn. (8)

To characterize the second probability in (7), we define κ = κ(P), a distribution-dependent quantity, to be the smallest integer such that k(S) κ with probability at least 1/2, where the randomness is from the data sequence S. Note that such an integer κ exists since otherwise there will exist at least an infinite (d + 1)-VC-eluder sequence. We then prove:

Claim 2. For any n N, P( En) 2 n/2κ .

Proof of Claim 2. For any n N, assume first that n κ, we then split the dataset Sn P n into n/κ parts with each one sized κ, denoted by Sn,1, . . . , Sn, n/κ . According to the definition, for every 1 i n/κ , we know that the induced version space has VC dimension VC(VSn,i(H)) d with probability at least 1/2. Note that Vn(H) = T 1 i n/κ VSn,i(H) satisfies VC(Vn(H)) VC(VSn,i(H)), for all 1 i n/κ . Therefore, we have

P (VC(Vn(H)) > d) P 1 i n/κ : VC(VSn,i(H)) > d 2 n/κ ,

which also holds when n < κ. Finally, P( En) = P(VC(V n/2 (H)) > d) 2 n/2κ .

Putting together, we have that for all n d,

E h er P (ˆhn) i = Z 1

16/n P er P (ˆhn) > ϵ dϵ + Z 16/n

0 P er P (ˆhn) > ϵ dϵ

16/n P er P (ˆhn, n/2 ) > ϵ En dϵ + Z 16/n

0 P er P (ˆhn, n/2 ) > ϵ En dϵ + Z 1

0 P ( En) dϵ

(8) ϵn + Z 1

ϵn P er P (ˆhn, n/2 ) > ϵ En dϵ + Z 1

0 P ( En) dϵ

(6) ϵn + 3 Z

d 2 nϵ/4dϵ + Z 1

0 P ( En) dϵ

+ 4 + 12 ln (2)

0 P ( En) dϵ

+ 4 + 12 ln (2)

n + 2 n/2κ 28d log n

n + 2 n/2κ .

When n d, the upper bound is trivial.

To show the lower bound, let S := {(x1, y1), (x2, y2), . . .} be any infinite d-VC-eluder sequence that is consistent with H. We denote Xk := {xkd d+1, . . . , xkd} for all integers k 1, and consider the following realizable distribution P: P {(xkd d+j, ykd d+j)} = pk/d, P {(xkd d+j, 1 ykd d+j)} = 0, 1 j d, k 1, where {pk}k 1 is a sequence of probabilities satisfying P

k 1 pk 1, which will be specified later.

We use a similar argument in the proof of Lemma 7, but instead of considering an arbitrarily slow rate function R(n) 0, we consider here R(n) := d/n. Specifically, let Sn := {(xi, yi)}n i=1 P n be a dataset. Note that for any k N and any j [d], if the dataset Sn does not contain any copy of the points in X>k {xkd d+j} := S

t>k Xt {xkd d+j}, the worst-case ERM can have an error rate er P (ˆhn) pk/d. The probability of such event is P(Pn i=1 1{Xi X>k {xkd d+j}} = 0) = Qn i=1 P(Xi / X>k {xkd d+j}) = (1 P

t>k pt pk/d)n.

Based on Lemma 30, for the rate function R(n) := d/n, there exist probabilities {pk}k 1 satisfying P

k 1 pk = 1, two increasing sequences of integers {kt}t 1 and {nt}t 1, and a constant 1/2 C 1 such that P

k>kt pk 1/nt and pkt = C d/nt. Therefore, it follows that for all integers t 1 (and thus for infinitely many n N),

E h er P ˆhnt i C d

nt d 18nt .

Lemma 22 (Lemma 11 restated). For every concept class H with |H| 3, if 1 SE(H) < , then the following hold:

E h er P (ˆhn) i log (SE(H))

12n , for infinitely many n N,

E h er P (ˆhn) i 160VCE(H)

n log SE(H)

+ 2 n/2ˆκ , n N,

where ˆκ = ˆκ(P) is a distribution-dependent constant.

Proof of Lemma 22. Let H be a concept class with SE(H) = s < and VCE(H) = d < .

To prove the upper bound, let P be a realizable distribution with respect to H centered at h, and for any n N, let Sn := {(x1, y1), . . . , (xn, yn)} P n be a dataset. Indeed, a distribution-free upper bound for any consistent learning rule has been proved in Hanneke (2016b) (see Lemma 26 in Appendix E), which states that for any δ (0, 1) and any n N, with probability at least 1 δ,

sup h Vn(H) er P (h) 8

VC(H) ln 49esh

VC(H) + 37 + 8 ln 6

Since H does not have an infinite (d + 1)-VC-eluder sequence, and also does not have an infinite (s + 1)-star-eluder sequence, for any infinite sequence S := {(x1, y1), (x2, y2), . . .} P N, there exists a (smallest) non-negative integer k = k(S) < such that VC(Vk(H)) d and sh(Vk(H)) s. Following a similar argument in the proof of Lemma 10, we define ˆκ = ˆκ(P), a distributiondependent quantity, to be the smallest integer such that k(S) ˆκ with probability at least 1/2, and then consider the following event ˆEn := {VC(V n/2 (H)) d, sh(V n/2 (H)) s} with probability P( ˆEn) 2 n/2ˆκ . For notation simplicity, let us denote by ϵn := 8

n(d ln ( 49es

d + 37) + 8 ln (6)), then conditioning on ˆEn, we have that for all n N,

sup ˆhn Vn(H) er P (ˆhn)

0 P er P (ˆhn) > ϵ dϵ

0 P er P (ˆhn) > ϵ ˆEn dϵ + Z 1

0 P er P (ˆhn) > ϵ ˆEn dϵ + Z 1

ϵn P er P (ˆhn) > ϵ ˆEn dϵ + Z 1

(9) ϵn + Z 1

dϵ + 2 n/2ˆκ

d + 37 + 8 ln (6) + 64

n + 2 n/2ˆκ

n log (49e + 37)s

+ 64 ln (6) + 64

n + 2 n/2ˆκ 160d

+ 2 n/2ˆκ .

To show the lower bound, let S := {(x1, y1), (x2, y2), . . .} be any infinite d-star-eluder sequence that is consistent with H. We denote Xk := {xkd d+1, . . . , xkd} for every integer k 1, and consider the following realizable distribution P (with the same center of S):

P {(xkd d+j, ykd d+j)} = pk/d, P {(xkd d+j, 1 ykd d+j)} = 0, 1 j d, k 1,

where {pk}k 1 is a sequence of probabilities satisfying P k 1 pk 1, which will be specified later. Let Sn := {(xi, yi)}n i=1 P n be a dataset and consider the event E := E1 E2, where

E1 := {Sn does not contain a copy of any point in X>k} , E2 := {Sn does not contain a copy of at least one point in Xk} .

If the event E happens, the worst-case ERM can have an error rate er P (ˆhn) pk/d. This is because {xkd d+1, . . . , xkd} is a star set (with the same center) of Vn(H) Vkd d(H), and so we know that for any 1 j d, there exists hk,j Vn(H) such that hk,j(xkd d+j) = ykd d+j, and the ERM outputting hk,j will have such an error rate. The probability of E1 follows simply as P(E1) = (1 P

t>k pt)n. Moreover, characterizing the probability of E2 can be approached as an instance of the so-called Coupon Collector s Problem. Specifically, we let

ˆnk := min {n N : Xk Sn} ,

which can be represented as a sum Pd j=1 Gj of independent geometric random variables Gj Geometric( d+1 j

d pk) for 1 j d, with the following properties

E [ˆnk] = Pd j=1 E [Gj] = Pd j=1 d p 1 k d+1 j = d pk

Pd j=1 1 d+1 j = d pk Hd

Var [ˆnk] = Pd j=1 Var [Gj] < Pd j=1 d+1 j

d pk 2 < π2d2

where Hd is dth harmonic number satisfying Hd log d, for all d 1. Then the standard Chebyshev s inequality implies that P(|ˆnk E[ˆnk]| > z) Var[ˆnk] z 2. By choosing z = p

2Var[ˆnk], we have with probability at least 1/2,

ˆnk > E [ˆnk] p

2Var [ˆnk] d

In particular, when d 38, it holds that log d 2π/

3, and thus ˆnk > p 1 k d log d/2, with probability at least 1/2. Altogether, we have for any n p 1 k d log d/2,

P (E2) P (n < ˆnk) P n p 1 k d log d/2, p 1 k d log d/2 < ˆnk 1/2.

Now for all d 38, it follows from the proceeding analysis that

P er P (ˆhn) pk

P (E) P (E1) P (E2) 1

which further implies that for all d 38,

E h er P (ˆhn) i pk

d P er P (ˆhn) pk

for all n p 1 k d log d/2. By letting nk = p 1 k d log d/2, we have for all k N (infinitely many n N),

E h er P ˆhnk i pk

!nk = log d

where the last inequality follows from choosing probabilities {pk}k 1 satisfying P

t>k pt 1/nk. When 1 d < 38, the result is trivial.

E Technical lemmas

Lemma 23 (Chernoff s bound). Let Z1, . . . , Zn be independent random variables in {0, 1}, let Z := 1

n Pn i=1 Zi. For all t (0, 1), we have

P Z (1 t)E[ Z] e n E[ Z]t2

Lemma 24. Let H be a concept class, and (κ, ρ) be a stable sample compression scheme of size ˆn(Sn) < n that is sample-consistent for H given data Sn. For any realizable distribution P with respect to H and Sn := {(xi, yi)}n i=1 P n, let ˆhn := ρ(κ(Sn)). Then it holds

E h er P (ˆhn) i E[ˆn(Sn)]

Proof of Lemma 24. We claim that if (xn+1, yn+1) satisfies ρ(κ(Sn))(xn+1) = yn+1, we must have (xn+1, yn+1) κ(Sn+1). Suppose not, then there is a subsequence Sn := Sn+1 \ {(xn+1, yn+1)} satisfying κ(Sn+1) Sn Sn+1 and ρ(κ(Sn))(xn+1) = yn+1 = ρ(κ(Sn+1))(xn+1) based on the sample-consistency of the compression scheme, which contradicts to our assumption that (κ, ρ) is stable. Now by the exchangeability of random variables {xi}i 1, we have

E h er P (ˆhn) i =ESn [P {ρ (κ (Sn)) (xn+1) = yn+1}]

=ESn+1 [1 {ρ (κ (Sn)) (xn+1) = yn+1}]

i=1 E [1 {ρ (κ (Sn+1 \ {(xi, yi)})) (xi) = yi}]

i=1 E [1 {(xi, yi) κ (Sn+1)}] E[ˆn(Sn)]

Lemma 25 (Hanneke and Yang, 2015, Lemma 44). Let H be a concept class, and P be a realizable distribution centered at the target h. For any n N, let Sn := {(xi, yi)}n i=1 P n be a dataset, and let ˆCn be the version space compression set (Definition 13), that is, the smallest subset of Sn such that V ˆCn(H) = VSn(H). We have | ˆCn| sh.

Proof of Lemma 25. We assume that the dataset Sn is consistent with the target h, i.e. h(xj) = yj for all j [n]. Note that, if there exists (xj, yj) ˆCn such that every hypothesis g V ˆCn\{(xj,yj)}(H) satisfies g(xj) = h(xj), then we have V ˆCn\{(xj,yj)}(H) = V ˆCn(H) = VSn(H), which contradicts the definition of the version space compression set as the smallest subset. Therefore, for any (xj, yj) ˆCn, there exists g V ˆCn\{(xj,yj)}(H) such that g(xj) = h(xj). Moreover, note that

g V ˆCn\{(xj,yj)}(H)" is equivalent to saying g(x) = y = h(x), for all (x, y) ˆCn \ {(xj, yj)}",

which precisely matches the definition of a star set centered at h, that is, ˆCn is a star set for H centered at h, witnessed by those hypotheses g s. We must have | ˆCn| sh.

Lemma 26 (Hanneke, 2016b, Theorem 11). Let H be a concept class, and P be a realizable distribution with respect to H centered at h, let Sn := {(xi, yi)}n i=1 P n be a dataset, for any n N. Then for any δ (0, 1) and any n N, we have with probability at least 1 δ,

sup h Vn(H) er P (h) 8

VC(H) ln 49eˆn1:n

VC(H) + 37 + 8 ln 6

where the data-dependent quantity ˆn1:n is defined in Definition 13 satisfying ˆn1:n sh.

Lemma 27 (Sauer s lemma, Sauer, 1972). Let H be a concept class with VC(H) < defined on X and Sn := {(xi, yi)}n i=1 (X {0, 1})n. Then for all n N, it holds that

In particular, if n VC(H), H(Sn) en VC(H)

Lemma 28 (VC dimension of unions). Let N, T N and H1, . . . , HN be concept classes with max1 i N VC(Hi) T, then it holds

2 log N + 4T.

Proof of Lemma 28. According to Sauer s lemma (Lemma 27), for any i N and Sn := {(xi, yi)}n i=1 with n T, we have

Then we can upper bound the number of possible classifications (of Sn) by the union SN i=1 Hi as

Hi(Sn) (10)

Let n = VC(SN i=1 Hi) and Sn be a set shattered by SN i=1 Hi, the LHS of (11) is exactly 2n, and thus

T n log N + T log en

n 2 log N + 4T,

where the last step follows from the fact that m s + q log (em/q) implies m 2s + 4q, for any s 0 and m q 1.

Lemma 29 (Shalev-Shwartz and Ben-David, 2014, Lemma A.1). Let a > 0, then x 2a log a implies x a log x. Conversely, x < a log x implies x < 2a log a.

Lemma 30 (Bousquet et al., 2021, Lemma 5.12). For any function R(n) 0, there exist probabilities {pt}t N satisfying P

t 1 pt = 1, two increasing sequences of integers {nt}t N and {kt}t N, and a constant 1/2 C 1 such that the following hold for all t N:

k>kt pk 1 nt . (2) ntpkt kt. (3) pkt = CR(nt).

Lemma 31 (Ghost samples). Let H be a concept class and P be a realizable distribution with respect to H. Let S2n := {(xi, yi)}2n i=1 P 2n, Sn := {(xi, yi)}n i=1 and Tn := {(xi, yi)}2n i=n+1. Then for any ϵ (0, 1) and n 8/ϵ, it holds

P ( h H : ˆer Sn(h) = 0 and er P (h) > ϵ) 2P ( h H : ˆer Sn(h) = 0 and ˆer Tn (h) > ϵ/2) .

Proof of Lemma 31. If there exists h H such that ˆer Sn(h) = 0 and er P (h) > ϵ, since Tn is independent of Sn, by applying the Chernoff s bound (Lemma 23), we have

P ˆer Tn (h) ϵ/2 h = P

i=n+1 1 {h(xi) = yi} ϵ

er P (h) > ϵ

Then for any n 8/ϵ, it follows

P ˆer Tn (h) > ϵ/2 h = 1 P ˆer Tn (h) ϵ/2 h > 1 exp n nϵ

which completes the proof.

Lemma 32 (Random swaps). Let H be a concept class with VC(H) < and P be a realizable distribution with respect to H. Let S2n := {(xi, yi)}2n i=1 P 2n, Sn := {(xi, yi)}n i=1 and Tn := {(xi, yi)}2n i=n+1. Then for any ϵ (0, 1) and n VC(H)/2, it holds

P ( h H : ˆer Sn(h) = 0 and ˆer Tn (h) > ϵ/2) 2en

VC(H) 2 nϵ/2.

Proof of Lemma 32. We prove the lemma by using the random swaps" technique. Specifically, we define σ1, . . . , σn to be independent random variables with σi Unif({i, n + i}) for all 1 i n, which are also independent of S2n. For notation simplicity, we denote by σn+i to be the remaining element in {i, n + i} \ {σi}. Now we let Sσ := {(xσ1, yσ1), . . . , (xσn, yσn)} and

Tσ := {(xσn+1, yσn+1), . . . , (xσ2n, yσ2n)}, and note that Sσ Tσ follows the same distribution as S2n. Hence, we have

P ( h H : ˆer Sn(h) = 0 and ˆer Tn (h) > ϵ/2) = P ( h H : ˆer Sσ (h) = 0 and ˆer Tσ (h) > ϵ/2)

= P (Y1, . . . , Y2n) H(S2n) : 1 n Pn i=1 1 {yσi = Yσi} = 0, 1 n Pn i=1 1 yσn+i = Yσn+i > ϵ/2

Lo TP = E P (Y1, . . . , Y2n) H(S2n) : 1 n Pn i=1 1 {yσi = Yσi} = 0, 1 n Pn i=1 1 yσn+i = Yσn+i > ϵ/2

Union bound E

(Y1,...,Y2n) H(S2n) P 1 n Pn i=1 1 {yσi = Yσi} = 0, 1 n Pn i=1 1 yσn+i = Yσn+i > ϵ/2

Next, we consider that given S2n, how possibly that the following event happens

i=1 1 {yσi = Yσi} = 0 and 1

i=1 1 yσn+i = Yσn+i > ϵ

for a given labeling Y := (Y1, . . . , Y2n) H(S2n). Indeed, if EY happens, then there must exist at least nϵ/2 indices i n such that either yi = Yi, yn+i = Yn+i or yi = Yi, yn+i = Yn+i, otherwise, the difference between 1

n Pn i=1 1{yσi = Yσi} and 1

n Pn i=1 1{yσn+i = Yσn+i} is less than ϵ/2. Based on this and the distribution of σi s, we have

P 1 n Pn i=1 1 {yσi = Yσi} = 0, 1 n Pn i=1 1 yσn+i = Yσn+i > ϵ/2

Plugging into (12), we finally get for all n VC(H)/2,

P ( h H : ˆer Sn(h) = 0 and ˆer Tn (h) > ϵ/2) E

(Y1,...,Y2n) H(S2n) 2 nϵ

E [H(S2n)] 2 nϵ/2

VC(H) 2 nϵ/2.

Neur IPS Paper Checklist

Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: The main results made in the abstract are discussed with details in Section 1.2. Guidelines:

The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper. 2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: Certain limitations are discussed in the paper (see e.g. Sections 1.1, 2). Guidelines:

The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations. 3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [Yes] Justification: The theoretical model that we are working on is formalized in Section 1. Due to space limitations, the complete proofs are postponed to the appendix. Guidelines:

The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems.

The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced. 4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [NA] Justification: The paper is theoretical in its nature and does not include experiments. Guidelines:

The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results.

5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [NA] Justification: The paper is theoretical in its nature and does not include experiments. Guidelines:

The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark).

The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [NA] Justification: The paper is theoretical in its nature and does not include experiments. Guidelines:

The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment Statistical Significance

Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [NA] Justification: The paper is theoretical in its nature and does not include experiments. Guidelines:

The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text. 8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [NA] Justification: The paper is theoretical in its nature and does not include experiments. Guidelines:

The answer NA means that the paper does not include experiments.

The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper). 9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justification: The research conducted in this paper follows the Neur IPS Code of Ethics. Guidelines:

The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction). 10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [NA] Justification: The paper is theoretical in its nature and holds no foreseeable societal impacts as far as we can discern. Guidelines:

The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML). 11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] Justification: The paper is theoretical in its nature and poses no such risks. Guidelines:

The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters.

Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort. 12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [NA] Justification: The paper is theoretical in its nature and does not use existing assets. Guidelines:

The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [NA] Justification: The paper is theoretical in its nature and does not release new assets. Guidelines:

The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: The paper does not involve crowdsourcing nor research with human subjects. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects

Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: The paper does not involve crowdsourcing nor research with human subjects. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.