# universal_rates_for_active_learning__6c6e6eac.pdf

Universal Rates for Active Learning

Steve Hanneke Purdue University steve.hanneke@gmail.com

Amin Karbasi Yale University amin.karbasi@yale.edu

Shay Moran Technion, Google Research smoran@technion.ac.il

Grigoris Velegkas Yale University grigoris.velegkas@yale.edu

In this work we study the problem of actively learning binary classifiers from a given concept class, i.e., learning by utilizing unlabeled data and submitting targeted queries about their labels to a domain expert. We evaluate the quality of our solutions by considering the learning curves they induce, i.e., the rate of decrease of the misclassification probability as the number of label queries increases. The majority of the literature on active learning has focused on obtaining uniform guarantees on the error rate which are only able to explain the upper envelope of the learning curves over families of different data-generating distributions. We diverge from this line of work and we focus on the distribution-dependent framework of universal learning whose goal is to obtain guarantees that hold for any fixed distribution, but do not apply uniformly over all the distributions. We provide a complete characterization of the optimal learning rates that are achievable by algorithms that have to specify the number of unlabeled examples they use ahead of their execution. Moreover, we identify combinatorial complexity measures that give rise to each case of our tetrachotomic characterization. This resolves an open question that was posed by Balcan et al. (2010). As a byproduct of our main result, we develop an active learning algorithm for partial concept classes that achieves exponential learning rates in the uniform setting.

1 Introduction

The most prototypical type of learning is that of supervised learning, where an algorithm is given as input n labeled data points sampled i.i.d. from some unknown distribution and the goal is to output a function that has low probability of misclassifying new data from the same distribution. One caveat with this passive model of learning is that it fails to capture the settings in which unlabeled data is easily accessible, but obtaining their labels is costly. A natural example that fits this description is that of webpage classification. It is easy for any web crawler to collect information about billions of different webpages in a very short amount of time, however understanding which category a webpage belongs to usually requires human feedback. Perhaps the most commonly used way to model this problem is through the active learning framework in which the learner is given access to a large stream of unlabeled data and a budget of n queries that it can submit to a domain expert in order to obtain the label of some datapoint. The learner s goal is to submit queries for the labels of the most informative examples, and, as a result, eliminate some redundancy in the information content of labeled data.

In this paper we study the optimal learning rates that are achievable by an active learning algorithm. In the passive learning setting, it is common to measure the quality of an algorithm by its learning curve, i.e., plotting the decay of its error rate as the number of training examples increases. In contrast, in the

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

active learning setting, the resource we are interested in is the number of label queries the algorithm requests, so it is natural to consider the rate of decay of the misclassification probability as the number of queries n increases. Most of the prior works on active learning have focused on obtaining uniform guarantees on the error rate, i.e., guarantees that hold uniformly over the data-generating distributions. In this work we follow a different path and we view the problem through the lens of universal rates that was recently introduced by Bousquet et al. (2021). In this framework we are aiming for guarantees that hold for all distributions, but they do not hold uniformly over them. In other words, we allow for distribution-dependent constants in the error rate. To make the distinction between uniform and universal rates clear, we first recall what uniform learnability of a hypothesis class H means. We say that H is uniformly learnable at rate R(n) if there exists a learning rule ˆhn such that

it holds that E[er P(ˆhn)] C R(c n), n N .

The above expression states that there exists a learning rule ˆhn and distribution-independent constants such that for all realizable distributions the expected error rate of the classifier is at most C R(c n). The difference in the definition of universal learnability is that we swap the order of the quantifiers. To be more precise, we say that a class H is universally learnable at rate R(n) if there exists a learning rule ˆhn such that

such that E[er P(ˆhn)] C R(c n), n N .

Note that in the above definition the constants c, C are distribution-dependent. As is evident from our main result and from prior results in universal learning (Bousquet et al., 2021; Hanneke et al., 2022; Kalavasis et al., 2022; Bousquet et al., 2022; Hanneke et al., 2023) this change in the definition affects the landscape of the optimal learning rates significantly.

1.1 Related Work

Active Learning. There has been a very long line of work deriving theoretical guarantees for active learning, both in the realizable and the agnostic setting (Cohn et al., 1994; Dasgupta, 2005; Balcan et al., 2006; Hanneke, 2007b,a; Dasgupta et al., 2007; Hanneke, 2009; Balcan et al., 2009, 2010; Hanneke, 2012, 2014; Wiener et al., 2015; Hanneke and Yang, 2015; Beygelzimer et al., 2016). As we alluded to before, most of these works focus on obtaining minmax guarantees, i.e., the bounds they provide hold in the worst case over a family of distributions. To be more precise, while many of these results have distribution-dependent guarantees, they are typically expressed in a way that aims to match a lower bound on the minmax performance over a family o distributions, with respect to some parameter, like the disagreement coefficient. For these reasons the results in these works do not capture the full spectrum of universal rates. We also remark that there are a few works that do study universal rates, e.g. Balcan et al. (2010); Hanneke (2012); Yang and Hanneke (2013), but none of them have derived a complete characterization of the optimal rates.

Universal Rates. The study of universal learning rates was put forth in the seminal work of Bousquet et al. (2021) who derived a complete characterization of the optimal rates in the supervised learning setting. Later, Kalavasis et al. (2022) extended these result to the multiclass setting, with a bounded number of classes, and Hanneke et al. (2023) improved upon this result by characterizing multiclass classification with an infinite number of labels. Subsequently, the work of Bousquet et al. (2022) derived more fine-grained results for binary classification compared to Bousquet et al. (2021). The work that is most closely related to ours is Hanneke et al. (2022) which derives a complete characterization of the optimal learning rates in a very general interactive learning setting. In that setting, the learner is allowed to submit arbitrary binary valued queries about the unlabeled data. We provide a detailed comparison between our results and theirs in Section 1.5. Very recently, Attias et al. (2024) studied universal rates in the context of regression.

Partial Concept Classes. The vast majority of the literature in learning theory has focused on total concept classes, i.e., classes that consist of functions that are defined everywhere on the instance domain. Recently, Alon et al. (2021) proposed a learning theory of partial concept classes, i.e., classes that consist of functions that can be undefined on some parts of the instance domain. Later, Kalavasis et al. (2022) extended some of the results to the multiclass setting, with a finite number of labels. Recently, Cheung et al. (2023) studied partial concept classes in the context of online learning and

they showed that there are such classes that are online learnable but none of their extensions to total concept classes is online learnable. The advantage of partial concepts is that they provide a convenient way to express data-dependent constraints. En route of obtaining our main result, we design active learning algorithms for partial concept classes. For a more detailed discussion about partial concepts we refer the reader to Appendix A.2.

1.2 Formal Setting

Learning Model. We now present formally the learning setting that we consider in this work. There is a domain X, which we assume to be a Polish space, and a concept class H {0, 1}X , which satisfies standard measurability assumptions (see Definition A.1). We define a classifier h : X {0, 1} to be a universally measurable function and its error rate is defined as er P(h) := P [(x, y) : h(x) = y], where P is the data-generating distribution over X {0, 1}. When P is clear from the context we might drop the subscript P in er P(h). We call P realizable with respect to the hypothesis class H if infh H er P(h) = 0. We denote by PX be the marginal distribution of P on X.

Active Learning Model. We define an active learning algorithm to be a sequence of universally measurable functions which, given access to a stream of unlabeled data points from X that are drawn i.i.d. from PX and a label query budget n, output a classifier ˆhn : X {0, 1}. In this work we consider a non-adaptive active learning model, with respect to the unlabeled data. In particular, the learning algorithm needs to specify a function u : N N so that u(n) is the number of unlabeled points it observes and n is the number of points for which it can request the label. The number u(n) is specified before the execution of the algorithm and cannot be modified based on the realization of the unlabeled sequence or the answers that it gets for the labels of the points it queries. We place no bound whatsoever on the function u( ). Also, the algorithm can only request the labels of points it has observed and not arbitrary points from X. We emphasize that the label requests that the algorithm makes can be adaptive, and can depend on answers to previous label queries. To the best of our knowledge, the active learning algorithms that have been proposed in the literature either fit into this model or they can be modified to satisfy the non-adaptivity restriction with negligible performance loss.

Learning Rates. We now define formally what it means for an algorithm to achieve a learning rate R(n) in the universal learning model. We adopt the definition of Bousquet et al. (2021).

Definition 1.1 (Learning Rates (Bousquet et al., 2021)). Fix a concept class H, and let R : N [0, 1], R(n) n 0 be a rate function, where n is the label query budget of the learner.

H is learnable at rate R if there is a learning algorithm ˆhn such that for every realizable distribution P, there exist c, C for which E[er(ˆhn)] CR(cn), n N.

H is not learnable at rate faster than R if for all learning algorithms ˆhn there exists a realizable distribution P and c, C for which E[er(ˆhn)] CR(cn), for infinitely many n N.

H is learnable with optimal rate R if it is learnable at rate R and it is not learnable at rate faster than R.

H admits arbitrarily fast rates if for all rate functions R, it is learnable at rate R.

H requires arbitrarily slow rates if for all rate functions R, it is not learnable at rate faster than R.

Combinatorial Measures. We now define some combinatorial complexity measures that our characterization relies on. To make the presentation easier to follow, we provide informal definitions. For the formal ones, we refer the reader to Appendix A.3. We first describe the Littlestone tree that was introduced by Bousquet et al. (2021).

Definition 1.2 (Littlestone Tree, Informal (see Definition A.8) (Bousquet et al., 2021)). A Littlestone tree for H {0, 1}X is a complete binary tree of depth d whose nodes are labeled by elements of X and the edges to the left, right child are labeled by 0, 1. We require that for every level 0 n < d

and every path from the root to a node at level n there is some h H that realizes this path. We say that H has an infinite Littlestone tree if it has a Littlestone tree of depth d = .

We underline that this notion can be thought of as an infinite extension of the Littlestone dimension (Littlestone, 1988) of H. Recall that the Littlestone dimension is defined to be the largest d N for which H has a Littlestone tree of such depth and it is if one can construct Littlestone trees of arbitrary depth. Crucially, this is not the same as having a single tree whose depth is infinite, so one can see that infinite Littlestone dimension is not the same as having an infinite Littlestone tree.

We next give the definition of the Vapnik-Chervonenkis-Littlestone tree (VCL) that was introduced by Bousquet et al. (2021). Definition 1.3 (VCL Tree, Informal (see Definition A.10) (Bousquet et al., 2021)). A VCL tree for H {0, 1}X is a complete tree of depth d such that every level 0 n < d has nodes that are labeled by X n+1 with branching factor 2n+1 and whose 2n+1 edges connecting a node to its children are labeled by the elements of {0, 1}n+1. We require that for every node at any level 0 n < d, the path from the root to this node is realized by some h H. We say that H has an infinite VCL tree if it has a VCL tree of depth d = .

Intuitively, the VCL tree combines the notions of the Littlestone tree and the VC dimension (Vapnik and Chervonenkis, 1971; Blumer et al., 1989). The differences between the Littlestone tree and the VCL tree are that in the latter the size of the nodes increases linearly with the level and the branching factor increases exponentially, whereas in the former all the nodes are singletons and the branching factor is always two.

We are now ready to introduce a new combinatorial measure which we call the star tree. Definition 1.4 (Star Tree, Informal (see Definition A.9)). A star tree for H {0, 1}X is a complete tree of depth d such that every level 0 n < d has nodes that are labeled by (X {0, 1})n+1

with branching factor n + 1 and whose n + 1 edges connecting a node to its children are labeled by {0, . . . , n}. The label of the edge indicates the element of the node whose label along the path is flipped. We require that for every node at level 0 n < d, the path from the root to this node is realized by H. We say that H has an infinite star star tree if it has a star tree of depth d = .

Essentially, this definition combines the structure of a Littlestone tree with the notion of the star number (Hanneke and Yang, 2015). Every node at level n consists of n + 1 labeled points and there are n + 1 edges attached to it. Each edge indicates which of the n + 1 points of the labeled node has its label flipped along every path that this edge is part of.

1.3 Main Results

We are now ready to state the main results of this work. Our first main result is a complete characterization of the optimal learning rates that a class H admits in the active learning setting. Theorem 1.5. For every concept class H exactly one of the following cases holds.

H is actively learnable at arbitrarily fast rates.

H is actively learnable at an optimal rate e n.

H is actively learnable at o(1/n) rates, but requires rates arbitrarily close to 1/n.

H requires arbitrarily slow rates for active learning.

Our next result characterizes exactly when these rates occur by specifying combinatorial complexity measures of H that determine which case of the tetrachotomy it falls into. Theorem 1.6. For every concept class H the following hold.

If H does not have an infinite Littlestone tree, then it is learnable at arbitrarily fast rates.

If H has an infinite Littlestone tree but does not have an infinite star tree, then it is learnable at optimal rate e n.

If H has an infinite star tree but does not have an infinite VCL tree, then it is learnable at optimal rate o(1/n), but requires rates arbitrarily close to 1/n.

If H has an infinite VCL tree, then it requires arbitrarily slow rates.

We remark that the landscape of the optimal rates looks significantly different compared to the passive setting (Bousquet et al., 2021) and the general interactive learning setting (Hanneke et al., 2022). Our main result answers an open question that was posed in Balcan et al. (2010), which asks for necessary and sufficient conditions for learnability at an exponential rate in the active learning setting.

1.4 Examples

We now present examples of classes that witness each case of our tetrachotomic characterization.

Arbitrarily-fast rates: Bousquet et al. (2021) give examples with no infinite Littlestone tree (e.g., thresholds on the integers, and positive halfspaces on Nd), hence learnable at arbitrarily fast rates.

Exponential rates: Famously, threshold classifiers [a, ) over R have an infinite Littlestone tree and are actively learnable at exponential rate (even uniformly). A more interesting example is the class of interval classifiers [a, b] on R, which (also famously) has uniform rate 1/n for active learning (it has infinite star number), and has an infinite Littlestone tree, but it has no infinite star tree: for any choice of x in the root node, the edge labeling x as 1 has a version space with star number equal 4 (it is effectively 2 disjoint threshold problems), so the depth of that subtree is bounded. Therefore, by our theory, it is actively learnable at e n universal rate (in stark contrast to the uniform rate 1/n).

Sublinear rates: Halfspaces on Rd have an infinite star tree but no infinite VCL tree. To construct that star tree, the key observation is that any set in convex position is a star set (Balcan et al., 2010).

Arbitrarily slow rates: Bousquet et al. (2021) provide examples of classes with an infinite VCL tree, such as the class of all claffisifiers, so these require arbitrarily slow rates in our setting as well.

1.5 Comparison to Supervised Learning and General Interactive Learning Setting

We now compare our characterization to the results of Bousquet et al. (2021) that prove an analogous result in the supervised learning setting. Let us first recall their main result which shows that in this learning model a class is universally learnable at:

Exponential rate e n if and only if it does not have an infinite Littlestone tree. Linear rate 1/n if and only if it has an infinite Littlestone tree but does not have an infinite VCL tree. Arbitrarily slow rates if and only if it has an infinite VCL tree.

As one can see, our results illustrate the advantage that active learning algorithms have over their supervised counterparts. We underline that the only case where active learning does not offer an improvement compared to the passive setting is when H has an infinite VCL tree.

Let us now discuss the interactive learning model that was considered by Hanneke et al. (2022). In this general model of interaction, the learner is allowed to ask arbitrary binary queries about any subset of the unlabeled data. In particular, these queries include, but are not limited to, label queries, comparison queries, and general membership queries. They show that the optimal rates that any class H admits are the following:

Arbitrarily fast if and only if it does not have an infinite Littlestone tree. Exponential if and only if it has an infinite Littlestone tree but not an infinite VCL tree. Arbitrarily slow if and only if it has an infinite VCL tree.

We underline that the algorithm by Hanneke et al. (2022) that achieves arbitrarily fast rates uses only label queries and the number of unlabeled data u(n) that it uses can be chosen statically prior to the execution of the algorithm. Therefore, this result applies to the setting we consider in our work. Moreover, the lower bounds they provide also apply to our setting since (i) the queries that Hanneke et al. (2022) consider are more general, and (ii) their lower bounds hold even when the learner knows the marginal distribution PX . As is evident from our main result, the active learning setting provides a richer landscape of optimal rates compared to the general interactive learning setting of Hanneke et al. (2022). In particular, just the absence of an infinite VCL tree for H does not necessarily imply

that we can achieve (at least) exponential rates. To get this result, Hanneke et al. (2022) make strong use of these general queries in order to provide an algorithm that has a binary-search flavor and can learn partial concept classes with finite VC dimension at an exponential rate. Our main result shows that such guarantees are not achievable by using only label queries.

1.6 Technical Challenges

The most technically innovative part of our work is the o(1/n) algorithm. Let us set up some terminology to facilitate our discussion. The version space of a concept class, given a labeled dataset S, is the set of concepts that classify all the elements of S correctly. Moreover, the VCL game is a Gale-Stewart game defined by Bousquet et al. (2021) that was used in their passive learning algorithm that achieves 1/n universal rate. The original (passive) learner from Bousquet et al. (2021) partitions a portion of the data into some number B(n) of batches, from which they construct partial concept classes of finite VC dimension by using the batches to play the VCL game against the player s winning strategy, and for each resulting partial concept class they run a separate learner on another portion of data and return a majority vote of their resulting classifiers. The latter part of this strategy could never work in the active learning setting, since the number of batches B(n) must be an increasing function of n, and applying an active learner for each partial concept class separately would require each to use (nearly) Ω(n) queries (to get the o(1/n) guarantee there), so the total number of queries would be (nearly) Ω(B(n)n) n, violating the label budget n. To resolve this, we ended up completely re-imagining how to use these partial concept classes. Rather than running a separate algorithm for each class, we run a single active learning algorithm, where individual decisions of whether to query involve voting over the partial concept classes. We first extend an algorithm of Hanneke (2012) (for total concepts) to achieve o(1/n) rate for partial concept VC classes (this required completely re-formulating Hanneke s analysis). The resulting algorithm involves estimating the probability that a random k-tuple is VC shattered by certain constrained version spaces. We replace this with an estimated average (over a select subset of partial concept classes) of this shattering probability, which we show composes appropriately with the analysis of the algorithm to obtain the o(1/n) rate. Comparing our work to the general interactive learning setting considered by Hanneke et al. (2022), the main differences are (i) we design a new algorithm that uses only label queries and achieves exponential rates when H does not have an infinite star tree, a combinatorial measure we introduce in our work, (ii) we prove a o(1/n) lower bound in the setting where H has an infinite star tree, and (iii) we propose a novel active learning algorithm that achieves sublinear rates when H does not have an infinite VCL tree.

2 Arbitrarily Fast Rates

As we explained before, the first case of the tetrachotomy in our characterization is a direct implication of the results by Hanneke et al. (2022). To be more precise, they design an algorithm which achieves arbitrarily fast rates using only label queries. In order to do that, they need the number of unlabeled points u(n) to be an arbitrarily fast increasing function, that, nevertheless, can be specified in a non-adaptive manner prior to the execution of the algorithm. The result is summarized in Theorem 2.1. Theorem 2.1 (Hanneke et al. (2022)). If H does not have an infinite Littlestone tree it is actively learnable with arbitrarily fast rates.

For completeness, we present their algorithm in Figure 1. Similary with the upper bound, the lower bound is an immediate consequence of a result from Hanneke et al. (2022), since the learner can submit more informative queries in their model. Theorem 2.2 (Hanneke et al. (2022)). If H has an infinite Littlestone tree, then H is not actively learnable at rate faster than exponential e n. This holds even if PX is known to the learner.

3 Exponentially Fast Rates

In this section, we prove the second case in the tetrachotomy we have stated, i.e., that H is learnable at an exponential rate if and only if it has an infinite Littlestone tree and it does not have an infinite star tree. Our proof consists of two parts. First, we show that if H does not have an infinite star tree it is learnable at an exponentially fast rate. Then, we show that whenever H has an infinite star tree, the best achievable rate cannot exceed o(1/n). The omitted details can be found in Appendix C.

3.1 Exponential Rates Algorithm: High-Level Overview

The high-level approach to get the exponential rates algorithm follows the same spirit of the approaches from Bousquet et al. (2021); Hanneke et al. (2022). First, we design an appropriate Gale-Stewart game (cf. Appendix A.2), i.e., a game between a learner and an adversary, that is associated with H in which the learner has a winning strategy if and only if H does not have an infinite star tree. The next step is to show that, in the limit, the winning strategy of the learner gives rise to a partial concept class F that has finite star number (see Definition A.7). Since this result is asymptotic, our approach is to consider several instances of this game, execute them for a finite number of steps, and obtain a partial concept class from each one. Then, we aggregate these classes into a majority class. The intuition is that, with high probability, most of the games will have induced classes whose star number is bounded by a distribution-dependent constant, so then we can show that the majority class will also have bounded star number, and this bound is distribution-dependent. Thus, our task boils down to actively learning a partial concept class whose star number is finite. In order to do that, we extend the approach that Hanneke and Yang (2015) used for total concept classes to the regime of partial classes. We believe that this result could be of independent interest.

3.2 The Star Tree Game

We first outline the Gale-Stewart game we use. Recall that every node of a star tree at depth n consists of n + 1 points along with their labels. There are n + 1 edges that connect the node with its children and the label of every edge indicates the point whose label is flipped along any path that uses this edge. Let us now describe the game G that we use in this setting. In every round τ 1 we have the following interaction between the learner PL and the adversary PA:

Player PA chooses points ( ξτ, ζτ) = (ξ0 τ, . . . , ξτ 1 τ , ζ0 τ , . . . , ζτ 1 τ ) X τ {0, 1}τ. Player PL chooses ητ {0, . . . , τ 1}. Player PL wins the game in round τ if

H ξ1, ζ1,η1,..., ξτ , ζτ ,ητ := h H : h(ξi s) = ζi s, if ηs = i h(ξi s) = 1 ζi s, if ηs = i , 1 s τ, 0 i < s = .

It is easy to see that the winning condition for PL is finitely decidable (cf. Appendix A.2), hence G is a Gale-Stewart game. Recall that this means exactly one player between the adversary and the learner has a winning strategy. Using a result regarding the measurability of winning strategies in Gale-Stewart games that was shown by Bousquet et al. (2021) (see Theorem A.3) we can prove the following connection between G and the existence of infinite star trees ( see Appendix C.1 for the proof). Lemma 3.1. The class H does not have an infinite star tree if and only if PL has a universally measurable winning strategy in G.

The first step in our approach, is to make use of some Θ(n) label queries in order to obtain the labels of Θ(n) many points. The idea these labeled points in order to simulate the Gale-Stewart game we described above (see Appendix C.3 for the details). The main technical issue we need to handle is that we have no control over the number of rounds the game needs in order to terminate. Using ideas that have appeared in the universal learning literature, we use a portion of these labeled points to estimate some number ˆtn so that, with at least some constant probability over the dataset, the game will terminate within ˆtn many rounds. Then, we split the remaining of the labeled dataset into batches of size ˆtn and we run the game on each batch. The outcome of each game gives us a pattern-avoidance function, i.e., a function that takes an input tuples of arbitrarily labeled points and changes the label of one of them so that the resulting labeled tuple is not consistent with the data-generating distribution P. In other words, the output of this function could not have been generated by the P. A technical complication we need to handle is that we obtain multiple such pattern avoidance functions, some of which are incorrect, but to make the presentation cleaner we explain the idea using a single pattern avoidance function egt that produces inconsistent labels. The formal setting is handled in Appendix C.4. One way to think about this function is that it provides data-dependent constraints. Thus, it is natural to express such a constraint through a partial concept class. We define

F = n f : X {0, 1, } : (x1, f(x1)), (x2, f(x2)), . . . , (xt , f(xt )) / image(egt ), x1, . . . , xt X t o .

Notice that the constraint we have placed on F is satisfied if f(xi) = , for some i [t ]. We consider the natural extension of the notion of star sets to the case of partial concept classes, i.e., we say that a labeled set S with labels in {0, 1} is a star set if S and its adjacent sets S , whose labels are still restricted to be in {0, 1}, are obtainable using functions from F (see Definition A.7). A key observation is that the star number of F is bounded by t 1. Another difficulty we need to overcome is that we do not know t , since it is a random variable that depends on the realized sequence and its distribution might have heavy tails. To make our approach easier to follow, let us first assume that we do know t 1. Then, our task boils down to actively learning a partial concept class.

3.3 Active Learning of Partial Concept Classes with Finite Star Number

We now present an algorithm that achieves exponential rates when actively learning a partial concept class F that star number s < . The idea of our approach is to reduce the problem of actively learning a partial concept to the well-studied problem of actively learning a total concept class. The algorithm is presented in Figure 2. Let us explain the high-level ideas of the algorithm. First, we consider a large enough set of unlabeled data. Our goal is to find their labels using logarithmically many queries. To do that, we consider the uniform distribution over these unlabeled examples. Then, we use an algorithm from Hanneke and Yang (2015) (see Theorem C.1) which guarantees exponential rates when applied to a class with finite star number. Because the underlying distribution on the sample is uniform, with high probability, the algorithm will find the correct labels of all the points. Finally, we feed these labeled examples to the one-inclusion graph algorithm (Theorem A.5) to get the desired result. We are now ready to state our theorem. The proof is postponed to Appendix C.2.

Theorem 3.2. There exists an active learning algorithm A for a partial concept class F which given a label budget n and access to unlabeled samples from a realizable distribution P returns a classifier ˆhn such that EP [er(ˆhn)] c1de c2n/s, where c1, c2 are absolute numerical constants, and s, d is the star number, VC dimension of F.

3.4 Slower than Exponential is Sublinear

The next step in the characterization is to show that if H has an infinite star tree, then it does not admit rates faster than sublinear. The proof starts by picking a random path on the infinite star tree. The target distribution is supported only on nodes of the selected path. Then, given some algorithm ˆhn, we distribute the mass of the target probability distribution across the path, potentially skipping some nodes of it, in a way that creates an infinite sequence ni1, ni2, . . ., so that when the learner has label budget nij, with some constant probability, it will only observe unlabeled points up to level kij. Moreover, with at least some constant probability, it will not query the point of that level whose label is flipped along the target path. On that event, it makes a mistake with probability at least C pij/kij, where C is some absolute constant. Our choice of pij, kij guarantees that R(nj) > pij/kij, where R( ) is the target sublinear rate function. Finally, we apply Fatou s lemma to get the desired result. For the full proof and the formal theorem statement, we refer the reader to Appendix C.5

4 Sublinear Rates

Our approach to achieve sublinear rates in the setting where H does not have an infinite VCL tree shares some high-level ideas with the one in Section 3, but many technical challenges make it significantly more involved. The main obstacle is that there is no active learning algorithm that achieves sublinear rates for VC classes uniformly over all realizable distributions. Recall that in the exponential rates setting, such an algorithm does exist (Hanneke and Yang, 2015). Instead, the sublinear rates algorithm for VC classes from Hanneke (2012) depends on distribution-dependent constants in the sample complexity. The omitted details from this section can be found in Appendix D.

Instead of the star tree Gale-Stewart game that was used to get the exponential rates guarantee, we use the VCL Gale-Stewart game Bousquet et al. (2021) (cf. Figure 6). To be more precise, we use n/5 of the label budget to get the labels of n/5 unlabeled points that come i.i.d. from PX . Then, we execute the VCL game on Θ( n) different batches of size Θ( n). Each of these games induces a partial concept class. We show how to obtain a P-dependent bound on the VC dimension that

1In Appendix C.4 we show how to obtain an etimate.

holds for most of these classes. Moreover, we show that for most of these classes the data-generating distribution P is realizable. Finally, we design a single active learning algorithm that combines information from all these classes and achieves sublinear learning rates. This algorithm builds upon Hanneke (2012) but is modified to work with partial concept classes instead of total concept classes. This requires a very different analysis and is the most technically involved part of our work.

Let us now explain the main ideas of this algorithm and the challenges behind it. As we mentioned before, the number of queries that the algorithm from Hanneke (2012) needs to achieve the sublinear error rate depends on the underlying data-generating distribution. Thus, we cannot just get a large enough number of unlabeled samples, consider the uniform distribution over them and use the algorithm on this distribution. This is because the learning rate for H would depend on the uniform distribution US over the sample S and not on the data-generating distribution P. To illustrate the ideas of the algorithm, we consider five different streams of i.i.d. (unlabeled) data S1, S2, S3, S4, S5. For the purposes of the subsequent discussion, we can imagine that these streams have infinite size, but as explained in description of the algorithm, we only need poly(n) unlabeled points. Let us first explain the use of S5. We use n/5 of our query budget to obtain the labels of the first n/5 points and then we use them to train a supervised learning algorithm from Bousquet et al. (2021) that achieves linear rate O(1/n). This is used for technical reasons in our analysis and in order to ensure that the classifier we output has, at most, linear error rate no matter how the active learning component of our algorithm behaves. Next, we use n/5 of the query budget to obtain the labels of the first n/5 points from S1. Then, we run the VCL game on these labeled datasets of size n/5 and obtain n different pattern avoidance functions byi n/5 that take as input ℓi n/5 points (cf. Appendix D.2), where i [ n]. Let

Fi n/5 := n f : X {0, 1, } : (f(x1), . . . , f(xℓi n/5)) = byi n/5(x1, . . . , xℓi n/5) o , i [ n] .

First, we show that for a (1 o(1))-fraction of these partial concept classes P is a realizable distribution. Intuitively, this means that the partial concept class we obtain by running the VCL game on n many points is the same as the class we would have obtained if we were to run the game on infinitely many points (cf. Lemma D.4). Next, we need to estimate some number bdn N which, as n , converges to the 9/10-quantile d of the distribution of the VC dimension of the partial concept classes that are obtained by running the VCL game on infinitely many samples. We let

bdn := min d N

n i1, i2, . . . , i9/10 n [ n] : i1 < i2 < . . . < i9/10 n, d Fij n/5 d, j [9/10 n] o ,

where d(F) denotes the VC dimension of class F. Lemma D.5 shows that, for large enough n, bdn = d , with high probability, where d N is such that with probability at least 9/10 over the random draw2 of the partial class, its VC dimension is at most d . One technical complication we need to handle is that the concept classes we have obtained are estimated from a game on n many points instead of infinitely many points, so a o(1)-fraction of them do not correspond to samples from the correct distribution. To do that we use a robust version of the well-known Dvoretzky Kiefer Wolfowitz (DKW) inequality (cf. Theorem D.1) for estimating the CDF.

Next, we use another n/5 of the query budget to obtain the labels of the first n/5 points of S2. We will make two distinctions regarding this stream. For the purposes of the analysis, we consider a fixed stream of infinitely many i.i.d. samples from P and we denote it by S 2 , but in the actual algorithm we use a dataset of size Θ(n) and we denote it by Sn 2 . We define

V i n/5 := n f Fi n/5 : f(x) = y for the first n/5 points in S 2 o ,

to be the version space of Fi n/5 defined on the first n/5 examples of S 2 . Moreover, we let Vd ,n/5 be a random version space that is sampled from the following process: we run the VCL game on an infinite stream of labeled data from P to get a pattern avoidance function ey and then we define the partial concept class e F in the same way as before. If the VC dimension of this class is greater than d we discard it and restart the process. Otherwise, we let Vd ,n/5 be the version space of e F on the first n/5 labeled points of S 2 . We take d to be the 9/10 quantile of e P as described before (cf. Lemma D.5). Given some k N, let pn,k := E[Pk(x1, . . . , xk VC shattered by Vd ,n/5)|S 2 ] where the expectation is over the draw of Vd ,n/5, given the fixed S 2 . Let k N be the largest number such that

2This random draw is induced by an execution of the VCL game on infinitely many i.i.d. points from P.

limn pn,k = 0. Notice that since, by definition, the VC dimension is bounded by d such a number k exists. From here on, we will only consider the version spaces V i n/5 that are obtained from some

partial class with VC dimension at most bdn. Let pn,k,i := Pk(x1, . . . , xk VC shattered by V i n/5).

Notice that for every k bdn and every i [ n] we can estimate this quantity to arbitrary precision using only unlabeled examples. We denote by bpn,k,i these estimates.

We now consider the first n2 unlabeled points of the third data stream S3. For each such point X, let p X n,k := E[Pk(x1, . . . , xk, X VC shattered by Vd ,n/5)|S 2 , X]. Moreover, for each y {0, 1} let

V (X,y) d ,n/5 := {f Vd ,n/5 : f(X) = y}, p(X,y) n,k := E[Pk(x1, . . . , xk VC shattered by V (X,y) d ,n/5)|S 2 , X], y {0, 1} .

We define the quantities p X n,k,i, p(X,y) n,k,i in the same way for the realized version spaces. Again, by Hoeffding s bound, we can estimate these quantities to arbitrary precision using unlabeled data. Similarly as before, we denote these estimates by bp X n,k,i, bp(X,y) n,k,i . The idea is to make use of Lemma D.4 and show that the classes which have been obtained by a VCL game that has not converged will only affect our estimates by some o(1). This is formalized in Proposition D.6. Thus, we can use bpn,k,i, bp X n,k,i, bp(X,y) n,k,i , in order to estimate pn,k,, p X n,k, p(X,y) n,k . We denote these estimates by bpn,k,, bp X n,k,

bp(X,y) n,k . These are the key quantities we use to infer the labels of unlabeled points.

Our algorithm tries to infer the label of each unlabeled point X S3 (cf. Figure 7) in the following way: if bp X n,k bpn,k

2 then we query the label of X, otherwise we infer the label to be

arg maxy {0,1} bp(X,y) n,k . Lemma D.7 shows that the inferences are correct, when n is large enough.

The main ingredient of the proof that remains to be handled is to show that the number of label queries we submit is sublinear in n. For that, it is sufficient to show that the probability that we query the label of a point is o(1). Lemma D.8 shows that when k = k , this is indeed the case.

Finally, since we do not know the true value of k , we run the algorithm for every k bdn. The active learning component of the algorithm gives us bdn + 1 different labeled datasets, which we use to train bdn + 1 instances of a supervised learning algorithm, such as the one from Bousquet et al. (2021). Our analysis so far has shown that for sufficiently large n, at least one of these datasets will be correctly labeled with size ω(n). Thus, since the supervised learning algorithm has error linear in the size of its training set, at least one of these executions will give a classifier that has error o(1/n). The last step is to run a tournament among the dn + 2 different classifiers3 to choose the best one. This is handled by Lemma D.9 (Hanneke, 2012). The main result of this section (cf. Theorem D.10), follows as a corollary of the results we have discussed. All the steps are summarized in Figure 7.

Lastly, a lower bound from Hanneke et al. (2022) completes our characterization (cf. Theorem D.11).

5 Conclusion

In this work we have provided a complete characterization of the optimal learning rates in active learning. It is an open question if it also holds when the learner knows the distribution PX .

Acknowledgments

Amin Karbasi acknowledges funding in direct support of this work from NSF (IIS-1845032), ONR (N0001419-1-2406), and the AI Institute for Learning-Enabled Optimization at Scale (TILOS). Shay Moran is a Robert J. Shillman Fellow; he acknowledges support by ISF grant 1225/20, by BSF grant 2018385, by Israel PBC-VATAT, by the Technion Center for Machine Learning and Intelligent Systems (MLIS), and by the the European Union (ERC, GENERALIZATION, 101039692). Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or the European Research Council Executive Agency. Neither the European Union nor the granting authority can be held responsible for them. Grigoris Velegkas was supported in part by the AI Institute for Learning-Enabled Optimization at Scale (TILOS).

3Recall that we have one more classifier from S5.

N. Alon, S. Hanneke, R. Holzman, and S. Moran. 2021. A Theory of PAC Learnability of Partial Concept Classes. In Proceedings of the 62nd Annual Symposium on Foundations of Computer Science.

Idan Attias, Steve Hanneke, Alkis Kalavasis, Amin Karbasi, and Grigoris Velegkas. 2024. Universal Rates for Regression: Separations between Cut-Off and Absolute Loss. In The Thirty Seventh Annual Conference on Learning Theory. PMLR, 359 405.

M.-F. Balcan, A. Beygelzimer, and J. Langford. 2006. Agnostic Active Learning. In Proceedings of the 23rd International Conference on Machine Learning.

M.-F. Balcan, A. Beygelzimer, and J. Langford. 2009. Agnostic Active Learning. J. Comput. System Sci. 75, 1 (2009), 78 89.

M.-F. Balcan, S. Hanneke, and J. Wortman Vaughan. 2010. The True Sample Complexity of Active Learning. Machine Learning 80, 2 3 (2010), 111 139.

A. Beygelzimer, D. J. Hsu, J. Langford, and C. Zhang. 2016. Search improves label for active learning. In Advances in Neural Information Processing Systems 29.

A. Blumer, A. Ehrenfeucht, D. Haussler, and M. Warmuth. 1989. Learnability and the Vapnik Chervonenkis Dimension. Journal of the Association for Computing Machinery 36, 4 (1989), 929 965.

Olivier Bousquet, Steve Hanneke, Shay Moran, Jonathan Shafer, and Ilya Tolstikhin. 2022. Fine Grained Distribution-Dependent Learning Curves. ar Xiv preprint ar Xiv:2208.14615 (2022).

O. Bousquet, S. Hanneke, S. Moran, R. van Handel, and A. Yehudayoff. 2021. A Theory of Universal Learning. In Proceedings of the 53rd Annual ACM Symposium on the Theory of Computing.

Tsun-Ming Cheung, Hamed Hatami, Pooya Hatami, and Kaave Hosseini. 2023. Online Learning and Disambiguations of Partial Concept Classes. ar Xiv preprint ar Xiv:2303.17578 (2023).

D. Cohn, L. Atlas, and R. Ladner. 1994. Improving Generalization with Active Learning. Machine Learning 15, 2 (1994), 201 221.

Donald L Cohn. 2013. Measure theory. Vol. 1. Springer.

S. Dasgupta. 2005. Coarse Sample Complexity Bounds for Active Learning. In Advances in Neural Information Processing Systems 18.

S. Dasgupta, D. Hsu, and C. Monteleoni. 2007. A General Agnostic Active Learning Algorithm. In Advances in Neural Information Processing Systems 20.

David Gale and Frank M Stewart. 1953. Infinite games with perfect information. Contributions to the Theory of Games 2, 245-266 (1953), 2 16.

S. Hanneke. 2007a. A Bound on the Label Complexity of Agnostic Active Learning. In Proceedings of the 24th International Conference on Machine Learning.

S. Hanneke. 2007b. Teaching Dimension and the Complexity of Active Learning. In Proceedings of the 20th Conference on Learning Theory.

S. Hanneke. 2009. Theoretical Foundations of Active Learning. Ph. D. Dissertation. Machine Learning Department, School of Computer Science, Carnegie Mellon University.

S. Hanneke. 2012. Activized Learning: Transforming Passive to Active with Improved Label Complexity. Journal of Machine Learning Research 13, 5 (2012), 1469 1587.

S. Hanneke. 2014. Theory of Disagreement-Based Active Learning. Foundations and Trends in Machine Learning 7, 2 3 (2014), 131 309.

Steve Hanneke, Amin Karbasi, Shay Moran, and Grigoris Velegkas. 2022. Universal Rates for Interactive Learning. In Advances in Neural Information Processing Systems.

Steve Hanneke, Shay Moran, and Qian Zhang. 2023. Universal Rates for Multiclass Learning. In The Thirty Sixth Annual Conference on Learning Theory. PMLR, 5615 5681.

S. Hanneke and L. Yang. 2015. Minimax Analysis of Active Learning. Journal of Machine Learning Research 16, 12 (2015), 3487 3602.

D. Haussler, N. Littlestone, and M. Warmuth. 1994. Predicting {0, 1}-Functions on Randomly Drawn Points. Information and Computation 115, 2 (1994), 248 292.

Wilfrid Hodges, Hodges Wilfrid, et al. 1993. Model theory. Cambridge University Press.

Alkis Kalavasis, Grigoris Velegkas, and Amin Karbasi. 2022. Multiclass Learnability Beyond the PAC Framework: Universal Rates and Partial Concept Classes. ar Xiv preprint ar Xiv:2210.02297 (2022).

Alexander Kechris. 2012. Classical descriptive set theory. Vol. 156. Springer Science & Business Media.

N. Littlestone. 1988. Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. Machine Learning 2 (1988), 285 318.

V. Vapnik and A. Chervonenkis. 1971. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications 16, 2 (1971), 264 280.

Y. Wiener, S. Hanneke, and R. El-Yaniv. 2015. A Compression Technique for Analyzing Disagreement-Based Active Learning. Journal of Machine Learning Research 16, 4 (2015), 713 745.

L. Yang and S. Hanneke. 2013. Activized Learning with Uniform Classification Noise. In Proceedings of the 30th International Conference on Machine Learning.

A Omitted Details from Section 1

A.1 Formal Definition of Learning Setting

We recall some basic notions of measures and probabilities on Polish spaces. For a detailed treatment the reader is referred to Kechris (2012); Cohn (2013). Our presentation follows Bousquet et al. (2021).

Polish Spaces. A Polish space is a separable topological space that admits a complete metric. For example, this category includes Rn, any compact metric space, any separable Banach space, etc.

Universally Measurable Functions. Let F be the Borel σ-field on some Polish space X and let µ be a probability measure. We denote by Fµ the completion of F under µ, i.e., the collections of all subsets of X that differ from a Borel set on a set of zero measure. A set B X is called universally measurable if B Fµ for every probability measure µ. Moreover, a function f : X Y is called universally measurable if f 1(B) is universally measurable, for any universally measurable set B. An important property is that universally measurable sets and functions on Polish spaces are the same as Borel sets, from a probabilistic point of view.

We are now ready to provide the definition of measurability of a concept class H. Definition A.1 (Measurability of H). Let X be a Polish space. We say that a concept class H {0, 1}X , is measurable if there is a Polish space Θ and a Borel-measurable map h : X Θ {0, 1} so that H = {h(θ, ) : θ Θ}.

We underline that Definition A.1 is very general and its only requirement is that H can be parameterized in some reasonable way.

A.2 Omitted Preliminaries

Gale-Stewart Games. We briefly discuss some basic and useful facts about Gale-Stewart games, a concept that was recently introduced to the learning theory community by Bousquet et al. (2021). Our discussion follows Bousquet et al. (2021); Hanneke et al. (2022). We refer to Bousquet et al. (2021) and references therein for a more detailed presentation. Let us fix sequences of sets Xt, Yt for t 1. We consider infinite games between two players, a learner PL and an adversary PA , where in each round t 1, PA selects an element xt Xt, and then PL selects an element yt Yt. The rules of the game are defined by a set W Q

t 1(Xt Yt) of winning sequences for PL. This means that after an infinite sequence of consecutive plays x1, y1, x2, y2, . . ., we say that PL wins if (x1, y1, x2, y2, . . .) W; otherwise the winner is PA.

A strategy is a rule used by each of the players to determine their next move, given the current state and the history of the game. Formally, a strategy for PA is a sequence of functions ft : Q

s<t(Xs Ys) Xt for t 1, so that PA plays xt = ft(x1, y1, . . . , xt 1, yt 1) in round t. Similarly, a strategy for PL is a sequence of gt : Q

s<t(Xs Ys) Xt Yt for t 1, so that PL plays yt = gt(x1, y1, . . . , xt 1, yt 1, xt) in round t. We say that a strategy for PA is winning if playing that strategy always makes PA win the game no matter what PL plays; a winning strategy for PL is defined similarly. The main question in these infinite games is determine conditions under which one of the two players has a winning strategy in the game. We remark that when the game is finite it is not hard to easy that exactly one of the two players has such a strategy. In the context of infinite games, such a condition was introduced by Gale and Stewart (1953): a W is finitely decidable if for every sequence of plays (x1, y1, x2, y2, . . .) W, there exists some n < so that

(x1, y1, . . . , xn, yn, x n+1, y n+1, x n+2, y n+2, . . .) W

for all choices of x n+1, y n+1, x n+2, y n+2, . . . In words, the condition that W is finitely decidable means that if PL wins the game, then she knows that after playing a finite number of rounds. Conversely, PA wins the game when PL does not win after any finite number of rounds.

Such a game whose set W is finitely decidable is called a Gale-Stewart game. We use the following important result about Gale-Stewart games. Remark A.2 (Gale and Stewart (1953); Hodges et al. (1993); Kechris (2012)). In any Gale-Stewart game exactly one between PA and PL has a winning strategy.

The above result is purely existential and provides no information about the complexity of the winning strategies. Importantly, it is unclear whether winning strategies can be chosen to be measurable. The next result from Bousquet et al. (2021) addresses this concern.

Theorem A.3 (Theorem B.1 from Bousquet et al. (2021)). Let {Xt}t 1 be Polish spaces and {Yt}t 1 be countable sets. Consider a Gale-Stewart game whose set W Q

t 1(Xt Yt) of winning strategies for PL is finitely decidable and coanalytic. Then there is a universally measurable winning strategy.

The following remark shows that we can, equivalently, let the strategy of the the learner PL and the adversary PA depend only on the choices of their opponent in the previous rounds.

Remark A.4 (Bousquet et al. (2021)). The strategy of PL is defined to be a sequence of functions yt = ft(x1, y1, . . . , xt 1, yt 1) of the history of the game, where y1, . . . , yt 1 are defined similarly. Thus, we can equivalently let yt = ft(x1, . . . , xt 1). The same holds for the strategy of PA.

Partial Concept Classes. The traditional PAC learning framework studies (mainly) total concept classes, i.e., classes of functions h : X {0, 1} that are defined on every point x X. The caveat with total functions is that they do not provide a direct way to express data-dependent constraints. For example, if the space X is high-dimensional but the data that the leaner has to classify lie in a low-dimensional space, it is not clear how to encode this restriction through total concepts. This is very relevant to practical application of machine learning such as classification of images; the space X is the set of all possible values of the pixels of the image but most of these configurations of the pixels do not even correspond to an image of an object of interest. Alon et al. (2021) proposed and studied an extension of the PAC framework that allows one to capture such assumptions using partial concept classes, i.e., sets of functions f : X {0, 1, }, where f(x) = means that f is undefined at x. In the context of classification of images, when a classifier returns it means that x does not belong to the space of valid images. Alon et al. (2021) extend a lot of notions, such as PAC learnability and the VC dimension, from the setting of total concept classes to the setting of partial concept classes (see, e.g., Definition A.6). To show how one can use partial classes, to express data-dependent constraints, we remark that Alon et al. (2021) illustrated how the class of d-dimensional linear classifiers with margin γ > 0 can be formulated as a partial class: we say that a sample (x1, y1), . . . , (xn, yn) Rd {0, 1} is (R, γ)-separable if all the points x1, . . . , xn lie in a (euclidean) ball of radius R, the 0-labeled examples and the 1-labeled examples are linearly separable, and the (euclidean) distance between the 0-labeled examples and 1-labeled examples is at least 2γ. Then, the class

FR,γ = f : Rd {0, 1 } :( x1, . . . , xn) supp(f) :

(x1, (f(x1)), . . . , (xn, (f(xn)) is (R, γ) separable ,

where supp(f) is the set of all points where f(x) = , expresses the set of functions that satisfy these

constraints. Remarkably, the VC dimension of F is bounded by O R2

γ2 (Alon et al., 2021). Perhaps surprisingly, even though the PAC learnability of partial classes is characterized by the VC dimension (as it is the case with total classes), the ERM algorithm provably fails to learn partial classes. Thus, the algorithmic landscape is much richer and complicated compared to total classes. Moreover, there is no algorithmic way to extend a partial concept class to a total concept class without significantly increasing its VC dimension. For details, we refer to (Alon et al., 2021).

One-Inclusion Graph Algorithm. We state formally the guarantees of the one-inclusion graph algorithm (Haussler et al., 1994).

Theorem A.5 (One-Inclusion Graph Algorithm (Haussler et al., 1994)). For any (total) concept class H whose VC dimension is bounded by d < , there is an algorithm A : (X {0, 1}) X {0, 1} such that for any n N and any sequence {(x1, y1), . . . , (xn, yn)} (X {0, 1})n that is realizable w.r.t. H,

σ Sym(n) 1{A(xσ(1), yσ(1), . . . , xσ(n 1), yσ(n 1), xσ(n)) = yσ(n)} d

where Sym(n) denotes the symmetric group of permutations of {1, . . . , n}.

In particular, Theorem A.5 implies immediately that if (x1, y1), . . . , (xn, yn) are i.i.d. from P then the classifier hn( ) := A(x1, y1, . . . , xn, yn, ) has E[er( hn)] d n+1. We remark that Alon et al. (2021) showed that this result also holds for partial concept classes.

A.3 Omitted Definitions

Definition A.6 (VC Dimension of Partial Concept Classes (Alon et al., 2021)). For a partial concept class F {0, 1, }X , the VC dimension of F is defined to be the largest number d N such that (x1, . . . , xd) X d such that {(f(x1), . . . , f(xd)) : f F} = {0, 1}d. Such a sequence (x1, . . . , xd) is said to be shattered by F. If there is no bound on d we say that the VC dimension is .

The following definition of the star number is an adaptation of the definition in (Hanneke and Yang, 2015).

Definition A.7 (Star Number of Partial Concept Classes). For a partial concept class F {0, 1, }X , the star number of F is defined to be the largest number s N such that (x1, . . . , xs) X s such that f0 F and i [s], fi F : fi(xi) = 1 f0(xi), fi(xj) = f0(xj) = , j [s] \ {i}. If there is no bound on s we say that the star number is .

Definition A.8 (Littlestone Tree (Bousquet et al., 2021)). A Littlestone tree for H {0, 1}X is a complete binary tree of depth d whose internal nodes are labeled by X, and whose two edges connecting a node to its children are labeled by {0, 1}, such that every path of length at most d emanating from the root is consistent with a concept h H. Formally, a Littlestone tree is a collection [

x u : u {0, 1}ℓ = {x } {x0, x1} {x00, x01, x10, x11} . . .

such that for every path y {0, 1}d and n < d, there exists h H so that h(x y ℓ) = yℓ+1 for 0 ℓ n. We say that H has an infinite Littlestone tree if there is a Littlestone tree for H of depth d = .

Definition A.9 (Star Tree). A Star tree for H {0, 1}X of consists of a collection [

(x u, z u) X ℓ+1, {0, 1}ℓ+1 : u {0} {0, 1} . . . {0, 1, . . . , ℓ}

such that for every level n < d, and y {0} {0, 1} ... {0, 1, . . . , n 1}, there exists some h H such that h(xi y k) = f(zi y k, yk+1) for all 0 i k and 0 k n, where we denote

y k = (y1, y2, . . . , yk), x y k = (x0 y k, . . . , xk y k), z y k = (z0 y k, . . . , zk y k),

f(zi y k, yk+1) =

( zi y k, if yk+1 = i 1 zi y k, otherwise .

We say that H has an infinite star tree if it has a star tree of depth d = .

Definition A.10 (VCL Tree Bousquet et al. (2021)). A Vapnik-Chervonenkis-Littlestone (VCL) tree for H {0, 1}X of depth d consists of a collection [

0 ℓ<d {x u X ℓ+1, u {0, 1} {0, 1}2 ... {0, 1}ℓ}

such that for every level n < d, and y {0, 1} {0, 1}2 {0, 1}n+1, there exists some h H such that h(xi y k) = yi k+1 for all 0 i k and 0 k n, where we denote

y k = (y0 1, (y0 2, y1 2), . . . , (y0 k, . . . , yk 1 k )), x y k = (x0 y k, . . . , xk y k).

We say that H has an infinite VCL tree if it has a VCL tree of depth d = .

Figure 1: Arbitrarily Fast Rates Algorithm (Hanneke et al., 2022)

Arbitrarily Fast Rates (Hanneke et al., 2022): Input is a label budget N and a rate function R(n).

1. m1 2 ln(1/R(n)) .

2. m2 ln(m122 m1 +2/R(n))

3. Create sets S1 = x1, . . . , xm1, S2 = xm1+1, . . . , xm2. 4. Split S1 into m1 batches of size m1 and consider all the labeled prefixes of the batches in lexicographical order: bi,j denotes the i-th batch with the labeled prefix j, 1 j 2 m1 +1. 5. Let A0( ; bi,j) be the output of the ordinal SOA trained on bi,j and denote gi,j(x) = A0(x; bi,j). See (Bousquet et al., 2021). 6. Evaluate the classifiers gi,j( ) on the points in S2. 7. Define a set F of equivalence classes: gi,j and gi ,j are in the same class iff they classify S2 the same. 8. For each F F, define

rank(F) = min r : 1 i1 < i2 . . . < i m1 /3 m1 , k1, . . . , k m1 /3 [2r+1]

such that gij,kj F, 1 j m1 /3 ,

or rank(F) = if no such r exists. 9. Enumerate F = {F1, F2, . . .} so that rank(Fℓ) is non-decreasing in ℓ, and pick any fℓ Fℓfor each ℓ. 10. For each distinct i, j n query the label of any point in S2 on which fi, fj disagree. 11. If there is some fˆℓwhich is correct on all the points we queried, return fˆℓ. 12. Otherwise, return any fℓ.

B Omitted Details from Section 2

C Omitted Details from Section 3

C.1 Proof of Lemma 3.1

Before we prove our main result of this section, we present an omitted proof from the main body.

Proof. Assume that H has an infinite star tree. Then, the adversary can start from the root of the tree and play ( ξ1, ζ1) = ( x , z ) and for every strategy {ητ}τ 1 of the learner, PA can follow the path y τ = (η1, η2, . . . , ητ) and play ( x y τ , z y τ ) in round τ + 1. Since the tree is infinite, there is no τ N such that H ξ1, ζ1η1,..., ξτ , ζτ ,ητ = .

For the other direction, assume that the adversary has a winning strategy. Then, we can define the tree

( x η, z η) X ℓ+1, {0, 1}ℓ+1 : ηi {0, 1, . . . , i 1}, 1 i ℓ

where ( xη1,...,ητ 1, zη1,...,ητ 1) = ( ξτ(η1, . . . , ητ 1), ζτ(η1, . . . , ητ 1)). The tree T is infinite by the definition of the winning strategy in G for PA. To prove the universal measurability of the winning strategy of PL, using Lemma A.3 it suffices to show that the set of winning strategies W of the learner

is coanalytic. To this end, we consider its complement

X t {0, 1}t {0, . . . , t 1} : H ξ1, ζ1,η1,..., ξt, ζt,ηt =

( ξ, ζ, η) : h(θ, ξi s) = ζi s, if ηs = i h(θ, ξi s) = 1 ζi s, if ηs = i , 0 i < s

Notice that the set h(θ, ξi s) = ζi s, if ηs = i h(θ, ξi s) = 1 ζi s, if ηs = i , 0 i < s is Borel by the measurability as-

sumption. Also, the intersection in the above expression are countable and the union is a projection of a Borel set. Hence, Wc is analytic, so we have proven the claim.

C.2 Active Learning of Partial Classes with Finite Star Number: Exponential Uniform Rate

We now explain the approach to actively learn a partial concept class with finite star number at a uniform exponential rate. We first state a result from Hanneke and Yang (2015) about active learning of total concept classes with finite star dimension that we utilize in our construction. Theorem C.1 (Hanneke and Yang (2015)). There exists an active learning algorithm A for a total concept class H which given a label budget n = c s log(1/ε), where c is an absolute numerical constant and s is the star number of the class, and access to samples from a realizable distribution P returns a classifier ˆhn such that, with probability arbitrarily close to one, P [ˆhn(x) = y] ε.

We underline that the success probability of this algorithm does not depend on the query budget.

Our algorithm for partial classes can be found in Figure 2. We now prove the main result in this setting that we described in Theorem 3.2.

Proof of Theorem 3.2. Let F be a partial concept class whose star number s is finite. Notice that since d s, its VC dimension is also finite. Let S = (X1, . . . , Xm) be m samples of the stream that A has access to, where m is some value to be determined. We also let b H = {h : S {0, 1} : f F s.t. h(Xi) = f(Xi), i [m]} .

Now notice that b H is, by definition, a total concept class and its star number, VC dimension is also at most s, d, respectively. Moreover, since P is realizable with respect to F we have that, except for a measure-zero event, b H = . This can be seen as follows. Since the distribution P is realizable, there exists a sequence of functions fk F so that

P [fk(x) = y] < 1

Let us fix m 1. We have that X

k=1 Pr[ s m : fk(xs) = ys] m

k=1 P [fk(x) = y] < ,

where the first inequality is due to union bound. By Borel-Cantelli, with probability one, there exists for every m 1 a hypothesis f F so that f(xs) = ys for all s m. For the rest of the proof, we condition on this probability one event E0. We definite b P to be the uniform distribution on the (labeled) sample. Our previous discussion shows that b P is realizable with respect to b H. Now let us run algorithm A from Theorem C.1 with ε = 1/(2m) and success probability 1/(2m). Let E1 be the event that the error of the output h of A is at most 1/(2m). We condition on this event for the rest of the proof. Notice that in this case, the algorithm will have figured out the correct label of every Xi, 1 i m. Indeed, if for some i [m] we have that h(Xi) = Yi, then er( h) 1/m. Notice that, by the definition of b H, we have that s(b H) s(F). Moreover, the number of labels that A requests is at most c s log(2m), where c is some absolute numerical constant.

Since we have m labeled samples, we can use them as input to the one-inclusion graph algorithm (see Theorem A.5). This algorithm guarantees that

E[er(ˆhm)|E0, E1] d(F)

Thus, we get that

E[er(ˆhm)] d(F)

m < d(F) + 1

It follows immediately from the definitions that d(F) s(F). The result follows by picking m = ec n/s

Exponential Rates for Partial Classes

1. m ec n/s

2 , where c is the constant in Theorem C.1. 2. Take a sample of unlabeled data S = {X1, . . . , Xm}.

3. Define b H to be the total class of the concepts in F that are not undefined on S, i.e.,

b H = {h : S {0, 1} : f F s.t. h(Xi) = f(Xi), i [m]} .

4. Let b P be the uniform distribution over the sample {(X1, Y1), . . . , (Xm, Ym)}, where Yi is the true label of Xi, i [m].

5. Use the active learning algorithm from Hanneke and Yang (2015) on the distribution b P (see Theorem C.1). 6. The aforementioned algorithm can find the labels of these m points after making at most O(s(b H) log m) queries, where s(b H) is the star number of b H. 7. Use these m labeled points as input to the one-inclusion graph algorithm (Haussler et al., 1994) and get ˆh (see Theorem A.5).

8. This algorithm guarantees linear error in m, i.e., E[er(ˆh)] d(F)

m+1, where d(F) is the VC dimension of F. 9. Return ˆhm.

Figure 2: Exponential Rates Algorithm for Partial Classes.

Star Tree Gale-Stewart Game with Finite Sequence and Estimation: Input is a labeled sequence {(x1, y1), . . . , (x N, y N)}

1. Use N/2 of the data to estimate ˆt N (cf. Lemma C.5.). Let ˆN = N/2ˆt N . 2. Use the remaining N/2 of the data to run the star tree game (cf. Figure 5) and obtain egi ˆt N , 1 i ˆN.

3. Estimate for 1 i ˆN

Fi = f :X {0, 1, } : (x1, . . . , xτˆt N ) X τˆt N , x1, f(x1), . . . , xτˆt N , f(xτˆt N ) / image egi ˆt N

4. Estimate the 9/16 majority class

F ˆ N m = f : X {0, 1, } : ℓ N, (x1, . . . , xℓ) X ℓ, at least (9/16) ˆN classes Fj, 1 j ˆN s.t.

ˆf Fj with ( ˆf(x1), . . . , ˆf(xℓ)) = (f(x1), . . . , f(xℓ)) .

5. Return Gℓ: (X {0, 1})ℓ {0, 1}

Gℓ(x1, y1, . . . , xℓ, yℓ) = 1 n f F ˆ N m : (f(x1), . . . , f(xℓ)) = (y1, . . . , yℓ) o .

Figure 3: Majority Class Estimation Through Star Games.

Figure 4: Exponential Rates Algorithm.

Exponential Rates: Input is an unlabeled data stream {x1, x2, . . . , } and query budget n.

1. Set number of unlabeled examples un = n/2 + ec n, where c is the absolute constant in Theorem C.1. 2. Use n/2 label queries to get the labels of the first n/2 unlabeled points. 3. Call the Gale-Stewart subroutine (Figure 3) with points (x1, y1), . . . , x n/2 , y n/2 . 4. Create the partial concept class

FG = f : X {0, 1, } : ℓ N, (x1, . . . , xℓ) X ℓ, Gℓ(x1, f(x1), . . . , xℓ, f(xℓ)) = 1 .

5. Run Figure 2 on the partial class FG with budget n/2 and return its output.

C.3 The Online Setting

To build some intuition, we first show how to use the game between the learner and the adversary discussed in Section 3 to get an asymptotic result. We call this the online setting. Assume that there is an infinite labeled sequence S = (x1, y1, x2, y2, . . .) that is consistent with H, i.e., for any t < , there exists h H : h(xt) = yt. We use S to simulate the game G between PA and PL which gives rise to a partial concept class F with finite star number. This approach is shown in Figure 5.

Star Gale-Stewart Game with Infinite Sequence

1. Let g1 : X {0, 1} {0, 1} be the function that corresponds to the strategy of PL in the first round of G. 2. Initialize τ0 1. 3. Let eg1(x, y) = (x, 1 y) # In the first round the learner just flips the label. 4. For every t 1:

If (xt τt 1+1, yt τt 1+1, . . . , xt, yt) image(egτt 1), witnessed by c τt 1: # proceed to the next round in G.

τt τt 1 + 1. gτt( , ) := ητt( c 1, . . . , c τt 1 , ), where ητt( c 1, . . . , c τt 1 , ) is the strategy of PL in round τt of G, given its history. egτt(x0, y0, . . . , xτt 1, yτt 1) := (x0, ey0, . . . , xτt 1, eyτt 1) : eyi = yi, if gτt(x0, y0, . . . , xτt 1, yτt 1) = i eyi = 1 yi, if gτt(x0, y0, . . . , xτt 1, yτt 1) = i , 0 i < τt

Else: τt τt 1.

Figure 5: Star Gale-Stewart Game on Infinite Sequences.

We first show as t the game converges, i.e., there are no more realizable τt -tuples (x1, y1, . . . , xτt , yτt ) that belong to the image of egτt , for some finite number t . To simplify the notation, we define egt = egτt . Recall that this function is defined in Figure 5.

Lemma C.2. If H does not have an infinite star tree, for any sequence x1, y1, x2, y2, . . . , that is consistent with H, there exists some finite number t N, such that for all t > t

(xt τt 1+1, yt τt 1+1 . . . , xt, . . . , yt) / image(egt 1), τt = τt 1, egt = egt 1 .

Proof. Suppose that (xt τt 1+1, . . . , xt, yt τt 1+1, . . . , yt) image(egt 1) happens an infinite sequence of times t1, t2, . . . . Since ητt is a winning strategy for the learner in G, we have that for some finite t it holds that H ξ1, ζ1,η1,..., ξt , ζt ,ηt = (see Equation (1)), where ξi, ζi, ηi are defined in Figure 5. Hence, we have arrived in a contradiction.

Similarly as in (Bousquet et al., 2021), we remark the following about the measurability of the functions. Remark C.3 (Adapted from Bousquet et al. (2021)). The strategy τt depends in a universally measurable way on x t, y t. Moreover, the functions egt are universally measurable with respect to x t, y t and the arguments of their input. To be more precise, for all t 0, there exist the following universally measurable functions

Tt : (X {0, 1})t {1, . . . , t+1}, e Gt : (X {0, 1})t

s<t (X {0, 1})s !

(X {0, 1})t ,

so that τt = Tt(x1, y1, . . . , xt, yt), egt(w1, z1, . . . , wτt, zτt) = e Gt(x1, y1, . . . , xt, yt, w1, z1, . . . , wτt, zτt) .

C.4 From the Online to the Statistical Setting

In this section we show how to utilize our results in the online setting to design an algorithm that achieves exponential rates in the statistical setting. Our approach is similar in spirit as in Bousquet et al. (2021); Hanneke et al. (2022). In particular, we adapt the approach of Hanneke et al. (2022) from the VCL game to the star tree game. The key steps are outlined in Figure 3.

Let P be a realizable distribution with respect to H and X1, Y1, X2, Y2, . . . , be an i.i.d. sequence from P. For the remaining of the section, we assume that H does not have an infinite star tree. Let us now set up the notation that we use in our proof. We denote

τt := Tt(X1, Y1, . . . , Xt, Yt), egt(w1, z1, . . . , wtτ , ztτ ) := e Gt(X1, . . . , Xt, Y1, . . . , Yt, w1, z1, . . . , wtτ , ztτ ) , where these functions are defined in Remark C.3. For any integer k 1, we define a flipping function egk : (X {0, 1})k (X {0, 1})k to be a function that flips one of the labels of its input from 0 to 1 or vice versa. We also define the probability of error of this function as follows:

per(egk) = perk(egk) = P N k ((x1, y1, . . . , xk, yk) : (x1, y1, . . . , xk, yk) image(egk)) , i.e., the probability that a k-tuple drawn i.i.d. from P belongs to the image of egk. When there is no confusion about the domain of egk we will omit its dependence on k. Our first observation is that we get an algorithm that is consistent in the statistical setting. Our proof is an adaptation of Lemma 5.7 in Bousquet et al. (2021). Lemma C.4. It holds that P[per(egt) > 0] 0 as t .

Proof. We know that P is realizable so we can pick a sequence of hypotheses hk H such that er(hk) 2 k, k N. Taking a union bound over all t 1, we have that X

k N P[hk(Xs) = Ys, for some s s] t X

k N er(hk) < .

Thus, by the Borel-Cantelli lemma, it holds that h(Xs) = Ys, s t for some h H (with probability one). Let T = sup s 1 : (Xs τs 1+1, Ys τs 1+1, . . . , Xs, Ys) image (egs 1) . Then, we know from Lemma C.2 that T is finite with probability one. Recall the law of large numbers for m-dependent sequences: if Z1, Z2 . . . , is an i.i.d. sequence of random variables, then

i=1 f(Zi+1, . . . , Zi+m) = 1

i=1 lim n m

j=0 f(Zmj+1+i, . . . , Zm(j+1)+i)+o(1) = E[f(Z1, . . . , Zm)] .

Thus, we have that

P[perτt(egt) = 0] = P

s=t+1 1(Xs,Ys,...,Xs+τt 1,Ys+τt 1) image(egt) = 0

s=t+1 1(Xs,Ys,...,Xs+τt 1,Ys+τt 1) image(egt) = 0, T t

= P[T t]. We know that T is finite with probability one, so we have that P[perτt(egt) > 0] P[T > t] 0, as t .

Lemma C.4 guarantees that for some t N, the Gale-Stewart game played on a set of t points will have converged with probability at least 7/8, i.e., Pr [per(egt ) > 0] 1/8 . This number t depends on the distribution and is unknown to the learner. We first show how the learner can estimate a similar ˆtn from the data. Our approach is an adaptation of Lemma 5.10 in Bousquet et al. (2021), but we present the proof for completeness. Lemma C.5 (Adaptation of Lemma 5.10 in (Bousquet et al., 2021)). For any N N, there exists a universally measurable ˆt N = ˆt N(X1, Y1, ..., X N/2 , Y N/2 ) whose definition does not depend on P so that the following holds. Set the critical time t N be such that

Pr [per(egt ) > 0] 1/8 ,

where the probability is over the training set of the algorithm. There exist C, c > 0 that depend on P, t but not N so that Pr[ˆt N T ] 1 Ce c N ,

where the probability is over the training of the estimator ˆt N and T is the set

T = {1 t t : Pr [per(egt) > 0] 3/8} .

Proof. We split this labeled set into two parts. The idea is to use the first one to run the Gale-Stewart game and the other one to estimate whether it has converged or not. For each 1 t N

4 and 1 i N

4t , we let

τ i t := Tt(X(i 1)t+1, Y(i 1)t+1, . . . , Xit, Yit)

egi t(w1, z1, . . . , wτ i t , zτ i t ) := e Gt(X(i 1)t+1, Y(i 1)t+1, . . . , Xit, Yit, w1, z1, . . . , wτ i t , zτ i t ) .

be the estimated quantities when we run Figure 5 on the i-th batch of the data. Next, for each t we estimate Pr[per(egt) > 0] by the fraction of the egi t for which there is a tuple on the second quarter of the data that belongs to the image of egi t. For every fixed t, the data that the functions egi t

i N/2t are trained on are independent of each other and of the second half of the training set. This means that we can view every egi t

i N/2t as an independent draw of the distribution of egt. To estimate the error of the algorithm we use the second quarter of the training set. We let

ˆet = 1 N/4t

i=1 1n s: N/4 +1 s N/2 τ i t , Xs+1,Ys+1,...,Xs+τi t ,Ys+τi t

image(egi t) o .

Now observe that

ˆet et = 1 N/4t

i=1 1{per(egt)>0} ,

with probability one. We define

ˆt N = inf{t N/4 : ˆet < 1/4} ,

where we assume that inf = .

We now want to bound the probability that ˆt N > t . Using Hoeffding s inequality we get that

Pr ˆt N > t Pr ˆet 1

= Pr et E[et ] 1

e N/4t /32 .

This implies that ˆt N t except for an event with exponentially small probability. Moreover, for all 1 t t that Pr [per(egt) > 0] > 3

8, there is some ε > 0 such that Pr [per(egt) > ε] > 1

16 (this holds by continuity). Now fix some 1 t t such that Pr [per(egt) > 0] > 3

8 (if it exists). Then, using Hoeffding s inequality again we get that

i=1 1{per(egi t)>ε} < 1

e N/4t /128 .

Whenever g is a function such that per(g) > ε, then

Pr [{ s : N/4 + 1 s N/2 τ, (Xs+1, Ys+1, . . . , Xs+τ, Ys+τ) image(g)}] 1 (1 ε) (N 4)/4τ ,

since there are (N 4)/4τ disjoint intervals of length τ. As we mentioned before, {egi t}i N/4t are independent of (Xs, Ys)s> n/4 . Thus, applying a union bound we get that the probability that all egi t that have perτ i t (egi t) > ε make at least one error on the second half of the training set is

Pr 1n perτi t (egi t)>ε o 1n s: N/4 +1 s N/2 τ i t ,(Xs+1,Ys+1,...,Xs+τi t ,Ys+τi t ) image(egi t) o

(1 ε) (N 4)/4t ,

where the last inequality holds since τ i t t . Thus, we get that

Pr[ˆt N = t] Pr ˆet < 1

(1 ε) (N 4)/4t + e N

Using the previous estimates and applying a union bound, we get that

Pr[ˆt N / T ] e N/4t /32 + t N

(1 ε) (N 4)/4t + t e N/4t /128 Ce cn ,

for some constants C, c > 0.

Essentially, Lemma C.5 tells us that we can estimate a batch size ˆtn, using only information from the data, so that, with probability 1 Ce cn, the star tree game will have converged when we run it on ˆt labeled points. Following the approach of Hanneke et al. (2022), the idea is to run the game multiple times, then create the partial concept classes that are induced by each execution, and, finally, aggregate them into a majority class (see Figure 3). We show that if most of the games have converged, then the majority class has bounded star number. This is made formal in Lemma C.6. We remark that Hanneke et al. (2022) showed a similar result regarding the VC dimension of the majority class.

Lemma C.6. The majority class F ˆ N m defined in Figure 3 has star number that is bounded by some distribution-dependent constant ˆs, with probability at least 1 e c ˆ N, for some absolute constant c > 0.

Proof. We know that there exists some s such that, with probability at least 9/10, the star of any partial concept class Fi is at most s . We let Xi = 1{star number of Fi > s }. Notice the all the Xi s are i.i.d. Bernoulli random variables with p 1/10, since the data that induce the classes Fi are i.i.d.. Thus, we can use Hoeffding s inequality to bound the probability that more than 2/10 of them have star number greater than s as follows

i=1 Xi (2/10) ˆN

i=1 Xi (1/10) ˆN (15/72) ˆN

We let E1 be the complement of the event above and we condition on it for the rest of the proof. So, we know that at least 8/10 ˆN of the partial concept classes Fi have star number bounded by s . We will bound the size ℓof the star set of F ˆ N m. For any star sequence S = (x1, y1, . . . , xℓ, yℓ) (X {0, 1})ℓ

of F ˆ N m it holds that

i=1 1 { f Fi : f(xj) = 1 yj, f(xp) = yp, p [ℓ] \ {j}} > 9/16

Using Markov s inequality, we get that

i=1 1 { f Fi : f(xj) = 1 yj, f(xp) = yp, p [ℓ] \ {j}} > 9

Swapping the summation gives us

j=1 1 { f Fi : f(xj) = 1 yj, f(xp) = yp, p [ℓ] \ {j}} > 9

j=1 1 { f Fi : f(xj) = 1 yj, f(xp) = yp, p [ℓ] \ {j}} 7

Using Markov s inequality again, we get that

j=1 1 { f Fi : f(xj) = 1 yj, f(xp) = yp, p [ℓ] \ {j}} > 9/16

j=1 1 { f Fi : f(xj) = 1 yj, f(xp) = yp, p [ℓ] \ {j}} .

This implies that

j=1 1 { f Fi : f(xj) = 1 yj, f(xp) = yp, p [ℓ] \ {j}} > 9/16

j=1 1 { f Fi : f(xj) = 1 yj, f(xp) = yp, p [ℓ] \ {j}} 7/16

Thus, we have shown that at least 2/9 of the partial concept classes witness at least 7/16 of the star patterns. By the definition of the star number, this means that all these classes have star number at least 7/16 m 1. This is because we can drop all the points from S that correspond to star patterns that the class does not realize and the remaining set of labeled points is a star set. Since at least one these classes star number is at most s we have that 7/16 m 1 s , which means that m 16/7(s + 1). Thus, we have shown that the star number of F ˆ N m is at most 16/7(s + 1) = ˆs, with probability at least 1 e c ˆ N.

Similarly as in Hanneke et al. (2022), we define a sequence of universally measurable functions Gℓ: (X {0, 1})ℓ {0, 1}

Gℓ(x1, y1, . . . , xℓ, yℓ) = 1 n f F ˆ N m : (f(x1), . . . , f(xℓ)) = (y1, . . . , yℓ) o .

An equivalent interpretation of the previous lemma is that the star number of the partial class on which {Gℓ}ℓ N returns 1 has a distribution-dependent bound, with high probability. Lemma C.7 is an important component in the derivation of our result and it shows that {Gℓ}ℓ N return 1 on all finite subsets of the correctly labeled data sequence, with high probability. This result follows similarly as Lemma 7 in Hanneke et al. (2022). Lemma C.7. The {Gℓ}ℓ N as defined above are universally measurable functions, and with probability at least 1 Ce c N, C, c > 0, the following hold:

The star number of the class

FG = f :X {0, 1, } : ℓ N, (x1, . . . , xℓ) X ℓ: Gℓ(x1, f(x1), . . . , xℓ, f(xℓ)) = 1

has a finite distribution-dependent upper bound ˆs.

ℓ N, for (x1, y1, . . . , xℓ, yℓ) Pℓ, Gℓ(x1, y1, . . . , xℓ, yℓ) = 1 with conditional probability one (given Gℓ).

Proof. We condition on the event E0 described in Lemma C.6. Notice that, by definition, a labeled sequence S = (x1, y1 . . . , xℓ, yℓ) (X {0, 1})ℓis a star set of the class FG if and only if S is a

star set of F ˆ N m. Hence, the bound on the star number follows immediately from Lemma C.6. We also condition on the event E1 described in Lemma C.5. We now bound the probability that at least (13/32) ˆN of the functions egi ˆt N have per egi ˆt N

> 0. For any t T , using Hoeffding s inequality we get that

i=1 1 n per egi ˆt N

i=1 1 n per egi ˆt N

exp ˆN/512 ,

Since we know that ˆtn t , taking a union bound over all 1 t t we get that

i=1 1 n per egi ˆt N

i=1 1 n per egi ˆt N

t exp ˆN/512 .

Thus, except for an event with exponentially small probability, at least 19/32 of the functions egi ˆt N have

per egi ˆt N

, with probability one. Notice that, by definition, if the partial class Fi that corresponds

to such a function cannot produce a labeling yℓ {0, 1}ℓof some tuple xℓ X ℓwhere xℓ Pℓ X, we can infer that this yℓis not the correct labeling, and this inference will be valid with probability one over the draw of xℓ. Thus, with probability one, if the 9/16-majority cannot produce some labeling we have that at least one such partial class Fj that corresponds to a correct egj ˆt N cannot

produce this labeling, so it is not the correct one. As a result, with probability one, if F ˆ N m cannot produce yℓfor xℓwe know that this is not the correct labeling. The measurability of {Gℓ}ℓ N follows by the measurability of all the n egi ˆt N

i ˆ N . The proof of the lemma follows by noticing that

the probability of all the events we have conditioned on can be bounded by 1 C e c ˆ N and that ˆN = N/2ˆt N N/2t .

We are now ready to prove the result about active learning at exponential universal rates. Our approach is summarized in Figure 4.

Theorem C.8. Assume that H {0, 1}X does not have an infinite star tree. Then, H admits an active learning algorithm that achieves exponential rates.

Proof. First, notice that the Gale-Stewart subroutine Figure 3 uses n/2 points and the exponential rates algorithm for partial classes uses at most ec n points. This is because we define a uniform distribution over these points during its execution. To prove the learning rate guarantees, we first condition on the event E0 described in Lemma C.7. Then, we have some partial concept class FG whose star number is finite by some distribution-dependent number ˆs and has the property that ℓ N, for (x1, y1, . . . , xℓ, yℓ) Pℓ, f(x1) = y1, . . . , f(xℓ) = yℓfor some f FG. Thus, the conditions of Theorem C.1 are satisfied and the conditional error rate of the output of our algorithm is E[er(ˆhn)|E0] c1ˆde c2n/ˆs. Since E0 happens with probability at least 1 Ce cn, we see that the unconditional error of the output is E[er(ˆhn)] c1ˆde C2n/ˆs + Ce cn Ce cn, for some distribution-dependent constants c, C > 0.

C.5 Sublinear Rates Lower Bound

Theorem C.9. Assume that H {0, 1}X has an infinite star tree and let R(n) be a rate function with R(n) = o(1/n). Then, H requires rate R(n).

(x u, z u) X ℓ+1, {0, 1}ℓ+1 : u {0} {0, 1} . . . {0, 1, . . . , ℓ} ,

be an infinite star tree. We also fix the learning algorithm ˆhn. Recall that {u(n)}n N denotes the number of unlabeled samples that ˆhn uses.

We first pick a path on the tree uniformly at random. To be more precise, let y = (y1, y2, . . .) be a sequence of independent random variables where yi is uniformly distributed in {0, . . . , i 1}. The idea is that y specifies a random path on the tree. We define the target distribution P y in an inductive way by specifying sequences {ki, pi, ni}i 1 and putting mass pi on the node of the random path at level ki. Conditional on the given node, we distribute the mass among its points uniformly. Let us now be formal. For the base of the induction, we let p2 = 1/2 and k1 = 1. We will specify the value of p1 later. For the inductive step, for any i 2, given the value of pi, ki 1, we will specify ki, pi+1. We let ki be the smallest integer so that ki > ki 1 and R( ki/8 ) < pi/ki. Notice that since R(n) = o(1/n), ki < . We let ni = ki/8 and

pi+1 = min 1 4uni , pi

Notice that since {pi > 0}i 2, P

i 2 pi < 14, by setting p1 = 1 P

i 2 pi the sequence {pi}i 1 can be used as a probability distribution. We are now ready to define P y on X {0, 1}. To make the notation simpler, for any y and for k 1

( 1 zj y k 1, if j = yk zj y k 1, otherwise .

For i 1, let

P y xj y ki 1, wj y ki 1 = pi

ki , 1 j ki 1 .

Notice that, by the definition of pi, this is indeed a probability distribution. We now show that P y is realizable. Since t is a star tree, there exists some h H such that

= wj y k 1 ,

for all 0 j k, and 1 k n. Thus, for the probability of error of h we have that

er y(h) := P y [(x, y) X {0, 1} : h(x) = y] X

so by taking n we see that P y is realizable for every y. Moreover, we note that the mapping y P y is measurable.

We now let (X, Y ), (X1, Y1), (X2, Y2), . . . be i.i.d. samples drawn from P y. We can draw them as

X = x J y T 1, Y = w J y T 1, Xj = x Jj y Tj 1, Yj = w Jj y Tj 1 ,

where (T, J), (T1, J1), (T2, J2), . . . are i.i.d. random variables, independent of the path y, with

Pr[T = ki, J = j] = pi

for all i 1, 0 j ki 1.

For all i N, let us define the event Ei 0 = T1 ki, T2 ki, . . . , Tuni ki under which the algorithm has not received any unlabeled points from any level kj, j > i. Let us also call Ei 1 the event that ˆhni does not query the point whose label is flipped, i.e., the point x yki y ki 1. Also, we denote Ei 2 the event that the learner classifies at least one point of level ki incorrectly. Then, we have that

Pr[Ei 2|Ei 1, Ei 0] 1 1 (7/8 ki)

since there are at least (7/8) ki points on this level that the learner has not queried, and under the events we have conditioned on the flipped point along the path is chosen uniformly at random among these points. We now bound Pr[Ei 0]. Notice that X

3 1 12 uni .

4This is because pi+1 pi/4.

Finally, we need to bound Pr[Ei 1|Ei 0]. Notice that

Pr[Ei 1|Ei 0] 7

since the learner has n label queries and there are at least 8n points. Putting it together, we can bound Pr[Ei 2] = Pr[Ei 2|Ei 1, Ei 0] Pr[Ei 1|Ei 0] Pr[Ei 0]

which using reverse Markov s inequality implies that

Pr[Ei 2| y] 9

Moreover, notice that

Pr[ˆhni(X) = Y | y] pi

ki Pr[Ei 2| y] R(ni) Pr[Ei 2| y] .

Thus, we can see that

Pr[ˆhni = Y | y] 9

40 R(ni) 18

Using Fatou s lemma, we have that

1 R(ni) Pr[ˆhni(X) = Y | y] E

1 R(ni) min n Pr[ˆhni(X) = Y | y], 9/40 R(ni) o

1 R(ni) E h min n Pr[ˆhni(X) = Y | y], 9/40 R(ni) oi

1 R(ni) Pr y

h Pr[ˆhni(X) = Y | y] 9/40 R(ni) i 9

where the first inequality is by the definition of min, the second inequality follows from Fatou s lemma, and the third inequality follows from Markov s inequality. We remark that Fatou s lemma can be applied here since 1 R(ni) min n Pr h ˆhni(X) = Y | y i , 9/40 R(ni) o 9/40 .

Thus, there must exist a realization of y such that

E[er y(ˆhn)| y] C R(n) , infinitely often, where C is some absolute constant. Choosing P = P y for this realization of y concludes the proof.

D Omitted Details from Section 4

D.1 Useful Tools

Theorem D.1 (Dvoretzky Kiefer Wolfowitz (DKW) Inequality). Let F be the CDF of some distribution D supported on R, n N and Fn be the empirical CDF obtained from n i.i.d. samples from D. Then, ε > 0 we have that

Pr sup x R |Fn(x) F(x)| > ε 2e 2nε2 .

Figure 6: The VCL Game (Bousquet et al., 2021).

VCL Game Subroutine (Bousquet et al., 2021): Input is a labeled sequence {(x1, y1), . . . , (x N, y N)}

1. Let τ0 0. 2. In every step t 1 :

If ητt 1(ξ1, . . . , ξτt 1 1, xt τt 1+1, . . . , xt) = (yt τt 1+1, . . . , yt) :

Let ξτt 1 (xt τt 1+1, . . . , xt) and τt τt 1 + 1. Else, τt τt 1.

D.2 The VCL Game

For completeness, we provide various algorithms and results that appeared in Bousquet et al. (2021); Hanneke et al. (2022), which we utilize in our approach to get sublinear learning rates. It is instructive to start with an informal description of a pattern avoidance function g : X k {0, 1}k. Intuitively, this function takes as input a k-tuple of unlabeled data generated by a realizable distribution PX and it returns a labeling of these data that is not the true labeling by P. They define

per(g) = Pk[x1, . . . , xk : g(x1, . . . , xk) = (y1, . . . , yk)] .

Bousquet et al. (2021) provide a way to obtain such a pattern avoidance function for any realizable distribution. This is obtained by a VCL game in Figure 6 which returns a function ˆ yτt. Their following result shows that as the number of points N used in the VCL game grows, the probability that we obtain a pattern avoidance function that is incorrect goes to zero. Lemma D.2 (Correct Pattern Avoidance Function (Restatement of Lemma 5.7 (Bousquet et al., 2021))). Consider the VCL game in Figure 6. Then, Pr[per(ˆ yτt) > 0] 0, as N .

Finally, we remark that Bousquet et al. (2021) showed the following result. Theorem D.3 (Supervised Learning Linear Rates (Bousquet et al., 2021)). If H does not have an infinite VCL tree there exists a supervised learning algorithm that can learn H at linear rate.

D.3 Omitted Details of the Sublinear Rates Algorithm

Lemma D.4. Let Fi n/5, i [ n] be the partial concept classes that are obtained by playing the VCL game on n/5 examples that are generated i.i.d. from P. Then, with probability at least 1 e C n1/4, where C > 0 is some absolute constant, P is a realizable distribution for a (1 o(1)) fraction of these classes.

Proof. Let b y i n/5 be the pattern avoidance function that is obtained from the i th batch of n/5

i.i.d. samples from P. Since we know that Pr[per(b y i n/5) > 0] 0 as n (cf. Lemma D.2), for

any ε > 0 there is some bε such that if the batch size b = n/5 is at least bε, then Pr[per(b y i n/5) > 0] < ε. Hence, for every n there is some εn = o(1) so that when the batch size is n/5 we have that

Pr[per(b y i n/5) > 0] < εn. Notice that whenever per(b y i n/5) = 0 the distribution P is realizable with respect to the corresponding class Fi n/5. Let us set δn = n 1/8. Using Hoeffding s inequality

we have that with probability at least 1 e C δ2 n n = 1 e C n1/4, where C is some absolute constant,

at least (1 εn δn) fraction of the b y i n/5 will have per(b y i n/5) = 0. Since δn + εn = o(1) we have shown the claim.

Lemma D.5. Let e P be the distribution of the VC dimension of the partial concept classes that are obtained by playing the VCL game on infinitely many i.i.d. samples from P. Let d be the 9/10quantile of e P. Then, for n n P, where n P is some P-dependent constant we have that bdn = d , with probability at least 1 3e CP n1/4, where CP is a constant the depends on the data-generating distribution P.

Proof. Consider the following experiment. We sample an infinite sequence of labeled points from P and we run the VCL game (cf. Figure 6) on this sequence. Let by be the pattern avoidance function that is obtained from that game defined over bℓmany points and let b F := f : X {0, 1, } : (f(x1), . . . , f(xbℓ)) = by(x1, . . . , xbℓ) .

Let d N be the smallest number such that

Pr[d( b F) d ] > 9/10 .

Let F be the CDF of the distribution of the VC dimension of F, let d1 < d < d2 be the largest, smallest number5 such that F(d1) = F(d ), F(d2) = F(d ) and let CP = min{F(d ) F(d1), F(d2) F(d )}/4. Let F n be the empirical CDF of the distribution obtained on n i.i.d. samples. Notice that whenever we estimate F with accuracy CP the empirical 9/10-quantile will the same as the true 9/10-quantile. The DKW inequality (Theorem D.1) shows that this happens with probability 1 2e 2C2 Pk. Let us call this even E1 and condition on it.

Next, we need to handle the fact that we do not obtain samples from partial concept classes that are generated from VCL games on infinitely many i.i.d. samples from P but merely on n many such samples. Let us consider the following experiment. We run the VCL game on k streams of infinitely many i.i.d. samples from P and if the game has not terminated within the first n rounds, we rewind the pattern avoidance function to the one that is obtained in that round. Then, Lemma D.4 shows that, with probability at least 1 e C n1/4, we will only rewind the pattern avoidance function of a o(1) fraction of these games will have converged. We call this event E2 and we condition on it. Let us denote by b Fk the empirical CDF of the VC dimension of the partial concept classes obtained by this experiment. Under E2, we have that supx N |Fk(x) b Fk(x)| = o(1). Thus, by taking n large enough so that the o(1) << CP, we see that the estimation of bdn using the samples that are obtained from the truncated VCL game also converges to d . The bound on the probability of error follows from a union bound over E1, E2.

Proposition D.6. For n > n P, where n P is some distribution-dependent constant, given m = O(n3) unlabeled samples we can estimate pn,k, p X n,k, p(X,y) n,k with accuracy o(1), with probability at least

1 CP e C Pn1/4, where CP, C P > 0, are distribution-dependent constants.

Proof. As we alluded to before, using O(n3) unlabeled samples in total we can estimate the quantities pn,k,i, p X n,k,i, p(X,y) n,k,i with accuracy 1/n for each version space V i n/5, with probability at least

1 C1 ne C2 n, where C1, C2 > 0, are absolute constants. Let us denote these estimates bpn,k,i, bp X n,k,i, bp(X,y) n,k,i We condition on this event for the rest of the proof.

Assume that n is large enough and condition on the events of Lemma D.4, Lemma D.5. Let us denote by i1, . . . , ik the indices of the partial concept classes whose VC dimension is bounded by bdn = d . By definition, k 9/10 n. Consider the following statistical experiment. We run the VCL game on infinitely many i.i.d. samples from P, we define the partial concept class obtained from the pattern avoidance function of the game, if the class has VC dimension at most d we keep the sample, otherwise we discard it. Then, we define its version space on the first n/5 samples of S 2 . Let us call e V i the version space obtained by the i-th sample of this experiment. Then, Pk(x1, . . . , xk VC shattered by e V i) is an unbiased sample from the distribution whose expected value pn,k we are trying to estimate. Similarly for the rest of the quantities we have defined. Let us call these samples epn,k,i, ep X n,k,i, ep(X,y) n,k,i and denote by e Sn, e SX n , e S(X,y) n their empirical estimates over

9/10 n many samples. Hoeffding s inequality gives us that |e Sn pn,k| = o(1), |e SX n p X n,k| =

o(1), |e S(X,y) n p(X,y) n,k | = o(1), with probability at least 1 Ce C n1/4, for some absolute constants C, C > 0. We condition on this event.

Notice that, under the events we have conditioned on, the samples pn,k,i, p X n,k,i, p(X,y) n,k,i we obtain

from running the VCL game on O( n) many samples, differ from the samples epn,k,i, ep X n,k,i, ep(X,y) n,k,i

5If only one these number exists, then we ignore the one that does not exist in the definition of CP. If none of these numbers exist then the estimation task is trivial.

on a sublinear number of terms. Moreover, the absolute difference on the terms they disagree on is bounded by 1. Thus, if we denote by Sn, SX n , S(X,y) n the average of these values we have that |e Sn Sn| = o(1), |e SX n SX n | = o(1), |e S(X,y) n SX n | = o(1). Finally, let b Sn, b SX n , b S(X,y) n be the averages of bpn,k,i, bp X n,k,i, bp(X,y) n,k,i . Since |bpn,k,i pn,k,i| = o(1), |bp X n,k,i p X n,k,i| = o(1), |bp(X,y) n,k,i

p(X,y) n,k,i | = o(1), using the triangle inequality on all the previous estimations we see that |b Sn pn,k| =

o(1), |b SX n p X n,k| = o(1), |b S(X,y) n p(X,y) n,k | = o(1), with probability at least 1 CP e C Pn1/4.

Lemma D.7. Let n > CS 2 , where CS 2 is some constant that depends on S 2 , P. Then, with proba-

bility at least 1 CP e C Pn1/4, for every k k , if bp X n,k < bpn,k

2 then y = argmaxy {0,1} bp(X,y) n,k is the correct label of X.

Proof. Let n > CS 2 be large enough so that for all k k we have that:

pn,k 100/99 lim n pn,k ,

the guarantee from Proposition D.6 gives us that

bp X n,k < bpn,k

2 = p X n,k < 3

p(X,y) n,k 99

100pn,k = bp(X,y) n,k 95

p(X,y) n,k < 80

100pn,k = bp(X,y) n,k < 90

100 bpn,k .

We condition on these events for the rest of the proof. Since pn,k 100/99 limn pn,k for the correct label y we have that p(X,y ) n,k 99/100 pn,k. Moreover, since p X n,k < 3/4 pn,k for the other

label y = 1 y it holds that p(X, y) n,k < 80/100 pn,k. Thus, under the event we have conditioned

on we have that argmaxy {0,1} bp(X,y) n,k will indeed be the correct label. The correctness probability follows directly from Proposition D.6.

Lemma D.8. For k = k we have that Pr h bp X n,k bpn,k

2 i = o(1).

Proof. Essentially, the proof of this result follows by Markov s inequality. Let us consider some large enough n and condition on the event described in Proposition D.6. Then, bp X n,k bpn,k

2 = p X n,k pn,k

4 , so it suffices to bound

Pr h p X n,k pn,k

Using Markov s inequality, we have that

Pr h p X n,k pn,k

i 4 E[p X n,k ] pn,k

where we have used the fact that E[p X n,k ] = pn,k +1 and that pn,k is bounded below by a constant.

We now state the guarantees of Active Select (Hanneke, 2012). We note that the original statement of the result did not specify the size of the unlabeled dataset that the algorithm uses, but a straightforward adaptation of it shows that with O(n3) many unlabeled data its error guarantee changes increases by O(1/n2).

Lemma D.9 (Hanneke (2012)). A call to Active Select (cf Figure 8) with N classifiers {h1, . . . , h N}, a query budget of m, and O(N 2/n3) unlabeled points makes at most m label queries and if hbk is the classifier it returns, then with probability at least 1 C1 Ne C2 m/(N2 log(N)), we have er(hbk) 2 mink [N] er(hk) + O(1/n2).

We are now ready to prove the main result in this section. The algorithm is summarized in Figure 7. Theorem D.10. Assume that H {0, 1}X does not have an infinite VCL tree. Then, H admits an active learning algorithm that achieves sublinear rates.

Proof. We have all the main ingredients we need to prove our result. Let us fix some S 2 whose elements are i.i.d. from P. Conditioning on the results of Lemma D.7, Lemma D.8, we have that for large enough n, which depends on S 2 , we can obtain a labeled dataset Ln whose size is ω(n). Feeding this dataset into the supervised learning algorithm from Bousquet et al. (2021) gives as a classifier ehn that, conditioned on S 2 and the events E we have described, has error

E[er(ehn)|S 2 , E] = o(1/n) .

Moreover, the probability of E is also o(1/n), so we get

E[er(ehn)|S 2 ] = o(1/n) .

Similarly, for the output bhn of Active Select, we have that

E[er(bhn)|S 2 ] = o(1/n)

E[er(bhn)] = O(1/n) ,

where the second bound comes from the fact that we feed into Active Select the output of the passive learning algorithm bhp n from Bousquet et al. (2021) (cf. Theorem D.3), whose bound does not not depend on S 2 .

Let us call R(n) the error rate of our algorithm. To show that E[R(n)] = o(1/n) it suffices to prove that lim supn E[n R(n)] = 0. Fatou s lemma gives us that

lim sup n n E[ R(n)] = lim sup n n E[ R(n)]

= lim sup n E S 2 [E[n R(n)|S 2 ]]

E S 2 [lim sup n n E[R(n)|S 2 ]]

where Fatou s lemma applies since E[n R(n)|S 2 ] CP where CP is some P dependent constant. This is because, almost surely over S 2 , the expected error of bhp n is bounded by C P/n.

D.4 Arbitrarily Slow Rates Lower Bound

Theorem D.11 (Hanneke et al. (2022)). If H has an infinite VCL tree, no active learning algorithm can achieve rates that are faster than arbitrarily slow.

D.5 Omitted Figures

Figure 7: Sublinear Rates Algorithm.

Sublinear Rates Algorithm [Adaptation from Hanneke (2012)] Input is a label budget N and a passive learning algorithm Ap

1. Let S1, S2, S3, S4, S5, U be sets of O(N 3) unlabeled points that are i.i.d. from PX .

2. Request the labels of the first N/5 points of S5 and run Ap on this set. Let bhp N be its output. 3. Request the labels of the first N/5 points of S1, split them into batches of size O(

N) and run the VCL game on them (cf. Figure 6). 4. Let byi

N], be the pattern avoidance functions from the previous step and

for every i [

N/5 := f : X {0, 1, } : (f(x1), . . . , f(xℓi

N/5(x1, . . . , xℓi

bd N := min d N

i1, i2, . . . , i9/10

i1 < i2 < . . . < i9/10

N/5) d, j [9/10

6. Request the labels of the first N/5 points of S2 and let

V i N/5 := n f Fi

N/5 : f(x) = y for these N/5 point o .

7. For k = 0, . . . , bd N:

(a) Let e S3 be the next n2 examples in S3 and em N = N/(5(bd N + 1)) be the label budget. (b) Let b S = {}

(c) For each point X e S3, y {0, 1}:

i. Estimate bp N,k, bp X N,k, bp(X,y) N,k using V i N/5 and fresh unlabeled points from U as described in Proposition D.6. ii. If bp X N,k bp N,k/2 infer y = argmaxy {0,1} bp(X,y) N,k , otherwise request the label y of X. iii. Append (X, y ) to b S and update the label budget em N. If em N = 0 jump to Step (d).

(d) Run Ap on b S and let bhk be its output.

8. Return Active Select n bh0, . . . ,bhk,bhp N o , N/5, S4 (cf. Figure 8).

Figure 8: Active Select Algorithm (Hanneke, 2012).

Active Select (Hanneke, 2012) Input is a set of classifiers {h1, . . . , h L}, label budget m, sequence of unlabeled examples U.

1. For each j, k {1, . . . , N} s.t. j < k:

(a) Let Rjk be the first j m j(L j) ln(e L) k points in U {x : hj(x) = hk(x)} (if such value exists). (b) Request the labels for Rjk and let Qjk be the set of labeled examples.

(c) Let mkj = er Qjk(hk).

2. Return hˆk, where ˆk = max {k {1, . . . , L} : maxj<k mkj 7/12} .

Neur IPS Paper Checklist

Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: The main body and supplementary material of the submission provide proofs for all the claims. Guidelines:

The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper. 2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: We formally define the mathematical model our results hold for. Guidelines:

The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations. 3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [Yes]

Justification: We present full proofs in the main body and the supplementary material. Guidelines:

The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced. 4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [NA] Justification: Guidelines:

The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results. 5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?

Answer: [NA] Justification: Guidelines:

The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [NA] Justification: Guidelines:

The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment Statistical Significance

Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [NA] Justification: Guidelines:

The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean.

It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text.

8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments?

Answer: [NA]

Justification:

Guidelines:

The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper).

9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines?

Answer: [Yes]

Justification:

Guidelines:

The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction).

10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?

Answer: [NA]

Justification: The work is primarily theoretical and does not have any immediate sociatal impact.

Guidelines:

The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations.

The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).

11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?

Answer: [NA]

Justification:

Guidelines:

The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.

12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?

Answer: [NA]

Justification:

Guidelines:

The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided.

If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [NA] Justification: Guidelines:

The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.