# universal_rates_for_active_learning__6c6e6eac.pdf Universal Rates for Active Learning Steve Hanneke Purdue University steve.hanneke@gmail.com Amin Karbasi Yale University amin.karbasi@yale.edu Shay Moran Technion, Google Research smoran@technion.ac.il Grigoris Velegkas Yale University grigoris.velegkas@yale.edu In this work we study the problem of actively learning binary classifiers from a given concept class, i.e., learning by utilizing unlabeled data and submitting targeted queries about their labels to a domain expert. We evaluate the quality of our solutions by considering the learning curves they induce, i.e., the rate of decrease of the misclassification probability as the number of label queries increases. The majority of the literature on active learning has focused on obtaining uniform guarantees on the error rate which are only able to explain the upper envelope of the learning curves over families of different data-generating distributions. We diverge from this line of work and we focus on the distribution-dependent framework of universal learning whose goal is to obtain guarantees that hold for any fixed distribution, but do not apply uniformly over all the distributions. We provide a complete characterization of the optimal learning rates that are achievable by algorithms that have to specify the number of unlabeled examples they use ahead of their execution. Moreover, we identify combinatorial complexity measures that give rise to each case of our tetrachotomic characterization. This resolves an open question that was posed by Balcan et al. (2010). As a byproduct of our main result, we develop an active learning algorithm for partial concept classes that achieves exponential learning rates in the uniform setting. 1 Introduction The most prototypical type of learning is that of supervised learning, where an algorithm is given as input n labeled data points sampled i.i.d. from some unknown distribution and the goal is to output a function that has low probability of misclassifying new data from the same distribution. One caveat with this passive model of learning is that it fails to capture the settings in which unlabeled data is easily accessible, but obtaining their labels is costly. A natural example that fits this description is that of webpage classification. It is easy for any web crawler to collect information about billions of different webpages in a very short amount of time, however understanding which category a webpage belongs to usually requires human feedback. Perhaps the most commonly used way to model this problem is through the active learning framework in which the learner is given access to a large stream of unlabeled data and a budget of n queries that it can submit to a domain expert in order to obtain the label of some datapoint. The learner s goal is to submit queries for the labels of the most informative examples, and, as a result, eliminate some redundancy in the information content of labeled data. In this paper we study the optimal learning rates that are achievable by an active learning algorithm. In the passive learning setting, it is common to measure the quality of an algorithm by its learning curve, i.e., plotting the decay of its error rate as the number of training examples increases. In contrast, in the 38th Conference on Neural Information Processing Systems (Neur IPS 2024). active learning setting, the resource we are interested in is the number of label queries the algorithm requests, so it is natural to consider the rate of decay of the misclassification probability as the number of queries n increases. Most of the prior works on active learning have focused on obtaining uniform guarantees on the error rate, i.e., guarantees that hold uniformly over the data-generating distributions. In this work we follow a different path and we view the problem through the lens of universal rates that was recently introduced by Bousquet et al. (2021). In this framework we are aiming for guarantees that hold for all distributions, but they do not hold uniformly over them. In other words, we allow for distribution-dependent constants in the error rate. To make the distinction between uniform and universal rates clear, we first recall what uniform learnability of a hypothesis class H means. We say that H is uniformly learnable at rate R(n) if there exists a learning rule ˆhn such that it holds that E[er P(ˆhn)] C R(c n), n N . The above expression states that there exists a learning rule ˆhn and distribution-independent constants such that for all realizable distributions the expected error rate of the classifier is at most C R(c n). The difference in the definition of universal learnability is that we swap the order of the quantifiers. To be more precise, we say that a class H is universally learnable at rate R(n) if there exists a learning rule ˆhn such that such that E[er P(ˆhn)] C R(c n), n N . Note that in the above definition the constants c, C are distribution-dependent. As is evident from our main result and from prior results in universal learning (Bousquet et al., 2021; Hanneke et al., 2022; Kalavasis et al., 2022; Bousquet et al., 2022; Hanneke et al., 2023) this change in the definition affects the landscape of the optimal learning rates significantly. 1.1 Related Work Active Learning. There has been a very long line of work deriving theoretical guarantees for active learning, both in the realizable and the agnostic setting (Cohn et al., 1994; Dasgupta, 2005; Balcan et al., 2006; Hanneke, 2007b,a; Dasgupta et al., 2007; Hanneke, 2009; Balcan et al., 2009, 2010; Hanneke, 2012, 2014; Wiener et al., 2015; Hanneke and Yang, 2015; Beygelzimer et al., 2016). As we alluded to before, most of these works focus on obtaining minmax guarantees, i.e., the bounds they provide hold in the worst case over a family of distributions. To be more precise, while many of these results have distribution-dependent guarantees, they are typically expressed in a way that aims to match a lower bound on the minmax performance over a family o distributions, with respect to some parameter, like the disagreement coefficient. For these reasons the results in these works do not capture the full spectrum of universal rates. We also remark that there are a few works that do study universal rates, e.g. Balcan et al. (2010); Hanneke (2012); Yang and Hanneke (2013), but none of them have derived a complete characterization of the optimal rates. Universal Rates. The study of universal learning rates was put forth in the seminal work of Bousquet et al. (2021) who derived a complete characterization of the optimal rates in the supervised learning setting. Later, Kalavasis et al. (2022) extended these result to the multiclass setting, with a bounded number of classes, and Hanneke et al. (2023) improved upon this result by characterizing multiclass classification with an infinite number of labels. Subsequently, the work of Bousquet et al. (2022) derived more fine-grained results for binary classification compared to Bousquet et al. (2021). The work that is most closely related to ours is Hanneke et al. (2022) which derives a complete characterization of the optimal learning rates in a very general interactive learning setting. In that setting, the learner is allowed to submit arbitrary binary valued queries about the unlabeled data. We provide a detailed comparison between our results and theirs in Section 1.5. Very recently, Attias et al. (2024) studied universal rates in the context of regression. Partial Concept Classes. The vast majority of the literature in learning theory has focused on total concept classes, i.e., classes that consist of functions that are defined everywhere on the instance domain. Recently, Alon et al. (2021) proposed a learning theory of partial concept classes, i.e., classes that consist of functions that can be undefined on some parts of the instance domain. Later, Kalavasis et al. (2022) extended some of the results to the multiclass setting, with a finite number of labels. Recently, Cheung et al. (2023) studied partial concept classes in the context of online learning and they showed that there are such classes that are online learnable but none of their extensions to total concept classes is online learnable. The advantage of partial concepts is that they provide a convenient way to express data-dependent constraints. En route of obtaining our main result, we design active learning algorithms for partial concept classes. For a more detailed discussion about partial concepts we refer the reader to Appendix A.2. 1.2 Formal Setting Learning Model. We now present formally the learning setting that we consider in this work. There is a domain X, which we assume to be a Polish space, and a concept class H {0, 1}X , which satisfies standard measurability assumptions (see Definition A.1). We define a classifier h : X {0, 1} to be a universally measurable function and its error rate is defined as er P(h) := P [(x, y) : h(x) = y], where P is the data-generating distribution over X {0, 1}. When P is clear from the context we might drop the subscript P in er P(h). We call P realizable with respect to the hypothesis class H if infh H er P(h) = 0. We denote by PX be the marginal distribution of P on X. Active Learning Model. We define an active learning algorithm to be a sequence of universally measurable functions which, given access to a stream of unlabeled data points from X that are drawn i.i.d. from PX and a label query budget n, output a classifier ˆhn : X {0, 1}. In this work we consider a non-adaptive active learning model, with respect to the unlabeled data. In particular, the learning algorithm needs to specify a function u : N N so that u(n) is the number of unlabeled points it observes and n is the number of points for which it can request the label. The number u(n) is specified before the execution of the algorithm and cannot be modified based on the realization of the unlabeled sequence or the answers that it gets for the labels of the points it queries. We place no bound whatsoever on the function u( ). Also, the algorithm can only request the labels of points it has observed and not arbitrary points from X. We emphasize that the label requests that the algorithm makes can be adaptive, and can depend on answers to previous label queries. To the best of our knowledge, the active learning algorithms that have been proposed in the literature either fit into this model or they can be modified to satisfy the non-adaptivity restriction with negligible performance loss. Learning Rates. We now define formally what it means for an algorithm to achieve a learning rate R(n) in the universal learning model. We adopt the definition of Bousquet et al. (2021). Definition 1.1 (Learning Rates (Bousquet et al., 2021)). Fix a concept class H, and let R : N [0, 1], R(n) n 0 be a rate function, where n is the label query budget of the learner. H is learnable at rate R if there is a learning algorithm ˆhn such that for every realizable distribution P, there exist c, C for which E[er(ˆhn)] CR(cn), n N. H is not learnable at rate faster than R if for all learning algorithms ˆhn there exists a realizable distribution P and c, C for which E[er(ˆhn)] CR(cn), for infinitely many n N. H is learnable with optimal rate R if it is learnable at rate R and it is not learnable at rate faster than R. H admits arbitrarily fast rates if for all rate functions R, it is learnable at rate R. H requires arbitrarily slow rates if for all rate functions R, it is not learnable at rate faster than R. Combinatorial Measures. We now define some combinatorial complexity measures that our characterization relies on. To make the presentation easier to follow, we provide informal definitions. For the formal ones, we refer the reader to Appendix A.3. We first describe the Littlestone tree that was introduced by Bousquet et al. (2021). Definition 1.2 (Littlestone Tree, Informal (see Definition A.8) (Bousquet et al., 2021)). A Littlestone tree for H {0, 1}X is a complete binary tree of depth d whose nodes are labeled by elements of X and the edges to the left, right child are labeled by 0, 1. We require that for every level 0 n < d and every path from the root to a node at level n there is some h H that realizes this path. We say that H has an infinite Littlestone tree if it has a Littlestone tree of depth d = . We underline that this notion can be thought of as an infinite extension of the Littlestone dimension (Littlestone, 1988) of H. Recall that the Littlestone dimension is defined to be the largest d N for which H has a Littlestone tree of such depth and it is if one can construct Littlestone trees of arbitrary depth. Crucially, this is not the same as having a single tree whose depth is infinite, so one can see that infinite Littlestone dimension is not the same as having an infinite Littlestone tree. We next give the definition of the Vapnik-Chervonenkis-Littlestone tree (VCL) that was introduced by Bousquet et al. (2021). Definition 1.3 (VCL Tree, Informal (see Definition A.10) (Bousquet et al., 2021)). A VCL tree for H {0, 1}X is a complete tree of depth d such that every level 0 n < d has nodes that are labeled by X n+1 with branching factor 2n+1 and whose 2n+1 edges connecting a node to its children are labeled by the elements of {0, 1}n+1. We require that for every node at any level 0 n < d, the path from the root to this node is realized by some h H. We say that H has an infinite VCL tree if it has a VCL tree of depth d = . Intuitively, the VCL tree combines the notions of the Littlestone tree and the VC dimension (Vapnik and Chervonenkis, 1971; Blumer et al., 1989). The differences between the Littlestone tree and the VCL tree are that in the latter the size of the nodes increases linearly with the level and the branching factor increases exponentially, whereas in the former all the nodes are singletons and the branching factor is always two. We are now ready to introduce a new combinatorial measure which we call the star tree. Definition 1.4 (Star Tree, Informal (see Definition A.9)). A star tree for H {0, 1}X is a complete tree of depth d such that every level 0 n < d has nodes that are labeled by (X {0, 1})n+1 with branching factor n + 1 and whose n + 1 edges connecting a node to its children are labeled by {0, . . . , n}. The label of the edge indicates the element of the node whose label along the path is flipped. We require that for every node at level 0 n < d, the path from the root to this node is realized by H. We say that H has an infinite star star tree if it has a star tree of depth d = . Essentially, this definition combines the structure of a Littlestone tree with the notion of the star number (Hanneke and Yang, 2015). Every node at level n consists of n + 1 labeled points and there are n + 1 edges attached to it. Each edge indicates which of the n + 1 points of the labeled node has its label flipped along every path that this edge is part of. 1.3 Main Results We are now ready to state the main results of this work. Our first main result is a complete characterization of the optimal learning rates that a class H admits in the active learning setting. Theorem 1.5. For every concept class H exactly one of the following cases holds. H is actively learnable at arbitrarily fast rates. H is actively learnable at an optimal rate e n. H is actively learnable at o(1/n) rates, but requires rates arbitrarily close to 1/n. H requires arbitrarily slow rates for active learning. Our next result characterizes exactly when these rates occur by specifying combinatorial complexity measures of H that determine which case of the tetrachotomy it falls into. Theorem 1.6. For every concept class H the following hold. If H does not have an infinite Littlestone tree, then it is learnable at arbitrarily fast rates. If H has an infinite Littlestone tree but does not have an infinite star tree, then it is learnable at optimal rate e n. If H has an infinite star tree but does not have an infinite VCL tree, then it is learnable at optimal rate o(1/n), but requires rates arbitrarily close to 1/n. If H has an infinite VCL tree, then it requires arbitrarily slow rates. We remark that the landscape of the optimal rates looks significantly different compared to the passive setting (Bousquet et al., 2021) and the general interactive learning setting (Hanneke et al., 2022). Our main result answers an open question that was posed in Balcan et al. (2010), which asks for necessary and sufficient conditions for learnability at an exponential rate in the active learning setting. 1.4 Examples We now present examples of classes that witness each case of our tetrachotomic characterization. Arbitrarily-fast rates: Bousquet et al. (2021) give examples with no infinite Littlestone tree (e.g., thresholds on the integers, and positive halfspaces on Nd), hence learnable at arbitrarily fast rates. Exponential rates: Famously, threshold classifiers [a, ) over R have an infinite Littlestone tree and are actively learnable at exponential rate (even uniformly). A more interesting example is the class of interval classifiers [a, b] on R, which (also famously) has uniform rate 1/n for active learning (it has infinite star number), and has an infinite Littlestone tree, but it has no infinite star tree: for any choice of x in the root node, the edge labeling x as 1 has a version space with star number equal 4 (it is effectively 2 disjoint threshold problems), so the depth of that subtree is bounded. Therefore, by our theory, it is actively learnable at e n universal rate (in stark contrast to the uniform rate 1/n). Sublinear rates: Halfspaces on Rd have an infinite star tree but no infinite VCL tree. To construct that star tree, the key observation is that any set in convex position is a star set (Balcan et al., 2010). Arbitrarily slow rates: Bousquet et al. (2021) provide examples of classes with an infinite VCL tree, such as the class of all claffisifiers, so these require arbitrarily slow rates in our setting as well. 1.5 Comparison to Supervised Learning and General Interactive Learning Setting We now compare our characterization to the results of Bousquet et al. (2021) that prove an analogous result in the supervised learning setting. Let us first recall their main result which shows that in this learning model a class is universally learnable at: Exponential rate e n if and only if it does not have an infinite Littlestone tree. Linear rate 1/n if and only if it has an infinite Littlestone tree but does not have an infinite VCL tree. Arbitrarily slow rates if and only if it has an infinite VCL tree. As one can see, our results illustrate the advantage that active learning algorithms have over their supervised counterparts. We underline that the only case where active learning does not offer an improvement compared to the passive setting is when H has an infinite VCL tree. Let us now discuss the interactive learning model that was considered by Hanneke et al. (2022). In this general model of interaction, the learner is allowed to ask arbitrary binary queries about any subset of the unlabeled data. In particular, these queries include, but are not limited to, label queries, comparison queries, and general membership queries. They show that the optimal rates that any class H admits are the following: Arbitrarily fast if and only if it does not have an infinite Littlestone tree. Exponential if and only if it has an infinite Littlestone tree but not an infinite VCL tree. Arbitrarily slow if and only if it has an infinite VCL tree. We underline that the algorithm by Hanneke et al. (2022) that achieves arbitrarily fast rates uses only label queries and the number of unlabeled data u(n) that it uses can be chosen statically prior to the execution of the algorithm. Therefore, this result applies to the setting we consider in our work. Moreover, the lower bounds they provide also apply to our setting since (i) the queries that Hanneke et al. (2022) consider are more general, and (ii) their lower bounds hold even when the learner knows the marginal distribution PX . As is evident from our main result, the active learning setting provides a richer landscape of optimal rates compared to the general interactive learning setting of Hanneke et al. (2022). In particular, just the absence of an infinite VCL tree for H does not necessarily imply that we can achieve (at least) exponential rates. To get this result, Hanneke et al. (2022) make strong use of these general queries in order to provide an algorithm that has a binary-search flavor and can learn partial concept classes with finite VC dimension at an exponential rate. Our main result shows that such guarantees are not achievable by using only label queries. 1.6 Technical Challenges The most technically innovative part of our work is the o(1/n) algorithm. Let us set up some terminology to facilitate our discussion. The version space of a concept class, given a labeled dataset S, is the set of concepts that classify all the elements of S correctly. Moreover, the VCL game is a Gale-Stewart game defined by Bousquet et al. (2021) that was used in their passive learning algorithm that achieves 1/n universal rate. The original (passive) learner from Bousquet et al. (2021) partitions a portion of the data into some number B(n) of batches, from which they construct partial concept classes of finite VC dimension by using the batches to play the VCL game against the player s winning strategy, and for each resulting partial concept class they run a separate learner on another portion of data and return a majority vote of their resulting classifiers. The latter part of this strategy could never work in the active learning setting, since the number of batches B(n) must be an increasing function of n, and applying an active learner for each partial concept class separately would require each to use (nearly) Ω(n) queries (to get the o(1/n) guarantee there), so the total number of queries would be (nearly) Ω(B(n)n) n, violating the label budget n. To resolve this, we ended up completely re-imagining how to use these partial concept classes. Rather than running a separate algorithm for each class, we run a single active learning algorithm, where individual decisions of whether to query involve voting over the partial concept classes. We first extend an algorithm of Hanneke (2012) (for total concepts) to achieve o(1/n) rate for partial concept VC classes (this required completely re-formulating Hanneke s analysis). The resulting algorithm involves estimating the probability that a random k-tuple is VC shattered by certain constrained version spaces. We replace this with an estimated average (over a select subset of partial concept classes) of this shattering probability, which we show composes appropriately with the analysis of the algorithm to obtain the o(1/n) rate. Comparing our work to the general interactive learning setting considered by Hanneke et al. (2022), the main differences are (i) we design a new algorithm that uses only label queries and achieves exponential rates when H does not have an infinite star tree, a combinatorial measure we introduce in our work, (ii) we prove a o(1/n) lower bound in the setting where H has an infinite star tree, and (iii) we propose a novel active learning algorithm that achieves sublinear rates when H does not have an infinite VCL tree. 2 Arbitrarily Fast Rates As we explained before, the first case of the tetrachotomy in our characterization is a direct implication of the results by Hanneke et al. (2022). To be more precise, they design an algorithm which achieves arbitrarily fast rates using only label queries. In order to do that, they need the number of unlabeled points u(n) to be an arbitrarily fast increasing function, that, nevertheless, can be specified in a non-adaptive manner prior to the execution of the algorithm. The result is summarized in Theorem 2.1. Theorem 2.1 (Hanneke et al. (2022)). If H does not have an infinite Littlestone tree it is actively learnable with arbitrarily fast rates. For completeness, we present their algorithm in Figure 1. Similary with the upper bound, the lower bound is an immediate consequence of a result from Hanneke et al. (2022), since the learner can submit more informative queries in their model. Theorem 2.2 (Hanneke et al. (2022)). If H has an infinite Littlestone tree, then H is not actively learnable at rate faster than exponential e n. This holds even if PX is known to the learner. 3 Exponentially Fast Rates In this section, we prove the second case in the tetrachotomy we have stated, i.e., that H is learnable at an exponential rate if and only if it has an infinite Littlestone tree and it does not have an infinite star tree. Our proof consists of two parts. First, we show that if H does not have an infinite star tree it is learnable at an exponentially fast rate. Then, we show that whenever H has an infinite star tree, the best achievable rate cannot exceed o(1/n). The omitted details can be found in Appendix C. 3.1 Exponential Rates Algorithm: High-Level Overview The high-level approach to get the exponential rates algorithm follows the same spirit of the approaches from Bousquet et al. (2021); Hanneke et al. (2022). First, we design an appropriate Gale-Stewart game (cf. Appendix A.2), i.e., a game between a learner and an adversary, that is associated with H in which the learner has a winning strategy if and only if H does not have an infinite star tree. The next step is to show that, in the limit, the winning strategy of the learner gives rise to a partial concept class F that has finite star number (see Definition A.7). Since this result is asymptotic, our approach is to consider several instances of this game, execute them for a finite number of steps, and obtain a partial concept class from each one. Then, we aggregate these classes into a majority class. The intuition is that, with high probability, most of the games will have induced classes whose star number is bounded by a distribution-dependent constant, so then we can show that the majority class will also have bounded star number, and this bound is distribution-dependent. Thus, our task boils down to actively learning a partial concept class whose star number is finite. In order to do that, we extend the approach that Hanneke and Yang (2015) used for total concept classes to the regime of partial classes. We believe that this result could be of independent interest. 3.2 The Star Tree Game We first outline the Gale-Stewart game we use. Recall that every node of a star tree at depth n consists of n + 1 points along with their labels. There are n + 1 edges that connect the node with its children and the label of every edge indicates the point whose label is flipped along any path that uses this edge. Let us now describe the game G that we use in this setting. In every round τ 1 we have the following interaction between the learner PL and the adversary PA: Player PA chooses points ( ξτ, ζτ) = (ξ0 τ, . . . , ξτ 1 τ , ζ0 τ , . . . , ζτ 1 τ ) X τ {0, 1}τ. Player PL chooses ητ {0, . . . , τ 1}. Player PL wins the game in round τ if H ξ1, ζ1,η1,..., ξτ , ζτ ,ητ := h H : h(ξi s) = ζi s, if ηs = i h(ξi s) = 1 ζi s, if ηs = i , 1 s τ, 0 i < s = . It is easy to see that the winning condition for PL is finitely decidable (cf. Appendix A.2), hence G is a Gale-Stewart game. Recall that this means exactly one player between the adversary and the learner has a winning strategy. Using a result regarding the measurability of winning strategies in Gale-Stewart games that was shown by Bousquet et al. (2021) (see Theorem A.3) we can prove the following connection between G and the existence of infinite star trees ( see Appendix C.1 for the proof). Lemma 3.1. The class H does not have an infinite star tree if and only if PL has a universally measurable winning strategy in G. The first step in our approach, is to make use of some Θ(n) label queries in order to obtain the labels of Θ(n) many points. The idea these labeled points in order to simulate the Gale-Stewart game we described above (see Appendix C.3 for the details). The main technical issue we need to handle is that we have no control over the number of rounds the game needs in order to terminate. Using ideas that have appeared in the universal learning literature, we use a portion of these labeled points to estimate some number ˆtn so that, with at least some constant probability over the dataset, the game will terminate within ˆtn many rounds. Then, we split the remaining of the labeled dataset into batches of size ˆtn and we run the game on each batch. The outcome of each game gives us a pattern-avoidance function, i.e., a function that takes an input tuples of arbitrarily labeled points and changes the label of one of them so that the resulting labeled tuple is not consistent with the data-generating distribution P. In other words, the output of this function could not have been generated by the P. A technical complication we need to handle is that we obtain multiple such pattern avoidance functions, some of which are incorrect, but to make the presentation cleaner we explain the idea using a single pattern avoidance function egt that produces inconsistent labels. The formal setting is handled in Appendix C.4. One way to think about this function is that it provides data-dependent constraints. Thus, it is natural to express such a constraint through a partial concept class. We define F = n f : X {0, 1, } : (x1, f(x1)), (x2, f(x2)), . . . , (xt , f(xt )) / image(egt ), x1, . . . , xt X t o . Notice that the constraint we have placed on F is satisfied if f(xi) = , for some i [t ]. We consider the natural extension of the notion of star sets to the case of partial concept classes, i.e., we say that a labeled set S with labels in {0, 1} is a star set if S and its adjacent sets S , whose labels are still restricted to be in {0, 1}, are obtainable using functions from F (see Definition A.7). A key observation is that the star number of F is bounded by t 1. Another difficulty we need to overcome is that we do not know t , since it is a random variable that depends on the realized sequence and its distribution might have heavy tails. To make our approach easier to follow, let us first assume that we do know t 1. Then, our task boils down to actively learning a partial concept class. 3.3 Active Learning of Partial Concept Classes with Finite Star Number We now present an algorithm that achieves exponential rates when actively learning a partial concept class F that star number s < . The idea of our approach is to reduce the problem of actively learning a partial concept to the well-studied problem of actively learning a total concept class. The algorithm is presented in Figure 2. Let us explain the high-level ideas of the algorithm. First, we consider a large enough set of unlabeled data. Our goal is to find their labels using logarithmically many queries. To do that, we consider the uniform distribution over these unlabeled examples. Then, we use an algorithm from Hanneke and Yang (2015) (see Theorem C.1) which guarantees exponential rates when applied to a class with finite star number. Because the underlying distribution on the sample is uniform, with high probability, the algorithm will find the correct labels of all the points. Finally, we feed these labeled examples to the one-inclusion graph algorithm (Theorem A.5) to get the desired result. We are now ready to state our theorem. The proof is postponed to Appendix C.2. Theorem 3.2. There exists an active learning algorithm A for a partial concept class F which given a label budget n and access to unlabeled samples from a realizable distribution P returns a classifier ˆhn such that EP [er(ˆhn)] c1de c2n/s, where c1, c2 are absolute numerical constants, and s, d is the star number, VC dimension of F. 3.4 Slower than Exponential is Sublinear The next step in the characterization is to show that if H has an infinite star tree, then it does not admit rates faster than sublinear. The proof starts by picking a random path on the infinite star tree. The target distribution is supported only on nodes of the selected path. Then, given some algorithm ˆhn, we distribute the mass of the target probability distribution across the path, potentially skipping some nodes of it, in a way that creates an infinite sequence ni1, ni2, . . ., so that when the learner has label budget nij, with some constant probability, it will only observe unlabeled points up to level kij. Moreover, with at least some constant probability, it will not query the point of that level whose label is flipped along the target path. On that event, it makes a mistake with probability at least C pij/kij, where C is some absolute constant. Our choice of pij, kij guarantees that R(nj) > pij/kij, where R( ) is the target sublinear rate function. Finally, we apply Fatou s lemma to get the desired result. For the full proof and the formal theorem statement, we refer the reader to Appendix C.5 4 Sublinear Rates Our approach to achieve sublinear rates in the setting where H does not have an infinite VCL tree shares some high-level ideas with the one in Section 3, but many technical challenges make it significantly more involved. The main obstacle is that there is no active learning algorithm that achieves sublinear rates for VC classes uniformly over all realizable distributions. Recall that in the exponential rates setting, such an algorithm does exist (Hanneke and Yang, 2015). Instead, the sublinear rates algorithm for VC classes from Hanneke (2012) depends on distribution-dependent constants in the sample complexity. The omitted details from this section can be found in Appendix D. Instead of the star tree Gale-Stewart game that was used to get the exponential rates guarantee, we use the VCL Gale-Stewart game Bousquet et al. (2021) (cf. Figure 6). To be more precise, we use n/5 of the label budget to get the labels of n/5 unlabeled points that come i.i.d. from PX . Then, we execute the VCL game on Θ( n) different batches of size Θ( n). Each of these games induces a partial concept class. We show how to obtain a P-dependent bound on the VC dimension that 1In Appendix C.4 we show how to obtain an etimate. holds for most of these classes. Moreover, we show that for most of these classes the data-generating distribution P is realizable. Finally, we design a single active learning algorithm that combines information from all these classes and achieves sublinear learning rates. This algorithm builds upon Hanneke (2012) but is modified to work with partial concept classes instead of total concept classes. This requires a very different analysis and is the most technically involved part of our work. Let us now explain the main ideas of this algorithm and the challenges behind it. As we mentioned before, the number of queries that the algorithm from Hanneke (2012) needs to achieve the sublinear error rate depends on the underlying data-generating distribution. Thus, we cannot just get a large enough number of unlabeled samples, consider the uniform distribution over them and use the algorithm on this distribution. This is because the learning rate for H would depend on the uniform distribution US over the sample S and not on the data-generating distribution P. To illustrate the ideas of the algorithm, we consider five different streams of i.i.d. (unlabeled) data S1, S2, S3, S4, S5. For the purposes of the subsequent discussion, we can imagine that these streams have infinite size, but as explained in description of the algorithm, we only need poly(n) unlabeled points. Let us first explain the use of S5. We use n/5 of our query budget to obtain the labels of the first n/5 points and then we use them to train a supervised learning algorithm from Bousquet et al. (2021) that achieves linear rate O(1/n). This is used for technical reasons in our analysis and in order to ensure that the classifier we output has, at most, linear error rate no matter how the active learning component of our algorithm behaves. Next, we use n/5 of the query budget to obtain the labels of the first n/5 points from S1. Then, we run the VCL game on these labeled datasets of size n/5 and obtain n different pattern avoidance functions byi n/5 that take as input ℓi n/5 points (cf. Appendix D.2), where i [ n]. Let Fi n/5 := n f : X {0, 1, } : (f(x1), . . . , f(xℓi n/5)) = byi n/5(x1, . . . , xℓi n/5) o , i [ n] . First, we show that for a (1 o(1))-fraction of these partial concept classes P is a realizable distribution. Intuitively, this means that the partial concept class we obtain by running the VCL game on n many points is the same as the class we would have obtained if we were to run the game on infinitely many points (cf. Lemma D.4). Next, we need to estimate some number bdn N which, as n , converges to the 9/10-quantile d of the distribution of the VC dimension of the partial concept classes that are obtained by running the VCL game on infinitely many samples. We let bdn := min d N n i1, i2, . . . , i9/10 n [ n] : i1 < i2 < . . . < i9/10 n, d Fij n/5 d, j [9/10 n] o , where d(F) denotes the VC dimension of class F. Lemma D.5 shows that, for large enough n, bdn = d , with high probability, where d N is such that with probability at least 9/10 over the random draw2 of the partial class, its VC dimension is at most d . One technical complication we need to handle is that the concept classes we have obtained are estimated from a game on n many points instead of infinitely many points, so a o(1)-fraction of them do not correspond to samples from the correct distribution. To do that we use a robust version of the well-known Dvoretzky Kiefer Wolfowitz (DKW) inequality (cf. Theorem D.1) for estimating the CDF. Next, we use another n/5 of the query budget to obtain the labels of the first n/5 points of S2. We will make two distinctions regarding this stream. For the purposes of the analysis, we consider a fixed stream of infinitely many i.i.d. samples from P and we denote it by S 2 , but in the actual algorithm we use a dataset of size Θ(n) and we denote it by Sn 2 . We define V i n/5 := n f Fi n/5 : f(x) = y for the first n/5 points in S 2 o , to be the version space of Fi n/5 defined on the first n/5 examples of S 2 . Moreover, we let Vd ,n/5 be a random version space that is sampled from the following process: we run the VCL game on an infinite stream of labeled data from P to get a pattern avoidance function ey and then we define the partial concept class e F in the same way as before. If the VC dimension of this class is greater than d we discard it and restart the process. Otherwise, we let Vd ,n/5 be the version space of e F on the first n/5 labeled points of S 2 . We take d to be the 9/10 quantile of e P as described before (cf. Lemma D.5). Given some k N, let pn,k := E[Pk(x1, . . . , xk VC shattered by Vd ,n/5)|S 2 ] where the expectation is over the draw of Vd ,n/5, given the fixed S 2 . Let k N be the largest number such that 2This random draw is induced by an execution of the VCL game on infinitely many i.i.d. points from P. limn pn,k = 0. Notice that since, by definition, the VC dimension is bounded by d such a number k exists. From here on, we will only consider the version spaces V i n/5 that are obtained from some partial class with VC dimension at most bdn. Let pn,k,i := Pk(x1, . . . , xk VC shattered by V i n/5). Notice that for every k bdn and every i [ n] we can estimate this quantity to arbitrary precision using only unlabeled examples. We denote by bpn,k,i these estimates. We now consider the first n2 unlabeled points of the third data stream S3. For each such point X, let p X n,k := E[Pk(x1, . . . , xk, X VC shattered by Vd ,n/5)|S 2 , X]. Moreover, for each y {0, 1} let V (X,y) d ,n/5 := {f Vd ,n/5 : f(X) = y}, p(X,y) n,k := E[Pk(x1, . . . , xk VC shattered by V (X,y) d ,n/5)|S 2 , X], y {0, 1} . We define the quantities p X n,k,i, p(X,y) n,k,i in the same way for the realized version spaces. Again, by Hoeffding s bound, we can estimate these quantities to arbitrary precision using unlabeled data. Similarly as before, we denote these estimates by bp X n,k,i, bp(X,y) n,k,i . The idea is to make use of Lemma D.4 and show that the classes which have been obtained by a VCL game that has not converged will only affect our estimates by some o(1). This is formalized in Proposition D.6. Thus, we can use bpn,k,i, bp X n,k,i, bp(X,y) n,k,i , in order to estimate pn,k,, p X n,k, p(X,y) n,k . We denote these estimates by bpn,k,, bp X n,k, bp(X,y) n,k . These are the key quantities we use to infer the labels of unlabeled points. Our algorithm tries to infer the label of each unlabeled point X S3 (cf. Figure 7) in the following way: if bp X n,k bpn,k 2 then we query the label of X, otherwise we infer the label to be arg maxy {0,1} bp(X,y) n,k . Lemma D.7 shows that the inferences are correct, when n is large enough. The main ingredient of the proof that remains to be handled is to show that the number of label queries we submit is sublinear in n. For that, it is sufficient to show that the probability that we query the label of a point is o(1). Lemma D.8 shows that when k = k , this is indeed the case. Finally, since we do not know the true value of k , we run the algorithm for every k bdn. The active learning component of the algorithm gives us bdn + 1 different labeled datasets, which we use to train bdn + 1 instances of a supervised learning algorithm, such as the one from Bousquet et al. (2021). Our analysis so far has shown that for sufficiently large n, at least one of these datasets will be correctly labeled with size ω(n). Thus, since the supervised learning algorithm has error linear in the size of its training set, at least one of these executions will give a classifier that has error o(1/n). The last step is to run a tournament among the dn + 2 different classifiers3 to choose the best one. This is handled by Lemma D.9 (Hanneke, 2012). The main result of this section (cf. Theorem D.10), follows as a corollary of the results we have discussed. All the steps are summarized in Figure 7. Lastly, a lower bound from Hanneke et al. (2022) completes our characterization (cf. Theorem D.11). 5 Conclusion In this work we have provided a complete characterization of the optimal learning rates in active learning. It is an open question if it also holds when the learner knows the distribution PX . Acknowledgments Amin Karbasi acknowledges funding in direct support of this work from NSF (IIS-1845032), ONR (N0001419-1-2406), and the AI Institute for Learning-Enabled Optimization at Scale (TILOS). Shay Moran is a Robert J. Shillman Fellow; he acknowledges support by ISF grant 1225/20, by BSF grant 2018385, by Israel PBC-VATAT, by the Technion Center for Machine Learning and Intelligent Systems (MLIS), and by the the European Union (ERC, GENERALIZATION, 101039692). Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or the European Research Council Executive Agency. Neither the European Union nor the granting authority can be held responsible for them. Grigoris Velegkas was supported in part by the AI Institute for Learning-Enabled Optimization at Scale (TILOS). 3Recall that we have one more classifier from S5. N. Alon, S. Hanneke, R. Holzman, and S. Moran. 2021. A Theory of PAC Learnability of Partial Concept Classes. In Proceedings of the 62nd Annual Symposium on Foundations of Computer Science. Idan Attias, Steve Hanneke, Alkis Kalavasis, Amin Karbasi, and Grigoris Velegkas. 2024. Universal Rates for Regression: Separations between Cut-Off and Absolute Loss. In The Thirty Seventh Annual Conference on Learning Theory. PMLR, 359 405. M.-F. Balcan, A. Beygelzimer, and J. Langford. 2006. Agnostic Active Learning. In Proceedings of the 23rd International Conference on Machine Learning. M.-F. Balcan, A. Beygelzimer, and J. Langford. 2009. Agnostic Active Learning. J. Comput. System Sci. 75, 1 (2009), 78 89. M.-F. Balcan, S. Hanneke, and J. Wortman Vaughan. 2010. The True Sample Complexity of Active Learning. Machine Learning 80, 2 3 (2010), 111 139. A. Beygelzimer, D. J. Hsu, J. Langford, and C. Zhang. 2016. Search improves label for active learning. In Advances in Neural Information Processing Systems 29. A. Blumer, A. Ehrenfeucht, D. Haussler, and M. Warmuth. 1989. Learnability and the Vapnik Chervonenkis Dimension. Journal of the Association for Computing Machinery 36, 4 (1989), 929 965. Olivier Bousquet, Steve Hanneke, Shay Moran, Jonathan Shafer, and Ilya Tolstikhin. 2022. Fine Grained Distribution-Dependent Learning Curves. ar Xiv preprint ar Xiv:2208.14615 (2022). O. Bousquet, S. Hanneke, S. Moran, R. van Handel, and A. Yehudayoff. 2021. A Theory of Universal Learning. In Proceedings of the 53rd Annual ACM Symposium on the Theory of Computing. Tsun-Ming Cheung, Hamed Hatami, Pooya Hatami, and Kaave Hosseini. 2023. Online Learning and Disambiguations of Partial Concept Classes. ar Xiv preprint ar Xiv:2303.17578 (2023). D. Cohn, L. Atlas, and R. Ladner. 1994. Improving Generalization with Active Learning. Machine Learning 15, 2 (1994), 201 221. Donald L Cohn. 2013. Measure theory. Vol. 1. Springer. S. Dasgupta. 2005. Coarse Sample Complexity Bounds for Active Learning. In Advances in Neural Information Processing Systems 18. S. Dasgupta, D. Hsu, and C. Monteleoni. 2007. A General Agnostic Active Learning Algorithm. In Advances in Neural Information Processing Systems 20. David Gale and Frank M Stewart. 1953. Infinite games with perfect information. Contributions to the Theory of Games 2, 245-266 (1953), 2 16. S. Hanneke. 2007a. A Bound on the Label Complexity of Agnostic Active Learning. In Proceedings of the 24th International Conference on Machine Learning. S. Hanneke. 2007b. Teaching Dimension and the Complexity of Active Learning. In Proceedings of the 20th Conference on Learning Theory. S. Hanneke. 2009. Theoretical Foundations of Active Learning. Ph. D. Dissertation. Machine Learning Department, School of Computer Science, Carnegie Mellon University. S. Hanneke. 2012. Activized Learning: Transforming Passive to Active with Improved Label Complexity. Journal of Machine Learning Research 13, 5 (2012), 1469 1587. S. Hanneke. 2014. Theory of Disagreement-Based Active Learning. Foundations and Trends in Machine Learning 7, 2 3 (2014), 131 309. Steve Hanneke, Amin Karbasi, Shay Moran, and Grigoris Velegkas. 2022. Universal Rates for Interactive Learning. In Advances in Neural Information Processing Systems. Steve Hanneke, Shay Moran, and Qian Zhang. 2023. Universal Rates for Multiclass Learning. In The Thirty Sixth Annual Conference on Learning Theory. PMLR, 5615 5681. S. Hanneke and L. Yang. 2015. Minimax Analysis of Active Learning. Journal of Machine Learning Research 16, 12 (2015), 3487 3602. D. Haussler, N. Littlestone, and M. Warmuth. 1994. Predicting {0, 1}-Functions on Randomly Drawn Points. Information and Computation 115, 2 (1994), 248 292. Wilfrid Hodges, Hodges Wilfrid, et al. 1993. Model theory. Cambridge University Press. Alkis Kalavasis, Grigoris Velegkas, and Amin Karbasi. 2022. Multiclass Learnability Beyond the PAC Framework: Universal Rates and Partial Concept Classes. ar Xiv preprint ar Xiv:2210.02297 (2022). Alexander Kechris. 2012. Classical descriptive set theory. Vol. 156. Springer Science & Business Media. N. Littlestone. 1988. Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. Machine Learning 2 (1988), 285 318. V. Vapnik and A. Chervonenkis. 1971. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications 16, 2 (1971), 264 280. Y. Wiener, S. Hanneke, and R. El-Yaniv. 2015. A Compression Technique for Analyzing Disagreement-Based Active Learning. Journal of Machine Learning Research 16, 4 (2015), 713 745. L. Yang and S. Hanneke. 2013. Activized Learning with Uniform Classification Noise. In Proceedings of the 30th International Conference on Machine Learning. A Omitted Details from Section 1 A.1 Formal Definition of Learning Setting We recall some basic notions of measures and probabilities on Polish spaces. For a detailed treatment the reader is referred to Kechris (2012); Cohn (2013). Our presentation follows Bousquet et al. (2021). Polish Spaces. A Polish space is a separable topological space that admits a complete metric. For example, this category includes Rn, any compact metric space, any separable Banach space, etc. Universally Measurable Functions. Let F be the Borel σ-field on some Polish space X and let µ be a probability measure. We denote by Fµ the completion of F under µ, i.e., the collections of all subsets of X that differ from a Borel set on a set of zero measure. A set B X is called universally measurable if B Fµ for every probability measure µ. Moreover, a function f : X Y is called universally measurable if f 1(B) is universally measurable, for any universally measurable set B. An important property is that universally measurable sets and functions on Polish spaces are the same as Borel sets, from a probabilistic point of view. We are now ready to provide the definition of measurability of a concept class H. Definition A.1 (Measurability of H). Let X be a Polish space. We say that a concept class H {0, 1}X , is measurable if there is a Polish space Θ and a Borel-measurable map h : X Θ {0, 1} so that H = {h(θ, ) : θ Θ}. We underline that Definition A.1 is very general and its only requirement is that H can be parameterized in some reasonable way. A.2 Omitted Preliminaries Gale-Stewart Games. We briefly discuss some basic and useful facts about Gale-Stewart games, a concept that was recently introduced to the learning theory community by Bousquet et al. (2021). Our discussion follows Bousquet et al. (2021); Hanneke et al. (2022). We refer to Bousquet et al. (2021) and references therein for a more detailed presentation. Let us fix sequences of sets Xt, Yt for t 1. We consider infinite games between two players, a learner PL and an adversary PA , where in each round t 1, PA selects an element xt Xt, and then PL selects an element yt Yt. The rules of the game are defined by a set W Q t 1(Xt Yt) of winning sequences for PL. This means that after an infinite sequence of consecutive plays x1, y1, x2, y2, . . ., we say that PL wins if (x1, y1, x2, y2, . . .) W; otherwise the winner is PA. A strategy is a rule used by each of the players to determine their next move, given the current state and the history of the game. Formally, a strategy for PA is a sequence of functions ft : Q s 0 can be formulated as a partial class: we say that a sample (x1, y1), . . . , (xn, yn) Rd {0, 1} is (R, γ)-separable if all the points x1, . . . , xn lie in a (euclidean) ball of radius R, the 0-labeled examples and the 1-labeled examples are linearly separable, and the (euclidean) distance between the 0-labeled examples and 1-labeled examples is at least 2γ. Then, the class FR,γ = f : Rd {0, 1 } :( x1, . . . , xn) supp(f) : (x1, (f(x1)), . . . , (xn, (f(xn)) is (R, γ) separable , where supp(f) is the set of all points where f(x) = , expresses the set of functions that satisfy these constraints. Remarkably, the VC dimension of F is bounded by O R2 γ2 (Alon et al., 2021). Perhaps surprisingly, even though the PAC learnability of partial classes is characterized by the VC dimension (as it is the case with total classes), the ERM algorithm provably fails to learn partial classes. Thus, the algorithmic landscape is much richer and complicated compared to total classes. Moreover, there is no algorithmic way to extend a partial concept class to a total concept class without significantly increasing its VC dimension. For details, we refer to (Alon et al., 2021). One-Inclusion Graph Algorithm. We state formally the guarantees of the one-inclusion graph algorithm (Haussler et al., 1994). Theorem A.5 (One-Inclusion Graph Algorithm (Haussler et al., 1994)). For any (total) concept class H whose VC dimension is bounded by d < , there is an algorithm A : (X {0, 1}) X {0, 1} such that for any n N and any sequence {(x1, y1), . . . , (xn, yn)} (X {0, 1})n that is realizable w.r.t. H, σ Sym(n) 1{A(xσ(1), yσ(1), . . . , xσ(n 1), yσ(n 1), xσ(n)) = yσ(n)} d where Sym(n) denotes the symmetric group of permutations of {1, . . . , n}. In particular, Theorem A.5 implies immediately that if (x1, y1), . . . , (xn, yn) are i.i.d. from P then the classifier hn( ) := A(x1, y1, . . . , xn, yn, ) has E[er( hn)] d n+1. We remark that Alon et al. (2021) showed that this result also holds for partial concept classes. A.3 Omitted Definitions Definition A.6 (VC Dimension of Partial Concept Classes (Alon et al., 2021)). For a partial concept class F {0, 1, }X , the VC dimension of F is defined to be the largest number d N such that (x1, . . . , xd) X d such that {(f(x1), . . . , f(xd)) : f F} = {0, 1}d. Such a sequence (x1, . . . , xd) is said to be shattered by F. If there is no bound on d we say that the VC dimension is . The following definition of the star number is an adaptation of the definition in (Hanneke and Yang, 2015). Definition A.7 (Star Number of Partial Concept Classes). For a partial concept class F {0, 1, }X , the star number of F is defined to be the largest number s N such that (x1, . . . , xs) X s such that f0 F and i [s], fi F : fi(xi) = 1 f0(xi), fi(xj) = f0(xj) = , j [s] \ {i}. If there is no bound on s we say that the star number is . Definition A.8 (Littlestone Tree (Bousquet et al., 2021)). A Littlestone tree for H {0, 1}X is a complete binary tree of depth d whose internal nodes are labeled by X, and whose two edges connecting a node to its children are labeled by {0, 1}, such that every path of length at most d emanating from the root is consistent with a concept h H. Formally, a Littlestone tree is a collection [ x u : u {0, 1}ℓ = {x } {x0, x1} {x00, x01, x10, x11} . . . such that for every path y {0, 1}d and n < d, there exists h H so that h(x y ℓ) = yℓ+1 for 0 ℓ n. We say that H has an infinite Littlestone tree if there is a Littlestone tree for H of depth d = . Definition A.9 (Star Tree). A Star tree for H {0, 1}X of consists of a collection [ (x u, z u) X ℓ+1, {0, 1}ℓ+1 : u {0} {0, 1} . . . {0, 1, . . . , ℓ} such that for every level n < d, and y {0} {0, 1} ... {0, 1, . . . , n 1}, there exists some h H such that h(xi y k) = f(zi y k, yk+1) for all 0 i k and 0 k n, where we denote y k = (y1, y2, . . . , yk), x y k = (x0 y k, . . . , xk y k), z y k = (z0 y k, . . . , zk y k), f(zi y k, yk+1) = ( zi y k, if yk+1 = i 1 zi y k, otherwise . We say that H has an infinite star tree if it has a star tree of depth d = . Definition A.10 (VCL Tree Bousquet et al. (2021)). A Vapnik-Chervonenkis-Littlestone (VCL) tree for H {0, 1}X of depth d consists of a collection [ 0 ℓ t (xt τt 1+1, yt τt 1+1 . . . , xt, . . . , yt) / image(egt 1), τt = τt 1, egt = egt 1 . Proof. Suppose that (xt τt 1+1, . . . , xt, yt τt 1+1, . . . , yt) image(egt 1) happens an infinite sequence of times t1, t2, . . . . Since ητt is a winning strategy for the learner in G, we have that for some finite t it holds that H ξ1, ζ1,η1,..., ξt , ζt ,ηt = (see Equation (1)), where ξi, ζi, ηi are defined in Figure 5. Hence, we have arrived in a contradiction. Similarly as in (Bousquet et al., 2021), we remark the following about the measurability of the functions. Remark C.3 (Adapted from Bousquet et al. (2021)). The strategy τt depends in a universally measurable way on x t, y t. Moreover, the functions egt are universally measurable with respect to x t, y t and the arguments of their input. To be more precise, for all t 0, there exist the following universally measurable functions Tt : (X {0, 1})t {1, . . . , t+1}, e Gt : (X {0, 1})t s 0] 0 as t . Proof. We know that P is realizable so we can pick a sequence of hypotheses hk H such that er(hk) 2 k, k N. Taking a union bound over all t 1, we have that X k N P[hk(Xs) = Ys, for some s s] t X k N er(hk) < . Thus, by the Borel-Cantelli lemma, it holds that h(Xs) = Ys, s t for some h H (with probability one). Let T = sup s 1 : (Xs τs 1+1, Ys τs 1+1, . . . , Xs, Ys) image (egs 1) . Then, we know from Lemma C.2 that T is finite with probability one. Recall the law of large numbers for m-dependent sequences: if Z1, Z2 . . . , is an i.i.d. sequence of random variables, then i=1 f(Zi+1, . . . , Zi+m) = 1 i=1 lim n m j=0 f(Zmj+1+i, . . . , Zm(j+1)+i)+o(1) = E[f(Z1, . . . , Zm)] . Thus, we have that P[perτt(egt) = 0] = P s=t+1 1(Xs,Ys,...,Xs+τt 1,Ys+τt 1) image(egt) = 0 s=t+1 1(Xs,Ys,...,Xs+τt 1,Ys+τt 1) image(egt) = 0, T t = P[T t]. We know that T is finite with probability one, so we have that P[perτt(egt) > 0] P[T > t] 0, as t . Lemma C.4 guarantees that for some t N, the Gale-Stewart game played on a set of t points will have converged with probability at least 7/8, i.e., Pr [per(egt ) > 0] 1/8 . This number t depends on the distribution and is unknown to the learner. We first show how the learner can estimate a similar ˆtn from the data. Our approach is an adaptation of Lemma 5.10 in Bousquet et al. (2021), but we present the proof for completeness. Lemma C.5 (Adaptation of Lemma 5.10 in (Bousquet et al., 2021)). For any N N, there exists a universally measurable ˆt N = ˆt N(X1, Y1, ..., X N/2 , Y N/2 ) whose definition does not depend on P so that the following holds. Set the critical time t N be such that Pr [per(egt ) > 0] 1/8 , where the probability is over the training set of the algorithm. There exist C, c > 0 that depend on P, t but not N so that Pr[ˆt N T ] 1 Ce c N , where the probability is over the training of the estimator ˆt N and T is the set T = {1 t t : Pr [per(egt) > 0] 3/8} . Proof. We split this labeled set into two parts. The idea is to use the first one to run the Gale-Stewart game and the other one to estimate whether it has converged or not. For each 1 t N 4 and 1 i N 4t , we let τ i t := Tt(X(i 1)t+1, Y(i 1)t+1, . . . , Xit, Yit) egi t(w1, z1, . . . , wτ i t , zτ i t ) := e Gt(X(i 1)t+1, Y(i 1)t+1, . . . , Xit, Yit, w1, z1, . . . , wτ i t , zτ i t ) . be the estimated quantities when we run Figure 5 on the i-th batch of the data. Next, for each t we estimate Pr[per(egt) > 0] by the fraction of the egi t for which there is a tuple on the second quarter of the data that belongs to the image of egi t. For every fixed t, the data that the functions egi t i N/2t are trained on are independent of each other and of the second half of the training set. This means that we can view every egi t i N/2t as an independent draw of the distribution of egt. To estimate the error of the algorithm we use the second quarter of the training set. We let ˆet = 1 N/4t i=1 1n s: N/4 +1 s N/2 τ i t , Xs+1,Ys+1,...,Xs+τi t ,Ys+τi t image(egi t) o . Now observe that ˆet et = 1 N/4t i=1 1{per(egt)>0} , with probability one. We define ˆt N = inf{t N/4 : ˆet < 1/4} , where we assume that inf = . We now want to bound the probability that ˆt N > t . Using Hoeffding s inequality we get that Pr ˆt N > t Pr ˆet 1 = Pr et E[et ] 1 e N/4t /32 . This implies that ˆt N t except for an event with exponentially small probability. Moreover, for all 1 t t that Pr [per(egt) > 0] > 3 8, there is some ε > 0 such that Pr [per(egt) > ε] > 1 16 (this holds by continuity). Now fix some 1 t t such that Pr [per(egt) > 0] > 3 8 (if it exists). Then, using Hoeffding s inequality again we get that i=1 1{per(egi t)>ε} < 1 e N/4t /128 . Whenever g is a function such that per(g) > ε, then Pr [{ s : N/4 + 1 s N/2 τ, (Xs+1, Ys+1, . . . , Xs+τ, Ys+τ) image(g)}] 1 (1 ε) (N 4)/4τ , since there are (N 4)/4τ disjoint intervals of length τ. As we mentioned before, {egi t}i N/4t are independent of (Xs, Ys)s> n/4 . Thus, applying a union bound we get that the probability that all egi t that have perτ i t (egi t) > ε make at least one error on the second half of the training set is Pr 1n perτi t (egi t)>ε o 1n s: N/4 +1 s N/2 τ i t ,(Xs+1,Ys+1,...,Xs+τi t ,Ys+τi t ) image(egi t) o (1 ε) (N 4)/4t , where the last inequality holds since τ i t t . Thus, we get that Pr[ˆt N = t] Pr ˆet < 1 (1 ε) (N 4)/4t + e N Using the previous estimates and applying a union bound, we get that Pr[ˆt N / T ] e N/4t /32 + t N (1 ε) (N 4)/4t + t e N/4t /128 Ce cn , for some constants C, c > 0. Essentially, Lemma C.5 tells us that we can estimate a batch size ˆtn, using only information from the data, so that, with probability 1 Ce cn, the star tree game will have converged when we run it on ˆt labeled points. Following the approach of Hanneke et al. (2022), the idea is to run the game multiple times, then create the partial concept classes that are induced by each execution, and, finally, aggregate them into a majority class (see Figure 3). We show that if most of the games have converged, then the majority class has bounded star number. This is made formal in Lemma C.6. We remark that Hanneke et al. (2022) showed a similar result regarding the VC dimension of the majority class. Lemma C.6. The majority class F ˆ N m defined in Figure 3 has star number that is bounded by some distribution-dependent constant ˆs, with probability at least 1 e c ˆ N, for some absolute constant c > 0. Proof. We know that there exists some s such that, with probability at least 9/10, the star of any partial concept class Fi is at most s . We let Xi = 1{star number of Fi > s }. Notice the all the Xi s are i.i.d. Bernoulli random variables with p 1/10, since the data that induce the classes Fi are i.i.d.. Thus, we can use Hoeffding s inequality to bound the probability that more than 2/10 of them have star number greater than s as follows i=1 Xi (2/10) ˆN i=1 Xi (1/10) ˆN (15/72) ˆN We let E1 be the complement of the event above and we condition on it for the rest of the proof. So, we know that at least 8/10 ˆN of the partial concept classes Fi have star number bounded by s . We will bound the size ℓof the star set of F ˆ N m. For any star sequence S = (x1, y1, . . . , xℓ, yℓ) (X {0, 1})ℓ of F ˆ N m it holds that i=1 1 { f Fi : f(xj) = 1 yj, f(xp) = yp, p [ℓ] \ {j}} > 9/16 Using Markov s inequality, we get that i=1 1 { f Fi : f(xj) = 1 yj, f(xp) = yp, p [ℓ] \ {j}} > 9 Swapping the summation gives us j=1 1 { f Fi : f(xj) = 1 yj, f(xp) = yp, p [ℓ] \ {j}} > 9 j=1 1 { f Fi : f(xj) = 1 yj, f(xp) = yp, p [ℓ] \ {j}} 7 Using Markov s inequality again, we get that j=1 1 { f Fi : f(xj) = 1 yj, f(xp) = yp, p [ℓ] \ {j}} > 9/16 j=1 1 { f Fi : f(xj) = 1 yj, f(xp) = yp, p [ℓ] \ {j}} . This implies that j=1 1 { f Fi : f(xj) = 1 yj, f(xp) = yp, p [ℓ] \ {j}} > 9/16 j=1 1 { f Fi : f(xj) = 1 yj, f(xp) = yp, p [ℓ] \ {j}} 7/16 Thus, we have shown that at least 2/9 of the partial concept classes witness at least 7/16 of the star patterns. By the definition of the star number, this means that all these classes have star number at least 7/16 m 1. This is because we can drop all the points from S that correspond to star patterns that the class does not realize and the remaining set of labeled points is a star set. Since at least one these classes star number is at most s we have that 7/16 m 1 s , which means that m 16/7(s + 1). Thus, we have shown that the star number of F ˆ N m is at most 16/7(s + 1) = ˆs, with probability at least 1 e c ˆ N. Similarly as in Hanneke et al. (2022), we define a sequence of universally measurable functions Gℓ: (X {0, 1})ℓ {0, 1} Gℓ(x1, y1, . . . , xℓ, yℓ) = 1 n f F ˆ N m : (f(x1), . . . , f(xℓ)) = (y1, . . . , yℓ) o . An equivalent interpretation of the previous lemma is that the star number of the partial class on which {Gℓ}ℓ N returns 1 has a distribution-dependent bound, with high probability. Lemma C.7 is an important component in the derivation of our result and it shows that {Gℓ}ℓ N return 1 on all finite subsets of the correctly labeled data sequence, with high probability. This result follows similarly as Lemma 7 in Hanneke et al. (2022). Lemma C.7. The {Gℓ}ℓ N as defined above are universally measurable functions, and with probability at least 1 Ce c N, C, c > 0, the following hold: The star number of the class FG = f :X {0, 1, } : ℓ N, (x1, . . . , xℓ) X ℓ: Gℓ(x1, f(x1), . . . , xℓ, f(xℓ)) = 1 has a finite distribution-dependent upper bound ˆs. ℓ N, for (x1, y1, . . . , xℓ, yℓ) Pℓ, Gℓ(x1, y1, . . . , xℓ, yℓ) = 1 with conditional probability one (given Gℓ). Proof. We condition on the event E0 described in Lemma C.6. Notice that, by definition, a labeled sequence S = (x1, y1 . . . , xℓ, yℓ) (X {0, 1})ℓis a star set of the class FG if and only if S is a star set of F ˆ N m. Hence, the bound on the star number follows immediately from Lemma C.6. We also condition on the event E1 described in Lemma C.5. We now bound the probability that at least (13/32) ˆN of the functions egi ˆt N have per egi ˆt N > 0. For any t T , using Hoeffding s inequality we get that i=1 1 n per egi ˆt N i=1 1 n per egi ˆt N exp ˆN/512 , Since we know that ˆtn t , taking a union bound over all 1 t t we get that i=1 1 n per egi ˆt N i=1 1 n per egi ˆt N t exp ˆN/512 . Thus, except for an event with exponentially small probability, at least 19/32 of the functions egi ˆt N have per egi ˆt N , with probability one. Notice that, by definition, if the partial class Fi that corresponds to such a function cannot produce a labeling yℓ {0, 1}ℓof some tuple xℓ X ℓwhere xℓ Pℓ X, we can infer that this yℓis not the correct labeling, and this inference will be valid with probability one over the draw of xℓ. Thus, with probability one, if the 9/16-majority cannot produce some labeling we have that at least one such partial class Fj that corresponds to a correct egj ˆt N cannot produce this labeling, so it is not the correct one. As a result, with probability one, if F ˆ N m cannot produce yℓfor xℓwe know that this is not the correct labeling. The measurability of {Gℓ}ℓ N follows by the measurability of all the n egi ˆt N i ˆ N . The proof of the lemma follows by noticing that the probability of all the events we have conditioned on can be bounded by 1 C e c ˆ N and that ˆN = N/2ˆt N N/2t . We are now ready to prove the result about active learning at exponential universal rates. Our approach is summarized in Figure 4. Theorem C.8. Assume that H {0, 1}X does not have an infinite star tree. Then, H admits an active learning algorithm that achieves exponential rates. Proof. First, notice that the Gale-Stewart subroutine Figure 3 uses n/2 points and the exponential rates algorithm for partial classes uses at most ec n points. This is because we define a uniform distribution over these points during its execution. To prove the learning rate guarantees, we first condition on the event E0 described in Lemma C.7. Then, we have some partial concept class FG whose star number is finite by some distribution-dependent number ˆs and has the property that ℓ N, for (x1, y1, . . . , xℓ, yℓ) Pℓ, f(x1) = y1, . . . , f(xℓ) = yℓfor some f FG. Thus, the conditions of Theorem C.1 are satisfied and the conditional error rate of the output of our algorithm is E[er(ˆhn)|E0] c1ˆde c2n/ˆs. Since E0 happens with probability at least 1 Ce cn, we see that the unconditional error of the output is E[er(ˆhn)] c1ˆde C2n/ˆs + Ce cn Ce cn, for some distribution-dependent constants c, C > 0. C.5 Sublinear Rates Lower Bound Theorem C.9. Assume that H {0, 1}X has an infinite star tree and let R(n) be a rate function with R(n) = o(1/n). Then, H requires rate R(n). (x u, z u) X ℓ+1, {0, 1}ℓ+1 : u {0} {0, 1} . . . {0, 1, . . . , ℓ} , be an infinite star tree. We also fix the learning algorithm ˆhn. Recall that {u(n)}n N denotes the number of unlabeled samples that ˆhn uses. We first pick a path on the tree uniformly at random. To be more precise, let y = (y1, y2, . . .) be a sequence of independent random variables where yi is uniformly distributed in {0, . . . , i 1}. The idea is that y specifies a random path on the tree. We define the target distribution P y in an inductive way by specifying sequences {ki, pi, ni}i 1 and putting mass pi on the node of the random path at level ki. Conditional on the given node, we distribute the mass among its points uniformly. Let us now be formal. For the base of the induction, we let p2 = 1/2 and k1 = 1. We will specify the value of p1 later. For the inductive step, for any i 2, given the value of pi, ki 1, we will specify ki, pi+1. We let ki be the smallest integer so that ki > ki 1 and R( ki/8 ) < pi/ki. Notice that since R(n) = o(1/n), ki < . We let ni = ki/8 and pi+1 = min 1 4uni , pi Notice that since {pi > 0}i 2, P i 2 pi < 14, by setting p1 = 1 P i 2 pi the sequence {pi}i 1 can be used as a probability distribution. We are now ready to define P y on X {0, 1}. To make the notation simpler, for any y and for k 1 ( 1 zj y k 1, if j = yk zj y k 1, otherwise . For i 1, let P y xj y ki 1, wj y ki 1 = pi ki , 1 j ki 1 . Notice that, by the definition of pi, this is indeed a probability distribution. We now show that P y is realizable. Since t is a star tree, there exists some h H such that = wj y k 1 , for all 0 j k, and 1 k n. Thus, for the probability of error of h we have that er y(h) := P y [(x, y) X {0, 1} : h(x) = y] X so by taking n we see that P y is realizable for every y. Moreover, we note that the mapping y P y is measurable. We now let (X, Y ), (X1, Y1), (X2, Y2), . . . be i.i.d. samples drawn from P y. We can draw them as X = x J y T 1, Y = w J y T 1, Xj = x Jj y Tj 1, Yj = w Jj y Tj 1 , where (T, J), (T1, J1), (T2, J2), . . . are i.i.d. random variables, independent of the path y, with Pr[T = ki, J = j] = pi for all i 1, 0 j ki 1. For all i N, let us define the event Ei 0 = T1 ki, T2 ki, . . . , Tuni ki under which the algorithm has not received any unlabeled points from any level kj, j > i. Let us also call Ei 1 the event that ˆhni does not query the point whose label is flipped, i.e., the point x yki y ki 1. Also, we denote Ei 2 the event that the learner classifies at least one point of level ki incorrectly. Then, we have that Pr[Ei 2|Ei 1, Ei 0] 1 1 (7/8 ki) since there are at least (7/8) ki points on this level that the learner has not queried, and under the events we have conditioned on the flipped point along the path is chosen uniformly at random among these points. We now bound Pr[Ei 0]. Notice that X 3 1 12 uni . 4This is because pi+1 pi/4. Finally, we need to bound Pr[Ei 1|Ei 0]. Notice that Pr[Ei 1|Ei 0] 7 since the learner has n label queries and there are at least 8n points. Putting it together, we can bound Pr[Ei 2] = Pr[Ei 2|Ei 1, Ei 0] Pr[Ei 1|Ei 0] Pr[Ei 0] which using reverse Markov s inequality implies that Pr[Ei 2| y] 9 Moreover, notice that Pr[ˆhni(X) = Y | y] pi ki Pr[Ei 2| y] R(ni) Pr[Ei 2| y] . Thus, we can see that Pr[ˆhni = Y | y] 9 40 R(ni) 18 Using Fatou s lemma, we have that 1 R(ni) Pr[ˆhni(X) = Y | y] E 1 R(ni) min n Pr[ˆhni(X) = Y | y], 9/40 R(ni) o 1 R(ni) E h min n Pr[ˆhni(X) = Y | y], 9/40 R(ni) oi 1 R(ni) Pr y h Pr[ˆhni(X) = Y | y] 9/40 R(ni) i 9 where the first inequality is by the definition of min, the second inequality follows from Fatou s lemma, and the third inequality follows from Markov s inequality. We remark that Fatou s lemma can be applied here since 1 R(ni) min n Pr h ˆhni(X) = Y | y i , 9/40 R(ni) o 9/40 . Thus, there must exist a realization of y such that E[er y(ˆhn)| y] C R(n) , infinitely often, where C is some absolute constant. Choosing P = P y for this realization of y concludes the proof. D Omitted Details from Section 4 D.1 Useful Tools Theorem D.1 (Dvoretzky Kiefer Wolfowitz (DKW) Inequality). Let F be the CDF of some distribution D supported on R, n N and Fn be the empirical CDF obtained from n i.i.d. samples from D. Then, ε > 0 we have that Pr sup x R |Fn(x) F(x)| > ε 2e 2nε2 . Figure 6: The VCL Game (Bousquet et al., 2021). VCL Game Subroutine (Bousquet et al., 2021): Input is a labeled sequence {(x1, y1), . . . , (x N, y N)} 1. Let τ0 0. 2. In every step t 1 : If ητt 1(ξ1, . . . , ξτt 1 1, xt τt 1+1, . . . , xt) = (yt τt 1+1, . . . , yt) : Let ξτt 1 (xt τt 1+1, . . . , xt) and τt τt 1 + 1. Else, τt τt 1. D.2 The VCL Game For completeness, we provide various algorithms and results that appeared in Bousquet et al. (2021); Hanneke et al. (2022), which we utilize in our approach to get sublinear learning rates. It is instructive to start with an informal description of a pattern avoidance function g : X k {0, 1}k. Intuitively, this function takes as input a k-tuple of unlabeled data generated by a realizable distribution PX and it returns a labeling of these data that is not the true labeling by P. They define per(g) = Pk[x1, . . . , xk : g(x1, . . . , xk) = (y1, . . . , yk)] . Bousquet et al. (2021) provide a way to obtain such a pattern avoidance function for any realizable distribution. This is obtained by a VCL game in Figure 6 which returns a function ˆ yτt. Their following result shows that as the number of points N used in the VCL game grows, the probability that we obtain a pattern avoidance function that is incorrect goes to zero. Lemma D.2 (Correct Pattern Avoidance Function (Restatement of Lemma 5.7 (Bousquet et al., 2021))). Consider the VCL game in Figure 6. Then, Pr[per(ˆ yτt) > 0] 0, as N . Finally, we remark that Bousquet et al. (2021) showed the following result. Theorem D.3 (Supervised Learning Linear Rates (Bousquet et al., 2021)). If H does not have an infinite VCL tree there exists a supervised learning algorithm that can learn H at linear rate. D.3 Omitted Details of the Sublinear Rates Algorithm Lemma D.4. Let Fi n/5, i [ n] be the partial concept classes that are obtained by playing the VCL game on n/5 examples that are generated i.i.d. from P. Then, with probability at least 1 e C n1/4, where C > 0 is some absolute constant, P is a realizable distribution for a (1 o(1)) fraction of these classes. Proof. Let b y i n/5 be the pattern avoidance function that is obtained from the i th batch of n/5 i.i.d. samples from P. Since we know that Pr[per(b y i n/5) > 0] 0 as n (cf. Lemma D.2), for any ε > 0 there is some bε such that if the batch size b = n/5 is at least bε, then Pr[per(b y i n/5) > 0] < ε. Hence, for every n there is some εn = o(1) so that when the batch size is n/5 we have that Pr[per(b y i n/5) > 0] < εn. Notice that whenever per(b y i n/5) = 0 the distribution P is realizable with respect to the corresponding class Fi n/5. Let us set δn = n 1/8. Using Hoeffding s inequality we have that with probability at least 1 e C δ2 n n = 1 e C n1/4, where C is some absolute constant, at least (1 εn δn) fraction of the b y i n/5 will have per(b y i n/5) = 0. Since δn + εn = o(1) we have shown the claim. Lemma D.5. Let e P be the distribution of the VC dimension of the partial concept classes that are obtained by playing the VCL game on infinitely many i.i.d. samples from P. Let d be the 9/10quantile of e P. Then, for n n P, where n P is some P-dependent constant we have that bdn = d , with probability at least 1 3e CP n1/4, where CP is a constant the depends on the data-generating distribution P. Proof. Consider the following experiment. We sample an infinite sequence of labeled points from P and we run the VCL game (cf. Figure 6) on this sequence. Let by be the pattern avoidance function that is obtained from that game defined over bℓmany points and let b F := f : X {0, 1, } : (f(x1), . . . , f(xbℓ)) = by(x1, . . . , xbℓ) . Let d N be the smallest number such that Pr[d( b F) d ] > 9/10 . Let F be the CDF of the distribution of the VC dimension of F, let d1 < d < d2 be the largest, smallest number5 such that F(d1) = F(d ), F(d2) = F(d ) and let CP = min{F(d ) F(d1), F(d2) F(d )}/4. Let F n be the empirical CDF of the distribution obtained on n i.i.d. samples. Notice that whenever we estimate F with accuracy CP the empirical 9/10-quantile will the same as the true 9/10-quantile. The DKW inequality (Theorem D.1) shows that this happens with probability 1 2e 2C2 Pk. Let us call this even E1 and condition on it. Next, we need to handle the fact that we do not obtain samples from partial concept classes that are generated from VCL games on infinitely many i.i.d. samples from P but merely on n many such samples. Let us consider the following experiment. We run the VCL game on k streams of infinitely many i.i.d. samples from P and if the game has not terminated within the first n rounds, we rewind the pattern avoidance function to the one that is obtained in that round. Then, Lemma D.4 shows that, with probability at least 1 e C n1/4, we will only rewind the pattern avoidance function of a o(1) fraction of these games will have converged. We call this event E2 and we condition on it. Let us denote by b Fk the empirical CDF of the VC dimension of the partial concept classes obtained by this experiment. Under E2, we have that supx N |Fk(x) b Fk(x)| = o(1). Thus, by taking n large enough so that the o(1) << CP, we see that the estimation of bdn using the samples that are obtained from the truncated VCL game also converges to d . The bound on the probability of error follows from a union bound over E1, E2. Proposition D.6. For n > n P, where n P is some distribution-dependent constant, given m = O(n3) unlabeled samples we can estimate pn,k, p X n,k, p(X,y) n,k with accuracy o(1), with probability at least 1 CP e C Pn1/4, where CP, C P > 0, are distribution-dependent constants. Proof. As we alluded to before, using O(n3) unlabeled samples in total we can estimate the quantities pn,k,i, p X n,k,i, p(X,y) n,k,i with accuracy 1/n for each version space V i n/5, with probability at least 1 C1 ne C2 n, where C1, C2 > 0, are absolute constants. Let us denote these estimates bpn,k,i, bp X n,k,i, bp(X,y) n,k,i We condition on this event for the rest of the proof. Assume that n is large enough and condition on the events of Lemma D.4, Lemma D.5. Let us denote by i1, . . . , ik the indices of the partial concept classes whose VC dimension is bounded by bdn = d . By definition, k 9/10 n. Consider the following statistical experiment. We run the VCL game on infinitely many i.i.d. samples from P, we define the partial concept class obtained from the pattern avoidance function of the game, if the class has VC dimension at most d we keep the sample, otherwise we discard it. Then, we define its version space on the first n/5 samples of S 2 . Let us call e V i the version space obtained by the i-th sample of this experiment. Then, Pk(x1, . . . , xk VC shattered by e V i) is an unbiased sample from the distribution whose expected value pn,k we are trying to estimate. Similarly for the rest of the quantities we have defined. Let us call these samples epn,k,i, ep X n,k,i, ep(X,y) n,k,i and denote by e Sn, e SX n , e S(X,y) n their empirical estimates over 9/10 n many samples. Hoeffding s inequality gives us that |e Sn pn,k| = o(1), |e SX n p X n,k| = o(1), |e S(X,y) n p(X,y) n,k | = o(1), with probability at least 1 Ce C n1/4, for some absolute constants C, C > 0. We condition on this event. Notice that, under the events we have conditioned on, the samples pn,k,i, p X n,k,i, p(X,y) n,k,i we obtain from running the VCL game on O( n) many samples, differ from the samples epn,k,i, ep X n,k,i, ep(X,y) n,k,i 5If only one these number exists, then we ignore the one that does not exist in the definition of CP. If none of these numbers exist then the estimation task is trivial. on a sublinear number of terms. Moreover, the absolute difference on the terms they disagree on is bounded by 1. Thus, if we denote by Sn, SX n , S(X,y) n the average of these values we have that |e Sn Sn| = o(1), |e SX n SX n | = o(1), |e S(X,y) n SX n | = o(1). Finally, let b Sn, b SX n , b S(X,y) n be the averages of bpn,k,i, bp X n,k,i, bp(X,y) n,k,i . Since |bpn,k,i pn,k,i| = o(1), |bp X n,k,i p X n,k,i| = o(1), |bp(X,y) n,k,i p(X,y) n,k,i | = o(1), using the triangle inequality on all the previous estimations we see that |b Sn pn,k| = o(1), |b SX n p X n,k| = o(1), |b S(X,y) n p(X,y) n,k | = o(1), with probability at least 1 CP e C Pn1/4. Lemma D.7. Let n > CS 2 , where CS 2 is some constant that depends on S 2 , P. Then, with proba- bility at least 1 CP e C Pn1/4, for every k k , if bp X n,k < bpn,k 2 then y = argmaxy {0,1} bp(X,y) n,k is the correct label of X. Proof. Let n > CS 2 be large enough so that for all k k we have that: pn,k 100/99 lim n pn,k , the guarantee from Proposition D.6 gives us that bp X n,k < bpn,k 2 = p X n,k < 3 p(X,y) n,k 99 100pn,k = bp(X,y) n,k 95 p(X,y) n,k < 80 100pn,k = bp(X,y) n,k < 90 100 bpn,k . We condition on these events for the rest of the proof. Since pn,k 100/99 limn pn,k for the correct label y we have that p(X,y ) n,k 99/100 pn,k. Moreover, since p X n,k < 3/4 pn,k for the other label y = 1 y it holds that p(X, y) n,k < 80/100 pn,k. Thus, under the event we have conditioned on we have that argmaxy {0,1} bp(X,y) n,k will indeed be the correct label. The correctness probability follows directly from Proposition D.6. Lemma D.8. For k = k we have that Pr h bp X n,k bpn,k 2 i = o(1). Proof. Essentially, the proof of this result follows by Markov s inequality. Let us consider some large enough n and condition on the event described in Proposition D.6. Then, bp X n,k bpn,k 2 = p X n,k pn,k 4 , so it suffices to bound Pr h p X n,k pn,k Using Markov s inequality, we have that Pr h p X n,k pn,k i 4 E[p X n,k ] pn,k where we have used the fact that E[p X n,k ] = pn,k +1 and that pn,k is bounded below by a constant. We now state the guarantees of Active Select (Hanneke, 2012). We note that the original statement of the result did not specify the size of the unlabeled dataset that the algorithm uses, but a straightforward adaptation of it shows that with O(n3) many unlabeled data its error guarantee changes increases by O(1/n2). Lemma D.9 (Hanneke (2012)). A call to Active Select (cf Figure 8) with N classifiers {h1, . . . , h N}, a query budget of m, and O(N 2/n3) unlabeled points makes at most m label queries and if hbk is the classifier it returns, then with probability at least 1 C1 Ne C2 m/(N2 log(N)), we have er(hbk) 2 mink [N] er(hk) + O(1/n2). We are now ready to prove the main result in this section. The algorithm is summarized in Figure 7. Theorem D.10. Assume that H {0, 1}X does not have an infinite VCL tree. Then, H admits an active learning algorithm that achieves sublinear rates. Proof. We have all the main ingredients we need to prove our result. Let us fix some S 2 whose elements are i.i.d. from P. Conditioning on the results of Lemma D.7, Lemma D.8, we have that for large enough n, which depends on S 2 , we can obtain a labeled dataset Ln whose size is ω(n). Feeding this dataset into the supervised learning algorithm from Bousquet et al. (2021) gives as a classifier ehn that, conditioned on S 2 and the events E we have described, has error E[er(ehn)|S 2 , E] = o(1/n) . Moreover, the probability of E is also o(1/n), so we get E[er(ehn)|S 2 ] = o(1/n) . Similarly, for the output bhn of Active Select, we have that E[er(bhn)|S 2 ] = o(1/n) E[er(bhn)] = O(1/n) , where the second bound comes from the fact that we feed into Active Select the output of the passive learning algorithm bhp n from Bousquet et al. (2021) (cf. Theorem D.3), whose bound does not not depend on S 2 . Let us call R(n) the error rate of our algorithm. To show that E[R(n)] = o(1/n) it suffices to prove that lim supn E[n R(n)] = 0. Fatou s lemma gives us that lim sup n n E[ R(n)] = lim sup n n E[ R(n)] = lim sup n E S 2 [E[n R(n)|S 2 ]] E S 2 [lim sup n n E[R(n)|S 2 ]] where Fatou s lemma applies since E[n R(n)|S 2 ] CP where CP is some P dependent constant. This is because, almost surely over S 2 , the expected error of bhp n is bounded by C P/n. D.4 Arbitrarily Slow Rates Lower Bound Theorem D.11 (Hanneke et al. (2022)). If H has an infinite VCL tree, no active learning algorithm can achieve rates that are faster than arbitrarily slow. D.5 Omitted Figures Figure 7: Sublinear Rates Algorithm. Sublinear Rates Algorithm [Adaptation from Hanneke (2012)] Input is a label budget N and a passive learning algorithm Ap 1. Let S1, S2, S3, S4, S5, U be sets of O(N 3) unlabeled points that are i.i.d. from PX . 2. Request the labels of the first N/5 points of S5 and run Ap on this set. Let bhp N be its output. 3. Request the labels of the first N/5 points of S1, split them into batches of size O( N) and run the VCL game on them (cf. Figure 6). 4. Let byi N], be the pattern avoidance functions from the previous step and for every i [ N/5 := f : X {0, 1, } : (f(x1), . . . , f(xℓi N/5(x1, . . . , xℓi bd N := min d N i1, i2, . . . , i9/10 i1 < i2 < . . . < i9/10 N/5) d, j [9/10 6. Request the labels of the first N/5 points of S2 and let V i N/5 := n f Fi N/5 : f(x) = y for these N/5 point o . 7. For k = 0, . . . , bd N: (a) Let e S3 be the next n2 examples in S3 and em N = N/(5(bd N + 1)) be the label budget. (b) Let b S = {} (c) For each point X e S3, y {0, 1}: i. Estimate bp N,k, bp X N,k, bp(X,y) N,k using V i N/5 and fresh unlabeled points from U as described in Proposition D.6. ii. If bp X N,k bp N,k/2 infer y = argmaxy {0,1} bp(X,y) N,k , otherwise request the label y of X. iii. Append (X, y ) to b S and update the label budget em N. If em N = 0 jump to Step (d). (d) Run Ap on b S and let bhk be its output. 8. Return Active Select n bh0, . . . ,bhk,bhp N o , N/5, S4 (cf. Figure 8). Figure 8: Active Select Algorithm (Hanneke, 2012). Active Select (Hanneke, 2012) Input is a set of classifiers {h1, . . . , h L}, label budget m, sequence of unlabeled examples U. 1. For each j, k {1, . . . , N} s.t. j < k: (a) Let Rjk be the first j m j(L j) ln(e L) k points in U {x : hj(x) = hk(x)} (if such value exists). (b) Request the labels for Rjk and let Qjk be the set of labeled examples. (c) Let mkj = er Qjk(hk). 2. Return hˆk, where ˆk = max {k {1, . . . , L} : maxj