# active_learning_through_a_covering_lens__1329e102.pdf Active Learning Through a Covering Lens Ofer Yehuda , Avihu Dekel , Guy Hacohen , Daphna Weinshall School of Computer Science & Engineering Edmond and Lily Safra Center for Brain Sciences The Hebrew University of Jerusalem Jerusalem 91904, Israel {ofer.yehuda,avihu.dekel,guy.hacohen,daphna}@mail.huji.ac.il Deep active learning aims to reduce the annotation cost for the training of deep models, which is notoriously data-hungry. Until recently, deep active learning methods were ineffectual in the low-budget regime, where only a small number of examples are annotated. The situation has been alleviated by recent advances in representation and self-supervised learning, which impart the geometry of the data representation with rich information about the points. Taking advantage of this progress, we study the problem of subset selection for annotation through a covering lens, proposing Prob Cover a new active learning algorithm for the low budget regime, which seeks to maximize Probability Coverage. We then describe a dual way to view the proposed formulation, from which one can derive strategies suitable for the high budget regime of active learning, related to existing methods like Coreset. We conclude with extensive experiments, evaluating Prob Cover in the low-budget regime. We show that our principled active learning strategy improves the state-of-the-art in the low-budget regime in several image recognition benchmarks. This method is especially beneficial in the semi-supervised setting, allowing state-of-the-art semi-supervised methods to match the performance of fully supervised methods, while using much fewer labels nonetheless. Code is available at https://github.com/avihu111/Typi Clust. 1 Introduction For the most part, deep learning technology critically depends on access to large amounts of annotated data. Yet annotations are costly and remain so even in our era of Big Data. Deep active learning (AL) aims to alleviate this problem by improving the utility of the annotated data. Specifically, given a fixed budget b of examples that can be annotated, and some deep learner, AL algorithms aim to query those b examples that will most benefit this learner. In order to optimally choose unlabeled examples to be annotated, most deep AL strategies follow some combination of two main principles: 1) Uncertainty sampling [e.g., 26, 48, 3, 14, 15, 36, 25], in which examples that the learner is most uncertain about are picked, to maximize the added value of the new annotations. 2) Diversity Sampling [e.g., 1, 22, 16, 38, 18, 40, 37, 17, 47, 42, 46], in which examples are chosen from diverse regions of the data distribution, to represent it wholly and reduce redundancy in the annotation. Most AL methods fail to improve over random selection when the annotation budget is very small [35, 39, 7, 32, 55, 21, 2], a phenomenon sometimes termed "cold start" [8, 49, 16, 50, 23]. When the budget contains only a few examples, they struggle to improve the model s performance, and even fail to reach the accuracy of the random baseline. Recently, it was shown that uncertainty sampling is inherently unsuited for the low-budget regime, which may explain the cold start phenomenon [19]. The low-budget scenario is relevant in many applications, especially those requiring an expert tagger 36th Conference on Neural Information Processing Systems (Neur IPS 2022). (a) 5 Samples Prob Cover selection Coreset selection (b) 20 Samples (c) 50 Samples Figure 1: Prob Cover selection (top) vs Coreset selection (bottom) of 5/20/50 samples (out of 600). Selected points are marked by x, which is color-coded by density (see color code bar to the right). Density is measured using Gaussian Kernel Density Estimation, and the covered area is marked in light blue. Coreset attempts to minimize ball size, constrained by complete coverage, while Prob Cover attempts to maximize coverage, constrained by a fixed ball size. Note that especially in low budgets, such as 5 samples, Coreset only selects outliers of the distribution (yellow), while Prob Cover selects from dense regions of the distribution (red). whose time is expensive (e.g., a radiologist tagger for tumor detection). If we want to expand deep learning to new domains, overcoming the cold start problem is an ever-important task. In this work, we focus on understanding the very low budget regime of AL, where the budget of b examples cannot dependably represent the annotated data distribution. To face up to this challenge, in Sections 2.1-2.2 we model the problem as Max Probability Cover, defined as follows: given some data distribution, and a radius δ, select the b examples that maximize the probability of the union of balls with radius δ around each example. We further show that under a separation assumption that is realistic in semantic embedding spaces, Max Probability Cover is befitting the nearest-neighbor classification model, in that it minimizes an upper bound on its generalization error. In Section 2.4 we show a connection with existing deep AL methods, like Coreset [37], and explain why those methods are more suitable for the high-budget regime than the low-budget regime. This phenomenon is visualized in Fig. 1, where we see that with only a few examples to choose, Coreset an AL strategy that employs the principle of diversity sampling chooses distant and often abnormal points, while Prob Cover chooses representative examples. When using the empirical data distribution, we further show that Max Probability Cover can be reduced to Max Coverage a known classical NP-hard problem [34] (see Section 2.2). To obtain a practical AL strategy, in Section 3 we adapt a greedy algorithm for the selection of b examples from a fixed finite training set of unlabeled examples (the training set), which guarantees 1 1 e approximation to the problem. We call this new method Prob Cover. In Section 4 we empirically evaluate the performance of Prob Cover on several computer vision datasets, including CIFAR-10, CIFAR-100, Tiny-Image Net, Image Net and its subsets. Prob Cover is thus shown to significantly outperform all alternative deep AL methods in the very low-budget regime. Additionally, Prob Cover improves the performance of state-of-the-art semi-supervised methods, which were thought until recently to make AL redundant [6], allowing for the learning of computer vision tasks with very few annotated examples. Relation to prior art Recent work investigated AL methods based on an approximation to a facility location problem [45, 30, 54, 37, 43], which is a variant of the covering problem. In the minimax facility location problem [13], the entire distribution is covered with a fixed number of balls, which can vary in size, whereas in Prob Cover the size of the balls is fixed, and we are allowed to cover only part of the total distribution. While this difference may seem minor, in the low-budget regime, when the budget is not large enough to represent the data, examples chosen by the facility location problem are not representative (as illustrated in Fig. 1), which leads to poor performance (as shown in Fig. 10). Summary of contribution (i) Develop a theoretical framework to analyze AL strategies in embedding spaces, with "dual" low and high-budget interpretations. (ii) Introduce Prob Cover, a low-budget AL strategy motivated by our framework, which significantly outperforms other methods in the low-budget regime. (iii) Demonstrate the outstanding competence of Prob Cover in semi-supervised learning with very few labeled examples. 2 Theoretical Analysis To address the challenge of active learning in low budgets, we adopt a point coverage framework. Computationally, we analyze the generalization of the Nearest Neighbor (NN) classification model, as this model depends exclusively on distances from a set of training examples and does not involve any additional inductive bias. Thus in Section 2.1 we develop a bound on the generalization error in 1-NN models. It follows from the analysis (see discussion below) that if full coverage is required, which is only practical in the high budget regime, the minimization of this bound translates to the optimal minimax facility location problem, which is known to be NP-hard and which the AL Coreset algorithm by Sener and Savarese [37] is designed to approximate. In contrast, in the low budget regime, the aforementioned bound can best be optimized by seeking a labeled set L whose probability of covering the unlabeled set is maximal. In Section 2.2 we show that this problem is also NP-hard. Furthermore, when the data distribution is not known and is being approximated by the empirical distribution, we show that it is equivalent to the classical Max Coverage problem. Prob Cover described in Section 3, is designed to solve this problem. In Section 2.4 we discuss a sense in which the high budget and low budget problems are dual. 2.1 Bounding the Generalization Error We shall now derive a bound on the generalization error of the 1 Nearest Neighbor (1-NN) classifier. We start with some necessary notations and definitions. Most important is the assumption of δ-purity, which states that most of the time, points that are less than δ apart have the same label. We then prove a lemma, showing that given a labeled set L and the coverage it achieves, and given the δ-purity assumption, the probability of a point being inside this cover and still being falsely labeled is small. From this, we finally derive a bound on the generalization error, which is stated in Thm. 1. Notations Let X denote the input domain whose underlying probability function is denoted P, and let Y = [k] denote the target domain. Assume that a true labeling function f : X Y exists. Let X = {xi}m i=1 denote an unlabeled set of points, and b m the annotation budget. Let L X denote the labeled set, where |L| = b. Let Bδ (x) = {x : x x 2 δ} denote a ball centered at x of radius δ. Let C C(L, δ) = S x L Bδ(x) denote the region covered by δ-balls centered at the labeled examples in L. We call C(L, δ) the covered region and P (C) the coverage. Definition 2.1. We say that a ball Bδ(x) is pure if x Bδ(x) : f(x ) = f(x). Definition 2.2. We define the purity of δ as π (δ) = P ({x : Bδ (x) is pure}) . Notice that π(δ) is monotonically decreasing. Let ˆf denote the 1-NN classifier based on L. We split the covered region C(L, δ) into two sets: Cright = {x C : ˆf(x) = f(x)}, Cwrong = C \ Cright. Lemma 1. Cwrong {x : Bδ(x) is not pure}. Proof. Let x Cwrong. Let c L denote the nearest neighbor to x. Then they have the same predicted label, ˆf(x) = ˆf(c), and f(c) = ˆf(c) because c is labeled. Since x is wrongly labeled, ˆf(x) = f(x), which implies that f(c) = ˆf(c) = ˆf(x) = f(x). Finally, since x Cwrong C is in the coverage, d(x, c) < δ, which means that c Bδ(x) with a different label and so Bδ(x) is not pure. Corollary 1. P(Cwrong) P({x : Bδ(x) is not pure}) = 1 π(δ). Theorem 1. The generalization error of the 1-NN classifier ˆf is bounded as follows E h ˆf(x) = f(x) i (1 P(C(L, δ))) + (1 π(δ)). (1) E h ˆf(x) = f(x) i = E[1f(x) = ˆ f(x)1x/ C] + E[1f(x) = ˆ f(x)1x C] P(x / C) + E[1f = ˆ f1x Cright] + E[1f(x) = ˆ f(x)1x Cwrong] P(x / C) + 0 + P(x Cwrong) (1 P(C(L, δ))) + (1 π(δ)). Note that (1) gives us a different bound for different δ values, which also depends on the labeled set L. This bound introduces a trade-off: as δ increases, the coverage increases, but the purity decreases. Ideally, we should seek a pair {δ, L} that achieves the tightest bound. Discussion We can interpret (1) in the context of two boundary conditions of AL: high-budget and low-budget. In the high-budget regime, achieving full coverage P(C) = 1 is feasible as we have many points, and the remaining challenge is to reduce 1 π(δ). Accordingly, since π(δ) is monotonically decreasing, we seek to minimize δ subject to the constraint P(C) = 1. This is similar to Coreset [37]. In the low-budget regime, full coverage entails very low purity, which (if sufficiently low) makes the bound trivially 1. Thus, instead of insisting on full coverage, we fix a δ that yields "large enough" purity π(δ) > 0, and then seek a labeled set L that maximizes the coverage P(C). We call this problem Max Probability Cover. 2.2 Max Probability Cover Definition 2.3 (Max Probability Cover). Fix δ > 0, and obtain a subset L X, |L| = b, that maximizes the probability of the covered area P(C(L, δ)) argmax L X;|L|=b P( [ x L Bδ(x)) (2) An optimal solution to (2) would minimize the bound in (1), when δ is fixed. Unfortunately, when moving to practical settings there are two obstacles. The first is complexity: Theorem 2. Max Probability Cover is NP-hard. Proof. (Sketch, see full proof in App. B) We construct a reduction from an established NP-hard problem (Max Coverage, see Def. A.1) to Max Probability Cover. For the collection of subsets S = {S1, . . . , Sm}, we consider the space Rm and a collection of δ-balls {Bδ(xi)}m i=1 with the exhaustive intersection property. This means that any subset of the balls has at least one point that is contained in all the balls in the subset, but not contained in any other ball (see example in Fig. 2). The existence of such a collection of balls in Rm, m, is proved in Lemma 3 (see App. B). We then assign each Si to Bδ(xi), and each element in Si is mapped to a point in the intersection of all the balls assigned to subsets that contain it. Each such point then defines a Dirac measure, the normalized sum of which determines a probability distribution on Rm. The selection of δ-balls that is the solution to the Max Probability Cover can be translated back to a selection of subsets, which is the solution to the original Max Coverage problem. Figure 2: Illustration in R2 of exhaustive intersection (see Def. B.2). (a) With 3 balls, every subset of balls has a point that is contained only in the specific subset. Thus this set of 3 balls has the exhaustive intersection property. (b) With 4 balls, any point in the intersection of two opposite balls is also contained in at least one other ball. Thus this set of 4 balls does not have the exhaustive intersection property. In the drawing, the region of intersection between the red and blue balls is outlined in purple, while the points within the region that are unique to this pair are marked in light purple. Note that in example (b), this set is empty. 2.3 Using the Empirical Distribution When employing Max Probability Cover, the second practical problem concerns the data distribution, which is hardly ever known apriori. In fact, even when known, the subsequent probabilistic computations are often intractable and hard to approximate. Instead, we may use the empirical distribution P(A) = 1 m Pm i=1 1xi A as an approximation, which gives us the following useful result: Proposition. When P is the empirical distribution P, the Max Probability Cover objective function is equivalent to the Max Coverage objective, with {Bδ(xi) X}m i=1 as the collection of subsets. Proof. Given a labeled set L = {xi}b i=1, we show equality of objectives up to constant 1 |X| = P {y Rd | i xi y < δ} = 1 |X| |{x X | i xi x < δ}| i=1 (Bδ(xi) X) 2.4 The "Duality" of Max Probability Cover and Coreset The Coreset AL method by Sener and Savarese [37] minimizes the objective δ(L) = max x X min c L d(x, c) = min{δ R+ : X [ We can rewrite the above in the language of distributions as δ (L) = min{δ R+ : P( [ c L Bδ(c)) = 1} If we use the empirical distribution then δ(L) = δ (L). In this framework we can say that Max Probability Cover and Coreset are dual problems in the following loose sense: 1. Max Probability Cover minimizes the generalization error bound (1) when we fix δ and seek to maximize the coverage, which is suitable for the low budget regime. 2. Coreset minimizes the generalization error bound (1) when we fix the coverage to 1 and minimize δ, which is suitable for the high budget regime because only then can we fix the coverage to 1. This duality is visualized in Fig. 1. 3 Method: Prob Cover To deliver a practical method, we first note that our approach implicitly relies on the existence of a good embedding space [4, 10, 53], where distance is correlated with semantic similarity, and where similar points are likely to bunch together in high-density regions. As is now customary [e.g., 30, 19], we use an embedding space derived by training a self-supervised task over the large unlabeled pool. In such a space similar labels often correspond to short distances, making 1-NN classification suitable, and also providing for the existence of large enough δ balls with good purity and coverage properties. Secondly, we note that Max Coverage is NP-hard and cannot be solved efficiently. Instead, as its objective is submodular and monotone [27], we use the greedy approximate algorithm that achieves 1 1 e -approximation [27]. A better approximation is impractical, as shown in App. D.1. See App. E for additional time and space complexity analysis. Below, we describe the greedy algorithm in Section 3.1, and the estimation of ball size δ in Section 3.2. 3.1 Greedy Algorithm Algorithm 1 Prob Cover Input: unlabeled pool U, labeled pool L, budget b, ball-size δ, Output: a set of points to query X Embedding of representation learning algorithm on U L G = (V = X, E = {(x, x ) : x Bδ (x)}) for all c L do Remove the incoming edges to covered vertices, {(x , x) E : (c, x) E}, from E end for Queries for all i=1,...,b do Add c U with the highest out-degree in G to Queries Remove the incoming edges to covered vertices, {(x , x) E : (c, x) E}, from E end for return Queries The algorithm (see Alg. 1 below for pseudo-code) goes as follows: First, construct a directed graph G = (V, E), with V = X the embedding of the data space, and (x, x ) E x Bδ (x) d(x, x ) δ. In G, each vertex represents a specific example, and there is an edge between two vertices (x, x ) if x is covered by the δ-ball centered at x (distances are measured in the embedding space). The algorithm then performs b iterations of the following two steps: (i) Pick the vertex xmax with the highest out-degree for annotation; (ii) Remove all incoming edges to xmax and its neighbors. As Prob Cover uses a sparse representation of the adjacency graph, it is able to scale to large datasets while requiring limited space resources. The complexity analysis of the algorithm, and specifically the complexity of constructing the adjacency graph and of the sample selection, are discussed in the Appendix E. 3.2 Estimating δ Our algorithm requires the specification of hyper-parameter δ, the ball radius, whose value depends on details of the embedding space (see App. C.1 for embeddings used). In choosing δ, we need to consider the trade-off between large coverage P(C) and high purity π(δ). We resolve this trade-off with the following heuristic, where we pick the largest δ possible, while maintaining purity above a certain threshold α (0, 1). Specifically, δ = max {δ : π (δ) α} Importantly, α is more intuitive to tune, and is kept constant across different datasets (unlike δ). We still need to estimate the purity π (δ), which depends on the labels, from unlabeled data. To this end, we estimate purity using unsupervised representation learning and clustering. First, we cluster self-supervised features using k-means with k equal to the number of classes. For a given δ, we compute the purity π(δ) using the clustering labels as pseudo-labels for each example. Searching for the best δ, we repeat the process and pick the largest δ so that at least α = 0.95 of the balls are pure. In Fig. 3, we plot the percentage of pure balls across different datasets as a function of δ, where the dashed line represents the δ chosen by Prob Cover. (a) CIFAR-10 (b) CIFAR-100 (c) Tiny-Image Net (d) Image Net Figure 3: Ball purity, as a function of δ, estimated from the unlabeled data (see text). The dashed line marks the highest δ, after which purity drops below α = 0.95. 4 Empirical Results We report a set of empirical results, comparing Prob Cover to other AL strategies in a variety of settings. We focus on the very low budget regime, with a budget size b of a similar order of magnitude as the number of classes. Note that since the data is picked from an unlabeled pool, chances are that the initial labeled set is not going to be balanced across classes, and in the early stages of training some classes will almost always be missing. Prob Cover s excellent performance nevertheless, as seen below, demonstrates its robustness in the presence of this hurdle. 4.1 Methodology Three deep AL frameworks are evaluated: (i) Fully supervised: train a Res Net-18 only on the annotated data, as a fully supervised task. (ii) Semi-supervised by transfer learning: create a representation of the data by training with a self-supervised task on the unlabeled data, then construct a 1-NN classifier using the ensuing representation in a supervised manner. This framework is intended to capture the basic benefits of semi-supervised learning, regardless of the added benefits provided by modern semi-supervised learning methods and the more sophisticated derivation of pseudo-labels. (iii) Fully semi-supervised: train a competitive semi-supervised model on both the annotated and unlabeled data. In our experiments we use Flex Match by Zhang et al. [51]. In frameworks (i) and (ii) we adopt the evaluation kit created by Munjal et al. [33], in which we can compare multiple deep AL strategies in a principled way. In framework (iii), we adopt the code and hyper-parameters provided by Flex Match. When evaluating frameworks (i) and (ii), we compare Prob Cover to 9 deep AL strategies as baselines. (1) Random query uniformly. (2)-(4) query examples with the lowest score, using the following basic scores: (2) Uncertainty max softmax output, (3) Margin margin between the two highest softmax outputs, (4) Entropy inverse entropy of softmax outputs. (5) BADGE [1]. (6) DBAL [15]. (7) Typi Clust [19]. (8) BALD [26]. (9) W-Dist [30], see also App. D.4. (10) Coreset [37]. We note that while most baseline methods are suitable for the high budget regime, Typi Clust and W-Dist are also suitable for the low budget regime. Similarly to Prob Cover, Typi Clust requires a good embedding space to work properly. When comparing Prob Cover and Typi Clust, and in order to avoid possible confounds, we use the same embedding space for both methods. These AL methods are evaluated on the following classification datasets: CIFAR-10/100 [28], Tiny Image Net, [29], Image Net [12] and its subsets (following Van Gansbeke et al. [41]). When considering CIFAR-10/100 and Tiny Image Net, we use as input the embedding of Sim CLR [9] across all methods. When considering Image Net we use as input the embedding of DINO [5] throughout. Results on Image Net-50/100 are deferred to App. D.1. Details concerning specific networks and hyper-parameters can be found in App. C, and in the attached code in the supplementary material. When evaluating frameworks (i) and (ii), we perform 5 active learning rounds, querying a fixed budget of b examples in each round. In framework (iii), as Flex Match is computationally demanding, we only evaluate methods on their initial pool selection capabilities. 4.2 Main Results (i) Fully supervised framework. We evaluate different AL methods based on the performance of a deep neural network trained directly on the raw queried data. In each round, we query b samples where b is equal to the number of classes in each dataset, and train a Res Net-18 on the accumulated queried set. We repeat this for 5 active learning rounds, and plot the mean accuracy of 5 repetitions (3 for Image Net) in Fig. 4 (see App. D.1 for additional results). (a) CIFAR-10 (b) CIFAR-100 (c) Tiny-Image Net (d) Image Net Figure 4: Framework (i), fully supervised: The performance of Prob Cover is compared with baseline AL strategies in image classification tasks in the low budget regime. Budget b guarantees on average 1 sample per class, thus the initial sample may be imbalanced. The final average test accuracy in each iteration is reported, using 5 repetitions (3 for Image Net). The shaded area reflects the standard error across repetitions. (ii) Semi-supervised by transfer learning. In this framework, we make use of pretrained selfsupervised features, and measure classification performance using the 1-NN classifier. Accordingly, each point is classified by the label of its nearest neighbor (within the selected labeled set L) in the self-supervised features space. In low budgets, this framework outperforms the fully-supervised framework (i), though it is not as effective as the full-blown semi-supervised learning framework (iii). This supports the generality of our findings, not limited to any specific semi-supervised method. Similarly to Fig. 4, in Fig. 5 we plot the mean accuracy of 5 repetitions for the different tasks. (a) CIFAR-10 (b) CIFAR-100 (c) Tiny-Image Net (d) Image Net Figure 5: Comparative evaluation of framework (ii) - semi-supervised by transfer learning, see caption of Fig. 4. (iii) Semi-supervised framework. We compare the performance of different AL strategies used prior to running Flex Match, a state-of-the-art semi-supervised method. In Fig. 6 we show results with 3 repetitions of Flex Match, using the labeled sets provided by different AL strategies and budget b equal to the number of classes. We see that Prob Cover outperforms random sampling and other AL baselines by a large margin. We note that in agreement with previous works [6, 19], AL strategies that are suited for high budgets do not improve the results of random sampling, while AL strategies that are suited for low budgets achieve large improvements. 4.3 Ablation Study We report a set of ablation studies, evaluating the added value of each step of Prob Cover. (a) CIFAR-10 (b) CIFAR-100 (c) Tiny-Image Net Figure 6: Framework (iii) Semi-supervised: comparison of AL strategies in a semi-supervised task. Each bar shows the mean test accuracy after 3 repetitions of Flex Match trained using b labeled examples, where b is equal to the number of classes in each task. Error bars denote the standard error. Random initial selection When following the uncertainty sampling principle, as many AL methods do, a trained learner is needed. Any such method requires therefore a non-empty initial pool of labeled examples to train a rudimentary learner, from which uncertainty selection can be bootstrapped. In the set of methods evaluated here (see Section 4.1), only two - Prob Cover and Typi Clust - are not affected by this problem. This can be seen in Fig. 4, noting that only these two methods do better than random in the initial step. Is this the only reason they outperform other methods in low budgets? To address this question, we repeat the experiments reported in Fig. 4a-4b, using an initial random set of annotated examples across the board and by all methods. Results are reported in Fig. 7. When comparing Fig. 4a-4b and Fig. 7, we see that the advantage of Prob Cover and Typi Clust goes beyond the initial set selection, and remains in effect even if this factor is eliminated. (a) CIFAR-10 (b) CIFAR-100 Figure 7: Random Initial pool in the supervised framework, an average of 1 sample per class. (a) CIFAR-10 (b) CIFAR-100 Figure 8: Comparison of Prob Cover when applied to the raw data vs the embedding space. RGB space distances As discussed in Section 3, our approach relies on the existence of a good embedding space, where distance is correlated with semantic similarity. We now verify this claim by repeating the basic fully-supervised experiments (Fig. 4) with one difference: Prob Cover can only use the original RGB space representation to compute distances. Results are shown in Fig. 8. When comparing the original Prob Cover with its variant using RGB space, a significant drop in performance is seen as expected, demonstrating the importance of the semantic embedding space. The interaction between δ and budget size To understand the interaction between the hyperparameter δ and budget b, we repeat our basic experiments (Fig. 4) with different choices of δ and b using CIFAR-10. For each pair (δ, b), we select an initial pool of b examples using Prob Cover with δ balls, and report the difference in accuracy from the selection of b random points. Average results across 3 repetitions are shown in Fig. 9 as a function of b. We see that as the budget b increases, smaller δ s are preferred. Coreset vs. Prob Cover. In Section 2.4 we argue that Prob Cover is suitable for low budgets, while Coreset is suitable for high budgets. To verify this claim, we compare their performance under the following 3 setups while using the same embedding space, and report results on CIFAR-10: Low budget - Select an initial pool of 100 samples using the Sim CLR representation. Figure 9: The accuracy difference between Prob Cover when using different δ values, and the outcome of b random samples (average over 3 repetitions). (a) Low budget (b) Mid budget (c) High Budget Figure 10: Comparing the performance under the supervised framework of Prob Cover and Coreset on different budget regimes. The low budget shows an initial pool selection of 100 samples. Mid/High budget start with 1K/5K samples and query additional 1K/5K samples (see text). High budget - Train a model on 5K randomly selected examples. Then select an additional set of 5K examples using the learner s latent representation. This is the setup used by Sener and Savarese [37]. Mid budget - Same as high budget, except the initial pool size and added budget are 1K. Results are reported in Fig. 10. In the low budget regime, Prob Cover outperforms Coreset as would be expected. In the mid-budget regime, where the feature space of the learner is informative, only Prob Cover achieves significant improvement over random selection. In the high budget regime, Coreset improves over random selection, while Prob Cover is least effective. 5 Summary and Discussion We study the problem of AL in the low-budget regime. We model the problem as Max Probability Cover, showing that under certain assumptions on the data distribution, which are likely to hold in self-supervised embedding spaces, it optimizes an upper-bound on the generalization error of a 1-NN classifier. We devise an AL strategy termed Prob Cover, which approximates the optimal solution. We empirically evaluate it in supervised and semi-supervised frameworks across different datasets, showing that Prob Cover significantly outperforms other methods in the low-budget regime. In future work we intend to investigate: (i) possible avenues for improving the choice of δ by making use of already known labels, or inferring a score for δ through the topology of the resulting covering graph; (ii) extensions of the current formulation of Max Probability Cover, by making δ - the radius of the balls - dependent on the samples rather than uniform; (iii) soft-coverage approaches, where the covering notion is not binary but some continuous measure, which may allow us to do away with δ. Acknowledgments This work was supported by the Israeli Ministry of Science and Technology, and by the Gatsby Charitable Foundations. We are grateful to our dedicated Neur IPS AC, who acted upon the lengthy discussion between us and the reviewers, as seen on Open Review. [1] Jordan T. Ash, Chicheng Zhang, Akshay Krishnamurthy, John Langford, and Alekh Agarwal. Deep batch active learning by diverse, uncertain gradient lower bounds. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. Open Review.net, 2020. [2] Josh Attenberg and Foster Provost. Why label when you can search? alternatives to active learning for applying human resources to build classification models under extreme class imbalance. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 423 432, 2010. [3] William H Beluch, Tim Genewein, Andreas Nürnberger, and Jan M Köhler. The power of ensembles for active learning in image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9368 9377, 2018. [4] Javad Zolfaghari Bengar, Joost van de Weijer, Bartlomiej Twardowski, and Bogdan Raducanu. Reducing label effort: Self-supervised meets active learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1631 1639, 2021. [5] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. ar Xiv preprint ar Xiv:2104.14294, 2021. [6] Yao-Chun Chan, Mingchen Li, and Samet Oymak. On the marginal benefit of active learning: Does self-supervision eat its cake? In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3455 3459. IEEE, 2021. [7] Akshay L Chandra, Sai Vikas Desai, Chaitanya Devaguptapu, and Vineeth N Balasubramanian. On initial pools for deep active learning. In Neur IPS 2020 Workshop on Pre-registration in Machine Learning, pages 14 32. PMLR, 2021. [8] Liangyu Chen, Yutong Bai, Siyu Huang, Yongyi Lu, Bihan Wen, Alan L Yuille, and Zongwei Zhou. Making your first choice: To address cold start problem in vision active learning. ar Xiv preprint ar Xiv:2210.02442, 2022. [9] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597 1607. PMLR, 2020. [10] Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad Norouzi, and Geoffrey E. Hinton. Big self-supervised models are strong semi-supervised learners. In Hugo Larochelle, Marc Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020, December 6-12, 2020, virtual, 2020. [11] Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V Le. Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 702 703, 2020. [12] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A largescale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248 255. Ieee, 2009. [13] Reza Zanjirani Farahani and Masoud Hekmatfar. Facility location: concepts, models, algorithms and case studies. Springer Science & Business Media, 2009. [14] Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, pages 1050 1059. PMLR, 2016. [15] Yarin Gal, Riashat Islam, and Zoubin Ghahramani. Deep bayesian active learning with image data. In International Conference on Machine Learning, pages 1183 1192. PMLR, 2017. [16] Mingfei Gao, Zizhao Zhang, Guo Yu, Sercan Ö Arık, Larry S Davis, and Tomas Pfister. Consistency-based semi-supervised active learning: Towards minimizing labeling cost. In European Conference on Computer Vision, pages 510 526. Springer, 2020. [17] Yonatan Geifman and Ran El-Yaniv. Deep active learning over the long tail. ar Xiv preprint ar Xiv:1711.00941, 2017. [18] Daniel Gissin and Shai Shalev-Shwartz. Discriminative active learning. ar Xiv preprint ar Xiv:1907.06347, 2019. [19] Guy Hacohen, Avihu Dekel, and Daphna Weinshall. Active learning on a budget: Opposite strategies suit high and low budgets. ar Xiv preprint ar Xiv:2202.02794, 2022. [20] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770 778, 2016. [21] Tao He, Xiaoming Jin, Guiguang Ding, Lan Yi, and Chenggang Yan. Towards better uncertainty sampling: Active learning with multiple views for deep convolutional neural network. In 2019 IEEE International Conference on Multimedia and Expo (ICME), pages 1360 1365. IEEE, 2019. [22] Seul Gi Hong, Heonjin Ha, Junmo Kim, and Min-Kook Choi. Deep active learning with augmentation-based consistency estimation. ar Xiv preprint ar Xiv:2011.02666, 2020. [23] Neil Houlsby, José Miguel Hernández-Lobato, and Zoubin Ghahramani. Cold-start active learning with robust ordinal matrix factorization. In International Conference on Machine Learning, pages 766 774. PMLR, 2014. [24] Harry B III Hunt, Madhav V Marathe, Venkatesh Radhakrishnan, Shankar S Ravi, Daniel J Rosenkrantz, and Richard E Stearns. Nc-approximation schemes for np-and pspace-hard problems for geometric graphs. Journal of algorithms, 26(2):238 274, 1998. [25] Ajay J Joshi, Fatih Porikli, and Nikolaos Papanikolopoulos. Multi-class active learning for image classification. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 2372 2379. IEEE, 2009. [26] Andreas Kirsch, Joost Van Amersfoort, and Yarin Gal. Batchbald: Efficient and diverse batch acquisition for deep bayesian active learning. Advances in neural information processing systems, 32:7026 7037, 2019. [27] Andreas Krause and Daniel Golovin. Submodular function maximization. Tractability, 3: 71 104, 2014. [28] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. Online, 2009. [29] Ya Le and Xuan Yang. Tiny imagenet visual recognition challenge. CS 231N, 7(7):3, 2015. [30] Rafid Mahmood, Sanja Fidler, and Marc T Law. Low budget active learning via wasserstein distance: An integer programming approach. ar Xiv preprint ar Xiv:2106.02968, 2021. [31] Dániel Marx. Efficient approximation schemes for geometric problems? In European Symposium on Algorithms, pages 448 459. Springer, 2005. [32] Sudhanshu Mittal, Maxim Tatarchenko, Özgün Çiçek, and Thomas Brox. Parting with illusions about deep active learning. ar Xiv preprint ar Xiv:1912.05361, 2019. [33] Prateek Munjal, N. Hayat, Munawar Hayat, J. Sourati, and S. Khan. Towards robust and reproducible active learning using neural networks. Ar Xiv, abs/2002.09564, 2020. [34] George L Nemhauser, Laurence A Wolsey, and Marshall L Fisher. An analysis of approximations for maximizing submodular set functions i. Mathematical programming, 14(1):265 294, 1978. [35] Kossar Pourahmadi, Parsa Nooralinejad, and Hamed Pirsiavash. A simple baseline for lowbudget active learning. ar Xiv preprint ar Xiv:2110.12033, 2021. [36] Hiranmayi Ranganathan, Hemanth Venkateswara, Shayok Chakraborty, and Sethuraman Panchanathan. Deep active learning for image classification. In 2017 IEEE International Conference on Image Processing (ICIP), pages 3934 3938. IEEE, 2017. [37] Ozan Sener and Silvio Savarese. Active learning for convolutional neural networks: A core-set approach. In International Conference on Learning Representations, 2018. [38] Changjian Shui, Fan Zhou, Christian Gagné, and Boyu Wang. Deep active learning: Unified and principled method for query and training. In International Conference on Artificial Intelligence and Statistics, pages 1308 1318. PMLR, 2020. [39] Oriane Siméoni, Mateusz Budnik, Yannis Avrithis, and Guillaume Gravier. Rethinking deep active learning: Using unlabeled data at model training. In 2020 25th International Conference on Pattern Recognition (ICPR), pages 1220 1227. IEEE, 2021. [40] Samarth Sinha, Sayna Ebrahimi, and Trevor Darrell. Variational adversarial active learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5972 5981, 2019. [41] Wouter Van Gansbeke, Simon Vandenhende, Stamatios Georgoulis, Marc Proesmans, and Luc Van Gool. Scan: Learning to classify images without labels. In European Conference on Computer Vision, pages 268 285. Springer, 2020. [42] Zengmao Wang, Bo Du, Lefei Zhang, and Liangpei Zhang. A batch-mode active learning framework by querying discriminative and representative samples for hyperspectral image classification. Neurocomputing, 179:88 100, 2016. [43] Kai Wei, Rishabh Iyer, and Jeff Bilmes. Submodularity in data subset selection and active learning. In International Conference on Machine Learning, pages 1954 1963. PMLR, 2015. [44] Daphna Weinshall, Hynek Hermansky, Alon Zweig, Jie Luo, Holly Jimison, Frank Ohl, and Misha Pavel. Beyond novelty detection: Incongruent events, when general and specific classifiers disagree. Advances in Neural Information Processing Systems, 21, 2008. [45] Ziting Wen, Oscar Pizarro, and Stefan Williams. Active self-semi-supervised learning for few labeled samples fast training. ar Xiv preprint ar Xiv:2203.04560, 2022. [46] Yi Yang, Zhigang Ma, Feiping Nie, Xiaojun Chang, and Alexander G Hauptmann. Multi-class active learning by uncertainty sampling with diversity maximization. International Journal of Computer Vision, 113(2):113 127, 2015. [47] Changchang Yin, Buyue Qian, Shilei Cao, Xiaoyu Li, Jishang Wei, Qinghua Zheng, and Ian Davidson. Deep similarity-based batch mode active learning with exploration-exploitation. In 2017 IEEE International Conference on Data Mining (ICDM), pages 575 584. IEEE, 2017. [48] Donggeun Yoo and In So Kweon. Learning loss for active learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 93 102, 2019. [49] Yue Yu, Rongzhi Zhang, Ran Xu, Jieyu Zhang, Jiaming Shen, and Chao Zhang. Cold-start data selection for few-shot language model fine-tuning: A prompt-based uncertainty propagation approach. ar Xiv preprint ar Xiv:2209.06995, 2022. [50] Michelle Yuan, Hsuan-Tien Lin, and Jordan L. Boyd-Graber. Cold-start active learning through self-supervised language modeling. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu, editors, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, pages 7935 7948. Association for Computational Linguistics, 2020. [51] Bowen Zhang, Yidong Wang, Wenxin Hou, Hao Wu, Jindong Wang, Manabu Okumura, and Takahiro Shinozaki. Flexmatch: Boosting semi-supervised learning with curriculum pseudo labeling. Co RR, abs/2110.08263, 2021. [52] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3): 107 115, 2021. [53] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586 595, 2018. [54] Fedor Zhdanov. Diverse mini-batch active learning. ar Xiv preprint ar Xiv:1901.05954, 2019. [55] Yu Zhu, Jinghao Lin, Shibi He, Beidou Wang, Ziyu Guan, Haifeng Liu, and Deng Cai. Addressing the item cold-start problem by attribute-driven active learning. IEEE Trans. Knowl. Data Eng., 32(4):631 644, 2020. doi: 10.1109/TKDE.2019.2891530. 1. For all authors... (a) Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] Throughout the entire paper. (c) Did you discuss any potential negative societal impacts of your work? [No] Irrelevant for this work. (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results... (a) Did you state the full set of assumptions of all theoretical results? [Yes] Section 2.1 (b) Did you include complete proofs of all theoretical results? [Yes] Section 2, and App. B 3. If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] We included the necessary information for reproducibility. Furthermore, the code will be published upon acceptance. (b) Did you specify all the training details (e.g., data splits, hyper-parameters, how they were chosen)? [Yes] (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes] (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [No] 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets... (a) If your work uses existing assets, did you cite the creators? [Yes] 4.1 (b) Did you mention the license of the assets? [Yes] All assets are publicly available. (c) Did you include any new assets either in the supplemental material or as a URL? [No] (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [Yes] The data is publicly available. (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [No] Irrelevant. 5. If you used crowdsourcing or conducted research with human subjects... (a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]