# leveraging_importance_weights_in_subset_selection__cbbbbfcd.pdf Published as a conference paper at ICLR 2023 LEVERAGING IMPORTANCE WEIGHTS IN SUBSET SELECTION Gui Citovsky1, Giulia De Salvo1, Sanjiv Kumar1, Srikumar Ramalingam1, Afshin Rostamizadeh1, Yunjuan Wang2 1Google Research, New York, NY, 10011 2Department of Computer Science, Johns Hopkins University, Baltimore, MD, 21218 {gcitovsky,giuliad,sanjivk,rostami,rsrikumar}@google.com ywang509@jhu.edu We present a subset selection algorithm designed to work with arbitrary model families in a practical batch setting. In such a setting, an algorithm can sample examples one at a time but, in order to limit overhead costs, is only able to update its state (i.e. further train model weights) once a large enough batch of examples is selected. Our algorithm, IWe S, selects examples by importance sampling where the sampling probability assigned to each example is based on the entropy of models trained on previously selected batches. IWe S admits significant performance improvement compared to other subset selection algorithms for seven publicly available datasets. Additionally, it is competitive in an active learning setting, where the label information is not available at selection time. We also provide an initial theoretical analysis to support our importance weighting approach, proving generalization and sampling rate bounds. 1 INTRODUCTION Deep neural networks have shown remarkable success in several domains such as computer vision and natural language processing. In many tasks, this is achieved by heavily relying on extremely large labeled datasets. In addition to the storage costs and potential security/privacy concerns that come along with large datasets, training modern deep neural networks on such datasets also incur high computational costs. With the growing size of datasets in various domains, algorithm scalability is a real and imminent challenge that needs to be addressed. One promising way to solve this problem is with data subset selection, where the learner aims to find the most informative subset from a large number of training samples to approximate (or even improve upon) training with the entire training set. Such ideas have been extensively studied in k-means and k-median clustering (Har-Peled & Mazumdar, 2004), subspace approximation (Feldman et al., 2010), computational geometry (Agarwal et al., 2005), density estimation (Turner et al., 2021), to name a few. One particular approach for solving data subsampling involves the computation of coresets, which are weighted subsets of a dataset that can act as the proxy for the whole dataset to solve some optimization task. Coreset algorithms are primarily motivated with theoretical guarantees that bound the difference between the training loss (or other such objective) over the coreset and that over the full dataset under different assumptions on the losses and hypothesis classes (Mai et al., 2021; Munteanu et al., 2018; Curtin et al., 2019; Karnin & Liberty, 2019). However, in practice, most competitive subset selection algorithms, that are designed for general loss functions and arbitrary function classes, focus only on selecting informative subsets of the data and typically do not assign weights to the selected examples. These methods are, for example, based on some notion of model uncertainty (Scheffer et al., 2001), information gain (Argamon-Engelson & Dagan, 1999), loss gradients (Paul et al., 2021; Ash et al., 2019), or diversity (Sener & Savarese, 2018). Counter to this trend, we show that weighting the selected samples can be very beneficial. In this work, we present a subset selection algorithm called IWe S that is designed for general loss functions and hypothesis classes and that selects examples by importance sampling, a theoretically This work was done when the author was interning at Google. Published as a conference paper at ICLR 2023 motivated and unbiased sampling technique. Importance sampling is conducted according to a specially crafted probability distribution and, importantly, each sampled example is weighted inversely proportional to its sampling probability when computing the training loss. We develop two types of sampling probability for different practical requirements (e.g. computational constraints and label availability), but in both cases, the sampling probability is based on the example s entropy-based score computed using a previously trained model. We note, the IWe S algorithm is similar to the IWAL active learning algorithm of Beygelzimer et al. (2009) as both are based on importance sampling. However, in contrast to IWAL, IWe S uses a different sampling probability definition with a focus on providing a practical method that is amenable to large deep networks and complex hypothesis classes. Through extensive experiments, we find that the IWe S algorithm is competitive for deep neural networks over several datasets. We compare our algorithm against four types of baselines whose sampling strategies leverage: the model s uncertainty over examples, diversity of selected examples, gradient information, and random sampling. Finally, we analyze a closely related albeit less practical algorithm that inspires the design of IWe S, called IWe S-V, proving it admits generalization and sampling rate guarantees that hold for general loss functions and hypothesis classes. The contributions of this work can be summarized as follows: 1. We present the Importance Weighted Subset Selection (IWe S) algorithm that selects examples by importance sampling with a sampling probability based on a model s entropy, which is applicable to (and practical for) arbitrary model families including modern deep networks. In addition to the subset selection framework, IWe S also works in the active learning setting where the examples are unlabeled at selection time. 2. We demonstrate that IWe S achieves significant improvement over several baselines (Random, Margin, Least-Confident, Entropy, Coreset, BADGE) using VGG16 model for six common multi-class datasets (CIFAR10, CIFAR10-corrupted, CIFAR100, SVHN, Eurosat, Fashion MNIST), and using Res Net101 model for the large-scale multi-label Open Images dataset. 3. We provide a theoretical analysis for a closely related algorithm, IWe S-V, in Section 4. We prove a O(1/ T) generalization bound, which depends on the full training dataset size T. We further give a new definition of disagreement coefficient and prove a sampling rate bound by leveraging label information, which is tighter compared with the label complexity bound provided by Beygelzimer et al. (2009) that does not use label information. 1.1 RELATED WORK Uncertainty. Uncertainty sampling, which selects examples that the model is least confident on, is favored by practitioners (Mussmann & Liang, 2018) and rather competitive among many recent algorithms (Yang & Loog, 2018). Uncertainty can be measured through entropy (Argamon-Engelson & Dagan, 1999), least confidence (Culotta & Mc Callum, 2005), and most popular is the margin between the most likely and the second most likely labels (Scheffer et al., 2001). More recent works measure the model uncertainty indirectly, such as selecting examples based on an estimated loss (Yoo & Kweon, 2019) as well as leveraging variational autoencoders and adversarial networks to find points not well represented by the current labeled data (Sinha et al., 2019; Kim et al., 2021). Beygelzimer et al. (2009) makes use of a disagreement-based notion of uncertainty and constructs an importance weighted predictor with theoretical guarantees called IWAL, which is further enhanced by Cortes et al. (2019). However, IWAL is not directly suitable for use with complex hypothesis spaces, such as deep networks, since it requires solving a non-trivial optimization over a subset of the hypothesis class, the so-called version space, in order to compute sampling probabilities. We further discuss these difficulties in Section 4. Diversity. In another line of research, subsets are selected by enforcing diversity such as in the FASS (Wei et al., 2015) and Coreset (Sener & Savarese, 2018) algorithms. Wei et al. (2015) introduces a submodular sampling objective that trades off between uncertainty and diversity by finding a diverse set of samples from amongst those that the current trained model is most uncertain about. It was further explored by Kaushal et al. (2019) who designed a unified framework for data subset selection with facility location and dispersion-based diversity functions. Sener & Savarese (2018) show that the task of identifying a coreset in an active learning setting can be mapped to solving the k-center problem. Further recent works related to coreset idea are Mirzasoleiman et al. (2020); Killamsetty Published as a conference paper at ICLR 2023 et al. (2021), where the algorithms select representative subsets of the training data to minimize the estimation error between the weighted gradient of selected subset and the full gradient. Loss Gradient. Another class of algorithms selects a subset by leveraging the loss gradients. For example, the GRAND score (Paul et al., 2021), or closely related EL2N score, leverages the average gradient across several different independent models to measure the importance of each sample. However, as such, it requires training several neural networks, which is computationally expensive. BADGE (Ash et al., 2019) is a sampling strategy for deep neural networks that uses k-MEANS++ on the gradient embedding of the networks to balance between uncertainty and diversity. ISAL (Liu et al., 2021) measures the potential impact of each example by leveraging the loss gradient and Hessian. Finally, for the sake of completeness, we note that importance weighting type approaches have also been used for the selection of examples within an SGD minibatch (Katharopoulos & Fleuret, 2018; Johnson & Guestrin, 2018), which can be thought of a change to the training procedure itself. In contrast, the problem setting we consider in this work requires explicitly producing a (weighted) subset of the training data and treats the training procedure itself as a black-box. These are a sampling of data subset selection algorithms, and we refer the reader to (Guo et al., 2022) for a more detailed survey. In this work, we choose at least one algorithm from each of the categories mentioned above, in particular, Margin (Scheffer et al., 2001), BADGE (Ash et al., 2019), and Coreset (Sener & Savarese, 2018) to compare against empirically in Section 3. However, before that, we first formally define the IWe S algorithm in the following section. 2 THE IWES ALGORITHM Algorithm 1 Importance Weighted Subset Selection (IWe S) Require: Labeled pool P, seed set size k0, subset batch size k, number of iterations R, weight cap parameter u. 1: Initialize the subset S = . 2: S0 Draw k0 examples from P uniformly at random. 3: Set S = {(x, y, 1) : (x, y) S0} and P = P\S0 4: for r = 1, 2, . . . , R do 5: Set Sr = . 6: Train fr, gr on S using the weighted loss and independent random initializations. 7: while |Sr| < k do 8: Select (x, y) uniformly at random from P. 9: Set p(x, y) using entropy-based disagreement or entropy criteria shown in Eq (1) and (2). 10: Q Bernoulli(p(x, y)). 11: if Q = 1 then 12: Set Sr = Sr n x, y, min 1 p(x,y), u o and P = P\ {(x, y)}. 13: end if 14: end while 15: S = S Sr. 16: end for 17: Train f R+1 on S using the weighted loss. 18: return S, f R+1 We consider a practical batch streaming setting, where an algorithm processes one example at a time without updating its state until a batch of examples is selected. That is, like in standard streaming settings, the algorithm receives a labeled example, and decides whether to include it in the selected subset or not. Yet, the algorithm is only allowed to update its state after a fixed batch of examples have been selected in order to limit the overhead costs (e.g. this typically can include retraining models and extracting gradients). Unlike the pool-based setting where the algorithm receives the entire labeled pool beforehand, a batch streaming setting can be more appropriate when facing a vast training data pool since the algorithm can only process a subset of the pool without iterating over the whole pool. Note that any batch streaming algorithm can also be used in a pool-based setting, by simply streaming through the pool in a uniformly random fashion. At a high level, the IWe S algorithm selects examples by importance sampling where the sampling probability is based on the entropy of models trained on previously selected data. We define two sampling probabilities Published as a conference paper at ICLR 2023 that allow us to trade-off between performance and the computational cost, as well as label-aware or an active learning setting leading to less label annotation costs. As we will subsequently see, these sampling definitions are both easy to use and work well in practice. To define the algorithm in more detail, we let X Rd and Y = {1, . . . , c} denote the input space and the multi-class label space, respectively. We assume the data (x, y) is drawn from an unknown joint distribution D on X Y. Let H = {h : X Z} be the hypothesis class consisting of functions mapping from X to some prediction space Z RY and let ℓ: Z Y R denote the loss. The pseudocode of IWe S is shown in Algorithm 1. Initially, a seed set S0 (|S0| = k0) is selected uniformly at random from the labeled pool P. Then the algorithm proceeds in rounds r [1, . . . , R] and it consists of two main components: training and sampling. At the training step at round r, the model(s) are trained using the importance-weighted loss, namely fr = arg minh H P (x,y,w) S w ℓ(h(x), y) on the subset S, selected so far, in the previous r 1 rounds. Depending on the sampling strategy, we may need to randomly initialize two models fr, gr in which case they are trained independently on the same selected subset S, but with different random initializations. At the sampling step at round r, the IWe S algorithm calculates a sampling probably for example (x, y) S based on one of the following definitions: Entropy-based Disagreement. We define the sampling probability based on the disagreement on two functions with respect to entropy restricted to the labeled example (x, y). That is, p(x, y) = |Pfr(y|x) log2 Pfr(y|x) Pgr(y|x) log2 Pgr(y|x)| (1) where Pfr(y|x) is the probability of class y with model fr given example x. If the two functions, fr, gr, disagree on the labeled example (x, y), then p(x, y) will be small and the example will be less likely to be selected. This definition is the closest to the IWe S-V algorithm analyzed in Section 4 and achieves the best performance when the computational cost of training two models is not an issue. In Appendix A, we show an efficient version of entropy-based disagreement that utilizes only one model and achieves similar performance. Entropy. We define the sampling probability by the normalized entropy of the model fr trained on past selected examples: p(x, ) = X y Y Pfr(y |x) log2 Pfr(y |x)/ log2 |Y|. (2) The sampling probability p(x, ) is high whenever the model class probability Pfr(y |x) is close to 1/|Y|, which is when the model is not confident about its prediction as it effectively randomly selects a label from Y. This definition does not use the label y and thus it can be used in an active learning setting where the algorithm can only access the unlabeled examples. Another advantage is that it only requires training one model, thereby saving some computational cost. We note that entropy-based sampling has been used in algorithms such as uncertainty sampling as discussed in the related works section, but using entropy to define importance weights has not been done in past literature. Based on one of these definitions, the IWe S algorithm then decides whether to include the example into the selected subset S by flipping a coin Q with chosen sampling probability p(x, y). If the example is selected, the example s corresponding weight w is set to 1 p(x,y), and the example is removed from the labeled pool P = P\ {(x, y)}. This process is repeated until k examples have been selected. Below we use IWe S-dis as an abbreviation for IWe S algorithm with Entropy-based Disagreement sampling probability and IWe S-ent for IWe S algorithm with Entropy sampling probability. The weighted loss used to train the model can be written as 1 |P| P i P Qi p(xi,yi)ℓ(f(xi), yi) and it is an unbiased estimator of the population risk E(x,y) D[ℓ(f(x), y)]. Yet such estimator can have a large variance when the model is highly confident in its prediction, that is whenever Pfr(y|x) is large, then p(x, y) is small. This may lead to training instability and one pragmatic approach to addressing this issue is by clipping the importance sampling weights (Ionides, 2008; Swaminathan & Joachims, 2015). Thus in our algorithm, we let u be the upper bound on the weight of the selected example. Although this clipping strategy introduces an additional parameter, we find it is not too sensitive and, as mentioned in the empirical section, set it to a fixed constant throughout our evaluation. The primary computational cost of the algorithm, at each sampling round r, comes from (1) training the model and (2) the inference cost associated with computing the sampling probabilities on a Published as a conference paper at ICLR 2023 constant fraction of the unlabeled pool of examples. As described in the next section, besides for random sampling, all compared sampling algorithms incur similar re-training and model scoring costs at each sampling round. 3 EMPIRICAL EVALUATION We compare IWe S with state-of-the-art baselines on several image classification benchmarks. Specifically, we consider six multi-class datasets (CIFAR10, CIFAR100 (Krizhevsky & Hinton, 2009), SVHN (Netzer et al., 2011), EUROSAT (Helber et al., 2019), CIFAR10 Corrupted (Hendrycks & Dietterich, 2019), Fashion MNIST (Xiao et al., 2017) and one large-scale multi-label Open Images dataset (Krasin et al., 2017). In the multi-class setting, each image is associated with only one label. On the other hand, the multi-label Open Images dataset consists of 19,957 classes over 9M images, where each image contains binary labels for a small subset of the classes (on average 6 labels per image). Further details of each dataset can be found in Table 1 and Table 2 in the appendix. For all experiments, we consider a diverse set of standard baselines from both subset selection and active learning literature (discussed in Section 1.1). Uncertainty Sampling selects top k examples on which the current model admits the highest uncertainty. There are three popular ways of defining model uncertainty s(x) of an example x, namely margin sampling, entropy sampling, and least confident sampling, and all are based on Pf[ˆy|x], the probability of class ˆy given example x according to the model f. Margin sampling defines the model uncertainty of an example x as s(x) = 1 (Pf[ˆy1|x] Pf[ˆy2|x]) where ˆy1 = argmaxy Y Pf[y|x], ˆy2 = argmaxy Y\y1 Pf[y|x] are the first and second most probable classes for model f. For entropy sampling, model uncertainty is defined as s(x) = P y Y Pf(ˆy|x) log(Pf(ˆy|x)) while for least confidence sampling, it is defined as s(x) = 1 maxy Y Pf(ˆy|x). BADGE of Ash et al. (2019) selects k examples by using the k-MEANS++ seeding algorithm using the gradient vectors, computed with respect to the penultimate layer using the most likely labels given by the latest model checkpoint. Coreset (k-Center) of Sener & Savarese (2018) selects a subset of examples using their embeddings derived from the penultimate layer using the latest model checkpoint. In particular, the k examples are chosen using a greedy 2-approximation algorithm for the k-center problem. Random Sampling selects k examples uniformly at random. 3.1 MULTI-CLASS EXPERIMENTS Here, we compare the IWe S algorithm against the baselines on the six multi-class image datasets. We use the VGG16 architecture with weights that were pre-trained using Image Net as well as add two fully-connected 4096 dimensional layers and a final prediction layer. Xavier uniform initialization is used for the final layers. For each dataset, we tune the learning rate by choosing the rate from the set {0.001, 0.002, 0.005, 0.01, 0.1} that achieves best model performance on the seed set. We use batch SGD with the selected learning rate and fix SGD s batch size to 100. At each sampling round r, the model is trained to convergence on all past selected examples for at least 20 epochs. For IWe S, we set the weight capping parameter to 2 for all datasets except for CIFAR10 which we decreased to 1.5 in order to reduce training instability. The embedding layer for BADGE and Coreset is extracted from the penultimate layer having a dimension of 4096. The effective dimension of the gradient vector in BADGE grows with the number of labels, which is problematic for CIFAR100 as it has 100 classes. More specifically, the runtime of BADGE is given by O (dk T), which can be large for CIFAR100 since the dimension of the gradient vector from the penultimate layer is d = 4096 100, the size of the labeled pool is T=50K, and the number of examples selected in each round is k=5K. In order to solve this inefficiency for CIFAR100, we split the labeled pool randomly into 100 partitions and ran separate instances of the algorithm in each partition with batch size k/100. Each algorithm is initialized with a seed set that is sampled uniformly at random from the pool. After that, sampling then proceeds in a series of rounds r where the model is frozen until a batch k of examples is selected. The seed set size and sampling batch size k are set to 1K for CIFAR10, SVHN, Published as a conference paper at ICLR 2023 Figure 1: Accuracy of VGG16 when trained on examples selected by IWe S-dis and baseline algorithms. Figure 2: Accuracy of VGG16 when trained on examples selected by IWe S-ent, IWe S-dis, margin sampling and random sampling. EUROSAT, CIFAR10 Corrupted, Fashion MNIST, and to 5K for CIFAR100. The initial seed set size allows at least 50 initial examples per class on average. The experiment was repeated for 5 trials. Any trial that encountered divergent training, i.e. the resulting model accuracy is more than three times below the standard error of model s accuracy on seed set, was dropped. We note that this happened infrequently (less than 3% of the time) and all reported averaged results have at least 3 trials. Figure 1 shows the mean and standard error of VGG16 model s accuracy on a held out test set comparing IWe S-dis to the baseline methods. The IWe S-dis algorithm either outperforms or matches the performance of the baseline algorithms for all datasets. We also find that margin sampling consistently performs well against the remaining baseline algorithms and that BADGE either matches the performance of margin sampling or slightly underperforms on some datasets (Eurosat, Fashion MNIST, CIFAR100). Coreset admits a similar and at times slightly poorer performance compared to random sampling. Next, Figure 2 compares the two variants of our algorithm: IWe S-dis and IWe S-ent. We find that the IWe S-dis performs slightly better than IWe S-ent on most of the datasets. This is not surprising since the IWe S-dis sampling probability leverages label information and more computational power, i.e. trains two models. As explained in Section 4, it also better fits our theoretical motivation. Nevertheless, it is important to note that IWe S-ent, without the label information, still consistently outperforms or matches the performance of the baselines for all the datasets. 3.2 MULTI-LABEL OPENIMAGES EXPERIMENTS Here, we evaluate the performance of the IWe S algorithm on Open Images v6. We train a Res Net101 model implemented using tf-slim on 64 Cloud two core TPU v4 acceleration, and apply batch SGD with batchsize of 6144 and an initial learning rate of 10 4 with decay logarithmically every 5 108 Published as a conference paper at ICLR 2023 examples. We add a global pooling layer with a fully connected layer of 128 dimensions as the final layers of the networks, which is needed by BADGE and Coreset. The model is initialized with weights that were pre-trained on the validation split using 150K SGD steps, and at each sampling round, the model is trained on all past selected examples with an additional 15K SGD steps. Figure 3: Pooled Average Precision of Res Net101 trained on examples selected by IWe S-ent and the baseline algorithms. In the previous section, our results show that the IWe Sdis algorithm only slightly outperforms the IWe S-ent algorithm on a few datasets. Also, since the IWe S-dis requires training two neural networks, which is computationally expensive in this scenario, we only test the performance of IWe S-ent. Since IWe S-ent does not use the label information, we can also measure the performance of the algorithm in an active learning setting. Since Open Images is a multi-label dataset, the sampling algorithms must not only select the image, but also the class. That is, each example selected by an algorithm consists of an image-class pair with a corresponding binary label indicating whether the corresponding class is present in the image or not. In order to adapt IWe S-ent to the multi-label setting, the entropy sampling probability for each image-class pair is defined as p(x, ) = Pfr(y|x) log2 Pfr(y|x) (1 Pfr(y|x)) log2 (1 Pfr(y|x)), where Pfr(y|x) is the model class probability of a positive label at round r. A seed set of size 300K is sampled uniformly at random from the pool, and at each sampling round r, the algorithms select 100K examples. Similarly to the previous section, in order to run BADGE on Open Images, we divide the pool into 100 partitions and run separate instances of the algorithm in each partition. For IWe S, the weight capping parameter is set to 10. Figure 3 shows the mean and standard error across 5 trials of the pooled average precision (Pooled AP) metric for each algorithm. As the number of selected examples increases, IWe S-ent outperforms all other baselines methods on the Open Images dataset. We also find that BADGE performs similarly or even slightly worse than the uncertainty-based sampling algorithms when the number of selected examples is smaller than 800K, and then outperforms all uncertainty-based sampling as the number of selected examples increases. Coreset initially performs better than random sampling, but at later sampling rounds, it admits a similar performance to random sampling. 4 THEORETICAL MOTIVATION In order to theoretically motivate the IWe S algorithm, we analyze a closely related algorithm which we call IWe S-V, adapted from the IWAL algorithm of Beygelzimer et al. (2009). We prove that IWe S-V admits generalization bounds that scale with the dataset size T and sampling rate bounds that are in terms of a new disagreement coefficient tailored to the subset selection framework. Below, we let L(h) = E(x,y) D[ℓ(h(x), y)] denote the expected loss of hypothesis h H, and h = argminh H L(h) be the best-in-class hypothesis. Without loss of generality, we consider a bounded loss ℓ: Z Y [0, 1] mapping to the interval [0, 1]. Such a loss can be achieved by any bounded loss after normalization. For simplicity, we assume H is a finite set, but our results can be easily extended by standard covering arguments to more general hypothesis sets such as finite VC-classes. The IWe S-V algorithm operates on an i.i.d. example (x1, y1), (x2, y2), . . . , (x T , y T ) drawn from D sequentially. It maintains a version space Ht at any time t, with H1 = H. At time t, IWe S-V flips a coin Qt {0, 1} with bias pt defined as pt = max f,g Ht ℓ(f(xt), yt) ℓ(g(xt), yt) (3) where Ht = n h Ht 1 : 1 t Pt s=1 Qs ps ℓ(h(xs), ys) minh Ht 1 1 t Pt s=1 Qs ps ℓ(h (xs), ys) + t 1 o with t 1 = q 8 log(2T (T +1)|H|2/δ) t 1 . The example is selected if Qt = 1 and otherwise it is discarded. The main idea behind this algorithm is thus to define a sampling probability that is in terms of the disagreement between two hypothesis f, g that are not too far from the best model trained on the past selected data, i.e. minh Ht 1 1 t Pt s=1 Qs ps ℓ(h(xs), ys). The formal IWe S-V algorithm pseudo-code (Algorithm 2) and all the theorem proofs can be found in Appendix B. Published as a conference paper at ICLR 2023 For general, e.g. non-linear, hypothesis classes it is computationally infeasible to find two hypotheses f, g Ht that maximize the expression in equation (3). This main impracticality of IWe S-V is reason why we developed the IWe S algorithm of the previous section. This drawback is also shared by the IWAL algorithm of Beygelzimer et al. (2009), which computes a sampling probability very similar to that of equation (3), but with an additional maximization over the choice of y Y in the definition of the sampling probability pt. Before continuing we explain how our practical algorithm IWe S-dis, specifically using sampling probability in equation (1), is closely related to the IWe S-V algorithm. Recall that the IWe S algorithm trains two models f and g each minimizing the importance-weighted loss using the data sampled so far. Therefore, each model exhibits reasonable training loss, i.e. they are expected to be included in the version space Ht of good hypothesis, while the different random initializations (in the case of non-convex neural network hypotheses) results in models that still differ in certain regions of the feature space. Thus, the difference in equation (1) can be thought of as a less aggressive version of the difference found in the maximization of equation (3). Another dissimilarity between the two is that the IWe S-dis algorithm is defined for the batch streaming setting while the IWe S-dis algorithm and its analysis is developed for the streaming setting. Said differently, the IWe S-V algorithm can be seen as a special case of the IWe S-dis algorithm with target subset size of 1. To extend the theoretical guarantees of IWe S-V to the batch streaming setting, we can follow a similar analysis developed by Amin et al. (2020) to also find that the effects of delayed feedback in the batch streaming setting are in fact mild as compared to the streaming setting. 4.1 GENERALIZATION BOUND Next, we turn to the topic of generalization guarantees and we review an existing bound for coreset based algorithms. The guarantees of coreset algorithms are generally focused on showing that a model s training loss on the selected subset is close to the same model s training loss on the whole dataset. That is, given dataset P = {(xi, yi)}T i=1 DT , the learner seek to select a subset m < T of examples S = {(x i, y i)}m i=1 along with a corresponding set of weights w1, . . . , wm such that for some small ϵ > 0 and for all h H, the additive error coreset guarantee holds Pm i=1 wiℓ(h(x i), y i) PT i=1 ℓ(h(xi), yi) ϵT. The following proposition, which is a minor extension of Fact 8 of Karnin & Liberty (2019), allows us to convert a coreset guarantee into a generalization guarantee. Proposition 4.1. Let h = argminh H Pm i=1 wiℓ(h(x i), y i), and let the additive error coreset guarantee hold for any ϵ > 0, then for any δ > 0, with probability at least 1 δ, it holds that L(h ) L(h ) + 2ϵ + 2 p ln(4/δ)/2T. As shown above, the generalization guarantee depends linearly on ϵ which in turn depends on the size of the subset m. To give a few examples, Karnin & Liberty (2019) show that for hypotheses that are defined as analytic functions of dot products (e.g. generalized linear models) this dependence on m is ϵ = O(1/m), while for more complex Kernel Density Estimator type models the dependence is ϵ = O(1/ m). See Mai et al. (2021), Table 1, for examples on the dependency between ϵ and m under different data distributions assumptions (e.g. uniform, deterministic, ℓ1 Lewis) and for specific loss functions (e.g. log loss, hinge loss). We now provide a generalization guarantee for the IWe S-V algorithm, which depends on the size of the labeled pool size T. The proof follows from that in Beygelzimer et al. (2009). Theorem 4.2. Let h H be the minimizer of the expected loss function h = argminh H L(h). For any δ > 0, with probability at least 1 δ, for any t 1 with t {1, 2 . . . , T}, we have that h Ht and that L(f) L(g) 2 t 1 for any f, g Ht. In particular, if h T is the output of IWe S-V, then L(h T ) L(h ) 2 T 1 = O p log(T/δ)/T . Unlike the distribution-specific and loss-specific theoretical guarantees proposed in the coreset literature, Theorem 4.2 holds for any bounded loss function and general hypothesis classes. If we ignore log terms and consider the more complex Kernel Density Estimator class of hypotheses, the coreset method of Karnin & Liberty (2019) requires m = O(T) coreset samples in order to achieve an overall O(1/ T) generalization bound. As we will see in the next section, the required IWe S sampling rate can also be as high as O(T), but critically is scaled by the best-in-class loss, which in favorable cases is significantly smaller than one. Published as a conference paper at ICLR 2023 4.2 SAMPLING RATE BOUNDS Hanneke (2007) proves that the expected number of labeled examples needed to train a model in an active learning setting can be characterized in terms of the disagreement coefficient of the learning problem. Later, Beygelzimer et al. (2009) generalizes this notion to arbitrary loss functions, and in this work, we further generalize this for the subset selection setting. Recall that the disagreement coefficient θAL in Beygelzimer et al. (2009) for the active learning setting is defined as θAL = sup r 0 Ex X maxh BAL(h ,r) maxy Y |ℓ(h(x), y) ℓ(h (x), y)| where BAL(h , r) = {h H : ρAL(h, h ) r} with ρAL(f, g) = Ex X [supy Y |ℓ(f(x), y) ℓ(g(x), y)|]. Informally, this coefficient quantifies how much disagreement there is among a set of classifiers that is close to the best-in-class hypothesis. In the subset selection setting, labels are available at sample time and, thus, we are able to define the following disagreement coefficient: Definition 4.1. Let ρS(f, g) = E(x,y) D[|ℓ(f(x), y) ℓ(g(x), y)|] and BS(h , r) = {h H : ρS(h, h ) r} for r 0. The disagreement coefficient in the subset selection setting is defined as θS = sup r 0 E(x,y) D maxh BS(h ,r) |ℓ(h(x), y) ℓ(h (x), y)| The main difference between the above coefficient and that of Beygelzimer et al. (2009) is that there is no supremum over all label y Y both in the definition of the distance ρ and the coefficient s numerator. Instead, the supremum is replaced with an expectation over the label space. The following theorem leverages θS to derive an upper bound on the expected number of selected examples for the IWe S-V algorithm. Below, let Ft = {(xi, yi, Qi)}t i=1 be the observations of the algorithm up to time t. Theorem 4.3. For any δ > 0, with probability at least 1 δ, the expected sampling rate of the IWe S-V algorithm is: PT t=1 E(xt,yt) D pt Ft 1 = O θS L(h )T + p T log(T/δ) . Suppressing lower order terms, the above expected sampling rate bound is small whenever the product of the disagreement coefficient and the expected loss of the best-in-class is small. In such cases, by combining the above theorem with the generalization guarantee, it holds that IWe S-V returns a hypothesis trained on a only fraction of the points that generalizes as well as a hypothesis trained on the full dataset of size T. Theorem 4.3 can be further improved by adapting the ideas found in Cortes et al. (2019) to the IWe S-V algorithm. See Appendix B.4 for this enhanced analysis. The form of this sampling rate bound is similar to that of Beygelzimer et al. (2009). More concretely, under the assumption that a loss function has bounded slope asymmetry, that is Kℓ= supz,z Z maxy Y |ℓ(z,y) ℓ(z ,y)| miny Y |ℓ(z,y) ℓ(z ,y)| is bounded, with probability at least 1 δ, the expected number of examples selected by the IWAL algorithm is given by O θALKℓ L(h )T + p T log(T/δ) . Thus, the main difference between the sampling rate bound of the IWAL algorithm and the IWe S-V algorithm are the factors that depends on the two disagreement coefficients: θALKℓand θS. Since θS leverages the label information we may expect it to provide a tighter bound, compared to using the label-independent disagreement θAL. Theorem 4.4 shows that this is indeed the case. Theorem 4.4. If the loss function has a bounded slope asymmetry Kℓ, then θS θALKℓ. The above theorem in conjunction with the sampling rate guarantees thus proves that the sampling rate bound of IWe S of Theorem 4.3 is tighter than the sampling rate bound of the IWAL algorithm. 5 CONCLUSION In this paper we have introduced a subset selection algorithm, IWe S that is designed for arbitrary hypothesis classes including deep networks. We have shown that the IWe S algorithm outperforms several natural and important baselines across multiple datasets. In addition, we have developed an initial theoretical motivation for our approach based on the importance weighted sampling mechanism. A natural next step is enforcing a notion of diversity as it will likely provide improved performance in the large-batch sampling setting and thus, we plan to adapt the diversity-based method in Citovsky et al. (2021) by replacing the uncertainty sampling component with the IWe S algorithm. Published as a conference paper at ICLR 2023 Pankaj K Agarwal, Sariel Har-Peled, Kasturi R Varadarajan, et al. Geometric approximation via coresets. Combinatorial and computational geometry, 2005. Kareem Amin, Corinna Cortes, Giulia De Salvo, and Afshin Rostamizadeh. Understanding the effects of batching in online active learning. In International Conference on Artificial Intelligence and Statistics, 2020. Shlomo Argamon-Engelson and Ido Dagan. Committee-based sample selection for probabilistic classifiers. Journal of Artificial Intelligence Research, 1999. Jordan T Ash, Chicheng Zhang, Akshay Krishnamurthy, John Langford, and Alekh Agarwal. Deep batch active learning by diverse, uncertain gradient lower bounds. In International Conference on Learning Representations, 2019. Alina Beygelzimer, Sanjoy Dasgupta, and John Langford. Importance weighted active learning. In International conference on machine learning, 2009. Gui Citovsky, Giulia De Salvo, Claudio Gentile, Lazaros Karydas, Anand Rajagopalan, Afshin Rostamizadeh, and Sanjiv Kumar. Batch active learning at scale. In Advances in Neural Information Processing Systems, 2021. Corinna Cortes, Giulia De Salvo, Claudio Gentile, Mehryar Mohri, and Ningshan Zhang. Regionbased active learning. In International Conference on Artificial Intelligence and Statistics, 2019. Aron Culotta and Andrew Mc Callum. Reducing labeling effort for structured prediction tasks. In Association for the Advancement of Artificial Intelligence, 2005. Ryan R Curtin, Sungjin Im, Ben Moseley, Kirk Pruhs, and Alireza Samadian. On coresets for regularized loss minimization. ar Xiv preprint ar Xiv:1905.10845, 2019. Dan Feldman, Morteza Monemizadeh, Christian Sohler, and David P Woodruff. Coresets and sketches for high dimensional subspace approximation problems. In Proceedings of the twentyfirst annual ACM-SIAM symposium on Discrete Algorithms, 2010. Chengcheng Guo, Bo Zhao, and Yanbing Bai. Deepcore: A comprehensive library for coreset selection in deep learning. Database and Expert Systems Applications, 2022. Steve Hanneke. A bound on the label complexity of agnostic active learning. In International conference on machine learning, 2007. Sariel Har-Peled and Soham Mazumdar. On coresets for k-means and k-median clustering. In Proceedings of the thirty-sixth annual ACM symposium on Theory of computing, 2004. Patrick Helber, Benjamin Bischke, Andreas Dengel, and Damian Borth. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2019. Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. In International Conference on Learning Representations, 2019. Edward L Ionides. Truncated importance sampling. Journal of Computational and Graphical Statistics, 2008. Tyler B Johnson and Carlos Guestrin. Training deep models faster with robust, approximate importance sampling. In Advances in Neural Information Processing Systems, 2018. Zohar Karnin and Edo Liberty. Discrepancy, coresets, and sketches in machine learning. In Conference on Learning Theory, 2019. Angelos Katharopoulos and Franc ois Fleuret. Not all samples are created equal: Deep learning with importance sampling. In International conference on machine learning, 2018. Published as a conference paper at ICLR 2023 Vishal Kaushal, Rishabh Iyer, Suraj Kothawade, Rohan Mahadev, Khoshrav Doctor, and Ganesh Ramakrishnan. Learning from less data: A unified data subset selection and active learning framework for computer vision. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), 2019. Krishnateja Killamsetty, S Durga, Ganesh Ramakrishnan, Abir De, and Rishabh Iyer. Grad-match: Gradient matching based data subset selection for efficient deep model training. In International Conference on Machine Learning, 2021. Kwanyoung Kim, Dongwon Park, Kwang In Kim, and Se Young Chun. Task-aware variational adversarial active learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021. Ivan Krasin, Tom Duerig, Neil Alldrin, Vittorio Ferrari, Sami Abu-El-Haija, Alina Kuznetsova, Hassan Rom, Jasper Uijlings, Stefan Popov, Andreas Veit, et al. Openimages: A public dataset for large-scale multi-label and multi-class image classification, 2017. Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, University of Toronto, Toronto, Ontario, 2009. Zhuoming Liu, Hao Ding, Huaping Zhong, Weijia Li, Jifeng Dai, and Conghui He. Influence selection for active learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021. Tung Mai, Cameron Musco, and Anup Rao. Coresets for classification simplified and strengthened. In Advances in Neural Information Processing Systems, 2021. Baharan Mirzasoleiman, Jeff Bilmes, and Jure Leskovec. Coresets for data-efficient training of machine learning models. In International Conference on Machine Learning, 2020. Alexander Munteanu, Chris Schwiegelshohn, Christian Sohler, and David P. Woodruff. On coresets for logistic regression. In Advances in Neural Information Processing Systems, 2018. Stephen Mussmann and Percy S Liang. Uncertainty sampling is preconditioned stochastic gradient descent on zero-one loss. In Advances in Neural Information Processing Systems, 2018. Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading digits in natural images with unsupervised feature learning. 2011. Mansheej Paul, Surya Ganguli, and Gintare Karolina Dziugaite. Deep learning on a data diet: Finding important examples early in training. In Advances in Neural Information Processing Systems, 2021. Tobias Scheffer, Christian Decomain, and Stefan Wrobel. Active hidden markov models for information extraction. In International Symposium on Intelligent Data Analysis, 2001. Ozan Sener and Silvio Savarese. Active learning for convolutional neural networks: A core-set approach. In International Conference on Learning Representations, 2018. Samarth Sinha, Sayna Ebrahimi, and Trevor Darrell. Variational adversarial active learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019. Adith Swaminathan and Thorsten Joachims. Counterfactual risk minimization: Learning from logged bandit feedback. In International Conference on Machine Learning, 2015. Paxton Turner, Jingbo Liu, and Philippe Rigollet. A statistical perspective on coreset density estimation. In International Conference on Artificial Intelligence and Statistics, 2021. Kai Wei, Rishabh Iyer, and Jeff Bilmes. Submodularity in data subset selection and active learning. In International conference on machine learning, 2015. Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. ar Xiv preprint ar Xiv:1708.07747, 2017. Published as a conference paper at ICLR 2023 Yazhou Yang and Marco Loog. A benchmark and comparison of active learning for logistic regression. Pattern Recognition, 2018. Donggeun Yoo and In So Kweon. Learning loss for active learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019. Published as a conference paper at ICLR 2023 Supplementary Material A EXTENDED ALGORITHMIC AND EXPERIMENTAL DETAILS A.1 DATASET DETAILS Table 1 and Table 2 provide further details of the six multi-class classification datasets and the multilabel Open Images dataset, respectively. Train Test # Classes Image Size Description CIFAR10 50,00010,000 10 32 32 3 classify the object in the image SVHN 73,25726,032 10 32 32 3 classify street view house number CIFAR Corrupted 7,614 2,386 10 32 32 3 classify the corrupted object in the image Eurosat 8,000 5,000 10 64 64 3 classify land use and land cover satelite image Fashion MNIST 60,00010,000 10 32 32 3 classify the type of clothes CIFAR100 50,00010,000 100 32 32 3 classify the object in the image Table 1: Multi-class Classification Datasets statistics Images Positives Negatives Train 9,011,21919,856,08637,668,266 Validation 41,620 367,263 228,076 Test 125,436 1,110,124 689,759 Table 2: Open Images Dataset v6 statistics by data split A.2 MORE EXPERIMENT DETAILS AND RESULTS When running IWe S, at each iteration, we pass over the labeled pool sequentially in a uniform random order, removing each selected example from the pool. If we exhaust the entire pool before selecting k examples, we start again at the beginning of the sequence and iterate over the remaining examples. In order to reduce the number of passes required for the smaller datasets, we scale the sampling probabilities of the remaining points uniformly by 1+ j 10, where j is the number of passes so far. Figure 4 compares IWe S-dis and IWe S-ent for additional three datasets that we omit in Section 3. Figure 4: Accuracy of VGG16 trained on examples selected by IWe S-ent, IWe S-dis, margin sampling and random sampling. A.3 EFFICIENT VERSION OF ENTROPY-BASED DISAGREEMENT Here we develop an efficient version of entropy-based disagreement which we call IWe S-loss by defining the sampling probability to be proportional to the model s entropy restricted to the labeled examples (x, y) as follows: p(x, y) = Pfr(y|x) log Pfr(y|x). (4) Published as a conference paper at ICLR 2023 where Pfr(y|x) is the probability of class y with model fr given example x. As Pfr(y|x) increase from 0 to 1, this sampling probability first increase then decrease. Thus the sampling probability is high whenever the model is not confident about the model prediction. Unlike IWe S-dis, this definition only requires training one model, thereby saving some computational cost. Figure 5 compares the two variants IWe S-dis and IWe S-loss. We find that the performance are similar across all datasets, with IWe S-loss behave slightly better than IWe S-dis at the first two sample iterations (i.e. Fashion MNIST, CIFAR10, Eurosat), and IWe S-dis slightly outperform IWe S-loss as the number of sample iteration increases (i.e. SVHN, Eurosat). Figure 5: Accuracy of VGG16 trained on examples selected by IWe S-loss and IWe S-dis B PROOFS OF THEORETICAL GUARANTEE B.1 PROOF OF PROPOSITION 4.1 Proposition 4.1. Let h = argminh H Pm i=1 wiℓ(h(x i), y i), and let the additive error coreset guarantee hold for any ϵ > 0, then for any δ > 0, with probability at least 1 δ, it holds that L(h ) L(h ) + 2ϵ + 2 p ln(4/δ)/2T. Proof of Proposition 4.1. L(h ) = E(x,y) D[ℓ(h (x), y)] i=1 ℓ(h (xi), yi) + 2T (Hoeffding inequality) i=1 wiℓ(h (x i), y i) + ϵ + 2T (0 ℓ( ) 1) i=1 wiℓ(h (x i), yi) + ϵ + 2T (By the definition of h ) i=1 ℓ(h (xi), yi) + 2ϵ + E(x,y) D[ℓ(h (x), y)] + 2ϵ + 2 2T (Hoeffding inequality) Published as a conference paper at ICLR 2023 = L(h ) + 2ϵ + 2 B.2 PROOFS FOR STATEMENTS IN SECTION 4.1 Algorithm 2 contains a detailed pseudocode of the IWe S-V algorithm referred to in the main body of the paper. Algorithm 2 IWe S-V Require: Labeled Pool P. 1: Initialize: S0 = , H0 = H. 2: for t = 1, 2, . . . , T do 3: Sample (xt, yt) uniformly at random from P. 4: Update: h Ht 1 : 1 t 1 pi ℓ(h(xi), yi) min h Ht 1 1 t 1 pi ℓ(h(xi), yi) + t 1 where t 1 = q 8 log(2T (T +1)|H|2/δ) 5: Set pt = maxf,g Ht ℓ(f(xt), yt) ℓ(g(xt), yt). 6: Qt Bernoulli(pt). 7: if Qt = 1 then 8: Set St = St 1 n xt, yt, 1 9: else 10: St = St 1. 11: end if 12: ht = arg minh H P (x,y,w) St w ℓ(h(x), y). 13: end for 14: return h T In order to prove Theorem 4.2, we first present the following Lemma. Lemma B.1. Define the weighted empirical loss as Lt 1(f) = 1 t 1 Pt 1 i=1 Qi pi ℓ(f(xi), yi) . For all δ > 0 with probability at least 1 δ, for all t {1, . . . , T}, and for all f, g Ht 1, we have |Lt 1(f) Lt 1(g) L(f) + L(g)| t 1. Proof of Lemma B.1. Fix any t {1, . . . , T}. For any f, g Ht 1, define pi (ℓ(f(xi), yi) ℓ(g(xi), yi)) (L(f) L(g)) for i {1, . . . , t 1} . The sequence Zi is a martingale difference since E [Zi|Z1, . . . Zi 1] = E Qi pi (ℓ(f(xi), yi) ℓ(g(xi), yi) (L(f) L(g))|Z1, . . . Zi 1 Due to the fact that pi |ℓ(f(xi), yi) ℓ(g(xi), yi)| + |L(f) L(g)| 2 as pi |ℓ(f(xi), yi) ℓ(g(xi), yi)| for f, g Ht 1. Thus, we can apply Azuma s inequality, Pr[|Lt 1(f) Lt 1(g) L(f) + L(g)| t 1] 2 exp( (t 1) 2 t 1/8) = δ T(T + 1)|H|2 , where we used the fact that (xt, yt) is i.i.d. Since Ht 1 is a random subset of H, we take the union bound over f, g H and t 1. Then we take another union bound over T to finish the proof. Published as a conference paper at ICLR 2023 Next we provide the proof of Theorem 4.2. Theorem 4.2. Let h H be the minimizer of the expected loss function h = argminh H L(h). For any δ > 0, with probability at least 1 δ, for any t 1 with t {1, 2 . . . , T}, we have that h Ht and that L(f) L(g) 2 t 1 for any f, g Ht. In particular, if h T is the output of IWe S-V, then L(h T ) L(h ) 2 T 1 = O p log(T/δ)/T . Proof of Theorem 4.2. We show that h Ht by induction. Base case holds as h H1 = H. Now assuming that h Ht 1 holds, we show that h Ht. Let h = argminf Ht 1 Lt 1(f). By Lemma B.1, Lt 1(h ) Lt 1(h ) L(h ) L(h ) + t 1 t 1 since L(h ) L(h ) 0 by definition of h . Thus, Lt 1(h ) Lt 1(h ) + t 1 which means that h Ht by definition of Ht. Since Ht Ht 1, Lemma B.1 implies that for any f, g Ht, L(f) L(g) Lt 1(f) Lt 1(g) + t 1 Lt 1(h ) + t 1 Lt 1(h ) + t 1 2 t 1. Noting that h , ht Ht completes the proof. B.3 PROOFS FOR STATEMENTS IN SECTION 4.2 Theorem 4.3. For any δ > 0, with probability at least 1 δ, the expected sampling rate of the IWe S-V algorithm is: PT t=1 E(xt,yt) D pt Ft 1 = O θS L(h )T + p T log(T/δ) . Proof of Theorem 4.3. By Theorem 4.2, Ht {h H : L(h) L(h ) + 2 t 1}. Using this fact and that h is the best in class, it holds that ρS(h, h ) = E(x,y) D|ℓ(h(x), y) ℓ(h (x), y)| L(h) + L(h ) 2L(h ) + 2 t 1. Thus, letting r = 2L(h ) + 2 t 1, it holds that Ht BS(h , r). Then, E(xt,yt) D[pt|Ft 1] = E(xt,yt) D[ sup f,g Ht |ℓ(f(xt), yt) ℓ(g(xt), yt)||Ft 1] 2E(xt,yt) D[ sup h Ht |ℓ(h(xt), yt) ℓ(h (xt), yt)||Ft 1] 2E(xt,yt) D[ sup h BS(h ,r) |ℓ(h(xt), yt) ℓ(h (xt), yt)|] 2θSr = 4θS(L(h ) + t 1) By summing the above over t [T], the sample complexity bound of IWe S-V is then given by O θSL(h )T + θS T , which completes the proof. Theorem 4.4. If the loss function has a bounded slope asymmetry Kℓ, then θS θALKℓ. Proof of Theorem 4.4. We first prove BIWAL(h , r) BS(h , r) BIWAL(h , Kℓr). (5) The left hand side follows directly from the definition. For the right hand side, assume h BS(h , r), then by the definition of Kℓ, we have r ρS(h, h ) = E(x,y) D [|ℓ(h(x), y) ℓ(h (x), y)|] 1 Kℓ Ex X sup y Y |ℓ(h(x), y) ℓ(h (x), y)| = 1 Kℓ ρIWAL(h, h ) Published as a conference paper at ICLR 2023 where the second inequality comes from Kℓ= sup z,z Z maxy Y |ℓ(z, y) ℓ(z , y)| miny Y |ℓ(z, y) ℓ(z , y)| maxy Y |ℓ(h(x), y) ℓ(h (x), y)| miny Y |ℓ(h(x), y) ℓ(h (x), y)| Ex X maxy Y |ℓ(h(x), y) ℓ(h (x), y)| miny Y |ℓ(h(x), y) ℓ(h (x), y)| Ex X maxy Y |ℓ(h(x), y) ℓ(h (x), y)| Ex X miny Y |ℓ(h(x), y) ℓ(h (x), y)| Ex X maxy Y |ℓ(h(x), y) ℓ(h (x), y)| E(x,y) D|ℓ(h(x), y) ℓ(h (x), y)| which follows by basic properties of expectations. As a result, we have θS = sup r 0 E(x,y) D maxh BS(h ,r) |ℓ(h(x), y) ℓ(h (x), y)| r (By definition of θS) E(x,y) D maxh BIWAL(h ,Kℓr) |ℓ(h(x), y) ℓ(h (x), y)| r (Use equation (5)) = Kℓsup r 0 E(x,y) D maxh BIWAL(h ,r) |ℓ(h(x), y) ℓ(h (x), y)| r (Redefine Kℓr as r) Ex X maxh BIWAL(h ,r) supy Y |ℓ(h(x), y) ℓ(h (x), y)| which concludes the proof. B.4 THE ENHANCED-IWES-V ALGORITHM Here we analyze a modified algorithm, which we called the Enhanced-IWe S-V , that uses new slack term defined as EIWe S t = 2 t=1 pt + 6 p log ((3 + t)t2/δ) p log (8T 2|H|2 log(T)/δ) to define the version space in the IWe S-V algorithm. The theorem below proves that the Enhanced IWe S-V admits improved sampling rate bound that is smaller than the bound in Theorem 4.3 since the square root term also scales with L(h ). In particular, under realizable setting where L(h ) = 0, the sampling rate bound is O log3 (T) , which depends poly-logarithmic in T and is smaller than T log (T) bound from Theorem 4.3. Theorem B.2. For all δ > 0, for all T 3, with probably at least 1 δ, the expected sampling rate of the Enhanced-IWe S-V algorithm is: t=1 E(xt,yt) D[pt|Ft 1] θS L(h )T + O p L(h )T log(T/δ) + O log3(T/δ) . Proof of Theorem B.2. The proof follows from that of Lemma 6 in Cortes et al. (2019). From Theorem 4.3, for t 3, t=1 E(xt,yt) D pt Ft 1 = 4θS (L(h ) + t 1) . Published as a conference paper at ICLR 2023 Plugging in the expression of EIWe S t 1 into t 1, and applying a concentration inequality to relate PT t=1 pt to PT t=1 E(xt,yt) D pt Ft 1 , we end up with a recursion on E(xt,yt) D [pt|Ft 1]: E(xt,yt) D pt Ft 1 4θSL(h ) + 4θSc1 s=1 E(xt,yt) D [ps|Fs 1] + c2 log ((t 1)|H|/δ) where c1 = 2 r log 8T 2|H|2 log(T ) δ , and c2 is a constant. Then we show for all t 3, for constant log (T|H|/δ) and c4 = O log2 (T|H|/δ) , we have E(xt,yt) D pt Ft 1 4θSL(h ) + c3 t 1 + c4 t 1 The above can be proved by induction as in Cortes et al. (2019). By summing the above over t [T] gives us the result. We absorb the dependency on |H| inside the big-O notation.