# agnostic_multigroup_active_learning__ad8e3d84.pdf Agnostic Multi-Group Active Learning Nick Rittler University of California - San Diego nrittler@ucsd.edu Kamalika Chaudhuri University of California - San Diego kamalika@cs.ucsd.edu Inspired by the problem of improving classification accuracy on rare or hard subsets of a population, there has been recent interest in models of learning where the goal is to generalize to a collection of distributions, each representing a group . We consider a variant of this problem from the perspective of active learning, where the learner is endowed with the power to decide which examples are labeled from each distribution in the collection, and the goal is to minimize the number of label queries while maintaining PAC-learning guarantees. Our main challenge is that standard active learning techniques such as disagreement-based active learning do not directly apply to the multi-group learning objective. We modify existing algorithms to provide a consistent active learning algorithm for an agnostic formulation of multi-group learning, which given a collection of G distributions and a hypothesis class H with VC-dimension d, outputs an ϵ-optimal hypothesis using O (ν2/ϵ2)Gdθ2 G log2(1/ϵ) + G log(1/ϵ)/ϵ2 label queries, where θG is the worst-case disagreement coefficient over the collection. Roughly speaking, this guarantee improves upon the label complexity of standard multi-group learning in regimes where disagreement-based active learning algorithms may be expected to succeed, and the number of groups is not too large. We also consider the special case where each distribution in the collection is individually realizable with respect to H, and demonstrate O (GdθG log(1/ϵ)) label queries are sufficient for learning in this case. We further give an approximation result for the full agnostic case inspired by the group realizable strategy. 1 Introduction There is a growing theory literature concerned with choosing a classifier that performs well on multiple subpopulations or groups [1, 2, 7, 6, 4, 5, 6, 3]. In many cases, the motivation comes from a perspective of fairness, where a typical requirement is that we classify with similar accuracy across groups [7, 6, 4]. In other cases, the motivation may simply be to train more reliable classifiers. For example, cancer detection models with good overall accuracy often suffer from poor ability to detect rare subtypes of cancer that are not well-represented or identified in training. This suggests that naive ERM may be insufficient in practice [8]. In this work, we consider the following formulation of multi-group learning. The learner is given a collection of distributions G = {Dg}G g=1, each corresponding to a group, and a hypothesis class H, and wants to pick a classifier that approximately minimizes the maximum classification error over group distributions. We consider this problem from an active learning perspective, where the learner has the power to choose which examples from each group it wants to label during training. In a standard extension of the active learning literature, we set out to design schemes for choosing which examples from each group should be labeled, where the goal is to minimize the number of label queries while retaining PAC-learning guarantees. A major challenge in harnessing the power of active algorithms even in standard agnostic settings is making sure they are consistent. In the case of active learning, this means that as the number 37th Conference on Neural Information Processing Systems (Neur IPS 2023). of number of labels requested approaches infinity, the learner outputs an optimal hypothesis. To complicate things further, the main algorithmic paradigm for consistent agnostic active learning over a single distribution - disagreement based active learning (DBAL) - fails to admit direct application in the multi-group learning objective. The fundamental idea in DBAL is that the learner may safely spend its labeling budget in the disagreement region , a subset of instance space where empirically well-performing hypotheses disagree about how new examples should be labeled. When the learner need only consider a single distribution, error differences between classifiers are specified entirely through their performance on the disagreement region, and so spending the labeling budget here allows the learner to figure out which hypotheses are best while saving on labels. The problem is that when multiple group distributions must be considered, the absolute errors of classifiers on each group must be estimated to compare performance of two classifiers in their worst case over collection, and this property no longer holds. We resolve this via the observation that, while we cannot spend all our labeling budget in the disagreement region, we can exploit the agreement in its complement to cheaply estimate absolute errors of classifiers on each group. In particular, we estimate the absolute errors by choosing a representative classifier h H in the set of empirically well-performing classifiers H , and estimating its error on the complement of the disagreement region on each group distribution. These error estimates can be used to construct estimates for the absolute errors on each group for each h H at the statistical cost of estimating a coin bias, leading to a relatively cheap, consistent active strategy. We analyze the number of label queries made by this scheme in terms of a standard complexity measure in the active learning literature called the disagreement coefficient [9, 10], and show an upper bound of O (ν2/ϵ2)Gdθ2 G log2(1/ϵ) + G log(1/ϵ)/ϵ2 , where θG is the maximal disagreement coefficient over the collection of group distributions. We discuss some regimes where this label complexity can be expected to push below sample complexity lower bounds for a learner that can request samples from each group distribution during training, but does not have power to abstain from labeling specific examples. We also consider the special case of agnostic learning where each group distribution is individually realizable, but no single hypothesis separates all groups simultaneously. In this case, we show that all dependence on 1/ϵ2 in the label complexity can be replaced with log(1/ϵ) when disagreement coefficients are bounded. It turns out that using the strategy we develop in this special case leads to an approximation algorithm for the general agnostic case, for which we give guarantees. 2 Related Work 2.1 Multi-Group Learning The majority of the empirical work on multi-group learning has been through the lens of Group Distributionally Robust Optimization (G-DRO) [11, 12, 13]. The goal in G-DRO is to choose a classifier that minimizes the maximal risk against an unknown mixture over a collection of distributions {Dg}G g=1 representing groups. One assumes a completely passive sampling setting - all data is given to the learner at the beginning of training, and the learner has no ability to draw extra, fine-grained samples. The strategy is usually empirical risk minimization (ERM) - or some regularized variant - on the empirical max loss over groups; for a set of classifiers parameterized by φ Φ, letting Sg denote the set of examples in the training set coming from Dg, one performs minφ Φ maxg [G] 1 |Sg| P (xi,yi) Sg l(fφ(xi), yi) for some loss l. It is important to note that the learner knows the group identity of each sample in the training set, but is not provided with group information at test time, precluding the possibility of training a separate classifiers for each group. Multi-group PAC learning consider the multi-group problem under the passive sampling assumption from a more classical learning-theoretic perspective [7, 6]. Here, one assumes there is a single distribution D from which one is given samples, but also a collection of subsets of instance space G over which one wants to learn conditional distributions. Given a hypothesis class H, the learner tries to improperly learn a classifier f that competes with the optimal hypothesis on each conditional distribution specified by a group g in the collection - formally, one requires that for a given error tolerance ϵ, f has the property g G, PD(f(x) = y|x g) infh H PD(h(x) = y|x g) + ϵ with high probability. An interesting wrinkle in this literature is that the group identity of samples is available at both training and test times. It has been shown that a sample complexity of Table 1: Overview of the complexity of multi-group learning. The O notation hide factors logarithmic in d, G, 1/δ, and log(1/ϵ). We reserve discussion of regimes in which our algorithm improves on results in Collaborative Learning for Section 5. Problem Full Agnostic Group-Realizable Passive Multi-Group [6] O log(|H|)/γϵ2 O (log(|H|)/γϵ) Collaborative Learning [3] O d log(1/ϵ)/ϵ2 + G/ϵ2 ? Active Multi-Group (us) O (ν2/ϵ2)Gdθ2 G log2(1/ϵ) + G log(1/ϵ)/ϵ2 O Gd log2(1/ϵ) O log(|H|)/γϵ2 is sufficient for agnostic learning in this model, where γ is the minimal mass of a group g under D [6]. Collaborative learning studies the multi-group problem under an alternative sampling model [1, 2, 3]. In this case, we are given a collection of distributions {Dg}G g=1, each corresponding to a group. Given some hypothesis class H, the goal is to learn a classifier f, possibly improperly, that is evaluated against its worst-case loss over D1, . . . , DG; formally, we would like f to satisfy maxg [G] PDg(f(x) = y) infh H maxg [G] PDg(h(x) = y) + ϵ. In contrast with multi-group PAC learning, the learner may decide how many samples from each Dg it wants to collect during training, and group identity is hidden on test examples. This models the case where a learner may want to collect more data from a particularly difficult group of instances, such as a rare or hard-to-diagnose type of cancer. It has been shown for finite hypothesis classes that Θ(log(|H|)/ϵ2 + G/ϵ2) total samples over all groups are necessary and sufficient to learn in this model; O(d log(1/ϵ)/ϵ2 + G/ϵ2) total samples are sufficient for VC-classes [3]. Our work extends the model of collaborative learning, and endows the learner with the ability to decide which samples from each group distribution Dg should be labeled. This is the standard framework of active learning, applied to the multi-group setting. As in collaborative learning, we assume group identity is hidden at test time. 2.2 Active Learning Active learning concerns itself with the development of learning algorithms for training classifiers that have power over which training examples should be labeled [14, 15]. The field has largely focused on uncovering settings in which algorithmic approaches reduce the amount of labels required for PAC-style learning guarantees beyond sample complexity lower bounds that apply to i.i.d. data collection from the underlying distribution [16, 17]. In the agnostic, 0-1 loss setting, the standard upper bounds for label complexity follow O θ(d log(1/ϵ) + dν2/ϵ2 . Here, ν is the noise rate , i.e. the true error of the optimal hypothesis h , and θ is a distribution-dependent parameter called the disagreement coefficient . Thus, gains of active strategies over standard passive lower bounds of Ω(dν/ϵ2) depend on certain easiness conditions like small noise rates and bounded disagreement coefficients [15]. The vast majority of the work on active learning has been done in the 0-1 loss setting [18, 9, 10, 19, 20, 21]. It has been significantly harder to push the design of active learning algorithms past the regime of accuracy on a fixed distribution. While some work has attempted to generalize classical ideas of active learning to different losses [22], these are heavily outnumbered in the literature. As previously mentioned, the most difficult part of designing agnostic active learning strategies is maintaining consistency. The issue comes down to a phenomenon referred to as sampling bias : because active learners would like to target certain parts of space to save on labels, there is a risk that the learner prematurely stops sampling on a part of space in which there is some detail in the distribution that could not be detected at a higher labeling resolution. This can easily lead to inconsistent strategies [15]. Thus, a major contribution of our work is exhibiting a consistent active scheme for the multi-group problem. 3 Preliminaries 3.1 Learning Problem We study a binary classification setting where examples fall in some instance space X, and labels lie in Y := { 1, 1}. We suppose we are given some pre-specified, finite collection of distributions G = {Dg}G g=1 over X Y corresponding to groups. Given a hypothesis class H of measurable classifiers with VC-dimension d, the goal of the leaner is to pick some h H from finite data that performs well across all the distributions in G in the worst case. Let LG(h | g) := PDg (h(x) = y) be the error of a hypothesis h on group g. Formally speaking, the learner would like to choose a classifier approximately obtaining inf h H max g [G] LG(h | g), using finite data. We often use Lmax G (h) as shorthand for maxg [G] LG(h | g). We use ν := infh H Lmax G (h) to denote the noise rate of H on the multi-distribution objective. The use of the term agnostic throughout reflects the fact that we make no assumption that ν = 0 in our algorithm design or analysis. We assume for simplicity that there is some h H attaining ν. 3.2 Active Learning Model We consider a standard active learning model specified as follows. Let supp(g) denote the support of the marginal over instance space of Dg. The active learner has access to two sampling oracles for each distribution specified by Dg. The first is Ug( ), which given a set S X measurable with respect to Dg, returns an unlabeled sample from Dg conditioned on S; if PDg(x S) = 0, then Ug(S) returns None . The second is Og( ), which given a point in supp(g), returns a sample from the conditional distribution over labels specified by x and g. More formally, querying Ug(S) for S such that PDg(x S) = 0 is equivalent to drawing i.i.d. samples according to marginal over instance space of Dg (independent of previous randomness), and returning the first example that falls in S; querying the oracle Og(x) for x supp(g) is equivalent to receiving a sample from a Rademacher random variable with parameter PDg(Y = 1|X = x). As is standard in active learning, the active learner is assumed to have functionally unlimited access to queries from Ug( ). On the other hand, queries to oracles Og( ) are precious: the label complexity of a strategy executed by the active learner is the sum of queries to oracles Og( ) over all g, and is to be minimized given a desired generalization error guarantee. 4 Challenges in Multi-Group Active Learning In this section, we give some background on classical disagreement-based methods on a single distribution, and discuss in more detail the challenge of designing consistent active learning strategies in the multi-group setting. 4.1 Background on Disagreement-Based Active Learning Almost all agnostic active learning algorithms for accuracy over a single distribution boil down to disagreement-based methods [18, 10, 15, 16]. The fundamental idea in this school of algorithms is that one can learn the relative accuracy of two classifiers h and h by only requesting labels for examples in the part of instance space on which they disagree about how examples should be labeled. More generally, given a set of classifiers H H, one can consider the disagreement region of H , defined as (H ) := {x X : h, h H s.t. h(x) = h (x)} . As alluded to above, the difference in accuracy of classifiers h, h H is specified entirely through this inherently label-independent notion. For a single distribution D, we may write PD (h(x) = y) PD (h (x) = y) PD ( (H )) = PD h(x) = y | (H ) PD (h (x) = y | (H )) , as by definition, h, h have the same conditional loss on (H )c. Inspired by this observation, the idea is to label examples in (H ), and ignore those outside of it. This allows the learner to learn about the relative performance of classifiers while saving on the labels of examples falling in (H )c. In running a DBAL algorithm, one hopes certain classifiers quickly reveal themselves to be empirically so much worse on (H ) than the current ERM hypothesis, that by standard concentration bounds, they can be inferred to be worse than ϵ-optimal on D with high probability. Elimination of these classifiers shrinks the disagreement region, allowing the labeling to become further fine-grained. Given the above loss decomposition, this leads to consistent active learning strategies. 4.2 Labeling in the Disagreement Region: No Longer Enough In the multi-group setting, the strategy of comparing performance of classifiers solely on (H ) breaks down. Although the classifiers in H still agree in (H )c, this is not enough to infer differences in the worst case error over groups Lmax G ; this is because differences in performance on (H ) are not generally representative of differences in absolute errors over group distributions. The following simple example makes this concrete. Example 1. Consider the task of determining which of two classifiers h and h has lower worst case error over distributions D1 and D2 with marginal supports S1 X and S2 X. Let their disagreement region be denoted by = {x X : h(x) = h (x)}, and let l(f, i, A) denote the conditional loss of classifier f on Si A under Di. Suppose we only know their conditional losses on S1 and S2 under D1 and D2, respectively. We see for h that l(h, i, A) = 49/100 i = 1, A = S1 52/100 i = 2, A = S2 ? i = 1, A = c S1 ? i = 2, A = c S2 and for h that l(h , i, A) = 51/100 i = 1, A = S1 48/100 i = 2, A = S2 ? i = 1, A = c S1 ? i = 2, A = c S2 Consider ignoring the performance of classifiers in c, and using as a surrogate for the multi-group objective max i {1,2} l(h, i, Si ). In this case, we would choose h has the better of the two hypotheses. Suppose now that S1 and S2 have mass γ under both D1 and D2, and that l(h, 1, c S1) = l(h , 1, c S1) = 3/10. Finally, suppose that l(h, 2, c S2) = l(h , 2, c S2) = 2/10. Then under the true multi-group objective, by decomposing the group losses, one can compute that h has a lower worst case error over groups D1 and D2 . Thus, to utilize the disagreement region in multi-group algorithms, we will need to at least label some samples on (H )c as H shrinks. The specification of such a strategy is the content of the next section. 5 General Agnostic Multi-Group Learning 5.1 An Agnostic Algorithm The basic idea in Algorithm 1 is similar to classical DBAL approaches for a single distribution. We start with the full hypothesis class H, and look to iteratively eliminate hypotheses from contention as we learn about how to classify on each group through targeted labeling. Our solution to the problem posed to DBAL above is to keep track of the errors of well-performing hypotheses on the complement of the disagreement region in a way that exploits the agreement property. To do this, we construct a two-part estimate for the loss of a hypothesis on a given group. Denote the set of hypotheses still in contention at iteration i is Hi. Let Ri = (Hi) and SRi,g be a labeled sample from U(Ri) and SRc i ,g be a labeled sample from U(Rc i). We can now estimate the loss for some h Hi on group g via LS;Ri(h | g) := PDG(x Ri) LSRi,g(h) + PDG(x Rc i) LSRc i ,g(h Hi), Algorithm 1 General Agnostic Algorithm 1: procedure multi_group_agnostic(H, ϵ, δ, {Ug( )}G g=1, {Og( )}G g=1) 2: H1 H, I log(1/ϵ) 3: for i [I] do 4: Ri (Hi) 5: mi maxg [G] PDg (x (Hi)) 6: for g [G] do 7: S Ri,g 1024 mi ϵ2I i 2 2d log(64/ϵ) + ln(8G log(1/ϵ) /δ) i.i.d. samples 8: from Ug(Ri) 9: S Rc i ,g 128 ln(4G log(1/ϵ) /δ) (ϵ2I i)2 i.i.d. samples from Ug(Rc i) 10: if None S Ri,g then PDg(x Ri) = 0 in this case 11: SRi,g 12: else 13: SRi,g {(x, Og(x)) : x S Ri,g} 14: end if 15: if None S Rc i ,g then 16: SRc i ,g 17: else 18: SRc i ,g {(x, Og(x)) : x S Rc i ,g} 19: end if 20: end for 21: ˆhi = arg minh Hi Lmax S;Ri(h) 22: Hi+1 n h Hi : Lmax S;Ri(h) Lmax S;Ri(ˆhi) + 2I iϵ/4 o 23: end for 24: return ˆh = arg minh HI+1 Lmax S;RI+1(h) 25: end procedure where LS(h) := 1/|S| P (x,y) S 1[h(x) = y] is a standard empirical loss estimate1, and h Hi is an arbitrarily chosen hypothesis from Hi that is used in the loss estimate of every h Hi. This leads to an unbiased estimator given that every h Hi labels the sample from this part of space in exactly the same way. The utility of this estimator is that by choosing an arbitrary representative h Hi, we can estimate the loss of all hypotheses still in contention to precision O(ϵ) on Rc i with O(1/ϵ2) samples, removing the usual dependence of the VC-dimension. On the other hand, as the disagreement region shrinks, PDG(x Ri) shrinks as well, so while we will still need to invoke uniform convergence to get reliable loss estimates in Ri, the precision to which we need to estimate losses in this part of space decreases with every iteration, and eventually the overall dependence on the VC-dimension is diminished. This later observation is the standard source of gains in DBAL [18, 9, 24]. After forming these loss estimates on each group, we construct unbiased loss estimates for the worst case over groups via Lmax S;Ri(h) := max g G LS;Ri(h | g). These loss estimates inherit concentration properties from the two-part estimator above. We draw enough samples at each iteration i such that we essentially learn the multi-group problem to precision 2 log(1/ϵ) iϵ. We note that Algorithm 1 assumes access to the underlying group marginals measures PDG. This is common in the active learning literature [18, 20]. Probabilities of events in instance space can be estimated to arbitrary accuracy using only unlabeled data, so this assumption is not dangerous to our goal of lowering label complexities. 1taken to be an arbitrary constant if S = ; see the Appendix for details. 5.2 Guarantees Vitally, the scheme given in Algorithm 1 is consistent. It is a lemma of ours that the number of samples drawn at each iteration is sufficiently large that the true error of any h Hi+1 is no more than 2 log(1/ϵ) iϵ. Thus, after log(1/ϵ) iterations, the ERM hypothesis on Lmax S;Ri( ) is then ϵ-optimal with high probability. We can bound the label complexity of the algorithm using standard techniques from DBAL. A ubiquitous quantity in the analysis of disagreement-based schemes is that of the disagreement coefficient [9, 25]. The general idea is that the disagreement coefficient bounds the rate of decrease in r of the measure of the the disagreement region of a ball of radius r around h in the pseudo-metric ρg(h, h ) := PDg (h(x) = h (x)). Precisely, we use the following definition of the disagreement coefficient in our analysis [10, 24]: given a group Dg, the disagreement coefficient on g is θg := sup h H sup r 2ν+ϵ PDg (x (Bg(h, r ))) where Bg(h, r ) := {h H : ρg(h, h ) r } is a ball of radius r about h in pseudo-metric ρg. We further notate the maximum disagreement coefficient over the groups G as θG := maxg θg. The disagreement coefficient is trivially bounded above by 1/ϵ, but can be bounded independently of ϵ in many cases [10, 25]. For example, when H is the class of linear separators in d dimensions and the underlying marginal distribution is uniform over the Euclidean unit sphere, the disagreement coefficient is Θ( Theorem 1. For all ϵ > 0, δ (0, 1), collections of groups G, and hypothesis classes H with d < , with probability 1 δ, the output ˆh of Algorithm 1 satisfies Lmax G (ˆh) Lmax G (h ) + ϵ, and its label complexity is bounded by ϵ2 + 1 d log(1/ϵ) + log(1/δ) log(1/ϵ) + G log(1/δ) log(1/ϵ) Here, the O notation hides factors of log(log(1/ϵ)) and log(G); we leave all proofs for the Appendix. Theorem 1 tell us that Algorithm 1 enjoys the following upside over passive and collaborative learning approaches: the dependence on the standard interaction of the VC-dimension d and 1/ϵ2 is removed, and replaced with Gdθ2 G log2(1/ϵ)ν2/ϵ2, which in settings with small disagreement coefficients and low noise rates, will be significant for small ϵ. 5.3 Comparison to Lower Bounds in Collaborative Learning We compare our label complexity guarantees to results in collaborative learning, where the learner has the power to ask for samples from specific group distributions, but not selectively label these samples. This is a strictly more demanding comparison than to pure passive settings, but a fair one, given that active learners have the option of executing any collaborative learning strategy. In collaborative learning, for finite hypothesis classes H, it is known that ϵ2 + G log(min(|H|, G)/δ) total labels over all groups are necessary [3]. We consider comparing this lower bound to a simplified version of the label complexity guarantee in Theorem 1: O d Gθ2 G log2(1/ϵ) + G log(1/ϵ) thus implicitly assuming ν is neglectably small, and making all comparisons up to factors logarithmic in G, 1/δ and log(1/ϵ). This former assumption is a standard assumption under which we may hope an agnostic active learner to succeed [15]. Algorithm 2 Group Realizable Algorithm procedure group_realizable(H, ϵ, δ, active learner A, {Ug( )}G g=1, {Og( )}G g=1)) for g [G] do ˆhg A(H, ϵ/6, δ/2G, Ug(X), Og) S g 144/ϵ2 (2d ln(24/ϵ) + ln(8G/δ)) samples from oracle Ug(X) ˆSg x, ˆhg(x) : x S g end for return ˆh = arg minh H maxg [G] 1 | ˆSg| P (x,ˆy) ˆSg 1 [h(x) = ˆy] end procedure Even the simplified upper bound does not admit the cleanest comparison this to lower bound, due to our excess factor of log(1/ϵ) in the second term. However, it does showcase that while we pay slightly more per group than necessary, under conditions amenable to active learning, we pay significantly less per dimension d. Particularly for small ϵ, one can see that s approximately sufficient that G < o(θ2 G(log(1/ϵ)ϵ)2) 1) for the simplified upper bound to beat the lower bound. For a more fine-grained comparison that in some sense underestimates the power of Algorithm 1, assume that the following condition governs the relationship of G, d, and ϵ: G log(1/ϵ) d < θ2 Gϵ2 log2(1/ϵ) 1 . Then the simplified bound is smaller in order than the lower bound above. 6 Group Realizable Learning A special case of the learning problem, where extreme active learning gains can be readily seen, comes when the hypothesis class H achieves zero noise rate on each group Dg. This setting has been considered in the passive multi-group learning literature [6]. Formally speaking, in the group realizable setting, the following condition holds: g [G], h g H s.t. LG(h g | g) = 0, i.e. for all groups in the collection G, there is some hypothesis achieving 0 error on that group. Note that this differs from the fully realizable setting where there is some h H with Lmax G (h ) = 0. While fully realizable implies group realizable, the converse is not true. Thus, group realizability represents an intermediate regime between the realizable setting and the full agnostic settings. 6.1 Algorithm In the group realizable case, it is possible to show a reduction of the problem of active learning over hypothesis classes with respect to a single distribution. This can be accomplished as follows. For each Dg, we call as a subroutine an active learner that is guaranteed to find an order ϵ-optimal hypothesis ˆh g with high probability over it s queries. It then gathers new unlabeled samples from each Dg, and instead of requesting labels from Og( ), labels each unlabeled point with ˆh g. The final step is to do an empirical risk minimization on these artificially labeled samples with respect to the multi-group objective. See Algorithm 2 for a formal specification of the strategy. 6.2 Guarantees The strategy given in Algorithm 2 leads to a consistent active learning scheme, provided the active learners called as subroutines have standard guarantees that can be inherited. Theorem 2 gives a guarantee to this end. The proof follows from an argument similar to one used in [10] - because the subroutine calls return hypotheses with near 0 error on each group, the artificially labeled training set used in the ERM step looks nearly identical to a counterfactual training set for the ERM step constructed by querying labels Og(x) for each unlabeled x. This is similar to the idea in [24]. We present Theorem 2 assuming access to a classical, realizable active learner due to [26]. Theorem 2. Suppose Algorithm 2 is run with the active learner ACAL of [26]. Then for all ϵ > 0, δ (0, 1), hypothesis classes H with d < , and collections of groups G with the group realizability property under H, with probability 1 δ, the output ˆh satisfies Lmax G (ˆh) Lmax G (h ) + ϵ, and the number of labels requested is O d GθG log(1/ϵ) . Thus, when disagreement coefficients across the collection of groups are bounded independently of ϵ, the usual, passive dependence on 1/ϵ2 is replaced by log(1/ϵ). In the passive multi-group setting of [7], it has been shown that O (log(|H|)/γϵ) samples are sufficient for group realizable learning, where we recall γ is a lower bound on the probability of getting a sample in each group [6]. 7 Full Agnostic Approximation 7.1 Inconsistency of the Reduction in the Full Agnostic Regime Algorithm 2 admits clean analysis, and nicely harnesses the power of realizable active learners for a single distribution. One might wonder if a similar strategy might provide a consistent strategy in full agnostic regime. Unfortunately, the direct application of Algorithm 2 using agnostic learners does not yield a consistent active learning algorithm. In fact, consistency fails even when for each g [G], h g is the Bayes optimal classifier on Dg, and νg := infh H LG(h | g) is small. This lack of consistency comes down to the fact that labeling with the Bayes optimal underestimates noise rates on each group, which in turn may bias the output of the ERM step. 7.2 A 3ν-Approximation Algorithm Although the strategy of creating an artificially labeled training set with near-optimal hypotheses on each group fails outside of the group realizable case, it possesses a nice approximation property. We give a guarantee to this end in Theorem 3. It states that if we call an active learner with agnostic guarantees on each group Dg, and then use the outputs ˆh g to artificially label a new batch of unlabeled data from each group, using ERM on this artificially labeled data gives at worst a 2ν + ϵ-optimal hypothesis with high probability. Theorem 3. Suppose Algorithm 2 is run with the agnostic active learner ADHM of [15]. Then for all ϵ > 0, δ (0, 1), hypothesis classes H with d < , and collections of groups G, with probability 1 δ, the output ˆh satisfies Lmax G (ˆh) Lmax G (h ) + 2 max g [G] νg + ϵ 3 Lmax G (h ) + ϵ, and the number of labels requested is log2(1/ϵ) + ν2 The proof is very similar to that of Theorem 2, but notes in addition that ˆh g mislabels on a roughly νg-fraction of the unlabeled samples from each group G. This allows us to upper bound the distortion of the ERM step. 8 Conclusion In this work, we have taken a first look at active multi-group learning. Though the design of general agnostic strategies in this setting is quite challenging, an interesting future direction may be the search for strategies that work in more specific cases, for exampling extending our work in the group realizable setting. In particular, the search for algorithms with small label complexities under specific low-noise conditions, such as Tsybakov noise on each Dg, may prove fruitful [27]. [1] Avrim Blum, Nika Haghtalab, Ariel D Procaccia, and Mingda Qiao. Collaborative pac learning. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. [2] Huy L. Nguyen and Lydia Zakynthinou. Improved algorithms for collaborative pac learning, 2018. [3] Nika Haghtalab, Michael I. Jordan, and Eric Zhao. On-demand sampling: Learning optimally from multiple distributions, 2023. [4] Jacob Abernethy, Pranjal Awasthi, Matthäus Kleindessner, Jamie Morgenstern, Chris Russell, and Jie Zhang. Active sampling for min-max fairness, 2022. [5] Agnieszka Słowik and Léon Bottou. On distributionally robust optimization and data rebalancing. In Proc. AIStats 2022, Feb 2022. to appear. [6] Christopher Tosh and Daniel Hsu. Simple and near-optimal algorithms for hidden stratification and multi-group learning. Co RR, abs/2112.12181, 2021. [7] Guy N. Rothblum and Gal Yona. Multi-group agnostic PAC learnability. Co RR, abs/2105.09989, 2021. [8] Luke Oakden-Rayner, Jared Dunnmon, Gustavo Carneiro, and Christopher Ré. Hidden stratification causes clinically meaningful failures in machine learning for medical imaging. Co RR, abs/1909.12475, 2019. [9] Steve Hanneke. A bound on the label complexity of agnostic active learning. In Proceedings of the 24th International Conference on Machine Learning, page 353 360, 2007. [10] Sanjoy Dasgupta, Daniel Hsu, and Claire Monteleoni. A general agnostic active learning algorithm. In Proceedings of the 20th International Conference on Neural Information Processing Systems, 2007. [11] Weihua Hu, Gang Niu, Issei Sato, and Masashi Sugiyama. Does distributionally robust supervised learning give robust classifiers?, 2018. [12] Yonatan Oren, Shiori Sagawa, Tatsunori B. Hashimoto, and Percy Liang. Distributionally robust language modeling, 2019. [13] Shiori Sagawa, Pang Wei Koh, Tatsunori B. Hashimoto, and Percy Liang. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. Co RR, abs/1911.08731, 2019. [14] Burr Settles. Active learning literature survey. 2009. [15] Sanjoy Dasgupta. Two faces of active learning. Theoretical Computer Science, 412(19):1767 1781, 2011. [16] Steve Hanneke and Liu Yang. Minimax analysis of active learning. 2014. [17] Maria-Florina Balcan and Ruth Urner. Active learning-modern learning theory., 2016. [18] Maria-Florina Balcan, Alina Beygelzimer, and John Langford. Agnostic active learning. In Proceedings of the 23rd International Conference on Machine Learning, page 65 72, 2006. [19] Rui M Castro and Robert D Nowak. Minimax bounds for active learning. IEEE Transactions on Information Theory, 54(5):2339 2353, 2008. [20] Stanislav Minsker. Plug-in approach to active learning. Journal of Machine Learning Research, 13(1), 2012. [21] Chicheng Zhang and Kamalika Chaudhuri. Beyond disagreement-based agnostic active learning. Co RR, abs/1407.2657, 2014. [22] Alina Beygelzimer, Sanjoy Dasgupta, and John Langford. Importance weighted active learning. Co RR, abs/0812.4952, 2008. [23] Steve Hanneke and Liu Yang. Negative results for active learning with convex losses. Journal of Machine Learning Research - Proceedings Track, 9:321 325, 01 2010. [24] Chicheng Zhang and Kamalika Chaudhuri. Active learning from weak and strong labelers. Co RR, abs/1510.02847, 2015. [25] Steve Hanneke. Theory of active learning. 2014. [26] David A. Cohn, Les E. Atlas, and Richard E. Ladner. Improving generalization with active learning. Machine Learning, 15:201 221, 1994. [27] Jean-Yves Audibert and Alexandre B Tsybakov. Fast learning rates for plug-in classifiers under the margin condition. ar Xiv preprint math/0507180, 2005. [28] Vladimir Vapnik. Statistical Learning Theory. Wiley, 1998. [29] Wassily Hoeffding. Probability Inequalities for sums of Bounded Random Variables. Springer New York, 1994. 9.1 Guarantees for General Agnostic Algorithm In this section, we give proofs for the guarantees of Algorithm 1. We begin with some definitions, starting with how empirical loss estimates are made. Definition 1. Given a hypothesis h H, and a set of pairs S = {(xi, yi) : xi X, yi Y}N i=1, let i=1 1[h(xi) = yi] the standard empirical loss of h on S. Let L (h) := 1. The convention to let L (h) = 1 allows us to collapse the two-part loss estimates in the case the probability of drawing an unlabeled sample in a specific region is 0; under the specification of the algorithm, S = if and only if the probability of a sample falling in the disagreement region or its complement is 0 under Dg, in which case we can safely ignore estimation in one of these regions. Definition 2. Given a set of classifiers H H, we say H agrees on a subset S X if for each x S and for each pair (h, h ) H H , it holds that h(x) = h (x). We now recall the two-part estimator for the loss of a hypothesis introduced above. Definition 3. Fix a group distribution Dg, some H H, a hypothesis h H , and some R X which is measurable with respect to each marginal of Dg and for which H agrees on Rc. Given sets of pairs SR,g and SRc,g, and some arbitrarily chosen classifier h H H , let LS;R(h | g) := PDg(x R) LSR,g(h) + PDg(x Rc) LSRc,g(h H ). As mentioned in the main body, h H must be used in the estimate of the loss under Dg in the agreement region for all h H. The extent to which this estimator is useful can be captured by standard uniform convergence arguments. To this end, we first introduce a function that will prove to control its deviations nicely. Definition 4. Given a confidence parameter δ (0, 1), a group distribution Dg G, some R X that is measurable with respect to each marginal Dg G, and sample sizes m, m > 0, define the function Γg(δ, R, m, m ) := PDg(x R) 1 m + q ln(8/δ)+d ln(2em/d) if PDg(x R) > 0, PDg(x Rc) > 0 ln(8/δ)+d ln(2em/d) m if PDg(x R) > 0, PDg(x Rc) = 0 q if PDg(x R) = 0, PDg(x Rc) > 0. Lemma 1. Fix δ (0, 1), a set of group distributions G, and a group distribution Dg G arbitrarily. Further, fix a subset R X measurable with respect to each marginal of Dg G, and a set of classifiers H H with the property that H agree on Rc. Suppose we query m > 0 unlabeled samples from Ug(R), and m > 0 samples from Ug(Rc). Suppose further that we label the output via calls to Og( ), forming the labeled samples SR,g and SRc,g, respectively; if either PDg (x R) = 0 or PDg (x Rc), then we set the corresponding sample to be . Then with probability 1 δ, it holds for all h H that |LG(h | g) LS;R(h | g)| Γg(δ, R, m, m ). Further, for all γ > 0, if m 16(PDg (x R))2 γ2 (2d ln(8/γ) + ln(8/δ)) and m 2 ln(4/δ) γ2 , then Γg(δ, R, m, m ) < γ. Proof. We begin with the case where both PDg (x R) = 0 and PDg (x Rc) = 0. In this case, we are able to draw unlabeled samples from both regions, and neither SR,g nor SR,g is . By a lemma of Vapnik [28], we have that with probability 1 δ/2 over the draw of m samples from Ug(R) and their labeling via Og( ), that simultaneously for each h H : LSR,g(h) PDg (h(x) = y| x R) 1 ln(8/δ) + d ln(2em/d) In Rc, all h H agree, and so estimating the conditional loss for each h H in this region is as statistically hard as estimating a single Bernoulli parameter, which we do by arbitrarily choosing a classifier to use for the loss estimate in this part of space. Thus, by definition of the two-part estimator and Hoeffding s inequality [29], we have with probability 1 δ/2 for all h H simultaneously LSRc,g(h H ) PDg (h(x) = y| x Rc) By a union bound, with probability 1 δ, both of these events take place, and so for all h H simultaneously, LG(h | g) = PDg h(x) = y | x R PDg (x R) + PDg (h(x) = y | x Rc) PDg (x Rc) LSR,g(h) + 1/m + p (ln(8/δ) + d ln(2em/d)) /m PDg (x R) + LSRc,g(h H ) + p ln(4/δ)/2m PDg (x Rc) LS;R(h | g) + Γg(δ, R, m, m ). The lower bound leading to the absolute value is analogous. Vapnik [28] also tells us that for any γ > 0, a sample of size m 4 γ 2 (2d ln(4/γ ) + ln(8/δ)) is sufficient to yield (ln(8/δ) + d ln(2em/d)) /m < γ . Let γ = γ/2PDg (x R). Thus, substituting for γ and bounding the probability inside the natural log above by 1, m PDg (x R)2 16 γ2 (2d ln(8/γ) + ln(8/δ)) implies that 1 m + ln(8/δ) + d ln(2em/d) m < γ 2PDg(x R). As a corollary to Hoeffding, if m 2 ln(4/δ)/γ2, then p log(4/δ)/2m < γ/2. Thus, we may write Γg(δ, R, m, m ) = PDg(x R) ln(8/δ) + d ln(2em/d) 2m < γ/2 + γ/2 = γ. Now suppose that PDg(x Rc) = 0. In this case, we have SRc,g = . Again, we have that with probability 1 δ/2, LSR,g(h) PDg (h(x) = y| x R) 1 ln(8/δ) + d ln(2em/d) When PDg(x Rc) = 0, it holds that PDg(x R) = 1, and so LG(h | g) = PDg (h(x) = y | x R) PDg(x R) + PDg (h(x) = y | x Rc) PDg(x Rc) = PDg (h(x) = y | x R) LSR,g(h) + 1/m + p (ln(8/δ) + d ln(2em/d)) /m = LS;R(h | g) + Γg(δ, R, m, m ), where the final equality comes from fact that PDg(x Rc) = 0 and PDg(x R) = 1, as well as the definitions of LS;R(h | g) and Γg(δ, R, m, m ). Similarly to the above, if we let γ = γ/2PDg(x R) = γ/2, then γ2 (2d ln(8/γ) + ln(8/δ)) implies that 1 m + ln(8/δ) + d ln(2em/d) which by the definition of Γg(δ, R, m, m ) when PDg(x Rc) = 0 gives us Γg(δ, R, m, m ) < γ/2 < γ. The case where PDg(x R) = 0 follows the previous argument for when PDg(x Rc) = 0. Definition 5. Given a collection of group distributions G, some H H, a hypothesis h H , some subset R X measurable with respect to each marginal of Dg G, and labeled samples SR,k and SRc,k, we define the empirical estimate of the multi-group loss of h parameterized by R via Lmax S;R (h) := max g [G] LS;R(h | g). Having recalled the way in which we form empirical estimates for the group worst-case loss of a given hypothesis, we can show a simple concentration lemma for this group worst-case loss estimator using the concentration property for individual groups proved in Lemma 1, Lemma 2. Fix δ (0, 1), a set of group distributions G, a subset R X measurable with respect to each marginal of Dg G, and a set of classifiers H H that agree on Rc. Suppose for each g [G], we query mg > 0 unlabeled samples from Ug(R), and m g > 0 samples from Ug(Rc). Suppose further that we label the outputs via calls to Og( ), forming the labeled samples SR,g and SRc,g, respectively, for each g [G]; if PDg(x R) = 0 or PDg(x Rc) = 0, then we set the corresponding sample to be . Then with probability 1 δ, it holds for all h H that Lmax G (h) Lmax S;R (h) max g [G] Γg (δ/G, mg , m g ). Proof. By Lemma 1 and a union bound, it holds with probability 1 δ that on all Dg, for all h H simultaneously, that |LG(h | g) LS;R(h | g)| Γg(δ/G, mg, m g). Thus we may write Lmax G (h) Lmax S;R (h) = max g [G] LG(h | g ) max g [G] LS;R(h | g) LG(h | g ) LS;R(h | g ) max g [G] Γg (δ/G, mg , m g ). We now use Lemma 2 to show that Algorithm 1 is conservative enough that the optimal hypothesis h is never eliminated from contention throughout the run of the algorithm with high probability. Lemma 3. Fix δ (0, 1), a collection of group distributions G, and a hypothesis class H with d < arbitrarily. With probability 1 δ, it holds after each iteration i of Algorithm 1 that h Hi+1. Proof. By Lemmas 1 and 2, and a union bound over iterations, the number of samples labeled at each iteration is sufficient for us to conclude that with probability 1 δ, for for every iteration i and for each h Hi, it holds that2 |Lmax S;Ri(h) Lmax G (h)| 2I iϵ/8. We give an inductive argument conditioned on this high probability event. When i = 1, we have h H1 because H1 = H, and h H by definition. If h Hi for i 1, then h Hi+1 if and only if Lmax S;Ri(h ) Lmax S;Ri(ˆhi) + 2I iϵ/4. When for each h Hi, it holds that |Lmax S;Ri(h) Lmax G (h)| 2I iϵ/8, we may write Lmax S;Ri(h ) Lmax S;Ri(ˆhi) Lmax S;Ri(h ) Lmax G (h ) + Lmax G (ˆhi) Lmax S;Ri(ˆhi) Lmax S;Ri(h ) Lmax G (h ) + Lmax G (ˆhi) Lmax S;Ri(ˆhi) 2I iϵ/8 + 2I iϵ/8 = 2I iϵ/4, where the first inequality comes from the optimality of h . Thus, we must have h Hi+1. Now, using the fact that the optimal hypothesis stays in contention throughout the run of the algorithm, we can give a guarantee on the true error of each hypothesis h Hi+1. The idea is that using concentration and the small empirical error of each h Hi+1, we can say that the true errors of each h Hi+1 are similar to the true errors of the ERM hypothesis ˆhi, and then use the true error of ˆhi as a reference point to which we can compare the true error of h Hi+1 and h . Lemma 4. Fix δ (0, 1), a collection of group distributions G, and a hypothesis class H with d < arbitrarily. Then with probability 1 δ, after every iteration i of Algorithm 1, it holds for all h Hi+1 that Lmax G (h) Lmax G (h ) 2I iϵ. Proof. If h Hi+1, then by the specification of the algorithm, it holds that Lmax S;Ri(h) Lmax S;Ri(ˆhi) 2I iϵ/4. Because ˆhi is the ERM hypothesis at iteration i, it holds that Lmax S;Ri(ˆhi) Lmax S;Ri(h) 0 < 2I iϵ/4, and thus we may conclude Lmax S;Ri(h) Lmax S;Ri(ˆhi) 2I iϵ/4. By Lemma 2 and the number of samples labeled at each iteration, with probability 1 δ, it holds for all iterations and for all h Hi that Lmax S;Ri(h) Lmax G (h) 2I iϵ/8. Conditioned on this event, if h Hi+1, we have Lmax G (h) Lmax G (ˆhi) = Lmax G (h) Lmax S;Ri(h) + Lmax S;Ri(h) Lmax S;Ri(ˆhi) + Lmax S;Ri(ˆhi) Lmax G (ˆhi) Lmax G (h) Lmax S;Ri(h) + Lmax S;Ri(h) Lmax S;Ri(ˆhi) + Lmax S;Ri(ˆhi) Lmax G (ˆhi) 2I iϵ/8 + 2I iϵ/4 + 2I iϵ/8 By Lemma 3, it holds that h Hi+1 whenever Lmax S;Ri(h) Lmax G (h) 2I iϵ/8 for all h Hi at all iterations. Thus, this bound on the true error difference with the ERM ˆhi applies to h , and we may write for arbitrary h Hi+1 that Lmax G (h) Lmax G (h ) Lmax G (h) Lmax G (ˆhi) + Lmax G (ˆhi) Lmax G (h ) 2I iϵ, which is the desired result. 2We do not directly apply Lemma 1 with γ = ϵ2I i/8 here. We use this quantity in the outer dependence on γ of Lemma 1, but for the natural log dependence on γ, we sub in ϵ/8 to simplify the analysis. Thus we take slightly more samples than Lemma 1 directly suggests. Because we take the largest probability of the disagreement region over groups as mi, it holds that mg is at the smallest the sample size suggested by Lemma 1 for each g. Definition 6. Given a group distribution Dg G, a hypothesis h H, and a radius r 0, let the Dg - disagreement ball in H of radius r about h be Bg(h, r) := {h H : ρg(h, h ) r} , where ρg(h, h ) := PDg (h(x) = h (x)). Definition 7. Given a group distribution Dg G and a hypothesis class H, let the disagreement coefficient of Dg be defined as θg := sup h H sup r 2ν+ϵ PDg (x (Bg(h, r ))) We further define the disagreement coefficient over a collection of group distributions G as θG := max g [G] θg . Given these definitions, we are now ready to state the main theorem. The consistency comes from what we showed in Lemma 4: as the true error for each h Hi+1 decreases with each iteration, after enough iterations we will have each h Hi+1 having ϵ-optimality. The label complexity bound follows standard ideas in the DBAL literature; see for example [9, 24]. Essentially, what we do is show that at each iteration i, because the true error of any h Hi on the multi-group objective can t be too large, the disagreement of h and h on any single group cannot be too large. This leads to a bound on the size of the disagreement region for each g. Theorem 1. For all ϵ > 0, δ (0, 1), collections of group distributions G, and hypothesis classes H with d < , with probability 1 δ, the output ˆh of Algorithm 1 satisfies Lmax G (ˆh) Lmax G (h ) + ϵ, and its label complexity is bounded by ϵ2 + 1 d log(1/ϵ) + log(1/δ) log(1/ϵ) + G log(1/ϵ) log(1/δ) Proof. Lemma 4 says that the number of samples drawn at each iteration is sufficiently large that with probability 1 δ, for all i [I], it holds that for all h Hi+1, that we have Lmax G (h) Lmax G (h ) 2I iϵ. Thus, after I = log(1/ϵ) iterations, the output ˆh satisfies the consistency condition. To see the label complexity, which is the sum of the number of labels we query at each iteration, we note at iteration i, we label no more than 2 2d log 64 + ln 8G log(1/ϵ) + 128 ln(4G log(1/ϵ) /δ) samples for each group distribution Dg, where mi = maxg PDg (x (Hi)). The only term here that depends on i is mi ϵ2I i . By Lemma 4, with probability 1 δ, it holds for each i > 1 that Lmax G (h) Lmax G (h ) 2I i+1ϵ; this holds automatically at i = 1 by the setting of I = log(1/ϵ) . Thus, at arbitrary i and for arbitrary g [G], we may write ρg(h, h ) = PDg(h(x) = h (x)) = PDg(h(x) = y, h (x) = y) + PDg(h(x) = y, h (x) = y) PDg(h(x) = y) + PDg(h (x) = y) = LG(h | g) + LG(h | g) Lmax G (h) + Lmax G (h ) = Lmax G (h) Lmax G (h ) + Lmax G (h ) + Lmax G (h ) 2I i+1ϵ + 2ν, where we recall ν is the noise rate on the multi-group objective. Thus, with probability 1 δ, for each i I and g [G], it holds that Hi Bg(h , 2I i+1ϵ + 2ν). Given this observation, we may then write, for all g, that PDg(x (Hi)) PDg(x (Bg(h , 2ν + 2I i+1ϵ))), as if there are h, h Hi that disagree on some x, we have h, h Bg(h , 2ν + 2I i+1ϵ), and so h, h also realize disagreement on x for the larger set of classifiers. Recalling the definition of mi, this allows us to bound the sum of terms depending on i for each distribution Dg as maxg PDg x (Bg (h , 2ν + 2I i+1ϵ)) max g PDg x (Bg (h , 2ν + 2I i+1ϵ)) 2ν + 2I i+1ϵ 2ν + 2I i+1ϵ max g PDg x (Bg (h , 2ν + 2I i+1ϵ)) 2ν + 2I i+1ϵ max g sup h H sup r 2ν+ϵ PDg (x (Bg (h, r))) = 4 log(1/ϵ) ν + ϵ 2 max g θg 2 = 4 log(1/ϵ) ν + ϵ The label complexity bound then follows by noting the algorithm labels the same amount of samples for all G groups each iteration, and ignoring the factors of log(G) and log(log(1/ϵ)). 9.2 Group-Realizable Guarantees Theorem 2. Suppose Algorithm 2 is run with the active learner ACAL of [26]. Then for all ϵ > 0, δ (0, 1), hypothesis classes H with d < , and collections of group distributions G that are group realizable with respect to H, with probability 1 δ, the output ˆh satisfies Lmax G (ˆh) Lmax G (h ) + ϵ, and the number of labels requested is O d GθG log(1/ϵ) . Proof. The label complexity follows directly from the guarantees given in [15]. By a union bound, we with probability 1 δ, have that for all g [G], that ACAL returns ˆhg with the property that LG(ˆhg | g) ϵ/6. Fix some g [G] arbitrarily. Consider a counterfactual training set Sg, unseen by the learner, constructed by labeling each example x S g via the oracle call Og(x). Then Vapnik [28] tells us that mg := |S g| is sufficiently large that with probability 1 δ/2, for each h H simultaneously, we have LG(h | g) LSg(h) < ϵ/6. Again by the union bound, this uniform convergence on Sg and the guarantee on the runs of ACAL both hold for each g [G]. Conditioned on this high probability event, we can first note that for some arbitrary h H, LSg(h) L ˆSg(h) = i=1 1[h(xi) = yi] 1[h(xi) = ˆhg(xi)] 1[h(xi) = yi] 1[h(xi) = ˆhg(xi)] i=1 1[yi = ˆhg(xi)] LG(ˆhg) + ϵ/6 ϵ/6 + ϵ/6 = ϵ/3, where the second to last inequality comes from the uniform convergence over Sg, and the final inequality comes from the success of the runs of ACAL. Then for arbitrary h, combining Vapnik s guarantee and the inequality we just showed, we may write: LG(h | g) L ˆSg(h) = LG(h | g) LSg(h) + LSg(h) L ˆSg(h) LG(h | g) LSg(h) + LSg(h) L ˆSg(h) < ϵ/6 + ϵ/3 = ϵ/2. Given this guarantee on the representativeness of the artificially labeled samples on each group g, we have a guarantee for the representativeness over the worst case. For arbitrarily h H, we may write Lmax G (h) max g [G] L ˆSg(h) = max g [G] LG(h | g) max g [G] L ˆSg(h) LG(h | g) L ˆSg(h) Thus, by the fact that ˆh is the ERM on the artificially labeled dataset, we have Lmax G (ˆh) max g [G] L ˆSg(ˆh) + ϵ/2 max g [G] L ˆSg(h ) + ϵ/2 Lmax G (h ) + ϵ. 9.3 Approximation Guarantees Theorem 3. Suppose Algorithm 3 is run with the active learner ADHM of [15]. Then for all ϵ > 0, δ (0, 1), hypothesis classes H with d < , and collections of groups D, with probability 1 δ, the output ˆh satisfies Lmax G (ˆh) Lmax G (h ) + 2 max g [G] νg + ϵ 3 Lmax G (h ) + ϵ, and the number of labels requested is log2(1/ϵ) + ν2 Proof. The proof is almost identical to that of Theorem 2. The label complexity bound follows directly from [10]. Similar to before, we have that for all g [G], ADHM returns ˆhg with the property that LG(ˆhg | g) LG(h g | g) + ϵ/6. Fix some g [G] arbitrarily. On a counterfactual training set Sg, unseen by the learner, constructed by labeling each example x S g via the oracle call Og(x), it holds that mg := |S g| is sufficiently large that with probability 1 δ/2, for each h H simultaneously, we have LG(h | g) LSg(h) < ϵ/6. By the union bound, this uniform convergence and the guarantee on the runs of ADHM both hold. Thus, we can first note that for some arbitrary h H, LSg(h) L ˆSg(h) = i=1 1[h(xi) = yi] 1[h(xi) = ˆhg(xi)] 1[h(xi) = yi] 1[h(xi) = ˆhg(xi)] i=1 1[yi = ˆhg(xi)] LG(ˆhg | g) + ϵ/6 LG(h g | g) + ϵ/3 = νg + ϵ/3. where the second to last inequality comes from uniform convergence over SG, and the final equality comes from the correctness guarantee of ADHM. Then for arbitrary h, combining Vapnik s guarantee and the inequality we just showed, we may write: LG(h | g) L ˆSg(h) = LG(h | g) LSg(h) + LSg(h) L ˆSg(h) LG(h | g) LSg(h) + LSg(h) L ˆSg(h) < ϵ/6 + νg + ϵ/3 = νg + ϵ/2. Then, as above, we have, for arbitrarily h H, Lmax G (h) max g [G] L ˆSg(h) max g [G] LG(h | g) L ˆSg(h) max g [G] νg + ϵ/2 ν + ϵ/2, where the the final inequality comes from the fact that if any hypothesis has less than νg error on all groups, it would be optimal on group g. Thus, by the fact that ˆh is the ERM on the artificially labeled dataset, we have Lmax G (ˆh) max g [G] L ˆSg(ˆh)+νg+ϵ/2 max g [G] L ˆSg(h )+νg+ϵ/2 Lmax G (h )+2ν+ϵ 3 Lmax G (h )+ϵ