# metricfair_active_learning__9ca17c3f.pdf Metric-Fair Active Learning Jie Shen 1 Nan Cui 1 Jing Wang 2 Active learning has become a prevalent technique for designing label-efficient algorithms, where the central principle is to only query and fit informative labeled instances. It is, however, known that an active learning algorithm may incur unfairness due to such instance selection procedure. In this paper, we henceforth study metric-fair active learning of homogeneous halfspaces, and show that under the distribution-dependent PAC learning model, fairness and label efficiency can be achieved simultaneously. We further propose two extensions of our main results: 1) we show that it is possible to make the algorithm robust to the adversarial noise one of the most challenging noise models in learning theory; and 2) it is possible to significantly improve the label complexity when the underlying halfspace is sparse. 1. Introduction Deep learning has become the driving force behind modern artificial intelligence. However, it requires massive amount of labeled data for model training. Though there is a massive amount of unlabeled data available in many applications, the labels are typically precious and expensive to acquire, especially in the areas of medicine and physiology. In this regard, active learning was broadly utilized as a paradigm to learn a good model with significantly fewer labels by designing strategies to adaptively select informative instances to annotate (Cohn et al., 1994; Dasgupta et al., 2005; Balcan et al., 2006; Dasgupta, 2009). On the other hand, recently practitioners from different disciplines highlighted the ethical and legal challenges posed by machine learning systems which are with potential to dis- 1Department of Computer Science, Stevens Institute of Technology, Hoboken, New Jersey, USA. 2Amazon, New York City, New York, USA. Correspondence to: Jie Shen , Nan Cui , Jing Wang . Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s). criminate against specific population groups (Chouldechova & Roth, 2020). The rising concern is that the designed algorithms achieve appealing prediction accuracy, yet cannot recognize ethical or moral feelings, and they will likely output a solution that treats vulnerable groups unfairly. For instance, Apple s credit card has been investigated by financial regulators after customers discovered that the lending algorithms were discriminating against women. In this light, there is a growing interest in incorporating different fairness criteria into algorithmic design, aiming to guarantee that similar individuals or groups will be treated equally (Dwork et al., 2012; Zemel et al., 2013; Hardt et al., 2016; Yona & Rothblum, 2018; Liu et al., 2018; Dwork & Ilvento, 2019; Liu et al., 2019; Dwork et al., 2020a;b; Ding et al., 2021). In this paper, we study the two properties that seemingly are odd with each other, and propose the first provable active learning algorithm to address the general concern that the instance selection paradigm in active learning may lead to unfairness. Our goal is three-fold: 1) designing a computationally efficient algorithm that learns the underlying hypothesis class under the probably approximately correct (PAC) model of Valiant (1984); 2) significantly reducing the label complexity using active learning techniques; and 3) ensuring fairness on the unseen data, i.e. generalization ability of the fairness guarantee. The first two objectives have been broadly studied in the literature and were achieved by many active learning algorithms (Balcan et al., 2007; Awasthi et al., 2017); hence our main contribution falls into the design of a new family of active learning algorithms that additionally satisfy the fairness guarantee. 1.1. Formal setup We study efficient PAC learning of homogeneous halfspaces (Valiant, 1984), which is arguably one of the most important problems in learning theory (Rosenblatt, 1958). Denote by X := Rd the instance space and by Y := { 1, 1} the label space. The class of homogeneous halfspaces is given by H := {x 7 sign(w x), w Rd, w 2 = 1}. Let D be the joint distribution on X Y and denote by DX the marginal distribution on X. For any hypothesis w H, we define the error rate as err D(w) := Pr(x,y) D sign(w x) = y . Let EXD be the sample generation oracle such that each time the learner makes a call, it returns a labeled instance Metric-Fair Active Learning (x, y) that is randomly drawn from D. In the active learning setting, however, each time EXD is called, a labeled instance (x, y) is still randomly drawn from D, but the oracle only returns the instance x. The learner must make another call to a label revealing oracle EXY D to obtain the label y. Let ϵ (0, 1) be the target classification error rate, and δ (0, 1) be the failure probability. We say H is PAC learnable if there exists a learning algorithm A, quantities n A ϵ,δ and m A ϵ,δ satisfying the following: given any ϵ and δ, by making n A ϵ,δ calls to EXD and m A ϵ,δ calls to EXY D, A outputs a halfspace ˆw with err D( ˆw) minw H err D(w) + ϵ with probability at least 1 δ (over the draw of the data and the internal randomness of A). The minimum of n A ϵ,δ and m A ϵ,δ over all possible algorithms are termed the sample complexity and label complexity, respectively. In addition to the PAC guarantee, we also aim to establish fairness guarantee. We consider the notion of approximate metric-fairness due to Yona & Rothblum (2018), yet with a slight modification. Definition 1 (Approximate metric-fairness). Given a metric ζ : X X [0, 1], let the fairness error be fζ(w; DX) := Pr DX DX w x w x > ζ(x, x ) . A hypothesis w is said to be α-approximately metric-fair if fζ(w; DX) α. We call α the fairness error rate. We note that when α = 0, Definition 1 reduces to the notion of perfect metric-fairness (Dwork et al., 2012). That is, for all (x, x ) X X, w x w x ζ(x, x ) almost surely. Yet, as shown in Yona & Rothblum (2018), even a perfectly metric-fair hypothesis exists and has zero error rate, for some simple learning problem, it cannot be found in polynomial time by any perfectly metric-fair algorithm. Therefore, throughout the paper, we will only consider finding a hypothesis with the property of approximate metric-fairness, which relaxes the perfectness in such a way that for an α fraction of the pairs, the metric-fairness property may not hold. It is also worth mentioning that Yona & Rothblum (2018) considered a slightly more general definition where w is said metric-fair if w x w x ζ(x, x ) + γ for some slack parameter γ 0. We find such relaxation seems unnecessary and our analysis will be different in the way that we design a computationally efficient learning algorithm. Specifically, we will utilize a different fairness loss function; see Section 3. Now we are in the position to state the probably approximately correct and fair (PACF) learning problem. Definition 2 (PACF learning). A learning algorithm PACFlearns a hypothesis class H if for any underlying distribution D, target classification error rate ϵ (0, 1), confidence δ (0, 1), fairness metric ζ( , ) : X X [0, 1], fairness error rate α (0, 1), it randomly draws a number of samples from D and with probability 1 δ over the draw, outputs a halfspace ˆw satisfying: 1) err D( ˆw) minw Hα err D(w)+ ϵ; and 2) fζ( ˆw; DX) α, where Hα H consists of all halfspaces that are α-approximately metric-fair. It is worth mentioning that in terms of classification error, Yona & Rothblum (2018) competed with the best hypothesis in a subclass Hα ϵα where ϵα (0, α) (but the most interesting regime is ϵα = Θ(α)), known as relaxed PACF learning. In this work, we alternatively make a realizable assumption that there exists a perfectly metric-fair target halfspace w in Hα to avoid the relaxation of PACF learnability. This naturally interpolates the settings in Dwork et al. (2012) and Yona & Rothblum (2018) and thus our results can be thought of as circumventing the computational hardness of finding the perfectly metric-fair hypothesis while permitting simpler technical analysis. 1.2. Main results The main contribution of this paper is a metric-fair active learning algorithm that fortifies state-of-the-art active learning algorithms with the PACF guarantee. In this section, we summarize our main results; readers are referred to Section 5 for a more comprehensive discussion. We will present a basic PACF algorithm with label efficiency. Then we will show how to improve this basic algorithm so that it can tolerate the adversarial label noise (Kearns et al., 1992) and can learn sparse halfspaces (Littlestone, 1987). All of our analysis hinges on a mild assumption on the marginal distribution DX. Assumption 1. The marginal distribution DX is isotropic log-concave on X; namely, it has zero mean, unit covariance matrix, and the logarithm of its density function is concave. Observe that the family of isotropic log-concave distributions is fairly standard and general (Lovász & Vempala, 2007b; Vempala, 2010; Balcan & Long, 2013). Without any assumptions on the distribution, active learning may fail to provide any improvement over passive learning in terms of label efficiency; see an example given by Dasgupta (2005). Our first result concerns label-efficient PACF learning in the noise-free setting, where there is a perfect hypothesis w H that incurs zero error rate. Theorem 3. If Assumption 1 is satisfied and there exists w H with err D(w ) = 0 and fζ(w ; DX) = 0, then there is an efficient algorithm that PACF learns H. In addition, the label complexity is O d polylog( 1 Observe that the obtained label complexity significantly im- Metric-Fair Active Learning Table 1. Comparison to most relevant works. Compared to Zhang (2018), our algorithms can produce metric-fair hypotheses. Compared to Yona & Rothblum (2018), we have exponential improvement on the dependence of ϵ and α, and are robust to adversarial noise. We remark that the PACF learnability of halfspaces was not set out in Yona & Rothblum (2018) but their results implied what we list here. Work Label Complexity Metric Fairness Noise Tolerance Zhang (2018) polylog( 1 ϵ ) O(t polylog(d)) Yona & Rothblum (2018) 1 ϵ2 1 α2 O(t polylog(d)) This Work (Theorem 4) polylog( 1 α) O(d) This Work (Theorem 5) polylog( 1 α) O(t polylog(d)) proves upon the one of Yona & Rothblum (2018): theirs is proportional to 1 (ϵα)2 while we have a poly-logarithmic dependence on both 1 α. This is due to our new algorithmic design and the distributional assumption we made. We then consider a more challenging setting where no perfect halfspace exists in H with zero error rate, known as the adversarial noise (Haussler, 1992; Kearns et al., 1992). Note that this is a very challenging label noise and only recently have efficient algorithms been established, though without fairness guarantees (Awasthi et al., 2017; Yan & Zhang, 2017; Shen, 2021a). Assumption 2. The distribution D is said to satisfy the ηadversarial-noise condition if there is a halfspace w H with err D(w ) η. Theorem 4. If Assumptions 1 and 2 are satisfied, and fζ(w ; DX) = 0, then there is a polynomial-time algorithm that PACF learns H if η O(ϵ). In addition, the label complexity is O d polylog( 1 Lastly, the margin-based active learning framework allows us to explore learning of structured halfspaces. In particular, we are interested in learning of t-sparse halfspaces and the goal is to obtain label complexity that is sublinear in the dimension d, a property termed attribute-efficiency (Littlestone, 1987). Such property was also broadly studied in statistics and signal processing communities (Chen et al., 1998; Tibshirani, 1996; Candès & Tao, 2005). Assumption 3. The hypothesis class consists of s-sparse halfspaces, i.e. H = {x 7 sign(w x), w Rd, w 2 = 1, w 0 t}, where w 0 counts the number of non-zero elements of w. Theorem 5 (Theorem 9, informal). If Assumptions 1, 2 and 3 are satisfied and fζ(w ; DX) = 0, then there is a polynomial-time algorithm that PACF learns H if η O(ϵ). In addition, the label complexity is O t polylog(d, 1 Observe that both Theorem 3 and Theorem 4 are just special cases of the above result. Without the fairness constraint, the setting of Theorem 5 has been studied in Zhang (2018). In fact, our high-level idea of algorithmic design is inspired by that work. The crucial difference lies in the incorporation of the metric-fairness and hence, a new theoretical analysis on the generalization ability of metric fairness in the marginbased active learning framework. We summarize our results and some closely related prior works in Table 1. 1.3. Overview of our techniques We sketch the main techniques in this section. From a high level, we leverage the metric-fairness into the celebrated margin-based active learning framework to achieve performance guarantees stated in Theorem 5. 1) Metric-fair learning via convex fairness loss. Given the definition of metric-fairness, it is natural to consider an indicator function as the loss to evaluate whether the fairness constraint is violated for a hypothesis w on a pair (x, x ) X X: fζ(w; (x, x )) = ( 1 if w x w x > ζ(x, x ), 0 otherwise. Yet, such loss function is discrete that is often computationally hard to optimize. Thus, we consider the following surrogate loss that is amenable for optimization: f G ζ (w; (x, x )) := max 0, G( w x w x ζ(x, x )) + 1 , which is an hinge-loss type upper bound of fζ(w; (x, x )) and very importantly, is convex with respect to w. Now given a set T of instances drawn independently from DX, it is possible to (arbitrarily) group each two instances to form a set M(T) X X and examine the empirical fairness loss induced by M(T): X (x,x ) M(T ) f G ζ (w; (x, x )). Note that M(T) can be thought of as a set of instance pairs independently drawn from X X, sometimes called a graph matching of T by imaging the instances in T as the nodes of a graph. This would allow us to establish generalization guarantee of metric-fairness, similar to Yona & Rothblum (2018). In our algorithm, we will mainly consider a constraint for the above empirical loss function which can be Metric-Fair Active Learning evaluated in polynomial time. This is one of the primary components when the algorithm produces a new iterate. By enforcing such convex fairness constraint in all iterations, we obtain the fairness guarantee for all iterates, and hence the final output of the algorithm. 2) Margin-based active learning with fairness: entangling unlabeled and labeled data. The margin-based active learning framework of Balcan et al. (2007) proceeds in an iterative fashion, where in each phase it minimizes an empirical hinge loss to produce a new iterate. The key aspect that differentiates it from passive learning is that in each phase, it draws a bulk of instances from DX but only queries the labels of those residing a band B, known as localized sampling. In addition, it searches the new iterate in a localized hypothesis space, which is roughly a trust region of the target halfspace w . As the algorithm proceeds, such localized hypothesis space shrinks at a geometric rate, hence after a few phases, a good hypothesis can be returned. We make several key observations. First, the labels are queried only during empirical hinge loss minimization, and the number of labeled instances is such that the empirical hinge loss is a good approximation to the expected loss. Second, the instances are drawn at the beginning of each phase and will be used to construct the surrogate fairness constraint. It turns out that such unlabeled and labeled instances are interleaved during hinge loss minimization, i.e. labeled samples are used to define the hinge loss, while unlabeled ones are used to construct the constraint. Yet, we show that the size of unlabeled samples and the necessary size of labels are almost independent. Therefore, different from prior active learning algorithms, after localized sampling, we also perform a random sampling of the remaining instances and query their labels. This ensures that we only make necessary label queries, and is the key to obtain the poly-logarithmic dependence on 1 Our treatment on adversarial noise is rather standard. It turns out that the framework inherently is robust to the adversarial noise as long as the noise rate η O(ϵ), as set out in Awasthi et al. (2017). Technically speaking, this comes from the fact that the expected error within the localized sampling region is only constant away from the error rate of w , which suffices to establish the desired bound on classification error rate. Moreover, we can show that the additional fairness constraint will not hurt such analysis. Finally, in order to incorporate the sparsity structure of the underlying halfspace, we consider enforcing an additional ℓ1-norm constraint in the localized hypothesis space. This narrows down the search space and hence improves the label complexity. At a technical level, the additional ℓ1-norm constraint significantly reduces the Rademacher complexity (Kakade et al., 2008). Since the ℓ1-norm constraint may promote a non-sparse halfspace, at the end of each phase, we will perform hard thresholding to ensure sparsity. This will increase the error rate by a constant factor but we can show that overall, it can still be well-controlled, which is a key observation made in Zhang (2018). 1.4. Roadmap We discuss more related works in Section 2, including fairness and active learning. A concrete problem setup is presented in Section 3. We elaborate on our main algorithms in Section 4, followed by performance guarantees in Section 5. We conclude the paper in Section 6, and defer all proof details to the appendix. 2. Related Works In this section, we provide a brief review of learning with fairness and active learning. Fairness. There are two important fairness criteria: statistical fairness and individual fairness. Statistical fairness seeks to stabilize a small number of protected demographic groups (e.g. kids) and then requires some statistical metric be equal across all these groups. The algorithms use a variety of metrics, the most popular of which are the raw positive classification rate (Feldman et al., 2015), the false positive and false negative classification rates (Hardt et al., 2016; Kleinberg et al., 2017), and the positive predictive value (Kleinberg et al., 2017). However, statistical fairness does not provide adequate protection for individuals or a structured subgroup since it examines whether the protected groups receive an average benefit. In contrast to statistical fairness, individual fairness focuses on specific individuals rather than an average across populations. Dwork et al. (2012) suggested that for each pair of individuals, a metric should be used in such a way that similar individuals should be treated similarly; this is the concept of individual fairness. Based on the notion of individual fairness, a series of algorithms have been developed that integrates the online learning setting and guarantees the individual fairness via an oracle to evaluate fairness violations (Kim et al., 2018; Kearns et al., 2018; Gillen et al., 2018). A very elegant work due to Yona & Rothblum (2018) presented generalization guarantee of fairness under the notion of approximate metric-fairness. For a comprehensive discussion on fairness in machine learning, we refer readers to a recent survey article by Mehrabi et al. (2021). Active learning. Active learning concerns the scenario where unlabeled data are abundant but labeling could be very expensive. The study of active learning was initiated by Cohn et al. (1994). It, however, turns out that in general, even for the very simple problem of learning halfspaces, active learning may not be able to provide any improvement on label complexity compared to passive learning Metric-Fair Active Learning (Dasgupta, 2005). In this regard, distributional assumptions on the unlabeled data are often made, such as uniform distribution on a unit ball (Balcan et al., 2007). One line of research considers combining empirical risk minimization with active learning and demonstrates surprising properties beyond label efficiency, such as noise tolerance (Awasthi et al., 2017; Shen, 2021b). Another line of research designs more sophisticated algorithms for improved noise tolerance (Yan & Zhang, 2017; Zhang et al., 2020; Diakonikolas et al., 2020b;a; Shen, 2021a; Zhang & Li, 2021). Interestingly, Awasthi et al. (2016); Zhang (2018) observed that the margin-based active learning framework inherently is compatible with attribute-efficiency, an important property that has been investigated in learning theory and statistics (Blum, 1990; Blum et al., 1991; Donoho, 2006; Tropp & Wright, 2010; Shen & Li, 2018). The objective of attributeefficient learning is to learn a sparse model that improves the performance guarantee on sample and label complexity, typically with the hope of obtaining bounds that are logarithmic in the dimension. More recently, Shen & Zhang (2021) developed an attribute-efficient active learning algorithm that is robust to the malicious noise where both instances and labels can be adversarially corrupted (Valiant, 1985). 3. Preliminaries For a vector w Rd, we denote its ℓ2 and ℓ1-norm by w 2 and w 1, respectively. We use w 0 to count the number of non-zero elements of w. We will denote by B2(w, r) the ℓ2-ball centering at w with radius r, and by B1(w, ρ) the ℓ1-ball centering at w with radius ρ. We define the angle between two vectors w1, w2 in Rd as θ(w1, w2) = arccos w1 w2 w1 2 w2 2 . The letter c and its subscript variants such as c1, c2, c3 are reserved for specific constants. The letters κ and c are also reserved constants. Readers may refer to Appendix A for the detailed values. The capital letter K and its subscript variants are also constants, but their values may change from appearance to appearance. Recall that we denote the instance space by X Rd , the label space by Y and the class of t-sparse homogeneous halfspaces by H := {x 7 sign(w x) : w Rd, w 2 = 1, w 0 t}. Note that a halfspace is non-sparse if t = d. Hence, this is a more general definition of the underlying hypothesis class. In our algorithm, we will often choose unlabeled instances from DX conditional on a sampling region B := {x Rd : |w x| b}. This can be done by repeatedly calling EXD until seeing an instance x lying in B; this process is referred to as rejection sampling. The distribution of DX conditional on the event x B is denoted by DX|B. We will consider a scaled hinge loss as a proxy to the 0/1 classification error, ℓτ w; (x, y) := max n 0, 1 yw x Note that even with the scaling parameter τ > 0, such hinge loss still upper bounds the 0/1 classification error. The expected scaled hinge loss for some distribution D over X Y is thus given by ℓτ(w; D) := E(x,y) D[ℓτ(w; (x, y))]. (2) Further, we will consider the empirical scaled hinge loss with respect to a sample set S X Y, ℓτ(w; S) := 1 (x,y) S ℓτ(w; (x, y)). (3) In the sequel, we summarize useful definitions for fairness. Definition 6 (Metric-fairness loss). The metric-fairness loss on a pair (x, x ) with respect to a metric ζ : X X [0, 1] is as follows: fζ(w; (x, x )) := 1 n w x w x > ζ(x, x ) + γ o , (4) where 1{E} is the indicator function which outputs 1 if the event E holds and 0 otherwise. The expected metric-fairness loss with respect to the distribution DX is thus given by fζ(w; DX) := E(x,x ) DX DX fζ(w; (x, x )) . (5) We will establish uniform convergence of metric-fairness loss through Rademacher complexity. One key requirement is that the empirical pairs (x, x ) need to be independent draws according to DX DX. However, in our algorithmic design, we will not do so. Rather, we draw a set T of instances and construct the set of pairs through the notion of matching in graph theory by thinking of T as a fully connected graph with instances being the vertices, an elegant idea due to Yona & Rothblum (2018). Definition 7 (Matching). Given a set T of instances in X, a matching M(T) is a set of instance pairs without common instances. In addition, all instances in T are contained in one pair in M(T). Equipped with a matching M(T), it is possible to evaluate the degree of violation of the fairness constraint for a given hypothesis w on a given set T. For example, we may consider the empirical metric-fairness loss fζ(w; M(T)) := 1 |M(T )| P (x,x ) M(T ) fζ(w; (x, x )). However, fζ(w; (x, x )) is a nonconvex function that is intractable to optimize. Yona & Rothblum (2018) Metric-Fair Active Learning considered a function of the form ˆfζ(w; (x, x )) := max 0, w x w x ζ(x, x ) which is convex but not always an upper bound of the 0/1 metric-fairness loss (4). This resulted in technical complications in their uniform convergence. We propose to use the following surrogate function. Definition 8 (Surrogate metric-fairness loss). The surrogate metric-fairness loss on a pair (x, x ) with respect to a metric ζ : X X [0, 1] is as follows: f G ζ (w; (x, x )) := max 0, G( w x w x ζ(x, x )) + 1 , (6) where G > 0 is a parameter. It is easy to see that the surrogate loss is convex with respect to w and is always an upper bound of the 0/1 loss in Definition 6. Thus, given an instance set T, we can build an arbitrary matching M(T) and consider the convex fairness loss over M(T) as follows: f G ζ (w; M(T)) := 1 M(T) X (x,x ) M(T ) f G ζ (w; (x, x )). (7) Since our algorithm and analysis hold for any matching of T, we will often write f G ζ (w; T) in place of f G ζ (w; M(T)). 4. Main Algorithms We describe the main algorithms in this section. For the purpose of exposition, we will start with a basic algorithm that works under the noise-free data and without sparsity pattern of the underlying halfspace. We then present an attribute-efficient algorithm when the hypothesis class is t-sparse halfspaces. For both algorithms, there is no particular treatment on the adversarial noise; it only enters our theoretical analysis. 4.1. Hyper-parameter setting Our algorithms proceed in phases. In each phase k, we set rk = 2 k 3, bk = c rk, ρk = 2t rk, τk = κ bk, (8) where ρk is used only in Algorithm 2. The failure probability δk := δ (k+1)(k+2). We will draw a set of instances Tk by calling EXD for nk times, where We will then query the label of some instances in Tk by calling EXY D for mk times, where mk = K2 t log3 d αδk log d. (10) Algorithm 1 Active Learning of Halfspaces with Approximate Metric-Fairness Require: Target classification error rate ϵ, failure probability δ, fairness metric function ζ( , ), fairness error rate α, sample generation oracle EXD, label revealing oracle EXY D. Ensure: A halfspace ˆw with err D( ˆw) ϵ and is also αapproximately metric-fair. 1: kmax log( π 128c1ϵ). 2: for k = 1, 2, . . . , kmax do 3: Tk independently draw nk instances from DX. 4: Build a matching M(Tk) and construct Wk. 5: Sk randomly sample mk instances in Tk Bk and query their labels. 6: Find wk Wk such that ℓτk(wk; Sk) arg min w Wk ℓτk(w; Sk) + κ. 7: end for return ˆw wkmax. Recall κ, c > 0 are reserved constants (see Appendix A), and K1, K2 > 0 are constants that we will not particularly track their values. It is also worth noting that nk and mk are referred to as the sample size and label size at phase k. 4.2. Active learning of general halfspaces with fairness Our algorithm proceeds in multiple phases. Fix a phase k 1. The basic metric-fair active learning, Algorithm 1, consists of two major stages: sampling from a region Bk, called localized sampling, and minimizing a hinge loss under certain constraint set Wk. Since the sample generation oracle only returns instances drawn from DX, we need to call it sufficient times to obtain an instance set Tk, and identify those in Bk. The localized sampling region Bk is defined as follows: ( Rd k = 1, {x : |wk 1 x| bk} k 2. (11) Intuitively, as the algorithm proceeds, the iterate wk 1 will be very close to the target halfspace w , and thus only instances close to the decision boundary of wk 1 are informative in the sense that wk 1 may disagree with w on their labels this is why we only query the labels in the band Bk. Note that such setting is standard in margin-based active learning (Awasthi et al., 2017). The design of Wk is more delicate, as we need to ensure that the iterates are metric-fair. Without the fairness consideration, it is known in the active learning literature that we Metric-Fair Active Learning can choose Wk same as Qk, where ( B2(0, 1) k = 1, B2(0, 1) B2(wk 1, rk) k 2. (12) The constraint set Qk has two properties with a high probability: 1) it is shrinking; and 2) the target halfspace w keeps lying in Wk for all k 1 (see Proposition 21). To account for the fairness property, we need additional constraint. We will consider Mk = n w Rd : f Gk ζ (w; Tk) α Recall that f Gk ζ (w; Tk) is a convex function that upper bounds the 0/1 fairness error. This will be a useful fact for us to show generalization ability of the metric-fairness property. The parameter Gk > 0 can be adaptively selected. Yet, we find that Gk = 1 works in our analysis. It is unclear whether a more involved choice would improve the performance guarantees; we leave it as our future study. We would also like to mention that Yona & Rothblum (2018) proposed to use a surrogate loss ˆfζ(w; Tk) := 1 |M(Tk)| P M(Tk) max{0, |w x w x | ζ(x, x )}. This is convex as well but the trouble is that it is not always an upper bound of the 0/1 fairness error defined in Definition 6. Thus, there are technical complications in the analysis of Yona & Rothblum (2018). Consequently, we will consider a constraint set for the candidate halfspaces as follows: Wk = Mk Qk. (14) Note that any w Wk is α 2 -approximately metric-fair on the instance set Tk. Then we need to find a good halfspace wk. To this end, we minimize the hinge loss ℓτk(w; Sk). The labeled instance set Sk is chosen in a very involved manner. Given Tk drawn at the beginning, we first identify those residing Bk, denoted by T k := Tk Bk. While prior works query the labels of all instances in T k (Balcan et al., 2007; Zhang, 2018), we will first perform a random sampling from T k to form an instance set of size mk, and query EXY D on their labels. As will be clear in the analysis, mk is orders of magnitude smaller than the size of T k a key step to ensure the announced label complexity. In fact, had we not performed the random sampling, our label complexity would be proportional to 1 α2 , which is exponentially more than what we obtained. Technically speaking, the reason that we only need a small amount of labeled data is that mk labeled instances suffice to guarantee uniform convergence to the expected hinge loss, which will imply a small classification error. On the other spectrum, we do need a large amount of unlabeled data (i.e. Tk) to guarantee uniform convergence to the expected fairness error. Therefore, the algorithm entangles the unlabeled and labeled data yet disentangles their sizes. Such idea of random sampling also appeared in a very recent work of Shen & Zhang (2021) but was motivated in a quite different context: in that work, they need to draw a bulk of instances Tk to detect malicious instances (Valiant, 1985), and then sample a small amount for labeling to obtain near-optimal label complexity. Equipped with the labeled instance set Sk and the convex constraint set Wk, we optimize the empirical hinge loss up to a constant κ > 0, which is computationally efficient. Here, the hinge loss is parameterized by a Lipschitz coefficient τk that shrinks exponentially with respect to the iteration number k; this will be useful to control the label complexity. A potential trouble is that whether such convex program is feasible. We give an affirmative answer; in fact, we show that the target halfspace is a feasible solution, namely, w Wk for all phases k. Observe that this immediately implies that after O(log 1 ϵ ) iterations, we have wk w 2 ϵ due to the shrinking ℓ2-norm constraint we imposed in Qk and the setting rk = Θ(2 k). Notably, a natural characteristic of our algorithm is its noise tolerance. When the label is corrupted by adversarial noise during the training phase, the shrinking sampling region Bk well-controls the amplitude of the noise. Therefore, Algorithm 1 simultaneously guarantees the metric fairness and tolerance to the adversarial noise . 4.3. Active learning of sparse halfspaces with fairness Now we present Algorithm 2, which leverages the prior knowledge that the underlying hypothesis class is t-sparse halfspaces to achieve attribute efficiency. Inspired by Zhang (2018), we modify the constraint set Qk in (12) as follows to incorporate the sparsity structure: ( B2(0, 1) B1(0, t) k = 1, Q1 B2(wk 1, rk) B1(wk 1, ρk) k 2. (15) The key difference from the Qk that we used in Algorithm 1 is that the constraint set Qk belongs to the intersection between ℓ1-norm ball and ℓ2-norm balls centering at wk 1, which will be useful to establish label complexity that is logarithmic in the dimension. Furthermore, we also narrow down the constraint set Qk such that it is a subset of Q1 for all k 1. This will simplify our analysis for the generalization of metric-fairness (but not vital). It is worth mentioning that the solution of hinge loss minimization vk is not always t-sparse. For technical reasons, we will perform a hard thresholding step Pt(vk) which sets all but the t largest (in magnitude) elements of vk to zero, followed by an ℓ2-normalization. We note that though hard Metric-Fair Active Learning Algorithm 2 Active Learning of Sparse Halfspaces with Approximate Metric-Fairness Require: Sparsity parameter t, target classification error rate ϵ, failure probability δ, fairness metric function ζ( , ), fairness error rate α, sample generation oracle EXD, label revealing oracle EXY D. Ensure: A halfspace ˆw with err D( ˆw) ϵ and is also αapproximately metric-fair. 1: kmax log( π 128c1ϵ). 2: for k = 1, 2, . . . , kmax do 3: Tk independently draw nk instances from DX. 4: Build a matching M(Tk) and construct Wk. 5: Sk randomly sample mk instances in Tk Bk and query their labels. 6: Find vk Wk such that ℓτk(vk; Sk) arg min w Wk ℓτk(w; Sk) + κ. 7: wk Pt(vk) Pt(vk) 2 . 8: end for return ˆw vkmax. thresholding is a non-convex projection, it is still possible to control the resultant deviation; see Proposition 21. Another notable aspect of Algorithm 2 is that the returned hypothesis is vkmax rather than wkmax. Though it is possible to show that wkmax enjoys PAC guarantee as well as vkmax does, we find it technically difficult to prove its fairness guarantee, due to the non-convexity nature of the hard thresholding operation. Therefore, the returned halfspace is not t-sparse, yet with PACF guarantee and attribute-efficiency. 5. Performance Guarantees We provide the performance guarantee for our algorithms in this section. Since Algorithm 1 is a special case of Algorithm 2, we state the main results of the latter first, and the guarantees of Algorithm 1 follow as an immediate corollary. Theorem 9 (Main result). Suppose Assumptions 1, 2, 3 are satisfied, and w is such that fζ(w ; DX) = 0. For any given ϵ, δ, α (0, 1), if η c0ϵ for sufficiently small constant c0 > 0, then the following holds. With probability 1 δ, Algorithm 2 runs in polynomial time and returns a halfspace ˆw such that err D( ˆw) ϵ. In addition, ˆw is α-approximately metric-fair. The sample complexity is O 1 ϵ , and the label complexity is O t log3 d αδ log d log 1 Observe that the fairness metric ζ( , ) enters our analysis through its maximal value, which is assumed to be 1 in Definition 1. It is straightforward to reproduce our analysis for any universally bounded metric function; we leave it to interested readers. Corollary 10. Suppose Assumptions 1 and 2 are satisfied, and w is such that fζ(w ; DX) = 0. For any given ϵ, δ, α (0, 1), if η c0ϵ for sufficiently small constant c0 > 0, then the following holds. With probability 1 δ, Algorithm 1 runs in polynomial time and returns a halfspace ˆw such that err D( ˆw) ϵ. In addition, ˆw is α-approximately metric-fair. The sample complexity is O 1 ϵ , and the label complexity is O(d log3 d αδ log d log 1 5.1. Proof sketch of Theorem 9 To prove our main result, we show the generalization bound of metric fair loss and then establish the PAC guarantee. To begin with, we demonstrate a generalization bound of our metric fair loss f Gk ζ (w; (x, x )). Lemma 11 (Lemma 14, informal). Consider phase k of Algorithm 2. Let Zk := max(x,x ) M(Tk) ζ(x, x ), Xk := maxx Tk x , Πk := Gk(2 t Xk + Zk). If |Tk| K 1 (α )2 (G2 k X2 kt log d+G2 k Z2 k +Π2 k) log 1 δ for some constant K > 0, then with probability 1 δ over the draw of Tk, the following holds for any α > 0: f Gk ζ (w; DX) f Gk ζ (w; Tk) α . Such maximal difference between empirical and expected loss functions follows from uniform convergence via Rademacher complexity (Bartlett & Mendelson, 2001). The primary challenge is to estimate the Rademacher complexity for the underlying hypothesis class. We show that it is possible to upper bound the maximal value of the function f Gk ζ (w; (x, x )) in view of the constraint set Wk and our distributional assumption on DX. This in conjunction with the fact that it is Gk-Lipschitz implies the desired result. By setting α = α 2 in Lemma 11 and combining with the constraint f Gk ζ (w; Tk) α 2 and the observation that f Gk ζ (w; (x, x )) is always an upper bound of the 0/1 fairness error fζ(w; (x, x )), we obtain the approximate metricfairness guarantee; see Appendix B for the full proof. Theorem 12 (Generalization of fairness). Consider phase k of Algorithm 2. With probability 1 δk over the draw of Tk, all w Wk are α-approximately metric-fair with respect to ζ( , ) provided that |Tk| K 1 α2 (t log4 d δk ) for some constant K > 0. In particular, the hinge loss minimizer vk is α-approximately metric-fair. We now discuss the PAC guarantee stated in Theorem 9. Let T k = Tk Bk and we imagine that T k is labeled by EXY D. In this way, the samples in T k are independent draws Metric-Fair Active Learning from Dk, the distribution D conditional on the event x Bk. Consequently, Sk is a subset that is randomly sampled from T k. We show that with the setting of nk and mk, the following inequalities hold: ℓτk(w; Sk) ℓτk(w; T k) κ, ℓτk(w; T k) ℓτk(w; Dk) κ, ℓτk(w; Dk) Lτk(w; DX|Bk) κ, where Lτk(w; DX|Bk) := Ex DX|Bk [ℓτk(w; (x, sign(w x)))]. The first inequality shows that the random sampling of Sk from T k preserves the hinge loss, which is crucial for obtaining improved label complexity since we only need to annotate Sk whose size is mk that is much less that T k (which is roughly Θ(bknk)). The second inequality is not surprising due to uniform convergence. The last inequality is very useful to handle the adversarial noise, since it asserts that the hinge loss on the corrupted distribution Dk does not deviate far from that on the clean distribution. Combining them, we can show that Lτk(w; DX|Bk) ℓτk(w; Sk) up to a constant additive factor. This suffices to establish the PAC guarantee in view of standard results from margin-based active learning; see Appendix C. Combining the PAC guarantee and Theorem 12, we obtain the PACF guarantee. Finally, recall that the sample complexity refers to the total number of calls to EXD, which is the sum of all nk := |Tk|; the label complexity refers to the total number of calls to EXY D, which is the sum of all mk := |Sk|. This completes the proof of Theorem 9 by observing our setting on nk and mk in Section 4.1. 6. Conclusion and Future Works In this paper, we presented the first computationally efficient active learning algorithm with the property of approximate metric-fairness. The core idea is to interleave unlabeled and labeled data in a delicate way to obtain exponential improvement on the label complexity while retaining metricfairness for the potential hypotheses. Our analysis is based on the presumption that a perfectly metric-fair hypothesis exists in the given hypothesis class. It would be useful to consider a weaker condition that such target hypothesis is only approximately metric-fair. It is also important to investigate whether we can improve the sample complexity in terms of the dependence on the fairness error rate, say proportional to 1 Acknowledgements We thank the anonymous reviewers and meta-reviewer for valuable comments on improving the notation and proof structure. This work is supported by NSF-IIS-1948133 and the startup funding from Stevens Institute of Technology. Awasthi, P., Balcan, M., Haghtalab, N., and Zhang, H. Learning and 1-bit compressed sensing under asymmetric noise. In Proceedings of the 29th Annual Conference on Learning Theory, pp. 152 192, 2016. Awasthi, P., Balcan, M., and Long, P. M. The power of localization for efficiently learning linear separators with noise. Journal of the ACM, 63(6):50:1 50:27, 2017. Balcan, M. and Long, P. M. Active and passive learning of linear separators under log-concave distributions. In Proceedings of the 26th Annual Conference on Learning Theory, pp. 288 316, 2013. Balcan, M., Beygelzimer, A., and Langford, J. Agnostic active learning. In Proceedings of the 23rd Annual Conference on Learning Theory, pp. 65 72, 2006. Balcan, M., Broder, A. Z., and Zhang, T. Margin based active learning. In Proceedings of the 20th Annual Conference on Learning Theory, pp. 35 50, 2007. Bartlett, P. L. and Mendelson, S. Rademacher and gaussian complexities: Risk bounds and structural results. In Proceedings of the 14th Annual Conference on Computational Learning Theory, pp. 224 240, 2001. Blum, A. Learning boolean functions in an infinite atribute space. In Proceedings of the 22nd Annual ACM Symposium on Theory of Computing, pp. 64 72, 1990. Blum, A., Hellerstein, L., and Littlestone, N. Learning in the presence of finitely or infinitely many irrelevant attributes. In Proceedings of the 4th Annual Workshop on Computational Learning Theory, pp. 157 166, 1991. Candès, E. J. and Tao, T. Decoding by linear programming. IEEE Transactions on Information Theory, 51(12):4203 4215, 2005. Chen, S. S., Donoho, D. L., and Saunders, M. A. Atomic decomposition by basis pursuit. SIAM Journal on Scientific Computing, 20(1):33 61, 1998. Chouldechova, A. and Roth, A. A snapshot of the frontiers of fairness in machine learning. Communications of the ACM, 63(5):82 89, 2020. Cohn, D. A., Atlas, L. E., and Ladner, R. E. Improving generalization with active learning. Machine learning, 15 (2):201 221, 1994. Metric-Fair Active Learning Dasgupta, S. Coarse sample complexity bounds for active learning. In Proceedings of the 19th Annual Conference on Neural Information Processing Systems, pp. 235 242, 2005. Dasgupta, S. The two faces of active learning. In Proceedings of the 20th International Conference On Algorithmic Learning Theory, pp. 1, 2009. Dasgupta, S., Kalai, A. T., and Monteleoni, C. Analysis of perceptron-based active learning. In Proceedings of the 18th Annual Conference on Learning Theory, pp. 249 263, 2005. Diakonikolas, I., Kane, D. M., Kontonis, V., Tzamos, C., and Zarifis, N. A polynomial time algorithm for learning halfspaces with Tsybakov noise. Co RR, abs/2010.01705, 2020a. Diakonikolas, I., Kontonis, V., Tzamos, C., and Zarifis, N. Learning halfspaces with Massart noise under structured distributions. In Proceedings of the 33rd Annual Conference on Learning Theory, pp. 1486 1513, 2020b. Ding, F., Hardt, M., Miller, J., and Schmidt, L. Retiring adult: New datasets for fair machine learning. In Proceedings of the 34th Annual Conference on Neural Information Processing Systems, pp. 6478 6490, 2021. Donoho, D. L. Compressed sensing. IEEE Transactions on Information Theory, 52(4):1289 1306, 2006. Dwork, C. and Ilvento, C. Fairness under composition. In Proceedings of the 10th Innovations in Theoretical Computer Science Conference, pp. 33:1 33:20, 2019. Dwork, C., Hardt, M., Pitassi, T., Reingold, O., and Zemel, R. S. Fairness through awareness. In Proceedings of the 3rd Innovations in Theoretical Computer Science, pp. 214 226, 2012. Dwork, C., Ilvento, C., and Jagadeesan, M. Individual fairness in pipelines. In Proceedings of the 1st Symposium on Foundations of Responsible Computing, pp. 7:1 7:22, 2020a. Dwork, C., Ilvento, C., Rothblum, G. N., and Sur, P. Abstracting fairness: Oracles, metrics, and interpretability. In Proceedings of the 1st Symposium on Foundations of Responsible Computing, pp. 8:1 8:16, 2020b. Feldman, M., Friedler, S. A., Moeller, J., Scheidegger, C., and Venkatasubramanian, S. Certifying and removing disparate impact. In Proceedings of the 21st International Conference on Knowledge Discovery and Data Mining, pp. 259 268, 2015. Gillen, S., Jung, C., Kearns, M. J., and Roth, A. Online learning with an unknown fairness metric. In Proceedings of the 31st Annual Conference on Neural Information Processing Systems, pp. 2605 2614, 2018. Hardt, M., Price, E., and Srebro, N. Equality of opportunity in supervised learning. In Proceedings of the 29th Annual Conference on Neural Information Processing Systems, pp. 3315 3323, 2016. Haussler, D. Decision theoretic generalizations of the PAC model for neural net and other learning applications. Information and Computation, 100(1):78 150, 1992. Kakade, S. M., Sridharan, K., and Tewari, A. On the complexity of linear prediction: Risk bounds, margin bounds, and regularization. In Proceedings of the 24th Annual Conference on Neural Information Processing Systems, pp. 793 800, 2008. Kearns, M. J., Schapire, R. E., and Sellie, L. Toward efficient agnostic learning. In Proceedings of the 5th Annual Conference on Computational Learning Theory, pp. 341 352, 1992. Kearns, M. J., Neel, S., Roth, A., and Wu, Z. S. Preventing fairness gerrymandering: Auditing and learning for subgroup fairness. In Proceedings of the 35th International Conference on Machine Learning, pp. 2569 2577, 2018. Kim, M. P., Reingold, O., and Rothblum, G. N. Fairness through computationally-bounded awareness. In Proceddings of the 31st Annual Conference on Neural Information Processing Systems, pp. 4847 4857, 2018. Kleinberg, J. M., Mullainathan, S., and Raghavan, M. Inherent trade-offs in the fair determination of risk scores. In Proceedings of the 8th Innovations in Theoretical Computer Science Conference, pp. 43:1 43:23, 2017. Littlestone, N. Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm (extended abstract). In Proceedings of the 28th Annual Symposium on Foundations of Computer Science, pp. 68 77, 1987. Liu, L. T., Dean, S., Rolf, E., Simchowitz, M., and Hardt, M. Delayed impact of fair machine learning. In Proceedings of the 35th International Conference on Machine Learning, pp. 3156 3164, 2018. Liu, L. T., Simchowitz, M., and Hardt, M. The implicit fairness criterion of unconstrained learning. In Proceedings of the 36th International Conference on Machine Learning, pp. 4051 4060, 2019. Lovász, L. and Vempala, S. S. The geometry of logconcave functions and sampling algorithms. Random Structures & Algorithms, 30(3):307 358, 2007a. Metric-Fair Active Learning Lovász, L. and Vempala, S. S. The geometry of logconcave functions and sampling algorithms. Random Structures and Algorithms, 30(3):307 358, 2007b. Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K., and Galstyan, A. A survey on bias and fairness in machine learning. ACM Computing Surveys, 54(6):115:1 115:35, 2021. Rosenblatt, F. The Perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65(6):386 408, 1958. Shen, J. On the power of localized Perceptron for labeloptimal learning of halfspaces with adversarial noise. In Proceedings of the 38th International Conference on Machine Learning, pp. 9503 9514, 2021a. Shen, J. Sample-optimal PAC learning of halfspaces with malicious noise. In Proceedings of the 38th International Conference on Machine Learning, pp. 9515 9524, 2021b. Shen, J. and Li, P. A tight bound of hard thresholding. Journal of Machine Learning Research, 18(208):1 42, 2018. Shen, J. and Zhang, C. Attribute-efficient learning of halfspaces with malicious noise: Near-optimal label complexity and noise tolerance. In Proceedings of the 32nd International Conference on Algorithmic Learning Theory, pp. 1072 1113, 2021. Tibshirani, R. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267 288, 1996. Tropp, J. A. and Wright, S. J. Computational methods for sparse solution of linear inverse problems. Proceedings of the IEEE, 98(6):948 958, 2010. Valiant, L. G. A theory of the learnable. In Proceedings of the 16th Annual ACM Symposium on Theory of Computing, pp. 436 445, 1984. Valiant, L. G. Learning disjunction of conjunctions. In Proceedings of the 9th International Joint Conference on Artificial Intelligence, pp. 560 566, 1985. Vempala, S. S. A random-sampling-based algorithm for learning intersections of halfspaces. Journal of the ACM, 57(6):32:1 32:14, 2010. Yan, S. and Zhang, C. Revisiting perceptron: Efficient and label-optimal learning of halfspaces. In Proceedings of the 30th Annual Conference on Neural Information Processing Systems, pp. 1056 1066, 2017. Yona, G. and Rothblum, G. N. Probably approximately metric-fair learning. In Proceedings of the 35th International Conference on Machine Learning, pp. 5666 5674, 2018. Zemel, R. S., Wu, Y., Swersky, K., Pitassi, T., and Dwork, C. Learning fair representations. In Proceedings of the 30th International Conference on Machine Learning, pp. 325 333, 2013. Zhang, C. Efficient active learning of sparse halfspaces. In Proceedings of the 31st Annual Conference on Learning Theory, pp. 1856 1880, 2018. Zhang, C. and Li, Y. Improved algorithms for efficient active learning halfspaces with Massart and Tsybakov noise. In Proceedings of the 34th Annual Conference on Learning Theory, pp. 4526 4527, 2021. Zhang, C., Shen, J., and Awasthi, P. Efficient active learning of sparse halfspaces with arbitrary bounded noise. In Proceedings of the 34th Annual Conference on Neural Information Processing Systems, pp. 7184 7197, 2020. Metric-Fair Active Learning A. Summary of Useful Notations and Reserved Parameters We recall the following fairness loss functions on a pair (x, x ) X X: fζ(w; (x, x )) = 1 n w x w x ζ(x, x ) > 0 o , f G ζ (w; (x, x )) = max 0, G( w x w x ζ(x, x )) + 1 . Note that the latter is always an upper bound of the former, and that the latter is convex with respect to w. Given a set T of instances drawn from the distribution DX and its matching M(T), we defined fζ(w; M(T)) = 1 M(T) X (x,x ) M(T ) fζ(w; (x, x )), fζ(w; DX) = E(x,x ) DX DX fζ(w; (x, x )) . Note that since M(T) is a matching, we have M(T) = 1 2 |T|. Since an arbitrary matching works in our analysis, we often omit the specification and write fζ(w; T) := fζ(w; M(T)). Likewise, we can define f G ζ (w; T) and f G ζ (w; DX). We recall the following scaled hinge loss functions on a pair (x, y) in X Y: ℓτ w; (x, y) = max n 0, 1 yw x Given a distribution D on X Y, let ℓτ(w; D) := E(x,y) D[ℓτ(w; (x, y))]. Further, we will consider the hinge loss with respect to a set S of labeled instances, ℓτ(w; S) := 1 (x,y) S ℓτ(w; (x, y)). For a given phase k 1, we collect useful notations in Table 2. Table 2. Notations used in phase k of Algorithm 2. Bk the band {x : |wk 1 x| bk} DX|Bk the marginal distribution of DX conditioned on the even x Bk Dk the joint distribution D conditioned on the even x Bk Tk a set of instances drawn from DX T k Tk Bk (to appear in our analysis) Sk a subset of T k and is labeled by EXY D M(Tk) a matching of Tk by viewing it as a graph Qk {w : w 2 1, w 1 t, w wk 1 2 rk, w wk 1 1 ρk} Mk {w : f G ζ (w; Tk) α 2 } Wk Qk Mk Xk maxx Tk x Constants. We clarify the choices of absolute constants. The constants c1, c2, c3, c4, c5, c6 are specified in Lemma 23, Lemma 24, and Lemma 25. The constant κ π 212c2 and c are jointly chosen. Let g(a) := κc6a + c3π 4π c2. We set κ = exp( a) and plug into g(a). It then is easy to see that g(a) is continuous and tends to zero as a goes to infinity. Thus, there must exist c such that g( c) π 28 ; in addition, c is an absolute constant since all coefficients in g(a) are constants. We then let κ = min{exp( c), π 212c2 }. Metric-Fair Active Learning B. Approximate Metric-Fairness Guarantee Recall that Wk = Mk Qk. We will first show that the target halfspace w Mk. Hence, the hinge loss minimization problem is always feasible. We then show that as long as we draw sufficient number of instances, any feasible solution is approximate metric-fair (Theorem 12). Lemma 13. Consider any phase k of Algorithm 2. The target halfspace w is contained Mk. Proof. Since we consider the realizable setting, we have fζ(w ; DX) = 0. Namely, w x w x ζ(x, x ) holds almost surely. This implies that f Gk ζ (w ; Tk) = 0 α 2 almost surely. The following theorem states that with our constructed constraint set Wk, any w Wk is α-approximate metric fair. B.1. Proof of Theorem 12 Proof. We will mainly use the result in Lemma 14. By our setting, we have Gk = 1, Zk = 1, Xk = Θ(log dnk δk ). Therefore, when nk Ω 1 α2 (t log3 d) log 1 δk , we have f Gk ζ (w; DX) f Gk ζ (w; Tk) α We now upper bound the expected metric-fairness loss when instances are drawn from DX in phase k. Note that for all w Wk, we have fζ(w; DX) = E(x,x ) DX DX fζ(w; (x, x )) E(x,x ) DX DX f Gk ζ (w; (x, x )) f Gk ζ (w; Tk) + α where the first inequality follows from our construction of f G ζ (w; (x, x )) which always upper bounds fζ(w; (x, x )), the second inequality follows from (16), and the last inequality follows from the construction of the constraint Wk which ensures f Gk ζ (w; Tk) α B.2. Uniform convergence of metric-fairness loss Lemma 14 (Formal statement of Lemma 11). Consider phase k of Algorithm 2. Let Zk := max(x,x ) M(Tk) ζ(x, x ) and let n be the size of Tk. With probability 1 δ over the draw of Tk, the following holds: f Gk ζ (w; DX) f Gk ζ (w; Tk) 4Gk Xk Z2 k n + Πk 8 log(2/δ ) where Πk = Gk(2 t Xk + Zk) + 1. Therefore, for any α > 0, f Gk ζ (w; DX) f Gk ζ (w; Tk) α as soon as n K 1 (α )2 (G2 k X2 kt log d + G2 k Z2 k + Π2 k) log 1 δ for some constant K > 0. Proof. We consider the following hypothesis class: G := (x, x ) 7 w x w x ζ(x, x ) : w Wk . Observe that f Gk ζ = q g, where g G and q(a) = max{0, Gk a + 1}. Let F be the hypothesis class of q g. Since q( ) is a Gk-Lipschitz function, it is known that Rn(F) Gk Rn(G), Metric-Fair Active Learning where Rn( ) denotes the empirical Rademacher complexity of a sample set with size n. Let L := {x 7 w x : w 1 t}. By Claim 2.17 of Yona & Rothblum (2018), we have Rn(G) 4Rn(L) + 2 max(x,x ) M(Tk) ζ2(x, x ) To upper bound Rn(L), we make use of Theorem 1 of Kakade et al. (2008) by observing that w 1 t. This implies n max x Tk x Xk where we recall that Xk is an upper bound of the infinity norm of x in Tk. Putting together, we have Rm(F) 4Gk Xk max(x,x ) M(Tk) ζ2(x, x ) Lastly, we need to upper bound f Gk ζ as follows: f Gk ζ Gk |a| + 1 Gk(2 t Xk + Zk) + 1. By standard uniform convergence via Rademacher complexity, e.g. Lemma 27, we obtain the claimed result. C. PAC Guarantee We aim to show that for any phase k, the distance between the new iterate wk and the target halfspace w is half of that of wk 1 and w . Thus, after O(log 1 ϵ ) iterations, we have an iterate with classification error rate lower than ϵ. Define the expected hinge loss with respect to clean samples in Bk as Lτk(w; DX|Bk) = Ex DX|Bk [ℓτk(w; (x, sign(w x)))]. (19) Our key tool is a crucial observation from margin-based active learning framework (Balcan et al., 2007; Awasthi et al., 2017; Zhang et al., 2020), which states that in each phase, it suffices to find an iterate wk whose error rate within the band Bk is a small constant. We first recall that Dk is the joint distribution D on X Y conditioned on the event x Bk. In our analysis, we need the following notion: T k := Tk Bk. (20) We will imagine that T k is labeled by EXY D. This is only for analysis purpose; in our algorithm T k was never involved. Now Sk is a randomly sampled subset of T k with size mk. Again, we remark that the number of labels we need in phase k is mk. Due to random sampling with replacement, we have ESk[ℓτk(w; Sk)] = ℓτk(w; T k). (21) Using standard uniform convergence via Rademacher complexity, we can show that ℓτk(w; Sk) is close to ℓτk(w; T k). For example, below is implied by Proposition 36 of Shen & Zhang (2021). Lemma 15. With probability 1 δ , the following holds: ℓτk(w; Sk) ℓτk(w; T k) 1 + ρk Xk where Xk := maxx T k x . Metric-Fair Active Learning The following lemma shares the same merit as above, but the expectation is taken over Dk. Lemma 16 (Lemma 13 of Zhang (2018)). Consider any phase k of Algorithm 2. Let n k := T k . With probability 1 δk, ℓτk(w; T k) ℓτk(w; Dk) K log n kd ϵδk t log(2d/δk) Lastly, it was shown that as long as the adversarial noise rate is O(ϵ), the expected hinge loss over the noisy joint distribution Dk is a good proxy of the expected hinge loss over the correctly labeled instances in Bk. This was originally discovered by Awasthi et al. (2017). Lemma 17 (Lemma 3.8 of Awasthi et al. (2017)). Consider any phase k of Algorithm 2. The following holds for arbitrarily small constant κ > 0, provided that the adversarial noise η Kϵ for sufficiently small constant K > 0: ℓτk(w; Dk) Lτk(w; DX|Bk) κ. Combining Lemma 15, Lemma 16, and Lemma 17, we can show the following useful result. Proposition 18. Consider any phase k of Algorithm 2. With probability 1 δk, the following holds for arbitrarily small constant κ > 0: sup w Qk ℓτk(w; Sk) Lτk(w; DX|Bk) 3κ, provided that mk K1t log3 nkd δk and n k K1t log4 d ϵδk for sufficiently large constant K1 > 0, and the adversarial noise rate η K2ϵ for sufficiently small constant K2 > 0. Proof. Due to our hyper-parameter setting, we have ρk = Θ( tτk) and bk = Θ(τk). In addition, using the sub-exponential tail bound of isotropic log-concave distributions, we can show that maxx T k x maxx Tk x O(log nkd δ ) holds with probability 1 δ (see Lemma 26). Thus, by Lemma 15, for any constant κ > 0, when mk K1t log3 nkd δk for large enough constant K1 > 0, we have that with probability 1 δk ℓτk(w; Sk) ℓτk(w; T k) κ. In addition, when n k K1t log4 d ϵδk , Lemma 16 implies that ℓτk(w; T k) ℓτk(w; Dk) κ. The above two inequalities combined with Lemma 17 and the triangle inequality gives the desired result. Therefore, we obtain the following key result for the error rate of vk on DX|Bk. Proposition 19. Consider any phase k of Algorithm 2. With probability 1 δk, the following holds for arbitrarily small constant κ > 0: Prx DX|Bk (sign(vk x) = sign(w x)) 8κ, provided that mk K1t log3 nkd δk and n k K1t log4 d ϵδk for sufficiently large constant K1 > 0, and the adversarial noise rate η K2ϵ for sufficiently small constant K2 > 0. Proof. Denote by v k the global optimum of the hinge loss minimization problem of Algorithm 2. Prx DX|Bk (sign(vk x) = sign(w x)) Lτk(vk; DX|Bk) (a) ℓτk(vk; Sk) + 3κ Metric-Fair Active Learning (b) ℓτk(v k; Sk) + 4κ (c) ℓτk(w ; Sk) + 4κ (d) Lτk(w ; DX|Bk) + 7κ In the above expression, the first step follows from the fact that the hinge loss is an upper bound of the 0/1 loss, Steps (a) and (d) follow from Proposition 18, Step (b) follows from the definition of vk, Step (c) follows from the fact that v k is the global minimizer of ℓτk(w; Sk), and the last step follows from Lemma 3.7 of Awasthi et al. (2017) which states that Lτk(w ; DX|Bk) can be made to be an arbitrarily small constant κ provided that τk = κbk. Finally, we establish the classification error guarantee of vk on DX. This is fairly standard due to Balcan et al. (2007). The only minor difference in our analysis is that we need to incorporate the fairness constraint. In view of Lemma 23, it is more convenient to show the angle between vk and w . Proposition 20. Consider any phase k of Algorithm 2. Assume that w Wk. Then with probability 1 δk, θ(vk, w ) 2 k 8π provided that mk K1t log3 nkd δk and n k K1t log4 d ϵδk for sufficiently large constant K1 > 0, and the adversarial noise rate η K2ϵ for sufficiently small constant K2 > 0. Proof. For k = 1, note that w W1 and we draw samples from DX. Thus, Proposition 19 implies Prx DX(sign(vk x) = sign(w x)) 8κ. (22) Lemma 23 tells θ(vk, w ) 8κc2 2 9π, (23) due to the setting of κ. For any k 2, we have Prx DX(sign(vk x) = sign(w x), x Bk) = Prx DX|Bk (sign(vk x) = sign(w x)) Prx DX(x Bk) 8κc6bk = 8κc6 crk, (24) where the inequality follows from Proposition 19 and Lemma 25. On the other hand, θ(vk, w ) π vk w 2 π( vk wk 1 2 + w wk 1 2) 2πrk, where the first inequality is a folklore and the last inequality holds as both vk and w are in Wk. Therefore, since we set bk = crk with c 8π c4 , we have bk 4 c4 θ(vk, w ). In view of Lemma 24, we have Prx DX(sign(vk x) = sign(w x), |vk x| bk) c3θ(vk, w ) exp c4bk 2θ(vk, w ) 2c3πrk exp c4bk = 2c3πrk exp c4 c This combined with (24) gives that for any k 2, Prx DX(sign(vk x) = sign(w x)) h 8κc6 c + 2c3π exp c4 c 4π i rk. (25) Combining (25), Lemma 23, and the setting rk = 2 k 3, we obtain that θ(vk, w ) h κc6 c + c3π 4π i c2 2 k. Recall that c is a constant such that the coefficient of 2 k is less than π 28 (see Appendix A). This completes the proof. Metric-Fair Active Learning Proposition 21. Consider any phase k of Algorithm 2. If θ(vk, w ) 2 k 8π, then w Wk+1 with certainty. Proof. The proof follows from some standard algebraic calculations. We will use the following fact: let S be some set, and ΠS(w) be the ℓ2-projection onto S. Suppose w S. Then for any w, we have ΠS(w) w 2 ΠS(w) w 2 + w w 2 2 w w 2 . (26) We first show that wk w 2 rk+1. Let ˆvk = vk/ vk 2. we have wk w 2 = Pt(vk)/ Pt(vk) 2 w 2 = Pt(ˆvk)/ Pt(ˆvk) 2 w 2 2 Pt(ˆvk) w 2 4 ˆvk w 2 2 k 4 where the first and second inequalities use (26), in the third inequality, we use ˆvk w 2 = 2 sin θ(vk,w ) 2 θ(vk, w ) 2 k 8π 2 k 6. By the sparsity of wk and w , and our choice ρk+1 = 2trk+1, we always have 2trk+1 = ρk+1. Since w has unit ℓ2-norm and is t-sparse, we also have w 2 1 and w 1 t. Therefore, w Qk+1. Since we assumed that w has zero fairness error, we have w Mk+1. Thus, w Wk+1. D. PACF Guarantee, Sample Complexity, and Label Complexity Theorem 22 (Restatement of Theorem 9). With probability 1 δ, Algorithm 2 runs in polynomial time and returns a halfspace ˆw such that err D( ˆw) ϵ. In addition, ˆw is α-approximately metric-fair. The sample complexity is O 1 ϵ , and the label complexity is O(t log3 d αδ log d log 1 Proof. Note that w W1. For any phase k, it follows from Proposition 20 and Theorem 12 that the following holds simultaneously with probability 1 δk: θ(vk, w ) 2 k 8π, and vk is α-approximately metric-fair. By iteratively applying Proposition 20 and Proposition 21, we have that with probability 1 Pkmax k=1 δk 1 δ, vkmax is such that θ(vkmax, w ) 2 kmax 8π, and it is α-approximately metric-fair. Using Lemma 23, Prx DX sign(vkmax x) = sign(w x) 1 c1 θ(vkmax, w ) π c1 2kmax+8 = ϵ due to the choice of kmax. Now using triangle inequality and our assumption that err D(w ) c0ϵ for sufficiently small constant c0 > 0, we have Pr(x,y) D(sign(vkmax x) = y) Prx DX(sign(vkmax x) = sign(w x)) + err D(w ) ϵ The classification error guarantee and fairness error guarantee follow in view of ˆw = vkmax. Next, we show the sample complexity. For any phase k, to ensure fairness, we required |Tk| K 1 α2 t log4 d δk in Theorem 12. On the other hand, Proposition 20 holds when T k Kt log4 d ϵδk where T k = Tk Bk. Since the density of the band Bk is Θ(bk) (see Lemma 25), by the Chernoff bound, it suffices to set |Tk| Ω( 1 bk ( T k + log 1 bk t log4 d ϵδk ). Metric-Fair Active Learning Hence, in order to fulfill both Theorem 12 and Proposition 20, we need |Tk| Ω ( 1 bk )t log4 d ϵδk . Consequently, the total number of instances needed by Algorithm 2 is k=1 |Tk| Ω 1 Lastly, we analyze the label complexity. Note that labels are needed only when we solve the hinge loss minimization problem. Hence, the number of labels in each phase k is mk, which is required to be mk Ω(t log3 nkd δk ) Ω t log3 d αδk log d . Consequently, the total number of labels needed by Algorithm 2 is k=1 mk Ω t log3 d αδ log d log 1 The proof is complete. E. Auxiliary Lemmas Throughout this section, we always assume that DX is isotropic log-concave. Lemma 23 (Vempala (2010)). There exist constants c1, c2 > 0 such that the following holds: c1 Prx DX(sign(u x) = sign(v x)) θ(u, v) c2 Prx DX(sign(u x) = sign(v x)). Lemma 24 (Theorem 21 of Balcan & Long (2013)). There are absolute constants c3, c4 > 0 such that the following holds for all isotropic log-concave distributions DX. Let u and v be two unit vectors in Rd and assume that θ(u, v) = θ < π/2. Then for any b 4 c4 θ, we have Prx DX(sign(u x) = sign(v x) and |v x| b) c3θ exp c4b Lemma 25 (Lovász & Vempala (2007a)). There exist constants c5, c6 > 0 such that the following holds. For any unit vector v and positive real number b, c5b Prx DX(|v x| b) c6b. Lemma 26 (Lemma 20 of Awasthi et al. (2016)). Let T be the set of instances drawn from DX. With probability 1 δ, maxx T x O(log |S| d Lemma 27 (Bartlett & Mendelson (2001)). Consider a loss function L : Y A [0, 1] and a dominating cost function φ : Y A [0, 1]. Let F be a class of functions mapping from X to A and let (Xi, Yi)n i=1 be independently selected according to the probability measure P. Then, for any integer n and any 0 < δ < 1, with probability at least 1 δ over a sample set S of length n, every f in F satisfies E[L(Y, f(X))] 1 (X,Y ) S φ(Y, f(X)) + Rn( φ F) + where φ F = {(x, y) 7 φ(y, f(x)) φ(y, 0) : f F} and φ(y, a) is an upper bound of L(y, a).