# loss_balancing_for_fair_supervised_learning__b0ca1f07.pdf

Loss Balancing for Fair Supervised Learning

Mohammad Mahdi Khalili 1 Xueru Zhang 2 Mahed Abroshan 3

Supervised learning models have been used in various domains such as lending, college admission, face recognition, natural language processing, etc. However, they may inherit pre-existing biases from training data and exhibit discrimination against protected social groups. Various fairness notions have been proposed to address unfairness issues. In this work, we focus on Equalized Loss (EL), a fairness notion that requires the expected loss to be (approximately) equalized across different groups. Imposing EL on the learning process leads to a non-convex optimization problem even if the loss function is convex, and the existing fair learning algorithms cannot properly be adopted to find the fair predictor under the EL constraint. This paper introduces an algorithm that can leverage off-the-shelf convex programming tools (e.g., CVXPY (Diamond and Boyd, 2016; Agrawal et al., 2018)) to efficiently find the global optimum of this non-convex optimization. In particular, we propose the ELminimizer algorithm, which finds the optimal fair predictor under EL by reducing the non-convex optimization to a sequence of convex optimization problems. We theoretically prove that our algorithm finds the global optimal solution under certain conditions. Then, we support our theoretical results through several empirical studies.

1. Introduction

As machine learning (ML) algorithms are increasingly being used in applications such as education, lending, recruitment, healthcare, criminal justice, etc., there is a growing con-

1Yahoo Research, NYC, NY. The author also holds a visiting professor position at The Ohio State University, Columbus, OH. 2The Ohio State University, Columbus, OH. 3Optum Labs, London, UK. The work was done while at Alan Turing Institute, London, UK. Correspondence to: Mohammad Mahdi Khalili <khaliligarekani.1@osu.edu>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

cern that the algorithms may exhibit discrimination against protected population groups. For example, speech recognition products such as Google Home and Amazon Alexa were shown to have accent bias (Harwell, 2018). The COMPAS recidivism prediction tool, used by courts in the US in parole decisions, has been shown to have a substantially higher false positive rate for African Americans compared to the general population (Dressel and Farid, 2018). Amazon had been using automated software since 2014 to assess applicants resumes, which were found to be biased against women (Dastin, 2018). As a result, there have been several works focusing on interpreting machine learning models to understand how features and sensitive attributes contribute to the output of the model (Ribeiro et al., 2016; Lundberg and Lee, 2017; Abroshan et al., 2023).

Various fairness notions have been proposed in the literature to measure and remedy the biases in ML systems; they can be roughly classified into two categories: 1) individual fairness focuses on equity at the individual level and it requires similar individuals to be treated similarly (Dwork et al., 2012; Biega et al., 2018; Jung et al., 2019; Gupta and Kamble, 2019); 2) group fairness requires certain statistical measures to be (approximately) equalized across different groups distinguished by some sensitive attributes (Hardt et al., 2016; Conitzer et al., 2019; Khalili et al., 2020; Zhang et al., 2020; Khalili et al., 2021; Diana et al., 2021; Williamson and Menon, 2019; Zhang et al., 2022).

Several approaches have been developed to satisfy a given definition of fairness; they fall under three categories: 1) pre-processing, by modifying the original dataset such as removing certain features and reweighing, (e.g., (Kamiran and Calders, 2012; Celis et al., 2020; Abroshan et al.)); 2) inprocessing, by modifying the algorithms such as imposing fairness constraints or changing objective functions during the training process, (e.g., (Zhang et al., 2018; Agarwal et al., 2018; 2019; Reimers et al., 2021; Calmon et al., 2017)); 3) post-processing, by adjusting the output of the algorithms based on sensitive attributes, (e.g., (Hardt et al., 2016)).

In this paper, we focus on group fairness, and we aim to mitigate unfairness issues in supervised learning using an in-processing approach. This problem can be cast as a constrained optimization problem by minimizing a loss function subject to a group fairness constraint. We are particularly in-

Loss Balancing for Fair Supervised Learning

terested in the Equalized Loss (EL) fairness notion proposed by Zhang et al. (2019), which requires the expected loss (e.g., Mean Squared Error (MSE), Binary Cross Entropy (BCE) Loss) to be equalized across different groups.1

The problem of finding fair predictors by solving constrained optimizations has been largely studied. Komiyama et al. (2018) propose the coefficient of determination constraint for learning a fair regressor and develop an algorithm for minimizing the mean squared error (MSE) under their proposed fairness notion. Agarwal et al. (2019) propose an approach to finding a fair regression model under bounded group loss and statistical parity fairness constraints. Agarwal et al. (2018) study classification problems and aim to find fair classifiers under various fairness notions including statistical parity and equal opportunity. In particular, they consider zero-one loss as the objective function and train a randomized fair classifier over a finite hypothesis space. They show that this problem can be reduced to a problem of finding the saddle point of a linear Lagrangian function. Zhang et al. (2018) propose an adversarial debiasing technique to find fair classifiers under equalized odd, equal opportunity, and statistical parity. Unlike the previous works, we focus on the Equalized Loss fairness notion which has not been well studied. Finding an EL fair predictor requires solving a non-convex optimization. Unfortunately, there is no algorithm in fair ML literature with a theoretical performance guarantee that can be properly applied to EL fairness (see Section 2 for detailed discussion).

Our main contribution can be summarized as follows, We develop an algorithm with a theoretical performance guarantee for EL fairness. In particular, we propose the (ELminimizer) algorithm to solve a non-convex constrained optimization problem that finds the optimal fair predictor under EL constraint. We show that such a non-convex optimization problem can be reduced to a sequence of convex constrained optimizations. The proposed algorithm finds the global optimal solution and is applicable to both regression and classification problems. Importantly, it can be easily implemented using off-the-shelf convex programming tools.

In addition to ELminimizer which finds the global optimal solution, we develop a simple algorithm for finding a sub-optimal predictor satisfying EL fairness. We prove there is a sub-optimal solution satisfying EL fairness that is a linear combination of the optimal solutions to two unconstrained optimizations, and it can be found without solving any constrained optimizations.

We conduct sample complexity analysis and provide a generalization performance guarantee. In particular, we show the sample complexity analysis found in (Donini

1Zhang et al. (2019) propose the EL fairness notion without providing an efficient algorithm for satisfying this notion.

et al., 2018) is applicable to learning problems under EL.

We also examine (in the appendix) the relation between Equalized Loss (EL) and Bounded Group Loss (BGL), another fairness notion proposed by (Agarwal et al., 2019). We show that under certain conditions, these two notions are closely related, and they do not contradict each other.

2. Problem Formulation

Consider a supervised learning problem where the training dataset consists of triples (X, A, Y ) from two social groups.2 Random variable X X is the feature vector (in the form of a column vector), A {0, 1} is the sensitive attribute (e.g., race, gender) indicating the group membership, and Y Y R is the label/output.We denote realizations of random variables by small letters (e.g., (x, a, y) is a realization of (X, A, Y )). Feature vector X may or may not include sensitive attribute A. Set Y can be either {0, 1} or R: if Y = {0, 1} (resp. Y = R), then the problem of interest is a binary classification (resp. regression) problem.

Let F be a set of predictors fw : X R parameterized by weight vector w with dimension dw.3 If the problem is binary classification, then fw(x) is an estimate of Pr(Y = 1|X = x).4 Consider loss function l : Y R R where l(Y, fw(X)) measures the error of fw in predicting X. We denote the expected loss with respect to the joint probability distribution of (X, Y ) by L(w) := E{l(Y, fw(X))}. Similarly, La(w) := E{l(Y, fw(X))|A = a} denotes the expected loss of the group with sensitive attribute A = a. In this work, we assume that l(y, fw(x)) is differentiable and strictly convex in w (e.g., binary cross entropy loss).5

Without fairness consideration, a predictor that simply minimizes the total expected loss, i.e., arg minw L(w), may be biased against certain groups. To mitigate the risks of unfairness, we consider Equalized Loss (EL) fairness notion, as formally defined below.

Definition 2.1 (γ-EL (Zhang et al., 2019)). We say fw satisfies the equalized loss (EL) fairness notion if L0(w) = L1(w). Moreover, we say fw satisfies γ EL for some γ > 0 if γ L0(w) L1(w) γ.

Note that if l(Y, fw(X)) is a (strictly) convex function in w, both L0(w) and L1(w) are also (strictly) convex in w.

2We use bold letters to represent vectors. 3Predictive models such as logistic regression, linear regression, deep learning models, etc., are parameterized by a weight vector. 4Our framework can be easily applied to multi-class classifications, where fw(X) becomes a vector. Because it only complicates the notations without providing additional insights about our algorithm, we present the method and algorithm in a binary setting. 5We do not consider non-differentiable losses (e.g., zero-one loss) as they have already been extensively studied in the literature, e.g., (Hardt et al., 2016; Zafar et al., 2017; Lohaus et al., 2020).

Loss Balancing for Fair Supervised Learning

However, L0(w) L1(w) is not necessary convex6. As a result, the following optimization problem for finding a fair predictor under γ-EL is not a convex programming, min w L(w) s.t. γ L0(w) L1(w) γ. (1)

We say a group is disadvantaged group if it experiences higher loss than the other group. Before discussing how to find the global optimal solution of the above non-convex optimization problem and train a γ-EL fair predictor, we first discuss why γ-EL is an important fairness notion and why the majority of fair learning algorithms in the literature cannot be used for finding γ-EL fair predictors.

2.1. Existing Fairness Notions & Algorithms

Next, we (mathematically) introduce some of the most commonly used fairness notions and compare them with γ-EL. We will also discuss why the majority of proposed fair learning algorithms are not properly applicable to EL fairness.

Overall Misclassification Rate (OMR): It was considered by (Zafar et al., 2017; 2019) for classification problems. Let ˆY = I(fw(X) > 0.5), where I(.) {0, 1} is an indicator function, and ˆY = 1 if fw(X) > 0.5. OMR requires Pr( ˆY = Y |A = 0) = Pr(ˆY = Y |A = 1), which is not a convex constraint. As a result, Zafar et al. (2017; 2019) propose a method to relax this constraint using decision boundary covariances. We emphasize that OMR is different from EL fairness, that OMR only equalizes the accuracy of binary predictions across different groups while EL is capable of considering the fairness in estimating probability Pr(Y = 1|X = x), e.g., by using binary cross entropy loss function. Note that in many applications such as conversion prediction, click prediction, medical diagnosis, etc., it is critical to find Pr(Y = 1|X = x) accurately for different groups besides the final predictions ˆY . Moreover, unlike EL, OMR is not applicable to regression problems. Therefore, the relaxation method proposed by (Zafar et al., 2017; 2019) cannot be applied to the EL fairness constraint.

Statistical Parity (SP), Equal Opportunity (EO): For binary classification, Statistical Parity (SP) (Dwork et al., 2012) (resp. Equal Opportunity (EO) (Hardt et al., 2016)) requires the positive classification rates (resp. true positive rates) to be equalized across different groups. Formally, Pr( ˆY = 1|A = 0) = Pr(ˆY = 1|A = 1)

Pr( ˆY = 1|A = 0, Y = 1) = Pr(ˆY = 1|A = 1, Y = 1) Both notions can be re-written in the expectation form using an indicator function. Specifically, SP is equivalent to E{I(fw(X) > 0.5)|A = 0} = E{I(fw(X) > 0.5)|A = 1}, and EO equals to E{I(fw(X) > 0.5)|A = 0, Y = 1} = E{I(fw(X) > 0.5)|A = 1, Y = 1}. Since the indi-

6As an example, consider two functions h0(x) = x2 and h1(x) = 2 x2 x. Although both h0 and h1 are convex, their difference h0(x) h1(x) is not a convex function.

cator function is neither differentiable nor convex, Donini et al. (2018) use a linear relaxation of EO as a proxy. 7

However, linear relaxation may negatively affect the fairness of the predictor (Lohaus et al., 2020). To address this issue, Lohaus et al. (2020) and Wu et al. (2019) develop convex relaxation techniques for SP and EO fairness criteria by convexifying indicator function I(.). However, these convex relaxation techniques are not applicable to EL fairness notion because l(., .) in our setting is convex, not a zero-one function. Fair Batch (Roh et al., 2020) is another algorithm that has been proposed to find a predictor under SP or EO. Fair Batch adds a sampling bias in the mini-batch selection. However, the bias in mini-batch sampling distribution leads to a biased estimate of the gradient, and there is no guarantee Fair Batch finds the global optimum solution. Fair Batch can be used to find a sub-optimal fair predictor EL fairness notion. We use Fair Batch as a baseline. Shen et al. (2022) propose an algorithm for EO. This algorithm adds a penalty term to the objective function, which is similar to the Penalty Method (Ben-Tal and Zibulevsky, 1997). We will use the Penalty method as a baseline as well.

Hardt et al. (2016) propose a post-processing algorithm that randomly flips the binary predictions to satisfy EO or SP. However, this method does not guarantee finding an optimal classifier (Woodworth et al., 2017). Agarwal et al. (2018) introduce a reduction approach for SP or EO. However, this method finds a randomized classifier satisfying SP or EO in expectation. In other words, to satisfy SP, the reduction approach finds distribution Q over F such that, P

f F Q(f)E{l(Y, f(X))|A = 0}

f F Q(f)E{l(Y, f(X))|A = 1} where Q(f) is the probability of selecting model f under distribution Q. Obviously, satisfying a fairness constraint in expectation may lead to unfair predictions because Q can still assign a non-zero probability to unfair models.

In summary, maybe some of the approaches used for SP/EO are applicable to EL fairness notion (e.g., linear relaxation or Fair Batch). However, they can only find sub-optimal solutions (see Section 6 for more details).

Statistical Parity for Regression: SP can be adjusted to be suitable for regression. As proposed by (Agarwal et al., 2019), Statistical Parity for regressor fw(.) is defined as: Pr(fw(X) z|A = a) = Pr(fw(X) z), z, a. (2) To find a predictor that satisfies constraint (2), Agarwal et al. (2019) use the reduction approach as mentioned above. However, this approach only finds a randomized predictor satisfying SP in expectation and cannot be applied to optimization problem (1).8

7This linear relaxation is applicable to EL with some modification. We use linear relaxation as one of our baselines. 8Appendix includes a detailed discussion on why the reduction

Loss Balancing for Fair Supervised Learning

Bounded Group Loss (BGL): γ-BGL was introduced by (Agarwal et al., 2019) for regression problems. It requires that the loss experienced by each group be bounded by γ. That is, La(w) γ, a {0, 1}. Agarwal et al. (2019) use the reduction approach to find a randomized regression model under γ-BGL. In addition to the reduction method, if L(w), L0(w), and L1(w) are convex in w, then we can directly use convex solvers (e.g., CVXPY (Diamond and Boyd, 2016; Agrawal et al., 2018)) to find a γ-BGL fair predictor. This is because the following is a convex problem,

min w L(w), s.t., La(w) γ, a. (3)

However, for non-convex optimization problems such as (1), the convex solvers cannot be used directly.

We want to emphasize that even though there are already many fairness notions and algorithms in the literature to find a fair predictor, none of the existing algorithms can be used to solve the non-convex optimization (1) efficiently and find a global optimal fair predictor under EL notion.

3. Optimal Model under γ-EL

In this section, we consider optimization problem (1) under EL fairness constraint. Note that this optimization problem is non-convex and finding the global optimal solution is difficult. We propose an algorithm that finds the solution to non-convex optimization (1) by solving a sequence of convex optimization problems. Before presenting the algorithm, we first introduce two assumptions, which will be used when proving the convergence of the proposed algorithm.

Assumption 3.1. Expected losses L0(w), L1(w), and L(w) are strictly convex and differentiable in w. Moreover, each of them has a unique minimizer.

Let w Ga be the optimal weight vector minimizing the loss associated with group A = a. That is, w Ga = arg min w La(w). (5)

Since problem (5) is an unconstrained, convex optimization problem, w Ga can be found efficiently by common convex solvers. We make the following assumption about w Ga.

Assumption 3.2. We assume the following holds, L0(w G0) L1(w G0) and L1(w G1) L0(w G1).

Assumption 3.2 implies that when a group experiences its lowest possible loss, this group is not the disadvantaged group. Under Assumptions 3.1 and 3.2, the optimal 0-EL fair predictor can be easily found using our proposed algorithm (i.e., function ELminimizer(w G0,w G1, ϵ, γ) with γ = 0); the complete procedure is shown in Algorithm 1, in which parameter ϵ > 0 specifies the stopping criterion: as ϵ 0, the output approaches to the global optimal solution.

approach is not appropriate for EL fairness.

Algorithm 1 Function ELminimizer Input: w G0,w G1, ϵ, γ Parameters: λ(0) start = L0(w G0), λ(0) end = L0(w G1), i = 0 Define L1(w) = L1(w) + γ

1: while λ(i) end λ(i) start > ϵ do

2: λ(i) mid = (λ(i) end + λ(i) start)/2 3: Solve the following convex optimization problem,

w i = arg min w L1(w) s.t. L0(w) λ(i) mid (4)

4: λ(i) = L1(w i )

5: if λ(i) λ(i) mid then

6: λ(i+1) start = λ(i) mid; λ(i+1) end = λ(i) end; 7: else 8: λ(i+1) end = λ(i) mid; λ(i+1) start = λ(i) start; 9: i = i + 1; 10: end if 11: end while Output: w i

Algorithm 2 Solving Optimization (1) Input: w G0, w G1,ϵ,γ

1: wγ = ELminimizer(w G0,w G1, ϵ, γ) 2: w γ = ELminimizer(w G0,w G1, ϵ, γ) 3: if L(wγ) L(w γ) then 4: w = wγ 5: else 6: w = w γ 7: end if Output: w

Intuitively, Algorithm 1 solves non-convex optimization (1) by solving a sequence of convex and constrained optimizations. When γ > 0 (i.e., relaxed fairness), the optimal γ-EL fair predictor can be found with Algorithm 2 which calls function ELminimizer twice. The convergence of Algorithm 1 for finding the optimal 0-EL fair solution, and the convergence of Algorithm 2 for finding the optimal γ-EL fair solution are stated in the following theorems.

Theorem 3.3 (Convergence of Algorithm 1 when γ = 0). Let {λ(i) mid|i = 0, 1, 2, . . .} and {w i |i = 0, 1, 2, . . .} be two sequences generated by Algorithm 1 when γ = ϵ = 0, i.e., ELminimizer(w G0,w G1, 0, 0). Under Assumptions 3.1 and 3.2, we have, lim i w i = w and lim i λ(i) mid = E{l(Y, fw (X))}

where w is the global optimal solution to (1).

Theorem 3.3 implies that when γ = ϵ = 0 and i goes to infinity, the solution to convex problem (4) is the same as the solution to (1).

Loss Balancing for Fair Supervised Learning

Theorem 3.4 (Convergence of Algorithm 2). Assume that L0(w G0) L1(w G0) < γ and L0(w G1) L1(w G1) > γ. If w O does not satisfy the γ-EL constraint, then, as ϵ 0, the output of Algorithm 2 goes to the optimal γ-EL fair solution (i.e., solution to (1)).

Complexity Analysis. The While loop in Algorithm 1 is executed for O(log(1/ϵ)) times. Therefore, Algorithm 1 needs to solve a constrained convex optimization problem for O(log(1/ϵ)) times. Note that constrained convex optimization problems can be efficiently solved via subgradient methods (Nedi c and Ozdaglar, 2009), brier methods (Wright, 2001), stochastic gradient descent with one projection (Mahdavi et al., 2012), interior point methods (Nemirovski, 2004), etc. For instance, (Nemirovski, 2004) shows that several convex optimization problems can be solved in polynomial time. Therefore, the time complexity of Algorithm 1 depends on the convex solver. If the time complexity of solving (4) is O(p(dw)), then the overall time complexity of Algorithm 1 is O(p(dw) log(1/ϵ)).

Regularization. So far we have considered a supervised learning model without regularization. Next, we explain how Algorithm 2 can be applied to a regularized problem. Consider the following optimization problem, min w Pr(A = 0)L0(w) + Pr(A = 1)L1(w) + R(w),

s.t., |L0(w) L1(w)| < γ. (6)

where R(w) is a regularizer function. In this case, we can re-write the optimization problem as follows, min w Pr(A = 0) L0(w) + R(w)

+ Pr(A = 1) L1(w) + R(w) ,

s.t., | L0(w) + R(w) L1(w) + R(w) | < γ.

If we define La(w) := La(w) + R(w) and L(w) := Pr(A = 0) L0(w) + Pr(A = 1) L1(w), then problem (6) can be written in the form of problem (1) using ( L0(w), L1(w), L(w)) and solved by Algorithm 2.

4. Sub-optimal Model under γ-EL

In Section 3, we have shown that non-convex optimization problem (1) can be reduced to a sequence of convex constrained optimizations (4), and based on this we proposed Algorithm 2 that finds the optimal γ-EL fair predictor. However, the proposed algorithm still requires solving a convex constrained optimization in each iteration. In this section, we propose another algorithm that finds a sub-optimal solution to optimization (1) without solving constrained optimization in each iteration. The algorithm consists of two phases: (1) finding two weight vectors by solving two unconstrained convex optimization problems; (2) generating a new weight vector satisfying γ-EL using the two weight vectors found in the first phase.

Phase 1: Unconstrained optimization. We ignore EL fairness and solve the following unconstrained problem, w O = arg min w L(w) (7)

Because L(w) is strictly convex in w, the above optimization problem can be solved efficiently using convex solvers. Predictor fw O is the optimal predictor without fairness constraint, and L(w O) is the smallest overall expected loss that is attainable. Let ˆa = arg maxa {0,1} La(w O), i.e., group ˆa is disadvantaged under predictor fw O. Then, for the disadvantaged group ˆa, we find w Gˆa by optimization (5).

Phase 2: Binary search to find the fair predictor. For β [0, 1], we define the following two functions, g(β) := Lˆa (1 β)w O + βw Gˆa

L1 ˆa (1 β)w O + βw Gˆa ;

h(β) := L (1 β)w O + βw Gˆa , where function g(β) can be interpreted as the loss disparity between two demographic groups under predictor f(1 β)w O+βw Gˆa , and h(β) is the corresponding overall expected loss. Some properties of functions g(.) and h(.) are summarized in the following theorem.

Theorem 4.1. Under Assumptions 3.1 and 3.2, 1. There exists β0 [0, 1] such that g(β0) = 0; 2. h(β) is strictly increasing in β [0, 1]; 3. g(β) is strictly decreasing in β [0, 1].

Theorem 4.1 implies that in a dw-dimensional space if we start from w O and move toward w Gˆa along a straight line, the overall loss increases and the disparity between two groups decreases until we reach (1 β0)w O + β0w Gˆa, at which 0-EL fairness is satisfied. Note that β0 is the unique root of g. Since g(β) is a strictly decreasing function, β0 can be found using binary search.

For the approximate γ-EL fairness, there are multiple values of β such that (1 β)w O + βw Gˆa satisfies γ-EL. Since h(β) is strictly increasing in β, among all β that satisfy γ-EL fairness, we would choose the smallest one. The method for finding a sub-optimal solution to optimization (1) is described in Algorithm 3. Note that while loop in Algorithm 3 is repeated for O(log(1/ϵ)) times. Since the time complexity of operations (i.e., evaluating gγ(β(i) mid)) in each iteration is O(dw) , the total time complexity of Algorithm 3 is O(dw log(1/ϵ)). We can formally prove that the output returned by Algorithm 3 satisfies γ-EL constraint.

Theorem 4.2. Assume that Assumptions 3.1 and 3.2 hold, and let gγ(β) = g(β) γ. If gγ(0) 0, then w O satisfies the γ-EL fairness; if gγ(0) > 0, then limi β(i) mid = β( ) mid exists, and (1 β( ) mid)w O + β( ) midw Gˆa satisfies the γ-EL fairness constraint.

Note that since h(β) is increasing in β, we only need to find the smallest possible β such that (1 β)w O+βw Gˆa satisfies

Loss Balancing for Fair Supervised Learning

Algorithm 3 Sub-optimal solution to optimization (1) Input: w Gˆa, w O, ϵ, γ Initialization: gγ(β) = g(β) γ, i = 0, β(0) start = 0, β(0) end = 1

1: if gγ(0) 0 then 2: w = w O, and go to line 13; 3: end if 4: while β(i) end β(i) start > ϵ do

5: β(i) mid = (β(i) start + β(i) end)/2;

6: if gγ(β(i) mid) 0 then

7: β(i+1) start = β(i) mid, β(i+1) end = β(i) end; 8: else 9: β(i+1) start = β(i) start, β(i+1) end = β(i) mid; 10: end if 11: end while 12: w = (1 β(i) mid)w O + β(i) midw Gˆa; 13: Output: w

γ-EL, which is β( ) mid in Theorem 4.2. Since Algorithm 3 finds a sub-optimal solution, it is important to investigate the performance of this sub-optimal fair predictor, especially in the worst case scenario. The following theorem finds an upper bound of the expected loss of fw, where w is the output of Algorithm 3.

Theorem 4.3. Under Assumptions 3.1 and 3.2, we have the following: L(w) maxa {0,1} La(w O). That is, the expected loss of fw is not worse than the loss of the disadvantaged group under predictor fw O.

Learning with Finite Samples. So far we proposed algorithms for solving optimization (1). In practice, the joint probability distribution of (X, A, Y ) is unknown and the expected loss needs to be estimated using the empirical loss. Specifically, given n i.i.d. samples {(X i, Ai, Yi)}n i=1 and a predictor fw, the empirical losses of the entire population and each group are defined as follows, ˆL(w) = 1 n Pn i=1 l(Yi, fw(X i)), ˆLa(w) = 1 na P i:Ai=a l(Yi, fw(X i)), where na = |{i|Ai = a}|. Because γ-EL fairness constraint is defined in terms of expected loss, the optimization problem of finding an optimal γ-EL fair predictor using empirical losses is as follows, ˆwˆwˆw = arg min w ˆL(w), s.t. |ˆL0(w) ˆL1(w)| ˆγ. (8)

In this section, we aim to investigate how to determine ˆγ so that with high probability, the predictor found by solving problem (8) satisfies γ-EL fairness, and meanwhile ˆwˆwˆw is a good estimate of the solution w to optimization (1). We aim to show that we can set ˆγ = γ if the number of samples is sufficiently large. To understand the relation between (8) and (1), we follow the general sample complexity analysis

found in (Donini et al., 2018) and show their sample complexity analysis is applicable to EL. To proceed, we make the assumption used in (Donini et al., 2018).

Assumption 4.4. With probability 1 δ, following holds: supfw F |L(w) ˆL(w)| B(δ, n, F), where B(δ, n, F) is a bound that goes to zero as n + .

Note that according to (Shalev-Shwartz and Ben-David, 2014), if the class F is learnable with respect to loss function l(., .), then always there exists such a bound B(δ, n, F) that goes to zero as n goes to infinity.9

Theorem 4.5. Let F be a set of learnable functions, and let ˆw and w be the solutions to (8) and (1) respectively, with ˆγ = γ + P

a {0,1} B(δ, na, F). Then, with probability at least 1 6δ, the followings hold, L( ˆwˆwˆw) L(w ) 2B(δ, n, F) and

|L0( ˆwˆwˆw) L1( ˆwˆwˆw)| γ + 2B(δ, n0, F) + 2B(δ, n1, F).

Theorem 4.5 shows that as n0, n1 go to infinity, ˆγ γ, and both empirical loss and expected loss satisfy γ-EL. In addition, as n goes to infinity, the expected loss at ˆw goes to the minimum possible expected loss. Therefore, solving (8) using empirical loss is equivalent to solving (1) if the number of data points from each group is sufficiently large.

5. Beyond Linear Models

So far, we have assumed that the loss function is strictly convex. This assumption is mainly valid for training linear models (e.g., Ridge regression, regularized logistic regression). However, it is known that training deep models lead to minimizing non-convex objective functions. To train a deep model under the equalized loss fairness notion, we can take advantage of Algorithm 2 for fine-tuning under EL as long as the objective function is convex with respect to the parameters of the output layer.10 To clarify how Algorithm 2 can be used for deep models, for simplicity, consider a neural network with one hidden layer for regression. Let W be an m by d matrix (d is the size of feature vector X and m is the number of neurons in the first layer) denoting the parameters of the first layer of the Neural Network, and w be a vector corresponding to the output layer. To find a neural network satisfying the equalized loss fairness notion, first, we train the network without any fairness constraint

9As an example, if F is a compact subset of linear predictors in Reproducing Kernel Hilbert Space (RKHS) and loss l(y, f(x)) is Lipschitz in f(x) (second argument), then Assumption 4.4 can be satisfied (Bartlett and Mendelson, 2002). Vast majority of linear predictors such as support vector machine and logistic regression can be defined in RKHS. 10In classification or regression problems with l2 regularizer, the objective function is strictly convex with respect to the parameters of the output layer. This is true regardless of the network structure before the output layer.

Loss Balancing for Fair Supervised Learning

using common gradient descent algorithms (e.g., stochastic gradient descent). Let W and w denote the network parameters after training the network without fairness constraint. Now we can take advantage of Algorithm 2 to fine-tune the parameters of the output layer under the equalized loss fairness notion. Let us define X := [1, W X]T . The problem for fine-tuning the output layer can be written as follows, w = arg min w E{l(Y,w T X)}, (9)

s.t., E{l(Y,w T X)|A = 0} E{l(Y,w T X)|A = 1} γ.

The objective function of the above optimization problem is strictly convex, and the optimization problem can be solved using Algorithm 2. After solving the above problem, [ W,w ] will be the final parameters of the neural network model satisfying the equalized loss fairness notion. Note that a similar optimization problem can be written for finetuning any deep model with classification/regression task.

6. Experiments

We conduct experiments on two real-world datasets to evaluate the performance of the proposed algorithm. In our experiments, we used a system with the following configurations: 24 GB of RAM, 2 cores of P100-16GB GPU, and 2 cores of Intel Xeon CPU@2.3 GHz processor. More information about the experiments and the instructions on reproducing the empirical results are provided in Appendix. The codes are available at https://github.com/ Khalili Mahdi/Loss_Balancing_ICML2023.

Baselines. As discussed in Section 2, not all the fair learning algorithms are applicable to EL fairness. The followings are three baselines that are applicable to EL fairness.

Penalty Method (PM): The penalty method (Ben-Tal and Zibulevsky, 1997) finds a fair predictor under γ-EL fairness constraint by solving the following problem,

min w ˆL(w) + t max{0, |ˆL0(w) ˆL1(w)| γ}2 + R(w), (10)

where t is the penalty parameter, and R(w) = 0.002 ||w||2 2 is the regularizer. The above optimization problem cannot be solved with a convex solver because it is not generally convex. We solve problem (10) using Adam gradient descent (Kingma and Ba, 2014) with a learning rate of 0.005. We use the default parameters of Adam optimization in Pytorch. We set the penalty parameter t = 0.1 and increase this penalty coefficient by a factor of 2 every 100 iteration.

Linear Relaxation (Lin Re): Inspired by (Donini et al., 2018), for the linear regression, we relax the EL constraint as γ 1 n0 P

i:Ai=a(Yi w TX i) 1 n1 P

i:Ai=1(Yi w TX i) γ. For the logistic regression, we relax the constraint as γ 1 n0 P

i:Ai=a(Yi 0.5) (w TX i) 1 n1 P

i:Ai=1(Yi 0.5) (w TX i) γ. Note that the sign of (Yi 0.5) (w TX i) determines whether the binary classifier

0.10 0.15 0.20 0.25

test loss difference | L0 L1|

Overall loss v s loss difference

Fair Ba ch Alg2 Alg3

Figure 1: Trade-off between overall MSE and unfairness. A lower curve implies a better trade-off.

0.02 0.04 0.06 0.08 0.10

test loss difference | L0 L1|

test loss L

Overall loss v.s. loss difference

PM Lin Re Alg2 Alg

Figure 2: Trade-off between overall BCE and unfairness. A lower curve implies a better trade-off.

makes a correct prediction or not.

Fair Batch (Roh et al., 2020): This method was originally proposed for equal opportunity, statistical parity, and equalized odds. With some modifications (see the appendix for more details), this can be applied to EL fairness. This algorithm measures the loss of each group in each epoch and changes the Minibatch sampling distribution in favor of the group with a higher empirical loss. When implementing Fair Batch, we use Adam optimization with default parameters, a learning rate of 0.005, and a batch size of 100.

Linear Regression and Law School Admission Dataset. In the first experiment, we use the law school admission dataset, which includes the information of 21,790 law students studying in 163 different law schools across the United States (Wightman, 1998). This dataset contains entrance exam scores (LSAT), grade-point average (GPA) prior to law school, and the first year average grade (FYA). Our goal is to train a γ-EL fair regularized linear regression model to estimate the FYA of students given their LSAT and GPA. In this study, we consider Black and White Demographic groups. In this dataset, 18285 data points belong to White students, and 1282 data points are from Black students. We randomly split the dataset into training and test datasets (70% for training and 30% for testing), and conduct five independent runs of the experiment. A fair predictor is found by solving the following optimization problem, min w ˆL(w) + 0.002 ||w||2 2 s.t., |ˆL0(w) ˆL1(w)| γ, (11)

with ˆL and ˆLa being the overall and the group specific empirical MSE, respectively. Note that 0.002 ||w||2 2 is the regularizer. We use Algorithm 2 and Algorithm 3 with ϵ = 0.01 to find the optimal linear regression model under EL and adopt CVXPY python library (Diamond and Boyd, 2016; Agrawal et al., 2018) as the convex optimization solver in ELminimizer algorithm.

Table 1 illustrates the means and standard deviations of empirical loss and the loss difference between Black and White students. The first row specifies desired fairness level (γ = 0 and γ = 0.1) used as the input to each algorithm. Based on Table 1, when desired fairness level is γ = 0, the

Loss Balancing for Fair Supervised Learning

Table 1: Linear regression model under EL fairness. The loss function in this example is the mean squared error loss.

γ = 0 γ = 0.1

test loss 0.9246 0.0083 0.9332 0.0101

test |ˆL0 ˆL1| 0.1620 0.0802 0.1438 0.0914

test loss 0.9086 0.0190 0.8668 0.0164

test |ˆL0 ˆL1| 0.2687 0.0588 0.2587 0.0704

test loss 0.8119 0.0316 0.8610 0.0884

test |ˆL0 ˆL1| 0.2862 0.1933 0.2708 0.1526

test loss 0.9186 0.0179 0.8556 0.0217

test |ˆL0 ˆL1| 0.0699 0.0469 0.1346 0.0749

test loss 0.9522 0.0209 0.8977 0.0223

test |ˆL0 ˆL1| 0.0930 0.0475 0.1437 0.0907

model fairness level trained by Lin Re and Fair Batch method is far from γ = 0. We also realized that the performance of Fair Batch highly depends on the random seed. As a result, the fairness level of the model trained by Fair Batch has a high variance (0.1933 in this example) in these five independent runs of the experiment, and in some of these runs, it can achieve desired fairness level. This is because the Fair Batch algorithm does not come with any performance guarantee, and as stated in (Roh et al., 2020), Fair Batch calculates a biased estimate of the gradient in each epoch, and the mini-batch sampling distribution keeps changing from one epoch to another epoch. We observed that Fair Batch has better performance with a non-linear model (see Table 3). Both Algorithms 2 and 3 can achieve a fairness level close to γ = 0. However, Algorithm 3 finds a sub-optimal solution and achieves higher MSE compared to Algorithm 2. For γ = 0.1, in addition to Algorithms 2 and 3, the penalty method also achieves a fairness level close to desired fairness level γ = 0.1 (i.e., |ˆL1 ˆL0| = 0.0892). Algorithm 2 still achieves the lowest MSE compared to Algorithm 3 and the penalty method. The model trained by Fair Batch also suffers from high variance in the fairness level. We want to emphasize that even though Algorithm 3 has a higher MSE compared to Algorithm 2, it is much faster, as stated in Section 3.

We also investigate the trade-off between fairness and overall loss under different algorithms. Figure 1 illustrates the MSE loss as a function of the loss difference between Black and White students. Specifically, we run Algorithm 2, Algorithm 3, and the baselines under different values of γ = [0.0250, 0.05, 0.1, 0.15, 0.2]. For each γ, we repeat the experiment five times and calculate the average MSE and average MSE difference over these five runs using the test dataset. Figure 1 shows the penalty method, linear relaxation, and Fair Batch are not sensitive to input γ. However, Algorithm 2 and Algorithm 3 are sensitive to γ; As γ increases, |ˆL0(w ) ˆL1(w )| increases and MSE decreases.

Table 2: Logistic Regression model under EL fairness. The loss function in this example is binary cross entropy loss.

γ = 0 γ = 0.1

test loss 0.5594 0.0101 0.5404 0.0046

test |ˆL0 ˆL1| 0.0091 0.0067 0.0892 0.0378

test loss 0.3468 0.0013 0.3441 0.0012

test |ˆL0 ˆL1| 0.0815 0.0098 0.1080 0.0098

test loss 1.5716 0.8071 1.2116 0.8819

test |ˆL0 ˆL1| 0.6191 0.5459 0.3815 0.3470

test loss 0.3516 0.0015 0.3435 0.0012

test |ˆL0 ˆL1| 0.0336 0.0075 0.1110 0.0140

test loss 0.3521 0.0015 0.3377 0.0015

test |ˆL0 ˆL1| 0.0278 0.0075 0.1068 0.0138

Logistic Regression and Adult Income Dataset. We consider the adult income dataset containing the information of 48,842 individuals (Kohavi, 1996). Each data point consists of 14 features, including age, education, race, etc. In this study, we consider race (White or Black) as the sensitive attribute and denote the White demographic group by A = 0 and the Black group by A = 1. We first pre-process the dataset by removing the data points with a missing value or with a race other than Black and White; this results in 41,961 data points. Among these data points, 4585 belong to the Black group. For each data point, we convert all the categorical features to one-hot vectors with 110 dimension and randomly split the dataset into training and test data sets (70% of the dataset is used for training). The goal is to predict whether the income of an individual is above $50K using a γ-EL fair logistic regression model. In this experiment, we solve optimization problem (11), with ˆL and ˆLa being the overall and the group specific empirical average of binary cross entropy (BCE) loss, respectively. The comparison of Algorithm 2, Algorithm 3, and the baselines is shown in Table 2, where we conduct five independent runs of experiments, and calculate the mean and standard deviation of overall loss and the loss difference between two demographic groups. The first row in this table shows the value of γ used as an input to the algorithms. The results show that linear relaxation, algorithm 2 and Algorithm 3 have very similar performances. All of these three algorithms are able to satisfy the γ-EL with small test loss. Similar to Table 1, we observe the high variance in the performance of Fair Batch, which highly depends on the random seed.

In Figure 2, we compare the performance-fairness trade-off. We focus on binary cross entropy on the test dataset. To generate this figure, we run Algorithm 2, Algorithm 3, and the baselines (we do not include the curve for Fair Batch due to large overall loss and high variance in performance) under different values of γ = [0.02, 0.04, 0.06, 0.08, 0.1] for five times and calculate the average BCE and the average BCE

Loss Balancing for Fair Supervised Learning

difference. We observe Algorithms 2 and 3 and the linear relaxation have a similar trade-off between ˆL and |ˆL0 ˆL1|.

Experiment with a non-linear model We repeat our first experiment with nonlinear models to demonstrate how we can use our algorithms to fine-tune a non-linear model. We work with the Law School Admission dataset, and we train a neural network with one hidden layer which consists of 125 neurons. We use sigmoid as the activation function for the hidden layer. We run the following algorithms, Penalty Method: We solve optimization problem (10). In this example, ˆL and ˆLa are not convex anymore. The hyperparameters except for the learning rate remain the same as in the first experiment. The learning rate is set to be 0.001.

Fair Batch: we train the whole network using Fair Batch with mini-batch Adam optimization with a batch size of 100 and a learning rate of 0.001.

Linear Relaxation: In order to take advantage of CVXPY, first, we train the network without any fairness constraint using batch Adam optimization (i.e., the batch size is equal to the size of the training dataset) with a learning rate of 0.001. Then, we fine-tune the parameters of the output layer. Note that the output layer has 126 parameters, and we fine-tune those under relaxed EL fairness. In particular, we solve problem (9) after linear relaxation.

Algorithm 2 and Algorithm 3: We can run Algorithm 2 and Algorithm 3 to fine-tune the neural network. After training the network without any constraint using batch Adam optimization, we solve (9) using Algorithm 2 and Algorithm 3. Table 3 illustrates the average and standard deviation of empirical loss and the loss difference between Black and White students. Both Algorithm 2 and Algorithm 3 can achieve a fairness level (i.e., |ˆL0 ˆL1|) close to desired fairness level γ. Also, we can see that the MSE of Algorithm 2 and Algorithm 3 under the nonlinear model is slightly lower than the MSE under the linear model.

We also investigate how MSE ˆL changes as a function of fairness level |ˆL1 ˆL0|. Figure 3 illustrates the MSE-fairness trade-off. To generate this plot, we repeat the experiment for γ = [0.025, 0.05, 0.1, 0.15, 0.2]. For each γ, we ran the experiment 5 times and calculated the average of MSE ˆL and the average of MSE difference using the test dataset. Based on Figure 3, we observe that Fair Batch and Lin Re are not very sensitive to the input γ. However, Fair Batch may sometimes show a better trade-off than Algorithm 2. In this example, PM, Algorithm 2, and Algorithm 3 are very sensitive to γ, and as γ increases, MSE ˆL decreases and |ˆL0 ˆL1| increases.

Limitation and Negative Societal Impact. 1) Our theoretical guarantees are valid under the stated assumptions

Table 3: Neural Network training under EL fairness. The loss function in this example is the mean squared error loss.

γ = 0 γ = 0.1

test loss 0.9490 0.0584 0.9048 0.0355

test |ˆL0 ˆL1| 0.1464 0.1055 0.1591 0.0847

test loss 0.8489 0.0195 0.8235 0.0165

test |ˆL0 ˆL1| 0.6543 0.0322 0.5595 0.0482

test loss 0.9012 0.1918 0.8638 0.0863

test |ˆL0 ˆL1| 0.2771 0.1252 0.1491 0.0928

test loss 0.9117 0.0172 0.8519 0.0195

test |ˆL0 ˆL1| 0.0761 0.0498 0.1454 0.0749

test loss 0.9427 0.0190 0.8908 0.0209

test |ˆL0 ˆL1| 0.0862 0.0555 0.1423 0.0867

0.1 0.2 0.3 0.4 0.5 0.6

test l ss difference |

Overall l ss v.s. l ss difference

F air Batch

Figure 3: Trade-off between overall MSE and unfairness. A lower curve implies a better trade-off.

(e.g., the convexity of L(w), i.i.d. samples, binary sensitive attribute). These assumptions have been clearly stated throughout this paper. 2) In this paper, we develop an algorithm for finding a fair predictor under EL fairness. However, we do not claim this notion is better than other fairness notions. Depending on the scenario, this notion may or may not be suitable for mitigating unfairness.

7. Conclusion

In this work, we studied supervised learning problems under the Equalized Loss (EL) fairness (Zhang et al., 2019), a notion that requires the expected loss to be balanced across different demographic groups. By imposing the EL constraint, the learning problem can be formulated as a non-convex problem. We proposed two algorithms with theoretical performance guarantees to find the global optimal and a sub-optimal solution to this non-convex problem.

Acknowledgment

This work is partially supported by the NSF under grant IIS-2202699.

Loss Balancing for Fair Supervised Learning

Mahed Abroshan, Mohammad Mahdi Khalili, and Andrew Elliott. Counterfactual fairness in synthetic data generation. In Neur IPS 2022 Workshop on Synthetic Data for Empowering ML Research.

Mahed Abroshan, Saumitra Mishra, and Mohammad Mahdi Khalili. Symbolic metamodels for interpreting blackboxes using primitive functions. ar Xiv preprint ar Xiv:2302.04791, 2023.

Alekh Agarwal, Alina Beygelzimer, Miroslav Dud ık, John Langford, and Hanna Wallach. A reductions approach to fair classification. In International Conference on Machine Learning, pages 60 69. PMLR, 2018.

Alekh Agarwal, Miroslav Dudik, and Zhiwei Steven Wu. Fair regression: Quantitative definitions and reductionbased algorithms. In International Conference on Machine Learning, pages 120 129. PMLR, 2019.

Akshay Agrawal, Robin Verschueren, Steven Diamond, and Stephen Boyd. A rewriting system for convex optimization problems. Journal of Control and Decision, 5(1): 42 60, 2018.

Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463 482, 2002.

Aharon Ben-Tal and Michael Zibulevsky. Penalty/barrier multiplier methods for convex programming problems. SIAM Journal on Optimization, 7(2):347 366, 1997.

Asia J Biega, Krishna P Gummadi, and Gerhard Weikum. Equity of attention: Amortizing individual fairness in rankings. In The 41st international acm sigir conference on research & development in information retrieval, pages 405 414, 2018.

Flavio P Calmon, Dennis Wei, Bhanukiran Vinzamuri, Karthikeyan Natesan Ramamurthy, and Kush R Varshney. Optimized pre-processing for discrimination prevention. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pages 3995 4004, 2017.

L Elisa Celis, Vijay Keswani, and Nisheeth Vishnoi. Data preprocessing to mitigate bias: A maximum entropy based approach. In International Conference on Machine Learning, pages 1349 1359. PMLR, 2020.

Vincent Conitzer, Rupert Freeman, Nisarg Shah, and Jennifer Wortman Vaughan. Group fairness for the allocation of indivisible goods. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 1853 1860, 2019.

Jeffrey Dastin. Amazon scraps secret ai recruiting tool that showed bias against women. http://reut.rs/ 2MXzkly, 2018.

Steven Diamond and Stephen Boyd. CVXPY: A Pythonembedded modeling language for convex optimization. Journal of Machine Learning Research, 17(83):1 5, 2016.

Emily Diana, Wesley Gill, Michael Kearns, Krishnaram Kenthapadi, and Aaron Roth. Minimax group fairness: Algorithms and experiments. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, pages 66 76, 2021.

Michele Donini, Luca Oneto, Shai Ben-David, John S Shawe-Taylor, and Massimiliano Pontil. Empirical risk minimization under fairness constraints. Advances in Neural Information Processing Systems, 31, 2018.

Julia Dressel and Hany Farid. The accuracy, fairness, and limits of predicting recidivism. Science advances, 4(1): eaao5580, 2018.

Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. Fairness through awareness. In Proceedings of the 3rd innovations in theoretical computer science conference, pages 214 226, 2012.

Swati Gupta and Vijay Kamble. Individual fairness in hindsight. In Proceedings of the 2019 ACM Conference on Economics and Computation, pages 805 806, 2019.

Moritz Hardt, Eric Price, and Nati Srebro. Equality of opportunity in supervised learning. Advances in neural information processing systems, 29:3315 3323, 2016.

Drew Harwell. The accent gap. http://wapo.st/ 3p Uq Z0S, 2018.

Christopher Jung, Michael Kearns, Seth Neel, Aaron Roth, Logan Stapleton, and Zhiwei Steven Wu. Eliciting and enforcing subjective individual fairness. ar Xiv preprint ar Xiv:1905.10660, 2019.

Faisal Kamiran and Toon Calders. Data preprocessing techniques for classification without discrimination. Knowledge and Information Systems, 33(1):1 33, 2012.

Mohammad Mahdi Khalili, Xueru Zhang, Mahed Abroshan, and Somayeh Sojoudi. Improving fairness and privacy in selection problems. ar Xiv preprint ar Xiv:2012.03812, 2020.

Mohammad Mahdi Khalili, Xueru Zhang, and Mahed Abroshan. Fair sequential selection using supervised learning models. Advances in Neural Information Processing Systems, 34:28144 28155, 2021.

Loss Balancing for Fair Supervised Learning

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014.

Ron Kohavi. Scaling up the accuracy of naive-bayes classifiers: A decision-tree hybrid. In Kdd, volume 96, pages 202 207, 1996.

Junpei Komiyama, Akiko Takeda, Junya Honda, and Hajime Shimao. Nonconvex optimization for regression with fairness constraints. In International conference on machine learning, pages 2737 2746. PMLR, 2018.

Michael Lohaus, Michael Perrot, and Ulrike Von Luxburg. Too relaxed to be fair. In International Conference on Machine Learning, pages 6360 6369. PMLR, 2020.

Scott Lundberg and Su-In Lee. A unified approach to interpreting model predictions. ar Xiv preprint ar Xiv:1705.07874, 2017.

Mehrdad Mahdavi, Tianbao Yang, Rong Jin, Shenghuo Zhu, and Jinfeng Yi. Stochastic gradient descent with only one projection. Advances in neural information processing systems, 25:494 502, 2012.

Angelia Nedi c and Asuman Ozdaglar. Subgradient methods for saddle-point problems. Journal of optimization theory and applications, 142(1):205 228, 2009.

Arkadi Nemirovski. Interior point polynomial time methods in convex programming. Lecture notes, 42(16):3215 3224, 2004.

Christian Reimers, Paul Bodesheim, Jakob Runge, and Joachim Denzler. Towards learning an unbiased classifier from biased data via conditional adversarial debiasing. ar Xiv preprint ar Xiv:2103.06179, 2021.

Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. Why should i trust you? explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pages 1135 1144, 2016.

Yuji Roh, Kangwook Lee, Steven Euijong Whang, and Changho Suh. Fairbatch: Batch selection for model fairness. In International Conference on Learning Representations, 2020.

Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.

Aili Shen, Xudong Han, Trevor Cohn, Timothy Baldwin, and Lea Frermann. Optimising equal opportunity fairness in model training. ar Xiv preprint ar Xiv:2205.02393, 2022.

Linda F Wightman. Lsac national longitudinal bar passage study. lsac research report series. 1998.

Robert Williamson and Aditya Menon. Fairness risk measures. In International Conference on Machine Learning, pages 6786 6797. PMLR, 2019.

Blake Woodworth, Suriya Gunasekar, Mesrob I Ohannessian, and Nathan Srebro. Learning non-discriminatory predictors. In Conference on Learning Theory, pages 1920 1953. PMLR, 2017.

Stephen J Wright. On the convergence of the newton/logbarrier method. Mathematical programming, 90(1):71 100, 2001.

Yongkai Wu, Lu Zhang, and Xintao Wu. On convexity and bounds of fairness-aware classification. In The World Wide Web Conference, pages 3356 3362, 2019.

Muhammad Bilal Zafar, Isabel Valera, Manuel Gomez Rodriguez, and Krishna P Gummadi. Fairness beyond disparate treatment & disparate impact: Learning classification without disparate mistreatment. In Proceedings of the 26th international conference on world wide web, pages 1171 1180, 2017.

Muhammad Bilal Zafar, Isabel Valera, Manuel Gomez Rodriguez, and Krishna P Gummadi. Fairness constraints: A flexible approach for fair classification. The Journal of Machine Learning Research, 20(1):2737 2778, 2019.

Brian Hu Zhang, Blake Lemoine, and Margaret Mitchell. Mitigating unwanted biases with adversarial learning. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, pages 335 340, 2018.

Xueru Zhang, Mohammadmahdi Khaliligarekani, Cem Tekin, and Mingyan Liu. Group retention when using machine learning in sequential decision making: the interplay between user dynamics and fairness. Advances in Neural Information Processing Systems, 32:15269 15278, 2019.

Xueru Zhang, Mohammad Mahdi Khalili, and Mingyan Liu. Long-term impacts of fair machine learning. Ergonomics in Design, 28(3):7 11, 2020.

Xueru Zhang, Mohammad Mahdi Khalili, Kun Jin, Parinaz Naghizadeh, and Mingyan Liu. Fairness interventions as (dis) incentives for strategic manipulation. In International Conference on Machine Learning, pages 26239 26264. PMLR, 2022.

Loss Balancing for Fair Supervised Learning

A. Appendix

A.1. Some notes on the code for reproducibility

In this part, we provide a description of the files provided in our Git Hub repository.

law data.py: This file includes a function called law data(seed) which processes the law school admission dataset and splits the dataset randomly into training and test datasets (we keep 70% of the datapoints for training). Later, in our experiments, we set the seed equal to 0, 1, 2, 3, and 4 to get five different splits to repeat our experiments five times.

Adult data.py: This file includes a function called Adult dataset(seed) which processes the adult income dataset and splits the dataset randomly into training and test datasets. Later, in our experiments, we set the seed equal to 0, 1, 2, 3, 4 to get five different splits to repeat our experiments five times.

Algorithms.py: This file includes the following functions,

ELminimizer(X0, Y 0, X1, Y 1, gamma, eta, model): This function implements Elminimizer algorithm. (X0, Y 0) are the training datapoints belonging to group A = 0 and (X1, Y 1) are the datapoits belonging to group A = 1. gamma is the fairness level for EL constraint. η is the reqularizer parameter (in our experiments, η = 0.002). model determines the model that we want to train. If model = linear , then we train a linear regression model. If model = logistic , then we train a logisitic regression model. This function returns five variables (w, b, l0, l1, l). w, b are the weight vector and bias term of the trained model. l0, l1 are the average training loss of group 0 and group 1, respectively. l is the overall training loss. Algorithm2(X0, Y 0, X1, Y 1, gamma, eta, model): This function implements Algorithm 2 which calls Elminimizer algorithm twice. This function also returns five variables (w, b, l0, l1, l). These variables have been defined above. Algorithm3(X0, Y 0, X1, Y 1, gamma, eta, model): This function implements Algorithm 3 which finds a suboptimal solution under EL fairness. This function also returns five variables (w, b, l0, l1, l). These variables have been defined above. solve constrained opt(X0, Y 0, X1, Y 1, eta, landa, model): This function uses the CVXPY package to solve the optimization problem (4). We set landa equal to λ(i) mid to solve the optimization problem (4) in iteration i of Algorithm 1. calculate loss(w, b, X0, Y 0, X1, Y 1, model): This function is used to find the test loss. w, b are model parameters (trained by Algorithm 2 or 3). It returns the average loss of group 0 and group 1 and the overall loss based on the given dataset. solve lin constrained opt(X0, Y 0, X1, Y 1, gamma, eta, model): This function is for solving optimization problem (8) after linear relaxation.

Baseline.py: this file includes the following functions,

penalty method(method, X 0, y 0, X 1, y 1, num itr, lr, r, gamma, seed, epsilon) where method can be either linear for linear regression or logistic for logistic regression. This function uses the penalty method and trains the model under EL using the Adam optimization. num itr is the maximum number of iterations. r is the regularization parameter (it is set to 0.002 in our experiment). lr is the learning rate and gamma is the fairness level. ϵ is used for the stopping criterion. This function returns the trained model (which is a torch module), and training loss of group 0 and group 1, and the overall training loss. fair batch(method, X 0, y 0, X 1, y 1, num itr, lr, r, alpha, gamma, seed, epsilon): This function is used to simulate the Fair Batch algorithm (Roh et al., 2020). The input parameters are similar to the input parameters of penalty method except for alpha. This parameter determines how to adjust the sub-sampling distribution for mini-batch formation. Please look at the next section for more details. This function returns the trained model (which is a torch module), and training loss of group 0 and group 1, and the overall training loss.

table1 2.py uses the above functions to reproduce the results in Table 1 and Table 2. figure1 2.py uses the above functions to reproduce Figure 1 and Figure 2. We provide some comments in these files to make the code more readable. We have also provided code for training non-linear models. Please use Table3.py and figure3.py to generate the results in Table 3 and Figure 3, respectively.

Loss Balancing for Fair Supervised Learning

Lastly, use the following command to generate results in Table 1:

python3 table1 2.py --experiment=1 --gamma=0.0

python3 table1 2.py --experiment=1 --gamma=0.1

Use the following command to generate results in Table 2:

python3 table1 2.py --experiment=2 --gamma=0.0

python3 table1 2.py --experiment=2 --gamma=0.1

Use the following command to generate results in Table 3:

python3 table3.py --gamma=0.0

python3 table3.py --gamma=0.1

Use the following command to generate results in Figure 1:

python3 figure1 2.py --experiment=1

Use the following command to generate results in Figure 2:

python3 figure1 2.py --experiment=2

Use the following command to generate results in Figure 3:

python3 figure3.py

Note that you need to install packages in requirements.txt

A.2. Notes on Fair Batch (Roh et al., 2020)

This method has been proposed to find a predictor under equal opportunity, equalized odd or statistical parity. In each epoch, this method identifies the disadvantaged group and increases the subsampling rate corresponding to the disadvantaged group in mini-batch selection for the next epoch. We modify this approach for γ-EL as follows,

We initialize the sub-sampling rate of group a (denoted by SR(0) a ) for mini-batch formation by SR(0) a = na

n , a = 0, 1. We Form the mini-batches using SR(0) 0 and SR(0) 1 .

At epoch i, we run gradient descent using the mini-batches formed by SR(i 1) 0 and SR(i 1) 1 , and we obtain new model parameters wi.

After epoch i, we calculate the empirical loss of each group. Then, we update SR(i) a as follows, SR(i) a SR(i 1) a + α if ˆLa(wi) ˆL1 a(wi) > γ

SR(i) a SR(i 1) a α if ˆLa(wi) ˆL1 a(wi) < γ

SR(i) a SR(i 1) a o.w., where is α is a hyperparameter and, in our experiment, is equal to 0.005.

Loss Balancing for Fair Supervised Learning

A.3. Details of numerical experiments and additional numerical results

Due to the space limits of the main paper, we provide more details on our experiments here,

Stopping criteria for penalty method and Fair Batch: For stopping criteria, we stopped the learning process when the change in the objective function is less than 10 6 between two consecutive epochs. The reason that we used 10 6 was that we did not observe any significant change by choosing a smaller value.

Learning rate for penalty method and Fair Batch: We chose 0.005 for the learning rate for training a linear model. For the experiment with a non-linear model, we set the learning rate to be 0.001.

Stopping criteria for Algorithm 2 and Algorithm 3: As we stated in the main paper, we set ϵ = 0.01 in ELminimizer and Algorithm 3. Choosing smaller ϵ did not change the performance significantly.

Linear Relaxation: Note that equation (8) after linear relaxation is a convex optimization problem. We directly solve this optimization problem using CVXPY.

The experiment has been done on a system with the following configurations: 24 GB of RAM, 2 cores of P100-16GB GPU, and 2 cores of Intel Xeon CPU@2.3 GHz processor. We used GPUs for training Fair Batch.

A.4. Notes on the Reduction Approach (Agarwal et al., 2018; 2019)

Let Q(f) be a distribution over F. In order to find optimal Q(f) using the reduction approach, we have to solve the following optimization problem,

f F Q(f)E{l(Y, f(X))}

f F Q(f)E{l(Y, f(X))|A = 0} = X

f F Q(f)E{l(Y, f(X))}

f F Q(f)E{l(Y, f(X))|A = 1} = X

f F Q(f)E{l(Y, f(X))}

Similar to (Agarwal et al., 2018; 2019), we can re-write the above optimization problem in the following form,

f F Q(f)E{l(Y, f(X))}

f F Q(f)E{l(Y, f(X))|A = 0} X

f F Q(f)E{l(Y, f(X))} 0

f F Q(f)E{l(Y, f(X))|A = 0} + X

f F Q(f)E{l(Y, f(X))} 0

f F Q(f)E{l(Y, f(X))|A = 1} X

f F Q(f)E{l(Y, f(X))} 0

f F Q(f)E{l(Y, f(X))|A = 1} + X

f F Q(f)E{l(Y, f(X))} 0

Loss Balancing for Fair Supervised Learning

Then, the reduction approach forms the Lagrangian function as follows,

L(Q, µ) = X

f F Q(f)E{l(Y, f(X))}

f F Q(f)E{l(Y, f(X))|A = 0} X

f F Q(f)E{l(Y, f(X))}

f F Q(f)E{l(Y, f(X))|A = 0} + X

f F Q(f)E{l(Y, f(X))}

f F Q(f)E{l(Y, f(X))|A = 1} X

f F Q(f)E{l(Y, f(X))}

f F Q(f)E{l(Y, f(X))|A = 1} + X

f F Q(f)E{l(Y, f(X))}

µ1 0, µ2 0, µ3 0, µ4 0.

Since f is parametrized with w, we can find distribution Q(w) over Rdw. Therefore, we rewrite the problem in the following form, L(Q(w), µ1, µ2, µ3, µ4) = X

w Q(w)L0(w) X

w Q(w)L0(w) + X

w Q(w)L1(w) X

w Q(w)L1(w) + X

The reduction approach updates Q(w) and (µ1, µ2, µ3, µ4) alternatively. Looking carefully at Algorithm 1 in (Agarwal et al., 2018), after updating (µ1, µ2, µ3, µ4), we need to have access to an oracle that is able to solve the following optimization problem in each iteration, min w (1 + µ1 µ2 + µ3 µ4)L(w) + ( µ1 + µ2)L0(w) + ( µ3 + µ4)L1(w)

The above optimization problem is not convex for all µ1, µ2, µ3, µ4. Therefore, in order to use the reduction approach, we need to have access to an oracle that is able to solve the above non-convex optimization problem which is not available. Note that the original problem (1) is a non-convex optimization problem and using the reduction approach just leads to another non-convex optimization problem.

A.5. Equalized Loss & Bounded Group Loss

In this section, we study the relation between EL and BGL fairness notions. It is straightforward to see that any predictor satisfying γ-BGL also satisfies the γ-EL. However, it is unclear to what extend an optimal fair predictor under γ-EL satisfies the BGL fairness notion. Next, we theoretically study the relation between BGL and EL fairness notions.

Let w be denoted as the solution to (1) and fw the corresponding optimal γ-EL fair predictor. Theorem A.1 below shows that under certain conditions, it is impossible for both groups to experience a loss larger than 2γ under the optimal γ-EL fair predictor.

Theorem A.1. Suppose there exists a predictor that satisfies γ-BGL fairness notion. That is, the following optimization

Loss Balancing for Fair Supervised Learning

problem has at least one feasible point. min w L(w) s.t. La(w) γ, a {0, 1}. (12)

Then, the followings hold, min{L0(w ), L1(w )} γ;

max{L0(w ), L1(w )} 2γ.

Theorem A.1 shows that γ-EL implies 2γ-BGL if γ-BGL is a feasible constraint. Therefore, if γ is not too small (e.g., γ = 0), then EL and BGL are not contradicting each other.

We emphasize that we are not claiming that whether EL fairness is better than BGL. Instead, these relations indicate the impacts the two fairness constraints could have on the model performance; the results may further provide the guidance for policy-makers.

Loss Balancing for Fair Supervised Learning

A.6. Proofs

In order to prove Theorem 3.3, we first introduce two lemmas.

Lemma A.2. Under assumption 3.2, there exists w Rdw such that L0(w) = L1(w) = L(w) and λ(0) start L(w) λ(0) end.

Proof. Let q0(β) = L0((1 β)w G0 +βw G1) and q1(β) = L1((1 β)w G0 +βw G1), and q(β) = q0(β) q1(β), β [0, 1]. Note that w La(w Ga) = 0 because w Ga is the minimizer of La(w).

First, we show that L0((1 β)w G0 + βw G1) is an increasing function in β, and L1((1 β)w G0 + βw G1) is a decreasing function in β. Note that q 0(0) = (w G1 w G0)T w L0(w G0) = 0, and q0(β) is convex because L0(w) is convex. This implies that q (β) is an increasing function, and q 0(β) 0, β [0, 1]. Similarly, we can show that q 1(β) 0, β [0, 1].

Note that under Assumption (3.2), q(0) < 0 and q(1) > 0. Therefore, by the intermediate value theorem, the exists β (0, 1) such that q(β) = 0. Define w = (1 β)w G0 + βw G1. We have,

q(β) = 0 = L0(w) = L1(w) = L(w)

w G0 is minimizer of L0 =

L(w) = L0(w) λ(0) start q 0(β) 0, β [0, 1] = q0(1) q0(β) =

λ(0) end L0(w) = L(w)

Lemma A.3. L0(w i ) = λ(i) mid, where w i is the solution to (4).

Proof. We proceed by contradiction. Assume that L0(w i ) < λ(i) mid (i.e., w i is an interior point of the feasible set of (4)). Notice that w G1 cannot be in the feasible set of (4) because L0(w G1) = λ(0) end > λ(i) mid. As a result, w L1(w i ) = 0. This is a contradiction because w i is an interior point of the feasible set of a convex optimization and cannot be optimal if w L1(w i ) is not equal to zero.

Proof [Theorem 3.3]

Let Ii = [λ(i) start, λ(i) end] be a sequence of intervals. It is easy to see that I1 I2 and λ(i) end λ(i) start 0 as i . Therefore, by the Nested Interval Theorem, i=1Ii consists of exactly one real number λ , and both λ(i) start and λ(i) end converge to λ . Because λ(i) mid = λ(i) start+λ(i) end 2 , λ(i) mid also converges to λ .

Now, we show that L(w ) Ii for all i (w is the solution to (1) when γ = 0. As a result, L0(w ) = L1(w ) = L(w )). Note that L(w ) = L0(w ) λ(0) start because w G0 is the minimizer of L0. Moreover, λ(0) end L(w ) otherwise L(w) < L(w ) (w is defined in Lemma A.2) and w is not optimal solution under 0-EL. Therefore, L(w ) I0.

Now we proceed by induction. Suppose L(w ) Ii. We show that L(w ) Ii+1 as well. We consider two cases.

L(w ) λ(i) mid. In this case w is a feasible point for (4), and L1(w i ) = λ(i) L1(w ) = L(w ) λ(i) mid. Therefore, L(w ) Ii+1.

L(w ) > λ(i) mid. In this case, we proceed by contradiction to show that λ(i) λ(i) mid. Assume that λ(i) < λ(i) mid. Define r(β) = r0(β) r1(β), where ra(β) = La((1 β)w G0 +βw i ). Note that λ(i) = r1(1) By Lemma A.3, r0(1) = λ(i) mid. Therefore, r(1) = λ(i) mid λ(i) > 0. Moreover, under Assumption 3.2, r(0) < 0. Therefore, by the intermediate value theorem, there exists β0 (0, 1) such that r(β0) = 0. Similar to the proof of Lemma A.2, we can show that r0(β) in an increasing function for all β [0, 1]. As a result r0(β0) < r0(1) = λ(i) mid. Define w0 = (1 β0)w G0 + β0w i . We have, r0(β0) = L0(w0) = L1(w0) = L(w0) < λ(i) mid (13)

L(w ) > λ(i) mid (14) The last two equations imply that w is not a global optimal fair solution under 0-EL fairness constraint and it is not the global optmal solution to (1). This is a contradiction. Therefore, if L(w ) > λ(i) mid, then λ(i) λ(i) mid. As a result, L(w ) Ii+1

Loss Balancing for Fair Supervised Learning

By two above cases and the nested interval theorem, we conclude that,

L(w ) i=1Ii, lim i λ(i) mid = L(w ),

defineλ mid := lim i λ(i) mid

Therefore, limi w i would be the solution to the following optimziation problem,

arg min w L1(w)s.t., L0(w) λ mid = L(w )

By lemma A.3, the solution to above optimization problem (i.e., limi w i ) satisfies the following, L0(limi w i ) = λ mid = L(w ). Therefore, limi w i is the global optimal solution to optimization problem (1).

Proof [Theorem 3.4 ] Let s assume that w O does not satisfy the γ-EL.11 Let w be the optimal weight vector under γ-EL. It is clear that w = w O.

Step 1. we show that one of the following holds, L0(w ) L1(w ) = γ (15)

L0(w ) L1(w ) = γ (16) Proof by contradiction. Assume γ < L0(w ) L1(w ) < γ. This implies that w is an interior point of the feasible set of optimization problem (1). Since w = w O, then L(w ) = 0. As a result, object function of (1) can be improved at w

by moving toward L(w ). This a contradiction. Therefore, |L0(w ) L1(w )| = γ.

Step 2. Function wγ = ELminimizer(w G0,w G0, ϵ, γ) is the solution to the following optimization problem, min w Pr{A = 0}L0(w) + Pr{A = 1}L1(w),

s.t., L0(w) L1(w) = γ (17)

To show the above claim, notice that the solution to optimization problem (17) is the same as the following,

min w Pr{A = 0}L0(w) + Pr{A = 1} L1(w),

s.t., L0(w) L1(w) = 0, (18) where L1(w) = L1(w) + γ. Since L0(w G0) L1(w G0) < 0 and L0(w G1) L1(w G1) > 0, by Theorem 3.3, we know that wγ = ELminimizer(w G0,w G0, ϵ, γ) find the solution to (18) when ϵ goes to zero.

Lastly, because |L0(w ) L1(w )| = γ, we have,

w = wγ if L(wγ) L(w γ) w γ o.w. (19)

Thus, Algorithm 2 finds the solution to (1).

Proof [Theorem 4.1]

1. Under Assumption 3.2, g(1) < 0. Moreover, g(0) 0. Therefore, by the intermediate value theorem, there exists β0 [0, 1] such that g(β0) = 0.

2. Since w O is the minimizer of L(w), h (0) = 0. Moreover, since L(w) is strictly convex, h(β) is strictly convex and h (β) is strictly increasing function. As a result, h (β) > 0 for β > 0, and h(β) is strictly increasing.

3. Similar to the above argument, s(β) = Lˆa((1 β)w O + βw Gˆa) is strictly decreasing function (notice that s (1) = 0 and s(β) is strictly convex).

11If w O satisfies γ-EL, it will be the optimal predictor under γ-EL fairness. Therefore, there is no need to solve any constrained optimization problem. Note that w O is the solution to problem (7).

Loss Balancing for Fair Supervised Learning

Note that since h(β) = Pr{A = ˆa}Lˆa((1 β)w O + βw Gˆa) + Pr{A = 1 ˆa}L1 ˆa((1 β)w O + βw Gˆa) is strictly increasing and Lˆa((1 β)w O + βw Gˆa) is strictly decreasing. Therefore, we conclude that L1 ˆa((1 β)w O + βw Gˆa) is strictly increasing. As a result, g(β) should be strictly decreasing.

Proof [Theorem 4.2] First, we show that if gγ(0) 0, then w O satisfies γ-EL.

gγ(0) 0 = g(β) γ 0 = Lˆa(w O) L1 ˆa(w O) γ

Moreover, Lˆa(w O) L1 ˆa(w O) 0 because ˆa = arg maxa La(w O). Therefore, γ-EL is satisfied.

Now, let gγ(0) > 0. Note that under Assumption 3.2, gγ(1) = Lˆa(w Gˆa) L1 ˆa(w Gˆa) γ < 0. Therefore, by the intermediate value theorem, there exists β0 such that gγ(β0) = 0. Moreover, based on Theorem 4.2, gγ is a strictly decreasing function. Therefore, the binary search proposed in Algorithm 3 converges to the root of gγ(β). As a result, (1 β( ) mid)w O + β( ) midw Gˆa satisfies γ-EL. Note that since g(β) is strictly decreasing, and g(β( ) mid) = γ, and β( ) mid is the smallest possible β under which (1 β)w O + βw Gˆa satisfies γ-EL. Since h is increasing, the smallest possible β gives us a better accuracy.

Proof [Theorem 4.3] If gγ(0) 0, then w O satisfies γ-EL, and w = w O. In this case, it is easy to see that L(w O) maxa {0,1} La(w O) (because L(w O) is a weighted average of L0(w O) and L1(w O)).

Now assume that gγ(0) > 0. Note that if we prove this theorem for γ = 0, then the theorem will hold for γ > 0. This is because the optimal predictor under 0-EL satisfies γ-EL condition as well. In other words, 0-EL is a stronger constraint than γ-EL.

Let γ = 0. In this case, Algorithm 3 finds w = (1 β0)w O + β0w Gˆa, where β0 is defined in Theorem 4.1. We have,

( ) g(β0) = 0 = Lˆa(w) L1 ˆa(w)

In the proof of theorem 4.1, we showed that Lˆa((1 β)w O + βw Gˆa) is decreasing in β. Therefore, we have, ( ) Lˆa(w) Lˆa(w O)

Therefore, we have, L(w) = Pr(A = 0) Lˆa(w) + (1 Pr(A = 1)) L1 ˆa(w) (20)

(By ( )) = Lˆa(w) (21)

(By ( )) Lˆa(w O) (22)

Proof [Theorem 4.5]

By the triangle inequality, the following holds,

sup fw F ||L0(w) L1(w)| |ˆL0(w) ˆL1(w)|| (23)

sup fw F |L0(w) ˆL0(w)| + sup fw F |L1(w) ˆL1(w)|. (24)

Therefore, with probability at least 1 2δ we have, sup fw F ||L0(w) L1(w)| |ˆL0(w) ˆL1(w)||

B(δ, n0, F) + B(δ, n1, F) (25) As a result, with probability 1 2δ holds, {w|fw F, |L0(w) L1(w)| γ}

{w|fw F, |ˆL0(w) ˆL1(w)| ˆγ} (26) Now consider the following,

L( ˆwˆwˆw) L(w ) = L( ˆwˆwˆw) ˆL( ˆwˆwˆw) + ˆL( ˆwˆwˆw) ˆL(w ) + ˆL(w ) L(w ) (27)

Loss Balancing for Fair Supervised Learning

By (26), ˆL( ˆwˆwˆw) ˆL(w ) 0 with probability 1 2δ. Thus, with probability at least 1 2δ, we have,

L( ˆwˆwˆw) L(w ) L( ˆwˆwˆw) ˆL( ˆwˆwˆw) + ˆL(w ) L(w ). (28)

Therefore, under assumption 4.4, we can conclude with probability at least 1 6δ, L( ˆwˆwˆw) L(w ) 2B(δ, n, F). In addition, by (25), with probability at least 1 2δ, we have,

|L0( ˆwˆwˆw) L1( ˆwˆwˆw)| B(δ, n0, F) + B(δ, n1, F) + |ˆL0(w) ˆL1(w)|

ˆγ + B(δ, n0, F) + B(δ, n1, F)

= γ + 2B(δ, n0, F) + 2B(δ, n1, F)

Proof [Theorem A.1] Let w w w be a feasible point of optimization problem (12), then w w w is also a feasible point to (1).

We proceed by contradiction. We consider three cases,

If min{L0(w ), L1(w )} > γ and max{L0(w ), L1(w )} > 2γ. In this case, L(w ) > γ L( w w w). This is a contradiction because it implies that w is not an optimal solution to (1), and w is a better solution for (1).

If min{L0(w ), L1(w )} > γ and max{L0(w ), L1(w )} 2γ. This case is similar to above. min{L0(w ), L1(w )} > γ implies that L(w ) > γ L( w w w). This is a contradiction because it implies that w

is not an optimal solution to (1).

If min{L0(w ), L1(w )} γ and max{L0(w ), L1(w )} > 2 γ. We have: max{L0(w ), L1(w )} min{L0(w ), L1(w )} > γ, which shows that w is not a feasible point for (1). This is a contradiction.

Therefore, max{L0(w ), L1(w )} 2γ and min{L0(w ), L1(w )} γ.