# calibrated_datadependent_constraints_with_exact_satisfaction_guarantees__9641cebd.pdf

Calibrated Data-Dependent Constraints with Exact

Satisfaction Guarantees

Songkai Xue Department of Statistics

University of Michigan

sxue@umich.edu

Yuekai Sun Department of Statistics

University of Michigan

yuekai@umich.edu

Mikhail Yurochkin

IBM Research MIT-IBM Watson AI Lab mikhail.yurochkin@ibm.com

We consider the task of training machine learning models with data-dependent constraints. Such constraints often arise as empirical versions of expected value constraints that enforce fairness or stability goals. We reformulate data-dependent constraints so that they are calibrated: enforcing the reformulated constraints guarantees that their expected value counterparts are satisﬁed with a user-prescribed probability. The resulting optimization problem is amendable to standard stochastic optimization algorithms, and we demonstrate the efﬁcacy of our method on a fairness-sensitive classiﬁcation task where we wish to guarantee the classiﬁer s fairness (at test time).

1 Motivation

In machine learning (ML) practice, accuracy is often only one of many training objectives. For example, algorithmic fairness considerations may require a credit scoring system to perform comparably on men and women. Here are a few other examples.

Churn rate and stability The churn rate of an ML model compared to another model is the fraction of samples on which the predictions of the two models differ [21, 30]. In ML practice, one may wish to control the churn rate between a new model and its predecessor because a high churn rate can disorient users and downstream system components. One way of training models with small churn is to enforce a churn rate constraint during training.

Precision, recall, etc. Classiﬁcation and information retrieval models must often balance precision and recall. To train such models, practitioners carefully trade off one metric for the other by optimizing for one metric subject to constraints on the other.

Resource constraints Practitioners sometimes wish to control how often a classiﬁer predicts a certain class due to budget or resource constraints. For example, a company that uses ML to select customers for a targeted offer may wish to constrain the fraction of customers selected for the offer. Another prominent example of a stochastic optimization problem with resource constraints is the newsvendor problem, which we come back to in section 4.

Unlike constraints on the structure of model parameters (e.g., sparsity), the constraints encoding the preceding training objectives are data-dependent. This leads to the issue of constraint generalization: whether the constraints generalize out-of-sample. For example, if a classiﬁer is trained to have comparable accuracy on two subpopulations in the training data, will it also have comparable accuracy on samples from the two subpopulations at test time?

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

In this paper, we consider the out-of-sample generalization of expected-value constraints. To keep things simple, consider a stochastic optimization problem with a single expected-value constraint:

arg min 2 EP0

Z f( ; z)d P0(z)

subject to EP0

Z g( ; z)d P0(z) 0

where is a (ﬁnite-dimensional) parameter space, f, g : Z ! R are (known) cost and constraint functions, and Z 2 Z is a random variable that represents a sample. The distribution of Z is unknown, so we cannot solve (1.1) directly. Instead, we obtain IID training samples {Zi}n

i=1 from the true underlying distribution P0 and solve the empirical version of (1.1):

i=1 f( ; Zi)

subject to 1 n

i=1 g( ; Zi) 0

The estimator b n (of ?) is guaranteed to satisfy the empirical constraint (i.e., 1

i=1 g(b n; Zi) 0), but it is unclear whether b n satisﬁes the actual (population) constraint EP0

0. As we shall see, under standard assumptions on (1.1), b n only satisﬁes the actual constraint with probability approaching 1

2 (see corollary 2.2). This is especially problematic for constraints that encode algorithmic fairness goals. For example, the 80% rule published by the US Equal Employment Opportunity Commission, interpreted in the machine learning context, requires the rate at which a classiﬁer predicts the advantaged label in minority groups to be at least 80% of the rate at which the classiﬁer predicts the advantaged label in the majority group [3].

In this paper, we propose a distributionally robust version of (1.2) that guarantees the actual constraint EP0

0 will be satisﬁed with probability 1 :

i=1 f( ; Zi)

subject to sup P :D'(P k b Pn)

where D' is a '-divergence (see section 2 for details), b Pn is the empirical distribution of the training samples, and p is the 1 quantile of a standard normal random variable. More concretely, we show that b n achieves asymptotically exact constraint satisfaction

= 1 . (1.4)

Here the inner expectation is with respect to Z; the outer probability is with respect to the training samples {Zi}n

i=1. Three desirable properties of (1.3) are

1. exact constraint satisfaction: If the actual probability of constraint satisfaction exceeds 1 ,

then the method is too conservative. This may (unnecessarily) increase the cost of the model. By picking in (1.3) carefully, constraints are satisﬁed with asymptotically exact probability 1 . 2. computationally efﬁcient: As we shall see, the computational cost of solving (1.3) is comparable

to the cost of solving distributionally robust sample average approximation (SAA) problems. 3. pivotal: There are no nuisance parameters to estimate (e.g., asymptotic variances) in (1.3). The

user merely needs to look up the correct quantile of the standard normal distribution for their desired level of constraint generalization.

The rest of this paper is organized as follows. In Section 2, we develop method, theory, and algorithm for stochastic optimization problems with single constraint. In Section 3, we extend our method, theory, and algorithm to stochastic optimization problems with multiple constraints. In Section 4, we validate our theory by simulating a resource-constrained newsvendor problem. In Section 5, we demonstrate the efﬁcacy of our method by using it to train an algorithmically fair income classiﬁer. In addition, we show how to apply our method to a fairness constrained learning problem and discuss two practical considerations for fair ML application scenarios. Finally, we summarize our work in Section 6 and point out an interesting avenue of future work.

1.1 Related work

The closest work to our work is [27]. They seek to pick a (data-dependent) uncertainty set U such that

sup P 2U EP

= 1 . (1.5)

This condition is stronger than necessary: we only require

sup P 2U EP

where b n is a (data-dependent) estimator (not necessarily (1.2) or (1.3)). [27] study (asymptotic) constraint satisfaction (1.4) for all deterministic objective functions (see [27], 1.1 for details). They advocate picking a KL divergence ball with radius that depends on the excursion probability of a certain χ2 process.

Another closely related line of work is on data-splitting approaches for ensuring constraint generalization [37, 7]. At a high level, they split the training data into a training and validation subsets and use the validation subset to tune models trained on the training subset so that they satisfy the constraints. Although (computationally) simple and intuitive, their approach does not allow users to precisely control the constraint violation probability.

[27] is the latest in a line of work on distributionally robust optimization (DRO) that show the optimal value of DRO problems

min 2 sup P 2U EP

, (1.7) where U is a (data-dependent) uncertainty set of probability distributions, are upper conﬁdence bounds for the optimal values of stochastic optimization problems. Common choices of uncertainty sets in DRO include uncertainty sets deﬁned by moment or support constraints [6, 12, 22], '- divergences [4, 26, 31], and Wasserstein distances [34, 5, 18, 28, 35]. This line of work is motivated by Owen s seminal work on empirical likelihood [32]. In recent work, [26, 15] show that the optimal value of DRO problems with empirical likelihood uncertainty sets leads to asymptotically exact upper conﬁdence bounds for the optimal value of stochastic optimization problems ([15] consider more general '-divergence uncertainty sets). [5] establish similar coverage results for Wasserstein uncertainty sets.

Our work is also closely related to the work on the variance regularization properties of DRO [31], which uses DRO to approximate the variance regularization cost function (see (2.4)). [20] establish similar results for Wasserstein DRO. Lastly, we relate our work to the literature on chance constrained optimization (see [24] and the references therein). The general goal of chance constrained optimization is to minimize a loss function subject to the probability of satisfying uncertain constraints is above a prescribed level. While our methods reformulate expected value constraints and we show that the solution of the reformulated problem enjoys an asymptotically exact probabilistic guarantees of constraint satisfaction. In addition, the data-dependent constraints in our work are also unknown in practice, which differs from the common setup in the chance constrained optimization literature.

2 Single expected value constraint

We motivate (1.3) by considering a few alternatives. First, we note that the results later in this section show that (1.2) violates the actual constraint in (1.1) approximately half the time (see corollary 2.2). The most straightforward modiﬁcation of (1.2) to ensure b n satisﬁes the (actual) constraint EP0

0 is to add a margin in (1.3); i.e. enforce the constraint

i=1 g( ; Zi) + n 0 (2.1)

in (1.2). If we pick the slack term n such that

i=1 g( ; Zi)

then it is not hard to check that the resulting b n satisﬁes the (actual) constraint with probability greater than 1 [36, 29]. However, this approach is most likely conservative because the constraint is unnecessarily stringent for s such that 1

i=1 g( ; Zi) is less variable. It is also not pivotal: n is often set using bounds from (uniform) concentration inequalities, which typically depend on unknown problem parameters.

To relax the empirical constraint in a way that adapts to the variability of the empirical constraints, we replace the uniform margin in (2.1) with a parameter-dependent margin:

i=1 g( ; Zi) + z

bσ( ) pn 0, (2.2)

where z is the 1 quantile of a standard normal random variable and bσ2( ) is an estimate of the asymptotic variance of g( ; Z). We recognize the (parameter-dependent) margin as (a multiple of) the standard error of the empirical constraint. It is possible to show that enforcing (2.2) achieves asymptotically exact constraint generalization (1.4) [27].

The main issue with this method is it is not amenable to standard stochastic optimization algorithms. In particular, even if the original constraint in (1.2) is convex, (2.2) is generally non-convex. Another issue is that it is not pivotal: the user must estimate the asymptotic variance of g( ; Z).

To overcome these two issues, we consider a distributionally robust version of (1.2); i.e. enforcing

sup P :D'(P k b Pn)

where D'(Pk Q) ,

d Q)d Q is a '-divergence. Common choices of ' include '(t) = (t 1)2

(which leads to the χ2-divergence) and '(t) = log t + t 1 (which leads to the Kullback-Leibler divergence). Although there are many other choices for the uncertainty set in (2.3), we pick an '-divergence ball because (i) (2.3) with an '-divergence ball is asymptotically equivalent to (2.2):

sup P :D'(P k b Pn)

i=1 g( ; Zi) + z

bσ( ) pn , (2.4)

and (ii) it leads to pivotal uncertainty sets. For theoretical analysis, we always use '(t) = (t 1)2 and χ2-divergence in the remainder of this paper.

Before we state the asymptotically exact constraint satisfaction property of (1.3) rigorously, we describe our assumptions on the problem.

1. smoothness and concentration: f and g are twice continuously differentiable with respect to ,

and f( ?; Z), rf( ?; Z), g( ?; Z), rg( ?; Z) are sub-Gaussian random variables. 2. uniqueness: the stochastic optimization problem with a single expected value constraint (1.1) has

a unique optimal primal-dual pair ( ?, λ?), and ? belongs to the interior of the compact set . 3. strict complementarity: λ? > 0. 4. positive deﬁniteness: The Hessian of the Lagrangian evaluated at ( ?, λ?) is positive deﬁnite.

The preceding assumptions are not the most general, but they are easy to interpret. The smoothness conditions on f and g with respect to , the concentration conditions of f( ?; Z) and g( ?; Z), and the uniqueness condition facilitate the use of standard tools from asymptotic statistics to study the large sample properties of the constraint value. The strict complementarity condition rules out problems in which the constraint is extraneous; i.e. problems in which the unconstrained minimum coincides with the constrained minimum.

We are ready to state the asymptotically exact constraint satisfaction property of (1.3) rigorously. The main technical result characterizes the limiting distribution of the constraint value.

Theorem 2.1. Let b n be an optimal solution of (1.3) converging in probability as n ! 1 to ?. Under the standing assumptions, we have

Var P0[g( ?; Z)], Var P0[g( ?; Z)]

We translate this result on the constraint value to a result on constraint generalization. Corollary 2.2. Let p be the 1 quantile of a standard normal random variable. Under the conditions of theorem 2.1, we have

= P {U p } = 1 ,

where U N(0, 1) is a standard Gaussian random variable.

From theorem 2.1 and corollary 2.2 (see proofs in Appendix A), we ﬁnd that

1. picking = 0 (i.e., equivalently solving (1.2)) leads to a constraint violation probability that

approaches 1

2 in the large sample limit. 2. the relation between the mean and variance of the limiting distribution of the constraint value in

Theorem 2.1 allows us to pick in a pivotal way (i.e. does not depend on nuisance parameters).

2.1 Stochastic approximation for (1.3)

In the rest of this section, we derive a stochastic optimization algorithm to solve (1.3) efﬁciently. As we shall see, the computational cost of this algorithm is comparable to the cost of solving a DRO problem. The key insight is that the robust constraint function has a dual form (see Appendix J):

sup P :D'(P k b Pn) EP

= infµ 0, 2R

i=1 µ' / g( ;Zi)

where ' (s) , supt{st '(t)} is the convex conjugate of '. As we use χ2-squared divergence and '(t) = (t 1)2, the corresponding ' (s) = s2

4 + s. The Lagrangian of (1.3) is

L( , λ) , 1

i=1 f( ; Zi) + λ sup P :D'(P k b Pn)

i=1 f( ; Zi) + λ infµ 0, 2R

i=1 µ' / g( ;Zi)

We see that evaluating the dual function inf L( , λ) (at a ﬁxed λ) entails solving a stochastic optimization problem that is suitable for stochastic approximation. This suggests a dual ascent algorithm for solving (1.3):

1. evaluate the dual function at λt by solving a stochastic optimization problem. 2. update λt with a dual ascent step.

We summarize this algorithm in Algorithm 1. The main cost of Algorithm 1 is incurred in the third line: evaluating the dual function. Fortunately, this step is suitable for stochastic approximation, so we can leverage recent advances in the literature to reduce the (computational) cost of this step. The total cost of this algorithm is comparable to that of distributionally robust optimization.

Algorithm 1 Dual ascent algorithm for (1.3)

1: Input: starting dual iterate λ0 0 2: repeat 3: Evaluate dual function:

( t, µt, t) arg min ,µ 0,

i=1 f( ; Zi)+λt

i=1 µ' / g( ;Zi)

4: Dual ascent update: λt+1

i=1 µt' ( g( t;Zi) t

+ 5: until converged

3 Multiple expected value constraints

In this section, we extend the results from the preceding section to stochastic optimization problems with multiple data-dependent constraints. Consider a stochastic optimization problem with K expected value constraints

(arg min 2 EP0

Following the development in Section 2, we enforce the expected value constraints with robust versions of the sample average constraints:

i=1 f( ; Zi)

sup P :D'(P k b Pn)

where = ( 1, . . . , K)> are uncertainty set radii for the constraints. There are other approaches to enforcing multiple constraints that result in constraint generalization; we focus on (3.2) here because it allows the user to adjust the constraint generalization probability for different constraints.

First, we extend theorem 2.1 and corollary 2.2 to problems with multiple (expected value) constraints. We assume

1. smoothness and concentration: for k 2 [K], f, gk are twice continuously differentiable with

respect to , and f( ?; Z), rf( ?; Z), gk( ?; Z), rgk( ?; Z) are sub-Gaussian random variables. 2. uniqueness: the stochastic optimization problem with K expected value constraints (3.1) has a

unique optimal primal-dual pair ( ?, λ?), and ? belongs to the interior of the compact set . 3. strict complementarity: λ? 2 int(RK

+ ), i.e., each component of λ? is strictly positive. 4. positive deﬁniteness: The Hessian of the Lagrangian evaluated at ( ?, λ?) is positive deﬁnite.

The strict complementarity constraint seems especially strong here because it requires all the constraints to be active. It is possible (with extra notational overhead) to state the result in terms of just the active constraints. We refer to Section 5.1 for more information about the unknown active set. Further, as long as the sample size is large enough, the active constraints in (3.2) coincide with the active constraints in (3.1). To keep things simple, we assume all the constraints are active.

Theorem 3.1. Let b n be an optimal solution of (3.2) converging in probability as n ! 1 to ?. Under the standing assumptions, we have

g K(b n; Z)

1 Var P0[g1( ?; Z)]

K Var P0[g K( ?; Z)]

75 , Var P0

... g K( ?; Z)

Corollary 3.2. Under the conditions of theorem 3.1, we have

g K(b n; Z)

where p = (p 1, . . . , p K)>, and U is a Gaussian random vector with mean zero and covariance

... g K( ?; Z)

... g K( ?; Z)

{Var P0[gk( ?, Z)]}K

From theorem 3.1 and corollary 3.2 (see proofs in Appendix B and C), we ﬁnd that the probability of constraint satisfaction decreases exponentially as the number of constraints increases. We also see that our method is no longer pivotal for multiple expected value constraints: the uncertainty set radii depends on the (unknown) correlation structure among the constraint values. Fortunately, it is not hard to estimate this correlation structure. The most straightforward way is with the empirical correlation matrix. Let b n be the empirical covariance matrix of the constraint values. The empirical correlation matrix is then given by b Rn , diag(b n) 1

2 b n diag(b n) 1

Finally, it is straightforward to extend the algorithm for solving (1.3) to (3.2). The Lagrangian of (3.2) is

L( , λ) , 1

i=1 f( ; Zi) + PK

k=1 λk sup P :D'(P k Pn)

i=1 f( ; Zi) + PK

k=1 λk infµk 0, k2R

i=1 µk' ( gk( ;Zi) k

where we recalled the dual form of the robust constraint function (2.5) in the second step. We see that evaluating the dual function inf L( , λ) (at a ﬁxed λ) entails solving a stochastic optimization problem that is suitable for stochastic approximation. This suggests a similar dual ascent algorithm for solving (1.3); we skip the details here (see Algorithm 2 in Appendix D).

4 Simulations

We simulate the frequency of constraint satisfaction for the following multi-item newsvendor problem:

p> min{Z, } c>

subject to EP0[(k Z(1)k2

2)+] "1 EP0[(k Z(2)k2

where c 2 Rd

+ is the manufacturing cost, p 2 Rd

+ is the sell price, 2 = [0, 100]d is the number of items in stock, Z 2 Rd is a random variable with probability distribution P0 representing the demand, and there are d items in total. The distribution P0 is unknown but we observe IID samples Z1, . . . , Zn from P0. All of the items have been partitioned into two groups so that the corresponding demand and stock can be written as Z = (Z(1), Z(2)) and = ( (1), (2)). The constraints in the problem exclude stock levels that underestimate the demand too much for each group of items, where "1, "2 > 0 indicate tolerance level of such underestimation. The target of the problem is to maximize the proﬁt while satisfying the constraints. It is easy to rewrite the maximization problem (4.1) as a minimization problem with expected value constraints in the form of (3.1) so that we can apply our method (3.2). We pick P0 as multivariate Gaussian with independent components so that the two constraints are generally uncorrelated with each other (see Appendix E for details).

Throughout the simulations, we solve (3.2) with = (z , z )> for 2 {0.4, 0.25, 0.1, 0.05, 0.005}. As suggested by our asymptotic theory in Section 3, the nominal probability of constraint satisfaction is 1 for each constraint and (1 )2 for both constraints due to the independence setup.

In Figure 1, we plot frequencies of constraint satisfaction for each constraint and both constraints, all of which are averaged over 1000 replicates. As the sample size n grows, the frequency versus probability curve converges to the theoretical dashed line of limiting probability of constraint satisfaction, validating our theory in the large sample regime. For more simulations (e.g., single constraint, two dependent constraints) we refer to Appendix E.

Figure 1: Frequency versus limiting probability of constraint satisfaction of the ﬁrst constraint (left), the second constraint (middle), and both of the constraints (right).

5 Application to fair machine learning

As ML models are deployed in high-stakes decision making and decision support roles, the fairness of the models has come under increased scrutiny. In response, there is a ﬂurry of recent work on mathematical deﬁnitions of algorithmic fairness [16, 23, 25] and algorithms to enforce the deﬁnitions [1, 10, 38].

A prominent class of fairness deﬁnitions is group fairness; such deﬁnitions require equality of certain metrics (e.g. false/true positive rates) among demographic groups. For example, consider a fair binary classiﬁcation problem. Let X Rd be the input space, {0, 1} be the set of possible labels, and A be the set of possible values of the protected/sensitive attribute. In this setup, training and test examples are tuples of the form (X, A, Y ) 2 X A Y, and a classiﬁer is a map f : X ! {0, 1}. A popular deﬁnition of algorithmic fairness for binary classiﬁcation is equality of opportunity [23]. Deﬁnition 5.1 (equality of opportunity). Let Y = 1 be the advantaged label that is associated with a positive outcome and b Y , f(X) be the output of the classiﬁer. Equality of opportunity entails P{b Y = 1 | A = a, Y = 1} = P{b Y = 1 | A = a0, Y = 1} for all a, a0 2 A.

Equality of opportunity, or true positive rate parity, means that the prediction b Y = h(X) conditioned on the advantaged label Y = 1 is statistically independent of the protected attribute A. Furthermore, an approximate version of equality of opportunity can be readily deﬁned. We say that b Y = h(X) satisﬁes "-equality of opportunity if P{b Y = 1 | A = a, Y = 1} P{b Y = 1 | A = a0, Y = 1} " for for all a, a0 2 A. In this case, " > 0 represents a practitioner s tolerance for fairness violations.

Given a parametric model space H = {f ( ) : 2 } and loss function : X Y ! R+, an in-processing fair ML routine is to minimize the (empirical) risk E [ ( ; X, Y )] while satisfying some fairness constraints. Most commonly, deﬁnitions of group fairness (including equality of opportunity, demographic parity, and more) can be written as a special example of a general set of linear constraints [1, 2] of the form Mµ( ) c, where matrix M 2 RK T and vector c 2 RK encode the constraints; µ( ) : ! RT is a vector of (conditional) moments µt( ) = E [ht(X, A, Y, ) | Et] for t 2 [T]; gt : X A Y ! R; event Et is deﬁned with respect to (X, A, Y ).

This framework ﬁts to our methodology if we note that each (conditional) moment can be written as

µt( ) = E(X,A,Y ) P0

ht(X, A, Y, ) 1 {Et(X, Y, A)}

E(X,A,Y ) P0

1 {Et(X, Y, A)}

Here the indicator 1 {Et} takes value 1 if the event Et happens, and 0 otherwise. Moreover, we use Et(X, A, Y ) to emphasize that Et only depends on (X, Y, A) but not on in any way.

Note that (5.1) is a ratio of expected values, which is a non-linear statistical functional of P0. To use our method, we ﬁrst replace the denominator of µt( ) with an estimator, such as the unbiased estimator b P(Et) = 1

i=1 1 {Et(Xi, Ai, Yi)}. The resulting plug-in estimation of µt( ) then becomes linear in P0, allowing us to apply our method (see similar tricks in [8]). We describe the application of our method to "-equality of opportunity in Appendix F.

5.1 A two-stage method for unknown active set

In practice, it is probable that only a subset of the constraints are active. Furthermore, we do not know beforehand whether or not a constraint is active in the true population problem. To handle this scenario, we propose a two-stage method:

1. At the ﬁrst stage, we solve the sample average approximation (SAA) problem (3.2) with = 0K.

By doing so, we identify the active set of the SAA problem. 2. At the second stage, we solve (3.2) with such that k is a positive number only if the k-th

constraint, k 2 [K], was identiﬁed as active at the ﬁrst stage.

In Appendix G, we show that the two-stage method also enjoys the calibration property (similar to Theorem 3.1 and Corollary 3.2) under standard assumptions (i.e., strict complementarity). At a high level, the limiting probability of satisfying the true constraints depends solely on the correlation structure between active constraints and the uncertainty set radii for active constraints, as long as the SAA problem identiﬁes active constraints with probability tending to 1.

5.2 Proxy dual function for non-differentiable constraints

Constraint functions in fair ML are often non-differentiable. For instance, fairness metrics are typically linear combinations of indicators that result in non-differentiable rate constraints [8 10]. This prevents the use of any gradient-based optimization algorithms. Fortunately, only the dual function evaluation step in Algorithm 1 requires access to gradients. Therefore, we can modify the algorithm by: (1) introducing proxy dual function, which uses a differentiable surrogate g instead of the non-differentiable g in the dual function evaluation step; (2) keeping g in the dual ascent step. For an indicator function h(t) = 1{t > 0}, one can replace it by sigmoidal function h1(t) = (1+e at) 1 or hinge upper bound h2(t) = max{0, t + 1} to produce smooth surrogates for non-differentiable rate constraints [11, 17, 9]. We summarize the proxy dual ascent algorithm in Appendix H.

5.3 Adult experiments

We compare the frequency of constraint satisfaction (at test time) of the sample average approximation and our methods with nominal probability 0.60, 0.75, 0.90, 0.95 using the Adult dataset from UCI [13]. The classiﬁcation task is to predict whether an individual s income per year is higher than $50K. The fairness goal is "-demographic parity ("-DP): |P(b Y = 1 | A = 1) P(b Y = 1 | A = 0)| ", where A = 1 for male is the advantaged group and A = 0 for female is the disadvantaged group. We use a logistic regression model for classiﬁcation and techniques in this section for implementation.

In Figure 2, we have line plots for frequency of constraint satisfaction and box plots for classiﬁcation error rate, all of which are summarized over 100 replicates. The left panel shows that solving (3.1)

directly leads to one half chance of constraint violation, while our method s constraint satisfaction frequency matches its nominal value. The price of a higher chance of test-time fairness satisfaction is an increase in classiﬁcation error rate as shown in the right panel. From the baseline to 95% chance of fairness satisfaction, we basically trade off 2% increase in error rate. We refer to Appendix I and K for details and more experiments.

0.01 0.02 0.03 0.04 0.05 DP Tolerance

Frequency of Constraint Satisfaction

baseline (prob = 0.50)

our method (prob = 0.60)

our method (prob = 0.75)

our method (prob = 0.90)

our method (prob = 0.95)

0.01 0.02 0.03 0.04 0.05 DP Tolerance

baseline (prob = 0.50)

our method (prob = 0.60)

our method (prob = 0.75)

our method (prob = 0.90)

our method (prob = 0.95)

Figure 2: Frequency of constraint satisfaction (left) and classiﬁcation error rate (right) for different demographic parity tolerance " 2 {0.01, 0.02, 0.03, 0.04, 0.05}. Baseline (sample average approximation, SAA) and our methods (with nominal probability 0.60, 0.75, 0.90, 0.95) are compared.

6 Summary and discussion

We explore the problem of exact constraint satisfaction probability in stochastic optimization with expected-value constraints. We propose a distributionally robust reformulation of data-dependent constraints and provide a theoretical guarantee of constraint satisfaction with an asymptotically exact probability speciﬁed by the user. For solving the reformulated problem, a scalable dual ascent algorithm and its variants are proposed. The computational cost of our algorithm is comparable to that of a standard distributionally robust optimization problem. Our theory on exact constraint satisfaction probability is validated via simulations on the resource-constrained newsvendor problem. The efﬁcacy of our methods is empirically demonstrated on fair machine learning applications.

Some data-dependent constraints are by nature non-linear in the underlying probability measure. For example, (5.1) is a ratio of expected values. An intriguing direction for future research is to generalize the methods and theory developed in this work to constraints on non-linear functions of expected values. Such forms of constraints are known as statistical functionals in statistics literature [19]. The non-linear dependence of the constraint function on the probability measure precludes the stochastic approximation as a general way of evaluating the dual function, as the constraint function no longer admits a dual form (2.5), calling for the development of a new algorithm.

Acknowledgments and Disclosure of Funding

This paper is based upon work supported by the National Science Foundation (NSF) under grants no. 1916271, 2027737, and 2113373.

[1] Alekh Agarwal, Alina Beygelzimer, Miroslav Dudík, John Langford, and Hanna Wallach. A

Reductions Approach to Fair Classiﬁcation. ar Xiv:1803.02453 [cs], July 2018.

[2] Alekh Agarwal, Miroslav Dudík, and Zhiwei Steven Wu. Fair Regression: Quantitative Deﬁnitions and Reduction-based Algorithms. ar Xiv:1905.12843 [cs, stat], May 2019.

[3] Solon Barocas, Moritz Hardt, and Arvind Narayanan. Fairness and Machine Learning. fairml-

book.org, 2019.

[4] Aharon Ben-Tal, Dick den Hertog, Anja De Waegenaere, Bertrand Melenberg, and Gijs Rennen.

Robust Solutions of Optimization Problems Affected by Uncertain Probabilities. Management Science, 59(2):341 357, November 2012. ISSN 0025-1909. doi: 10.1287/mnsc.1120.1641.

[5] Jose Blanchet, Yang Kang, and Karthyek Murthy. Robust Wasserstein Proﬁle Inference and

Applications to Machine Learning. ar Xiv:1610.05627 [math, stat], October 2016.

[6] Xin Chen, Melvyn Sim, and Peng Sun. A Robust Optimization Perspective on Stochastic

Programming. Operations Research, 55:1058 1071, 2007. doi: 10.1287/opre.1070.0441.

[7] Andrew Cotter, Maya Gupta, Heinrich Jiang, Nathan Srebro, Karthik Sridharan, Serena Wang,

Blake Woodworth, and Seungil You. Training Well-Generalizing Classiﬁers for Fairness Metrics and Other Data-Dependent Constraints. ar Xiv:1807.00028 [cs, stat], September 2018.

[8] Andrew Cotter, Maya Gupta, Heinrich Jiang, Nathan Srebro, Karthik Sridharan, Serena Wang,

Blake Woodworth, and Seungil You. Training well-generalizing classiﬁers for fairness metrics and other data-dependent constraints. In International Conference on Machine Learning, pages 1397 1405. PMLR, 2019.

[9] Andrew Cotter, Heinrich Jiang, Maya Gupta, Serena Wang, Taman Narayan, Seungil You, and

Karthik Sridharan. Optimization with Non-Differentiable Constraints with Applications to Fairness, Recall, Churn, and Other Goals. Journal of Machine Learning Research, 20(172): 1 59, 2019. ISSN 1533-7928.

[10] Andrew Cotter, Heinrich Jiang, and Karthik Sridharan. Two-Player Games for Efﬁcient Non-

Convex Constrained Optimization. In Algorithmic Learning Theory, pages 300 332. PMLR, March 2019.

[11] Mark A Davenport, Richard G Baraniuk, and Clayton D Scott. Tuning support vector machines

for minimax and neyman-pearson classiﬁcation. IEEE transactions on pattern analysis and machine intelligence, 32(10):1888 1898, 2010.

[12] Erick Delage and Yinyu Ye. Distributionally Robust Optimization Under Moment Uncertainty

with Application to Data-Driven Problems. Operations Research, 58:595 612, 2010. doi: 10.1287/opre.1090.0741.

[13] Dheeru Dua and Casey Graff. UCI machine learning repository, 2017. URL http://archive.

ics.uci.edu/ml.

[14] John Duchi and Hongseok Namkoong. Variance-based regularization with convex objectives.

October 2016.

[15] John Duchi, Peter Glynn, and Hongseok Namkoong. Statistics of Robust Optimization: A

Generalized Empirical Likelihood Approach. ar Xiv:1610.03425 [stat], October 2016.

[16] Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Rich Zemel. Fairness

Through Awareness. ar Xiv:1104.3913 [cs], April 2011.

[17] Elad Eban, Mariano Schain, Alan Mackey, Ariel Gordon, Ryan Rifkin, and Gal Elidan. Scalable

learning of non-decomposable objectives. In Artiﬁcial intelligence and statistics, pages 832 840. PMLR, 2017.

[18] Peyman Mohajerin Esfahani and Daniel Kuhn. Data-driven Distributionally Robust Optimiza-

tion Using the Wasserstein Metric: Performance Guarantees and Tractable Reformulations. May 2015.

[19] Luisa Turrin Fernholz. Von Mises calculus for statistical functionals, volume 19. Springer

Science & Business Media, 2012.

[20] Chao Gao. Robust Regression via Mutivariate Regression Depth. ar Xiv:1702.04656 [math,

stat], February 2017.

[21] Gabriel Goh, Andrew Cotter, Maya Gupta, and Michael P. Friedlander. Satisfying Real-world

Goals with Dataset Constraints. Advances in Neural Information Processing Systems, 29, 2016.

[22] Joel Goh and Melvyn Sim. Distributionally Robust Optimization and Its Tractable Approxima-

tions. Operations Research, 58(4-part-1):902 917, August 2010. ISSN 0030-364X, 1526-5463. doi: 10.1287/opre.1090.0795.

[23] Moritz Hardt, Eric Price, and Nathan Srebro. Equality of Opportunity in Supervised Learning.

ar Xiv:1610.02413 [cs], October 2016.

[24] Ruiwei Jiang and Yongpei Guan. Data-driven chance constrained stochastic program. Mathe-

matical Programming, 158(1):291 327, 2016.

[25] Matt J. Kusner, Joshua R. Loftus, Chris Russell, and Ricardo Silva. Counterfactual Fairness.

ar Xiv:1703.06856 [cs, stat], March 2018.

[26] H. Lam and Enlu Zhou. Quantifying uncertainty in sample average approximation. In 2015

Winter Simulation Conference (WSC), pages 3846 3857, December 2015. doi: 10.1109/WSC.

2015.7408541.

[27] Henry Lam. Recovering Best Statistical Guarantees via the Empirical Divergence-Based

Distributionally Robust Optimization. Operations Research, 67(4):1090 1105, July 2019. ISSN 0030-364X. doi: 10.1287/opre.2018.1786.

[28] Jaeho Lee and Maxim Raginsky. Minimax Statistical Learning with Wasserstein Distances.

ar Xiv:1705.07815 [cs], May 2017.

[29] James Luedtke and Shabbir Ahmed. A Sample Approximation Approach for Optimization with

Probabilistic Constraints. SIAM Journal on Optimization, 19(2):674 699, January 2008. ISSN 1052-6234. doi: 10.1137/070702928.

[30] Mahdi Milani Fard, Quentin Cormier, Kevin Canini, and Maya Gupta. Launch and Iterate:

Reducing Prediction Churn. In Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016.

[31] Hongseok Namkoong and John C. Duchi. Stochastic Gradient Methods for Distributionally

Robust Optimization with F-divergences. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS 16, pages 2216 2224, USA, 2016. Curran Associates Inc. ISBN 978-1-5108-3881-9.

[32] Art Owen. Empirical Likelihood Ratio Conﬁdence Regions. The Annals of Statistics, 18(1):

90 120, March 1990. ISSN 0090-5364, 2168-8966. doi: 10.1214/aos/1176347494.

[33] Reuven Y Rubinstein and Alexander Shapiro. Discrete event systems: Sensitivity analysis and

stochastic optimization by the score function method, volume 13. Wiley, 1993.

[34] Soroosh Shaﬁeezadeh-Abadeh, Peyman Mohajerin Esfahani, and Daniel Kuhn. Distributionally

Robust Logistic Regression. September 2015.

[35] Aman Sinha, Hongseok Namkoong, and John Duchi. Certifying Some Distributional Robustness

with Principled Adversarial Training. ar Xiv:1710.10571 [cs, stat], October 2017.

[36] Wei Wang and Shabbir Ahmed. Sample average approximation of expected value constrained

stochastic programs. Operations Research Letters, 36(5):515 519, September 2008. ISSN 0167-6377. doi: 10.1016/j.orl.2008.05.003.

[37] Blake Woodworth, Suriya Gunasekar, Mesrob I. Ohannessian, and Nathan Srebro. Learning

Non-Discriminatory Predictors. In Conference on Learning Theory, pages 1920 1953. PMLR, June 2017.

[38] Mikhail Yurochkin and Yuekai Sun. Sen Se I: Sensitive Set Invariance for Enforcing Individual

Fairness. In International Conference on Learning Representations, September 2020.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reﬂect the paper s

contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes]

(c) Did you discuss any potential negative societal impacts of your work? [N/A] (d) Have you read the ethics review guidelines and ensured that your paper conforms to them?

[Yes] 2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [Yes] (b) Did you include complete proofs of all theoretical results? [Yes] 3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experimental

results (either in the supplemental material or as a URL)? [Yes] (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were

chosen)? [Yes] (c) Did you report error bars (e.g., with respect to the random seed after running experiments

multiple times)? [Yes] (d) Did you include the total amount of compute and the type of resources used (e.g., type of

GPUs, internal cluster, or cloud provider)? [No] 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [Yes] (b) Did you mention the license of the assets? [N/A]

(c) Did you include any new assets either in the supplemental material or as a URL? [N/A] (d) Did you discuss whether and how consent was obtained from people whose data you re

using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identiﬁable

information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if applica-

ble? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board

(IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on

participant compensation? [N/A]