# calibrated_datadependent_constraints_with_exact_satisfaction_guarantees__9641cebd.pdf Calibrated Data-Dependent Constraints with Exact Satisfaction Guarantees Songkai Xue Department of Statistics University of Michigan sxue@umich.edu Yuekai Sun Department of Statistics University of Michigan yuekai@umich.edu Mikhail Yurochkin IBM Research MIT-IBM Watson AI Lab mikhail.yurochkin@ibm.com We consider the task of training machine learning models with data-dependent constraints. Such constraints often arise as empirical versions of expected value constraints that enforce fairness or stability goals. We reformulate data-dependent constraints so that they are calibrated: enforcing the reformulated constraints guarantees that their expected value counterparts are satisfied with a user-prescribed probability. The resulting optimization problem is amendable to standard stochastic optimization algorithms, and we demonstrate the efficacy of our method on a fairness-sensitive classification task where we wish to guarantee the classifier s fairness (at test time). 1 Motivation In machine learning (ML) practice, accuracy is often only one of many training objectives. For example, algorithmic fairness considerations may require a credit scoring system to perform comparably on men and women. Here are a few other examples. Churn rate and stability The churn rate of an ML model compared to another model is the fraction of samples on which the predictions of the two models differ [21, 30]. In ML practice, one may wish to control the churn rate between a new model and its predecessor because a high churn rate can disorient users and downstream system components. One way of training models with small churn is to enforce a churn rate constraint during training. Precision, recall, etc. Classification and information retrieval models must often balance precision and recall. To train such models, practitioners carefully trade off one metric for the other by optimizing for one metric subject to constraints on the other. Resource constraints Practitioners sometimes wish to control how often a classifier predicts a certain class due to budget or resource constraints. For example, a company that uses ML to select customers for a targeted offer may wish to constrain the fraction of customers selected for the offer. Another prominent example of a stochastic optimization problem with resource constraints is the newsvendor problem, which we come back to in section 4. Unlike constraints on the structure of model parameters (e.g., sparsity), the constraints encoding the preceding training objectives are data-dependent. This leads to the issue of constraint generalization: whether the constraints generalize out-of-sample. For example, if a classifier is trained to have comparable accuracy on two subpopulations in the training data, will it also have comparable accuracy on samples from the two subpopulations at test time? 36th Conference on Neural Information Processing Systems (Neur IPS 2022). In this paper, we consider the out-of-sample generalization of expected-value constraints. To keep things simple, consider a stochastic optimization problem with a single expected-value constraint: arg min 2 EP0 Z f( ; z)d P0(z) subject to EP0 Z g( ; z)d P0(z) 0 where is a (finite-dimensional) parameter space, f, g : Z ! R are (known) cost and constraint functions, and Z 2 Z is a random variable that represents a sample. The distribution of Z is unknown, so we cannot solve (1.1) directly. Instead, we obtain IID training samples {Zi}n i=1 from the true underlying distribution P0 and solve the empirical version of (1.1): i=1 f( ; Zi) subject to 1 n i=1 g( ; Zi) 0 The estimator b n (of ?) is guaranteed to satisfy the empirical constraint (i.e., 1 i=1 g(b n; Zi) 0), but it is unclear whether b n satisfies the actual (population) constraint EP0 0. As we shall see, under standard assumptions on (1.1), b n only satisfies the actual constraint with probability approaching 1 2 (see corollary 2.2). This is especially problematic for constraints that encode algorithmic fairness goals. For example, the 80% rule published by the US Equal Employment Opportunity Commission, interpreted in the machine learning context, requires the rate at which a classifier predicts the advantaged label in minority groups to be at least 80% of the rate at which the classifier predicts the advantaged label in the majority group [3]. In this paper, we propose a distributionally robust version of (1.2) that guarantees the actual constraint EP0 0 will be satisfied with probability 1 : i=1 f( ; Zi) subject to sup P :D'(P k b Pn) where D' is a '-divergence (see section 2 for details), b Pn is the empirical distribution of the training samples, and p is the 1 quantile of a standard normal random variable. More concretely, we show that b n achieves asymptotically exact constraint satisfaction = 1 . (1.4) Here the inner expectation is with respect to Z; the outer probability is with respect to the training samples {Zi}n i=1. Three desirable properties of (1.3) are 1. exact constraint satisfaction: If the actual probability of constraint satisfaction exceeds 1 , then the method is too conservative. This may (unnecessarily) increase the cost of the model. By picking in (1.3) carefully, constraints are satisfied with asymptotically exact probability 1 . 2. computationally efficient: As we shall see, the computational cost of solving (1.3) is comparable to the cost of solving distributionally robust sample average approximation (SAA) problems. 3. pivotal: There are no nuisance parameters to estimate (e.g., asymptotic variances) in (1.3). The user merely needs to look up the correct quantile of the standard normal distribution for their desired level of constraint generalization. The rest of this paper is organized as follows. In Section 2, we develop method, theory, and algorithm for stochastic optimization problems with single constraint. In Section 3, we extend our method, theory, and algorithm to stochastic optimization problems with multiple constraints. In Section 4, we validate our theory by simulating a resource-constrained newsvendor problem. In Section 5, we demonstrate the efficacy of our method by using it to train an algorithmically fair income classifier. In addition, we show how to apply our method to a fairness constrained learning problem and discuss two practical considerations for fair ML application scenarios. Finally, we summarize our work in Section 6 and point out an interesting avenue of future work. 1.1 Related work The closest work to our work is [27]. They seek to pick a (data-dependent) uncertainty set U such that sup P 2U EP = 1 . (1.5) This condition is stronger than necessary: we only require sup P 2U EP where b n is a (data-dependent) estimator (not necessarily (1.2) or (1.3)). [27] study (asymptotic) constraint satisfaction (1.4) for all deterministic objective functions (see [27], 1.1 for details). They advocate picking a KL divergence ball with radius that depends on the excursion probability of a certain χ2 process. Another closely related line of work is on data-splitting approaches for ensuring constraint generalization [37, 7]. At a high level, they split the training data into a training and validation subsets and use the validation subset to tune models trained on the training subset so that they satisfy the constraints. Although (computationally) simple and intuitive, their approach does not allow users to precisely control the constraint violation probability. [27] is the latest in a line of work on distributionally robust optimization (DRO) that show the optimal value of DRO problems min 2 sup P 2U EP , (1.7) where U is a (data-dependent) uncertainty set of probability distributions, are upper confidence bounds for the optimal values of stochastic optimization problems. Common choices of uncertainty sets in DRO include uncertainty sets defined by moment or support constraints [6, 12, 22], '- divergences [4, 26, 31], and Wasserstein distances [34, 5, 18, 28, 35]. This line of work is motivated by Owen s seminal work on empirical likelihood [32]. In recent work, [26, 15] show that the optimal value of DRO problems with empirical likelihood uncertainty sets leads to asymptotically exact upper confidence bounds for the optimal value of stochastic optimization problems ([15] consider more general '-divergence uncertainty sets). [5] establish similar coverage results for Wasserstein uncertainty sets. Our work is also closely related to the work on the variance regularization properties of DRO [31], which uses DRO to approximate the variance regularization cost function (see (2.4)). [20] establish similar results for Wasserstein DRO. Lastly, we relate our work to the literature on chance constrained optimization (see [24] and the references therein). The general goal of chance constrained optimization is to minimize a loss function subject to the probability of satisfying uncertain constraints is above a prescribed level. While our methods reformulate expected value constraints and we show that the solution of the reformulated problem enjoys an asymptotically exact probabilistic guarantees of constraint satisfaction. In addition, the data-dependent constraints in our work are also unknown in practice, which differs from the common setup in the chance constrained optimization literature. 2 Single expected value constraint We motivate (1.3) by considering a few alternatives. First, we note that the results later in this section show that (1.2) violates the actual constraint in (1.1) approximately half the time (see corollary 2.2). The most straightforward modification of (1.2) to ensure b n satisfies the (actual) constraint EP0 0 is to add a margin in (1.3); i.e. enforce the constraint i=1 g( ; Zi) + n 0 (2.1) in (1.2). If we pick the slack term n such that i=1 g( ; Zi) then it is not hard to check that the resulting b n satisfies the (actual) constraint with probability greater than 1 [36, 29]. However, this approach is most likely conservative because the constraint is unnecessarily stringent for s such that 1 i=1 g( ; Zi) is less variable. It is also not pivotal: n is often set using bounds from (uniform) concentration inequalities, which typically depend on unknown problem parameters. To relax the empirical constraint in a way that adapts to the variability of the empirical constraints, we replace the uniform margin in (2.1) with a parameter-dependent margin: i=1 g( ; Zi) + z bσ( ) pn 0, (2.2) where z is the 1 quantile of a standard normal random variable and bσ2( ) is an estimate of the asymptotic variance of g( ; Z). We recognize the (parameter-dependent) margin as (a multiple of) the standard error of the empirical constraint. It is possible to show that enforcing (2.2) achieves asymptotically exact constraint generalization (1.4) [27]. The main issue with this method is it is not amenable to standard stochastic optimization algorithms. In particular, even if the original constraint in (1.2) is convex, (2.2) is generally non-convex. Another issue is that it is not pivotal: the user must estimate the asymptotic variance of g( ; Z). To overcome these two issues, we consider a distributionally robust version of (1.2); i.e. enforcing sup P :D'(P k b Pn) where D'(Pk Q) , d Q)d Q is a '-divergence. Common choices of ' include '(t) = (t 1)2 (which leads to the χ2-divergence) and '(t) = log t + t 1 (which leads to the Kullback-Leibler divergence). Although there are many other choices for the uncertainty set in (2.3), we pick an '-divergence ball because (i) (2.3) with an '-divergence ball is asymptotically equivalent to (2.2): sup P :D'(P k b Pn) i=1 g( ; Zi) + z bσ( ) pn , (2.4) and (ii) it leads to pivotal uncertainty sets. For theoretical analysis, we always use '(t) = (t 1)2 and χ2-divergence in the remainder of this paper. Before we state the asymptotically exact constraint satisfaction property of (1.3) rigorously, we describe our assumptions on the problem. 1. smoothness and concentration: f and g are twice continuously differentiable with respect to , and f( ?; Z), rf( ?; Z), g( ?; Z), rg( ?; Z) are sub-Gaussian random variables. 2. uniqueness: the stochastic optimization problem with a single expected value constraint (1.1) has a unique optimal primal-dual pair ( ?, λ?), and ? belongs to the interior of the compact set . 3. strict complementarity: λ? > 0. 4. positive definiteness: The Hessian of the Lagrangian evaluated at ( ?, λ?) is positive definite. The preceding assumptions are not the most general, but they are easy to interpret. The smoothness conditions on f and g with respect to , the concentration conditions of f( ?; Z) and g( ?; Z), and the uniqueness condition facilitate the use of standard tools from asymptotic statistics to study the large sample properties of the constraint value. The strict complementarity condition rules out problems in which the constraint is extraneous; i.e. problems in which the unconstrained minimum coincides with the constrained minimum. We are ready to state the asymptotically exact constraint satisfaction property of (1.3) rigorously. The main technical result characterizes the limiting distribution of the constraint value. Theorem 2.1. Let b n be an optimal solution of (1.3) converging in probability as n ! 1 to ?. Under the standing assumptions, we have Var P0[g( ?; Z)], Var P0[g( ?; Z)] We translate this result on the constraint value to a result on constraint generalization. Corollary 2.2. Let p be the 1 quantile of a standard normal random variable. Under the conditions of theorem 2.1, we have = P {U p } = 1 , where U N(0, 1) is a standard Gaussian random variable. From theorem 2.1 and corollary 2.2 (see proofs in Appendix A), we find that 1. picking = 0 (i.e., equivalently solving (1.2)) leads to a constraint violation probability that approaches 1 2 in the large sample limit. 2. the relation between the mean and variance of the limiting distribution of the constraint value in Theorem 2.1 allows us to pick in a pivotal way (i.e. does not depend on nuisance parameters). 2.1 Stochastic approximation for (1.3) In the rest of this section, we derive a stochastic optimization algorithm to solve (1.3) efficiently. As we shall see, the computational cost of this algorithm is comparable to the cost of solving a DRO problem. The key insight is that the robust constraint function has a dual form (see Appendix J): sup P :D'(P k b Pn) EP = infµ 0, 2R i=1 µ' / g( ;Zi) where ' (s) , supt{st '(t)} is the convex conjugate of '. As we use χ2-squared divergence and '(t) = (t 1)2, the corresponding ' (s) = s2 4 + s. The Lagrangian of (1.3) is L( , λ) , 1 i=1 f( ; Zi) + λ sup P :D'(P k b Pn) i=1 f( ; Zi) + λ infµ 0, 2R i=1 µ' / g( ;Zi) We see that evaluating the dual function inf L( , λ) (at a fixed λ) entails solving a stochastic optimization problem that is suitable for stochastic approximation. This suggests a dual ascent algorithm for solving (1.3): 1. evaluate the dual function at λt by solving a stochastic optimization problem. 2. update λt with a dual ascent step. We summarize this algorithm in Algorithm 1. The main cost of Algorithm 1 is incurred in the third line: evaluating the dual function. Fortunately, this step is suitable for stochastic approximation, so we can leverage recent advances in the literature to reduce the (computational) cost of this step. The total cost of this algorithm is comparable to that of distributionally robust optimization. Algorithm 1 Dual ascent algorithm for (1.3) 1: Input: starting dual iterate λ0 0 2: repeat 3: Evaluate dual function: ( t, µt, t) arg min ,µ 0, i=1 f( ; Zi)+λt i=1 µ' / g( ;Zi) 4: Dual ascent update: λt+1 i=1 µt' ( g( t;Zi) t + 5: until converged 3 Multiple expected value constraints In this section, we extend the results from the preceding section to stochastic optimization problems with multiple data-dependent constraints. Consider a stochastic optimization problem with K expected value constraints (arg min 2 EP0 Following the development in Section 2, we enforce the expected value constraints with robust versions of the sample average constraints: i=1 f( ; Zi) sup P :D'(P k b Pn) where = ( 1, . . . , K)> are uncertainty set radii for the constraints. There are other approaches to enforcing multiple constraints that result in constraint generalization; we focus on (3.2) here because it allows the user to adjust the constraint generalization probability for different constraints. First, we extend theorem 2.1 and corollary 2.2 to problems with multiple (expected value) constraints. We assume 1. smoothness and concentration: for k 2 [K], f, gk are twice continuously differentiable with respect to , and f( ?; Z), rf( ?; Z), gk( ?; Z), rgk( ?; Z) are sub-Gaussian random variables. 2. uniqueness: the stochastic optimization problem with K expected value constraints (3.1) has a unique optimal primal-dual pair ( ?, λ?), and ? belongs to the interior of the compact set . 3. strict complementarity: λ? 2 int(RK + ), i.e., each component of λ? is strictly positive. 4. positive definiteness: The Hessian of the Lagrangian evaluated at ( ?, λ?) is positive definite. The strict complementarity constraint seems especially strong here because it requires all the constraints to be active. It is possible (with extra notational overhead) to state the result in terms of just the active constraints. We refer to Section 5.1 for more information about the unknown active set. Further, as long as the sample size is large enough, the active constraints in (3.2) coincide with the active constraints in (3.1). To keep things simple, we assume all the constraints are active. Theorem 3.1. Let b n be an optimal solution of (3.2) converging in probability as n ! 1 to ?. Under the standing assumptions, we have g K(b n; Z) 1 Var P0[g1( ?; Z)] K Var P0[g K( ?; Z)] 75 , Var P0 ... g K( ?; Z) Corollary 3.2. Under the conditions of theorem 3.1, we have g K(b n; Z) where p = (p 1, . . . , p K)>, and U is a Gaussian random vector with mean zero and covariance ... g K( ?; Z) ... g K( ?; Z) {Var P0[gk( ?, Z)]}K From theorem 3.1 and corollary 3.2 (see proofs in Appendix B and C), we find that the probability of constraint satisfaction decreases exponentially as the number of constraints increases. We also see that our method is no longer pivotal for multiple expected value constraints: the uncertainty set radii depends on the (unknown) correlation structure among the constraint values. Fortunately, it is not hard to estimate this correlation structure. The most straightforward way is with the empirical correlation matrix. Let b n be the empirical covariance matrix of the constraint values. The empirical correlation matrix is then given by b Rn , diag(b n) 1 2 b n diag(b n) 1 Finally, it is straightforward to extend the algorithm for solving (1.3) to (3.2). The Lagrangian of (3.2) is L( , λ) , 1 i=1 f( ; Zi) + PK k=1 λk sup P :D'(P k Pn) i=1 f( ; Zi) + PK k=1 λk infµk 0, k2R i=1 µk' ( gk( ;Zi) k where we recalled the dual form of the robust constraint function (2.5) in the second step. We see that evaluating the dual function inf L( , λ) (at a fixed λ) entails solving a stochastic optimization problem that is suitable for stochastic approximation. This suggests a similar dual ascent algorithm for solving (1.3); we skip the details here (see Algorithm 2 in Appendix D). 4 Simulations We simulate the frequency of constraint satisfaction for the following multi-item newsvendor problem: p> min{Z, } c> subject to EP0[(k Z(1)k2 2)+] "1 EP0[(k Z(2)k2 where c 2 Rd + is the manufacturing cost, p 2 Rd + is the sell price, 2 = [0, 100]d is the number of items in stock, Z 2 Rd is a random variable with probability distribution P0 representing the demand, and there are d items in total. The distribution P0 is unknown but we observe IID samples Z1, . . . , Zn from P0. All of the items have been partitioned into two groups so that the corresponding demand and stock can be written as Z = (Z(1), Z(2)) and = ( (1), (2)). The constraints in the problem exclude stock levels that underestimate the demand too much for each group of items, where "1, "2 > 0 indicate tolerance level of such underestimation. The target of the problem is to maximize the profit while satisfying the constraints. It is easy to rewrite the maximization problem (4.1) as a minimization problem with expected value constraints in the form of (3.1) so that we can apply our method (3.2). We pick P0 as multivariate Gaussian with independent components so that the two constraints are generally uncorrelated with each other (see Appendix E for details). Throughout the simulations, we solve (3.2) with = (z , z )> for 2 {0.4, 0.25, 0.1, 0.05, 0.005}. As suggested by our asymptotic theory in Section 3, the nominal probability of constraint satisfaction is 1 for each constraint and (1 )2 for both constraints due to the independence setup. In Figure 1, we plot frequencies of constraint satisfaction for each constraint and both constraints, all of which are averaged over 1000 replicates. As the sample size n grows, the frequency versus probability curve converges to the theoretical dashed line of limiting probability of constraint satisfaction, validating our theory in the large sample regime. For more simulations (e.g., single constraint, two dependent constraints) we refer to Appendix E. Figure 1: Frequency versus limiting probability of constraint satisfaction of the first constraint (left), the second constraint (middle), and both of the constraints (right). 5 Application to fair machine learning As ML models are deployed in high-stakes decision making and decision support roles, the fairness of the models has come under increased scrutiny. In response, there is a flurry of recent work on mathematical definitions of algorithmic fairness [16, 23, 25] and algorithms to enforce the definitions [1, 10, 38]. A prominent class of fairness definitions is group fairness; such definitions require equality of certain metrics (e.g. false/true positive rates) among demographic groups. For example, consider a fair binary classification problem. Let X Rd be the input space, {0, 1} be the set of possible labels, and A be the set of possible values of the protected/sensitive attribute. In this setup, training and test examples are tuples of the form (X, A, Y ) 2 X A Y, and a classifier is a map f : X ! {0, 1}. A popular definition of algorithmic fairness for binary classification is equality of opportunity [23]. Definition 5.1 (equality of opportunity). Let Y = 1 be the advantaged label that is associated with a positive outcome and b Y , f(X) be the output of the classifier. Equality of opportunity entails P{b Y = 1 | A = a, Y = 1} = P{b Y = 1 | A = a0, Y = 1} for all a, a0 2 A. Equality of opportunity, or true positive rate parity, means that the prediction b Y = h(X) conditioned on the advantaged label Y = 1 is statistically independent of the protected attribute A. Furthermore, an approximate version of equality of opportunity can be readily defined. We say that b Y = h(X) satisfies "-equality of opportunity if P{b Y = 1 | A = a, Y = 1} P{b Y = 1 | A = a0, Y = 1} " for for all a, a0 2 A. In this case, " > 0 represents a practitioner s tolerance for fairness violations. Given a parametric model space H = {f ( ) : 2 } and loss function : X Y ! R+, an in-processing fair ML routine is to minimize the (empirical) risk E [ ( ; X, Y )] while satisfying some fairness constraints. Most commonly, definitions of group fairness (including equality of opportunity, demographic parity, and more) can be written as a special example of a general set of linear constraints [1, 2] of the form Mµ( ) c, where matrix M 2 RK T and vector c 2 RK encode the constraints; µ( ) : ! RT is a vector of (conditional) moments µt( ) = E [ht(X, A, Y, ) | Et] for t 2 [T]; gt : X A Y ! R; event Et is defined with respect to (X, A, Y ). This framework fits to our methodology if we note that each (conditional) moment can be written as µt( ) = E(X,A,Y ) P0 ht(X, A, Y, ) 1 {Et(X, Y, A)} E(X,A,Y ) P0 1 {Et(X, Y, A)} Here the indicator 1 {Et} takes value 1 if the event Et happens, and 0 otherwise. Moreover, we use Et(X, A, Y ) to emphasize that Et only depends on (X, Y, A) but not on in any way. Note that (5.1) is a ratio of expected values, which is a non-linear statistical functional of P0. To use our method, we first replace the denominator of µt( ) with an estimator, such as the unbiased estimator b P(Et) = 1 i=1 1 {Et(Xi, Ai, Yi)}. The resulting plug-in estimation of µt( ) then becomes linear in P0, allowing us to apply our method (see similar tricks in [8]). We describe the application of our method to "-equality of opportunity in Appendix F. 5.1 A two-stage method for unknown active set In practice, it is probable that only a subset of the constraints are active. Furthermore, we do not know beforehand whether or not a constraint is active in the true population problem. To handle this scenario, we propose a two-stage method: 1. At the first stage, we solve the sample average approximation (SAA) problem (3.2) with = 0K. By doing so, we identify the active set of the SAA problem. 2. At the second stage, we solve (3.2) with such that k is a positive number only if the k-th constraint, k 2 [K], was identified as active at the first stage. In Appendix G, we show that the two-stage method also enjoys the calibration property (similar to Theorem 3.1 and Corollary 3.2) under standard assumptions (i.e., strict complementarity). At a high level, the limiting probability of satisfying the true constraints depends solely on the correlation structure between active constraints and the uncertainty set radii for active constraints, as long as the SAA problem identifies active constraints with probability tending to 1. 5.2 Proxy dual function for non-differentiable constraints Constraint functions in fair ML are often non-differentiable. For instance, fairness metrics are typically linear combinations of indicators that result in non-differentiable rate constraints [8 10]. This prevents the use of any gradient-based optimization algorithms. Fortunately, only the dual function evaluation step in Algorithm 1 requires access to gradients. Therefore, we can modify the algorithm by: (1) introducing proxy dual function, which uses a differentiable surrogate g instead of the non-differentiable g in the dual function evaluation step; (2) keeping g in the dual ascent step. For an indicator function h(t) = 1{t > 0}, one can replace it by sigmoidal function h1(t) = (1+e at) 1 or hinge upper bound h2(t) = max{0, t + 1} to produce smooth surrogates for non-differentiable rate constraints [11, 17, 9]. We summarize the proxy dual ascent algorithm in Appendix H. 5.3 Adult experiments We compare the frequency of constraint satisfaction (at test time) of the sample average approximation and our methods with nominal probability 0.60, 0.75, 0.90, 0.95 using the Adult dataset from UCI [13]. The classification task is to predict whether an individual s income per year is higher than $50K. The fairness goal is "-demographic parity ("-DP): |P(b Y = 1 | A = 1) P(b Y = 1 | A = 0)| ", where A = 1 for male is the advantaged group and A = 0 for female is the disadvantaged group. We use a logistic regression model for classification and techniques in this section for implementation. In Figure 2, we have line plots for frequency of constraint satisfaction and box plots for classification error rate, all of which are summarized over 100 replicates. The left panel shows that solving (3.1) directly leads to one half chance of constraint violation, while our method s constraint satisfaction frequency matches its nominal value. The price of a higher chance of test-time fairness satisfaction is an increase in classification error rate as shown in the right panel. From the baseline to 95% chance of fairness satisfaction, we basically trade off 2% increase in error rate. We refer to Appendix I and K for details and more experiments. 0.01 0.02 0.03 0.04 0.05 DP Tolerance Frequency of Constraint Satisfaction baseline (prob = 0.50) our method (prob = 0.60) our method (prob = 0.75) our method (prob = 0.90) our method (prob = 0.95) 0.01 0.02 0.03 0.04 0.05 DP Tolerance baseline (prob = 0.50) our method (prob = 0.60) our method (prob = 0.75) our method (prob = 0.90) our method (prob = 0.95) Figure 2: Frequency of constraint satisfaction (left) and classification error rate (right) for different demographic parity tolerance " 2 {0.01, 0.02, 0.03, 0.04, 0.05}. Baseline (sample average approximation, SAA) and our methods (with nominal probability 0.60, 0.75, 0.90, 0.95) are compared. 6 Summary and discussion We explore the problem of exact constraint satisfaction probability in stochastic optimization with expected-value constraints. We propose a distributionally robust reformulation of data-dependent constraints and provide a theoretical guarantee of constraint satisfaction with an asymptotically exact probability specified by the user. For solving the reformulated problem, a scalable dual ascent algorithm and its variants are proposed. The computational cost of our algorithm is comparable to that of a standard distributionally robust optimization problem. Our theory on exact constraint satisfaction probability is validated via simulations on the resource-constrained newsvendor problem. The efficacy of our methods is empirically demonstrated on fair machine learning applications. Some data-dependent constraints are by nature non-linear in the underlying probability measure. For example, (5.1) is a ratio of expected values. An intriguing direction for future research is to generalize the methods and theory developed in this work to constraints on non-linear functions of expected values. Such forms of constraints are known as statistical functionals in statistics literature [19]. The non-linear dependence of the constraint function on the probability measure precludes the stochastic approximation as a general way of evaluating the dual function, as the constraint function no longer admits a dual form (2.5), calling for the development of a new algorithm. Acknowledgments and Disclosure of Funding This paper is based upon work supported by the National Science Foundation (NSF) under grants no. 1916271, 2027737, and 2113373. [1] Alekh Agarwal, Alina Beygelzimer, Miroslav Dudík, John Langford, and Hanna Wallach. A Reductions Approach to Fair Classification. ar Xiv:1803.02453 [cs], July 2018. [2] Alekh Agarwal, Miroslav Dudík, and Zhiwei Steven Wu. Fair Regression: Quantitative Definitions and Reduction-based Algorithms. ar Xiv:1905.12843 [cs, stat], May 2019. [3] Solon Barocas, Moritz Hardt, and Arvind Narayanan. Fairness and Machine Learning. fairml- book.org, 2019. [4] Aharon Ben-Tal, Dick den Hertog, Anja De Waegenaere, Bertrand Melenberg, and Gijs Rennen. Robust Solutions of Optimization Problems Affected by Uncertain Probabilities. Management Science, 59(2):341 357, November 2012. ISSN 0025-1909. doi: 10.1287/mnsc.1120.1641. [5] Jose Blanchet, Yang Kang, and Karthyek Murthy. Robust Wasserstein Profile Inference and Applications to Machine Learning. ar Xiv:1610.05627 [math, stat], October 2016. [6] Xin Chen, Melvyn Sim, and Peng Sun. A Robust Optimization Perspective on Stochastic Programming. Operations Research, 55:1058 1071, 2007. doi: 10.1287/opre.1070.0441. [7] Andrew Cotter, Maya Gupta, Heinrich Jiang, Nathan Srebro, Karthik Sridharan, Serena Wang, Blake Woodworth, and Seungil You. Training Well-Generalizing Classifiers for Fairness Metrics and Other Data-Dependent Constraints. ar Xiv:1807.00028 [cs, stat], September 2018. [8] Andrew Cotter, Maya Gupta, Heinrich Jiang, Nathan Srebro, Karthik Sridharan, Serena Wang, Blake Woodworth, and Seungil You. Training well-generalizing classifiers for fairness metrics and other data-dependent constraints. In International Conference on Machine Learning, pages 1397 1405. PMLR, 2019. [9] Andrew Cotter, Heinrich Jiang, Maya Gupta, Serena Wang, Taman Narayan, Seungil You, and Karthik Sridharan. Optimization with Non-Differentiable Constraints with Applications to Fairness, Recall, Churn, and Other Goals. Journal of Machine Learning Research, 20(172): 1 59, 2019. ISSN 1533-7928. [10] Andrew Cotter, Heinrich Jiang, and Karthik Sridharan. Two-Player Games for Efficient Non- Convex Constrained Optimization. In Algorithmic Learning Theory, pages 300 332. PMLR, March 2019. [11] Mark A Davenport, Richard G Baraniuk, and Clayton D Scott. Tuning support vector machines for minimax and neyman-pearson classification. IEEE transactions on pattern analysis and machine intelligence, 32(10):1888 1898, 2010. [12] Erick Delage and Yinyu Ye. Distributionally Robust Optimization Under Moment Uncertainty with Application to Data-Driven Problems. Operations Research, 58:595 612, 2010. doi: 10.1287/opre.1090.0741. [13] Dheeru Dua and Casey Graff. UCI machine learning repository, 2017. URL http://archive. ics.uci.edu/ml. [14] John Duchi and Hongseok Namkoong. Variance-based regularization with convex objectives. October 2016. [15] John Duchi, Peter Glynn, and Hongseok Namkoong. Statistics of Robust Optimization: A Generalized Empirical Likelihood Approach. ar Xiv:1610.03425 [stat], October 2016. [16] Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Rich Zemel. Fairness Through Awareness. ar Xiv:1104.3913 [cs], April 2011. [17] Elad Eban, Mariano Schain, Alan Mackey, Ariel Gordon, Ryan Rifkin, and Gal Elidan. Scalable learning of non-decomposable objectives. In Artificial intelligence and statistics, pages 832 840. PMLR, 2017. [18] Peyman Mohajerin Esfahani and Daniel Kuhn. Data-driven Distributionally Robust Optimiza- tion Using the Wasserstein Metric: Performance Guarantees and Tractable Reformulations. May 2015. [19] Luisa Turrin Fernholz. Von Mises calculus for statistical functionals, volume 19. Springer Science & Business Media, 2012. [20] Chao Gao. Robust Regression via Mutivariate Regression Depth. ar Xiv:1702.04656 [math, stat], February 2017. [21] Gabriel Goh, Andrew Cotter, Maya Gupta, and Michael P. Friedlander. Satisfying Real-world Goals with Dataset Constraints. Advances in Neural Information Processing Systems, 29, 2016. [22] Joel Goh and Melvyn Sim. Distributionally Robust Optimization and Its Tractable Approxima- tions. Operations Research, 58(4-part-1):902 917, August 2010. ISSN 0030-364X, 1526-5463. doi: 10.1287/opre.1090.0795. [23] Moritz Hardt, Eric Price, and Nathan Srebro. Equality of Opportunity in Supervised Learning. ar Xiv:1610.02413 [cs], October 2016. [24] Ruiwei Jiang and Yongpei Guan. Data-driven chance constrained stochastic program. Mathe- matical Programming, 158(1):291 327, 2016. [25] Matt J. Kusner, Joshua R. Loftus, Chris Russell, and Ricardo Silva. Counterfactual Fairness. ar Xiv:1703.06856 [cs, stat], March 2018. [26] H. Lam and Enlu Zhou. Quantifying uncertainty in sample average approximation. In 2015 Winter Simulation Conference (WSC), pages 3846 3857, December 2015. doi: 10.1109/WSC. 2015.7408541. [27] Henry Lam. Recovering Best Statistical Guarantees via the Empirical Divergence-Based Distributionally Robust Optimization. Operations Research, 67(4):1090 1105, July 2019. ISSN 0030-364X. doi: 10.1287/opre.2018.1786. [28] Jaeho Lee and Maxim Raginsky. Minimax Statistical Learning with Wasserstein Distances. ar Xiv:1705.07815 [cs], May 2017. [29] James Luedtke and Shabbir Ahmed. A Sample Approximation Approach for Optimization with Probabilistic Constraints. SIAM Journal on Optimization, 19(2):674 699, January 2008. ISSN 1052-6234. doi: 10.1137/070702928. [30] Mahdi Milani Fard, Quentin Cormier, Kevin Canini, and Maya Gupta. Launch and Iterate: Reducing Prediction Churn. In Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016. [31] Hongseok Namkoong and John C. Duchi. Stochastic Gradient Methods for Distributionally Robust Optimization with F-divergences. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS 16, pages 2216 2224, USA, 2016. Curran Associates Inc. ISBN 978-1-5108-3881-9. [32] Art Owen. Empirical Likelihood Ratio Confidence Regions. The Annals of Statistics, 18(1): 90 120, March 1990. ISSN 0090-5364, 2168-8966. doi: 10.1214/aos/1176347494. [33] Reuven Y Rubinstein and Alexander Shapiro. Discrete event systems: Sensitivity analysis and stochastic optimization by the score function method, volume 13. Wiley, 1993. [34] Soroosh Shafieezadeh-Abadeh, Peyman Mohajerin Esfahani, and Daniel Kuhn. Distributionally Robust Logistic Regression. September 2015. [35] Aman Sinha, Hongseok Namkoong, and John Duchi. Certifying Some Distributional Robustness with Principled Adversarial Training. ar Xiv:1710.10571 [cs, stat], October 2017. [36] Wei Wang and Shabbir Ahmed. Sample average approximation of expected value constrained stochastic programs. Operations Research Letters, 36(5):515 519, September 2008. ISSN 0167-6377. doi: 10.1016/j.orl.2008.05.003. [37] Blake Woodworth, Suriya Gunasekar, Mesrob I. Ohannessian, and Nathan Srebro. Learning Non-Discriminatory Predictors. In Conference on Learning Theory, pages 1920 1953. PMLR, June 2017. [38] Mikhail Yurochkin and Yuekai Sun. Sen Se I: Sensitive Set Invariance for Enforcing Individual Fairness. In International Conference on Learning Representations, September 2020. 1. For all authors... (a) Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] (c) Did you discuss any potential negative societal impacts of your work? [N/A] (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results... (a) Did you state the full set of assumptions of all theoretical results? [Yes] (b) Did you include complete proofs of all theoretical results? [Yes] 3. If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes] (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [No] 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets... (a) If your work uses existing assets, did you cite the creators? [Yes] (b) Did you mention the license of the assets? [N/A] (c) Did you include any new assets either in the supplemental material or as a URL? [N/A] (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects... (a) Did you include the full text of instructions given to participants and screenshots, if applica- ble? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]