# omnipredictors_for_constrained_optimization__5e5a45ce.pdf

Omnipredictors for Constrained Optimization

Lunjia Hu * 1 Inbal Rachel Livni Navon * 1 Omer Reingold * 1 Chutong Yang * 1

Abstract The notion of omnipredictors (Gopalan, Kalai, Reingold, Sharan and Wieder ITCS 2022), suggested a new paradigm for loss minimization. Rather than learning a predictor based on a known loss function, omnipredictors can easily be postprocessed to minimize any one of a rich family of loss functions compared with the loss of hypotheses in a class C. It has been shown that such omnipredictors exist and are implied (for all convex and Lipschitz loss functions) by the notion of multicalibration from the algorithmic fairness literature. In this paper, we introduce omnipredictors for constrained optimization and study their complexity and implications. The notion that we introduce allows the learner to be unaware of the loss function that will be later assigned as well as the constraints that will be later imposed, as long as the subpopulations that are used to define these constraints are known. We show how to obtain omnipredictors for constrained optimization problems, relying on appropriate variants of multicalibration. We also investigate the implications of this notion when the constraints used are so-called group fairness notions.

1. Introduction

A predominant usage for outcome prediction is to inform the choice of a related action. Predicting the probability of a medical condition may help decide on a medical intervention or determine a life insurance premium rate. Predicting the probability of rain may help decide on the method of commuting to work or on a vacation destination or on wedding plans. For each possible action and outcome pair, there may be an associated loss the cost of catching a cold while riding to work on a bike in the rain or perhaps the cost of changing a wedding venue at the last minute. A learning

*Equal contribution 1Computer Science Department, Stanford University, Stanford, USA. Correspondence to: Lunjia Hu <lunjia@stanford.edu>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

algorithm may try to come up with a hypothesis that determines an action to minimize an expected loss based on a particular loss function. The challenge in this prevalent paradigm of loss minimization is that different loss functions call for very different learning algorithms, which is problematic for a variety of reasons (e.g. multiple relevant loss functions or loss functions that are undetermined at the time of learning). The notion of omnipredictors that was introduced recently by Gopalan, Kalai, Reingold, Sharan and Wieder (Gopalan et al., 2022) provides a way to learn a single predictor that can be naturally post-processed (without access to data) to an action that minimizes any one of a very wide collection of loss functions. Gopalan et al. (2022) showed that omniprediction is implied by multicalibrated prediction, a notion introduced by H ebert-Johnson, Kim, Reingold and Rothblum in the algorithmic fairness literature (H ebert-Johnson et al., 2018).

While loss minimization is a natural goal, it may not be the only consideration in choosing an action. There may, for example, be capacity constraints (e.g. a limited number of vaccines) as well as fairness and diversity considerations. In this work, we introduce a notion of omniprediction that applies to the task of loss minimization conditioned on a set of constraints. For example, imagine we are deciding on which patients would receive a medical intervention when the budget for offering that intervention is limited (capacity constraint), or when we want this intervention to be assigned proportionally to the size of two subpopulations (statistical parity), or when we want the probability of receiving an intervention among patients who experience medical complications to be the same in two different subpopulations (equal opportunity). Our notion of omniprediction allows learning a single predictor that could be used to minimize a large collection of loss functions, even when arbitrary subsets of constraints are imposed from a rich family of constraints. We show how to formalize such a notion (exposing subtleties not existing in the original notion of omniprediction), how to obtain it using some variants of multicalibration, demonstrating that seeking an accurate depiction of the current world may be useful even when the final goal is a socially engineered action. Finally, we study the interaction between loss minimization and fairness constraints, showing that loss minimization has the potential to support fairness objectives.

Omnipredictors for Constrained Optimization

Unconstrained Omniprediction. We assume a distribution D, over pairs (x, y), where x X represents an individual, and y represents an outcome associated with x. For example, x is the attributes of a patient and y is whether that patient experienced a specific medical condition (in this paper, we will consider Boolean outcomes, i.e., y {0, 1}, but the notion could be generalized). We consider individual loss functions. A loss function ℓis applied to an action a and an outcome y and signifies the loss ℓ(y, a) incurred when taking action a and observing outcome y (as we will discuss below, our results apply to a more general set of loss functions that can take into account membership of an individual in some predefined subpopulation).

The learning task of loss minimization is to learn a function c mapping individuals to actions such that the expected loss, E(x,y) D[ℓ(y, c(x))], is at least as small (up to some error term) as E(x,y) D[ℓ(y, c (x))] for any function c in a hypothesis class C. Note that different loss functions may require different functions c and different learning algorithms to train them. The notion of omniprediction offers a way for a single algorithm to learn a predictor p : X [0, 1] that allows optimizing any loss function in a rich family (e.g. all loss functions that are convex and κ-Lipschitz in the action). In this sense, p imitates the true probability predictor p : X [0, 1] where p (x) = Pr D[y = 1 | x]. Note that for every nice loss function, it is fairly easy to transform p (x) to an action a = τℓ(p (x)) that individually minimizes ℓ(y, a) (conditioned on x). Loosely, p is an (L, C)-omnipredictor if for every ℓ L, applying τℓto p to get c(x) = τℓ(p(x)) minimizes loss ℓcompared with the class C. An omnipredictor resolves the aforementioned disadvantage of traditional loss minimization as it can be trained without knowledge of the specific loss function chosen and the loss function is only needed to decide on an action.

It has been shown in (Gopalan et al., 2022) that omniprediction is a somewhat surprising application of the notion of multicalibration, introduced by H ebert-Johnson et al. (2018) with the motivation of preventing unfair discrimination. Calibration roughly asks that every prediction value be accurate on average over the instances when the prediction value is given. Multicalibration asks a predictor to be calibrated not only over the entire population but also on many subpopulations (thus, a multicalibrated predictor cannot trade the accuracy of a relevant minority group for the benefit of the majority population). Ignoring some subtleties, a predictor p is C-multicalibrated (up to error α) if for all c C, P

v E(x,y) D[(y v)c(x)1(p(x) = v)] α, where the summation is over v in the range of p (we assume the range is finite). It is shown in (Gopalan et al., 2022) that a Cmulticalibrated predictor is also an (L, C)-omnipredictor for a wide class of loss functions (all convex and Lipschitz loss functions), and Gopalan et al. (2023) relax the multicali-

bration requirement to calibrated multiaccuracy when the loss functions have additional properties (e.g. when they are induced by generalized linear models). As we discuss in Appendix G, many previous algorithms can construct multiaccurate and multicalibrated predictors, and some of these algorithms have been implemented in real applications such as mortality risk prediction (Barda et al., 2020).

Constraints are Essential but Challenging. Omnipredictors constructed in previous work (Gopalan et al., 2022; 2023) allow us to efficiently solve various downstream loss minimization tasks. Each of these tasks aims to minimize the expectation of a loss function and beyond that the solutions to these tasks are not guaranteed to satisfy any nontrivial constraints. However, many loss minimization problems in practice naturally come with constraints that cannot be simply expressed as minimizing an expected loss E(x,y) D[ℓ(y, c(x))]. For example, if an action c(x) represents the amount of resources allocated to individual x, it is common to impose a budget constraint E[c(x)] B for an average budget B per individual. Other natural constraints come from the algorithmic fairness literature and are known as group fairness notions. Here, we assume that the entire set X of individuals is partitioned into t subpopulations (i.e., groups) S1, . . . , St. Common examples of group fairness constraints include statistical parity (E[c(x)|x Si] being approximately equal for every choice of i = 1, . . . , t), equal opportunity (E[c(x)|x Si, y = 1] being approximately equal for every i), and equalized odds (for every b = 0, 1, the expectation E[c(x)|x Si, y = b] being approximately equal for every i).

Constraints as basic as the budget constraint already impose challenges to the omniprediction results in previous work. This is because in previous work the final action c(x) = τℓ(p(x)) is extremely local: it depends only on the loss function ℓand the prediction p(x) for that single individual x. Even if p(x) equals the true conditional probability Pr D[y = 1|x], such local actions that completely ignore the marginal distribution over individuals and the predictions p(x ) for other individuals x X \ {x} cannot in general minimize the squared loss under even the simplest budget constraint (see Appendix A). While a loss function can be optimized for every individual separately, to determine whether an action c(x) would violate the budget constraint, it is necessary to know the actions c(x ) assigned to other individuals x X \ {x}. When constraints are present, omnipredictors are only possible when we allow more flexible ways of turning predictions into actions.

1.1. Our Contributions

We start by generalizing the powerful notion of omniprediction to more widely-applicable loss minimization tasks that have constraints.

Omnipredictors for Constrained Optimization

Defining Omniprediction for Constrained Loss Minimization. We consider constrained loss minimization tasks in general forms, where every task has an objective function f0 : X A {0, 1} R and a collection of constraint functions fj : X A {0, 1} R indexed by j J. The goal of the task is to find an action function c : X A that minimizes the objective E(x,y) D[f0(x, c(x), y)] while satisfying the constraints E(x,y) D[fj(x, c(x), y)] 0 for every j J. Results in this paper extend to more general tasks where we use an arbitrary Lipschitz function to combine constraints as well as objectives (Appendix E).

Following previous work, for a class T of tasks and a class C of hypotheses c : X A, we say a predictor p : X [0, 1] is an omnipredictor if it allows us to efficiently solve any task T T compared to the hypotheses in C. More specifically, in our constrained setting, an omnipredictor p allows us to efficiently produce a good action function c : X A for any task T T such that c approximately satisfies all the constraints in T, and the objective achieved by c does not exceed (up to a small error) the objective of any c C that satisfy all the constraints of T.

A key challenge in formalizing omniprediction for constrained loss minimization is to specify the procedure of efficiently turning a predictor p : X [0, 1] into an action function c : X A for a specific task T T . As discussed earlier, previous work only allows c(x) to be τ(p(x)) for a transformation function τ that only depends on T, and this local transformation is not sufficient in our constrained setting. We need more flexible transformations, and we also need to maintain the efficiency of such transformations. We solve this challenge by examining the semantics behind the transformation τ(p(x)) in previous work: this transformation corresponds to solving the task T optimally while pretending that p(x) is the true conditional probability Pr D[y = 1|x]. We thus use transformations induced by solving the task on a simulated distribution defined by p in our definition of omniprediction (Definition 2.1). We show that this not only makes omniprediction possible for constrained problems, but also maintains the efficiency of the transformation. Moreover, as we discuss below, we can construct omnipredictors for important families of constrained loss minimization problems from group-wise variants of the multiaccuracy and/or multicalibration conditions. Note that conditions such as multiaccuracy and multicalibration are already needed in previous omniprediction results that do not handle constraints!

Constructing Omnipredictors for Group Objectives and Constraints. We develop omnipredictors for an important class of constrained loss minimization tasks, namely, tasks with group objectives and constraints. Here, as in many problems in the fairness literature, we assume that the set X of individuals is partitioned into t groups S1, . . . , St, and

we let g : X [t] denote the group partition function, i.e., g(x) = i if and only if x Si. We say an objective/constraint function f : X A {0, 1} R is a group constraint if there exists f : [t] A {0, 1} R such that f(x, a, y) = f (g(x), a, y) for every (x, a, y) X A {0, 1}. Tasks with group objectives and constraints are significantly more general than unconstrained tasks in previous work with a loss function ℓ(y, a) that does not depend on the individual x at all.

In Section 4, we show that omnipredictors for loss minimization problems with group objectives and constraints can be obtained from group-wise multiaccuracy and/or multicalibration conditions. Here, group-wise multiaccuracy and multicalibration require the predictor to satisfy multiaccuracy and multicalibration when conditioned on every group Si (see Section 2.3 for formal definitions). Specifically, we show the following results from the simplest setting to more challenging ones:

1. We start by considering a simple but general class of objectives/constraints that are convex and special (Definition 4.2). Objectives in this class include the common ℓ1 loss, the squared loss, loss induced by generalized linear models (up to scaling), and group combinations of these loss functions (e.g. each group chooses the ℓ1 or the squared loss). Constraints in this class include budget constraints and group fairness constraints such as statistical parity, equal opportunity, and equalize odds. In Theorem 4.4, we show that omnipredictors for tasks with convex and special group objectives and constraints can be obtained from group multiaccuracy w.r.t. the hypothesis class C plus group calibration. This generalizes the results in (Gopalan et al., 2023) to our constrained and multi-group setting.

2. In Theorem 4.6, we show that for general convex and Lipschitz group objectives and constraints, we can construct omnipredictors from group multicalibration w.r.t. C. This generalizes the results in (Gopalan et al., 2022) to our constrained and multi-group setting.

3. In Theorem 4.7, we show that for general (non-convex) group objectives and constraints, omnipredictors can be obtained from group calibration plus group levelset multiaccuracy w.r.t. C, namely, being accurate in expectation over individuals x Si with c(x) = a for every group i, hypothesis c C, and action a.

We provide counterexamples in Appendix I to show that it is necessary to strengthen multiaccuracy/multicalibration to their group-wise and occasionally level-set variants in our constrained setting.

We prove all our omniprediction results in a unified and streamlined fashion using Lemma 3.1. Previously, Gopalan

Omnipredictors for Constrained Optimization

et al. (2023) also aim to build a unified framework for omnipredictors using the notion of outcome indistinguishability (Dwork et al., 2021). While the initial omniprediction result in (Gopalan et al., 2022) requires multicalibration (as an unconstrained special case of our Theorem 4.6), Gopalan et al. (2023) only require a weaker calibrated multiaccuracy condition (as an unconstrained special case of our Theorem 4.4) and they provide a simpler and more structured analysis than Gopalan et al. (2022). However, the result in Gopalan et al. (2023) requires the loss functions to satisfy additional properties (we call such loss functions special objectives in Definition 4.2), and it particularly focuses on loss functions induced by generalized linear models. That is, Gopalan et al. (2023) fall short of providing a simple analysis that fully reconstructs the result in Gopalan et al. (2022). By proving Theorems 4.4 and 4.6, we show that our streamlined analysis using Lemma 3.1 can not only reconstruct the results in Gopalan et al. (2022; 2023), but also generalize them to the constrained setting.

Loss Minimization Can Augment Fairness. When solving an optimization task T using an omnipredictor p, for fairness and interpretability reasons, it is natural to require the solution c to be rank-preserving. That is, we require c(x) c(x ) when p(x) p(x ). For example, this could mean that we grant higher loans to individuals predicted more likely to repay it. A violation of the rank-preserving property corresponds to granting excessive loans to people that are likely to default on it, which causes harm to these individuals as well as the ones that deserve the loans more. With group constraints, it makes more sense to only require ranks to be preserved within each group, i.e., for individuals x, x satisfying g(x) = g(x ). This is a necessary relaxation, as group fairness constraints aim to increase opportunities for individuals from certain groups (e.g. to rectify historical discrimination), which would not always preserve ranks between individuals from different groups. However, some unreasonable objectives would incentivize the solution to be not rank-preserving even within a group. For example, an unreasonable objective could be f(x, a, y) = 1 |a y| for all x Si, and f(x, a, y) = |a y| for all other x, assuming the actions a are in [0, 1] after scaling. This objective incentivizes giving loans to individuals in Si that are likely to default on it, instead of those that are likely to repay it. A group fairness constraint, such as parity, can enforce giving a fair total amount of loan to the individuals in Si but cannot promise that the loans are given to those predicted to be more likely to repay it. This limitation of group fairness notions has been repeatedly demonstrated (cf. (Dwork et al., 2012) for an early example), and often abuses of these notions lead to violations of the rank-preserving requirement as in the example. In Section 5, we formally study the conditions of the objective and constraints under which we can ensure that the solution c obtained from an

omnipredictor p is rank-preserving within every group.

1.2. Related Work

Loss minimization under fairness or other constraints is a rich research area. For any given fairness definition, it is natural to ask how to learn under the corresponding constraints and how to minimize loss (or maximize utility). This has been studied for various group notions of fairness (cf (Zafar et al., 2017b)) but also for more refined notions such as metric fairness and multi-group metric fairness (Dwork et al., 2012; Rothblum & Yona, 2018; Kim et al., 2018). A common approach to combining loss minimization with fairness constraints is to add a fairness regularizer to the risk minimization (Donini et al., 2018; Kamishima et al., 2012; Zafar et al., 2017b). Non-convex constraints have been considered in (Cotter et al., 2019). Accordingly, they also formulate the problem as a non-convex optimization problem which may be hard to solve. There is also a line of empirical work on loss minimization with fairness constraints (Zemel et al., 2013; Zafar et al., 2017a; Goh et al., 2016). Finally, some recent related works focus on other learning setting under fairness constraint, like learning policies (Nabi et al., 2019), online learning (Bechavod & Roth, 2022), federated learning (Hu et al., 2022b), and ranking (Dwork et al., 2019).

A key difference between our work and most previous work on loss minimization is that we aim for learning a single predictor that can efficiently solve a variety of downstream constrained loss minimization tasks. Moreover, as we do not make any assumption on the true data distribution D, we consider it infeasible to learn the distribution D entirely and we only require conditions such as multicalibration that can be much easier to achieve using existing algorithms in the literature. Some works, such as (Celis et al., 2019; Agarwal et al., 2018; Narasimhan, 2018; Sharifi-Malvajerdi et al., 2019), can deal with multiple loss minimization tasks but they require approximately learning the true distribution D within a small total variation distance or approximately learning the true labels.

In an influential paper, Hardt, Price and Srebro (Hardt et al., 2016) propose equalized odds and equal opportunity as group notions of fairness. They give methods of postprocessing a predictor to enforce these constraints while minimizing loss. They show optimality compared with solutions that can be obtained from post-processing the predictor, whereas in this work we directly aim for optimality with respect to a rich pre-specified hypothesis class C. We consider more general loss functions with real-valued actions compared to the loss functions in (Hardt et al., 2016) that only take binary values as input, and we also consider more general constraints beyond the group fairness constraints in (Hardt et al., 2016).

Rothblum & Yona (2021) use the notion of outcome in-

Omnipredictors for Constrained Optimization

distinguishability (Dwork et al., 2021), closely related to multicalibration, to obtain loss minimization, not only on the entire population but also on many subpopulations. Their approach relies on a locality property of the loss function which they term f-proper. When this property is satisfied, for every fixed individual x0 X, the optimal action c(x0) for that individual x0 only depends on E[y|x = x0] and not on E[y|x = x1] for other individuals x1 X \ {x0}. In our constrained setting, this locality property fails to hold: to satisfy a group constraint, the action c(x0) must coordinate with the actions c(x1) for other individuals x1 in or out of the group/subpopulation of x0.

Independently of our work, Globus-Harris et al. (2022) also study the problem of solving downstream tasks by postprocessing multicalibrated predictors. They focus on the 0-1 loss for classification tasks and thus their results do not imply the full power of omnipredictors that handle arbitrary loss functions from a rich family. They also focus on a few specific group fairness constraints, whereas we consider more general classes of constraints. By assuming multicalibration with respect to delicately-designed classes, their predictors can be efficiently post-processed to satisfy constraints on intersecting groups. Again independently of our work, Kim & Perdomo (2023) study omniprediction in an (unconstrained) performative setting, where the distribution of the outcome y of an individual x can change based on the action c(x).

1.3. Limitations and Social Impacts

While our work is theoretical, we view it as giving a foundation and proof-of-concept for potential omnipredictors to be deployed in the real world with fairness considerations. Omnipredictors allow us to efficiently solve optimization problems and adapt to rich families of objectives and constraints, but carelessly choosing constraints and objectives for the omnipredictors may not always lead to good decisions. In some situations, different fairness constraints can lead to contradictory fairness guarantees, and choosing a wrong fairness constraint may lead to inappropriate actions. Our results in Section 5 are motivated by situations where fairness constraints alone are not enough for acceptable actions and a good objective is needed to augment the fairness constraints for arguably fairer actions. In terms of limitations of the results, our omnipredictors rely on a fixed group partition g and a fixed (unknown) distribution D, and it remains an interesting question to construct omnipredictors that can adapt to changes in g and D as well. Addressing this limitation could help protect new and evolving subpopulations. It is an interesting question whether our techniques (in particular Lemma 3.1) can be applied to tasks with more general outcomes beyond binary outcomes y {0, 1}. Unconstrained versions of such tasks have been considered by Gopalan et al. (2022), and we leave it for future work to

generalize their results to the constrained setting.

2. Problem Setup

Throughout the paper, we use X to denote a non-empty set of individuals, and use D to denote a distribution over X {0, 1}. We use A to denote a non-empty set of actions, and use c : X A to denote an action function that assigns an action c(x) to every individual x X (e.g. hiring the individual or not). We occasionally consider a randomized action function c : X A that assigns every individual x X a distribution c(x) A over actions in A. For generality we sometimes only make statements about randomized action functions, where one should view a deterministic action function c : X A as the randomized action function c : X A where c (x) A is the degenerate distribution supported on c(x) for every x X.

2.1. Constrained Loss Minimization Tasks

Given a loss function f0 : X A {0, 1} R and a collection of constraints fj : X A {0, 1} R indexed by j J, we define a constrained loss minimization task T to be the following optimization problem:

minimize c:X A E (x,y) D f0(x, c(x), y) (1)

s.t. E (x,y) D fj(x, c(x), y) 0 for every j J.

It is often challenging to solve a task T optimally, and we need to consider approximate and potentially randomized solutions. For β R and ε R 0, we define sol D(T, β, ε) to be the set of randomized action functions c : X A satisfying

E (x,y) D E a c(x) f0(x, a, y) β, and

E (x,y) D E a c(x) fj(x, a, y) ε for every j J.

For a class C of functions c : X A, we define

opt D(T, C, ε) := inf{β R : C sol D(T, β, ε) = }.

Note that opt D(T, C, ε) may take any value in R { }, where we define inf = + . In Appendix E, we show how results in this paper extend to more general tasks where we combine constraints and objectives using arbitrary Lipschitz functions.

2.2. Omnipredictors for Constrained Loss Minimization

An omnipredictor, as introduced by Gopalan et al. (2022), allows us to solve a family of downstream loss minimization tasks without training a different model from scratch for every task in the family. Previous work focuses on omnipredictors for unconstrained loss minimization (Gopalan

Omnipredictors for Constrained Optimization

et al., 2022; 2023). We generalize this notion to constrained loss minimization as follows.

For a distribution D over X {0, 1} and a predictor p : X [0, 1], we define the simulated distribution Dp to be the distribution of (x, y ) X {0, 1} where we first draw (x, y) from D and then draw y from the Bernoulli distribution Ber(p(x)) with mean p(x). For a hypothesis class C consisting of functions c : X A, suppose we want to solve a downstream constrained loss minimization task T on the true distribution D and we want a comparable or better solution than the best hypothesis c C. An omnipredictor p should allow us to achieve this goal by finding an approximately optimal solution c from another, ideally simpler, hypothesis class C for the same task T but on the simulated distribution Dp defined by the omnipredictor p. Such an omnipredictor is particularly powerful when it works for tasks T from a rich family T and when solving any T T on the simulated distribution Dp over hypothesis class C

is significantly easier than directly solving T on the true distribution D over hypothesis class C. This leads to the following formal definition of an omnipredictor:

Definition 2.1. Let D be a distribution over X {0, 1} and ε 0 be a parameter. Let T be a collection of constrained loss minimization tasks and let p : X [0, 1] be a predictor. For classes C, C of functions c : X A, we say p is a (T , C, C , ε)-omnipredictor on D if the following holds for any T T . Defining β := opt D(T, C, 0) R and β := opt Dp(T, C , ε/3) R, we have

C sol Dp(T, β + ε/3, 2ε/3) sol D(T, β + ε, ε).

Suppose we have an omnipredictor p as in the definition above, and we want to solve an arbitrary constrained loss minimization task T T in comparison with the class C, i.e., we want to find a solution in sol D(T, β + ε, ε). Instead of collecting data points from D and solve the task from scratch, we just need to find a solution in C sol Dp(T, β + ε/3, 2ε/3), i.e., a solution c C that approximately solves the task on the simulated distribution Dp. This is usually much easier than solving the task on the original distribution D for the following two reasons:

Simplicity from Dp. First, since we know p, we know the conditional distribution of y given x in (x, y) Dp, and thus the only unknown part about Dp is the marginal distribution of x, which can be learned from unlabeled data drawn from D (i.e., examples of x in (x, y) D with y concealed). For all the omniprediction results in this paper, we assume that the downstream tasks have a group structure specified by a fixed group partition function g : X [t] (see Section 4). To solve such tasks on the simulated distribution Dp, all we need to know about the marginal distribution of x is the probability that x belongs to each of a few subsets defined independently of the actual downstream

task. We can estimate these probabilities when we train the omnipredictor p, and no additional data (labeled or not) is needed at all when we use p to solve downstream tasks (see Appendix H).

Simplicity from C . Second, solving downstream tasks over the new class C can be computationally much more efficient than solving them over the original class C. In all of the omniprediction results in this paper, we choose C to be very simple (as Cp,g and Crand p,g in Definition 4.3) so that its complexity depends on the number of groups and the size of the range of p, which can be made to be very small (O(1/ε)), whereas C can be significantly more complex. Specifically, every function in Cp,g (resp. Crand p,g ) assigns the same action (resp. same distribution over actions) to individuals x with the same p(x) and g(x). In Appendix H we give very efficient algorithms for solving constrained loss minimization tasks given omnipredictors.

In previous work on omniprediction without constraints, the optimal solution c on the simulated distribution Dp is trivial to find: it is given by choosing c(x) so that Ey Ber(p(x)) f0(x, c(x), y) is minimized (Bayes optimal solution). That is, the optimal c(x) depends only on x, f0, and p(x) (often f0 does not depend on x and thus c(x) only depends on f0 and p(x)). Because of this locality property, previous definitions of omniprediction for unconstrained loss minimization simply explicitly uses the optimal solution on the simulated distribution Dp without defining a task on Dp or even without defining Dp at all. Our Definition 2.1 not only generalizes these previous definitions, but also deals with more challenging tasks with constraints where the locality property fails to hold.

2.3. Group Multiaccuracy and Multicalibration

A main contribution of this paper is showing that omnipredictors for a variety of constrained loss minimization problems can be obtained from group-wise multiaccuracy and/or multicalibration conditions. The notions of multiaccuracy and multicalibration are introduced by H ebert Johnson et al. (2018) and Kim et al. (2019), and there are many algorithms for achieving these notions in previous work (see Appendix G). We define these notions here as special cases of the following generalized multicalibration notion. For the definitions below, we assume D is a distribution over X {0, 1} and ε 0 is a parameter.

Definition 2.2 (Generalized multicalibration (Gen MC) (see e.g. Kim et al., 2022, Definition 1.1 in Supplementary Information)). Let W be a class of functions w : X [0, 1] R. We say a predictor p : X [0, 1] satisfies (W, ε)- generalized multicalibration w.r.t. distribution D if

E (x,y) D[(y p(x))w(x, p(x))] ε for every w W.

Omnipredictors for Constrained Optimization

For simplicity, we additionally require the range of p, range(p) := {p(x) : x X}, to be a finite subset of [0, 1].1

We use Gen MCD(W, ε) to denote the set of predictors p satisfying the conditions above.

We define multiaccuracy and multicalibration below as special cases of Gen MC in a general group-wise setting, by choosing an appropriate function class W in every definition. Here, we assume that the set X of individuals is partitioned into t groups (i.e., subpopulations). We use g : X [t] to denote the group partition function that assigns every individual x X a group index g(x) [t] := {1, . . . , t}.

Definition 2.3 (Group Multiaccuracy (Grp MA)). For a class H of functions h : X R and group index g : X [t], we define Grp MAD(H, g, ε) to be the set Gen MCD(W, ε) where W consists of all functions w : X [0, 1] R such that there exist h H and τ : [t] [ 1, 1] satisfying w(x, v) = h(x)τ(g(x)) for every (x, v) X [0, 1]. We say a predictor p is (H, g, ε)-multiaccurate w.r.t. distribution D if p Grp MAD(H, g, ε). When the distribution D is clear from context, we often drop it and write Grp MA(H, g, ε) (similarly for other definitions below).

In Appendix B we give an equivalent definition of Grp MA in a form closer to similar definitions in the literature. We do this for other definitions below in this section as well. Definition 2.4 (Group Multicalibration (Grp MC)). For a class H of functions h : X R and group index g : X [t], we define Grp MCD(H, g, ε) to be the set Gen MCD(W, ε) where W consists of all functions w : X [0, 1] R such that there exist h H and τ : [t] [0, 1] [ 1, 1] satisfying w(x, v) = h(x)τ(g(x), v) for every (x, v) X [0, 1]. We say a predictor p is (H, g, ε)-multicalibrated w.r.t. distribution D if p Grp MCD(H, g, ε).

The following definition of group calibration is a special case of group multicalibration where H only contains the constant function h that maps every x X to 1: Definition 2.5 (Group Calibration (Grp Cal)). We define Grp Cal D(g, ε) to be the set Gen MCD(W, ε) where W consists of all functions w : X [0, 1] [ 1, 1] such that there exists τ : [t] [0, 1] [ 1, 1] satisfying w(x, v) = τ(g(x), v) for every (x, v) X [0, 1]. We say a predictor p is (g, ε)-calibrated w.r.t distribution D if p Grp Cal D(g, ε).

The following definition is a variant of group multiaccuracy where the transformation τ also takes the function value

1As we discuss in Appendix H, a simple discretization allows us to get a predictor p with range(p) {0, 1/m, 2/m, . . . , 1} for m = O(1/ε) that satisfy all the group multiaccuracy and multicalibration requirements we need for our results. Also, all previous algorithms for achieving multicalibration naturally produce predictors with such discrete ranges.

h(x) as input, and we view individuals x with the same h(x) as belonging to the same level set of h. Definition 2.6 (Group Level-Set Multiaccuracy (Grp LMA)). For an arbitrary finite set A and a class H of functions h : X A, we define Grp LMAD(H, g, ε) to be the set Gen MCD(W, ε) where W consists of all functions w : X [0, 1] [ 1, 1] such that there exist h H and τ : [t] A [ 1, 1] satisfying w(x, v) = τ(g(x), h(x)) for every (x, v) X [0, 1]. We say a predictor p is (H, g, ε)-level-set multiaccurate w.r.t distribution D if p Grp LMAD(H, g, ε).

When the group partition function g is a constant function g0 that assigns every individual to the same group, we recover notions in the standard single-group setting: multiaccuracy (MAD(H, ε) := Grp MAD(H, g0, ε)), multicalibration (MCD(H, ε) := Grp MCD(H, g0, ε)), and calibration (Cal D(ε) := Grp Cal D(g0, ε)).

3. Our Approach

We describe our general approach for constructing and analyzing omnipredictors for constrained loss minimization tasks. Our approach is similar in spirit to the outcome indistinguishability perspective taken by (Gopalan et al., 2023), but our approach is more general: it takes constraints into account and can also be applied to reconstruct the results in previous papers on omnipredictors (Gopalan et al., 2022; 2023). In particular, we overcome the limitation of (Gopalan et al., 2023) that it falls short of fully explaining the initial omnipredictors results in (Gopalan et al., 2022). Our approach is based on the following key lemma: Lemma 3.1. Let D be a distribution over X {0, 1} and ε 0 be a parameter. Let T be a collection of constrained loss minimization tasks and let C, C be classes of functions c : X A. If a predictor p satisfies the following two properties for every T T , then p is a (T , C, C , ε)- omnipredictor on D:

1. Let f0 be the loss function of T and (fj)j J be the constraints of T. For every c C, there exists c C

such that for every j {0} J,

E (x,y) Dp E a c (x) fj(x, a, y)

E (x,y) D E a c(x) fj(x, a, y) + ε/3. (2)

2. For every c C and every j {0} J,

E (x,y) D E a c(x) fj(x, a, y)

E (x,y) Dp E a c(x) fj(x, a, y) + ε/3. (3)

Lemma 3.1 reduces the task of constructing an omnipredictor to satisfying the conditions in (2) and (3). We prove

Omnipredictors for Constrained Optimization

Lemma 3.1 in Appendix C and show how to apply it to construct omnipredictors for a variety of constrained loss minimization tasks in Section 4. Lemma 3.1 allows us to give short and streamlined proofs for all our results in Section 4, and these results generalize previous results in (Gopalan et al., 2022; 2023) as special cases.

4. Omnipredictors from Group Multiaccuracy and Multicalibration

In this section, we apply Lemma 3.1 and show that we can obtain omnipredictors for loss minimization tasks with group objectives and constraints from group multiaccuracy and/or multicalibration conditions. Here, we assume that the individual set X is partitioned into t groups by a group partition function g : X [t] assigning a group index g(x) [i] to every individual x X. Definition 4.1. For a group partition g : X [t], we say an objective/constraint function f : X A {0, 1} R is a group objective/constraint if there exists f : [t] A {0, 1} R such that f(x, a, y) = f (g(x), a, y) for every (x, a, y) X A {0, 1}.

Proofs for the results in this section are deferred to Appendix D. These results show that algorithms in previous work for achieving multiaccuracy and multicalibration allow us to obtain omnipredictors even when constraints are imposed on the loss minimization tasks. We discuss these algorithms in more detail in Appendix G.

We start with a basic case where the objectives and constraints are convex and special, defined below. We use f(x, a) to denote f(x, a, 1) f(x, a, 0). Definition 4.2. Let the action set A R be an interval. We say an objective/constraint function f : X A {0, 1} R is convex if f(x, , y) is convex for every fixed (x, y) X {0, 1}. We say f is special w.r.t a group partition g : X [t] if there exist τ1, τ2 : [t] [ 1, 1] such that f(x, a) = τ1(g(x)) + τ2(g(x))a.

Examples of convex and special group objectives when A = [0, 1] include the ℓ1 loss f(x, a, y) = |a y|/2, the squared loss f(x, a, y) = (a y)2/2, and group-wise combinations of them (every group chooses either ℓ1 or squared loss). As demonstrated in (Gopalan et al., 2023), loss functions induced from generalized linear models are also special after appropriate scaling. Examples of convex and special constraints include all linear constraints, i.e., constraint functions f for which there exist τ1, τ2 : [t] [ 1, 1] and τ3, τ4 : [t] R such that

f(x, a, y) = τ1(i)y + τ2(i)ay + τ3(i) + τ4(i)a (4)

for every (x, a, y) X A {0, 1} where i := g(x). Linear constraints are general enough to express fairness constraints such as statistical parity, equal opportunity (equal

true positive rates), and equalized odds (equal true positive rates and equal false positive rates) as follows. For every group i [t], define ri := Pr[g(x) = i], r+ i := Pr[g(x) = i|y = 1], and r i := Pr[g(x) = i|y = 0]. These fairness constraints can be expressed as2

E[1(g(x) = i)c(x)] = ri E[c(x)], (statistical parity)

E[1(g(x) = i)c(x)y] = r+ i E[c(x)y], (equal true positive rates)

E[1(g(x) = i)c(x)(1 y)] = r i E[c(x)(1 y)]. (equal false positive rates)

Each of the above fairness constraints can be written as E[f(x, c(x), y)] = 0 for an appropriate f satisfying (4). For example, for statistical parity, we choose f as follows:

f(x, a, y) = 1(g(x) = i)a ria. (statistical parity)

Moreover, we can express approximate fairness constraints as a combination of linear constraints because | E[f(x, c(x), y)]| α is equivalent to E[f(x, c(x), y) α] 0 and E[ f(x, c(x), y) α] 0.

For tasks with group objectives/constraints, we often choose the class C in our definition of omnipredictors (Definition 2.1) to be Cp,g and Crand p,g in the following definition: Definition 4.3. For an action set A, a group partition function g : X [t] and a predictor p : X [0, 1], we define Cp,g to be the class consisting of all functions c : X A such that there exists τ : [t] [0, 1] A satisfying c(x) = τ(g(x), p(x)) for every x X. We define Crand p,g to be the class consisting of all functions c : X A such that there exists τ : [t] [0, 1] A satisfying c(x) = τ(g(x), p(x)) for every x X.

We now state our omniprediction theorem for convex and special constraints and objectives. In the theorems below, we use D to denote an underlying distribution over X {0, 1} and use C to denote a class of functions c : X A. Theorem 4.4. Let A = [0, 1] be an action set and let g : X [t] be a group partition. Let T be a class of tasks that only have group constraints and group objectives that are all convex and special. Let p be a predictor in Grp MAD(C, g, ε/6) Grp Cal D(g, ε/6) and define Cp,g as in Definition 4.3. Then p is a (T , C, Cp,g, ε)-omnipredictor on D.

We remark that the convexity assumption in the theorem above can be removed if we replace Cp,g with Crand p,g (Theorem D.9), in which case we can handle any finite action

2Here for simplicity we assume that we know ri, r+ i , r i . These quantities can be estimated from unlabeled data and a predictor satisfying group calibration. It is also natural to compute and store estimates for these quantities when we train an omnipredictor before seeing downstream tasks because these estimates are helpful for general tasks, not just for those with group fairness constraints (see Appendix H).

Omnipredictors for Constrained Optimization

set A [0, 1]. Once we construct an omnipredictor using Theorem 4.4 (and other theorems in this section), we can efficiently transform it into nearly optimal actions for any task T T (see Appendix H). Theorem 4.4 generalizes the results in (Gopalan et al., 2023) that hold in the singlegroup unconstrained setting. Our following theorem deals with general convex and Lipschitz group objectives and constraints and it generalizes the results in (Gopalan et al., 2022).

Definition 4.5. We say an objective/constraint function f : X A {0, 1} R is κ-Lipschitz if f(x, , y) is κ-Lipschitz for every fixed (x, y) X {0, 1}. We say f has B-bounded difference if f(x, a) [ B, B] for every (x, a) X A.

Theorem 4.6. Let A = [0, 1] be an action set and let g : X [t] be a group partition. Let T be a class of tasks that only have group objectives and group constraints that are all convex and 1-Lipschitz and have 1-bounded differences. Let p be a predictor in Grp MCD(C, g, ε/15) Grp Cal D(g, ε/15) and define Cp,g as in Definition 4.3. Then p is a (T , C, Cp,g, ε)-omnipredictor on D.

Finally, we consider general group constraints. These constraints allow us to constrain the entire distribution of c(x) (e.g. constraints on Pr[c(x) A ] for A A) and the distribution of c(x) within each group (e.g. constraints on Pr[c(x) A , g(x) = i]).

Theorem 4.7. Let A be a finite non-empty action set and let g : X [t] be a group partition. Let T be a class of tasks with group constraints and group objectives that all have 1-bounded differences. Let p be a predictor in Grp LMAD(C, g, ε/3) Grp Cal D(g, ε/3) and define Crand p,g as in Definition 4.3. Then p is a (T , C, Crand p,g , ε)-omnipredictor on D.

We give counterexamples in Appendix I showing that strengthening standard multiaccuracy and multicalibration to their group-wise and/or level-set variants in the theorems above is necessary.

5. Interaction between Group Fairness and Loss Minimization

In this section we explain how we can use our omnipredictors to get an additional property, which we call rankpreserving. The intuition is that if we assume the predictor p : X [0, 1] describes an approximation to the true probability Pr(x,y) D[y = 1], then we want individuals x with higher p(x) to get higher action values, for real-valued actions a R. This requirement can be thought of as a fairness property, that individuals that are more likely to succeed (within the same group) should get higher actions.

Definition 5.1. Let A R be set of real-valued actions. We

say a transformation τ : [t] [0, 1] A is rank-preserving if for all i [t] and v > v [0, 1] we have τ(i, v) τ(i, v ). Let p : X [0, 1] be a predictor and g : X [t] be a group index function. We denote by rp-Cp,g the set of all functions c Cp,g such that there exists a rankpreserving transformation τ : [t] [0, 1] A satisfying c(x) = τ(g(x), p(x)) for every x X.

Our goal is to show that for a large class of optimization problems T , the post-processing of the omnipredictor can output an optimal solution that is also rank-preserving. We achieve this by showing that for every problem T T ,

opt Dp(T, rp-Cp,g, ε) opt Dp(T, Cp,g, ε). (5)

Inequality (5) implies that when we solve a downstream task using an omnipredictor, i.e., when we compute a solution in Cp,g sol Dp(T, β + ε/3, 2ε/3) as in Definition 2.1, we can search only within the class rp-Cp,g instead of the entire class Cp,g. This would ensure that our final solution is rank-preserving. Searching over rp-Cp,g can be implemented efficiently by adding linear constraints (for the rankpreserving requirement) to the linear/convex programming in Appendix H.

In Appendix F we prove Lemma F.4 showing that (5) holds for a class of optimization problems when the loss functions and constraints satisfy certain monotonicity conditions, and there is a single linear constraint per group i [t]. In Lemma F.6 we prove a randomized version of (5), when the constraints are independent of the outcome y. We remark that some monotonicity requirements from the loss functions and the constraints are necessary to promise a rank-preserving optimal solution.

Acknowledgments

We thank Cynthia Dwork and Guy Rothblum for discussions that motivated this project and Parikshit Gopalan and Michael Kim for useful discussions. We thank the authors of Globus-Harris et al. (2022) for discussing their work with us.

LH is supported by the Simons Foundation Collaboration on the Theory of Algorithmic Fairness, Omer Reingold s NSF Award IIS-1908774, and Moses Charikar s Simons Investigators award. IL is supported by the Simons Foundation Collaboration on the Theory of Algorithmic Fairness, the Sloan Foundation Grant 2020-13941, and the Zuckerman STEM Leadership Program. OR is supported by the Simons Foundation Collaboration on the Theory of Algorithmic Fairness and the Simons Foundation Investigators award 689988. CY is supported by the Simons Foundation Collaboration on the Theory of Algorithmic Fairness and Omer Reingold s NSF Award IIS-1908774.

Omnipredictors for Constrained Optimization

Agarwal, A., Beygelzimer, A., Dud ık, M., Langford, J., and Wallach, H. A reductions approach to fair classification. In International Conference on Machine Learning, pp. 60 69. PMLR, 2018.

Barda, N., Riesel, D., Akriv, A., Levy, J., Finkel, U., Yona, G., Greenfeld, D., Sheiba, S., Somer, J., Bachmat, E., Rothblum, G. N., Shalit, U., Netzer, D., Balicer, R., and Dagan, N. Developing a COVID-19 mortality risk prediction model when individual-level data are not available. Nature Communications, 11(1):4439, 2020. doi: 10.1038/s41467-020-18297-9. URL https: //doi.org/10.1038/s41467-020-18297-9.

Bechavod, Y. and Roth, A. Individually fair learning with one-sided feedback. ar Xiv preprint ar Xiv:2206.04475, 2022.

Canonne, C. L. A short note on learning discrete distributions. ar Xiv preprint ar Xiv:2002.11457, 2020.

Celis, L. E., Huang, L., Keswani, V., and Vishnoi, N. K. Classification with fairness constraints: A meta-algorithm with provable guarantees. In Proceedings of the conference on fairness, accountability, and transparency, pp. 319 328, 2019.

Cotter, A., Jiang, H., Gupta, M. R., Wang, S., Narayan, T., You, S., and Sridharan, K. Optimization with nondifferentiable constraints with applications to fairness, recall, churn, and other goals. J. Mach. Learn. Res., 20 (172):1 59, 2019.

Donini, M., Oneto, L., Ben-David, S., Shawe-Taylor, J. S., and Pontil, M. Empirical risk minimization under fairness constraints. Advances in Neural Information Processing Systems, 31, 2018.

Dwork, C., Hardt, M., Pitassi, T., Reingold, O., and Zemel, R. Fairness through awareness. In Proceedings of the 3rd innovations in theoretical computer science conference, pp. 214 226. ACM, 2012.

Dwork, C., Kim, M. P., Reingold, O., Rothblum, G. N., and Yona, G. Learning from outcomes: Evidence-based rankings. In 2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS), pp. 106 125. IEEE, 2019.

Dwork, C., Kim, M. P., Reingold, O., Rothblum, G. N., and Yona, G. Outcome indistinguishability. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, pp. 1095 1108, 2021.

Feldman, V. Distribution-specific agnostic boosting. In Yao, A. C. (ed.), Innovations in Computer

Science - ICS 2010, Tsinghua University, Beijing, China, January 5-7, 2010. Proceedings, pp. 241 250. Tsinghua University Press, 2010. URL http://conference.iiis.tsinghua.edu. cn/ICS2010/content/papers/20.html.

Globus-Harris, I., Gupta, V., Jung, C., Kearns, M., Morgenstern, J., and Roth, A. Multicalibrated regression for downstream fairness. ar Xiv preprint ar Xiv:2209.07312, 2022.

Goh, G., Cotter, A., Gupta, M., and Friedlander, M. P. Satisfying real-world goals with dataset constraints. Advances in Neural Information Processing Systems, 29, 2016.

Gopalan, P., Kalai, A. T., Reingold, O., Sharan, V., and Wieder, U. Omnipredictors. In Braverman, M. (ed.), 13th Innovations in Theoretical Computer Science Conference, ITCS 2022, January 31 - February 3, 2022, Berkeley, CA, USA, volume 215 of LIPIcs, pp. 79:1 79:21. Schloss Dagstuhl - Leibniz-Zentrum f ur Informatik, 2022. doi: 10.4230/LIPIcs.ITCS.2022.79. URL https://doi. org/10.4230/LIPIcs.ITCS.2022.79.

Gopalan, P., Hu, L., Kim, M. P., Reingold, O., and Wieder, U. Loss Minimization Through the Lens Of Outcome Indistinguishability. In Tauman Kalai, Y. (ed.), 14th Innovations in Theoretical Computer Science Conference (ITCS 2023), volume 251 of Leibniz International Proceedings in Informatics (LIPIcs), pp. 60:1 60:20, Dagstuhl, Germany, 2023. Schloss Dagstuhl Leibniz Zentrum f ur Informatik. ISBN 978-3-95977-263-1. doi: 10.4230/LIPIcs.ITCS.2023.60. URL https://drops. dagstuhl.de/opus/volltexte/2023/17563.

Hardt, M., Price, E., and Srebro, N. Equality of opportunity in supervised learning. Advances in neural information processing systems, 29, 2016.

H ebert-Johnson, U., Kim, M., Reingold, O., and Rothblum, G. Multicalibration: Calibration for the (computationallyidentifiable) masses. In International Conference on Machine Learning, pp. 1939 1948. PMLR, 2018.

Hu, L. and Peale, C. Comparative Learning: A Sample Complexity Theory for Two Hypothesis Classes. In Tauman Kalai, Y. (ed.), 14th Innovations in Theoretical Computer Science Conference (ITCS 2023), volume 251 of Leibniz International Proceedings in Informatics (LIPIcs), pp. 72:1 72:30, Dagstuhl, Germany, 2023. Schloss Dagstuhl Leibniz-Zentrum f ur Informatik. ISBN 978-3-95977-263-1. doi: 10.4230/LIPIcs. ITCS.2023.72. URL https://drops.dagstuhl. de/opus/volltexte/2023/17575.

Hu, L., Peale, C., and Reingold, O. Metric entropy duality and the sample complexity of outcome indistinguisha-

Omnipredictors for Constrained Optimization

bility. In Dasgupta, S. and Haghtalab, N. (eds.), Proceedings of The 33rd International Conference on Algorithmic Learning Theory, volume 167 of Proceedings of Machine Learning Research, pp. 515 552. PMLR, 29 Mar 01 Apr 2022a. URL https://proceedings. mlr.press/v167/hu22a.html.

Hu, S., Wu, Z. S., and Smith, V. Provably fair federated learning via bounded group loss. ar Xiv preprint ar Xiv:2203.10190, 2022b.

Kalai, A. T., Mansour, Y., and Verbin, E. On agnostic boosting and parity learning. In STOC 08, pp. 629 638. ACM, New York, 2008. doi: 10.1145/ 1374376.1374466. URL https://doi.org/10. 1145/1374376.1374466.

Kamishima, T., Akaho, S., Asoh, H., and Sakuma, J. Fairness-aware classifier with prejudice remover regularizer. In Joint European conference on machine learning and knowledge discovery in databases, pp. 35 50. Springer, 2012.

Kearns, M. J. and Schapire, R. E. Efficient distribution-free learning of probabilistic concepts (extended abstract). In 31st Annual Symposium on Foundations of Computer Science, St. Louis, Missouri, USA, October 22-24, 1990, Volume I, pp. 382 391. IEEE Computer Society, 1990. doi: 10.1109/FSCS.1990.89557. URL https://doi. org/10.1109/FSCS.1990.89557.

Kim, M., Reingold, O., and Rothblum, G. Fairness through computationally-bounded awareness. Advances in Neural Information Processing Systems, 31, 2018.

Kim, M. P. and Perdomo, J. C. Making Decisions Under Outcome Performativity. In Tauman Kalai, Y. (ed.), 14th Innovations in Theoretical Computer Science Conference (ITCS 2023), volume 251 of Leibniz International Proceedings in Informatics (LIPIcs), pp. 79:1 79:15, Dagstuhl, Germany, 2023. Schloss Dagstuhl Leibniz Zentrum f ur Informatik. ISBN 978-3-95977-263-1. doi: 10.4230/LIPIcs.ITCS.2023.79. URL https://drops. dagstuhl.de/opus/volltexte/2023/17582.

Kim, M. P., Ghorbani, A., and Zou, J. Multiaccuracy: Blackbox post-processing for fairness in classification. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, pp. 247 254, 2019.

Kim, M. P., Kern, C., Goldwasser, S., Kreuter, F., and Reingold, O. Universal adaptability: Targetindependent inference that competes with propensity scoring. Proceedings of the National Academy of Sciences, 119(4):e2108097119, 2022. doi: 10.1073/pnas. 2108097119. URL https://www.pnas.org/doi/ abs/10.1073/pnas.2108097119.

Nabi, R., Malinsky, D., and Shpitser, I. Learning optimal fair policies. In International Conference on Machine Learning, pp. 4674 4682. PMLR, 2019.

Narasimhan, H. Learning with complex loss functions and constraints. In International Conference on Artificial Intelligence and Statistics, pp. 1646 1654. PMLR, 2018.

Rothblum, G. N. and Yona, G. Probably approximately metric-fair learning. Unpublished Manuscript, 2018.

Rothblum, G. N. and Yona, G. Multi-group agnostic PAC learnability. In Meila, M. and Zhang, T. (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 9107 9115. PMLR, 18 24 Jul 2021. URL https://proceedings.mlr.press/ v139/rothblum21a.html.

Sharifi-Malvajerdi, S., Kearns, M., and Roth, A. Average individual fairness: Algorithms, generalization and experiments. Advances in Neural Information Processing Systems, 32, 2019.

Zafar, M. B., Valera, I., Rodriguez, M. G., and Gummadi, K. P. Fairness beyond disparate treatment & disparate impact: Learning classification without disparate mistreatment. In Proceedings of the 26th international conference on world wide web, pp. 1171 1180, 2017a.

Zafar, M. B., Valera, I., Rogriguez, M. G., and Gummadi, K. P. Fairness constraints: Mechanisms for fair classification. In Artificial intelligence and statistics, pp. 962 970. PMLR, 2017b.

Zemel, R., Wu, Y., Swersky, K., Pitassi, T., and Dwork, C. Learning fair representations. In International conference on machine learning, pp. 325 333. PMLR, 2013.

Omnipredictors for Constrained Optimization

A. Constraints Require Global Transformations

We give a simple example where the optimal action c(x) for a constrained loss minimization problem depends not only on the single prediction p(x) but also on the predictions p(x ) for other individuals x = x, assuming that the predictor p agrees with the ground truth: p(x) = E[y|x]. Let X = {x1, x2} be a set of two individuals, and let D be a distribution of (x, y) X {0, 1} such that the marginal distribution of x is the uniform distribution over X, and for a predictor p : X [0, 1] we have ED[y|x = xi] = p(xi) for every i = 1, 2. Consider the problem of minimizing the expected squared loss ED[(y c(x))2] under a budget constraint ED[c(x)] 1/2 over action functions c : X R. Defining a1 := c(x1), a2 := c(x2), p1 := p(x1), p2 := p(x2), we can write the problem as minimizing

ED[(y c(x))2] = 1

2(p1(1 a1)2 + (1 p1)(0 a1)2) + 1

2(p2(1 a2)2 + (1 p2)(0 a2)2)

2p1(1 p1) + 1

2p2(1 p2) + 1

2(a1 p1)2 + 1

2(a2 p2)2 (6)

under the constraint a1 + a2 1 over the variables a1, a2. Note that the first two terms in the objective (6) are independent of the actions a1, a2, so minimizing (6) is equivalent to minimizing the Euclidean distance from point (a1, a2) to point (p1, p2). The optimal solution (a1, a2) is given by projecting the two dimensional point (p1, p2) onto the feasible region (grey area in Figure 1). It is clear that a1 depends on both p1 and p2, and similarly a2 depends on both p1 and p2. It is straightforward to generalize this example to more than two individuals where the optimal action for any individual depends on the predictions for all the other individuals.

6s7O9U59+Dd1HD9Aj FKPna Bu9RAdog Cj6g L6i7+gie B98D4Fnxep Qat+cw/9Fc G3n9JR/P8=</latexit>1

Figure 1. A simple constrained problem whose optimal solution has global dependence.

B. Proof of Equivalence in Multiaccuracy and Multicalibration Definitions

Below we state equivalent definitions of the notions in Section 2.3.

Equivalently to Definition 2.3, Grp MAD(H, g, ε) is the set of predictors p : X [0, 1] satisfying the following for every h H: X

E (x,y) D[(y p(x))h(x)1(g(x) = i)] ε.

Equivalently to Definition 2.4, Grp MCD(H, g, ε) is the set of predictors p : X [0, 1] satisfying the following for every h H:

E (x,y) D[(y p(x))h(x)1(g(x) = i, p(x) = v)] ε.

where the sum is over i [t] and v range(p).

Equivalently to Definition 2.5, Grp Cal D(g, ε) is the set of predictors p : X [0, 1] satisfying:

E (x,y) D[(y p(x))1(g(x) = i, p(x) = v)] ε.

Omnipredictors for Constrained Optimization

Equivalently to Definition 2.6, Grp LMA(H, g, ε) is the set of predictors p : X [0, 1] satisfying the following for every h H: X

E (x,y) D[(y p(x))1(g(x) = i, h(x) = a)] ε.

We prove the equivalence relationship for Grp MA. Similar proofs can be applied to other definitions. Claim B.1. In Definition 2.3, a predictor p belongs to Grp MA(C, g, ε) if and only if

i [t] | E (x,y) D[(y p(x))h(x)1(g(x) = i)]| ε for every h H. (7)

Proof. We first show that p Grp MA(C, g, ε) implies (7). For a fixed h H, we choose τ : [t] [ 1, 1] such that

τ(i) = sign E (x,y) D[(y p(x))h(x)1(g(x) = i)] , (8)

where sign(v) = 1 if v 0, and sign(v) = 1 if v < 0. By our assumption p Grp MA(C, g, ε),

ε E (x,y) D[(y p(x))h(x)τ(g(x))]

i [t] E (x,y) D[(y p(x))h(x)1(g(x) = i)τ(i)]

i [t] | E (x,y) D[(y p(x))h(x)1(g(x) = i)]|. (by (8))

This proves (7). Now we prove that (7) implies p Grp MA(C, g, ε). For any h H and τ : [t] [ 1, 1],

E (x,y) D[(y p(x))h(x)τ(g(x))]

i [t] E (x,y) D[(y p(x))h(x)1(g(x) = i)τ(i)]

i [t] | E (x,y) D[(y p(x))h(x)1(g(x) = i)]| (by τ(i) [ 1, 1])

ε. (by (7))

This proves p Grp MA(C, g, ε).

Remark B.2. The proof above can be adapted to show that if we restrict τ to only output values in { 1, 1} instead of [ 1, 1], we also get an equivalent definition of Grp MA, and this holds for other definitions in Section 2.3 as well.

C. Proof of Lemma 3.1

We restate and prove Lemma 3.1 below.

Lemma C.1. Let D be a distribution over X {0, 1} and ε 0 be a parameter. Let T be a collection of constrained loss minimization tasks and let C, C be classes of functions c : X A. If a predictor p satisfies the following two properties for every T T , then p is a (T , C, C , ε)-omnipredictor on D:

1. Let f0 be the loss function of T and (fj)j J be the constraints of T. For every c C, there exists c C such that for every j {0} J,

E (x,y) Dp E a c (x) fj(x, a, y)

E (x,y) D E a c(x) fj(x, a, y) + ε/3. (2)

Omnipredictors for Constrained Optimization

2. For every c C and every j {0} J,

E (x,y) D E a c(x) fj(x, a, y)

E (x,y) Dp E a c(x) fj(x, a, y) + ε/3. (3)

Proof. Fix an arbitrary task T T . Define β := opt D(T, C, 0) and β := opt Dp(T, C , ε/3) as in Definition 2.1. By the definition of β, for any β1 > β, there exists c C sol D(T, β1, 0). By (2), there exists c C sol Dp(T, β1 + ε/3, ε/3). This implies that β β1 + ε/3, and thus β β + ε/3. Now we have β + ε/3 β + 2ε/3, and thus

C sol Dp(T, β + ε/3, 2ε/3) C sol Dp(T, β + 2ε/3, 2ε/3). (9)

Inequality (3) implies that for any β2 R and ε R 0, C sol Dp(T, β2, ε ) sol D(T, β2 + ε/3, ε + ε/3), and thus

C sol Dp(T, β + 2ε/3, 2ε/3) sol D(T, β + ε, ε). (10)

Combining (9) and (10) completes the proof.

D. Proofs for Section 4

D.1. Proof of Theorem 4.4

Theorem 4.4. Let A = [0, 1] be an action set and let g : X [t] be a group partition. Let T be a class of tasks that only have group constraints and group objectives that are all convex and special. Let p be a predictor in Grp MAD(C, g, ε/6) Grp Cal D(g, ε/6) and define Cp,g as in Definition 4.3. Then p is a (T , C, Cp,g, ε)-omnipredictor on D.

We first prove three helper lemmas/claims below and then prove Theorem 4.4. Claim D.1. For any predictor p : X [0, 1], any function f : X A {0, 1} R and any c : X A, we have

E (x,y) D f(x, c(x), y) E (x,y) Dp f(x, c(x), y) = E (x,y) D[(y p(x)) f(x, c(x))], (11)

where f(x, a) := f(x, a, 1) f(x, a, 0) for every (x, a) X A.

Proof. The claim is proved by plugging the following equation into the left-hand side of (11).

f(x, c(x), y) = f(x, c(x), 0) + y f(x, c(x)).

E (x,y) D f(x, c(x), y) E (x,y) Dp f(x, c(x), y)

= E (x,y) D[f(x, c(x), 0) + y f(x, c(x))] E (x,y) Dp[f(x, c(x), 0) + y f(x, c(x))].

The distributions D, Dp are identical on the x part, therefore f(x, c(x), 0) cancels out. The distribution Dp is defined such that y = 1 with probability p(x), which finishes the proof.

Lemma D.2. In the setting of Theorem 4.4, for every c C, there exists c Cp,g such that for every convex and special group objective/constraint f : X A {0, 1} R, it holds that

E (x,y) Dp f(x, c (x), y) E (x,y) D f(x, c(x), y) + ε/3.

Proof. By Claim D.1,

E (x,y) D f(x, c(x), y) E (x,y) Dp f(x, c(x), y) = E (x,y) D[(y p(x)) f(x, c(x))]. (12)

Omnipredictors for Constrained Optimization

Since f is a special objective/constraint, there exist τ1, τ2 : [t] [ 1, 1] such that f(x, c(x)) = τ1(g(x)) + τ2(g(x))c(x). By our assumption that p Grp Cal(g, ε/6), we have

E (x,y) D[(y p(x))τ1(g(x))] ε/6.

By our assumption that p Grp MA(C, g, ε/6), we have

E (x,y) D[(y p(x))τ2(g(x))c(x)] ε/6.

Combining them, we have

E (x,y) D[(y p(x)) f(x, c(x))] = E (x,y) D[(y p(x))(τ1(g(x)) + τ2(g(x))c(x))] ε/3. (13)

Finally, define τ such that τ(i, v) = E[c(x)|g(x) = i, p(x) = v] and define c (x) = τ(g(x), p(x)). It is clear that c Cp,g. Moreover, by the convexity of f, we have

E (x,y) Dp f(x, c (x), y) E (x,y) Dp f(x, c(x), y).

Combining this with (12) and (13) completes the proof.

Lemma D.3. In the setting of Theorem 4.4, for every c Cp,g, for every convex and special group objective/constraint f : X A {0, 1} R, it holds that

E (x,y) D f(x, c(x), y) E (x,y) Dp f(x, c(x), y) + ε/3.

Proof. By Claim D.1,

E (x,y) D f(x, c(x), y) E (x,y) Dp f(x, c(x), y) = E (x,y) D[(y p(x)) f(x, c(x))]. (14)

Since f is convex and special, there exists τ : [t] A [ 2, 2] such that f(x, a) = τ(g(x), a). Since c Cp,g, there exists τ : X [0, 1] A such that c(x) = τ(g(x), p(x)). Therefore,

E (x,y) D[(y p(x)) f(x, c(x))] = E (x,y) D[(y p(x))τ(g(x), τ (g(x), p(x)))] ε/3, (15)

where the last inequality holds by our assumption that p Grp Cal(g, ε/6). Combining (14) and (15) completes the proof.

Proof of Theorem 4.4. The proof is completed by applying Lemma 3.1 to the setting of Theorem 4.4 and observing that (2) and (3) in Lemma 3.1 can be established by Lemma D.2 and Lemma D.3, respectively.

D.2. Proof of Theorem 4.6

Theorem 4.6. Let A = [0, 1] be an action set and let g : X [t] be a group partition. Let T be a class of tasks that only have group objectives and group constraints that are all convex and 1-Lipschitz and have 1-bounded differences. Let p be a predictor in Grp MCD(C, g, ε/15) Grp Cal D(g, ε/15) and define Cp,g as in Definition 4.3. Then p is a (T , C, Cp,g, ε)- omnipredictor on D.

We first prove three helper lemmas below and then prove Theorem 4.6.

Lemma D.4 ((Gopalan et al., 2022)). Let c : X R be a function. Let g : X [t] be a group partition function. Let f : X R {0, 1} R be a convex 1-Lipschitz group objective/constraint (Definitions 4.1, 4.2 and 4.5). Define τ, τ : [t] R such that τ(i) = E[y|g(x) = i] and τ (i) = E[c(x)|g(x) = i] for every i [t]. Assume that P

i [t] | E(x,y) D[(y τ(i))c(x)1(g(x) = i)]| ε. We have

E (x,y) D[f(x, τ (g(x)), y)] E (x,y) D[f(x, c(x), y)] + 2ε.

Omnipredictors for Constrained Optimization

Lemma D.4 is essentially Theorem 19 in (Gopalan et al., 2022). The only difference is that in (Gopalan et al., 2022), the function f is not allowed to depend on x, whereas in Lemma D.4, we allow f to depend on the group index g(x) of x. The proof in (Gopalan et al., 2022) can be used here without any essential change. Lemma D.5. In the setting of Theorem 4.6, for every c C, there exists c Cp,g such that for every convex 1-Lipschitz group objective/constraint f : X A {0, 1} R with 1-bounded difference, it holds that

E (x,y) Dp f0(x, c (x), y) E (x,y) D f0(x, c(x), y) + ε/3. (16)

Proof. We fix an arbitrary c C and define τ, τ : [t] [0, 1] [0, 1] such that τ(i, v) = E[y|g(x) = i, p(x) = v] and τ (i, v) = E[c(x)|g(x) = i, p(x) = v] for every (i, v) [t] [0, 1].

By our assumption that p Grp Cal(g, ε/15),

E (x,y) D |p(x) τ(g(x), p(x))| ε/15.

By our assumption that p Grp MC(C, g, ε/15), X

v range(p) | E (x,y) D[(y p(x))c(x)1(g(x) = i, p(x) = v)]| ε/15.

Combining the inequalities above, X

v range(p) | E (x,y) D[(y τ(g(x), p(x)))c(x)1(g(x) = i, p(x) = v)]| 2ε/15.

Define c : X A such that c (x) = τ (g(x), p(x)) for every x X. Clearly, c Cp,g. Taking the groups in Lemma D.4 to be {x X : g(x) = i, p(x) = v} here for (i, v) [t] range(p), we have

E (x,y) D f(x, c (x), y) E (x,y) D f(x, c(x), y) + 4ε/15. (17)

By Claim D.1,

E (x,y) D f(x, c (x), y) E (x,y) Dp f(x, c (x), y) = E (x,y) D[(y p(x)) f(x, c (x))]. (18)

Since we assume that f is a group objective/constraint and it has 1-bounded difference, there exists τ : [t] A [ 1, 1] such that f(x, a) = τ (g(x), a). By our definition c (x) = τ (g(x), p(x)),

E (x,y) D[(y p(x)) f(x, c (x))] = E (x,y) D[(y p(x))τ (g(x), τ (g(x), p(x)))].

By our assumption that p Grp Cal(g, ε/15),

E (x,y) D[(y p(x))τ (g(x), τ (g(x), p(x)))] ε/15. (19)

Combining (17), (18), and (19) proves (16).

Lemma D.6. In the setting of Theorem 4.6, for every c Cp,g, for every convex 1-Lipschitz group objective/constraint f : X A {0, 1} R with 1-bounded difference, it holds that

E (x,y) D f(x, c(x), y) E (x,y) Dp f(x, c(x), y) + ε/3.

Proof. The proof is similar to the proof of Lemma D.3 and we omit the details. In the proof of Lemma D.3, we use the assumption that p Grp Cal(g, ε/6) and the fact that there exists τ : [t] A [ 2, 2] such that f(x, a) = τ(g(x), a). For our f with 1-bounded difference, we can similarly take τ : [t] A [ 1, 1] and use our assumption that p Grp Cal(g, ε/15) Grp Cal(g, ε/3).

Proof of Theorem 4.6. The proof is completed by applying Lemma 3.1 to the setting of Theorem 4.6 and observing that (2) and (3) in Lemma 3.1 can be established by Lemma D.5 and Lemma D.6, respectively.

Omnipredictors for Constrained Optimization

D.3. Proof of Theorem 4.7

Theorem 4.7. Let A be a finite non-empty action set and let g : X [t] be a group partition. Let T be a class of tasks with group constraints and group objectives that all have 1-bounded differences. Let p be a predictor in Grp LMAD(C, g, ε/3) Grp Cal D(g, ε/3) and define Crand p,g as in Definition 4.3. Then p is a (T , C, Crand p,g , ε)-omnipredictor on D.

We first prove two helper lemmas below and then prove Theorem 4.7.

Lemma D.7. In the setting of Theorem 4.7, for every c C, there exists c Crand p,g such that for every group objective/constraint f : X A {0, 1} R with 1-bounded difference, it holds that

E (x,y) Dp E a c (x) f(x, a, y) E (x,y) D f(x, c(x), y) + ε/3.

Proof. By Claim D.1,

E (x,y) D f(x, c(x), y) E (x,y) Dp f(x, c(x), y) = E (x,y) Dp[(y p(x)) f(x, c(x))]. (20)

Since we assume that f is a group objective/constraint and it has 1-bounded difference, there exists τ : [t] A [ 1, 1] such that f(x, a) = τ(g(x), a). By our assumption that p Grp LMA(C, g, ε/3),

E (x,y) Dp[(y p(x)) f(x, c(x))] = E (x,y) Dp[(y p(x))τ(g(x), c(x)))] ε/3. (21)

Combining (20) and (21), we have

E (x,y) Dp f(x, c(x), y) E (x,y) D f(x, c(x), y) + ε/3. (22)

Now we define τ : [t] [0, 1] A such that τ (i, v) is the conditional distribution of c(x) given g(x) = i and p(x) = v. We define c : X A such that c (x) = τ (g(x), c(x)). Clearly, c Crand p,g . Since f is a group objective/constraint, there exists τ : [t] A { 1, 1} R such that f(x, a, y) = τ (g(x), a, y). Now we have

E (x,y) Dp f(x, c(x), y) = E[E[f(x, c(x), y)|g(x), p(x)]]

= E[E[τ (g(x), c(x), y)|g(x), p(x)]]

E a τ (g(x),p(x)),y Ber(p(x))[τ (g(x), a, y)]

= E (x,y) Dp E a c (x)[f(x, a, y)]. (23)

Combining (22) and (23) completes the proof.

Lemma D.8. In the setting of Theorem 4.7, for every c Crand p,g , for every group objective/constraint f : X A {0, 1} R with 1-bounded difference, it holds that

E (x,y) D E a c(x) f(x, a, y) E (x,y) Dp E a c(x) f(x, a, y) + ε/3. (24)

Proof. By our assumption c Crand p,g , there exists τ : [t] [0, 1] A such that c(x) = τ(g(x), p(x)) for every x X. Consider the joint distribution of (x, a, y) where (x, y) D and a c(x). This distribution can be equivalently defined as follows. We first construct a function τ : [t] [0, 1] A at random, where τ (i, v) A is drawn independently from the distribution τ(i, v) A for every (i, v) [t] [0, 1]. We then draw (x, y) D and choose c(x) = τ (g(x), p(x)). This equivalent construction also works when we replace D with Dp. Therefore, to prove (24), it suffices to prove that for every τ : [t] [0, 1] A,

E (x,y) D f(x, τ (g(x), p(x)), y) E (x,y) Dp f(x, τ (g(x), p(x)), y) + ε/3. (25)

Omnipredictors for Constrained Optimization

By Claim D.1,

E (x,y) D f(x, τ (g(x), p(x)), y) E (x,y) Dp f(x, τ (g(x), p(x)), y)

= E (x,y) D[(y p(x)) f(x, τ (g(x), p(x)))]. (26)

Since we assume that f is a group objective/constraint and it has 1-bounded difference, there exists τ : [t] A [ 1, 1] such that f(x, a) = τ (g(x), a). Therefore,

E (x,y) D[(y p(x)) f(x, τ (g(x), p(x)))]

= E (x,y) D[(y p(x))τ (g(x), τ (g(x), p(x)))]

where the last inequality follows from our assumption p Grp Cal(g, ε/3). Combining (26) and (27) proves (25).

Proof of Theorem 4.7. The proof is completed by applying Lemma 3.1 to the setting of Theorem 4.7 and observing that (2) and (3) in Lemma 3.1 can be established by Lemma D.7 and Lemma D.8, respectively.

D.4. Variant of Theorem 4.4

Theorem D.9. Let D be a distribution over X {0, 1}. Let A [0, 1] be a finite action set. Let T be a class of tasks that only have group constraints and group objectives that are all special. Let C be a class of functions c : X A. Let p be a predictor in Grp MAD(C, g, ε/6) Grp Cal D(g, ε/6) and define Crand p,g as in Definition 4.3. Then p is a (T , C, Crand p,g , ε)-omnipredictor on D.

We first prove two helper lemmas below and then prove Theorem D.9.

Lemma D.10. In the setting of Theorem D.9, for every c C, there exists c Crand p,g such that for every special group objective/constraint f : X A {0, 1} R, it holds that

E (x,y) Dp E a c (x) f(x, a, y) E (x,y) D f(x, c(x), y) + ε/3.

Proof. Using the same argument as in the proof of Lemma D.2, we can show that

E (x,y) Dp f(x, c(x), y) E (x,y) D f(x, c(x), y) + ε/3.

This is the same as (22) as in the proof of Lemma D.7, and the rest of the proof follows the same argument as in the proof of Lemma D.7.

Lemma D.11. In the setting of Theorem D.9, for every c Crand p,g , for every special group objective/constraint f : X A {0, 1} R, it holds that

E (x,y) D E a c(x) f(x, a, y) E (x,y) Dp E a c(x) f(x, a, y) + ε/3.

Proof. The proof follows from the same argument as the proof of Lemma D.8. In Lemma D.8, we use the assumption that p Grp Cal(g, ε/3) and that f has 1-bounded difference. Here we have the assumption that p Grp Cal(g, ε/6), and since we assume f is special and A [0, 1], we know that f has 2-bounded difference.

Proof of Theorem D.9. The proof is completed by applying Lemma 3.1 to the setting of Theorem D.9 and observing that (2) and (3) in Lemma 3.1 can be established by Lemma D.10 and Lemma D.11, respectively.

Omnipredictors for Constrained Optimization

E. Lipschitz Combination of Constraints

We show that all our omniprediction results in Section 4 can be extended to more general constrained loss minimization tasks where we combine the constraints using a Lipschitz function. Specifically, we consider more general tasks where each task T not only has an objective f0 : X A {0, 1} R and constraints fj : X A {0, 1} for j [m], but also has a combining function Γ : Rm R. The task T corresponds to the following optimization problem:

minimize c:X A E (x,y) D f0(x, c(x), y) (28)

s.t. Γ E (x,y) D f1(x, c(x), y), . . . , E (x,y) D fm(x, c(x), y) 0.

The task in (1) can be viewed as a special case of (28) where Γ is the max function: Γ(r1, . . . , rm) = max(r1, . . . , rm). For a task T in the form of (28), for β R and ε R 0, we can again define sol D(T, β, ε) to be the set of randomized action functions c : X A satisfying

E (x,y) D E a c(x) f0(x, a, y) β, and

Γ E (x,y) D E a c(x) f1(x, a, y), . . . , E (x,y) D E a c(x) fm(x, a, y) ε.

Correspondingly, for a class C consisting of functions c : X A, we define

opt D(T, C, ε) := inf{β R : C sol D(T, β, ε) = }.

We can then similarly define omnipredictors for these tasks in the same way as in Definition 2.1.

Here we focus on obtaining omnipredictors for tasks with Lipschitz combining functions Γ. We say Γ is κ-Lipschitz (in the ℓ norm) if |Γ(r1, . . . , rm) Γ(r 1, . . . , r m)| κ maxi [m] |ri r i|. For tasks with 1-Lipschitz combining functions, we have the following analogue of Lemma 3.1:

Lemma E.1. Let T be a class of constrained loss minimization tasks each having a 1-Lipschitz combining function. Let C and C be classes of action functions f : X A as in Definition 2.1. If a predictor p satisfies the following two properties for every T T , then p is a (T , C, C , ε)-omnipredictor:

1. Let f0 be the loss function of T and (fj)j J be the constraints of T. For every c C, there exists c C such that

E (x,y) Dp E a c (x) f0(x, a, y) E (x,y) D E a c(x) f0(x, a, y) + ε/3, and (29) E (x,y) Dp E a c (x) fj(x, a, y) E (x,y) D E a c(x) fj(x, a, y) ε/3 for every j J. (30)

2. For every c C ,

E (x,y) D E a c(x) f0(x, a, y) E (x,y) Dp E a c(x) f0(x, a, y) + ε/3, and (31) E (x,y) D E a c(x) fj(x, a, y) E (x,y) Dp E a c(x) fj(x, a, y) ε/3 for every j J. (32)

Lemma E.1 can be proved similarly to Lemma 3.1 using the observation that (30) implies the following by the 1-Lipschitz assumption on Γ and an analogous observation for (32):

Γ E (x,y) Dp E a c (x) f1(x, a, y), . . . , E (x,y) Dp E a c (x) fm(x, a, y)

Γ E (x,y) D E a c(x) f1(x, a, y), . . . , E (x,y) D E a c(x) fm(x, a, y) + ε/3.

We thus omit the proof of Lemma E.1. The only difference between Lemma E.1 and Lemma 3.1 in the requirements needed for p to be an omnipredictor is the additional absolute values in (30) and (32). As all our proofs in Section 4 are

Omnipredictors for Constrained Optimization

through Lemma 3.1, they can be adapted to tasks with constraints combined by a Lipschitz function using Lemma E.1. The absolute values in (30) and (32) only require us to make sure that for every constraint function f, both f and f satisfy the assumptions needed for our theorems in Section 4 (e.g. we need to replace convex by affine ). Note that all linear constraints f defined in (4) satisfy that both f and f are convex and special. Ideas in this section can be applied to tasks where the objective function is also a Lipschitz combination:

minimize c:X A Γ E (x,y) D f 1(x, c(x), y), . . . , E (x,y) D f m (x, c(x), y)

s.t. Γ E (x,y) D f1(x, c(x), y), . . . , E (x,y) D fm(x, c(x), y) 0.

F. Rank-preserving Transformations of Omnipredictors

In this section we identify cases where we can ensure that our solutions to downstream tasks given an omnipredictor p is rank-preserving (Definition 5.1). Specifically, we identify conditions under which (5) is satisfied.

Our results in this section focus on actions a [0, 1] and assume that the objective function is rank-preserving:

Definition F.1. Let g : X [t] be a group partition function. We say an objective function f0 : X [0, 1] {0, 1} [0, 1] is rank-preserving (within groups), if there exists a function f : [t] [0, 1] {0, 1} such that for all x X, a [0, 1], y {0, 1}, we have f0(x, a, y) = f(g(x), a, y) and for every i [t] and a > a [0, 1],

f(i, a, 1) f(i, a , 1)

f(i, a, 0) f(i, a , 0).

Rank preserving a desired property which holds when the loss function represents the distance between the taken action and the outcome. In particular, the ℓ1 loss and squared loss satisfy it, as well as every loss function of form f(x, a, y) = dist(a, y), when dist is a distance function.

We also assume that the predictor p which we use to solve downstream tasks is monotone:

Definition F.2. Let D be a distribution over X {0, 1} and let g : X [t] be a group partition function. We say a predictor p : X [0, 1] is monotone w.r.t. D and g if for every i [t] and for every v, v [0, 1] satisfying v > v , we have E(x,y) D[y|p(x) = v, g(x) = i] E(x,y) D[y|p(x) = v , g(x) = i].

This monotonicity requirement is satisfied if D = Dp. In Appendix F.1 we describe how to modify p using samples from D to satisfy this requirement even when D is different from Dp.

We start with the simpler case, where we assume that the constraint combines functions that are fixed for each group:

Definition F.3. Let D be a distribution over X {0, 1}. Let g : X [t] be a group partition function. Let A be an action set. Let σ1, . . . , σt : A {0, 1} R be functions. We say a constrained loss minimization task T as in (28) is compatible with σ1, . . . , σt if T only has a single constraint of the form

Γ(ξ1, . . . , ξt) 0 (33)

for some function Γ : Rt R where ξi := E(x,y) D[1(g(x) = i)σi(c(x), y)].

In the lemma below we establish (5) when each σi is restricted to the form σi(a, y) = α1y + α2a + α3ay for some α1, α2, α3 R satisfying α2(α2 + α3) 0. The flexibility in the combining function Γ allows the constraint (33) to express group fairness constraints such as statistical parity and equal opportunity even when each σi is restricted in this special form.

Lemma F.4. Let D be a distribution over X {0, 1}, and g : X [t] be a group partition function. Let A = [0, 1] be the action set, and let σ1, . . . , σt : A {0, 1} R be functions where each σi can be written as σi(a, y) = α1y + α2a + α3ay for some α1, α2, α3 R satisfying α2(α2 + α3) 0. Let T be a task compatible with σ1, . . . , σk and assume its objective function f0 : X A {0, 1} R is rank-preserving and convex. For a predictor p : X [0, 1], assuming p is monotone w.r.t. D and g (which is always satisfied when D = Dp), we have

opt D(T, rp-Cp,g, ε) = opt D(T, Cp,g, ε).

Omnipredictors for Constrained Optimization

Furthermore, given any deterministic c Cp,g, such that c(x) = τ(g(x), p(x)) for a transformation τ : [t] V A, there exists an algorithm running in time polynomial in t, |V | , ε and outputting a transformation τ : [t] V A that is rank-preserving, and c (x) = τ(g(x), p(x)) has the same objective value as c up to a factor of ε with high probability.

We remark that the requirement α2(α2 + α3) 0 cannot be removed. Without this requirement, it could be the case that some functions in rp-Cp,g satisfy the constraint but none of them is rank-preserving. This highlights the importance of picking appropriate loss functions and constraints if we want to achieve fair outcome.

Proof. We prove the claim by an iterative process, taking τ that is not rank-preserving on some inputs and correcting it.

Suppose τ is not rank-preserving, and there exists i [t], v, v V such that (τ(i, v) τ(i, v ))(v v ) < 0. We show a local correction from τ to τ such that τ is rank-preserving on v, v . The final transformation is created by fixing all such violations. We denote

θ = Pr (x,y) D[p(x) = v|g(x) = i, p(x) {v, v }] (34)

qv = E (x,y) D[y|g(x) = i, p(x) = v] (35)

qv = E (x,y) D[y|g(x) = i, p(x) = v ] (36)

Constraint value. Let us assume θ|α2 + α3qv| (1 θ)|α2 + α3qv |. (37)

This is without loss of generality because otherwise we can switch the roles of v and v , in which case θ and 1 θ also switch. Our goal is to update the values of τ(i, v) and τ(i, v ) so that they no longer violate the rank-preserving property while keeping ξi defined below unchanged so that the constraint of T is still satisfied:

ξi := E (x,y) D[1(g(x) = i)σi(τ(g(x), p(x)), y)].

By our assumption, we can write σi as σi(a, y) = α1y + α2a + α3ay for α1, α2, α3 R satisfying α2(α2 + α3) 0. We only change the values of τ(i, v) and τ(i, v ), so to keep ξi unchanged, it suffices to keep the following conditional expectation unchanged:

E (x,y) D[σi(τ(g(x), p(x)), y)|g(x) = i, p(x) {v, v }] (38)

= α1(θqv + (1 θ)qv ) + α2(θτ(i, v) + (1 θ)τ(i, v )) + α3(θqvτ(i, v) + (1 θ)qv τ(i, v )). (39)

Simply switching between τ(i, v) and τ(i, v ) does not work, as the constraint can be violated. Instead, we set τ (i, v) = τ(i, v ), and then we want to set τ (i, v ) to some value keeping the expectation in Equation (38) exactly the same. We denote τ (i, v ) = z and look for z such that

α1(θqv + (1 θ)qv ) + α2(θτ(i, v ) + (1 θ)z) + α3(θqvτ(i, v ) + (1 θ)qv z)

= α1(θqv + (1 θ)qv ) + α2(θτ(i, v) + (1 θ)τ(i, v )) + α3(θqvτ(i, v) + (1 θ)qv τ(i, v ))

z = α2θ + α3θqv α2(1 θ) + α3(1 θ)qv τ(i, v) + α2(1 2θ) + α3((1 θ)qv θqv)

α2(1 θ) + α3(1 θ)qv τ(i, v ). (40)

We can only set τ (i, v ) to a value z [0, 1], so we need to check that the above expression is in this range. We can write z = γτ(i, v) + (1 γ)τ(i, v ) for

γ = α2θ + α3θqv α2(1 θ) + α3(1 θ)qv = θ 1 θ α2 + α3qv

α2 + α3qv . (41)

We prove γ [0, 1] as follows, which implies z [0, 1] (as the range of τ is also [0, 1]). To prove γ 0, we know that θ [0, 1], so we need to show that the second expression is also positive. From the lemma requirement, we have that α2(α2 + α3) 0, together with qv, q v [0, 1] we get that (α2 + qvα3)/(α2 + qv α3) 0 and so γ 0. The fact γ 1 follows from (37).

Omnipredictors for Constrained Optimization

Objective. We are left with showing that the objective of T (i.e., expected value of f0) is not increased by the correction. For simplicity of notations we denote ℓ0(v) = f0(x, τ(i, v), 0) and ℓ1(v) = f0(x, τ(i, v), 1) for some x with g(x) = i. The values of ℓ0(v) and ℓ1(v) are independent of the actual choice of such x because f0 is a group constraint by Definition F.1.

Since the objective value is an expectation and therefore additive, it is enough to analyze the value for x such that g(x) = i, p(x) {v, v }. The original expected objective value for these x s is

E (x,y) D[f0(x, τ(g(x), p(x)), y)|g(x) = i, p(x) {v, v }]

= ℓ1(v)θqv + ℓ0(v)θ(1 qv)

+ ℓ1(v )(1 θ)qv + ℓ0(v )(1 θ)(1 qv ).

The same expectation with τ is given by

E (x,y) D[f0(x, τ (g(x), p(x)), y)|g(x) = i, p(x) {v, v }]

= ℓ1(v )θqv + ℓ0(v )θ(1 qv)

+ f0(i, z, 1)(1 θ)qv + f0(i, z, 0)(1 θ)(1 qv )

ℓ1(v )θqv + ℓ0(v )θ(1 qv)

+ (γℓ1(v) + (1 γ)ℓ1(v ))(1 θ)qv

+ (γℓ0(v) + (1 γ)ℓ0(v ))(1 θ)(1 qv ).

The last inequality holds because of the convexity of f0. Taking the difference, we get

E (x,y) D[f0(x, τ(g(x), p(x), y) f0(x, τ (g(x), p(x), y)|g(x) = i, p(x) {v, v }]

(ℓ1(v) ℓ1(v ))θqv + (ℓ0(v) ℓ0(v ))θ(1 qv)

+ γ(ℓ1(v ) ℓ1(v))(1 θ)qv + γ(ℓ0(v ) ℓ0(v))(1 θ)(1 qv )

= (ℓ1(v) ℓ1(v ))(θqv γ(1 θ)qv )

+ (ℓ0(v ) ℓ0(v))(γ(1 θ)(1 qv ) θ(1 qv)). (42)

We show that the expression above is nonnegative. We focus on the case where v > v , and a similar argument applies to the other case v < v . By our assumption that τ(i, v), τ(i, v ) violate the rank-preserving property, we know that τ(i, v ) τ(i, v). By our assumption that f0 is rank-preserving,

ℓ1(v) ℓ1(v ) = f0(i, τ(i, v), 1) f0(i, τ(i, v ), 1) 0,

ℓ0(v ) ℓ0(v) = f0(i, τ(i, v ), 0) f0(i, τ(i, v), 0) 0. (43)

γ = θ 1 θ (α2 + α3)qv + α2(1 qv)

(α2 + α3)qv + α2(1 qv ).

By our monotonicity assumption on p, we have qv qv . Using our assumption that α2(α2 + α3) 0 in the equation above, we get θ 1 θ 1 qv 1 qv γ θ 1 θ qv qv .

θqv γ(1 θ)qv 0.

γ(1 θ)(1 qv ) θ(1 qv) 0. (44)

Plugging (43) and (44) into (42) proves that (42) is nonnegative. Therefore correcting τ does not increase the objective.

After preforming this step for every pair v, v that violate the rank-preserving property, the resulting transformation τ is rank-preserving.

Omnipredictors for Constrained Optimization

The correction described above uses the exact value of θ, qv, qv . In order to implement such algorithm in practice, we approximate θ, qv, qv , and update τ based on our approximation. Using the approximation instead of the exact values can reduce objective by the approximation error. The running time of the algorithm is polynomial in |V | , t, δ when δ is the accuracy parameter.

The previous theorem modified a transformation τ into a rank-preserving one by correcting its values for every violation. Allowing the correction to be randomized, the theorem holds for a larger collection of constraints. In order to do so, we first define rank-preserving for a randomized transformation.

Definition F.5. A randomized transformation τ : [0, 1] [t] A for A = [0, 1] is rank-preserving within groups, if for every i [t], v > v V and γ [0, 1],

Pr[τ(i, v) γ] Pr[τ(i, v ) γ].

Let p : X [0, 1] be a predictor and g : X [t] be a group index function. We denote by rp-Crand p,g the set of all functions c Crand p,g such that there exists a rank-preserving transformation τ : [t] [0, 1] A satisfying c(x) = τ(g(x), p(x)) for every x X.

Lemma F.6. Let A [0, 1] be a discrete action set, g : X [t] be a group partition function, and T be a task with constraints that are independent of the outcome. For a predictor p and a distribution D, assuming p is monotone w.r.t. D and g (which is always satisfied when D = Dp), we have

opt D(T, rp-Crand p,g , ε) = opt D(T, Crand p,g , ε).

Furthermore, given any c Crand p,g , such that c(x) = τ(g(x), p(x)) for a transformation τ : [t] V A, there exists an algorithm running in time polynomial in t, |V | , ε and outputting a randomized transformation τ : [t] V A that is rank-preserving, and c (x) = τ(g(x), p(x)) has the same objective value as c up to a factor of ε with high probability.

Proof. The proof follows the same structure of the previous proof. Let τ be a randomized transformation, and assume that there exists v > v such that τ is not rank-preserving on v, v . We describe a single step in an iterative process, transforming τ into τ .

Intuitively, we take the histogram of the values of τ on the input set {x X|g(x) = i, p(x) {v, v }}, and assign v the lower values in the histogram and v the upper ones.

θ = Pr (x,y) D[p(x) = v|g(x) = i, p(x) {v, v }] (45)

θa =θ Pr[τ(i, v) = a] + (1 θ) Pr[τ(i, v ) = a], a A (46)

when the probability in the second definition is over the internal randomness of τ. For every a A, we define the function u : A [0, 1] indicating how much of θa is coming from τ(i, v). That is, for all a A if θ = 0 we have

u(a) = θ Pr[τ(i, v) = a]/θa.

When θa = 0, u(a) can take any value in [0, 1]. Notice that by definition, Pr[τ(i, v ) = a] = θa(1 u(a))/(1 θ).

We define τ by creating an analog function u : A [0, 1], when u indicates if a certain outcome a A is in the upper part of the histogram (and should be assigned to τ (i, v)) or lower part (and should be assigned to τ (i, v )). Fractional values u (a) imply that a is in the middle of the histogram, i.e. assigned to both. For every a A let

a a θa θ 0 if P

a a θa 1 θ 1 θa θ P

a >a θa otherwise. (47)

We are now ready to define τ to equal τ on all except (i, v), (i, v ), in which we have:

a A Pr[τ (i, v) = a] = θau (a)

Omnipredictors for Constrained Optimization

a A Pr[τ (i, v) = a] = θa 1 θ(1 u (a)). (49)

Notice that τ is rank-preserving on inputs (i, v), (i, v ) by definition.

We next show that τ satisfies all of the constraints in the same way was τ. Let fj(i, a, y) = f(i, a) be any constraint that is not a function of y. Then we have

E (x,y) D[f(i, τ(i, p(x)))] = X

a A Pr (x,y) D[τ(i, p(x)) = a]f(i, a).

The transformations τ, τ only differ on inputs (i, v), (i, v ), so it is enough to analyze the difference on these inputs. For every a A,

Pr (x,y) D[τ(i, p(x)) = a|g(x) = i, p(x) {v, v }] = θ Pr[τ(i, v) = a] + (1 θ) Pr[τ(i, v ) = a] = θa.

For the new transformation,

Pr (x,y) D[τ (i, p(x)) = a|g(x) = i, p(x) {v, v }] = θ Pr[τ (i, v) = a] + (1 θ) Pr[τ (i, v ) = a]

θ + (1 θ) θa

1 θ(1 u (a)) = θa.

Therefore, we get that E(x,y) D[f(i, τ(i, p(x)))] = E(x,y) D[f(i, τ (i, p(x)))].

We are left with proving that this correction does not increase the loss. We define qv, qv as in the previous proof.

qv = E (x,y) D[y|g(x)i, p(x) = v] (50)

qv = E (x,y) D[y|g(x)i, p(x) = v ]. (51)

The expected loss of τ on the relevant inputs:

E (x,y) D[f0(i, τ(i, x), y)|g(x) = i, p(x) {v, v }]

a A Pr[τ(i, v) = a] (qvf0(i, a, 1) + (1 qv)f0(i, a, 0))

a A Pr[τ(i, v ) = a] (qv f0(i, a, 1) + (1 qv )f0(i, a, 0))

a A f0(i, a, 1)θa (u(a)qv + (1 u(a))qv )

a A f0(i, a, 0)θa (u(a)(1 qv) + (1 u(a))(1 qv )) .

By definition, the loss of τ is exactly the same only with u instead of u.

Comparing the two losses we get:

E (x,y) D[f0(i, τ(i, x), y)|g(x) = i, p(x) {v, v }] E (x,y) D[f0(i, τ (i, x), y)|g(x) = i, p(x) {v, v }] (52)

a A θa(f0(i, a, 1) f0(i, a, 0))(u(a) u (a))(qv qv ). (53)

Denote γa = (u(a) u (a))(qv qv ). From our assumption, qv qv . From the definition of u (a), for every a A we have X

a a A u (a) X

a a A u(a).

Omnipredictors for Constrained Optimization

a A u(a) = P

a A u (a), we have that P

a γa = 0, and that there exists a such that γa 0 for all a > a, and γa 0 for all a a. Since the function f0 is rank preserving, we have that for every a > a ,

f0(i, a, 1) f0(i, a, 0) f0(i, a , 1) f0(i, a , 0).

Therefore, X

a,γa 0 γa(f0(i, a, 1) f0(i, a, 0)) X

a,γa 0 γa(f0(i, a, 1) f0(i, a, 0)).

Which implies that P

a γa(f0(i, a, 1) f0(i, a, 0)) 0 and the loss of τ is at most the loss of τ.

The final transformation τ is created by repeatedly applying the above step until τ is rank-preserving. The process ends after |V |2 such switching steps.

When performing the algorithm in practice we do not know u, θ, qv, qv exactly and need to approximate them at every step. This adds an error to the algorithm.

F.1. Monotone predictor

In the following claim we show that a calibrated predictor with a discrete range can be modified to one that is monotone (as in Definition F.2) with high probability, by merging small level sets and level sets that are close together. This claim only holds for functions w with bounded range, although the rest of the section holds more generally. We remark that as long as the hypothesis class H contains bounded functions h : X [0, 1], then the claim below holds for all classes W defining group or level-set calibration on Section 2.3. In case of group multi-accuracy or calibration with negative value of τ, the claim below should be run on each part {x|g(x) = i} separately. Claim F.7. Let V [0, 1] be a discrete set, and let W be a class of functions w : X [0, 1] [0, 1] containing a function fv(x, v ) = 1(v = v ) for all v V . Let p : X [0, 1] be a predictor with a discrete range V such that p Gen MCD(W, ε). Then there is an algorithm running in time O(|V |3 1 ε2δ), uses O(|V |3 1 ε2δ) samples, that with probability 1 δ outputs a monotone predictor p Gen MCD(W, 6ε).

Proof. We describe a simple algorithm for merging the levels of p that are too close to each other or too small. We start by looking at the partition of X defined by p, then merge parts that are too small or too close to each other. Let P = P1, . . . P|V | be the partition of x defined by p.

The algorithm sample S of size O(|V |3 1 ε2δ) of (x, y) D, and do:

1. While there exists a part Pi such that Pr(x,y) S[x Pi] < 2ε |V |, merge Pi with its neighbor.

2. While there are Pi, Pj P such that E (x,y) S[y|x Pi] E (x,y) S[y|x Pj] < 2ε

merge Pi, Pj.

3. Set p : X [0, 1] by choosing for every part x Pi the value E(x ,y ) S[y |x Pi].

From Claim J.1, by taking O(|V |3 1 ε2δ), with probability 1 δ/2 the algorithm approximates Pr(x,y) D[p(x) = v] up to an error of ε |V |. After the first step of the algorithm, each Pi has size at least ε |V |. Therefore, from Claim J.1 the algorithm approximates E(x,y) S[y|x Pi] up to an additive error of ε |V | with probability 1 δ/2 for all parts. Assuming all approximations are correct, the predictor p is monotone. Therefore, p is monotone with probability at least 1 δ.

To prove the generalized calibration, we first use the function fv W and get that for every v V , E (x,y) D[(y v)1(p(x) = v)] ε, (54)

Omnipredictors for Constrained Optimization

Assume that the algorithm skips Item 2, and only preforms merging for small sets and asigns new values. Let p be this predictor. Then for p we have, E (x,y) D[(y p (x))w(x, v)]

E (x,y) D[(y p (x))w(x, v)] + E (x,y) D[(p (x) p(x))w(x, v)]

ε + E (x,y) D[(p (x) p(x))w(x, v)]

Pr (x,y) D[p(x) in small Pi] + X

large Pi E (x,y) D[(p (x) p(x))w(x, v)1(x Pi)]

large Pi E (x,y) D[(p (x) p(x))1(x Pi)]

3ε + E (x,y) D[(y v)1(p(x) = v)] + X

E (x,y) D[(y v)1(p (x) = v)] . (55)

Where large Pi s are those that the algorithm does not merge in Item 1. From Equation (54), the first expectation is bounded by ε. From the paragraph above, with probability at least 1 δ/2 the we have E(x,y) S[y|x Pi] E(x,y) D[y|x Pi] ε/ |V | for all large partitions Pi. Together we get E(x,y) D[(y p (x))w(x, v)] 5ε.

Our monotone predictor p has an extra step in Item 2, in which the algorithm merges parts Pi, Pj. The algorithm only merges parts in which the expected value of y, E[y|x Pi] is within distance ε

V . Therefore, even if we preform |V | merges, we have that E (x,y) D[|p (x) p (x)|] ε.

Substituting p (x) instead of p (x) on equation Equation (55) can only increase the expected value by ε.

G. Algorithms for Multiaccuracy and Multicalibration

The computational and sample complexity of learning a multiaccuracy/multicalibrated predictor w.r.t. a function class C using i.i.d. data points from the true distribution D depends on the complexity and structure of the class C. In (H ebert-Johnson et al., 2018), the authors show that the task can be efficiently reduced to weak agnostic learning for C (Kalai et al., 2008; Feldman, 2010). This implies that the sample and computational complexity of learning a multicalibrated predictor cannot be much larger than weak agnostic learning. Hu et al. (2022a) concretely characterize the sample complexity of learning a multiaccurate/multicalibrated predictor in terms of the fat-shattering dimension of C (Kearns & Schapire, 1990), and they also study the sample complexity of multiaccuracy/multicalibration with additional realizability assumptions about D, which is a setting further explored by Hu & Peale (2023) (results in our paper do not require any assumption on D). Gopalan et al. (2023) propose and implement algorithms for calibrated multiaccuracy and demonstrate their efficiency compared to achieving multicalibration. Many of our results in this paper require group multiaccuracy/multicalibration, and such a predictor can be obtained by first learning a multiaccurate/multicalibrated predictor w.r.t. C on each group and then combining. Some of our results in this paper require group level-set multiaccuracy. This can be equivalently viewed as multiaccuracy w.r.t. a larger class C(g) of binary functions c : X { 1, 1} such that there exist c C and τ : [t] A { 1, 1} satisfying c (x) = τ(g(x), c(x)) for every x X. The complexity of C(g) depends on the complexity of C and the group partition g.

H. Optimization Algorithms on the Simulated Distribution

An omnipredictor p, as in Definition 2.1, allows us to solve downstream tasks T T on the true distribution D by solving the task on the simulated distribution Dp. In this section, we show very efficient algorithms for solving the task on the simulated distribution for all the settings we consider in Section 4.

Specifically, in Definition 2.1, we define β := opt Dp(T, C , ε/3) R. Suppose the objective of T is f0 : X A

Omnipredictors for Constrained Optimization

{0, 1} R and the constraints of T are fj : X A {0, 1} R for every j J. The task of finding a solution in C sol Dp(T, β + ε/3, 2ε/3) is to solve the following optimization problem approximately:

minimize c C E (x,y) Dp E a c(x) f0(x, a, y) (56)

s.t. E (x,y) Dp E a c(x) fj(x, a, y) 0 for every j J.

In Theorems 4.4 and 4.6, the action set A is the interval [0, 1], and the objective f0 and the constraints fj are convex group objective/constraints. That is, for every j {0} J, there exists f j : [t] A {0, 1} R such that fj(x, a, y) = f j(g(x), a, y) for every (x, a, y) X A {0, 1}, and the function f j(i, , y) is convex for every i [t] and y {0, 1}. Moreover, the class C is the class Cp,g in Definition 4.3, i.e., C consists of all functions c : X A such that there exists τ : [t] [0, 1] A satisfying c(x) = τ(g(x), p(x)) for every x X. Thus, (56) becomes the following equivalent problem:

minimize τ:[t] [0,1] [0,1] E (x,y) Dp f 0(g(x), τ(g(x), p(x)), y) (57)

s.t. E (x,y) Dp f j(g(x), τ(g(x), p(x)), y) ε/3 for every j J.

Let V := range(p) denote the range of p. Since the functions c C in Theorem 4.4 and Theorem 4.6 output bounded values c(x) A = [0, 1], we can always make sure that V is finite and has size O(1/ε ) when we require p to be (C, g, ε )- multiaccurate and/or (C, g, ε )-multicalibrated because discretizing the values p(x) to multiples of ε /2 can only increase the group multiaccuracy/multicalibration error by at most ε /2. Let prob(i, v) denote Pr(x,y) Dp[g(x) = i, p(x) = v]. The optimization problem (57) above is equivalent to

minimize τ:[t] V [0,1]

v V prob(i, v) vf 0(i, τ(i, v), 1) + (1 v)f 0(i, τ(i, v), 0) (58)

v V prob(i, v) vf j(i, τ(i, v), 1) + (1 v)f j(i, τ(i, v), 0) ε/3 for every j J.

Suppose for now that we know the probabilities prob(i, v). The optimization problem (58) above is a convex program with size O(t |V | |J|) and thus can be solved efficiently assuming that we can efficiently compute f j for every j {0} J and its sub-gradient. When we do not know prob(i, v), we can estimate it to sufficient accuracy using i.i.d. data points from the marginal distribution of x in (x, y) Dp, which is the same marginal distribution of x in (x, y) D. Thus these data points are exactly unlabeled data points from the true distribution D. By standard concentration results (e.g. Claim J.1), using n = O(ε 2 1 (|V |t + log(1/δ))) data points we can compute an estimate est(i, v) for each prob(i, v) such that with probability at least 1 δ, X

v V |est(i, v) prob(i, v)| ε1/3.

The computation of these estimates est is independent of the actual loss minimization task. Thus it can be done when we train the omnipredictor p, in which case no data is needed when solving a downstream task using p and these estimates.

In Theorems 4.7 and D.9, the action set A is a finite set, and the objective f0 and the constraints fj are group objective/constraints. The class C is the class Crand p,g , i.e., C consists of all functions c : X A such that there exists τ : [t] [0, 1] A satisfying c(x) = τ(g(x), p(x)) for every x X. Thus, (56) becomes the following equivalent problem:

minimize τ:[t] [0,1] A E (x,y) Dp E a τ(g(x),p(x)) f 0(g(x), a, y) (59)

s.t. E (x,y) Dp E a τ(g(x),p(x)) f j(g(x), a, y) ε/3 for every j J.

Defining V and prob(i, v) as before and using τ (i, v, a) to denote the probability mass on a A in τ(i, v), the optimization problem (59) above is equivalent to the following:

minimize τ :[t] V A R

a A prob(i, v)τ (i, v, a) vf 0(i, a, 1) + (1 v)f 0(i, a, 0) (60)

Omnipredictors for Constrained Optimization

a A prob(i, v)τ (i, v, a) vf j(i, a, 1) + (1 v)f j(i, a, 0) ε/3, j J,

a A τ (i, v, a) = 1, (i, v) [t] V,

τ (i, v, a) 0, (i, v, a) [t] V A.

This optimization problem (60) is a linear program of size O(t |V | |A| |J|) and thus can be solved efficiently.

I. Counterexamples

I.1. Group Multiaccuracy is Necessary

We show that the group multiaccuracy and group calibration assumptions in Theorem 4.4 cannot be replaced by standard (non-group-wise) multicalibration. Claim I.1. Let A = [0, 1] be an action set. There exists a non-empty set X over individuals, a group partition function g : X [t], a distribution D over X {0, 1}, a task T, a class C of functions c : X A, a predictor p : X [0, 1] with the following properties. The task T has the ℓ1 objective f0(x, a, y) = |a y| and linear constraints (as in (4)). The predictor p belongs to MC(C, 0) Cal(0). However, p is not a ({T}, C, Cp,g, ε)-omnipredictor for sufficiently small ε > 0.

Proof. We assume that X = {x1, x2, x3, x4} and (x, y) D can be sampled by first drawing x from the uniform distribution over X, and then drawing y Ber(p (x)) for

0.5, if x = x1, 0.5, if x = x2, 0, if x = x3, 1, if x = x4.

The function class C consists of a single function c defined by

0.75, x = x1, 0.25, x = x2, 0, x {x3, x4}.

The groups are defined by

( 1, x {x1, x3}, 2, x {x2, x4}.

The constraints fj of the task T are defined by

f1(x, a, y) = 1(i = 1)0.375 1(i = 1)a

f2(x, a, y) = 1(i = 1)0.375 + 1(i = 1)a

f3(x, a, y) = 1(i = 2)0.125 1(i = 2)a

f4(x, a, y) = 1(i = 2)0.125 + 1(i = 2)a

That is, they require that E[c(x)|g(x) = 1] = 0.375, E[c(x)|g(x) = 2] = 0.125. We can easily see that c satisfies the constraint:

E x[c(x)|g(x) = 1] = 0.75 0.5 = 0.375,

E x[c(x)|g(x) = 2] = 0.25 0.5 = 0.125.

Omnipredictors for Constrained Optimization

We choose p : X [0, 1] to be the constant function satisfying p(x) = 0.5 for all x X. We show that p MC(C, 0) Cal(0). We start from calibration:

E (x,y) D[y] = 0.5 = E (x,y) D[p(x)].

Now we show multicalibration with respect to c C:

E (x,y) D[c(x) (y p(x))]

= E (x,y) D[c(x) (y 0.5)]

= 0.25 (0.75(0.5 0.5) + 0.25(0.5 0.5) + 0 (0 0.5) + 0 (1 0.5))) = 0.

The objective value of c is:

β := opt D(T, C, 0)

= E (x,y) D[f0(i, c(x), y)]

= 0.125 (|1 0.75| + |0 0.75| + |1 0.25| + |0 0.25|) + 0.25 (|0, 0| + |1, 0|)

= 0.125 (2 0.25 + 2 0.75) + 0.25 = 0.25 + 0.25

Since p is a constant function, any c Cp,g must satisfy c (x1) = c (x3) and c (x2) = c (x4) because g(x1) = g(x3) and g(x2) = g(x4). To satisfy the constraints up to a small error ε, c must be close to assigning 0.375 to x1 and x3, and assigning 0.125 to x2 and x4. We calculate the loss for this c :

E (x,y) D[f0(i, c (x), y)] = 0.125 (|1 0.375| + |0 0.375| + |1 0.125| + |0 0.125|)

+ 0.25 (|0 0.325| + |1 0.125|)

= 0.125 (0.625 + 0.375 + 0.125 + 0.875) + 0.25 (0.375 + 0.875)

= 0.25 + 0.25 1.25

This implies that for small enough ε, we have Cp,g sol D(T, β + ε, ε) = , and thus p cannot be a ({T}, C, Cp,g, ε)- omnipredictor.

I.2. Group Level-Set Multiaccuracy is Necessary

We show an example task with non-convex constraints and a non-special objective, and thus none of our Theorems 4.4, 4.6 and D.9 could be applied to the example. Theorem 4.7 is applicable, but it requires group level-set multiaccuracy. Below we show that for this task group multicalibration is indeed not enough and the level-set variant is necessary to guarantee omniprediction. Claim I.2. Let A = [0, 1] be an action set. There exists a non-empty set X over individuals, a group partition function g : X [t], a distribution D over X {0, 1}, a task T, a class C of functions c : X A, a predictor p : X [0, 1] with the following properties. The task T only has group constraints and objectives with 1-bounded differences. The predictor p belongs to Grp MCD(C, g, 0) Grp Cal D(g, 0). However, p is not a ({T}, C, Crand p,g , ε)-omnipredictor for sufficiently small ε > 0.

Proof. Let X = {x1, x2, x3} and let g : X [t] be the trivial group partition that assigns every individual x X to the same group g(x) = 1. The distribution D is defined by first choosing x X uniformly at random, and then choosing y Ber(p (x)) for

0.25, x = x1, 1, x = x2, 0.25, x = x3.

Omnipredictors for Constrained Optimization

The function class C contains only a single function C = {c} defined by:

0.1, x = x1, 0.2, x = x2, 0.3, x = x3.

We choose the objective f0 of T to be the cubic loss: f0(x, a, y) = |a y|3. We choose the collection of constraints fj of T to be

f1(x, a, y) = 1(a = 0.1) 1

f2(x, a, y) = 1(a = 0.1) + 1

f3(x, a, y) = 1(a = 0.2) 1

f4(x, a, y) = 1(a = 0.2) + 1

f5(x, a, y) = 1(a = 0.3) 1

f6(x, a, y) = 1(a = 0.3) + 1

For an action function c : X A to satisfy these constraints exactly, it must satisfy

Pr (x,y) D[c (x) = a] = 1/3 for every a {0.1, 0.2, 0.3} .

It is clear that the only function c C satisfies the constraints. The objective value achieved by c is

β := opt D(T, C, 0) = E (x,y) D[f0(x, c(x), y)]

j Pr[x = xj] E (x,y) D[y|x Uj]|1 cj|3 + (1 E (x,y) D[y|x Uj])|cj|3

4(0.9)3 + 3

4(0.1)3 + 1(0.8)3 + 0(0.2)3 + 1

4(0.7)3 + 3

The predictor p : X [0, 1] defined by p(x) = 0.5 for all x X. We show that p Grp MCD(C, g, 0) Grp Cal D(g, 0). We show it, starting from calibration: E (x,y) D[y] = 0.5 = E (x,y) D[p(x)].

For group multicalibration with respect to c C:

E (x,y) D [c(x) (y p(x))] = 1

Since both p and g are constant functions, any c Crand p,g has to give all x X the same distribution c(x) of actions. To satisfy the constraints up to a small error ε, c (x) must be close to the uniform distribution over {0.1, 0.2, 0, 3} for every x. When c (x) is this uniform distribution for every x, we have

E (x,y) D E a c(x)[f0(x, a, y)] = X

b {0,1},a {0.1,0.2,0.3} Pr (x,y) D[y = b, c(x) = a] |y a|3

3 (0.9)3 + (0.1)3 + (0.8)3 + (0.2)3 + (0.7)3 + (0.3)3

Omnipredictors for Constrained Optimization

Therefore, for small enough ε > 0, we have Crand p,g sol D(T, β + ε, ε) = , and thus p cannot be a ({T}, C, Crand p,g , ε)- omnipredictor.

J. Helper Claims

The following claim is a standard result (see e.g. (Canonne, 2020, Theorem 1)): Claim J.1. Let Z be a non-empty set partitioned into Z(1), . . . , Z(m). For ε, δ (0, 1/2) and an integer n W(ε 2(m + log(1/δ))) for a sufficiently large absolute constant W > 0, let z1, . . . , zn Z be n data points drawn i.i.d. from any distribution D over Z. Then with probability at least 1 δ, the following inequality holds:

i=1 1(zi Z(j)) Pr z D[z Z(j)]