# predictive_multiplicity_in_classification__389b49c5.pdf

Predictive Multiplicity in Classiﬁcation

Charles T. Marx 1 Flavio du Pin Calmon 2 Berk Ustun 3

Prediction problems often admit competing models that perform almost equally well. This effect challenges key assumptions in machine learning when competing models assign conﬂicting predictions. In this paper, we deﬁne predictive multiplicity as the ability of a prediction problem to admit competing models with conﬂicting predictions. We introduce formal measures to evaluate the severity of predictive multiplicity and develop integer programming tools to compute them exactly for linear classiﬁcation problems. We apply our tools to measure predictive multiplicity in recidivism prediction problems. Our results show that real-world datasets may admit competing models that assign wildly conﬂicting predictions, and motivate the need to measure and report predictive multiplicity in model development.

1 Introduction

Machine learning algorithms are often designed to ﬁt a best model from data. For example, modern methods for empirical risk minimization ﬁt a model by optimizing a speciﬁc objective (e.g., error rate) over models that obey a speciﬁc set of constraints (e.g., linear classiﬁers with equal TPR between groups). In an ideal scenario where stakeholders agree on such a problem formulation (Passi & Barocas, 2019) and we are given a large dataset of representative examples, the use of machine learning may still lead to ethical challenges if there are multiple best-ﬁtting models.

In machine learning, multiplicity refers to the ability of a prediction problem to admit multiple competing models that perform almost equally well. Several works mention that prediction problems can exhibit multiplicity (see e.g., Mountain & Hsiao, 1989; Mc Cullagh & Nelder, 1989), but few discuss its implications. The work of Breiman (2001) is

1Haverford College 2Harvard SEAS 3UC San Diego. Correspondence to: Charles T. Marx <cmarx@haverford.edu>, Berk Ustun <berk@ucsd.edu>.

Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s).

a major exception. In a seminal position paper, Breiman describes how multiplicity challenges explanations derived from a single predictive model: if one can ﬁt multiple competing models each of which provides a different explanation of the data-generating process how can we tell which explanation is correct?

Drawing parallels between the discordant explanations of competing models and the discordant testimonies of witnesses in the motion picture Rashomon, Breiman refers to this dilemma the Rashomon effect. In the context of his work, the Rashomon effect is in fact an argument against the misuse of explanations. Seeing how prediction problems can exhibit multiplicity, we should not use the explanations of a single model to draw conclusions about the broader data-generating process, at least until we can rule out multiplicity.

Machine learning has changed drastically since Breiman coined the Rashomon effect. Many models are now exclusively built for prediction (Kleinberg et al., 2015). In applications like lending and recidivism prediction, predictions affect people (Binns et al., 2018), and multiplicity raises new challenges when competing models assign conﬂicting predictions. Consider the following examples:

Recidivism Prediction: Say that a recidivism prediction problem admits competing models with conﬂicting predictions. In this case, a person who is predicted to recidivate by one model may be predicted not to recidivate by a competing model that performs equally well. If so, we may want to ignore predictions for this person or even forgo deployment.

Lending: Consider explaining the prediction of a loan approval model to an applicant who is denied a loan (e.g., by producing a counterfactual explanation for the prediction Martens & Provost, 2014). If competing models assign conﬂicting predictions, then these predictions may lead to contradictory explanations. In this case, reporting evidence of competing models with conﬂicting predictions would mitigate unwarranted rationalization of the model resulting from fairwashing (A ıvodji et al., 2019; Laugel et al., 2019; Slack et al., 2020) or explanation bias (Koehler, 1991).

In this work, we deﬁne predictive multiplicity as the ability of a prediction problem to admit competing models that assign conﬂicting predictions. Predictive multiplicity af-

Predictive Multiplicity in Classiﬁcation

fects key tasks in modern machine learning from model selection to model validation to post-hoc explanation. In such tasks, presenting stakeholders with information about predictive multiplicity empowers them to challenge these decisions.

Our goal is to allow stakeholders to measure and report predictive multiplicity in the same way that we measure and report test error. To this end, we introduce formal measures of predictive multiplicity in classiﬁcation:

Ambiguity: How many individuals are assigned conﬂicting predictions by any competing model?

Discrepancy: What is the maximum number of predictions that could change if we were to switch the model that we deploy with a competing model?

Both measures are designed to support stakeholder participation in model development and deployment (see e.g., Figure 1). For example, ambiguity counts the number of individ-

uals whose predictions are determined by the decision to deploy one model over another. These individuals should have a say in model selection and should be able to contest the predictions assigned to them by a model in deployment.

The main contributions of this paper are as follows:

1. We introduce formal measures of predictive multiplicity

for classiﬁcation problems: ambiguity and discrepancy.

2. We develop integer programming tools to compute ambi-

guity and discrepancy for linear classiﬁcation problems. Our tools compute these measure exactly by solving nonconvex empirical risk minimization problems over the set of competing models.

3. We present an empirical study of predictive multiplic-

ity in recidivism prediction. Our results show that real-world datasets can admit competing models with highly conﬂicting predictions, and illustrate how reporting predictive multiplicity can inform stakeholders in such cases. For example, in the Pro Publica COMPAS dataset (Angwin et al., 2016), we ﬁnd that a competing model that is only 1% less accurate than the most accurate model assigns conﬂicting predictions to over 17% of individuals, and that the predictions of 44% of

individuals are affected by model choice.

1.1 Related Work

Multiplicity. Recent work in machine learning tackles multiplicity from the Rashomon perspective. Fisher et al. (2018) and Dong & Rudin (2019) develop methods to measure variable importance over the set of competing models. Semenova & Rudin (2019) present a formal measure of

# Data Points Where yi = 1

Predictions of Best

Linear Classiﬁers

(x1, x2) n+ n ˆha ˆhb ˆhc ˆhd (0, 0) 0 25 + (0, 1) 25 0 + + + (1, 0) 25 0 + + + (1, 1) 0 25 +

Figure 1. Classiﬁers with conﬂicting predictions can perform

equally well. We show 4 linear classiﬁers that optimize accuracy on a 2D classiﬁcation problem with 100 points. The predictions of any 2 models differ on 50 points. Thus, discrepancy is 50%. The predictions of 100 points vary based on model choice. Thus, ambiguity is 100%.

the size of the set of competing models and use it to characterize settings where simple models perform well. Our work differs from this stream of research in that we study competing models with conﬂicting predictions (see Figure 2). Predictive multiplicity reﬂects irreconcilable differences between subsets of predictions similar to the impossibility results in fair machine learning literature (Chouldechova, 2017; Kleinberg et al., 2016; Corbett-Davies et al., 2017).

Model Selection. Techniques to resolve multiplicity can be broadly categorized as approaches for tie-breaking and reconciliation. Classical approaches for model selection break ties using measures like AIC, BIC, or K-CV error (see e.g., Mc Allister, 2007; Ding et al., 2018). These approaches are designed to improve out-of-sample performance. However, they may fail to do so when problems exhibit predictive multiplicity. In Figure 1 for example, tie-breaking would not improve out-of-sample performance as all competing models perform equally well.

Bayesian approaches. Bayesian approaches explicitly represent multiplicity through posterior distributions over models. Posterior distributions are commonly used to construct a single model for deployment via majority vote or randomization procedures (see e.g., Mc Allester, 1999; Germain et al., 2016). In theory, however, posterior distributions could allow for an ad hoc analyses of predictive multiplicity e.g., by counting conﬂicting predictions over a set of models sampled from the posterior (see Dusenberry et al., 2020). While valuable, these analyses may underestimate the severity of predictive multiplicity because the sample would not contain all competing models.

Integer Programming. Our work is part of a recent stream of research on integer programming methods for classiﬁcation (Nguyen & Sanner, 2013; Belotti et al., 2016; Ustun & Rudin, 2015; Ustun et al.; 2019). We present methods to compute measures of predictive multiplicity for linear classiﬁcation problems by solving integer programs. Inte-

Predictive Multiplicity in Classiﬁcation

Rashomon Effect Predictive Multiplicity

Figure 2. Predictive multiplicity reﬂects irreconcilable differences

between the predictions of competing models. Here, we depict two classiﬁcation problems where the competing classiﬁers ha and hb optimize accuracy. We highlight points that are assigned conﬂicted predictions in red and regions of conﬂict in yellow. On the left, ha and hb assign the same predictions on the training data but produce conﬂicting explanations of the importance of x1 vs. x2, as per the Rashomon effect. On the right, ha and hb assign conﬂicting predictions on the training data as per predictive multiplicity.

ger programming allows us to count conﬂicting predictions over the full set of competing classiﬁers i.e., all models that attain -optimal values of a discrete performance metric like the error rate. In contrast, traditional approaches for reducing computation would produce unreliable estimates of predictive multiplicity. For example, if we were to count conﬂicting predictions over models that attain -optimal values of a convex surrogate loss. In this case, we could underestimate or overestimate predictive multiplicity because models that attain near-optimal performance may differ signiﬁcantly from models that attain near-optimal values of a surrogate loss.

2 Framework

In this section, we introduce measures of predictive multiplicity. For clarity of exposition, we present measures for binary classiﬁcation problems. Our measures generalize to problems where models optimize other performance metrics (e.g., AUC), predict multiple outcomes, or obey additional constraints on performance or model form.

Preliminaries. We start with a dataset of n examples {(xi, yi)}n

i=1 where each example consists of a feature vector xi = (1, xi1, . . . , xid) 2 Rd+1 and a label yi 2 { 1}. We use the dataset to ﬁt a baseline classiﬁer h : Rd+1 ! { 1} from a hypothesis class H by minimizing empirical risk (i.e., training error):

h0 2 argmin

where ˆR(h) := 1

[h(xi) 6= yi].

This practice is aligned with the goal of optimizing true risk (i.e., test error) when h0 generalizes. Generalization is

a reasonable assumption in our setting as we work with a simple hypothesis class (see e.g., empirical results in Table 1). Fitting models that optimize performance on all of the

training data is a best practice in machine learning (see e.g., Cawley & Talbot, 2010, for a discussion). 1

Competing Models. We measure predictive multiplicity over a set of classiﬁers that perform almost as well as the baseline classiﬁer. We refer to this set as the -level set and to as the error tolerance.

Deﬁnition 1 ( -Level Set) Given a baseline classiﬁer h0 and a hypothesis class H, the -level set around h0 is the set of all classiﬁers h 2 H with an error rate of at most

ˆR(h0) + on the training data:

S (h0) := {h 2 H : ˆR(h) ˆR(h0) + }.

Predictive multiplicity can arise over an -level set where = 0 (see e.g., Figure 1). Despite this, we typically measure predictive multiplicity over an -level set where > 0. This is because a competing model with near-optimal performance on the training data may outperform the optimal model in deployment. In such cases, it would not be defensible to rule out competing models due to small differences in training error.

In practice, should be set so that the -level set is likely to include a model that attains optimal performance in deployment. This can be achieved by computing conﬁdence intervals for out-of-sample performance (e.g., via bootstrapping or cross-validation) or by using generalization bounds (e.g., by setting so that with high probability the -level set contains the model that optimizes true risk).

Predictive Multiplicity. A prediction problem exhibits predictive multiplicity if competing models assign conﬂicting predictions over the training data.

Deﬁnition 2 (Predictive Multiplicity) Given a baseline classiﬁer h0 and an error tolerance , a prediction problem exhibits predictive multiplicity over the -level set S (h0) if there exists a model h 2 S (h0) such that h(xi) 6= h0(xi) for some xi in the training dataset.

The fact that competing models assign conﬂicting predictions means that model selection will involve arbitrating irreconcilable predictions.

In what follows, we present formal measures of predictive

1For example, in a typical setting where we need to control overﬁtting by tuning hyperparameters over a validation dataset, we would ﬁrst ﬁnd hyperparameters that optimize an estimate of out-of-sample error (e.g., mean 5-CV error). We would then ﬁt a model to optimize performance for these hyperparameters using all of the training data.

Predictive Multiplicity in Classiﬁcation

multiplicity. Each measure evaluates the severity of predictive multiplicity by counting the number of examples that are assigned conﬂicting predictions by competing models in the -level set.

Deﬁnition 3 (Ambiguity) The ambiguity of a prediction problem over the -level set S (h0) is the proportion of points in a training dataset that can be assigned a conﬂicting prediction by a competing classiﬁer h 2 S (h0):

max h2S (h0)

[h(xi) 6= h0(xi)].

Deﬁnition 4 (Discrepancy) The discrepancy of a prediction problem over the -level set S (h0) is the maximum proportion of conﬂicting predictions between the baseline classiﬁer h0 and a competing classiﬁer h 2 S (h0):

δ (h0) := max h2S (h0)

[h(xi) 6= h0(xi)].

Ambiguity represents the number of predictions that can change over the set of competing models. This reﬂects the number of individuals whose predictions are determined by model choice, who could contest the prediction assigned to them by the deployed model, and who should have a say in model selection.

Discrepancy represents the maximum number of predictions that can change if we switch the baseline classiﬁer with a competing classiﬁer. This reﬂects that in practice, in order to change multiple predictions, the conﬂicting predictions must all be realized by a single competing model.

We end with a discussion of the relationship between accuracy and predictive multiplicity. In Proposition 1, we bound the number of conﬂicts between the optimal model and a model in the -level set. We include a proof in Appendix A.

Proposition 1 (Bound on Discrepancy) The discrepancy between h0 and any competing classiﬁer in the -level set h 2 S (h0) obeys:

δ 2 ˆR(h0) + .

Proposition 1 demonstrates how the severity of predictive multiplicity depends on the accuracy of a baseline model. Speciﬁcally, a less accurate baseline model provides more room for predictive multiplicity. This result motivates why

it is important to measure discrepancy and ambiguity using the best possible baseline model.

3 Methodology

In this section, we present integer programming tools to compute ambiguity and discrepancy for linear classiﬁcation problems.

3.1 Overview

Baseline Classiﬁer. Our tools compute ambiguity and discrepancy given a baseline linear classiﬁer h0 i.e., the classiﬁer that we would typically deploy. In our experiments, we use a baseline classiﬁer h0 that minimizes the error rate, which we ﬁt using a MIP formulation in Appendix B. This ensures that multiplicity does not arise due to suboptimality. Thus, the only way to avoid multiplicity is to change the prediction problem i.e., by changing the dataset, the model class, or the constraints.

Path Algorithms. We present path algorithms to compute ambiguity and discrepancy for all possible -level sets. Path algorithms efﬁciently compute the information needed to show how ambiguity and discrepancy change with respect to (see Figure 3). These plots relax the need for practitioners to choose a priori, and calibrates their choice of in settings where small changes in may produce large changes in ambiguity and discrepancy.

MIP Formulations. We compute ambiguity and discrepancy by ﬁtting classiﬁers from the -level set. We ﬁt each classiﬁer by solving a discrete empirical risk minimization problem. We formulate each problem as a mixed integer program (MIP). Our MIP formulations can easily be changed to compute predictive multiplicity for more complex prediction problems e.g., problems where we optimize other performance measures (e.g., TPR, FPR) or where models must obey constraints on model form or model predictions (e.g., group fairness constraints as in Zafar et al., 2019; Celis et al., 2019; Cotter et al., 2019).

MIP Solvers. We solve each MIP with a MIP solver such as CPLEX, CBC, or Gurobi. MIP solvers ﬁnd the global optimum of a discrete optimization problem using exhaustive search algorithms like branch-and-bound (Wolsey, 1998). In our setting, solving a MIP returns: (i) an upper bound on the objective value; (ii) a lower bound on the objective value; and (iii) the coefﬁcients of a linear classiﬁer that achieves the upper bound. When the upper bound matches the lower bound, the solution (iii) is certiﬁably optimal, and our measures are exact. If a MIP solver does not return a certiﬁably optimal solution in a user-speciﬁed time limit, the bounds from (i) and (ii) can be used to produce bounds on ambiguity and discrepancy.

Predictive Multiplicity in Classiﬁcation

3.2 Computing Discrepancy

Given a training dataset, a baseline classiﬁer h0, and a userspeciﬁed error tolerance , we compute the discrepancy over the -level set around h0 by solving the following optimization problem.

[h(xi) = h0(xi)]

s.t. ˆR(h) ˆR(h0) +

We denote the optimal solution to Equation (1) as g . For linear classiﬁcation problems, we can recover the coefﬁcients of g by solving the following MIP formulation, which we refer to as Disc MIP(h0, ):

s.t. Miai γ + h0(xi)

wjxij i = 1, ..., n (2a)

yih0(xi)(1 ai) (2b)

j j = 0, ..., d (2c)

ai 2 {0, 1} i = 1, ..., n w+

j 2 [0, 1] j = 0, ..., d w

j 2 [ 1, 0] j = 0, ..., d

Disc MIP minimizes the agreement between h and h0 using indicator variables ai =

[h(xi) = h0(xi)]. These variables are set via the Big-M constraints in (2a). These constraints depend on: (i) a margin parameter γ > 0, which should be set to a small positive number (e.g., γ = 10 4); and (ii) the Big-M parameters Mi, which can be set as Mi = γ + maxi kxik1 since we have ﬁxed kwk1 = 1 in constraint (2d). Constraint (2b) ensures that any feasible classiﬁer must belong to the -level set.

Bounds. Solving Disc MIP returns the coefﬁcients of the classiﬁer that maximizes discrepancy with respect to the baseline classiﬁer h0. If the solution is not certiﬁably optimal, the upper bound from Disc MIP corresponds to a lower bound on discrepancy. Likewise, the lower bound from Disc MIP corresponds to an upper bound on discrepancy.

Path Algorithm. In Algorithm 1, we present a procedure to compute discrepancy for all possible values of . The procedure solves Disc MIP(h0, ) for increasing values of 2 E. At each iteration, it uses the current solution to initialize Disc MIP for the next iteration. The solution from the previous iteration produces upper and lower bounds that reduce the search space of the MIP, which is much faster than solving Disc MIP separately for each .

Algorithm 1 Compute Discrepancy for All Values of input h0 baseline classiﬁer input E values of sorted in increasing order 0: for 2 E do 0: g solution to Disc MIP(h0, ) 0: δ number of conﬂicts between g and h0 0: next next value of 2 E 0: Initialize Disc MIP(h0, next) with g 0: end for Output: {δ , g } 2E discrepancy and classiﬁer for each =0

3.3 Computing Ambiguity

We present an algorithm to compute ambiguity for all possible values of . Given a baseline classiﬁer h0, the algorithm ﬁts a pathological classiﬁer gi for each point in the training data i.e., the most accurate linear classiﬁer that must assign a conﬂicting prediction to point i. Given a pathological classiﬁer gi for each i, it then computes ambiguity over the -level set by counting the number of pathological classiﬁers whose error is within of the error of the baseline classiﬁer. Observe that ambiguity can be expressed as follows:

max h2S (h0)

[h(xi) 6= h0(xi)]

[ ˆR(gi) ˆR(h0) + ].

Thus, this approach corresponds to evaluating the summands in the expression for ambiguity in Deﬁnition 3.

We ﬁt gi by solving the following optimization problem:

[h(xi) 6= yi]

s.t. h(xi) 6= h0(xi)

Here, h(xi) 6= h0(xi) forces h to assign a conﬂicting prediction to xi. For linear classiﬁcation problems, we can recover the coefﬁcients of gi by solving the following MIP formulation, which we refer to as Flip MIP(h0, xi):

s.t. Mili yi(γ

wjxij) i = 1, ..., n (4a)

j j = 0, ..., d (4c)

li 2 {0, 1} i = 1, ..., n w+

j 2 [0, 1] j = 0, ..., d w

j 2 [ 1, 0] j = 0, ..., d

Predictive Multiplicity in Classiﬁcation

Flip MIP minimizes the error rate of a pathological classiﬁer gi using the indicator variables li

[h(xi) 6= yi]. These variables are set through the Big-M constraints in (4a) whose parameters can be set in the same way as those in Disc MIP. Constraint (4b) enforces the condition that gi(x) 6= h0(x).

Bounds. When a solver does not return a certiﬁably optimal solution to Flip MIP within a user-speciﬁed time limit, it will return upper and lower bounds on the objective value of Flip MIP that can be used to bound ambiguity. The upper bound will produce a lower bound on ambiguity. The lower bound will produce an upper bound on ambiguity.

Path Algorithm. In Algorithm 2, we present a procedure to efﬁciently compute ambiguity by initializing each instance of Flip MIP. In line 2, the procedure sets the upper bound for Flip MIP using the most accurate classiﬁer in POOL that obeys the constraint h(xi) 6= h0(xi). Given a certiﬁably optimal baseline classiﬁer, we can initialize the lower bound of Flip MIP to n ˆR(h0).

Algorithm 2 Compute Ambiguity for All Values of input h0 baseline classiﬁer input E values of POOL ; pool of pathological classiﬁers 0: for i 2 {1, 2, . . . , n} do 0: Initialize Flip MIP(h0, xi) using best solution in POOL 0: gi solution to Flip MIP(h0, xi) 0: Add gi to POOL 0: end for 0: for 2 E do 0: 1

[ ˆR(gi) ˆR(h0)] 0: end for Output: { } 2E and {gi}n

4 Experiments

In this section, we apply our tools to measure predictive multiplicity in recidivism prediction problems. We have three goals: (i) to measure the incidence of predictive multiplicity in real-world classiﬁcation problems; (ii) to discuss how reporting predictive multiplicity can inform stakeholders; (iii) to show that we can also measure predictive multiplicity using existing tools, albeit imperfectly. We include software to reproduce our results at https://github.com/charliemarx/pmtools.

Our focus on recidivism prediction should not be viewed as an endorsement of the practice. We consider recidivism prediction since it is a domain where predictive multiplicity has serious ethical implications, and where the existence of predictive multiplicity may serve as an additional reason to forgo the deployment of machine learning entirely (see e.g., Harcourt, 2008; Lum & Isaac, 2016; Barabas et al., 2017, for broader critiques).

Error of h0 Dataset Outcome Variable n d Train Test

compas arrest rearrest for any crime 5,380 18 32.7% 33.4% compas violent rearrest for violent crime 8,768 18 37.7% 37.9% pretrial CA arrest rearrest for any crime 9,926 22 34.1% 34.4% pretrial CA fta failure to appear 8,738 22 36.3% 36.3% recidivism CA arrest rearrest for any offense 114,522 20 34.4% 34.2% recidivism CA drug rearrest for drug-related offense 96,664 20 36.3% 36.2% recidivism NY arrest rearrest for any offense 31,624 20 31.0% 31.8% recidivism NY drug rearrest for drug-related offense 27,526 20 32.5% 33.6%

Table 1. Recidivism prediction datasets used in Section 4. For each

dataset, we ﬁt a baseline linear classiﬁer that minimizes training error. As shown, the models generalize as training error is close to test error. This is expected given that we ﬁt models from a simple hypothesis class. Here, n and d denote the number of examples and features in each dataset, respectively. All datasets are publicly available. We include a copy of compas arrest and compas violent with our code. The remaining datasets must be requested from ICPSR due to privacy restrictions.

Datasets. We derive 8 datasets from the following studies of recidivism in the United States:

compas from Angwin et al. 2016;

pretrial from Felony Defendants in Large Urban

Counties (US Dept. of Justice, 2014b);

recidivism from Recidivism of Prisoners Released in

1994 (US Dept. of Justice, 2014a).

We process each dataset by binarizing features and dropping examples with missing entries. For clarity of exposition, we oversample the minority class to equalize the number of positive and negative examples. Oversampling allows us to report our measures for level sets deﬁned in terms of error rates instead of TPR/FPR. We ﬁnd that oversampling has a negligible effect on our measures of multiplicity. We provide a summary of each datasets in Table 1.

Measurement Protocol. We compute our measures of predictive multiplicity for each dataset as follows. We split each dataset into a training set composed of 80% of points and a test set composed of 20% of points. We use the training set to ﬁt a baseline classiﬁer that minimizes the 0-1 loss directly by solving MIP (6) in Appendix B. We measure ambiguity and discrepancy for all possible values of the error tolerance using Algorithms 1 and 2. We solve each MIP on a 3.33 GHz CPU with 16 GB RAM. We allocate at most 6 hours to ﬁt the baseline model, 6 hours to ﬁt the models to compute discrepancy for all , and 6 hours to ﬁt the models to compute ambiguity for all .

Ad Hoc Measurement Protocol. We compute ambiguity and discrepancy through an ad hoc approach. We include these results to show that an imperfect analysis of predictive

Predictive Multiplicity in Classiﬁcation

APPROACH DISCREPANCY AMBIGUITY

Figure 3. Severity of predictive multiplicity measured using our tools (top) and using an ad hoc approach (bottom) for compas arrest. We plot the values of discrepancy (left) and ambiguity (right) over the -level set. We ﬁnd a discrepancy of 17% and ambiguity of 44% over the 1%-level set. This means that one can change 17% of predictions by switching the baseline model with a model that is only 1% less accurate, and that 44% of individuals are assigned conﬂicting predictions by models in the 1%-level set. We include similar plots for other datasets in Appendix C.

multiplicity can reveal salient information. Here, we produce a pool of competing models using the glmnet package of Friedman et al. (2010). We ﬁt 1,100 linear classiﬁers using penalized logistic regression. Each model corresponds to an optimizer of the logistic loss with a different degree of 1 and 2 regularization. We choose the baseline model as the model that minimizes the 5-fold CV test error.

4.2 Results

In Figure 3, we plot ambiguity and discrepancy for all possible values of the error tolerance for compas arrest, comparing the measures produced using our tools to those produced using an ad hoc analysis. In Table 2, we compare competing classiﬁers for compas arrest. In what follows, we discuss these results.

On the Incidence of Predictive Multiplicity. Our results in Figure 3 show how predictive multiplicity arises in realworld prediction problems. For the 8 datasets we consider, we ﬁnd that between 4% and 53% of individuals are assigned conﬂicting predictions in the 1%-level set. In compas arrest, for example, we observe an ambiguity of 44%. Considering discrepancy, we can ﬁnd a competing model in the 1%-level set that would assign a conﬂicting prediction to 17% of individuals.

On the Burden of Multiplicity. Our results show that the incidence of multiplicity can differ signiﬁcantly between protected groups. In compas violent, for example, predictive multiplicity disproportionately affects African Americans compared to individuals of other ethnic groups: the proportion of individuals who are assigned conﬂicting predictions over the 1% level set is 72.9% for African Americans but 37.2% for Caucasians. Groups with a larger burden of multiplicity are more vulnerable to model selection, and more likely to be affected by the ignorance of competing models.

On the Implications of Predictive Multiplicity. Our results illustrate how reporting ambiguity and discrepancy can challenge model development and deployment. In compas arrest, for example, our baseline model provably optimizes training error and generalizes. Without an analysis of predictive multiplicity, practitioners could deploy this model. Our analysis reveals that there exists a competing model that assigns conﬂicting predictions to 17% of individuals. Thus, these measures support the need for greater scrutiny and stakeholder involvement in model selection.

Reporting ambiguity and discrepancy also help us calibrate trust in downstream processes in the modern machine learning life-cycle (e.g., evaluating feature inﬂuence as in Kumar et al., 2020; Marx et al., 2019). Consider the process of explaining individual predictions. In this case, an ambiguity

Predictive Multiplicity in Classiﬁcation

Baseline Model Individual Ambiguity Model Discrepancy Model

h(xp) +1 1 1

Error (Train/Test) 32.7% / 33.4% 32.7% / 33.4% 33.6 / 34.5%

Discrepancy (Train/Test) 0.0% / 0.0% 0.0037% / 0.0% 16.8% / 15.1%

Score Function

+ 0.5 age 25 + 0.0 age 25-45 16.4 age 46 16.3 female 0.2 n priors=0 0.1 n priors 1 + 16.4 n priors 2 + 16.6 n priors 5 + 0.0 n juvenile misdemeanors=0 0.1 n juvenile misdemeanors 1 + 0.0 n juvenile misdemeanors 2 32.6 n juvenile misdemeanors 5 + 0.0 n juvenile felonies=0 0.2 n juvenile felonies 1 + 0.3 n juvenile felonies 2 + 0.0 n juvenile felonies 5 0.2 charge degree=M + 0.0

+ 10.3 age 25 + 0.0 age 25-45 9.9 age 46 9.7 female + 0.0 n priors=0 + 0.0 n priors 1 + 19.8 n priors 2 + 10.1 n priors 5 + 0.0 n juvenile misdemeanors=0 0.1 n juvenile misdemeanors 1 10.1 n juvenile misdemeanors 2 9.5 n juvenile misdemeanors 5 9.9 n juvenile felonies=0 10.1 n juvenile felonies 1 + 0.3 n juvenile felonies 2 + 0.0 n juvenile felonies 5 0.2 charge degree=M + 0.0

+ 7.7 age 25 + 0.0 age 25-45 7.8 age 46 7.6 female 7.8 n priors=0 + 0.0 n priors 1 + 7.4 n priors 2 + 7.8 n priors 5 + 0.0 n juvenile misdemeanors=0 + 0.1 n juvenile misdemeanors 1 0.1 n juvenile misdemeanors 2 15.2 n juvenile misdemeanors 5 + 7.7 n juvenile felonies=0 + 0.0 n juvenile felonies 1 + 15.4 n juvenile felonies 2 + 0.0 n juvenile felonies 5 7.5 charge degree=M 0.1

Table 2. Competing linear classiﬁers that assign conﬂicting prediction to xp compas arrest. We show the baseline model (left), the

competing model ﬁt to measure ambiguity to xp (middle), and competing model ﬁt to measure discrepancy (right). The baseline model predicts h(xp) = +1 while other models predict h(xp) = 1. As shown, there exists at least two competing models that predict that xp would not recidivate. In addition, each model exhibits different coefﬁcients and measures of variable importance.

of 44% means one could produce conﬂicting explanations for 44% of predictions. While every explanation would help us understand how competing models operate, evidence of conﬂicting predictions would provide a safeguard against unwarranted rationalization.

On Model Selection. When presented with many competing models, a natural solution is to choose among them to optimize secondary objectives. We support this practice when secondary objectives reﬂect bona ﬁde goals rather than a way to resolve reconciling multiplicity (see Section 5 for a discussion). However, tie-breaking does not always yield a unique model. For example, on the compas arrest dataset, we can break ties between competing models in the 1%-level set on the basis of a group fairness criterion (i.e.,

by minimizing the disparity in accuracy between African Americans and other ethic groups). In this case, we ﬁnd 102 competing models that are also within 1% optimal in terms of the secondary criterion.

On Ad Hoc Measurement. Our results for the ad hoc approach show how measuring and reporting predictive multiplicity can reveal useful information even without specialized tools. In compas arrest, for example, an ad hoc analysis reveals an ambiguity of 10% and a discrepancy of 7% over the set of competing models. These estimates are far less than those produced using our tools (44% and 17% respectively). This is because the ad hoc approach only considers competing models that can be obtained by varying 1 and 2 penalties in penalized logistic regression, rather than all linear classiﬁers in the 1%-level set. These results show that ad hoc approaches can detect predictive multiplicity, but should not be used to certify the absence of multiplicity.

5 Concluding Remarks

Prediction problems can exhibit predictive multiplicity due to a host of reasons, including feature selection, a misspeciﬁed hypothesis class, or the existence of latent groups.

Even as there exist techniques to choose between competing models, we do not advocate a general prescription to resolve predictive multiplicity. Instead, we argue that we should measure and report multiplicity like we measure and report test error (Saleiro et al., 2018; Reisman et al., 2018). In this way, predictive multiplicity can be resolved on a case-by-case basis, and in a way that allows for input from stakeholders (as per the principles of contestable design; see e.g., Hirsch et al., 2017; Kluttz et al., 2018).

Reporting predictive multiplicity can change how we build and deploy models in human-facing applications. In such settings, presenting stakeholders with meaningful information about predictive multiplicity may lead them to think carefully about which model to deploy, consider assigning favorable predictions to individuals who receive conﬂicting predictions, or forgo deployment entirely.

Acknowledgements

We thank Sorelle Friedler, Dylan Slack, and Ben Green for helpful discussions, and anonymous reviewers for constructive feedback. This research is supported in part by the National Science Foundation under Grants No. CAREER CIF-1845852 and by a Google Faculty Award.

Predictive Multiplicity in Classiﬁcation

A ıvodji, U., Arai, H., Fortineau, O., Gambs, S., Hara, S.,

and Tapp, A. Fairwashing: the risk of rationalization. ar Xiv preprint ar Xiv:1901.09749, 2019.

Angwin, J., Larson, J., Mattu, S., and Kirchner, L. Machine

bias. Pro Publica, May, 23:2016, 2016.

Barabas, C., Dinakar, K., Ito, J., Virza, M., and Zittrain,

J. Interventions over predictions: Reframing the ethical debate for actuarial risk assessment. ar Xiv preprint ar Xiv:1712.08238, 2017.

Belotti, P., Bonami, P., Fischetti, M., Lodi, A., Monaci, M.,

Nogales-G omez, A., and Salvagnin, D. On handling indicator constraints in mixed integer programming. Computational Optimization and Applications, 65(3):545 566, 2016.

Binns, R., Van Kleek, M., Veale, M., Lyngs, U., Zhao, J.,

and Shadbolt, N. it s reducing a human being to a percentage : Perceptions of justice in algorithmic decisions. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, pp. 377. ACM, 2018.

Breiman, L. Statistical modeling: The two cultures. Statisti-

cal science, 16(3):199 231, 2001.

Cawley, G. C. and Talbot, N. L. On over-ﬁtting in model

selection and subsequent selection bias in performance evaluation. Journal of Machine Learning Research, 11 (Jul):2079 2107, 2010.

Celis, L. E., Huang, L., Keswani, V., and Vishnoi, N. K.

Classiﬁcation with fairness constraints: A meta-algorithm with provable guarantees. In Proceedings of the Conference on Fairness, Accountability, and Transparency, pp. 319 328. ACM, 2019.

Chouldechova, A. Fair prediction with disparate impact: A

study of bias in recidivism prediction instruments. Big data, 5(2):153 163, 2017.

Corbett-Davies, S., Pierson, E., Feller, A., Goel, S., and Huq,

A. Algorithmic decision making and the cost of fairness. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 797 806. ACM, 2017.

Cotter, A., Jiang, H., Wang, S., Narayan, T., You, S., Srid-

haran, K., and Gupta, M. R. Optimization with nondifferentiable constraints with applications to fairness, recall, churn, and other goals. 2019.

Ding, J., Tarokh, V., and Yang, Y. Model selection tech-

niques: An overview. IEEE Signal Processing Magazine, 35(6):16 34, 2018.

Dong, J. and Rudin, C. Variable importance clouds: A way

to explore variable importance for the set of good models. ar Xiv preprint ar Xiv:1901.03209, 2019.

Dusenberry, M. W., Tran, D., Choi, E., Kemp, J., Nixon,

J., Jerfel, G., Heller, K., and Dai, A. M. Analyzing the role of model uncertainty for electronic health records. In Proceedings of the ACM Conference on Health, Inference, and Learning, pp. 204 213, 2020.

Fisher, A., Rudin, C., and Dominici, F. All models are

wrong but many are useful: Variable importance for blackbox, proprietary, or misspeciﬁed prediction models, using model class reliance. ar Xiv preprint ar Xiv:1801.01489, 2018.

Friedman, J., Hastie, T., and Tibshirani, R. Regularization

paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1):1 22, 2010.

Germain, P., Bach, F., Lacoste, A., and Lacoste-Julien, S.

PAC-Bayesian theory meets bayesian inference. In Advances in Neural Information Processing Systems, pp. 1884 1892, 2016.

Harcourt, B. E. Against prediction: Proﬁling, policing, and

punishing in an actuarial age. University of Chicago Press, 2008.

Hirsch, T., Merced, K., Narayanan, S., Imel, Z. E., and

Atkins, D. C. Designing contestability: Interaction design, machine learning, and mental health. In Proceedings of the 2017 Conference on Designing Interactive Systems, pp. 95 99. ACM, 2017.

Kleinberg, J., Ludwig, J., Mullainathan, S., and Obermeyer,

Z. Prediction policy problems. American Economic Review, 105(5):491 95, 2015.

Kleinberg, J., Mullainathan, S., and Raghavan, M. Inherent

trade-offs in the fair determination of risk scores. ar Xiv preprint ar Xiv:1609.05807, 2016.

Kluttz, D., Kohli, N., and Mulligan, D. K. Contestability

and professionals: From explanations to engagement with algorithmic systems. Available at SSRN 3311894, 2018.

Koehler, D. J. Explanation, imagination, and conﬁdence in

judgment. Psychological bulletin, 110(3):499, 1991.

Kumar, I. E., Venkatasubramanian, S., Scheidegger, C., and

Friedler, S. Problems with shapley-value-based explanations as feature importance measures. ar Xiv preprint ar Xiv:2002.11097, 2020.

Laugel, T., Lesot, M.-J., Marsala, C., Renard, X., and De-

tyniecki, M. The dangers of post-hoc interpretability: Unjustiﬁed counterfactual explanations. ar Xiv preprint ar Xiv:1907.09294, 2019.

Predictive Multiplicity in Classiﬁcation

Lum, K. and Isaac, W. To predict and serve? Signiﬁcance,

13(5):14 19, 2016.

Martens, D. and Provost, F. Explaining data-driven docu-

ment classiﬁcations. MIS Quarterly, 38(1):73 100, 2014.

Marx, C., Phillips, R., Friedler, S., Scheidegger, C., and

Venkatasubramanian, S. Disentangling inﬂuence: Using disentangled representations to audit model predictions. In Advances in Neural Information Processing Systems, pp. 4496 4506, 2019.

Mc Allester, D. A. Some PAC-Bayesian theorems. Machine

Learning, 37(3):355 363, 1999.

Mc Allister, J. W. Model selection and the multiplicity of

patterns in empirical data. Philosophy of Science, 74(5): 884 894, 2007.

Mc Cullagh, P. and Nelder, J. A. Generalized Linear Models,

volume 37. CRC Press, 1989.

Mountain, D. and Hsiao, C. A combined structural and ﬂex-

ible functional approach for modeling energy substitution. Journal of the American Statistical Association, 84(405): 76 87, 1989.

Nguyen, T. and Sanner, S. Algorithms for direct 0 1 loss

optimization in binary classiﬁcation. In Proceedings of the 30th International Conference on Machine Learning, pp. 1085 1093, 2013.

Passi, S. and Barocas, S. Problem formulation and fairness.

In Proceedings of the Conference on Fairness, Accountability, and Transparency, pp. 39 48. ACM, 2019.

Reisman, D., Schultz, J., Crawford, K., and Whittaker, M.

Algorithmic impact assessments: A practical framework for public agency accountability. AI Now Institute, 2018.

Saleiro, P., Kuester, B., Stevens, A., Anisfeld, A., Hinkson,

L., London, J., and Ghani, R. Aequitas: A bias and fairness audit toolkit. ar Xiv preprint ar Xiv:1811.05577, 2018.

Semenova, L. and Rudin, C. A study in rashomon curves

and volumes: A new perspective on generalization and model simplicity in machine learning. ar Xiv preprint ar Xiv:1908.01755, 2019.

Slack, D., Hilgard, S., Jia, E., Singh, S., and Lakkaraju, H.

Fooling lime and shap: Adversarial attacks on post hoc explanation methods. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, pp. 180 186, 2020.

US Dept. of Justice, B. o. J. S. Recidivism of prisoners

released in 1994. 2014a. doi: 10.3886/ICPSR03355.v8.

US Dept. of Justice, B. o. J. S. State court processing statistics, 1990-2009: Felony defendants in large urban counties. 2014b. doi: 10.3886/ICPSR02038.v5.

Ustun, B. and Rudin, C. Supersparse linear integer mod-

els for optimized medical scoring systems. Machine Learning, 102(3):349 391, Nov 2015. doi: 10.1007/ s10994-015-5528-6.

Ustun, B., Spangher, A., and Liu, Y. Actionable recourse in

linear classiﬁcation. In Proceedings of the Conference on Fairness, Accountability, and Transparency, FAT* 19,

pp. 10 19. ACM.

Ustun, B., Liu, Y., and Parkes, D. Fairness without harm:

Decoupled classiﬁers with preference guarantees. In Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 6373 6382. PMLR, 2019.

Wolsey, L. A. Integer Programming, volume 42. Wiley

New York, 1998.

Zafar, M. B., Valera, I., Gomez-Rodriguez, M., and Gum-

madi, K. P. Fairness constraints: A ﬂexible approach for fair classiﬁcation. Journal of Machine Learning Research, 20(75):1 42, 2019.