# predictive_multiplicity_in_classification__389b49c5.pdf Predictive Multiplicity in Classification Charles T. Marx 1 Flavio du Pin Calmon 2 Berk Ustun 3 Prediction problems often admit competing models that perform almost equally well. This effect challenges key assumptions in machine learning when competing models assign conflicting predictions. In this paper, we define predictive multiplicity as the ability of a prediction problem to admit competing models with conflicting predictions. We introduce formal measures to evaluate the severity of predictive multiplicity and develop integer programming tools to compute them exactly for linear classification problems. We apply our tools to measure predictive multiplicity in recidivism prediction problems. Our results show that real-world datasets may admit competing models that assign wildly conflicting predictions, and motivate the need to measure and report predictive multiplicity in model development. 1 Introduction Machine learning algorithms are often designed to fit a best model from data. For example, modern methods for empirical risk minimization fit a model by optimizing a specific objective (e.g., error rate) over models that obey a specific set of constraints (e.g., linear classifiers with equal TPR between groups). In an ideal scenario where stakeholders agree on such a problem formulation (Passi & Barocas, 2019) and we are given a large dataset of representative examples, the use of machine learning may still lead to ethical challenges if there are multiple best-fitting models. In machine learning, multiplicity refers to the ability of a prediction problem to admit multiple competing models that perform almost equally well. Several works mention that prediction problems can exhibit multiplicity (see e.g., Mountain & Hsiao, 1989; Mc Cullagh & Nelder, 1989), but few discuss its implications. The work of Breiman (2001) is 1Haverford College 2Harvard SEAS 3UC San Diego. Correspondence to: Charles T. Marx , Berk Ustun . Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s). a major exception. In a seminal position paper, Breiman describes how multiplicity challenges explanations derived from a single predictive model: if one can fit multiple competing models each of which provides a different explanation of the data-generating process how can we tell which explanation is correct? Drawing parallels between the discordant explanations of competing models and the discordant testimonies of witnesses in the motion picture Rashomon, Breiman refers to this dilemma the Rashomon effect. In the context of his work, the Rashomon effect is in fact an argument against the misuse of explanations. Seeing how prediction problems can exhibit multiplicity, we should not use the explanations of a single model to draw conclusions about the broader data-generating process, at least until we can rule out multiplicity. Machine learning has changed drastically since Breiman coined the Rashomon effect. Many models are now exclusively built for prediction (Kleinberg et al., 2015). In applications like lending and recidivism prediction, predictions affect people (Binns et al., 2018), and multiplicity raises new challenges when competing models assign conflicting predictions. Consider the following examples: Recidivism Prediction: Say that a recidivism prediction problem admits competing models with conflicting predictions. In this case, a person who is predicted to recidivate by one model may be predicted not to recidivate by a competing model that performs equally well. If so, we may want to ignore predictions for this person or even forgo deployment. Lending: Consider explaining the prediction of a loan approval model to an applicant who is denied a loan (e.g., by producing a counterfactual explanation for the prediction Martens & Provost, 2014). If competing models assign conflicting predictions, then these predictions may lead to contradictory explanations. In this case, reporting evidence of competing models with conflicting predictions would mitigate unwarranted rationalization of the model resulting from fairwashing (A ıvodji et al., 2019; Laugel et al., 2019; Slack et al., 2020) or explanation bias (Koehler, 1991). In this work, we define predictive multiplicity as the ability of a prediction problem to admit competing models that assign conflicting predictions. Predictive multiplicity af- Predictive Multiplicity in Classification fects key tasks in modern machine learning from model selection to model validation to post-hoc explanation. In such tasks, presenting stakeholders with information about predictive multiplicity empowers them to challenge these decisions. Our goal is to allow stakeholders to measure and report predictive multiplicity in the same way that we measure and report test error. To this end, we introduce formal measures of predictive multiplicity in classification: Ambiguity: How many individuals are assigned conflicting predictions by any competing model? Discrepancy: What is the maximum number of predictions that could change if we were to switch the model that we deploy with a competing model? Both measures are designed to support stakeholder participation in model development and deployment (see e.g., Figure 1). For example, ambiguity counts the number of individ- uals whose predictions are determined by the decision to deploy one model over another. These individuals should have a say in model selection and should be able to contest the predictions assigned to them by a model in deployment. The main contributions of this paper are as follows: 1. We introduce formal measures of predictive multiplicity for classification problems: ambiguity and discrepancy. 2. We develop integer programming tools to compute ambi- guity and discrepancy for linear classification problems. Our tools compute these measure exactly by solving nonconvex empirical risk minimization problems over the set of competing models. 3. We present an empirical study of predictive multiplic- ity in recidivism prediction. Our results show that real-world datasets can admit competing models with highly conflicting predictions, and illustrate how reporting predictive multiplicity can inform stakeholders in such cases. For example, in the Pro Publica COMPAS dataset (Angwin et al., 2016), we find that a competing model that is only 1% less accurate than the most accurate model assigns conflicting predictions to over 17% of individuals, and that the predictions of 44% of individuals are affected by model choice. 1.1 Related Work Multiplicity. Recent work in machine learning tackles multiplicity from the Rashomon perspective. Fisher et al. (2018) and Dong & Rudin (2019) develop methods to measure variable importance over the set of competing models. Semenova & Rudin (2019) present a formal measure of # Data Points Where yi = 1 Predictions of Best Linear Classifiers (x1, x2) n+ n ˆha ˆhb ˆhc ˆhd (0, 0) 0 25 + (0, 1) 25 0 + + + (1, 0) 25 0 + + + (1, 1) 0 25 + Figure 1. Classifiers with conflicting predictions can perform equally well. We show 4 linear classifiers that optimize accuracy on a 2D classification problem with 100 points. The predictions of any 2 models differ on 50 points. Thus, discrepancy is 50%. The predictions of 100 points vary based on model choice. Thus, ambiguity is 100%. the size of the set of competing models and use it to characterize settings where simple models perform well. Our work differs from this stream of research in that we study competing models with conflicting predictions (see Figure 2). Predictive multiplicity reflects irreconcilable differences between subsets of predictions similar to the impossibility results in fair machine learning literature (Chouldechova, 2017; Kleinberg et al., 2016; Corbett-Davies et al., 2017). Model Selection. Techniques to resolve multiplicity can be broadly categorized as approaches for tie-breaking and reconciliation. Classical approaches for model selection break ties using measures like AIC, BIC, or K-CV error (see e.g., Mc Allister, 2007; Ding et al., 2018). These approaches are designed to improve out-of-sample performance. However, they may fail to do so when problems exhibit predictive multiplicity. In Figure 1 for example, tie-breaking would not improve out-of-sample performance as all competing models perform equally well. Bayesian approaches. Bayesian approaches explicitly represent multiplicity through posterior distributions over models. Posterior distributions are commonly used to construct a single model for deployment via majority vote or randomization procedures (see e.g., Mc Allester, 1999; Germain et al., 2016). In theory, however, posterior distributions could allow for an ad hoc analyses of predictive multiplicity e.g., by counting conflicting predictions over a set of models sampled from the posterior (see Dusenberry et al., 2020). While valuable, these analyses may underestimate the severity of predictive multiplicity because the sample would not contain all competing models. Integer Programming. Our work is part of a recent stream of research on integer programming methods for classification (Nguyen & Sanner, 2013; Belotti et al., 2016; Ustun & Rudin, 2015; Ustun et al.; 2019). We present methods to compute measures of predictive multiplicity for linear classification problems by solving integer programs. Inte- Predictive Multiplicity in Classification Rashomon Effect Predictive Multiplicity Figure 2. Predictive multiplicity reflects irreconcilable differences between the predictions of competing models. Here, we depict two classification problems where the competing classifiers ha and hb optimize accuracy. We highlight points that are assigned conflicted predictions in red and regions of conflict in yellow. On the left, ha and hb assign the same predictions on the training data but produce conflicting explanations of the importance of x1 vs. x2, as per the Rashomon effect. On the right, ha and hb assign conflicting predictions on the training data as per predictive multiplicity. ger programming allows us to count conflicting predictions over the full set of competing classifiers i.e., all models that attain -optimal values of a discrete performance metric like the error rate. In contrast, traditional approaches for reducing computation would produce unreliable estimates of predictive multiplicity. For example, if we were to count conflicting predictions over models that attain -optimal values of a convex surrogate loss. In this case, we could underestimate or overestimate predictive multiplicity because models that attain near-optimal performance may differ significantly from models that attain near-optimal values of a surrogate loss. 2 Framework In this section, we introduce measures of predictive multiplicity. For clarity of exposition, we present measures for binary classification problems. Our measures generalize to problems where models optimize other performance metrics (e.g., AUC), predict multiple outcomes, or obey additional constraints on performance or model form. Preliminaries. We start with a dataset of n examples {(xi, yi)}n i=1 where each example consists of a feature vector xi = (1, xi1, . . . , xid) 2 Rd+1 and a label yi 2 { 1}. We use the dataset to fit a baseline classifier h : Rd+1 ! { 1} from a hypothesis class H by minimizing empirical risk (i.e., training error): h0 2 argmin where ˆR(h) := 1 [h(xi) 6= yi]. This practice is aligned with the goal of optimizing true risk (i.e., test error) when h0 generalizes. Generalization is a reasonable assumption in our setting as we work with a simple hypothesis class (see e.g., empirical results in Table 1). Fitting models that optimize performance on all of the training data is a best practice in machine learning (see e.g., Cawley & Talbot, 2010, for a discussion). 1 Competing Models. We measure predictive multiplicity over a set of classifiers that perform almost as well as the baseline classifier. We refer to this set as the -level set and to as the error tolerance. Definition 1 ( -Level Set) Given a baseline classifier h0 and a hypothesis class H, the -level set around h0 is the set of all classifiers h 2 H with an error rate of at most ˆR(h0) + on the training data: S (h0) := {h 2 H : ˆR(h) ˆR(h0) + }. Predictive multiplicity can arise over an -level set where = 0 (see e.g., Figure 1). Despite this, we typically measure predictive multiplicity over an -level set where > 0. This is because a competing model with near-optimal performance on the training data may outperform the optimal model in deployment. In such cases, it would not be defensible to rule out competing models due to small differences in training error. In practice, should be set so that the -level set is likely to include a model that attains optimal performance in deployment. This can be achieved by computing confidence intervals for out-of-sample performance (e.g., via bootstrapping or cross-validation) or by using generalization bounds (e.g., by setting so that with high probability the -level set contains the model that optimizes true risk). Predictive Multiplicity. A prediction problem exhibits predictive multiplicity if competing models assign conflicting predictions over the training data. Definition 2 (Predictive Multiplicity) Given a baseline classifier h0 and an error tolerance , a prediction problem exhibits predictive multiplicity over the -level set S (h0) if there exists a model h 2 S (h0) such that h(xi) 6= h0(xi) for some xi in the training dataset. The fact that competing models assign conflicting predictions means that model selection will involve arbitrating irreconcilable predictions. In what follows, we present formal measures of predictive 1For example, in a typical setting where we need to control overfitting by tuning hyperparameters over a validation dataset, we would first find hyperparameters that optimize an estimate of out-of-sample error (e.g., mean 5-CV error). We would then fit a model to optimize performance for these hyperparameters using all of the training data. Predictive Multiplicity in Classification multiplicity. Each measure evaluates the severity of predictive multiplicity by counting the number of examples that are assigned conflicting predictions by competing models in the -level set. Definition 3 (Ambiguity) The ambiguity of a prediction problem over the -level set S (h0) is the proportion of points in a training dataset that can be assigned a conflicting prediction by a competing classifier h 2 S (h0): max h2S (h0) [h(xi) 6= h0(xi)]. Definition 4 (Discrepancy) The discrepancy of a prediction problem over the -level set S (h0) is the maximum proportion of conflicting predictions between the baseline classifier h0 and a competing classifier h 2 S (h0): δ (h0) := max h2S (h0) [h(xi) 6= h0(xi)]. Ambiguity represents the number of predictions that can change over the set of competing models. This reflects the number of individuals whose predictions are determined by model choice, who could contest the prediction assigned to them by the deployed model, and who should have a say in model selection. Discrepancy represents the maximum number of predictions that can change if we switch the baseline classifier with a competing classifier. This reflects that in practice, in order to change multiple predictions, the conflicting predictions must all be realized by a single competing model. We end with a discussion of the relationship between accuracy and predictive multiplicity. In Proposition 1, we bound the number of conflicts between the optimal model and a model in the -level set. We include a proof in Appendix A. Proposition 1 (Bound on Discrepancy) The discrepancy between h0 and any competing classifier in the -level set h 2 S (h0) obeys: δ 2 ˆR(h0) + . Proposition 1 demonstrates how the severity of predictive multiplicity depends on the accuracy of a baseline model. Specifically, a less accurate baseline model provides more room for predictive multiplicity. This result motivates why it is important to measure discrepancy and ambiguity using the best possible baseline model. 3 Methodology In this section, we present integer programming tools to compute ambiguity and discrepancy for linear classification problems. 3.1 Overview Baseline Classifier. Our tools compute ambiguity and discrepancy given a baseline linear classifier h0 i.e., the classifier that we would typically deploy. In our experiments, we use a baseline classifier h0 that minimizes the error rate, which we fit using a MIP formulation in Appendix B. This ensures that multiplicity does not arise due to suboptimality. Thus, the only way to avoid multiplicity is to change the prediction problem i.e., by changing the dataset, the model class, or the constraints. Path Algorithms. We present path algorithms to compute ambiguity and discrepancy for all possible -level sets. Path algorithms efficiently compute the information needed to show how ambiguity and discrepancy change with respect to (see Figure 3). These plots relax the need for practitioners to choose a priori, and calibrates their choice of in settings where small changes in may produce large changes in ambiguity and discrepancy. MIP Formulations. We compute ambiguity and discrepancy by fitting classifiers from the -level set. We fit each classifier by solving a discrete empirical risk minimization problem. We formulate each problem as a mixed integer program (MIP). Our MIP formulations can easily be changed to compute predictive multiplicity for more complex prediction problems e.g., problems where we optimize other performance measures (e.g., TPR, FPR) or where models must obey constraints on model form or model predictions (e.g., group fairness constraints as in Zafar et al., 2019; Celis et al., 2019; Cotter et al., 2019). MIP Solvers. We solve each MIP with a MIP solver such as CPLEX, CBC, or Gurobi. MIP solvers find the global optimum of a discrete optimization problem using exhaustive search algorithms like branch-and-bound (Wolsey, 1998). In our setting, solving a MIP returns: (i) an upper bound on the objective value; (ii) a lower bound on the objective value; and (iii) the coefficients of a linear classifier that achieves the upper bound. When the upper bound matches the lower bound, the solution (iii) is certifiably optimal, and our measures are exact. If a MIP solver does not return a certifiably optimal solution in a user-specified time limit, the bounds from (i) and (ii) can be used to produce bounds on ambiguity and discrepancy. Predictive Multiplicity in Classification 3.2 Computing Discrepancy Given a training dataset, a baseline classifier h0, and a userspecified error tolerance , we compute the discrepancy over the -level set around h0 by solving the following optimization problem. [h(xi) = h0(xi)] s.t. ˆR(h) ˆR(h0) + We denote the optimal solution to Equation (1) as g . For linear classification problems, we can recover the coefficients of g by solving the following MIP formulation, which we refer to as Disc MIP(h0, ): s.t. Miai γ + h0(xi) wjxij i = 1, ..., n (2a) yih0(xi)(1 ai) (2b) j j = 0, ..., d (2c) ai 2 {0, 1} i = 1, ..., n w+ j 2 [0, 1] j = 0, ..., d w j 2 [ 1, 0] j = 0, ..., d Disc MIP minimizes the agreement between h and h0 using indicator variables ai = [h(xi) = h0(xi)]. These variables are set via the Big-M constraints in (2a). These constraints depend on: (i) a margin parameter γ > 0, which should be set to a small positive number (e.g., γ = 10 4); and (ii) the Big-M parameters Mi, which can be set as Mi = γ + maxi kxik1 since we have fixed kwk1 = 1 in constraint (2d). Constraint (2b) ensures that any feasible classifier must belong to the -level set. Bounds. Solving Disc MIP returns the coefficients of the classifier that maximizes discrepancy with respect to the baseline classifier h0. If the solution is not certifiably optimal, the upper bound from Disc MIP corresponds to a lower bound on discrepancy. Likewise, the lower bound from Disc MIP corresponds to an upper bound on discrepancy. Path Algorithm. In Algorithm 1, we present a procedure to compute discrepancy for all possible values of . The procedure solves Disc MIP(h0, ) for increasing values of 2 E. At each iteration, it uses the current solution to initialize Disc MIP for the next iteration. The solution from the previous iteration produces upper and lower bounds that reduce the search space of the MIP, which is much faster than solving Disc MIP separately for each . Algorithm 1 Compute Discrepancy for All Values of input h0 baseline classifier input E values of sorted in increasing order 0: for 2 E do 0: g solution to Disc MIP(h0, ) 0: δ number of conflicts between g and h0 0: next next value of 2 E 0: Initialize Disc MIP(h0, next) with g 0: end for Output: {δ , g } 2E discrepancy and classifier for each =0 3.3 Computing Ambiguity We present an algorithm to compute ambiguity for all possible values of . Given a baseline classifier h0, the algorithm fits a pathological classifier gi for each point in the training data i.e., the most accurate linear classifier that must assign a conflicting prediction to point i. Given a pathological classifier gi for each i, it then computes ambiguity over the -level set by counting the number of pathological classifiers whose error is within of the error of the baseline classifier. Observe that ambiguity can be expressed as follows: max h2S (h0) [h(xi) 6= h0(xi)] [ ˆR(gi) ˆR(h0) + ]. Thus, this approach corresponds to evaluating the summands in the expression for ambiguity in Definition 3. We fit gi by solving the following optimization problem: [h(xi) 6= yi] s.t. h(xi) 6= h0(xi) Here, h(xi) 6= h0(xi) forces h to assign a conflicting prediction to xi. For linear classification problems, we can recover the coefficients of gi by solving the following MIP formulation, which we refer to as Flip MIP(h0, xi): s.t. Mili yi(γ wjxij) i = 1, ..., n (4a) j j = 0, ..., d (4c) li 2 {0, 1} i = 1, ..., n w+ j 2 [0, 1] j = 0, ..., d w j 2 [ 1, 0] j = 0, ..., d Predictive Multiplicity in Classification Flip MIP minimizes the error rate of a pathological classifier gi using the indicator variables li [h(xi) 6= yi]. These variables are set through the Big-M constraints in (4a) whose parameters can be set in the same way as those in Disc MIP. Constraint (4b) enforces the condition that gi(x) 6= h0(x). Bounds. When a solver does not return a certifiably optimal solution to Flip MIP within a user-specified time limit, it will return upper and lower bounds on the objective value of Flip MIP that can be used to bound ambiguity. The upper bound will produce a lower bound on ambiguity. The lower bound will produce an upper bound on ambiguity. Path Algorithm. In Algorithm 2, we present a procedure to efficiently compute ambiguity by initializing each instance of Flip MIP. In line 2, the procedure sets the upper bound for Flip MIP using the most accurate classifier in POOL that obeys the constraint h(xi) 6= h0(xi). Given a certifiably optimal baseline classifier, we can initialize the lower bound of Flip MIP to n ˆR(h0). Algorithm 2 Compute Ambiguity for All Values of input h0 baseline classifier input E values of POOL ; pool of pathological classifiers 0: for i 2 {1, 2, . . . , n} do 0: Initialize Flip MIP(h0, xi) using best solution in POOL 0: gi solution to Flip MIP(h0, xi) 0: Add gi to POOL 0: end for 0: for 2 E do 0: 1 [ ˆR(gi) ˆR(h0)] 0: end for Output: { } 2E and {gi}n 4 Experiments In this section, we apply our tools to measure predictive multiplicity in recidivism prediction problems. We have three goals: (i) to measure the incidence of predictive multiplicity in real-world classification problems; (ii) to discuss how reporting predictive multiplicity can inform stakeholders; (iii) to show that we can also measure predictive multiplicity using existing tools, albeit imperfectly. We include software to reproduce our results at https://github.com/charliemarx/pmtools. Our focus on recidivism prediction should not be viewed as an endorsement of the practice. We consider recidivism prediction since it is a domain where predictive multiplicity has serious ethical implications, and where the existence of predictive multiplicity may serve as an additional reason to forgo the deployment of machine learning entirely (see e.g., Harcourt, 2008; Lum & Isaac, 2016; Barabas et al., 2017, for broader critiques). Error of h0 Dataset Outcome Variable n d Train Test compas arrest rearrest for any crime 5,380 18 32.7% 33.4% compas violent rearrest for violent crime 8,768 18 37.7% 37.9% pretrial CA arrest rearrest for any crime 9,926 22 34.1% 34.4% pretrial CA fta failure to appear 8,738 22 36.3% 36.3% recidivism CA arrest rearrest for any offense 114,522 20 34.4% 34.2% recidivism CA drug rearrest for drug-related offense 96,664 20 36.3% 36.2% recidivism NY arrest rearrest for any offense 31,624 20 31.0% 31.8% recidivism NY drug rearrest for drug-related offense 27,526 20 32.5% 33.6% Table 1. Recidivism prediction datasets used in Section 4. For each dataset, we fit a baseline linear classifier that minimizes training error. As shown, the models generalize as training error is close to test error. This is expected given that we fit models from a simple hypothesis class. Here, n and d denote the number of examples and features in each dataset, respectively. All datasets are publicly available. We include a copy of compas arrest and compas violent with our code. The remaining datasets must be requested from ICPSR due to privacy restrictions. Datasets. We derive 8 datasets from the following studies of recidivism in the United States: compas from Angwin et al. 2016; pretrial from Felony Defendants in Large Urban Counties (US Dept. of Justice, 2014b); recidivism from Recidivism of Prisoners Released in 1994 (US Dept. of Justice, 2014a). We process each dataset by binarizing features and dropping examples with missing entries. For clarity of exposition, we oversample the minority class to equalize the number of positive and negative examples. Oversampling allows us to report our measures for level sets defined in terms of error rates instead of TPR/FPR. We find that oversampling has a negligible effect on our measures of multiplicity. We provide a summary of each datasets in Table 1. Measurement Protocol. We compute our measures of predictive multiplicity for each dataset as follows. We split each dataset into a training set composed of 80% of points and a test set composed of 20% of points. We use the training set to fit a baseline classifier that minimizes the 0-1 loss directly by solving MIP (6) in Appendix B. We measure ambiguity and discrepancy for all possible values of the error tolerance using Algorithms 1 and 2. We solve each MIP on a 3.33 GHz CPU with 16 GB RAM. We allocate at most 6 hours to fit the baseline model, 6 hours to fit the models to compute discrepancy for all , and 6 hours to fit the models to compute ambiguity for all . Ad Hoc Measurement Protocol. We compute ambiguity and discrepancy through an ad hoc approach. We include these results to show that an imperfect analysis of predictive Predictive Multiplicity in Classification APPROACH DISCREPANCY AMBIGUITY Figure 3. Severity of predictive multiplicity measured using our tools (top) and using an ad hoc approach (bottom) for compas arrest. We plot the values of discrepancy (left) and ambiguity (right) over the -level set. We find a discrepancy of 17% and ambiguity of 44% over the 1%-level set. This means that one can change 17% of predictions by switching the baseline model with a model that is only 1% less accurate, and that 44% of individuals are assigned conflicting predictions by models in the 1%-level set. We include similar plots for other datasets in Appendix C. multiplicity can reveal salient information. Here, we produce a pool of competing models using the glmnet package of Friedman et al. (2010). We fit 1,100 linear classifiers using penalized logistic regression. Each model corresponds to an optimizer of the logistic loss with a different degree of 1 and 2 regularization. We choose the baseline model as the model that minimizes the 5-fold CV test error. 4.2 Results In Figure 3, we plot ambiguity and discrepancy for all possible values of the error tolerance for compas arrest, comparing the measures produced using our tools to those produced using an ad hoc analysis. In Table 2, we compare competing classifiers for compas arrest. In what follows, we discuss these results. On the Incidence of Predictive Multiplicity. Our results in Figure 3 show how predictive multiplicity arises in realworld prediction problems. For the 8 datasets we consider, we find that between 4% and 53% of individuals are assigned conflicting predictions in the 1%-level set. In compas arrest, for example, we observe an ambiguity of 44%. Considering discrepancy, we can find a competing model in the 1%-level set that would assign a conflicting prediction to 17% of individuals. On the Burden of Multiplicity. Our results show that the incidence of multiplicity can differ significantly between protected groups. In compas violent, for example, predictive multiplicity disproportionately affects African Americans compared to individuals of other ethnic groups: the proportion of individuals who are assigned conflicting predictions over the 1% level set is 72.9% for African Americans but 37.2% for Caucasians. Groups with a larger burden of multiplicity are more vulnerable to model selection, and more likely to be affected by the ignorance of competing models. On the Implications of Predictive Multiplicity. Our results illustrate how reporting ambiguity and discrepancy can challenge model development and deployment. In compas arrest, for example, our baseline model provably optimizes training error and generalizes. Without an analysis of predictive multiplicity, practitioners could deploy this model. Our analysis reveals that there exists a competing model that assigns conflicting predictions to 17% of individuals. Thus, these measures support the need for greater scrutiny and stakeholder involvement in model selection. Reporting ambiguity and discrepancy also help us calibrate trust in downstream processes in the modern machine learning life-cycle (e.g., evaluating feature influence as in Kumar et al., 2020; Marx et al., 2019). Consider the process of explaining individual predictions. In this case, an ambiguity Predictive Multiplicity in Classification Baseline Model Individual Ambiguity Model Discrepancy Model h(xp) +1 1 1 Error (Train/Test) 32.7% / 33.4% 32.7% / 33.4% 33.6 / 34.5% Discrepancy (Train/Test) 0.0% / 0.0% 0.0037% / 0.0% 16.8% / 15.1% Score Function + 0.5 age 25 + 0.0 age 25-45 16.4 age 46 16.3 female 0.2 n priors=0 0.1 n priors 1 + 16.4 n priors 2 + 16.6 n priors 5 + 0.0 n juvenile misdemeanors=0 0.1 n juvenile misdemeanors 1 + 0.0 n juvenile misdemeanors 2 32.6 n juvenile misdemeanors 5 + 0.0 n juvenile felonies=0 0.2 n juvenile felonies 1 + 0.3 n juvenile felonies 2 + 0.0 n juvenile felonies 5 0.2 charge degree=M + 0.0 + 10.3 age 25 + 0.0 age 25-45 9.9 age 46 9.7 female + 0.0 n priors=0 + 0.0 n priors 1 + 19.8 n priors 2 + 10.1 n priors 5 + 0.0 n juvenile misdemeanors=0 0.1 n juvenile misdemeanors 1 10.1 n juvenile misdemeanors 2 9.5 n juvenile misdemeanors 5 9.9 n juvenile felonies=0 10.1 n juvenile felonies 1 + 0.3 n juvenile felonies 2 + 0.0 n juvenile felonies 5 0.2 charge degree=M + 0.0 + 7.7 age 25 + 0.0 age 25-45 7.8 age 46 7.6 female 7.8 n priors=0 + 0.0 n priors 1 + 7.4 n priors 2 + 7.8 n priors 5 + 0.0 n juvenile misdemeanors=0 + 0.1 n juvenile misdemeanors 1 0.1 n juvenile misdemeanors 2 15.2 n juvenile misdemeanors 5 + 7.7 n juvenile felonies=0 + 0.0 n juvenile felonies 1 + 15.4 n juvenile felonies 2 + 0.0 n juvenile felonies 5 7.5 charge degree=M 0.1 Table 2. Competing linear classifiers that assign conflicting prediction to xp compas arrest. We show the baseline model (left), the competing model fit to measure ambiguity to xp (middle), and competing model fit to measure discrepancy (right). The baseline model predicts h(xp) = +1 while other models predict h(xp) = 1. As shown, there exists at least two competing models that predict that xp would not recidivate. In addition, each model exhibits different coefficients and measures of variable importance. of 44% means one could produce conflicting explanations for 44% of predictions. While every explanation would help us understand how competing models operate, evidence of conflicting predictions would provide a safeguard against unwarranted rationalization. On Model Selection. When presented with many competing models, a natural solution is to choose among them to optimize secondary objectives. We support this practice when secondary objectives reflect bona fide goals rather than a way to resolve reconciling multiplicity (see Section 5 for a discussion). However, tie-breaking does not always yield a unique model. For example, on the compas arrest dataset, we can break ties between competing models in the 1%-level set on the basis of a group fairness criterion (i.e., by minimizing the disparity in accuracy between African Americans and other ethic groups). In this case, we find 102 competing models that are also within 1% optimal in terms of the secondary criterion. On Ad Hoc Measurement. Our results for the ad hoc approach show how measuring and reporting predictive multiplicity can reveal useful information even without specialized tools. In compas arrest, for example, an ad hoc analysis reveals an ambiguity of 10% and a discrepancy of 7% over the set of competing models. These estimates are far less than those produced using our tools (44% and 17% respectively). This is because the ad hoc approach only considers competing models that can be obtained by varying 1 and 2 penalties in penalized logistic regression, rather than all linear classifiers in the 1%-level set. These results show that ad hoc approaches can detect predictive multiplicity, but should not be used to certify the absence of multiplicity. 5 Concluding Remarks Prediction problems can exhibit predictive multiplicity due to a host of reasons, including feature selection, a misspecified hypothesis class, or the existence of latent groups. Even as there exist techniques to choose between competing models, we do not advocate a general prescription to resolve predictive multiplicity. Instead, we argue that we should measure and report multiplicity like we measure and report test error (Saleiro et al., 2018; Reisman et al., 2018). In this way, predictive multiplicity can be resolved on a case-by-case basis, and in a way that allows for input from stakeholders (as per the principles of contestable design; see e.g., Hirsch et al., 2017; Kluttz et al., 2018). Reporting predictive multiplicity can change how we build and deploy models in human-facing applications. In such settings, presenting stakeholders with meaningful information about predictive multiplicity may lead them to think carefully about which model to deploy, consider assigning favorable predictions to individuals who receive conflicting predictions, or forgo deployment entirely. Acknowledgements We thank Sorelle Friedler, Dylan Slack, and Ben Green for helpful discussions, and anonymous reviewers for constructive feedback. This research is supported in part by the National Science Foundation under Grants No. CAREER CIF-1845852 and by a Google Faculty Award. Predictive Multiplicity in Classification A ıvodji, U., Arai, H., Fortineau, O., Gambs, S., Hara, S., and Tapp, A. Fairwashing: the risk of rationalization. ar Xiv preprint ar Xiv:1901.09749, 2019. Angwin, J., Larson, J., Mattu, S., and Kirchner, L. Machine bias. Pro Publica, May, 23:2016, 2016. Barabas, C., Dinakar, K., Ito, J., Virza, M., and Zittrain, J. Interventions over predictions: Reframing the ethical debate for actuarial risk assessment. ar Xiv preprint ar Xiv:1712.08238, 2017. Belotti, P., Bonami, P., Fischetti, M., Lodi, A., Monaci, M., Nogales-G omez, A., and Salvagnin, D. On handling indicator constraints in mixed integer programming. Computational Optimization and Applications, 65(3):545 566, 2016. Binns, R., Van Kleek, M., Veale, M., Lyngs, U., Zhao, J., and Shadbolt, N. it s reducing a human being to a percentage : Perceptions of justice in algorithmic decisions. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, pp. 377. ACM, 2018. Breiman, L. Statistical modeling: The two cultures. Statisti- cal science, 16(3):199 231, 2001. Cawley, G. C. and Talbot, N. L. On over-fitting in model selection and subsequent selection bias in performance evaluation. Journal of Machine Learning Research, 11 (Jul):2079 2107, 2010. Celis, L. E., Huang, L., Keswani, V., and Vishnoi, N. K. Classification with fairness constraints: A meta-algorithm with provable guarantees. In Proceedings of the Conference on Fairness, Accountability, and Transparency, pp. 319 328. ACM, 2019. Chouldechova, A. Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. Big data, 5(2):153 163, 2017. Corbett-Davies, S., Pierson, E., Feller, A., Goel, S., and Huq, A. Algorithmic decision making and the cost of fairness. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 797 806. ACM, 2017. Cotter, A., Jiang, H., Wang, S., Narayan, T., You, S., Srid- haran, K., and Gupta, M. R. Optimization with nondifferentiable constraints with applications to fairness, recall, churn, and other goals. 2019. Ding, J., Tarokh, V., and Yang, Y. Model selection tech- niques: An overview. IEEE Signal Processing Magazine, 35(6):16 34, 2018. Dong, J. and Rudin, C. Variable importance clouds: A way to explore variable importance for the set of good models. ar Xiv preprint ar Xiv:1901.03209, 2019. Dusenberry, M. W., Tran, D., Choi, E., Kemp, J., Nixon, J., Jerfel, G., Heller, K., and Dai, A. M. Analyzing the role of model uncertainty for electronic health records. In Proceedings of the ACM Conference on Health, Inference, and Learning, pp. 204 213, 2020. Fisher, A., Rudin, C., and Dominici, F. All models are wrong but many are useful: Variable importance for blackbox, proprietary, or misspecified prediction models, using model class reliance. ar Xiv preprint ar Xiv:1801.01489, 2018. Friedman, J., Hastie, T., and Tibshirani, R. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1):1 22, 2010. Germain, P., Bach, F., Lacoste, A., and Lacoste-Julien, S. PAC-Bayesian theory meets bayesian inference. In Advances in Neural Information Processing Systems, pp. 1884 1892, 2016. Harcourt, B. E. Against prediction: Profiling, policing, and punishing in an actuarial age. University of Chicago Press, 2008. Hirsch, T., Merced, K., Narayanan, S., Imel, Z. E., and Atkins, D. C. Designing contestability: Interaction design, machine learning, and mental health. In Proceedings of the 2017 Conference on Designing Interactive Systems, pp. 95 99. ACM, 2017. Kleinberg, J., Ludwig, J., Mullainathan, S., and Obermeyer, Z. Prediction policy problems. American Economic Review, 105(5):491 95, 2015. Kleinberg, J., Mullainathan, S., and Raghavan, M. Inherent trade-offs in the fair determination of risk scores. ar Xiv preprint ar Xiv:1609.05807, 2016. Kluttz, D., Kohli, N., and Mulligan, D. K. Contestability and professionals: From explanations to engagement with algorithmic systems. Available at SSRN 3311894, 2018. Koehler, D. J. Explanation, imagination, and confidence in judgment. Psychological bulletin, 110(3):499, 1991. Kumar, I. E., Venkatasubramanian, S., Scheidegger, C., and Friedler, S. Problems with shapley-value-based explanations as feature importance measures. ar Xiv preprint ar Xiv:2002.11097, 2020. Laugel, T., Lesot, M.-J., Marsala, C., Renard, X., and De- tyniecki, M. The dangers of post-hoc interpretability: Unjustified counterfactual explanations. ar Xiv preprint ar Xiv:1907.09294, 2019. Predictive Multiplicity in Classification Lum, K. and Isaac, W. To predict and serve? Significance, 13(5):14 19, 2016. Martens, D. and Provost, F. Explaining data-driven docu- ment classifications. MIS Quarterly, 38(1):73 100, 2014. Marx, C., Phillips, R., Friedler, S., Scheidegger, C., and Venkatasubramanian, S. Disentangling influence: Using disentangled representations to audit model predictions. In Advances in Neural Information Processing Systems, pp. 4496 4506, 2019. Mc Allester, D. A. Some PAC-Bayesian theorems. Machine Learning, 37(3):355 363, 1999. Mc Allister, J. W. Model selection and the multiplicity of patterns in empirical data. Philosophy of Science, 74(5): 884 894, 2007. Mc Cullagh, P. and Nelder, J. A. Generalized Linear Models, volume 37. CRC Press, 1989. Mountain, D. and Hsiao, C. A combined structural and flex- ible functional approach for modeling energy substitution. Journal of the American Statistical Association, 84(405): 76 87, 1989. Nguyen, T. and Sanner, S. Algorithms for direct 0 1 loss optimization in binary classification. In Proceedings of the 30th International Conference on Machine Learning, pp. 1085 1093, 2013. Passi, S. and Barocas, S. Problem formulation and fairness. In Proceedings of the Conference on Fairness, Accountability, and Transparency, pp. 39 48. ACM, 2019. Reisman, D., Schultz, J., Crawford, K., and Whittaker, M. Algorithmic impact assessments: A practical framework for public agency accountability. AI Now Institute, 2018. Saleiro, P., Kuester, B., Stevens, A., Anisfeld, A., Hinkson, L., London, J., and Ghani, R. Aequitas: A bias and fairness audit toolkit. ar Xiv preprint ar Xiv:1811.05577, 2018. Semenova, L. and Rudin, C. A study in rashomon curves and volumes: A new perspective on generalization and model simplicity in machine learning. ar Xiv preprint ar Xiv:1908.01755, 2019. Slack, D., Hilgard, S., Jia, E., Singh, S., and Lakkaraju, H. Fooling lime and shap: Adversarial attacks on post hoc explanation methods. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, pp. 180 186, 2020. US Dept. of Justice, B. o. J. S. Recidivism of prisoners released in 1994. 2014a. doi: 10.3886/ICPSR03355.v8. US Dept. of Justice, B. o. J. S. State court processing statistics, 1990-2009: Felony defendants in large urban counties. 2014b. doi: 10.3886/ICPSR02038.v5. Ustun, B. and Rudin, C. Supersparse linear integer mod- els for optimized medical scoring systems. Machine Learning, 102(3):349 391, Nov 2015. doi: 10.1007/ s10994-015-5528-6. Ustun, B., Spangher, A., and Liu, Y. Actionable recourse in linear classification. In Proceedings of the Conference on Fairness, Accountability, and Transparency, FAT* 19, pp. 10 19. ACM. Ustun, B., Liu, Y., and Parkes, D. Fairness without harm: Decoupled classifiers with preference guarantees. In Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 6373 6382. PMLR, 2019. Wolsey, L. A. Integer Programming, volume 42. Wiley New York, 1998. Zafar, M. B., Valera, I., Gomez-Rodriguez, M., and Gum- madi, K. P. Fairness constraints: A flexible approach for fair classification. Journal of Machine Learning Research, 20(75):1 42, 2019.