# learning_optimal_predictive_checklists__8afce5d4.pdf

Learning Optimal Predictive Checklists

Haoran Zhang Massachusetts Institute of Technology haoranz@mit.edu

Quaid Morris Memorial Sloan Kettering Cancer Center morrisq@mskcc.org

Berk Ustun* UC San Diego berk@ucsd.edu

Maryzeh Ghassemi* Massachusetts Institute of Technology mghassem@mit.edu

Checklists are simple decision aids that are often used to promote safety and reliability in clinical applications. In this paper, we present a method to learn checklists for clinical decision support. We represent predictive checklists as discrete linear classiﬁers with binary features and unit weights. We then learn globally optimal predictive checklists from data by solving an integer programming problem. Our method allows users to customize checklists to obey complex constraints, including constraints to enforce group fairness and to binarize realvalued features at training time. In addition, it pairs models with an optimality gap that can inform model development and determine the feasibility of learning sufﬁciently accurate checklists on a given dataset. We pair our method with specialized techniques that speed up its ability to train a predictive checklist that performs well and has a small optimality gap. We benchmark the performance of our method on seven clinical classiﬁcation problems, and demonstrate its practical beneﬁts by training a short-form checklist for PTSD screening. Our results show that our method can ﬁt simple predictive checklists that perform well and that can easily be customized to obey a rich class of custom constraints.

1 Introduction

Checklists are simple tools that are widely used to assist humans when carrying out important tasks or making important decisions [8, 13, 18, 35, 44, 50, 60, 61, 64, 65, 66]. These tools are often used as predictive models in modern healthcare applications. In such settings, a checklist is a set of Boolean conditions that predicts a condition of interest e.g., a list of symptoms ﬂag a patient for a critical illness when M out N symptoms are checked. These kinds of predictive checklists are often used for clinical decision support because they are easy to use and easy to understand [33, 54]. In contrast to other kinds of predictive models, clinicians can easily scrutinize a checklist, and make an informed decision as to whether they will adopt it. Once they have decided to use a checklist, they can integrate the model into their clinical workﬂow without extensive training or technology (e.g., as a printed sheet [52]).

Considering these beneﬁts, one of the key challenges in using predictive checklists in healthcare applications is ﬁnding a reliable way to create them [33]. Most predictive checklists in medicine are either hand-crafted by panels of experts [30, 41], or built by combining statistical techniques and heuristics [e.g., logistic regression, stepwise feature selection, and rounding 39]. These approaches

* Equal supervision.

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

make it difﬁcult to develop checklists that are sufﬁciently accurate as panel or pipeline will effectively need to specify a model that performs well under stringent assumptions on model form. Given the simplicity of the model class, it is entirely possible that some datasets may never admit a checklist that is sufﬁciently accurate to deploy in a clinical setting as even checklists that are accurate at a population-level may perform poorly on a minority population [57, 70].

In this paper, we introduce a machine learning method to learn checklists from data. Our method is designed to streamline the creation of predictive checklists in a way that overcomes speciﬁc challenges of model development in modern healthcare applications. Our method solves an integer programming problem to return the most accurate checklist that obeys user-speciﬁed constraints on model form and/or model performance. This approach is computationally challenging, but provides speciﬁc functionality that simpliﬁes and streamlines model development. First, it learns the most accurate checklist by optimizing exact measures of model performance (i.e., accuracy rather than a convex surrogate measure). Second, it seeks to improve the performance of checklists by adaptively binarizing features at training time. Third, it allows practitioners to train checklists that obey custom requirements on model form or on prediction, by allowing them to encode these requirements as constraints in the optimization problem. Finally, it provides practitioners with an optimality gap, which informs them when a sufﬁciently accurate checklist does not exist.

The main contributions of this paper are:

1. We present a machine learning method to learn checklists from data. Our method allows practitioners to customize models to obey a wide range of real-world constraints. In addition, it pairs checklists with an optimality gap that can inform practitioners in model development. 2. We develop specialized techniques that improve the ability of our approach to train a checklist that performs well, and to pair this model with a small optimally gap. One of these techniques can be used to train checklists heuristically. 3. We conduct a broad empirical study of predictive checklists on clinical classiﬁcation datasets [24,

32, 37, 51, 58]. Our results show that predictive checklists can perform as well as state-of-the-art classiﬁers on some datasets, and that our method can provide practitioners with an optimality gap to ﬂag when this is not the case. We highlight the ability of our method to handle real-world requirements through applications where we enforce group fairness constraints on a mortality prediction task, and where we build a short-form checklist to screen for PTSD. 4. We provide a Python package to train and customize predictive checklists with open-source and commercial solvers, including CBC [29] and CPLEX [20] (see https://github.com/ MLfor Health/predictive_checklists).

2 Related Work

The use of checklist-style classiﬁers e.g., classiﬁers that assign coefﬁcients of 1 to Boolean conditions dates to the work of Burgess [9]. The practice is often referred to as unit weighting," and is motivated by observations that it often performs surprisingly well [see e.g., 7, 19, 23, 27].

Our work is strongly related to methods that learn sparse linear classiﬁers with small integer coefﬁcients [see e.g., 4, 11, 31, 38, 68, 69, 75]. We learn checklists from data by solving an integer program. Our problem can be viewed as a special case of an IP proposed by Ustun and Rudin [68], Zeng et al. [77] to learn sparse linear classiﬁers with small integer coefﬁcients. The problem that we consider is considerably easier to solve from a computational standpoint because it restricts coefﬁcients to binary values and does not include ℓ0-regularization. Here, we make use of this slack" in computation to add functionality that improves the performance of checklists, namely: (1) constraints to binarize continuous and categorical features into items that can be included in a checklist [see also 10, 12, 71]; and (2) develop specialized techniques to reduce computation during the learning process.

Checklists are M-of-N rules i.e., classiﬁers that predict y = +1 when M out of N conditions are true. Early work in machine learning used M-of-N rules in an auxiliary capacity e.g., to explain the predictions of neural networks [67], or to use as features in decision trees [78]. More recent work has focused on learning M-of-N rules for standalone prediction e.g. Chevaleyre et al. [16] describe a method to learn M-of-N rules by rounding the coefﬁcients of linear SVMs to { 1, 0, 1} using a randomized rounding procedure [see also 79]. Methods to learn disjunctions and conjunctions are also relevant since they correspond to 1-of-N rules and N-of-N rules, respectively. These methods

learn models by solving a special case of the ERM problem in (1) where M = 1 or M = N. Given that this problem is NP-hard, existing methods often reduce computation by using algorithms that return approximate solutions e.g., simulated annealing [72], set cover [48, 49], or ERM with a convex surrogate loss [21, 22, 46, 47]. These approaches could be used to learn checklists by solving the ERM problem in (1). However, they would not be able to handle the discrete constraints needed for binarization and customization. More generally, they would not be guaranteed to recover the most accurate checklists for a given dataset, nor pair models with a certiﬁcate of optimality that can be used to inform model development.

3 Methodology

We start with a dataset of n examples (xi, yi)n i=1 where xi = [xi1, . . . , xid] {0, 1}d is a vector of d binary variables and yi {0, 1} is a label. Here, xi are Boolean variables that could be used as items in a checklist. We denote the indices of positive and negative examples as I+ = {i | yi = +1} and I = {i | yi = 1} respectively, and let n+ = |I+| and n = |I |. We denote the set of positive integers up to k as [k] = {1, . . . , k}.

We use the dataset to learn a predictive checklist i.e., a Boolean threshold rule that predicts ˆy = +1 when at least M of N items are checked. We represent a checklist as a linear classiﬁer with the form:

ˆyi = sign(λ xi M).

Here λ = [λ1, . . . , λd] {0, 1}d is a coefﬁcient vector and λj = 1 iff the checklist contains item j. We denote the number of items in a checklist with coefﬁcients λ as N = Pd j=1 λj, and denote the threshold number of items that must be checked to assign a positive prediction as M. We learn checklists from data by solving an empirical risk minimization problem with the form:

min λ,M l (λ, M) + ϵNN + ϵMM

Here, l (λ, M) = Pn i=1 1 [yi = ˆyi] counts the number of mistakes of a checklist with parameters λ and M. The parameters ϵN > 0 and ϵM > 0 are small penalties used to specify lexicographic preferences. We set ϵN < 1 nd and ϵM < ϵN

d . These choices ensure that optimization will return the most accurate checklist, breaking ties between checklists that are equally accurate to favor smaller N, and breaking ties between checklists that are equally accurate and sparse to favor smaller M. Smaller values of N and M are preferable, as checklists with smaller N users check fewer items, and checklists with smaller M let users stop checking as soon as M items are checked (see Figure 3).

We recover a globally optimal solution to (1) by solving the integer program (IP):

min λ,z,M l+ + W l + ϵNN + ϵMM

s.t. Bizi M

j=1 λjxi,j i I+ (2a)

j=1 λjxi,j M + 1 i I (2b)

M [N] zi {0, 1} i [n] λj {0, 1} j [d]

Here, l+ and l are variables that count the number of mistakes on positive and negative examples. The values of l+ and l are computed using the mistake indicators zi = 1 [ˆyi = yi], which are set to

1 through Big-M" Constraints in (2a) and (2a).These constraints depend on the Big-M" parameters Bi, which can be set to its tightest possible value maxλ M λ xi. W is a user-deﬁned parameter that reﬂects the relative cost of misclassifying a negative example. By default, we set W = 1, so that the objective minimizes training error.

Customization The IP in (2) can be customized to ﬁt checklists that obey a wide range of realworld requirements on performance and model form. We list examples of constraints in Table 1. Our method provides two ways of handling classiﬁcation problems with different costs of False Negatives (FNs) and False Positives (FPs). First, the user can specify the relative misclassiﬁcation costs for FPs and FNs in the objective function by setting W . Here, optimizing (2) corresponds directly to minimizing the weighted error. Second, the user can instead specify limits on FPR and FNR - e.g. minimize the training FPR (by setting W = n ) subject to training FNR 20%.

MODEL REQUIREMENT EXAMPLE CONSTRAINT

Model Size Use Nmax items N Nmax

Procedural If checklist includes item about coughing, then include item about fever λfever λcough

Prediction Predict ˆyi = +1 for all patients with both fever and coughing λfeverxi,fever + λcoughxi,cough M i [n]

Class-Based Accuracy Max FPR γ l γ n

Group Fairness Max FPR disparity of γ% between males and females l M n M l F n F

Minimax Fairness No group with FNR worse than δ l+ g δ n+ g g G Table 1: Model requirements that can be addressed by our method. Each requirement can be directly encoded into IP (2). The IP can then be solved using the same MIP solver to obtain the most accurate checklist that obeys these constraints.

Adaptive Binarization Practitioners often work with datasets that contain real-valued and categorical features. In such settings, rule-learning methods often require practitioners to binarize such features before training. Our approach allows practitioners to binarize features into items during training. This allows all items to be binarized in a way that maximizes accuracy. This approach requires practitioners to ﬁrst deﬁne Tj candidate items for each non-binary feature ui,j. For example, given a real-valued feature ui,j R, the set of candidate items could take the form {xi,j,1, xi,j,2, . . . , xi,j,Tj} where xi,j,t = 1 [ui,j vj,t] and vj,t is a threshold. We would add the following constraint to IP (2) to ensure that the checklist only uses one of the Tj items: P

t [Tj] λtj 1. In this way, the IP would then choose a binary item that is aligned with the goal of maximizing accuracy. In practice, the set of candidate items can be produced automatically [e.g., using binning or information gain 53] or speciﬁed on the basis of clinical standards (as real-valued features like blood pressure and BMI have established thresholds). This approach is general enough to allow practitioners to optimize over all possible thresholds, though this may lead to overﬁtting. Our experiments in Section 5.2 show that adaptive binarization produces a meaningful improvement in performance for our model class.

Optimality Gap We solve IP (2) with a MIP solver, which ﬁnds a globally optimal solution through exhaustive search algorithms like branch-and-bound [73]. Given an instance of IP (2) and a time limit for when to stop the search, a solver returns: (i) the best feasible checklist found within the time limit i.e., values of λ and M that achieve the best objective value V max = l (λ, M); and (ii) a lower bound on the objective value i.e., V min, the minimal objective value for any feasible solution. These quantities are used to compute the optimality gap ε := 1 V min

V max . When the upper bound V max

matches the lower bound V min, the solver returns a checklist with an optimality gap of ε = 0 i.e., one that is certiﬁably optimal.

In our setting, the optimality gap reﬂects the worst-case difference in training error between the checklist that a solver returns and the most accurate checklist that we could hope to obtain by solving IP (2). Given a solution to IP (2) with an objective value of L and an optimality gap of ε, any feasible checklist has a training error of at least (1 ε)L . When ε is small, the most accurate checklist has a training error L. Thus, if we are not satisﬁed with the performance of our model, we know that no checklist can perform better than L, and can make an informed decision to relax constraints, ﬁt a classiﬁer from a more complex hypothesis class, or not ﬁt a classiﬁer at all. When ε is large, our problem may admit a checklist with training error far smaller than L. If we are not satisﬁed with the

performance of the checklist, we cannot determine whether this is because the dataset does not admit a sufﬁciently accurate model or because the solver has not been able to ﬁnd it.

4 Algorithmic Improvements

In this section, we present specialized techniques to speed up the time required to ﬁnd a feasible solution, and/or to produce a solution with a smaller optimality gap.

4.1 Submodular Heuristic

Our ﬁrst technique is a submodular optimization technique that we use to generate initial feasible solutions for IP (2). We consider a knapsack cover problem. This problem can be solved to produce a checklist of Nmax or fewer items that: (1) maximizes coverage of positive examples; (2) has a budget of negative examples B; and (3) uses at most one item from each of the feature groups R1, ..., RT where each Rt denote a set of items derived from a particular real-valued feature and T t=1Rt = [d].

max A f + M(A) (3a)

s.t. |A| Nmax (3b) |A Rt| 1 t [T] (3c) X

j A c(j) B (3d)

Here, f + M := P

j A xi,j, M) counts the positive examples that are covered by at least M items in A. Constraint (3b) limits the number of items in the checklist. Constraint (3c) requires the checklist to use at most one item from each feature group R1, . . . , RT . Constraint (3d) controls the number of negative examples that are covered by at least M items in A. This is a knapsack constraint that assigns a cost to each item as c(j) = P

i I xi,j and limits the overall cost of A using the budget parameter B > 0.

The optimization problem in (3) can be solved using a submodular minimization algorithm given that f + M is monotone submodular (see Appendix A.1 for a proof), and that constraints (3b) to (3d) are special kinds of matroid constraints. We solve (3) with a submodular minimization algorithm adapted from Badanidiyuru and Vondrák [2] (Appendix A.2). The procedure takes as input M, Nmax and B and outputs a set of checklists that use up to Nmax items. This procedure can run within 1 second. Given a problem where we would want to train a checklist that has at most Nmax items, we would solve this problem |B|Nmax times, varying M [Nmax] and B B.

Algorithm 1: Sequential Training over N and M

Input: maximum # of items Nmax Input: objective value function V ( ) pool for i [Nmax], j [i] do

S {C|C pool, N(C) i, M(C) j} Cinit argmin C S f(C) Cnew solve IP (2) with initial solution Cinit subject to N i and M j Add {Cnew} to pool end return pool

Figure 1: (a) Performance proﬁle when solving IP (2) with and without initialization. We show the optimality gap vs. the # of branch-and-bound nodes when solving IP (2) for the heart dataset where N 6 and M 2. (b) algorithm for sequential training over N and M.

In Figure 1a, we show the effect of initializing the IP with checklists produced using the submodular heuristic. As shown, these solutions allow us to initialize the search with a good upper bound, allowing us to ultimately recover a certiﬁably optimal solution faster. Our empirical results in Section

5.2 show that checklists from our submodular heuristic outperform checklists ﬁt using existing baselines. This highlights the value of this approach as a heuristic to ﬁt checklists (in that it can produce checklists that perform well), and as a technique for initialization (in that it can produce checklists that can be used to initialize IP (2) with a small upper bound).

4.2 Path Algorithms

Our second technique is a path algorithm for sequential training. Given an instance of the checklist IP (2), this algorithm sequentially solves smaller instances of IP (2). The algorithm exploits the fact that small instances can be solved to optimality quickly and thus ruled out" of the feasible region of subsequent instances. The resulting approach can reduce the size of the branch-and-bound tree and improve the ability to recover a checklist with a small optimality gap. These algorithms can wrap around any procedure to solve IP (2).

In Algorithm 1, we present a path algorithm to train checklists over the full range of model sizes from 1 to N max. The algorithm recovers a solution to a problem with a hard model size limit by solving smaller instances with progressively larger limits of Nmax and Mmax. We store the best feasible solution from each iteration in a pool, which is then used to initialize future iterations. If the best feasible solution is certiﬁably optimal, we add a constraint to constrict the search space for subsequent instances. This allows us to reduce the size of the branch-and-bound tree, resulting in a faster solution in each iteration.

5 Experimental Results

In this section, we benchmark the performance of our methods on seven clinical prediction tasks.

Data. We consider seven clinical classiﬁcation tasks shown in Table 2. For each task, we create a classiﬁcation dataset by using each of the following techniques to binarize features:

Fixed: We convert continuous and ordinal features into threshold indicator variables using the median, and convert categorical feature into an indicator for the most common category. Adaptive: We convert continuous and ordinal features into 4 threshold indicator variables using quintiles as thresholds, and convert each categorical feature with a one-hot encoding. Optbinning: We use the method proposed by Navas-Palencia [53] to binarize all features.

We process each dataset to oversample the minority class to equalize the number of positive and negative examples due to class imbalance.

Methods. We compare the performance of the following methods for creating checklists:

MIP: We ﬁt a predictive checklist by solving an instance of IP (2) with the appropriate constraints. We solve this problem using CPLEX 12.10 [20] paired with the computational improvements in Section 4 on a 2.4 GHz CPU with 16 GB RAM for 60 minutes.

Cover: We ﬁt a predictive checklist using the submodular heuristic in Section 4.1. We set Nmax = 8, and vary M [Nmax] and B {kn+|k { 1

2, 1, 2, 3}}.

Dataset d T dpct n Prediction Task Reference

adhd 5 20 594 Patient diagnosed with ADHD [51] cardio 40 52 8,815 Patient with cardiogenic shock died in hospital [58] kidney 17 80 1,722 Patient with kidney failure died after renal replacement therapy [37] mortality 50 484 21,139 Patient died in hospital [37] ptsd 20 80 873 Patient has a PCL-5 PTSD diagnosis [51] readmit 42 94 9,766 Patient re-admitted to hospital within 30 days [32] heart 13 82 303 Patient has heart disease [24] Table 2: Clinical prediction tasks used in Section 5. Here, n is the number of samples prior to oversampling, d T is the number of feature groups, and dpct is the number of features after binarization using the adaptive method.

FIXED BINARIZATION ADAPTIVE BINARIZATION LOWER BOUNDS

Dataset Metric UNIT COVER MIP_OR MIP UNIT COVER MIP_OR MIP LR XGB

adhd n = 632 dpct = 20

test error (%) (min, max) train error (%) gap (%)

16.0 (11.9, 19.0) 11.4 -

17.1 (13.5, 20.6) 11.1 -

16.0 (11.9, 19.0) 11.4 0.0

17.1 (13.5, 20.6) 11.1 0.0

1.4 (0.8, 2.4) 1.1 -

6.3 (2.4, 11.7) 5.4 -

5.2 (4.0, 7.9) 5.4 0.0

0.5 (0.0, 0.8) 0.5 0.0

1.1 (0.8, 2.4) 0.5 -

1.6 (0.8, 2.4) 1.1 -

cardio n = 15254 dpct = 51

test error (%) (min, max) train error (%) gap (%)

23.8 (21.1, 26.5) 25.0 -

25.3 (21.4, 27.2) 26.2 -

29.6 (28.4, 31.0) 29.6 19.5

24.1 (22.5, 25.6) 24.1 82.8

23.0 (21.7, 24.9) 23.0 -

25.3 (23.7, 26.8) 25.5 -

29.2 (27.7, 30.9) 29.2 51.9

22.6 (21.5, 24.1) 22.5 83.2

21.6 (20.2, 23.3) 20.6 -

26.3 (25.0, 27.7) 6.4 -

kidney n = 1760 dpct = 80

test error (%) (min, max) train error (%) gap (%)

33.2 (33.2, 33.2) 32.4 -

35.6 (31.8, 38.9) 33.6 -

37.2 (34.7, 40.6) 36.7 45.7

33.3 (31.2, 36.1) 31.0 78.9

37.9 (36.1, 39.8) 34.1 -

37.2 (36.4, 38.6) 36.5 -

34.7 (33.2, 36.9) 34.0 43.3

33.8 (31.2, 37.5) 30.4 82.4

30.9 (28.7, 33.5) 27.2 -

25.3 (20.2, 28.7) 0.2 -

mortality n = 36684 dpct = 478

test error (%) (min, max) train error (%) gap (%)

34.3 (33.1, 36.1) 32.9 -

38.1 (37.4, 39.7) 36.9 -

37.2 (36.6, 38.4) 37.4 0.0

29.1 (27.8, 30.2) 29.0 81.0

33.8 (32.3, 35.5) 33.4 -

36.5 (35.5, 37.0) 36.5 -

37.8 (37.0, 39.4) 37.8 34.0

29.2 (29.0, 29.5) 29.6 80.4

25.0 (23.5, 26.4) 22.6 -

28.0 (27.2, 29.1) 4.6 -

ptsd n = 1106 dpct = 80

test error (%) (min, max) train error (%) gap (%)

10.9 (7.7, 14.9) 8.3 -

10.3 (9.1, 12.6) 9.5 -

13.8 (12.2, 15.8) 13.4 0.0

11.0 (6.8, 14.9) 8.2 0.0

10.2 (7.7, 12.7) 9.2 -

12.6 (7.7, 20.3) 10.8 -

16.2 (16.2, 16.2) 12.5 0.0

8.7 (5.9, 11.7) 5.6 66.5

8.0 (5.9, 10.4) 2.8 -

4.6 (1.8, 6.4) 0.0 -

readmit n = 16732 dpct = 90

test error (%) (min, max) train error (%) gap (%)

35.3 (33.8, 37.2) 35.2 -

37.5 (35.3, 40.0) 37.3 -

37.9 (36.4, 39.3) 36.5 49.3

36.2 (33.8, 37.1) 35.2 81.0

34.2 (32.6, 35.5) 34.4 -

34.6 (32.6, 36.2) 34.4 -

35.2 (34.8, 35.5) 33.2 49.9

33.9 (32.6, 35.7) 33.1 73.4

33.0 (32.2, 33.8) 32.0 -

36.6 (34.7, 38.1) 13.9 -

heart n = 330 dpct = 82

test error (%) (min, max) train error (%) gap (%)

16.3 (13.6, 24.2) 15.2 -

18.5 (12.1, 25.8) 16.4 -

29.1 (21.2, 39.4) 23.6 0.0

28.8 (28.8, 28.8) 13.9 0.0

16.1 (7.6, 31.8) 15.2 -

23.9 (15.2, 33.3) 18.2 -

29.1 (21.2, 39.4) 23.6 0.0

16.7 (13.6, 19.7) 13.6 54.5

17.6 (10.6, 27.3) 15.2 -

17.6 (12.1, 25.8) 9.7 - Table 3: Performance results of all methods on all datasets. For checklist models, we report the training error, test error, and optimality gap for the checklist that minimizes training error and satisﬁes all constraints. The intervals under test error reﬂect the 5-CV minimum and maximum test error. We additionally report results for LR and XGB as performance baselines that represent an informal lower bound on the error of possible checklists.

Unit: We ﬁt a predictive checklist using unit weighting [7]. We ﬁt L1-regularized logistic regression models for L1 penalties [10 5, 10]. We convert each model into an array of checklists by including items with positive coefﬁcients and the complements of items with negative coefﬁcients. We convert each logistic regression model into multiple checklists by setting M [N].

Evaluation. We use 5-fold cross validation and report the mean, minimum, and maximum test error across the ﬁve folds. We use the training set of each fold to ﬁt a predictive checklist that contains at most N 8 items, and that is required to select at most 1 item from a feature group. We report the training error of a ﬁnal model trained on all available data.

We report results for MIP checklists where M = 1, which we refer to as MIP_OR since it corresponds to an OR rule. Finally, we report performance for L1-regularized logistic regression (LR) and XGBoost (XGB) as baselines to evaluate performance. We train these methods using all features in an adaptive-binarized dataset to produce informal lower bounds on the error rate.

For each method and each dataset, we report the performance of models that achieve the lowest training error with N 8 items. We compare the performance of Fixed and Adaptive binarization here, and report performance of Optbinning in Appendix D.2.

5.2 Results

In Table 3, we report performance for each method on each dataset when using Fixed binarization and Adaptive binarization. We report the corresponding results for Optbinning in Appendix D.2 due to space limitations. In Figure 2, we show the performance effects of varying N. Lastly, we show predictive checklists trained using our method in Figure 3 and Appendix D.3. In what follows, we discuss these results.

On Performance Our results in Table 3 show that our method MIP outperforms existing techniques to ﬁt checklists (Unit) on 7/7 datasets in terms of training error, and on 5/7 datasets in terms of 5-fold CV test error. As shown in Figure 2, this improvement typically holds across all values of N.

In general, we ﬁnd that relaxing the value of M yield performance gains across all datasets (see e.g., results for ptsd where the difference in test error between MIP_OR and MIP is 7.8%), though smaller values of M can produce checklists that are easier to use (see checklists in Appendix D.3).

1 2 3 4 5 6 7 8 Maximum # Items

Unit - Training Error Unit - Test Error MIP - Training Error MIP - Test Error

Figure 2: Performance of checklists with N items on mortality ﬁt using our approach and unit weighting.

Predict 30-day readmission if 3+ items are checked

# admissions past year 1 bed days 14 length of stay 8 days chronic or uncontrolled pain mood problems past 5 years substance abuse past month

Figure 3: Predictive checklist ﬁt for readmit using MIP. The checklist has a train error of 33.1%, and a test error of 33.9%. We show sample checklists for all other datasets in Appendix D.3.

The results in Table 3 also highlight the potential of adaptive binarization to improve performance by binarizing during training time, as MIP with adaptive binarization achieves a better training error than ﬁxed binarization for 6/7 datasets and a better test error on 5/7 datasets.

The performance loss of checklists relative to more complex model classes (LR and XGB) varies greatly between datasets. Checklists trained on heart and readmit perform nearly as well as more complex classiﬁers. On mortality and kidney, however, the loss may exceed 5%.

On Generalization Our results show that checklist performance on the training set tends to generalize well to the test set. This holds for checklists trained using all methods (i.e., Unit, Cover, MIP_OR, and MIP), which is expected given that checklists belong to a simple model class. Generalization can help us customize checklists in an informed way as we can assume that any changes to the training loss (e.g. with the addition of a constraint) lead to similar changes in the test set. In this regime, practitioners can evaluate the impact of model requirements on predictive performance since they can measure differences in training error between checklists that obey different requirements, and expect these differences in training error as a reliable proxy for differences in test error.

On the Value of an Optimality Gap One of the beneﬁts of our approach is that it pairs each checklist with an optimality gap, which can inform model development by identifying settings where we cannot train a checklist that is sufﬁciently accurate (see Section 3). On heart, we train a checklist with a training error of 13.6% and an optimality gap of 54.5%, which suggests that there may exist a checklist with a training error of 6.2%. This optimality gap is not available for checklists trained with a heuristic (like unit weighting or domain expertise), so we would not be able to attribute poor model performance to a difﬁcult problem, or to the heuristic being ineffective.

On Fairness We analyze the fairness of MIP checklists trained on the kidney dataset. The task is to predict in-hospital mortality for ICU patients with kidney failure after receiving Continuous Renal Replacement Therapy (CRRT). In this setting, a false negative is often worse than a false positive, so we train checklists that minimize training FPR under a constraint to limit training FNR to 20%, using an 80%/20% train/test split.

We ﬁnd that our checklist exhibit performance disparities across sex and race, which is consistent with the tendency of simpler models to exhibit performance disparities [40]. To address this issue, we train a new checklist by solving a version of IP (2) with constraints that limit disparities in training FPR and FNR over 6 intersectional groups g G = {Male, Female} {White, Black, Other}. Speciﬁcally, 30 constraints that limit the FPR disparity across groups to 15%, and 6 constraints to cap the max FNR for each intersectional group to 20%. This is unique compared to prior methods in fair classiﬁcation work [76] because it enforces group fairness using exact measures[i.e., without convex relaxations 45] and over intersectional groups .

We ﬁnd that we can limit performance disparities across all values of N (see Figures 4 and 5, and Appendix D.4), observing that MIP with group fairness constraints has larger FPR overall but exhibits lower FPR disparities over groups (e.g., reducing the test set FPR gap between Black Females and White Males from 54.5% to 30.6%).

Figure 4: FPR (left), maximum FPR disparity (center) and worst-case FNR (right) versus number of items for checklists for mortality prediction given CRRT using kidney. Group fairness constraints effectively limit the FPR disparity and worst-case FNR during training, at the cost of overall performance.

Predict Mortality Given CRRT if 3+ Items are Checked Age 66.0 years AST 162.6 IU/L Blood p H 7.29 MCV 99.0 ﬂ Norepinephrine 0.1 mcg/kg/min Platelets 65.0 103/µL RDW 19.2% Time in ICU 14.1 hours

Predict Mortality Given CRRT if 2+ Items are Checked ALT 16.0 IU/L Bicarbonate 17.0 mmol/L Blood p H 7.22 Norepinephrine 0.1 mcg/kg/min RDW 19.2% Time in ICU 117.3 hours

FNR FPR Worst FNR Max FPR Gap

Training 20.0% 43.9% 33.3% 24.3% Test 22.2% 52.6% 62.5% 54.5%

FNR FPR Worst FNR Max FPR Gap

Training 17.5% 52.2% 18.1% 13.9% Test 19.6% 55.1% 50.0% 38.3%

(a) MIP (b) MIP + Fairness Constraints

Figure 5: Checklists trained to predict mortality given CRRT (a) without group fairness constraints and (b) with group fairness constraints. Checklist (a) has better overall FPR, but has worse FNR for the worst-case group, and exhibits disparities in FPR across groups. Checklist (b) obeys group fairness constraints that attenuate the training FPR gap to 15% and the worst-case FNR to 20%.

6 Learning a Short-Form Checklist for PTSD Screening

In this section, we demonstrate the practical beneﬁts of our approach by learning a short-form checklist for Post-Traumatic Stress Disorder (PTSD) diagnosis.

Background The PTSD Checklist for DSM-5 (PCL-5) is a self-report screening tool for PTSD [5, 74]. The PCL-5 consists of 20 questions that assess the severity of symptoms of PTSD DSM-5. Patients respond to each question with answers of Not at all, A little bit, Moderately, Quite a bit, or Extremely. Given these responses, the PCL assigns a provisional diagnosis of PTSD by counting the number of responses of Moderately or more frequent across four clusters of questions. Patients with a provisional diagnosis are then assessed for PTSD by a specialist. The PCL-5 can take over 10 minutes to complete, which limits its use in studies that include the PCL-5 in a battery of tests (i.e., using the provisional diagnosis to evaluate the prevalence of PTSD and its effect as a potential confounder), and has led to the development of short-form models [see e.g., 6, 43, 80].

Problem Formulation Our goal is to create a short-form version of the PCL-5 i.e., a checklist that assigns the same provisional diagnoses as PCL-5 using only a subset of the 20 questions. We train our model using data from the AURORA study [51], which contains PCL-5 responses of U.S. patients who have visited the emergency department following a traumatic event. Here, yi = +1 if the patient is assigned a provisional PTSD diagnosis. We encode the response for each question into four binary variables xi,k,l = 1 [qk l] where qk is the response to each question and l [4] is denotes its response. Our ﬁnal dataset contains d = 80 binary variables for n = 873 patients, which we split into an 80% training set and 20% test set.

Since this checklist would be used to screen patients, a false negative diagnosis is less desirable than a false positive diagnosis. We train a checklist that minimizes FPR and that obeys the following operational constraints: (1) limit FNR to 5%; (2) pick one threshold for each PCL-5 question; (3) use at most 8 questions. We make use of the same methods and techniques as in Section 5.2.

Results Our results show that short-form checklists ﬁt using our method outperform other checklists (see Figure 6). In Figure 7, we display a sample N = 8, M = 4 checklist trained with our method which reproduces the PCL-5 diagnosis with high accuracy, offering an alternative to the PCL-5 with simpler binary items and less than half the number of questions.

Figure 6: FPR versus number of items for shortforms of the PCL-5. MIP outperforms Unit across all values of N.

Predict PTSD if 4+ Items are Checked Avoiding thinking about experience moderately Avoiding activities or situations moderately Blame for stressful experience a little bit Feeling distant or cut off quite a bit Irritable or angry outbursts moderately Loss of interest in activities moderately Repeated disturbing dreams moderately Trouble experiencing positive feelings moderately

Figure 7: Short-form of the PCL-5 trained with MIP. The checklist has a train/test FPR of 6.6%/10.8%, and a train/test FNR of 4.7%/6.3%. It reproduces the PCL5 diagnosis on the training/test sets with 94.1/90.9% accuracy, and has a 70.8% optimality gap. We show condensed PCL-5 questions due to space constraint.

7 Concluding Remarks

In this work, we presented a machine learning method to learn predictive checklists from data by solving an integer program. Our method illustrates a promising approach to build models that obey constraints on qualities like safety [1] and fairness [3, 14]. Using our approach, practitioners can potentially co-design checklists alongside clinicians by encoding their exact requirements into the optimization problem and evaluating their effects on predictive performance [17, 36].

We caution against the use of items in a checklist as causal factors of outcomes (i.e., questions in the short-form PCL are not drivers of PTSD), and against the use of our method solely due to its interpretability. Recent work has argued against the use of models simply because they are interpretable" [25] and highlights the importance of validating claims on interpretability through user studies [42, 59].

We emphasize that effective deployment of the checklist models we create does not end at adoption. The proper integration of our models in a clinical setting will require careful negotiation at the personal and institutional level, and should be considered in the context of care delivery [28, 62, 63] and clinical workﬂows [26, 55].

Acknowledgments and Disclosure of Funding

This material is based upon work supported by the National Science Foundation under Grant No. 2040880. We acknowledge a Wellcome Trust Data for Science and Health (222914/Z/21/Z) and NSERC (PDF-516984) grant. Resources used in preparing this research were provided, in part, by the Province of Ontario, the Government of Canada through CIFAR, and companies sponsoring the Vector Institute. Dr. Marzyeh Ghassemi is funded in part by Microsoft Research, and a Canadian CIFAR AI Chair held at the Vector Institute. We would like to thank Taylor Killian, Bret Nestor, Aparna Balagopalan, and four anonymous reviewers for their valuable feedback.

[1] Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. Concrete problems in ai safety. ar Xiv preprint ar Xiv:1606.06565, 2016.

[2] Ashwinkumar Badanidiyuru and Jan Vondrák. Fast algorithms for maximizing submodular functions. In Proceedings of the twenty-ﬁfth annual ACM-SIAM symposium on Discrete algorithms, pages 1497 1514. SIAM, 2014.

[3] Solon Barocas, Moritz Hardt, and Arvind Narayanan. Fairness and Machine Learning. fairmlbook.org, 2018. http://www.fairmlbook.org.

[4] Lieven Billiet, Sabine Van Huffel, and Vanya Van Belle. Interval coded scoring: a toolbox for interpretable scoring systems. Peer J Computer Science, 4:e150, 2018.

[5] Christy A Blevins, Frank W Weathers, Margaret T Davis, Tracy K Witte, and Jessica L Domino. The posttraumatic stress disorder checklist for dsm-5 (pcl-5): Development and initial psychometric evaluation. Journal of traumatic stress, 28(6):489 498, 2015.

[6] Paul D Bliese, Kathleen M Wright, Amy B Adler, Oscar Cabrera, Carl A Castro, and Charles W Hoge. Validating the primary care posttraumatic stress disorder screen and the posttraumatic stress disorder checklist with soldiers returning from combat. Journal of consulting and clinical psychology, 76(2):272, 2008.

[7] Philip Bobko, Philip L Roth, and Maury A Buster. The usefulness of unit weights in creating composite scores: A literature review, application to content validity, and meta-analysis. Organizational Research Methods, 10(4):689 709, 2007.

[8] David Brodie and Richard Wells. An evaluation of the utility of three ergonomics checklists for predicting health outcomes in a car manufacturing environment. In Proc. of the 29th Annual Conference of the Human Factors Association of Canada. Citeseer, 1997.

[9] Ernest W Burgess. Factors determining success or failure on parole. The workings of the indeterminate sentence law and the parole system in Illinois, pages 221 234, 1928.

[10] Emilio Carrizosa, Belen Martin-Barragan, and Dolores Romero Morales. Binarized support vector machines. INFORMS Journal on Computing, 22(1):154 167, 2010.

[11] Emilio Carrizosa, Amaya Nogales-Gómez, and Dolores Romero Morales. Strongly agree or strongly disagree?: Rating features in support vector machines. Information Sciences, 329: 256 273, 2016.

[12] Emilio Carrizosa, Amaya Nogales-Gómez, and Dolores Romero Morales. Clustering categories in support vector machines. Omega, 66:28 37, 2017.

[13] Ken Catchpole and Stephanie Russ. The problem with checklists. BMJ quality & safety, 24(9): 545 549, 2015.

[14] Simon Caton and Christian Haas. Fairness in machine learning: A survey. ar Xiv preprint ar Xiv:2010.04053, 2020.

[15] Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pages 785 794, 2016.

[16] Yann Chevaleyre, Frederic Koriche, and Jean-Daniel Zucker. Rounding methods for discrete linear classiﬁcation. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), pages 651 659, 2013.

[17] Evangelia Christodoulou, Jie Ma, Gary S Collins, Ewout W Steyerberg, Jan Y Verbakel, and Ben Van Calster. A systematic review shows no performance beneﬁt of machine learning over logistic regression for clinical prediction models. Journal of clinical epidemiology, 110:12 22, 2019.

[18] Robyn Clay-Williams and Lacey Colligan. Back to basics: checklists in aviation and healthcare. BMJ quality & safety, 24(7):428 431, 2015.

[19] Jacob Cohen. Things i have learned (so far). In Annual Convention of the American Psychological Association, 98th, Aug, 1990, Boston, MA, US; Presented at the aforementioned conference. American Psychological Association, 1992.

[20] IBM ILOG Cplex. V12. 1: User s manual for cplex. International Business Machines Corporation, 46(53):157, 2009.

[21] Sanjeeb Dash, Dmitry M Malioutov, and Kush R Varshney. Screening for learning classiﬁcation rules via boolean compressed sensing. In Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, pages 3360 3364. IEEE, 2014.

[22] Sanjeeb Dash, Dmitry M Malioutov, and Kush R Varshney. Learning interpretable classiﬁcation rules using sequential rowsampling. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3337 3341. IEEE, 2015.

[23] Robyn M Dawes. The robust beauty of improper linear models in decision making. American psychologist, 34(7):571 582, 1979.

[24] Robert Detrano, Andras Janosi, Walter Steinbrunn, Matthias Pﬁsterer, Johann-Jakob Schmid, Sarbjit Sandhu, Kern H Guppy, Stella Lee, and Victor Froelicher. International application of a new probability algorithm for the diagnosis of coronary artery disease. The American journal of cardiology, 64(5):304 310, 1989.

[25] Finale Doshi-Velez and Been Kim. Towards a rigorous science of interpretable machine learning. ar Xiv preprint ar Xiv:1702.08608, 2017.

[26] B Alex Dummett, Carmen Adams, Elizabeth Scruth, Vincent Liu, Margaret Guo, and Gabriel J Escobar. Incorporating an early detection system into routine clinical practice in two community hospitals. Journal of hospital medicine, 11:S25 S31, 2016.

[27] Hillel J Einhorn and Robin M Hogarth. Unit weighting schemes for decision making. Organizational behavior and human performance, 13(2):171 192, 1975.

[28] Madeleine Clare Elish. The stakes of uncertainty: developing and integrating machine learning in clinical care. In Ethnographic Praxis in Industry Conference Proceedings, volume 2018, pages 364 380. Wiley Online Library, 2018.

[29] John Forrest, Ted Ralphs, Stefan Vigerske, Lou Hafer, Bjarni Kristjansson, JP Fasano, E Straver, M Lubin, HG Santos, R Lougee, et al. coin-or/cbc: Version 2.9. 9. URL http://dx. doi. org/10.5281/zenodo, 1317566, 2018.

[30] Brigid M Gillespie and Andrea Marshall. Implementation of safety checklists in surgery: a realist synthesis of evidence. Implementation Science, 10(1):137, 2015.

[31] Sharad Goel, Justin M Rao, Ravi Shroff, et al. Precinct or prejudice? understanding racial disparities in new york city s stop-and-frisk policy. Annals of Applied Statistics, 10(1):365 394, 2016.

[32] Jeffrey L Greenwald, Patrick R Cronin, Victoria Carballo, Goodarz Danaei, and Garry Choy. A novel model for predicting rehospitalization risk incorporating physical function, cognitive status, and psychosocial support using natural language processing. Medical care, 55(3): 261 266, 2017.

[33] Brigette Hales, Marius Terblanche, Robert Fowler, and William Sibbald. Development of medical checklists for improved quality of patient care. International Journal for Quality in Health Care, 20(1):22 30, 2008.

[34] Hrayr Harutyunyan, Hrant Khachatrian, David C. Kale, Greg Ver Steeg, and Aram Galstyan. Multitask learning and benchmarking with clinical time series data. Scientiﬁc Data, 6(1):96, 2019. ISSN 2052-4463. doi: 10.1038/s41597-019-0103-9. URL https://doi.org/10. 1038/s41597-019-0103-9.

[35] Alex B Haynes, Thomas G Weiser, William R Berry, Stuart R Lipsitz, Abdel-Hadi S Breizat, E Patchen Dellinger, Teodoro Herbosa, Sudhir Joseph, Pascience L Kibatala, Marie Carmela M Lapitan, et al. A surgical safety checklist to reduce morbidity and mortality in a global population. New England Journal of Medicine, 360(5):491 499, 2009.

[36] Sungsoo Ray Hong, Jessica Hullman, and Enrico Bertini. Human factors in model interpretability: Industry practices, challenges, and needs. Proceedings of the ACM on Human-Computer Interaction, 4(CSCW1):1 26, 2020.

[37] Alistair EW Johnson, Tom J Pollard, Lu Shen, H Lehman Li-Wei, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. Mimic-iii, a freely accessible critical care database. Scientiﬁc data, 3(1):1 9, 2016.

[38] Jongbin Jung, Connor Concannon, Ravi Shroff, Sharad Goel, and Daniel G Goldstein. Simple rules to guide expert classiﬁcations. Journal of the Royal Statistical Society: Series A (Statistics in Society), 183(3):771 800, 2020.

[39] Ronald C Kessler, Lenard Adler, Minnie Ames, Olga Demler, Steve Faraone, EVA Hiripi, Mary J Howes, Robert Jin, Kristina Secnik, Thomas Spencer, et al. The world health organization adult adhd self-report scale (asrs): a short screening scale for use in the general population. Psychological medicine, 35(2):245, 2005.

[40] Jon Kleinberg and Sendhil Mullainathan. Simplicity creates inequity: implications for fairness, stereotypes, and interpretability. In Proceedings of the 2019 ACM Conference on Economics and Computation, pages 807 808, 2019.

[41] Heidi S Kramer and Frank A Drews. Checking the lists: A systematic review of electronic checklist use in health care. Journal of biomedical informatics, 71:S6 S12, 2017.

[42] Isaac Lage, Emily Chen, Jeffrey He, Menaka Narayanan, Been Kim, Samuel J Gershman, and Finale Doshi-Velez. Human evaluation of models built for interpretability. In Proceedings of the AAAI Conference on Human Computation and Crowdsourcing, volume 7, pages 59 67, 2019.

[43] Ariel J Lang and Murray B Stein. An abbreviated ptsd checklist for use as a screening instrument in primary care. Behaviour research and therapy, 43(5):585 594, 2005.

[44] Lorelei Lingard, Glenn Regehr, Beverley Orser, Richard Reznick, G Ross Baker, Diane Doran, Sherry Espin, John Bohnen, and Sarah Whyte. Evaluation of a preoperative checklist and team brieﬁng among surgeons, nurses, and anesthesiologists to reduce failures in communication. Archives of surgery, 143(1):12 17, 2008.

[45] Michael Lohaus, Michael Perrot, and Ulrike Von Luxburg. Too relaxed to be fair. In International Conference on Machine Learning, pages 6360 6369. PMLR, 2020.

[46] Dmitry Malioutov and Kush Varshney. Exact rule learning via boolean compressed sensing. In Proceedings of The 30th International Conference on Machine Learning, pages 765 773, 2013.

[47] Dmitry M Malioutov, Kush R Varshney, Amin Emad, and Sanjeeb Dash. Learning interpretable classiﬁcation rules with boolean compressed sensing. In Transparent Data Mining for Big and Small Data, pages 95 121. Springer, 2017.

[48] Mario Marchand and John Shawe-Taylor. Learning with the set covering machine. In ICML, pages 345 352, 2001.

[49] Mario Marchand and John Shawe Taylor. The set covering machine. The Journal of Machine Learning Research, 3:723 746, 2003.

[50] Robert Mauro, Asaf Degani, Loukia Loukopoulos, and Immanuel Barshi. The operational context of procedures and checklists in commercial aviation. In Proceedings of the Human Factors and Ergonomics Society Annual Meeting, volume 56, pages 758 762. SAGE Publications Sage CA: Los Angeles, CA, 2012.

[51] Samuel A Mc Lean, Kerry Ressler, Karestan Chase Koenen, Thomas Neylan, Laura Germine, Tanja Jovanovic, Gari D Clifford, Donglin Zeng, Xinming An, Sarah Linnstaedt, et al. The aurora study: a longitudinal, multimodal library of brain biology and function after traumatic stress exposure. Molecular psychiatry, 25(2):283 296, 2020.

[52] Keith E Morse, Steven C Bagley, and Nigam H Shah. Estimate the hidden deployment cost of predictive models to improve patient care. Nature medicine, 26(1):18 19, 2020.

[53] Guillermo Navas-Palencia. Optimal binning: mathematical programming formulation. ar Xiv preprint ar Xiv:2001.08025, 2020.

[54] Janki Patel, Kamran Ahmed, Khurshid A Guru, Fahd Khan, Howard Marsh, Mohammed Shamim Khan, and Prokar Dasgupta. An overview of the use and implementation of checklists in surgical specialities a systematic review. International Journal of Surgery, 12(12): 1317 1323, 2014.

[55] Shirley S Paulson, B Alex Dummett, Julia Green, Elizabeth Scruth, Vivian Reyes, and Gabriel J Escobar. What do we do after the pilot is done? implementation of a hospital early warning system at scale. The Joint Commission Journal on Quality and Patient Safety, 46(4):207 216, 2020.

[56] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825 2830, 2011.

[57] Emma Pierson, David M Cutler, Jure Leskovec, Sendhil Mullainathan, and Ziad Obermeyer. An algorithmic approach to reducing unexplained pain disparities in underserved populations. Nature Medicine, 27(1):136 140, 2021.

[58] Tom J Pollard, Alistair EW Johnson, Jesse D Raffa, Leo A Celi, Roger G Mark, and Omar Badawi. The eicu collaborative research database, a freely available multi-center database for critical care research. Scientiﬁc data, 5(1):1 13, 2018.

[59] Forough Poursabzi-Sangdeh, Daniel G Goldstein, Jake M Hofman, Jennifer Wortman Wortman Vaughan, and Hanna Wallach. Manipulating and measuring model interpretability. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, pages 1 52, 2021.

[60] Peter Pronovost, Dale Needham, Sean Berenholtz, David Sinopoli, Haitao Chu, Sara Cosgrove, Bryan Sexton, Robert Hyzy, Robert Welsh, Gary Roth, et al. An intervention to decrease catheter-related bloodstream infections in the icu. New England Journal of Medicine, 355(26): 2725 2732, 2006.

[61] Hajo Reijers, Henrik Leopold, and Jan Recker. Towards a science of checklists. In Proceedings of the 50th Hawaii International Conference on System Sciences, pages 5773 5782. University of Hawaii, 2017.

[62] Mark P Sendak, Joshua D Arcy, Sehj Kashyap, Michael Gao, Marshall Nichols, Kristin Corey, William Ratliff, and Suresh Balu. A path for translation of machine learning products into healthcare delivery. EMJ Innov, 10:19 172, 2020.

[63] Pratik Shah, Francis Kendall, Sean Khozin, Ryan Goosen, Jianying Hu, Jason Laramie, Michael Ringel, and Nicholas Schork. Artiﬁcial intelligence and machine learning in clinical development: a translational perspective. NPJ digital medicine, 2(1):1 5, 2019.

[64] Daniel L Stufﬂebeam. Guidelines for developing evaluation checklists: the checklists development checklist (cdc). Kalamazoo, MI: The Evaluation Center Retrieved on January, 16:2008, 2000.

[65] Øyvind Thomassen, Guttorm Brattebø, Eirik Søfteland, HM Lossius, and J-K Heltne. The effect of a simple checklist on frequent pre-induction deﬁciencies. Acta anaesthesiologica scandinavica, 54(10):1179 1184, 2010.

[66] Øyvind Thomassen, Ansgar Storesund, Eirik Søfteland, and Guttorm Brattebø. The effects of safety checklists in medicine: a systematic review. Acta Anaesthesiologica Scandinavica, 58(1): 5 18, 2014.

[67] Geoffrey G Towell and Jude W Shavlik. Extracting reﬁned rules from knowledge-based neural networks. Machine learning, 13(1):71 101, 1993.

[68] Berk Ustun and Cynthia Rudin. Supersparse Linear Integer Models for Optimized Medical Scoring Systems. Machine Learning, 102(3):349 391, 2016.

[69] Berk Ustun and Cynthia Rudin. Learning optimized risk scores. Journal of Machine Learning Research, 20(150):1 75, 2019.

[70] Darshali A Vyas, Leo G Eisenstein, and David S Jones. Hidden in plain sight reconsidering the use of race correction in clinical algorithms, 2020.

[71] Tong Wang. Multi-value rule sets. ar Xiv preprint ar Xiv:1710.05257, 2017.

[72] Tong Wang, Cynthia Rudin, Finale Doshi-Velez, Yimin Liu, Erica Klampﬂ, and Perry Mac Neille. Or s of and s for interpretable classiﬁcation, with application to context-aware recommender systems. ar Xiv preprint ar Xiv:1504.07614, 2015.

[73] Laurence A Wolsey. Integer Programming, volume 42. Wiley New York, 1998.

[74] Jennifer H Wortmann, Alexander H Jordan, Frank W Weathers, Patricia A Resick, Katherine A Dondanville, Brittany Hall-Clark, Edna B Foa, Stacey Young-Mc Caughan, Jeffrey S Yarvis, Elizabeth A Hembree, et al. Psychometric analysis of the ptsd checklist-5 (pcl-5) among treatment-seeking military service members. Psychological assessment, 28(11):1392, 2016.

[75] Mike Wu, Michael Hughes, Sonali Parbhoo, Maurizio Zazzi, Volker Roth, and Finale Doshi Velez. Beyond sparsity: Tree regularization of deep models for interpretability. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 32, 2018.

[76] Muhammad Bilal Zafar, Isabel Valera, Manuel Gomez-Rodriguez, and Krishna P Gummadi. Fairness constraints: A ﬂexible approach for fair classiﬁcation. J. Mach. Learn. Res., 20(75): 1 42, 2019.

[77] Jiaming Zeng, Berk Ustun, and Cynthia Rudin. Interpretable classiﬁcation models for recidivism prediction. Journal of the Royal Statistical Society: Series A (Statistics in Society), 180(3): 689 722, 2017.

[78] Zijian Zheng. Constructing x-of-n attributes for decision tree learning. Machine learning, 40 (1):35 75, 2000.

[79] Jean-Daniel Zucker, Yann Chevaleyre, and Dao Van Sang. Experimental analysis of new algorithms for learning ternary classiﬁers. In 2015 IEEE RIVF International Conference on Computing & Communication Technologies, pages 19 24. IEEE, 2015.

[80] Kelly L Zuromski, Berk Ustun, Irving Hwang, Terence M Keane, Brian P Marx, Murray B Stein, Robert J Ursano, and Ronald C Kessler. Developing an optimal short-form of the ptsd checklist for dsm-5 (pcl-5). Depression and anxiety, 36(9):790 800, 2019.