# classification_with_conceptual_safeguards__65d9d4c4.pdf

Published as a conference paper at ICLR 2024

CLASSIFICATION WITH CONCEPTUAL SAFEGUARDS

Hailey Joren UC San Diego hjoren@ucsd.edu

Charles Marx Stanford University ctmarx@stanford.edu

Berk Ustun UC San Diego berk@ucsd.edu

We propose a new approach to promote safety in classiﬁcation tasks with established concepts. Our approach called a conceptual safeguard acts as a veriﬁcation layer for models that predict a target outcome by ﬁrst predicting the presence of intermediate concepts. Given this architecture, a safeguard ensures that a model meets a minimal level of accuracy by abstaining from uncertain predictions. In contrast to a standard selective classiﬁer, a safeguard provides an avenue to improve coverage by allowing a human to conﬁrm the presence of uncertain concepts on instances on which it abstains. We develop methods to build safeguards that maximize coverage without compromising safety, namely techniques to propagate the uncertainty in concept predictions and to ﬂag salient concepts for human review. We benchmark our approach on a collection of real-world and synthetic datasets, showing that it can improve performance and coverage in deep learning tasks.

1 INTRODUCTION

One of the most promising applications of machine learning is to automate routine tasks that a human can perform. We can now train deep learning models to perform such tasks across applications be it to identify a bird in an image [1], detect toxicity in text [2], or pneumonia in a chest x-ray [3]. Even as these models may outperform human experts [4, 5, 6], their performance falls short of the levels we need to reap the beneﬁts of full automation. A bird identiﬁcation model that is 80% accurate may not work reliably enough to be used in the ﬁeld by conservationists. A pneumonia detection model with 95% accuracy is still not sufﬁcient to eliminate the need for human oversight in a hospital setting.

One strategy to reap some of the beneﬁts of automation in such tasks is through abstention. Given any model that is insufﬁciently accurate, we can measure the uncertainty in its predictions and improve its accuracy by abstaining from predictions that are too uncertain. In this way, we can improve accuracy by sacriﬁcing coverage i.e., the proportion of instances where a model assigns a prediction. One of the common barriers to abstention in general applications is how to handle instances where a model abstains. In applications where we wish to automate a routine task, we would only pass these to the human expert who would have made the decision in the ﬁrst place. Thus, abstention represents a way to reap beneﬁts from partial automation.

In this paper, we present a modeling paradigm to ensure safety in such applications that we call a conceptual safeguard (see Fig. 1). A conceptual safeguard is an abstention mechanism for models that predict an outcome by ﬁrst predicting a set of concepts. Given such a model, a conceptual safeguard operates as a veriﬁcation layer estimating the uncertainty in each prediction and abstaining when it exceeds a threshold needed to ensure a minimal level of accuracy. Unlike a traditional selective classiﬁer, a conceptual safeguard provides a way to improve coverage by allowing experts to conﬁrm the presence of certain concepts on instances on which we abstain.

Although conceptual safeguards are designed to be a simple component that can be implemented off the shelf, designing methods to learn them is challenging. On the one hand, concepts introduce a degree of uncertainty that we must account for at test time. In the best case, we may have to abstain too much to achieve a desired level of accuracy. In the worst case, we may fail to hit the mark. On the other hand, we must build systems that are designed for conﬁrmation i.e., where we can reasonably expect that conﬁrming concepts will improve coverage and where we can rank the concepts in a way that improves coverage. Our work seeks to address these challenges so that we can reap the beneﬁts of these models.

Published as a conference paper at ICLR 2024

Conceptual safeguards abstain using concept uncertainty Abstentions can be resolved by confirming concepts

Pr( Pigmented ) = 10%

Pr( Irreg Vasc ) = 30%

Pr( Dotted ) = 70%

Predicted Concepts

/(! = 62% 0(! = Abstain

Uncertainty Propagator

Conceptual Safeguard

Selective Classifier Image of Skin Lesion

/(! = 95% 4-! = 1 Melanoma

"($!) &"(/(!) 2(,!) = $! ,!

Pr( Pigmented ) = 10%

Pr( Irreg Vasc ) = 0%

Pr( Dotted ) = 100%

Uncertainty Propagator

Conceptual Safeguard

Selective Classifier

Predicted Concepts Image of Skin Lesion

2(,!) = $! ,!

Figure 1: Conceptual safeguard to detect melanoma from an image of a skin lesion. We consider a model that estimates the probabilities of m concepts: Dotted, Pigmented . . . Irregular Vasc. Given these probabilities, a conceptual safeguard will decide whether to output a prediction ˆy 2 {Melanoma, No Melanoma} or to abstain

ˆy =?. The safeguard improves accuracy by abstaining on images that would receive a low conﬁdence prediction, and measures conﬁdence in a way that accounts for the uncertainty in concept predictions through uncertainty propagation. On the left, we show an image where a safeguard abstains because its conﬁdence fails to meet the threshold to ensure high accuracy Pr(Melanoma) = 62% 90%. On the right, we show a human expert can resolve the abstention by conﬁrming the presence of concepts Dotted and Irreg Vasc in the image.

Contributions Our main contributions include:

1. We introduce a general-purpose approach for safe automation in classiﬁcation tasks with concept

annotation. Our approach can be applied using off-the-shelf supervised learning techniques.

2. We develop a technique to account for concept uncertainty in concept bottleneck models. Our

technique can use native estimates from the components of a concept bottleneck to return reliable estimates of label uncertainty, improving the accuracy-coverage trade-off in selective classiﬁcation.

3. We propose a conﬁrmation policy that can improve coverage. Our policy ﬂags concepts in instances

on which a model abstains, prioritizing high-value concepts that could resolve abstention. This strategy can be readily customized to conform to a budget and applied ofﬂine.

4. We benchmark conceptual safeguards on classiﬁcation datasets with concept labels. Our results

show that safeguards can improve accuracy and coverage through uncertainty propagation and concept conﬁrmation.

RELATED WORK

Selective Classiﬁcation Our work is related to a large body of work on machine learning with a reject option [see e.g., 7, for a recent survey], and more speciﬁcally methods for selective classiﬁcation [8, 9, 10, 11]. Our goal is to learn a selective classiﬁer that assigns as many predictions as possible while adhering to a minimal level of accuracy [i.e., the bounded abstention model of 11]. We build conceptual safeguards for this task through a post-hoc approach [see e.g., 12] in which we are given a model, and build a veriﬁcation layer that estimates the conﬁdence of each prediction and abstains on predictions where a model is insufﬁciently conﬁdent. The key challenge in our approach is that we require reliable estimates of uncertainty to abstain effectively [13, 14], which we address by propagating the uncertainty in concept predictions to the uncertainty in the ﬁnal outcome.

Deep Learning with Concepts Our work is related to a stream of work on deep learning with concept annotations. A large body of work uses these annotations to promote interpretability either by explaining the predictions of DNN in terms of concepts that a human can understand [15, 16], or by building concept bottleneck models i.e., a model that predicts an outcome of interest by predicting concepts i.e., [17]. One of the key motivations for concept bottleneck models is the potential for humans to intervene at test time [17] e.g., to improve performance by correcting a concept prediction that was incorrectly detected. Recent work shows that many architectures that

Published as a conference paper at ICLR 2024

perform well cannot readily support interventions [see e.g., 18, 19].1 Likewise, interventions may not be practical in applications where we wish to automate a routine task. In such cases, we need a mechanism to ﬂag predictions for human review to prevent human experts from having to check each prediction. Our work highlights several avenues to avoid these limitations. In particular, we work with an independent architecture that is amenable to interventions, consider a restricted class of interventions, and present an approach that does not require constant human supervision.

One overarching challenge in building concept bottleneck models is the need for concept annotations. In effect, very few datasets include concept annotations and those that do are often incomplete (e.g., an example may be missing some or all concept labels). In practice, the lack of concept labels can limit the applicability and the performance of concept bottleneck models and has motivated work a recent stream of work on machine-driven concept annotation [see e.g., 20] and drawing on alternate sources of information [21, 22]. Our work outlines an alternative approach to overcome this barrier to adoption: rather than building a new model that is sufﬁciently accurate, use it as much as possible by allowing it to abstain from prediction.

2 FRAMEWORK

We consider a classiﬁcation task to predict a label from a complex feature space. We start with a dataset of n i.i.d. training examples {(xi, ci, yi)}n

i=1, where example i consists of:

a vector of features xi 2 X Rd e.g., xi pixels in image i; a vector of k concepts ci 2 C = {0, 1}m e.g., ci,k = 1 if x-ray i contains a bone spur; a label yi 2 Y = {0, 1} e.g., yi = 1 if patient i has arthritis.

Objective We use the dataset to build a selective classiﬁcation model h : X ! Y [ {?}. Given a feature vector xi, we denote the output of the model as ˆyi := h(xi) where ˆyi =? denotes that the model abstains from prediction for xi.

In the context of an automation task, we would like our model to assign as many predictions as possible while adhering to a target accuracy to ensure safety at test time. In this setup, abstention reﬂects a viable path to meet this constraint by allowing the model to abstain on instances where it would otherwise assign an incorrect prediction. Given an target accuracy 2 (0, 1), we express these requirements as an optimization problem:

h Coverage(h)

s.t. Accuracy(h) ,

Coverage(h) := Pr (ˆy 6=?) is the coverage of the model h i.e., proportion of instances where h

outputs a prediction. Accuracy(h) := Pr (y = ˆy | ˆy 6=?) is the selective accuracy of the model h i.e., the accuracy

of h over instances on which it outputs a prediction

System Components We consider a selective classiﬁer with the components shown in Fig. 1. Given the dataset, we will ﬁrst train the basic components of an independent concept bottleneck model:

A concept detector g : X ! [0, 1]m. Given features xi, the concept detector returns as output

a vector of m probabilities qi = [qi,1, . . . , qi,m] 2 [0, 1]m where each qi,k := Pr(ci,k = 1 | xi) captures the probability that concept k is present in instance i.

A front-end model f : C ! Y, which takes as input a vector of hard concepts ci and returns as

output the outcome probability yi := f(ci).

1For example, if we build a sequential architecture wherein we train a front-end model to predict the label from predicted concepts then interventions may reduce accuracy as the front-end may rely on an incorrect concept prediction to output accurate label predictions.

Published as a conference paper at ICLR 2024

Selective Accuracy

100% Operating Points @ 3 = 0.1 Safeguard Safeguard + Confirmation

Figure 2: We evaluate the performance of classiﬁcation models that can abstain using an accuracy-coverage curve [11]. Given a model that outputs a probability prediction for each point, a conceptual safeguard ﬂags points on which a model abstains based on a conﬁdence threshold 2 [0, 0.5] where setting = 0 leads to 100% coverage and setting = 0.5 leads to 0% coverage.

In what follows, we treat these models as ﬁxed components that are given and assume that they satisfy two requirements: (i) independent training, meaning that we train the concept detector and the front-end model using the datasets {(xi, ci)}n

i=1 and {(ci, yi)}n

i=1, respectively; (ii) calibration, i.e., that the concept detector and front-end model and return calibrated probability estimates of their respective outcomes. As we will discuss shortly, independent training will be essential for conﬁrmation while calibration will be essential for abstention.

In practice, these are mild requirements that we can satisfy using off-the-shelf methods for supervised learning. Given a dataset, for example, we can ﬁt the concept detectors using a standard deep learning algorithm, and ﬁt the front-end model using logistic regression. In settings where the dataset is missing concept labels for certain examples, we can train a separate detector for each concept to maximize the amount of data for each concept detector. In settings where we have an embedding model [21] or an existing end-to-end deep learning model, we can dramatically reduce the computation by training the concept detectors via ﬁne-tuning. In all cases, we can calibrate each model by applying a post-hoc calibration technique, e.g., Platt scaling [23].

Conceptual Safeguards Conceptual safeguards operate as a conﬁdence-based selective classiﬁer abstaining on points where they cannot assign sufﬁciently conﬁdent prediction. We control this

behavior through an internal component called the selection gate ' : [0, 1] ! Y [ {?}, which is parameterized with a conﬁdence threshold 2 [0, 1]. Given a threshold , the gate takes as input a soft label yi 2 [0, 1] and returns:

ˆyi = ' (yi) =

0 if yi 2 [0, ) ? if yi 2 [ , 1 ] 1 if yi 2 (1 , 1]

As shown in Fig. 2, we can set to choose an operating point for the model on the accuracy-coverage curve e.g., to meet a desired target accuracy rate at by sacriﬁcing coverage. In settings where our front-end model returns calibrated probability predictions, we can use them to set the threshold : Deﬁnition 1. A probabilistic prediction y 2 [0, 1] for a binary label y 2 {0, 1} is calibrated if Pr (y = 1 | y = t) = t for all values t 2 [0, 1].

Proposition 1. Suppose that y is a calibrated probability prediction for y. Then any selective classiﬁer ' (y) that abstains when y has conﬁdence below 1 achieves accuracy at least 1 .

Proposition 1 is a standard result ensuring high accuracy for selective classiﬁers that output calibrated probabilistic prediction (see Appendix A for a proof). In cases where the front-end model is not calibrated, the result will hold only given the degree of calibration. Thus, the reliability of this approach hinges on the reliability of our conﬁdence estimates. An alternative approach for such cases is to treat the output of the front-end model f as a conﬁdence score and tune using a calibration dataset [see, e.g., the SGR algorithm of 12].

Conﬁrmation We let users conﬁrm the presence of concepts on instances where a model abstains. In a pneumonia detection task, for example, we can ask a radiologist to conﬁrm the presence of

Published as a conference paper at ICLR 2024

concept k in x-ray i. We refer to this procedure as conﬁrmation rather than intervention to distinguish it from other ways where a human expert would alter the output from a concept detector.2

We consider tasks where each concept is human-veriﬁable i.e., that it can be detected correctly by the human experts that we would query to conﬁrm [c.f., concepts where a human may be uncertain as in 24]. In such tasks, conﬁrming a concept will replace its probability qi,k to its ground-truth value ci,k 2 {0, 1}. We cast conﬁrmation as a policy function S : [0, 1]m ! [0, 1]m where S [m] denotes a subset of concepts to conﬁrm. The function takes as input a vector of raw concept predictions and returns a vector of partially conﬁrmed concept predictions: pi = [pi,1, . . . , pi,m] 2 [0, 1]m where:

ci,k if k 2 S qi,k if k /2 S

3 METHODOLOGY

In this section, we describe how to build the internal components of a conceptual safeguard. We ﬁrst discuss how to output reliable predicted probabilities given uncertain inputs at test time. We then introduce a technique to ﬂag promising examples that can be conﬁrmed to improve coverage.

3.1 UNCERTAINTY PROPAGATION

It is important that the concept detectors are probabilistic, so that we can prioritize conﬁrming concepts that have high uncertainty. However, this creates an issue where the concept detectors output probabilities, but the front-end model requires hard concepts as inputs.

Here, we describe a simple strategy wherein we use uncertainty propagation to allow the front-end model to accept probabilities rather than hard concepts as input. We make two assumptions about the underlying data distribution to motivate our approach. Assumption 2. The label y and features x are conditionally independent given the concepts c. Assumption 3. The concepts {c1, . . . , cm} are conditionally independent given the features x.

We can use these assumptions to write the conditional label distribution p(y | x) in terms of quantities that we can readily obtain from a concept detector and front-end model:

p(y | c, x) p(c | x) (1)

p(y | c) p(c | x) (Assumption 2)

p(y | c) | {z } output from f(c)

p(ck | x) | {z } output from gk(x)

(Assumption 3)

We use this decomposition to propagate uncertainty from the inputs of the front-end model to its output. Speciﬁcally, we compute the expected prediction of the front-end model on all possible realizations of hard concepts and weigh each realization in terms of the probabilities from concept detectors. In practice, we replace each quantity in Eq. (1) with an estimate computed by the concept detectors and front-end model. Given predicted probabilities from concept detectors pi = (pi,1, . . . , pi,k), we compute the estimate of p(y | x) as:

i,k(1 pi,k)1 ck (2)

Here, we abuse notation slightly and use f(pi) to denote the front-end model applied to uncertain concepts, and f(c) to denote the front-end model applied to hard concepts. This requires 2m calls to the front-end model, which is negligible in practice as most front-end models are trained with a limited number of concepts. In settings where m is large, or computation is prohibitive, we can use a sample of concept vectors to construct a Monte Carlo estimate.

2For example, intervention may refer to correction in which a human replaces a hard concept prediction with its correct value.

Published as a conference paper at ICLR 2024

Algorithm 1 Greedy Concept Selection

Input: {i 2 [n] | ' (f(qi)) = ?} instances on which a model abstained Input: γ1, . . . , γm > 0, cost to conﬁrm each concept Input: B > 0, conﬁrmation budget

1: S1, . . . , Sn {} concepts to conﬁrm for each instance 2: repeat

i , k arg maxi,k Gain(qi, k) s.t. k 62 Si select best remaining concept 3: Si Si [ {k } 4: B B γk 5: until B < 0 or Si = [m] for all i 2 [n] Output: S1, . . . , Sn, concepts to conﬁrm for each abstained instance

3.2 CONFIRMATION

Selective classiﬁcation guarantees higher accuracy at the cost of potentially lower coverage. In what follows, we describe a strategy that can mitigate the loss in conﬁrmation by human conﬁrmation i.e., manually spotting concepts among instances on which we abstain. In principle, conﬁrmation will always lead to an improvement in coverage. In practice, human conﬁrmation is labor intensive and may require expertise so we want to develop techniques that can account by ﬂagging promising examples that are responsive to conﬁrmation costs.

In Algorithm 1, we present a routine to ﬂag concepts for a human expert to review among instances on which a model abstains. The routine iterates over a batch of n points on which the model has abstained and identiﬁes salient concepts that can be conﬁrmed by a human expert to avoid coverage.

Algorithm 1 computes the gain associated with conﬁrming each concept on each instance and then returns the concepts with the maximum gain while adhering to a user-speciﬁed conﬁrmation budget B > 0. We associate the cost of conﬁrming each concept k with a cost γk > 0, which can be set to control the time or expertise that is associated with each conﬁrmation. The routine selects concepts for review based on the expectation of the gain in certainty. We measure the gain in certainty in terms of the variance of the prediction after conﬁrmation:

Gain(q, k) = Varpk Bern(qk)

= qk(1 qk) (f(q[qk 1]) f(q[qk 0]))2 (3)

The gain measure in (3) captures the sensitivity of predictions from the front-end model by conﬁrming concept k. In particular, we seek to identify concepts that if conﬁrmed would induce a large change in the output of the front-end and thus resolve abstentions. Given that we do not know the underlying value of concept k prior to conﬁrmation, we treat qk as a random variable that will be set to q[qk 1] with probability qk and q[qk 0] with probability 1 qk. Here, q[qk 1] refer to the vector q, except with qk replaced with 1.

4 EXPERIMENTS

We present experiments where we benchmark conceptual safeguards on a collection of real-world classiﬁcation datasets. Our goal is to evaluate their accuracy and coverage trade-offs, and to study the effect of uncertainty propagation and conﬁrmation through ablation studies. We include details on our setup and results in Appendix B, and provide code to reproduce our results on Git Hub.

We consider six classiﬁcation datasets with concept annotations:

The melanoma and skincancer datasets are image classiﬁcation tasks to diagnose melanoma

and skin cancer derived from the Derm7pt dataset [25]. The warbler and flycatcher datasets are image classiﬁcation tasks derived from the Cal Tech-

UCSD Birds dataset [26] to identify different species of birds.

Published as a conference paper at ICLR 2024

The noisyconcepts25 and noisyconcepts75 datasets are synthetic classiﬁcation tasks de-

signed to control the noise in concepts (see Appendix B.1).

We process each dataset to binarize categorical concepts (e.g., Wing Color to Wing Color Red). We split each dataset into a training sample (80%, used to build a selective classiﬁcation model) and a test sample (20%, used to evaluate coverage and selective accuracy in deployment).

Models We train a selective classiﬁcation model using one of the following methods:

X!Y MLP: A multilayer perceptron trained on top of the penultimate layer of the embedding

model. This baseline represents an end-to-end deep learning model that directly predicts the output without concepts.

Baseline: An independent concept bottleneck model built from concept detectors g1, . . . , gm and a

front-end model f trained to predict the true concepts.

CS: Conceptual safeguard built from the same concept detectors and front-end as the baseline. This

model propagates the uncertainty from the concept detectors g1, . . . , gm to the front-end model f as described in Section 3.

We build Baseline and CS models using the same front-end model f and concept detectors g1, . . . , gm. We train the front-end model f using logistic regression, and the concept detectors using embeddings from a pre-trained model [i.e., Inception V3, 27] (for all datasets other than noisyconcepts).

Evaluation We report the performance of each model through an accuracy-coverage curve as in Fig. 2, which plots its coverage and selective accuracy on the test sample across thresholds. We evaluate the impact of conﬁrmation in concept based models i.e., Baseline and CS in terms of the following policies:

Impact Conf, which ﬂags concepts to review using Algorithm 1;

Random Conf, which ﬂags a random subset of concepts to review.

We control the number of examples to conﬁrm by setting a conﬁrmation budget, and plot accuracycoverage curves for conﬁrmation budgets of 0/10/20/50% to show how performance changes under no/low/medium/high human levels of human intervention, respectively.

4.2 RESULTS

We present the accuracy coverage curves for all methods and all datasets in Table 1. Given these curves, we can evaluate the gains to uncertainty propagation by comparing Baseline to CS, and the gains from conﬁrmation by comparing Baseline + Random Conf to CS + Impact Conf).

Overall, our results that our methods outperform their respective baselines in terms of coverage and selective accuracy. In general, we ﬁnd that the gains vary across datasets and conﬁrmation budgets. Given a desired target accuracy, for example, we achieve higher coverage on warbler and flycatcher rather than melanoma and skincancer. Such differences arise due to differences in the quality of concept annotations and their relevance for the prediction task at hand.

On Uncertainty Propagation Our results highlight how uncertainty propagation can lead to improvements in a safeguard. On the one hand, we ﬁnd that uncertainty propagation improves the accuracy coverage trade-off. On the skincancer dataset, for example, we see a major improvement in the accuracy coverage curves between a model that propagates uncertainty (CS + Random Conf, blue) and a model and a comparable model that does not (Baseline + Random Conf, green).

On the other hand, accounting for uncertainty can improve these trade-offs by producing a more effective conﬁrmation policy. Our results highlight these effects by showing gains of uncertainty may change under a conﬁrmation budget. On the flycatcher dataset, for example, accounting for uncertainty leads to little difference when we do not conﬁrm examples (i.e., a 0% conﬁrmation budget). In a regime where we allow for conﬁrmation setting a 20% conﬁrmation budget we ﬁnd that accounting for uncertainty can increase coverage, from 46.2% to 63.5%, when the threshold is set at = 0.05 (see Section 4.1).

Published as a conference paper at ICLR 2024

Conﬁrmation Budget

0% 10% 20% 50%

warbler n = 5, 994 m = 112 Wah et al. [26]

flycatcher n = 5, 994 m = 112 Wah et al. [26]

melanoma n = 616 m = 17 Kawahara et al. [25]

skincancer n = 616 m = 17 Kawahara et al. [25]

noisyconcepts25 n = 100, 000 m = 3

noisyconcepts75 n = 100, 000 m = 3

Table 1: Accuracy coverage curves for all methods on all datasets. We include additional results in Appendix B for multiclass tasks. Note that Baseline (black) and Baseline + Random Conf (green) produce identical results without conﬁrmation. Likewise, CS + Random Conf (blue) and CS + Impact Conf (purple) are also equivalent under the same condition.

On Conﬁrmation Our results show that conﬁrming concepts improves performance across all datasets. On flycatcher, we ﬁnd that conﬁrming a random subset of instances 20% improves coverage from 26.9% to 46.2% for a threshold of = 0.05 (Baseline ! Baseline + Random Conf). In a conceptual safeguard, coverage starts from 63.5% due to uncertainty propagation (CS + Rand Conf), and improves to 84.6% as a result of our targetted conﬁrmation policy (CS + Impact Conf). The gains of conﬁrmation depend on the underlying task and dataset. For example, the gains may be smaller when the concept detectors perform well enough without conﬁrmation to achieve high accuracy (e.g., for the warbler and noisyconcepts25 datasets).

Published as a conference paper at ICLR 2024

Dataset Prediction Thresholds X->Y MLP Baseline CS Baseline + Random Conf CS + Impact Conf

warbler n = 5, 994 m = 112 Wah et al. [26]

= 0.05 = 0.1 = 0.15 = 0.2

86.0% 100.00% 100.00% 100.00%

17.3% 74.0% 100.00% 100.00%

74.0% 87.3% 100.00% 100.00%

30.0% 89.3% 100.00% 100.00%

90.00% 100.00% 100.00% 100.00%

flycatcher n = 5, 994 m = 112 Wah et al. [26]

= 0.05 = 0.1 = 0.15 = 0.2

0.0% 82.7% 92.3% 100.00%

26.9% 44.2% 44.2% 57.7%

0.0% 36.5% 36.5% 51.9%

46.2% 46.2% 65.4% 100.00%

84.62% 100.00% 100.00% 100.00%

melanoma n = 616 m = 17 Kawahara et al. [25]

= 0.05 = 0.1 = 0.15 = 0.2

0.0% 0.0% 0.0% 0.0%

0.0% 0.0% 0.0% 18.6%

4.3% 14.3% 14.3% 34.3%

8.6% 8.6% 41.4% 41.4%

52.86% 72.86% 100.00% 100.00%

skincancer n = 616 m = 17 Kawahara et al. [25]

= 0.05 = 0.1 = 0.15 = 0.2

0.0% 0.0% 32.9% 86.59%

0.0% 0.0% 11.0% 11.0%

0.0% 36.6% 36.6% 36.6%

6.10% 15.9% 28.0% 36.6%

0.0% 39.02% 85.37%

noisyconcepts25 n = 100, 000 m = 3

= 0.05 = 0.1 = 0.15 = 0.2

0.0% 32.3% 44.1% 80.4%

0.0% 32.3% 43.6% 80.4%

0.0% 32.3% 43.6% 80.4%

0.0% 33.9% 45.8% 86.8%

0.0% 44.66% 69.16% 93.31%

noisyconcepts75 n = 100, 000 m = 3

= 0.05 = 0.1 = 0.15 = 0.2

0.0% 0.0% 0.0% 0.0%

0.0% 0.0% 0.0% 0.0%

0.0% 0.0% 0.0% 0.0%

0.0% 0.0% 0.0% 0.0%

0.0% 11.17% 11.17% 11.17%

Table 2: Coverage at speciﬁc thresholds for a conﬁrmation budget of 20%. We present results for other datasets and conﬁrmation budgets in Appendix B.2.

Our results highlight the value of targetted conﬁrmation policy i.e., as a technique that can lead to meaningful gains in coverage without compromising safety or requiring real-time human supervision. In practice, these gains only part of the beneﬁts of our approach as practitioners may be able to specify costs in a way that limits the experts required for conﬁrmation. For example, non-expert users may be able to conﬁrm concepts such as Wing Color Red, while conﬁrming the ﬁnal species prediction to Red Faced Cormorant may require greater expertise.

5 CONCLUDING REMARKS

Conceptual safeguards reﬂect a general-purpose approach to promote safety through selective classiﬁcation. In applications where we wish to automate routine tasks that humans could perform, safegaurds allow us to reap some of the beneﬁts of automation through abstention in a way that can promote interpretability and improve coverage. Although our work has primarily focused on binary classiﬁcation tasks, our machinery can be applied to build conceptual safeguards for multiclass tasks (see Appendix B.3), and extended to other supervised prediction problems.

Our approach has a number of overarching limitations that affect concept bottlenecks and selective classiﬁers. As with all models trained with concept annotations, we expect to incur some loss in performance relative to an end-to-end model when concepts are unlikely to capture all relevant information about the label from its input. In practice, this gap may be large and we may reap the beneﬁts of automation using a traditional selective classiﬁer. As with other selective classiﬁcation methods, we expect that abstention may exacerbate disparities in coverage or performance across subpopulations [see e.g., 28]. In our setting, it may be difﬁcult to pin down the source of these disparities as they may arise from concept annotations, model concepts, or interaction effects. Nevertheless, we may be able to mitigate these effects through a suitable conﬁrmation policy.

Published as a conference paper at ICLR 2024

ACKNOWLEDGMENTS

We thank Taylor Joren, Mert Yuksekgonul, and Lily Weng for helpful discussions. This work was supported by funding from the National Science Foundation Grants IIS-2040880 and IIS-2313105, the NIH Bridge2AI Center Grant U54HG012510, and an Amazon Research Award.

[1] Michael A Tabak, Mohammad S Norouzzadeh, David W Wolfson, Steven J Sweeney, Kurt C Ver Cauteren,

Nathan P Snow, Joseph M Halseth, Paul A Di Salvo, Jesse S Lewis, Michael D White, et al. Machine learning to classify animal species in camera trap images: Applications in ecology. Methods in Ecology and Evolution, 10(4):585 590, 2019.

[2] Deepak Kumar, Patrick Gage Kelley, Sunny Consolvo, Joshua Mason, Elie Bursztein, Zakir Durumeric,

Kurt Thomas, and Michael Bailey. Designing toxic content classiﬁcation for a diversity of perspectives. In Seventeenth Symposium on Usable Privacy and Security (SOUPS 2021), pages 299 318, 2021.

[3] An Yan, Yu Wang, Yiwu Zhong, Zexue He, Petros Karypis, Zihan Wang, Chengyu Dong, Amilcare

Gentili, Chun-Nan Hsu, Jingbo Shang, et al. Robust and interpretable medical image classiﬁers via concept bottleneck models. ar Xiv preprint ar Xiv:2310.03182, 2023.

[4] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectiﬁers: Surpassing

human-level performance on imagenet classiﬁcation. In Proceedings of the IEEE international conference on computer vision, pages 1026 1034, 2015.

[5] Wenying Zhou, Yang Yang, Cheng Yu, Juxian Liu, Xingxing Duan, Zongjie Weng, Dan Chen, Qianhong

Liang, Qin Fang, Jiaojiao Zhou, et al. Ensembled deep learning model outperforms human experts in diagnosing biliary atresia from sonographic gallbladder images. Nature communications, 12(1):1259, 2021.

[6] Eunji Chong, Elysha Clark-Whitney, Audrey Southerland, Elizabeth Stubbs, Chanel Miller, Eliana L

Ajodan, Melanie R Silverman, Catherine Lord, Agata Rozga, Rebecca M Jones, et al. Detection of eye contact with deep neural networks is as accurate as human experts. Nature communications, 11(1):6386, 2020.

[7] Kilian Hendrickx, Lorenzo Perini, Dries Van der Plas, Wannes Meert, and Jesse Davis. Machine learning

with a reject option: A survey. ar Xiv preprint ar Xiv:2107.11277, 2021.

[8] Chi-Keung Chow. An optimum character recognition system using decision functions. IRE Transactions

on Electronic Computers, pages 247 254, 1957.

[9] C Chow. On optimum recognition error and reject tradeoff. IEEE Transactions on information theory, 16

(1):41 46, 1970.

[10] Giorgio Fumera and Fabio Roli. Reject option with multiple thresholds. Pattern recognition, 33(12):

2099 2101, 2000.

[11] Vaclav Voracek Vojtech Franc, Daniel Prusa, and Vaclav Voracek. Optimal strategies for reject option

classiﬁers. Journal of Machine Learning Research, 24(11):1 49, 2023.

[12] Yonatan Geifman and Ran El-Yaniv. Selective classiﬁcation for deep neural networks. Advances in neural

information processing systems, 30, 2017.

[13] Nontawat Charoenphakdee, Zhenghang Cui, Yivan Zhang, and Masashi Sugiyama. Classiﬁcation with

rejection based on cost-sensitive classiﬁcation. In International Conference on Machine Learning, pages 1507 1517. PMLR, 2021.

[14] Adam Fisch, Tommi Jaakkola, and Regina Barzilay. Calibrated selective classiﬁcation. ar Xiv preprint

ar Xiv:2208.12084, 2022.

[15] Been Kim, Martin Wattenberg, Justin Gilmer, Carrie Cai, James Wexler, Fernanda Viegas, et al. In-

terpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav). In International conference on machine learning, pages 2668 2677. PMLR, 2018.

[16] Zhi Chen, Yijie Bei, and Cynthia Rudin. Concept whitening for interpretable image recognition. Nature

Machine Intelligence, 2(12):772 782, 2020.

Published as a conference paper at ICLR 2024

[17] Pang Wei Koh, Thao Nguyen, Yew Siang Tang, Stephen Mussmann, Emma Pierson, Been Kim, and Percy

Liang. Concept bottleneck models. In International Conference on Machine Learning, pages 5338 5348. PMLR, 2020.

[18] Marton Havasi, Sonali Parbhoo, and Finale Doshi-Velez. Addressing leakage in concept bottleneck models.

Advances in Neural Information Processing Systems, 35:23386 23397, 2022.

[19] Andrei Margeloiu, Matthew Ashman, Umang Bhatt, Yanzhi Chen, Mateja Jamnik, and Adrian Weller. Do

concept bottleneck models learn as intended? ar Xiv preprint ar Xiv:2105.04289, 2021.

[20] Tuomas Oikarinen, Subhro Das, Lam M Nguyen, and Tsui-Wei Weng. Label-free concept bottleneck

models. ar Xiv preprint ar Xiv:2304.06129, 2023.

[21] Mert Yuksekgonul, Maggie Wang, and James Zou. Post-hoc concept bottleneck models. ar Xiv preprint

ar Xiv:2205.15480, 2022.

[22] Chih-Kuan Yeh, Been Kim, Sercan Arik, Chun-Liang Li, Tomas Pﬁster, and Pradeep Ravikumar. On

completeness-aware concept-based explanations in deep neural networks. Advances in neural information processing systems, 33:20554 20565, 2020.

[23] John Platt et al. Probabilistic outputs for support vector machines and comparisons to regularized likelihood

methods. Advances in large margin classiﬁers, 10(3):61 74, 1999.

[24] Katherine Maeve Collins, Matthew Barker, Mateo Espinosa Zarlenga, Naveen Raman, Umang Bhatt,

Mateja Jamnik, Ilia Sucholutsky, Adrian Weller, and Krishnamurthy Dvijotham. Human uncertainty in concept-based ai systems. In Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society, pages 869 889, 2023.

[25] Jeremy Kawahara, Sara Daneshvar, Giuseppe Argenziano, and Ghassan Hamarneh. Seven-point checklist

and skin lesion classiﬁcation using multitask multimodal neural nets. IEEE journal of biomedical and health informatics, 23(2):538 546, 2018.

[26] Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. The caltech-ucsd

birds-200-2011 dataset. Technical report, California Institute of Technology, 2011.

[27] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the

inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818 2826, 2016.

[28] Erik Jones, Shiori Sagawa, Pang Wei Koh, Ananya Kumar, and Percy Liang. Selective classiﬁcation

can magnify disparities across groups. In Proceedings of the International Conference on Learning Representations (ICLR), 2021.