# interactive_concept_bottleneck_models__4882ad9c.pdf

Interactive Concept Bottleneck Models

Kushal Chauhan, Rishabh Tiwari, Jan Freyberg, Pradeep Shenoy, Krishnamurthy Dvijotham

Google Research {kushalchauhan, rishabhtiwari,janfreyberg,shenoypradeep,dvij}@google.com

Concept bottleneck models (CBMs) are interpretable neural networks that ﬁrst predict labels for human-interpretable concepts relevant to the prediction task, and then predict the ﬁnal label based on the concept label predictions. We extend CBMs to interactive prediction settings where the model can query a human collaborator for the label to some concepts. We develop an interaction policy that, at prediction time, chooses which concepts to request a label for so as to maximally improve the ﬁnal prediction. We demonstrate that a simple policy combining concept prediction uncertainty and inﬂuence of the concept on the ﬁnal prediction achieves strong performance and outperforms static approaches as well as active feature acquisition methods proposed in the literature. We show that the interactive CBM can achieve accuracy gains of 5-10% with only 5 interactions over competitive baselines on the Caltech-UCSD Birds, Che Xpert and OAI datasets.

1 Introduction Deep learning-based AI systems have demonstrated signiﬁcant capabilities across a range of applications. However, in many sensitive or safety-critical applications like healthcare or toxicity detection, AI-based predictive systems are seldom deployed in a standalone fashion; instead, they are used as one component in the overall decision-making workﬂow (Lee et al. 2021). A concrete interactive setting that has received signiﬁcant attention in literature is that of active feature acquisition (AFA) (Greiner, Grove, and Roth 2002; Kanani and Melville 2008). In this setting, a classiﬁer can request additional features at prediction time. There is a cost associated with each feature acquisition, so the AFA algorithms have to reason about the value of the acquired feature relative to the cost. In this paper, we focus on a slightly different setting where the classiﬁer always has access to a basic set of features (like pixels of an image). However, at prediction time, the classiﬁer can request additional labels corresponding to humaninterpretable concepts that are relevant to the prediction task. For example, when predicting bird species from a bird image, the classiﬁer can request access to the wing shape. A key distinguishing feature of this framework from the general active

Corresponding author Copyright c 2023, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

(a) Concept Bottleneck Model

(b) Interactive Prediction

Figure 1: Interactive Prediction: Panel (a) shows a concept bottleneck model (Koh et al. 2020) that predicts a label y from an input x through an intermediate concept prediction layer (Figure adapted from Koh et al. (2020)). Panel (b) shows our proposal: after predicting concepts, the system interactively queries the human for true values ci for concepts chosen so as to maximize prediction accuracy and minimize acquisition cost.

feature acquisition setting is that the human-interpretable concepts are potentially inferrable from the initial features input to the model. In particular, we build on Concept bottleneck models (Koh et al. 2020) which explicitly predict concept labels from images and then predict the ﬁnal label based on the concept label predictions (Figure 1(a)). The authors argue that CBMs show better explainability, performance under distribution shift, and improvements in performance when concept labels are made available at prediction time. However, they only consider static policies for test time intervention that request concept labels in a predetermined order. We present Cooperative Prediction (Coo P), a dynamic interactive policy that can reason about the uncertainty associated with a given test instance, and request only those con-

The Thirty-Seventh AAAI Conference on Artificial Intelligence (AAAI-23)

Figure 2: Inﬂuence of interventions using queries from the Coo P policy on a CUB trained model on two example images. Y-axis denotes the difference between the probability assigned to the correct and top incorrect class.

cepts that improve predictive power on that instance. Thus, for a given number of queried concepts (or, more generally, a budget for concept acquisition cost), our interactive models can achieve signiﬁcantly higher levels of performance than static baselines. Figure 2 illustrates with a pair of example images how our approach selects different sequences of concepts to be queried from the user, based on the ambiguity inherent in the speciﬁc instances, in a bird species identiﬁcation task (Wah et al. 2011). The speciﬁc sequence of queries allows us to rapidly improve our conﬁdence in the true label. We make the following contributions1:

We develop a simple approach for training policies that act on top of Concept Bottleneck Models (CBMs) from Koh et al. (2020), with the objective of achieving a speciﬁed trade-off between interaction cost and predictive performance. Our approach only has a couple of tunable parameters and can be learned using a small validation set separate from the training set used for the CBM.

We compare our dynamic instance-based query policy against static policies that determine a ﬁxed order of querying attributes for all examples (Koh et al. 2020), as well as SOTA active feature acquisition strategies (Shim, Hwang, and Yang 2018) that apply to more general settings of acquiring features beyond human-interpretable concepts.

Our model incorporates and optimizes for a cost model of feature acquisition; we show that our approach can adapt to settings with non-uniform costs of querying attributes, and demonstrate superior performance compared to the baselines.

1Code is available at https://github.com/google-research/google research/tree/master/interactive cbms

2 Related Work

The closest relevant work to our paper is that on Active Feature Acquisition and Concept-Aware Models. We review the literature in both and explain how our work is distinguished from this prior work: Active Feature Acquisition (AFA): The goal of active feature acquisition is to acquire a subset of features, in a cost sensitive manner, for each instance in order to maximize performance at test time. Zubek and Dietterich (2002) proposes a AO* based learning algorithm to heuristically search for classiﬁcation policy. Ma et al. (2018) and Zannone et al. (2019) uses a partial variational autoencoder to predict the rest of features given the acquired ones to model the feature importance and uncertainity and combine it with acquisition policy to maximize information gain. Shim, Hwang, and Yang (2018) treats this as a joint learning problem trains both the classiﬁer and RL agent together to learn when and which feature to acquire to increase classiﬁcation accuracy while maintaining cost-efﬁciency. Li and Oliva (2021) reformulate Markov Decision Process (MDP) and learn a generative surrogate model to capture inter feature dependencies to aid RL agent with intermediate rewards and auxiliary information. In this paper, we deal with a special case of the general AFA problem where the classiﬁer always has access to an initial set of features (like the pixels in an image), but can request additional human-interpretable concept labels for each prediction. The goal of our work, just like in AFA, is to select which concepts to acquire labels for in a cost sensitive manner. However, we leverage the fact that we are not acquiring arbitrary features but human-intepretable concepts that have strong correlations to the input features and the ﬁnal label, and exploit the fact that we can train a CBM to solve the prediction task. This makes our approach more perfomant than the general AFA approach. In Branson et al. (2010), the authors work in exactly the same setting as us and even obtain results on one of the

datasets we use. In this work, the authors posit a generative model for the concept labels given the ﬁnal label, and assume that the concepts are independent of the input features given the ﬁnal label. They exploit this assumption to compute a tractable posterior distribution on the label given additional concept labels acquired. Our work does not make any such assumptions, instead relying on the CBM to learn the correlations between the input features and concepts, and the inﬂuence of the concept labels on the ﬁnal label. We achieve stronger results empirically than the approach presented in Branson et al. (2010). Concept Aware Models: Concept bottleneck models (Koh et al. 2020) were developed to show that building models that explicitly predict concept labels from images or other raw features helps with explainability, performance under distribution shift and improvements in performance when concept labels are made available at prediction time. CBMs have been extended in various ways: Bahadori and Heckerman (2021) performs causal reasoning to debias CBMs, Antognini and Faltings (2021) develops textual rationalizations based on concept interventions. However, of these works, only Koh et al. (2020) explicitly considers test time intervention, and even here they only consider static policies for test time intervention that request concept labels in a ﬁxed predetermined order (learned on a validation set). In this work, we allow for dynamic interactive policies that, for each prediction, reason about which concept labels are useful to improve the reliability of the prediction and request only those. Thus, with the same number of concepts allowed to be queried at prediction time, our interactive models can achieve signiﬁcantly higher levels of performance than static baselines.

We outline the problem formulation as well as a simple approach to computing interactive policies on top of pretrained predictive models.

3.1 Formalizing an Interactive Prediction System

Input, Concept and Label Spaces: We denote raw input features by x (these can correspond, for example to images or text or features derived from these), the labels y and the acquired concepts c. We assume that c is a vector of m categorical concepts and y is a categorical scalar taking K possible values. Individual concepts are denoted ci, i = 1, . . . , m. The set of possible values c can take is denoted C (with the set of possible values for ci being denoted Ci = {1, . . . , ni} so that C = Q

i Ci) and the set of possible values y can take is denoted Y = {1, 2, . . . , K}. We denote the space of possible inputs by X. We also note here that the term concept is loosely applied - in some contexts, it may refer simply to additional attributes (a user s age or gender, for example) or pieces of information that can be acquired at some cost at prediction time. Hence we use the term concepts or attributes interchangeably. Concept Bottleneck Model: The CBM is the composition of an input-to-concept (X C) model pθ(c|x) and

a concept-to-label (C Y) model pφ(y|c). We assume that both models make probabilistic predictions and that the input-to-concept model makes an independent probabilistic prediction for each concept ci, denoted by pθ(ci|x) for each i = 1, . . . , m. We assume that these models have been trained and are available to us and do not make any assumptions about how the models were trained, or whether the predictions output represent calibrated probabilities. Intervention: In the absence of interactivity, the two stage model makes predictions by ﬁrst invoking the X C model and then passing the output to the C Y model to get the ﬁnal prediction. However, in our setting, we allow for interventions on the concepts, i.e, replacing the predicted value (or distribution over values) of a concept with its ground truth value. We denote the prediction concept values as ˆc = pθ ( |x) and the intervened concept values (i.e. the ground truth) as c. We further denote the prediction of a C Ymodel with partial intervention on a subset of concepts S as pφ (y|c S, ˆc S). We also use the notation pφ (c S = v, ˆc S) to denote the predictions given the intervention where the concepts c S are set to a speciﬁc value v. Interaction Cost Model: We denote the cost of acquiring an attribute ci as qi > 0. We assume all costs are positive and that the cost of acquiring a concept is the same independent of the previously acquired concepts or the value the concept takes. The total cost of acquiring a set of attributes S is assumed to be P i S qi. Extensions that don t make these assumptions are possible but we leave this to future work. We assume that for each prediction made, there is a budget B and that the interaction can only occur while the total cost of concepts acquired so far is below B. Interactive Policies: Given the two stage model, we deﬁne an interactive policy ψ as follows: An interactive policy takes as input a set of revealed concepts c S where S {1, . . . , m}, the X C and C Y models and interaction costs q and outputs the new concept to acquire:

ψ (S, pθ, pφ, q, B) = i S = {1, . . . , m} \ S

We will usually drop the dependence on pθ, pφ, q (as these are assumed to be always available) and simply write ψ (S). A rollout of an interaction policy corresponds to the Algorithm 1. The ﬁnal prediction generated on an input x is denoted rollout (ψ, x) Separation of Policy and Model Learning: In this paper, we restrict ourselves to learning an interactive policy as a post-hoc step on top of a learned model. We do not make assumptions about how the two-stage model (i.e., pθ( ), pφ( )) is trained. Dataset for Policy Learning: We assume that we can acquire a dataset consisting of (x, c, y) triplets sampled iid from an underlying unknown joint distribution Pdata. Our goal is then to minimize

E (x,c,y) Pdata [ℓ(y, rollout (ψ, x))] (1)

where ℓis a loss function that measures the discrepancy between the true label y and the label output by rolling out the policy ψ.

Algorithm 1: Policy Rollout

S while b B do i ψ (S) if b B qi then Acquire the label for concept ci S S {i} b b + qi else b B + 1 end if end while Output prediction argmaxy Y pφ (c S, pθ (c S|x))

3.2 Interactive Policy Learning with Cooperative Prediction (Coo P) We present Cooperative Prediction (Coo P), a lightweight approach to learning interactive policies that attempt to optimize the objective in equation (1). The key intuition behind Coo P is that deciding which concept labels to acquire should be informed by three considerations: a) The uncertainty associated with the concept prediction - if we can already infer the concept label with high conﬁdence based on the input features, there is not much value to acquiring it. b) The impact of the concept label on the ﬁnal label prediction - If knowing the value of the concept does not change the predicted label conﬁdence scores by much, it is not very valuable. c) Cost of acquiring the concept. Coo P uses a very simple measure of each of the three components, and chooses concepts to acquire iteratively in a greedy fashion by developing a score function based on the three components and choosing the concept not yet acquired that scores the highest. In particular we use the following measures:

Concept prediction uncertainty (CPU): We compute the entropy of the distribution pθ (ci|x) for each concept ci with i S H [pθ(ci|x)]

Concept importance score: We compute the expected change in the softmax score pφ (y|c) associated with the ﬁnal label prediction when the concept label for each concept ci is changed in the inputs to the C Y model, i.e.:

CIS(ci; S, k) = E h pφ y = k|ci = v, c S, ˆc S\{j} i pφ (y = k|c S, ˆc S)

where k was the label predicted in the previous round of the interaction and the expectation is taken over concept values v pθ(ci|x). Acquisition cost: This is simply the acquisition cost of each attribute qi.

The ﬁnal score is simply a linear combination of normalized versions of these scores (each score is normalized so the

Algorithm 2: Coo P policy

Given pθ, pφ, q, set of concepts acquired so far S and highest scoring predicted label based on acquired concepts k Y, and score importance weights α, β, γ R+ for all i S do Compute scorei = αCPU(ci; S) + βCIS(ci; S, k) γqi end for Output argmaxi scorei

range of values it takes on the policy learning dataset lies between 0 and 1) and the overall attribute selection algorithm is outlined in algorithm 2. Each linear combination of scores leads to a different interactive policy, and those weights are tuned to optimize performance on a holdout validation set. A primary advantage of the proposal is its ease of learning, especially in sparse-data scenarios, as it only requires that we estimate two mixing parameters. As the ﬁrst paper proposing this novel problem setting (to our knowledge), our primary goal here is to demonstrate that careful policy selection can indeed provide value. We leave further improvements and theoretically principled approaches to this problem for future work, here instead focusing on demonstrating that a simple approach works well for this novel problem setting and formulation.

4 Experiments

4.1 Datasets

We experimented with the following datasets, with characteristics summarized in Table 1: CUB (Caltech-UCSD Birds): This dataset contains pictures of birds coupled with human-labeled concept attributes identifying prominent characteristics (wing color, beak length, undertail color, etc.) (Wah et al. 2011). The there are 28 such categorical concepts, resulting in a total of 112 binary labels. In an interactive setting, attributes are revealed at prediction time only when the policy queries an attribute. In practice, this could be seen as asking human labelers in an interactive setting to provide speciﬁc hints on concepts they can easily identify (like beak length or wing color) even if they are unable to make the ﬁnal prediction on what species of bird it is, as most labelers will be unable to do this unless they are speciﬁcally trained on this task. CHEXPERT: This dataset contains chest X-rays accompanied by binary concept labels extracted from a report generated by a radiologist, with the goal of predicting whether the X-ray was normal or abnormal (Irvin et al. 2019). Each chest X-ray is also accompanied by 13 binary attributes that include concepts easily recognized by a non-expert (presence of a fracture or any support devices on the patient), harder-to-label attributes that need a nurse or physician (e.g., Cardiomegaly), and, ﬁnally, attributes that require a radiologist to label. OAI (Osteoarthritis Initiative): This dataset contains knee X-rays, annotated with the Kellgren-Lawrence grade (KLG), a 4-level ordinal variable (assumed to be categorical for training) that measures the severity of knee osteoarthritis. Each

knee X-ray is also annotated with 10 ordinal attributes describing joint space narrowing, bone spurs, calciﬁcation, etc., resulting in a total of 40 binary concepts.

4.2 Base Models

Following the proposals made by Koh et al. (2020), we train the following 2 kinds of concept-bottleneck models (CBMs):2 Independent model: The X C and C Y models are trained separately, respectively mapping the inputs x to the true concepts c, and the true concepts c to the labels y respectively. The C Y model sees true concept values as input at train-time, and estimated values (probabilities) at test time. Joint model: Both X C and C Y models are learned using a joint optimization criterion which combines the concept prediction loss (cross-entropy) and label prediction loss (cross-entropy). The probabilities output by the X C model are passed on to the C Y model during training.

We build our interactive intervention models on top of these CBMs, and propose and evaluate various methods of intervention using each of these as the base CBM. Although we performed extensive experimentation with both the Independent and Joint models, due to space considerations we present only results from the Independent CBM here. Findings on the Joint model are qualitatively similar; please see Chauhan et al. (2022) for more details.

4.3 Training and Evaluation

For each experiment, we split the data into 3 sets: train, validation, and test the details are available in Table 1. We used the training data to train the base CBMs, and validation data to select parameter settings, if any, for intervention policies. We then retrained the base model on pooled train + validation datasets3, and ﬁnally reported performance of the policies on the unseen test set. In case an intervention policy did not require parameter selection, we directly trained the base model on the pooled train+val data. Finally, for Coo P, we require accurate measures of uncertainty in order to make effective decisions; we, therefore, calibrate the pooled concept probabilities across the training data for a given base model using isotonic regression. Our primary metric is accuracy for CUB and OAI, and AUC for CHEXPERT at the classiﬁcation tasks. For each intervention policy, we start with performance using only the input x. We then iteratively obtain true labels for intermediate concepts c as speciﬁed by the policy, set the concept value to the observed true value, and report the performance of the updated prediction after adding the observed concept. In this manner, we obtain a curve measuring the performance metric as a function of the number of observed concepts.

2Although Coo P is agnostic to the speciﬁc way the base CBM has been trained, each base CBM may have idiosyncrasies in its predictive power that interact with Coo P; therefore, speciﬁc base CBMs combined with Coo P may perform overall better on any given dataset. 3We did this due to data paucity in the datasets we studied. This is common practice for the datasets, and does not compromise the

CUB CHEXPERT OAI

Input Dims (299, 299, 3) (320, 320, 3) (512, 512, 3) Concepts 112 13 10 Data splits

train 4,796 178,731 31,370 val 1,198 22,341 4,426 test 5,794 22,342 4,522

Table 1: Details of the datasets used in our experiments.

4.4 Baselines for Comparison

Greedy: Select an ordering of attributes using a greedy ranking scheme over the validation dataset; i.e., the ﬁrst attribute in the list is the one that improves the performance measure the most on average over the validation set. Subsequent attribute orders are determined in a similar greedy fashion after conditioning on all previous attributes as being available. Random: For each instance in the training set, choose the next attribute to query at random. Skyline: Evaluate an oracle greedy approach which checks, for each instance in the test set, the speciﬁc greedy order of querying attributes that provides maximum incremental gains on each step for that test instance. This is an oracular skyline since it uses the test label for optimization, and is an approximate ceiling on the performance achievable under any interactive policy.4 Active Feature Acquisition (AFA): We also compare against the Active Feature Acquisition policy based on work by Shim, Hwang, and Yang (2018). We use image embeddings from Res Net18 pre-trained on Image Net as auxiliary information, and ground truth concepts as features that can be acquired; we train an RL policy to actively acquire features/- concepts as described in Shim, Hwang, and Yang (2018). This algorithm is relatively sensitive to the value of a hyperparameter r cost that determines the tradeoff between model performance and acquisition cost. For evaluating AFA, we performed a hyperparameter search for r cost in the range of [-1e-5, -0.06] for each number of intervention steps using the accuracy/AUC on the validation set. We also tried using ﬁne-tuned Res Net features instead of pre-trained features as auxiliary information, but it resulted in severe overﬁtting on training data. For a fair comparison, we use the same data splits as other baselines and the policy is trained until convergence.

4.5 Intervention Costs

In practice, it is likely that obtaining certain concept labels has a higher cost, for example, due to the difﬁculty of the annotation task or the data being purchased at a cost from a data provider, or the privacy costs of asking for sensitive user information. We studied the following cost models:

ﬁndings since all reported results are on an unseen test set. 4The reason this skyline is only approximate-oracular is that searching all possible orderings of attributes for querying is combinatorially infeasible; as seen in our results, the proposed greedy approximation quickly approaches 100% performance measure suggesting it is a good approximation.

Figure 3: Accuracy gains vs interaction cost (unit cost model) on different datasets. See text for details.

Figure 4: Cost-efﬁcient interventions on different datasets. For CUB and OAI, we report mean results for 10 random cost assignments. For CHEXPERT, we use the domain-informed cost model.

Unit cost: Each query made by the interactive policy incurs a unit cost. This is largely appropriate for the CUB dataset since the concepts are largely identiﬁable easily by a non-expert. Random cost: To stress test our approach, we also experiment with random cost models where a randomly chosen cost from the range [1, 7] is assigned to each concept, which is then normalized such that the total cost of all concepts is 100. Systematic cost: In datasets like CHEXPERT , it is clear that some attributes can be easily labeled by non-experts while others require a specialized radiologist. Thus we assign systematic costs, that have a strong justiﬁcation depending on the difﬁculty of acquisition of a concept label. Based on consultations with domain experts, we use concept acquisition costs of 1, 3, and 10 for concepts that are very easy, moderately difﬁcult, and very difﬁcult to annotate respectively. Similar to random costs, we also normalize CHEXPERT s systematic costs such that the total cost of all concepts is 100. The costs are factored into the learning of policies for Coo P as described in Section 3.2. For the CHEXPERT dataset where a systematic cost was available, we used it to evaluate our cost optimization procedure; for CUB and OAI, we used random costs.

We perform experiments that demonstrate the following valuable contributions from Coo P : Performance gains from adaptive interactive policies:

Section 5.1 shows that by querying just 5 additional concept labels at prediction time, our interactive policies can improve relevant performance metric (AUC or accuracy) by 20-25% over any static baseline that determines in advance an order in which attributes are queried. In some cases, Coo P achieves a sizable fraction of the possible gain determined by the skyline. Improving Coo P further to fully close the gap to the skyline is a compelling direction for future work. Cost-aware acquisition: Relative to policies that only use uncertainty to drive interactivity, Coo P is cost-aware and acquires concept labels only when they are valuable, achieving a better tradeoff between cost and performance than simpler policies (Section 5.2). We conducted additional analyses, including sample efﬁciency for Coo P, and ablation studies probing the contributions of different factors in our scoring function. Details are available at Chauhan et al. (2022).

5.1 Predictive Performance Improvement

Figure 3 shows the results of our experiments on three different datasets: CUB, CHEXPERT, and OAI. Across all 3 datasets, we see that the simple Greedy policy already outperforms Random consistently across the range of intervention steps. This serves as a strong existence proof of nontrivial intervention policies. Note that the greedy policy requests concept labels in the same sequence regardless of the test data instance. In contrast, Coo P uses instance-

Figure 5: Comparison of the behavior of Coo P and Greedy policies for the ﬁrst ﬁve steps of intervention using two example images. The bar plots show the probabilities assigned to top 5 ranking classes according to the model, with the ﬁrst blue bar representing the correct class. The bar plot titles represent the concept revealed by the respective policy.

speciﬁc uncertainties both in concept labels and ﬁnal predictions, and as a result signiﬁcantly outperforms Greedy, again across datasets. In particular, Coo P is able to substantially improve the performance metric with as few as 5 queries. We also note that the AFA baseline has an uneven performance across datasets and the range of intervention steps. For instance, the initial performance of AFA is quite poor, followed by a rapid rise in the CUB and OAI datasets. In CHEXPERT, AFA fails to improve upon Random, a weak baseline. Finally, on the OAI dataset, AFA does outperform Coo P, although only by small margins and after the key ﬁrst few queries. This variable performance is driven by two factors: the signiﬁcant data need for the AFA algorithm, and the need for selecting a particular tradeoff cost at training time, rather than being able to smoothly adjust the cost-beneﬁt tradeoffs at test time on a per-instance basis. An interesting ﬁnding is that for all datasets, Skyline5 performs noticeably better than both the other baselines and Coo P. Even though Skyline is an oracle, this ﬁnding suggests the possibility of additional headroom available for other sophisticated, potentially data-hungry policies. Figure 5 illustrates Coo P s ability to customize intervention queries to each instance, which is its key differentiating feature as compared to Greedy. In the ﬁrst image, Greedy chooses to reveal concepts like bill shape and wing color, which can be reasonably inferred from the image. Coo P instead chooses to reveal under tail color which is difﬁcult to see in the shadow, and the wing shape which can t be inferred since the bird s wings are closed. Similarly, in the second image, most concepts that Greedy chooses to reveal are visible to some extent. Coo P instead queries concepts like upper tail color and head pattern which are hidden in the shadows, and wing shape, which also can t be inferred since the wings are closed.

5Although additional information should never worsen performance, we do see that adding certain concepts decreases accuracy, particularly in the Skyline. This is due to the heuristic nature of the Skyline, and the sparse-data settings that limit the base CBM s exposure to certain rare concepts in the training data.

5.2 Cost-Efﬁcient Interventions

In the previous experiments, we assumed that all interventions (concept labels) have unit cost; we now explore the scenario where different concepts have different costs (see Section 4.5 for details on the cost model). Figure 4 shows the result of Coo P when optimizing for intervention costs. The presented data is similar to Figure 3, except that the tradeoff is now accuracy versus total cost, as opposed to the total number of steps previously. We see that Coo P outperforms the baselines, both without and with cost-sensitive selection, and Coo P with the cost is better. This demonstrates Coo P s ability to incorporate cost structure into the optimization of the interactive policy.

6 Discussion

We proposed a novel problem setting, that of iterative/interactive reﬁnement of model predictions using human inputs, and the cost-efﬁcient optimization of this interactive loop. We demonstrated a principled ﬁrst-cut approach at learning such optimization policies in the context of two-stage, or concept-bottleneck models where interactions are simpliﬁed to querying concept or attribute labels. We do not provide a wide or exhaustive discussion of the advanced algorithmic possibilities; indeed, we anticipate that future work will explore a number of alternate formulations both of learning the base models, and of optimizing interactive policies on top of those base models. In particular, one could hypothesize architectures other than the two-stage CBMs explored here. Further, human inputs could include information other than the bottleneck concepts for instance, side information that cannot be inferred from the input data, region-of-interest annotations, etc. A key related challenge for our work (and concept bottleneck models in general) is the need for intermediate supervision in the form of concept labels. Future work could explore weak or distant supervision for obtaining these concept labels. Finally, there would be a need to evaluate our approach in realistic human-AI collaboration setups, where UX or other psychological factors may impact the interactive performance of the human-AI team (Bansal et al. 2021).

Ethics Statement Our goal is to increase the interpretability and robustness of predictive models, a signiﬁcant net positive especially for applications such as medical diagnosis. As such, we do not expect any adverse outcomes of our work or follow-on research. For our experiments, we have used open-sourced datasets collected with appropriate review processes; we have not conducted human-in-the-loop experiments as it was out of scope for this paper. For an eventual system that includes humans in the predictive workﬂow, our proposal only leverages additional instance speciﬁc data at test-time, as supplied by the expert responsible for prediction; this minimizes the potential for data misuse by our model.

References Antognini, D.; and Faltings, B. 2021. Rationalization through Concepts. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, 761 775. Bahadori, M. T.; and Heckerman, D. E. 2021. Debiasing concept-based explanations with causal analysis. In ICLR 2021. Bansal, G.; Wu, T.; Zhou, J.; Fok, R.; Nushi, B.; Kamar, E.; Ribeiro, M. T.; and Weld, D. 2021. Does the whole exceed its parts? the effect of ai explanations on complementary team performance. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, 1 16. Branson, S.; Wah, C.; Schroff, F.; Babenko, B.; Welinder, P.; Perona, P.; and Belongie, S. 2010. Visual recognition with humans in the loop. In European Conference on Computer Vision, 438 451. Springer. Chauhan, K.; Tiwari, R.; Freyberg, J.; Shenoy, P.; and Dvijotham, K. 2022. Interactive Concept Bottleneck Models. ar Xiv preprint ar Xiv:DOI. Greiner, R.; Grove, A. J.; and Roth, D. 2002. Learning costsensitive active classiﬁers. Artiﬁcial Intelligence, 139(2): 137 174. Irvin, J.; Rajpurkar, P.; Ko, M.; Yu, Y.; Ciurea-Ilcus, S.; Chute, C.; Marklund, H.; Haghgoo, B.; Ball, R.; Shpanskaya, K.; et al. 2019. Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In Proceedings of the AAAI conference on artiﬁcial intelligence, volume 33, 590 597. Kanani, P.; and Melville, P. 2008. Prediction-time active feature-value acquisition for cost-effective customer targeting. Advances in neural information processing systems (NIPS). Koh, P. W.; Nguyen, T.; Tang, Y. S.; Mussmann, S.; Pierson, E.; Kim, B.; and Liang, P. 2020. Concept Bottleneck Models. In III, H. D.; and Singh, A., eds., Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, 5338 5348. PMLR. Lee, M. H.; Siewiorek, D. P.; Smailagic, A.; Bernardino, A.; and Berm udez i Badia, S. 2021. A Human-AI Collaborative Approach for Clinical Decision Making on Rehabilitation Assessment. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, 1 14.

Li, Y.; and Oliva, J. 2021. Active feature acquisition with generative surrogate models. In International Conference on Machine Learning, 6450 6459. PMLR. Ma, C.; Tschiatschek, S.; Palla, K.; Hern andez-Lobato, J. M.; Nowozin, S.; and Zhang, C. 2018. Eddi: Efﬁcient dynamic discovery of high-value information with partial vae. ar Xiv preprint ar Xiv:1809.11142. Shim, H.; Hwang, S. J.; and Yang, E. 2018. Joint Active Feature Acquisition and Classiﬁcation with Variable-Size Set Encoding. In Neur IPS, 1375 1385. Wah, C.; Branson, S.; Welinder, P.; Perona, P.; and Belongie, S. 2011. The caltech-ucsd birds-200-2011 dataset. Computation & Neural Systems Technical Report. Zannone, S.; Hern andez-Lobato, J. M.; Zhang, C.; and Palla, K. 2019. Odin: Optimal discovery of high-value information using model-based deep reinforcement learning. In ICML Real-world Sequential Decision Making Workshop. Zubek, V. B.; and Dietterich, T. G. 2002. Pruning Improves Heuristic Search for Cost-Sensitive Learning. In Proceedings of the Nineteenth International Conference on Machine Learning, ICML 02, 19 26. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc. ISBN 1558608737.