# editable_concept_bottleneck_models__1996e5c1.pdf

Editable Concept Bottleneck Models

Lijie Hu * 1 2 Chenyang Ren * 1 2 3 Zhengyu Hu * 4 1 Hongbin Lin 4 1

Cheng-Long Wang 1 2 Zhen Tan 5 Weimin Lyu 6 Jingfeng Zhang 7 1 2 Hui Xiong 4 Di Wang 1 2

Concept Bottleneck Models (CBMs) have garnered much attention for their ability to elucidate the prediction process through a humanunderstandable concept layer. However, most previous studies focused on cases where the data, including concepts, are clean. In many scenarios, we often need to remove/insert some training data or new concepts from trained CBMs for reasons such as privacy concerns, data mislabelling, spurious concepts, and concept annotation errors. Thus, deriving efficient editable CBMs without retraining from scratch remains a challenge, particularly in large-scale applications. To address these challenges, we propose Editable Concept Bottleneck Models (ECBMs). Specifically, ECBMs support three different levels of data removal: concept-label-level, concept-level, and data-level. ECBMs enjoy mathematically rigorous closedform approximations derived from influence functions that obviate the need for retraining. Experimental results demonstrate the efficiency and adaptability of our ECBMs, affirming their practical value in CBMs. Code is available on https: //github.com/kaustpradalab/ECBM

1. Introduction

Modern deep learning models, such as large language models (Zhao et al., 2023; Yang et al., 2024a;b; Xu et al., 2023; Yang et al., 2024c) and large multimodal (Yin et al., 2023; Ali et al., 2024; Cheng et al., 2024), often exhibit intricate

*Equal contribution 1Provable Responsible AI and Data Analytics (PRADA) Lab 2King Abdullah University of Science and Technology 3Shanghai Jiao Tong University 4Thrust of Artificial Intelligence, The Hong Kong University of Science and Technology (Guangzhou), China Department of Computer Science and Engineering, The Hong Kong University of Science and Technology, Hong Kong SAR, China 5Arizona State University 6Stony Brook University 7The University of Auckland. Correspondence to: Di Wang <di.wang@kaust.edu.sa>.

Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s).

non-linear architectures, posing challenges for end-users seeking to comprehend and trust their decisions. This lack of interpretability presents a significant barrier to adoption, particularly in critical domains such as healthcare (Ahmad et al., 2018; Yu et al., 2018) and finance (Cao, 2022), where transparency is paramount. To address this demand, explainable artificial intelligence (XAI) models (Das & Rad, 2020; Hu et al., 2023b;a) have emerged, offering explanations for their behavior and insights into their internal mechanisms. Among these, Concept Bottleneck Models (CBMs) (Koh et al., 2020) have gained prominence for explaining the prediction process of end-to-end AI models. CBMs add a bottleneck layer for placing human-understandable concepts. In the prediction process, CBMs first predict the concept labels using the original input and then predict the final classification label using the predicted concept in the bottleneck layer, which provides a self-explained decision to users.

Delete 𝑝𝓇 𝒸

wing color undert ail color

beak length

wing color beak length

Concept level

Dataset 𝐷 ሼሺ𝓍 𝑐 , 𝑦 , ሻሽ ଵ

Remove data ሺ𝓍𝒾, 𝒸𝒾, 𝑦𝒾)

incorrect 𝑐 𝑦 𝓍୧

Dataset 𝐷 ሼሺ𝓍 𝑐 , 𝑦 , ሻሽ ଵ ଵ

𝓍୬ ଵ 𝑐 ଵ 𝑦 ଵ

Concept label level

Data ሺ𝓍𝒾, 𝒸𝒾, 𝑦𝒾)

undert ail color

bird spec ies

Delete/Correct 𝒸𝒾

Data ሺ𝓍𝒾, 𝒸𝒾 , 𝑦𝒾)

beak length

beak length

bird spec ies

Figure 1: An illustration of Editable Concept Bottleneck Models with three settings.

Existing research on CBMs predominantly addresses two primary concerns: Firstly, CBMs heavily rely on laborious dataset annotation. Researchers have explored solutions to these challenges in unlabeled settings (Oikarinen et al., 2023; Yuksekgonul et al., 2023; Lai et al., 2023). Secondly, the performance of CBMs often lags behind that of original models lacking the concept bottleneck layer, attributed to incomplete information extraction from original data to bottleneck features. Researchers aim to bridge this utility gap (Sheth & Ebrahimi Kahou, 2023; Yuksekgonul et al., 2023; Espinosa Zarlenga et al., 2022). However, few of them considered the adaptivity or editability of CBMs, cru-

Editable Concept Bottleneck Models

cial aspects encompassing annotation errors, data privacy considerations, or concept updates. Actually, these demands are increasingly pertinent in the era of large models. We delineate the editable setting into three key aspects (illustrated in Figure 1):

Concept-label-level: In most scenarios, concept labels are annotated by humans or experts. Thus, it is unavoidable that there are some annotation errors, indicating that there is a need to correct some concept labels in a trained CBM.

Concept-level: In CBMs, the concept set is pre-defined by LLMs or experts. However, in many cases, evolving situations demand concept updates, as evidenced by discoveries such as chronic obstructive pulmonary disease as a risk factor for lung cancer, and doctors have the requirements to add related concepts. For another example, recent research found a new factor, obesity (Sattar et al., 2020) are risky for severe COVID19 and factors (e.g., older age, male gender, Asian race) are risk associated with COVID-19 infection (Rozenfeld et al., 2020). On the other hand, one may also want to remove some spurious or unrelated concepts for the task. This demand is even more urgent in some rapidly evolving domains like the pandemic.

Data-level: Data issues can arise in CBMs when training data is erroneous or poisoned. For example, if a doctor identifies a case as erroneous or poisoned, this data sample becomes unsuitable for training. Therefore, it is essential to have the capability to completely delete such data from the learned models. We need such an editable model that can interact effectively with doctors.

The most direct way to address the above three problems is retraining from scratch on the data after correction. However, retraining models in such cases prove prohibitively expensive, especially in large models, which is resourceintensive and time-consuming. Therefore, developing an efficient method to approximate prediction changes becomes paramount. Providing users with an adaptive and editable CBM is both crucial and urgent.

We propose Editable Concept Bottleneck Models (ECBMs) to tackle these challenges. Specifically, compared to retraining, ECBMs provide a mathematically rigorous closed-form approximation for the above three settings to address editability within CBMs efficiently. Leveraging the influence function (Cook, 2000; Cook & Weisberg, 1980), we quantify the impact of individual data points, individual concept labels, and the concept for all data on model parameters. Despite the growing attention and utility of influence functions in machine learning (Koh & Liang, 2017), their application

in CBMs remains largely unexplored due to their composite structure, i.e., the intermediate representation layer.

To the best of our knowledge, we are the first to work to fill this gap by demonstrating the effectiveness of influence functions in elucidating the behavior of CBMs, especially in identifying mislabeled data and discerning the data influence. Comprehensive experiments on benchmark datasets show that our ECBMs are efficient and effective. Our contributions are summarized as follows.

We delineate three different settings that need various levels of data or concept removal in CBMs: conceptlabel-level, concept-level, and data-level. To the best of our knowledge, our research marks the first exploration of data removal issues within CBMs.

To make CBMs able to remove data or concept influence without retraining, we propose the Editable Concept Bottleneck Models (ECBMs). Our approach in ECBMs offers a mathematically rigorous closedform approximation. Furthermore, to improve computational efficiency, we present streamlined versions integrating Eigenvalue-corrected Kronecker-Factored Approximate Curvature (EK-FAC).

To showcase the effectiveness and efficiency of our ECBMs, we conduct comprehensive experiments across various benchmark datasets to demonstrate our superior performance.

2. Related Work

Concept Bottleneck Models. CBM (Koh et al., 2020) stands out as an innovative deep-learning approach for image classification and visual reasoning. It introduces a concept bottleneck layer into deep neural networks, enhancing model generalization and interpretability by learning specific concepts. However, CBM faces two primary challenges: its performance often lags behind that of original models lacking the concept bottleneck layer, attributed to incomplete information extraction from the original data to bottleneck features. Additionally, CBM relies on laborious dataset annotation. Researchers have explored solutions to these challenges. Chauhan et al. (2023) extend CBM into interactive prediction settings, introducing an interaction policy to determine which concepts to label, thereby improving final predictions. Oikarinen et al. (2023) address CBM limitations and propose a novel framework called Label-free CBM. This innovative approach enables the transformation of any neural network into an interpretable CBM without requiring labeled concept data, all while maintaining high accuracy. Post-hoc Concept Bottleneck models (Yuksekgonul et al., 2023) can be applied to various neural networks without compromising model performance, preserving interpretability advantages. CBMs work on the image field

Editable Concept Bottleneck Models

also includes the works of Havasi et al. (2022),Kim et al. (2023),Keser et al. (2023),Sawada & Nakamura (2022) and Sheth & Kahou (2023). Despite many works on CBMs, we are the first to investigate the interactive influence between concepts through influence functions. Our research endeavors to bridge this gap by utilizing influence functions in CBMs, thereby deciphering the interaction of concept models and providing an adaptive solution to concept editing. For more related work, please refer to Appendix I.

3. Preliminaries

Concept Bottleneck Models. In this paper, we consider the original CBM, and we adopt the notations used by Koh et al. (2020). We consider a classification task with a concept set denoted as {p1, , pk} with each pi being a concept given by experts or LLMs, and a training dataset represented as D = {zi}n i=1, where zi = (xi, yi, ci). Here, for i [n], xi Rdi represents the input feature vector, yi Rdo denotes the label (with do corresponding to the number of classes) and ci = (c1 i , , ck i ) Rk represents the concept vector. In this context, cj i represents the label of the concept pj of the i-th data. In CBMs, our goal is to learn two representations: one called concept predictor that transforms the input space into the concept space, denoted as g : Rd i Rk, and the other called label predictor which maps the concept space to the prediction space, denoted as f : Rk Rdo. Usually, here the map f is linear. For each training sample zi = (xi, yi, ci), we consider two empirical loss functions: concept predictor ˆg and label predictor ˆf:

ˆg = arg min g

j=1 gj(xi) log(cj i), (1)

where gj( ) is the predicted j-th concept. For brevity, write the loss function as LC(g(xi), ci) = Pk j=1 Lj C(g(xi), ci) for data (xi, ci). Once we obtain the concept predictor ˆg, the label predictor is defined as:

ˆf = arg min f

i=1 LY f(ˆg(xi)), yi , (2)

where LY represents the cross-entropy loss, similar to (1). CBMs enforce dual precision in predicting interpretable concept vectors ˆc = ˆg(x) (matching concept c) and final outputs ˆy = ˆf(ˆc) (matching label y), ensuring transparent reasoning through explicit concept mediation. Furthermore, in this paper, we focus primarily on the scenarios in which the label predictor f is a linear transformation, motivated by their interpretability advantages in tracing concept-to-label relationships. For details on the symbols used, please refer to the notation table in Appendix 2.

Influence Function. The influence function measures the dependence of an estimator on the value of individ-

ual point in the sample. Consider a neural network ˆθ = arg minθ Pn i=1 ℓ(zi; θ) with loss function ℓand dataset D = {zi}n i=1. If we remove zm from the training dataset, the parameters become ˆθ zm = arg minθ P

i =m ℓ(zi; θ). The influence function provides an efficient model approximation by defining a series of ϵ-parameterized models as ˆθϵ, zm = argmin Pn i=1 ℓ(zi; θ) + ϵℓ(zm; θ). By performing a first-order Taylor expansion on the gradient of the objective function corresponding to the arg min process, the influence function is defined as:

Iˆθ (zm) dˆθϵ, zm

ϵ=0 = H 1 ˆθ θℓ(zm; ˆθ),

where H 1 ˆθ = 2 θ Pn i=1 ℓ(zi; ˆθ) is the Hessian matrix. When the loss function ℓis twice-differentiable and strongly convex in θ, the Hessian Hˆθ is positive definite and thus the influence function is well-defined. For non-convex loss functions, Bartlett (1953) proposed replacing the Hessian Hˆθ with ˆH = Gˆθ + δI, where Gˆθ is the Fisher information matrix defined as Pn i=1 θℓ(zi; θ)T θℓ(zi; θ), and δ is the damping term used to ensure the positive definiteness of ˆH. We can employ the Eigenvalue-corrected Kronecker Factored Approximate Curvature (EK-FAC) method to further accelerate the computation. See Appendix C for additional details.

4. Editable Concept Bottleneck Models

In this section, we introduce our EBCMs for the three settings mentioned in the introduction, leveraging the influence function. Specifically, at the concept-label level, we calculate the influence of a set of data samples individual concept labels; at the concept level, we calculate the influence of multiple concepts; and at the data level, we calculate the influence of multiple samples.

4.1. Concept Label-level Editable CBM

In many cases, certain data samples contain erroneous annotations for specific concepts, yet their other information remains valuable. This is particularly relevant in domains such as medical imaging, where acquiring data is often costly and time-consuming. In such scenarios, it is common to correct the erroneous concept annotations rather than removing the entire data from the dataset. Estimating the retrained model parameter is crucial in this context. We refer to this scenario as the concept label-level editable CBM.

Mathematically, we have a set of erroneous data De and its associated index set Se [n] [k] such that for each (w, r) Se, (xw, yw, cw) De with cr w is mislabeled and cr w is corrected concept label. Our goal is to estimate the retrained CBM. The retrained concept predictor and label

Editable Concept Bottleneck Models

predictor are represented as follows:

ˆge = arg min g

(i,j)/ Se Lj C (g(xi), ci)

(i,j) Se Lj C (g(xi), ci) , (3)

ˆfe = arg min f

i=1 LY (f (ˆge (xi)) , yi) . (4)

For simple neural networks, we can use the influence function approach directly to estimate the retrained model. However, for CBM architecture, if we intervene with the true concepts, the concept predictor ˆg fluctuates to ˆge accordingly. Observe that the input data of the label predictor comes from the output of the concept predictor, which is also subject to change. Therefore, we need to adopt a twostage editing approach. Here we consider the influence function for (3) and (4) separately. We first edit the concept predictor from ˆg to ge, and then edit from ˆf to fe based on our approximated concept predictor. To begin, we provide the following definitions:

Definition 4.1. Define the gradient of the j-th concept predictor and the label predictor for the i-th data point xi as:

Gj C(xi, ci; g) g Lj C (g(xi), ci) ,

GY (xi; g, f) f LY (f(g(xi)), yi) .

Theorem 4.2. The retrained concept predictor ˆge defined by (3) can be approximated by ge, defined by:

ˆg H 1 ˆg X

(w,r) Se (Gr C(xw, cw; ˆg) Gr C(xw, cw; ˆg)) ,

where Hˆg = ˆg P

i,j Gj C(xi, ci; ˆg) is the Hessian matrix of the loss function with respect to ˆg.

Theorem 4.3. The retrained label predictor ˆfe defined by (4) can be approximated by fe, defined by:

ˆf + H 1 ˆ f

GY (xi; ˆg, ˆf) GY (xi; ge, ˆf) ,

where H ˆ f = ˆ f Pn i=1 GY (xi; ˆg, ˆf) is the Hessian matrix, and ge is given in Theorem 4.2.

Difference from Test-Time Intervention. The ability to intervene in CBMs allows human users to interact with the model during the prediction process. For example, a medical expert can directly replace an erroneously predicted concept value ˆc and observe its impact on the final prediction ˆy. However, the underlying flaws in the concept predictor remain unaddressed, meaning similar errors may persist when applied to new test data. In contrast, under the editable

CBM framework, not only can test-time interventions be performed, but the concept predictor of the CBM can also be further refined based on test data that repeatedly produces errors. Our ECBM method incorporates the corrected test data into the training dataset without requiring full retraining. This approach extends the rectification process from the data level to the model level.

4.2. Concept-level Editable CBM

In this case, a set of concepts is removed due to incorrect attribution or spurious concepts, termed concept-level edit. 1Specifically, for the concept set, denote the erroneous concept index set as M [k], we aim to delete these concept labels in all training samples. We aim to investigate the impact of updating the concept set within the training data on the model s predictions. It is notable that compared to the above concept label case, the dimension of output (input) of the retrained concept predictor (label predictor) will change. If we delete t concepts from the dataset, then g becomes g : Rdi Rk t and f becomes f : Rk t Rdo. More specifically, if we retrain the CBM with the revised dataset, the corresponding concept predictor becomes:

ˆg p M = arg min g

i=1 Lj C(g (xi), ci). (5)

The variation of the parameters in dimension renders the application of influence function-based editing challenging for the concept predictor. This is because the influence function implements the editorial predictor by approximate parameter change from the original base after ϵ-weighting the corresponding loss for a given sample, and thus, it is unable to deal with changes in parameter dimensions.

To overcome the challenge, our strategy is to develop some transformations that need to be performed on ˆg p M to align its dimension with ˆg so that we can apply the influence function to edit the CBM. We achieve this by mapping ˆg p M to ˆg p M P(ˆg p M ), which has the same amount of parameters as ˆg and has the same predicted concepts ˆg p M (j) as ˆg p M (j) for all j [di] M. We achieve this effect by inserting a zero row vector into the r-th row of the matrix in the final layer of ˆg p M for r M. Thus, we can see that the mapping P is one-to-one. Moreover, assume the parameter space of ˆg is T and that of ˆg p M , T0 is the subset of T. Noting that ˆg p M is the optimal model of the following objective function:

ˆg p M = arg min g T0

i=1 Lj C(g (xi), ci), (6)

i.e., it is the optimal model of the concept predictor loss on the remaining concepts under the constraint T0. Now we

1For convenience, in this paper, we only consider concept removal; our method can directly extend to concept insertion.

Editable Concept Bottleneck Models

can apply the influence function to edit ˆg to approximate ˆg p M with the restriction on the value of 0 for rows indexed by M with the last layer of the neural network, denoted as g p M . After that, we remove from g p M the parameters initially inserted to fill in the dimensional difference, which always equals 0 because of the restriction we applied in the editing stage, thus approximating the true edited concept predictor ˆg p M . We now detail the editing process from ˆg to ˆg p M using the following theorem. Theorem 4.4. For the retrained concept predictor ˆg p M defined in (5), we map it to ˆg p M as (6). And we can edit the initial ˆg to ˆg p M , defined as:

g p M ˆg H 1 ˆg X

i=1 Gj C(xi, ci; ˆg),

where Hˆg = g P

j / M Pn i=1 Gj C(xi, ci; ˆg). Then, by removing all zero rows inserted during the mapping phase, we can naturally approximate ˆg p M P 1(ˆg p M ).

For the second stage of training, assume we aim to remove concept pr for r M and the new optimal model is ˆf p M . We will encounter the same difficulty as in the first stage, i.e., the number of parameters of the label predictor will change. To address the issue, our key observation is that in the existing literature on CBMs, we always use linear transformation for the label predictor, meaning that the dimensions of the input with values of 0 will have no contribution to the final prediction. To leverage this property, we fill the missing values in the input of the updated predictor with 0, that is, replacing ˆg p M with ˆg p M and consider ˆfp M=0 defined by

ˆfp M=0 = arg min f

i=1 LY f ˆg p M (xi) , yi . (7)

In total, we have the following lemma: Lemma 4.5. In the CBM, if the label predictor utilizes linear transformations of the form ˆf c with input c, then, for each r M, we remove the r-th concept from c and denote the new input as c ; set the r-th concept to 0 and denote the new input as c0. Then we have ˆf p M c = ˆfp M=0 c0 for any input c.

Lemma 4.5 demonstrates that the retrained ˆf p M and ˆfp M=0, when given inputs ˆg p M (x) and ˆg p M (x) respectively, yield identical outputs. Consequently, we can utilize ˆfp M=0 as the editing target in place of ˆf p M . Theorem 4.6. For the revised retrained label predictor ˆfp M=0 defined by (7), we can edit the initial label predictor ˆf to fp M=0 by the following equation as a substitute for ˆfp M=0:

ˆfp M=0 fp M=0 ˆf H 1 ˆ f

l=1 GY (xl; g p M , ˆf),

where H ˆ f = ˆ f Pn i=1 GY (xl; g p M , ˆf) is the Hessian matrix. Deleting the r-th dimension of fp M=0 for r M, then we can map it to f p M , which is the approximation of the final edited label predictor ˆf p M under concept level.

4.3. Data-level Editable CBM

In this scenario, we are more concerned about fully removing the influence of data samples on CBMs due to different reasons, such as the training data involving poisoned or erroneous issues. Specifically, we have a set of samples to be removed {(xi, yi, ci)}i G with G [n]. Then, we define the retrained concept predictor as

ˆg z G = arg min g

i [n] G Lj C(g(xi), ci), (8)

which can be evaluated by the following theorem:

Theorem 4.7. For dataset D = {(xi, yi, ci)}n i=1, given a set of data zr = (xr, yr, cr), r G to be removed. Suppose the updated concept predictor ˆg z G is defined by (8), then we have the following approximation for ˆg z G

ˆg z G g z G ˆg + H 1 ˆg X

j=1 Gj C(xr, cr; ˆg), (9)

where Hˆg = g P

i,j Gj C(xi, ci; ˆg) is the Hessian matrix of the loss function with respect to ˆg.

Based on ˆg z G, the label predictor becomes ˆf z G which is defined by

ˆf z G = arg min f

i [n] G LY (f(ˆg z G (xi) , yi) . (10)

Compared with the original loss before unlearning in (2), we can observe two changes in (10). First, we remove |G| data points in the loss function LY . Secondly, the input for the loss is also changed from ˆg(xi) to ˆg z G. Therefore, it is difficult to estimate directly with an influence function. Here we introduce an intermediate label predictor as

f z G = arg min X

i [n] G LY (f(ˆg(xi), yi), (11)

and split the estimate of ˆf z G ˆf into ˆf z G f z G and f z G ˆf.

Theorem 4.8. For dataset D = {(xi, yi, ci)}n i=1, given a set of data zr = (xr, yr, cr), r G to be removed. The intermediate label predictor f z G is defined in (11). Then we have

f z G ˆf H 1 ˆ f X

i [n] G GY (xi; ˆg, ˆf) AG.

Editable Concept Bottleneck Models

We denote the edited version of f z G as f z G ˆf + AG. Define BG as

i [n] G GY (xi; g z G, f z G) GY (xi; ˆg, f z G),

where H f z G = f P

i [n] G GY (xi; ˆg, f z G) is the Hes-

sian matrix concerning f z G. Then ˆf z G can be estimated by f z G + BG. Combining the above two-stage approximation, then, the final edited label predictor f z G can be obtained by

f z G = f z G + BG = ˆf + AG + BG. (12)

Acceleration via EK-FAC. As mentioned in Section 3, the loss function in CBMs is non-convex, meaning the Hessian matrices in all our theorems may not be well-defined. To address this, we adopt the EK-FAC approach, where the Hessian is approximated as ˆHθ = Gθ + δI. Here, Gθ represents the Fisher information matrix of the model θ, and δ is a small damping term introduced to ensure positive definiteness. For details on applying EK-FAC to CBMs, see Appendix C.1. Additionally, refer to Algorithms 6-8 in the Appendix for the EK-FAC-based algorithms corresponding to our three levels, with their original (Hessian-based) versions provided in Algorithms 1-3, respectively.

Theoretical Bounds. We provide error bounds for the concept predictor between retraining and ECBM across all three levels; see Appendix D.2, E.2 and F.2 for details. We show that under certain scenarios, the approximation error becomes tolerable theoretically when leveraging some damping term δ regularized in the Hessian matrix.

5. Experiments

In this section, we demonstrate our main experimental results on utility evaluation, edition efficiency, and interpretability evaluation. Details and additional results are in Appendix H due to space limit.

5.1. Experimental Settings

Dataset.We utilize three datasets: X-ray Grading (OAI) (Nevitt et al., 2006), Bird Identification (CUB) (Wah et al., 2011), and the Large-scale Celeb Faces Attributes Dataset (Celeb A) (Liu et al., 2015). OAI is a multi-center observational study of knee osteoarthritis, comprising 36,369 data points. Specifically, we configure n=10 concepts that characterize crucial osteoarthritis indicators such as joint space narrowing, osteophytes, and calcification. Bird identification (CUB)2 consists of 11,788 data points, which belong to 200 classes and include 112 binary attributes to

2The original dataset is processed. Detailed explanation can be found in H.

describe detailed visual features of birds. Celeb A comprises 202,599 celebrity images, each annotated with 40 binary attributes that detail facial features, such as hair color, eyeglasses, and smiling. As the dataset lacks predefined classification tasks, following (Espinosa Zarlenga et al., 2022), we designate 8 attributes as labels and the remaining 32 attributes as concepts. For all the above datasets, we follow the same network architecture and settings outlined in (Koh et al., 2020).

Ground Truth and Baselines. We use retrain as the ground truth method. Retrain: We retrain the CBM from scratch by removing the samples, concept labels, or concepts from the training set. We employ two baseline methods: CBM-IF, and ECBM. CBM-IF: This method is a direct implementation of our previous theorems of model updates in the three settings. See Algorithms 1-3 in Appendix for details. ECBM: As we discussed above, all of our model updates can be further accelerated via EK-FAC, ECBM corresponds to the EK-FAC accelerated version of Algorithms 1-3 (refer to Algorithms 6-8 in Appendix).

Evaluation Metric. We utilize two primary evaluation metrics to assess our models: the F1 score and runtime (RT). F1 score measures the model performance by balancing precision and recall. Runtime, measured in minutes, evaluates the total running time of each method to update the model.

Implementation Details. Our experiments utilized an Intel Xeon CPU and an RTX 3090 GPU. For utility evaluation, at the concept level, one concept was randomly removed for the OAI dataset and repeated while ten concepts were randomly removed for the CUB dataset, with five different seeds. At the data level, 3% of the data points were randomly deleted and repeated 10 times with different seeds. At the concept-label level, we randomly selected 3% of the data points and modified one concept of each data randomly, repeating this 10 times for consistency across iterations.

5.2. Evaluation of Utility and Editing Efficiency

Our experimental results, as illustrated in Table 1, demonstrate the effectiveness of ECBMs compared to traditional retraining and CBM-IF, particularly emphasizing computational efficiency without compromising accuracy. Specifically, ECBMs achieved F1 scores close to those of retraining (0.8808 vs. 0.8825) while significantly reducing the runtime from 297.77 minutes to 2.36 minutes. This pattern is consistent in the CUB dataset, where the runtime was decreased from 85.56 minutes for retraining to 0.65 minutes for ECBMs, with a negligible difference in the F1 score (0.7971 to 0.7963). These results highlight the potential of ECBMs to provide substantial time savings approximately 22-30% of the computational time required for retraining while maintaining comparable accuracy. Compared to CBM-IF, ECBM also showed a slight

Editable Concept Bottleneck Models

Table 1: Performance comparison of different methods on the three datasets.

Edit Level Method OAI CUB Celeb A

F1 score RT (minute) F1 score RT (minute) F1 score RT (minute)

Concept Label Retrain 0.8825 0.0054 297.77 35.01 0.7971 0.0066 85.56 4.22 0.3827 0.0272 304.71 35.15 CBM-IF(Ours) 0.8639 0.0033 4.63 0.89 0.7699 0.0035 1.33 0.09 0.3561 0.0134 5.54 0.82 ECBM(Ours) 0.8808 0.0039 2.36 0.54 0.7963 0.0050 0.65 0.08 0.3845 0.0327 2.49 0.53

Concept Retrain 0.8448 0.0191 258.84 42.48 0.7811 0.0047 87.21 7.62 0.3776 0.0350 355.85 25.39 CBM-IF(Ours) 0.8214 0.0071 4.94 0.91 0.7579 0.0065 1.45 0.08 0.3609 0.0202 5.51 0.89 ECBM(Ours) 0.8403 0.0090 2.36 0.60 0.7787 0.0058 0.59 0.17 0.3761 0.0280 2.48 0.52

Data Retrain 0.8811 0.0065 319.37 31.08 0.7838 0.0051 86.20 7.74 0.3797 0.0375 325.62 50.59 CBM-IF(Ours) 0.8472 0.0046 5.07 0.75 0.7623 0.0031 1.46 0.08 0.3536 0.0166 5.97 0.75 ECBM(Ours) 0.8797 0.0038 2.50 0.49 0.7827 0.0088 0.65 0.19 0.3748 0.0347 2.49 0.61

reduction in runtime and a significant improvement in F1 score. The former verifies the effective acceleration of our algorithm by EK-FAC. This efficiency is particularly crucial in scenarios where frequent updates to model annotations are needed, confirming the utility of ECBMs in dynamic environments where running time and accuracy are critical.

We can also see that the original version of ECBM, i.e., CBM-IF, also has a lower runtime than retraining but a lower F1 score than ECBM. Such results may be due to different reasons. For example, our original theorems depend on the inverse of the Hessian matrices, which may not be well-defined for non-convex loss. Moreover, these Hessian matrices may be ill-conditioned or singular, which makes calculating their inverse imprecise and unstable.

Editing Multiple Samples. To comprehensively evaluate the editing capabilities of ECBM in various scenarios, we conducted experiments on the performance with multiple samples that need to be removed. Specifically, for the concept label/data levels, we consider the different ratios of samples (1-10%) for edit, while for the concept level, we consider removing different numbers of concepts {2, 4, 6, , 20}. We compared the performance of retraining, CBM-IF, and ECBM methods. As shown in Figure 2, except for certain cases at the concept level, the F1 score of the ECBM method is generally around 0.0025 lower than that of the retrain method, which is significantly better than the corresponding results of the CBM-IF method. Recalling Table 1, the speed of ECBM is more than three times faster than that of retraining. Consequently, ECBM is an editing method that achieves a trade-off between speed and effectiveness.

5.3. Results on Interpretability

ECBM can measure concepts importance. The original motivation of the influence function is to calculate the importance score of each sample. Here, we will show that the influence function for the concept level in Theorem 4.4 can be used to calculate the importance of each concept in CBMs, which provides an explainable tool for CBMs. In detail, we conduct our experiments on the CUB dataset. We

first select 1-10 most influential and 1-10 least influential concepts by our influence function. Then, we will remove these concepts and update the model via retraining or our ECBM and analyze the change (F1 Score Difference) w.r.t. the original CBM before removal.

The results in Figure 3(a) demonstrate that when we remove the 1-10 most influential concepts identified by the ECBM method, the F1 score decreases by more than 0.025 compared to the CBM before removal. In contrast, Figure 3(b) shows that the change in the F1 score remains consistently below 0.005 when removing the least influential concepts. These findings strongly indicate that the influence function in ECBM can successfully determine the importance of concepts. Furthermore, we observe that the gap between the F1 score of retraining and ECBM is consistently smaller than 0.005, and even smaller in the case of least important concepts. This further suggests that when ECBM edits various concepts, its performance is very close to the ground truth.

ECBMs can erase data influence. For the data level, ECBMs aim to facilitate an efficient removal of samples. We perform membership inference attacks (MIAs) to provide direct evidence that ECBMs can indeed erase data influence. MIA is a privacy attack that aims to infer whether a specific data sample was part of the training dataset used to train a model. The attacker exploits the model s behavior, such as overconfidence or overfitting, to distinguish between training (member) and non-training (non-member) data points. In MIAs, the attacker typically queries the model with a data sample and observes its prediction confidence or loss values, which tend to be higher for members of the training set than non-members (Shokri et al., 2017).

To quantify the success of these edits, we calculate the RMIA (Removed Membership Inference Attack) score for each category. The RMIA score is defined as the model s confidence in classifying whether a given sample belongs to the training set. Lower RMIA values indicate that the sample behaves more like a test set (non-member) sample (Zarifzadeh et al., 2024). This metric is especially crucial for edited samples, as a successful ECBM should make the removed members behave similarly to non-members,

Editable Concept Bottleneck Models

2 4 6 8 10 Ratio of Edition

Concept-label-level

Retrain CBM-IF ECBMs

2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 Number of Deletion

Concept-level

Retrain CBM-IF ECBMs

2 4 6 8 10 Ratio of Deletion

Retrain CBM-IF ECBMs

Figure 2: Impact of edition ratio on three settings on CUB dataset.

1 2 3 4 5 6 7 8 9 10 Number of Deletions

F1 Score Difference

Retrain ECBMs

(a) Results on the 1-10 most influential concepts

1 2 3 4 5 6 7 8 9 10 Number of Deletions

F1 Score Difference

Retrain ECBMs

(b) Results on the 1-10 least influential concepts

Figure 3: F1 score difference after removing most and least influential concepts given by ECBM.

0.01 0.02 0.03 0.04 0.05 0.06 0.07 RMIA

Before Editing

Normal Distribution of Non-Members' RMIA Normal Distribution of Members' RMIA non-members members

(a) RMIA Score Before Editing

0.01 0.02 0.03 0.04 0.05 0.06 0.07 RMIA

After Editing

Normal Distribution of Non-Members' RMIA Normal Distribution of Removed-members' RMIA non-members removed-members

(b) RMIA Score After Editing

Figure 4: RMIA scores of data before and after removal.

reducing their membership vulnerability. See Appendix H for its definition.

We conducted experiments by randomly selecting 200 samples from the training set (members) and 200 samples from the test set (non-members) of the CUB dataset. We calculated the RMIA scores for these samples and plotted their frequency distributions, as shown in Figure 4(a). The mean RMIA score for non-members was 0.049465, while members had a mean score of 0.063505. Subsequently, we applied ECBMs to remove the 200 training samples from the model, updated the model parameters, and then recalculated the RMIA scores. After editing, the mean RMIA score for the removed members decreased to 0.052105, significantly

closer to the non-members mean score. This shift in RMIA values demonstrates the effectiveness of ECBMs in editing the model, as the removed members now exhibit behavior closer to that of non-members. The post-editing RMIA score distributions are shown in Figure 4(b). These results provide evidence of the effectiveness of ECBMs in editing the model s knowledge about specific samples.

6. Conclusion

In this paper, we propose Editable Concept Bottleneck Models (ECBMs). ECBMs can address issues of removing/inserting some training data or new concepts from

Editable Concept Bottleneck Models

trained CBMs for different reasons, such as privacy concerns, data mislabelling, spurious concepts, and concept annotation errors retraining from scratch. Furthermore, to improve computational efficiency, we present streamlined versions integrating EK-FAC. Experimental results show our ECBMs are efficient and effective.

Impact Statement

This research propose editable concept bottleneck models. While we acknowledge the importance of evaluating societal impacts, our analysis suggests that there are no immediate ethical risks requiring specific mitigation measures beyond standard practices in machine learning.

Acknowledgements

Lijie Hu, Chenyang Ren, Cheng-Long Wang, and Di Wang are supported in part by the funding BAS/1/1689-01-01, URF/1/4663-01-01, REI/1/5232-01-01, REI/1/5332-01-01, and URF/1/5508-01-01 from KAUST, and funding from KAUST - Center of Excellence for Generative AI, under award number 5940. Zhengyu Hu, Hongbin Lin, and Hui Xiong are supported by the National Key R&D Program of China (Grant No. 2023YFF0725001), the National Natural Science Foundation of China (Grant No. 92370204), the Guangdong Basic and Applied Basic Research Foundation (Grant No. 2023B1515120057), the Guangzhou-HKUST(GZ) Joint Funding Program (Grant No. 2023A03J0008), and the Education Bureau of Guangzhou Municipality.

Agarwal, N., Bullins, B., and Hazan, E. Second-order stochastic optimization for machine learning in linear time. Journal of Machine Learning Research, 18(116): 1 40, 2017.

Ahmad, M. A., Eckert, C., and Teredesai, A. Interpretable machine learning in healthcare. In Proceedings of the 2018 ACM international conference on bioinformatics, computational biology, and health informatics, pp. 559 560, 2018.

Ali, M. A., Li, Z., Yang, S., Cheng, K., Cao, Y., Huang, T., Hu, L., Yu, L., and Wang, D. Prompt-saw: Leveraging relation-aware graphs for textual prompt compression. ar Xiv preprint ar Xiv:2404.00489, 2024.

Bartlett, M. Approximate confidence intervals. Biometrika, 40(1/2):12 19, 1953.

Basu, S., Pope, P., and Feizi, S. Influence functions in deep learning are fragile. In International Conference on Learning Representations (ICLR), 2021.

Bourtoule, L., Chandrasekaran, V., Choquette-Choo, C. A., Jia, H., Travers, A., Zhang, B., Lie, D., and Papernot, N. Machine unlearning. In 2021 IEEE Symposium on Security and Privacy (SP), pp. 141 159. IEEE, 2021.

Brophy, J. and Lowd, D. Machine unlearning for random forests. In International Conference on Machine Learning, pp. 1092 1104. PMLR, 2021.

Brunet, M.-E., Alkalay-Houlihan, C., Anderson, A., and Zemel, R. Understanding the origins of bias in word embeddings. In International conference on machine learning, pp. 803 811. PMLR, 2019.

Cao, L. Ai in finance: challenges, techniques, and opportunities. ACM Computing Surveys (CSUR), 55(3):1 38, 2022.

Cao, Y. and Yang, J. Towards making systems forget with machine unlearning. In 2015 IEEE symposium on security and privacy, pp. 463 480. IEEE, 2015.

Chauhan, K., Tiwari, R., Freyberg, J., Shenoy, P., and Dvijotham, K. Interactive concept bottleneck models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37(5), pp. 5948 5955, 2023.

Chen, C., Sun, F., Zhang, M., and Ding, B. Recommendation unlearning. In Proceedings of the ACM Web Conference 2022, pp. 2768 2777, 2022a.

Chen, H., Si, S., Li, Y., Chelba, C., Kumar, S., Boning, D., and Hsieh, C.-J. Multi-stage influence function. Advances in Neural Information Processing Systems, 33:12732 12742, 2020.

Chen, M., Zhang, Z., Wang, T., Backes, M., Humbert, M., and Zhang, Y. Graph unlearning. In Proceedings of the 2022 ACM SIGSAC conference on computer and communications security, pp. 499 513, 2022b.

Cheng, K., Lin, G., Fei, H., Yu, L., Ali, M. A., Hu, L., Wang, D., et al. Multi-hop question answering under temporal knowledge editing. ar Xiv preprint ar Xiv:2404.00492, 2024.

Chowdhury, S. B. R., Choromanski, K., Sehanobish, A., Dubey, A., and Chaturvedi, S. Towards scalable exact machine unlearning using parameter-efficient fine-tuning. ar Xiv preprint ar Xiv:2406.16257, 2024.

Cook, R. D. Detection of influential observation in linear regression. Technometrics, 42(1):65 68, 2000.

Cook, R. D. and Weisberg, S. Characterizations of an empirical influence function for detecting influential cases in regression. Technometrics, 22(4):495 508, 1980.

Editable Concept Bottleneck Models

Das, A. and Rad, P. Opportunities and challenges in explainable artificial intelligence (xai): A survey. ar Xiv preprint ar Xiv:2006.11371, 2020.

Espinosa Zarlenga, M., Barbiero, P., Ciravegna, G., Marra, G., Giannini, F., Diligenti, M., Shams, Z., Precioso, F., Melacci, S., Weller, A., Li o, P., and Jamnik, M. Concept embedding models: Beyond the accuracy-explainability trade-off. In Advances in Neural Information Processing Systems, volume 35, 2022.

George, T., Laurent, C., Bouthillier, X., Ballas, N., and Vincent, P. Fast approximate natural gradient descent in a kronecker factored eigenbasis. Advances in Neural Information Processing Systems, 31, 2018.

Golatkar, A., Achille, A., and Soatto, S. Eternal sunshine of the spotless net: Selective forgetting in deep networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9304 9312, 2020a.

Golatkar, A., Achille, A., and Soatto, S. Forgetting outside the box: Scrubbing deep networks of information accessible from input-output observations. In Computer Vision ECCV 2020: 16th European Conference, Glasgow, UK, August 23 28, 2020, Proceedings, Part XXIX 16, pp. 383 398. Springer, 2020b.

Golatkar, A., Achille, A., Ravichandran, A., Polito, M., and Soatto, S. Mixed-privacy forgetting in deep networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 792 801, 2021.

Guo, C., Goldstein, T., Hannun, A., and Van Der Maaten, L. Certified data removal from machine learning models. ar Xiv preprint ar Xiv:1911.03030, 2019.

Guo, H., Rajani, N., Hase, P., Bansal, M., and Xiong, C. Fastif: Scalable influence functions for efficient model interpretation and debugging. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 10333 10350, 2021.

Han, X., Wallace, B. C., and Tsvetkov, Y. Explaining black box predictions and unveiling data artifacts through influence functions. ar Xiv preprint ar Xiv:2005.06676, 2020.

Havasi, M., Parbhoo, S., and Doshi-Velez, F. Addressing leakage in concept bottleneck models. Advances in Neural Information Processing Systems, 35:23386 23397, 2022.

Hu, L., Liu, Y., Liu, N., Huai, M., Sun, L., and Wang, D. Improving faithfulness for vision transformers. ar Xiv preprint ar Xiv:2311.17983, 2023a.

Hu, L., Liu, Y., Liu, N., Huai, M., Sun, L., and Wang, D. Seat: stable and explainable attention. In Proceedings of

the AAAI Conference on Artificial Intelligence, volume 37(11), pp. 12907 12915, 2023b.

Izzo, Z., Smart, M. A., Chaudhuri, K., and Zou, J. Approximate data deletion from machine learning models. In International Conference on Artificial Intelligence and Statistics, pp. 2008 2016. PMLR, 2021.

Keser, M., Schwalbe, G., Nowzad, A., and Knoll, A. Interpretable model-agnostic plausibility verification for 2d object detectors using domain-invariant concept bottleneck models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3890 3899, 2023.

Kim, E., Jung, D., Park, S., Kim, S., and Yoon, S. Probabilistic concept bottleneck models. ar Xiv preprint ar Xiv:2306.01574, 2023.

Koh, P. W. and Liang, P. Understanding black-box predictions via influence functions. In International conference on machine learning, pp. 1885 1894. PMLR, 2017.

Koh, P. W., Nguyen, T., Tang, Y. S., Mussmann, S., Pierson, E., Kim, B., and Liang, P. Concept bottleneck models. In International conference on machine learning, pp. 5338 5348. PMLR, 2020.

Kwon, Y., Wu, E., Wu, K., and Zou, J. Datainf: Efficiently estimating data influence in lora-tuned llms and diffusion models. In The Twelfth International Conference on Learning Representations, 2023.

Lai, S., Hu, L., Wang, J., Berti-Equille, L., and Wang, D. Faithful vision-language interpretation via concept bottleneck models. In The Twelfth International Conference on Learning Representations, 2023.

Liu, J., Lou, J., Qin, Z., and Ren, K. Certified minimax unlearning with generalization rates and deletion capacity. Advances in Neural Information Processing Systems, 36, 2024.

Liu, Z., Luo, P., Wang, X., and Tang, X. Deep learning face attributes in the wild. In Proceedings of the IEEE international conference on computer vision, pp. 3730 3738, 2015.

Mehta, R., Pal, S., Singh, V., and Ravi, S. N. Deep unlearning via randomized conditionally independent hessians. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10422 10431, 2022.

Nevitt, M., Felson, D., and Lester, G. The osteoarthritis initiative. Protocol for the cohort study, 1:2, 2006.

Editable Concept Bottleneck Models

Oikarinen, T., Das, S., Nguyen, L., and Weng, L. Label-free concept bottleneck models. In International Conference on Learning Representations, 2023.

Rozenfeld, Y., Beam, J., Maier, H., Haggerson, W., Boudreau, K., Carlson, J., and Medows, R. A model of disparities: risk factors associated with covid-19 infection. International journal for equity in health, 19(1):126, 2020.

Sattar, N., Mc Innes, I. B., and Mc Murray, J. J. Obesity is a risk factor for severe covid-19 infection: multiple potential mechanisms. Circulation, 142(1):4 6, 2020.

Sawada, Y. and Nakamura, K. Concept bottleneck model with additional unsupervised concepts. IEEE Access, 10: 41758 41765, 2022.

Sekhari, A., Acharya, J., Kamath, G., and Suresh, A. T. Remember what you want to forget: Algorithms for machine unlearning. Advances in Neural Information Processing Systems, 34:18075 18086, 2021.

Sheth, I. and Ebrahimi Kahou, S. Auxiliary losses for learning generalizable concept-based models. In Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., and Levine, S. (eds.), Advances in Neural Information Processing Systems, volume 36, pp. 26966 26990, 2023.

Sheth, I. and Kahou, S. E. Auxiliary losses for learning generalizable concept-based models. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.

Shokri, R., Stronati, M., Song, C., and Shmatikov, V. Membership inference attacks against machine learning models. In 2017 IEEE symposium on security and privacy (SP), pp. 3 18. IEEE, 2017.

Thudi, A., Deza, G., Chandrasekaran, V., and Papernot, N. Unrolling sgd: Understanding factors influencing machine unlearning. In 2022 IEEE 7th European Symposium on Security and Privacy (Euro S&P), pp. 303 319. IEEE, 2022.

Wah, C., Branson, S., Welinder, P., Perona, P., and Belongie, S. The caltech-ucsd birds-200-2011 dataset. California Institute of Technology, 2011.

Wang, H., Ustun, B., and Calmon, F. Repairing without retraining: Avoiding disparate impact with counterfactual distributions. In International Conference on Machine Learning, pp. 6618 6627. PMLR, 2019.

Warnecke, A., Pirch, L., Wressnegger, C., and Rieck, K. Machine unlearning of features and labels. Network and Distributed System Security (NDSS) Symposium, 2023.

Wu, Y., Dobriban, E., and Davidson, S. Deltagrad: Rapid retraining of machine learning models. In International Conference on Machine Learning, pp. 10355 10366. PMLR, 2020.

Xu, X., Kong, K., Liu, N., Cui, L., Wang, D., Zhang, J., and Kankanhalli, M. An llm can fool itself: A prompt-based adversarial attack. ar Xiv preprint ar Xiv:2310.13345, 2023.

Yang, S., Ali, M. A., Wang, C.-L., Hu, L., and Wang, D. Moral: Moe augmented lora for llms lifelong learning. ar Xiv preprint ar Xiv:2402.11260, 2024a.

Yang, S., Hu, L., Yu, L., Ali, M. A., and Wang, D. Human-ai interactions in the communication era: Autophagy makes large models achieving local optima. ar Xiv preprint ar Xiv:2402.11271, 2024b.

Yang, S., Su, J., Jiang, H., Li, M., Cheng, K., Ali, M. A., Hu, L., and Wang, D. Dialectical alignment: Resolving the tension of 3h and security threats of llms. ar Xiv preprint ar Xiv:2404.00486, 2024c.

Yin, S., Fu, C., Zhao, S., Li, K., Sun, X., Xu, T., and Chen, E. A survey on multimodal large language models. ar Xiv preprint ar Xiv:2306.13549, 2023.

Yu, K.-H., Beam, A. L., and Kohane, I. S. Artificial intelligence in healthcare. Nature biomedical engineering, 2 (10):719 731, 2018.

Yuksekgonul, M., Wang, M., and Zou, J. Post-hoc concept bottleneck models. In The Eleventh International Conference on Learning Representations, 2023.

Zarifzadeh, S., Liu, P., and Shokri, R. Low-cost high-power membership inference attacks. In Forty-first International Conference on Machine Learning, 2024.

Zhao, W. X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y., Min, Y., Zhang, B., Zhang, J., Dong, Z., et al. A survey of large language models. ar Xiv preprint ar Xiv:2303.18223, 2023.

Editable Concept Bottleneck Models

A. Notation Table

Table 2: Notation Table.

Symbol Description

c = {p1, . . . , pk} Set of concepts provided by experts or LLMs. D = {zi}n i=1 Training dataset, where zi = (xi, yi, ci). xi Rdi Feature vector for the i-th sample. yi Rdo Label for the i-th sample, with dz being the number of classes. ci = (c1 i , . . . , ck i ) Rk Concept vector for the i-th sample. cr w Corrected concept label for the w-th sample and r-th concept. cj i Weight of the concept pj in the concept vector ci. g : Rd i Rk Concept predictor mapping input space to concept space. f : Rk Rdo Label predictor mapping concept space to prediction space. Lj C(g(x), c) Loss function for the j-th concept predictor. LY (f(ˆg(x)), y) Loss function from concept space to output space. Gj C(xi, ci; g) Gradient of the loss Lj C(g(xi), ci). GY (xi, ci; g) Gradient of the loss LY (f(g(xi)), ci). Hˆθ Hessian matrix of the loss function with respect to ˆθ. Gˆθ Fisher information matrix of model ˆθ. δ Damping term for ensuring positive definiteness of the Hessian. ˆg Estimated concept predictor. ˆf Estimated label predictor. ˆge Retrained concept predictor after correcting erroneous data. ˆfe Retrained label predictor after correcting erroneous data. ˆg p M Retrained concept predictor after removing concepts indexed by M. ˆg p M Mapped concept predictor with the same dimensionality as ˆg. g p M Approximation of the retrained concept predictor ˆg p M . ˆfp M=0 Label predictor after setting the r-th concept to zero for r M. fp M=0 Approximation of the label predictor ˆfp M=0. Hˆg Hessian matrix of the loss function with respect to ˆg. H ˆ f Hessian matrix of the loss function with respect to ˆf. M [k] Set of erroneous concept indices to be removed. G [n] Set of indices of samples to be removed from the dataset. zr = (xr, yr, cr) Data sample to be removed, where r G. ˆg z G Retrained concept predictor after removing samples indexed by G. g z G Approximation of the retrained concept predictor ˆg z G. f z G Intermediate label predictor. f z G Final edited label predictor after removing samples indexed by G.

B. Influence Function

Consider a neural network ˆθ = arg minθ Pn i=1 ℓ(zi, θ) with loss function L and dataset D = {zi}n i=1. That is ˆθ minimize the empirical risk

i=1 L(zi, θ)

Assume R is strongly convex in θ. Then θ is uniquely defined. If we remove a point zm from the training dataset, the parameters become ˆθ zm = arg minθ P

i =m L(zi, θ). Up-weighting zm by ϵ small enough, then the revised risk

n Pn i=1 L(zi; θ) + ϵL(zm; θ) is still strongly convex. Then the response function ˆθϵ, zm = R(θ)

is also uniquely

defined. The parameter change is denoted as ϵ = ˆθϵ, zm ˆθ. Since ˆθϵ, zm is the minimizer of R(θ)

, we have the

Editable Concept Bottleneck Models

first-order optimization condition as

ˆθϵ, zm R(θ) + ϵ ˆθϵ, zm L(zm, ˆθϵ, zm) = 0

Since ˆθϵ, zm ˆθasϵ 0, we perform a Taylor expansion of the right-hand side: h R(ˆθ) + ϵ L(zm, ˆθ) i + h 2R(ˆθ) + ϵ 2L(zm, ˆθ) i ϵ 0

Noting ϵ 2L(zm, ˆθ) ϵ is o( ϵ ) term, which is smaller than other parts, we drop it in the following analysis. Then the Taylor expansion equation becomes h R(ˆθ) + ϵ L(zm, ˆθ) i + 2R(ˆθ) ϵ 0

Solving for ϵ, we obtain:

ϵ = h 2R(ˆθ) + ϵ 2L(z, ˆθ) i 1 h R(ˆθ) + ϵ L(z, ˆθ) i .

Remember θ minimizes R, then R(ˆθ) = 0. Dropping o(ϵ) term, we have

ϵ = ϵ 2R(ˆθ) 1 L(z, ˆθ).

ϵ=0 = H 1 ˆθ L(z, ˆθ) Iup,params(z).

Besides, we can obtain the approximation of ˆθ zm directly by ˆθ zm ˆθ + Iup,params(z).

C. Acceleration for Influence Function

EK-FAC. EK-FAC method relies on two approximations to the Fisher information matrix, equivalent to Gˆθ in our setting, which makes it feasible to compute the inverse of the matrix.

Firstly, assume that the derivatives of the weights in different layers are uncorrelated, which implies that Gˆθ has a blockdiagonal structure. Suppose ˆgθ can be denoted by ˆgθ(x) = gθL gθl gθ1(x) where l [L]. We fold the bias into the weights and vectorize the parameters in the l-th layer into a vector θl Rdl, dl N is the number of l-th layer parameters. Then Gˆθ can be reaplcaed by G1(ˆθ), , GL(ˆθ) , where Gl(ˆθ) n 1 Pn i=1 ˆθlℓi θlℓT i . Denote hl, ol as

the output and pre-activated output of l-th layer. Then Gl(θ) can be approximated by

Gl(θ) ˆGl(θ) 1

i=1 hl 1 (xi) hl 1 (xi)T 1

i=1 olℓi olℓT i Ωl 1 Γl.

Furthermore, in order to accelerate transpose operation and introduce the damping term, perform eigenvalue decomposition of matrix Ωl 1 and Γl and obtain the corresponding decomposition results as QΩΛΩQ Ωand QΓΛΓQ Γ . Then the inverse of ˆHl(θ) can be obtained by

ˆHl(θ) 1 ˆGl (ˆg) + λl Idl 1 = QΩl 1 QΓl ΛΩl 1 ΛΓl + λl Idl 1 QΩl 1 QΓl T .

Besides, George et al. (2018) proposed a new method that corrects the error in equation 13 which sets the i-th diagonal element of ΛΩl 1 ΛΓl as Λ ii = n 1 Pn j=1 QΩl 1 QΓl θlℓj 2 i .

C.1. EK-FAC for CBMs

In our CBM model, the label predictor is a single linear layer, and Hessian computing costs are affordable. However, the concept predictor is based on Resnet-18, which has many parameters. Therefore, we perform EK-FAC for ˆg.

ˆg = arg min g

j=1 LCj = arg min g

i=1 LC(gj(xi), cj i),

Editable Concept Bottleneck Models

we define Hˆg = 2 ˆg P

i,j LCj(g(xi), ci) as the Hessian matrix of the loss function with respect to the parameters.

To this end, consider the l-th layer of ˆg which takes as input a layer of activations {aj,t} where j {1, 2, . . . , J} indexes the input map and t T indexes the spatial location which is typically a 2-D grid. This layer is parameterized by a set of weights W = (wi,j,δ) and biases b = (bi), where i {1, . . . , I} indexes the output map, and δ indexes the spatial offset (from the center of the filter).

The convolution layer computes a set of pre-activations as

[Sl]i,t = si,t = X

δ wi,j,δaj,t+δ + bi.

Denote the loss derivative with respect to si,t as

Dsi,t = P LCj

which can be computed during backpropagation.

The activations are actually stored as Al 1 of dimension |T | J. Similarly, the weights are stored as an I | |J array Wl. The straightforward implementation of convolution, though highly parallel in theory, suffers from poor memory access patterns. Instead, efficient implementations typically leverage what is known as the expansion operator J K. For instance, JAl 1K is a |T | J| | matrix, defined as

JAl 1Kt,j| |+δ = [Al 1](t+δ),j = aj,t+δ,

In order to fold the bias into the weights, we need to add a homogeneous coordinate (i.e. a column of all 1 s) to the expanded activations JAl 1K and denote this as JAl 1KH. Concatenating the bias vector to the weights matrix, then we have θl = (bl, Wl).

Then, the approximation for Hˆg is given as:

G(l)(ˆg) =E [Dwi,j,δDwi ,j ,δ ] = E

t T aj,t+δDsi,t

t T aj ,t +δ Dsi ,t

E JAl 1K HJAl 1KH 1

|T |E DS l DSl Ωl 1 Γl.

Estimate the expectation using the mean of the training set,

JAi l 1K HJAi l 1KH 1

|T |DSi l DSi l

Furthermore, if the factors ˆΩl 1 and ˆΓl have eigen decomposition QΩΛΩQ Ωand QΓΛΓQ Γ , respectively, then the eigen decomposition of ˆΩl 1 ˆΓl can be written as:

ˆΩl 1 ˆΓl = QΩΛΩQ Ω QΓΛΓQ Γ

= (QΩ QΓ) (ΛΩ ΛΓ) (QΩ QΓ) .

Since subsequent inverse operations are required and the current approximation for G(l)(ˆg) is PSD, we actually use a damped version as

ˆGl(ˆg) 1 = (Gl (ˆg) + λl Idl) 1 = QΩl 1 QΓl ΛΩl 1 ΛΓl + λl Idl 1 QΩl 1 QΓl T . (13)

Besides, (George et al., 2018) proposed a new method that corrects the error in equation 13 which sets the i-th diagonal element of ΛΩl 1 ΛΓl as

Λ ii = n 1 n X

QΩl 1 QΓl θlℓj 2 i .

Editable Concept Bottleneck Models

D. Concept-label-level Influence

D.1. Proof of Concept-label-level Influence Function

We have a set of erroneous data De and its associated index set Se [n] [k] such that for each (w, r) Se, we have (xw, yw, cw) De with cr w is mislabeled and cr w is its corrected concept label. Thus, our goal is to approximate the new CBM without retraining.

Proof Sketch. Our goal is to edit ˆg and ˆf to ˆge and ˆfe. (i) First, we introduce new parameters ˆgϵ,e that minimize a modified loss function with a small perturbation ϵ. (ii) Then, we perform a Newton step around ˆg and obtain an estimate for ˆge. (iii) Then, we consider changing the concept predictor at one data point (xic, yic, cic) and retraining the model to obtain a new label predictor ˆfic, obtain an approximation for ˆfic. (iv) Next, we iterate ic over 1, 2, , n, sum all the equations together, and perform a Newton step around ˆf to obtain an approximation for ˆfe. (v) Finally, we bring the estimate of ˆg into the equation for ˆfe to obtain the final approximation.

Theorem D.1. Define the gradient of the j-th concept predictor of the i-th data xi as

Gj C(xi, ci; g) g Lj C (g(xi), ci) , (14)

then the retrained concept predictor ˆge defined by

ˆge = arg min

(i,j)/ Se Lj C (g(xi), ci) + X

(i,j) Se Lj C (g(xi), ci)

can be approximated by ge, defined by:

ˆg H 1 ˆg X

(w,r) Se (Gr C(xw, cw; ˆg) Gr C(xw, cw; ˆg)) (16)

where Hˆg = 2 ˆg P

i,j Lj C(ˆg(xi), ci) is the Hessian matrix of the loss function with respect to ˆg.

Proof. For the index (w, r) Se, indicating the r-th concept of the w-th data is wrong, we correct this concept cr w to cr w. Rewrite ˆge as

ˆge = arg min

i,j Lj C (g(xi), ci) + X

(w,r) Se Lr C (g(xw), cw) X

(w,r) Se Lr C (g(xw), cw)

To approximate this effect, define new parameters ˆgϵ,e as

ˆgϵ,e arg min

i,j Lj C (g(xi), ci) + X

(w,r) Se ϵ Lr C (g(xw), cw) X

(w,r) Se ϵ Lr C (g(xw), cw)

Then, because ˆgϵ,e minimizes equation 18, we have

i,j Lj C (ˆgϵ,e(xi), ci) + X

(w,r) Se ϵ ˆg Lr C (ˆgϵ,e(xw), cw) X

(w,r) Se ϵ ˆg Lr C (ˆgϵ,e(xw), cw) = 0.

Perform a Taylor expansion of the above equation at ˆg,

i,j Lj C (ˆg(xi), ci) + X

(w,r) Se ϵ ˆg Lr C (ˆg(xw), cw) X

(w,r) Se ϵ ˆg Lr C (ˆg(xw), cw)

i,j Lj C (ˆg(xi), ci) (ˆgϵ,e ˆg) 0. (19)

Editable Concept Bottleneck Models

Because of equation 15, the first term of equation 19 equals 0. Then we have

ˆgϵ,e ˆg = X

(w,r) Se ϵ H 1 ˆg ( ˆg Lr C (ˆg(xw), cw) ˆg Lr C (ˆg(xw), cw)) ,

where Hˆg = 2 ˆg X

i,j Lj C (ˆg(xi), ci) .

Then, we do a Newton step around ˆg and obtain

ˆge ge ˆg H 1 ˆg X

(w,r) Se ( ˆg Lr C (ˆg(xw), cw) ˆg Lr C (ˆg(xw), cw)) . (20)

Define the gradient of the j-th concept predictor of the i-th data xi as

Gj C(xi, ci; g) g Lj C (g(xi), ci) , (21)

then the retrained concept predictor ˆge defined by

ˆge = arg min

(i,j)/ Se Lj C (g(xi), ci) + X

(i,j) Se Lj C (g(xi), ci)

can be approximated by ge, defined by:

ˆg H 1 ˆg X

(w,r) Se (Gr C(xw, cw; ˆg) Gr C(xw, cw; ˆg)) (23)

Theorem D.2. Define the gradient of the label predictor of the i-th data xi as

GY (xi; g, f) f LY (f(g(xi)), yi) , (24)

then the retrained label predictor ˆfe defined by

ˆfe = arg min

i=1 LY (f (ˆge (xi)) , yi)

can be approximated by:

ˆf + H 1 ˆ f

GY (xi; ˆg, ˆf) GY (xi; ge, ˆf) ,

where H ˆ f = ˆ f Pn i=1 GY (xi; ˆg, ˆf) is the Hessian matrix, and ge is given in Theorem D.1.

Proof. Now we come to deduce the edited label predictor towards ˆfe.

First, we consider only changing the concept predictor at one data point (xic, yic, cic) and retrain the model to obtain a new label predictor ˆfic.

ˆfic = arg min

i=1,i =ic LY (f (ˆg (xi)) , yi) + LY (f (ˆge (xic)) , yic)

We rewrite the above equation as follows:

ˆfic = arg min

i=1 LY (f (ˆg (xi)) , yi) + LY (f (ˆge (xic)) , yic) LY (f (ˆg (xic)) , yic)

Editable Concept Bottleneck Models

We define ˆfϵ,ic as:

ˆfϵ,ic = arg min

i=1 LY (f (ˆg (xi)) , yi) + ϵ LY (f (ˆge (xic)) , yic) ϵ LY (f (ˆg (xic)) , yic)

Derive with respect to f at both sides of the above equation. we have

i=1 LY ˆfϵ,ic (ˆg (xi)) , yi + ϵ ˆ f LY ˆfϵ,ic (ˆge (xic)) , yic ϵ ˆ f LY ˆfϵ,ic (ˆg (xic)) , yic = 0

Perform a Taylor expansion of the above equation at ˆf,

i=1 LY ˆf (ˆg (xi)) , yi + ϵ ˆ f LY ˆf (ˆge (xic)) , yic

ϵ ˆ f LY ˆf (ˆg (xic)) , yic + 2 ˆ f

i=1 LY ˆf (ˆg (xi)) , yi ˆfϵ,ic ˆf = 0

Then we have ˆfϵ,ic ˆf ϵ H 1 ˆ f f LY ˆf (ˆge (xic)) , yic LY ˆf (ˆg (xic)) , yic ,

where H 1 ˆ f = 2 ˆ f Pn i=1 LY ˆf (ˆg (xi)) , yi .

Iterate ic over 1, 2, , n, and sum all the equations together, we can obtain:

ˆfϵ,e ˆf ϵ H 1 ˆ f

i=1 f LY ˆf (ˆge (xi)) , yi LY ˆf (ˆg (xi)) , yi .

Perform a Newton step around ˆf and we have

ˆfe ˆf H 1 ˆ f

i=1 f LY ˆf (ˆge (xi)) , yi LY ˆf (ˆg (xi)) , yi . (25)

Bringing the edited (20) of g into equation (25), we have

ˆfe ˆf H 1 ˆ f

i=1 f LY ˆf ( ge (xi)) , yi LY ˆf (ˆg (xi)) , yi fe.

Define the gradient of the label predictor of the i-th data xi as

GY (xi; g, f) f LY (f(g(xi)), yi) , (26)

then the retrained label predictor ˆfe defined by (4) can be approximated by fe, defined by:

ˆf + H 1 ˆ f

GY (xi; ˆg, ˆf) GY (xi; ge, ˆf) ,

where H ˆ f = ˆ f Pn i=1 GY (xi; ˆg, ˆf) is the Hessian matrix, and ge is given in Theorem D.1.

Editable Concept Bottleneck Models

D.2. Theoretical Bound for the Influence Function

Consider the dataset D = {(xi, ci, yi}i = 1n, the loss function of the concept predictor g is defined as:

LTotal(D; g) =

i=1 LC(g(xi), ci) + δ

j=1 Lj C(g(xi), ci) + δ

j=1 gj(xi) log(ci j) + δ

Mathematically, we have a set of erroneous data De and its associated index set Se [n] [k] such that for each (w, r) Se, we have (xw, yw, cw) De with cr w is mislabeled and cr w is corrected concept label. Thus, our goal is to estimate the retrained CBM. The retrained concept predictor and label predictor will be represented in the following manner.

ˆge = arg min

(i,j)/ Se Lj C (g(xi), ci) + X

(i,j) Se Lj C (g(xi), ci) + δ

Define the corrected dataset as D . Then the loss function with the influence of erroneous data De removed becomes

L (D ; g) = X

(i,j)/ Se Lj C (g(xi), ci) + X

(i,j) Se Lj C (g(xi), ci) + δ

2 g 2. (28)

Assume ˆg = arg min LTotal(D; g) is the original model parameter, and ˆge(D ) is the minimizer of L (D ; g), which is obtained from retraining. Denote ge(D ) as the updated model with the influence of erroneous data De removed and is obtained by the influence function method in theorem D.1, which is an estimation for ˆge(D ).

To simplify the problem, we concentrate on the removal of erroneous data De and neglect the process of adding the corrected data back. Once we obtain the bound for ˆge(D ) ge(D ) under this circumstance, the bound for the case where the corrected data is added back can naturally be derived using a similar approach. For brevity, we use the same notations.

Then, the loss function L (D ; g) becomes

L (D ; g) = X

(i,j)/ Se Lj C (g(xi), ci) + δ

2 g 2 = LTotal(D; g) X

(i,j) Se Lj C (g(xi), ci) (29)

And the definition of ge(D ) becomes

ˆg + H 1 ˆg X

(w,r) Se Gr C(xw, cw; ˆg) (30)

where Hˆg = 2 ˆg P

i,j Lj C(ˆg(xi), ci) + δ I is the Hessian matrix of the loss function with respect to ˆg. Here δ I is a small damping term for ensuring positive definiteness of the Hessian. Introducing the damping term into the Hessian is essentially equivalent to adding a regularization term to the initial loss function. Consequently, δ can also be interpreted as the regularization strength.

In this part, we will study the error between the estimated influence given by the theorem D.1 method and retraining. We use the parameter changes as the evaluation metric:

|( ge ˆg) (ˆge ˆg)| = | ge ˆge| (31)

Assumption D.3. The loss LC(x, c; g) LC(x, c; g; j) = Lj C(g(x), c)

is convex and twice-differentiable in g, with positive regularization δ > 0. There exists CH R such that

2 g LC(x, c; g1) 2 g LC(x, c; g2) 2 CH g1 g2 2

for all (x, c) D = {(xi, ci)}n i=1, j [k] and g1, g2 Γ.

Editable Concept Bottleneck Models

Then the function L (D, Se; g):

L (D, Se; g) = X

(i,j) Se Lj C (g(xi), ci) = X

(i,j) Se LC(xi, ci; g; j)

is convex and twice-differentiable in g, with some positive regularization. Then we have

2 g L (D, Se; g1) 2 g L (D, Se; g2) 2 |Se| CH g1 g2 2

for g1, g2 Γ.

Corollary D.4.

2 g L (D ; g1) 2 g L (D ; g2) 2 ((nk + |Se|) CH) g1 g2

Define C H (nk + |Se|) CH

Definition D.5. Define |D| as the number of pairs

C L = g L (D, Se; ˆg) 2 ,

σ min = smallest singular value of 2 g L (D ; ˆg),

σmin = smallest singular value of 2 g LTotal(D; ˆg),

Based on above corollaries and assumptions, we derive the following theorem.

Theorem D.6. We obtain the error between the actual influence and our predicted influence as follows:

ˆge(D ) ge(D )

2(σ min + δ)3 + 2δ + σmin + σ min (δ + σ min) (δ + σmin)

Proof. We will use the one-step Newton approximation as an intermediate step. Define g Nt(D ) as

g Nt(D ) H 1 δ g L (D, Se; ˆg),

where Hδ = δ I + 2 g L (D ; ˆg) is the regularized empirical Hessian at ˆg but reweighed after removing the influence of wrong data. Then the one-step Newton approximation for ˆg(D ) is defined as g Nt(D ) g Nt(D ) + ˆg.

In the following, we will separate the error between ge(D ) and ˆge(D ) into the following two parts:

ˆge(D ) ge(D ) = ˆge(D ) g Nt(D ) | {z } Err Nt, act(D )

+ (g Nt(D ) ˆg) ( ge(D ) ˆg) | {z } Err Nt, if(D )

Firstly, in Step 1, we will derive the bound for Newton-actual error Err Nt, act(D ). Since L (g) is strongly convex with parameter σ min + δ and minimized by ˆge(D ), we can bound the distance ˆge(D ) g Nt(D ) 2 in terms of the norm of the gradient at g Nt:

ˆge(D ) g Nt(D ) 2 2 σ min + δ

g L (g Nt(D )) 2 (32)

Therefore, the problem reduces to bounding g L (g Nt(D )) 2. Noting that g L (ˆg) = g L . This is because ˆg minimizes L + L , that is,

g L (ˆg) + g L (ˆg) = 0.

Editable Concept Bottleneck Models

Recall that g Nt = H 1 δ g L (D, Se; ˆg) = H 1 δ g L (D ; ˆg). Given the above conditions, we can have this bound for Err Nt, act( D ). g L (g Nt(D )) 2 = g L (ˆg + g Nt(D )) 2 = g L (ˆg + g Nt(D )) g L (ˆg) 2 g L (ˆg) g Nt(D ) 2

2 g L (ˆg + t g Nt(D )) 2 g L (ˆg) g Nt(D ) dt 2

C H 2 g Nt(D ) 2 2 = C H 2

2 g L (ˆg) 1 g L (ˆg) 2

C H 2(σ min + δ)2 g L (ˆg) 2 2 = C H 2(σ min + δ)2 g L (ˆg) 2 2

2(σ min + δ)2 .

Now we come to Step 2 to bound Err Nt, if( D ), and we will bound the difference in parameter change between Newton and our ECBM method.

(g Nt(D ) ˆg) ( ge(D ) ˆg)

= h δ I + 2 g L (ˆg) 1 + δ I + 2 g LTotal (ˆg) 1i g L (D, Se; ˆg)

For simplification, we use matrix A, B for the following substitutions:

A = δ I + 2 g L (ˆg)

B = δ I + 2 g LTotal (ˆg)

And A and B are positive definite matrices with the following properties

δ + σ min A δ + σ max δ + σmin B δ + σmax

Therefore, we have

(g Nt(D ) ˆg) ( ge(D ) ˆg)

= A 1 + B 1 g L (D ; ˆg)

A 1 + B 1 g L (D ; ˆg)

2δ + σmin + σ min (δ + σ min) (δ + σmin)

g L (D ; ˆg)

2δ + σmin + σ min (δ + σ min) (δ + σmin)

By combining the conclusions from Step I and Step II in Equations 32, 33 and 34, we obtain the error between the actual influence and our predicted influence as follows:

ˆge(D ) ge(D )

2(σ min + δ)3 + 2δ + σmin + σ min (δ + σ min) (δ + σmin)

Editable Concept Bottleneck Models

Remark D.7. Theorem D.6 reveals one key finding about influence function estimation: The estimation error scales inversely with the regularization parameter δ (O(1/δ)), indicating that increased regularization improves approximation accuracy. Remark D.8. In CBM, retraining is the most accurate way to handle the removal of a training data point. For the concept predictor, we derive a theoretical error bound for an influence function-based approximation. However, the label predictor differs. As a single-layer linear model, the label predictor is computationally inexpensive to retrain. However, its input depends on the concept predictor, making theoretical analysis challenging due to: (1) Input dependency: Changes in the concept predictor affect the label predictor s input, coupling their updates. (2) Error propagation: Errors from the concept predictor propagate to the label predictor, introducing complex interactions. Given the label predictor s low retraining cost, direct retraining is more practical and accurate. Thus, we focus our theoretical analysis on the concept predictor.

E. Concept-level Influence

E.1. Proof of Concept-level Influence Function

We address situations that delete pr for r M concept removed dataset. Our goal is to estimate ˆg p M , ˆf p M , which is the concept and label predictor trained on the pr for r M concept removed dataset.

Proof Sketch. The main ideas are as follows: (i) First, we define a new predictor ˆg p M , which has the same dimension as ˆg and the same output as ˆg p M . Then deduce an approximation for ˆg p M . (ii) Then, we consider setting pr = 0 instead of removing it, we get ˆfp M=0, which is equivalent to ˆf p M according to lemma E.1. We estimate this new predictor as a substitute. (iii) Next, we assume we only use the updated concept predictor ˆg p M for one data (xir, yir, cir) and obtain a new label predictor ˆfir, and obtain a one-step Newtonian iterative approximation of ˆfir with respect to ˆf. (iv) Finally, we repeat the above process for all data points and combine the estimate of ˆg in Theorem E.2, we obtain a closed-form solution of the influence function for ˆf.

First, we introduce our following lemma: Lemma E.1. For the concept bottleneck model, if the label predictor utilizes linear transformations of the form ˆf c with input c, then, for each r M, we remove the r-th concept from c and denote the new input as c . Set the r-th concept to 0 and denote the new input as c0. Then we have ˆf p M c = ˆfp M=0 c0 for any c.

Proof. Assume the parameter space of ˆf p M and ˆfp M=0 are Γ and Γ0, respectively, then there exists a surjection P : Γ Γ0. For any θ Γ, P(θ) is the operation that removes the r-th row of θ for r M. Then we have:

t/ M θ[j] c [j] = X

t θ[t]I{t / M}c[t] = θ c0.

Thus, the loss function LY (θ, c0) = LY (P(θ), c ) of both models is the same for every sample in the second stage. Besides, by formula derivation, we have, for θ Γ0, for any θ in P 1(θ ),

θ = LY (P(θ), c )

Thus, if the same initialization is performed, ˆf p M c = ˆfp M=0 c0 for any c in the dataset.

Theorem E.2. For the retrained concept predictor ˆg p M defined by

ˆg p M = arg min g

i=1 Lj C(g (xi), ci),

we map it to ˆg p M as

ˆg p M = arg min g T0

i=1 Lj C(g (xi), ci).

And we can edit the initial ˆg to ˆg p M , defined as:

g p M ˆg H 1 ˆg X

i=1 Dj C(xi, ci; ˆg), (35)

Editable Concept Bottleneck Models

where Hˆg = g P

j / M Pn i=1 Dj C(xi, ci; ˆg). Then, by removing all zero rows inserted during the mapping phase, we can naturally approximate ˆg p M P 1(ˆg p M ).

Proof. At this level, we consider the scenario that removes a set of mislabeled concepts or introduces new ones. Because after removing concepts from all the data, the new concept predictor has a different dimension from the original. We denote gj(xi) as the j-th concept predictor with xi, and cj i as the j-th concept in data zi. For simplicity, we treat g as a collection of k concept predictors and separate different columns as a vector gj(xi). Actually, the neural network gets g as a whole.

For the comparative purpose, we introduce a new notation ˆg p M . Specifically, we define weights of ˆg and ˆg p M for the last layer of the neural network as follows.

ˆg p M (x) =

w11 w12 w1di w21 w22 w2di ... ... ... w(k 1)1 w(k 1)2 w(k 1)di

| {z } (k 1) di

| {z } di 1

c1 ... cr 1 cr+1 ... ck

| {z } (k 1) 1

ˆg p M (x) =

w11 w12 w1di ... ... ... w(r 1)1 w(r 1)2 w(r 1)di 0 0 0 w(r+1)1 w(r+1)2 w(r+1)di ... ... ... wk1 wk2 wkdi

| {z } k di

x1 ... xr 1

xr+1 ... xdi

| {z } di 1

c1 ... cr 1 0 cr+1 ... ck

where r is an index from the index set M.

Firstly, we want to edit to ˆg p M T0 = {wfinal = 0} T based on ˆg, where wfinal is the parameter of the final layer of neural network. Let us take a look at the definition of ˆg p M :

ˆg p M = arg min g T0

i=1 Lj C(g (xi), ci).

Then, we separate the r-th concept-related item from the rest and rewrite ˆg as the following form:

ˆg = arg min g T

i=1 Lj C(g(xi), ci) + X

i=1 Lr C(g(xi), ci)

Then, if the r-th concept part is up-weighted by some small ϵ, this gives us the new parameters ˆgϵ,p M , which we will abbreviate as ˆgϵ below.

ˆgϵ,p M arg min g T

i=1 Lj C(g(xi), ci) + ϵ X

i=1 Lr C(g(xi), ci)

Obviously, when ϵ 0, ˆgϵ ˆg p M . We can obtain the minimization conditions from the definitions above.

i=1 Lj C(ˆg p M (xi), ci) = 0. (36)

Editable Concept Bottleneck Models

i=1 Lj C(ˆgϵ(xi), ci) + ϵ ˆgϵ X

i=1 Lr C(ˆgϵ(xi), ci) = 0.

Perform a first-order Taylor expansion of equation 36 with respect to ˆgϵ, then we get

i=1 Lj C(ˆgϵ(xi), ci) + 2 g X

i=1 Lj C(ˆgϵ(xi), ci) (ˆg p M ˆgϵ) 0.

Then we have

ˆg p M ˆgϵ = H 1 ˆgϵ g X

i=1 Lj C(ˆgϵ(xi), ci).

Where Hˆgϵ = 2 g P

j / M Pn i=1 Lj C(ˆgϵ(xi), ci).

We can see that:

When ϵ = 0, ˆgϵ = ˆg p M ,

When ϵ = 1, ˆgϵ = ˆg,

ˆg p M ˆg H 1 ˆg g X

i=1 Lj C(ˆg(xi), ci),

where Hˆg = 2 g P

j / M Pn i=1 Lj C(ˆg(xi), ci).

Then, an approximation of ˆg p M is obtained.

ˆg p M ˆg H 1 ˆg g X

i=1 Lj C(ˆg(xi), ci). (37)

Recalling the definition of the gradient:

Gj C(xi, ci; ˆg) = Lj C(ˆg(xi), ci)) = ˆgj(xi) log(cj i).

Then the approximation of ˆg p M becomes

g p M ˆg H 1 ˆg X

i=1 Gj C(xi, ci; ˆg),

Theorem E.3. For the retrained label predictor ˆf p M defined as

ˆf p M = arg min f

i=1 LY = arg min f

i=1 LY (f (ˆg p M (xi)), yi),

We can consider its equivalent version ˆfp M=0 as:

ˆfp M=0 = arg min f

i=1 LY f ˆg p M (xi) , yi ,

which can be edited by

ˆfp M=0 fp M=0 ˆf H 1 ˆ f

l=1 GY (xl; g p M , ˆf),

where H ˆ f = ˆ f Pn i=1 GY (xl; g p M , ˆf) is the Hessian matrix. Deleting the r-th dimension of fp M=0 for r M, then we

can map it to f p M , which is the approximation of the final edited label predictor ˆf p M under concept level.

Editable Concept Bottleneck Models

Proof. Now, we come to the approximation of ˆf p M . Noticing that the input dimension of f decreases to k |M|. We consider setting pr = 0 for all data points in the training phase of the label predictor and get another optimal model ˆfp M=0. From lemma E.1, we know that for the same input x, ˆfp M=0(x) = ˆf p M . And the values of the corresponding parameters in ˆfp M=0 and ˆf p M are equal.

Now, let us consider how to edit the initial ˆf to ˆfp M=0. Firstly, assume we only use the updated concept predictor ˆg p M for one data (xir, yir, cir) and obtain the following ˆfir, which is denoted as

ˆfir = arg min f

i=1 LY (f(ˆg(xi)), yi) + LY (f(ˆg p M (xir)), yir) LY (f(ˆg(xir)), yir)

Then up-weight the ir-th data by some small ϵ and have the following new parameters:

ˆfϵ,ir = arg min f

i=1 LY (f(ˆg(xi)), yi) + ϵ LY (f(ˆg p M (xir)), yir) ϵ LY (f(ˆg(xir)), yir)

Deduce the minimized condition subsequently,

i=1 LY ( ˆfir(ˆg(xi)), yi) + ϵ f LY ( ˆfir(ˆg p M (xir)), yir) ϵ f LY ( ˆfir(ˆg(xir)), yir) = 0.

If we expand first term of ˆf, which ˆfir,ϵ ˆf(ϵ 0), then

i=1 LY ˆf(ˆg(xi)), yi + ϵ f LY ( ˆf(ˆg p M (xir)), yir) ϵ f LY ( ˆf(ˆg(xir)), yir)

i=1 LY ˆf(ˆg(xi)), yi !

( ˆfir,ϵ ˆf) = 0.

Note that f Pn i=1 LY ( ˆf(ˆg(xi)), yi) = 0. Thus we have

ˆfir,ϵ ˆf = H 1 ˆ f ϵ f LY ( ˆf(ˆg p M (xir)), yir) f LY ( ˆf(ˆg(xir)), yir) .

We conclude that

ϵ=0 = H 1 ˆ f ˆ f LY ( ˆf(ˆg p M (xir)), yir) ˆ f LY ( ˆf(ˆg(xir)), yir) .

Perform a one-step Newtonian iteration at ˆf and we get the approximation of ˆfir.

ˆfir ˆf + H 1 ˆ f ˆ f LY ( ˆf(ˆg(xir)), yir) ˆ f LY ( ˆf(ˆg p M (xir)), yir) .

Reconsider the definition of ˆfir, we use the updated concept predictor ˆg p M for one data (xir, yir, cir). Now we carry out this operation for all the other data and estimate ˆfp M=0. Combining the minimization condition from the definition of ˆf, we have

ˆfp M=0 ˆf + H 1 ˆ f

i=1 LY ( ˆf(ˆg(xi)), yi) ˆ f

i=1 LY ( ˆf(ˆg p M (xi)), yi)

= ˆf + H 1 ˆ f

i=1 LY ( ˆf(ˆg p M (xi)), yi)

= ˆf H 1 ˆ f

l=1 ˆ f LY ( ˆf(ˆg p M (xl)), yl). (38)

Editable Concept Bottleneck Models

Theorem E.2 gives us the edited version of ˆg p M . Substitute it into equation 38, and we get the final closed-form edited label predictor under concept level:

ˆfp M=0 fp M=0 ˆf H 1 ˆ f ˆ f

l=1 LYl ˆf, g p M ,

where H ˆ f = 2 ˆ f Pn i=1 LYi( ˆf, ˆg) is the Hessian matrix of the loss function respect to is the Hessian matrix of the loss

function respect to ˆf. Recalling the definition of the gradient:

GY (xl; g p M , ˆf) = ˆ f LY ˆf g p M (xl) , yl ,

then the approximation becomes

ˆfp M=0 fp M=0 ˆf H 1 ˆ f

l=1 GY (xl; g p M , ˆf).

E.2. Theoretical Bound for the Influence Function

Consider the dataset D = {(xi, ci, yi}n i=1, the loss function of the concept predictor g is defined as:

LTotal(D; g) =

i=1 LC(g(xi), ci) + δ

j=1 Lj C(g(xi), ci) + δ

j=1 gj(xi) log(ci j) + δ

Mathematically, we have a set of erroneous concepts need to be removed, which are denoted as pr for r M. Then the retrained concept predictor becomes

ˆg p M = arg min g

i=1 Lj C(g (xi), ci) + δ

We map it to ˆg p M as ˆg p M to ˆg p M P(ˆg p M ), which has the same amount of parameters as ˆg and has the same predicted concepts ˆg p M (j) as ˆg p M (j) for all j [di] M. We achieve this effect by inserting a zero row vector into the r-th row of the matrix in the final layer of ˆg p M for r M. Thus, we can see that the mapping P is one-to-one. Moreover, assume the parameter space of ˆg is T and that of ˆg p M , T0 is the subset of T. Noting that ˆg p M is the optimal model of the following objective function:

ˆg p M = arg min g T0

i=1 Lj C(g (xi), ci) + δ

Then the loss function with the influence of erroneous concepts removed becomes

L (D; g) = X

i=1 Lj C(g (xi), ci) + δ

2 g 2 = LTotal(D; g) X

i=1 Lj C (g(xi), ci) . (39)

Assume ˆg = arg min LTotal(D; g) is the original model parameter. ˆg p M (D) and ˆg p M (D) is the minimizer of L (D; g), which is obtained from retraining in different parameter space. ˆg p M (D) shares the same dimensionality as the original model. Because ˆg p M (D) and ˆg p M (D) produces identical outputs given identical inputs, to simplify the proof, we use ˆg p M (D) as the retrained model.

Denote g p M as the updated model with the influence of erroneous concepts removed and is obtained by the influence function method in theorem E.2, which is an estimation for ˆg p M (D).

g p M (D) ˆg H 1 ˆg X

i=1 Gj C(xi, ci; ˆg),

Editable Concept Bottleneck Models

In this part, we will study the error between the estimated influence given by the theorem E.2 method and ˆg p M (D). We use the parameter changes as the evaluation metric: ( g p M ˆg) ˆg p M ˆg = g p M ˆg p M (40)

Assumption E.4. The loss LC(x, c; g; j)

LC(D; g; j) =

i=1 Lj C(g(xi), ci).

is convex and twice-differentiable in g, with positive regularization δ > 0. There exists CH R such that

2 g LC(D; g1; j) 2 g LC(D; g2; j) 2 CH g1 g2 2

for all j [k] and g1, g2 Γ.

Definition E.5.

C L = max j g LC(D; ˆg; j) 2 ,

σ min = smallest singular value of 2 g L (D; ˆg),

σmin = smallest singular value of 2 g LTotal(D; ˆg),

L (D, M; g) = X

j M LC(D; g; j) (41)

Corollary E.6.

L (D; g) = LTotal(D; g) L (D, M; g) (42)

2 g L (D; g1) 2 g L (D; g2) 2 ((k + |M|) CH) g1 g2

Define C H (k + |M|) CH

Based on above corollaries and assumptions, we derive the following theorem.

Theorem E.7. We obtain the error between the actual influence and our predicted influence as follows: ˆg p M (D) g p M (D)

2(σ min + δ)3 + 2δ + σmin + σ min (δ + σ min) (δ + σmin)

Proof. We will use the one-step Newton approximation as an intermediate step. Define g Nt(D) as

g Nt(D) H 1 δ g L (D, M; ˆg),

where Hδ = δ I + 2 g L (D; ˆg) is the regularized empirical Hessian at ˆg but reweighed after removing the influence of wrong data. Then the one-step Newton approximation for ˆg p M (D) is defined as g Nt(D) g Nt(D) + ˆg.

In the following, we will separate the error between g p M (D) and ˆg p M (D) into the following two parts:

ˆg p M (D) g p M (D) = ˆg p M (D) g Nt(D) | {z } Err Nt, act(D)

+ (g Nt(D) ˆg) ( g p M (D) ˆg) | {z } Err Nt, if(D)

Firstly, in Step 1, we will derive the bound for Newton-actual error Err Nt, act(D). Since L (g) is strongly convex with parameter σ min + δ and minimized by ˆg p M (D), we can bound the distance ˆg p M (D) g Nt(D) 2 in terms of the norm of the gradient at g Nt: ˆg p M (D) g Nt(D) 2 2 σ min + δ

g L (g Nt(D)) 2 (43)

Editable Concept Bottleneck Models

Therefore, the problem reduces to bounding g L (g Nt(D)) 2. Noting that g L (ˆg) = g L . This is because ˆg minimizes L + L , that is, g L (ˆg) + g L (ˆg) = 0.

Recall that g Nt = H 1 δ g L (D, Se; ˆg) = H 1 δ g L (D; ˆg). Given the above conditions, we can have this bound for Err Nt, act( D). g L (g Nt(D)) 2 = g L (ˆg + g Nt(D)) 2 = g L (ˆg + g Nt(D)) g L (ˆg) 2 g L (ˆg) g Nt(D) 2

2 g L (ˆg + t g Nt(D)) 2 g L (ˆg) g Nt(D) dt 2

C H 2 g Nt(D ) 2 2 = C H 2

2 g L (ˆg) 1 g L (ˆg) 2

C H 2(σ min + δ)2 g L (ˆg) 2 2 = C H 2(σ min + δ)2 g L (ˆg) 2 2

2(σ min + δ)2 .

Now we come to Step 2 to bound Err Nt, if( D), and we will bound the difference in parameter change between Newton and our ECBM method.

(g Nt(D) ˆg) ( g p M (D) ˆg)

= h δ I + 2 g L (ˆg) 1 + δ I + 2 g LTotal (ˆg) 1i g L (D, Se; ˆg)

For simplification, we use matrix A, B for the following substitutions:

A = δ I + 2 g L (ˆg)

B = δ I + 2 g LTotal (ˆg)

And A and B are positive definite matrices with the following properties

δ + σ min A δ + σ max δ + σmin B δ + σmax

Therefore, we have

(g Nt(D) ˆg) ( g p M (D) ˆg)

= A 1 + B 1 g L (D; ˆg)

A 1 + B 1 g L (D; ˆg)

2δ + σmin + σ min (δ + σ min) (δ + σmin)

g L (D; ˆg)

2δ + σmin + σ min (δ + σ min) (δ + σmin)

By combining the conclusions from Step I and Step II in Equations 43, 44 and 45, we obtain the error between the actual influence and our predicted influence as follows: ˆg p M (D) g p M (D)

2(σ min + δ)3 + 2δ + σmin + σ min (δ + σ min) (δ + σmin)

Editable Concept Bottleneck Models

Remark E.8. Theorem E.7 reveals one key finding about influence function estimation: The estimation error scales inversely with the regularization parameter δ (O(1/δ)), indicating that increased regularization improves approximation accuracy. Besides, the error bound is linearly increasing with the number of removed concepts |M|. This implies that the estimation error increases with the number of erroneous concepts removed.

F. Data-level Influence

F.1. Proof of Data-level Influence Function

We address situations that for dataset D = {(xi, yi, ci)}n i=1, given a set of data zr = (xr, yr, cr), r G to be removed. Our goal is to estimate ˆg z G, ˆf z G, which is the concept and label predictor trained on the zr for r G removed dataset.

Proof Sketch. (i) First, we estimate the retrained concept predictor ˆg z G. (ii) Then, we define a new label predictor f z G and estimate f z G ˆf. (iii) Next, in order to reduce computational complexity, use the lemma method to obtain the approximation of the Hessian matrix of f z G. (iv) Next, we compute the difference ˆf z G f z G as

H 1 f z G ˆ f LY f z G(ˆg z G(xir)), yir ˆ f LY f z G(ˆg(xir)), yir .

(v) Finally, we divide ˆf z G ˆf, which we actually concerned with, into ˆf z G f z G + f z G ˆf .

Theorem F.1. For dataset D = {(xi, yi, ci)}n i=1, given a set of data zr = (xr, yr, cr), r G to be removed. Suppose the updated concept predictor ˆg z G is defined by

ˆg z G = arg min g

i [n] G LCj(ˆg(xi), ci)

where LC(ˆg(xi), ci) Pk j=1 Lj C(ˆg(xi), ci). Then we have the following approximation for ˆg z G

ˆg z G g z G ˆg + H 1 ˆg X

j=1 Gj C(xr, cr; ˆg), (46)

where Hˆg = g P

i,j Gj C(xi, ci; ˆg) is the Hessian matrix of the loss function with respect to ˆg.

Proof. Firstly, we rewrite ˆg z G as

ˆg z G = arg min g

i=1 LC(ˆg(xi), ci) X

r G LC(g(xr), cr)

Then we up-weighted the r-th data by some ϵ and have a new predictor ˆg z G,ϵ, which is abbreviated as ˆgϵ:

ˆgϵ arg min g

i=1 LC(g(xi), ci) ϵ X

r G LC(g(xr), cr)

Because ˆgϵ minimizes the right side of equation 47, we have

i=1 LY (ˆgϵ(xi), ci) ϵ ˆgϵ X

r G LY (ˆgϵ(xr), cr) = 0.

When ϵ 0, ˆgϵ ˆg. So we can perform a first-order Taylor expansion with respect to ˆg, and we have

i=1 LC(ˆg(xi), ci) ϵ g X

r G LC(ˆg(xr), cr) + 2 g

i=1 LC(ˆg(xi), ci) (ˆgϵ ˆg) 0. (48)

Editable Concept Bottleneck Models

Recap the definition of ˆg:

ˆg = arg min g

i=1 LY (g(xi), ci),

Then, the first term of equation 48 equals 0. Let ϵ 0, then we have

ϵ=0 = H 1 ˆg X

r G g LC(ˆg(xr), cr),

where Hˆg = 2 g Pn i=1 ℓ(ˆg(xi), ci).

Remember when ϵ 0, ˆgϵ ˆg z G. Perform a Newton step at ˆg, then we obtain the method to edit the original concept predictor under concept level: ˆg z G g z G ˆg + H 1 ˆg X

r G g LC(ˆg(xr), cr).

Recall the definition of Gj C(xi, ci; ˆg), then we can edit the initial ˆg to ˆg p G, defined as:

ˆg z G g z G ˆg + H 1 ˆg X

j=1 Gj C(xr, cr; ˆg), (49)

where Hˆg = g P

i,j Gj C(xi, ci; ˆg) is the Hessian matrix of the loss function with respect to ˆg.

Theorem F.2. For dataset D = {(xi, yi, ci)}n i=1, given a set of data zr = (xr, yr, cr), r G to be removed. The label predictor ˆf z G trained on the revised dataset becomes

ˆf z G = arg min f

i [n] G LYi(f, ˆg z G). (50)

The intermediate label predictor f z G is defined by

f z G = arg min X

i [n] G LYi(f, ˆg),

Then f z G ˆf can be approximated by

f z G = arg min X

i [n] G LY (f(ˆg(xi), yi). (51)

We denote the edited version of f z G as f z G ˆf + AG. Define BG as

i [n] G GY (xi; g z G, f z G) GY (xi; ˆg, f z G),

where H f z G = f P

i [n] G GY (xi; ˆg, f z G) is the Hessian matrix concerning f z G. Then ˆf z G can be estimated by f z G + BG. Combining the above two-stage approximation, then, the final edited label predictor f z G can be obtained by

f z G = f z G + BG = ˆf + AG + BG. (52)

Proof. We can see that there is a huge gap between ˆf z G and ˆf. Thus, firstly, we define f z G as

f z G = arg min f

i=1 LY (f(ˆg(xi)), yi) X

r G LY (f(ˆg(xr)), yr) .

Editable Concept Bottleneck Models

Then, we define fϵ, z G as follows to estimate f z G.

fϵ, z G = arg min f

i=1 LY (f(ˆg(xi)), yi) ϵ X

r G LY (f(ˆg(xr)), yr) .

From the minimization condition, we have

i=1 LY fϵ, z G(ˆg(xi)), yi ϵ X

r G f LY fϵ, z G(ˆg(xr)), yr = 0.

Perform a first-order Taylor expansion at ˆf,

i=1 LY ˆf(ˆg(xi)), yi ϵ ˆ f X

r G LY ˆf(ˆg(xr)), yr

i=1 LY ˆf(ˆg(xi)), yi fϵ, z G ˆf = 0.

Then f z G can be approximated by

f z G ˆf + H 1 ˆ f X

r G ˆ f LY ˆf(ˆg(xr)), yr AG. (53)

Then the edit version of f z G is defined as f z G = ˆf + AG (54)

Then we estimate the difference between ˆf z G and f z G. Rewrite f z G as

f z G = arg min f

i S LY (f(ˆg(xi)), yi) , (55)

where S [n] G.

Compare equation 50 with 55, we still need to define an intermediary predictor ˆf z G,ir as

ˆf z G,ir = arg min f

LYi (f, ˆg(xi)) + LYir (f, ˆg z G)

= arg min f

i S LYi (f, ˆg) + LYir (f, ˆg z G) LYir (f, ˆg)

Up-weight the ir data by some ϵ, we define ˆfϵ, z G,ir as

ˆfϵ, z G,ir = arg min f

i S LYi (f, ˆg) + ϵ LYir (f, ˆg z G) ϵ LYir (f, ˆg)

We denote ˆfϵ, z G,ir as ˆf ϵ in the following proof. Then, from the minimization condition, we have

i S LYi ˆf ϵ , ˆg + ϵ ˆ f LYir ˆf ϵ , ˆg z G ϵ ˆ f LYir ˆf ϵ , ˆg(xir . (56)

Editable Concept Bottleneck Models

When ϵ 0, ˆf ϵ f z G. Then we perform a Taylor expansion at f z G of equation 56 and have

i S LYi f z G, ˆg + ϵ ˆ f LYir f z G, ˆg z G

ϵ ˆ f LYir f z G, ˆg + 2 ˆ f X

i S LYi f z G, ˆg ( ˆf ϵ f z G) 0.

Organizing the above equation gives

ˆf ϵ f z G ϵ H 1 f z G ˆ f LYir f z G, ˆg z G ˆ f LYir f z G, ˆg ,

where H f z G = 2 ˆ f P

i S LYi f z G, ˆg .

When ϵ = 1, ˆf ϵ = ˆf z G,ir. Then we perform a Newton iteration with step size 1 at f z G,

ˆf z G,ir f z G H 1 f z G ˆ f LYir f z G, ˆg z G ˆ f LYir f z G, ˆg

Iterate ir through set S, and we have

ˆf z G f z G H 1 f z G

i S LYi f z G, ˆg z G ˆ f X

i S LYi f z G, ˆg !

The edited version of ˆg z G has been deduced as g z G in theorem F.1, substituting this approximation into (57), then we have

ˆf z G f z G H 1 f z G

i S LYi f z G, g z G ˆ f X

i S LYi f z G, ˆg !

Noting that we cannot obtain ˆf z G and H f z G directly because we do not retrain the label predictor but edit it to f z G as a

substitute. Therefore, we approximate ˆf z G with f z G and H f z G with H f z G which is defined by:

H f z G = 2 ˆ f X

i S LYi f z G, ˆg

Then we define BG as

BG H 1 f z G

i S LYi f z G, g z G ˆ f X

i S LYi f z G, ˆg !

Combining (54) and (59), then we deduce the final closed-form edited label predictor as

f z G = f z G + BG = ˆf + AG + BG.

Replace the definition of gradient of LC and LC, then we obtain the final version of approximation.

F.2. Theoretical Bound for the Influence Function

Consider the dataset D = {(xi, ci, yi}n i=1, the loss function of the concept predictor g is defined as:

LTotal(D; g) =

i=1 LC(g(xi), ci) + δ

j=1 Lj C(g(xi), ci) + δ

j=1 gj(xi) log(ci j) + δ

Mathematically, we have a set of erroneous data zr = (xr, yr, cr), r G need to be removed. Then the retrained concept predictor becomes

Editable Concept Bottleneck Models

ˆg z G = arg min g

i [n] G Lj C(g(xi), ci) + δ

Define the new dataset as D = {(xi, ci, yi)}i [n] G, then the loss function with the influence of erroneous data removed becomes

L (D ; g) =

i [n] G Lj C(g(xi), ci) + δ

2 g 2 = LTotal(D; g)

i G Lj C (g(xi), ci) . (60)

Assume ˆg = arg min LTotal(D; g) is the original model parameter. ˆg z G is the minimizer of L (D ; g). Denote g z G as the updated model with the influence of erroneous data removed and is obtained by the influence function method in theorem F.1, which is an estimation for ˆg z G.

ˆg z G g z G ˆg + H 1 ˆg X

j=1 Gj C(xr, cr; ˆg), (61)

In this part, we will study the error between the estimated influence given by the theorem F.1 method and ˆg z G. We use the parameter changes as the evaluation metric:

|( g z G ˆg) (ˆg z G ˆg)| = | g z G ˆg z G| (62)

Assumption F.3. The loss LC(x, c; g; j)

LC(x, c; g) =

j=1 Lj C(g(x), c).

is convex and twice-differentiable in g, with positive regularization δ > 0. There exists CH R such that

2 g LC(x, c; g1) 2 g LC(x, c; g2) 2 CH g1 g2 2

for all (x, c) D and g1, g2 Γ.

Definition F.4.

C L = g LC(D; ˆg) 2 ,

σ min = smallest singular value of 2 g L (D; ˆg),

σmin = smallest singular value of 2 g LTotal(D; ˆg),

L (D, G; g) = X

i G LC(xi, ci; g) (63)

Corollary F.5.

L (D; g) = LTotal(D; g) L (D, G; g) (64)

2 g L (D; g1) 2 g L (D; g2) 2 ((n + |G|) CH) g1 g2

Define C H (n + |G|) CH

Based on above corollaries and assumptions, we derive the following theorem.

Theorem F.6. We obtain the error between the actual influence and our predicted influence as follows:

ˆg z G(D) g z G(D)

2(σ min + δ)3 + 2δ + σmin + σ min (δ + σ min) (δ + σmin)

Editable Concept Bottleneck Models

Proof. We will use the one-step Newton approximation as an intermediate step. Define g Nt(D) as

g Nt(D) H 1 δ g L (D, G; ˆg),

where Hδ = δ I + 2 g L (D; ˆg) is the regularized empirical Hessian at ˆg but reweighed after removing the influence of wrong data. Then the one-step Newton approximation for ˆg z G(D) is defined as g Nt(D) g Nt(D) + ˆg.

In the following, we will separate the error between g z G(D) and ˆg z G(D) into the following two parts:

ˆg z G(D) g z G(D) = ˆg z G(D) g Nt(D) | {z } Err Nt, act(D)

+ (g Nt(D) ˆg) ( g z G(D) ˆg) | {z } Err Nt, if(D)

Firstly, in Step 1, we will derive the bound for Newton-actual error Err Nt, act(D). Since L (g) is strongly convex with parameter σ min + δ and minimized by ˆg z G(D), we can bound the distance ˆg z G(D) g Nt(D) 2 in terms of the norm of the gradient at g Nt:

ˆg z G(D) g Nt(D) 2 2 σ min + δ

g L (g Nt(D)) 2 (65)

Therefore, the problem reduces to bounding g L (g Nt(D)) 2. Noting that g L (D, G; ˆg) = g L . This is because ˆg minimizes L + L , that is, g L (ˆg) + g L (D, G; ˆg) = 0.

Recall that g Nt = H 1 δ g L (D, G; ˆg) = H 1 δ g L (D; ˆg). Given the above conditions, we can have this bound for Err Nt, act( D). g L (g Nt(D)) 2 = g L (ˆg + g Nt(D)) 2 = g L (ˆg + g Nt(D)) g L (ˆg) 2 g L (ˆg) g Nt(D) 2

2 g L (ˆg + t g Nt(D)) 2 g L (ˆg) g Nt(D) dt 2

C H 2 g Nt(D ) 2 2 = C H 2

2 g L (ˆg) 1 g L (ˆg) 2

C H 2(σ min + δ)2 g L (ˆg) 2 2 = C H 2(σ min + δ)2 g L (D, G; ˆg) 2 2

2(σ min + δ)2 .

Now we come to Step 2 to bound Err Nt, if( D), and we will bound the difference in parameter change between Newton and our ECBM method.

(g Nt(D) ˆg) ( g z G(D) ˆg)

= h δ I + 2 g L (ˆg) 1 + δ I + 2 g LTotal (ˆg) 1i g L (D, G; ˆg)

For simplification, we use matrix A, B for the following substitutions:

A = δ I + 2 g L (ˆg)

B = δ I + 2 g LTotal (ˆg)

And A and B are positive definite matrices with the following properties

δ + σ min A δ + σ max δ + σmin B δ + σmax

Editable Concept Bottleneck Models

Therefore, we have

(g Nt(D) ˆg) ( g z G(D) ˆg)

= A 1 + B 1 g L (D; ˆg)

A 1 + B 1 g L (D; ˆg)

2δ + σmin + σ min (δ + σ min) (δ + σmin)

g L (D; ˆg)

2δ + σmin + σ min (δ + σ min) (δ + σmin)

By combining the conclusions from Step I and Step II in Equations 65, 66 and 67, we obtain the error between the actual influence and our predicted influence as follows:

ˆg z G(D) g z G(D)

2(σ min + δ)3 + 2δ + σmin + σ min (δ + σ min) (δ + σmin)

Remark F.7. The error bound is linearly increasing with the number of removed data |G|. This implies that the estimation error increases with the number of erroneous data removed.

G. Algorithm

Algorithm 1 Concept-label-level ECBM

1: Input: Dataset D = {(xi, yi, ci)}n i=1, original concept predictor ˆf, and label predictor ˆg, a set of erroneous data De and its associated index set Se. 2: For the index (w, r) in Se, correct cr w to the right label cr w for the w-th data (xw, yw, cw). 3: Compute the Hessian matrix of the loss function respect to ˆg:

Hˆg = 2 ˆg X

i,j LCj(ˆgj(xi), cj i).

4: Update concept predictor g:

g = ˆg H 1 ˆg X

ˆg LCr ˆgr(xw), cr w ˆg LCr (ˆgr(xw), cr w) .

5: Compute the Hessian matrix of the loss function respect to ˆf:

H ˆ f = 2 ˆ f

i=1 LYi( ˆf, ˆg).

6: Update label predictor f:

f = ˆf + H 1 ˆ f f

i=1 LY ˆf (ˆg(xi)) , yi H 1 ˆ f f

LY ˆf ( g(xl)) , yl .

7: Return: f, g.

Algorithm 2 Concept-level ECBM

1: Input: Dataset D = {(xi, yi, ci)}n i=1, original concept predictor ˆf, label predictor ˆg and the to be removed concept index set M.

Editable Concept Bottleneck Models

2: For r M, set pr = 0 for all the data z D. 3: Compute the Hessian matrix of the loss function respect to ˆg:

Hˆg = 2 ˆg X

i=1 LCj(ˆgj(xi), cj i).

4: Update concept predictor g :

g = ˆg H 1 ˆg ˆg X

i=1 LCj(ˆgj(xi), cj i).

5: Compute the Hessian matrix of the loss function respect to ˆf:

H ˆ f = 2 ˆ f

i=1 LY ( ˆf(ˆg(xi), yi).

6: Update label predictor f:

f = ˆf H 1 ˆ f ˆ f

l=1 LY ˆf ( g (xl)) , yl .

7: Map g to g by removing the r-th row of the matrix in the final layer of g for r M. 8: Return: f, g.

Algorithm 3 Data-level ECBM

1: Input: Dataset D = {(xi, yi, ci)}N i=1, original concept predictor ˆf, label predictor ˆg, and the to be removed data index set G. 2: For r G, remove the r-th data (xr, yr, cr) from D and define the new dataset as S. 3: Compute the Hessian matrix of the loss function with respect to ˆg:

Hˆg = 2 ˆg X

i,j LCj(ˆgj(xi), cj i).

4: Update concept predictor g: g = ˆg + H 1 ˆg X

r G g LC(ˆg(xr), cr)

5: Update label predictor f. Compute the Hessian matrix of the loss function with respect to ˆf:

H ˆ f = 2 ˆ f

i=1 LY ( ˆf(ˆg(xi), yi).

6: Compute A as: A = H 1 ˆ f X

i [n] G ˆ f LY ˆf(ˆg(xi)), yi

7: Obtain f as f = ˆf + A

8: Compute the Hessian matrix of the loss function concerning f:

H f = 2 f X

i [n] G LY ( f(ˆg(xi)), yi).

9: Compute B as B = H 1 f X

i [n] G ˆ f LY ( f( g(xi)), yi) LY ( f(ˆg(xi)), yi)

Editable Concept Bottleneck Models

10: Update the label predictor f as: f = ˆf + A + B. 11: Return: f, g.

Algorithm 4 EK-FAC for Concept Predictor g

1: Input: Dataset D = {(xi, yi, ci)}N i=1, original concept predictor ˆg. 2: for the l-th convolution layer of ˆg: do 3: Define the input activations {aj,t}, weights W = (wi,j,δ), and biases b = (bi) of this layer; 4: Obtain the expanded activations JAl 1K as:

JAl 1Kt,j| |+δ = [Al 1](t+δ),j = aj,t+δ,

5: Compute the pre-activations: [Sl]i,t = si,t = X

δ wi,j,δaj,t+δ + bi.

6: During the backpropagation process, obtain the Dsi,t as:

Dsi,t = Pk j=1 Pn i=1 LCj si,t

7: Compute ˆΩl 1 and ˆΓl:

JAi l 1K HJAi l 1KH

|T |DSi l DSi l

8: Perform eigenvalue decomposition of ˆΩl 1 and ˆΓl, obtain QΩ, ΛΩ, QΓ, ΛΓ, which satisfies

ˆΩl 1 = QΩΛΩQ Ω ˆΓl = QΓΛΓQ Γ

9: Define a diagonal matrix Λ and compute the diagonal element as

Λ ii = n 1 n X

QΩl 1 QΓl θl LCj 2 i .

10: Compute ˆH 1 l as ˆH 1 l = QΩl 1 QΓl (Λ + λl Idl) 1 QΩl 1 QΓl T

11: end for 12: Splice Hl sequentially into large diagonal matrices

ˆH 1 1 0 ... 0 ˆH 1 d

where d is the number of the convolution layer of the concept predictor. 13: Return: the inverse Hessian matrix ˆH 1 ˆg .

Algorithm 5 EK-FAC for Label Predictor f

1: Input: Dataset D = {(xi, yi, ci)}N i=1, original label predictor ˆf.

Editable Concept Bottleneck Models

2: Denote the pre-activated output of ˆf as f , Compute A as

i=1 ˆg(xi) ˆg(xi)T

3: Comput B as:

i=1 f LY ( ˆf (ˆg(xi)) , yi) f LY ( ˆf (ˆg(xi)) , yi) T

4: Perform eigenvalue decomposition of AA and BB, obtain QA, ΛA, QB, ΛB, which satisfies

A = QAΛAQ A B = QBΛBQ B

5: Define a diagonal matrix Λ and compute the diagonal element as

Λ ii = n 1 n X

(QA QB) ˆ f LYj 2

6: Compute ˆH 1 ˆ f as ˆH 1 ˆ f = (QA QB) (Λ + λId) 1 (QA QB)T

7: Return: the inverse Hessian matrix ˆH 1 ˆ f .

Algorithm 6 EK-FAC Concept-label-level ECBM

1: Input: Dataset D = {(xi, yi, ci)}N i=1, original concept predictor ˆf, label predictor ˆg, and the to be removed data index set G, and damping parameter λ. 2: For r G, remove the r-th data (xr, yr, cr) from D and define the new dataset as S. 3: Use EK-FAC method in algorithm 4 to accelerate i HVP problem for ˆg and obtain the inverse Hessian matrix ˆH 1 ˆg 4: Update concept predictor g:

g = ˆg H 1 ˆg X

ˆg LCr ˆgr(xw), cr w ˆg LCr (ˆgr(xw), cr w) .

5: Use EK-FAC method in algorithm 5 to accelerate i HVP problem for ˆf and obtain ˆH 1 ˆ f 6: Update label predictor f:

f = ˆf + H 1 ˆ f f

i=1 LY ˆf (ˆg(xi)) , yi H 1 ˆ f f

LY ˆf ( g(xl)) , yl .

7: Return: f, g.

Algorithm 7 EK-FAC Concept-level ECBM

1: Input: Dataset D = {(xi, yi, ci)}n i=1, original concept predictor ˆf, label predictor ˆg and the to be removed concept index set M, and damping parameter λ. 2: For r M, set pr = 0 for all the data z D. 3: Use EK-FAC method in algorithm 4 to accelerate i HVP problem for ˆg and obtain the inverse Hessian matrix ˆH 1 ˆg 4: Update concept predictor g:

g = ˆg H 1 ˆg ˆg X

i=1 LCj(ˆgj(xi), cj i).

5: Use EK-FAC method in algorithm 5 to accelerate i HVP problem for ˆf and obtain ˆH 1 ˆ f

Editable Concept Bottleneck Models

6: Update label predictor f:

f = ˆf H 1 ˆ f ˆ f

l=1 LY ˆf ( g (xl)) , yl .

7: Map g to g by removing the r-th row of the matrix in the final layer of g for r M. 8: Return: f, g.

Algorithm 8 EK-FAC Data-level ECBM

1: Input: Dataset D = {(xi, yi, ci)}n i=1, original concept predictor ˆf, and label predictor ˆg, a set of erroneous data De and its associated index set Se, and damping parameter λ. 2: For the index (w, r) in Se, correct cr w to the right label cr w for the w-th data (xw, yw, cw). 3: Use EK-FAC method in algorithm 4 to accelerate i HVP problem for ˆg and obtain the inverse Hessian matrix ˆH 1 ˆg 4: Update concept predictor g:

g = ˆg H 1 ˆg X

ˆg LCr ˆgr(xw), cr w ˆg LCr (ˆgr(xw), cr w) .

5: Use EK-FAC method in algorithm 5 to accelerate i HVP problem for ˆf and obtain H 1 ˆ f Compute A as:

A = H 1 ˆ f X

i [n] G ˆ f LY ˆf(ˆg(xi)), yi

Obtain f as f = ˆf + A

6: Use EK-FAC method in algorithm 5 to accelerate i HVP problem for f and obtain H 1 f Compute B as

B = H 1 f X

i [n] G ˆ f LY ( f( g(xi)), yi) LY ( f(ˆg(xi)), yi)

Update the label predictor f as: f = ˆf + A + B . 7: Return: f, g.

H. Additional Experiments

H.1. Experimental Setting

Methodology for Processing CUB Dataset For CUB dataset, we follow the setting in (Koh et al., 2020). We aggregate instance-level concept annotations into class-level concepts via majority voting: e.g., if more than 50% of crows have black wings in the data, then we set all crows to have black wings.

RMIA score. The RMIA score is computed as:

LRθ(x, z) Pr(fθ(x)|N(µx, z(x), σ2 x, z(x))) Pr(fθ(x)|N(µ x,z(x), σ2 x,z(x))) Pr(fθ(z)|N(µx, z(z), σ2 x, z(z))) Pr(fθ(z)|N(µ x,z(z), σ2 x,z(z)))

where fθ(x) represents the model s output (logits) for the data point x, N(µ, σ2) denotes a Gaussian distribution with mean µ and variance σ2, µx, z(x) is the mean of the model s outputs for x under the assumption that x belongs to the training set, and σ2 x, z(x) is the variance of the model s outputs for x. The likelihoods Pr(fθ(x)|N) represent the probability that the model s output fθ(x) follows the Gaussian distribution parameterized by µ and σ2, under the two different hypotheses: x being a member of the training set versus not being a member.

Editable Concept Bottleneck Models

H.2. Improvement via Harmful Data Removal

We conducted additional experiments on CUB datasets with synthetically introduced noisy concepts or labels. Firstly, we introduce noises under three levels. At the concept level, we choose 10% of the concepts and flip these concept labels for a portion of the data. At the data level, we choose 10% of the data and flip their labels. At the concept-label level, we choose 10% of the total concepts and flip them. Then, we conduct the following experiments.

We introduce noises into the three levels and train the model. After that, we remove the noise and obtain the retrained model, which is the ground truth(gt) of this harmful data removal task. In contrast, we use ECBM to remove the harmful data.

Figure 5: Model performance after the removal of harmful data.

From Figure 5, it can be observed that the model performance improves across all three settings after noise removal and subsequent retraining or ECBM editing. This confirms that the performance of ECBM is nearly equivalent to retraining in various experimental scenarios, further providing evidence of the robustness of our method.

H.3. Periodic Editing Performance

ECBM can perform periodic editing. To evaluate the multiple editing performance of ECBM, we conduct the following experiments. Firstly, we introduce noises under three levels. At the concept level, we choose 10% of the concepts and flip these concept labels for a portion of the data. At the data level, we choose 10% of the data and flip their labels. At the concept-label level, we choose 10% of the total concepts and flip them. Then, we conduct the following experiments.

At the concept level, we first remove 1% of the concepts, then retrain or use ECBM to edit and repeat. In the data level, we first remove 1% of the data, then retrain or use ECBM to edit. At the concept label level, we first remove one concept label from 1% of the data, then retrain or use ECBM to edit. Note that when removing the next 1% of the concepts, ECBM edits the model based on the last editing result. The results at each level are shown in Figure 6, 7 and 8.

From the above three levels, we can find that with the mislabeled information removed, the retrained model achieves better performance in both accuracy and F1 score than the initial model. Furthermore, the performance of the ECBM-edited model is similar to that of the retrained model, even after 10 rounds of editing, which demonstrates the ability of our ECBM method to handle multiple edits.

H.4. More Visualization Results and Explanation

Visualization. Since CBM is an explainable model, we aim to evaluate the interpretability of our ECBM (compared to the retraining). We will present some visualization results for the concept-level edit. Figure 9 presents the top 10 most influential concepts and their corresponding predicted concept labels obtained by our ECBM and the retrain method after randomly deleting concepts for the CUB dataset. (Detailed explanation can be found in Appendix H.4.1.) Our ECBM can provide explanations for which concepts are crucial and how they assist the prediction. Specifically, among the top 10 most important concepts in the ground truth (retraining), ECBM can accurately recognize 9 within them. For instance, we correctly identify has upperparts color::orange , has upper tail color::red , and has breast color::black as some of the most important concepts when predicting categories. Additional visualization results under data level and concept-label

Editable Concept Bottleneck Models

10% 9% 8% 7% 6% 5% 4% 3% 2% 1% 0% Mislabeled Concept

Accuracy Comparison (Concept Level)

Accuracy Strategy

Retrain Acc ECBM Acc

(a) The accuracy of the edited model compared with retrained.

10% 9% 8% 7% 6% 5% 4% 3% 2% 1% 0% Mislabeled Concept

F1 Score Comparison (Concept Level)

F1 Score Strategy

Retrain F1 ECBM F1

(b) The F1 score of the edited model compared with retrained.

Figure 6: Accuracy and F1 score difference of the edited model compared with retrained at concept level.

10% 9% 8% 7% 6% 5% 4% 3% 2% 1% 0% Data (%)

Accuracy Comparison (Data Level)

Accuracy Strategy

Retrain Acc ECBM Acc

(a) The accuracy of the edited model compared with retrained.

10% 9% 8% 7% 6% 5% 4% 3% 2% 1% 0% Data (%)

F1 Score Comparison (Data Level)

F1 Score Strategy

Retrain F1 ECBM F1

(b) he F1 score of the edited model compared with retrained.

Figure 7: Accuracy and F1 score difference of the edited model compared with retrained at data level.

level on OAI and CUB datasets are included in Appendix H.4.2.

H.4.1. EXPLANATION FOR VISUALIZATION RESULTS

At the concept level, we remove each concept one at a time, retrain the CBM, and subsequently evaluate the model performance. We rank the concepts in descending order based on the model performance loss. Concepts that, when removed, cause significant changes in model performance are considered influential concepts. The top 10 concepts are shown in the retrain column as illustrated in Figure 9. In contrast, we use our ECBM method instead of the retrain method, as outlined in Algorithm 7, and the top 10 concepts are shown in the ECBM column of Figure 9.

To help readers connect the top 10 influential concepts with the input image, we provide visualizations of the data and list the concept labels corresponding to the top 10 influential concepts, which are shown in Figure 9,10, 11.

For the other two levels and for additional datasets, we also conduct a similar procedure, and the corresponding visualization results are presented in Figure 12, 13, 14, 15, and 16.

H.4.2. VISUALIZATION RESULTS

We provide our additional visualization results in Figure 10, 11, 12, 13, 14, 15, and 16.

Editable Concept Bottleneck Models

10% 9% 8% 7% 6% 5% 4% 3% 2% 1% 0% Concept Label (%)

Accuracy Comparison (Concept-label Level)

Accuracy Strategy

Retrain Acc ECBM Acc

(a) The accuracy of the edited model compared with retrained.

10% 9% 8% 7% 6% 5% 4% 3% 2% 1% 0% Concept Label (%)

F1 Score Comparison (Concept-label Level)

F1 Score Strategy

Retrain F1 ECBM F1

(b) The F1 score of the edited model compared with retrained.

Figure 8: Accuracy and F1 score difference of the edited model compared with retrained at concept-label level.

Figure 9: Visualization of the Top 10 Most Influential Concepts for CBM(Identified by ECBM or Retrain) Highlighted on an Extracted Image.

I. More Related Work

Influence Function. The influence function, initially a staple in robust statistics (Cook, 2000; Cook & Weisberg, 1980), has seen extensive adoption within machine learning since (Koh & Liang, 2017) introduced it to the field. Its versatility spans various applications, including detecting mislabeled data, interpreting models, addressing model bias, and facilitating machine unlearning tasks. Notable works in machine unlearning encompass unlearning features and labels (Warnecke et al., 2023), minimax unlearning (Liu et al., 2024), forgetting a subset of image data for training deep neural networks (Golatkar et al., 2020a; 2021), graph unlearning involving nodes, edges, and features. Recent advancements, such as the Li SSA method (Agarwal et al., 2017; Kwon et al., 2023) and k NN-based techniques (Guo et al., 2021), have been proposed to enhance computational efficiency. Besides, various studies have applied influence functions to interpret models across different domains, including natural language processing (Han et al., 2020) and image classification (Basu et al., 2021), while also addressing biases in classification models (Wang et al., 2019), word embeddings (Brunet et al., 2019), and finetuned models (Chen et al., 2020). Despite numerous studies on influence functions, we are the first to utilize them to construct the editable CBM. Moreover, compared to traditional neural networks, CBMs are more complicated in their influence function. Because we only need to change the predicted output in the traditional influence function. While in CBMs, we should first remove the true concept, then we need to approximate the predicted concept in order to approximate the output. Bridging the gap between the true and predicted concepts poses a significant theoretical challenge in our proof.

Model Unlearning. Model unlearning has gained significant attention in recent years, with various methods (Bourtoule et al., 2021; Brophy & Lowd, 2021; Cao & Yang, 2015; Chen et al., 2022a;b) proposed to efficiently remove the influence of certain data from trained machine learning models. Existing approaches can be broadly categorized into exact and approximate unlearning methods. Exact unlearning methods aim to replicate the results of retraining by selectively updating only a portion of the dataset, thereby avoiding the computational expense of retraining on the entire dataset (Sekhari et al., 2021; Chowdhury et al., 2024). Approximate unlearning methods, on the other hand, seek to adjust model parameters to approximately satisfy the optimality condition of the objective function on the remaining data (Golatkar et al., 2020a; Guo et al., 2019; Izzo et al., 2021). These methods are further divided into three subcategories: (1) Newton step-based updates that leverage Hessian-related terms [22, 26, 31, 34, 40, 43, 49], often incorporating Gaussian noise to mitigate

Editable Concept Bottleneck Models

residual data influence. To reduce computational costs, some works approximate the Hessian using the Fisher information matrix (Golatkar et al., 2020a) or small Hessian blocks (Mehta et al., 2022). (2) Neural tangent kernel (NTK)-based unlearning approximates training as a linear process, either by treating it as a single linear change (Golatkar et al., 2020b). (3) SGD path tracking methods, such as Delta Grad (Wu et al., 2020) and unroll SGD (Thudi et al., 2022), reverse the optimization trajectory of stochastic gradient descent during training. Despite their advancements, these methods fail to handle the special architecture of CBMs. Moreover, given the high cost of obtaining data, we sometimes prefer to correct the data rather than remove it, which model unlearning is unable to achieve.

J. Limitations and Broader Impacts

It is important to acknowledge that the ECBM approach is essentially an approximation of the model that would be obtained by retraining with the edited data. However, results indicate that this approximation is effective in real-world applications.

Concept Bottleneck Models (CBMs) have garnered much attention for their ability to elucidate the prediction process through a human-understandable concept layer. However, most previous studies focused on cases where the data, including concepts, are clean. In many scenarios, we always need to remove/insert some training data or new concepts from trained CBMs due to different reasons, such as data mislabeling, spurious concepts, and concept annotation errors. Thus, the challenge of deriving efficient editable CBMs without retraining from scratch persists, particularly in large-scale applications. To address these challenges, we propose Editable Concept Bottleneck Models (ECBMs). Specifically, ECBMs support three different levels of data removal: concept-label-level, concept-level, and data-level. ECBMs enjoy mathematically rigorous closed-form approximations derived from influence functions that obviate the need for re-training. Experimental results demonstrate the efficiency and effectiveness of our ECBMs, affirming their adaptability within the realm of CBMs. Our ECBM can be an interactive model with doctors in the real world, which is an editable explanation tool.

Editable Concept Bottleneck Models

Figure 10: Visualization of the top-10 most influential concepts for different classes in CUB.

Editable Concept Bottleneck Models

Figure 11: Visualization of the top-10 most influential concepts for different classes in CUB.

Editable Concept Bottleneck Models

Figure 12: Visualization of the most influential concept label related to different data in CUB.

Editable Concept Bottleneck Models

Figure 13: Visualization of the most influential concept label related to different data in CUB.

Editable Concept Bottleneck Models

Figure 14: Visualization of the most influential concept label related to different data in CUB.

Editable Concept Bottleneck Models

Figure 15: Visualization of the most influential concept label related to different data in CUB.

Editable Concept Bottleneck Models

Figure 16: Visualization of the most influential concept label related to different data in OAI.