# stochastic_concept_bottleneck_models__510def71.pdf Stochastic Concept Bottleneck Models Moritz Vandenhirtz , Sonia Laguna , Riˇcards Marcinkeviˇcs, Julia E. Vogt Department of Computer Science ETH Zurich Switzerland Concept Bottleneck Models (CBMs) have emerged as a promising interpretable method whose final prediction is based on intermediate, human-understandable concepts rather than the raw input. Through time-consuming manual interventions, a user can correct wrongly predicted concept values to enhance the model s downstream performance. We propose Stochastic Concept Bottleneck Models (SCBMs), a novel approach that models concept dependencies. In SCBMs, a single-concept intervention affects all correlated concepts, thereby improving intervention effectiveness. Unlike previous approaches that model the concept relations via an autoregressive structure, we introduce an explicit, distributional parameterization that allows SCBMs to retain the CBMs efficient training and inference procedure. Additionally, we leverage the parameterization to derive an effective intervention strategy based on the confidence region. We show empirically on synthetic tabular and natural image datasets that our approach improves intervention effectiveness significantly. Notably, we showcase the versatility and usability of SCBMs by examining a setting with CLIP-inferred concepts, alleviating the need for manual concept annotations. 1 Introduction In today s world, machine learning plays a crucial role in making important decisions, from healthcare to finance and law. However, as these algorithms become more complex, understanding how they arrive at their decisions becomes increasingly challenging. This lack of interpretability is a significant concern, especially in situations where trustworthiness, transparency, and accountability are paramount (Lipton, 2016; Doshi-Velez & Kim, 2017). Recent studies have focused on Concept Bottleneck Models (CBMs) (Koh et al., 2020; Havasi et al., 2022; Shin et al., 2023), a class of models that predict human-understandable concepts upon which the final target prediction is based. CBMs offer interpretability since a user can inspect the predicted concept values to understand how the model arrives at its final target prediction. Moreover, if they disagree with a concept prediction, they can intervene by adjusting it to the right value, which in turn affects the target prediction. For example, consider the yellow warbler in Figure 1 (a), where a user might notice that the binary concept yellow primary color is mispredicted. Upon this realization, they can intervene on the CBM by setting its value to 1, which increases the probability of the class yellow warbler. This way of interacting allows any untrained user to engage with the model to increase its predictive performance. However, if the user input is that the primary color is yellow, should not the likelihood of a yellow crown increase too? This adaptation would increase the predicted likelihood of the correct class even more, as yellow warblers are characterized by their fully yellow body. Currently, vanilla CBMs do not exhibit this behavior as they do not use the intervened-on concepts to update their remaining concept predictions. This indicates that they suboptimally adapt to the additional knowledge gained. Equal contribution. Correspondence to {moritz.vandenhirtz,slaguna}@inf.ethz.ch 38th Conference on Neural Information Processing Systems (Neur IPS 2024). Concepts 𝑐1, , 𝑐! Target 𝑦 Brown wing Yellow crown Yellow primary Intervention 𝒄𝒮 ŷ[ Yellow Warbler ] = 0.32 ŷ[ California Gull ] = 0.13 ŷ[ Yellow headed Blackbird ] = 0.36 ĉ[ Brown wing ] = 0.82 ĉ[ Yellow crown ] = 0.39 ĉ[ Yellow primary ] = 0.40 ŷ [ Yellow Warbler ] = 0.73 ŷ [ California Gull ] = 0.07 ŷ [ Yellow headed Blackbird ] = 0.15 c [ Brown wing ] = 0.11 c [ Yellow crown ] = 0.93 c [ Yellow primary ] = 1 𝑔𝜓 ℎ𝜙 𝒄 ~ Bern 𝜎𝜼 𝜼 ~ 𝒩𝝁(𝒙), 𝚺(𝒙) , Bern(𝜎𝜼\𝒮)] 𝜼\𝒮|𝜼𝒮 # ~𝒩(𝝁6, 𝚺6) Figure 1: Overview of the proposed method for the CUB dataset. (a) A user intervenes on the concept of primary color: yellow . Unlike CBMs, our method then uses this information to adjust the predicted probability of correlated concepts, thereby affecting the target prediction. (b) Schematic overview of the intervention procedure. A user s intervention c S is used to infer the logits η\S of the remaining concepts. (c) Visualization of the learned global dependency structure as a correlation matrix for the 112 concepts of CUB (Wah et al., 2011). Characterization of concepts on the left. To this end, we propose to extend the concept predictions with the modeling of their dependencies, as depicted in Figure 1. The proposed approach captures the concept dependencies by modeling the concept logits with a learnable non-diagonal normal distribution, which enables efficient, scalable computing of the effect of interventions on other concepts. By integrating concept correlations, we reduce the time and effort of having to laboriously intervene on many correlated variables and increase the efficacy of interventions on the downstream prediction. Thanks to the explicit distributional assumptions, the model is trained end-to-end, retaining the training and inference speed of classic CBMs as well as the benefits of training the concept and target predictor jointly. Moreover, we show that our method excels when querying user interventions based on predicted concept uncertainty (Shin et al., 2023), further highlighting the practical utility of our approach as such policies spare users from manually sifting through the concepts to identify necessary interventions. Lastly, based on the distributional concept parameterization, we propose a novel approach for computing dependency-aware interventions through the likelihood-based confidence region. Contributions This work contributes to the line of research on concept bottleneck models in several ways. (i) We propose to capture and model concept dependencies with a multivariate normal distribution. (ii) We derive a novel intervention strategy based on the confidence region of the normal distribution that incorporates concept correlations. Using the learned concept dependencies during the intervention procedure allows for stronger interventional effectiveness. (iii) We provide a thorough empirical assessment of the proposed method on synthetic tabular and natural image data. Additionally, we combine our method with concept discovery where we alleviate the need for annotations by using CLIP-inferred concepts. In particular, we show the proposed method (a) discovers meaningful, interpretable patterns in the form of concept dependencies, (b) allows for fast, scalable inference, and (c) outperforms related work with respect to intervention effectiveness thanks to the proposed concept modeling and intervention strategy. 2 Background & Related Work Concept bottleneck models (Koh et al., 2020; Lampert et al., 2009; N. Kumar et al., 2009) are typically trained on data points (x, c, y), comprising the covariates x X, target y Y, and C annotated binary concepts c C. Consider a neural network fθ parameterized by θ and a slice gψ, hϕ (Leino et al., 2018) s.t. ˆy = fθ (x) = gψ (hϕ (x)). CBMs enforce a concept bottleneck ˆc = hϕ(x) such that the model s final output depends on the covariates x solely through the predicted concepts ˆc. While Koh et al. (2020) propose the soft CBM, where the concept logits parameterize the bottleneck, Havasi et al. (2022) argue that such a representation leads to leakage, where additional unwanted information in the concept representation is used to predict the target (Margeloiu et al., 2021; Mahinpei et al., 2021). Thus, they parameterize the bottleneck by binarized concept predictions and call it the hard CBM. Then, Havasi et al. (2022) equip the hard CBM with an autoregressive structure of the form ci|x, c2.0.CO;2 [Referenced on page 6] Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., . . . others (2020). Language models are few-shot learners. Advances in neural information processing systems, 33, 1877 1901. [Referenced on page 3, 6] Chauhan, K., Tiwari, R., Freyberg, J., Shenoy, P., & Dvijotham, K. (2023). Interactive concept bottleneck models. In Proceedings of the aaai conference on artificial intelligence (Vol. 37, pp. 5948 5955). [Referenced on page 3] Collins, K. M., Barker, M., Zarlenga, M. E., Raman, N., Bhatt, U., Jamnik, M., ... Dvijotham, K. (2023). Human uncertainty in concept-based AI systems. In F. Rossi, S. Das, J. Davis, K. Firth Butterfield, & A. John (Eds.), Proceedings of the 2023 AAAI/ACM conference on ai, ethics, and society, AIES 2023, montréal, qc, canada, august 8-10, 2023 (pp. 869 889). ACM. [Referenced on page 3] Doshi-Velez, F., & Kim, B. (2017, March). Towards A Rigorous Science of Interpretable Machine Learning (No. ar Xiv:1702.08608). ar Xiv. doi: 10.48550/ar Xiv.1702.08608 [Referenced on page 1] Espinosa Zarlenga, M., Barbiero, P., Ciravegna, G., Marra, G., Giannini, F., Diligenti, M., . . . others (2022). Concept embedding models: Beyond the accuracy-explainability trade-off. In Advances in neural information processing systems (Vol. 35, pp. 21400 21413). [Referenced on page 3, 6] Espinosa Zarlenga, M., Collins, K., Dvijotham, K., Weller, A., Shams, Z., & Jamnik, M. (2024). Learning to receive help: Intervention-aware concept embedding models. Advances in Neural Information Processing Systems, 36. [Referenced on page 3] Friedman, J., Hastie, T., & Tibshirani, R. (2008). Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3), 432 441. [Referenced on page 4] Havasi, M., Parbhoo, S., & Doshi-Velez, F. (2022). Addressing leakage in concept bottleneck models. In A. H. Oh, A. Agarwal, D. Belgrave, & K. Cho (Eds.), Advances in neural information processing systems. Retrieved from https://openreview.net/forum?id=tglni D_fn9 [Referenced on page 1, 3, 4, 6, 8, 10] He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the ieee conference on computer vision and pattern recognition (pp. 770 778). [Referenced on page 14] Heidemann, L., Monnet, M., & Roscher, K. (2023). Concept correlation and its effects on conceptbased models. In Proceedings of the ieee/cvf winter conference on applications of computer vision (pp. 4780 4788). [Referenced on page 3] Jaccard, P. (1901). Étude comparative de la distribution florale dans une portion des alpes et des jura. Bull Soc Vaudoise Sci Nat, 37, 547 579. [Referenced on page 17] Jang, E., Gu, S., & Poole, B. (2017). Categorical reparameterization with gumbel-softmax. In 5th international conference on learning representations, ICLR 2017, toulon, france, april 24-26, 2017, conference track proceedings. Open Review.net. Retrieved from https://openreview.net/ forum?id=rk E3y85ee [Referenced on page 4, 6] Kim, B., Wattenberg, M., Gilmer, J., Cai, C., Wexler, J., Viegas, F., & Sayres, R. (2018). Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (TCAV). In J. Dy & A. Krause (Eds.), Proceedings of the 35th international conference on machine learning (Vol. 80, pp. 2668 2677). PMLR. Retrieved from https://proceedings.mlr.press/v80/ kim18d.html [Referenced on page 3] Kim, E., Jung, D., Park, S., Kim, S., & Yoon, S. (2023). Probabilistic concept bottleneck models. In A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, & J. Scarlett (Eds.), Proceedings of the 40th international conference on machine learning (Vol. 202, pp. 16521 16540). PMLR. Retrieved from https://proceedings.mlr.press/v202/kim23g.html [Referenced on page 3] Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. In Y. Bengio & Y. Le Cun (Eds.), 3rd international conference on learning representations, ICLR 2015, san diego, ca, usa, may 7-9, 2015, conference track proceedings. Retrieved from http://arxiv.org/abs/ 1412.6980 [Referenced on page 14] Kingma, D. P., & Welling, M. (2014). Auto-encoding variational bayes. In Y. Bengio & Y. Le Cun (Eds.), 2nd international conference on learning representations, ICLR 2014, banff, ab, canada, april 14-16, 2014, conference track proceedings. Retrieved from http://arxiv.org/abs/ 1312.6114 [Referenced on page 3] Koh, P. W., Nguyen, T., Tang, Y. S., Mussmann, S., Pierson, E., Kim, B., & Liang, P. (2020). Concept bottleneck models. In H. D. III & A. Singh (Eds.), Proceedings of the 37th international conference on machine learning (Vol. 119, pp. 5338 5348). Virtual: PMLR. Retrieved from https://proceedings.mlr.press/v119/koh20a.html [Referenced on page 1, 2, 3, 4, 5, 6, 10, 14, 17] Kraft, D. (1988). A software package for sequential quadratic programming. Forschungsbericht Deutsche Forschungsund Versuchsanstalt fur Luftund Raumfahrt. [Referenced on page 7] Krizhevsky, A., Hinton, G., et al. (2009). Learning multiple layers of features from tiny images. [Referenced on page 6, 14] Kumar, A., Liang, P. S., & Ma, T. (2019). Verified uncertainty calibration. Advances in Neural Information Processing Systems, 32. [Referenced on page 6] Kumar, N., Berg, A. C., Belhumeur, P. N., & Nayar, S. K. (2009). Attribute and simile classifiers for face verification. In 2009 ieee 12th international conference on computer vision (pp. 365 372). Kyoto, Japan: IEEE. Retrieved from https://doi.org/10.1109/ICCV.2009.5459250 [Referenced on page 2] Lampert, C. H., Nickisch, H., & Harmeling, S. (2009). Learning to detect unseen object classes by between-class attribute transfer. In 2009 IEEE conference on computer vision and pattern recognition. Miami, FL, USA: IEEE. Retrieved from https://doi.org/10.1109/CVPR.2009 .5206594 [Referenced on page 2] Leino, K., Sen, S., Datta, A., Fredrikson, M., & Li, L. (2018). Influence-directed explanations for deep convolutional networks. In 2018 IEEE international test conference (ITC). IEEE. Retrieved from https://doi.org/10.1109/test.2018.8624792 [Referenced on page 2] Lipton, Z. C. (2016, June). The Mythos of Model Interpretability. Communications of the ACM, 61(10), 35 43. doi: 10.48550/arxiv.1606.03490 [Referenced on page 1] Maddison, C. J., Mnih, A., & Teh, Y. W. (2017). The concrete distribution: A continuous relaxation of discrete random variables. In 5th international conference on learning representations, ICLR 2017, toulon, france, april 24-26, 2017, conference track proceedings. Open Review.net. Retrieved from https://openreview.net/forum?id=S1j E5L5gl [Referenced on page 4, 6] Mahinpei, A., Clark, J., Lage, I., Doshi-Velez, F., & Pan, W. (2021). Promises and pitfalls of black-box concept learning models. Retrieved from https://doi.org/10.48550/ar Xiv.2106.13314 (ar Xiv:2106.13314) [Referenced on page 3] Marcinkeviˇcs, R., Laguna, S., Vandenhirtz, M., & Vogt, J. E. (2024). Beyond concept bottleneck models: How to make black boxes intervenable? In Advances in neural information processing systems (Vol. 37). [Referenced on page 3, 5] Marcinkeviˇcs, R., Reis Wolfertstetter, P., Klimiene, U., Chin-Cheong, K., Paschke, A., Zerres, J., ... Vogt, J. E. (2024). Interpretable and intervenable ultrasonography-based machine learning models for pediatric appendicitis. Medical Image Analysis, 91, 103042. Retrieved from https:// www.sciencedirect.com/science/article/pii/S136184152300302X [Referenced on page 6] Margeloiu, A., Ashman, M., Bhatt, U., Chen, Y., Jamnik, M., & Weller, A. (2021). Do concept bottleneck models learn as intended? Retrieved from https://doi.org/10.48550/ar Xiv .2105.04289 (ar Xiv:2105.04289) [Referenced on page 3] Monteiro, M., Le Folgoc, L., Coelho de Castro, D., Pawlowski, N., Marques, B., Kamnitsas, K., ... Glocker, B. (2020). Stochastic segmentation networks: Modelling spatially correlated aleatoric uncertainty. In Advances in neural information processing systems (Vol. 33, pp. 12756 12767). [Referenced on page 3] Naeini, M. P., Cooper, G., & Hauskrecht, M. (2015). Obtaining well calibrated probabilities using bayesian binning. In Proceedings of the aaai conference on artificial intelligence (Vol. 29). [Referenced on page 6] Neal, R. M. (1995). Bayesian learning for neural networks (Doctoral dissertation, University of Toronto, Canada). Retrieved from https://librarysearch.library.utoronto.ca/ permalink/01UTORONTO_INST/14bjeso/alma991106438365706196 [Referenced on page 3] Oikarinen, T., Das, S., Nguyen, L. M., & Weng, T.-W. (2023). Label-free concept bottleneck models. In The 11th international conference on learning representations. Retrieved from https://openreview.net/forum?id=Fl Cg47MNv BA [Referenced on page 3, 6, 15] Panousis, K. P., Ienco, D., & Marcos, D. (2023). Sparse linear concept discovery models. In Proceedings of the ieee/cvf international conference on computer vision (pp. 2767 2771). [Referenced on page 3] Panousis, K. P., Ienco, D., & Marcos, D. (2024). Coarse-to-fine concept bottleneck models. In Neurips 2024-38th annual conference on neural information processing systems. [Referenced on page 17] Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., . . . others (2021). Learning transferable visual models from natural language supervision. In International conference on machine learning (pp. 8748 8763). [Referenced on page 3, 6] Sheth, I., Rahman, A. A., Sevyeri, L. R., Havaei, M., & Kahou, S. E. (2022). Learning from uncertain concepts via test time interventions. In Workshop on trustworthy and socially responsible machine learning, neurips 2022. Retrieved from https://openreview.net/forum?id=WVe3vok8Cc3 [Referenced on page 3] Shin, S., Jo, Y., Ahn, S., & Lee, N. (2023). A closer look at the intervention procedure of concept bottleneck models. In A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, & J. Scarlett (Eds.), Proceedings of the 40th international conference on machine learning (Vol. 202, pp. 31504 31520). PMLR. Retrieved from https://proceedings.mlr.press/v202/shin23a.html [Referenced on page 1, 2, 3, 6, 16] Silvey, S. (1975). Statistical inference. Taylor & Francis. Retrieved from https://books.google .ch/books?id=q IKLejb VMf4C [Referenced on page 6] Singhi, N., Kim, J. M., Roth, K., & Akata, Z. (2024). Improving intervention efficacy via concept realignment in concept bottleneck models. ar Xiv preprint ar Xiv:2405.01531. [Referenced on page 3] Steinmann, D., Stammer, W., Friedrich, F., & Kersting, K. (2023). Learning to intervene on concept bottlenecks. Retrieved from https://doi.org/10.48550/ar Xiv.2308.13453 (ar Xiv:2308.13453) [Referenced on page 3] Wah, C., Branson, S., Welinder, P., Perona, P., & Belongie, S. (2011). The caltech-ucsd birds-2002011 dataset. [Referenced on page 2, 5, 6, 14] Yuksekgonul, M., Wang, M., & Zou, J. (2023). Post-hoc concept bottleneck models. In The 11th international conference on learning representations. Retrieved from https://openreview .net/forum?id=n A5AZ8CEyow [Referenced on page 3] A Dataset Details In this section, we provide additional details on the datasets that are being used in the experiments. A.1 Synthetic Data-Generating Mechanism Here, we describe the data-generating mechanism of the synthetic dataset in more detail. Let N, p, and C denote the number of independent data points {(xn, cn, yn)}N n=1, covariates, and concepts, respectively. We set N = 50,000, p = 1,500, and C = 100, with a 60%-20%-20% train-validationtest split. The generative process is as follows: 1. Randomly sample W RC 10 s.t. wi,j N(0, 1) for 1 i C and 1 j 10. 2. Generate a positive definite matrix Σ RC C s.t. Σ = W W T + D. Let D RC C s.t. D = δI, where δi U[0,1] for 1 i C. 3. Randomly sample logits H RN C s.t. ηn N(0, Σ) for 1 n N. 4. Let cn,i = 1{ηn,i 0} for 1 n N and 1 i C. 5. Let h : RC Rp be a randomly initialised multilayer perceptron with Re LU nonlinearities. 6. Let xn = h (ηn) + ϵn s.t. ϵn N(0, I) for 1 n N. 7. Let g : RC R be a randomly initialized linear perceptron. 8. Let yn = 1{(g(cn) ymed)} for 1 n N, where ymed denotes the median of g (cn). A.2 Natural Image Datasets Caltech-UCSD Birds-200-2011 We evaluate on the Caltech-UCSD Birds-200-2011 (CUB)2 dataset (Wah et al., 2011). It comprises 11,788 photographs from 200 distinct bird species annotated with 312 concepts, such as belly color and pattern. In this manuscript, we follow the original train-test split and revised the proposed dataset in the initial CBM work (Koh et al., 2020). Here, only the 112 most widespread binary attributes are included in the final dataset, and concepts are shared across samples in identical classes. The images were resized to a resolution of 224 224 pixels. Finally, following the original proposed augmentations, we applied random horizontal flips, modified the brightness and saturation, and applied normalization during training. CIFAR-10 CIFAR-103 (Krizhevsky et al., 2009) is a natural image benchmark with 60,000 32x32 colour images and 10 classes. We kept the original train-test split, with 50,000 samples in the train set and a balanced total of 6,000 images per class. We generated 143 concept labels as described in Section 4 using large language and vision models. At training time, as for CUB, we applied augmentations including modifications to brightness and saturation, random horizontal flips and normalisation. Images were rescaled to a size of 224 224 pixels. B Implementation Details This section provides further implementation details of SCBM and the evaluated baselines. All methods were implemented using Py Torch (v 2.1.1) (Ansel et al., 2024). All models are trained for 150 epochs for the synthetic and 300 epochs for the natural image datasets with the Adam optimizer (Kingma & Ba, 2015) with a learning rate of 10 4 and a batch size of 64. For the independently trained autoregressive model, we split the training epochs into 2/3 for the concept predictor and 1/3 for the target predictor. For the methods requiring sampling, the number of Monte Carlo samples is set to M = 100. We provide an ablation for M = 10 in Appendix C.2. Note that since the predictor head is very simple, the MC sampling of SCBMs is extremely fast and does not influence computational complexity by more than 0.1%. For the synthetic tabular data, we use a fully connected neural network as backbone, with 3 non-linear layers, batch normalization, and dropout. For the CUB dataset, we use a pretrained Res Net-18 (He et al., 2016), and for the lower-resolution 2https://www.vision.caltech.edu/datasets/cub_200_2011/, no license available 3https://www.cs.toronto.edu/~kriz/cifar.html, no license available CIFAR-10 a simple convolutional neural network with 2 convolutional layers followed by Re LU, Dropout, and a fully connected layer. For fairness in the comparisons, all baselines have the same model architecture choices and all experiments are performed over 10 random seeds. Resource Usage For the experiments of the main paper, we used a cluster of mostly Ge Force RTX 2080s with 2 CPU workers. Over all methods, we estimate an average runtime of 8h per experiment, each running on a single GPU. This amounts to 5 methods 3 datasets 10 seeds 8 hours = 1200 hours. Adding to that, the Ablation Figures required another 40 runs, amounting to a full total of 1520 hours of compute. Please note that we only report the numbers to generate the final results but not the development time, which we roughly estimate to be around 10 times bigger. C Further Experiments In this section, we show additional experiments to provide a more in-depth understanding of SCBM s effectiveness. We ablate multiple hyperparameters to provide an understanding of how they influence the model performance, as well as show the performance of our model in other settings. 0 20 40 60 80 100 Number of Concepts Intervened Concept accuracy (%) 0 20 40 60 80 100 Number of Concepts Intervened Target accuracy (%) Figure 4: Performance after intervening on concepts in the order of highest predicted uncertainty in CIFAR-100 with 892 concepts. Concept and target accuracy (%) are shown in the first and second rows, respectively. Results are reported as averages and standard deviations of model performance across 3 seeds. 0 25 50 75 100 Number of Concepts Intervened Concept accuracy (%) 0 25 50 75 100 Number of Concepts Intervened Target accuracy (%) Figure 5: Intervention performance in the order of highest predicted uncertainty in CUB. Concept and target accuracy (%) are shown in the first and second rows, respectively. Results are reported as averages and standard deviations of model performance across 3 seeds. C.1 Intervention Performance on CIFAR-100 We present the result on the CIFAR-100 dataset with 892 concepts obtained from Oikarinen et al. (2023) in Figure 4 to showcase the scalability of SCBMs. The results underline the efficiency of our method. Notably, the Autoregressive baseline has a negative dip, which is likely due to the independently trained target predictor not being aligned with the concept predictors in this noisy CLIP-annotated scenario. Note that they need to train independently to avoid the sequential MC sampling during training, which would otherwise increase training time significantly. Our jointly trained SCBMs do not have this issue and surpass the baselines. We use the same configuration as for CIFAR-10, with the exception that we set M = 10 to reduce the memory requirement. C.2 Number of Monte Carlo Samples To showcase that SCBMs do not rely on a huge number of Monte Carlo samples, we provide an ablation of M in Figure 5. It shows that even for M = 10, SCBMs thrive. Note, however, that since M is not a driving factor of SCBMs computational cost, one can leave it at a high number. C.3 Random Intervention Policy 0 20 40 60 80 100 Number of Concepts Intervened Concept accuracy (%) 0 20 40 60 80 100 Number of Concepts Intervened Target accuracy (%) (a) Synthetic 0 25 50 75 100 Number of Concepts Intervened 0 25 50 75 100 Number of Concepts Intervened 0 50 100 Number of Concepts Intervened 0 50 100 Number of Concepts Intervened (c) CIFAR-10 Hard CBM CEM Autoregressive CBM Global SCBM Amortized SCBM Figure 6: Performance after intervening on concepts in random order. Concept and target accuracy (%) are shown in the first and second rows, respectively. Results are reported as averages and standard deviations of model performance across ten seeds. In Figure 6, we present the intervention performance of SCBM and baseline methods. Compared to the uncertainty-based intervention policy of Figure 2, the intervention curves of all methods are less steep, confirming the usefulness of Shin et al. (2023) s proposed policy. Following the previous statements, SCBMs still outperform baseline methods with the amortized beating the global variant for real-world datasets. We observe that in CIFAR-10 for the first interventions, an improvement in concept accuracy is not directly reflected in improved target prediction for SCBMs, which is likely due to the low signal-to-noise ratio of the CLIP-inferred concepts. C.4 Regularization Strength In Figure 7, we analyze the impact of the strength of λ2 from Equation 6. Due to environmental considerations, we conducted experiments using only 5 seeds and limited the number of interventions to 20. Our findings indicate that SCBMs are not sensitive to the choice of λ2, except that the unregularized amortized variant exhibits slight patterns of overfitting. 0 5 10 15 20 Number of Concepts Intervened Concept accuracy (%) 0 5 10 15 20 Number of Concepts Intervened Target accuracy (%) Figure 7: Performance on CUB after intervening on concepts in the order of highest predicted uncertainty with differing regularization strengths. Concept and target accuracy (%) are shown in the first and second columns, respectively. Results are reported as averages and standard deviations of model performance across five seeds. For each SCBM variant, we choose a darker color, the higher the regularization strength of λ2. C.5 Intervention Strategy In Figure 8, we analyze the effect of the intervention strategy. Our findings indicate that while SCBMs are still effective with the proposed strategy from Koh et al. (2020), that sets the logits to the 5th (if ci = 0) or 95th (if ci = 1) percentile of the training distribution, our proposed strategy based on the confidence region results in stronger intervenability. 0 25 50 75 100 Number of Concepts Intervened Concept accuracy (%) 0 25 50 75 100 Number of Concepts Intervened Target accuracy (%) Figure 8: Performance on CUB after intervening on concepts in the order of highest predicted uncertainty, comparing the proposed intervention strategy to Koh et al. (2020) s intervention of setting the logits to the 5th or 95th empirical percentile of the training distribution. Concept and target accuracy (%) are shown in the first and second columns, respectively. Results are reported as averages and standard deviations of model performance across five seeds. C.6 Confidence Region Level In Figure 9, we analyze the effect of the level 1 α of the likelihood-based confidence region. Our findings indicate that the SCBMs are not sensitive to the choice of 1 α, with higher levels being slightly better in performance. C.7 Jaccard Index Panousis et al. (2024) propose to interpret the interpretation capacity of concepts with the Jaccard Index (Jaccard, 1901). As such, in Table 4, we extend Table 1 with this metric. It is evident that the interpretation does not change, indicating that the performance is robust to the choice of evaluation metric. 0 25 50 75 100 Number of Concepts Intervened Concept accuracy (%) 0 25 50 75 100 Number of Concepts Intervened Target accuracy (%) Figure 9: Performance on CUB after intervening on concepts in the order of highest predicted uncertainty with differing levels 1 α of the confidence region. Concept and target accuracy (%) are shown in the first and second columns, respectively. Results are reported as averages and standard deviations of model performance across three seeds. Table 4: Test-set performance before interventions. Results are averaged across ten seeds. Dataset Method Concept Accuracy Concept Jaccard Target Accuracy Hard CBM 61.42 0.07 43.80 1.32 58.38 0.39 CEM 61.42 0.12 44.84 1.36 58.01 0.49 Synthetic Autoregressive CBM 62.17 0.11 45.30 1.29 59.60 0.62 Global SCBM 61.57 0.05 44.53 1.02 58.39 0.53 Amortized SCBM 62.41 0.20 45.85 1.45 58.96 0.38 Hard CBM 94.97 0.07 77.22 0.33 67.72 0.57 CEM 95.12 0.07 78.20 0.28 69.60 0.30 CUB Autoregressive CBM 95.33 0.07 79.21 0.21 69.24 0.44 Global SCBM 94.99 0.09 76.83 0.47 68.19 0.63 Amortized SCBM 95.22 0.09 78.29 0.28 69.87 0.56 Hard CBM 85.51 0.04 81.54 0.08 69.73 0.29 CEM 85.12 0.14 81.06 0.21 72.24 0.33 CIFAR-10 Autoregressive CBM 85.31 0.06 81.31 0.10 68.88 0.47 Global SCBM 85.86 0.04 81.81 0.19 70.74 0.29 Amortized SCBM 86.00 0.03 81.97 0.20 71.66 0.25 Neur IPS Paper Checklist Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: Claims are supported by evidence in the Results section and Appendix. Guidelines: The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper. 2. Limitations Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: Yes, we have a Limitations & Future Work paragraph at the end of the conclusion. Guidelines: The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations. 3. Theory Assumptions and Proofs Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [Yes] Justification: We provide derivations of the method s theoretical foundations (detailed up to an acceptable degree of expected math knowledge) in the Method section. Guidelines: The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced. 4. Experimental Result Reproducibility Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justification: We disclose hyperparameters in the main text and Appendix. We also offer the code for reproducibility in case any information is missing. Guidelines: The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results. 5. Open access to data and code Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: We have released an anonymized version of the repository. Guidelines: The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https:// nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 6. Experimental Setting/Details Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justification: See Question 4. Guidelines: The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment Statistical Significance Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [Yes] Justification: We provide error bars in all experiments as we believe this to be of utmost importance to reproducible research. For the Appendix, we have reduced the number of seeds and/or experiment size to save computational resources for the environment s sake. Guidelines: The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text. 8. Experiments Compute Resources Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: See Appendix Guidelines: The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper). 9. Code Of Ethics Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justification: The code of ethics was followed. Guidelines: The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction). 10. Broader Impacts Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [NA] Justification: Given the more foundational work of this paper, there is not a direct negative influence that the authors can think of that might arise from this work specifically. Guidelines: The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML). 11. Safeguards Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] Justification: To the best of our knowledge, our work does not have high risk for misuse. Guidelines: The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort. 12. Licenses for existing assets Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [Yes] Justification: Licenses for all used datasets were clearly stated. Guidelines: The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [Yes] Justification: In the Appendix, the data generating mechanism is clearly stated for the introduced synthetic dataset. Additionally, the new method is described in detail. Guidelines: The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: The paper does not involve crowdsourcing nor research with human subjects. Guidelines: The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: The paper does not involve crowdsourcing nor research with human subjects. Guidelines: The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.