# fastshap_realtime_shapley_value_estimation__585ef49c.pdf Published as a conference paper at ICLR 2022 FASTSHAP: REAL-TIME SHAPLEY VALUE ESTIMATION Neil Jethani New York University Mukund Sudarshan New York University Ian Covert University of Washington Su-In Lee University of Washington Rajesh Ranganath New York University Although Shapley values are theoretically appealing for explaining black-box models, they are costly to calculate and thus impractical in settings that involve large, high-dimensional models. To remedy this issue, we introduce Fast SHAP, a new method for estimating Shapley values in a single forward pass using a learned explainer model. To enable efficient training without requiring ground truth Shapley values, we develop an approach to train Fast SHAP via stochastic gradient descent using a weighted least squares objective function. In our experiments with tabular and image datasets, we compare Fast SHAP to existing estimation approaches and find that it generates accurate explanations with an orders-ofmagnitude speedup. 1 INTRODUCTION With the proliferation of black-box models, Shapley values (Shapley, 1953) have emerged as a popular explanation approach due to their strong theoretical properties (Lipovetsky and Conklin, 2001; Štrumbelj and Kononenko, 2014; Datta et al., 2016; Lundberg and Lee, 2017). In practice, however, Shapley value-based explanations are known to have high computational complexity, with an exact calculation requiring an exponential number of model evaluations (Van den Broeck et al., 2021). Speed becomes a critical issue as models increase in size and dimensionality, and for the largest models in fields such as computer vision and natural language processing, there is an unmet need for significantly faster Shapley value approximations that maintain high accuracy. Recent work has addressed the computational challenges with Shapley values using two main approaches. First, many works have proposed stochastic estimators (Castro et al., 2009; Štrumbelj and Kononenko, 2014; Lundberg and Lee, 2017; Covert et al., 2020) that rely on sampling either feature subsets or permutations; though often consistent, these estimators require many model evaluations and impose an undesirable trade-off between run-time and accuracy. Second, some works have proposed model-specific approximations, e.g., for trees (Lundberg et al., 2020) or neural networks (Shrikumar et al., 2017; Chen et al., 2018b; Ancona et al., 2019; Wang et al., 2021); while generally faster, these approaches can still require many model evaluations, often induce bias, and typically lack flexibility regarding the handling held-out features a subject of ongoing debate in the field (Aas et al., 2019; Janzing et al., 2020; Frye et al., 2020; Covert et al., 2021). Here, we introduce a new approach for efficient Shapley value estimation: to achieve the fastest possible run-time, we propose learning a separate explainer model that outputs precise Shapley value estimates in a single forward pass. Naïvely, such a learning-based approach would seem to require a large training set of ground truth Shapley values, which would be computationally intractable. Instead, our approach trains an explainer model by minimizing an objective function inspired by the Shapley value s weighted least squares characterization (Charnes et al., 1988), which enables efficient gradient-based optimization. Our contributions. We introduce Fast SHAP, an amortized approach for generating real-time Shapley value explanations.1 We derive an objective function from the Shapley value s weighted least Equal contribution 1https://git.io/JCq FV (Py Torch), https://git.io/JCqb P (Tensor Flow) Published as a conference paper at ICLR 2022 Kernel SHAP Kernel SHAP-S Integrated Gradients Smooth Grad Figure 1: Explanations generated by each method for Imagenette images. squares characterization and investigate several ways to reduce gradient variance during training. Our experiments show that Fast SHAP provides accurate Shapley value estimates with an ordersof-magnitude speedup relative to non-amortized estimation approaches. Finally, we also find that Fast SHAP generates high-quality image explanations (fig. 1) that outperform gradient-based methods (e.g., Int Grad and Grad CAM) on quantitative inclusion and exclusion metrics. 2 BACKGROUND In this section, we introduce notation used throughout the paper and provide an overview of Shapley values and their weighted least squares characterization. Let x X be a random vector consisting of d features, or x = (x1, . . . , xd), and let y Y = {1, . . . , K} be the response variable for a classification problem. We use s {0, 1}d to denote subsets of the indices {1, . . . , d} and define xs := {xi}i:si=1. The symbols x, y, s are random variables and x, y, s denote possible values. We use 1 and 0 to denote vectors of ones and zeros in Rd, so that 1 s is a subset s cardinality, and we use ei to denote the ith standard basis vector. Finally, f(x; η) : X 7 K 1 is a model that outputs a probability distribution over y given x, and fy(x; η) is the probability for the yth class. 2.1 SHAPLEY VALUES Shapley values were originally developed as a credit allocation technique in cooperative game theory (Shapley, 1953), but they have since been adopted to explain predictions from black-box machine learning models (Štrumbelj and Kononenko, 2014; Datta et al., 2016; Lundberg and Lee, 2017). For any value function (or set function) v : 2d 7 R, the Shapley values ϕ(v) Rd, or ϕi(v) R for each feature i = 1, . . . , d, are given by the formula 1 v(s + ei) v(s) . (1) The difference v(s + ei) v(s) represents the ith feature s contribution to the subset s, and the summation represents a weighted average across all subsets that do not include i. In the model explanation context, the value function is chosen to represent how an individual prediction varies as different subsets of features are removed. For example, given an input-output pair (x, y), the prediction for the yth class can be represented by a value function vx,y defined as vx,y(s) = link E p(x1 s) [fy (xs, x1 s; η)] , (2) where the held out features x1 s are marginalized out using their joint marginal distribution p(x1 s), and a link function (e.g., logit) is applied to the model output. Recent work has debated the properties of different value function formulations, particularly the choice of how to remove features (Aas et al., 2019; Janzing et al., 2020; Frye et al., 2020; Covert et al., 2021). However, regardless of the formulation, this approach to model explanation enjoys several useful theoretical properties due to its use of Shapley values: for example, the attributions are zero for irrelevant features, and they are guaranteed to sum to the model s prediction. We direct readers to prior work for a detailed discussion of these properties (Lundberg and Lee, 2017; Covert et al., 2021). Unfortunately, Shapley values also introduce computational challenges: the summation in eq. (1) involves an exponential number of subsets, which makes it infeasible to calculate for large d. Fast approximations are therefore required in practice, as we discuss next. Published as a conference paper at ICLR 2022 2.2 KERNELSHAP Kernel SHAP (Lundberg and Lee, 2017) is a popular Shapley value implementation that relies on an alternative Shapley value interpretation. Given a value function vx,y(s), eq. (1) shows that the values ϕ(vx,y) are the features weighted average contributions; equivalently, their weighted least squares characterization says that they are the solution to an optimization problem over ϕx,y Rd, ϕ(vx,y) = arg min ϕx,y E p(s) h vx,y(s) vx,y(0) s ϕx,y 2i s.t. 1 ϕx,y = vx,y(1) vx,y(0), (Efficiency constraint) where the distribution p(s) is defined as p(s) d 1 d 1 s 1 s (d 1 s) (Shapley kernel) for s such that 0 < 1 s < d (Charnes et al., 1988). Based on this view of the Shapley value, Lundberg and Lee (2017) introduced Kernel SHAP, a stochastic estimator that solves an approximate version of eq. (3) given some number of subsets sampled from p(s). Although the estimator is consistent and empirically unbiased (Covert and Lee, 2021), Kernel SHAP often requires many samples to achieve an accurate estimate, and it must solve eq. (3) separately for each input-output pair (x, y). As a result, it is unacceptably slow for some use cases, particularly in settings with large, high-dimensional models. Our approach builds on Kernel SHAP, leveraging the Shapley value s weighted least squares characterization to design a faster, amortized estimation approach. We now introduce Fast SHAP, a method that amortizes the cost of generating Shapley values across many data samples. Fast SHAP has two main advantages over existing approaches: (1) it avoids solving separate optimization problems for each input to be explained, and (2) it can use similar data points to efficiently learn the Shapley value function ϕ(vx,y). 3.1 AMORTIZING SHAPLEY VALUES In our approach, we propose generating Shapley value explanations using a learned parametric function ϕfast(x, y; θ) : X Y 7 Rd. Once trained, the parametric function can generate explanations in a single forward pass, providing a significant speedup over methods that approximate Shapley values separately for each sample (x, y). Rather than using a dataset of ground truth Shapley values for training, we train ϕfast(x, y; θ) by penalizing its predictions according to the weighted least squares objective in eq. (3), or by minimizing the following loss, L(θ) = E p(x) E Unif(y) E p(s) h vx,y(s) vx,y(0) s ϕfast(x, y; θ) 2i , (4) where Unif(y) represents a uniform distribution over classes. If the model s predictions are forced to satisfy the Efficiency constraint, then given a large enough dataset and a sufficiently expressive model class for ϕfast, the global optimizer ϕfast(x, y; θ ) is a function that outputs exact Shapley values (see proof in appendix A). Formally, the global optimizer satisfies the following: ϕfast(x, y; θ ) = ϕ(vx,y) almost surely in p(x, y). (5) We explore two approaches to address the efficiency requirement. First, we can enforce efficiency by adjusting the Shapley value predictions using their additive efficient normalization (Ruiz et al., 1998), which applies the following operation to the model s outputs: ϕeff fast(x, y; θ) = ϕfast(x, y; θ) + 1 vx,y(1) vx,y(0) 1 ϕfast(x, y; θ) | {z } Efficiency gap The normalization step can be applied at inference time and optionally during training; in appendix B, we show that this step is guaranteed to make the estimates closer to the true Shapley values. Second, we can relax the efficiency property by augmenting L(θ) with a penalty on the efficiency gap (see eq. (6)); the penalty requires a parameter γ > 0, and as we set γ we can guarantee that efficiency holds (see appendix A). Algorithm 1 summarizes our training approach. Published as a conference paper at ICLR 2022 Empirical considerations. Optimizing L(θ) using a single set of samples (x, y, s) is problematic because of high variance in the gradients, which can lead to poor optimization. We therefore consider several steps to reduce gradient variance. First, as is conventional in deep learning, we minibatch across multiple samples from p(x). Next, when possible, we calculate the loss jointly across all classes y {1, . . . , K}. Then, we experiment with using multiple samples s p(s) for each input sample x. Finally, we explore paired sampling, where each sample s is paired with its complement 1 s, which has been shown to reduce Kernel SHAP s variance (Covert and Lee, 2021). Appendix C shows proofs that these steps are guaranteed to reduce gradient variance, and ablation experiments in appendix D demonstrate their improvement on Fast SHAP s accuracy. 3.2 A DEFAULT VALUE FUNCTION FOR FASTSHAP Algorithm 1: Fast SHAP training Input: Value function vx,y, learning rate α Output: Fast SHAP explainer ϕfast(x, y; θ) initialize ϕfast(x, y; θ) while not converged do sample x p(x), y Unif(y), s p(s) predict ˆϕ ϕfast(x, y; θ) if normalize then set ˆϕ ˆϕ+d 1 vx,y(1) vx,y(0) 1T ˆϕ end calculate L vx,y(s) vx,y(0) s T ˆϕ 2 update θ θ α θL end Fast SHAP has the flexibility to work with any value function vx,y(s). Here, we describe a default value function that is useful for explaining predictions from a classification model. The value function s aim is to assess, for each subset s, the classification probability when only the features xs are observed. Because most models f(x; η) do not support making predictions without all the features, we require an approximation that simulates the inclusion of only xs (Covert et al., 2021). To this end, we use a supervised surrogate model (Frye et al., 2020; Jethani et al., 2021) to approximate marginalizing out the remaining features x1 s using their conditional distribution. Separate from the original model f(x; η), the surrogate model psurr(y | m(x, s); β) takes as input a vector of masked features m(x, s), where the masking function m replaces features xi such that si = 0 with a [mask] value that is not in the support of X. Similar to prior work (Frye et al., 2020; Jethani et al., 2021), the parameters β are learned by minimizing the following loss function: L(β) = E p(x) E p(s) h DKL f(x; η) || psurr(y | m(x, s); β) i . (7) It has been shown that the global optimizer to eq. (7), or psurr(y | m(x, s); β ), is equivalent to marginalizing out features from f(x; η) with their conditional distribution (Covert et al., 2021): psurr(y | m(x, s); β ) = E[fy(x; η) | xs = xs]. (8) The choice of distribution over p(s) does not affect the global optimizer of eq. (7), but we use the Shapley kernel to put more weight on subsets likely to be encountered when training Fast SHAP. We use the surrogate model as a default choice for two reasons. First, it requires a single prediction for each evaluation of vx,y(s), which permits faster training than the common approach of averaging across many background samples (Lundberg and Lee, 2017; Janzing et al., 2020). Second, it yields explanations that reflect the model s dependence on the information communicated by each feature, rather than its algebraic dependence (Frye et al., 2020; Covert et al., 2021). 4 RELATED WORK Recent work on Shapley value explanations has largely focused on how to remove features (Aas et al., 2019; Frye et al., 2020; Covert et al., 2021) and how to approximate Shapley values efficiently (Chen et al., 2018b; Ancona et al., 2019; Lundberg et al., 2020; Covert and Lee, 2021). Model-specific approximations are relatively fast, but they often introduce bias and are entangled with specific feature removal approaches (Shrikumar et al., 2017; Ancona et al., 2019; Lundberg et al., 2020). In contrast, model-agnostic stochastic approximations are more flexible, but they must trade off run-time and accuracy in the explanation. For example, Kernel SHAP samples subsets to approximate the solution to a weighted least squares problem (Lundberg and Lee, 2017), while other Published as a conference paper at ICLR 2022 0 250 500 750 1000 1250 1500 # Evals Mean 2 distance Kernel SHAP Kernel SHAP (Paired) Permutation Sampling Permutation Sampling (Antithetical) Fast SHAP 0 200 400 600 800 1000 1200 # Evals 0.20 Census 0 500 1000 1500 2000 # Evals Figure 2: Comparison of Shapley value approximation accuracy across methods. Using three datasets, we measure the distance of each method s estimates to the ground truth as a function of the number of model evaluations. Fast SHAP is represented by a horizontal line since it requires only a single forward pass. The baselines require 200 2000 model evaluations to achieve Fast SHAP s level of accuracy. approaches sample marginal contributions (Castro et al., 2009; Štrumbelj and Kononenko, 2014) or feature permutations (Illés and Kerényi, 2019; Mitchell et al., 2021). Fast SHAP trains an explainer model to output an estimate that would otherwise require orders of magnitude more model evaluations, and, unlike other fast approximations, it is agnostic to the model class and feature removal approach. Other methods have been proposed to generate explanations using learned explainer models. These are referred to as amortized explanation methods (Covert et al., 2021; Jethani et al., 2021), and they include several approaches that are comparable to gradient-based methods in terms of compute time (Dabkowski and Gal, 2017; Chen et al., 2018a; Yoon et al., 2018; Schwab and Karlen, 2019; Schulz et al., 2020; Jethani et al., 2021). Notably, one approach generates a training dataset of ground truth explanations and then learns an explainer model to output explanations directly (Schwab and Karlen, 2019) a principle that can be applied with any attribution method, at least in theory. However, for Shapley values, generating a large training set would be very costly, so Fast SHAP sidesteps the need for a training set using a custom loss function based on the Shapley value s weighted least squares characterization (Charnes et al., 1988). 5 STRUCTURED DATA EXPERIMENTS We analyze Fast SHAP s performance by comparing it to several well-understood baselines. First, we evaluate its accuracy on tabular (structured) datasets by comparing its outputs to the ground truth Shapley values. Then, to disentangle the benefits of amortization from the in-distribution value function, we make the same comparisons using different value function formulations vx,y(s). Unless otherwise stated, we use the surrogate model value function introduced in section 3.2. Later, in section 6, we test Fast SHAP s ability to generate image explanations. Baseline methods. To contextualize Fast SHAP s accuracy, we compare it to several nonamortized stochastic estimators. First, we compare to Kernel SHAP (Lundberg and Lee, 2017) and its acceleration that uses paired sampling (Covert and Lee, 2021). Next, we compare to a permutation sampling approach and its acceleration that uses antithetical sampling (Mitchell et al., 2021). As a performance metric, we calculate the proximity to Shapley values that were obtained by running Kernel SHAP to convergence; we use these values as our ground truth because Kernel SHAP is known to converge to the true Shapley values given infinite samples (Covert and Lee, 2021). These baselines were all run using an open-source implementation.2 Implementation details. We use either neural networks or tree-based models for each of f(x; η) and psurr(y | m(x, s); β). The Fast SHAP explainer model ϕfast(x, y; θ) is implemented with a network g(x; θ) : X Rd Y that outputs a vector of Shapley values for every y Y; deep neural networks are ideal for Fast SHAP because they have high representation capacity, they can provide many-to-many mappings, and they can be trained by stochastic gradient descent. Appendix D contains more details about our implementation, including model classes, network architectures and training hyperparameters. 2https://github.com/iancovert/shapley-regression/ (License: MIT) Published as a conference paper at ICLR 2022 0 200 400 600 # Evals Mean 2 distance Kernel SHAP Kernel SHAP (Paired) Permutation Sampling Permutation Sampling (Antithetical) Fast SHAP 0 200 400 600 # Evals 0.10 Baseline 0 250 500 750 1000 1250 1500 # Evals Figure 3: Fast SHAP approximation accuracy for different value functions. Using the marketing dataset, we find that Fast SHAP provides accurate Shapley value estimates regardless of the value function (surrogate, marginal, baseline), with the baselines requiring 200 1000 model evaluations to achieve Fast SHAP s level of accuracy. Error bars represent 95% confidence intervals. We also perform a series of experiments to determine several training hyperparameters for Fast SHAP, exploring (1) whether or not to use paired sampling, (2) the number of subset samples to use, and (3) how to best enforce the efficiency constraint. Based on the results (see appendix D), we use the following settings for our tabular data experiments: we use paired sampling, between 32 and 64 samples of s per x sample, additive efficient normalization during both training and inference, and we set γ = 0 (since the normalization step is sufficient to enforce efficiency). 5.1 ACCURACY OF FASTSHAP EXPLANATIONS Here, we test whether Fast SHAP s estimates are close to the ground truth Shapley values. Our experiments use data from a 1994 United States census, a bank marketing campaign, bankruptcy statistics, and online news articles (Dua and Graff, 2017). The census data contains 12 input features, and the binary label indicates whether a person makes over $50K a year (Kohavi et al., 1996). The marketing dataset contains 17 input features, and the label indicates whether the customer subscribed to a term deposit (Moro et al., 2014). The bankruptcy dataset contains 96 features describing various companies and whether they went bankrupt (Liang et al., 2016). The news dataset contains 60 numerical features about articles published on Mashable, and our label indicates whether the share count exceeds the median number (1400) (Fernandes et al., 2015). The datasets were each split 80/10/10 for training, validation and testing. In fig. 2, we show the distance of each method s estimates to the ground truth as a function of the number of model evaluations for the news, census and bankruptcy datasets. Figure 3 shows results for the marketing dataset with three different value functions (see section 5.2). For the baselines, each sample s requires evaluating the model given a subset of features, but since Fast SHAP requires only a single forward pass of ϕfast(x, y; θ), we show it as a horizontal line. To reach Fast SHAP s level of accuracy on the news, census and bankruptcy datasets, Kernel SHAP requires between 1,200-2,000 model evaluations; like prior work (Covert and Lee, 2021), we find that paired sampling improves Kernel SHAP s rate of convergence, helping reach Fast SHAP s accuracy in 250-1,000 model evaluations. The permutation sampling baselines tend to be faster: the original version requires between 300-1,000 evaluations, and antithetical sampling takes 200-500 evaluations to reach an accuracy equivalent to Fast SHAP. Across all four datasets, however, Fast SHAP achieves its level of accuracy at least at least 600 faster than the original version of Kernel SHAP, and 200 faster than the best non-amortized baseline. 5.2 DISENTANGLING AMORTIZATION AND THE CHOICE OF VALUE FUNCTION In this experiment, we verify that Fast SHAP produces accurate Shapley value estimates regardless of the choice of value function. We use the marketing dataset for this experiment and test the following value functions: 1. (Surrogate/In-distribution) vx,y(s) = psurr(y | m(x, s); β) 2. (Marginal/Out-of-distribution) vx,y(s) = Ep(x1 s) [fy (xs, x1 s; η)] 3. (Baseline removal) vx,y(s) = fy(xs, xb 1 s; η), where xb X are fixed baseline values (the mean for continuous features and mode for discrete ones) Published as a conference paper at ICLR 2022 Kernel SHAP Kernel SHAP-S Integrated Gradients Smooth Grad Figure 4: Explanations generated by each method for CIFAR-10 images. In fig. 3 we compare Fast SHAP to the same non-amortized baseline methods, where each method generates Shapley value estimates using the value functions listed above. The results show that Fast SHAP maintains the same computational advantage across all three cases: to achieve the same accuracy as Fast SHAP s single forward pass, the baseline methods require at least 200 model evaluations, but in some cases up to nearly 1,000. 6 IMAGE EXPERIMENTS Images represent a challenging setting for Shapley values due to their high dimensionality and the computational cost of model evaluation. We therefore compare Fast SHAP to Kernel SHAP on two image datasets. We also consider several widely used gradient-based explanation methods, because they are the most commonly used methods for explaining image classifiers. 6.1 DATASETS We consider two popular image datasets for our experiments. CIFAR-10 (Krizhevsky et al., 2009) contains 60,000 32 32 images across 10 classes, and we use 50,000 samples for training and 5,000 samples each for validation and testing. Each image is resized to 224 224 using bilinear interpolation to interface with the Res Net-50 architecture (He et al., 2016). Figure 4 shows example CIFAR-10 explanations generated by each method. The Imagenette dataset (Howard and Gugger, 2020), a subset of 10 classes from the Image Net dataset, contains 13,394 total images. Each image is cropped to keep the 224 224 central region, and the data is split 9,469/1,963/1,962. Example Imagenette explanations are shown in fig. 1. 6.2 EXPLANATION METHODS We test three Shapley value estimators, Fast SHAP, Kernel SHAP, and Deep SHAP (Lundberg and Lee, 2017), where the last is an existing approximation designed for neural networks. We test Kernel SHAP with the zeros baseline value function, which we refer to simply as Kernel SHAP, and with the in-distribution surrogate value function, which we refer to as Kernel SHAP-S. We also compare these methods to the gradient-based explanation methods Grad CAM (Selvaraju et al., 2017), Smooth Grad (Smilkov et al., 2017) and Int Grad (Sundararajan et al., 2017). Gradient-based methods are relatively fast and have therefore been widely adopted for explaining image classifiers. Finally, we also compare to CXPlain (Schwab and Karlen, 2019), an amortized explanation method that generates attributions that are not based on Shapley values. Implementation details. The models f(x; η) and psurr(y | m(x, s); β) are both Res Net-50 networks (He et al., 2016) pretrained on Image Net and fine-tuned on the corresponding imaging dataset. Fast SHAP, CXPlain, and Kernel SHAP are all implemented to output 14 14 superpixel attributions for each class. For Fast SHAP, we parameterize ϕfast(x, y; θ) to output superpixel attributions: we use an identical pretrained Res Net-50 but replace the final layers with a 1 1 convolutional layer so that the output is 14 14 K (see details appendix D). We use an identical network to produce attributions for CXPlain. For Fast SHAP, we do not use additive efficient normalization, and we set γ = 0; we find that this relaxation of the Shapley value s efficiency property does not inhibit Fast SHAP s ability to produce high-quality image explanations. Kernel SHAP and Kernel SHAP-S are implemented using the shap3 package s default parameters, and Grad CAM, Smooth Grad, and Int Grad are implemented using the tf-explain4 package s default parameters. 3https://shap.readthedocs.io/en/latest/ (License: MIT) 4https://tf-explain.readthedocs.io/en/latest/ (License: MIT) Published as a conference paper at ICLR 2022 6.3 QUALITATIVE REMARKS Explanations generated by each method are shown in fig. 4 for CIFAR-10 and fig. 1 for Imagenette (see appendix E for more examples). While a qualitative evaluation is insufficient to draw conclusions about each method, we offer several remarks on these examples. Fast SHAP, and to some extent Grad CAM, appear to reliably highlight the important objects, while the Kernel SHAP explanations are noisy and fail to localize important regions. To a lesser extent, CXPlain occasionally highlights important regions. In comparison, the remaining methods (Smooth Grad, Int Grad and Deep SHAP) are granulated and highlight only small parts of the key objects. Next, we consider quantitative metrics that test these observations more systematically. 6.4 QUANTITATIVE EVALUATION 0 20 40 60 80 100 Exclusion % Top 1 Accuracy Exclusion Curve 0 20 40 60 80 100 Inclusion % Top 1 Accuracy Inclusion Curve Kernel SHAP Kernel SHAP-S Grad CAM Integrated Gradients Smooth Grad Deep SHAP CXPlain Fast SHAP Figure 5: Imagenette inclusion and exclusion curves. The change in top-1 accuracy as an increasing percentage of the pixels estimated to be important are excluded (top) or included (bottom). Evaluating the quality of Shapley value estimates requires access to ground truth Shapley values, which is computationally infeasible for images. Instead, we use two metrics that evaluate an explanation s ability to identify informative image regions. These metrics build on several recent proposals (Petsiuk et al., 2018; Hooker et al., 2018; Jethani et al., 2021) and evaluate the model s classification accuracy after including or excluding pixels according to their estimated importance. Similar to Jethani et al. (2021), we begin by training a single evaluation model peval to approximate the f(x; η) model s output given a subset of features; this serves as an alternative to training separate models on each set of features (Hooker et al., 2018) and offers a more realistic option than masking features with zeros (Schwab and Karlen, 2019). This procedure is analogous to the psurr training procedure in section 3.2, except it sets the subset distribution to p(s) = Uniform({0, 1}d) to ensure all subsets are equally weighted. Next, we analyze how the model s predictions change as we remove either important or unimportant features according to each explanation. Using a set of 1,000 images, each image is first labeled by the original model f(x; η) using the most likely predicted class. We then use explanations generated by each method to produce feature rankings and compute the top-1 accuracy (a measure of agreement with the original model) as we either include or exclude the most important features, ranging from 0100%. The area under each curve (AUC) is termed the Inclusion AUC or Exclusion AUC. These metrics match the idea that an accurate image explanation should (1) maximally degrade the performance of peval when important features are excluded, and (2) maximally improve the performance of peval when important features are included (Petsiuk et al., 2018; Hooker et al., 2018). The explanations are evaluated by removing superpixels; for gradient-based methods, we coarsen the explanations using the sum total importance within each superpixel. In appendix E, we replicate these metrics using log-odds rather than top-1 accuracy, finding a similar ordering among methods. Results. Table 1 shows the Inclusion and Exclusion AUC achieved by each method for both CIFAR-10 and Imagenette. In fig. 5, we also present the curves used to generate these AUCs for Imagenette. Lower Exclusion AUCs and higher Inclusion AUCs are better. These results show that Fast SHAP outperforms all baseline methods when evaluated with Exclusion AUC: when the pixels identified as important by Fast SHAP are removed from the images, the sharpest decline in top-1 accuracy is observed. Additionally, Fast SHAP performs well when evaluated on the basis of Inclusion AUC, second only to Kernel SHAP-S. Published as a conference paper at ICLR 2022 Table 1: Exclusion and Inclusion AUCs. Evaluation of each method on the basis of Exclusion AUC (lower is better) and Inclusion AUC (higher is better) calculated using top-1 accuracy. Parentheses indicate 95% confidence intervals, and the best methods are bolded in each column. CIFAR-10 Imagenette Exclusion AUC Inclusion AUC Exclusion AUC Inclusion AUC Fast SHAP 0.42 (0.41, 0.43) 0.78 (0.77, 0.79) 0.51 (0.49, 0.52) 0.79 (0.78, 0.80) Kernel SHAP 0.64 (0.63, 0.65) 0.78 (0.77, 0.79) 0.68 (0.67, 0.70) 0.77 (0.75, 0.78) Kernel SHAP-S 0.54 (0.52, 0.55) 0.86 (0.85, 0.87) 0.61 (0.60, 0.62) 0.82 (0.80, 0.83) Grad CAM 0.52 (0.51, 0.53) 0.76 (0.75, 0.77) 0.52 (0.50, 0.53) 0.74 (0.73, 0.76) Integrated Gradients 0.55 (0.54, 0.56) 0.74 (0.73, 0.75) 0.65 (0.64, 0.67) 0.73 (0.71, 0.74) Smooth Grad 0.70 (0.69, 0.71) 0.72 (0.71, 0.73) 0.72 (0.71, 0.73) 0.73 (0.72, 0.75) Deep SHAP 0.65 (0.64, 0.66) 0.79 (0.78, 0.80) 0.69 (0.68, 0.71) 0.74 (0.73, 0.75) CXPlain 0.56 (0.55, 0.57) 0.71 (0.70, 0.72) 0.60 (0.58, 0.61) 0.72 (0.71, 0.74) For Imagenette, Grad CAM performs competitively with Fast SHAP on Exclusion AUC and Kernel SHAP-S marginally beats Fast SHAP on Inclusion AUC. Kernel SHAP-S also outperforms on Inclusion AUC with CIFAR-10, which is perhaps surprising given its high level of noise (fig. 4). However, Kernel SHAP-S does not do as well when evaluated using Exclusion AUC, and Grad CAM does not do as well on Inclusion AUC. The remaining methods are, by and large, not competitive on either metric (except Deep SHAP on CIFAR-10 Inclusion AUC). An accurate explanation should perform well on both metrics, so these results show that Fast SHAP provides the most versatile explanations, because it is the only approach to excel at both Inclusion and Exclusion AUC. Finally, we also test Fast SHAP s robustness to limited training data. In appendix E, we find that Fast SHAP outperforms most baseline methods on Inclusion and Exclusion AUC when using just 25% of the Imagenette data, and that it remains competitive when using just 10%. 6.5 SPEED EVALUATION Table 2: Training and explanation run-times for 1,000 images (in minutes). CIFAR-10 Imagenette Fast SHAP 0.04 0.04 Kernel SHAP 453.69 1089.50 Kernel SHAP-S 460.10 586.12 Grad CAM 0.38 0.30 Int Grad 0.91 0.92 Smooth Grad 1.00 1.05 Deep SHAP 5.39 6.01 CXPlain 0.04 0.04 Fast SHAP 693.57 146.49 Kernel SHAP-S 362.03 73.22 CXPlain 538.49 93.00 The image experiments were run using 8 cores of an Intel Xeon Gold 6148 processor and a single NVIDIA Tesla V100. Table 2 records the time required to explain 1,000 images. For Fast SHAP, Kernel SHAP-S and CXPlain, we also report the time required to train the surrogate and/or explainer models. The amortized explanation methods, Fast SHAP and CXPlain, incur a fixed training cost but very low marginal cost for each explanation. The gradient-based methods are slightly slower, but Kernel SHAP requires significantly more time. These results suggest that Fast SHAP is well suited for real-time applications where it is crucial to keep explanation times as low as possible. Further, when users need to explain a large quantity of data, Fast SHAP s low explanation cost can quickly compensate for its training time. 7 DISCUSSION In this work, we introduced Fast SHAP, a method for estimating Shapley values in a single forward pass using a learned explainer model. To enable efficient training, we sidestepped the need for a training set and derived a learning approach from the Shapley value s weighted least squares characterization. Our experiments demonstrate that Fast SHAP can produce accurate Shapley value estimates while achieving a significant speedup over non-amortized approaches, as well as more accurate image explanations than popular gradient-based methods. While Shapley values provide a strong theoretical grounding for model explanation, they have not been widely adopted for explaining large-scale models due to their high computational cost. Fast SHAP can solve this problem, making fast and high-quality explanations possible in fields such as computer vision and natural language processing. By casting model explanation as a learning problem, Fast SHAP stands to benefit as the state of deep learning advances, and it opens a new direction of research for efficient Shapley value estimation. Published as a conference paper at ICLR 2022 8 REPRODUCIBILITY Code to implement Fast SHAP is available online in two separate repositories: https://github. com/iancovert/fastshap contains a Py Torch implementation and https://github. com/neiljethani/fastshap/ a Tensor Flow implementation, both with examples of tabular and image data experiments. The complete code for our experiments is available at https: //github.com/iclr1814/fastshap, and details are described throughout section 5 and section 6, with model architectures and hyperparameters reported in appendix D. Proofs for our theoretical claims are provided in appendix A, appendix B, and appendix C. 9 ACKNOWLEDGEMENTS We thank the reviewers for their thoughtful feedback, and we thank the Lee Lab for helpful discussions. Neil Jethani was partially supported by NIH T32 GM136573. Mukund Sudarshan was partially supported by a Ph RMA Foundation Predoctoral Fellowship. Mukund Sudarshan and Rajesh Ranganath were partly supported by NIH/NHLBI Award R01HL148248, and by NSF Award 1922658 NRT-HDR: FUTURE Foundations, Translation, and Responsibility for Data Science. Ian Covert and Su-In Lee were supported by the NSF Awards CAREER DBI-1552309 and DBI-1759487; the NIH Awards R35GM128638 and R01NIAAG061132; and the American Cancer Society Award 127332-RSG-15-097-01-TBG. Aas, K., Jullum, M., and Løland, A. (2019). Explaining individual predictions when features are dependent: More accurate approximations to Shapley values. ar Xiv preprint ar Xiv:1903.10464. Ancona, M., Oztireli, C., and Gross, M. (2019). Explaining deep neural networks with a polynomial time algorithm for Shapley value approximation. In International Conference on Machine Learning, pages 272 281. PMLR. Castro, J., Gómez, D., and Tejada, J. (2009). Polynomial calculation of the Shapley value based on sampling. Computers & Operations Research, 36(5):1726 1730. Charnes, A., Golany, B., Keane, M., and Rousseau, J. (1988). Extremal principle solutions of games in characteristic function form: core, Chebychev and Shapley value generalizations. In Econometrics of Planning and Efficiency, pages 123 133. Springer. Chen, J., Song, L., Wainwright, M., and Jordan, M. (2018a). Learning to explain: An informationtheoretic perspective on model interpretation. In International Conference on Machine Learning, pages 883 892. PMLR. Chen, J., Song, L., Wainwright, M. J., and Jordan, M. I. (2018b). L-Shapley and C-Shapley: Efficient model interpretation for structured data. ar Xiv preprint ar Xiv:1808.02610. Chen, T. and Guestrin, C. (2016). XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 785 794. Covert, I. and Lee, S.-I. (2021). Improving Kernel SHAP: Practical Shapley value estimation using linear regression. In International Conference on Artificial Intelligence and Statistics, pages 3457 3465. PMLR. Covert, I., Lundberg, S., and Lee, S.-I. (2020). Understanding global feature contributions with additive importance measures. Advances in Neural Information Processing Systems, 33. Covert, I., Lundberg, S., and Lee, S.-I. (2021). Explaining by removing: A unified framework for model explanation. Journal of Machine Learning Research, 22(209):1 90. Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems, 2(4):303 314. Dabkowski, P. and Gal, Y. (2017). Real time image saliency for black box classifiers. In Advances in Neural Information Processing Systems, pages 6967 6976. Published as a conference paper at ICLR 2022 Datta, A., Sen, S., and Zick, Y. (2016). Algorithmic Transparency via Quantitative Input Influence: Theory and Experiments with Learning Systems. In Proceedings - 2016 IEEE Symposium on Security and Privacy, SP 2016, pages 598 617. Institute of Electrical and Electronics Engineers Inc. Dua, D. and Graff, C. (2017). UCI machine learning repository. Fernandes, K., Vinagre, P., and Cortez, P. (2015). A proactive intelligent decision support system for predicting the popularity of online news. In Portuguese Conference on Artificial Intelligence, pages 535 546. Springer. Frye, C., de Mijolla, D., Begley, T., Cowton, L., Stanley, M., and Feige, I. (2020). Shapley explainability on the data manifold. In International Conference on Learning Representations. He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770 778. Hooker, S., Erhan, D., Kindermans, P.-J., and Kim, B. (2018). A benchmark for interpretability methods in deep neural networks. ar Xiv preprint ar Xiv:1806.10758. Hornik, K. (1991). Approximation capabilities of multilayer feedforward networks. Neural Networks, 4(2):251 257. Howard, J. and Gugger, S. (2020). Fast AI: A layered API for deep learning. Information, 11(2):108. Illés, F. and Kerényi, P. (2019). Estimation of the Shapley value by ergodic sampling. ar Xiv preprint ar Xiv:1906.05224. Janzing, D., Minorics, L., and Blöbaum, P. (2020). Feature relevance quantification in explainable AI: A causal problem. In International Conference on Artificial Intelligence and Statistics, pages 2907 2916. PMLR. Jethani, N., Sudarshan, M., Aphinyanaphongs, Y., and Ranganath, R. (2021). Have we learned to explain?: How interpretability methods can learn to encode predictions in their interpretations. In International Conference on Artificial Intelligence and Statistics, pages 1459 1467. PMLR. Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., and Liu, T.-Y. (2017). Light GBM: A highly efficient gradient boosting decision tree. Advances in Neural Information Processing Systems, 30:3146 3154. Kohavi, R. et al. (1996). Scaling up the accuracy of naive-bayes classifiers: A decision-tree hybrid. In Knowledge Discovery and Data Mining, volume 96, pages 202 207. Krizhevsky, A., Hinton, G., et al. (2009). Learning multiple layers of features from tiny images. Liang, D., Lu, C.-C., Tsai, C.-F., and Shih, G.-A. (2016). Financial ratios and corporate governance indicators in bankruptcy prediction: A comprehensive study. European Journal of Operational Research, 252(2):561 572. Lipovetsky, S. and Conklin, M. (2001). Analysis of regression in game theory approach. Applied Stochastic Models in Business and Industry, 17(4):319 330. Lundberg, S. M., Erion, G., Chen, H., De Grave, A., Prutkin, J. M., Nair, B., Katz, R., Himmelfarb, J., Bansal, N., and Lee, S.-I. (2020). From local explanations to global understanding with explainable AI for trees. Nature Machine Intelligence, 2(1):56 67. Lundberg, S. M. and Lee, S.-I. (2017). A unified approach to interpreting model predictions. Advances in Neural Information Processing Systems, 30:4765 4774. Mitchell, R., Cooper, J., Frank, E., and Holmes, G. (2021). Sampling permutations for Shapley value estimation. ar Xiv preprint ar Xiv:2104.12199. Moro, S., Cortez, P., and Rita, P. (2014). A data-driven approach to predict the success of bank telemarketing. Decision Support Systems, 62:22 31. Published as a conference paper at ICLR 2022 Petsiuk, V., Das, A., and Saenko, K. (2018). RISE: Randomized input sampling for explanation of black-box models. ar Xiv preprint ar Xiv:1806.07421. Ruiz, L. M., Valenciano, F., and Zarzuelo, J. M. (1998). The family of least square values for transferable utility games. Games and Economic Behavior, 24(1-2):109 130. Schulz, K., Sixt, L., Tombari, F., and Landgraf, T. (2020). Restricting the flow: Information bottlenecks for attribution. ar Xiv preprint ar Xiv:2001.00396. Schwab, P. and Karlen, W. (2019). CXPlain: Causal explanations for model interpretation under uncertainty. In Advances in Neural Information Processing Systems, pages 10220 10230. Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., and Batra, D. (2017). Grad CAM: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, pages 618 626. Shapley, L. S. (1953). A value for n-person games. Contributions to the Theory of Games, 2(28):307 317. Shrikumar, A., Greenside, P., and Kundaje, A. (2017). Learning important features through propagating activation differences. In International Conference on Machine Learning, pages 3145 3153. PMLR. Smilkov, D., Thorat, N., Kim, B., Viégas, F., and Wattenberg, M. (2017). Smoothgrad: removing noise by adding noise. ar Xiv preprint ar Xiv:1706.03825. Štrumbelj, E. and Kononenko, I. (2014). Explaining prediction models and individual predictions with feature contributions. Knowledge and Information Systems, 41(3):647 665. Sundararajan, M., Taly, A., and Yan, Q. (2017). Axiomatic attribution for deep networks. In International Conference on Machine Learning, pages 3319 3328. PMLR. Van den Broeck, G., Lykov, A., Schleich, M., and Suciu, D. (2021). On the tractability of SHAP explanations. In Proceedings of the 35th Conference on Artificial Intelligence (AAAI). Wang, R., Wang, X., and Inouye, D. I. (2021). Shapley explanation networks. In International Conference on Learning Representations. Yoon, J., Jordon, J., and van der Schaar, M. (2018). INVASE: Instance-wise variable selection using neural networks. In International Conference on Learning Representations. Published as a conference paper at ICLR 2022 A FASTSHAP GLOBAL OPTIMIZER Here, we prove that Fast SHAP is trained using an objective function whose global optimizer outputs the true Shapley values. Recall that the loss function for the explainer model ϕfast(x, y; θ) is L(θ) = E p(x) E Unif(y) E p(s) h vx,y(s) vx,y(0) s ϕfast(x, y; θ) 2i . As mentioned in the main text, it is necessary to force the model to satisfy the Efficiency constraint, or the property that the predictions from ϕfast(x, y; θ) satisfy 1 ϕfast(x, y; θ) = vx,y(1) vx,y(0) x X, y Y. One option for guaranteeing the efficiency property is to adjust the model outputs using their additive efficient normalization (see section 3.1). Incorporating this constraint on the predictions, we can then view the loss function as an expectation across (x, y) and write the expected loss for each sample (x, y) as a separate optimization problem over the variable ϕx,y Rd: min ϕx,y E p(s) h vx,y(s) vx,y(0) s ϕx,y 2i s.t. ϕx,y = vx,y(1) vx,y(0). This is a constrained weighted least squares problem with a unique global minimizer, and it is precisely the Shapley value s weighted least squares characterization (see eq. (3)). We can therefore conclude that the optimal prediction for each pair (x, y) is the true Shapley values, or that ϕ x,y = ϕ(vx,y). As a result, the global optimizer for our objective is a model ϕfast(x, y; θ ) that outputs the true Shapley values almost everywhere in the data distribution p(x, y). Achieving the global optimum requires the ability to sample from p(x) (or a sufficiently large dataset), perfect optimization, and a function class for ϕfast(x, y; θ) that is expressive enough to contain the global optimizer. The universal approximation theorem (Cybenko, 1989; Hornik, 1991) implies that a sufficiently large neural network can represent the Shapley value function to arbitrary accuracy as long as it is a continuous function. Specifically, we require the function hy(x) = ϕ(vx,y) to be continuous in x for all y. This holds in practice when we use a surrogate model parameterized by a continuous neural network, because vx,y(s) is continuous in x for all (s, y), and the Shapley value is a linear combination of vx,y(s) across different values of s (see eq. (1)). Another approach to enforce the efficiency property is by using efficiency regularization, or penalizing the efficiency gap in the explainer model s predictions (section 3.1). If we incorporate this regularization term with parameter γ > 0, then our objective yields the following optimization problem for each (x, y) pair: min ϕx,y E p(s) h vx,y(s) vx,y(0) s ϕx,y 2i + γ vx,y(1) vx,y(0) 1 ϕx,y 2. For finite hyperparameter values γ [0, ), this problem relaxes the Shapley value s efficiency property and eliminates the requirement that predictions must sum to the grand coalition s value. However, as we let γ , the penalty term becomes closer to the hard constraint in eq. (9). Note that in practice, we use finite values for γ and observe sufficiently accurate results, whereas using excessively large γ values would render gradient-based optimization ineffective. B ADDITIVE EFFICIENT NORMALIZATION Here, we provide a geometric interpretation for the additive efficient normalization step and prove that it is guaranteed to yield Shapley value estimates closer to their true values. Consider a game v with Shapley values ϕ(v) Rd, and assume that we have Shapley values estimates ˆϕ Rd that do not satisfy the efficiency property. To force this property to hold, we can project these estimates Published as a conference paper at ICLR 2022 onto the efficient hyperplane, or the subset of Rd where the efficiency property is satisfied. This corresponds to solving the following optimization problem over ϕeff Rd: min ϕeff ||ϕeff ˆϕ||2 s.t. 1 ϕeff= v(1) v(0). We can solve the problem via its Lagrangian, denoted by L(ϕeff, ν), with the Lagrange multiplier ν R as follows: L(ϕeff, ν) = ||ϕeff ˆϕ||2 + ν v(1) v(0) 1 ϕeff ϕ eff= ˆϕ 1v(1) v(0) 1 ˆϕ This transformation, where the efficiency gap is split evenly and added to each estimate, is known as additive efficient normalization (Ruiz et al., 1998). We implement it as an output layer for Fast SHAP s predictions to ensure that they satisfy the efficiency property (section 3). This step can therefore be understood as a projection of the network s output onto the efficient hyperplane. The normalization step is guaranteed to produce corrected estimates ϕ effthat are closer to the true Shapley values ϕ(v) than the original estimates ˆϕ. To see this, note that the projection step guarantees that ˆϕ ϕ effand ϕ eff ϕ(v) are orthogonal vectors, so the Pythagorean theorem yields the following inequality: ||ϕ(v) ˆϕ||2 = ||ϕ(v) ϕ eff||2 + ||ϕ eff ˆϕ||2 ||ϕ(v) ϕ eff||2. C REDUCING GRADIENT VARIANCE Recall that our objective function L(θ) is defined as follows: L(θ) = E p(x) E Unif(y) E p(s) h vx,y(s) vx,y(0) s ϕfast(x, y; θ) 2i . The objective s gradient is given by θL(θ) = E p(x) E Unif(y) E p(s) h (x, y, s; θ) i , (10) where we define (x, y, s; θ) := θ vx,y(s) vx,y(0) s ϕfast(x, y; θ) 2. When Fast SHAP is trained with a single sample (x, y, s), the gradient covariance is given by Cov (x, y, s; θ) , which may be too large for effective optimization. We use several strategies to reduce gradient variance. First, given a model that outputs estimates for all classes y {1, . . . , K}, we calculate the loss jointly for all classes. This yields gradients that we denote as (x, s; θ), defined as (x, s; θ) := E Unif(y)[ (x, y, s; θ)], where we have the relationship Published as a conference paper at ICLR 2022 Cov (x, s; θ) Cov (x, y, s; θ) due to the law of total covariance. Next, we consider minibatches of b independent x samples, which yields gradients b(x, s; θ) with covariance given by Cov b(x, s; θ) = 1 b Cov (x, s; θ) . We then consider sampling m independent coalitions s for each input x, resulting in the gradients m b (x, s; θ) with covariance given by Cov m b (x, s; θ) = 1 mb Cov (x, s; θ) . Finally, we consider a paired sampling approach, where each sample s p(s) is paired with its complement 1 s. Paired sampling has been shown to reduce Kernel SHAP s variance (Covert and Lee, 2021), and our experiments show that it helps improve Fast SHAP s accuracy (appendix D). The training algorithm in the main text is simplified by omitting these gradient variance reduction techniques, so we also provide algorithm 2 below, which includes minibatching, multiple coalition samples, paired sampling, efficiency regularization and parallelization over all output classes. Algorithm 2: Full Fast SHAP training Input: Value function vx,y, learning rate α, batch size b, samples m, penalty parameter γ Output: Fast SHAP explainer ϕfast(x, y; θ) initialize ϕfast(x, y; θ) while not converged do set R 0, L 0 for i = 1, . . . , b do sample x p(x) for y = 1, . . . , K do predict ˆϕ ϕfast(x, y; θ) calculate R R + vx,y(1) vx,y(0) 1 ˆϕ 2 // Pre-normalization if normalize then set ˆϕ ˆϕ + d 1 vx,y(1) vx,y(0) 1 ˆϕ end for j = 1, . . . , m do if paired sampling and i mod 2 = 0 then set s 1 s // Invert previous subset else sample s p(s) end calculate L L + vx,y(s) vx,y(0) s ˆϕ 2 end end end update θ θ α θ L bm K + γ R Published as a conference paper at ICLR 2022 0 20 40 60 # Training samples Mean 2 dist Fast SHAP Fast SHAP (Paired Sampling) 0 20 40 60 # Training samples 0 20 40 60 # Training samples 0 20 40 60 # Training samples Figure 6: Fast SHAP accuracy as a function of the number of training samples. The results show that using more s samples per x improves Fast SHAP s closeness to the ground truth Shapley values, as does the use of paired sampling. D FASTSHAP MODELS AND HYPERPARAMETERS In this section, we describe the models and architectures used for each dataset, as well as the hyperparameters used when training Fast SHAP. D.1 MODELS Tabular datasets. For the original model f(x; η), we use neural networks for the news and marketing datasets and gradient boosted trees for the census (Light GBM (Ke et al., 2017)) and bankruptcy (XGBoost (Chen and Guestrin, 2016)) datasets. The Fast SHAP model ϕfast(x, y; θ) and the surrogate model psurr(y | m(x, s); β) are implemented using neural networks that consist of 2-3 fully connected layers with 128 units and Re LU activations. The psurr models use a softmax output layer, while ϕfast has no output activation. The models are trained using Adam with a learning rate of 10 3, and we use a learning rate scheduler that multiplies the learning rate by a factor of 0.5 after 3 epochs of no validation loss improvement. Early stopping was triggered after the validation loss ceased to improve for 10 epochs. Image datasets. The models f(x; η) and psurr are Res Net-50 models pretrained on Imagenet. We use these without modification to the architecture and fine-tune them on each image dataset. To create the ϕfast(x, y; θ) model, we modify the architecture to return a tensor of size 14 14 K. First, the layers after the 4th convolutional block are removed; the output of this block is 14 14 256. We then append a 2D convolutional layer with K filters, each of size 1 1, so that the output is 14 14 K and the yth 14 14 slice corresponds to the superpixel-level Shapley values for each class y Y. Each model is trained using Adam with a learning rate of 10 3, and we use a learning rate scheduler that multiplies the learning rate by a factor of 0.8 after 3 epochs of no validation loss improvement. Early stopping was triggered after the validation loss ceased to improve for 20 epochs. D.2 FASTSHAP HYPERPARAMETERS We now explore various settings of Fast SHAP s hyperparameters and observe their impact on Fast SHAP s performance. There are two types of hyperparameters: sampling hyperparameters, which affect the number of samples of s taken during training, and efficiency hyperparameters, which control how we enforce the Efficiency constraint. Sampling hyperparameters include: (1) whether to use paired sampling, and (2) the number of samples of s per x to take during training. Efficiency hyperparameters include: (1) the choice of γ in eq. (4), and (2) whether to perform the additive efficient normalization during training, inference or both. To understand the effect of sampling hyperparameters, we perform experiments using the same tabular datasets from the main text. We use the in-distribution value function psurr and compute the ground truth SHAP values the same way as in our previous experiments (i.e., by running Kernel SHAP to convergence). Figure 6 shows the mean ℓ2 distance between Fast SHAP s estimates and the ground truth. We find that across all four datasets, increasing the number of training samples of s generally improves the mean ℓ2 distance to ground truth. We also find that for any fixed number of samples (greater than 1), using paired sampling improves Fast SHAP s accuracy. Table 3 shows the results of an ablation study for the efficiency hyperparameters. Normalization (or Norm.) refers to the additive efficient normalization step (applied during training and inference, Published as a conference paper at ICLR 2022 Table 3: Fast SHAP ablation results. The distance to the ground truth Shapley values is displayed for several Fast SHAP variations, showing that normalization helps and that the penalty is unnecessary. Census Bankruptcy ℓ2 ℓ1 ℓ2 ℓ1 Normalization 0.0229 0.0863 0.0295 0.2436 Normalization + Penalty 0.0261 0.0971 0.0320 0.2740 Inference Norm. 0.0406 0.1512 0.0407 0.3450 Inference Norm. + Penalty 0.0452 0.1671 0.0473 0.4471 No Norm. 0.0501 0.1933 0.0408 0.3474 No Norm. + Penalty 0.0513 0.1926 0.0474 0.4490 or only during inference), and penalty refers to the efficiency regularization technique with the parameter set to γ = 0.1. We find that using normalization during training uniformly achieves better results than without normalization or with normalization only during inference. The efficiency regularization approach proves to be less effective, generally leading to less accurate Shapley value estimates. Based on these results, we opt to use additive efficient normalization in our tabular data experiments. E ADDITIONAL RESULTS FOR IMAGE EXPERIMENTS In this section, we provide additional results for the Fast SHAP image experiments. E.1 INCLUSION AND EXCLUSION METRICS Table 4 shows our inclusion and exclusion metrics when replicated using log-odds rather than accuracy. Similar to our metrics described in the main text, we choose the class predicted by the original model for each image, and we measure the average log-odds for that class as we include or exclude important features according to the explanations generated by each method. The results confirm roughly the same ordering between methods, with Fast SHAP being the only method to achieve strong results on both metrics for both datasets. Figure 7 shows the raw inclusion and exclusion curves for both the accuracy and log-odds-derived metrics. Table 4: Exclusion and Inclusion AUCs calculated using the average log-odds of the predicted class. CIFAR-10 Imagenette Exclusion AUC Inclusion AUC Exclusion AUC Inclusion AUC Fast SHAP 5.92 (5.62, 6.14) 5.36 (5.16, 5.63) 7.98 (7.68, 8.33) 5.40 (5.16, 5.60) Kernel SHAP 9.88 (9.55, 10.20) 5.36 (5.14, 5.63) 10.68 (10.36, 11.00) 5.07 (4.81, 5.31) Kernel SHAP-S 8.01 (7.68, 8.34) 6.80 (6.65, 6.96) 9.39 (9.11, 9.66) 6.01 (5.78, 6.26) Grad CAM 7.75 (7.44, 8.09) 4.99 (4.81, 5.26) 7.77 (7.49, 8.05) 4.65 (4.40, 4.89) Integrated Gradients 8.34 (8.03, 8.61) 4.58 (4.37, 4.85) 10.14 (9.79, 10.46) 4.34 (4.10, 4.58) Smooth Grad 10.99 (10.67, 11.29) 4.30 (4.08, 4.58) 11.19 (10.84, 11.48) 4.47 (4.24, 4.70) Deep SHAP 9.96 (9.61, 10.24) 5.47 (5.28, 5.76) 10.93 (10.61, 11.20) 4.63 (4.38, 4.85) CXPlain 8.34 (8.00, 8.58) 4.02 (3.80, 4.31) 9.13 (8.83, 9.41) 4.33 (4.11, 4.57) Published as a conference paper at ICLR 2022 0 20 40 60 80 100 Exclusion % Exclusion Curve 0 20 40 60 80 100 Inclusion % Inclusion Curve Kernel SHAP Kernel SHAP-S Grad CAM Integrated Gradients Smooth Grad Deep SHAP CXPlain Fast SHAP (a) Imagenette: Log-odds derived inclusion and exclusion curves. 0 20 40 60 80 100 Exclusion % Top 1 Accuracy Exclusion Curve 0 20 40 60 80 100 Inclusion % Top 1 Accuracy Inclusion Curve Kernel SHAP Kernel SHAP-S Grad CAM Integrated Gradients Smooth Grad Deep SHAP CXPlain Fast SHAP (b) CIFAR-10: Accuracy derived inclusion and exclusion curves. 0 20 40 60 80 100 Exclusion % Top 1 Accuracy Exclusion Curve 0 20 40 60 80 100 Inclusion % Top 1 Accuracy Inclusion Curve Kernel SHAP Kernel SHAP-S Grad CAM Integrated Gradients Smooth Grad Deep SHAP CXPlain Fast SHAP (c) CIFAR-10: Log-odds derived inclusion and exclusion curves. Figure 7: Additional inclusion and exclusion curves. The change in top-1 accuracy or average log-odds of the predicted class as an increasing percentage of the pixels estimated to be important are excluded (left) or included (right) from the set of 1,000 images. Published as a conference paper at ICLR 2022 20 40 60 80 100 Percent of Dataset Exclusion AUC Exclusion AUC Learning Curve 20 40 60 80 100 Percent of Dataset Inclusion AUC Inclusion AUC Learning Curve Fast SHAP Kernel SHAP Kernel SHAP-S Grad CAM Integrated Gradients Smooth Grad Deep SHAP CXPlain Figure 8: Fast SHAP robustness to limited data. The curves are generated by training Fast SHAP with varying portions of the Imagenette dataset and evaluating the Inclusion and Exclusion AUC. Horizontal lines show the Exclusion and Inclusion AUCs for each of the baseline methods, as reported in table 1. E.2 FASTSHAP ROBUSTNESS TO LIMITED DATA To test Fast SHAP s robustness to the size of the training data, we compare its performance when trained with varying amounts of the Imagenette dataset. Figure 8 plots the change in inclusion and exclusion AUC, calculated using top-1 accuracy, achieved when training Fast SHAP with 95%, 85%, 75%, 50%, 25%, 15%, 10%, and 5% of the training dataset. We find that Fast SHAP remains competitive when using just 10% of the data, and that it outperforms most baseline methods by a large margin when using just 25%. Published as a conference paper at ICLR 2022 E.3 EXAMPLE FASTSHAP IMAGE EXPLANATIONS Finally, we show additional explanations generated by Fast SHAP and the baseline methods for both CIFAR-10 and Imagenette. Figure 9: Explanations generated by Fast SHAP for 18 randomly selected CIFAR-10 images. Each column corresponds to a CIFAR-10 class, and the model s prediction (in logits) is provided below each image. Published as a conference paper at ICLR 2022 English Springer Figure 10: Explanations generated by Fast SHAP for 18 randomly selected Imagenette images. Each column corresponds to an Imagenette class, and the model s prediction (in logits) is provided below each image. Published as a conference paper at ICLR 2022 Kernel SHAP Kernel SHAP-S Integrated Gradients Smooth Grad Figure 11: Explanations generated for the predicted class for 15 randomly selected CIFAR-10 images. Each column corresponds to an explanation method, and each row is labeled with the image s corresponding class. Published as a conference paper at ICLR 2022 Kernel SHAP Kernel SHAP-S Integrated Gradients Smooth Grad Figure 12: Explanations generated for the predicted class for 15 randomly selected Imagenette images. Each column corresponds to an explanation method, and each row is labeled with the image s corresponding class.