# rkhsshap_shapley_values_for_kernel_methods__9a977c31.pdf RKHS-SHAP: Shapley Values for Kernel Methods Siu Lun Chau Department of Statistics University of Oxford Robert Hu Amazon London Javier Gonzalez Microsoft Research Cambridge Dino Sejdinovic School of Computer and Mathematical Sciences University of Adelaide Feature attribution for kernel methods is often heuristic and not individualised for each prediction. To address this, we turn to the concept of Shapley values (SV), a coalition game theoretical framework that has previously been applied to different machine learning model interpretation tasks, such as linear models, tree ensembles and deep networks. By analysing SVs from a functional perspective, we propose RKHS-SHAP, an attribution method for kernel machines that can efficiently compute both Interventional and Observational Shapley values using kernel mean embeddings of distributions. We show theoretically that our method is robust with respect to local perturbations - a key yet often overlooked desideratum for consistent model interpretation. Further, we propose Shapley regulariser, applicable to a general empirical risk minimisation framework, allowing learning while controlling the level of specific feature s contributions to the model. We demonstrate that the Shapley regulariser enables learning which is robust to covariate shift of a given feature and fair learning which controls the SVs of sensitive features. 1 Introduction Machine learning model interpretability is critical for researchers, data scientists, and developers to explain, debug and trust their models and understand the value of their findings. Figure 1: An example of RKHS-SHAP providing local explanations to why a kernel logistic model predicts this patient to be diabetic. [11]. RKHS-SHAP provides a more granular level of explanation than studying lengthscales across dimensions. A typical way to understand model performance is to attribute importance scores to each input feature [5]. These scores can be computed either for an entire dataset to explain the model s overall behaviour (global) or compute individually for each single prediction (local). Understanding feature importances in reproducing kernel Hilbert space (RKHS) methods such as kernel ridge regression and support vector machines often require the study of kernel lengthscales across dimensions [44, Chapter 5]. The larger the value, the less relevant the feature is to the model. Albeit straightforward, this approach comes with three shortcomings: (1) It only provides global feature importances and cannot be individualised to each single prediction. Work mainly done while the authors were with the Department of Statistics, University of Oxford 36th Conference on Neural Information Processing Systems (Neur IPS 2022). This explanation is limited as global importance does not necessarily imply local importance [33]). In safety critical domain such as medicine, understanding individual prediction is arguably more important than capturing the general model performance. See Fig 1 for an example of local explanation. (2) The tuning of lengthscales often requires a userspecified grid of possible configurations and is selected using cross-validations. This pre-specification thus injects substantial amount of human bias to the explanation task. (3) Lengthscales across kernels acting on different data types, such as binary and continuous variables, are difficult to compare and interpret. To address this problem we turn to the Shapley value (SV) [35] literature, which has become central to many model explanation methods in recent years. The Shapley value was originally a concept used in game theory that involves fairly distributing credits to players working in coalition. Štrumbelj and Kononenko [40] were one of the first to connect SV with machine learning explanations by casting predictions as coalition games, and features as players. Since then, a variety of SV based explanation models were proposed. For example, LINEARSHAP [40] for linear models, TREESHAP [24] for tree ensembles and DEEPSHAP [23] for deep networks. Model agnostic methods such as DATASHAPLEY [15], SAGE [9] and KERNELSHAP 2 [23] were also proposed. However, to the best of our knowledge, an SV-based local feature attribution framework suited for kernel methods has not been proposed. While one could still apply model-agnostic KERNELSHAP on kernel machines, we show that by representing distributions as elements in the RKHS through kernel mean embeddings [38, 27], we can compute Shapley values more efficiently by circumventing the need to sample and estimate an exponential amount of densities required to compute the value functions, an essential component for Shapley value computation. We call this approach RKHS-SHAP to distinguish it from KERNELSHAP. Through the lens of RKHS, we study Shapley values from a functional perspective and prove that our method is robust with respect to local perturbations under mild assumptions, which is an important yet often neglected criteria for explanation models as discussed in Hancox-Li [17]. In addition, a Shapley regulariser based on RKHS-SHAP is proposed for the empirical risk minimisation framework, allowing the modeller to control the degree of feature contribution during the learning. We also discuss its application to robust learning to covariate shift of a given feature and fair learning while controlling contributions from sensitive features. We summarise our contributions below: 1. We propose RKHS-SHAP, a model specific algorithm to compute Shapley values efficiently for kernel methods by circumventing the need to sample and fit from an exponential number of densities. 2. We prove that the corresponding Shapley values are robust to local perturbations under mild assumptions, thus providing consistent explanations for the kernel model. 3. We propose a Shapley regulariser for the empirical risk minimisation framework, allowing the modeller to control the degree of feature contribution during the learning. The paper is outlined as follows: In section 2 we provide an overview of Shapley values and kernel methods. In section 3 we introduce RKHS-SHAP and show robustness of the algorithm. Shapley regulariser is introduced in section 4. Section 5 provides extensive experiments. We conclude our work in section 6. 2 Background Materials Notation. We denote X, Y as random variables (rv) with distribution p(X, Y ) taking values in the d-dimensional instance space X Rd and the label space Y (could be in R or discrete) respectively. We use D = {1, ..., d} to denote the feature index set of X and S D to denote the subset of features of interests. Lower case letters are used to denote observations from corresponding rvs. 2.1 The Shapley Value The Shapley value was first proposed by Shapley [35] to allocate performance credit across coalition game players in the following sense: Let ν : {0, 1}d R be a coalition game that returns a score for each coalition S Dg, where Dg = {1, ..., d} represents a set of players. Assuming the grand 2The kernel in KERNELSHAP refers to the estimation procedure is not related to RKHS kernel methods. coalition Dg is participating and one wished to provide the ith player with a fair allocation of the total profit ν(Dg), how should one do it? Surely this is related to each player s marginal contribution to the profit with respect to a coalition S, i.e. ν(S i) ν(S). Shapley [35] proved that there exists a unique combination of marginal contributions that satisfies a set of favourable and fair game theoretical axioms, commonly known as efficiency, null player, symmetry and additivity. This unique combination of contributions is later denoted as the Shapley value. Formally, given a coalition game ν, the Shapley value for player i is computed as the following, 1 ν(S i) ν(S) . (1) Choosing ν for ML explanation In recent years, the Shapley value concept has become popular for feature attribution in machine learning. SHAP [23], SHAPLEY EFFECT [37], DATA-SHAPLEY [15] and SAGE [9] are all examples that cast model explanations as coalition games by choosing problemspecific value functions ν. Denote f : X Y as the machine learning model of interest. Value functions for local attribution on observation x often take the form of the expectation of f with respect to some reference distribution r(XSc | XS = x S), where S D is some coalition of features in analogous to the game theory setting, such that: νx,S(f) = Er(XSc|XS=x S)[f({x S, XSc}], (2) where {x S, XSc} denotes the concatenation of the arguments. We wrote f as the main argument of ν to highlight its interpretation as a functional indexed by local observation x and coalition S. When r is set to be marginal distribution, i.e r(XSc | XS = x S) = p(XSc), the value function is denoted as the Interventional value function by Janzing et al. [20]. Observational value function [13], on the other hand, set the reference distribution to be a conditional distribution p(XSc | XS = x S). Other choices of reference distributions will lead to Shapley values with specific properties, e.g., better locality of explanations [14] or incorporating causal knowledge [18]. In this work we shall restrict our attention to marginal and conditional cases as they are the two most commonly adopted choices in the literature. Definition 1. Given model f, local observation x and a coalition set S D, the Interventional and Observational value functions are denoted by ν(I) x,S(f) := EXSc [(f({x S, XSc})] and ν(O) x,S (f) := EXSc [f({x S, XSc}) | XS = x S]. The right choice of ν has been a long-standing debate in the community. While Janzing et al. [20] argued from a causal perspective that ν(I) x,f is the correct notion to represent missingness of features in an explanation task, Frye et al. [13] argued that computing marginal expectation ignores feature correlation and leads to unrealistic results since one would be evaluating the value function outside the data-manifold. This controversy was further investigated by Chen et al. [6], where they argued that the choice of ν is application dependent and the two approaches each lead to an explanation that is either true to the model (marginal expectation) or true to the data (conditional expectation). When the context is clear, we denote the Shapley value of the ith feature of observation x at f as φx,i(f) and use a superscript to indicate whether it is Interventional φ(I) x,i(f) or Observational φ(O) x,i (f). Computing Shapley values. While Shapley values can be estimated directly from Eq. (1) using a sampling approach [40], Lundberg and Lee [23] proposed KERNELSHAP, a more efficient algorithm for estimating Shapley values in high dimensional feature spaces by casting Eq. (1) as a weighted least square problem. Similar to LIME [33], for each data x, model f, and feature coalition S, KERNELSHAP places a linear model ux(S) = βx,0 + P i S βx,i to explain the value function νx,S(f), which corresponds to solving the following regression problem: minβx,0,...,βx,d P S D w(S)(ux(S) νx,S(f))2, where w(S) = d 1 ( d |S|)|S|(d |S|) is a carefully chosen weighting such that the regression coefficients recover Shapley values. In particular, one set w( ) = w(D) = to effectively enforce constraints βx,0 = νx, (f) and P i D βx,i = νx,D(f) νx, (f). Denoting each subset S D using the corresponding binary vector z {0, 1}d, and with an abuse of notation by setting ν ,z := ν ,S and w(z) := w(S) for S = {j : z[j] = 1}, we can express the Shapley values βx := [βx,0, ..., βx,d] as βx = (Z WZ) 1Z Wvx where Z R2d d is the binary matrices with columns {zi}2d i=1, W is the diagonal matrix with entries wii = w(zi) and vx := {νx,zi(f)}2d i=1 R2d 1 the vector of evaluated value functions, which is often estimated using sampling and data imputations. We shall explain the pathology of this approach in detail later in Section 3. In practice, instead of evaluating at all 2d combinations, one would subsample the coalitions z w(z) for computational efficiency [8]. Model specific Shapley methods. KERNELSHAP provides efficient model-agnostic estimations of Shapley values. However, by leveraging additional structural knowledge about specific models, one could further improve computational performance. This leads to a variety of model-specific approximations, most of which relies on utilising their specific structure to speed up computation of value functions. For example, LINEARSHAP [40] explain linear models using model coefficients directly. TREESHAP [24] provides an exponential reduction in complexity compared to KERNELSHAP by exploiting the tree structure. DEEPSHAP [23], on the other hand, combines DEEPLIFT [36] with Shapley values and uses the compositional nature of deep networks to improve efficiencies. However, to the best of our knowledge, a kernel method specific Shapley value approximation has not been studied. Later in Section 3, we will show that, under a mild structural assumption on the RKHS, kernel methods can be used to speed up the computation in KERNELSHAP by estimating value functions analytically, thus circumventing the need for estimating and sampling from an exponential number of densities. Related work on kernel-based Shapley methods. Da Veiga [10] s work on tackling global sensitive analysis by proposing the kernel-based maximum mean discrepancy as value function, is conceptually most similar to ours. However, there are multiple key differences in our contributions. Firstly, their method is designed for global explanation, while ours is for local. Secondly, similar to interventional SV, they do not consider any conditional distributions, thus leading to completely different estimation procedures and thus novelty. Lastly, their method is on understanding the input/outputs relationship of a numerical simulation model, while ours focuses on understanding specific RKHS models learnt from a machine learning task, e.g. kernel ridge regression and kernel logistic regression. 2.2 Kernel Methods Kernel methods are one of the pillars of machine learning, as they provide flexible yet principled ways to model complex functional relationships and come with well-established statistical properties and theoretical guarantees. Empirical Risk Minimisation. Recall in the supervised learning framework, we are learning a function f : X Y from a hypothesis space H, such that given a training set (x, y) = {(xi, yi)}n i=1 sampled identically and independently from p, the following empirical risk is minimised: f = arg minf H 1 n Pn i=1 ℓ(yi, f(xi))+λfΩ(f), where ℓ: Y Y R is the loss function, Ω: H R a regularisation function and λf a scalar controlling the level of regularisation. Denote k : X X R a positive definite kernel with feature map ψx for input x X and Hk the corresponding RKHS. If we pick Hk as our hypothesis space, then the Representer theorem [39] tells us that the optimal solution takes the form of f = Pn i=1 αik( , xi) = Ψxα, where Ψx = [ψx1 . . . ψxn] is the feature matrix defined by stacking feature maps along columns. If ℓis the squared loss then the above optimisation is known as kernel ridge regression and α can be recovered in closed form α = (Kxx + λf I) 1y, where Kxx = Ψ x Ψx is the kernel matrix. If ℓis the logistic loss, then the problem is known as kernel logistic regression, and α can be obtained using gradient descent. Kernel embedding of distributions. An essential component for RKHS-SHAP is the embedding of both marginal and conditional distribution of features into the RKHS [38, 27], thus allowing one to estimate the value function analytically. Formally, the kernel mean embedding (KME) of a marginal distribution PX is defined as µX := EX[ψX] = R X ψxd PX(x) and the empirical estimate can be obtained as ˆµ := 1 n Pn i=1 ψxi. Furthermore, given another kernel g : Y Y R with feature map ψY of RKHS Hg, the conditional mean embedding (CME) of the conditional distribution PY |X=x is defined as µY |X=x := E[ψY |X = x] = R Y ψyd PY |X=x(y). One way to understand CME is to view it as an evaluation of a vector-valued(VV) function µY |X : X Hg such that µY |X(x) = µY |X=x, which minimises the following risk function Ep(X,Y )[||ψY µY |X(X)||2 Hg] [16]. Let L(Hg) be the space of bounded linear operators from Hg to itself. Denote Γx : X X L(Hg) as the operator-valued kernel such that Γx(x, x ) = k(x, x )1 with 1 the identity operator on Hg. We denote HΓx as the corresponding vector-valued RKHS. By utilising the VV-Representer theorem [26], we could minimises the following empirical risk: ˆµY |X = arg min µY |X HΓx i=1 ||ψyi µY |X(xi)||2 Hg + nη||µY |X||2 Γx where η > 0 is a regularisation parameter. This leads to the following empirical estimate of the CME, i.e., ˆµY |X = Ψy Kxx + nηI) 1Ψ x , where Ψy := [ψy1...ψyn] and Ψx := [ψx1...ψxn] are feature matrices. Intuitively, this essential turns CME estimation to a regression problem from X to the vector-valued labels ψY . Please see Micchelli and Pontil [26] and Grünewälder et al. [16] for further discussions on vector-valued RKHSs and CMEs. In fact, when using finite-dimensional feature maps, such as in the case with running Random Fourier Features [32] and Nyström methods [45] for scalability, one could reduce the computational complexity of evaluating empirical CME from O(n3) to O(b3) + O(b2n) [27] where b is the dimension of the feature map and often can be chosen much smaller than n [21]. 3 RKHS-SHAP While KERNELSHAP is model agnostic, by restricting our attention to the class of kernel methods, faster Shapley value estimation can be derived. We assume our RKHS takes a tensor product structure, i.e, Hk = Nd i=1 Hk(i), where k(i) is the kernel for each dimension i D. This structural assumption allows us to decompose the value functionals into tensor products of embeddings and feature maps, thus we can estimate them analytically, as later shown in Prop. 2. Tensor product RKHSs are commonly used in practice, as they preserve universalities of kernels from individual dimension [42], thus providing a rich function space. Note that this assumption is not essential within our framework. Namely, for a non-product kernel, one can still evaluate the value functions using tools from conditional mean embeddings and utilise our interpretability pipeline without conditional density estimation. We show this in Appendix C. In the following, we will lay out the disadvantage of existing sampling and data imputation approach and show that by estimating the value functionals as elements in the RKHS, we can circumvent the need for learning and sampling from an exponential number of conditionals densities thus improving the computational efficiency in the estimation. Estimating value functions by sampling. Estimating the Observational value function ν(O) x,S (f) is typically much harder than the Interventional value function ν(I) x,S(f) as it requires integration with respect to the unknown conditional density p(XSc | XS). Therefore, estimating OSVs often boils down to a two-stage approach: (1) Conditional density estimation and (2) Monte Carlo averaging over imputed data, as shown in Aas et al. [1], where they considered using multivariate Gaussian and Gaussian Copula for density estimation. Recently, an alternate way to estimate observational value functions is proposed by Frye et al. [13], where they formulate the estimation as a regression problem and compute the value function using a masked neural network directly without making any distributional assumption. This method shares conceptual similarities to ours but uses very different tools for the estimation. We highlight such differences in the appendix B. Once the conditional density function p(XSc | XS) for each S D is estimated, the observational value function at the ith observation xi can then be computed by taking averages of m Monte Carlo samples from the estimated conditional density, i.e. 1 m Pm j=1 f({xi S, xj Sc}) where {xi S, xj Sc} is the concatenation of xi S with the jth sample xj Sc from p(XSc|XS = xi S). Note further that the Monte Carlo samples cannot be reused for another observation xk as their conditional densities are different. In other words, n m Monte Carlo samples are required for each coalition S if one wishes to compute Shapley values for all n observations. This is clearly not desirable. In the spirit of Vapnik s principle3, as our goal is to estimate conditional expectations that lead to Shapley values, we are not going to solve a harder and more general problem of conditional density estimation as an intermediate step, but instead utilise the arsenal of kernel methods to estimate the conditional expectations directly. Further discussion on comparing complexity of RKHS-SHAP with density estimation methods can be found in Appendix A. 3When solving a problem, try to avoid solving a more general one as an intermediate step. [43, Section 1.9] Estimating value functions using mean embeddings. If our model f lives in Hk, both the marginal and conditional expectation can be estimated analytically without any sampling or density estimation. We first show that the Riesz representations [30] of both Interventional and Observational value functionals exist and are well-defined in Hk. In the following, for simplicity, we will denote the functional and its corresponding Riesz representer using the same notation. For example, we will write νx,S(f) = f, νx,S Hk when the context is clear. Given a vector of n instances x, we denote the corresponding vector of value functions as νx,S(f) = {νxi,S(f)}n i=1 . All proofs of this paper can be found in the Appendix D. Proposition 2 (Riesz representations of value functionals). Denote k as the product kernel of d bounded kernels k(i) : X (i) X (i) R, where X (i) is the domain of the ith feature for i D. Riesz representations of the Interventional and Observational value functionals then exist and can be written as ν(I) x,S = ψx S µXSc and ν(O) x,S = ψx S µXSc|XS=x S, where ψx S := N i S ψx(i), µXSc := E[N i Sc ψx(i)] and µXSc|XS=x S := E[N i Sc ψx(i)|XS = x S]. The corresponding finite sample estimators ˆν(I) x,S and ˆν(O) x,S are then obtained by replacing the corresponding KME and CME components with their empirical estimators. As a result, given f = Ψxα trained on dataset (x, y), Prop. 2 allows us to estimate the value functionals analytically since ˆν(I) x,S(f ) = f , ψx S ˆµXSc and ˆν(O) x,S (f ) = f , ψx S ˆµXSc|XS=x S . This corresponds to the direct non-parametric estimators of value functions given in the following proposition, which circumvent the need for sampling or density estimation. Proposition 3. Given x Rn a vector of instances and f = Ψxα, the empirical estimates of the functionals can be computed as, ˆν(I) x ,S(f) = α K(I) x ,S, ˆν(O) x ,S(f) = α K(O) x ,S, respectively, where K(I) x ,S = Kx Sx S 1 n diag(K x Scx Sc 1n)1n1n and K(O) x ,S = Kx Sx S ΞSKx Sx S, 1n is the all-one vector with length n, the Hadamard product and ΞS = Kx Scx Sc (Kx Sx S + nηI) 1. Finally, to obtain the Shapley values with these value functions, we deploy the same least square approach as KERNELSHAP. Proposition 4 (RKHS-SHAP). Given f Hk and ν, Shapley values B Rd n for all d features and all n input x can be computed as B = (Z WZ) 1Z W ˆV where ˆVi,: = ˆνx,Si(f). Estimating value functions with specific models. To the best of our knowledge, Tree SHAP [24] was the only machine learning model-specific SV algorithm computing conditional expectations using the properties of the model (tree in this case) directly, rather than relying on some sort of sampling procedure and density estimation. However, it is unclear how to validate the assumptions about feature distribution in Tree SHAP, which are specified as the distribution generated by the tree , as discussed by Sundararajan and Najmi [41]. In comparison, RKHS-SHAP does not pose assumptions on the underlying feature distribution and computes the corresponding conditional expectations via mean embeddings analytically. However, one should note that each of these model specific algorithm are only designed to explain specific models, therefore it is not informative to compare, e.g. Tree SHAP values with RKHS-SHAP values, as they are explaining different models. 3.1 Robustness of RKHS-SHAP Robustness of interpretability methods is important from both an epistemic and ethical perspective, as discussed in Hancox-Li [17]. On the other hand, Alvarez-Melis and Jaakkola [2] showed empirically that Shapley methods when used with complex non-linear black-box models such as neural networks, yield explanations that vary considerably for some neighbouring inputs, even if the deep network gives similar predictions at those neighbourhoods. In light of this, we analyse the Shapley values obtained from our proposed RKHS-SHAP and show that they are robust. To illustrate this, we first formally define the Shapley functional, Proposition 5 (Shapley functional). Given a value functional ν indexed by input x and coalition S, the Shapley functional φx,i : Hk R such that φx,i(f) gives the ith Shapley values of x on f, has the following Riesz representation in the RKHS: φx,i = 1 S D\{i} d 1 |S| 1 νx,S i νx,S Analogously, we denote φ(I) x,i and φ(O) x,i as the Interventional Shapley functional (ISF) and Observational Shapley functional respectively (OSF). Using the functional formalism, we now show that given f Hk, when ||x x ||2 δ for δ > 0, the difference in Shapley values at x and x will be arbitrarily small for all features i.e. |φx,i(f) φx ,i(f)| is small i D. This corresponds to the following, |φx,i(f) φx ,i(f)|2 = | f, φx,i φx ,i |2 ||f||2 Hk||φx,i φx ,i||2 Hk (3) where we use Cauchy-Schwarz for the last line. Therefore, for a given f with fix RKHS norm, the key to show robustness lies into bounding the Shapley functionals. In the following theorem, we make two assumptions: (1) the base kernels k(i) for each dimension i D are bounded, and (2) the (population) conditional mean embedding functions µXSc|XS belong to the vector-valued RKHSs HΓXS for all coalitions S D, therefore have finite norms. This assumption is also adopted in Park and Muandet [29, Theorem 4.5]. Theorem 6 (Bounding Shapley functionals). Let k be a product kernel with d bounded kernels |k(i)(x, x)| M for all i D. Denote Mµ := sup S D M |S|, MΓ := sup S D ||µXSc|XS||2 ΓXS and Lδ = sup S D ||ψx S ψx S||2 Hk. Let δ > 0, assume |x(i) x(i) |2 δ for all features i D, then differences of the Interventional and Observational Shapley functionals for feature i at observation x, x can be bounded as ||φ(I) x,i φ(I) x ,i||2 Hk 2MµLδ and ||φ(O) x,i φ(O) x ,i||2 Hk 4MΓMµLδ. If k is the RBF kernel with lengthscale l, then ||φ(I) x,i φ(I) x ,i||2 Hk 4(1 exp( dδ/2l2)), ||φ(O) x,i φ(O) x ,i||2 Hk 8MΓ(1 exp( dδ/2l2)) Therefore, as long as ||f||Hk is small, RKHS-SHAP will return robust Shapley values with respect to small perturbations. Notice the Shapley functionals do not depend on f and can be estimated separately purely based on data. We will show in the next section how this key property allows us to use the functional itself to aid in learning of f. This enables us to enforce particular structural constraints on f via an additional regularisation term. 4 Shapley regularisation Regularisation is popular in machine learning because it allows inductive bias to be injected to learn functions with specific properties. For example, classical L1 and L2 regularisers are used to control the sparsity and smoothness of model parameters. Manifold regularisation [4], on the other hand, exploits the geometry of the distribution of unlabelled data to improve learning in a semi-supervised setting, whereas Pérez-Suay et al. [31] and Li et al. [22] adopted a kernel dependence regulariser to learn functions for fair regression and fair dimensionality reduction. In the following, we propose a new Shapley regulariser based on the Shapley functionals, which allows learning while controlling the level of specific feature s contributions to the model. Formulation Let A be a specific feature whose contribution we wish to regularise, f the function we wish to learn, and φxi,A(f) the Shapley value of A at a given observation xi. Our goal is to penalise the mean squared magnitude of {φxi,A(f)}n i=1 in the ERM framework, which corresponds to minf Hk Pn i=1 ℓ(yi, f(xi)) + λf||f||2 Hk + λS n Pn i=1 |φxi,A(f)|2, where ℓis some loss function and λf and λS control the level of regularisations. If we replace the population Shapley functional with the finite sample estimate from Prop. 2, and utilise the Representer theorem, we can rewrite the optimisation in terms of α, Proposition 7. The above optimisation can be rewritten as, minα Rn Pn i=1 ℓ(yi, Kxixα) + λfα Kxxα + λS n α ζAζ Aα. To regularise the Interventional SVs (ISV-REG) of A, we set ζA = 1 J PJ j=1 K(I) x,Sj A K(I) x,Sj where Sj s are coalitions sampled from p SV (S) = 1 d d 1 |S| 1. For regularising Observational SVs (OSV-REG), we set ζA = 1 J PJ j=1 K(O) x,Sj A K(O) x,Sj. In particular, closed form optimal dual weights α = (K2 xx + λf Kxx + λS n ζAζ A) 1Kxxy can be recovered when ℓis the squared loss. Choice of regularisation. Similar to the feature attribution problem, the choice of regularising against ISVs or OSVs is application dependent and boils down to whether one wants to take the correlation of A with other features into account or not. (i) Estimating ISVs 1.0 2.0 10.0 20.0 100.0 b (ii) Estimating OSVs 100.0 500.0 1000.0 1500.0 3000.0 5000.0 Number of data log10 seconds (iii) Run time analysis RKHS-ISV KSHAP-ISV RKHS-OSV GSHAP-OSV (a) RKHS-SHAP experiments 0.0 0.5 1.0 1.5 2.0 2.5 Shapley Regularisation Robustness experiment (b) ISV-REG experiments Figure 2: (a) RKHS-SHAP: Estimation of Shapley values using data from the Banana distribution. Run time analysis in log scale is also reported. (b) ISV-REG: RMSEs of freg on noisy test data at different noise level σ . All scores are averaged over 10 runs and 1 sd is reported. ISV-REG ISV-REG can be used to protect the model when covariate shift of variable A is expected to happen at test time and one wishes to downscale A s contribution during training instead of completely removing this (potentially useful) feature. Such situation may arise if, e.g., a different measurement equipment or process is used for collecting observations of A during test time. ISV is well suited for this problem as dependencies across features will be broken by the covariate shift at test time. OSV-REG On the other hand, OSV-REG can find its application in fair learning learning a function that is fair with respect to some sensitive feature A. There exist a variety of fairness notions one could consider, such as, e.g. Statistical Parity, Equality of Opportunity and Equalised Odds [7]. In particular, we consider the fairness notion recently explored in the literature [19, 25] that uses Shapley values, which are becoming a bridge between Explainable AI and fairness, given that they can detect biased explanations from biased models. In particular, Jain et al. [19] illustrated that if a model is fair against a sensitive feature A, A should have neither a positive nor negative contribution towards the prediction. This corresponds to A having SVs with negligible magnitudes. Simply removing A from the training doesn t make the model fair, as contributions of A might enter the model via correlated features, therefore it is important to take feature correlations into account while regularising. Hence, it is natural to deploy OSV-REG for fair learning. 5 Experiments We demonstrate specific properties of RKHS-SHAP and Shapley regularisers using four synthetic experiments, because these properties are best illustrated under a fully controlled environment. For example, to highlight the merit of distributional-assumption-free value function estimation in RKHSSHAP, we need groundtruth conditional expectations of value functions for verification, but they are not available in real-world data because we do not observe the true data generating distribution. Nonetheless, as model interpretability is a practical problem, we have also ran several larger scales (n = 50000, 1.8 106) real-world explanation tasks using RKHS-SHAP and reported our findings in Appendix E for a complete empirical demonstration. All code and implementations are made publicly available [3]. In the first two experiments, we evaluate RKHS-SHAP methods against benchmarks on estimating Interventional and Observational SVs on a Banana-shaped distribution with nonlinear dependencies [34]. The setup allows us to obtain closed-form expressions for the ground truth ISVs and OSVs, yet the conditional distributions among features are challenging to estimate using any standard parametric density estimation methods. We also present a run time analysis to demonstrate empirically that mean embedding based approaches are significantly more efficient than sampling based approaches. Finally, the last two experiments are applications of Shapley regularisers in robust modelling to covariate shifts and fair learning with respect to a sensitive feature. In the following, we denote RKHS-OSV and RKHS-ISV as the OSV and ISV obtained from RKHSSHAP. As benchmark, we implement the model agnostic sampling-based algorithm KERNELSHAP from the Python package shap [23]. We denote the ISV obtained from KERNELSHAP as KSHAP-ISV. As shap does not offer model-agnostic OSV algorithm, we implement the approach from Aas et al. [1] (described in Section 3), where OSVs are estimated using Monte Carlo samples from fitted multivariate Gaussians. We denote this approach as GSHAP-OSV. We fit a kernel ridge regression on each of our experiments. Lengthscales of the kernel are selected using median heuristic [12] and regularisation parameters are selected using cross-validation. Further implementation details and real world data illustrations are included in Appendix E. 5.1 RKHS-SHAP experiments Experiment 1: Estimating Shapley values from Banana data. We consider the following 2d Banana distribution B(b 1, v) from Sejdinovic et al. [34]: Sample Z N(0, diag(v, 1)) and transform the data by setting X1 = Z1 and X2 = b 1(Z2 1 v) + Z2. Regression labels are obtained from ftruth(X) = b 1(X2 1 v) + X2. This formulation allows us to compute the true ISVs and OSVs in closed forms, i.e φ(I) X,1(ftruth) = b 1(X2 1 v), φ(I) X,2(ftruth) = X2, φ(O) X,1(ftruth) = 1 2(3b 1(X2 1 v) X2) and φ(O) X,2(ftruth) = 1 2(3X2 b 1(X2 1 v)). In the following we will simulate 3000 data points from B(b 1, 10) with b [1, 10, 20, 50, 100], where smaller values of b correspond to more nonlinearly elongated distributions. We choose R2 as our metric since the true Shapley values for each experiment are scaled according to b. Figure 2a(i) and 2a(ii) demonstrate R2 scores of estimated ISVs and OSVs in contrast with groundtruths SVs across different configurations. We see that RKHS-ISV and KSHAP-ISV give exactly the same R2 scores across configurations. This is not surprising as the two methods are mathematically equivalent. While in KSHAP-ISV one averages over evaluated {f(x j)} with x j being the imputed data, RKHS-ISV aggregated feature maps of the imputed data first before evaluating at f, i.e P j=1 f(x j) = f, P j=1 φ(x j) Hk = f, ˆµX Hk. However, it is this subtle difference in the order of operations contribute to a significant computational speed difference as we later show in Experiment 2. In the case of estimating OSVs, we see RKHS-OSV is consistently better than GSHAP-OSV at all configurations. This highlights the merit of RKHS-OSV as no density estimation is needed, thus avoiding any potential distribution model misspecification which happens in GSHAP-OSV. Experiment 2: Run time analysis. In this experiment we sample n data points from B(1, 10) where n [100, 500, 1000, 1500, 3000, 5000] and record the log10 seconds required to complete each algorithm. In practice, as the software documentation of shap suggests, one is encouraged to subsample their data before passing to the KERNELSHAP algorithm as the background sampling distribution to avoid slow run time. As this approach speeds up computation at the expense of estimation accuracy since less data is used, for fair comparison with our RKHS-SHAP method which utilises all data, we pass the whole training set to the KERNELSHAP algorithm. Figure 2a(iii) illustrates the run time across methods. We note that the difference in runtime between the two sampling based methods KSHAP-ISV and GSHAP-OSV can be attributed to a different software implementation, but we observe that they are both significantly slower than RKHS-ISV and RKHS-OSV. RKHS-OSV is slower than RKHS-ISV as it involves matrix inversion when computing the empirical CME. In practice, one can trivially subsample data for RKHS-SHAP to achieve further speedups like in the shap package, but one can also deploy a range of kernel approximation techniques as discussed in Section 2.2. 5.2 Shapley regularisation experiments For the last two experiments we will simulate 3000 samples from X N(0, Σ) with diag(Σ) = 15 and Σ4,5 = Σ5,4 = 0.9, 0 otherwise, therefore feature X4 and X5 will be highly correlated. We set our regression labels as ftrue(x) = x β with β = [1, 2, 3, 4, 10], enforcing X5 to be the most influential feature. We use 70% of our data for training and 30% for testing. Experiment 3: Protection against covariate shift using ISV-REG. For this experiment, we inject extra mean zero Gaussian noise to the most influential feature X5 in the testing set, i.e. X 5 = X5 + σ N(0, 1) for σ [0, 0.1, 0.5, 1, 1.5]. We assume that there is an expectation for covariate shift in X5 to occur at test time, due to e.g. a change in the measurement precision hence, we train our model freg using ISV-REG at different regularisation level λs for λs [0, 0.5, 1, 1.5, 2, 2.5]. We then compare RMSEs when no covariate shift is present (σ = 0) against RMSEs at different noise levels. The results are shown in Figure 2b. We see that when no regularisation is applied, RMSEs increase rapidly as σ increases, indicating our standard unprotected kernel ridge regressor is sensitive to noises from X 5. As the Shapley regularisation parameter increases, the RMSE of the noiseless case gradually increases too, but RMSEs of the noisy data are much closer to the noiseless case, exhibiting robustness to the covariate shift. 20 0 20 (I) X, 4(freg) 20 0 20 (I) X, 5(freg) 20 0 20 (O) X, 4(freg) 20 0 20 (O) X, 5(freg) Figure 3: Distributions of SVs of sensitive feature X5 and correlated feature X4 obtained from ISV-REG and OSV-REG at different regularisation parameters. Colour intensity represents the strength of regularisation. Experiment 4: Fair learning with OSV-REG At last, we demonstrate the use of Shapley regulariser to enable fair learning. In this context, as we will see, OSV-REG is the appropriate regulariser. Consider X5 as some sensitive feature which we would like to minimise its contribution during the learning of f. Recall X4 is highly correlated to X5 so it contains sensitive information from X5 as well. Figure 3 demonstrates how distributions of ISVs and OSVs of X4 and X5 changes as λs increases. As regularisation increases, the SVs of X5 becomes more centered at 0, indicating lesser contribution to the model freg. Similar behavior can be seen from the distribution of φ(O) X,4(freg) but not from φ(I) X,4. This illustrates how ISV-REG will propagate unfairness through correlated feature X4 while OSV-REG can take them into account by minimising the contribution of sensitive information during learning. 6 Conclusion, limitations, and future directions In this work, we proposed a more accurate and more efficient algorithm to compute Shapley values for kernel methods, termed RKHS-SHAP. We proved that the corresponding local attributions are robust to local perturbations under mild assumptions, a desirable property for consistent model interpretation. Furthermore, we proposed the Shapley regulariser which allows learning while controlling specific feature contribution to the model. We suggested two applications of this regulariser and concluded our work with synthetic experiments demonstrating specific aspects of our contributions. Extensive real-world data explanations are provided in Appendix E.2 for empirical demonstration. While our methods currently only are applicable to functions arising from kernel methods, a fruitful direction would be to extend the applicability to more general models using the same paradigm. It would also be interesting to extend our formulation to kernel-based hypothesis testing, and for example, to interpret results from two-sample tests. Acknowledgements The authors would like to thank Shahine Bouabid for helpful comments. SLC is supported by the EPSRC and MRC through the Ox Wa SP CDT programme EP/L016710/1. DS is supported in part by Tencent AI Lab and in part by the Alan Turing Institute (EP/N510129/1). [1] Kjersti Aas, Martin Jullum, and Anders Løland. Explaining individual predictions when features are dependent: More accurate approximations to shapley values. ar Xiv preprint ar Xiv:1903.10464, 2019. [2] David Alvarez-Melis and Tommi S Jaakkola. On the robustness of interpretability methods. ar Xiv preprint ar Xiv:1806.08049, 2018. [3] Anonymous Author(s). https://anonymous.4open.science/r/RKHS-SHAP/, 2022. [4] Mikhail Belkin, Partha Niyogi, and Vikas Sindhwani. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. Journal of machine learning research, 7(11), 2006. [5] Diogo V Carvalho, Eduardo M Pereira, and Jaime S Cardoso. Machine learning interpretability: A survey on methods and metrics. Electronics, 8(8):832, 2019. [6] Hugh Chen, Joseph D Janizek, Scott Lundberg, and Su-In Lee. True to the model or true to the data? ar Xiv preprint ar Xiv:2006.16234, 2020. [7] Sam Corbett-Davies and Sharad Goel. The measure and mismeasure of fairness: A critical review of fair machine learning. ar Xiv preprint ar Xiv:1808.00023, 2018. [8] Ian Covert and Su-In Lee. Improving kernelshap: Practical shapley value estimation using linear regression. In International Conference on Artificial Intelligence and Statistics, pages 3457 3465. PMLR, 2021. [9] Ian Covert, Scott Lundberg, and Su-In Lee. Understanding global feature contributions with additive importance measures. Advances in Neural Information Processing Systems, 33, 2020. [10] Sébastien Da Veiga. Kernel-based anova decomposition and shapley effects application to global sensitivity analysis. ar Xiv preprint ar Xiv:2101.05487, 2021. [11] Diabetes dataset from Kaggle.com. https://www.kaggle.com/datasets/mathchi/ diabetes-data-set?resource=download, 2022. [12] Seth Flaxman, Dino Sejdinovic, John P Cunningham, and Sarah Filippi. Bayesian learning of kernel embeddings. ar Xiv preprint ar Xiv:1603.02160, 2016. [13] Christopher Frye, Damien de Mijolla, Laurence Cowton, Megan Stanley, and Ilya Feige. Shapley-based explainability on the data manifold. ar Xiv preprint ar Xiv:2006.01272, 2020. [14] Sahra Ghalebikesabi, Lucile Ter-Minassian, Karla Diaz Ordaz, and Chris C Holmes. On locality of local explanation models. Advances in Neural Information Processing Systems, 34, 2021. [15] Amirata Ghorbani and James Zou. Data shapley: Equitable valuation of data for machine learning. In International Conference on Machine Learning, pages 2242 2251. PMLR, 2019. [16] Steffen Grünewälder, Guy Lever, Luca Baldassarre, Sam Patterson, Arthur Gretton, and Massimilano Pontil. Conditional mean embeddings as regressors-supplementary. ar Xiv preprint ar Xiv:1205.4656, 2012. [17] Leif Hancox-Li. Robustness in machine learning explanations: does it matter? In Proceedings of the 2020 conference on fairness, accountability, and transparency, pages 640 647, 2020. [18] Tom Heskes, Evi Sijben, Ioan Gabriel Bucur, and Tom Claassen. Causal shapley values: Exploiting causal knowledge to explain individual predictions of complex models. Advances in neural information processing systems, 33:4778 4789, 2020. [19] Aditya Jain, Manish Ravula, and Joydeep Ghosh. Biased models have biased explanations. ar Xiv preprint ar Xiv:2012.10986, 2020. [20] Dominik Janzing, Lenon Minorics, and Patrick Blöbaum. Feature relevance quantification in explainable ai: A causal problem. In International Conference on Artificial Intelligence and Statistics, pages 2907 2916. PMLR, 2020. [21] Z. Li, J.-F. Ton, D. Oglic, and D. Sejdinovic. Towards A Unified Analysis of Random Fourier Features. In International Conference on Machine Learning (ICML), pages PMLR 97:3905 3914, 2019. [22] Zhu Li, Adrian Perez-Suay, Gustau Camps-Valls, and Dino Sejdinovic. Kernel dependence regularizers and gaussian processes with applications to algorithmic fairness. ar Xiv preprint ar Xiv:1911.04322, 2019. [23] Scott M Lundberg and Su-In Lee. A unified approach to interpreting model predictions. In Advances in neural information processing systems, pages 4765 4774, 2017. [24] Scott M Lundberg, Gabriel G Erion, and Su-In Lee. Consistent individualized feature attribution for tree ensembles. ar Xiv preprint ar Xiv:1802.03888, 2018. [25] Masayoshi Mase, Art B Owen, and Benjamin B Seiler. Cohort shapley value for algorithmic fairness. ar Xiv preprint ar Xiv:2105.07168, 2021. [26] Charles A Micchelli and Massimiliano Pontil. On learning vector-valued functions. Neural computation, 17(1):177 204, 2005. [27] Krikamol Muandet, Kenji Fukumizu, Bharath Sriperumbudur, and Bernhard Schölkopf. Kernel mean embedding of distributions: A review and beyond. ar Xiv preprint ar Xiv:1605.09522, 2016. [28] League of Legends Interpretability Demonstration. https://slundberg.github.io/shap/ notebooks/League%20of%20Legends%20Win%20Prediction%20with%20XGBoost. html, 2022. [29] Junhyung Park and Krikamol Muandet. A measure-theoretic approach to kernel conditional mean embeddings. ar Xiv preprint ar Xiv:2002.03689, 2020. [30] Vern I Paulsen and Mrinal Raghupathi. An introduction to the theory of reproducing kernel Hilbert spaces, volume 152. Cambridge university press, 2016. [31] Adrián Pérez-Suay, Valero Laparra, Gonzalo Mateo-García, Jordi Muñoz-Marí, Luis Gómez Chova, and Gustau Camps-Valls. Fair kernel learning. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 339 355. Springer, 2017. [32] Ali Rahimi, Benjamin Recht, et al. Random features for large-scale kernel machines. In NIPS, volume 3, page 5. Citeseer, 2007. [33] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. "why should I trust you?": Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 13-17, 2016, pages 1135 1144, 2016. [34] Dino Sejdinovic, Heiko Strathmann, Maria Lomeli Garcia, Christophe Andrieu, and Arthur Gretton. Kernel adaptive metropolis-hastings. In International conference on machine learning, pages 1665 1673. PMLR, 2014. [35] Lloyd S Shapley. A value for n-person games. Contributions to the Theory of Games, 2(28): 307 317, 1953. [36] Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. Learning important features through propagating activation differences. In International Conference on Machine Learning, pages 3145 3153. PMLR, 2017. [37] Eunhye Song, Barry L Nelson, and Jeremy Staum. Shapley effects for global sensitivity analysis: Theory and computation. SIAM/ASA Journal on Uncertainty Quantification, 4(1):1060 1083, 2016. [38] Le Song, Kenji Fukumizu, and Arthur Gretton. Kernel embeddings of conditional distributions: A unified kernel framework for nonparametric inference in graphical models. IEEE Signal Processing Magazine, 30(4):98 111, 2013. [39] Ingo Steinwart and Andreas Christmann. Support vector machines. Springer Science & Business Media, 2008. [40] Erik Štrumbelj and Igor Kononenko. Explaining prediction models and individual predictions with feature contributions. Knowledge and information systems, 41(3):647 665, 2014. [41] Mukund Sundararajan and Amir Najmi. The many shapley values for model explanation. In International Conference on Machine Learning, pages 9269 9278. PMLR, 2020. [42] Zoltán Szabó and Bharath K Sriperumbudur. Characteristic and universal tensor product kernels. The Journal of Machine Learning Research, 18(1):8724 8752, 2017. [43] Vladimir N. Vapnik. The nature of statistical learning theory. Springer-Verlag New York, Inc., 1995. [44] Christopher K Williams and Carl Edward Rasmussen. Gaussian processes for machine learning, volume 2. MIT press Cambridge, MA, 2006. [45] Tianbao Yang, Yu-Feng Li, Mehrdad Mahdavi, Rong Jin, and Zhi-Hua Zhou. Nyström method vs random fourier features: A theoretical and empirical comparison. Advances in neural information processing systems, 25:476 484, 2012. 1. For all authors... (a) Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] (c) Did you discuss any potential negative societal impacts of your work? [N/A] (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results... (a) Did you state the full set of assumptions of all theoretical results? [Yes] (b) Did you include complete proofs of all theoretical results? [Yes] 3. If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] See [3] (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] See experiment section. (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes] See Experiments. (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets... (a) If your work uses existing assets, did you cite the creators? [Yes] (b) Did you mention the license of the assets? [N/A] (c) Did you include any new assets either in the supplemental material or as a URL? [N/A] (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects... (a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]