# robust_data_valuation_with_weighted_banzhaf_values__3d66ce55.pdf

Robust Data Valuation with Weighted Banzhaf Values

Weida Li vidaslee@gmail.com Yaoliang Yu School of Computer Science University of Waterloo Vector Institute yaoliang.yu@uwaterloo.ca

Data valuation, a principled way to rank the importance of each training datum, has become increasingly important. However, existing value-based approaches (e.g., Shapley) are known to suffer from the stochasticity inherent in utility functions that render consistent and reliable ranking difficult. Recently, Wang and Jia (2023) proposed the noise-structure-agnostic framework to advocate the Banzhaf value for its robustness against such stochasticity as it achieves the largest safe margin among many alternatives. Surprisingly, our empirical study shows that the Banzhaf value is not always the most robust when compared with a broader family: weighted Banzhaf values. To analyze this scenario, we introduce the concept of Kronecker noise to parameterize stochasticity, through which we prove that the uniquely robust semi-value, which can be analytically derived from the underlying Kronecker noise, lies in the family of weighted Banzhaf values while minimizing the worst-case entropy. In addition, we adopt the maximum sample reuse principle to design an estimator to efficiently approximate weighted Banzhaf values, and show that it enjoys the best time complexity in terms of achieving an (ϵ, δ)-approximation. Our theory is verified under both synthetic and authentic noises. For the latter, we fit a Kronecker noise to the inherent stochasticity, which is then plugged in to generate the predicted most robust semi-value. Our study suggests that weighted Banzhaf values are promising when facing undue noises in data valuation.

1 Introduction

In machine learning, data curation plays a crucial role in training powerful models, and one big difficulty facing practitioners is how to quantify the extent of usefulness each datum possesses (under a specific application). In practice, large datasets are typically noisy and often contain mislabeled data, which could harm the downstream performance of models trained on top. Sources that affect data quality include mechanical or human labelling errors (Frénay and Verleysen 2013), and poisoning attacks from malicious adversaries (Steinhardt et al. 2017). Therefore, it would be extremely useful if the potential importance of each datum can be efficiently and reliably determined. Data valuation aims to lay out a principled way to filter out noisy data and hunt down data that are valuable to retain.

In the emerging field of data valuation, the concept of value, originally developed in cooperative game theory as an axiomatic and principled way to assign contributions to all players in a game, has become increasingly popular in machine learning (e.g., Ghorbani and Zou 2019; Jia et al. 2019a,b; Kwon and Zou 2022; Lin et al. 2022; Wang and Jia 2023). Typically, the procedures of training and testing are treated as games that constitute utility functions while each datum serves as a player. The contribution imputed to each player is then equivalent to the importance determined for each datum.

Recently, Wang and Jia (2023) pointed out that the inherent stochasticity in utility functions, which stems from the procedure of training models, is substantial enough to incur unexpected inconsistency in ranking data. To solve this issue, they introduced the noise-structure-agnostic framework to

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

demonstrate that the Banzhaf value (Banzhaf 1965) is the most robust against such stochasticity compared with other previously employed values. Surprisingly, our extensive experiments on various datasets, see Figure 1 for some examples, reveal that the Banzhaf value is not always the most robust compared with weighted Banzhaf values, to which the Banzhaf value belongs.

FMNIST (10 classes) MNIST (10 classes) gas (6 classes)

Figure 1: The first row contains results from the experiment of ranking consistency, while the second row is from the experiment of noisy label detection. For both metrics, the higher the better. The error bars represent the standard deviation over 10 random seeds. Specifically, each Spearman s rank correlation coefficient is averaged over 10 2 = 45 pairs of independent rankings. Each parameter along the horizontal axis defines a weighted Banzhaf value, and 0.50 corresponds to the Banzhaf value.

In this work, our goal is to shed further light on the above-mentioned phenomenon both theoretically and empirically. As one might expect, our extensive experiments confirm that overall there is no single universally the best value, in terms of either noisy label detection or ranking consistency, across a variety of experiment settings. Nevertheless, we show that it is possible to adaptively estimate a weighted Banzhaf value that achieves competitive robustness compared with the existing baselines. We summarize our contributions as follows:

1. First, we introduce the concept of Kronecker noise to parameterize the inherent stochasticity in practice, and prove that the uniquely most robust semi-value lies in the family of weighted Banzhaf values in terms of minimizing the worst-case entropy. Specifically, the parameter for the most robust semi-value can be directly derived from the underlying Kronecker noise.

2. Second, it is known that evaluating each semi-value is exponentially expensive w.r.t. the number of data being valuated, and thus we adopt the maximum sample reuse principle to design an efficient estimator for approximating weighted Banzhaf values. Particularly, we show that it enjoys the best time complexity in terms of achieving an (ϵ, δ)-approximation.

3. Third, both synthetic and authentic noises are employed in our experiments to verify the robustness assertion in Theorem 1. For the latter, we fit a Kronecker noise to the inherent stochasticity by minimizing the Kullback-Leibler divergence between Gaussian distributions.

4. Last, our empirical study shows that it is often that some member of weighted Banzhaf values achieves the best performance in noisy label detection as well as ranking consistency when compared with the existing baselines. The empirical evidence highlights that there is no single universally the best value across various experiment settings.

Our code is available at https://github.com/watml/weighted-Banzhaf.

2 Background

2.1 Notations

Let Dtr and Dval be training and validation datasets, respectively. We write n = |Dtr|, and identify Dtr with [n] = {1, 2, . . . , n}. The definition of utility function v : 2[n] R is pivotal for employing valued-based data valuation methods. Typically, for each subset S [n], v(S) is set to be the performance of a chosen model trained on S. For example, v(S) is usually the accuracy measured on Dval for classification tasks. Specifically, v( ) is set to be the performance of randomly initialized models in this work. Let Gn be the set that contains all possible utility functions with domain 2[n], and define G = S

n 1 Gn. Particularly, the subscript of vn G is to indicate the number of data being valuated. For convenience, we write S i and S\i instead of S {i} and S\{i}, and use s = |S| when S represents a set.

2.2 Axioms for Values

Value-based data valuation methods take advantage of the axiomatic approach developed in cooperative game theory (Dubey et al. 1981; Shapley 1953; Weber 1977). A value is a function ϕ that maps each utility function vn G to a contribution vector c Rn. In other words, ϕi(v) = ci is the contribution of the i-th datum in Dtr for the model performance achieved by training on Dtr. There are four axioms we expect a value to have, which are

1. Linearity: ϕ(a un + vn) = a ϕ(un) + ϕ(vn) for every un, vn G and every a R; 2. Dummy: ϕi(vn) = C if vn(S i) = vn(S) + C for every S [n]\i; 3. Monotonicity: ϕi(vn) 0 if vn(S i) vn(S) for every S [n]\i; 4. Symmetry:1 ϕ(vn π) = ϕ(vn) π for every permutation π of [n].

As proved by Weber (1977, Theorems 5 and 10),2 a value ϕ satisfies the above four axioms if and only if there exists a list of non-negative vectors Ξ = (pn Rn)n 1 such that, for each n 1, Pn s=1 n 1 s 1 pn s = 1 and

ϕi(vn; Ξ) = X

S [n]\i pn s+1 (vn(S i) vn(S)) for every vn G and every i [n]. (1)

All such values are called probabilistic values.

There is one more regularization implicitly used by Dubey et al. (1981) to filter out some undesired probabilistic values. The eventual regularized probabilistic values are called semi-values.34 We explicitly rephrase the regularization as an extra axiom in this paper. Let z be an extra datum labelled by n + 1, and extend each utility function vn G into vn+1 defined by vn+1(S) = vn(S [n]) for every S [n + 1]. The extra axiom reads as follows:

5. Consistency: ϕi(vn; Ξ) = ϕi(vn+1; Ξ) for every vn G and every i [n].

The axiom of consistency states that a value should not change the previously assigned values of importance if the added data do not contribute anything. Note that the proposed axiom of consistency is the discrete version of the dummy consistency proposed by Friedman (2003, Definition 7) in cost-sharing problems. We have the following characterization for semi-values: Proposition 1 (Dubey et al. 1981, Theorem 1(a)). A probabilistic value ϕ is a semi-value if and only if there exists a probability distribution P on the interval [0, 1] such that

[0,1] ts 1(1 t)n sd P(t) for every 1 s n. (2)

1It can be equivalently replaced by the axiom of equal treatment in this context, which is not generally equal to the axiom of symmetry. We refer the reader to Appendix A for more details. 2Their assumption that vn( ) = 0 can be quickly removed by showing that ϕ(vn) = 0 if vn is constant based on the axioms of linearity and monotonicity. 3In some references (e.g., Wang and Jia 2023), probabilistic value is referred to as semi-value, but semi-value may refer to a distinct entity (e.g., Dragan 2006), i.e., what we call semi-value in this work. 4The family of probabilistic values is strictly larger than that of semi-values. For example, for p3 = (0, 1/2, 0), there is no non-negative p4 satisfying that p3 i = p4 i + p4 i+1 for every 1 i 3.

Table 1: The underlying CDFs on the interval [0, 1] for semi-values of interest. x denotes the variable of CDFs. χ[0.5,1] is defined by χ[0.5,1](x) = 1 if x [0.5, 1] and 0 otherwise. Beta(α, β) is a parameterized notation for the Beta Shapley values.

Method G-Shapley Beta(α, β) Data Banzhaf

CDF x R x 0 tβ 1(1 t)α 1dt χ[0.5,1]

Moreover, the map from all such probability distributions to semi-values is one-to-one.

The axiom of consistency basically imposes that

pn s = pn+1 s + pn+1 s+1 for every 1 s n,

which is the key for connecting all components in Ξ. All in all, combing Eqs. (1) and (2) yields the formula for each semi-value,

ϕi(vn; P) = Z 1

S [n]\i ts(1 t)n 1 s(vn(S i) vn(S)) d P(t) (3)

for every vn G and every i [n]. It suggests that each semi-value is just the integral of weighted Banzhaf values w.r.t. the underlying probability distribution P. Particularly, substituting the uniform distribution for P recovers the formula for the Shapley value (Shapley 1953) derived from the multilinear extension of utility functions (Owen 1972, Theorem 5).

2.3 Semi-values for Data Valuation

As a pioneering work, Ghorbani and Zou (2019) proposed Data Shapley that includes G-Shapley and TMC-Shapley for data valuation, and demonstrated its effectiveness in applications. Later, Beta Shapley was developed to improve the overall performance of value-based data valuation (Kwon and Zou 2022). Recently, Wang and Jia (2023) set up a noise-structure-agnostic framework to demonstrate that their proposed Data Banzhaf is the most robust compared with the previous works. Basically, the distinction between these methods is the use of semi-values. The underlying cumulative density functions (CDFs) of P in Eq. (2) for these value-based data valuation methods are listed in Table 1. Specifically, the TMC-Shapley (if the indices for truncation are the same for all sampled permutations of [n]) is some probabilistic value up to some scalar.

In cooperative game theory references, weighted Banzhaf values (which are semi-values) were discovered by Ding et al. (2010) and Marichal and Mathonet (2011, Theorem 10) based on the approximation of utility functions (Hammer and Holzman 1992).5 To obey the axiom of symmetry, the family of symmetric weighted Banzhaf values is taken herein, which can be parameterized by w [0, 1]. Precisely, for w-weighted Banzhaf value, the corresponding coefficients in Eq. (1) are

pn s = ws 1(1 w)n s for every 1 s n.

Equivalently, its underlying CDF is χ[w,1], while its underlying probability distribution is the Dirac delta distribution δw defined by

δw(S) = 1 w S 0 w S for every Borel measurable subset S of [0, 1].

Particularly, 0.5-weighted Banzhaf value is exactly the Banzhaf value employed in Data Banzhaf, while 1.0-weighted Banzhaf value is just the leave-one-out method, which is

ϕi(vn; δ1) = vn([n]) vn([n]\i) for every vn G and i [n].

5In some references (e.g., Domenech et al. 2016), weighted Banzhaf values are referred to as multinomial probabilistic values.

3 Main Work

In this section, we first introduce how to use the maximum sample reuse (MSR) principle to design an efficient estimator for weighted Banzhaf values, and also present its time complexity in terms of (ϵ, δ)-approximation. Then, we introduce the concept of Kronecker noise to parameterize the underlying stochasticity in utility functions, through which we show that the uniquely most roust semi-value always lies in the family of weighted Banzhaf values while minimizing the worst-case entropy.

3.1 Efficient Estimator for Weighted Banzhaf Values

To exactly compute each probabilistic value shown in Eq. (1), we have to evaluate the provided utility function 2n times in total, which is intractable when n is large, not to mention the expensive cost of training models. Therefore, in general, each probabilistic value can only be approximated using estimators. Specifically, the sampling lift estimator can be used to approximate every probabilistic value (Moehle et al. 2022), and its weighted version was employed by Kwon and Zou (2022) for approximating their proposed Beta Shapley values. Meanwhile, there are efficient estimators designed specifically for the Shapley value (e.g., Castro et al. 2009; Covert and Lee 2021; Jia et al. 2019b; Lundberg and Lee 2017). Recently, Wang and Jia (2023) demonstrated how to use the MSR principle to design an efficient estimator for the Banzhaf value, but they also presented that such a principle can not extend to many other probabilistic values (e.g., the Beta Shapley values). Nevertheless, we show that MSR estimator exists for weighted Banzhaf values.

For approximating w-weighted Banzhaf value based on some vn G, we first independently sample a sequence of Bernoulli vectors {b1, b2, . . . , bm} where the entries of each bj Rn are drawn from n independent Bernoulli distribution X with P(X = 1) = w; then, the sequence of Bernoulli vectors are transformed into a sequence of subsets {S1, S2, . . . , Sm} of [n] by defining i Sj if and only if the i-th entry of bj is 1; finally, the approximate value is computed by

ˆϕi(vn; δw) = 1 |S i|

S S i vn(S) 1 |S i|

S S i vn(S) for every i [n] (4)

where S i = {Sj | i Sj} and S i = {Sj | i Sj}.

As shown in Eq. (4), each utility evaluation is employed for estimating the importance of every datum in [n], while for the (reweighted) sampling lift it is two utility evaluations for only one datum; see Appendix D.2 for more details. Therefore, it could be expected that the MSR estimators are much more efficient, which is certainly confirmed by Figure 2. Meanwhile, such an MSR estimator enjoys the currently best time complexity bound demonstrated by Wang and Jia (2023, Theorem 4.9). Proposition 2. Assume vn r for every vn G. The estimator based on Eq.(4) requires O( n

δ ) utility evaluations to achieve P( ˆϕ(vn; δw) ϕ(vn; δw) 2 ϵ) δ for every utility function vn G.

3.2 Robustness under Kronecker Noises

If 2[n] is ordered, each stochastic utility function vn G can be treated as a random vector in R2n , and thus its randomness can be summarized by its covariance matrix D R2n 2n. However, there are Θ(2n) parameters to characterize. To make the analysis tractable, our approach is to parameterize D as simply as possible so that the most robust semi-value can be analyzed accordingly. Before proceeding, we introduce the binary ordering for 2[n], so that we can legally have the covariance matrix D.

Ordering of 2[n] Define a binary sequence [0]2, [1]2, . . . , [2n 1]2 by letting [k]2 be the binary representation of k in n-digit format. We order {S}S [n] along this binary sequence by assigning to each binary representation [k]2 a subset Sk [n] defined by i Sk if and only if the (n+1 i)-th digit of [k]2 is 1. For example, if n = 3, the binary sequence is 000, 001, 010, 011, 100, 101, 110, 111, and the corresponding ordering for the subsets of [3] is , {1}, {2}, {1, 2}, {3}, {1, 3}, {2, 3}, {1, 2, 3}. We refer to this ordering as binary ordering. In the rest of the paper, we treat each stochastic utility function vn G as a random vector in R2n w.r.t. the binary ordering. The advantage of this ordering

is that it aligns well with the Kronecker product so that we can parameterize the covariance matrix of each stochastic utility function using just one simple covariance matrix Σ R2 2.

Assumption 1 (Kronecker noises). Let 2[n] be ordered w.r.t. the binary ordering. For each random vector vn R2n (equivalently, a stochastic utility function in G), we assume vn E[vn] = ϵn ϵn 1 ϵ1 where (ϵi)1 i n are n independent continuous random vectors in R2 that have the same positive-definite covariance matrix Σ.

Proposition 3. Under Assumption 1, there is

vn(S) = E[vn(S)] + Y

i S ϵi,1, for every S [n],

Cov(vn) = Σ Σ Σ (n repetitions),

Cov(vn(S), vn(T)) = σ|[n]\(S T )| 11 σ|(S\T ) (T \S)| 12 σ|S T | 22 for every S, T [n]

Remark 1. According to Proposition 3, Assumption 1 provides a simple parameterization for the covariance matrix of the stochasticity inherent in vn. Though such a simple modeling is yet ideal, it is sufficient for arguing theoretically that there is no single universally the most robust semi-value across different stochastic utility functions. Notably, as shown in Table 2 , weighted Banzhaf values tend to be the most consistent in ranking data, and the most empirically robust one varies across different experiment settings. Overall, our theory emphasizes that a noise-structure-specific framework is able to provide a finer conclusion for robustness compared with the previous noise-structure-agnostic framework (Wang and Jia 2023).

Next, we propose to use the entropy of continuous random vectors to define the worst-case uncertainty brought by the parameterized covariance matrix, which is

sup vn Gn : Cov(vn)=Σ[n] h(ϕ(vn; P)) (5)

where P is a probability distribution that defines a semi-value Eq. (3), h is the entropy that measures the uncertainty of continuous random vectors, and Σ[n] = Σ Σ Σ (n repetitions). Intuitively, ϕ(vn; P) tends to be deterministic as h(ϕ(vn; P)) decreases to zero. Since Eq. (5) defines the least upper bound on h(ϕ(vn; P)), the lower Eq. (5) is, the more robust (equivalently, more deterministic) the corresponding semi-value is. Therefore, the most robust semi-value is defined to be the one that minimizes argmin P P sup vn Gn : Cov(vn)=Σ[n] h(ϕ(vn; P)) (6)

where P is the set that contains all (Borel-measurable) probability distributions on the interval [0, 1]. In other words, we try to determine a semi-value that can best tolerate the largest uncertainty brought by any noises having the covariance matrix Σ[n]. As one may expect, the most robust semi-value depends on the choice of Σ R2 2.

Since ϕ(vn; P) is linear in vn Gn, the result of Cov(ϕ(vn; P)) subject to Cov(vn) = Σ[n] only depends on Σ[n] and P. Consequently, supvn Gn : Cov(vn)=Σ[n] h(ϕ(vn; P)) = sup Y : Cov(Y )=Ψ h(Y ) where Ψ = Cov(ϕ(vn; P)). It is known that sup Y : Cov(Y )=Ψ h(Y ) = n

2 (1+log 2π)+ 1

2 log det Ψ, and the maximum is achieved if Y is a Gaussian random vector. Besides, Y is Gaussian if vn is Gaussian because ϕ(vn; P) is linear in vn Gn. All in all, the optimization problem (6) equivalently reduces to be argmin P P det(Cov(ϕ(vn; P)))

s.t. Cov(vn) = Σ[n]. (7)

Theorem 1. Under Assumption 1, and further assume σ12 < min(σ11, σ22), the Dirac delta distribution δw with

w = σ11 σ12 σ11 + σ22 2σ12 (8)

is optimal to the problem (7) for every n 1. In addition, the optimal solution is unique if n > 3.

Remark 2. Theorem 1 suggests that 0.5-weighted Banzhaf value (i.e., the Banzhaf value) is the most robust provided that σ11 = σ22, which coincides with (Wang and Jia 2023, Theorem 4.6). Specifically,

they simulated some isotropic Gaussian noises added to deterministic utility functions to verify that the Banzhaf value is the most robust in ranking data, which is also supported by Theorem 1. However, the first column of Figure 3 shows that the Banzhaf value is not always the most robust if the added noises follow Kronecker noises with σ11 = σ22. In practice, Table 2 also refutes that the Banzhaf value is universally the most robust.

4 Experiments

Figure 2: Comparison of three estimators for 0.8Banzhaf value on 2dplanes dataset. The shaded region represents the standard deviation over 10 different random seeds.

Our experiments contain three aspects: 1) approximation efficiency of the MSR estimator for weighted Banzhaf values; 2) verifying Theorem 1 using synthetic and authentic noises; 3) empirical study to show that the most robust/- effective semi-values tend to be some weighted Banzhaf values.

All datasets used are from open sources, and are classification tasks. Except for MNIST and FMNIST, each Dtr or Dval is balanced between different classes. Without explicitly stated, we set |Dval| = 200. All utility functions are set to be the accuracy reported on Dval with logistic regression models being trained on Dtr, except that we implement Le Net (Le Cun et al. 1998) for MNIST and FMNIST. To have the merit of efficiency, we adopt one-epoch onemini-batch learning for training models in all types of experiments (Ghorbani and Zou 2019). More experiment details and results are included in Appendices D and E.

4.1 Approximation Efficiency

The MSR estimator is compared with reweighted sampling lift and sampling lift estimators. To calculate the exact values, we set |Dtr| = 16. Besides, the learning rate is set to be 1.0. To have fair comparison, we plot relative difference ˆϕ ϕ 2/ ϕ 2 along the number of average utility evaluations, i.e. #utility evaluations

n . Figure 2 is plotted with 10 different random seeds used by estimators. It demonstrates that the MSR estimator is significantly the most efficient.

4.2 Verification for the Theorem 1

Experiments based on synthetic and authentic noises are designed to verify Theorem 1. We set |Dtr| = 10 so that the exact values can be cheaply computed.

For synthetic experiments, we simulate Kronecker noises with each ϵi independently and identically following a Gaussian distribution N(0, Σ), which are employed as added noises to deterministic utility functions. Besides, the learning rate is set to be 0.05. Specifically, we fix σ11 and σ22 while varying σ12 along the horizontal axis. For each σ12, the corresponding added noises are simulated using 10 different seeds. The Spearman s rank correlation coefficient is used to measure the similarity between the rankings of the ground-truth value and any noisy one. The results are reported in the first column of Figure 3, where the one tagged as robust semi-value refers to the corresponding weighted Banzhaf values predicted by Eq. (8). It is clear that Theorem 1 predicts the most robust semi-value well, and that the Banzhaf value is not universally the most robust under these synthetic noises.

Next, we examine Theorem 1 by fitting a Kronecker noise to the stochasticity inherent in the given utility function, and then compute the predicted most robust semi-value using Eq. (8). To produce simple utility functions, we set |Dval| = 10. Besides, the learning rate for utility functions is set to be 1.0. The underlying covariance matrix is empirically approximated by ˆD, and then we fit a Kronecker

spambase (synthetic) iris (authentic) phoneme (authentic)

Figure 3: (a) Synthetic noises: Each point represents the Spearman s rank correlation coefficient averaged over 10 random seeds. For each σ12, the robust semi-value is the weighted Banzhaf value parameterized by (σ11 σ12)/(σ11 + σ22 2σ12), as predicted by Theorem 1. For the first row, we set σ11 = 1.5 and σ22 = 0.5, while it is σ11 = 0.5 and σ22 = 1.5 for the other. (b) Authentic noises: Ranking consistency is measured by the mean and standard deviation of the Spearman s rank correlation coefficients between all pairs of 10 random seeds. The variance refers to P10 i=1 ˆϕi m 2 2/10 where m = 1 10 P10 i=1 ˆϕi. The computed robust parameters are 0.5255 and 0.5341 for iris and phoneme datasets, respectively.

noise by optimizing the Kullback-Leibler divergence between Gaussian distributions, which is

argmin Σ R2 2 KL N(0, ˆD) | N(0,

10 repetitions z }| { Σ Σ Σ) .

The final results averaged over 10 random seeds are presented in the last two columns of Figure 3. Each parameter in {0.00, 0.05, 0.10, . . . , 0.95, 1.00} defines a specific weighted Banzhaf value, while {(16, 1), (4, 1), (1, 4), (1, 16)} are parameters for the Beta Shapley values. The one tagged as robust is generated from the approximate Kronecker noise using Eq. (8). As demonstrated, the predicted robust semi-values do achieve the best in both ranking consistency and variance.

4.3 Empirical Study

Following the previous study (Wang and Jia 2023), we implement an exploratory study on 23 datasets, including two types of experiment: 1) noisy label detection and 2) ranking consistency.

We fix |Dtr| = 1, 000 for all datasets except that it is |Dtr| = 2, 000 for MNIST and FMNIST, which means the induced stochasticity inherent in utility functions is more complicated than Assumption 1. For each dataset, the chosen model is individually fine-tuned for the learning rate. The randomness in each utility function stems from the underlying random seed that determines the order of feeding training data as well as the initialization of models. All results are reported using mean and standard deviation over 10 random seeds. For each estimator, the total number of utility evaluations is set to be 400, 000. Permutation estimator (Castro et al. 2009) is used for Shapley value, reweighted sampling lift estimator for Beta Shapley, and MSR estimator for weighted Banzhaf values. Specifically, 0weighted Banzhaf value and 1-weighted Banzhaf value require only n + 1 utility evaluations to compute exactly, and thus we do not use any estimators for them.

Table 2: Ranking consistency. For weighted Banzhaf values, we only report the best one from the parameter list {0.00, 0.05, 0.10, . . . , 0.95, 1.00}. The first value in the column of Weighted Banzhaf is the parameter that achieves the highest ranking consistency. Boldface numbers mark the best. The number following each dataset refers to the number of classes used in the experiments.

Dataset Weighted Banzhaf Banzhaf Beta(16, 1) Beta(4, 1) Beta(1, 4) Beta(1, 16) Shapley

MNIST (10) 0.20 | 0.680 0.120 0.581 0.075 0.073 0.082 0.020 0.043 0.004 0.039 0.009 0.037 0.011 0.019 FMNIST (10) 0.30 | 0.670 0.042 0.562 0.040 0.017 0.092 0.007 0.046 0.001 0.035 0.009 0.029 0.002 0.021 2dplanes (2) 0.80 | 0.950 0.012 0.839 0.091 0.424 0.155 0.358 0.133 0.046 0.087 0.001 0.130 0.256 0.083 bank-marketing (2) 0.75 | 0.895 0.028 0.814 0.092 0.599 0.096 0.445 0.093 0.015 0.094 0.041 0.128 0.226 0.033 bioresponse (2) 0.00 | 0.997 0.000 0.948 0.003 0.193 0.032 0.169 0.030 0.016 0.037 0.004 0.034 0.122 0.021 covertype (7) 0.35 | 0.840 0.033 0.832 0.032 0.618 0.066 0.490 0.051 0.035 0.134 0.028 0.227 0.234 0.026 cpu (2) 0.20 | 0.926 0.011 0.874 0.008 0.549 0.096 0.298 0.084 0.003 0.132 0.004 0.158 0.142 0.028 default (2) 0.30 | 0.953 0.011 0.941 0.007 0.542 0.078 0.369 0.059 0.022 0.090 0.004 0.131 0.170 0.034 gas (6) 0.05 | 0.959 0.009 0.729 0.019 0.204 0.076 0.105 0.057 0.003 0.068 0.017 0.094 0.063 0.032 har (6) 0.05 | 0.956 0.003 0.741 0.012 0.133 0.063 0.074 0.057 0.004 0.050 0.001 0.058 0.051 0.031 letter (26) 0.50 | 0.966 0.004 0.966 0.004 0.530 0.048 0.346 0.065 0.021 0.110 0.015 0.148 0.197 0.031 optdigits (10) 0.25 | 0.923 0.010 0.898 0.009 0.687 0.039 0.485 0.049 0.024 0.076 0.004 0.145 0.278 0.030 pendigits (10) 0.05 | 0.947 0.014 0.806 0.010 0.384 0.100 0.259 0.085 0.031 0.096 0.024 0.142 0.113 0.024 phoneme (2) 0.85 | 0.783 0.159 0.588 0.309 0.601 0.141 0.499 0.148 0.028 0.125 0.008 0.153 0.325 0.105 pol (2) 0.25 | 0.970 0.003 0.938 0.003 0.479 0.108 0.320 0.105 0.066 0.144 0.033 0.158 0.186 0.026 satimage (6) 0.30 | 0.863 0.024 0.814 0.028 0.341 0.135 0.189 0.087 0.004 0.086 0.006 0.087 0.100 0.033 segment (7) 0.25 | 0.941 0.012 0.883 0.006 0.361 0.118 0.213 0.069 0.001 0.102 0.005 0.149 0.113 0.030 spambase (2) 0.20 | 0.939 0.007 0.876 0.007 0.466 0.073 0.336 0.070 0.001 0.094 0.004 0.148 0.194 0.030 texture (11) 0.05 | 0.961 0.011 0.790 0.009 0.217 0.073 0.109 0.050 0.003 0.072 0.020 0.080 0.049 0.032 wind (2) 0.00 | 0.793 0.111 0.720 0.036 0.398 0.091 0.215 0.078 0.004 0.079 0.012 0.114 0.116 0.028

Table 3: Noisy label detection. For weighted Banzhaf values, we only report the best one from the parameter list {0.00, 0.05, 0.10, . . . , 0.95, 1.00}. The first value in the column of Weighted Banzhaf is the parameter that achieves the highest F1-score. Boldface numbers mark the best. The number following each dataset refers to the number of classes used in the experiments.

Dataset Weighted Banzhaf Banzhaf Beta(16, 1) Beta(4, 1) Beta(1, 16) Beta(1, 4) Shapley

MNIST (10) 0.50 | 0.625 0.040 0.625 0.040 0.228 0.023 0.226 0.017 0.210 0.013 0.207 0.013 0.229 0.020 FMNIST (10) 0.25 | 0.496 0.058 0.448 0.027 0.217 0.019 0.223 0.017 0.208 0.015 0.209 0.016 0.212 0.016 2dplanes (2) 0.05 | 0.609 0.059 0.346 0.014 0.572 0.052 0.532 0.055 0.219 0.035 0.210 0.030 0.412 0.053 bank-marketing (2) 0.05 | 0.528 0.015 0.460 0.014 0.487 0.025 0.452 0.026 0.247 0.024 0.226 0.020 0.368 0.025 bioresponse (2) 0.05 | 0.502 0.012 0.481 0.011 0.323 0.031 0.316 0.030 0.221 0.022 0.204 0.027 0.293 0.025 covertype (7) 0.05 | 0.518 0.026 0.318 0.025 0.481 0.025 0.398 0.026 0.217 0.014 0.212 0.023 0.318 0.032 cpu (2) 0.05 | 0.772 0.017 0.648 0.015 0.682 0.031 0.508 0.045 0.242 0.028 0.212 0.031 0.390 0.027 default (2) 0.05 | 0.398 0.010 0.346 0.011 0.355 0.019 0.316 0.019 0.203 0.022 0.211 0.023 0.286 0.029 gas (6) 0.10 | 0.767 0.018 0.731 0.021 0.402 0.030 0.358 0.020 0.220 0.018 0.209 0.025 0.321 0.026 har (6) 0.10 | 0.869 0.008 0.809 0.013 0.388 0.028 0.355 0.025 0.222 0.023 0.210 0.022 0.328 0.021 letter (26) 0.05 | 0.552 0.022 0.467 0.009 0.444 0.038 0.384 0.022 0.238 0.030 0.226 0.030 0.336 0.022 optdigits (10) 0.05 | 0.852 0.011 0.701 0.016 0.739 0.025 0.652 0.023 0.248 0.030 0.218 0.033 0.513 0.033 pendigits (10) 0.35 | 0.822 0.006 0.814 0.008 0.590 0.056 0.489 0.037 0.240 0.027 0.224 0.026 0.390 0.027 phoneme (2) 0.05 | 0.492 0.021 0.382 0.023 0.482 0.030 0.450 0.034 0.242 0.048 0.225 0.033 0.376 0.047 pol (2) 0.05 | 0.646 0.011 0.554 0.009 0.498 0.035 0.476 0.037 0.275 0.046 0.248 0.045 0.388 0.030 satimage (6) 0.05 | 0.604 0.082 0.476 0.037 0.472 0.055 0.398 0.055 0.225 0.021 0.209 0.022 0.342 0.031 segment (7) 0.15 | 0.763 0.015 0.712 0.014 0.502 0.042 0.452 0.042 0.249 0.027 0.228 0.031 0.351 0.021 spambase (2) 0.05 | 0.757 0.012 0.717 0.008 0.586 0.057 0.511 0.033 0.243 0.036 0.223 0.038 0.412 0.027 texture (11) 0.15 | 0.834 0.009 0.762 0.013 0.458 0.041 0.388 0.036 0.210 0.018 0.194 0.034 0.315 0.028 wind (2) 0.05 | 0.686 0.016 0.618 0.019 0.520 0.026 0.428 0.044 0.227 0.025 0.222 0.031 0.359 0.020

Ranking Consistency In data valuation, it could be the ranking of data that matters the most, but the randomness of utility functions may produce inconsistent rankings. For each semi-value, the averaged Spearman s rank correlation coefficients over all possible pairs, which is 10 2 = 45 in total, are calculated to measure ranking consistency. The results are summarized in Table 2, while figures with more details are included in Appendix E. As presented, it is mostly some member of weighted Banzhaf values, not always the Banzhaf value, that achieves the highest ranking consistency compared with the baselines.

Noisy Label Detection Notice that a more robust semi-value does not necessarily produce a higher quality of data valuation. This experiment is to present how well weighted Banzhaf values would achieve in noisy label detection. For each dataset, we randomly flip the labels of 20 percent of data in Dtr to be any of the rest in a uniform manner. Those flipped data are treated as having noisy/incorrect labels. Particularly, such a flipped Dtr is also used for reporting ranking consistency in Table 2. Then, each estimated semi-value is used to detect noisy labels by marking data assigned with values of importance less than or equal to 20 percent percentile as mislabeled. The results are summarized in Table 3, and figures that contain more details are included in Appendix E. As observed, the best performance is mostly achieved by weighted Banzhaf values, and most of them are not the Banzhaf value.

In both types of experiments, it manifests itself that the semi-value that achieves the best in ranking consistency does not coincide with the one that achieves the highest F1-score on each dataset. Nevertheless, the detailed figures in Appendix E show that weighted Banzhaf values tend to be more robust than the Beta Shapley values, which include the Shapley value. Moreover, we can conclude that there is no single universally the best semi-value across various experiment settings in terms of either ranking consistency or noisy label detection. For each utility function, it would be useful if the best semi-value can be effectively determined in advance.

5 Conclusion

We introduce a noise-structure-specific framework to theoretically assert that it is always a member of weighted Banzhaf values that is deemed the most robust in terms of minimizing the worst-case entropy. Theorem 1 shows that the most robust semi-value would vary across different experiment settings. Our extensive empirical study demonstrates that weighted Banzhaf values are promising when dealing with inevitable noises inherent in utility functions. Particularly, all experiments suggest that there is no universally the best or the most robust semi-value for all experiment settings. Therefore, future investigation is to set up principled approaches that can pin down the most robust or the most effective value in advance.

6 Limitations

The concept of Kronecker noise is still ideal for fitting the stochasticity in practice. The verification of Theorem 1 using fitted Kronecker noises can only be done based on simple utility functions, as one may expect. Nevertheless, the theoretical and empirical evidence is sufficient to argue that there is no single universally the most robust or the most effective semi-value across various experiment settings, and that weighted Banzhaf values are promising in practice.

Acknowledgments and Disclosure of Funding

We thank the reviewers and the area chair for thoughtful comments that have improved our final draft. YY gratefully acknowledges NSERC and CIFAR for funding support.

Banzhaf III, J. F. (1965). Weighted Voting Doesn t Work: A Mathematical Analysis . Rutgers Law Review, vol. 19, pp. 317 343.

Castro, J., D. Gómez, and J. Tejada (2009). Polynomial Calculation of the Shapley Value Based on Sampling . Computers & Operations Research, vol. 36, no. 5, pp. 1726 1730.

Covert, I. and S.-I. Lee (2021). Improving Kernel SHAP: Practical Shapley Value Estimation via Linear Regression . In: International Conference on Artificial Intelligence and Statistics. PMLR, pp. 3457 3465.

Ding, G., R. F. Lax, J. Chen, P. P. Chen, and B. D. Marx (2010). Transforms of Pseudo-Boolean Random Variables . Discrete Applied Mathematics, vol. 158, no. 1, pp. 13 24.

Domenech, M., J. M. Giménez, and M. A. Puente (2016). Some Properties for Probabilistic and Multinomial (Probabilistic) Values on Cooperative Games . Optimization, vol. 65, no. 7, pp. 1377 1395.

Dragan, I. (2006). On the Semivalues and the Least Square Values Average Per Capita Formulas and Relationships . Acta Mathematica Sinica, vol. 22, pp. 1539 1548.

Dubey, P., A. Neyman, and R. J. Weber (1981). Value Theory without Efficiency . Mathematics of Operations Research, vol. 6, no. 1, pp. 122 128.

Frénay, B. and M. Verleysen (2013). Classification in the Presence of Label Noise: A Survey . IEEE Transactions on Neural Networks and Learning Systems, vol. 25, no. 5, pp. 845 869.

Friedman, E. J. (2003). Paths and Consistency in Additive Cost Sharing . International Journal of Game Theory, vol. 32, no. 4, pp. 501 518.

Ghorbani, A. and J. Zou (2019). Data Shapley: Equitable Valuation of Data for Machine Learning . In: International Conference on Machine Learning. PMLR, pp. 2242 2251.

Hammer, P. L. and R. Holzman (1992). Approximations of Pseudo-Boolean Functions; Applications to Game Theory . Zeitschrift für Operations Research, vol. 36, no. 1, pp. 3 21.

Jia, R., D. Dao, B. Wang, F. A. Hubis, N. M. Gurel, B. L. C. Zhang, and C. S. D. Song (2019a). Efficient Task-Specific Data Valuation for Nearest Neighbor Algorithms . Proceedings of the VLDB Endowment, vol. 12, no. 11.

Jia, R. et al. (2019b). Towards Efficient Data Valuation Based on the Shapley Value . In: The 22nd International Conference on Artificial Intelligence and Statistics. PMLR, pp. 1167 1176.

Kwon, Y. and J. Zou (2022). Beta Shapley: A Unified and Noise-reduced Data Valuation Framework for Machine Learning . In: International Conference on Artificial Intelligence and Statistics. PMLR, pp. 8780 8802.

Le Cun, Y., L. Bottou, Y. Bengio, and P. Haffner (1998). Gradient-Based Learning Applied to Document Recognition . Proceedings of the IEEE, vol. 86, no. 11, pp. 2278 2324.

Lin, J., A. Zhang, M. Lécuyer, J. Li, A. Panda, and S. Sen (2022). Measuring the Effect of Training Data on Deep Learning Predictions via Randomized Experiments . In: International Conference on Machine Learning. PMLR, pp. 13468 13504.

Lundberg, S. M. and S.-I. Lee (2017). A Unified Approach to Interpreting Model Predictions . Advances in Neural Information Processing Systems, vol. 30.

Marichal, J.-L. and P. Mathonet (2011). Weighted Banzhaf Power and Interaction Indexes Through Weighted Approximations of Games . European Journal of Operational Research, vol. 211, no. 2, pp. 352 358.

Moehle, N., S. Boyd, and A. Ang (2022). Portfolio Performance Attribution via Shapley Value . Journal of Investment Management, vol. 20, no. 3, pp. 33 52.

Owen, G. (1972). Multilinear Extensions of Games . Management Science, vol. 18, no. 5-part-2, pp. 64 79.

Shapley, L. S. (1953). A Value for N-Person Games . Annals of Mathematics Studies, vol. 28, pp. 307 317.

Steinhardt, J., P. W. Koh, and P. Liang (2017). Certified Defenses for Data Poisoning Attacks . In: Advances in Neural Information Processing Systems. Vol. 30.

Wang, J. T. and R. Jia (2023). Data Banzhaf: A Robust Data Valuation Framework for Machine Learning . In: International Conference on Artificial Intelligence and Statistics. PMLR, pp. 6388 6421.

Weber, R. J. (1977). Probabilistic Values for Games . The Shapley Value. Essays in Honor of Lloyd S. Shapley, pp. 101 119.

Table of Contents

A The Axiom of Equal Treatment 12

B Proofs for the Propositions 13

C Prerequisites and Proof for Theorem 1 15 C.1 Extend Semi-values to Be Semi-indices . . . . . . . . . . . . . . . . . . . . . . 15 C.2 Interaction Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 C.3 Matrix Representation for Semi-indices . . . . . . . . . . . . . . . . . . . . . . 18 C.4 A Theorem of Robustness for Semi-indices . . . . . . . . . . . . . . . . . . . . 19 C.5 Proof for Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 C.6 Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

D Experiment Settings 26 D.1 Fitting Kronecker Noises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 D.2 Sampling Lift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

E More Experiment Results 27

A The Axiom of Equal Treatment

In the existing references, the axiom of symmetry is sometimes used to refer to the axiom of equal treatment, but they are not always interchangeable.

Equal Treatment: ϕi(vn) = ϕj(vn) if vn(S i) = vn(S j) for every S [n]\{i, j}.

The axiom of equal treatment is not always equal to the axiom of symmetry. As a counterexample, pick n = 2, and consider a value ϕ defined by ϕ1(v) = 1

2v(2) and ϕ2(v) = 1

3v(2). The reader can verify that it satisfies the axiom of equal treatment but violates the axiom of symmetry (e.g., take v(1) = 3 and v(2) = 6).

Nevertheless, they are equivalent in some contexts. Recall that a weaker version of the axiom of dummy is

Null: ϕi(vn) = 0 if vn(S i) = vn(S) for every S [n]\i. Proposition 4. Suppose a value ϕ verifies the axioms of linearity and null, then ϕ verifies the axiom of symmetry if and only if it verifies the axiom of equal treatment.

Proof. The only if part is trivial as the axiom of symmetry always implies the axiom of equal treatment, and thus we only prove the if part. Let n 1 be fixed, since ϕ verifies the axioms of linearity and null, by (Weber 1977, Theorem 2),6 we have

S [n]\i pi S (vn(S i) vn(S)) for every vn Gn and i [n].

For every S N, we define u S Gn by letting u S(T) = 1 if T = S and 0 otherwise. For every i = j and S [n]\ij, by the axiom of equal treatment, there is

pi S = ϕi(u S i + u S j + u S ij) = ϕj(u S i + u S j + u S ij) = pj S. (9)

On the other hand, for every S with s > 1, and every i, j S, there is

pi S\i = ϕi(u S) = ϕj(u S) = pj S\j. (10)

6By using the axiom of null instead of the axiom of dummy, the only difference is that there is no P

S [n]\i pi S = 1.

Combing Eqs. (9) and (10), we can get pi S = ps for every i [n] and every S [n]\i. To demonstrate, take p1 {2,3} and p4 {5,6} as example, there is

p1 {2,3} = p4 {2,3} = p2 {4,3} = p5 {4,3} = p3 {4,5} = p6 {4,5} = p4 {5,6}.

Therefore, the equality pi S = ps for every i [n] and every S [n]\i implies the axiom of symmetry.

B Proofs for the Propositions

Proposition 2. Assume vn r for every vn G. The estimator based on Eq.(4) requires O( n

δ ) utility evaluations to achieve P( ˆϕ(vn; δw) ϕ(vn; δw) 2 ϵ) δ for every utility function vn G.

Proof. Note that it only requires n + 1 utility evaluations for exactly yielding 0 and 1-weighted Banzhaf values. Therefore, the MSR estimator is designed for 0 < w < 1. Our proof is adapted from (Wang and Jia 2023, Theorem 4.9).

Let vn G and i [n] be fixed, and by assumption vn r. According to the underlying sampling for {Sj}, there is |S i| binomial(w). Let a bridge estimator defined by

ϕi(vn; δw) = 1 wm

S S i vn(S) 1 (1 w)m

S S i vn(S).

For simplicity, we write ϕi, ˆϕi and ϕi instead of ϕi(vn; δw), ˆϕi(vn; δw) and ϕi(vn; δw). Plus, we also have s i = |S i| and s i = |S i|.

First of all, we prove that E[ϕi] = ϕi. Suffice it to prove that E[ci(S)vn(S)] = ϕi where ci(S) = 1

w if i S and 1 (1 w) otherwise. For every S [n]\i,

w = ws+1(1 w)n 1 s

w = ws(1 w)n 1 s,

1 w = ws(1 w)n s

1 w = ws(1 w)n 1 s.

On the other hand, ϕi = P

S [n]\i ws(1 w)n 1 s(vn(S i) vn(S)).

Next, we prove the convergence for the MSR estimator. Since E[ϕi] = ϕi, by the Hoeffding s inequality,

P(|ϕi ϕi| ϵ) 2 exp mϵ2

where T = max( 1

w, 1 1 w). Assume s i > 0 and s i > 0, note that s i + s i = m, there is

|ˆϕi ϕi| = | 1

S S i vn(S) + 1

s i 1 (1 w)m

S S i vn(S)

r wm |s i wm| + r (1 w)m |s i (1 w)m|

= r wm |s i wm| + r (1 w)m |s i wm| = r w(1 w)m |s i wm| .

Note that the first inequality in the above still holds if s i = 0 or s i = 0. Since s i binomial(w), by the Hoeffding s inequality, there is

P(|s i wm| ) 2 exp 2 2

Therefore, |ˆϕi ϕi| < r w(1 w)m provided that |s i wm| < . Then,

P(|ˆϕi ϕi| ϵ) = P(|ˆϕi ϕi| ϵ |s i wm| < ) + P(|ˆϕi ϕi| ϵ |s i wm| )

P(|ˆϕi ϕi| ϵ | |s i wm| < ) + 2 exp 2 2

P(|ϕi ϕi| ϵ r w(1 w)m | |s i wm| < ) + 2 exp 2 2

P(|ϕi ϕi| ϵ r w(1 w)m)

1 2 exp 2 2

m + 2 exp 2 2

2 exp m(ϵ r w(1 w)m) 2

1 2 exp 2 2

m + 2 exp 2 2

m ϵ r w(1 w)m 2

+ 2 exp 2 2

where the last inequality is due to that 1 2 exp 2 2

3 if m is sufficiently large. Solving

the equation m(ϵ r w(1 w)m) 2

2r2T 2 = 2 2

m yields = w(1 w)mϵ r(2T w 2T w2+1). Specifically, 2Tw 2Tw2 + 1 = 2|w 0.5|+2, and thus its range is [2, 3) provided that w (0, 1), which indicates ϵ r w(1 w)m ϵ

So, we have 2 2

m = Cmϵ2 where C = 2w2(1 w)2

r2(2|w 0.5|+2)2 . Eventually,

P(|ˆϕi ϕi| ϵ) 5 exp Cmϵ2

provided that 1 2 exp 2 2

3, or equivalently m log 6

Cϵ2 . Then,

P( ˆϕ ϕ 2 ϵ) P( [

i [n] |ˆϕi ϕi| ϵ n) 5n exp Cmϵ2

Solving 5n exp Cmϵ2

n δ leads to m n Cϵ2 log 5n

δ . Eventually, we conclude that the MSR estimator requires

Cϵ2 , n Cϵ2 log 5n

samples (equivalently, utility evaluations) to achieve P( ˆϕ ϕ 2 ϵ) δ.

Proposition 3. Under Assumption 1, there is

vn(S) = E[vn(S)] + Y

i S ϵi,1, for every S [n],

Cov(vn) = Σ Σ Σ (n repetitions),

Cov(vn(S), vn(T)) = σ|[n]\(S T )| 11 σ|(S\T ) (T \S)| 12 σ|S T | 22 for every S, T [n]

Proof. By Assumption 1, vn E[vn] = ϵn ϵn 1 ϵ1 such that (ϵi)1 i n are independent continuous random vectors in R2 that have the same covariance matrix Σ. For simplicity, we write ϵ[n] = ϵn ϵn 1 ϵ1.

Since (A B)(C D) = (AC) (BD), there is

ϵ[n](ϵ[n]) = (ϵnϵ n ) (ϵn 1ϵ n 1) (ϵ1ϵ 1 ).

By the independence, we have

Cov(vn) = Cov(ϵ[n]) =

n times z }| { Σ Σ Σ . (11)

Next, we prove the rest by induction. It is clear that it holds for n = 1. Suppose it holds for n = k > 0, i.e., ,

vk(S) = E[vk(S)] + Y

i S ϵi,1 for every S [k],

Cov(vk(S), vk(T)) = σ|[k]\(S T )| 11 σ|(S\T ) (T \S)| 12 σ|S T | 22 for every S, T [k] (12)

and we proceed for n = k + 1. Suppose the binary ordering for {S}S [k] is

S1, S2, , S2k,

then, the binary ordering for {S}S [k+1] is

S1, S2, . . . , S2k, S1 {k + 1}, S2 {k + 1}, . . . , S2k {k + 1}, (13)

which aligns well with ϵk+1 (ϵk ϵk 1 ϵ1). Thus, we have vk+1(S) = E[vk+1(S)] + Q

i S ϵi,1 for every S [k + 1]. On the other hand, write C[k] = Cov(vk), by Eq. (11), we have

C[k+1] = σ11C[k] σ12C[k]

σ12C[k] σ22C[k].

Let S, T [k + 1] be given, there are four cases in total: i) S, T [k]; ii) S, T [k]; iii) S [k], T [k]; and iv) S [k], T [k]. For the first case, combing Eqs. (13), (14) and (12), there is

C[k+1](S, T) = σ11C[k](S, T) = σ|[k+1]\(S T )| 11 σ|(S\T ) (T \S)| 12 σ|S T | 22 .

The rest of the cases can be done similarly.

C Prerequisites and Proof for Theorem 1

To prove Theorem 1, our first step is to lift each ϕ(vn; P) into R2n w.r.t. the binary ordering. Precisely, we extend each semi-value ϕP , a shorthand for ϕ( ; P), to be a semi-index IP that maps each v Gn (for every n 1) into another utility function in Gn, whose matrix representations on all Gn are inductive. Then, we prove a theorem of robustness for semi-indices, which is then used to prove Theorem 1.

C.1 Extend Semi-values to Be Semi-indices

Index is another concept developed in cooperative game theory that serves as an extension for value (Fujimoto et al. 2006; Grabisch and Roubens 1999), which has also been employed in the community of machine learning (Sundararajan et al. 2020; Zhang et al. 2021). Specifically, an index I is a function that maps each utility function v Gn (for every n 1) into another utility function in Gn. Note that each utility function v Gn can be viewed as a vector indexed by all subsets of [n]. In other words, I is to impute contributions to all possible subsets of [n], not just to each datum in [n], so as to account for the contributions of interactions among all data [n]. For convenience, we write I(vn, S) for the value of I(vn) indexed at S.

Notably, the axiom of recursion was proposed by Grabisch and Roubens (1999) to extend the Banzhaf and Shapley values into the Banzhaf and Shapley indices, respectively. To state the axiom of recursion, we need further definitions. Suppose v Gn and S [n] are given. The restriction of v on S, denoted by v S, is defined by v S(T) = v(T), T S. (15)

In other words, the domain 2[n] of v is restricted to be 2S.

Plus, let j [n]\S be given, the restriction of v on S in the presence of j, denoted by v j S , is defined by v j S (T) = v(T j), T S. (16)

Remark 3. The definition of v j S is not the same as the one defined by Grabisch and Roubens (1999), which is instead v j S (T) = v(T j) v(j), T S. The term v(j) is only to make v j S ( ) = 0, and we comment that it is not necessary for all the results therein. Moreover, this extra term will make Eq. (17) false.

1. Recursion: For every v Gn, S [n] with s > 1, and j S, there is

I(v, S) = I(v j [n]\j, S\j) I(v[n]\j, S\j).

The axiom of recursion dictates that the contribution imputed to the subset S should be the difference of the contributions imputed to S\j in the presence and absence of any player j S. So, starting from IP (vn, i) = ϕi(vn; P) for every vn G and i [n], the axiom of recursion can be used to define IP (vn, S) recursively for every non-empty subsets S [n]. Particularly, the defined value of I(vn, S) is independent of the choice of j S.

One thing that remains is to determine the contribution for the empty subset. For each semi-value ϕP , we argue that it should be what is stated as the axiom of empty.

2. Empty: To extend a semi-value ϕP to be a semi-index IP , there should be IP (vn, ) = P

S [n] pn+1 s+1 v(S) for every vn G. Recall that pn+1 s+1 = R 1 0 ts(1 t)n sd P(t) for every 0 s n.

Observe that P

S [n] pn+1 s+1 = Pn s=0 n s pn+1 s+1 = 1, the imposed value for IP (vn, ) is just the expected value of vn w.r.t. the probability distribution (pn+1 s+1 )S [n] induced by the underlying P P. One can verify the axiom of empty consistently extends the axiom of recursion, i.e.,

IP (v, i) = IP (v i [n]\i, ) IP (v[n]\i, ) for every v Gn and i [n]. (17)

Definition 1 (Semi-indices). A semi-index IP is extended from a semi-value ϕP using the axioms of recursion and empty. Remark 4. We can therefore extend each w-weighted Banzhaf value to be w-weighted Banzhaf index. Particularly, 0-weighted Banzhaf index is exactly the Möbius transform Eq. (18), while 1-weighted Banzhaf index is instead the co-Möbius transform (Grabisch et al. 2000, Eq. (12)). In addition, extending the Shapley value and the Banzhaf value in this way will yield the Shapley interaction index and the Banzhaf interaction index presented in (Grabisch and Roubens 1999, Theorem 3). Proposition 5 (Formulas for Semi-indices). For each semi-index IP , there is

IP (vn, S) = X

R [n]\S pn+1 s r+1 Svn(R) for every vn G and S [n]

where pm j = Z 1

0 tj 1(1 t)m jd P(t) for 1 j m

and Svn(R) = X

T S ( 1)s tvn(R T).

Proof. For simplicity, we write v instead of vn. It is clear that it holds when s = 0 and s = 1. Assume it holds for all S such that |S| = k. Let S [n] that satisfies |S| = k + 1, and some i S be given, there is

=IP (v i [n]\i, S\i) IP (v[n]\i, S\i) the axiom of recursion

R [n]\S pn k r+1 S\iv i [n]\i(R) X

R [n]\S pn k r+1 S\iv[n]\i(R) by inductive assumption

R [n]\S pn k r+1 ( S\iv(R i) S\iv(R))

R [n]\S pn+1 s r+1 Sv(R).

Note the above reasoning is independent of the choice of i S.

The reader can also verify that each semi-index IP satisfies the following axioms, the first three of which are often laid out for how indices should be in the fields of cooperative game theory and machine learning (Fujimoto et al. 2006; Grabisch and Roubens 1999; Sundararajan et al. 2020).

3. Linearity: IP (a un + vn, S) = a IP (un, S) + IP (vn, S) for every vn, un G, a R and S [n].

4. Dummy Interaction: If S [n] with s > 1 contains a datum i such that for some constant C, vn(T i) = vn(T) + C for all T [n]\i, then IP (vn, S) = 0.

5. Symmetry: IP (vn π, S) = IP (vn, π(S)) for every utility function vn G, every permutation π of [n] and every S [n].

6. Consistency: For each utility function vn G, its extension vn+1 is defined by vn+1(S) = vn(S [n]) for every S [n + 1]. Then, IP (vn+1, S) = IP (vn, S) for every S [n].

C.2 Interaction Transform

The concept of interaction transform is crucial to establish well-structured matrix representations for each semi-index IP . Specifically, Definitions 3 and 4 are inspired by (Denneberg and Grabisch 1999).

Definition 2 (Möbius Transform). The Möbius transform µ is a function that maps each utility function v Gn (for every n 1) into another utility function in Gn. For each vn G, we write µ(vn, S) for the value of µ(vn) indexed at S [n]. The definition is

µ(vn, S) = X

R S ( 1)s rvn(R) for every S [n]. (18)

Definition 3 (Convolution). The convolution ι u between each ι RN and each u Gn is defined by

(ι u)(S) = X

[n] T S ι(t s)u(T) for every S [n]. (19)

In other words, each ι RN may be seen as a function that maps each utility function v Gn (for every n 1) into another utility function in Gn, and thus we may write ι(u) instead of ι u.

Definition 4 (Interaction Transform). For each semi-index IP , its interaction transform is defined to be a vector ιP RN that satisfies IP = ιP µ.

Proposition 6 (Interaction Transforms for Semi-indices). For each semi-index IP , its interaction transform ιP RN is

ιP (i) = Z 1

0 tid P(t) for every i N. (20)

Here, ιP (i) refers to the i-th entry of ιP .

Proof. Let v Gn be given, and for simplicity, we write µ instead for µ(v). For every S [n], there is

IP (v, S) = X

T [n]\S pn+1 s t+1 Sv(T) = X

T [n]\S pn+1 s t+1 X

R T [n]\S pn+1 s t+1 µ(S R) = X

W \S T [n]\S pn+1 s t+1 µ(W)

pn+1 s w s+i+1

We refer the reader to (Grabisch et al. 2000, Eq. (10)) for the equality Sv(T) = P

R T µ(S R) used in the second equality.

Since pm k = R 1 0 tk 1(1 t)m k for 1 k m, there is

pn+1 s w s+i+1 = Z 1

0 tw sd P(t).

Therefore, we have

IP (v, S) = X

0 tw sd P(t) µ(W) = X

[n] W S ιP (w s)µ(W) = [ιP µ(v)](S).

C.3 Matrix Representation for Semi-indices

From now on, let 2[n] be ordered w.r.t. the binary ordering. Then, vn G is a vector in R2n, and so is IP (vn). Since IP is linear in vn Gn, IP has a matrix representation on each Gn.

Proposition 7 (Matrix Representations for Semi-indices). Let 2[n] be ordered using the binary ordering, the matrix representation of semi-index IP on Gn w.r.t. the standard basis is

M[n] P = Z 1

0 B[n](t)d P(t)

n repetitions z }| { B[1](t) B[1](t) B[1](t) and B[1](t) = 1 t t 1 1

Here, is the Kronecker product.

Proof. Since IP = ιP µ, suffice it to find the matrix representations for both µ and ιP w.r.t. the standard basis in Gn.

As provided by Grabisch et al. (2000, Eq. (45)), which can be verified by induction using Eq. (13), the matrix representation for the Möbius transform µ on Gn w.r.t. the standard basis is

n repetitions z }| { W[1] W[1] W[1] where W[1] = 1 0 1 1

We first prove by induction that the matrix representation of ιP on Gn w.r.t. the standard basis is

T[n] P = Z 1

0 A[n](t)d P(t) (21)

n repetitions z }| { A[1](t) A[1](t) A[1](t) and A[1](t) = 1 t 0 1

It is clear that it holds for n = 1. Assume it holds for some k > 0, i.e., for every u[k] Gk

ιP u[k] = T[k] P u[k], (22)

and we proceed with k + 1. Let u[k+1] Gk+1 and S [k + 1] be given, there are two cases: i) S [k] and ii) S [k]. If S [k], by Eq. (19),

(ιP u[k+1])(S)

[k+1] T S ιP (t s)u[k+1](T)

[k] T S ιP (t s)u[k+1](T) + X

[k] T S ιP (t s)u[k+1](T)

[k] T S ιP (t s)u[k+1] [k] (T) + X

[k] T S ιP (t s)u[k+1](T) by Eq. (15)

=[T[k] P u[k+1] [k] ](S) + X

[k] T S ιP (t s)u[k+1](T) by Eq. (22)

=[T[k] P u[k+1] [k] ](S) + X

[k] R S ιP (r + 1 s)u[k+1], {k+1} [k] (R) by R = T\{k + 1} and Eq. (16)

=[T[k] P u[k+1] [k] ](S) + Z 1

0 t A[k](t)d P(t) u[k+1], {k+1} [k]

(S) by Eqs. (20), (21) and (22)

=[T[k+1] P u[k+1]](S) by Eq. (13).

The remaining case can be done in a similar fashion. After all, the matrix representation of ιP on Gn w.r.t. the standard basis is

M[n] P = T[n] P W[n]

n repetitions z }| { A[1](t) A[1](t) A[1](t) d P(t) (

n repetitions z }| { W[1] W[1] W[1])

0 (A[1](t) A[1](t) A[1](t)) (W[1] W[1] W[1])d P(t)

0 B[1](t) B[1](t) B[1](t)d P(t)

where B[1](t) = A[1](t) W[1].

C.4 A Theorem of Robustness for Semi-indices

Definition 5 (Truncated Semi-indices). Given a semi-index IP and a non-empty collection of subsets S 2[n], the truncated semi-index IP,S is obtained by only retaining the values of IP indexed by S. Theorem 2. Under Assumption 1, and further assume σ12 < min(σ11, σ22), for every non-empty collection of subsets S 2[n], the Dirac delta distribution δw with

w = σ11 σ12 σ11 + σ22 2σ12 is optimal to the optimization problem

argmin P P det (Cov (IP,S(vn)))

s.t. Cov(vn) = Σ Σ Σ (n repetitions).

Proof. The matrix representation of IP on Gn w.r.t. the standard basis is M[n] P = T[n] P W[n], as presented in Proposition 7. For simplicity, we write V[n],P = Cov(IP (vn)). Since IP is linear in vn, there is V[n],P = M[n] P C[n](M[n] P ) . (23)

Note that det(T[n] P ) = det(W[n]) = 1 as they are triangular matrices with all diagonal entries being one, and thus we have

det(M[n] P ) = 1, (24)

det(V[n],P ) = det(C[n]). (25)

Observe that Cov(IP,S(vn)) is just a sub-matrix of V[n],P . To represent such sub-matrices, for every non-empty α, β [2n], Vα,β [n],P is defined to be the sub-matrix of V[n],P whose rows and columns are indexed by α and β, respectively. Plus, we write Vα [n],P and V:,α [n],P instead for Vα,α [n],P and V[2n],α [n],P , respectively. Therefore, for each IP,S(vn), there exists a non-empty α [2n] such that Vα [n],P = Cov(IP,S(vn)).

We prove by induction that for every n > 0, it holds that for every non-empty α [2n] and for every P P, det(Vα [n],P ) det(Vα [n],δw ).

For n = 1, let s = R 1 0 td P(t), there is

V[1],P = σ11(1 s)2 + σ22s2 2σ12s(1 s) (σ11 + σ22 2σ12)s + σ12 σ11 (σ11 + σ22 2σ12)s + σ12 σ11 σ11 + σ22 2σ12

by which one can verify the statement by noticing that the function s 7 σ11(1 s)2 + σ22s2 2σ12s(1 s) achieves its minimum at point s = w . Suppose the statement holds for some k > 0, and we proceed for k + 1.

Before proceeding with the induction, we have to explore the Kronecker structure between V[k+1],P and V[k],P . By Eq. (23), and Proposition 7,

V[k+1],P = Z 1

(1 t)B[k](t) t B[k](t) B[k](t) B[k](t)

d P(t) σ11C[k] σ12C[k]

σ12C[k] σ22C[k]

(1 t)B[k](t) t B[k](t) B[k](t) B[k](t)

Denote L[k] P = R 1 0 t B[k](t)d P(t), Eq. (26) can be rewritten as

M[k] P L[k] P L[k] P M[k] P M[k] P

! σ11C[k] σ12C[k]

σ12C[k] σ22C[k]

M[k] P L[k] P L[k] P M[k] P M[k] P

U = σ11(M[k] P L[k] P )C[k](M[k] P L[k] P ) + σ22L[k] P C[k](L[k] P )

+ σ12(M[k] P L[k] P )C[k](L[k] P ) + σ12L[k] P C[k](M[k] P L[k] P ) , (28)

F = M[k] P C[k]K, (29)

K = (σ12 σ11)(M[k] P L[k] P ) + (σ22 σ12)(L[k] P ) , (30) E = (σ11 + σ22 2σ12)V[k],P . (31)

Our goal is to simplify the expression

S = U F E 1F.

Note that E and M[k] P are invertible by Eqs. (24) and (25) and that Σ is assumed to be positive definite (and thus det(C[n]) > 0). Observe that σ11 +σ22 2σ12 > 0 by assumption, using Eqs. (23) and (31), Eq. (29) is equal to

F = E 1 σ11 + σ22 2σ12 ((M[k] P ) ) 1K ,

which indicates that

= 1 σ11 + σ22 2σ12 K (M[k] P ) 1 EE 1E 1 σ11 + σ22 2σ12 ((M[k] P ) ) 1K

= 1 (σ11 + σ22 2σ12)2 K (M[k] P ) 1E((M[k] P ) ) 1K

= 1 σ11 + σ22 2σ12 K C[k]K.

The last equality is obtained by using Eqs. (23) and (31) again. Write R = M[k] P L[k] P . From Eqs. (28) and (30), there is

U = σ11RC[k]R + σ22L[k] P C[k](L[k] P ) + σ12(RC[k](L[k] P ) + L[k] P C[k]R ),

K C[k]K = (σ12 σ11)2RC[k]R + (σ22 σ12)2L[k] P C[k](L[k] P )

+ (σ12 σ11)(σ22 σ12)(RC[k](L[k] P ) + L[k] P C[k]R ).

Notice that

(σ11 + σ22 2σ12)U K C[k]K

=(σ11σ22 σ2 12) RC[k]R + L[k] P C[k](L[k] P ) + RC[k](L[k] P ) + L[k] P C[k]R

=(σ11σ22 σ2 12) (R + L[k] P )C[k](R + (L[k] P ) )

=(σ11σ22 σ2 12)M[k] P C[k](M[k] P ) .

Combining Eqs. (32) and (23), we eventually get the simplified formula for S,

S = σ11σ22 σ2 12 σ11 + σ22 2σ12 V[k],P . (33)

Particularly, σ11σ22 σ2 12 > 0 as Σ is assumed to be positive definite. So far so good, we are ready to complete the induction by cases while looking into Eq. (27).

Case 1: If [2k+1]\[2k] α [2k+1], let β = α\([2k+1]\[2k]), which could be empty. Then, there is

Vα [k+1],P = Uβ (F:,β)

F:,β (σ11 + σ22 2σ12)V[k],P

Note that we can write F:,β = FDβ (34) where Dβ is the corresponding column-selecting matrix. It also gives that

D β UDβ = Uβ. (35)

Using Eqs. (33), (34) and (35), according to Lemma 2, there is

det(Vα [k+1],P )

= det (σ11 + σ22 2σ12)V[k],P det Uβ (F:,β) ((σ11 + σ22 2σ12)V[k],P ) 1F:,β

= det (σ11 + σ22 2σ12)V[k],P det(Sβ) = L det(Sβ)

=L det σ11σ22 σ2 12 σ11 + σ22 2σ12 Vβ [k],P

where L = det (σ11 + σ22 2σ12)V[k],P = (σ11 + σ22 2σ12)2k det(C[k]) by Eq. (25).

Therefore, by inductive assumption, for every P P,

det(Vα [k+1],P ) = L det σ11σ22 σ2 12 σ11 + σ22 2σ12 Vβ [k],P

L det σ11σ22 σ2 12 σ11 + σ22 2σ12 Vβ [k],δw

= det(Vα [k+1],δw ). (36)

Case 2: If [2k+1]\[2k] α [2k+1] and α = , let β = [2k+1]\[2k] and γ = α β. Note that γ could be empty. By Lemma 5, there is

det((σ11 + σ22 2σ12)Vγ 2k

[k],P ) det(Vβ α [k+1],P ) det(Vα [k+1],P ) det((σ11 + σ22 2σ12)V[k],P ),

which implies, for every P P,

det(Vα [k+1],P ) det((σ11 + σ22 2σ12)Vγ 2k

[k],P ) det(Vβ α [k+1],P )

(σ11 + σ22 2σ12)2k det(C[k])

det((σ11 + σ22 2σ12)Vγ 2k

[k],δw ) det(Vβ α [k+1],δw )

(σ11 + σ22 2σ12)2k det(C[k]) = det(Vα [k+1],δw ).

The second inequality comes from the inductive assumption as well as case 1, while the last equality is due to that V[k+1],δw is diagonal according to Lemma 7

C.5 Proof for Theorem 1

Theorem 1. Under Assumption 1, and further assume σ12 < min(σ11, σ22), the Dirac delta distribution δw with

w = σ11 σ12 σ11 + σ22 2σ12 (8)

is optimal to the problem (7) for every n 1. In addition, the optimal solution is unique if n > 3.

Proof. Let S = {{1}, {2}, . . . , {n}}, then

IP,S(vn) = ϕ(vn; P).

Therefore, by Theorem 2, δw is an optimal solution. For the uniqueness, it does not hold if n 2. For example, let

O = {P P | Z 1

0 td P(t) = w },

we point out that every P O is optimal if n = 1 or n = 2. To see |O| > 1, for example, if σ11 = σ22 and σ12 = 0, the probability distribution for the Shapley value lies in O. Nevertheless, we prove the uniqueness as follows: 1) first show that if n = k possesses such a uniqueness, so does n = k + 1, and 2) then demonstrate that it holds for n = 4.

For the inductive part, it can be completed by reusing the induction in the proof for Theorem 2. Specifically, for case 1, let α = ([2k+1]\[2k]) β where β contains all indices pointing to subsets {1}, {2}, . . . , and {k}, then the inequality in Eq. (36) is strict if P = δw . On the other hand, for case 2, let α contain all indices pointing to subsets {1}, {2}, . . . , and {k + 1}, the second inequality in Eq. (37) is also strict if P = δw by using the aforementioned strict inequality in case 1.

To show the uniqueness for n = 4, it can also be done by looking into the proof for Theorem 2. Specifically, in the induction, we focus on case 2 for k = 3 and α = {2, 3, 5, 9} that points to subsets {1}, {2}, {3} and {4}. Note that β = [24]\[23] = {9, 10, . . . , 16} in case 2. By Lemma 5, a necessary condition for the first inequality in Eq. (37) to be equal is that each column of V{2,3,5},{10,11,...,16} [4],P lies in span(V{2,3,5},{9} [4],P ). We prove that this condition can only be met by δw , and thus the first inequality in Eq. (36) is strict if P = δw .

Regarding Eq. (23), using the fact that

V[n],P = ( Z 1

0 B[n](t)d P(t)) C[n] ( Z 1

0 B[n](t)d P(t))

0 d P(t) Z 1

0 B[n](t)C[n](B[n](s)) d P(s),

one can verify that V{2,3,5},{9} [4],P = c P 1 for some c P R. Therefore, to meet the necessary condition,

all entries in each column of V{2,3,5},{10,11,...,16} [4],P should be the same. Particularly, there is

V{3,5},{12} [4],P

R 1 0 d P(t) R 1 0 (σ11 + σ22 2σ12)p(t)2(p(s t) + (σ12 σ11)(s + t 1) + σ11)d P(s) R 1 0 d P(t) R 1 0 p(s)p(t)3d P(s)

where p(w) = (σ11 + σ22 2σ12)w + σ12 σ11.

The equation V{3},{12} [4],P V{5},{12} [4],P = 0 leads to Z 1

0 (σ11σ22 σ2 12)p(t)2d P(t) = 0.

Set g(w) = (σ11σ22 σ2 12)p(w)2, since g(w) attains its minimum 0 only at the point w , by Lemma 6, the necessary condition is met if and only if P = δw .

For the ease of the reader to verify the calculation in Theorem 1, we include below a Matlab code to help compute.

syms sigma11 sigma12 sigma22 s t real; Sigma1 = [sigma11 , sigma12; sigma12 , sigma22 ]; B1 = [1-t, t; -1, 1]; Bt = B1; Sigma = Sigma1; for i = 1:3

Sigma = kron(Sigma1 , Sigma); Bt = kron(B1 , Bt); end Bs = subs(Bt , t, s); V = Bt*Sigma*Bs '; disp(simplify(V([2 ,3 ,5] ,9))); % V {{2 ,3 ,5} ,{9}} Vp = simplify(V([3 ,5] ,12)); disp(Vp); % V {{3 ,5} ,{12}} disp(simplify(Vp(1)-Vp(2))); % V {{3} ,{12}} -V {{5} ,{12}}

In this section, we provide all lemmas necessary to prove Theorems 2 and 1. Lemma 1. Suppose A is positive definite and B is positive semi-definite, there is

det(A + B) det(A)

where the equality holds if and only if B = 0.

Proof. Since A + B = A 1 2 (I + A 1

2 )A 1 2 , there is det(A + B) = det(A) det(I + A 1

2 ). By spectral decomposition, A 1

2 = QDQ where QQ = I and D is diagonal with all entries being non-negative. Then, det(I + A 1

2 ) = det(I + D) 1, and thus det(A + B) det(A).

For the claim on the equality, suppose det(I + A 1

2 ) = 1, it leads to D = 0, and thus B = 0.

Lemma 2. Given a positive definite matrix M S++ that is partitioned as

M = Am m Bm k B k m Ck k

there is det(M) = det(A BC 1B ) det(C).

Q Am m Bm k B k m Ck k

Q = A BC 1B 0 0 C

where Q = Im m BC 1

Lemma 3. Let M be the matrix given in Lemma 2, then

det(M) det(A) det(C).

Moreover, the equality holds if and only if B = 0.

Proof. By Lemmas 2 and 1, det(M) = det(A BC 1B ) det(C) det(A) det(C). Notice that the proof for Lemma 2 ensures that A BC 1B is positive definite.

By Lemma 1, the equality holds if and only if BC 1B = 0. Note that for a positive definite matrix C 1, x C 1x = 0 if and only if x = 0. Therefore, BC 1B = 0 if and only if B = 0.

Notation For a square matrix A Rm m, and α, β {1, 2, . . . , m}, A[α, β] is defined to be the sub-matrix of A by only keeping the rows and columns indexed by α and β, respectively. Plus, A[α] is abused for A[α, α].

Lemma 4. Suppose a positive definite matrix M Rm m is given, for every α {1, 2, . . . , m}, there is

det(M 1[αc]) = det(M[α])

The convention is det(M[ ]) = 1.

Proof. Without loss of generality, we write M = A B B C

such that M[α] = C and M[αc] = A.

By Lemma 2, det(M) = det(S) det(C) where S = A BC 1B . On the other hand,

M 1 = S 1 S 1BC 1

C 1B S 1 C 1 + C 1B S 1BC 1.

Therefore, we have

det(M) = det(C)

det(M) = det(S 1) = det(M 1[αc]).

Lemma 5. For every positive definite matrix Mm m S++, and every α, β {1, 2, . . . m},

det(M[α β]) det(M[α β]) det(M[α]) det(M[β])

The convention here is det(M[ ]) = 1. Moreover, provided that (α β)\α = and (α β)\β = , and w.l.o.g., write

A B C B D E C E F

such that M[α] = A B B D

and M[β] = D E E F

, A necessary condition to have the equality

is that each column of C lies in span(B).

Proof. Our proof is adapted from (Horn and Johnson 2012, Theorem 7.8.9), by which we add a necessary condition for the equality.

By Lemmas 4 and 3, there is det(M[α β]) det(M[α β]) = det(M 1[((α β)\α) ((α β)\β)])

det(M 1[(α β)\α]) det(M 1[(α β)\β])

= det(M[α]) det(M[α β]) det(M[β]) det(M[α β]).

Suppose (α β)\α = and (α β)\β = , by Lemma 3, the equality holds if and only if M 1[(α β)\β, (α β)\α] = 0. Using Eq. (38), one can verify that

M 1[(α β)\β, (α β)\α] = (A 1BG 1H A 1C)(J H G 1H) 1

where G = D B A 1B,

H = E B A 1C,

J = F C A 1C,

and thus the equality holds if and only if C = BG 1H.

Lemma 6. Given a continuous non-negative function g : [0, 1] R+ that achieves its minimum 0 only at one point w [0, 1], δw is the unique probability measure on [0, 1] that minimizes

argmin P P L(P) := Z 1

0 g(t)d P(t).

Proof. It is clear that L(δw ) = 0 = inf P P L(P). Suffice it to prove that L(P) = 0 implies P = δw .

Suppose P P is given such that L(P) = 0. A point w [0, 1] is said to be irrelevant w.r.t. P if there exists ϵ > 0, P ((w ϵ, w + ϵ) [0, 1]) = 0. For the sake of contradiction, suppose a distinct point w [0, 1] (w = w ) is relevant. Since g(w) > g(w ) = 0 and g is continuous, there exists ϵ > 0 such that g(s) > 0 for all s [w ϵ, w + ϵ] [0, 1]. Since m = infs [w ϵ,w+ϵ] [0,1] g(s) > 0 and w is relevant, we have

[w ϵ,w+ϵ] [0,1] md P(t) = m P([w ϵ, w + ϵ] [0, 1]) > 0,

a contradiction. So the only possible relevant point is w .

Next, we prove that P = δw . For each δ > 0, construct Sδ = [0, w δ] [w + δ, 1]. For each w Sδ, since w is irrelevant, there exists ϵ(w) > 0 such that P(Ow) = 0 where Ow = (w ϵ(w), w + ϵ(w)) [0, 1]. Then, {Ow : w Sδ} forms an open (w.r.t. [0, 1]) cover of Sδ. Since Sδ is compact, there is a finite subcover Sδ S

1 i k Owi. Therefore, P(Sδ) Pk i=1 P(Owi) = 0, which implies P(Sδ) = 0. By letting δ 0, P(Sδ) P([0, 1]\{w }) = 0. Using the π λ theorem, e.g., (Durrett 2019, Theorem A.1.4), one can get P = δw .

Lemma 7. Suppose V[n],P = M[n] P C[n](M[n] P ) where M[n] P is defined in Proposition 7 and C[n] = Σ Σ Σ (n repetitions), and assume σ12 < min(σ11, σ22). Then, the Dirac delta distribution

δw where w = σ11 σ12 σ11 + σ22 2σ12 makes the matrix V[n],P diagonal. In addition, δw is the unique

probability distribution on [0, 1] that makes V[n],P diagonal if n > 1.

Proof. Since (A B)(C D) = AC BD, according to Proposition 7, for each w [0, 1],

V[n],δw = B[n](w)C[n](B[n](w)) =

n repetitions z }| { V[1],δw V[1],δw V[1],δw

where V[1],δw = p1(w) p2(w) p2(w) C

p1(w) = σ11(1 w)2 + σ22w2 2σ12w(1 w), p2(w) = (σ11 + σ22 2σ12)w + σ12 σ11, C = σ11 + σ22 2σ12.

Using Eq. (13) (note 2[n] is ordered w.r.t. the binary ordering), by induction, one can show the (A, B)-th entry of V[n],δw is p1(w)|[n] (A B)|p2(w)|(A\B) (B\A)|C|A B| for every A, B [n]. Therefore, for every off-diagonal entry of V[n],δw, it must contain the term p2(w)k where k is some positive integer. Notice that p2(w ) = 0, and thus V[n],δw is diagonal.

Conversely, suppose that a probability distribution P on [0, 1] is given such that V[n],P is diagonal. Then,

V[n],P = Z 1

0 B[n](t)d P(t) C[n] Z 1

0 B[n](t)d P(t) .

Observe that V[n],P equals to R 1 0 B[n](t)C[n](B[n](t)) d P(t) along the last column as we can prove by induction that the last row of Bn(t) contains only constants. Since B[n](t)C[n](B[n](t)) can be computed recursively, one can show by induction that V[n],P ( , [n]) = R 1 0 p2(t)nd P(t) and Vn,P ({1}, [n]) = R 1 0 Cp2(t)n 1d P(t).

Set g(w) = p2(w)n when n is even and g(w) = Cp2(w)n 1 otherwise. Note that g(w) 0 and attains its minimum 0 only at the point w . By Lemma 6, it implies P = δw .

D Experiment Settings

The datasets we use in the main paper are summarized in Table 4. The logistic model used is

(x,y) Dtr Cross Entropy(w x + b, y)

In the experiment of noisy label detection, the F1-score is computed by using

2 |{z Dtr | z is flipped and z is detected noisy}| |{z Dtr | z is flipped}| + |{z Dtr | z is detected noisy}|,

which is the harmonic mean of precision and recall.

D.1 Fitting Kronecker Noises

We independently sample 128 sequence = S0 S1 Sn = [n] in a way that |Si+1\Si| = 1 for every 0 i < n. For each subset Si (0 i n) in a sampled sequence, v(Si) is computed using 128 random seeds. All these utility evaluations are then employed to generate an estimated covariance matrix ˆD R(n+1) (n+1), which is an estimate for a sub-matrix of the ground-truth covariance matrix of the stochasticity. Note that under Assumption 1, for every 0 i, j n,

Cov(vn(Si), vn(Sj)) = σn max(i,j) 11 σmax(i,j) min(i,j) 12 σmin(i,j) 22 .

D.2 Sampling Lift

Basically, sampling lift, which is designed for all semi-values, refers to any estimators based on

ϕi(v; P) = ES[vn(S i) vn(S)] for every vn G and i [n]

where S is sampled from [n]\i with P(S) = pn s+1.

There are two different ways to sample S. One is derived from Eq. (2). For example, to estimate the value for n [n], the procedure is i) sample t [0, 1] according to the underlying probability distribution P P; ii) let X be a Bernoulli random variable such that P(X = 1) = t, and sample a vector b Rn 1 whose entries are independently following X; and iii) define S [n]\n by letting i S if and only if bi = 1. The other way is by noticing that

S [n]\i |S|=k

1 (v(S i) v(S)) = Ek ES:|S|=k[v(S i) v(S)]].

Table 4: A summary of datasets used in this work. Dataset Source

MNIST (Le Cun et al. 1998) Py Torch FMNIST (Xiao et al. 2017) Py Torch 2dplanes https://www.openml.org/d/727 bank-marketing (Moro et al. 2011) https://www.openml.org/d/1461 bioresponse https://www.openml.org/d/4134 covertype https://www.openml.org/d/150 cpu https://www.openml.org/d/761 credit https://www.openml.org/d/31 default (Yeh and Lien 2009) https://www.openml.org/d/42477 diabetes https://www.openml.org/d/37 fraud (Dal Pozzolo et al. 2014) https://www.openml.org/d/42175 gas (Vergara et al. 2012) https://www.openml.org/d/1476 har (Anguita et al. 2013) https://www.openml.org/d/1478 iris https://www.openml.org/d/61 letter (Frey and Slate 1991) https://www.openml.org/d/6 optdigits https://www.openml.org/d/28 pendigits https://www.openml.org/d/32 phoneme https://www.openml.org/d/1489 pol https://www.openml.org/d/722 satimage https://www.openml.org/d/182 segment https://www.openml.org/d/36 spambase https://www.openml.org/d/44 texture https://www.openml.org/d/40499 wind https://www.openml.org/d/847

where qk = pn k n 1 k 1 for every 1 k n. The corresponding sampling strategy is i) sample k {qk}1 k n, and then ii) sample S uniformly from 2[n]\i subject to |S| = k 1.

We comment that for the first way we do not have to calculate the distribution {qk}1 k n. Note the implementation for {qk}1 k n might suffer from numerical blowup induced by n 1 k 1 , and it is why Wang and Jia (2023) did not report the results of the Beta Shapley values if n > 500.

E More Experiment Results

In this section, we provide detailed figures, see Figures 4 to 10, to supplement the results reported in Tables 3 and 2. In each figure, lr stands for the learning rate used for that dataset. In addition, we also run experiments while setting |Dtr| = 200, 10 percent of which is flipped. The corresponding results are presented in Figures 11 to 17.

2dplanes bank-marketing bioresponse

Figure 4: The first row exhibits results from the experiment of ranking consistency, while the second row is for the experiment of noisy label detection. Each result is reported using mean and standard deviation. For noisy label detection, the F1-score is reported over 10 random seeds. For ranking consistency, the Spearman s rank correlation coefficient is reported over all possible pairs of 10 random seeds, which is 45 in total.

covertype cpu default

Figure 5: The first row exhibits results from the experiment of ranking consistency, while the second row is for the experiment of noisy label detection. Each result is reported using mean and standard deviation. For noisy label detection, the F1-score is reported over 10 random seeds. For ranking consistency, the Spearman s rank correlation coefficient is reported over all possible pairs of 10 random seeds, which is 45 in total.

gas har letter

Figure 6: The first row exhibits results from the experiment of ranking consistency, while the second row is for the experiment of noisy label detection. Each result is reported using mean and standard deviation. For noisy label detection, the F1-score is reported over 10 random seeds. For ranking consistency, the Spearman s rank correlation coefficient is reported over all possible pairs of 10 random seeds, which is 45 in total.

optdigits pendigits phoneme

Figure 7: The first row exhibits results from the experiment of ranking consistency, while the second row is for the experiment of noisy label detection. Each result is reported using mean and standard deviation. For noisy label detection, the F1-score is reported over 10 random seeds. For ranking consistency, the Spearman s rank correlation coefficient is reported over all possible pairs of 10 random seeds, which is 45 in total.

pol satimage segment

Figure 8: The first row exhibits results from the experiment of ranking consistency, while the second row is for the experiment of noisy label detection. Each result is reported using mean and standard deviation. For noisy label detection, the F1-score is reported over 10 random seeds. For ranking consistency, the Spearman s rank correlation coefficient is reported over all possible pairs of 10 random seeds, which is 45 in total.

spambase texture wind

Figure 9: The first row exhibits results from the experiment of ranking consistency, while the second row is for the experiment of noisy label detection. Each result is reported using mean and standard deviation. For noisy label detection, the F1-score is reported over 10 random seeds. For ranking consistency, the Spearman s rank correlation coefficient is reported over all possible pairs of 10 random seeds, which is 45 in total.

MNIST FMNIST

Figure 10: The first row exhibits results from the experiment of ranking consistency, while the second row is for the experiment of noisy label detection. Each result is reported using mean and standard deviation. For noisy label detection, the F1-score is reported over 10 random seeds. For ranking consistency, the Spearman s rank correlation coefficient is reported over all possible pairs of 10 random seeds, which is 45 in total.

2dplanes bank-marketing bioresponse

Figure 11: The first row exhibits results from the experiment of ranking consistency, while the second row is for the experiment of noisy label detection. Each result is reported using mean and standard deviation. For noisy label detection, the F1-score is reported over 10 random seeds. For ranking consistency, the Spearman s rank correlation coefficient is reported over all possible pairs of 10 random seeds, which is 45 in total.

covertype cpu credit

Figure 12: The first row exhibits results from the experiment of ranking consistency, while the second row is for the experiment of noisy label detection. Each result is reported using mean and standard deviation. For noisy label detection, the F1-score is reported over 10 random seeds. For ranking consistency, the Spearman s rank correlation coefficient is reported over all possible pairs of 10 random seeds, which is 45 in total.

default diabetes fraud

Figure 13: The first row exhibits results from the experiment of ranking consistency, while the second row is for the experiment of noisy label detection. Each result is reported using mean and standard deviation. For noisy label detection, the F1-score is reported over 10 random seeds. For ranking consistency, the Spearman s rank correlation coefficient is reported over all possible pairs of 10 random seeds, which is 45 in total.

gas har letter

Figure 14: The first row exhibits results from the experiment of ranking consistency, while the second row is for the experiment of noisy label detection. Each result is reported using mean and standard deviation. For noisy label detection, the F1-score is reported over 10 random seeds. For ranking consistency, the Spearman s rank correlation coefficient is reported over all possible pairs of 10 random seeds, which is 45 in total.

optdigits pendigits phoneme

Figure 15: The first row exhibits results from the experiment of ranking consistency, while the second row is for the experiment of noisy label detection. Each result is reported using mean and standard deviation. For noisy label detection, the F1-score is reported over 10 random seeds. For ranking consistency, the Spearman s rank correlation coefficient is reported over all possible pairs of 10 random seeds, which is 45 in total.

pol satimage segment

Figure 16: The first row exhibits results from the experiment of ranking consistency, while the second row is for the experiment of noisy label detection. Each result is reported using mean and standard deviation. For noisy label detection, the F1-score is reported over 10 random seeds. For ranking consistency, the Spearman s rank correlation coefficient is reported over all possible pairs of 10 random seeds, which is 45 in total.

spambase texture wind

Figure 17: The first row exhibits results from the experiment of ranking consistency, while the second row is for the experiment of noisy label detection. Each result is reported using mean and standard deviation. For noisy label detection, the F1-score is reported over 10 random seeds. For ranking consistency, the Spearman s rank correlation coefficient is reported over all possible pairs of 10 random seeds, which is 45 in total.

Extra References

Anguita, D., A. Ghio, L. Oneto, X. Parra, J. L. Reyes-Ortiz, et al. (2013). A Public Domain Dataset for Human Activity Recognition Using Smartphones . In: Esann. Vol. 3, p. 3.

Dal Pozzolo, A., O. Caelen, Y.-A. Le Borgne, S. Waterschoot, and G. Bontempi (2014). Learned Lessons in Credit Card Fraud Detection From a Practitioner Perspective . Expert Systems with Applications, vol. 41, no. 10, pp. 4915 4928.

Denneberg, D. and M. Grabisch (1999). Interaction Transform of Set Functions Over a Finite Set . Information Sciences, vol. 121, no. 1-2, pp. 149 170.

Durrett, R. (2019). Probability: Theory and Examples . Vol. 49. Cambridge University Press.

Frey, P. W. and D. J. Slate (1991). Letter Recognition Using Holland-Style Adaptive Classifiers . Machine Learning, vol. 6, pp. 161 182.

Fujimoto, K., I. Kojadinovic, and J.-L. Marichal (2006). Axiomatic Characterizations of Probabilistic and Cardinal-Probabilistic Interaction Indices . Games and Economic Behavior, vol. 55, no. 1, pp. 72 99.

Grabisch, M., J.-L. Marichal, and M. Roubens (2000). Equivalent Representations of Set Functions . Mathematics of Operations Research, vol. 25, no. 2, pp. 157 178.

Grabisch, M. and M. Roubens (1999). An Axiomatic Approach to the Concept of Interaction Among Players in Cooperative Games . International Journal of Game Theory, vol. 28, no. 4, pp. 547 565.

Horn, R. A. and C. R. Johnson (2012). Matrix Analysis . Cambridge University Press.

Moro, S., R. Laureano, and P. Cortez (2011). Using Data Mining for Bank Direct Marketing: An Application of the Crisp-Dm Methodology .

Sundararajan, M., K. Dhamdhere, and A. Agarwal (2020). The Shapley Taylor Interaction Index . In: International Conference on Machine Learning. PMLR, pp. 9259 9268.

Vergara, A., S. Vembu, T. Ayhan, M. A. Ryan, M. L. Homer, and R. Huerta (2012). Chemical Gas Sensor Drift Compensation Using Classifier Ensembles . Sensors and Actuators B: Chemical, vol. 166, pp. 320 329.

Xiao, H., K. Rasul, and R. Vollgraf (2017). Fashion-Mnist: A Novel Image Dataset for Benchmarking Machine Learning Algorithms . ar Xiv preprint ar Xiv:1708.07747.

Yeh, I.-C. and C.-h. Lien (2009). The Comparisons of Data Mining Techniques for the Predictive Accuracy of Probability of Default of Credit Card Clients . Expert Systems with Applications, vol. 36, no. 2, pp. 2473 2480.

Zhang, H., Y. Xie, L. Zheng, D. Zhang, and Q. Zhang (2021). Interpreting Multivariate Shapley Interactions in DNNs . In: Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 35. 12, pp. 10877 10886.