# linear_tree_shap__3f98db52.pdf

Linear Tree Shap

Peng Yu1,2,5 Chao Xu1 Albert Bifet2,4 Jesse Read3

1 University of Electronic Science and Technology of China 2 LTCI, T el ecom Paris, IP Paris 3 LIX, Ecole Polytechnique, IP Paris 4 AI Institute, University of Waikato 5 Shopify {peng.yu,albert.bifet}@telecom-paris.fr cxu@uestc.edu.cn jesse.read@polytechnique.edu

Decision trees are well-known due to their ease of interpretability. To improve accuracy, we need to grow deep trees or ensembles of trees. These are hard to interpret, offsetting their original benefits. Shapley values have recently become a popular way to explain the predictions of tree-based machine learning models. It provides a linear weighting to features independent of the tree structure. The rise in popularity is mainly due to Tree Shap, which solves a general exponential complexity problem in polynomial time. Following extensive adoption in the industry, more efficient algorithms are required. This paper presents a more efficient and straightforward algorithm: Linear Tree Shap. Like Tree Shap, Linear Tree Shap is exact and requires the same amount of memory.

1 Introduction

Machine learning in the industry has played more and more critical roles. For both business and fairness purposes, the need for explainability has been increasing dramatically. As one of the most popular machine learning models, the tree-based model attracted much attention. Several methods were developed to improve the interpretability of complex tree models, such as sampling-based local explanation model LIME[9], game-theoretical based Shapley value[10], etc. Shapley value gained particular interest due to both local and globally consistent and efficient implementation: Tree Shap[5]. With the broad adoption of Shapley value, the industry has been seeking a much more efficient implementation. Various methods like GPUTree Shap[6] and Fast Tree Shap[12] were proposed to speed up Tree Shap. GPUTree Shap primarily focuses on utilizing GPU to perform efficient parallelization. And Fast Tree Shap improves the efficiency of Tree Shap by utilizing caching. All of them are empirical approaches lacking a mathematical foundation and are thus making them hard to understand.

We solve the exact Shapley value computing problem based on polynomial arithmetic. By utilizing the properties of polynomials, our proposed algorithm Linear Tree Shap can compute the exact Shapley value in linear time. And there is no compromise in memory utilization.

1.1 Contrast with previous result

We compare the running time of our algorithm with previous results for a single tree in Table 1, since all current algorithms for ensemble of trees are applying the same algorithm to each tree individually.

Corresponding author

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

Let S be the number of samples to be explained, N the number of features, L the number of leaves in the tree, and D is the maximum depth of the tree. For simplicity, we assume every feature is used in the tree, and therefore N = O(L). Also, D L.

Algorithm Time Complexity Space Complexity

Original Tree SHAP [5] O(SLD2) O(D2 + N) Fast Tree SHAP v1 [12] O(SLD2) O(D2 + N) Fast Tree SHAP v2 [12] O(L2DD + SLD) O(L2D)

Linear Tree SHAP O(SLD) O(D2 + N)

Table 1: Comparison of both computational and space complexity

temperature

0.5 rain(D)

0.4 rain (C)

0.6 rain (B)

0.7 rain (A)

w:0.6 yes w:0.5 > 19

Figure 1: An example decision tree Tf shows chances of rain

2 Methodology

2.1 Notation & Background

Elementary symmetry polynomials are widely used in our paper. R[x] denotes the set of polynomials with coefficient in R. R[x]d are polynomials of degree no larger than d. We use for polynomial multiplication. For two polynomials a and b, a

b is the quotient of the polynomial division a/b.

For two vectors x, y Rd, x, y is the inner product of x and y. We abuse the notation, so when a polynomial appears in the inner product, we take it to mean the coefficient vector of the polynomial. Namely, if p, q are both polynomials of the same degree, then p, q is the inner product of their coefficient vectors. We use for matrix multiplication.

We refer to x X Rm as an instance and f : X R as the fitted tree model in a supervised learning task. Here, m denotes the number of all features, M is the set of all features, and |.| is the cardinality operation, namely |M| = m. We denote x[i] as the value of feature i of instance x.

We have to start with some common terminologies because our algorithm is closely involved with trees. A rooted tree T = (V, E) is a directed tree where each edge is oriented away from the root r V . For each node v, Pv is the root to v path, i.e. the set of edges from the root to v. L(v) is the set of leaves reachable from v. L(T) = L(r) is the set of leaves of T. T is a full binary tree if every non-leaf node has two children. If an edge e goes from u to v, then u and v are the tail and head of e, respectively. We write h(e) for the head of e.

A tree is weighted, if there is an edge weight we for each edge e. It is a labeled tree if each edge also has a label ℓe. For a labeled tree T, let Ei be all the edges of the tree with label i. Similarly, define Pi,v = Pv Ei, the set of edges in the root to v path that has label i. The last edge of any subset of a path is the edge furthest away from the root.

For our purpose, a decision tree is a weighted labeled rooted full binary tree. There is a corresponding decision tree for the fitted tree model Tf. The internal nodes of the tree are called the decision nodes, and the leaves are called the end nodes. Every decision node has a label of feature i, and every end node contains a prediction v. The label of each edge is the feature of the head node of the edge. We will call the label on the edge of the feature. Every edge e contains weight we that is the conditional probability based on associated splitting criteria during training.

When predicting a given instance, x, decision tree model f sends the instance to one of its leaves according to splitting criteria. We draw an example decision tree in Fig 1. Each leaf node is labeled with id and prediction value. Every decision node is labeled with the feature. We also associate each edge with conditional probability w and splitting criteria of features in the parenting node.

To represent the marginal effect, we use f S(x) : X R, S M to denote the prediction of instance x of the fitted tree model using only the features in active set S, and treat the rest of features of instance x as missing. Using this representation, the default prediction f(x) is a shorthand for f M(x).

The Shapley value of a decision tree model f is the function ϕ(f, i) : X R,

ϕ(f, i)(x) = 1 |M|

1 |M| 1 |S| f S {i}(x) f S(x). (1)

The Shapley value ϕ(f, i)(x) quantifies the marginal contribution of feature i in the tree model f when predicting instance x. The problem of computing the Shapley values is the following algorithmic problem.

The Tree Shapley Value Problem Input: A decision tree Tf for function f : X R over m features, and x X. Output: The vector (ϕ(f, 1)(x), . . . , ϕ(f, m)(x)).

Meanwhile, decision nodes cannot split instances with missing feature values. A common convention is to use conditional expectation. When a decision node encounters a missing value, it redirects the instance to both children and returns the weighted sum of both children s predictions. The weights differ between decision nodes and are empirical instance proportions during training: wl, wr. Here wl + wr = 1 and 0 < wl < 1. A similar approach is also used in both Treeshap[5], and C4.5 [7] to deal with missing values. Any instance would result in a single leaf when there is no missing feature. In contrast, an instance might reach multiple different leaves with a missing feature.

Here, we use an example instance x =(temperature: 20, cloudy: no, wind speed: 6) with tree f in Fig 1 to show the full process of Shapley value computing. By following the decision nodes of Tf, the prediction f(x) is leaf C: 0.4 chance of raining. Now we compute the importance/Shapley value of feature (temperature: 18) among x for getting prediction of 0.4 chance of raining.

The importance/Shapley value of feature (temperature: 18) is

ϕ(f, temperature)(x) = 1

3( 1 2 2 (f{temperature, cloudy, wind speed}(x) f{cloudy, wind speed}(x))

+ 1 2 1 (f{temperature, wind speed}(x) f{wind speed}(x) + f{temperature, cloudy}(x) f{cloudy}(x))

+ 1 2 0 (f{temperature}(x) f (x)))

To elaborate more, a term f{cloudy, wind speed}(x) with x =(temperature: 20, cloudy: no, wind speed: 6) is equivalent to f(cloudy: no, wind speed: 6). When traversal first decision node: temperature, value to current feature is considered as unspecified. We sum over children leaves with empirical weights and get 0.5 D + 0.5 C as prediction.

On the other hand, decision tree f can be linearized into decision rules [8]. A decision rule can be seen as a decision tree with only a single path. A decision rule Rv : X R for a leaf v can be constructed via starting from root node, following all the conditions along the path to the leaf v. We use F(R) to represent the set of all features specified in decision rule R, namely, F(R) = {i|Pi,v = }.

The linearization of the decision tree f to decision rules is the relation f(x) = P

v L(Tf ) Rv(x). Example tree in Fig 1 can be linearized into 4 rules:

1. RA: if (temperature > 19) and (is cloudy) then predict 0.7 chance of rain else predict 0 chance of rain

2. RB: if (temperature > 19) and (is not cloudy) and (wind speed > 8) then predict 0.6 chance of rain else predict 0 chance of rain

3. RC: if (temperature > 19) and (is not cloudy) and (wind speed 8) then predict 0.4 chance of rain else predict 0 chance of rain

4. RD: if (temperature 19) then predict 0.5 chance of rain else predict 0 chance of rain

For a decision rule R, we also introduce prediction with active set S, RS : X R. When features are missing, leaf value further weighted by their conditional probability is provided as the prediction. Here we introduce the definition recursively. First, we define the prediction of rule R associated with leaf value V with empty input:

Rv = Rv (x) = V Y

e Pv we (2)

Where we is the conditional probability/proportion of instances, when splitting by decision node at the source of edge e, the proportion of instances belong to the current edge. V is the prediction of the leaf node v that defines that decision rule.

We say x πi(R), if x[i] is satisfied by every splitting criteria concerning feature i in decision rule R. For a given instance x and leaf v, we use a new variable qi,v(x) to denote the marginal prediction of Rv when adding feature i to active set S.

e Pi,v 1 we x πi(Rv) 0 x / πi(Rv) (3)

The empty product equals 1, hence if Pi,v = , qi,v(x) = 1. We omit the super/subscript v if there is no ambiguity on the leaf node. So, with i / S, we can write:

R{i} S(x) = qi(x)RS(x) (4)

Since is a subset of any set S, we can get RS(x) via products of weights:

RS(x) = R Y

j S qj(x) (5)

With RS, f S can also be linearized into the sum of rule predictions:

v L(Tf ) Rv S(x). (6)

2.2 Some special functions and their properties

Definition 2.1. Define the reciprocal binomial polynomial to be Bd(x) = Pd i=0 d i 1xi.

Definition 2.2. The function ψd : R[x]d R is defined as

ψd(A) := A, Bd

d + 1 . (7)

We write ψ(p) = ψd(p) where d is the degree of p.

The function ψd has two nice properties: additive for same degree polynomial and scale invariant when multiplying binomial coefficient.

Proposition 2.1. Let p, q R[x]d, and k N.

Additivity: ψd(p) + ψd(q) = ψd(p + q).

Scale Invariant: ψ(p (1 + y)k) = ψ(p).

2.3 Summary polynomials and their relation to Shapley value

Consider we have a function f represented by a decision tree Tf. We want to explain a particular sample x, therefore in the later sections, we abuse the notation and let g to mean g(x) whenever g : X R, e.g qi,v = qi,v(x). In order to not confuse the readers, the polynomials we are constructing always have the formal variable y.

Since tree prediction can be linearized into decision rules, and the Shapley value also has Linearity property, we decompose the Shapley value of a tree as the sum of the Shapley value of decision rules.

ϕ(f, i) = X

v L(Tf ) ϕ(Rv, i) (8)

Now, for each decision rule, we define a summary polynomial. Definition 2.3. For a decision tree Tf and an instance x. For a decision rule associated with leaf v in Tf, the summary polynomial Gv is defined as

Gv(y) = Rv Y

j F (Rv) (qj,v + y) (9)

Next, we study the relationship between the summary polynomial and the Shapley value of corresponding decision rule. Lemma 2.2. Let v be a leaf in Tf, then

ϕ(Rv, i) = (qi,v 1)ψ Gv qi,v + y

Proof. Since everything involved in the proof is related to the leaf v, we drop v from the super/subscripts for simplicity. First, we simplify the Shapley value of rule R into:

ϕ(R, i) = 1

1 m 1 |S| RS {i} RS = R (qi 1)

1 m 1 |S| Y

j S qj (10)

When feature i does not appear in R, qi 1 returns 0, thus Shapley value on feature i from rule R is 0. Let |F(R)| = d, the number of features in R. The Shapley value of R further reduces to:

ϕ(R, i) = R (qi 1)

S F (R)\{i}

j S qj (11)

We observe that R P|S|=k S F (R)\{i} Q

j S qj is precisely the coefficient of yk in G qi+y.

We obtain the weighted sum of all subsets decision rule prediction using the inner product:

S F (R)\{i} 1/ d 1 |S|

S F (R)\{i}

j S qj = G qi + y , Bd 1 (12)

Shapley value for R has a compact form as shown in Eq.13.

ϕ(R, i) = R (qi 1)

S F (R)\{i}

1 d 1 |S| Y

d G qi + y , Bd 1

= (qi 1)ψ G qi + y

2.4 Computations

Even though we have simplified the Shapley value of a decision rule in a compact form using polynomials, it is still not friendly in computational complexity. In particular, the values qi,v are flat aggregated statistics and do not necessarily share terms in-between different rules. This makes it difficult to share intermediate results across different rules. To benefit from the fact that decision rules overlap, we develop an edge-based polynomial representation.

For every edge e with feature i, we use e to denote its closest ancestor edge that shares the same feature. In cases such edge does not exist, e = . We also use x πu to represent the x[i] is satisfied by all splitting criteria on feature i associated with all edges in Pi,u.

e Pi,h(e) 1 we x πh(e) 0 x / πh(e) (14)

We also define additionally that p = 1.

If e is the last edge in Pi,v then pe = qi,v. This is the key to avoid qi,v completely, and instead switch to pe. Our analysis will make sure that any pe that does not correspond to qi,v for any v and i gets cancelled out.

We show a relation between the Shapley value of a decision rule and the newly defined pe s.

Consider an operation d1,d2 : R[x]d1 R[x]d2 R[x]max(d1,d2). The subscript is omitted when d1, d2 is implicit. G1 G2 := G1 + G2 (1 + y)d1 d2, (15)

We extend the summary polynomial to all nodes in the tree. Let Gu = L

v L(u) Gv. Denote de as the degree of Gu, where h(e) = u. Proposition 2.3. Let v be a leaf in Tf, and dv be the degree of Gv then

ϕ(Rv, i) = X

e Pi,v (pe 1)ψ Gv (y + 1)de dv

(pe 1)ψ Gv (y + 1)de dv

Proof. Let e be the last edge of Pi,v. We note a few facts. pe = qi,v, y + qi,v divides Gv, and the sum is a telescoping sum. Put them together. X

e Pi,v (pe 1)ψ Gv (y + 1)de dv

(pe 1)ψ Gv (y + 1)de dv

=(pe 1)ψ Gv (y + 1)de dv

=(qi,v 1)ψ Gv (y + 1)de dv

=(qi,v 1)ψ Gv y + qi,v (y + 1)de dv

=(qi,v 1)ψ Gv y + qi,v

The following theorem establishes the relation between Shapley values, the summary polynomials at each node, and pe for each edge e. Theorem 2.4 (Main).

ϕ(f, i) = X

e Ei (pe 1)ψ Gh(e)

$ Gh(e) (y + 1)de de

Proof. Based on linearity of Shapley Value, ϕ(f, i) = P

v L(Tf ) ϕ(Rv, i).

For each rule Rv, we can scale their summary polynomial Gv to the degree of Gh(e). Based on Proposition 2.3,

ϕ(f, i) = X

e Pi,v (pe 1)ψ Gv (y + 1)de dv

(pe 1)ψ Gv (y + 1)(de dv)+(de de)

Observe that for any (e, v) pair, we have e Ei and v L(h(e)) if and only if v L(Tf) and e Pi,v. Hence, can order the summation by summing through the edges.

ϕ(f, i) = X

v L(h(e)) (pe 1)ψ Gv (y + 1)de dv

(pe 1)ψ Gv (y + 1)(de dv)+(de de)

Observe that at each edge e, all summary polynomial G is scaled to the same degree de. According to Proposition 2.1, we can add the summary polynomials before evaluate using ψ(.). Now, focus on the first part of the sum.

v L(h(e)) ((pe 1)ψ( Gv (y + 1)de dv

y + pe ) = X

e Ei (pe 1)ψ

Gv (y + 1)de dv

e Ei (pe 1)ψ(

P v L(h(e)) Gv (y + 1)de dv

e Ei (pe 1)ψ(

v L(h(e)) Gv

e Ei (pe 1)ψ Gh(e)

Using the exact same proof, we can also obtain

v L(h(e)) (pe 1)ψ Gv (y + 1)(de dv)+(de de)

e Ei (pe 1)ψ

$ Gh(e) (y + 1)de de

2.5 Linear Tree SHAP and complexity analysis

By Theorem 2.4, we can obtain an algorithm in two phases. Efficiently compute the summary polynomial on each node(Algorithm 2) and then evaluates for ϕ(f, i) directly(Algorithm 3). Both parts of the algorithm are straightforward, basically computing directly through definition and tree traversal. The final values of S[i] is the desired value ϕ(f, i)(x) after running Algorithm 4.

To analyze the running time, one can see each node is visited a constant number of times. The operations are polynomial addition, multiplication, division, inner product, or constant-time operations. In general, all those polynomial operations takes O(D log D) time for degree D polynomial [1]. This shows the total running time is O(LD log D).

However, we never need the coefficients of the polynomials. So we can improve the running time by storing the summary polynomials in a better-suited form, the multipoint interpolation form. Namely, we evaluate the polynomials G on a set of predefined unique points Y = (y0, y1, y2, , y D) RD+1, and store G(Y ) = (G(y0), . . . , G(y D)) instead. In this form, addition, product and division takes O(D) time [2]. The evaluation function ψ(G) also takes O(D) time but needs more explanation.

Denote V(Y ) RD+1 D+1 as the Vandermonde matrix of Y , where vi,j V(Y ) = yj i is the jth power of yi.

COMPUTESUMMARYPOLYNOMIALS(x, v, C): if node v is leaf: G[v] C Rv else: if v is not the root: e edge with v as head C C (y + pe(x)) if e = : C C y+pe (x) l, r children of v COMPUTESUMMARYPOLYNOMIALS(x, l, C) COMPUTESUMMARYPOLYNOMIALS(x, r, C) G[v] G[l] G[r] return G[v]

Algorithm 2: Obtain the summary polynomial for each node.

AGGREGATESHAPLEY(x, v, G): if v has children: l, r children of v AGGREGATESHAPLEY(x, l, G) AGGREGATESHAPLEY(x, r, G) if v is not the root: e edge with v as head i feature of edge e S[i] S[i] + (pe(x) 1)ψ( j G[v] y+pe(x) k )

S[i] S[i] (pe (x) 1)ψ( j G[v] (y+1) de de

y+pe (x) k )

Algorithm 3: Obtain the Shapley value vector.

Lemma 2.5. Let p, q R[x]d, and its coefficients A and B, respectively, then we have

p, q = A, B = p(Y ), V(Y ) 1B .

Proof. Polynomial evaluation can be consider as inner product of coefficient and Vandermonde matrix of input Y . Namely p(Y ) = A V(Y ). Therefore p(Y ), V(Y ) 1B = A V(Y ), V(Y ) 1B = A, V(Y ) V(Y ) 1B = A, B completes the proof.

In order to compute the inner products G, Bd in O(D) time, we have to precompute Nd = V(Y ) 1Cd, where Cd is the coefficient of Bd, for all 0 d D. This can be done simply in O(D) time for each d, so a total of O(D2) time.

By storing the polynomial in interpolation form, all our polynomial operations on each node take O(D) time. Therefore the total running time is O(LD).

Other than the summary polynomials, the algorithm uses constant space to store information on nodes and edges. Each summary polynomial takes O(D) space to store. Therefore the algorithm takes O(LD) space. Nevertheless, we can save space by realizing the algorithms only need a single top-down and a single bottom-up step. By joining two steps into one, the algorithm consumes the summary polynomials on the spot. Hence the total space usage will be bounded by O(D) times the stack size, bounded by D, the depth of the tree. The final total space usage is improved to O(D2).

Remark Even though Y can be arbitrarily chosen based on the maximum depth of the tree D, it is shown that Chebyshev points are near-optimal in numerical stability [11]. In our Linear Tree Shap implementation, we used the Chebyshev points of the second kind.

LINEARTREESHAP(x, Tf): G an array indexed by the nodes COMPUTESUMMARYPOLYNOMIALS(x, root(Tf), 1) S an array indexed by the features AGGREGATESHAPLEY(x, root(Tf), G) return S

Algorithm 4: The entire LINEARTREESHAP algorithm.

5 10 15 depth

dataset = adult

5 10 15 depth

dataset = conductor

method linear_tree_shap fast_tree_shap_v1 fast_tree_shap_v2 tree_shap

Figure 5: Speed up comparison

Datasets # Instances # Attributes Task Classes adult [4] 48,842 64 Classification 2 conductor [3] 21,263 81 Regression - Table 2: Datasets

3 Experiments

We run an experiment on both regression dataset adult and classification dataset conductor(summary in Table 2) to compare both our method and two popular algorithms, Tree Shap and Fast Tree Shap. We explain Trees with depths ranging from 2 to 18. And to align the performance across different depths of trees, we plot the ratio between the time of Tree Shap and the time of all methods in Figure 5. We run every algorithm on the same test set 5 times to get both average speeds up and the error bar. And for fair comparison purposes, all methods are limited to using a single core.

Linear Tree Shap is the fastest among all setups. And due to heavy memory usage, Fast Tree Shap V2 falls back to V1 when tree depth reaches 18 for the dataset conductor. Since the degree of the polynomial is bounded both by the depth of the tree also the number of unique features per decision rule, with deeper depth, dataset Conductor has much more speed up gains thanks to a higher number of features. We can conclude that the Linear Tree Shap is more efficient than all state-of-the-art Shapley value computing methods in both theory and practice.

[1] Ronald Newbold Bracewell and Ronald N Bracewell. The Fourier transform and its applications, volume 31999. Mc Graw-hill New York, 1986.

[2] Henri Cohen. A course in computational algebraic number theory, volume 8. Springer-Verlag Berlin, 1993.

[3] Kam Hamidieh. A data-driven statistical model for predicting the critical temperature of a superconductor. Computational Materials Science, 154:346 354, 2018.

[4] Ron Kohavi. Scaling up the accuracy of naive-bayes classifiers: A decision-tree hybrid. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, KDD 96, page 202 207. AAAI Press, 1996.

[5] Scott M Lundberg, Gabriel Erion, Hugh Chen, Alex De Grave, Jordan M Prutkin, Bala Nair, Ronit Katz, Jonathan Himmelfarb, Nisha Bansal, and Su-In Lee. From local explanations to global understanding with explainable ai for trees. Nature machine intelligence, 2(1):56 67, 2020.

[6] Rory Mitchell, Eibe Frank, and Geoffrey Holmes. Gputreeshap: Fast parallel tree interpretability, 2020.

[7] J Ross Quinlan. C4. 5: programs for machine learning. Elsevier, 2014.

[8] J.R. Quinlan. Simplifying decision trees. International Journal of Man-Machine Studies, 27(3):221 234, 1987.

[9] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. why should i trust you? explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pages 1135 1144, 2016.

[10] Lloyd S Shapley. A value for n-person games. Contributions to the Theory of Games, 2(28):307 317, 1953.

[11] Sven Weisbrich, Georgios Malissiovas, and Frank Neitzel. On the fast approximation of point clouds using chebyshev polynomials. Journal of Applied Geodesy, 15(4):305 317, 2021.

[12] Jilei Yang. Fast treeshap: Accelerating shap value computation for trees. ar Xiv preprint ar Xiv:2109.09847, 2021.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes]

(c) Did you discuss any potential negative societal impacts of your work? [N/A] (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [Yes] (b) Did you include complete proofs of all theoretical results? [Yes] 3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [N/A] (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [Yes] (b) Did you mention the license of the assets? [Yes]

(c) Did you include any new assets either in the supplemental material or as a URL? [Yes]

(d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]