# scalable_interpretability_via_polynomials__ac08304b.pdf

Scalable Interpretability via Polynomials

Abhimanyu Dubey

Meta AI dubeya@fb.com

Filip Radenovic

Meta AI filipradenovic@fb.com

Dhruv Mahajan

Meta AI dhruvm@fb.com

Generalized Additive Models (GAMs) have quickly become the leading choice for interpretable machine learning. However, unlike uninterpretable methods such as DNNs, they lack expressive power and easy scalability, and are hence not a feasible alternative for real-world tasks. We present a new class of GAMs that use tensor rank decompositions of polynomials to learn powerful, inherently-interpretable models. Our approach, titled Scalable Polynomial Additive Models (SPAM) is effortlessly scalable and models all higher-order feature interactions without a combinatorial parameter explosion. SPAM outperforms all current interpretable approaches, and matches DNN/XGBoost performance on a series of real-world benchmarks with up to hundreds of thousands of features. We demonstrate by human subject evaluations that SPAMs are demonstrably more interpretable in practice, and are hence an effortless replacement for DNNs for creating interpretable and high-performance systems suitable for large-scale machine learning. Source code is available at github.com/facebookresearch/nbm-spam.

1 Introduction

Interpretable machine learning systems suffer from a tradeoff between approximation and interpretability: high-performing models used in practice have millions of parameters and are highly non-linear, making them difﬁcult to interpret. Post-hoc explainability attempts to remove this tradeoff by explaining black-box predictions with interpretable instance-speciﬁc approximations (e.g., LIME [Ribeiro et al., 2016], SHAP [Lundberg and Lee, 2017]), however, they are notoriously unstable [Ghorbani et al., 2019a, Lakkaraju and Bastani, 2020], expensive to compute [Slack et al., 2021], and in many cases, inaccurate [Lipton, 2018]. It is therefore desirable to learn models that are instead inherently-interpretable (glass-box), i.e., do not require post-hoc interpretability, but provide better performance than inherently-interpretable classical approaches such as linear or logistic regression.

This has led to a ﬂurry of research interest in Generalized Additive Models [Hastie and Tibshirani, 2017], i.e., methods that non-linearly transform each input feature separately, e.g., tree-based GAM and GA2Ms [Lou et al., 2013], or NAMs [Agarwal et al., 2021]. While these are an improvement over linear methods, due to their highly complex training procedures, they are not a simple dropin replacement for black-box methods. Furthermore, it is not straightforward to model feature interactions in these models: the combinatorial explosion in higher-order interactions makes this infeasible without a computationally expensive feature selection [Lou et al., 2013, Chang et al., 2021].

Contrary to these approaches, we revisit an entirely different solution: polynomials. It is folklore in machine learning that since polynomials are universal approximators [Stone, 1948, Weierstrass, 1885], hypothetically, modeling all possible feature interactions will sufﬁce in creating powerful learners and eliminate the need for intricate non-linear transformations of data. Furthermore, polynomials can also provide a rubric for interpretability: lower-degree polynomials, that have interactions between a few features are interpretable but likely less accurate, whereas polynomials of higher degree have larger predictive power at the cost of interpretability. Thus, not only are polynomial models capable of capturing all possible interactions between features, they also give practitioners the ability to select a suitable model (order of interactions, i.e., degree) based precisely on their requirement in the

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

interpretability-performance tradeoff. However, such an approach has till date remained elusive, as learning interpretable feature interactions efﬁciently and at scale remains an open problem.

In this paper, we introduce a highly scalable approach to modeling feature interactions for inherently interpretable classiﬁers based on rank decomposed polynomials. Our contributions are:

1. First, we present an algorithm titled Scalable Polynomial Additive Models (SPAM) to learn

inherently-interpretable classiﬁers that can learn all possible feature interactions by leveraging low-rank tensor decompositions of polynomials [Nie, 2017b]. To the best of our knowledge, SPAM is the among the ﬁrst Generalized Additive Models with feature interactions scalable to 100K features. SPAM can be trained end-to-end with SGD (backpropagation) and GPU acceleration, but has orders of magnitude fewer parameters than comparable interpretable models. 2. We demonstrate that under a coarse regularity assumption (Assumption 1), SPAM converges

to the optimal polynomial as the number of samples n ! 1. Furthermore, we establish novel non-asymptotic excess risk bounds that match classic bounds for linear or full-rank polynomials. 3. SPAM outperforms all current interpretable baselines on several machine learning tasks. To the best

of our knowledge, our experimental benchmark considers problems of size (number of samples, and dimensionality) orders of magnitude larger than prior work, and we show that our simple approach scales easily to all problems, whereas prior approaches fail. Moreover, we demonstrate that the need for modeling feature interactions is as important as non-linear feature transformations: on most real-world tabular datasets, pairwise interactions sufﬁce to match DNN performance. 4. We conduct a detailed human subject evaluation to corroborate our performance with practical

interpretability. We demonstrate that SPAM explanations are more faithful compared to posthoc methods, making them a drop-in replacement for black-box methods such as DNNs for interpretable machine learning on tabular and concept-bottleneck [Koh et al., 2020] problems.

2 Related Work

Here we survey the literature most relevant to our work, i.e., interpretable ML with feature interactions, and polynomials for machine learning. Please see Appendix Section B for a detailed survey.

Transparent and Interpretable Machine Learning. Early work has focused on greedy or ensemble approaches to modeling interactions [Friedman, 2001, Friedman and Popescu, 2008] that enumerate pairwise interactions and learn additive interaction effects. Such approaches often pick spurious interactions when data is sparse [Lou et al., 2013] and are impossible to scale to modern-sized datasets due to enumeration of individual combinations. As an improvement, Lou et al. [2013] proposed GA2M that uses a statistical test to select only true interactions. However GA2M fails to scale to large datasets as it requires constant re-training of the model and ad-hoc operations such as discretization of features which may require tuning for datasets with a large dimensionality. Other generalized additive models require expensive training of millions of decision trees, kernel machines or splines [Hastie and Tibshirani, 2017], making them unattractive compared to black-box models.

An alternate approach is is to learn interpretable neural network transformations. Neural Additive Models (NAMs, Agarwal et al. [2021]) learn a DNN per feature. Tab Net [Arık and Pﬁster, 2021] and NIT [Tsang et al., 2018] alternatively modify NN architectures to increase their interpretability. NODE-GAM [Chang et al., 2021] improves NAMs with oblivious decision trees for better performance while maintaining interpretability. Our approach is notably distinct from these prior works: we do not require iterative re-training; we can learn all pairwise interactions regardless of dimensionality; we can train SPAM via backpropagation; and we scale effortlessly to very large-scale datasets.

Learning Polynomials. The idea of decomposing polynomials was of interest prior to the deep learning era. Speciﬁcally, the work of Ivakhnenko [1971], Oh et al. [2003], Shin and Ghosh [1991] study learning neural networks with polynomial interactions, also known as ridge polynomial networks (RPNs). However, RPNs are typically not interpretable: they learn interactions of a very high order, and include non-linear transformation. Similar rank decompositions have been studied in the context of matrix completion [Recht, 2011], and are also a subject of interest in tensor decompositions [Nie, 2017a, Brachat et al., 2010], where, contrary to our work, the objective is to decompose existing tensors rather than directly learn decompositions from gradient descent. Recently, Chrysos et al. [2019, 2020] use tensor decompositions to learn higher-order polynomial relationships in intermediate layers of generative models. However, their work uses a recursive formulation and learns high-degree polynomials directly from uninterpretable input data (e.g., images), and hence is non-interpretable.

3 Scalable Polynomial Additive Models

Setup and Notation. Matrices are represented by uppercase boldface letters, e.g., X and vectors by boldface lowercase, i.e., x. We assume that the covariates lie within the set X Rd, and the labels lie within the ﬁnite set Y. Data (x, y) 2 X Y are drawn following some unknown (but ﬁxed) distribution P. We assume we are provided with n i.i.d. samples {(xi, yi)}n

i=1 as the training set.

Motivation. Generalized Additive Models [Hastie and Tibshirani, 2017] are an excellent design choice for interpretable models, as they learn transformations of individual features, allowing us to model exactly the contribution of any feature. A typical GAM is as follows.

Interpretability

fi(xi) | {z } order 1 (unary)

fij(xi, xj) | {z } order 2 (pairwise)

+ + f1...d(x) | {z } order d

Performance Where fi, etc., are possibly non-linear transformations. It is evident [Lou et al., 2013] that as the order of interaction increases, e.g., beyond pairwise interactions, these models are no longer interpretable, albeit at some beneﬁt to performance. While some approaches (e.g., Agarwal et al. [2021], Chang et al. [2021]) learn fi via neural networks, we want to learn the simplest GAMs, i.e., polynomials. Speciﬁcally, we want to learn a polynomial P(x), of order (degree) k d of the form:

ij xixj + ...

Here, the weights W(l) = {w(l)

i1...il}, 1 l k capture the l ary interactions between subsets of l features. For small values of d and k, one can potentially enumerate all possible interactions and learn a linear model of O(dk) dimensionality, however this approach quickly becomes infeasible for large d and k. Furthermore, the large number of parameters in P(x) make regularization essential for performance, and the computation of each interaction can be expensive. Alternatively, observe that any polynomial that models k ary interactions can be written as follows, for weights {W(l)}k

l=1, P(x) = W(1) 1 x + W(2) 2 x + ...W(k) k x + b. Here, the weights W(l) 2 Rdl are written as a tensor of order l, and the operation l refers to a repeated inner product along l dimensions, e.g., W 2 Rd2, x>Wx = W 2 x = (W x) x where denotes the inner product. More generally, any symmetric order l tensor admits an equivalent polynomial representation having only degree l terms. Now, this representation is still plagued with the earlier curse of dimensionality for arbitrary weight tensors {W(l)}k

l=1, and we will now demonstrate how to circumvent this issue by exploiting rank decompositions.

3.1 Learning Low-Rank Decompositions of Polynomials

The primary insight of our approach is to observe that any symmetric tensor W(l) of order l also admits an additional rank decomposition [Nie, 2017a, Brachat et al., 2010] ( being the outer product):

λi ui ui ui | {z } l times

Where {ui}r

i=1 are the (possibly orthonormal) d dimensional basis vectors, r 2 Z+ denotes the rank of the tensor, and the scalars {λi}r

i=1, λi 2 R denote the singular values. Our objective is to directly learn the rank decomposition of an order l tensor, and therefore learn the required polynomial function. This gives us a total of O(rd) learnable parameters per tensor, down from a previous dependence of O(dl). This low-rank formulation additionally enables us to compute the polynomial more efﬁciently. Speciﬁcally, for a degree k polynomial with ranks r = {1, r2, ..., rk}, we have:

P(x) = b + hu1, xi +

λ2i hu2i, xi2 +

λ3i hu3i, xi3... +

λki huki, xik. (1)

Where {uli}rl

i=1, {λli}rl

i=1 denote the corresponding bases and eigenvalues for W(l). The above term can now be easily computed via a simple inner product, and has a time as well as space complexity

of O(dkrk1), instead of the earlier O(dk). Here, krk1 simply represents the sum of ranks of all the k tensors. Note that low-rank decompositions do not enjoy a free lunch: the rank r of each weight tensor can, in the worst case, be polynomial in d. However, it is reasonable to assume that correlations in the data X will ensure that a small r will sufﬁce for real-world problems (see, e.g., Assumption 1).

Optimization. For notational simplicity, let us parameterize any order k polynomial by the weight set = {b, u1, {λ2i, u2i}r2

i=1, ..., {λki, uki}rk

i=1} 2 Rr(d+1) where from now onwards r = krk1 = 1 + r2 + r3 + ... + rk denotes the cumulative rank across all the tensors. We write the polynomial parametrized by as P( ; ), and the learning problem for any loss is:

select ? = arg min

(P(xi; ), yi) + β R( ). (2)

Here, denotes the feasible set for , and R( ) denotes an appropriate regularization term scaled by β > 0. We can show that the above problem is well-behaved for certain data distributions X.

Proposition 1. If the regularization R and loss function : Y Y ! [0, 1] are convex, λ > 0 and X Rd

+ then the optimization problem in Equation 2 is convex in for Rr(d+1)

Proposition 1 suggests that if the training data is positive (achieved by renormalization), typical loss functions (e.g., cross-entropy or mean squared error) and R such as Lp norms are well-suited for optimization. To ensure convergence, one can use proximal SGD with a small learning rate [Nitanda, 2014]. In practice, we ﬁnd that unconstrained SGD also provides comparable performance.

3.2 Improving Polynomials for Learning

Geometric rescaling. Learning higher-order interactions is tricky as the order of interactions increases: the product of two features is an order of magnitude different from the original, and consequently, higher-order products are progressively disproportionate. To mitigate this, we rescale features such that (a) the scale is preserved across terms, and (b) the variance in interactions is captured. We replace the input x with xl = sign(x) |x|1/l for an interaction of order l. Speciﬁcally,

P(x) = hu1, xi +

λ2i hu2i, x2i2 +

λ3i hu3i, x3i3... +

λki huki, xkik + b. (3)

We denote this model as SPAM-LINEAR. We argue that this rescaling for unit-bounded features ensures higher interpretability as well as better learning (since all features are of similar variance). Note that for order 1, x1 = x. For k 2, consider a pairwise interaction between two features xi = 0.5 and xj = 0.6. Naively multiplying the two will give a feature value of xixj = 0.3 (smaller than both xi and xj), whereas an intuitive value of the feature would be pxixj = 0.54, i.e., the geometric mean. Regarding the reduced variance, observe that unconstrained, the variance V(xixj) V(xi) V(xj) and V(pxixj) V(pxi)V(pxj)

V(xi) V(xj). If the features have small variance, e.g., V(xi) = V(xj) = 10 2 and are uncorrelated, then the ﬁrst case would provide a much smaller variance for the interaction, whereas the rescaling would preserve variance.

Shared bases for multi-class problems. When we are learning a multi-class classiﬁer, it is observed that learning a unique polynomial for each class creates a large model (with O(2dr C) weights) when the number of classes C is large, which leads to overﬁtting and issues with regularization. Instead, we propose sharing bases (u) across classes for all higher order terms, and learning classspeciﬁc singular values (λ) for all higher order terms ( 2) per class. For any input x, P(x; ) = SOFTMAX{Pc(x; )}C

c=1, where, for any class c 2 {1, ..., C},

Pc(x; ) = bc + huc

2i hu2i, x2i2 +

3i hu3i, x3i3... +

ki huki, xkik.

The terms in green denote weights unique to each class, and terms in red denote weights shared across classes. This reduces the model size to O((d + r)C + rd) from O(rd C). We set as the cross-entropy of the softmax of {Pc( , )}C

c=1, as both these operations preserve convexity.

Exploring nonlinear input transformations. Motivated by non-linear GAMs (e.g., GA2M [Lou et al., 2013] and NAM [Agarwal et al., 2021]), we modify our low-rank decomposed algorithm by replacing xl for each l with feature-wise non-linearities. We learn a non-linear SPAM P( ; ) as,

P(x; ) = b + hu1, F1(x)i +

λ2i hu2i, F2(x)i2 + ... +

λki huki, Fk(x)ik.

Here, the function Fi(x) = [fi1(x1), fi2(x2), ..., fid(xd)] is a Neural Additive Model (NAM) [Agarwal et al., 2021]. We denote this model as SPAM-NEURAL. For more details on the NAM we use, please refer to the Appendix Section C. Note that the resulting model is still interpretable, as the interaction between features xi and xj, e.g., will be given by (Pr2

k=1 λ2ku2kiu2kj) f2i(xi)f2j(xj) instead of the previous (Pr2

k=1 λ2ku2kiu2kj) pxi xj for typical SPAM.

Dropout for bases via λ. To capture non-overlapping feature correlations, we introduce a basis dropout by setting the contribution from any particular basis to zero at random via the singular values λ. Speciﬁcally, observe that the contribution of any basis-singular value pair (u, λ) at order l is λhu, xil. We apply dropout to λ to ensure that the network learns robust basis directions.

3.3 Approximation and Learning-Theoretic Guarantees

We present some learning-theoretic guarantees for SPAM. Proofs are deferred to Appendix Section A.

Proposition 2. Let f be a continuous real-valued function over a compact set X. For any threshold > 0, there exists a low-rank polynomial P : X ! R such that supx2X |f(x) P(x)| < .

The above result demonstrates that asymptotically, decomposed polynomials are universal function approximators. Next, we present an assumption on the data-generating distribution as well as a novel excess risk bound that characterizes learning with polynomials more precisely (non-asymptotic). Assumption 1 (Exponential Spectral Decay of Polynomial Approximators). Let Pk denote the family of all polynomials of degree at most k, and let P?,k denote the optimal polynomial in Pk, i.e., P?,k = arg min P 2Pk E(x,y) P[ (P(x), y)]. We assume that 8k, P?,k admits a decomposition as described in Equation 1 such that, for all 1 l k, there exist absolute constants C1 < 1 and C2 = O(1) such that |λlj| C1 exp ( C2 jγ) for each j 1 and l 2 [1, k].

This condition is essentially bounding the smoothness of the nearest polynomial approximator of f, i.e., implying that only a few degrees of freedom sufﬁce to approximate f accurately. Assumption 1 provides a soft threshold for singular value decay. One can also replace the exponential with a slower polynomial decay with similar results, and we discuss this in the Appendix Section A. We can now

present our error bound for L2 regularized models (see Appendix for L1-regularized models). Theorem 1. Let be 1-Lipschitz, δ 2 (0, 1], the generalization error for the optimal degree-k polynomial P?,k be E(P?,k) and the generalization error for the empirical risk minimizer SPAM

b Pr,k, i.e., b Pr,k = arg min P

i=1 (P(xi), yi) with ranks r = [1, r2, ..., rk] be E( b Pr,k). Then, if P satisﬁes Assumption 1 with constants C1, C2 and γ, we have for L2 regularized ERM, where kuijk2 Bu,2 8i 2 [k], j 2 [ri], and kλk2 Bλ,2 where λ = {{λij}ri

i=1, with probability at least 1 δ,

E( b Pr,k) E(P?,k) 2Bλ,2

The above result demonstrates an expected scaling of the generalization error, matching the bounds obtained for L1 (see Appendix Section A) and L2 regularized linear classiﬁers [Wainwright, 2019]. We that the rank r = krk1 has an O(r) dependence on the Rademacher complexity. This highlights a trade-off between optimization and generalization: a larger r gives a lower approximation error (smaller second term), but a larger model complexity (larger ﬁrst term). Observe, however, that even when r = ((log d)p) for some p 1, the approximation error (second term) diminishes as o( 1

dpγ ). This suggests that in practice, one only needs a cumulative rank (poly)logarithmic in d. The above result also provides a non-asymptotic variant of Proposition 2; as n, r ! 1, E( b Pr,k) ! E(P?,k)+.

4 Ofﬂine Experiments

Here we evaluate SPAM against a set of benchmark algorithms on both classiﬁcation and regression problems. The baseline approaches we consider are (see Appendix Section C for precise details):

Deep Neural Networks (DNN): These are standard multi-layer fully-connected neural networks

included to demonstrate an upper-bound on performance. These are explained via the perturbationbased LIME [Ribeiro et al., 2016] and SHAP [Lundberg and Lee, 2017] methods in Section 5. Linear / Logistic Regression: These serve as our baseline interpretable models. We learn both L1

and L2 regularized models on unary and pairwise features (

features) with minibatch SGD.

Table 1: Evaluation of SPAM on benchmarks against prior work. ("): higher is better, (#): lower is better, : variance across trials is < 0.001, : variance across trials is < 0.005. Boldface red denotes the best black-box model, boldface green denotes the best interpretable model, boldface denotes where SPAM Order 3 is best. Runs averaged over 10 random trials with optimal hyperparameters.

Regression Binary Multi-Class Classiﬁcation Obj. Det. RMSE (#) AUROC (") Accuracy (") m AP (") CH FICO Cov Type News CUB i Nat CO114 Interpretable Baselines Linear (Order 1) 0.7354 0.7909 0.7254 0.8238 0.7451 0.3932 0.1917

Linear (Order 2) 0.7293 0.7910 0.7601 0.7617 0.4292 0.2190

EBMs (Order 1) 0.5586 0.7985 0.7392 EB2Ms (Order 2) 0.4919 0.7998 NAM 0.5721 0.7993 0.7359 0.7632 0.4194 0.2056

Uninterpretable Black-Box Baselines XGBoost 0.4428 0.7925 0.8860 0.7677 0.7186 DNN 0.5014 0.7936 0.9694 0.8494 0.7684 0.4584 0.2376

Our Interpretable Models SPAM (Linear, Order 2) 0.6474 0.7940 0.7732 0.8472 0.7786 0.4605 0.2361

SPAM (Neural, Order 2) 0.4914 0.8011 0.7770 0.7762 0.4689 SPAM (Linear, Order 3) 0.6410 0.7945 0.8066 0.8520 0.7741 0.4684 0.2368

SPAM (Neural, Order 3) 0.4865 0.8024 0.8857 0.7753 0.4722

Table 2: Tabular Datasets Name California Housing (CH) FICO Cover Type (Cov Type) Newsgroups Source Pace and Barry [1997] FICO [2018] Blackard and Dean [1999] Lang [1995] Instances 20,640 10,459 581,012 18,828 Features 8 23 54 146,016 Classes - 2 7 20 Feature Type Numeric Mixed Mixed TF-IDF

Gradient Boosted Trees (XGBoost): We use the library xgboost. This baseline is mainly to

compare accuracy, as the number of trees required are typically large and are hence uninterpretable. Explainable Boosting Machines (EBMs) [Lou et al., 2013]: EBMs use millions of shallow bagged

trees operating on each feature at a time. Note that this approach is not scalable to datasets with many features or multi-class problems. We report scores on the datasets where we successfully trained EBMs without sophisticated engineering, using the interpretml library [Nori et al., 2019]. Neural Additive Models (NAMs) [Agarwal et al., 2021]: These models are neural network

extensions of prior EBMs. Note that this method also does not scale to some datasets.

Training Setup. SPAM is implemented by learning L1/L2-regularized variants by minibatch SGD implemented in Py Torch. For regression tasks, we measure the root mean squared error (RMSE). For binary classiﬁcation, we report the area under the ROC (AUROC), for multi-class classiﬁcation, we report the top-1 accuracy (Acc@1), and ﬁnally, for object detection, we report mean Average Precision (m AP). We tune hyperparameters via random sampling approach over a grid. Note that for all experiments, both NAM and SPAM-NEURAL have identical MLP structures to ensure identical approximation power. For deﬁnitions of metrics and hyperparameter ranges, see Appendix Section C.

4.1 Measuring Benchmark Performance

We select tasks to explore a variety of settings from regression to multi-class classiﬁcation, and also explore different dataset scales, from a few hundred samples and tens of features to 100K-scale datasets (both in the number of samples and data dimensionality), while ensuring that the features are interpretable. Our datasets are summarized in Table 2. Please see Appendix Section C.1 for details. For all datasets with no deﬁned train-val-test split, we use a ﬁxed random sample of 70% of the data for training, 10% for validation and 20% for testing. For the 20 Newsgroups dataset, we split the pre-deﬁned training split 7:1 for training and validation, respectively.

In an effort to scale interpretable approaches beyond tabular datasets, we consider benchmark problems using the Independent Concept Bottleneck framework of Koh et al. [2020] for image

$ $FF YV 2UGHU L1DW

63$0 /LQHDU 63$0 1HXUDO

$ $FF YV 2UGHU &RY7\SH

63$0 /LQHDU 63$0 1HXUDO

% $FF YV 5DQNV L1DW

63$0 2UGHU 63$0 2UGHU

&XPXODWLYH 5DQN U

% $FF YV 5DQNV 1HZVJURXSV

63$0 2UGHU 63$0 2UGHU

)UDFWLRQ RI 3DLUZLVH ,QWHUDFWLRQV

& $FF YV 6SDUVLW\ &8%

Figure 1: Ablation of Accuracy with: (A1, A2) Degree k; (B1, B2) Rank r; (C) L1 Sparsity.

classiﬁcation and object detection. We use a convolutional neural network (CNN) backbone Res Net50 [He et al., 2016] that is trained to predict interpretable concepts (e.g., parts of object) from images. After training the backbone, we extract predicted concepts for all inputs, and learn an interpretable classiﬁer head from these concepts. In experiments, the backbone remains identical for all comparisons, and we only compare the head classiﬁers (see Appendix Section C.3 for more details). We select three datasets for evaluation in the concept bottleneck setting:

1. Caltech-UCSD Birds (CUB-200) [Wah et al., 2011]: This is a ﬁne-grained visual categorization

dataset where one has to classify between 200 species of birds. The interpretable concepts are 278 binary bird attributes, e.g., the shape of the beak, the color of the wings, etc. The dataset has 5,994 training and 5,794 validation images. See Appendix Section C.3.1 for more details. 2. i Naturalist Birds [Van Horn et al., 2018, 2021]: i Naturalist can be thought of as a larger version

of the previous dataset. We only select the Birds super-category of samples, which has 414,000 training and 14,860 validation instances across 1,486 classes. Since the i Naturalist dataset does not contain dense part and attribute annotations, we use the predicted concepts from the CUB-200 backbone model extracted on all samples. 3. Common Objects Dataset (CO114, Proprietary): Here we consider, for the ﬁrst time, a concept

bottleneck model for object detection. We construct a dataset involving common household objects (e.g., bed, saucer, tables, etc.) with their bounding box annotations. For each bounding box, we collect 2,618 interpretable annotations (e.g., parts, color, etc.). The dataset has 2,645,488 training and 58,525 validation samples across 115 classes with 2,618 interpretable concepts. We report the mean Average Precision (m AP) metric. For more details please refer to Appendix Section C.3.2.

Our results are summarized in Table 1. There are four main takeaways from the results. First, observe that both SPAM-LINEAR and SPAM-NEURAL comfortably outperform their prior interpretable counterparts on all datasets (e.g., Order 2 SPAM-LINEAR outperforms all linear methods even the full rank pairwise model), and SPAM-NEURAL comfortably outperforms all non-linear baselines. Next, observe that in all datasets but Cover Type, second degree interactions sufﬁce to match or even outperform DNN performance. We will discuss Cover Type in detail in the next paragraph. Thirdly, observe that SPAM models are as scalable as the black-box approaches, whereas prior work (e.g., NAM and EBMs) are not. In the case of NAM, we were unable to scale due to the sheer number of parameters required to model large datasets, e.g., Newsgroups ( 900M parameters), and for EBMs, the training time and memory requirement increased dramatically for large datasets. Furthermore, EB2Ms do not even support multi-class problems. Finally, observe that for many problems, we do not require non-linear feature transformations once we model feature interactions: e.g., i Nat and CUB.

Cover Type Dataset. XGBoost and DNNs perform substantially better on Cover Type compared to linear or neural SPAM. On analysis, we found that existing SPAM models underﬁt on Cover Type, and hence we increased the total parameters for SPAM-NEURAL via subnets, identical to NAMs [Agarwal et al., 2021], where each feature is mapped to s non-linear features (as opposed to 1 originally). With s = 8 subnets, the performance of SPAM-NEURAL (order 3) improves to 0.9405. In comparison, NAMs with 8 subnets provides a lower accuracy of 0.7551.

4.2 Ablation Studies

Rank and Degree of Interactions. By Proposition 2, it is natural to expect the approximation quality to increase with the degree k and rank r, i.e., as we introduce more higher-order interactions

Table 3: Throughput benchmarking (higher is faster).

Method CH FICO Cov Type News Throughput (images / second) NAM [Agarwal et al., 2021] 5 105 1.2 105 8 104 23 NAM (Order 2) 1.1 104 6 103 3 103 - Linear SPAM (Order 2) 6.1 107 6.7 107 6.1 107 2.6 106

Neural SPAM (Order 2) 1.7 105 7.9 103 4.1 103 - Linear SPAM (Order 3) 3.2 107 3.7 107 3.9 107 1.8 105

Neural SPAM (Order 3) 1.1 105 5.3 103 2.6 103 - MLP 1.3 107 1.3 107 1.3 107 2.2 105

between features. We ablate the degree k on the tabular benchmark Cover Type and concept bottleneck benchmark i Naturalist (Birds), as summarized in Figure 1A. We observe, as expected, that increasing the degree k leads to moderate improvements beyond k 3, but with diminishing returns. Similarly, we examine the effect of the cumulative rank r on performance on the Newsgroups and i Naturalist (Birds) datasets (Figure 1B); we observe that performance is sufﬁciently insensitive to r, and plateaus after a while. We imagine that as r increases, model complexity will dominate and performance will likely begin to decrease, matching the full-rank performance at r = O(d2) for pairwise models.

Sparsity in Higher-Order Relationships. A requirement for interpretability is to ensure that the learned models can be explained with a few interpretable concepts. Since the number of higher-order combinations increases with k, we examine sparsity to limit the number of active feature assignments. We penalize the objective in Equation 2 with a regularization term that inhibits dense feature selection. If U = {{uli}rl

l=1 denotes all the basis vectors in matrix form, we add the penalty R( ) , k Uk1. This ensures that every basis u only captures limited interactions between features, and hence the overall complexity of the model is limited. We examine accuracy as a function of the fraction of non-zero pairwise interactions for a degree 2, rank 800 SPAM on CUB-200 in Figure 1C, and ﬁnd that only 6% of the possible interactions sufﬁce to obtain competitive performance.

Examining Assumption 1 (Spectral Decay). The spectral decay assumption presented in Assumption 1 is crucial to obtain reasonable generalization bounds (Theorem 1). To test this assumption, we examine the spectra of a SPAM-LINEAR (Order 2) model for all 200 of CUB-200 classes. These are depicted in Figure 2A. Furthermore, we ﬁt an exponential model on the decay itself, and obtain (in line with Assumption 1) γ = 3, C1 = 0.54 and C2 = 0.006, all in accordance with the assumption.

4.3 Additional Experiments

4.3.1 Comparisons on Common Benchmarks

In addition to the 7 datasets we considered earlier, we evaluate on 9 further benchmark datasets commonly used in the interpretability literature. We evaluate on the MIMIC2, Credit and COMPAS datasets as presented in Agarwal et al. [2021], and the Click, Epsilon, Higgs, Microsoft, Yahoo and Year datasets from Chang et al. [2021]. The result of this is summarized in Table 4. We observe that

SPAM models are competitive with prior work, and outperforming prior work on a number of tasks (the best interpretable model is in bold).

4.3.2 Runtime Evaluation

We provide a comparison of SPAM runtimes on 4 different datasets to establish their scalability. We consider the California Housing (CH, Pace and Barry [1997], 8 features), FICO HELOC (FICO, FICO [2018], 23 features), Cover Type (Cov Type, Blackard and Dean [1999], 54 features) and 20 Newsgroups (News, Lang [1995], 146,016 features), and evaluate the throughput of the optimal Linear and Neural SPAM models against Neural Additive Models (NAM, Agarwal et al. [2021]) and MLPs. Table 3 describes the throughput, where we observe that Linear SPAM is more efﬁcient than MLPs, and Neural SPAM is orders of magnitude faster than NAM, which does not scale to higher orders.

Table 4: Comparison against state-of-the-art interpretable neural networks on additional benchmarks.

Method MIMIC2

Credit (AUC)

Click (ERR)

Higgs (ERR)

Yahoo (MSE)

Year (MSE) NAM [Agarwal et al., 2021] 0.8539 0.9766 0.7368 0.3447 0.1079 0.2972 0.5824 0.6093 85.25 NODE-GAM [Chang et al., 2021] 0.8320 0.9810 0.7420 0.3342 0.1040 0.2970 0.5821 0.6101 85.09 Linear SPAM (Order 2) 0.8514 0.9836 0.7426 0.3791 0.1011 0.2881 0.5710 0.5923 81.30 Neural SPAM (Order 2) 0.8664 0.9850 0.7411 0.3348 0.1020 0.2750 0.5671 0.5869 79.99 XGBoost 0.8430 0.9780 0.7440 0.3334 0.1112 0.2328 0.5544 0.5420 78.53

Table 5: Feature importances for interpretable models used in human subject evaulations.

Model Importances for input x = {x1, ..., xd} Linear / post-hoc Linear wi xi, i 2 {1, ..., d} SPAM (Linear, Order 2) u1i xi, i 2 {1, ..., d} and

l=1 λ2lu2liu2lj

pxi xj, i, j 2 [d]

5 Human Subject Evaluations

Experiment Methodology. We now evaluate how well explanations from SPAM fare in a practical setting with non-interpretable benchmarks such as black-box models equipped with post-hoc explanations. Our objective is to mimic a practical setting, where the model user must decide between a fully-interpretable model vs. a black-box classiﬁer with post-hoc explainability. Our experiment design is a Prediction Task [Hoffman et al., 2018, Muramatsu and Pratt, 2001], where the objective for the users is to guess what the model predicted given an explanation. The motivation of such a design is to ascertain both the faithfulness and interpretability of the model explanations.

Computing Explanations. We ﬁrst compute the feature importances for any class c, as the contribution of any speciﬁc feature in the logit for class c (for binary classiﬁcation or regression, there is only one class). For the Logisitc Regression model, we simply use the contribution of each feature to the prediction. For SPAM-Linear (Order 2), we still use each feature s contribution, but features now can be unary (xi) or pairwise (xi and xj). As black-box post-hoc baselines, we consider DNNs with LIME [Ribeiro et al., 2016] and Kernel SHAP [Lundberg and Lee, 2017] to generate post-hoc linear explanations. See Table 5 for the formulation of importance for the models we use in experiments.

Experiment Design We conduct a separate experiment for each model M with explanations of length E by selecting E most important features. Each such experiment runs in two phases - training and testing. In the training phase, participants are shown 8 images of birds and their corresponding explanations to develop their mental model. Four of these images have been predicted as class A by the model, and the remaining are predicted as class B. They then move on to the testing phase, where they are successively shown the explanations for 5 unseen images (which the model could have predicted as either A or B), and the users must answer given the explanation, which class (of A or B) did the model predict? . If they desire, the users can move back to the training phase at any time to revise model explanations. We do not show the corresponding images in the test phase, as we want the user feedback to rely solely on the faithfulness of explanations. We measure the mean user accuracy on predicting model decisions. For more details on the interface, please see Appendix Section D.1. For each experiment, we follow an independentsubjects design, and gathered a total of 150 participants per experiment. Each task (corresponding to one model-explanation pair) lasted 3 minutes. The participants were compensated with $0.75 USD ($15/hour), and all experiments were run on Amazon Mechanical Turk (AMT, Buhrmester et al. [2016]) with a total cost of $3000. To remove poorly performing participants, we only select those that get the ﬁrst decision correct, and compute the mean user accuracy using the remaining 4 images.

Study #1: Comparing Black-Box and Transparent Explanations. We compare different interpretable models with a ﬁxed explanation length of E = 7. Our objective is to assess the interpretability of pairwise interactions compared to black-box and linear explanations, and to re-answer whether it was even necessary to have fully-interpretable models compared to black-box models with post-hoc explanations. We used samples from the 5 pairs of classes, that is A and B), from CUB-200 dataset (see Section D.1 for images). The results from this experiment are summarized in Figure 2B. Observe that with SPAM, the mean user accuracy (m UA) is substantially higher (0.71) compared to both linear (0.67) and post-hoc LIME (0.65). We would like to remark that even for E = 7, SPAM has an

$ 6SHFWUDO 'HFD\

([SRQHQWLDO )LW

/HQJWK RI ([SODQDWLRQ

PHDQ 8VHU $FFXUDF\

& 9DU\LQJ ([SODQDWLRQ /HQJWK

/,0( . 63$0 /LQHDU

/LQHDU /,0( 6+$3 63$0

PHDQ 8VHU $FFXUDF\

% &RPSDULQJ ,QWHUSUHWDEOH $SSURDFKHV

Figure 2: (A) Spectral Decay on CUB-200; Human Subject Evaluation results for (B) Comparing Black-box and glass-box explanations, and (C) LIME vs. SPAM with different explanation lengths.

average of 3.1 pairwise terms in each explanation, and therefore the higher-order terms are substantial. A one-sided t-test [Freund and Simon, 1967] provides a signiﬁcance (p value) of 2.29 10 4 and a test statistic of 3.52 for the SPAM m UA being higher than LIME, and correspondingly for Linear, we get a signiﬁcance (p value) of 0.03801 and a test statistic of 1.666. Hence SPAM s improvements in interpretability are statistically signiﬁcant over both linear and post-hoc approaches.

Study #2: Varying Explanation Length. One can argue that the increased interpretability of polynomial approaches is due to the fact that each higher-order term involves multiple features (e.g., pairwise involves two features), and hence the increased mean user accuracy (m UA) is due to the larger number of features shown (either individually or as pairs). We examine this hypothesis by varying the explanation length E for both SPAM-Linear (Order 2) and LIME, which consequently increases the number of terms seen for both approaches. We observe in Figure 2C that regardless of explanation method, m UA is maximum at E = 7, and introducing more terms decreases interpretability.

6 Discussion and Conclusion

We presented a simple and scalable approach for modeling higher-order interactions in interpretable machine learning. Our approach alleviates several of the concerns with existing algorithms and focuses on presenting a viable alternative for practical, large-scale machine learning systems. In addition to ofﬂine experimental guarantees, we demonstrate by an extensive user study that these models are indeed more interpretable in practice, and therefore can readily substitute post-hoc interpretability without any loss in performance. Moreover, to the best of our knowledge, our work provides the ﬁrst incremental analysis of interpretability and performance: we show how progressively increasing the model complexity (with higher order interactions) brings better performance while compromising interpretability in practice, and the sweet spot appears to rest at pairwise interactions.

We have several follow-up directions. First, given that our model provides precise guarantees on generalization, it is a feasible starting point to understanding tradeoffs between privacy and explainability from a rigorous perspective. Next, one can consider utilizing SPAM to understand failure modes and spurious correlations directly. Furthermore, SPAM-esque decompositions can also be useful in other domains beyond interpretability, e.g., language and vision, where modeling higher-order interactions is challenging due to the curse of dimensionality.

R. Agarwal, L. Melnick, N. Frosst, X. Zhang, B. Lengerich, R. Caruana, and G. E. Hinton. Neural

additive models: Interpretable machine learning with neural nets. Advances in Neural Information Processing Systems, 34, 2021.

S. O. Arık and T. Pﬁster. Tabnet: Attentive interpretable tabular learning. In AAAI, volume 35, pages

6679 6687, 2021.

A. Asuncion and D. Newman. Uci machine learning repository, 2007.

S. Bach, A. Binder, G. Montavon, F. Klauschen, K.-R. Müller, and W. Samek. On pixel-wise

explanations for non-linear classiﬁer decisions by layer-wise relevance propagation. Plo S one, 10 (7):e0130140, 2015.

P. L. Bartlett and S. Mendelson. Rademacher and gaussian complexities: Risk bounds and structural

results. Journal of Machine Learning Research, 3(Nov):463 482, 2002.

B. Bercu, B. Delyon, and E. Rio. Concentration inequalities for sums and martingales. Springer,

J. A. Blackard and D. J. Dean. Comparative accuracies of artiﬁcial neural networks and discriminant

analysis in predicting forest cover types from cartographic variables. Computers and electronics in agriculture, 24(3):131 151, 1999.

J. Brachat, P. Comon, B. Mourrain, and E. Tsigaridas. Symmetric tensor decomposition. Linear

Algebra and its Applications, 433(11-12):1851 1872, 2010.

M. Buhrmester, T. Kwang, and S. D. Gosling. Amazon s mechanical turk: A new source of inexpensive, yet high-quality data? 2016.

D. V. Carvalho, E. M. Pereira, and J. S. Cardoso. Machine learning interpretability: A survey on

methods and metrics. Electronics, 8(8):832, 2019.

S. Chakraborty, R. Tomsett, R. Raghavendra, D. Harborne, M. Alzantot, F. Cerutti, M. Srivastava,

A. Preece, S. Julier, R. M. Rao, et al. Interpretability of deep learning models: A survey of results. In 2017 IEEE smartworld, ubiquitous intelligence & computing, advanced & trusted computed, scalable computing & communications, cloud & big data computing, Internet of people and smart city innovation (smartworld/SCALCOM/UIC/ATC/CBDcom/IOP/SCI), pages 1 6. IEEE, 2017.

C.-H. Chang, R. Caruana, and A. Goldenberg. Node-gam: Neural generalized additive model for

interpretable deep learning. ar Xiv preprint ar Xiv:2106.01613, 2021.

C. Chen, O. Li, D. Tao, A. Barnett, C. Rudin, and J. K. Su. This looks like that: deep learning for

interpretable image recognition. Advances in neural information processing systems, 32, 2019.

G. Chrysos, S. Moschoglou, Y. Panagakis, and S. Zafeiriou. Polygan: High-order polynomial

generators. ar Xiv preprint ar Xiv:1908.06571, 2019.

G. G. Chrysos, S. Moschoglou, G. Bouritsas, Y. Panagakis, J. Deng, and S. Zafeiriou. P-nets: Deep

polynomial neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7325 7335, 2020.

C. K. Chui and X. Li. Realization of neural networks with one hidden layer. In Multivariate

approximation: From CAGD to wavelets, pages 77 89. World Scientiﬁc, 1993.

M. Du, N. Liu, and X. Hu. Techniques for interpretable machine learning. Communications of the

ACM, 63(1):68 77, 2019.

FICO. Fico community explainable machine learning challenge, 2018. URL https://community.

fico.com/s/explainable-machine-learning-challenge?tabset-3158a=2.

J. E. Freund and G. A. Simon. Modern elementary statistics, volume 256. Prentice-Hall Englewood

Cliffs, NJ, 1967.

J. H. Friedman. Greedy function approximation: a gradient boosting machine. Annals of statistics,

pages 1189 1232, 2001.

J. H. Friedman and B. E. Popescu. Predictive learning via rule ensembles. The annals of applied

statistics, 2(3):916 954, 2008.

A. Ghorbani, A. Abid, and J. Zou. Interpretation of neural networks is fragile. In Proceedings of the

AAAI conference on artiﬁcial intelligence, volume 33, pages 3681 3688, 2019a.

A. Ghorbani, J. Wexler, J. Y. Zou, and B. Kim. Towards automatic concept-based explanations.

Advances in Neural Information Processing Systems, 32, 2019b.

T. J. Hastie and R. J. Tibshirani. Generalized additive models. Routledge, 2017.

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings

of the IEEE conference on computer vision and pattern recognition, pages 770 778, 2016.

R. R. Hoffman, S. T. Mueller, G. Klein, and J. Litman. Metrics for explainable ai: Challenges and

prospects. ar Xiv preprint ar Xiv:1812.04608, 2018.

A. G. Ivakhnenko. Polynomial theory of complex systems. IEEE transactions on Systems, Man, and

Cybernetics, (4):364 378, 1971.

P. W. Koh, T. Nguyen, Y. S. Tang, S. Mussmann, E. Pierson, B. Kim, and P. Liang. Concept bottleneck

models. In International Conference on Machine Learning, pages 5338 5348. PMLR, 2020.

H. Lakkaraju and O. Bastani. " how do i fool you?" manipulating user trust via misleading black

box explanations. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, pages 79 85, 2020.

K. Lang. Newsweeder: Learning to ﬁlter netnews. In Machine Learning Proceedings 1995, pages

331 339. Elsevier, 1995.

T. Laugel, M.-J. Lesot, C. Marsala, X. Renard, and M. Detyniecki. The dangers of post-hoc interpretability: Unjustiﬁed counterfactual explanations. ar Xiv preprint ar Xiv:1907.09294, 2019.

Z. C. Lipton. The mythos of model interpretability: In machine learning, the concept of interpretability

is both important and slippery. Queue, 16(3):31 57, 2018.

I. Loshchilov and F. Hutter. Decoupled weight decay regularization. ar Xiv preprint ar Xiv:1711.05101,

Y. Lou, R. Caruana, J. Gehrke, and G. Hooker. Accurate intelligible models with pairwise interactions.

In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 623 631, 2013.

S. M. Lundberg and S.-I. Lee. A uniﬁed approach to interpreting model predictions. Advances in

neural information processing systems, 30, 2017.

A. Madsen, S. Reddy, and S. Chandar. Post-hoc interpretability for neural nlp: A survey. ar Xiv

preprint ar Xiv:2108.04840, 2021.

P. Massart. Some applications of concentration inequalities to statistics. In Annales de la Faculté des

sciences de Toulouse: Mathématiques, volume 9, pages 245 303, 2000.

A. H. Miller, W. Feng, A. Fisch, J. Lu, D. Batra, A. Bordes, D. Parikh, and J. Weston. Parlai: A

dialog research software platform. ar Xiv preprint ar Xiv:1705.06476, 2017.

J. Muramatsu and W. Pratt. Transparent queries: investigation users mental models of search

engines. In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pages 217 224, 2001.

J. Nie. Generating polynomials and symmetric tensor decompositions. Foundations of Computational

Mathematics, 17(2):423 465, 2017a.

J. Nie. Low rank symmetric tensor approximations. SIAM Journal on Matrix Analysis and Applica-

tions, 38(4):1517 1540, 2017b.

A. Nitanda. Stochastic proximal gradient descent with acceleration techniques. Advances in Neural

Information Processing Systems, 27, 2014.

H. Nori, S. Jenkins, P. Koch, and R. Caruana. Interpretml: A uniﬁed framework for machine learning

interpretability. ar Xiv preprint ar Xiv:1909.09223, 2019.

S.-K. Oh, W. Pedrycz, and B.-J. Park. Polynomial neural networks architecture: analysis and design.

Computers & Electrical Engineering, 29(6):703 725, 2003.

R. K. Pace and R. Barry. Sparse spatial autoregressions. Statistics & Probability Letters, 33(3):

291 297, 1997.

B. Recht. A simpler approach to matrix completion. Journal of Machine Learning Research, 12(12),

M. T. Ribeiro, S. Singh, and C. Guestrin. " why should i trust you?" explaining the predictions of

any classiﬁer. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pages 1135 1144, 2016.

C. Rudin. Stop explaining black box machine learning models for high stakes decisions and use

interpretable models instead. Nature Machine Intelligence, 1(5):206 215, 2019.

Y. Shin and J. Ghosh. The pi-sigma network: an efﬁcient higher-order neural network for pattern

classiﬁcation and function approximation. In IJCNN-91-Seattle International Joint Conference on Neural Networks, volume i, pages 13 18 vol.1, 1991. doi: 10.1109/IJCNN.1991.155142.

A. Shrikumar, P. Greenside, A. Shcherbina, and A. Kundaje. Not just a black box: Learning important

features through propagating activation differences. ar Xiv preprint ar Xiv:1605.01713, 2016.

A. Shrikumar, P. Greenside, and A. Kundaje. Learning important features through propagating

activation differences. In International conference on machine learning, pages 3145 3153. PMLR, 2017. D. Slack, A. Hilgard, S. Singh, and H. Lakkaraju. Reliable post hoc explanations: Modeling

uncertainty in explainability. Advances in Neural Information Processing Systems, 34, 2021. M. H. Stone. The generalized weierstrass approximation theorem. Mathematics Magazine, 21(5):

237 254, 1948. M. Tsang, H. Liu, S. Purushotham, P. Murali, and Y. Liu. Neural interaction transparency (nit):

Disentangling learned interactions for improved interpretability. Advances in Neural Information Processing Systems, 31, 2018. G. Van Horn, O. Mac Aodha, Y. Song, Y. Cui, C. Sun, A. Shepard, H. Adam, P. Perona, and

S. Belongie. The inaturalist species classiﬁcation and detection dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8769 8778, 2018. G. Van Horn, E. Cole, S. Beery, K. Wilber, S. Belongie, and O. Mac Aodha. Benchmarking

representation learning for natural world image collections. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12884 12893, 2021.

C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The caltech-ucsd birds-200-2011 dataset.

M. J. Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge

University Press, 2019. K. Weierstrass. On the analytical representability of so-called arbitrary functions of a real variable.

Meeting reports of the Royal Prussian Academy of Sciences in Berlin, 2:633 639, 1885. Z. Yang, C. Jin, Z. Wang, M. Wang, and M. I. Jordan. On function approximation in reinforcement

learning: Optimism in the face of large state spaces. ar Xiv preprint ar Xiv:2011.04622, 2020. R. Zhang, P. Madumal, T. Miller, K. A. Ehinger, and B. I. Rubinstein. Invertible concept-based

explanations for cnn models with non-negative concept activation vectors. ar Xiv preprint ar Xiv:2006.15417, 2020.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reﬂect the paper s

contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] See conclusion.

(c) Did you discuss any potential negative societal impacts of your work? [Yes] (d) Have you read the ethics review guidelines and ensured that your paper conforms to

them? [Yes] 2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [Yes] See Appendix

Section A. (b) Did you include complete proofs of all theoretical results? [Yes] See Appendix Sec-

tion A. 3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experi-

mental results (either in the supplemental material or as a URL)? [Yes] (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they

were chosen)? [Yes] Please see Section 4 and supplementary material. (c) Did you report error bars (e.g., with respect to the random seed after running experi-

ments multiple times)? [Yes] (d) Did you include the total amount of compute and the type of resources used (e.g., type

of GPUs, internal cluster, or cloud provider)? [Yes] 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [Yes] Please see Sections 5

and 4. (b) Did you mention the license of the assets? [Yes] See accompanying supplementary

material. (c) Did you include any new assets either in the supplemental material or as a URL? [N/A]

No new assets except code, see 3(a). (d) Did you discuss whether and how consent was obtained from people whose data you re

using/curating? [Yes] (e) Did you discuss whether the data you are using/curating contains personally identiﬁable

information or offensive content? [Yes] 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if

applicable? [Yes] See supplementary material. (b) Did you describe any potential participant risks, with links to Institutional Review

Board (IRB) approvals, if applicable? [N/A] No participant risks were identiﬁed in the experiment design. (c) Did you include the estimated hourly wage paid to participants and the total amount

spent on participant compensation? [Yes] See Section 5.