# fairness_reprogramming__5a970a3c.pdf

Fairness Reprogramming

Guanhua Zhang UC Santa Barbara guanhua@ucsb.edu

Yihua Zhang Michigan State University

zhan1908@msu.edu

Yang Zhang MIT-IBM Watson AI Lab

yang.zhang2@ibm.com

Wenqi Fan The Hong Kong Polytechnic University

wenqifan@polyu.edu.hk

Qing Li The Hong Kong Polytechnic University

csqli@comp.polyu.edu.hk

Sijia Liu Michigan State University & MIT-IBM Watson AI Lab

liusiji5@msu.edu

Shiyu Chang UC Santa Barbara chang87@ucsb.edu

Despite a surge of recent advances in promoting machine Learning (ML) fairness, the existing mainstream approaches mostly require training or ﬁnetuning the entire weights of the neural network to meet the fairness criteria. However, this is often infeasible in practice for those large-scale trained models due to large computational and storage costs, low data efﬁciency, and model privacy issues. In this paper, we propose a new generic fairness learning paradigm, called FAIRREPROGRAM, which incorporates the model reprogramming technique. Speciﬁcally, FAIRREPROGRAM considers the case where models can not be changed and appends to the input a set of perturbations, called the fairness trigger, which is tuned towards the fairness criteria under a min-max formulation. We further introduce an information-theoretic framework that explains why and under what conditions fairness goals can be achieved using the fairness trigger. We show both theoretically and empirically that the fairness trigger can effectively obscure demographic biases in the output prediction of ﬁxed ML models by providing false demographic information that hinders the model from utilizing the correct demographic information to make the prediction. Extensive experiments on both NLP and CV datasets demonstrate that our method can achieve better fairness improvements than retraining-based methods with far less data dependency under two widely-used fairness criteria. Codes are available at https://github.com/UCSB-NLP-Chang/Fairness-Reprogramming.git.

1 Introduction

Fairness in machine learning (ML) has become a critical concern. Due to the biases in data collection, the output prediction is often spuriously correlated with some demographic attributes, which are thus undesirably incorporated into the decision-making process of machine learning models. For example, it is found that some abusive language detection systems tend to classify texts that contain mere mentioning of certain minority groups, e.g., homosexual groups, as abusive content, even though the texts themselves are not abusive at all [1, 2]. Despite the recent advances in fairness promoting learning methods [3 7], the existing mainstreaming approaches mostly require retraining or ﬁnetuning the entire model parameters towards an extra fairness objective. However, this is

Equal contribution

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

often infeasible in practice, particularly for those well-trained large-scale models, due to the huge computation and storage costs. In addition, for machine learning models that are deployed as a service, model retraining is hindered by limited access to the model parameters.

Recently, model reprogramming has emerged as an alternative technique to model ﬁnetuning. In particular, model reprogramming considers the pre-trained model ﬁxed, and instead modiﬁes their input to re-purpose the model towards different objectives. For example, it is shown that a well-crafted input perturbation can re-program an Image Net classiﬁer to solve the task of counting squares in an image [8, 9]. It is also shown that by learning task-speciﬁc embedding prompts concatenated to the inputs, pre-trained language models can achieve better performances than full-parameter tuning in natural language understanding tasks [10 12] Compared with ﬁnetuning methods, model reprogramming enjoys lower cost, better scalability, and requires less access to the model parameters. Hence here come our research questions - Can model reprogramming techniques be applied to fairness objectives? If so, why and how would it work?

In this paper, we revisit the model reprogramming and propose a novel generic fairness learning paradigm, called FAIRREPROGRAM. In particular, FAIRREPROGRAM perturbs the input by appending to the input a global constant vector/feature, called the fairness trigger, which is optimized towards the fairness objective under a min-max framework. FAIRREPROGRAM is a generic framework that works for various tasks and domains. We further introduce an information-theoretic framework that explains why and under what conditions fairness goals can be achieved using a constant fairness trigger. We show theoretically and empirically that the fairness trigger can effectively obscure demographic biases in the output prediction of ﬁxed ML models by providing false demographic information that hinders the model from utilizing the correct demographic information to make predictions.

We perform extensive experiments across various NLP and CV datasets with in-the-wild biases. The results show that FAIRREPROGRAM can consistently achieve better fairness improvement with the retraining-based methods under the two widely-used fairness notions, but with far less trade-off in accuracy. For example, with comparable accuracy, our method can outperform the retraining based baseline with 10.5% and 36.5% lower bias scores over two fairness criteria in the Celeb A dataset with the hair color prediction task and gender as demographic information. In addition, our method demonstrates great transferability and interpretability. Our theoretical analysis and empirical ﬁndings can provide useful insights toward more practical, scalable, and ﬂexible fairness learning paradigms.

2 Related Work

Fairness in ML Fairness problems in ML models have received increasing attention from both industry [13] and academia [14 17]. There has been a myriad of fairness deﬁnitions in the literature [18, 14, 19, 20]. Among them, group fairness notions are one of the most popular [21 23], which require ML models to perform similarly for different demographic groups. In this paper, we mainly focus on the two most widely-used group fairness deﬁnitions, demographic parity [21] and equalized odds [22], but it is worth mentioning that our method is general for other fairness notions. Existing fairness promoting methods can be broadly categorized into pre-processing, in-processing, and post-processing methods [24]. Pre-processing methods calibrate the training data to remove the spurious correlations and train fair model on the modiﬁed data [25 28, 2, 1, 29, 30]. In-processing methods work on training ML with extra fairness-aware regularization [3 7, 31, 32]. For example, an adversarial framework is introduced to train model parameters to meet fairness requirements [33]. In our method, we adopt a similar adversarial loss but optimize the fairness triggers with a ﬁxed model. Despite the effectiveness, these methods usually consider training fair models from scratch and do not directly apply to already-trained models. Post-processing methods focus on calibrating trained ML models to be fair [24]. Many of them modify the model outputs to meet the fairness criteria [18, 22, 34 44]. For example, the model outputs are directly modiﬁed to meet equalized odds by solving an optimization problem [22]. Alternatively, a boosting-based method is introduced to calibrate model outputs [40].

Model reprogramming Model reprogramming [45, 9, 8, 46 48] aims to repurpose an already trained neural network for different tasks. Different from the typical transfer learning that requires modifying the structure and parameters of the given pre-trained model, reprogramming technology instead designs a trainable program appended to the input, while keeping the pre-trained model intact. The model reprogramming technology can be designed in the form of an input-agnostic

perturbation [45, 8] or a trainable input transformation function together with the label mapping from the source domain to the target domain [9]. In particular, the feasibility of designing a universal input perturbation to reprogram a well-trained Image Net classiﬁer to the CIFAR-10 dataset is demonstrated in the white-box setting [8]. As an exploration to implement reprogramming in the discrete scenario, another work [46] successfully reprograms the text classiﬁcation neural network for alternate classiﬁcation tasks. This work also shows the possibility of developing reprogramming in the black-box setting, where the reprogrammer may not have the access to the parameters of the target model. Recent work [47] shows the possibility of repurposing deep neural networks designed for image classiﬁers for the natural language processing and other sequence classiﬁcation tasks. It is argued the success of the reprogramming lies in the size of the average input gradient and the input dimension is crucial to the performance of the reprogrammer [48]. It is also shown that generative models like Fair GANs [49] can be transfered to other tasks by reprogramming with variational auto-encoders [50]. A highly related topic to model reprograming is prompt learning in NLP [11]. It is shown that by designing designated text prompts appended to inputs, pre-trained language models could be re-directed to perform well under downstream tasks in a few-shot setting [51]. Prompt-based tuning methods have become the mainstream and achieve better performance than ﬁne-tuning in many scenarios [52 55]. Seminal works about prompt learning can be found in [11, 12]. However, nearly all existing methods focus on using model reprogramming to improve accuracy in domain-transfer tasks and to our best knowledge, our work is the ﬁrst to generalize model reprogramming to improve fairness of a trained model.

3 Fairness Reprogramming

In this section, we will introduce the FAIRREPROGRAM algorithms. As some notations, upper-cased letters, X and X, denote random vectors and random variables, respectively; lower-cased letters, x or x, denote deterministic vectors and scalars respectively. p X( ) or p(X) denote the probability density function of the (discrete) random variable X.

3.1 Problem Formulation

Consider a classiﬁcation task, where X represents the input feature, and Y represents the output label. In addition, there exists some sensitive attributes or demographic group, Z, that may be spuriously correlated with Y . There is a pre-trained classiﬁer, f ( ), that predicts Y from X, i.e. ˆY = f (X). The weights of the classiﬁer are considered ﬁxed (hence the superscript ). Unfortunately, due to the spurious correlation between Z and Y , the classiﬁer may be biased against certain demographics.

Our goal is to improve the fairness of the classiﬁer by modifying the input X, rather than modifying the classiﬁer s ﬁxed weights. In particular, we aim to achieve either of the following fairness criteria.

Equalized Odds: ˆY Z Y, or Demographic Parity: ˆY Z, (1)

where denotes independence. The following two subsection will explain how to modify input and design the optimization objective respectively.

3.2 Modifying the Input Features

Input modiﬁcation primarily involves appending a fairness trigger to the input. Formally, the input modiﬁcation takes the following generic form:

X = m(X; , δ) = [δ, g(X; )], (2)

where X denotes the modiﬁed input; [ ] denotes vector concatenation. As can be observed, the input modiﬁcation consists of two steps. First, X is fed through a transformation function g( ; ), where represents the hyper-parameters of the transformation function. The actual form of g( ; ) is contingent upon different applications and modalities, but a general requirement is that g( ; ) should largely retain the information necessary for classiﬁcation. The second step is to append a fairness trigger, δ, to the input, which is a vector that can be optimized over. It is important to note that δ is a constant different inputs get appended the same trigger. Although it does not seem intuitive, we will soon show that a constant trigger is all you need to achieve fair prediction on all different inputs.

Below are speciﬁc forms of transformations (Eq. (2)) we use.

Text Classiﬁcation In text classiﬁcation, X represents a sequence of input token embeddings. To modify the input, we simply append a ﬁxed number of embeddings after the input text. In this case, g( ; ) is the identity mapping, and δ corresponds to the appended embeddings.

(a) Border trigger (b) Patch trigger

Figure 1: Demonstration of the border and patch trigger applied on an image from Celeb A [56].

Image Classiﬁcation In image classiﬁcation, X represents the (vectorized) input image. Unlike text classiﬁcation, where the input can have a variable length, the length of the input to the image classiﬁcation network is ﬁxed. We thus apply the following two approaches to append the trigger, as shown in Fig. 1. The ﬁrst approach, called the patch approach, removes a patch from the original image, and appends a trigger the same size as the patch to the patch location (as shown in Fig. 1(a)). In this case, g( ; ) is a function that removes the patch dimension and retain the rest, with representing the patch location; δ represents the trigger feature that replaces the patch. The second approach, called the border approach, shrinks the image to a smaller image, and then appends the trigger at the border (as shown in Fig. 1(b)). In this case, g( ; ) is a function that shrinks the image, and δ represents the trigger feature at the border.

3.3 Optimization Objective

Our optimization objective is as follows

δ, Lutil(Dtune, f m) + λLfair(Dtune, f m), (3)

where m = m( ; , δ) represents the input modiﬁcation function as in Eq. (2); represents nested functions; Dtune represents the dataset that are used to train the fairness trigger. Note that this is different from the dataset where the classiﬁer, f , is pre-trained.

The ﬁrst loss term, Lutil, is the utility loss function of the task. For classiﬁcation tasks, Lutil is usually the cross-entropy loss, i.e.,

Lutil(Dtune, f m) = EX,Y Dtune[CE(Y, f (m(X)))], (4)

where CE( , ) denotes the cross-entropy loss.

The second loss term, Lfair, encourages the prediction to follow the fairness criteria as in Eq. (1). According to Eq. (1), Lfair should measure how much information about Z is in ˆY . To measure this, we introduce another network, called the discriminator, d( ; φ), where φ represents its parameters. If the equalized odds criterion is applied, then d( ; φ) should predict Z from ˆY and Y ; if the demographic parity criterion is applied, then the input to d( ; φ) would just be ˆY . In the following, we will focus on equalize odds criterion for conciseness. Then, the information of Z can be measured by maximizing the negative cross-entropy loss for the prediction of Z over the discriminator parameters, i.e.,

Lfair(Dtune, f m) = max

φ EX,Y,Z Dtune[ CE(Z, d(f (m(X)), Y ; φ))]. (5)

By plugging Eqs. (4) and (5) into (3), we can see that the entire optimization objective becomes a min-max framework, where the discriminator tries to improve its prediction of Z while the fairness trigger tries to make the prediction worse. As shown in [33], when the discriminator cannot predict Z better than chance, the aforementioned fairness criteria can be achieved.

3.4 Why Does It Work?

It is not immediately straightforward why a global trigger can obscure the demographic information for any input. In this section, we will propose an information-theoretic framework that illustrates one of the mechanisms through which the trigger can remove the demographic information.

Our theoretical framework builds upon the data generation process as shown in Fig. 2(a). Speciﬁcally, we assume that X consists of a set of features, i.e. X = [X1, , XT ], where T is the total number of features. In text classiﬁcation, a feature can be a word or a word piece; in image classiﬁcation, a feature can be speciﬁc shapes, colors, patterns, etc. Assume that these features can be divided into two groups. The ﬁrst group, denoted as X(y), consists of features that are directly governed by the output label Y ; the second group, denoted as X(z), consists of featuers that are directly governed by

%#$% |" ! " !

" " %&$% |" "

%#$% |" ! " !

( False constant %&$% |)" "

(a) (b) (c)

Figure 2: Illustration of why fairness trigger works. (a) The data generation process. (b) The information ﬂow from data to the classiﬁer through the sufﬁcient statistics. (c) Fairness trigger strongly indicative of a demographic group can confuse the classiﬁer with a false demographic posterior, and thus preventing the classiﬁer from using the correct demographic information.

the demographic information Z. Z and Y can be spuriously correlated, i.e. there can be common confounders, C, between Z and Y . As a result, both X(y) and X(z) are indicative of Y .

To further simplify our theoretical analysis, we consider a bag-of-feature scenario, where each feature in X(y) is drawn from the vocabulary set X (y), and each feature in X(z) is drawn from the vocabulary set X (z). There should not be any overlap between the two vocabulary sets, i.e. X (y) X (z) = . Otherwise it violates our assumption that demographic-related features are biased features.

It can be shown (in Appendix C) that the posterior distributions, p Y ( X(y)) and p Z( X(z)), are the sufﬁcient statistics of X(y) and X(z) respectively for inferring Y . In other words, these two posterior distributions summarize all the information about X(y) and X(z) that the classiﬁer needs to know to predict Y . Therefore, we assume that the classiﬁer takes the following generic form

ˆY = f (X) = h(ptr

Y ( X(y)), ptr

Z ( X(z))). (6)

Note that we add a superscript, tr, to emphasize that the probability distributions are over the data set where the classiﬁer is trained, because the classiﬁer has never been trained on inputs modiﬁed with the fairness trigger. Eq. (6) encompasses many common decision functions. For example, it can be shown (in Appendix C) that the posterior distribution p(Y X), which is the minimizer of the cross-entropy loss, is a special case of Eq. (6).

As illustrated in Fig. 2(b), p Y ( X(y)) and p Z( X(z)) provide two sets of information from input features. p Y ( X(y)) provides the unbiased information, because a desirable fair classiﬁer should rely only upon p Y ( X(y)) to make a decision. On the other hand, p Z( X(z)) provides the biased information, because it conveys the demographic information. In other words, the fairness goals can be achieved by cutting off the biased information path. Therefore, our research question boils down to: is it possible to cut off the biased information path with a global fairness trigger δ?

Without loss of generality, assume that δ consists of only one feature. Consider the case where δ is a demographic feature, i.e. δ X (z). In this case, we assume the transformed input as deﬁned in Eq. (2) can also be divided into two groups:

X = [ X(y), X(z)], where X(y) = g(X(y)), X(z) = [δ, g(X(z))]. (7)

The following theorem states our main conclusion:

Theorem 1. Under the assumptions in Eq. (6) and (7), and some additional regularity conditions2, if the fairness trigger δ is indicative of a certain demographic group z, then

lim ptr(Z=z X(z)

MI( ˆ Y, Z Y ) = 0, (8)

where MI means mutual information; ˆ Y = f ( X) is the classiﬁer s prediction after input is modiﬁed.

ptr(Z = z X(z)

0 = δ) 1 means that the fairness trigger is very strongly indicative of the demographic group z. Therefore, Thm. 1 essentially states that if the prepended trigger feature is very strongly indicative of a certain demographic group, then equalized odds can be achieved. A formal proof is presented in Appendix C. Here we would like to give an intuitive explanation. When ptr(Z = z X(z)

0 = δ) 1, it will also happen that ptr(Z = z X(z) = X(z)) 1. In other words, the fairness trigger δ would overshadow the rest of the demographic features and trick the classiﬁer into believing all the

2Formal assumptions stated in the appendix.

different inputs belong to the same demographic group z. As a result, the second argument in Eq (6) would reduce to a constant (1 for demographic group z and 0 elsewhere), effectively blocking the biased information path, as shown in Fig. 2(c). Note that the premise for the fairness trigger to work is that the classiﬁer has never seen the modiﬁed input. Otherwise, the classiﬁer will be able to learn to ignore the constant trigger and still elicit the true demographic information from input.

4 Experiments

In this section, we evaluate the effectiveness of FAIRREPROGRAM on both NLP and CV applications in terms of accuracy, fairness, performances under low-data regime, transferability and interpretability.

4.1 Experiment Setup

Datasets We consider the following two commonly used NLP and CV datasets: Civil Comments [57, 58]: The dataset contains 448k texts with labels that depict the toxicity of

each input. The demographic information of each text is provided. Celeb A [56]: The dataset contains over 200k human face images and each contains 39 binary

attribute annotations. We follow the conventional setting [56] that adopts the hair color prediction task in our experiment and uses gender annotation as the demographic information. [59 61] For both datasets, we split the entire data into a training set, a tuning set, a validation set, and a testing set. The training set is used for the base model training, i.e., to obtain a biased model for reprogramming. The tunning set and validation set are used for trigger training and hyper-parameter selection. We report our results on the testing set. It is worth mentioning that there is no overlapping data between different sets and the size of the tuning set is much smaller than the training one. Speciﬁcally, we set the size ratio between the tunning set and the training as 1 5 and 1 100 for Civil Comments and Celeb A, respectively. The full statistics of the datasets can be found in Appendix A.1.

Metrics Besides the model accuracy, we introduce two empirical fairness metrics, one under each of the two fairness criteria as in Eq. (1). For binary classiﬁcation, the metrics are calculated as:

p(ˆY = 1) p(ˆY = 1 Z = z) , EO:

( FPR FPRz + FNR FNRz ) 2,

where DP and EO stand for demographic parity and equalized odds respectively. FPR and FNR are the false positive/negative rate, and the subscript z denotes the score is calculated within a speciﬁc demographic group Z = z. For example, FPRmale indicates the false positive rate calculated over all examples with the male annotation. For a multi-class setting, the bias scores are ﬁrst calculated similarly using one-vs-all for each class and then averaged across different classes. All reported results are the average of three different random runs. It can be shown that these metrics are non-negative, and will become zero when their corresponding fairness criteria are achieved. For better elaboration, we report the negative bias scores in our experiments, so the larger these negative scores are, the better the model satisﬁes the corresponding fairness criteria.

Baselines and implementation details We consider the following models for comparison:

BASE: the base model to be reprogrammed, trained with the cross-entropy loss on the training set.

ADVIN [33]: an in-processing adversarial training method that optimizes both model accuracy and fairness using the training set.

ADVPOST: a post-processing variant of ADVIN, which ﬁne-tunes the BASE model with the same fairness-aware adversarial objectives as ADVIN, but using the (low-resource) tunning set only.

For NLP experiments, we use a pre-trained BERT [62] to obtain the BASE and ADVIN models. We use ADAMW [63] as the optimizer, and set the learning rate to 10 5 for all baselines and 0.1 for FAIRREPROGRAM. For CV experiments, we consider a RESNET-18 [64] that pre-trained on Image Net. The discriminator used in ADVIN, ADVPOST and FAIRREPROGRAM is a three-layer MLP, and the parameters are optimized using ADAM with a learning rate of 0.01. We pick the best model based on the accuracy (for the BASE) or the bias scores (for all other debiasing methods) of the validation set. We refer to Appendix A.2 for more details and Appendix B for more baseline studies.

(a) Civil Comments (b) Celeb A

Figure 3: Results on (a) Civil Comments and (b) Celeb A. We report the negative DP (left) and the negative EO (right) scores. For each method, we vary the trade-off parameter λ (as shown in (3)) to record the performance. The closer a dot to the upper-right corner, the better the model is. We consider ﬁve different λs for each method. The solid curve is the ﬁtted polynomial with order 30.

(a) Civil Comments (b) Celeb A

Figure 4: Results on (a) Civil Comments and (b) Celeb A with different tuning data ratio. We report the negative DP (left) and negative EO (right) scores. We consider a ﬁxed BASE model trained with training set, whose negative bias scores are presented as a black dashed line. Then we train other methods with different tuning data ratio to promote fairness of the BASE model.

Next we introduce the implementation details of the triggers for different variants of FAIRREPROGRAM. For image classiﬁcation task, we adopt the border and patch trigger as shown in Fig. 1, termed FAIRREPROGRAM (BORDER) and FAIRREPROGRAM (PATCH) correspondingly. We deﬁne the trigger size as the width of the trigger frame for border trigger and the width of the square patch for patch trigger. Unless otherwise stated, the default trigger size for each setting are 20 and 80.

For text classiﬁcation task, we introduce a probability vector vi to control the selection of trigger word for each position i. Speciﬁcally, we have the trigger δi = Evi where E represents the pretrained word embedding matrix of BERT. Then we simply concatenate δ after all input texts3 in the embeddings space as the fairness trigger. We introduce two types of trigger. The ﬁrst type, called FAIRREPROGRAM (SOFT) , uses continuous vi s, and each vi is projected onto the continuous probability simplex using the bisection algorithm after each training step. The second type, called FAIRREPROGRAM (HARD), discretizes each vi into a one-hot vector ˆvi via arg max operation. We adopt the straight through technique [65] to update vi during training. The triggers found by

FAIRREPROGRAM (HARD) enjoy better interpretability as they correspond to a sequence of word tokens. Unless speciﬁed otherwise, we set the trigger word number as ﬁve for our experiments.

4.2 Results

Fig. 3 shows the performance of the proposed FAIRREPROGRAM with other baselines on both NLP (subﬁgure (a)) and CV (subﬁgure (b)) datasets using DP (left) and EO (right) metrics. In each subﬁgure, the data samples of the same method (dots in the same color) are generated by explicit changing the adversary weight λ in (3), which controls the trade-off between fairness and accuracy. We further ﬁt the data with polynomial regression to present the curves. Appendix A.2 shows the detailed λ choices for different methods. Here are our key observations. First, our method improves the fairness of the BASE model. In particular, our methods (both orange and red curves) achieve higher negative DP and EO scores with a comparable classiﬁcation accuracy. Second, our method enjoys a better fairness-accuracy trade-off compared with all other baselines. Speciﬁcally, the curves of our method lie farther to the upper-right corner of the plots, which implies that our method

3The trigger is appended as a sufﬁx after all input tokens but before [SEP] for BERT.

(a) Civil Comments (b) Celeb A

Figure 5: Results in the transfer setting. We report negative DP (left) and negative EO (right) scores. The triggers are ﬁrstly trained in a BASE model. Then we evaluate the triggers based on another unseen BASE model. We change the parameter λ to trade-off accuracy with fairness and draw the curves in the same way with Fig. 3. The 8 point corresponds to the average of all BASE models with different random seeds.

improves model fairness with fewer sacriﬁces on accuracy. It is also worth noting that although ADVIN achieves good fairness scores, it uses much more data for training.

Limited data setting We further evaluate ADVPOST and FAIRREPROGRAM with decreasing the number of data in the tuning set. Speciﬁcally, we ﬁx a λ for each method such that all methods achieve comparable bias score with full tuning set. The detailed λ choices are provided in Appendix A.2. Then we apply these methods to subsets of the tuning set with different proportions. The results are shown in Fig. 4. There are two key observations. First, our method can consistently improve fairness upon BASE model even with 1% tuning data, indicating a high data efﬁciency of FAIRREPROGRAM. Second, FAIRREPROGRAM achieves better fairness than ADVPOST does when tuning data number decreases. For example, in Fig. 4 (a), the curve of our method is signiﬁcantly above the ADVPOST as tuning data decreases. When the tuning set size is extremely small, ADVPOST signiﬁcantly deteriorates and even underperforms the BASE model.

Figure 6: Gradient-based saliency map visualized with GRAD CAM [66] of different methods. The highlighted zones (marked in red) depicting regions exerting major inﬂuence on the predicted labels (non-blond hair v.s. blond hair) in each row, which also depict the attention of the model on the input image.

Transferability Next, we show the transferability of the fairness triggers found by FAIRREPROGRAM. We ﬁrst tune the triggers with a BASE source model and then apply the trigger on a target model trained with a different random seed. The results are shown in Fig. 5. As can be seen, FAIRREPROGRAM achieves comparable fairnessaccuracy trigger on both the source model and the target model, indicating our method has a good transferability. This intriguing property brings two beneﬁts of our method: if ML model parameters are infeasible (e.g. when ML models are provided as services), the users could train a surrogate model and tune the trigger based on it to promote fairness of the original model; when ML model parameters are updated with new data (e.g. online learning), the user could still use the original trigger for ﬁxing fairness problems. We further elaborate the results of FAIRREPROGRAM for transferring to different tasks and model architectures in Appendix B.5.

Input saliency attribution. Fig. 7 and 6 compare the saliency maps of some example inputs with and without the fairness triggers. Speciﬁcally, For the NLP applications, we extract a subset of Civil Comments with religion-related demographic annotations, and apply IG [67] to localize word pieces that contribute most to the text toxicity classiﬁcation. For the CV application, we use Grad Cam [66] to identify class-discriminative regions of Celeb A s test images. As shown in Fig. 7, our fairness trigger consists of a lot of religion-related words (e.g., diocesan, hebrew, parish). Meanwhile, the predicted toxicity score of the benign text starting from muslims signiﬁcantly reduces. These observations verify our theoretical hypothesis that the fairness trigger is strongly indicative of a certain demographic group to prevent the classiﬁer from using the true demographic information. In addition, Fig. 6 presents the input saliency maps on two input images with respect to their predicted

Text (Non-toxic) Predicted Toxicity

muslims need to take a look in the mirror 0.149 muslims need to take a look in the mirror same diocesan bula rev proceedings 0.069 muslims need to take a look in the mirror soto cc rib hebrew armenian 0.054 muslims need to take a look in the mirror paul long course parish body 0.073

Figure 7: A text example from Civil Comments with INTEGRATED GRADIENT [67, 68] highlighting important words that inﬂuence ERM model predictions. The text is concatenated with three triggers generated with different adversary weight. Green highlights the words that lean to toxic predictions and Red highlights non-toxic leaning words. The model prediction tends to be correct after adding the triggers.

(a) 8-class negative DP (b) 8-class negative EO (c) 16-class negative DP (d) 16-class negative EO

Figure 8: Performance of multi-class classiﬁcation. For (a) and (b), we use the attributes Blond Hair, Smiling, Attractive for multi-class construction. We add an addition attribute Wavy Hair for (c) and (d).

labels, non-blond hair and blond hair, respectively. As can be observed, when there is no fairness trigger, the saliency region incorrectly concentrates on the facial parts, indicating the classiﬁer is likely to use biased information, such as gender, for its decision. With the fairness trigger, the saliency region moves to the hair parts, which matches the behavior of ADVIN. These results conﬁrm that our fairness trigger can drive models to make fairer predictions.

Table 1: Predictions of the demographic classiﬁer on a null input with triggers generated by different λ. The demographic prediction for CV triggers indicate the predicted score for Male and Female, and it is Christian, Muslim and Other religion for NLP.

Trigger Demographic Prediction

same diocesan bula rev proceedings 0.96, 0.11, 0.02 soto cc rib hebrew armenian 0.51, 0.08, 0.81 paul long course parish body 0.98, 0.04, 0.03

To further verify that the triggers encode demographic information, we trained a demographic classiﬁer to predict the demographics from the input (texts or images) without triggers. The obtained demographic classiﬁers can accurately identify the demographics contained in the inputs and achieve over 0.99 AUC for identifying demographics in the validation datasets. Then, we use the demographic classiﬁer to predict the demographic information of a null image/text4 with the trigger. Speciﬁcally, we select three triggers generated with different λ values for both two datasets. The results5 can be seen in Table 1. We see that the demographic classiﬁer gives conﬁdent outputs on the triggers. For example, we see that the trigger paul long course parish body is classiﬁed as containing christian with 0.98 conﬁdence, indicating that the found triggers are highly indicative of demographics. This is consistent with our perspective in Section 3.4 that the fairness triggers are encoding fake demographic information to obscure ML models from making biased predictions.

4.3 Multi-Class Classiﬁcation

To extend our evaluation to a multi-class setting, we use the Celeb A dataset and select n binary attributes that may be spuriously correlated with gender [59 61]. Then, following [69], we construct data groups by enumerating all 2n possible binary vectors, where each dimension corresponds to a binary attribute. We index these vectors and treat them as the class labels. Fig. 8 shows the accuracy-fairness trade-off curves similar to Fig. 3. It can be observed that our method outperforms the other methods as the red curves are closer to the top-right corner. Also, as the class label number

4We use an empty string as the null text and an all-black image as the null image. 5One text could contain multiple religions so the probabilities do not sum to one for NLP triggers.

(a) Civil Comments (b) Celeb A

Figure 9: Ablation study of the trigger size. We evaluate the bias scores with different trigger word numbers (Civil Comments) and different trigger size (Celeb A) with ﬁxed adversary weight λ.

increases, the post-processing-based ADVPOST falls behind its in-processing counterpart ADVIN, indicating a larger class number may induce more challenges to post-processing methods.

4.4 Ablation Studies

We perform an ablation study to investigate the effects of the trigger size. Speciﬁcally, we run experiments with different numbers of trigger words / trigger patch sizes on the NLP / CV dataset. We set a λ value for each method such that all methods achieve comparable bias scores with the largest trigger size. The detailed λ choices can be seen in Appendix A.2. Then we train the triggers with different sizes in the tuning set using the ﬁxed λ s. For the text trigger as shown in Fig. 9(a), we see that the negative bias score gets worse as the number of trigger words gets smaller. However, our method can still improve fairness upon the BASE model even with only a one-word trigger. On the other hand, the results with ﬁve trigger words and above are all comparable, indicating that ﬁve words is enough to achieve the fairness goal. Similarly, for the image trigger as shown in Fig. 9(b), the results suggest a larger trigger would consistently improve fairness. On the other hand, we show that larger trigger size could hurt accuracy in Appendix B, which is similar to the effect of increasing λ. 4.5 Summary of Additional Results

We compare our proposed FAIRREPROGRAM with four additional baselines, and we show the full results with variance in Tab. 3. We further compare our method with MMD methods where Lfair in Eq. (3) is replaced with Maximum Mean Discrepancy regularization [70] to partial out the instability of adversarial training, and the results are shown in Fig. 10. We also implement the fairness reprogramming in the black-box setting on Celeb A dataset, where the model parameters are not available for training the reprogram, and the results are shown in Fig. 11. Besides, we show that FAIRREPROGRAM could also be used in tabular data, and the corresponding experiment results on the Adult dataset are shown in Fig. 13.

5 Conclusion

In this paper, we introduce a novel model reprogramming based fairness promoting method, termed FAIRREPROGRAM. Speciﬁcally, FAIRREPROGRAM considers a ﬁxed ML model and optimizes a set of vectors, named fairness trigger, concatenated on inputs to boost model fairness. We introduce an information-theoretic framework to explain the rationales of why FAIRREPROGRAM can improve model fairness. As implied by our theoretic framework as well as our empirical ﬁndings, the fairness trigger can effectively mask out the true demographic information with its strong, false demographic information. Extensive experiments demonstrate that our method could achieve better fairness improvements to retraining based methods with far-less training cost. We further empirically show fairness triggers enjoys great transferability and interpretability. We hope that FAIRREPROGRAM can inspire new fairness learning paradigms that are more feasible and ﬂexible in practice.

Acknowledgement

The work of Yihua Zhang, Sijia Liu, and Shiyu Chang was partially supported by National Science Foundation (NSF) Grant IIS-2207052. The computing resources used in this work were partially supported by the MIT-IBM Watson AI Lab.

[1] Lucas Dixon, John Li, Jeffrey Scott Sorensen, Nithum Thain, and Lucy Vasserman, Measuring

and mitigating unintended bias in text classiﬁcation, Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, 2018.

[2] J. Park, Jamin Shin, and Pascale Fung, Reducing gender bias in abusive language detection,

in EMNLP, 2018.

[3] Muhammad Bilal Zafar, Isabel Valera, Manuel Gomez-Rodriguez, and Krishna P. Gummadi,

Fairness constraints: Mechanisms for fair classiﬁcation, in AISTATS, 2017.

[4] Alekh Agarwal, Alina Beygelzimer, Miroslav Dudík, John Langford, and Hanna M. Wallach,

A reductions approach to fair classiﬁcation, Ar Xiv, vol. abs/1803.02453, 2018.

[5] Toshihiro Kamishima, Shotaro Akaho, Hideki Asoh, and Jun Sakuma, Fairness-aware classiﬁer

with prejudice remover regularizer, in ECML/PKDD, 2012.

[6] Sina Baharlouei, Maher Nouiehed, and Meisam Razaviyayn, Rényi fair inference, ar Xiv:

Learning, 2019.

[7] Adrián Pérez-Suay, Valero Laparra, Gonzalo Mateo-García, Jordi Muñoz-Marí, Luis Gómez-

Chova, and Gustau Camps-Valls, Fair kernel learning, in ECML/PKDD, 2017.

[8] Gamaleldin F Elsayed, Ian Goodfellow, and Jascha Sohl-Dickstein, Adversarial reprogramming

of neural networks, ar Xiv preprint ar Xiv:1806.11146, 2018.

[9] Yun-Yun Tsai, Pin-Yu Chen, and Tsung-Yi Ho, Transfer learning without knowing: Repro-

gramming black-box machine learning models with scarce data and limited resources, in International Conference on Machine Learning. PMLR, 2020, pp. 9614 9624.

[10] Karen Hambardzumyan, Hrant Khachatrian, and Jonathan May, Warp: Word-level adversarial

reprogramming, ar Xiv preprint ar Xiv:2101.00121, 2021.

[11] Ning Ding, Shengding Hu, Weilin Zhao, Yulin Chen, Zhiyuan Liu, Haitao Zheng, and

Maosong Sun, Openprompt: An open-source framework for prompt-learning, Ar Xiv, vol. abs/2111.01998, 2022.

[12] Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig,

Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language

processing, Ar Xiv, vol. abs/2107.13586, 2021.

[13] Kenneth Holstein, Jennifer Wortman Vaughan, Hal Daumé, Miroslav Dudík, and H. Wal-

lach, Improving fairness in machine learning systems: What do industry practitioners need?, Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, 2019.

[14] A. Chouldechova and Aaron Roth, The frontiers of fairness in machine learning, Ar Xiv, vol.

abs/1810.08810, 2018.

[15] Tony Sun, Andrew Gaut, Shirlyn Tang, Yuxin Huang, Mai El Sherief, Jieyu Zhao, Diba Mirza,

Elizabeth M. Belding-Royer, Kai-Wei Chang, and William Yang Wang, Mitigating gender bias in natural language processing: Literature review, in ACL, 2019.

[16] Ninareh Mehrabi, Fred Morstatter, Nripsuta Ani Saxena, Kristina Lerman, and A. G. Galstyan,

A survey on bias and fairness in machine learning, ACM Computing Surveys (CSUR), vol. 54,

pp. 1 35, 2021.

[17] Anjalie Field, Su Lin Blodgett, Zeerak Waseem, and Yulia Tsvetkov, A survey of race, racism,

and anti-racism in nlp, Ar Xiv, vol. abs/2106.11410, 2021.

[18] Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard S. Zemel, Fairness

through awareness, Ar Xiv, vol. abs/1104.3913, 2012.

[19] Karima Makhlouf, Sami Zhioua, and Catuscia Palamidessi, Survey on causal-based machine

learning fairness notions, Ar Xiv, vol. abs/2010.09553, 2020.

[20] T. Hashimoto, Megha Srivastava, Hongseok Namkoong, and Percy Liang, Fairness without

demographics in repeated loss minimization, in ICML, 2018.

[21] Toon Calders and Sicco Verwer, Three naive bayes approaches for discrimination-free classiﬁ-

cation, Data Mining and Knowledge Discovery, vol. 21, pp. 277 292, 2010.

[22] Moritz Hardt, Eric Price, and Nathan Srebro, Equality of opportunity in supervised learning,

in NIPS, 2016.

[23] Tim Räz, Group fairness: Independence revisited, Proceedings of the 2021 ACM Conference

on Fairness, Accountability, and Transparency, 2021.

[24] Simon Caton and Christian Haas, Fairness in machine learning: A survey, Ar Xiv, vol.

abs/2010.04053, 2020.

[25] Faisal Kamiran and Toon Calders, Data preprocessing techniques for classiﬁcation without

discrimination, Knowledge and Information Systems, vol. 33, pp. 1 33, 2011.

[26] Richard S. Zemel, Ledell Yu Wu, Kevin Swersky, Toniann Pitassi, and Cynthia Dwork, Learn-

ing fair representations, in ICML, 2013.

[27] Michael Feldman, Sorelle A. Friedler, John Moeller, Carlos Eduardo Scheidegger, and Suresh

Venkatasubramanian, Certifying and removing disparate impact, Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2015.

[28] Flávio du Pin Calmon, Dennis Wei, Bhanukiran Vinzamuri, Karthikeyan Natesan Ramamurthy,

and Kush R. Varshney, Optimized pre-processing for discrimination prevention, in NIPS, 2017.

[29] Aditya Grover, Kristy Choi, Rui Shu, and Stefano Ermon, Fair generative modeling via weak

supervision, in ICML, 2020.

[30] Guanhua Zhang, Bing Bai, Junqi Zhang, Kun Bai, Conghui Zhu, and T. Zhao, Demographics

should not be the reason of toxicity: Mitigating discrimination in text classiﬁcations with instance weighting, in ACL, 2020.

[31] Yuji Roh, Kangwook Lee, Steven Euijong Whang, and Changho Suh, Fr-train: A mutual

information-based approach to fair and robust training, in ICML, 2020.

[32] Wenbin Zhang and Eirini Ntoutsi, Faht: An adaptive fairness-aware decision tree classiﬁer,

in IJCAI, 2019.

[33] B. Zhang, Blake Lemoine, and Margaret Mitchell, Mitigating unwanted biases with adversarial

learning, Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, 2018.

[34] Felix Petersen, Debarghya Mukherjee, Yuekai Sun, and Mikhail Yurochkin, Post-processing

for individual fairness, in Neur IPS, 2021.

[35] Dennis Wei, Karthikeyan Natesan Ramamurthy, and Flávio du Pin Calmon, Optimized score

transformation for fair classiﬁcation, in AISTATS, 2020.

[36] Pranjal Awasthi, Matthäus Kleindessner, and Jamie H. Morgenstern, Equalized odds postpro-

cessing under imperfect group information, in AISTATS, 2020.

[37] Blake E. Woodworth, Suriya Gunasekar, Mesrob I. Ohannessian, and Nathan Srebro, Learning

non-discriminatory predictors, Ar Xiv, vol. abs/1702.06081, 2017.

[38] Alan Mishler and Edward H. Kennedy, Fairness in risk assessment instruments: Postprocessing to achieve counterfactual equalized odds, Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 2021.

[39] Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei Chang, Men also

like shopping: Reducing gender bias ampliﬁcation using corpus-level constraints, in EMNLP, 2017.

[40] Michael P. Kim, Amirata Ghorbani, and James Y. Zou, Multiaccuracy: Black-box post-

processing for fairness in classiﬁcation, Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, 2019.

[41] Pranay Kr. Lohia, Karthikeyan Natesan Ramamurthy, Manish Bhide, Diptikalyan Saha, Kush R.

Varshney, and Ruchir Puri, Bias mitigation post-processing for individual and group fairness, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2847 2851, 2019.

[42] Pranay Kr. Lohia, Priority-based post-processing bias mitigation for individual and group

fairness, Ar Xiv, vol. abs/2102.00417, 2021.

[43] Cynthia Dwork, Nicole Immorlica, Adam Tauman Kalai, and Mark D. M. Leiserson, Decou-

pled classiﬁers for group-fair and efﬁcient machine learning, in FAT, 2018.

[44] Evgenii Chzhen, Christophe Denis, Mohamed Hebiri, L. Oneto, and Massimiliano Pontil,

Leveraging labeled and unlabeled data for consistent fair binary classiﬁcation, in Neur IPS,

[45] Hyojin Bahng, Ali Jahanian, Swami Sankaranarayanan, and Phillip Isola, Visual prompting:

Modifying pixel space to adapt pre-trained models, ar Xiv preprint ar Xiv:2203.17274, 2022.

[46] Paarth Neekhara, Shehzeen Hussain, Shlomo Dubnov, and Farinaz Koushanfar, Adversarial

reprogramming of text classiﬁcation neural networks, ar Xiv preprint ar Xiv:1809.01829, 2018.

[47] Paarth Neekhara, Shehzeen Hussain, Jinglong Du, Shlomo Dubnov, Farinaz Koushanfar, and

Julian Mc Auley, Cross-modal adversarial reprogramming, in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2022, pp. 2427 2435.

[48] Yang Zheng, Xiaoyi Feng, Zhaoqiang Xia, Xiaoyue Jiang, Ambra Demontis, Maura Pintor,

Battista Biggio, and Fabio Roli, Why adversarial reprogramming works, when it fails, and how to tell the difference, ar Xiv preprint ar Xiv:2108.11673, 2021.

[49] Depeng Xu, Shuhan Yuan, Lu Zhang, and Xintao Wu, Fairgan: Fairness-aware generative

adversarial networks, 2018 IEEE International Conference on Big Data (Big Data), pp. 570 575, 2018.

[50] Beatrice Nobile, Gabriele Santin, Bruno Lepri, and Pierpaolo Brutti, Reprogramming fairgans

with variational auto-encoders: A new transfer learning model, Ar Xiv, vol. abs/2203.05811, 2022.

[51] Tianyu Gao, Adam Fisch, and Danqi Chen, Making pre-trained language models better

few-shot learners, Ar Xiv, vol. abs/2012.15723, 2021.

[52] Xiang Lisa Li and Percy Liang, Preﬁx-tuning: Optimizing continuous prompts for generation,

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), vol. abs/2101.00190, 2021.

[53] Timo Schick and Hinrich Schütze, It s not just size that matters: Small language models are

also few-shot learners, Ar Xiv, vol. abs/2009.07118, 2021.

[54] Taylor Shin, Yasaman Razeghi, Robert L Logan IV, Eric Wallace, and Sameer Singh, Elic-

iting knowledge from language models using automatically generated prompts, Ar Xiv, vol. abs/2010.15980, 2020.

[55] Shizhe Diao, Xuechun Li, Yong Lin, Zhichao Huang, and Tong Zhang, Black-box prompt

learning for pre-trained language models, Ar Xiv, vol. abs/2201.08531, 2022.

[56] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang, Deep learning face attributes in

the wild, in Proceedings of International Conference on Computer Vision (ICCV), December 2015.

[57] Pang Wei Koh, Shiori Sagawa, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay

Balsubramani, Wei hua Hu, Michihiro Yasunaga, Richard L. Phillips, Sara Beery, Jure Leskovec, Anshul Kundaje, Emma Pierson, Sergey Levine, Chelsea Finn, and Percy Liang, Wilds: A benchmark of in-the-wild distribution shifts, in ICML, 2021.

[58] Jigsaw/Conversation AI, Jigsaw unintended bias in toxicity classiﬁcation, 2019.

[59] Tian Xu, Jennifer White, Sinan Kalkan, and Hatice Gunes, Investigating bias and fairness in

facial expression recognition, in European Conference on Computer Vision. Springer, 2020, pp. 506 523.

[60] Saloni Dash and Amit Sharma, Counterfactual generation and fairness evaluation using

adversarially learned inference, Ar Xiv, 2020.

[61] Sunhee Hwang, Sungho Park, Dohyung Kim, Mirae Do, and Hyeran Byun, Fairfacegan:

Fairness-aware facial image-to-image translation, ar Xiv preprint ar Xiv:2012.00282, 2020.

[62] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova, Bert: Pre-training of

deep bidirectional transformers for language understanding, Ar Xiv, vol. abs/1810.04805, 2019.

[63] Ilya Loshchilov and Frank Hutter, Decoupled weight decay regularization, ar Xiv preprint

ar Xiv:1711.05101, 2017.

[64] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, Deep residual learning for image

recognition, in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770 778.

[65] Yoshua Bengio, Nicholas Léonard, and Aaron C. Courville, Estimating or propagating gradients through stochastic neurons for conditional computation, Ar Xiv, vol. abs/1308.3432, 2013.

[66] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi

Parikh, and Dhruv Batra, Grad-cam: Visual explanations from deep networks via gradientbased localization, in Proceedings of the IEEE international conference on computer vision, 2017, pp. 618 626.

[67] Mukund Sundararajan, Ankur Taly, and Qiqi Yan, Axiomatic attribution for deep networks,

Ar Xiv, vol. abs/1703.01365, 2017.

[68] Narine Kokhlikyan, Vivek Miglani, Miguel Martin, Edward Wang, Bilal Alsallakh, Jonathan

Reynolds, Alexander Melnikov, Natalia Kliushkina, Carlos Araya, Siqi Yan, and Orion Reblitz Richardson, Captum: A uniﬁed and generic model interpretability library for pytorch, 2020.

[69] Ni Zhuang, Yan Yan, Si Chen, Hanzi Wang, and Chunhua Shen, Multi-label learning based

deep transfer neural network for facial attribute classiﬁcation, Pattern Recognition, vol. 80, pp. 225 240, 2018.

[70] Christos Louizos, Kevin Swersky, Yujia Li, Max Welling, and Richard Zemel, The variational

fair autoencoder, ar Xiv preprint ar Xiv:1511.00830, 2015.

[71] Geoff Pleiss, M. Raghavan, Felix Wu, J. Kleinberg, and Kilian Q. Weinberger, On fairness and

calibration, in NIPS, 2017.

[72] Faisal Kamiran, Asim Karim, and Xiangliang Zhang, Decision theory for discrimination-aware

classiﬁcation, 2012 IEEE 12th International Conference on Data Mining, pp. 924 929, 2012.

[73] Rachel K. E. Bellamy, Kuntal Dey, Michael Hind, Samuel C. Hoffman, Stephanie Houde,

Kalapriya Kannan, Pranay Lohia, Jacquelyn Martino, Sameep Mehta, Aleksandra Mojsilovic, Seema Nagar, Karthikeyan Natesan Ramamurthy, John Richards, Diptikalyan Saha, Prasanna Sattigeri, Moninder Singh, Kush R. Varshney, and Yunfeng Zhang, AI Fairness 360: An extensible toolkit for detecting, understanding, and mitigating unwanted algorithmic bias, Oct. 2018.

[74] Yimeng Zhang, Yuguang Yao, Jinghan Jia, Jinfeng Yi, Mingyi Hong, Shiyu Chang, and Sijia

Liu, How to robustify black-box ml models? a zeroth-order optimization perspective, ar Xiv preprint ar Xiv:2203.14195, 2022.

[75] Arthur Asuncion and David Newman, Uci machine learning repository, 2007.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reﬂect the paper s

contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] See Appendix D.

(c) Did you discuss any potential negative societal impacts of your work? [Yes] See

Appendix D. (d) Have you read the ethics review guidelines and ensured that your paper conforms to

them? [Yes] 2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [Yes] See Ap-

pendix C. (b) Did you include complete proofs of all theoretical results? [Yes] See Appendix C. 3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experi-

mental results (either in the supplemental material or as a URL)? [Yes] See Section 4.1 and Appendix A. (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they

were chosen)? [Yes] See Section 4.1 and Appendix A.2. (c) Did you report error bars (e.g., with respect to the random seed after running experi-

ments multiple times)? [Yes] (d) Did you include the total amount of compute and the type of resources used (e.g., type

of GPUs, internal cluster, or cloud provider)? [Yes] See Appendix A.2. 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [Yes] See Section 4.1. (b) Did you mention the license of the assets? [No]

(c) Did you include any new assets either in the supplemental material or as a URL? [No] (d) Did you discuss whether and how consent was obtained from people whose data you re

using/curating? [No] (e) Did you discuss whether the data you are using/curating contains personally identiﬁable

information or offensive content? [No] 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if

applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review

Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount

spent on participant compensation? [N/A]