# optimal_classification_under_performative_distribution_shift__939c200e.pdf

Optimal Classification under Performative Distribution Shift

Edwige Cyffers Univ. Lille, Inria, CNRS, Centrale Lille, UMR 9189 - CRISt AL, F-59000 Lille edwige.cyffers@inria.fr

Muni Sreenivas Pydi Université Paris Dauphine, Université PSL, CNRS, LAMSADE, 75016 Paris

Jamal Atif Université Paris Dauphine, Université PSL, CNRS, LAMSADE, 75016 Paris

Oliver Cappé École Normale Supérieure, Université PSL, CNRS, Inria, DI ENS, 75005 Paris

Performative learning addresses the increasingly pervasive situations in which algorithmic decisions may induce changes in the data distribution as a consequence of their public deployment. We propose a novel view in which these performative effects are modelled as push-forward measures. This general framework encompasses existing models and enables novel performative gradient estimation methods, leading to more efficient and scalable learning strategies. For distribution shifts, unlike previous models which require full specification of the data distribution, we only assume knowledge of the shift operator that represents the performative changes. This approach can also be integrated into various change-of-variablebased models, such as VAEs or normalizing flows. Focusing on classification with a linear-in-parameters performative effect, we prove the convexity of the performative risk under a new set of assumptions. Notably, we do not limit the strength of performative effects but rather their direction, requiring only that classification becomes harder when deploying more accurate models. In this case, we also establish a connection with adversarially robust classification by reformulating the minimization of the performative risk as a min-max variational problem. Finally, we illustrate our approach on synthetic and real datasets.

1 Introduction

Machine learning models are increasingly deployed in real-world scenarios where their predictions can influence the users behaviors, thereby altering the underlying data distribution. This phenomenon, though rooted in long-standing economic theory [Morgenstern, 1928, Muth, 1961], has recently attracted interest in the machine learning community under the name of performative prediction [Perdomo et al., 2020, Hardt and Mendler-Dünner, 2023]. Consider for instance a social ranking system: if it consistently favors a particular subpopulation of individuals, user behavior might shift towards mimicking the main characteristics of this subgroup or, conversely, some features of this subpopulation can undergo modification as a consequence of the selection by the system, both effects leading to subtle alterations of the original data distribution. More generally, performative learning captures dynamics at stake in strategic classification, where individuals are confronted by algorithmic decisions that impact their life such as loan acceptance, college admission, probation and might thus try to overturn predictions by optimizing some of their features.

This feedback loop, where predictions influence future data, poses new challenges and necessitates the development of novel approaches within statistical learning theory and practice [Perdomo et al.,

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

2020, Jagadeesan et al., 2022, Drusvyatskiy and Xiao, 2023, Hardt and Mendler-Dünner, 2023, Zezulka and Genin, 2023]. Perdomo et al. [2020] proposed to formalize performative learning as a generalized risk minimization problem, with the performative risk being defined as PR(θ) = Eθ[ℓ(Z; θ)], (1) where ℓis a loss function, θ a model s parameters, and Z an observable random variable drawn from a distribution Pθ also parametrized by θ itself. In light of the difficulty of minimizing PR(θ) directly, one can define a decoupled performative risk as DPR(θ, θ ) = Eθ[ℓ(Z; θ )], clarifying the interplay between the model s prediction and the distribution change. This can be seen as a Stackelberg game that stabilizes when neither the modeler (learned parameters) nor the environment (distribution) has incentive to change their states. Solving the performative learning problem consists in minimizing this risk under the constraint that θ = θ , because the testing samples will follow the distribution corresponding to the deployed model, and thus PR(θ) = DPR(θ, θ). Minimizing DPR(θ, θ ) w.r.t. θ for a fixed θ corresponds to the classical machine learning setting. In contrast, estimating the performative effect, i.e., knowing how to optimize θ for a given θ is more challenging as, per definition, one can only perform statistics from samples collected for values of the parameters θ for which the model has already been deployed. Hence, performative learning does require some form of counterfactual extrapolation, i.e., what will happen to the data distribution when the parameter θ changes from its current setting?

Hence, instead of focusing on methods finding performatively optimal points, θP O arg min PR(θ), many previous works, following Perdomo et al. [2020], focus on finding stable points θP S arg min DPR(θP S, θ), through methods that iteratively minimize the empirical risk. This line of research is appropriate in settings where the performative effect can be tamed. If it is sufficiently small, explicitly taking into account the performative changes of distribution is not required and optimal and stable points will be close enough [Perdomo et al., 2020]. However, real use cases do not always satisfy such strong assumptions (see further discussion in Section 4). In general, stable points may not be good proxys for performatively optimal points, particularly in settings where the performative effect cannot be bounded a priori.

Towards this goal, another line of research focuses on finding the optimal points θP O. Izzo et al. [2022] propose to use Monte Carlo sample-based approximations of the gradient of the performative risk, θ PR(θ), based on the score function estimator (see Section 2 below). Miller et al. [2021] use a two-stage approach that deploys random models to estimate the performative effect in the first stage, and then minimizes the estimated performative risk in the second stage. A drawback of both of these approaches is the restrictive set of assumptions needed to show that the algorithms converge. While Izzo et al. [2022] assume the convexity of PR(θ) along with smoothness and boundedness contitions, Miller et al. [2021] assume that the loss function is simultaneously strongly convex and smooth. Moreover, the score function estimator of Izzo et al. [2022] necessitates full knowledge of a parametric form of Pθ, which is unrealistic in practice. Alternatively, Jagadeesan et al. [2022] resort to derivative-free (or zeroth-order) optimization strategies. However, such an approach is appropriate only when it is possible to sequentially deploy a large number of model instances, and it does not scale with the dimension of the parameter θ.

The present work is connected to the second line of the research explored above, where the focus is on finding the optimal point θP O. Our contributions are as follows.

(i) Model the performative effect as a push-forward operator This novel approach provides a new explicit expression of the performative gradient. Not only does this approach allow estimation of the performative gradient in settings where previous methods couldn t, but we show that in typical use cases, the variance of this new estimator is significantly smaller.

(ii) Convexity for Performative Classification We then focus on the specific task of strategic classification, as this performative learning problem encompasses various real-use cases with important societal impact such as college admission or credit decisions. Our second contribution is to provide new convexity results on the performative risk in this case. Whereas existing results were only proving convexity under assumptions restricting the performative effect to be small compared to the (assumed) strong convexity of the loss function ℓ(z; θ), our results leverage structural assumptions on the performative effect that ensure that the performative risk is convex without any restriction on the strength of the performative effect.

(iii) Linking Performative and Robust Learning We establish a connexion between performative learning and adversarially robust learning, paving the way to transferring robustness results to the

performative learning field. In particular, this result gives new insights on the empirical evidence in favor of using regularization in the presence of performative effets. Finally, we illustrate our findings on synthetic and real-world datasets.

2 Push-forward Model for Performative Effects

In this section, we study the general performative learning setting without yet specializing it to the classification context. In Section 2.1, we introduce the push-forward model of performative learning and derive the expression of the gradient of the performative risk under this model. In Section 2.2, we present a reparameterization-based estimator for the gradient of the performative risk, and compare it to the score function based estimator considered by Izzo et al. [2022].

2.1 The Push-forward Model

We aim to minimize the performative risk defined in eq. (1), where the observation Z is drawn from the distribution Pθ, which depends on the parameter θ Rp of the learning model. For this to be tractable, one needs additional hypotheses on the nature of the performative effect. We propose to represent the performative effect through a push-forward measure, which matches the intuition of having an untouched distribution that is steered by the performative effect.

Assumption 1 (Push-forward Performative Model). For a given model parameter θ Rp, the samples distribution under the performative effect is given by Pθ = φ( ; θ) P, where φ( ; θ) is a differentiable invertible mapping on Rd, depending on θ.

This assumption can be equivalently stated by the probabilistic representation Z d= φ(U; θ), where U P (the symbol d= denoting equality in distribution). If P admits a density, then so does Pθ with density function given by pθ(z) = |Jzψ(z; θ)|p(ψ(z; θ)) where ψ( ; θ) = φ 1( ; θ). In this last formula, Jzψ(z; θ) refers to the Jacobian matrix where [Jzψ(z; θ)]ij = ψ(z;θ)i

From an abstract point of view, such a representation of a parametrized family of distributions exists under very general conditions. However, the above model is more interesting in scenarios where φ( ; θ) is a simple operator, for instance a linear one, as is the case in most of the examples considered by Perdomo et al. [2020], Miller et al. [2021], and the dependence with respect to θ can also be made explicit. This representation is also modular in the sense that φ could be chosen as the composition φ(u; θ) = φ0(φ1(u; θ)), where φ 1 0 (Z) corresponds to a fixed (not depending on θ) representation of Z in a feature space and φ1( ; θ) models the performative effect in the representation space. Although we will not explicitly consider such cases in the rest of the paper, this representation of the performative effect is particularly attractive when using embedding tools based on kernels [Hofmann et al., 2008], neural nets (e.g., VAEs [Kingma and Welling, 2013]) or normalizing flows [Papamakarios et al., 2021, Kobyzev et al., 2021].

This structural assumption on the performative effect yields a new estimator for the performative gradient, which may be seen as an instance of the "reparametrization trick" used in VAEs, normalizing flows or by Kucukelbir et al. [2017]. Mohamed et al. [2020] also refer to this approach as "pathwise" gradient estimation.

Theorem 1 (Performative Risk Gradient). Under Assumption 1, the gradient of the performative risk is given by θ PR(θ) = Eθ [ θℓ(Z; θ)] + Eθ JT θ φ(ψ(Z; θ); θ) zℓ(Z; θ) , (2)

where zℓ(z; θ) and θℓ(z; θ) denote respectively the gradient with respect to the first and the second parameter of the loss, and JT θ φ(u; θ) is the transpose of the Jacobian with respect to θ.

Proof. Notice that under assumption 1, we can rewrite the decoupled risk with a change of variable as DPR(θ, θ ) = E[ℓ(φ(U; θ); θ )]. This expression leads to the following.

θ PR(θ) = θE[ℓ(φ(U; θ); θ)] = E θℓ(φ(U; θ); θ) + JT θ φ(U; θ) zℓ(φ(U; θ)); θ) ,

which gives eq. (2) under a change of variable.

2.2 Estimating the Performative Gradient

From Theorem 1, it is clear that the gradient of performative risk in eq. (2) is composed of two terms the first term corresponds to the classical risk minimization, while the second one which we will refer to as the performative gradient in the following captures the performative effect. The first term can be estimated by 1

n Pn i=1 ℓ(Zi; θ) as usual. For the second term, we propose the following estimator.

Definition 1 (Reparameterization-based Performative Gradient Estimator). The performative gradient θ DPR(θ, θ )|θ =θ admits as unbiased estimator:

i=1 JT θ φ(ψ(Zi; θ); θ) zℓ(Zi; θ). (3)

This estimator allows performing gradient descent to minimize the performative risk and thus, if the performative objective is well behaved, to converge to the performative optimal point. ˆGRP θ should be compared to the following estimator used by Izzo et al. [2022], which relies on the well-known score function formula see [L Ecuyer, 1991, Kleijnen and Rubinstein, 1996, Mohamed et al., 2020] and references therein.

i=1 ℓ(Zi; θ) θ log pθ(Zi).

While both ˆGRP θ and ˆGSF θ estimate the same quantity θ DPR(θ, θ )|θ =θ, ˆGRP θ has two distinct advantages over ˆGSF θ . First, computing ˆGSF θ requires access to the analytical form of pθ, which is fairly unrealistic in a learning scenario, whereas our estimator ˆGRP θ only requires knowledge of φ, paving the way for a semi-parametric approach in which the performative effect is modelled explicitly, but not the distribution of the data. For general maps φ, ˆGRP θ still requires to use the inverse mapping ψ, however this is not required in situations where the Jacobian Jθ φ(u; θ) does not depend on u. Specifically, when φ is a shift operator, one obtains a very simple expression for ˆGRP θ as shown in the following example. Example 1 (Shift Operator). If the performative effect can be modelled by a shift operator, i.e., φ(U; θ) = U + Π(θ), the ˆGRP θ estimator is given by:

ˆGRP θ = JT θ Π(θ) 1

i=1 zℓ(Zi; θ),

where Jθ Π(θ) is the Jacobian of the performative shift Π(θ).

In addition to removing the need to know pθ, a second advantage of ˆGRP θ is that it can lead to significant decrease of the variance of the estimates, as illustrated by the following example.

Example 2 (Perfomative Gaussian Mean estimation). Let ℓ(z; θ) = z θ 2/2, and Z d= U + Πθ, that is, Π(θ) = Πθ is a linear shift operator. We will assume U N(0, σ2Id), so that pθ(z) exp[ z Πθ 2/(2σ2)], where Π represents the performative effect. The gradient of DPR(θ, θ ) w.r.t. the distributional parameter θ is given both by

θ DPR(θ, θ ) = Eθ[ΠT zℓ(Z; θ )] = ΠT Eθ[Z θ ] = ΠT E[U + a] (reparameterization)

= Eθ[ℓ(Z; θ ) θ log pθ(z)] = ΠT 1 2σ2 E U + a 2U (score function)

where a = Πθ θ . Hence, in this case ˆGRP θ = ΠT 1

n Pn i=1(Ui + a), while ˆGSF θ = ΠT 1 2nσ2 Pn i=1 Ui + a 2Ui. Both of these expressions have equal expectation ΠT (Πθ θ ) which corresponds to the gradient of DPR(θ, θ ) w.r.t. θ. However, the reparametrization estimator ˆGRP θ has covariance σ2ΠT Π/n while the score-based estimator ˆGSF θ has covariance:

1 nΠT (d2 + 6d + 8)σ2 + 2(d + 4) a 2 + a 4/σ2

4 Id + aa T Π.

The details of this computation can be found in appendix A.1. Both estimators are unbiased but note that while ˆGRP θ would always be an unbiased estimator of the performative gradient without any further assumption on the distribution of U, the unbiasedness of ˆGSF θ relies on the fact that U is Gaussian. ˆGRP θ has a covariance that does not depend on θ, θ nor on the dimension d. In contrast, ˆGSF θ s covariance includes a factor that increases with d2, making it unreliable in high dimensions. It also includes additional terms that grow with the norm of Πθ θ , so the estimator becomes less reliable when the performative effect is strong.

One could argue that the previous result does not provide a fair comparison between both estimators, as GSF s variance can be reduced by subtracting a baseline. Indeed, GSF θ = 1

n Pn i=1(ℓ(Zi; θ) m) θ log pθ(Zi) is also an unbiased estimator of the gradient of the performative effect (for any choice of the baseline m), as the score function has, by definition, zero expectation. Tuning m properly may reduce the variance by creating a so-called control variate see, e.g., Greensmith et al. [2004] for the use of this principle in policy gradient methods. However, similar calculations (detailed in appendix A.1) show that the minimum covariance that can be achieved by subtracting a baseline is 1 nΠT (1 + d/2)σ2 + a 2 Id + aa T Π,

which is still larger than the covariance of ˆGRP θ by a factor that grows with the model dimension d. The fact that the reparameterization-based estimator is preferable when considering the Gaussian distribution with quadratic loss function was observed before by Mohamed et al. [2020] in the scalar case. The above computations however show that the difference between the two approaches gets more and more significant as the dimension increases. The case of other distribution/loss function combination still needs to be investigated.

3 Classification under Performative Shift

In this section, we specialize to the setting of binary classification which encompasses various machine learning applications where performative effects are expected. Usually, this setting involves a desirable class and an undesirable one. For example, the desirable class might represent college admission, loan acceptance, no-spam email, or probation. In this setting, one can also expect that individuals belonging to the favored class we designate this as class 1 do not need to alter their features, or only with small changes. On the contrary, individuals with negative predictions in class 0 have an incentive to modify their features, resulting in a significant performative effect.

We particularize the arguments introduced in section 2 to the setting of binary classification, by fixing z = (x, y), with a covariate vector x Rd and a label y {0, 1}. As is done classically see, e.g., [Bach, 2024], we further assume that the classifier fθ(x) is a real valued function that depends on a parameter θ Rp and that a convex loss surrogate Φ is used, such that the loss function ℓ(z; θ) is equal to Φ(( 1)yfθ(x)). We model the performative effect as label-dependent push forward models, i.e., that, under Pθ, X|Y =1 d= φ1(U1; θ) and X|Y =0 d= φ0(U0; θ),

where φ1 and φ0 represent the performative changes affecting class-conditional distributions of classes 0 and 1 respectively. If the classifier fθ(x) is sufficiently expressive, changes such that φ1 = φ0 will not create performative effects. We thus focus on scenarios where the performative changes affect each class-dependent distribution differently. For concreteness, we will assume the following.

Assumption 2. Pθ(Y = 1) = ρ is fixed and not subject to performative effects.

Assumption 3. φ1 does not depend on θ, and for simplicity, we assume it is the identity function.

Assumption 4. φ0(u0; θ) = u0 + Π(θ) is a shift operator.

Assumption 2 is a consequence of the intuitive property that even if the distribution is modified, the ground truth labels are not impacted by the performative effect. Assumption 3 allows to focus on the performative effect on the unfavored class and simplifies the presentation but could be easily relaxed. Finally, assumption 4 is restricting the performative change to a shift, which does simplify the problem but still corresponds to a realistic model for feature alteration.

It is important to stress that, despite the fact that this performative effect is modelled as a shift, the joint distribution of Z = (X, Y ) does not belong to the location-scale family discussed by Miller et al. [2021]. Under these assumptions, the decoupled performative risk takes the following form:

DPR(θ, θ ) = Eθ Φ ( 1)Y fθ (X) = ρE [Φ(fθ (U1))]+(1 ρ)E [Φ( fθ (U0 + Π(θ)))] , (4)

where the performative effect is only manifested in the second term, which corresponds to class 0. Remark 1 (Localization of the Performative Shift). In eq. (4), we refer to U0 which corresponds to the covariates of the second class in the absence of performative effect, i.e., when θ = 0. However, for a shift operator, for any value of θ, one may equivalently write that, under Pθ, X d= φ(U θ; θ), where φ(u; θ) = u + Π(θ) Π( θ) and U θ is distributed under P θ. Thus eq. (4) can equivalently be rewritten by taking expectation under an arbitrary parameter value θ, upon defining the performative effect as U θ + Π(θ) Π( θ) and the linear model as fθ(x) = x T (θ θ).

4 Convexity of Performative Risk

Identification of the cases in which the performative risk PR(θ) is convex is an important step towards generalizing results obtained in the context of traditional (ie., non performative) learning theory. Existing results mainly exploit the fact that if the loss function is strongly convex, a sufficiently small performative effect cannot break this convexity. For this reason, it is often assumed [Miller et al., 2021, Hardt and Mendler-Dünner, 2023] that the change in the distributions has a bounded sensitivity with respect to the parameters, using the 1-Wasserstein distance:

W1 (Pθ, Pθ ) ε θ θ 2 .

Note that in our setting, such ϵ exists and corresponds to the operator norm of Π(θ). In order to preserve convexity, it is then needed that ϵ µ/2L when the risk is L-smooth and µ-strongly convex. The pricing model, considered by Izzo et al. [2022], is a very simple example showing that the performative risk can be convex while not fulfilling this criterion. Example 3 (Pricing Model). Given a fixed set of d resources, the pricing model aims at finding the prices θ Rd for the d ressources that maximize the overall profit given the elasticity of the demand level: ℓ(z; θ) = z T θ with Z d= φ(U; θ) = U Πθ, (5) where Π is a diagonal matrix with positive elements, encoding the elasticity of the demand level to a raise in price of each resource, and µ = E[U] contains the baseline demand levels for each resource.

In this example, the performative risk PR(θ) = Pd i=1(µi Πiiθi)θi is a strongly convex quadratic function minimized at θ i = µi/(2Πii). In contrast, the decoupled performative risk DPR(θ, θ ) = (α Πθ)T θ is still convex, but not strongly convex in θ and is always minimized in θ at infinity. Despite its simplicity, this example is thus not covered by existing theorems, and retraining procedures considered by Perdomo et al. [2020] fail by diverging.

Moreover, this example highlights that requiring a small sensitivity for the performative effect does not match the true convexity conditions. The performative risk is indeed strongly convex as long as the Πii are positive, irrespectively of their magnitude. In contrast, in this example, the performative risk would become non-convex if one of the Πii were negative, even with a small magnitude.

This motivates the search for related phenomenons in the classification context, by looking at conditions for ensuring the convexity of eq. (4). Indeed, we show that in the classification setting, one can also observe convexity without restriction on the magnitude of the performative effect. For this, we consider the case of linearly parameterized models, which we denote for simplicity by fθ(x) = x T θ. Note that, as discussed in Section 2, our model of performative effect is composable and hence, we could also consider the more general linearly parameterized model in which fθ(x) = ψ(x)T θ, with a linear-in-the-parameter performative effect in the feature space such that ψ(X) = U + Π(θ). For ease of notation, we stick to the case where ψ is the identity function in the following. Using standard arguments, the choice of a convex loss surrogate Φ then entails that DPR(θ, θ ) is a convex function of θ . For the same reason, if Π(θ) = Πθ is a linear-in-parameters shift operator, DPR(θ, θ ) is also convex in θ. Note however that, unless there is no performative effect (i.e., if Π = 0), eq. (4) is not jointly convex in (θ, θ ). The following result show that it is nonetheless the case that the performative risk PR(θ) = DPR(θ, θ) is convex under the condition that Π is a positive semidefinite matrix (see proof in appendix A.2).

Theorem 2 (Convexity of Classification Performative Risk). Under assumptions 2 to 4, for linearly parameterized classifier fθ(x) = x T θ and linear shift operator Π(θ) = Πθ, the performative risk PR(θ) is convex when Π is a positive semidefinite matrix and one of the following conditions holds.

(a) Φ is the quadratic loss function;

(b) Φ is a convex non increasing function (such as hinge, logistic or exp loss).

This theorem allows us to extend the known convexity results to losses that are not strongly convex, and to performative effects with arbitrary magnitude.

1.5 1.0 0.5 0.0 0.5 1.0

1.5 1.0 0.5 0.0 0.5 1.0

1.5 1.0 0.5 0.0 0.5 1.0

1.5 1.0 0.5 0.0 0.5 1.0

Figure 1: Profile risk for classifying two Gaussian centered in µ0 = (0, 0) and µ1 = ( 1, 1) with quadratic loss and various values of λ for the diagonal coefficients of Π. The performative risk remains convex as long as Π is positive semidefinite i.e. λ 0, and becomes non-convex whenever some of the λi are negative.

Remark 2 (Generalization to Performative Effect Affecting Both Classes). One could remove assumption 3 to allow class 1 to change under performative effect. The convexity of PR(θ) remains if φ1(u; θ) = u Π1θ, where Π1 is a positive semidefinite matrix. Similarly, the classification task becomes harder with performative effects and the performative risk is convex.

5 Connection with Robustness and Regularization

In order to enforce strong convexity of the loss function ℓ( ; θ), previous works on performative prediction have considered the use of an additional regularization term see, e.g., Section 5.2 of [Perdomo et al., 2020] where logistic regression is used with a ridge regularizer. When doing so, it has been observed empirically that the retraining method performs quite well. To build on this observation, we show below that for linear-in-the-parameter performative effects that tends to make the classification task harder, the performative optimum may indeed be interpreted as a regularized version of the base classification problem. This regularization does not take the form of and additive penalty but can be interpreted as the solution of a specific adversarially robust classification objective. In this section, we use the slightly stronger assumption that Π is a symmetric positive definite matrix, in order to ensure that both v Π = (v T Πv)1/2 and v Π 1 = (v T Π 1v)1/2 are norms on Rd.

Theorem 3 (Variational Formulation of the performative Risk). Under assumptions 2 to 4, for linearly parameterized classifiers fθ(x) = x T θ and linear shift operators Π(θ) = Πθ, and assuming that Φ is a convex non increasing function and that Π is symmetric positive definite, the performative risk may be rewritten as

PR(θ) = ρE[Φ(U T 1 θ)] + (1 ρ)E max { U0: U0 Π 1 θ Π} Φ( (U0 + U0)T θ) . (6)

Intuitively, for a classification-calibrated loss function [Bartlett et al., 2006, Bach, 2024] and classes with identical covariances, we expect θ to align with the direction of µ1(θ) µ0(θ), so that, when Π is positive definite, the performative shift Πθ has itself a positive dot product with µ1(θ) µ0(θ). The reformulation of the performative risk in eq. (6) formalizes this intuition by showing that the performative optimum is associated to an adversarially robust classification task [Goodfellow et al., 2015, Madry et al., 2018, Ribeiro et al., 2023] in which the points of class 0 are allowed to shift towards those of class 1, so as to increase the overall loss. Compared to objectives found in the robust

classification literature, the specificity of eq. (6) lies in the fact that the tolerance (or budget) on the adversarial displacement U0 depends on both Π and θ.

To understand the role played by the Π and Π 1 norms, consider the particular case where only a subset of the variables have a performative effect, i.e., if we let θ and U0 be partitioned into

θp . . . θs

U0,p . . . U0,s

with Πp = γI and Πs = ϵI, one obtains, letting ϵ tend to zero, that the performative risk is equal to

PR(θ) = ρE[Φ(U T 1 θ)]+(1 ρ)E max { U0,p: U0,p γ θp } Φ (U T 0,sθs + (U0,p + U0,p)T θp) .

The above expression shows that in this case, only the coordinates subject to the performative effect appear in the adversarial reformulation.

In the proof of Theorem 3 (see appendix A.3), we observe that the second term of eq. (6) may also be rewritten as (1 ρ)E[Φ( U T 0 θ θ 2 Π)]. Similarly to the case studied by Ribeiro et al. [2023], the term θ 2 Π that appears inside the surrogate loss function Φ has a regularization effect. Note however that it is not equivalent to the use of a standard ridge regression penalty on θ. The following theorem provides a bound on the performative optimum that highlights the role played by Π on the significance of this regularization effect.

Theorem 4 (Regularization Bound). Define µi = E[Ui]. Under assumptions 2 to 4, for linearly parameterized classifiers fθ(x) = x T θ and linear drift operators Π(θ) = Πθ, when Φ is a convex non increasing function and Π a symmetric positive definite matrix, the minimizer θ of PR(θ) satisfies the following condition.

2 (ρµ1 (1 ρ)µ0)

Theorem 4 shows that the performative optimum has a smaller value in Π norm when the performative effect is stronger, that is, when Π gets larger. In the particular case where Π = γI, eq. (7) rewrites as γ1/2 θ γ 1/2 ρµ1 (1 ρ)µ0 /(1 ρ) and thus larger values of γ decrease the r.h.s. while the l.h.s. increases for identical values of θ , showing that θ has to decrease to zero.

6 Experiments

In this section1, we test the performance of our algorithm Reparametrization-based Performative Gradient (RPPerf GD) with respect to existing algorithms. Three baselines were introduced in Perdomo et al. [2020]. First, Repeated Risk Minimization (RRM) computes at each step the next θ to minimize the non-performative risk, leading to the update rule θt+1 = arg minθ DPR(θt, θ ). In practice, we found that this algorithm is unstable as soon as performative effects become significant. We thus report separately the results obtained with this algorithm in appendix B.2. A second baseline is Repeated Gradient Descent (RGD), which ignores the performative effect but limits itself to a gradient step towards this minimization θt+1 = θt η θ DPR (θ, θ )|θ =θ. In numerical experiments, it is often chosen to add a regularizer to the objective function, which is particularly interesting in the context of performative learning, as discussed in section 5. Hence, we report Regularized Repeated Gradient Descent (RRGD), that corresponds to the repeated gradient with a loss function including an additional ridge penalty on θ, leading to a more conservative behavior that is also more robust to performative effects. Finally, we also compare to the Score Function Performative Gradient Descent (SFPerf GD) which estimates the performative part of the gradient using the ˆGRP θ estimator based on the score-function approach (see section 2). This was previously used in small dimensions in Miller et al. [2021].

1The code is available at https://github.com/totilas/Perf Opti

0 20 40 60 80 100 Iterations

RPPerf GD RGD RRGD

0.0 0.2 0.4 0.6 0.8 1.0 1.2

RPPerf GD RGD RRGD

0 5 10 15 20 25 Iterations

= . 15 = . 275 = . 5

RPPerf GD RGD RRGD SFPerf GD

0 5 10 15 20 25 Iterations

= . 15 = . 275 = . 5

RPPerf GD RGD RRGD SFPerf GD

0 5 10 15 20 25 Iterations

Frobenius Norm of Difference

= . 15 = . 275 = . 5

RPPerf GD RGD RRGD SFPerf GD

0 2 4 6 8 10 12 14 Iteration

= 0 = 1 = 10

RPPerf GD RGD RRGD

Figure 2: (a) Logistic regression to classify two Gaussian distributions centered in (0, 0) and ( 1, 1) and different magnitudes of performative effects γ. We report the accuracy for three different magnitudes of the performative effects, from no performative effect (γ = 0) to a strong one (γ = 1). (b) we report the position of the parameter θ in its 2D-space, starting from (0, 0) and following different paths depending on the algorithm. (c) Accuracy of a classification with quadratic loss on two Gaussian distributions of dimension 7 with various levels of variance σ of the distributions. (d) Same experiments but using the learnt Π for RPPerf GD. (e) In this case, distance between the true matrix Π and the estimated version. Note that in RGD and RRGD the estimation of Π is not used in the algorithm. (f) Logistic regression for the Housing dataset with various magnitude of performative shift λ on the coordinates 0, 4 and 6. Accuracy is averaged over 20 runs.

Influence of the performative effect In fig. 2b, we illustrate how taking into account the performative effect allows to mitigate regimes with strong distribution shift. We generate two Gaussian distributions, one fixed for class 1 and one moving with mean µ(θ)T = (1, 1)+γθT diag(0.1, 0.9), where γ is the magnitude parameter effect. We learn a logistic regression and report the accuracy of the predictions as well as the trajectory of the parameter in (θ1, θ2)-space. As expected, when there is no performative effect (γ = 0), all methods are equivalent. As soon as there are performative effects, Performative gradient takes advantage over other methods. RRGD proposes an interesting tradeoff in terms of performance: agnostic to the performative effect, it still moderates its value and thus the magnitude of the performative effect.

Stability of the estimator In fig. 2c, we illustrate the result of example 2, by training a classifier with the square loss, showing that the score-based estimator used in SFPerf GD becomes unstable in high dimensions. We use Gaussian distributions of dimension 7 with two dimensions subject to performative effects. We vary the variance of the distributions. When the scale σ is small, the variance of the estimator increases to the point of making learning impossible with unstable trajectories of the parameter θ. Even when the scale is small enough to ensure convergence, RPPerf GD provides faster convergence illustrating its better scalability for high dimensions.

Estimation of Π We estimate Π in fig. 2d and fig. 2e by running a ridge regression along the successive deployments of the model as described in Algorithm 1: the ridge penalty ensures that initially the estimate of Π is close to zero making the RPPerf GD updates very similar to those of

RGD and it is easy to check that order d deployments are enough to obtain a non void estimate of Π. While this plug-in approach is not guaranteed to converge from a theoretical standpoint, we observe results that very similar to the case where Π is fully known.

Houses price prediction To simulative performative effects from a dataset, we follow the methodology of Perdomo et al. [2020], by shifting the coordinate i of a factor λθi if the i-th coordinate could be easily modified, and keeping its real value intact otherwise. We use the binarized version of the Housing dataset2, where the outcome is whether the price is high or not. Assuming that a seller wants to obtain a high price, the high price is the favored class. Some characteristics are harder to tamper with such as the location or the income, whereas other can be slightly adjusted such as the household and the number of bedrooms (a room could be promoted bedroom). Coordinates 0, 4 and 6 are thus shifted while other remains identical. We see that when the magnitude of the shift increases, RPPerf GD outperforms RGD. In particular, it seems that RPPerf GD succeeds in converging faster than the non performative approach.

Algorithm 1: RPPerf GD with Π learning Input :Stepsize η, regularizer λ, starting θ0, Loss ℓ Output :Parameters θK and diagonal matrix ΠK 1 Π0 0d d // initialize Π as a zero matrix of size d d

2 for k {0, . . . , K 1} do

3 Receive n samples {xi k}n i=1 D(θk) with n0 samples of label 1 denoted (x0,k)k

4 Compute 1 1

n Pn i=1 θℓ(xi k, θk)// Non performative part of the gradient

5 Compute 2 1

nΠ k Pn0 i=1 xℓ(xi 0,k, θk)// Performative part of the gradient over negative samples

6 θk+1 θk η( 1 + 2) // Gradient Descent step

7 Πk+1 argmin Pk j=1 Pn0 l=1 xl 0,j ˆµ Πθj 2 + λ Π 2 // ˆµ is the estimated mean of the class

7 Conclusion

In this work, we have investigated the consequences of assuming a novel, more explicit, model for performative effects under the form of a push-forward shift of distribution. We have demonstrated that it comes with practically important consequences, such as enabling more reliable performative gradient estimation in large dimensional models.

In the classification case, we observed that when the change of distribution is given by a linear-inparameters shift, the performative risk is convex under relatively general assumptions. It would be interesting to study how these results may extend to non-linear models for the performative effect.

Finally, we have shown that certain kinds of performative effect induce implicit regularization of the risk minimization problem. Moreover, this regularization effect can alternatively be viewed through the lens of adversarial robustness. It would be useful to explore whether this reformulation can be used to optimize the performative risk without an explicit model for the performative effect.

8 Acknowledgments

This work was supported by grants ANR-20-THIA-0014 program AI Ph D@Lille and ANR-22PESN-0014 under the France 2030 program. The authors thank Francis Bach, Bruno Loureiro and Kamélia Daudel for their helpful insights.

Francis Bach. Learning Theory from First Principles. MIT Press, 2024.

2https://www.openml.org/d/823

Peter L. Bartlett, Michael I. Jordan, and Jon D. Mc Auliffe. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 473(101), 2006.

Dmitriy Drusvyatskiy and Lin Xiao. Stochastic optimization with decision-dependent distributions. Mathematics of Operations Research, 48(2), 2023.

Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In International Conference on Learning Representations, 2015.

Evan Greensmith, Peter L. Bartlett, and Jonathan Baxter. Variance reduction techniques for gradient estimates in reinforcement learning. J. Mach. Learn. Res., 5, 2004.

Moritz Hardt and Celestine Mendler-Dünner. Performative prediction: Past and future, 2023. ar Xiv:2310.16608.

Thomas Hofmann, Bernhard Schölkopf, and Alexander J Smola. Kernel methods in machine learning. Annals of Statistics, 36(3), 2008.

Zachary Izzo, James Zou, and Lexing Ying. How to learn when data gradually reacts to your model. In International Conference on Artificial Intelligence and Statistics, 2022.

Meena Jagadeesan, Tijana Zrnic, and Celestine Mendler-Dünner. Regret minimization with performative feedback. In International Conference on Machine Learning, 2022.

Diederik P. Kingma and Max Welling. Auto-encoding variational bayes, 2013. ar Xi:1312.6114.

Jack P.C. Kleijnen and Reuven Y. Rubinstein. Optimization and sensitivity analysis of computer simulation models by the score function method. European Journal of Operational Research, 88 (3), 1996.

Ivan Kobyzev, Simon J. D. Prince, and Marcus A. Brubaker. Normalizing flows: An introduction and review of current methods. IEEE Trans. PAMI, 43(11), 2021.

Alp Kucukelbir, Dustin Tran, Rajesh Ranganath, Andrew Gelman, and David M. Blei. Automatic differentiation variational inference. Journal of Machine Learning Research, 18(14), 2017.

Pierre L Ecuyer. An overview of derivative estimation. In Proceedings of the 23rd conference on Winter simulation, 1991.

Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations, 2018.

John Miller, Juan C. Perdomo, and Tijana Zrnic. Outside the echo chamber: Optimizing the performative risk. In International Conference on Machine Learning, 2021.

Shakir Mohamed, Mihaela Rosca, Michael Figurnov, and Andriy Mnih. Monte carlo gradient estimation in machine learning. Journal of Machine Learning Research, 21, 2020.

Oskar Morgenstern. Wirtschaftsprognose: Eine Untersuchung ihrer Voraussetzungen und Möglichkeiten. Springer Vienna, 1928.

John F. Muth. Rational expectations and the theory of price movements. Econometrica: journal of the Econometric Society, 1961.

George Papamakarios, Eric Nalisnick, Danilo Jimenez Rezende, Shakir Mohamed, and Balaji Lakshminarayanan. Normalizing flows for probabilistic modeling and inference. J. Mach. Learn. Res., 22(1), 2021.

Juan Perdomo, Tijana Zrnic, Celestine Mendler-Dünner, and Moritz Hardt. Performative prediction. In International Conference on Machine Learning, 2020.

Antonio H. Ribeiro, Dave Zachariah, Francis Bach, and Thomas B. Schön. Regularization properties of adversarially-trained linear regression. In Conference on Neural Information Processing Systems, 2023.

Sebastian Zezulka and Konstantin Genin. Performativity and Prospective Fairness, October 2023. ar Xiv:2310.08349.

A Supplementary Material

A.1 Proofs for Example 2

Using the notations of Example 2 the score function estimator may we written as ΠT 1 2σ2 G, where G = U + a 2U with U N(0, σ2Id) and a Rd is a deterministic vector depending on the parameters θ, θ and the performative effect. To compute the expectation and covariance matrix of G we will use Isserlis (or Wick s probability) theorem which state that

(a) E[Ui1 . . . Ui2m+1] = 0, for any {i1, . . . , i2m+1} {1, . . . , d}2m+1

E[Ui1 . . . Ui2m] = σ2m X

{j1,k1},...,{jm,km} P({i1,...,i2m}) δj1k1 . . . δjmkm

where P({i1, . . . , i2m}) denotes all the distinct ways of partitioning {i1, . . . , i2m} into non-overlapping (unordered) pairs and δ is the Kronecker delta. It is easily checked that the number of partitions in P({i1, . . . , i2m}) is equal to 2m m m!

2m which is also equal to the product of all odd numbers between 1 and 2m 1.

For the expectation, E[G] = E[( U 2 + a 2 + 2a T U)U] = 2E(UU T )a = 2σ2a, as the expansion of all other terms would involve and odd number of coordinates of U.

Let M = E[GGT ], we have

k=1 (Uk + ak)2 d X

l=1 (Ul + al)2Ui Uj

U 2 k U 2 l + U 2 ka2 l + U 2 l a2 k + 4alak Uk Ul + a2 ka2 l Ui Uj

omitting terms in the expansion that involve and odd number of coordinates of U (which have 0 expectation). Now, we apply Isserlis theorem to each term in this decomposition, starting with the lowest order (rightmost) ones:

l=1 a2 ka2 l Ui Uj

l=1 4alak Uk Ul Ui Uj

l=1 alak(δijδkl + δikδjl + δilδjk) = 4σ4( a 2 + 2aiaj)δij

l=1 U 2 k Ui Uja2 l

= σ4 a 2 d X

k=1 (δij + 2δkiδkj)

= σ4 a 2(d + 2)δij

l=1 U 2 k U 2 l Ui Uj

= σ6 (δij + 2δilδjl + 2δikδjk + 2δklδij + 8δklδkiδij) = σ6(d2 + 6d + 8)δij

where the last decomposition is obtained by examination of the 6 3 3!

23 = 15 possible partitions of {k, k, l, l, i, j} in 3 pairs of indices. Putting all together, one obtains

M = (d2 + 6d + 8)σ6 + 2(d + 4) a 2σ4 + a 4σ2 Id + 8σ4aa T

which yields

Cov(G) = (d2 + 6d + 8)σ6 + 2(d + 4) a 2σ4 + a 4σ2 Id + 4σ4aa T (8)

Subtracting a scalar baseline m yields the estimator G = ( U + a 2 m)U which has the same expectation as G. In terms of covariances, one has

Cov( G) = Cov(G) + m2Id 2m E U + a 2UU T

The rightmost expression, when expanded, features terms have already been met in the computation above and it is easy to check that

E U + a 2UU T = ((d + 2)σ4 + a 2σ2)Id

Hence, Cov( G) = Cov(G) + m2 2m((d + 2)σ4 + a 2σ2) Id

In the above equation, the scalar term m2 2m((d + 2)σ4 + a 2σ2) is minimized by choosing m = (d + 2)σ2 + a 2 and is equal to (d + 2)σ2 + a 2 2 σ2, which, combined with eq. (8) yields Cov( G) 2(d + 2)σ6 + 4 a 2σ4 Id + 4σ4aa T (9)

A.2 Proof of Theorem 2

Proof. For (a), eq. (4) may be rewritten as

PR(θ) = ρE[(U T 1 θ 1)2] + (1 ρ)E[((U0 + Πθ)T θ + 1)2]

Denoting E(Ui) = µi and Covθ(Ui) = Σi, for i {0, 1}, one has

PR(θ) = ρE[((U1 µ1)T θ (1 µT 1 θ))2] + (1 ρ)E[((U0 µ0)T θ + (µ0 + Πθ)T θ + 1)2]

= ρ[ θ 2 Σ1 + (1 µT 1 θ)2] + (1 ρ)[ θ 2 Σ0 + ((µ0 + Πθ)T θ + 1)2]

Both squared norms are convex as well as the squares of, respectively, an affine function and a convex second order polynomial (as Π is positive semidefinite).

For (b), examining

PR(θ) = ρE[Φ(U T 1 θ)] + (1 ρ)E[Φ( (U0 + Πθ)T θ)] (10)

one observes that

Φ(u T 1 θ) is convex by our assumption on Φ (for any value of u1);

(u0 + Πθ)T θ is a convex second order (multivariate) polynomial in θ when Π is positive semidefinite and v 7 Φ( v) is convex non decreasing, hence Φ( (u0 + Πθ)T θ) is also convex.

Thus PR(θ) is also convex in θ as the expectation of convex functions.

A.3 Proof of Theorem 3

Proof. To obtain eq. (6), as v 7 Φ( v) is non decreasing, one has

max { u0: u0 Π 1 θ Π} Φ (u0 + u0)T θ = Φ u T 0 θ max { u0: u0 Π 1 θ Π}( u0)T θ

for any outcome u0 of the random variable U0. The maximization occurs for u0 = Πθ, which does not depend on u0, leading to ( u0)T θ = θ 2 Π and thus

max { U0: U0 Π 1 θ Π} Φ (U0 + U0)T θ = Φ U T 0 θ θ 2 Π

whose expectation is recognized as the second term of eq. (10).

A.4 Proof of Theorem 4

Proof. Recall from the proof of Theorem 2 that the performative risk can be rewritten as follows.

PR(θ) = ρE[Φ(U T 1 θ)] + (1 ρ)E[Φ( (U0 + Πθ)T θ)]

= ρE[Φ(U T 1 θ)] + (1 ρ)E[Φ( U T 0 θ θ 2 Π)]

Let µρ = ρµ1 (1 ρ)µo. We have,

Φ(0) = PR(0) PR(θ )

= ρE[Φ(U T 1 θ )] + (1 ρ)E[Φ( U T 0 θ θ 2 Π)]

ρΦ(E[U T 1 θ ]) + (1 ρ)Φ(E[ U T 0 θ θ 2 Π])

= ρΦ(µT 1 θ ) + (1 ρ)Φ( µT 0 θ θ 2 Π)

Φ ρµT 1 θ (1 ρ)(µT 0 θ + θ 2 Π)

= Φ µT ρ θ (1 ρ) θ 2 Π

where we have successively used Jensen s inequality and the convexity of Φ. Since Φ is nonincreasing, we must have 0 µT ρ θ (1 ρ) θ 2 Π. Denoting β = Π 1 2 θ , it holds that

(1 ρ) β 2 µT ρ Π 1

where that last inequality is obtained using Cauchy Schwarz. Hence, θ Π = β µT ρ Π 1

B Numerical Experiments

B.1 Full parameters list

In this section, we report all the parameters needed to reproduce the figures in the paper. Note that we always use the same step size for all the methods. This choice stems from the fact that the methods are equivalent (up to the regularization parameter) when there is no performative effect.

Table 1: Parameters used for figure fig. 2b Parameter Value

Number of iterations (num_iter) 100 Sample size (n) 1000 Scale (σ) 0.5 Average number of iterations (num_iter_average) 100 Step size (step_size) 0.1 Regularization parameter (λ) 3 10 2

Table 2: Parameters used for figure fig. 2c Parameter Value

Number of iterations (num_iter) 25 Sample size (n) 1000 Initial scale (scale0) 0.5 Transition probability matrix (Π) diag([0.1, 3, 0, 0, 0, 0, 0]) Mean of class 0 (µ) [1 2 0.5 0.5 0 0 0] Average number of iterations (num_iter_average) 100 Step size (step_size) 0.1 Regularization parameter (λ) 10 1

B.2 Repeated Risk Minimization

In this section, we report the same learning tasks as those reported in the main text, but for Repeated Risk Minimization (RRM). In every setting, as soon as the performative effect is not negligible, the technique diverges. To ensure the readability of the figure and avoid shrinking the differences between the other algorithms, we report it separately.

Table 3: Parameters used for fig. 2f Parameter Value

Number of iterations (num_iter) 15 Sample size (n) 18000 Number of runs (n_runs) 20 Step size (step_size) 0.2 Regularization parameter (λ) 5 10 3

0 20 40 60 80 100 Iterations

20 0 20 40 60 80 100 120

0 5 10 15 20 25 Iterations

= . 15 = . 275 = . 5 RRM

Figure 3: (a) Learning a logistic regression between two Gaussian distributions centered in (0, 0) and ( 1, 1) and different magnitude of performative effects γ. (b) Accuracy of a classification with quadratic loss on two Gaussian of dimension 7 with various level of noise σ

Neur IPS Paper Checklist

Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes]

Justification: We define our approach with pushforward measure in section 2, the application to classification insection 3, the new theorems on convexity in section 4 and the relation with robustness in section 5 2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: We discuss our model as pushforward measures in section 2 and we discuss our theorems to ensure convexity in section 4. For the experiments, we acknowledge the lack of true performative datasets and use the same simulations approximation as in previous work, as described in section 6. 3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [Yes] Justification: All theorems are defined with their full set of assumptions, with proofs just after them or in the supplementary material (appendix A). 4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes]

Justification: The experiments are described in section 6. Additional choice of parameters are reported in appendix A. 5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: the code will be publicly released after publication 6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justification: The detailed parameters are in Appendix (appendix B). The choice of parameters are quite limited as the studied models have very simple architecture 7. Experiment Statistical Significance

Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [Yes] Justification: All figures report standard deviation over the runs, as stated in legend. 8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [NA] Justification: The experiments are only using small datasets so all runs were done on a single laptop where the time of execution is very small (few seconds per run). 9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justification: The paper is conform to the Neur IPS code of Ethics. 10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [NA] Justification: The paper is theoretical and should not have directly any societal impacts. In fact, raising the attention on the feedback loop that might occurs in performative learning could lead to a positive societal impact, if better understood. 11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] Justification: The paper poses no such risks. 12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?

Answer: [Yes] Justification: We only reuse one dataset (Housing) and we cite it. 13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [NA] Justification: The paper does not release new assets. 14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: The paper does not involve crowdsourcing nor research with human subjects. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: The paper does not involve crowdsourcing nor research with human subjects.