# anticipating_performativity_by_predicting_from_predictions__ea51f370.pdf

Anticipating Performativity by Predicting from Predictions

Celestine Mendler-Dünner Max Planck Institute for Intelligent Systems, Tübingen cmendler@tuebingen.mpg.de

Frances Ding Univerity of California, Berkeley frances@berkeley.edu

Yixin Wang University of Michigan yixinw@umich.edu

Predictions about people, such as their expected educational achievement or their credit risk, can be performative and shape the outcome that they are designed to predict. Understanding the causal effect of predictions on the eventual outcomes is crucial for foreseeing the implications of future predictive models and selecting which models to deploy. However, this causal estimation task poses unique challenges: model predictions are usually deterministic functions of input features and highly correlated with outcomes. This can make the causal effect of predictions on outcomes impossible to disentangle from the direct effect of the covariates. We study this problem through the lens of causal identiﬁability. Despite the hardness of this problem in full generality, we highlight three natural scenarios where the causal effect of predictions can be identiﬁed from observational data: randomization in predictions, overparameterization of the predictive model deployed during data collection, and discrete prediction outputs. Empirically we show that given our identiﬁability conditions hold, standard variants of supervised learning that predict from predictions by treating the prediction as an input feature can ﬁnd transferable functional relationships that allow for conclusions about newly deployed predictive models. These positive results fundamentally rely on model predictions being recorded during data collection, bringing forward the importance of rethinking standard data collection practices to enable progress towards a better understanding of social outcomes and performative feedback loops.

1 Introduction

Predictions can impact sentiments, alter expectations, inform actions, and thus change the course of events. Through their inﬂuence on people, predictions have the potential to change the regularities in the population they seek to describe and understand. This insight underlies the theories of performativity [38] and reﬂexivity [62] that play an important role in modern economics and ﬁnance. Recently, Perdomo et al. [51] pointed out that the social theory of performativity has important implications for machine learning theory and practice. Prevailing approaches to supervised learning assume that features X and labels Y are sampled jointly from a ﬁxed underlying data distribution that is unaffected by attempts to predict Y from X. Performativity questions this assumption and suggests that the deployment of a predictive model can disrupt the relationship between X and Y . Hence, changes to the predictive model can induce shifts in the data distribution. For example, consider a lender with a predictive model for risk of default performativity could arise if individuals who are predicted as likely to default are given higher interest loans, which make default even more likely [41], akin to a self-fulﬁlling prophecy. In turn, a different predictive model that predicts smaller risk and suggests offering more low-interest loans could cause some individuals who previously looked risky

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

to be able to pay the loans back, which would appear as a shift in the relationship between features X and loan repayment outcomes Y . This performative nature of predictions poses a challenge to using historical data to predict the outcomes that will arise under the deployment of future models.

1.1 Our work

In this work, we aim to understand under what conditions observational data is sufﬁcient to identify the performative effects of predictions. Only when causal identiﬁability is established can we rely on data-driven strategies to anticipate performativity and reason about the downstream consequences of deploying new models. Towards this goal, we focus on a subclass of performative prediction problems in this paper where performative effects of predictions solely surface as a shift in the outcome variable, and the distribution over covariates X is unaffected by the prediction ˆY . Our goal is to identify the expected counterfactual outcome

MY (x, ˆy) E[Y |X = x, do( ˆY = ˆy)].

Understanding the causal mechanism MY is crucial for model evaluation, as well as model optimization. In particular, it allows for ofﬂine evaluation of the potential outcome Y of an individual X subject to a predictive model fnew with the prediction ˆY = fnew(X) before actually deploying it.

The need for observing predictions. We start by illustrating the hardness of performativity-agnostic learning by relating performative prediction to a concept shift problem. Using the speciﬁcs of the performative shift, we establish a lower bound on the extrapolation error of predicting Y from X under the deployment of a new model fnew that is different from the model ftrain deployed during data collection. In particular, the extrapolation error grows with the distance between the prediction functions of the two models and the strength of performativity. This lower bound on the extrapolation error demonstrates the necessity to take performativity into account for reliably predicting Y .

Predicting from predictions. We then explore the feasibility of learning performative effects when the training data recorded the predictions and training data samples (X, Y, ˆY ) are available. As an identiﬁcation strategy for learning MY , we focus on building a meta machine learning model that predicts Y for an individual with features X, subjected to a prediction ˆY . We term this data-driven strategy predicting from predictions; it treats the predictions as an input to the meta machine learning model. The meta model seeks to answer what would the outcome be if we were to deploy a different prediction model? Crucially, this what if question is causal in nature; it aims to understand the potential outcome under an intervention which is different from merely estimating the outcome variable in previously seen data. Whether such a transferable model is learnable depends on whether the training data provides causal identiﬁability [49] Only after causal identiﬁability is established can we rely on observational data to select and design optimal prediction models under performativity.

Establishing identiﬁability. For our main technical results, we ﬁrst show that, in general, observing ˆY is not sufﬁcient for identifying the causal effects of predictions. In particular, if the training data was collected under the deployment of a deterministic prediction function, the mechanism MY can not be uniquely identiﬁed. The reason is a lack of coverage in the training data as X and ˆY are deterministically bound. Next, we establish several conditions under which observing ˆY is sufﬁcient for identifying MY . The ﬁrst condition exploits the presence of randomness in the prediction. This randomness could be purposely built into the prediction for individual fairness, differential privacy, or other considerations. The second condition exploits the property that predictive models are often over-parameterized, which leads to incongruence in functional complexity between different causal paths, enabling the effects of predictions to be separated from other variables effects. The third condition takes advantage of discreteness in predictions such that performative effects can be disentangled from the continuous relationship between covariates and outcomes. Together, these results reveal that particularities of the performative prediction problem can enable us to recover the causal effect of predictions from observational data. In particular, we show that, under these conditions, standard supervised learning techniques can be used to ﬁnd these transferable functional relationships by treating predictions as model inputs. Empirically, we demonstrate that supervised learning succeeds in ﬁnding MY even in ﬁnite samples.

We conclude with a discussion of limitations and extensions of our work, pointing out potential violations of the modeling assumptions underlying our causal analysis and proposing directions for future work.

1.2 Broader context and related work

The work by Perdomo et al. [51], initiated the discourse of performativity in the context of supervised learning by pointing out that the deployment of a predictive model can impact the data distribution we train our models on. Existing scholarship on performative prediction [c.f., 51, 42, 12, 44, 24, 26, 68, 45, 52, 31] has predominantly focused on achieving a particular solution concept with a prediction function that maps X to Y in the presence of unknown performative effects. We are interested in understanding the underlying causal mechanism of the performative distribution shift. Our work is motivated by the seemingly natural approach of lifting the supervised-learning problem and incorporating the prediction as an input feature when building a meta machine learning model for explaining Y . By establishing a connection to causal identiﬁability, our goal is to understand when such a data-driven strategy can help anticipate the down stream effects of predictions

This work focuses on the setting where predictions lead to changes in the relationship between covariates X and label Y , while the marginal distribution P(X) over covariates is assumed to be ﬁxed. This setting where performativity only surfaces in the label describes an interesting subclass of problems falling under the umbrella of performative (aka. model-induced or decision-dependent) distribution shifts [51, 37, 12]. Our assumptions are complementary to the strategic classiﬁcation framework [8, 20] that focuses on a setting where performative effects concern P(X), while P(Y |X) is assumed to remain stable. Consequently, causal questions in strategic classiﬁcation [e.g., 22, 3, 59] are concerned with identifying stable causal relationships between X and Y . Since we assume P(Y |X) can change (i.e. the true underlying concept determining outcomes can change), conceptually different questions emerge in our work. Similar in spirit to strategic classiﬁcation, the work on algorithmic recourse and counterfactual explanations [32, 28, 65] focuses on the causal link between features and predictions, whereas we focus on the down-stream effects of predictions.

There are interesting parallels between our work and related work on the ofﬂine evaluation of online policies [e.g., 35, 63, 36, 58]. In particular, [63] explicitly emphasize the importance of logging propensities of the deployed policy during data collection to be able to mitigate selection bias. In our work the deployed model can induce a concept shift. Thus, we ﬁnd that additional information about the predictions of the deployed model needs to be recorded to be able to foresee the impact of a new predictive model on the conditional distribution P(Y |X), beyond enabling propensity weighting [55]. A notable work by [66] investigates how predictions at one time step impact predictions in future time steps. Complementary to these existing works we show that randomness in the predictive model is not the only way causal effects of predictions can be identiﬁed.

For our theoretical results, we build on classical tools from causal inference [48, 57, 64]. In particular, we distill unique properties of the performative prediction problem to design assumptions for the identiﬁability of the causal effect of predictions.

2 The causal force of prediction

Predictions can be performative and impact the population of individuals they aim to predict. Formulized it in the language of causal inference [48]: the deployment of a predictive model represents an intervention on a causal diagram that describes the underlying data generation process of the population. We will expand on this causal perspective to study an instance of ths performative prediction problem described below.

2.1 Prediction as a partial mediator

Consider a machine learning application relying on a predictive model f that maps features X to a predicted label ˆY . We assume the predictive model f is performative in that the prediction ˆY = f(X) has a direct causal effect on the outcome variable Y of the individual it concerns. Thereby the prediction impacts how the outcome variable Y is generated from the features X. The causal diagram illustrating this setting is visualized in Figure 1.

The features X X Rd are drawn i.i.d. from a ﬁxed underlying continuous distribution over covariates DX with support X. The outcome Y Y R is a function of X, partially mediated by the prediction ˆY Y. The prediction ˆY is determined by the deployed predictive model f : X Y. For a given prediction function f, every individual is assumed to be sampled i.i.d. from the data generation process described by the causal graph in Figure 1. We assume the exogenous noise ξY is zero mean, and ξf allows the prediction function to be randomized.

ˆY f X = ξX ξX DX (1) ˆY = f(X, ξf) ξf Df (2)

Y = g(X, ˆY ) + ξY ξY DY (3)

Figure 1: Performative effects mediated by predictions for a given f

Note that our model is not meant to describe performativity in its full generality (which includes other ways f may affect P(X, Y )). Rather, it describes an important and practically relevant class of performative feedback problems that are characterized by two properties: 1) performativity surfaces only in the label Y , and 2) performative effects are mediated by the prediction, such that Y f | ˆY , rather than dependent on the speciﬁcs of the decision rule.

Application examples. Causal effects of predictions on outcomes have been documented in multiple contexts: A bank s prediction about the client (e.g., his or her creditworthiness in applying for a loan) determines the interest rate assigned to them, which in turn changes a client s ﬁnancial situation [41]. Mathematical models that predict stock prices inform the actions of traders and thus heavily shape ﬁnancial markets and economic realities [38]. Zillow s housing price predictions directly impact sales prices [39]. Predictions about the severity of an illness play an important role in treatment decisions and hence the very chance of survival of the patient [34]. Another prominent example from psychology is the Pygmalion effect [56]. It refers to the phenomenon that high expectations lead to improved performance, which is widely documented in the context of education [6], sports [61], and organizations [16]. Examples of such performativity abound, and we hope to have convinced the reader that the performative effects in the label are important for algorithmic prediction.

2.2 Implications for performativity-agnostic learning

Begin with considering the classical supervised learning task where ˆY is unobserved. The goal is to learn a model h : X Y for predicting the label Y from the features X. To understand the inherent challenge of classical prediction under performativity, we investigate the relationship between X and Y more closely. Speciﬁcally, the data generation process (Figure 1) implies that

P(Y |X) = Z P(Y | ˆY , X)P( ˆY |X)d ˆY . (4)

This expression makes explicit how the relationship between X and Y that we aim to learn depends on the predictive model governing P( ˆY |X). As a consequence, when the deployed predictive model at test time differs from the model at training time, performative effects surface as concept shift [17]. Such distribution shift problems are known to be intractable without structural knowledge about the shift, implying that we can not expect h to generalize to distributions induced by future model deployments. Let us inspect the resulting extrapolation gap in more detail and put existing positive results on performative prediction into perspective.

Extrapolation loss. We illustrate the effect of performativity on predictive performance using a simple instantiation of the structural causal model from Figure 1. Therefore, assume a linear performative effect of strength α > 0 and a base function g1 : X Y

g(X, ˆY ) := g1(X) + α ˆY . (5)

Now, assume we collect training data under the deployment of a predictive model fθ and validate our model under the deployment of fφ. We adopt the notion of a distribution map from Perdomo et al. [51] and write DXY (f) for the joint distribution over (X, Y ) surfacing from the deployment of a model f. We assess the quality of our predictive model h : X Y over a distribution DXY (f) induced by f via the loss function ℓ: Y Y R and write Rf(h) := Ex,y DXY (f)ℓ(h(x), y) for the risk of h on the distribution induced by f. We use h f for the risk minimizer h f := argminh H Rf(h), and H for the hypothesis class we optimize over. Proposition 1 bounds the extrapolation loss and can be viewed as a concrete instantiation of the more general extrapolation bounds for performative prediction discussed in [37] within the feedback model from Figure 1. Proposition 1 (Hardness of performativity-agnostic prediction). Consider the data generation process in Figure 1 with g given in (5) and fθ, fφ being deterministic functions. Take a loss function

ℓ: Y Y R that is γ-smooth and µ-strongly convex in its second argument. Let h fθ be the risk minimizer over the training distribution and assume the problem is realizable, i.e., h fθ H. Then, we can bound the extrapolation loss of h fθ on the distribution induced by fφ as γ 2 α2d2 DX(fθ, fφ) Rfθ fφ(h fθ) µ

2 α2d2 DX(fθ, fφ) (6)

where d2 DX(fθ, fφ) := Ex DX(fθ(x) fφ(x))2 and Rfθ fφ(h) := Rfφ(h) Rfθ(h).

The extrapolation loss Rfθ fφ(h fθ) is zero if and only if either the strength of performativity tends to zero (α 0), or the predictions of the two predictors fθ and fφ are identical over the support of DX. If this is not the case, an extrapolation gap is inevitable. This elucidates the fundamental hardness of performative prediction from feature, label pairs (X, Y ) when performative effects disrupt the causal relationship between X and Y .

The special case where α = 0 aligns with the assumption of classical supervised learning, in which there is no performativity. This may hold in practice if the predictive model is solely used for descriptive purposes, or if the agent making the prediction does not enjoy any economic power [21]. The second special case where the extrapolation error is small is when d2 DX(fθ, fφ) 0. In which case DXY (fθ) and DXY (fφ) are equal in distribution and hence exhibit the same risk minimizer. Such a scenario can happen, for example, if the model fφ is obtained by retraining fθ on observational data and a ﬁxpoint is reached (fθ = h fθ). The convergence of policy optimization strategies to such ﬁxpoints (perfromative stablity) has been studied in prior work [e.g., 51, 42, 12] and enabled optimality results even in the presence of performative concept shifts, relying on the target model fφ not being chosen arbitrarily, but based on a pre-speciﬁed update strategy.

3 Identifying the causal effect of prediction

Having illustrated the hardness of performativity-agnostic learning, we explore under what conditions incorporating the presence of performative predictions into the learning task enables us to anticipate the perfromative effects of ˆY on Y . Towards this goal, we assume that the mediator ˆY in Figure 1 is observed the prediction takes on the role of the treatment in our causal analysis and we can not possibly hope to estimate the treatment effect of a treatment that is unobserved.

3.1 Problem setup

Assume we are given access to data points (x, ˆy, y) generated i.i.d. from the structural causal model in Figure 1 under the deployment of a prediction function fθ. From this observational data, we wish to estimate the expected potential outcome of an individual under the deployment of an unseen (but known) predictive model fφ. We note that given our causal graph, the implication of intervening on the function f can equivalently be explained by an intervention on the prediction ˆY . Thus, we are interested in identifying the causal mechanism:

MY (x, ˆy) := E[Y |X = x, do( ˆY = ˆy)]. (7) Unlike P(Y |X), the mecahnism MY is invariant to the changes in the predictive model governing P( ˆY |X). Thus, being able to identify MY will allow us to make inferences about the potential outcome surfacing from planned model updates beyond explaining patterns in historical data. We can evaluate MY to infer y for any x at ˆy = fφ(x) for fφ being the model of interest.

For simplicity of notation, we will write D(fθ) to denote the joint distribution over (X, ˆY , Y ) of the observed data collected under the deployment of the predictive model fθ. We say MY can be identiﬁed, if it can uniquely be expressed as a function of observed data. More formally: Deﬁnition 1 (identiﬁability). Given a predictive model f, the causal graph in Figure 1, and a set of assumptions A. We say MY is identiﬁable from D(f), if for any function h that complies with assumptions A and h(x, ˆy) = MY (x, ˆy) for pairs (x, ˆy) supp(DXY (f)) it must also hold that h(x, ˆy) = MY (x, ˆy) for all pairs (x, ˆy) X Y.

Without causal identiﬁability, there might be models h = MY that explain the training distribution equally well but do not transfer to the distribution induced by the deployment of a new model. Causal identiﬁability is crucial for enabling extrapolation. It quantiﬁes the limits of what we can infer given access to the training data distribution, ignoring ﬁnite sample considerations.

Identiﬁcation with supervised learning. Identiﬁability of MY from samples of D(fθ) implies that the historical data collected under the deployment of fθ contains sufﬁcient information to recover the invariant relationship (7). As a concrete identiﬁcation strategy, consider the following standard variant of supervised learning that takes in samples (x, ˆy, y) and builds a meta-model that predicts Y from X, ˆY by solving the following risk minimization problem

h SL := argmin h H E(x,ˆy,y) D(fθ) (h(x, ˆy) y)2 . (8)

where H denotes the hypothesis class. We consider the squared loss for risk minimization because it pairs well with the exogeneous noise ξY in (3) being additive and zero mean. The strategy (8) is an instance of what we term predicting from predictions. Lemma 2 provides a sufﬁcient condition for the supervised learning solution h SL to recover the invariant causal quantity MY . Lemma 2 (Identiﬁcation strategy). Consider the data generation process in Figure 1 and a set of assumptions A. Given a hypothesis class H such that every h H complies with A and the problem is realizable, i.e., MY H. Then, if MY is causally identiﬁable from D(fθ) given A, the risk minimizer h SL in (8) will coincide with MY .

3.2 Challenges for identiﬁability

The main challenge for identiﬁcation of MY from data is that in general, the prediction rule fθ which produces ˆY is a deterministic function of the covariates X. This means that, for any realization of X, we only get access to one ˆY = fθ(X) in the training distribution, which makes it challenging to disentangle the direct and the indirect effects of X on Y . To illustrate this challenge, consider the function h(x, ˆy) := MY (x, fθ(x)) that ignores the input parameter ˆy and only relies on x for explaining the outcome. This function explains y equally well and can not be differentiated from MY based on data collected under the deployment of a deterministic prediction rule fθ. The problem is akin to ﬁtting a linear regression model to two perfectly correlated covariates. More broadly, this ambiguity is due to what is known as a lack of overlap (or lack of positivity) in the literature of causal inference [47, 23]. In the covariate shift literature, the lack of overlap surfaces when the covariate distribution violates the common support assumption and the propensity scores are not well-deﬁned (see e.g., Pan and Yang [46]). This problem renders causal identiﬁcation and thus data-driven learning of performative effects from deterministic predictions fundamentally challenging. Proposition 3 (Nonidentiﬁability from deterministic predictions). Consider the structural causal model in Figure 1. Assume Y non-trivially depends on ˆY , and the set Y is not a singleton. Then, given a deterministic prediction function f, the mechanism MY is not identiﬁable from D(f).

The identiﬁability issue persists as long as the two variables X, ˆY are deterministically bound and there is no incongruence or hidden structure that can be exploited to disentangle the direct effect of X on Y from the indirect effect mediated by ˆY . In the following, we focus on particularities of prediction problems and show how they allow us to identify MY .

3.3 Identiﬁability from randomization

We start with the most natural setting that provides identiﬁability guarantees: randomness in the prediction function fθ. Using standard arguments about overlap [47] we can identify MY (x, ˆy) for any pair x, ˆy with positive probability in the data distribution D(fθ) from which the training data is sampled. To relate this to our goal of identifying the outcome under the deployment of an unseen model fφ we introduce the following deﬁnition: Deﬁnition 2 (output overlap). Given two predictive models fθ, fφ, the model fφ is said to satisfy output overlap with fθ, if for all x X and any subset Y Y with positive measure, it holds that

P[fφ(x) Y ] P[fθ(x) Y ] > 0. (9)

In particular, output overlap requires the support of the new model s predictions fφ(x) to be contained in the support of fθ(x) for every potential x X. The following proposition takes advantage of the fact that the joint distribution over (X, Y ) is fully determined by the deployed model s predictions to relate output overlap to identiﬁcation:

Proposition 4. Given the causal graph in Figure 1, the mechanism MY (x, ˆy) is identiﬁable from D(fθ) for any pair x, ˆy with ˆy = fφ(x), as long as fφ is a prediction function that satisﬁes output overlap with fθ.

Proposition 4 allows us to pinpoint the models fφ to which we can extrapolate to from data collected under fθ. Furthermore, it makes explicit that for collecting data to learn about performative effects, it is ideal to deploy a predictor fθ that is randomized so that the prediction output has full support over Y for any x. Such a model would generate a dataset that guarantees global identiﬁcation of MY over X Y and thus robust conclusions about any future deployable model fφ. One interesting and relevant setting that satisﬁes this property is the differentially private release of predictions through an additive Laplace (or Gaussian) noise mechanism applied to the output of the prediction function [13].1

While standard in the literature, a caveat of identiﬁcation from randomization is that there are several reasons a decision-maker may choose not to deploy a randomized prediction function in performative environments, including negative externalities and concerns about user welfare [29], but also business interests to preserve consumer value of the prediction-based service offered. In the context of our credit scoring example, random predictions would imply that interest rates are randomly assigned to applicants in order to learn how the rates impact their probability of paying back. We can not presently observe this scenario, given regulatory requirements for lending institutions.

3.4 Identiﬁability through overparameterization

The following two sections consider situations where we can achieve identiﬁcation, without randomization, from data collected under a deterministic fθ. Our ﬁrst result exploits incongruences in functional complexity arising from machine learning models that are overparameterized [e.g. 30]. By overparameterization, we refer to the fact that the representational complexity of the model is larger than the underlying concept it needs to describe. Assumption 1 (overparameterization). We say a function f is overparameterized with respect to G over X if there is no function g G and c R such that f(x) = c g (x) for all x X.

A challenge for identiﬁcation is that for deterministic fθ the prediction can be reconstructed from X without relying on ˆY , and thus h(x, ˆy) = MY (x, fθ(x)) can not be differentiated from MY based on observational data. However, note that this ambiguity relies on there being an admissable h such that h( , ˆy) for a ﬁxed ˆy can represent fθ. If fθ is overparameterized with respect to the hypothesis class H, this ambiguity is resolved. Let us make this intuition concrete with an example: Example 3.1. Assume the structural equation for y in Figure 1 is g(x, ˆy) = αx + βˆy for some unknown α, β. Consider prediction functions fθ of the following form fθ(x) = γx2 + ξx for some γ, ξ 0. Consider H be the class of linear functions. Then, any admissable estimate h H takes the form h(x, ˆy) = α x + β ˆy. For h to be consistent with observations we need α + β ξ = α + βξ and β γ = βγ. This system of equations has a unique solution as long as γ > 0 which corresponds to the case where fθ is overparameterized with respect to H. In contrast, for γ = 0 the function h(x, ˆy) = (α + βξ)x would explain the training data equally well.

The following result generalizes this argument to separable functions. Proposition 5. Consider the structural causal model in Figure 1 where fθ is a deterministic function. Assume that g can be decomposed as g(X, ˆY ) = g1(X) + α ˆY for some α > 0 and g1 G, where the function class G is closed under addition (i.e. g1, g2 G a1 g1 + a2 g2 G a1, a2 R). Let H contain functions that are separable in X and ˆY , linear in ˆY , and h H it holds that h( , ˆy) G for a ﬁxed ˆy. Then, if fθ is overparameterized with respect to G over the support of DX, MY is identiﬁable from D(fθ).

3.5 Identiﬁability from classiﬁcation

A second ubiquitous source of incongruence that we can exploit for identiﬁcation is the discrete nature of predictions in the context of classiﬁcation. The resulting discontinuity in the relationship between X and ˆY enables us to disentangle MY from the direct effect of X on Y . This identiﬁcation strategy is akin to the popular regression discontinuity design [33] and relies on the assumption that all other variables in X are continuously related to Y around the discontinuities in ˆY .

1In Appendix B we discuss two additional natural sources of randomness (randomized decisions and noisy measurements of covariates) that can potentially help identiﬁcation with appropriate side-information.

0.0 0.2 0.5 0.8 1.0 Performativity strength

Y included as feature

Y not included as feature

(a) Non-identiﬁable setting

0.0 0.2 0.5 0.8 1.0 Performativity strength

(b) randomized fθ

0.0 0.2 0.5 0.8 1.0 Performativity strength

(c) overparameterized fθ

0.0 0.2 0.5 0.8 1.0 Performativity strength

(d) discrete fθ

Figure 2: Extrapolation error of supervised learning with and without access to ˆY . (a) In the non-identiﬁable setting, adding ˆY as a feature harms generalization performance. (b)-(d) Randomization, overparameterization, and discrete predictions are each sufﬁcient for avoiding this failure mode.

Proposition 6. Consider the structural causal model in Figure 1 where fθ is a deterministic function. Assume that the structural equation for Y is separable g(X, ˆY ) = g1(X) + g2( ˆY ), X, ˆY for some differentiable functions g1 and g2. Further, suppose X is a continuous random variable and ˆY is a discrete random variable that takes on at least two distinct values with non-zero probability. Then, MY is identiﬁable from D(fθ).

Similar to Proposition 5, the separability assumption together with incongruence provides a way to disentangle the direct effect from the indirect effect of X on Y . Separability is necessary in order to achieve global identiﬁcation guarantees without randomness, the identiﬁcation of entangled components without overlap is fundamentally hard. Thus, under violations of the separability assumptions, we can only expect the separable components of g to be correctly identiﬁed. Similarly, a regression discontinuity design only enables the identiﬁcation of the causal effect locally around the discontinuity. Extrapolation away from the decision boundary to models fφ that are substantially different from fθ increasingly relies on separability to hold true.

4 Empirical evaluation

We investigate empirically how well the supervised learning solution h SL in (8) is able to identify the causal mechanism MY from observational data in practical settings with ﬁnite data.

Methodology. We generated semi-synthetic data for our experiments, using a Census income prediction dataset from folktables.org [11]. Using this dataset as a starting point, we simulate a training dataset and test dataset with distribution shift as follows: First, we choose two different predictors fθ and fφ to predict a target variable of interest (e.g. income) from covariates X (e.g. age, occupation, education, etc.). If not speciﬁed otherwise, fθ is ﬁt to the original dataset to minimize squared error, while fφ is trained on randomly shufﬂed labels. Next, we posit a function g for simulating the performative effects. Then, we generate a training dataset of (X, ˆY , Y ) tuples from the causal model in Figure 1, using the covariates X from the original data, g, and fθ to generate ˆY and Y . Similarly, we generate a test dataset of (X, ˆY , Y ) tuples, using X, g, fφ. We assess how well supervised methods learn transferable functional relationships by ﬁtting a model h SL to the training dataset and then evaluating the root mean squared error (RMSE) for regression and the accuracy for classiﬁcation on the test dataset. In our ﬁgures, we visualize the standard error from 10 replicates with different random seeds and we compare it to an in-distribution baseline trained and evaluated on samples of D(fφ). If not speciﬁed otherwise we use N = 200, 000 samples.

4.1 Necessity of identiﬁcation guarantees for supervised learning

We start by illustrating why our identiﬁcation guarantees are crucial for supervised learning under performativity. Therefore, we instantiate the structural equation g in Figure 1 as

g(X, ˆY ) = g1(X) + α ˆY (10)

with g1(X) = β X and ξY N(0, 1). The coefﬁcients β are determined by linear regression on the original dataset. The hyperparameter α quantiﬁes the performativity strength that we vary in our experiments. The predictions ˆY are generated from a linear model fθ that we modify to illustrate the resulting impact on identiﬁability. We optimize h SL in (8) over H being the class of linear functions.

1 3 5 7 9 11 13 m

Y included as feature

Y not included as feature

(a) Varying functional complexity of fθ

0.0 0.2 0.4 0.6 0.8 1.0 100

Y included as feature

Y not included as feature

(b) Varying degree of noise in fθ.

0 5K 10K 15K 20K 25K number of training datapoints

(c) Varying N for randomized fθ

Figure 3: Ablation study of extrapolation performance. (a) We vary mθ. Adding ˆY as a feature helps as soon as fθ is overparameterized with respect to g1 (mθ > 3). (b) We vary the noise in the predictions of fθ. A small amount of noise is sufﬁcient for identiﬁability. (c) We vary number of datapoints for training h SL.

We start by illustrating a failure mode of supervised learning in a non-identiﬁability setting (Proposition 3). Therefore, we let fθ be a deterministic linear model ﬁt to the base dataset (fθ(X) β X). This results in MY not being identiﬁable from D(fθ). In Figure 2(a) we can see that supervised learning indeed struggles to identify a transferable functional relationship from the training data. The meta model returns h SL(X, ˆY ) = (1 + α) ˆY , instead of identifying g, which leads to a high extrapolation error independent of the strength of performativity. While we only show the error for one fφ in Figure 2(a), the error grows with the distance d2 Dx(fθ, fφ). In contrast, when the feature ˆY is not included, the supervised learning strategy returns h SL(X) = (1 + α)β X. The extrapolation loss of this performativity-agnostic model scales with the strength of performativity (Proposition 1) and is thus strictly smaller than the error of the model that predicts from predictions.

Next, we move to the regime of our identiﬁcation results (Proposition 4-6). Therefore, we modify the way the predictions in the training data are generated. In Figure 2(b) we use additive Gaussian noise to determine the predictions as ˆY = fθ(X) + η with η N(0, σ2). In Figure 2(c) we augment the input to fθ with second-degree polynomial features to achieve overparameterization. In Figure 2(d) we round the predictions of fθ to obtain discrete values. In all three cases, including ˆY as a feature is beneﬁcial and allows the model to match in-distribution accuracy baselines, closing the extrapolation gap that is inevitable for performativity-agnostic prediction.

4.2 Strength of incongruence and ﬁnite samples

We next conduct an ablation study and investigate how the degree of overparameterization and the noise level for randomized fθ impacts the extrapolation performance of supervised learning. Therefore, we consider the setup in (10) with a general function g1. We ﬁx the level of performativity at α = 0.5 for this experiment. We optimize h SL in (8) over H (which we vary).

In Figure 3(a) we investigate the effect of overparameterization of fθ on the extrapolation error of h SL. We choose fully connected neural networks with a single hidden layer to represent the functions g1, fθ and h SL. For g1 and H we take a neural network with m = 3 units in the hidden layer. The model g1 is ﬁt it to the original dataset. We vary the number of units in the hidden layer of fθ, denoted mθ. As expected, the extrapolation error decreases with the complexity of fθ. As soon as mθ > mφ there is a signiﬁcant beneﬁt to including predictions as features. In this regime, MY becomes identiﬁable as Proposition 5 suggests. In turn, without access to ˆY the model suffers an inevitable extrapolation gap due to a concept shift that is independent of the properties of fθ. In Figure 2(b) we investigate the effect of the magnitude of additive noise added to the predictions. Here H and g1 are linear functions. We have ˆY = fθ(X) + βη with η N(0, 1) and we vary the noise level β. We see that even small amounts of noise are sufﬁcient for identiﬁcation and adding ˆY as a feature to our meta-machine lenaring model is effective as soon as the noise in fθ is non-zero. In Figure 2(c) we ﬁx the noise level at α = 0.5 and vary the number of samples N. We ﬁnd that only moderate dataset sizes are necessary for predicting from predictions to approximate MY in our identiﬁable settings.

5 Discussion

This paper focused on identifying the causal effect of predictions on outcomes from observational data. We point out several natural situations where this causal question can be answered, but we

also highlight situations where observational data is not sufﬁciently informative to reason about performative effects. By establishing a connection between causal identiﬁability and the feasibility of anticipating performative effects using data-driven techniques, this paper contributes to a better understanding of the suitability of supervised learning techniques for explaining social effects arising from the deployment of predictive models in economically and socially relevant applications.

We hope the positive results in this work serve as a message for data-collection: only if predictions are observed, they can be incorporated to anticipate the performative effects of future model deployments. Thus, access to this information is crucial for an analyst hoping to understand the effects of deployed predictive models, an engineer hoping to foresee consequences of model updates, or a researcher studying performative phenomena. To date, such data is scarcely available in benchmark datasets, hindering the progress towards a better understanding of performative effects, essential for the reliable deployment of algorithmic systems in the social world.

At the same time we have shown that the deterministic nature of prediction poses unique challenges for causal identiﬁability even if ˆY is observed. Thus, the success of observational designs (as shown in our empirical investigations) is closely tied to the corresponding identiﬁability conditions being satisﬁed. Our results must not be understood as a green-light to justify the use of supervised learning techniques to address performativity in full generality beyond the scope of our theoretical results.

Limitations and Extensions. The central assumption of our work is the causal model in Figure 1. While carving out a rich and interesting class of performative prediction problems that allows us to articulate the challenges of covariates and predictions being coupled, it can not account for all mechanisms of performativity. This in turn gives rise to interesting questions for follow-up studies.

A ﬁrst neglected aspect is performativity through social inﬂuence. Our causal model, relies on the stable unit treatment value assumption (SUTVA) [23]. There is no possibility for the prediction of one individual to impact the outcome of his or her peers. Such an individualistic perspective is not unique to our paper but prevalent in existing causal analyses and model-based approaches to performative prediction and strategic classiﬁcation [e.g., 20, 25, 43, 3, 18, 22]. Spillover effects [cf. 60, 64, 1, 40] are yet unexplored in the context of performative prediction. Nevertheless, they have important implications for how causal effects should be estimated and interpreted. In the context of our work they imply that an intervention on f can no longer be explaind solely by changing an individual s prediction. As a result, approaches for microfounding performative effect based on models learned from simple, unilateral interventions on an individual s prediction result in different causal estimates than supervised learning based methods for identiﬁcation as studied in this work. A preliminary study included in Appendix C shows that data-driven techniques can pick up on interference patterns in the data and beneﬁt from structural properties such as network homophily [19], whereas individualistic modeling misses out on the indirect component arising from neighbors inﬂuencing each other.

A second aspect is performativity in non-causal prediction. Our model posits that prediction is solely based on features X that are causal for the outcome Y . This is a desirable situation in many practical applications because causal predictions disincentivize gaming of strategic individuals manipulating their features [43, 3] and offer explanations for the outcome that persist across environments [54, 7]. Nevertheless, non-causal variables are often included as input features in practical machine learning prediction tasks. Establishing a better understanding for the implications of the resulting causal dependencies due to performativity could be an important direction for future work.

Finally, performative effect can also lead to covariate shift and impact the joint distribution P(X, Y ) = P(Y |X)P(X) over covariates and labels. We assumed that performative effects only surface in P(Y |X). For our theoretical results, this implied that overlap in the X variable across environments is trivially satisﬁed, which enabled us to pinpoint the challenges of learning performative effects due to the coupling between X and ˆY . For establishing identiﬁcation in the presence of a causal arrow fθ X additional steps are required to ensure identiﬁability.

Acknowledgement

The authors would like to thank Moritz Hardt and Lydia Liu for many helpful discussions throughout the development of this project, Tijana Zrnic, Krikamol Muandet, Jacob Steinhardt, Meena Jagadeesan and Juan Perdomo for feedback on the manuscript, and Gary Cheng for helpful discussions on differential privacy. We are also grateful for a constructive discourse and valuable feedback provided by the reviewers that greatly helped improve the manuscript.

[1] Peter M. Aronow and Cyrus Samii. Estimating average causal effects under general interference, with application to a social network experiment. The Annals of Applied Statistics, 11(4), 2017.

[2] Susan Athey. Machine learning and causal inference for policy evaluation. In Proceedings of the 21th International Conference on Knowledge Discovery and Data Mining, KDD 15, page 5 6, 2015.

[3] Yahav Bechavod, Katrina Ligett, Steven Wu, and Juba Ziani. Gaming helps! learning from strategic interactions in natural dynamics. In Proceedings of The 24th International Conference on Artiﬁcial Intelligence and Statistics, volume 130, pages 1234 1242. PMLR, 2021.

[4] Joël Berger, Margit Osterloh, and Katja Rost. Focal random selection closes the gender gap in competitiveness. Science Advances, 6(47), 2020.

[5] Joël Berger, Margit Osterloh, Katja Rost, and Thomas Ehrmann. How to prevent leadership hubris? comparing competitive selections, lotteries, and their combination. The Leadership Quarterly, 31(5):101388, 2020.

[6] Xander M. Bezuijen, Peter T. van den Berg, Karen van Dam, and Henk Thierry. Pygmalion and employee learning: The role of leader behaviors. Journal of Management, 35(5):1248 1267, 2009.

[7] Peter Bühlmann. Invariance, causality and robustness. Arxiv, abs/1812.08233, 2018.

[8] Michael Brückner, Christian Kanzow, and Tobias Scheffer. Static prediction games for adversarial learning problems. Journal of Machine Learning Research, 13(85):2617 2654, 2012.

[9] Flavio Chierichetti, Silvio Lattanzi, and Alessandro Panconesi. Rumor spreading in social networks. In Automata, Languages and Programming, 2009. ISBN 978-3-642-02930-1.

[10] Mina Cikara, Matthew M. Botvinick, and Susan T. Fiske. Us versus them: Social identity shapes neural responses to intergroup competition and harm. Psychological Science, 22(3): 306 313, 2011.

[11] Frances Ding, Moritz Hardt, John Miller, and Ludwig Schmidt. Retiring adult: New datasets for fair machine learning. Advances in Neural Information Processing Systems, 34, 2021. Code available at: https://github.com/zykls/folktables.

[12] Dmitriy Drusvyatskiy and Lin Xiao. Stochastic optimization with decision-dependent distributions. Arxiv, abs/2011.11173, 2020.

[13] Cynthia Dwork, Frank Mc Sherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography, pages 265 284, 2006.

[14] Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. Fairness through awareness. In Proceedings of the 3rd Innovations in Theoretical Computer Science Conference, ITCS 12, page 214 226, 2012.

[15] Dean Eckles, Nikolaos Ignatiadis, Stefan Wager, and Han Wu. Noise-induced randomization in regression discontinuity designs. Ar Xiv, abs:2004.09458, 2020.

[16] Dov Eden. Leadership and expectations: Pygmalion effects and other self-fulﬁlling prophecies in organizations. The Leadership Quarterly, 3(4):271 305, 1992.

[17] João Gama, Indre Zliobaite, Albert Bifet, Mykola Pechenizkiy, and Abdelhamid Bouchachia. A survey on concept drift adaptation. ACM Comput. Surv., 46(4), 2014.

[18] Ganesh Ghalme, Vineet Nair, Itay Eilat, Inbal Talgam-Cohen, and Nir Rosenfeld. Strategic classiﬁcation in the dark. In Proceedings of the 38th International Conference on Machine Learning, volume 139, pages 3672 3681. PMLR, 2021.

[19] Paul Goldsmith-Pinkham and Guido W. Imbens. Social networks and the identiﬁcation of peer effects. Journal of Business & Economic Statistics, 31(3):253 264, 2013.

[20] Moritz Hardt, Nimrod Megiddo, Christos Papadimitriou, and Mary Wootters. Strategic classiﬁcation. In Proceedings of the 2016 ACM Conference on Innovations in Theoretical Computer Science, ITCS 16, pages 111 122, 2016.

[21] Moritz Hardt, Meena Jagadeesan, and Celestine Mendler-Dünner. Performative power. Ar Xiv, abs:2203.17232, 2022.

[22] Keegan Harris, Dung Daniel T Ngo, Logan Stapleton, Hoda Heidari, and Steven Wu. Strategic instrumental variable regression: Recovering causal relationships from strategic responses. In Proceedings of the 39th International Conference on Machine Learning, volume 162, pages 8502 8522. PMLR, 2022.

[23] Guido W Imbens and Donald B Rubin. Causal inference in statistics, social, and biomedical sciences. Cambridge University Press, 2015.

[24] Zachary Izzo, Lexing Ying, and James Zou. How to learn when data reacts to your model: Performative gradient descent. In Proceedings of the 38th International Conference on Machine Learning, volume 139, pages 4641 4650, 2021.

[25] Meena Jagadeesan, Celestine Mendler-Dünner, and Moritz Hardt. Alternative microfoundations for strategic classiﬁcation. In Proceedings of the 38th International Conference on Machine Learning, volume 139, pages 4687 4697, 2021.

[26] Meena Jagadeesan, Tijana Zrnic, and Celestine Mendler-Dünner. Regret minimization with performative feedback. In Proceedings of the 39th International Conference on Machine Learning, volume 162, pages 9760 9785, 2022.

[27] Kaggle. Give me some credit, 2011. URL https://www.kaggle.com/competitions/ Give Me Some Credit/data.

[28] Amir-Hossein Karimi, Bernhard Schölkopf, and Isabel Valera. Algorithmic recourse: From counterfactual explanations to interventions. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAcc T 21, page 353 362, 2021.

[29] Adam D. I. Kramer, Jamie Guillory, and Jeffrey T. Hancock. Experimental evidence of massivescale emotional contagion through social networks. Proceedings of the National Academy of Sciences of the United States of America, 111 24, 2014.

[30] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classiﬁcation with deep convolutional neural networks. In Advances in Neural Information Processing Systems, volume 25, 2012.

[31] Bogdan Kulynych. Causal prediction can induce performative stability. In ICML 2022: Workshop on Spurious Correlations, Invariance and Stability, 2022.

[32] Thibault Laugel, Marie-Jeanne Lesot, Christophe Marsala, Xavier Renard, and Marcin Detyniecki. Comparison-based inverse classiﬁcation for interpretability in machine learning. In Information Processing and Management of Uncertainty in Knowledge-Based Systems. Theory and Foundations, pages 100 111, 2018.

[33] David S. Lee and Thomas Lemieux. Regression discontinuity designs in economics. Journal of Economic Literature, 48(2), 2010.

[34] Scott Levin, Matthew Toerper, Eric Hamrock, Jeremiah S. Hinson, Sean Barnes, Heather Gardner, Andrea Dugas, Bob Linton, Tom Kirsch, and Gabor Kelen. Machine-learningbased electronic triage more accurately differentiates patients with respect to clinical outcomes compared with the emergency severity index. Annals of Emergency Medicine, 71(5):565 574.e2, 2018.

[35] Lihong Li, Wei Chu, John Langford, and Xuanhui Wang. Unbiased ofﬂine evaluation of contextual-bandit-based news article recommendation algorithms. In Proceedings of the Fourth ACM International Conference on Web Search and Data Mining, WSDM 11, page 297 306, 2011.

[36] Lihong Li, Shunbao Chen, Jim Kleban, and Ankur Gupta. Counterfactual estimation and optimization of click metrics in search engines: A case study. In Proceedings of the 24th International Conference on World Wide Web, WWW 15 Companion, page 929 934, 2015. ISBN 9781450334730.

[37] Yang Liu, Yatong Chen, and Jiaheng Wei. Induced domain adaptation. Ar Xiv, abs:2107.05911, 2021.

[38] Donald Mac Kenzie. An engine, not a camera: How ﬁnancial models shape markets. Mit Press, 2008.

[39] Nikhil Malik. Does machine learning amplify pricing errors in housing market? economics of ml feedback loops. Information Systems & Economics e Journal, 2020.

[40] Charles F. Manski. Identiﬁcation of endogenous social effects: The reﬂection problem. The Review of Economic Studies, 60(3):531 542, 1993.

[41] Gustavo Manso. Feedback effects of credit ratings. Journal of Financial Economics, 109(2): 535 548, 2013.

[42] Celestine Mendler-Dünner, Juan C. Perdomo, Tijana Zrnic, and Moritz Hardt. Stochastic optimization for performative prediction. In Proceedings of the 34th International Conference on Neural Information Processing Systems, volume 33, pages 4929 4939, 2020.

[43] John Miller, Smitha Milli, and Moritz Hardt. Strategic classiﬁcation is causal modeling in disguise. In Proceedings of the 37th International Conference on Machine Learning, volume 119, pages 6917 6926, 2020.

[44] John P Miller, Juan C Perdomo, and Tijana Zrnic. Outside the echo chamber: Optimizing the performative risk. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of PMLR, pages 7710 7720, 2021.

[45] Adhyyan Narang, Evan Faulkner, Dmitriy Drusvyatskiy, Maryam Fazel, and Lillian J. Ratliff. Multiplayer performative prediction: Learning in decision-dependent games. Ar Xiv, abs/2201.03398, 2022.

[46] Sinno Jialin Pan and Qiang Yang. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 22(10):1345 1359, 2010. doi: 10.1109/TKDE.2009.191.

[47] Judea Pearl. Causal diagrams for empirical research. Biometrika, 82(4):669 688, 1995.

[48] Judea Pearl. Causality: Models, Reasoning and Inference. Cambridge University Press, New York, NY, USA, 2nd edition, 2009.

[49] Judea Pearl. Causal inference in statistics: An overview. Statist. Surv., 3:96 146, 2009.

[50] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825 2830, 2011.

[51] Juan Perdomo, Tijana Zrnic, Celestine Mendler-Dünner, and Moritz Hardt. Performative prediction. In Proceedings of the 37th International Conference on Machine Learning, pages 7599 7609, 2020.

[52] Georgios Piliouras and Fang-Yi Yu. Multi-agent performative prediction: From global stability and optimality to chaos. Arxiv, abs/2201.10483, 2022.

[53] Aahlad Puli, Adler Perotte, and Rajesh Ranganath. Causal estimation with functional confounders. Advances in neural information processing systems, 33:5115 5125, 2020.

[54] Mateo Rojas-Carulla, Bernhard Schölkopf, Richard Turner, and Jonas Peters. Invariant models for causal transfer learning. JMLR, 19:1309 1342, 2018.

[55] Paul R. Rosenbaum and Donald B. Rubin. The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1):41 55, 1983. ISSN 00063444.

[56] Robert Rosenthal and Lenore Jacobson. Pygmalion in the classroom. The Urban Review, 1968.

[57] Donald B. Rubin. Randomization analysis of experimental data: The ﬁsher randomization test comment. Journal of the American Statistical Association, 75(371):591 593, 1980. ISSN 01621459.

[58] Tobias Schnabel, Adith Swaminathan, Ashudeep Singh, Navin Chandak, and Thorsten Joachims. Recommendations as treatments: Debiasing learning and evaluation. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML 16, page 1670 1679, 2016.

[59] Yonadav Shavit, Benjamin Edelman, and Brian Axelrod. Causal strategic linear regression. In Proceedings of the 37th International Conference on Machine Learning, volume 119, pages 8676 8686, 2020.

[60] Michael E Sobel. What do randomized studies of housing mobility demonstrate? causal inference in the face of interference. Journal of the American Statistical Association, 101(476): 1398 1407, 2006.

[61] Gloria B. Solomon, David A. Striegel, John F. Eliot, Steve N. Heon, Jana L. Maas, and Valerie K. Wayda. The self-fulﬁlling prophecy in college basketball: Implications for effective coaching. Journal of Applied Sport Psychology, 8(1):44 59, 1996.

[62] George Soros. The Alchemy of Finance. John Wiley & Sons, 2015.

[63] Adith Swaminathan and Thorsten Joachims. Counterfactual risk minimization: Learning from logged bandit feedback. In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, ICML 15, page 814 823, 2015.

[64] Eric J Tchetgen and Tyler J Vander Weele. On causal inference in the presence of interference. Statistical Methods in Medical Research, 21(1):55 75, 2012.

[65] Stratis Tsirtsis and Manuel Gomez Rodriguez. Decisions, counterfactual explanations and strategic behavior. In Advances in Neural Information Processing Systems, volume 33, pages 16749 16760, 2020.

[66] Stefan Wager, Nick Chamandy, Omkar Muralidharan, and Amir Najmi. Feedback detection for live predictors. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, NIPS 14, page 3428 3436. MIT Press, 2014.

[67] Yixin Wang and David M. Blei. The blessings of multiple causes. Journal of the American Statistical Association, 114(528):1574 1596, 2019.

[68] Killian Wood, Gianluca Bianchin, and Emiliano Dall Anese. Online projected gradient descent for stochastic optimization with decision-dependent distributions. IEEE Control Systems Letters, 6:1646 1651, 2022.

6 Paper checklist

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reﬂect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes]

(c) Did you discuss any potential negative societal impacts of your work? [Yes] see Appendix F. (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [Yes] (b) Did you include complete proofs of all theoretical results? [Yes] 3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes] (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [Yes] (b) Did you mention the license of the assets? [Yes]

(c) Did you include any new assets either in the supplemental material or as a URL? [Yes] (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [Yes] (e) Did you discuss whether the data you are using/curating contains personally identiﬁable information or offensive content? [Yes] 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]