# plugin_performative_optimization__099f729f.pdf

Plug-in Performative Optimization

Licong Lin 1 Tijana Zrnic 2

When predictions are performative, the choice of which predictor to deploy influences the distribution of future observations. The overarching goal in learning under performativity is to find a predictor that has low performative risk, that is, good performance on its induced distribution. One family of solutions for optimizing the performative risk, including bandits and other derivative-free methods, is agnostic to any structure in the performative feedback, leading to exceedingly slow convergence rates. A complementary family of solutions makes use of explicit models for the feedback, such as best-response models in strategic classification, enabling faster rates. However, these rates critically rely on the feedback model being correct. In this work we study a general protocol for making use of possibly misspecified models in performative prediction, called plug-in performative optimization. We show this solution can be far superior to model-agnostic strategies, as long as the misspecification is not too extreme. Our results support the hypothesis that models, even if misspecified, can indeed help with learning in performative settings.

1. Introduction

Predictions have the power to influence the patterns they aim to predict. For example, stock price predictions inform trading decisions and hence prices; traffic predictions influence routing decisions and thus traffic outcomes; recommendations shape users consumption and thus preferences.

This pervasive phenomenon has been formalized in a framework called performative prediction (Perdomo et al., 2020). A central feature that distinguishes the framework from tra-

1Department of Statistics, University of California, Berkeley, USA 2Stanford Data Science and Department of Statistics, Stanford University, USA. Correspondence to: Licong Lin <liconglin@berkeley.edu>, Tijana Zrnic <tijana.zrnic@stanford.edu>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

ditional supervised learning is the concept of a distribution map D( ). This object, aimed to capture the feedback from predictions to future observations, is a mapping from predictors fθ to their induced data distributions D(θ). The main goal in performative prediction is thus to deploy a predictor fθ that will have good performance after deployment, that is, on its induced distribution D(θ). Formally, the goal is to choose predictor parameters θ Θ Rdθ so as to minimize the performative risk:

PR(θ) = Ez D(θ)[ℓ(z; θ)],

where ℓ(z; θ) is the loss incurred by predicting on instance z with model θ. Typically, z is a feature outcome pair (x, y). We refer to θPO = arg min θ Θ PR(θ)

as the performative optimum.

The main challenge in optimizing the performative risk lies in the fact that the map D( ) is not known. We only observe samples from D(θ) for models θ that have been deployed; we do not observe any feedback for the (typically infinitely many) other models. A key discriminating factor between existing solutions for optimizing under performativity is how they cope with this uncertainty.

One group of methods accounts for the feedback without assuming a problem-specific structure for it. This group includes bandit strategies (Kleinberg et al., 2008; Jagadeesan et al., 2022) and derivative-free optimization (Flaxman et al., 2004; Miller et al., 2021). These methods converge to optima at typically slow without convexity, even exponentially slow convergence rates. Moreover, their rates rely on regularity conditions that are out of the learner s control, such as convexity of the performative risk (Miller et al., 2021; Izzo et al., 2021; Dong et al., 2018) or bounded performative effects (Jagadeesan et al., 2022).

A complementary group of methods an important starting point for this work takes feedback into account by positing explicit models for it. Such models include best-response models for strategic classification (Hardt et al., 2016; Jagadeesan et al., 2021; Levanon & Rosenfeld, 2021; Ghalme et al., 2021), rational-agent models in economics (Spence, 1978; Wooldridge, 2003), and parametric distribution shifts (Izzo et al., 2021; Miller et al., 2021; Izzo et al., 2022),

Plug-in Performative Optimization

among others. To argue that methods building on models find optimal solutions, existing analyses assume that the model is well-specified. However, models of social behavior are widely acknowledged to be simplistic representations of real-world dynamics.

Yet, despite the unavoidable misspecification of models, they are ubiquitous in practice. Though their simplicity leads to misspecification, it also allows for efficient, interpretable, and practical solutions. Motivated by this observation, in this work we ask: can models for performative feedback be useful, even if misspecified?

1.1. Our Contribution

We initiate a study of the benefits of modeling feedback in performative prediction. We show that models even if misspecified can indeed help with learning under performativity.

We begin by defining a general protocol for performative optimization with feedback models, which we call plug-in performative optimization. The protocol consists of three steps. First, the learner deploys models θi D and collects data zi D(θi), i [n]. Here, D is an exploration distribution of the learner s choosing (for example, it can be uniform on Θ when Θ is bounded). The second step is to use the observations {(θi, zi)}n i=1 to fit an estimate of the distribution map. The map is chosen from a parametric family of possible maps DB = {Dβ}β B, obtained through modeling. The estimation of the map thus reduces to computing an estimate ˆβ. For example, in strategic classification, β could be a parameter quantifying the strategic agents tradeoff between utility and cost. Finally, the third step is to compute the plug-in performative optimum:

ˆθPO = arg min θ Θ PR ˆβ(θ) = arg min θ Θ Ez D ˆ β(θ)[ℓ(z; θ)].

We prove a general excess-risk bound on PR(ˆθPO) PR(θPO), showing that the error decomposes into two terms. The first is a misspecification error term, Misspec Err, which captures the gap between the true performative risk and the plug-in performative risk PR ˆβ(θ) in the large-sample regime. This term is irreducible and does not vanish as the sample size n grows. The second is a statistical error term that captures the imperfection in fitting ˆβ due to finite samples. For a broad class of problems, our main result can be summarized as follows.

Theorem 1.1 (Informal). The excess risk of the plug-in performative optimum is bounded by:

PR(ˆθPO) PR(θPO) c Misspec Err + O 1 n

for some universal constant c > 0.

Therefore, although the misspecification error is irreducible, the statistical error vanishes at a fast rate. In contrast, modelagnostic strategies such as bandit algorithms (Kleinberg et al., 2008; Jagadeesan et al., 2022) do not suffer from misspecification but have an exceedingly slow, often exponentially slow, statistical rate. For example, the bandit algorithm of (Jagadeesan et al., 2022) has an excess risk of O(n 1 dθ+1 ). This is why feedback models are useful for a finite n, their excess risk can be far smaller than the risk of a model-agnostic strategy due to the rapidly vanishing statistical rate. The statistical rate is fast because it only depends on the parametric estimation rate of ˆβ; it does not depend on the complexity of PR.

One important case of performative prediction is strategic classification. We apply our general theory to common bestresponse models in strategic classification. We also conduct numerical evaluations that confirm our theoretical findings. Overall our results support the use of models in optimization under performative feedback.

1.2. Related Work

Performative prediction. We build on the growing body of work studying performative prediction (Perdomo et al., 2020). Existing work studies different variants of retraining (Perdomo et al., 2020; Mendler-D unner et al., 2020; Drusvyatskiy & Xiao, 2022), which converge to so-called performatively stable solutions, as well as methods for finding performative optima (Miller et al., 2021; Izzo et al., 2021; Jagadeesan et al., 2022). The methods in the latter category are largely model-agnostic and as such converge at slow rates. Exceptions include the study of parametric distribution shifts (Izzo et al., 2021; 2022) and location families (Miller et al., 2021; Jagadeesan et al., 2022), but those analyses crucially rely on the model being well-specified. We are mainly interested in settings where D(θ) is a general black-box. Other work in performative prediction includes the study of time-varying shifts (Brown et al., 2022; Izzo et al., 2022; Li & Wai, 2022; Ray et al., 2022), multi-agent settings (Dean et al., 2022; Qiang et al., 2022; Narang et al., 2022; Piliouras & Yu, 2023), causality and robustness (Maheshwari et al., 2022; Mendler-D unner et al., 2022; Kim & Perdomo, 2023), and it would be valuable to extend our theory on the use of models to those settings.

Strategic classification and economic modeling. Strategic classification (Hardt et al., 2016; Dong et al., 2018; Levanon & Rosenfeld, 2021; Zrnic et al., 2021), as well as other problems studying strategic agent behavior, frequently use models of agent behavior in order to compute Stackelberg equilibria, which are direct analogues of performative optima. However, convergence to Stackelberg equilibria assumes correctness of the models, a challenge we circumvent in this work. We use strategic classification as a primary

Plug-in Performative Optimization

domain of application of our general theory.

Statistics under model misspecification. Our work is partially inspired by works in statistics studying the benefits and impact of modeling, including under misspecification (White, 1980; 1982; Buja et al., 2019a;b). At a technical level, our results are related to M-estimation (Van der Vaart, 2000; Geer, 2000; Mou et al., 2019) and semiparametric statistics (Tsiatis, 2007; Kennedy, 2022).

Zeroth-order optimization. Plug-in performative optimization serves as an alternative to black-box baselines for zeroth-order optimization, which have previously been studied in performative prediction. These include bandit algorithms (Kleinberg et al., 2008; Jagadeesan et al., 2022) and zeroth-order convex optimization algorithms (Flaxman et al., 2004; Miller et al., 2021). As mentioned, we show that the use of models can give far smaller excess risk, given the fast convergence rates of parametric learning problems.

2. Plug-in Performative Optimization Protocol

We describe the main protocol at the focus of our study and then instantiate it with an example.

We consider the use of parametric models DB := {Dβ}β B for modeling the true unknown distribution map D, where B Rdβ. We denote

PRβ(θ) = Ez Dβ(θ)[ℓ(z; θ)].

Since DB is a collection of maps, we call it a distribution atlas. We emphasize that it need not hold that D DB; the model could be misspecified.

The protocol for plug-in performative optimization proceeds as follows. First, the learner collects pairs of i.i.d. observations {(θi, zi)}n i=1, where θi is deployed according to some exploration distribution D and zi D(θi). The exploration distribution should be dispersed enough to enable capturing varied distributions D(θi) (e.g., uniform, Gaussian with a full-rank covariance, etc). Then, the learner estimates the distribution map by fitting ˆβ based on the collected observations:

ˆβ = d Map((θ1, z1), . . . , (θn, zn)),

where d Map is some model-fitting function. We will consider different criteria for fitting ˆβ. We let β denote the largesample limit of ˆβ, β = limn ˆβ. Finally, the learner finds the plug-in performative optimum:

ˆθPO = arg min θ Θ PR ˆβ(θ) = arg min θ Θ Ez D ˆ β(θ)[ℓ(z; θ)].

We summarize the protocol in Algorithm 1.

Notice that, since D ˆβ is known to the learner, we may solve

for ˆθPO explicitly in Step 3 of the protocol, without col-

Algorithm 1 Plug-in performative optimization

Require: distribution atlas DB, exploration strategy D, loss ℓ(z; θ), map-fitting algorithm d Map. 1: Deploy θi D, observe zi D(θi), i [n]. 2: Fit distribution map: ˆβ = d Map((θ1, z1), . . . , (θn, zn)), where ˆβ B. 3: Compute plug-in performative optimum: ˆθPO = arg minθ Θ Ez D ˆ β(θ)[ℓ(z; θ)].

lecting any additional real data. In particular, solving for ˆθPO incurs only computational complexity not statistical complexity. A detailed discussion on how to execute Step 3 empirically can be found in Appendix C.

A canonical choice of d Map that we will focus on is empirical risk minimization:

ˆβ = arg min β B

i=1 r(θi, zi; β),

where r is a loss function. Throughout we will use θ and z to denote draws θ D, z D( θ). Then, β = arg minβ B E[r( θ, z; β)]. For example, one can choose r(θ, z; β) = log pβ(z; θ) to be the log-likelihood, where pβ( ; θ) is the density under Dβ(θ), in which case ˆβ is the maximum-likelihood estimator. Under this choice,

β = arg max β B E[log pβ( z; θ)] = arg min β B KL( D, Dβ).

Here, D is the distribution of z, that is, the distribution map D(θ) averaged over θ D. We similarly define Dβ. Therefore, β is the KL projection of the true data-generating process onto the considered distribution atlas. Example 1 (Biased coin flip). To build intuition for the introduced concepts, we consider an illustrative example. Suppose we want to predict the outcome of a biased coin flip, where the bias arises due to performative effects. The outcome is generated as z D(θ) = Bern(0.5 + µθ + ηθ2), where µ (0, 0.5), η (0, 0.5 µ). The parameter θ [0, 1] aims to predict the outcome while minimizing the squared loss, ℓ(z; θ) = (z θ)2. Suppose that we know that θ introduces a bias to the coin flip, but we do not know how strongly or in what way. We thus choose a simple model for the bias, Dβ(θ) = Bern(0.5 + βθ), and fit β in a

data-driven way. To do so, we deploy θi i.i.d. Unif[0, 1] and observe zi D(θi), for i [n]. One natural way to fit the distribution map is to solve

ˆβ = arg min β

i=1 (zi 0.5 βθi)2.

Finally, we compute the plug-in performative optimum as

ˆθPO = arg min θ Ez D ˆ β(θ)[(z θ)2] = 1 ˆβ

Plug-in Performative Optimization

It is not hard to show that the population limit of ˆβ is equal to β = µ + 0.75η. Therefore, if the feedback model is well-specified, meaning η = 0, then β indeed recovers the true distribution map, and ˆθPO converges to the true performative optimum.

3. Excess Risk

We study the excess risk of plug-in performative optimization. The key takeaway of this section is that the excess risk depends on two sources of error: one is the misspecification error due to the fact that, often, D DB; the other is the statistical error due to the gap between ˆβ and β .

Formally, define

Misspec Err = sup θ Θ |PRβ (θ) PR(θ)|,

Stat Errn = sup θ Θ |PRβ (θ) PR ˆβ(θ)|.

We note that the statistical error depends on the sample size n, while the misspecification error is irreducible even in the large-sample limit. In later sections we will show that the statistical error vanishes at a fast rate, namely O 1 n , for a broad class of problems. In Theorem 3.1 we state a general bound on the excess risk in terms of these two sources of error.

Theorem 3.1. The excess risk of the plug-in performative optimum is bounded by:

PR(ˆθPO) PR(θPO) 2 (Misspec Err + Stat Errn) .

Theorem 3.1 illuminates the benefits of feedback models: if the model is a reasonable approximation, the misspecification error is not too large; at the same time, due to the parametric specification of the distribution atlas, the statistical error vanishes quickly. Therefore, we conclude that even misspecified models can lead to lower excess risk than entirely model-agnostic strategies such as bandit algorithms. Remark 3.2. It should be noted that there may be numerical inaccuracy in solving for ˆθPO in Step 3 of Algorithm 1. However, the bound of Theorem 3.1 degrades smoothly: if a δ-suboptimal solution is attained, then the excess risk increases by at most δ. As mentioned before, δ is not dependent on n; it only depends on the amount of computation.

In the rest of this section we give fine-grained bounds on the misspecification error and the statistical error under appropriate regularity assumptions, providing intuition via examples along the way. The most natural way to bound the misspecification error is in terms of a distributional distance between the true distribution map D(θ) and the modeled distribution map Dβ (θ). We define the misspecification of a distribution atlas.

Definition 3.3 (Misspecification). We say that a distribution atlas is η-misspecified in distance dist if, for all θ Θ, it holds that dist(Dβ (θ), D(θ)) η.

We will measure misspecification in either total-variation distance or Wasserstein (i.e. earth mover s) distance. Depending on the problem setting, one of the two distances will yield a smaller misspecification parameter and thus a tighter rate according to Theorem 3.1.

We will also require that the atlas is smooth in the analyzed distance. Definition 3.4 (Smoothness). We say that a distribution atlas is ϵ-smooth in distance dist if, for all β, β B and θ Θ, it holds that dist(Dβ(θ), Dβ (θ)) ϵ β β .

Unless stated otherwise, denotes the ℓ2-norm. In some examples the parameter of the atlas will be a matrix, in which case the norm will be the operator norm op. It is important to note that, while the misspecification parameter is a property of the true distribution map, the smoothness parameter is entirely in the learner s control, as it is solely a property of the chosen distribution atlas.

In what follows, Section 3.1 and Section 3.2 focus on bounding the misspecification error. Section 3.3 focuses on bounding the statistical error with an explicit rate.

3.1. Total-Variation Misspecification

First we consider misspecification in total-variation (TV) distance. Building on Theorem 3.1, we obtain the following excess-risk bound as a function of TV misspecification. Corollary 3.5. Suppose the distribution atlas is ηTVmisspecified and ϵTV-smooth in TV distance. Moreover, suppose that |ℓ(z; θ)| Bℓand ˆβ β Cn. Then, the excess risk of the plug-in performative optimum is bounded by:

PR(ˆθPO) PR(θPO) 4Bℓ ηTV + 4Bℓ ϵTV Cn.

Corollary 3.5 shows that plug-in performative optimization is efficient as long as the distribution atlas is smooth, not too misspecified, and the rate of estimation Cn is fast. In Section 3.3 we will prove convergence rates Cn when ˆβ is a sufficiently regular empirical risk minimizer.

We now build intuition for the relevant terms in Corollary 3.5 through examples. First we give a couple of examples of distribution atlases and bound their smoothness parameter ϵTV. Example 2 (Mixture model). Suppose that we have k candidate distribution maps {D(i)(θ)}k i=1. We would like to find a combination of these maps that approximates the true map D(θ) as closely as possible. To do so, we can define Dβ(θ) = Pk i=1 βi D(i)(θ), where β [0, 1]k defines the mixture weights. This model is smooth in TV distance: TV(Dβ(θ), Dβ (θ)) 1

Plug-in Performative Optimization

Example 3 (Self-fulfilling prophecy). Suppose that we want to model outcomes that follow a self-fulfilling prophecy, meaning that predicting a certain outcome makes it more likely for the outcome to occur. Assume we have historical data of feature label pairs before the model was deployed. Denote the resulting empirical distribution by D0 = DX 0 DY |X 0 . We assume the features are nonperformative; only the labels exhibit performative effects. Then, we can model the label distribution as DY |X β (θ) =

(1 β)DY |X 0 +βδfθ(X), where δfθ(X) denotes a point mass at the predicted label. Here, β [0, 1] tunes the strength of performativity: β = 0 implies no performativity, while β = 1 a perfect self-fulfilling prophecy. This atlas has TV-smoothness equal to ϵTV = 1.

Next, we describe a general type of misspecification that implies a bound on ηTV.

Example 4 ( Typically well-specified model). Suppose that the data distribution consists of observations about strategic individuals. Suppose that a (1 p)-fraction of the population is rational and acts in a predictable fashion. The remaining p-fraction acts arbitrarily. Then, if we model the predictable behavior appropriately, meaning Dβ (θ) follows the distribution produced by the rational agents, the misspecification parameter ηTV is at most p. More generally, if we have D(θ) = (1 p)Dβ (θ) + p D(θ), where D(θ) is an arbitrary component, then ηTV p.

3.2. Wasserstein Misspecification

We next consider misspecification in Wasserstein (i.e. earth mover s) distance. Building on Theorem 3.1, we bound the excess-risk via Wasserstein misspecification.

Corollary 3.6. Suppose that the distribution atlas is ηW - misspecified and ϵW -smooth in Wasserstein distance. Moreover, suppose that the loss ℓ(z; θ) is Lz-Lipschitz in z, and that ˆβ β Cn. Then, the excess risk of the plug-in performative optimum is bounded by:

PR(ˆθPO) PR(θPO) 2Lz ηW + 2Lz ϵW Cn.

As in Corollary 3.5, we see that the excess risk of the plug-in performative optimum is small as long as the distribution atlas is smooth, not overly misspecified, and the rate Cn is sufficiently fast.

Below we give an example of a natural distribution atlas and characterize its Wasserstein smoothness.

Example 5 (Performative outcomes). As in Example 3, suppose that we have data of feature outcome pairs before any model deployment, and suppose that only the outcomes are performative while the features are static. Let D0 denote the historical data distribution. We assume that a predictor θ affects the outcomes only through its predictions fθ(x).

One simple way to model such feedback is via an additive effect on the outcomes. Formally, we define (x, y) Dβ(θ) (x0, y0) D0, x = x0, y = y0 + β fθ(x). As in Example 3, β R controls the strength of performativity. This atlas is ϵW -smooth in Wasserstein distance for ϵW = supθ Ex DX 0 [|fθ(x)|].

One way that misspecification can arise is due to omittedvariable bias. We illustrate this in the following example and explicitly bound the misspecification parameter ηW .

Example 6 (Omitted-variable bias). Suppose that only a subset of the coordinates I [d] of θ induce performative effects. This can happen in linear or logistic regression, where the coordinates of θ measure feature importance, but only a subset of the features are manipulable. Specifically, assume the data follows a location family model: z D(θ) z = z0 + MθI, where M Rdz |I| is a true parameter of the shift and z0 is a zero-mean draw from a base distribution D0. Suppose the model omits one performative coordinate by mistake: z DM(θ) z = z0 + MθI , where I = I \ {imiss} and M Rdz |I |. If c M is fit via least-squares, c M = arg min M 1 n Pn i=1 zi Mθi,I 2, and D is a product distribution, then the population-level counterpart of c M is equal to M = MI , which denotes the restriction of M to the columns indexed by I . Putting everything together, we can conclude that the misspecification due to the omitted coordinate is ηW = supθ W(D(θ), DM (θ)) B Mimiss , where we assume the imiss-coordinate of |θ| is at most B.

3.3. Bounding the Estimation Error

We saw that the statistical error is driven by the estimation rate of β . We show that for a broad class of problems the rate is O(n 1

2 ). We focus on map-fitting via empirical risk minimization (ERM):

ˆβ = arg min β B

i=1 r(θi, zi; β), (1)

where B Rdβ is bounded and convex. We denote r(β) = E[r( θ, z; β)], and note that β = arg minβ B r(β). To establish the error bound, we rely on the following assumptions.

Assumption 3.7. ERM problem (1) is regular if:

(a) r(β) is convex and µ-strongly convex at β , and the Hessian 2r(β) is σr-Lipschitz;

(b) β B, the gradient r( θ, z; β) is a subexponential vector with parameter Br > 0;

(c) u, u Sdβ 1, supβ B u 2r( θ, z; β)u is subexponential with parameter Lr > 0.

Plug-in Performative Optimization

We will see that Assumption 3.7 is satisfied in many important settings. Condition (a) is fairly standard; conditions (b,c) seem less standard because we want to make them realistic for problems with diverging dimension. They immediately hold if r(θ, z; β), 2r(θ, z; β) are bounded, which are standard conditions.

The following lemma is the key tool for obtaining our main excess-risk bounds with a fast rate, stated in Theorem 3.9. Lemma 3.8. Assume the map-fitting algorithm is regular (Ass. 3.7) and that n log n C(dβ + log(1/δ)) for a sufficiently large C > 0. Then, for some constant C > 0, with probability 1 δ it holds that

dβ + log(1/δ)

Theorem 3.9. Assume the map-fitting algorithm is regular (Ass. 3.7) and that n log n C(dβ + log(1/δ)) for a sufficiently large C > 0.

If the distribution atlas is ηTV-misspecified and ϵTVsmooth in total-variation distance, and |ℓ(z; θ)| Bℓ, then there exists a C > 0 such that, with probability 1 δ:

PR(ˆθPO) PR(θPO)

4BℓηTV + C BℓϵTV p

dβ + log(1/δ)

If the distribution atlas is ηW -misspecified and ϵW - smooth in Wasserstein distance, and ℓ(z; θ) is Lz Lipschitz in z, then there exists a C > 0 such that, with probability 1 δ:

PR(ˆθPO) PR(θPO)

2LzηW + C LzϵW p

dβ + log(1/δ)

The excess risk is thus bounded by the sum of a term due to misspecification and a fast statistical rate. To contrast this with a model-agnostic rate, the bandit algorithm of (Jagadeesan et al., 2022) does not suffer from misspecification but has an exponentially slow excess risk, O(n 1/(dθ+1)).

The analysis underlying Theorem 3.9 is based on standard ERM analyses under model misspecification, though there are differences. The main one is that our setting requires that we analyze the error in terms of the difference between parameters ˆβ β , as opposed to the excess risk r(ˆβ) r(β ) in the standard ERM analysis. This difference requires new tools and assumptions akin to those in Assumption 3.7.

4. Applications

We apply our theory and plug-in performative optimization to several problems with performative feedback, building

on prevalent models for those problems. In each problem, we prove the model s smoothness and fast estimation of β .

4.1. Strategic Regression

We begin by considering strategic regression. Here, a population of individuals described by (x, y) strategically responds to a deployed predictor fθ. For example, the predictor could be fθ(x) = θ x.

Distribution atlas. The strategic responses consist of manipulating features in order to maximize a utility function, which is often equal to the prediction itself. Formally, given an individual with features x0, a commonly studied response model is gβ(x0, θ) = arg maxx(uθ(x) 1 2β x x0 2), where uθ(x) is a concave utility function and the second term captures the cost of feature manipulations. Here, β [βmin, βmax] R+ trades off utility and cost. The natural distribution atlas capturing the above response model is obtained as follows. Suppose that we have a historical distribution of feature label pairs D0. Then, let:

(x, y) Dβ(θ) (x0, y) D0, x = gβ(x0, θ). (2)

Claim 4.1. If uθ(x) Bu and uθ(x) is Lu-Lipschitz, then {Dβ}β has

ϵW Bu 1 βmax Lu .

This bound is attained for uθ(x) = x θ. There, Lu = 0, Bu = supθ Θ θ , so the bound equals ϵW maxθ Θ θ . This is tight because supx,θ Θ gβ(x, θ) gβ (x, θ) = |β β | supθ Θ θ .

Map fitting. We can fit ˆβ via maximum-likelihood estimation (MLE). Suppose that DX 0 has a density and denote it by p X 0 . Then, we can let

ˆβ = arg min β B 1

i=1 log p X 0 (xi β uθi(xi)) .

Given the first-order optimality condition for gβ(x0, θ), under mild regularity and correct specification of the model, meaning D(θ) = Dβ(θ) for some β, this map-fitting strategy ensures ηW = 0, as expected. For example, if DX 0 = N(0, σ2I), MLE reduces to

ˆβ = arg min β B

i=1 xi β uθi(xi) 2.

Least-squares makes sense even if the features are not Gaussian; it just coincides with MLE for Gaussians. We show a fast estimation rate of least-squares under mild conditions via Lemma 3.8.

Plug-in Performative Optimization

Claim 4.2. If E[ u θ( x) 2] > 0, x and u θ( x) are subgaussian, and n log n C(1 + log(1/δ)) for a sufficiently large C > 0, then

|ˆβ β | C p

1 + log(1/δ)

n with probability 1 δ.

If, in addition, the loss function ℓ(z; θ) is Lz-Lipschitz in z, combining Claim 4.1, Claim 4.2, and Corollary 3.6 gives an upper bound on the excess risk PR(ˆθPO) PR(θPO).

4.2. Binary Strategic Classification

Next, we consider binary strategic classification, in which a population of strategic individuals described by (x, y) takes strategic actions in order to reach a decision boundary. We assume the learner s decision rule is obtained by thresholding a linear model, fθ(x) = 1{θ x T}, for some T. Without loss of generality we assume θ = 1, since the rule is invariant to rescaling θ and T.

Distribution atlas. A common model assumes that the individuals have a budget β > 0 on how much they can change their features (Kleinberg & Raghavan, 2020; Chen et al., 2020; Zrnic et al., 2021). The individuals move to the decision boundary if it is within ℓ2 distance β. Formally, an individual with features x0 responds with gβ(x0, θ) = x0 + θ(T x 0 θ) if x 0 θ [T β, T), and does not move otherwise. The natural distribution atlas corresponding to the above model is defined as in Eq. (2), for a given base distribution D0. We show that this atlas is smooth in totalvariation distance. Claim 4.3. If, for all θ, x 0 θ has a density upper bounded by ϕu, then {Dβ}β has ϵTV ϕu.

Map fitting. According to the atlas, all individuals with x 0 θ [T β, T) move to the decision boundary, defined by x θ = T. Therefore, one can estimate the individuals budget by finding ˆβ such that P{x 0 θ [T ˆβ, T]} = 1 n Pn i=1 1{x i θi (T ϵ)}, for a small ϵ > 0. The latter term estimates the mass in a small neighborhood of the boundary. Therefore, P{x 0 θ [T β , T]} = P{ x θ (T ϵ)}. For simplicity we assume x 0 θ has a density on R so that ˆβ exists. Note that P{x 0 θ [T β, T]} is known for all β because it is a property of the base distribution, so finding ˆβ reduces to estimating P{ x θ (T ϵ)}. Claim 4.4. If x 0 θ has a density lower bounded by ϕl, then

with probability 1 δ.

If the learner s loss is bounded, putting together Claim 4.3, Claim 4.4, and Corollary 3.5 gives an upper bound on the excess risk PR(ˆθPO) PR(θPO).

4.3. Location Families

Lastly, we consider general location families (Miller et al., 2021; Jagadeesan et al., 2022; Ray et al., 2022), in which the deployment of θ leads to performativity via a linear shift. This model often appears in strategic classification with linear or logistic regression, and can capture performativity only in certain features (Miller et al., 2021).

Distribution atlas. The location-family model is defined by z DM(θ) z = Mθ + z0, where M Rdz dθ is a matrix that parameterizes the shift, and z0 is a sample from a zero-mean base distribution D0. We assume supθ Θ θ Bθ. It is not hard to see that the atlas is smooth in op. Claim 4.5. The atlas {DM}M has ϵW Bθ.

Map fitting. We fit the distribution map via least-squares:

c M = arg min M

i=1 zi Mθi 2.

Thus, M = arg min M E[ z M θ 2]. We provide control on the estimation error below. Claim 4.6. Assume D is zero-mean and subgaussian with κmin I E[ θ θ ] κmax I. Further, for all u Sdθ 1, v Sdz 1, assume u θ v z is subexponential with parameter Lθz and E[ θ z T ] op B. Then, if n C(dθ + dz + log(1/δ)) for some sufficiently large C > 0, there exists C > 0 such that with probability 1 δ we have

c M M op C r

dθ + dz + log(1/δ)

The above assumptions hold under mild regularity conditions when the model is nearly well-specified. If the loss is Lz-Lipschitz in z, then using Claim 4.5, Claim 4.6, and Corollary 3.6, we can bound the excess performative risk PR(ˆθPO) PR(θPO).

5. Experiments

We confirm the qualitative takeaways of our theory empirically. We compare our approach with two model-agnostic algorithms: derivative-free optimization (DFO) (Flaxman et al., 2004) and greedy SGD (Mendler-D unner et al., 2020). The latter naively retrains while ignoring the feedback; it is a practical heuristic but only converges to stable points. For the location-family experiment, we also compare our approach with Perf GD (Izzo et al., 2021), which approximates the performative gradient via numerical methods. We refer the reader to Appendix B for further details.

5.1. Location Family

We start with the location-family setting. We assume the true map follows a linear model with a quadratic term, zi = b +

Plug-in Performative Optimization

0 10000 20000 30000 40000 50000

Number of Samples

PR( PO) PR( PO)

Plug-in DFO SGD Perf GD

0 10000 20000 30000 40000 50000

Number of Samples

PR( PO) PR( PO)

Plug-in DFO SGD Perf GD

0 10000 20000 30000 40000 50000

Number of Samples

PR( PO) PR( PO)

Plug-in DFO SGD Perf GD

0 10000 20000 30000 40000 50000

Number of Samples

PR( PO) PR( PO)

s=0.5, d=10

Plug-in DFO SGD Perf GD

0 10000 20000 30000 40000 50000

Number of Samples

PR( PO) PR( PO)

Plug-in DFO SGD Perf GD

0 10000 20000 30000 40000 50000

Number of Samples

PR( PO) PR( PO)

s=1.5, d=10

Plug-in DFO SGD Perf GD

Figure 1. (Location family) Excess risk of plug-in performative optimization, DFO, greedy SGD, and Perf GD with 1 standard deviation on a logarithmic scale.

M1θi+s M2(θi θi)+z0,i, where zi, θi, b Rd, θi θi := (θ2 i,1, . . . , θ2 i,d) represents the quadratic effect, and z0,i N(0, σ2Id). The parameter s 0 varies the magnitude of misspecification. We want to minimize the loss ℓ(z; θ) = z θ 2, and we use a simple linear model to approximate

D(θ), i.e., z DM(θ) z d= b + Mθ + z0. To fit c M, we use the loss r(θ, z; M) = z Mθ 2. We vary d {5, 10} and let Mi = Mi/ Mi op, i {1, 2} where Mi Rd d have entries generated i.i.d. from N(0, 1). We let b N(0, Id), σ = 0.5, and Θ = {θ : θ 1}.

In Figure 1 we see that the excess risk of our algorithm converges rapidly to a value that reflects the degree of misspecification. It approaches zero for s = 0 (top panel) due to no misspecification and stabilizes at a nonzero value for s > 0 (middle and bottom panels), consistent with our theory. In contrast, the risk of both Perf GD and DFO reduce slowly, while SGD quickly reaches a suboptimal value.

5.2. Strategic Regression

We next consider strategic regression. We first generate 5000 i.i.d. base samples (xi, yi) where xi N(0, Idx), and yi {0, 1} follows from a logistic model with a fixed parameter vector η Unif(Sdx 1). Denote the joint empirical distribution of (xi, yi) by D0. The true distribution map, (x, y) D(θ), is defined via

(x0, y0) D0, y = y0, x = arg max x Rdx (θ x 1

2 β x x0 ρ ρ),

0 5000 10000 15000 20000 25000 Number of Samples

PR( PO) PR( PO)

Plug-in DFO SGD

0 5000 10000 15000 20000 25000 Number of Samples

Plug-in DFO SGD

0 5000 10000 15000 20000 25000 Number of Samples

PR( PO) PR( PO)

Plug-in DFO SGD

0 5000 10000 15000 20000 25000 Number of Samples

Plug-in DFO SGD

0 5000 10000 15000 20000 25000 Number of Samples

PR( PO) PR( PO)

Plug-in DFO SGD

0 5000 10000 15000 20000 25000 Number of Samples

Plug-in DFO SGD

Figure 2. (Strategic regression) Excess risk and accuracy of plug-in performative optimization, DFO, and greedy SGD, with 1 standard deviation on a logarithmic scale.

for some ρ > 1. We set dx = 5 and β = 2. To construct a model for D( ), we follow the procedure from Section 4.1, using the linear utility and quadratic cost. Thus, ρ = 2 results in correct specification. We choose ℓ(z; θ) to be the logistic loss with a ridge penalty and Θ = {θ : θ 1}.

We examine the well-specified scenario ρ = 2 and two misspecified cases ρ {2.5, 3}, comparing our method with DFO and greedy SGD in Figure 2. We see our algorithm quickly tends to zero excess risk when ρ = 2 (top left). When ρ = 2 (middle and bottom left), the algorithm still converges, albeit to a suboptimal point. SGD converges to a suboptimal point with nonzero excess risk, while the excess risk of the DFO method decreases at a slow rate. We remark that Perf GD is not applicable in our strategic regression setting as D(θ) does not admit a probability density function. Similar trends appear for test accuracy.

6. Discussion

We discuss several limitations and comment on a few technical aspects of our work. Along the way, we discuss valuable future directions in the context of modeling the distribution map in performative prediction.

First, our approach may be of limited use when the learner has very little prior knowledge of the true data-generating process, as this prevents them from choosing an atlas in an informed way. This is a hard scenario, and it is possible that no existing method would lead to a satisfactory performance.

Plug-in Performative Optimization

For example, while the model-free approach of Jagadeesan et al. (2022) is asymptotically more robust, for a reasonable number of samples it is not clear that it would outperform our approach with a highly misspecified model. One valuable future direction is to choose a family of distribution maps DB that is sufficiently complex and has strong expressive power, e.g., neural networks, in cases where the learner is very uncertain about the true map. When the number of samples is suitably large, this may give a small excess risk as the misspecification error is smaller with a more expressive model. While in the current paper we mainly focus on simple models with closed-form expressions, we are hopeful that our approach could be valuable in such settings with black-box modeling as well.

Modeling could be particularly useful if one considers the fact that in reality data distributions take time to shift after model deployment and do not generate i.i.d. observations. Such time-varying shifts were, in fact, modeled in prior work (Brown et al., 2022; Izzo et al., 2022; Li & Wai, 2022; Ray et al., 2022). It would be interesting to study the impacts of modeling the time-varying aspect of distribution maps.

Many of our examples assumed access to historical data. However, this assumption is not fundamental to our framework. Often this just means that there was another historical model in place under which the data was collected. We can model this as step 0 in our framework, where we first deploy a default model θ0 (e.g. θ0 = 0) and collect data points drawn from D(θ0) D0 (for instance, in Example 5 this would correspond to deploying a model with fθ(x) = 0 for all x; there exist analogues for other problems as well). This alternative view would simply incur an additional statistical error term due to having finite-sample access to D(θ0).

In this work, we used standard ERM to estimate the distribution map. However, the learner may want to incorporate criteria other than model fit (which is the focus on ERM) in this process. For example, the learner may want to use a classical model selection criterion such as AIC or BIC (see, e.g., Hastie et al. (2009)) to regularize towards simpler models and run ERM afterwards. Investigating model selection criteria beyond ERM for distribution map estimation is another valuable future direction.

Impact Statement

We have analyzed performative prediction with misspecified models. Our results highlight the statistical gains of using models, however modeling has consequences far beyond statistical efficiency. On the positive side, modeling can help interpretability and computational efficiency. On the negative side, however, using highly misspecified models can lead to unfairness, lack of validity, and poor downstream

decisions. For example, modeling may be too coarse and not represent certain demographic groups properly. Going forward, it would be valuable to develop deeper understanding of such negative aspects of modeling. Overall, given that models are ubiquitous in practice, we believe they merit further study especially under misspecification and we have only scratched the surface of this agenda.

Brown, G., Hod, S., and Kalemaj, I. Performative prediction in a stateful world. In International Conference on Artificial Intelligence and Statistics, pp. 6045 6061. PMLR, 2022.

Buja, A., Brown, L., Berk, R., George, E., Pitkin, E., Traskin, M., Zhang, K., and Zhao, L. Models as approximations I: Consequences illustrated with linear regression. Statistical Science, 34(4):523 544, 2019a.

Buja, A., Brown, L., Kuchibhotla, A. K., Berk, R., George, E., and Zhao, L. Models as approximations II: A modelfree theory of parametric regression. Statistical Science, 34(4):545 565, 2019b.

Chen, Y., Liu, Y., and Podimata, C. Learning strategyaware linear classifiers. Advances in Neural Information Processing Systems, 33:15265 15276, 2020.

Dean, S., Curmei, M., Ratliff, L. J., Morgenstern, J., and Fazel, M. Multi-learner risk reduction under endogenous participation dynamics. ar Xiv preprint ar Xiv:2206.02667, 2022.

Dong, J., Roth, A., Schutzman, Z., Waggoner, B., and Wu, Z. S. Strategic classification from revealed preferences. In Proceedings of the 2018 ACM Conference on Economics and Computation, pp. 55 70, 2018.

Drusvyatskiy, D. and Xiao, L. Stochastic optimization with decision-dependent distributions. Mathematics of Operations Research, 2022.

Flaxman, A. D., Kalai, A. T., and Mc Mahan, H. B. Online convex optimization in the bandit setting: gradient descent without a gradient. In ACM-SIAM Symposium on Discrete Algorithms, 2004.

Geer, S. A. Empirical Processes in M-estimation, volume 6. Cambridge university press, 2000.

Ghalme, G., Nair, V., Eilat, I., Talgam-Cohen, I., and Rosenfeld, N. Strategic classification in the dark. In International Conference on Machine Learning, pp. 3672 3681. PMLR, 2021.

Hardt, M., Megiddo, N., Papadimitriou, C., and Wootters, M. Strategic classification. In Proceedings of the 2016

Plug-in Performative Optimization

ACM conference on innovations in theoretical computer science, pp. 111 122, 2016.

Hastie, T., Tibshirani, R., Friedman, J. H., and Friedman, J. H. The elements of statistical learning: data mining, inference, and prediction, volume 2. Springer, 2009.

Izzo, Z., Ying, L., and Zou, J. How to learn when data reacts to your model: performative gradient descent. In International Conference on Machine Learning, pp. 4641 4650. PMLR, 2021.

Izzo, Z., Zou, J., and Ying, L. How to learn when data gradually reacts to your model. In International Conference on Artificial Intelligence and Statistics, pp. 3998 4035. PMLR, 2022.

Jagadeesan, M., Mendler-D unner, C., and Hardt, M. Alternative microfoundations for strategic classification. In International Conference on Machine Learning, pp. 4687 4697. PMLR, 2021.

Jagadeesan, M., Zrnic, T., and Mendler-D unner, C. Regret minimization with performative feedback. In International Conference on Machine Learning, pp. 9760 9785. PMLR, 2022.

Kennedy, E. H. Semiparametric doubly robust targeted double machine learning: a review. ar Xiv preprint ar Xiv:2203.06469, 2022.

Kim, M. P. and Perdomo, J. C. Making decisions under outcome performativity. In 14th Innovations in Theoretical Computer Science Conference (ITCS), 2023.

Kleinberg, J. and Raghavan, M. How do classifiers induce agents to invest effort strategically? ACM Transactions on Economics and Computation (TEAC), 8(4):1 23, 2020.

Kleinberg, R., Slivkins, A., and Upfal, E. Multi-armed bandits in metric spaces. In Proceedings of the fortieth annual ACM symposium on Theory of computing, pp. 681 690, 2008.

Levanon, S. and Rosenfeld, N. Strategic classification made practical. In International Conference on Machine Learning, pp. 6243 6253. PMLR, 2021.

Li, Q. and Wai, H.-T. State dependent performative prediction with stochastic approximation. In International Conference on Artificial Intelligence and Statistics, pp. 3164 3186. PMLR, 2022.

Maheshwari, C., Chiu, C.-Y., Mazumdar, E., Sastry, S., and Ratliff, L. Zeroth-order methods for convex-concave minmax problems: Applications to decision-dependent risk minimization. In International Conference on Artificial Intelligence and Statistics, pp. 6702 6734. PMLR, 2022.

Mendler-D unner, C., Perdomo, J., Zrnic, T., and Hardt, M. Stochastic optimization for performative prediction. Advances in Neural Information Processing Systems, 33: 4929 4939, 2020.

Mendler-D unner, C., Ding, F., and Wang, Y. Anticipating performativity by predicting from predictions. Advances in Neural Information Processing Systems, 35:31171 31185, 2022.

Miller, J. P., Perdomo, J. C., and Zrnic, T. Outside the echo chamber: Optimizing the performative risk. In International Conference on Machine Learning, pp. 7710 7720. PMLR, 2021.

Mou, W., Ho, N., Wainwright, M. J., Bartlett, P., and Jordan, M. I. A diffusion process perspective on posterior contraction rates for parameters. ar Xiv preprint ar Xiv:1909.00966, 2019.

Narang, A., Faulkner, E., Drusvyatskiy, D., Fazel, M., and Ratliff, L. Learning in stochastic monotone games with decision-dependent data. In International Conference on Artificial Intelligence and Statistics, pp. 5891 5912. PMLR, 2022.

Perdomo, J., Zrnic, T., Mendler-D unner, C., and Hardt, M. Performative prediction. In International Conference on Machine Learning, pp. 7599 7609. PMLR, 2020.

Piliouras, G. and Yu, F.-Y. Multi-agent performative prediction: From global stability and optimality to chaos. In Proceedings of the 24th ACM Conference on Economics and Computation, pp. 1047 1074, 2023.

Qiang, L., Yau, C.-Y., and Wai, H. T. Multi-agent performative prediction with greedy deployment and consensus seeking agents. In Advances in Neural Information Processing Systems, 2022.

Ray, M., Ratliff, L. J., Drusvyatskiy, D., and Fazel, M. Decision-dependent risk minimization in geometrically decaying dynamic environments. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp. 8081 8088, 2022.

Spence, M. Job market signaling. In Uncertainty in economics, pp. 281 306. Elsevier, 1978.

Tsiatis, A. Semiparametric Theory and Missing Data. New York: Springer, 2007.

Van der Vaart, A. W. Asymptotic statistics, volume 3. Cambridge university press, 2000.

Wainwright, M. J. High-dimensional statistics: A nonasymptotic viewpoint, volume 48. Cambridge University Press, 2019.

Plug-in Performative Optimization

White, H. A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrica, 48:817 838, 1980.

White, H. Maximum likelihood estimation of misspecified models. Econometrica, 50(1):1 25, 1982. ISSN 00129682, 14680262. URL http://www.jstor. org/stable/1912526.

Wooldridge, M. Reasoning about rational agents. MIT press, 2003.

Zrnic, T., Mazumdar, E., Sastry, S., and Jordan, M. Who leads and who follows in strategic classification? Advances in Neural Information Processing Systems, 34: 15257 15269, 2021.

Plug-in Performative Optimization

A.1. Notation and Definitions

In the proofs we will sometimes use c, c > 0 to denote universal constants and C, C > 0 to denote constants that may depend on the parameters introduced in the assumptions. We allow the values of the constants to vary from place to place.

We say a random variable x is subexponential with parameter ν if

P{|x| t} 2 exp t

for any t 0. Unless specified, we do not assume x has mean zero in general. Moreover, we say a vector x Rd is subexponential with parameter ν if u, x is subexponential with parameter ν for any fixed direction u Sd 1. Similarly, we say a random variable x is subgaussian with parameter σ if

P{|x| t} 2 exp t2

for any t 0. Likewise, a vector x Rd is subgaussian with parameter σ if u, x is subgaussian with parameter σ for any fixed direction u Sd 1.

A.2. Proof of Theorem 3.1

Define the population-level counterpart of ˆθPO as:

θ = arg min θ Θ PRβ (θ).

We can write

PR(ˆθPO) PR(θPO)

= (PR(ˆθPO) PRβ (ˆθPO)) + (PRβ (ˆθPO) PR ˆβ(ˆθPO)) + (PR ˆβ(ˆθPO) PR ˆβ(θ ))

+ (PR ˆβ(θ ) PRβ (θ )) + (PRβ (θ ) PRβ (θPO)) + (PRβ (θPO) PR(θPO)).

By the definition of ˆθPO, we know PR ˆβ(ˆθ) PR ˆβ(θ ) 0. Similarly, by the definition of θ , we know PRβ (θ ) PRβ (θPO) 0. Using these inequalities, we establish

PR(ˆθPO) PR(θPO) (PR(ˆθPO) PRβ (ˆθPO)) + (PRβ (ˆθPO) PR ˆβ(ˆθPO))

+ (PR ˆβ(θ ) PRβ (θ )) + (PRβ (θPO) PR(θPO))

2 sup θ |PR(θ) PRβ (θ)| + 2 sup θ |PRβ (θ) PR ˆβ(θ)|

= 2(Stat Errn + Misspec Err).

A.3. Proof of Corollary 3.5

|PRβ (θ) PR(θ)| =

Z ℓ(z; θ)(pβ (z; θ) p(z; θ))dz

Bℓ Z |pβ (z; θ) p(z; θ)|dz

= 2Bℓ TV(Dβ (θ), D(θ)).

Therefore, Misspec Err 2Bℓsup θ Θ TV(Dβ (θ), D(θ)) 2BℓηTV.

Plug-in Performative Optimization

By a similar argument as above, |PRβ (θ) PR ˆβ(θ)| 2BℓTV(Dβ (θ), D ˆβ(θ)). Applying ϵTV-smoothness of the distribution atlas, we get

Stat Errn 2Bℓsup θ Θ TV(Dβ (θ), D ˆβ(θ)) 2BℓϵTV ˆβ β 2BℓϵTVCn.

Applying Theorem 3.1 completes the proof.

A.4. Proof of Corollary 3.6

Denote by Π(D, D ) a coupling between two distributions D and D . We have PRβ (θ) PR(θ) = inf Π(Dβ (θ),D(θ)) E(z,z ) Π(Dβ (θ),D(θ))[ℓ(z; θ) ℓ(z ; θ)]

inf Π(Dβ (θ),D(θ)) E(z,z ) Π(Dβ (θ),D(θ))[|ℓ(z; θ) ℓ(z ; θ)|]

L inf Π(Dβ (θ),D(θ)) E(z,z ) Π(Dβ (θ),D(θ))[ z z ]

= LW(Dβ (θ), D(θ)).

Therefore, Misspec Err L sup θ Θ W(Dβ (θ), D(θ)) L ηW .

By a similar argument, |PRβ (θ) PR ˆβ(θ)| LW(Dβ (θ), D ˆβ(θ)). Applying ϵW -smoothness of the distribution atlas, we get Stat Errn L sup θ Θ W(Dβ (θ), D ˆβ(θ)) LϵW ˆβ β L ϵW Cn.

Applying Theorem 3.1 completes the proof.

A.5. Proof of Lemma 3.8

We first present Lemma A.1 which we will use in the proof. Lemma A.1.

2 β β 2, β β µ

2σr β β , β β µ

See the proof of Lemma A.1 in Supplement A.6

Let Bβ be the bound of the parameter set B, i.e., β Bβ for any β B. We will show Lemma 3.8 holds with some sufficiently large constants C, C > 0 that depend polynomially on (log |1 + Bβ|, 1/µ, Lr, Br, σr).

i=1 r(θi, zi; β); r(β) := Eθ D,z D(θ)[r(θ, z; β)].

We begin by claiming the following result, which we will prove later. With probability over 1 δ

sup β B βrn(β) βr(β) C p

dβ + log(1/δ)

With this result at hand, it follows from Lemma A.1 that

2 ˆβ β 2, µ2

2σr ˆβ β o βr(ˆβ), ˆβ β = βr(ˆβ) βrn(ˆβ), ˆβ β

βr(ˆβ) βrn(ˆβ) ˆβ β C p

dβ + log(1/δ)

n ˆβ β , (5)

Plug-in Performative Optimization

where the first equality is due to the fact that βrn(ˆβ) = 0. Eliminating ˆβ β in both the first and the last term of (5) yields

dβ + log(1/δ)

By the sample size Assumption in Lemma 3.8, we may assume n is sufficiently large such that µ2

2σr C log n q

dβ+log(1/δ)

n for some constant C > 0. Therefore, we conclude that

dβ + log(1/δ)

Proof of Eq. (4). Let {u1, u2, . . . , u M} be a 1/2-covering of Sdβ 1 in the Euclidean norm such that |M| 5dβ. Define the random variables ϕu,β := u, βrn(β) βr(β) , ϕu := sup β B ϕu,β.

It follows from a standard discretization argument (e.g., (Wainwright, 2019), Chap. 6) that

sup β B βrn(β) βr(β) 2 sup β B sup i [M] ϕui,β. (6)

We make the following claim which will be proved at the end. With probability over 1 δ 1 n

i=1 sup β B 2 βr(θi, zi; β)

op c Lr + c Lr

dβ + log(1/δ)

n c Lr. (7)

for some constant c > 0.

Let ϵ > 0 be some value we specify later. Construct an ε-covering net {β1, . . . , βN} of B in . Then the covering number |N| (1 + 2Bβ/ϵ)dβ, and

sup β B sup i [M] ϕui,β sup i [M] sup j [N] ϕui,βj + sup i [M] sup β1 β2 ε |ϕui,β1 ϕui,β2|

sup i [M] sup j [N] ϕui,βj + sup i [M] sup β1 β2 ε

i=1 sup β B u i 2 βr(θi, zi; β)(β1 β2)

sup i [M] sup j [N] ϕui,βj + sup i [M] sup β1 β2 ε ui

i=1 sup β B 2 βr(θi, zi; β)

sup i [M] sup j [N] ϕui,βj + c Lrε, (8)

where the last inequality follows from the claim in (7). Since u, βr(θi, zi; β) E[ βr(θi, zi; β)] is zero mean subexponential with parameter Br by condition (b) of Assumption 3.7, it follows from concentration of subexponential variables that

P{|ϕui,βj| t} 2 exp nt2

for any |t| Br.

Applying a union bound over i, j, we establish

P max i [M],j [N] |ϕui,βj| t 2 exp cdβ(log(1 + 2Bβ/ϵ) + 1) nt2

for any |t| Br. (9)

dβ+log(1/δ)

log(1/δ) + dβ(log(1 + 2Bβ/ϵ) + 1) n CBr p

log n(dβ + log(1/δ)) n Br

Plug-in Performative Optimization

for some constant c > 0, where the last inequality uses the sample size assumption of the lemma. Substituting the values of ϵ and t into Equations (8), (9) and combining with Eq. (6), we obtain

sup β B βrn(β) βr(β) CBr p

log n(dβ + log(1/δ)) n + c Lr

dβ + log(1/δ)

dβ + log(1/δ)

with probability over 1 δ for some parameter-dependent constant C .

Proof of Eq. (7). Similar to equation (6), from a standard discretization argument we have 1 n

i=1 sup β B 2 βr(θi, zi; β)

op 2 sup j [M]

i=1 sup β B u j 2 βr(θi, zi; β)uj.

Since supβ B u j 2 βr(θi, zi; β)uj are subexponential variables by condition (c) in Assumption 3.7, we have from properties of subexponential variables and Bernstein s inequality that

i=1 sup β B 2 βr(θi, zi; β)uj c Lr + t

i=1 sup β B 2 βr(θi, zi; β)uj E[u j sup β B 2 βr(θi, zi; β)uj] + t

exp c min{ nt

Applying a union bound over j [M] and setting t = c Lr q

dβ+log(1/δ)

n < c Lr, we establish

i=1 sup β B 2 βr(θi, zi; β)uj c Lr + c Lr

dβ + log(1/δ)

for some c > 0 with probability over 1 δ.

A.6. Proof of Lemma A.1

For any β β µ σr , by a Taylor expansion of r(β) at β and Ass. 3.7(a), we have

r(β), β β = r(β) r(β ), β β µ β β 2 σr

This gives the second part of Lemma A.1. When β β µ σr , write β = β + tu, where u = (β β )/ β β . For a fixed direction u, define β(t) := β + tu and

f(u, t) := D r(β(t)), β(t) β

E = u, r(β(t)) .

Then for t 0, using Ass. 3.7(a) again we obtain

tf(u, t) = u, 2r(β(t))u 0

Therefore f(u, t) is increasing in t and for any β β µ σr

D r(β), β β

This gives the second part of Lemma A.1.

Plug-in Performative Optimization

A.7. Proof of Theorem 3.9

The first statement follows directly by putting together Corollary 3.5 and Lemma 3.8. Similarly, the second statement follows by putting together Corollary 3.6 and Lemma 3.8.

A.8. Proof of Claim 4.1

Fix x and θ. We will use the shorthand notation gβ(x, θ) xβ. First, we show that xβ xβ 2 Bu 1 βmax Lu |β β |. To see this, notice that the optimality condition of the best-response equation is equal to:

β uθ(xβ) xβ = x.

Since β uθ(xβ) xβ = β uθ(xβ ) xβ = x, we know

β uθ(xβ) β uθ(xβ ) = xβ xβ .

We also know

β uθ(xβ) β uθ(xβ ) = β uθ(xβ) β uθ(xβ ) + β uθ(xβ ) β uθ(xβ )

βLu xβ xβ + Bu |β β | .

Therefore, xβ xβ βLu xβ xβ + Bu |β β | .

Rearranging the terms, we get

xβ xβ Bu 1 βLu |β β |.

By the definition of Wasserstein distance, this condition directly implies

W(Dβ(θ), Dβ (θ)) Bu 1 βmax Lu |β β |,

which is the definition of Bu 1 βmax Lu -smoothness.

A.9. Proof of Claim 4.2

The claim follows by Lemma 3.8 after verifying the conditions required in Assumption 3.7. We have r(θ, x; β) = x β uθ(x) 2, so r(θ, x; β) = 2(x uθ(x) β uθ(x) 2) and 2r(θ, x; β) = 2 uθ(x) 2. Conditions (b) and (c) of Assumption 3.7 are thus satisfied by x and u θ( x) being subgaussian since products of subgaussians are subexponential. Condition (a) is satisfied by the fact that r(β) = E[ x β u θ( x) 2] is a quadratic in β when E[ u θ( x) 2] > 0.

A.10. Proof of Claim 4.3

Fix θ, β, β , and without loss of generality let β > β . We show that TV(Dβ(θ), Dβ (θ)) ϕu. The distributions Dβ(θ) and Dβ (θ) are equal to each other and to D0 for all {x : x θ ( , T β) (T, )}. Moreover, under both Dβ(θ) and Dβ (θ), there is no mass for {x : x θ (T β , T)}. The distributions thus only differ for {x : x θ [T β, T β ] {T}}. Since the density of x θ is bounded by ϕu, the measure of such vectors x is at most ϕu|β β |.

A.11. Proof of Claim 4.4

By Hoeffding s inequality, with probability 1 δ it holds that 1 n

i=1 1{x i θi (T ϵ)} P{ x θ (T ϵ)}

Plug-in Performative Optimization

2n . Next we argue that |ˆβ β |

ϕl by contradiction. Suppose |ˆβ β | >

i=1 1{x i θi (T ϵ)} P{ x θ (T ϵ)}

= P{x 0 θ [T ˆβ, T] P{x 0 θ [T β , T]}

> ϕl ϕl = ,

which contradicts Hoeffding s inequality. Therefore, we conclude that |ˆβ β |

A.12. Proof of Claim 4.5

By the definition of Wasserstein distance, we have

W(DM1(θ), DM2(θ)) = W(M1θ + z0, M2θ + z0) = M1θ M2θ Bθ M1 M2 op

for any M1, M2. Therefore, the distribution atlas {DM}M is ϵW -smooth with parameter Bθ.

A.13. Proof of Claim 4.6

Let νθ be the subgaussian parameter of D. We prove that there exists C, C depending polynomially on (1/κmin, κmax, νθ, Lθz, B) such that Claim 4.6 holds. By definition, we have

i=1 θiz i ,

M = E[ θ θ ] 1E[ θ z ].

We state the following results, which we will prove later:

E[θiθ i ] 1

cν2 θ κ2 min

dθ + log(1/δ)

i=1 θiz i E[θiz i ]

dθ + dz + log(1/δ)

Combining Equations (10), (11) with the assumptions of the claim, we establish

i=1 θiz i E[θiθ i ] 1E[θiz i ]

E[θiθ i ] 1

+ E[θiθ i ] 1 op

i=1 θiz i E[θiz i ]

dθ + dz + log(1/δ)

for some C > 0 that depends on problem-specific parameters.

Plug-in Performative Optimization

Proof of Eq. (10). Under the conditions of the claim, we establish from concentration inequalities for subgaussian vectors (see, e.g., Theorem 6.5 in (Wainwright, 2019)) that with probability at least 1 δ,

2 κmin cν2 θ

dθ + log(1/δ)

κmax + cν2 θ

dθ + log(1/δ)

where the last line follows from the sample-size assumption. In addition, we also have from (Wainwright, 2019) that 1 n

i=1 θiθ i E[θiθ i ]

dθ + log(1/δ)

Therefore, it follows from Woodbury s matrix identity and the last two displays that

E[θiθ i ] 1

(E[θiθ i ]) 1

i=1 θiθ i E[θiθ i ]

(E[θiθ i ]) 1 op

cν2 θ κ2 min

dθ + log(1/δ)

The second inequality follows from the assumption on sample size.

Proof of Eq. (11). Let {u1, . . . , u M} be a 1/4-covering of Sdθ 1 in the Euclidean norm with |M| 9dθ, and {v1, . . . , v N} to be a 1/4-covering of Sdz 1 with |N| 9dz. Then by a standard discretization argument, we have 1 n

i=1 θiz i E[θizi]

op 2 sup k [M],l [N]

i=1 u k θiz i vl E[u k θiz i vl].

Since u k θiz i vl E[u k θiz i vl] are zero-mean subexponential variables by assumption, it follows that

i=1 u k θiz i vl E[u k θiz i vl]

2 exp c min nt2

Applying a union bound over M, N and setting t = c Lθz q

dθ+dz+log(1/δ)

n < c Lθz with some sufficiently large constant c > 0 yields

i=1 u k θiz i vl E[u k θiz i vl]

i=1 u k θiz i vl E[u k θiz i vl]

2 exp (dθ + dz) log 9 c min nt2

which gives Eq. (11).

Plug-in Performative Optimization

0 5000 10000 15000 20000 25000 Number of Samples

PR( PO) PR( PO)

Plug-in DFO SGD

0 5000 10000 15000 20000 25000 Number of Samples

PR( PO) PR( PO)

Plug-in DFO SGD

0 5000 10000 15000 20000 25000 Number of Samples

PR( PO) PR( PO)

Plug-in DFO SGD

0 5000 10000 15000 20000 25000 Number of Samples

Plug-in DFO SGD

0 5000 10000 15000 20000 25000 Number of Samples

Plug-in DFO SGD

0 5000 10000 15000 20000 25000 Number of Samples

Plug-in DFO SGD

Figure 3. Excess risk (top) and accuracy (bottom) versus n for plug-in performative optimization, the DFO algorithm, and greedy SGD, with a changed value of β = 1. We display the 1 standard deviation, logarithmically scaled. The takeaways are largely the same as in Figure 2.

B. Further Experimental Results and Details

We repeat each experiment 10 times and plot the mean excess risk as well as the 1 standard deviation. In all experiments on strategic classification, we choose the ridge parameter λ = 0.001.

In Figure 3 we provide an additional comparison in the context of the strategic-regression example from Section 5. We let β = 1, showing that our takeaways are robust to the exact value of β.

We also run the strategic-regression experiment on a real data set. We use the credit data set, in particular the processed version available at: https://github.com/ustunb/actionable-recourse. The data set contains 30, 000 samples of d = 17 features and a {0, 1}-valued outcome yi with yi = 1 denoting individual i not defaulting on a credit card payment. The features include marital status, age, education level, and payment patterns. We assume the individuals can modify their records on education level and payment patterns (features 7 17), but cannot change other records. We use 1500 randomly drawn data points to form the base distribution D0; we assume the same true response model and use the same distribution atlas as before. We set β = 5, Θ = {θ : θ 1}, and standardize the features so that each column is zero-mean and has unit variance. In Figure 4, we observe patterns similar to those in Figure 2, though the gap in accuracy between our method and SGD is smaller.

Below we provide implementation details for the two considered baselines.

Derivative-free optimization (DFO). Starting from θ0 = 1d/

d, we run the updates

θt+1 = Proj 1(θt ηtb E[u PR(θt + δu)d/δ])

for t 0, where the step size ηt = c0/(t + 1), u is uniformly distributed on Sd 1, and b E[u PR(θt + δu)d/δ] denotes the unbiased sample estimation of E[u PR(θt + δu)d/δ] using m i.i.d. pairs of (u, z) Unif(Sd 1) D(θt + δu). The projection Proj 1(x) denotes the projection of x Rd onto the ball {v Rd : v 1} in Euclidean norm. We choose the step size parameter c0 [10 4, 10 1], the batch size m in [1, 500], and δ [0.1, 100] via grid search.

Greedy stochastic gradient descent (SGD). Starting from θ0 = 1d/

d, we run the updates

θt+1 = Proj 1(θt ηt θℓ(zt; θt))

with step size ηt = c0/(t + 1) and zt D(θt). The step size parameter c0 [10 4, 10] and the batch size m [1, 500] are selected via grid search. The greedy SGD algorithm neglects the implicit dependence of z on θ due to performativity, and therefore typically converges to suboptimal points.

Plug-in Performative Optimization

0 5000 10000 15000 20000 25000 Number of Samples

PR( PO) PR( PO)

Plug-in DFO SGD

0 5000 10000 15000 20000 25000 Number of Samples

PR( PO) PR( PO)

Plug-in DFO SGD

0 5000 10000 15000 20000 25000 Number of Samples

PR( PO) PR( PO)

Plug-in DFO SGD

0 5000 10000 15000 20000 25000 Number of Samples

Plug-in DFO SGD

0 5000 10000 15000 20000 25000 Number of Samples

Plug-in DFO SGD

0 5000 10000 15000 20000 25000 Number of Samples

Plug-in DFO SGD

Figure 4. Excess risk (top) and accuracy (bottom) versus n for plug-in performative optimization, the DFO algorithm, and greedy SGD, on the credit data set. We display the 1 standard deviation, logarithmically scaled.

Performative gradient descent (Perf GD). Assume the distribution map has the form z D(θ) z d= N(f(θ), σ2Id), where f is some unknown smooth function. Starting from θ0 = 1d/

d, we first run the greedy SGD updates for H burn-in steps. Next, we run SGD on the performative risk using an estimated performative gradient, namely,

θt+1 = Proj 1(θt ηt b θE[ℓ(zt; θt)]),

with step size ηt = c0/(t + 1) and zt D(θt), where the estimated performative gradient is computed as in Algorithm 3 and Eq. (2) in (Izzo et al., 2021) via numerically estimating the gradient f

We choose the number of burn-in steps H = 10d. The step size parameter c0 [10 4, 10] and the batch size m [1, 500] are selected via grid search. Perf GD runs stochastic gradient descent on the performative risk using an estimated performative gradient. It should be noted that the numerical approximation of f

θ is unstable when d > 1, which results in the suboptimal performance of Perf GD in our location-family experiment.

C. Solving for ˆθPO

The map D ˆβ belongs to the distribution atlas chosen by the learner, and as such, it is fully specified and known to the learner.

Therefore, solving for ˆθPO = arg minθ Θ Ez D ˆ β(θ)[ℓ(z; θ)] can only incur error due to computational inaccuracies. There is no additional statistical complexity (i.e. dependence on n), which is the focus of our excess risk bounds in Theorem 3.1. In a sense, our results can be thought of as analogous to classical generalization bounds for empirical risk minimizers: we are concerned with characterizing the performance of the empirical risk minimizer, not with computational strategies for finding them.

More practically, there are several approaches one can take to compute ˆθPO. Sometimes ˆθPO has a closed-form expression, as in Example 1. In such cases there is no error in Step 3 of Algorithm 1. Sometimes D ˆβ(θ) and ℓ(z; θ) are simple enough

that PR ˆβ(θ) has a closed-form expression; in such cases, we compute ˆθPO by running gradient descent on PR ˆβ(θ). This is the case in all our experiments. Alternatively, if PR ˆβ(θ) does not have a closed-form expression, one may compute ˆθPO by using a black-box optimizer on an unbiased estimate of Ez D ˆ β(θ)[ℓ(z; θ)] obtained by drawing many i.i.d. samples from D ˆβ(θ) (the right algorithm depends on what we know about the problem; generically we can always use DFO (Flaxman et al., 2004)). Since these samples are all synthetic and do not count toward the sample complexity i.e., they do not require collecting real data but only simulation we can draw arbitrarily many samples to achieve a small numerical error.