# stochastic_optimization_for_performative_prediction__73b46c88.pdf Stochastic Optimization for Performative Prediction Celestine Mendler-Dünner , Juan C. Perdomo , Tijana Zrnic , Moritz Hardt University of California, Berkeley {mendler,jcperdomo,tijana.zrnic,hardt}@berkeley.edu In performative prediction, the choice of a model influences the distribution of future data, typically through actions taken based on the model s predictions. We initiate the study of stochastic optimization for performative prediction. What sets this setting apart from traditional stochastic optimization is the difference between merely updating model parameters and deploying the new model. The latter triggers a shift in the distribution that affects future data, while the former keeps the distribution as is. Assuming smoothness and strong convexity, we prove rates of convergence for both greedily deploying models after each stochastic update (greedy deploy) as well as for taking several updates before redeploying (lazy deploy). In both cases, our bounds smoothly recover the optimal O(1/k) rate as the strength of performativity decreases. Furthermore, they illustrate how depending on the strength of performative effects, there exists a regime where either approach outperforms the other. We experimentally explore the trade-off on both synthetic data and a strategic classification simulator. 1 Introduction Prediction in the social world is often performative in that a prediction triggers actions that influence the outcome. A forecast about the spread of a disease, for example, can lead to drastic public health action aimed at deterring the spread of the disease. In hindsight, the forecast might then appear to have been off, but this may largely be due to the actions taken based on it. Performativity arises naturally in consequential statistical decision-making problems in domains ranging from financial markets to online advertising. Recent work [17] introduced and formalized performative prediction, an extension of the classical supervised learning setup whereby the choice of a model can change the data-generating distribution. This perspective leads to an important notion of stability requiring that a model is optimal on the distribution it entails. Stability prevents a certain cat-and-mouse game in which the learner repeatedly updates a model, because it no longer is accurate on the observed data. Prior work established conditions under which stability can be achieved through repeated risk minimization on the full data-generating distribution. When samples arrive one-by-one over time, however, the learner faces a new challenge compared with traditional stochastic optimization. With every new sample that arrives, the learner has to decide whether to deploy the model, thereby triggering a drift in distribution, or to continue to collect more samples from the same distribution. Never deploying a model avoids distribution shift, but forgoes the possibility of converging to a stable point. Deploying the model too greedily could lead to overwhelming distribution shift that hampers convergence. In fact, it is not even clear that fast convergence to stability is possible at all in an online stochastic setting. Equal contribution. MH is a paid consultant for Twitter. 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada. 1.1 Our contributions In this work, we initiate the study of stochastic optimization for performative prediction. Our main results are the first convergence guarantees for the stochastic gradient method in performative settings. Previous finite-sample guarantees had an exponential dependence on the dimension. We distinguish between two natural variants of the stochastic gradient method. One variant, called greedy deploy, updates model parameters and deploys the model at every step, after seeing a single example. The other, called lazy deploy, updates model parameters on multiple samples before deploying a model. We show that both methods converge to a stable solution. However, which one is preferable depends both on the cost of model deployment and the strength of performativity. To state our results more precisely we recall the formal setup of performative prediction. In performative prediction, we assume that after deploying a model parameterized by , data are drawn from the distribution D( ). The distribution map D( ) maps model parameters to data-generating distributions. Given a loss function (z; ), a peformatively stable model satisfies the fixed-point condition, 0 E z D( ) (z; 0) . Performative stability expresses the desideratum that the model minimizes loss on the distribution D( ) that it entails. Once we found a performatively stable model, we therefore have no reason to deviate from it based on the data that we observe. The stochastic gradient method in this setting operates in a sequence of rounds. In each round k, the algorithm starts from a model k and can choose to perform n(k) stochastic gradient updates where each data point is drawn i.i.d. from the distribution D( k). After n(k) stochastic gradient updates, the algorithm deploys the new model parameters k+1. Henceforth, the data-generating distribution is D( k+1) and the algorithm proceeds to the next round. For greedy deploy, n(k) = 1 for all k, whereas for lazy deploy n(k) is a hyperparameter we can choose freely. To analyze the stochastic gradient method, we import the same assumptions that were used in prior work on performative prediction. Apart from smoothness and strong convexity of the loss function, the main assumption is that the distribution map is sufficiently Lipschitz. This means that a small change to the model parameters (in Euclidean distance) leads to small change in the data-generating distribution (as measured in the Wasserstein metric). Our first main result shows that under these assumptions, greedy deploy achieves the same convergence rate as the stochastic gradient method in the absence of performativity. Theorem 1.1 (Greedy deploy, informal). If the loss is smooth and strongly convex and the distribution map is sufficiently Lipschitz, greedy deploy converges to performative stability at rate O(1/k), where k is the number of model deployment steps. Generally speaking, the Lipschitz parameter has to be smaller than the inverse condition number of the loss function for our bound to guarantee convergence. The exact rate stated in Theorem 3.2 further improves as the Lipschitz constant tends to 0. In many realistic scenarios, data are plentiful, but deploying a model in a large production environment is costly. In such a scenario, it makes sense to aim to minimize the number of model deployment steps by updating the model parameters on multiple data points before initiating another model deployment. This is precisely what lazy deploy accomplishes as our next result shows. Theorem 1.2 (Lazy deploy, informal). Under the same assumptions as above, for any > 0, lazy deploy converges to performative stability at rate O(1/k ) provided that O(k1.1 ) samples are collected between deployments k and k + 1. In particular, this shows that any distance from optimality δ > 0 can be achieved with (1/δ)c model deployments for an arbitrarily small c > 0 at the cost of collecting polynomial in 1/δ many samples. Our main theorems provide upper bounds on the convergence rate of each method. As such they can only draw an incomplete picture about the relative performance of these methods. Our empirical investigation therefore aims to shed further light on their relative merits. In particular, our experiments show that greedy deploy generally performs better than lazy deploy when the distribution map is very Lipschitz, i.e., the performative effects are small. Conversely, lazy deploy fares better when the distribution map is less Lipschitz. These observations are consistent with what our theoretical upper bounds suggest. 1.2 Related work Perdomo et al. [17] introduced the performative prediction framework and analyzed algorithms for finding stable points that operate at the population level. While they also analyze some finite-sample extensions of these procedures, their analysis relies on concentration of the empirical distribution to the true distribution in the Wasserstein metric, and hence requires exponential sample complexity. In contrast, our analysis ensures convergence even if the learner collects a single sample at every step. There has been a long line of work [3, 4, 5, 6, 12] within the learning theory community studying concept drift and learning from drifting distributions. Our results differ from these previous works since in performative prediction, changes in distribution are not a passive feature of the environment, but rather an active consequence of model deployment. This introduces several new considerations, such as the conceptual idea of performative stability, which is the main focus of our investigation. Our work draws upon ideas from the stochastic convex optimization literature [7,16,18,20,21,22]. Relative to these previous studies, our work analyzes the behavior of the stochastic gradient method in performative settings, where the underlying objective changes as a response to model deployment. Lastly, we can view instances of performative prediction as special cases of reinforcement learning problems with nice structure, such as a Lipschitz mapping from policy parameters to the induced distribution over trajectories (see [17] for further discussion). The variants of the stochastic gradient method we consider can be viewed as policy gradient-like algorithms [1,10,23,24] for this setting. 2 Preliminaries We start by reviewing the core concepts of the framework of performative prediction. Afterwards, we set the stage for our analysis of stochastic algorithms by first considering gradient descent at the population level. In doing so, we highlight some of the fundamental limitations of gradient descent in performative settings. 2.1 The framework of performative prediction Throughout our presentation, we focus on predictive models f that are parametrized by a vector 2 Rd, where the parameter space is a closed, convex set. The model or classifier, f , maps instances z 2 Rm to predictions f (z). Typically, we think of z as being a feature, label pair (x, y). We assess the quality of a classifier f via a loss function (z; ). The key theme in performative prediction is that the choice of deployed model f influences the future data distribution and hence the expected loss of the classifier f . This behavior is formalized via the notion of a distribution map D( ), which is the key conceptual device of the framework. For every 2 , D( ) denotes the distribution over instances z induced by the deployment of f . In this paper, we consider the setting where at each step, the learner observes a single sample z D( ), where f is the most recently deployed classifier. After having observed this sample, the learner chooses whether to deploy a new model or to leave the distribution as is before collecting the next sample. We adopt the following Lipschitzness assumption on the distribution map. It captures the idea that if two classifiers make similar predictions, then they also induce similar distributions. Definition 1 ( -sensitivity [17]). A distribution map D( ) is -sensitive if for all , 0 2 : D( ), D( 0) where W1 denotes the Wasserstein-1, or earth mover s distance. The value of indicates the strength of performative effects; small means that the distribution induced by the model f is not overly sensitive to the choice of , while large indicates high sensitivity. As an extreme case, = 0 implies D( ) = D( 0) for all , 0 2 and hence there are no performative effects, as in classical supervised learning. Given how the choice of classifier induces a change in distribution, a naturally appealing property of a predictive model in performative settings is that it achieves minimal risk on the distribution that it induces. This solution concept is referred to as performative stability. Definition 2 (Performative stability). A classifier f PS is peformatively stable if PS 2 arg min E z D( PS) (z; ). We refer to PS as being performatively stable, or simply stable, if f PS is performatively stable. Performative stability captures an equilibrium state in which a classifier induces a shift in distribution by the environment, yet remains simultaneously optimal for this new distribution. These solutions are referred to as stable since they eliminate the need for retraining. Besides eliminating the need for retraining, performatively stable classifiers were also shown to have nearly optimal predictive power on the distribution they induce. More specifically, stable points approximately minimize the performative risk, PR( ) = Ez D( ) (z; ), in the case of a strongly convex loss and a reasonably small sensitivity parameter (Theorem 4.3, [17]). To illustrate these abstract concepts, we instantiate a simple traffic prediction example with performative effects which will serve as a running example throughout the paper. Example 1 (ETA estimation). Suppose that each day we want to estimate the duration of a trip on a fixed route from the current weather conditions. Let x 2 {0, 1} denote a binary indicator of whether the current day is sunny or rainy, and suppose that Pr {x = 1} = p 2 (0, 1). Let f denote the deployed classifier which predicts trip duration y from x. Assume y behaves according to the following model: y = µ + w x (f (x) µ), where µ > 0 denotes the usual time needed to complete the route on a sunny day, w > 0 denotes additional incurred time due to bad weather, and (f (x) µ) denotes the performative effects, for some 2 (0, 1). Namely, if the model predicts a faster than usual time to the destination, more people want to take the route, thus worsening the traffic conditions and making y large. If, on the other hand, the model predicts a longer trip, then few people follow the route and y is smaller. Suppose that the model class is all predictors of the form f (x) = x 1 + 2, where = ( 1, 2) and 1 2 (0, w), 2 2 (0, 2µ). It is not hard to see that the distribution map corresponding to this data-generating process is -sensitive. Assume that we measure predictive performance according to the squared loss, ((x, y); ) = 1 2(y 1x 2)2. Then, a simple calculation reveals that the unique performatively stable classifier, satisfying Definition 2, corresponds to In fact, one can show that PS is simultaneously optimal in the sense that it minimizes the performative risk, PS = arg min PR( ) = arg min E(x,y) D( ) ((x, y); ). 2.2 Population-level results Before analyzing optimization algorithms in stochastic settings, we first consider their behavior at the population level. Throughout our analysis, we make the following assumptions on the loss (z; ), which hold for broad classes of objectives. To ease readability, we let Z = [ 2 supp(D( )). (A1) (joint smoothness) A loss function (z; ) is β-jointly smooth if the gradient3 r (z; ) is β-Lipschitz in and z, that is for all , 0 2 and z, z0 2 Z it holds that kr (z; ) r (z; 0)k2 β k 0k2 and kr (z; ) r (z0; )k2 β kz z0k2. (A2) (strong convexity) A loss function (z; ) is γ-strongly convex if for all , 0 2 and z 2 Z it holds that (z; ) (z; 0) + r (z; 0)>( 0) + γ 2. For γ = 0, this condition is equivalent to convexity. We will refer to β γ , where β is as in (A1) and γ as in (A2), as the condition number. 3Gradients of the loss are always taken with respect to the parameters . In this paper we are interested in the convergence of optimization methods to performatively stable classifiers. However, unlike classical risk minimizers in supervised learning, it is not a priori clear that performatively stable classifiers always exist. We thus recall the following fact. Fact 2.1 ( [17]). Assume that the loss is β-jointly smooth (A1) and γ-strongly convex (A2). If D( ) is -sensitive with < γ β , then there exists a unique performatively stable point PS 2 . We note that it is not possible to reduce sensitivity by merely rescaling the problem, while keeping the ratio γ/β the same; the critical condition β/γ < 1 remains unaltered by scaling.4 The upper bound < γ/β on the sensitivity parameter is not only crucial for the existence of unique stable points but also for algorithmic convergence. It defines a regime outside which gradient descent is not guaranteed to converge even at the population level. To be more precise, consider repeated gradient descent (RGD), defined recursively as k+1 = k k E z D( k)[r (z; k)], k 1, where 1 2 is initialized arbitrarily. As shown in the following result, RGD need not converge if γ β . Furthermore, a strongly convex loss is necessary to ensure convergence, even if performative effects are arbitrarily weak. Proposition 2.2. Suppose that the distribution map D( ) is -sensitive. Repeated gradient descent can fail to converge in any of the following cases, for any choice of positive step size sequence { k}: (a) The loss is β-jointly smooth (A1) and convex, but not strongly convex (A2), for any β, > 0. (b) The loss is β-jointly smooth (A1) and γ-strongly convex (A2), but γ β , for any γ On the other hand, if < γ/β we prove that RGD converges to a unique performatively stable point at a linear rate. Proposition 2.3 strengthens the corresponding result of Perdomo et al. [17], who showed linear convergence of RGD for < γ/(γ + β). Proofs can be found in Appendix C. Proposition 2.3. Assume that the loss is β-jointly smooth (A1) and γ-strongly convex (A2), and suppose that the distribution map D( ) is -sensitive. Let < γ/β, and suppose that PS 2 Int( ). Then, repeated gradient descent (RGD) with a constant step size k = γ β 2(1+ 2)β2 converges to the stable point PS at a linear rate, k k+1 PSk2 δ for k 4(1+ 2) β2 (γ β)2 log(k 1 PSk2 / δ). Together, these results show that γ/β is a sharp threshold for the convergence of gradient descent in performative settings, thereby resolving an open problem due to Perdomo et al. [17]. Having characterized the convergence regime of gradient descent, we now move on to presenting our main technical results, focusing on the case of a smooth, strongly convex loss with < γ/β. 3 Stochastic optimization results We introduce two variants of the stochastic gradient method for optimization in performative settings (i.e. stochastic gradient descent, SGD), which we refer to as greedy deploy and lazy deploy. Each method performs a stochastic gradient update to the model parameters at every iteration, however they choose to deploy these updated models at different time intervals. To analyze these methods, in addition to (A1) and (A2), we make the following assumption which is customary in the stochastic optimization literature [7,19]. (A3) (second moment bound) There exist constants σ2 and L2 such that for all , 0 2 : kr (z; 0)k2 σ2 + L2k 0 G( )k2 2, where G( ) def = arg min 0 E z D( ) (z; 0). Given the operator G( ), performative stability can equivalently be expressed as PS 2 G( PS). 4The reason is that the notion of joint smoothness we consider does not scale like strong convexity when rescaling . For example, rescaling 7! 2 (thus making 7! /2) would downscale the strong convexity parameter and the parameter corresponding to the usual notion of smoothness in optimization by 4, however the smoothness in z would downscale by 2. Greedy Deploy Input: step size sequence { k}1 k=1 Deploy initial classifier 1 2 For each k = 1, 2, . . . Observe z(k) D( k) Update model parameters: k+1 = k kr (z(k); k) Deploy k+1 Lazy Deploy Input: step size sequence { k,j}1 k,j=1 Deploy initial classifier 1 2 For each k = 1, 2, . . . Set 'k,1 = k For each j = 1, . . . , n(k) : 1. Observe z(k) j D( k) 2. Update model parameters: 'k,j+1 = 'k,j k,jr (z(k) j ; 'k,j) Deploy k+1 = 'k,n(k)+1 Figure 1: Stochastic gradient method for performative prediction. Greedy deploy publishes the new classifier at every step while lazy deploy performs several gradient updates before releasing the new model. 3.1 Greedy deploy A natural algorithm for stochastic optimization in performative prediction is a direct extension of the stochastic gradient method, whereby at every time step, we observe a sample z(k) D( k), compute a gradient update to the current model parameters k, and deploy the new classifier k+1 (see left panel in Figure 1). We call this algorithm greedy deploy. In the context of our previous traffic prediction example, this greedy procedure corresponds to iteratively updating and redeploying the model based off information from the most recent trip. While this procedure is algorithmically identical to the stochastic gradient method in traditional convex optimization, in performative prediction, the distribution of the observed samples depends on the trajectory of the algorithm. We begin by stating a technical lemma which introduces a recursion for the distance between k and PS. Lemma 3.1. Assume (A1), (A2) and (A3). If the distribution map D( ) is -sensitive with < γ/β, then greedy deploy with step size k satisfies the following recursion for all k 1: 1 2 k(γ β) + 2 Similar recursions underlie many proofs of SGD, and Lemma 3.1 can be seen as their generalization to the performative setting. Furthermore, we see how the bound implies a strong contraction to the performatively stable point if the performative effects are weak, that is when γ/β. Using this recursion, a simple induction argument suffices to prove that greedy deploy converges to the performatively stable solution (see Appendix D). Moreover, it does so at the usual O(1/k) rate. Theorem 3.2. Assume (A1), (A2) and (A3). If the distribution map D( ) is -sensitive with < γ then for all k 0 greedy deploy with step size k = (γ β)k + 8L2/(γ β) " 1 satisfies Mgreedy (γ β)2k + 8L2 , where Mgreedy = max 2σ2, 8L2k 1 PSk2 Comparing this result to the traditional analysis of SGD for smooth, strongly convex objectives (e.g. [18]), we see that the traditional factor of γ is replaced by γ β, which we view as the effective strong convexity parameter of the performative prediction problem. When = 0, there are no performative effects and the problem of finding the stable solution reduces to that of finding the risk minimizer on a fixed, static distribution. Consequently, it is natural for the two bounds to identify. 3.2 Lazy deploy Contrary to greedy deploy, lazy deploy collects multiple data points and hence takes multiple stochastic gradient steps between consecutive model deployments. In the setting from Example 1, this corresponds to observing the traffic conditions across multiple days, and potentially diverse conditions, before deploying a new model. This modification significantly changes the trajectory of lazy deploy relative to greedy deploy, given that the observed samples follow the distribution of the last deployed model, which might differ from the current iterate. More precisely, after deploying k, we perform n(k) stochastic gradient steps to the model parameters, using samples from D( k) before we deploy the last iterate as k+1 (see right panel in Figure 1). At a high level, lazy deploy converges to performative stability because it progressively approximates repeated risk minimization (RRM), defined recursively as, k+1 = arg min E z D( k) (z; 0) for k 1 and 1 2 initialized arbitrarily. Perdomo et al. [17] show that RRM converges to a performatively stable classifier at a linear rate when < γ/β. Since the underlying distribution remains static between deployments, a classical analysis of SGD shows that for large n(k) these offline" iterates 'k,j converge to the risk minimizer on the distribution corresponding to the previously deployed classifier. In particular, for large n(k), k+1 G( k). By virtue of approximately tracing out the trajectory of RRM, lazy deploy converges to PS as well. This sketch is formalized in the following theorem. For details we refer to Appendix E. Theorem 3.3. Assume (A1), (A2), and (A3), and that the distribution map D( ) is -sensitive with < γ β . For any > 0, running lazy deploy with n(k) n0k , k = 1, 2, . . . many steps between deployments and step size sequence k,j = (γj + 8L2/γ) 1, satisfies ck k 1 PSk2 c (k) + 2 k (1 o(1)) "2 + o(1) and Mlazy = 3(σ+γ)2 γ2(1 c) . Here, o(1) is independent of k and vanishes as n0 grows; n0 is chosen large enough such that c < 1. 3.3 Discussion In this section, we have presented how varying the intervals at which we deploy models trained with stochastic gradient descent in performative prediction leads to qualitatively different algorithms. While greedy deploy resembles classical SGD with a step size sequence adapted to the strength of distribution shift, lazy deploy can be viewed as a rough approximation of repeated risk minimization. As we alluded to previously, the convergence behavior of both algorithms is critically affected by the strength of performative effects . For γ/β, the effective strong convexity parameter γ β of the performative prediction problem is large. In this setting, the relevant distribution shift of deploying a new model is neglible and greedy deploy behaves almost like SGD in classical supervised learning, converging quickly to performative stability. Conversely, for close to the convergence threshold, the contraction of greedy deploy to the performatively stable classifier is weak. In this regime, we expect lazy deploy to perform better since the convergence of the offline iterates 'k,j to the risk minimizer on the current distribution G( k) is unaffected by the value of . Lazy deploy then converges by closely mimicking the behavior of RRM. Furthermore, both algorithms differ in their sensitivity to different initializations. In greedy deploy, the initial distance k 1 PSk2 2 decays polynomially, while in lazy deploy it decays at a linear rate. This suggests that the lazy deploy algorithm is more robust to poor initialization. While we derive these insights purely by inspecting our upper bounds, we find that these heuristic observations also hold empirically, as shown in the next section. In terms of the asymptotics of both algorithms, we identify the following tradeoff between the number of samples and the number of deployments sufficient to converge to performative stability. Corollary 3.4. Assume (A1), (A2), and (A3), and that D( ) is -sensitive with < γ To ensure that greedy deploy returns a solution ? such that, E δ, it suffices to collect O(1 / δ) samples and to deploy O(1 / δ) classifiers. To achieve the same guarantee using lazy deploy, it suffices to collect O(1 / δ +1 (1 !) ) samples and to deploy O(1 / δ 1 ) classifiers, for any > 0 and some ! = 1 o(1) which tends to 1 as n0 grows. (a) = 0.2 (b) = 0.6 (c) = 0.9 Figure 2: Convergence of lazy and greedy deploy to performative stability for varying values of . We use n(k) = k for lazy deploy. The results are for the synthetic Gaussian example with µ = 10, σ = 0.1. We see from the above result that by choosing large enough values of n0 and , we can make the sample complexity of the lazy deploy algorithm come arbitrarily close to that of greedy deploy. However, to match the same convergence guarantee, lazy deploy only performs O(1 / δ 1 ) deployments, which is significantly better than the O(1 / δ) deployments for greedy. This reduction in the number of deployments is particularly relevant when considering the settings that performative prediction is meant to address. Whenever we use prediction in social settings, there are important social costs associated with making users adapt to a new model [15]. Furthermore, in industry, there are often significant technical challenges associated with deploying a new classifier [2]. By choosing n(k) = n0k appropriately, we can reduce the number of deployments necessary for lazy deploy to converge while at the same time improving the sample complexity of the algorithm. 4 Experiments We complement our theoretical analysis of greedy and lazy deploy with a series of empirical evaluations. First, we carry out experiments using synthetic data where we can analytically compute stable points and carefully evaluate the tradeoffs suggested by our theory. Second, we evaluate the performance of these procedures on a strategic classification simulator previously used as a benchmark for optimization in performative settings by [17]. 4.1 Synthetic data For our first experiment, we consider the task of estimating the mean of a Gaussian random variable under performative effects. In particular, we consider minimizing the expected squared loss (z; ) = 1 2(z )2 where z D( ) = N(µ + , σ2). For > 0, the true mean of a distribution D( ) depends on our revealed estimate . Furthermore, for < γ/β = 1, the problem has a unique stable point. A short algebraic manipulation shows that PS = µ 1 . As per our theory, both greedy and lazy deploy converge to performative stability for all < 1. Effect of performativity. We compare the convergence behavior of lazy deploy and greedy deploy for various values of in Figure 2. We choose step sizes for both algorithms according to our theorems in Section 3. In the case of lazy deploy, we set = 1, and hence n(k) / k. We see that when performative effects are weak, i.e. γ/β, greedy deploy outperforms lazy. Lazy deploy in turn is better at coping with large distribution shifts from strong performative effects. These results confirm the conclusions from our theory and show that the choice of greedy vs lazy deployment can indeed have a large impact on algorithm performance depending on the value of . Deployment schedules. We also experiment with different deployment schedules n(k) for lazy deploy. As described in Theorem 3.3, we can choose n(k) / k for all > 0. The results for 2 {0.5, 1, 2} and 2 {.2, .6, .9}, are depicted in Figure 4 in the Appendix. We find that shorter deployment schedules, i.e., smaller , lead to faster progress during initial stages of the optimization, whereas longer deployments schedules fare better in the long run while at the same time significantly reducing the number of deployments. Figure 3: Convergence of lazy and greedy deploy to performative stability. Results are for the strategic classification experiments with = 100. (left panel) convergence as a function of the number of samples. (right panel) convergence as a function of the number of deployments. 4.2 Strategic classification In addition to the experiments on synthetic data, we also evaluate the performance of the two optimization procedures in a simulated strategic classification setting. Strategic classification is a two-player game between an institution which deploys a classifier f and individual agents who manipulate their features in order to achieve a more favorable classification. Perdomo et al. [17] introduce a credit scoring simulator in which a bank deploys a logistic regression classifier to determine the probability that an individual will default on a loan. Individuals correspond to feature, label pairs (x, y) drawn from a Kaggle credit scoring dataset [9]. Given the bank s choice of a classifier f , individuals solve an optimization problem to compute the best-response set of features, x BR. This optimization procedure is parameterized by a value which determines the extent to which agents can change their respective features. The bank then observes the manipulated data points (x BR, y). This data-generating process can be described by a distribution map, which we can verify is -sensitive. For additional details we refer to Appendix A. At each time step, the learner observes a single sample from the distribution in which the individual s features have been manipulated in response to the most recently deployed classifier. This is in contrast to the experimental setup in [17], where the learner gets to observe the entire distribution of manipulated features at every step. While we cannot compute the stable point analytically in this setting, we can calculate it empirically by running RRM until convergence. Results. The inverse condition number of this problem is much smaller than in the Gaussian example; we have γ/β 10 2. We fist pick within the regime of provable convergence, i.e., = 10 3, and compare the two methods. As expected, for such a small value of greedy deploy is the preferred method. Results are depicted in Figure 6 in the Appendix. Furthermore, we explore the behavior of these algorithms outside the regime of provable convergence with γ/β. We choose step sizes for both algorithms as defined in Section 3 with the exception that we ignore the -dependence in the step size schedule of greedy deploy and choose the same initial step size as for lazy deploy (Theorem 3.2). As illustrated in Figure 3 (left), lazy significantly outperforms greedy deploy in this setting. Moreover, the performance of lazy deploy significantly improves with . In addition to speeding up convergence, choosing larger sample collection schedules n(k) substantially reduces the number of deployments, as seen in Figure 3 (right). Funding disclosure This research was supported by an NSF CAREER award (#1750555). MH is a paid consultant for Twitter. He was previously a paid consultant for Google. CMD is supported by the Swiss National Science Foundation Postdoc Mobility Fellowship Program. JCP is in part supported by the U.S. National Science Foundation Graduate Research Fellowship Program. Broader impact statement The motivation for studying performative prediction comes from the observation that whenever we use supervised learning in social settings, we almost never make predictions for predictions sake, but rather to inform decision making within some broader context [11]. Banks predict default risks to decide to whom they will allocate loans. Commuters use estimated time of arrival (ETA) prediction to choose which route to take to work. Governments predict crime rates to decide how to deploy police forces [8,13]. In each of these settings, our choice of predictive model leads to changes in the way the broader system behaves and hence in the distribution over observed data. Our work introduces optimization procedures for finding classifiers with good predictive performance for these performative settings. Here, we use good to indicate that these models are accurate. However, as is clear from examples in recent history, the societal impacts of having an accurate model depend on the context in which prediction is used, and the intent of the system designer. As a society, we can benefit from having robust and reliable systems to forecast congestion in cities, yet it would be remiss of us to overlook how these advances could also be used to infringe upon civil liberties. As a subfield of learning theory, performative prediction is only just starting to receive attention from the community and papers in this area are largely theoretical in nature. Therefore, much remains to be seen in terms of the broader impact of these ideas. We eagerly welcome feedback and comments from members of the community as to how we may ensure the success of this research agenda. [1] Alekh Agarwal, Sham M Kakade, Jason D Lee, and Gaurav Mahajan. Optimality and Ap- proximation with Policy Gradient Methods in Markov Decision Processes. volume 125 of Proceedings of Machine Learning Research, pages 64 66. PMLR, 09 12 Jul 2020. [2] Algorithmia. 2020 State of Enterprise Machine Learning. 2020. https://info. algorithmia.com/hubfs/2019/Whitepapers/The-State-of-Enterprise-ML-2020/ Algorithmia_2020_State_of_Enterprise_ML.pdf. [3] Peter L. Bartlett. Learning with a Slowly Changing Distribution. In Proceedings of the Conference on Computational Learning Theory (COLT), pages 243 252, 1992. [4] Peter L. Bartlett, Shai Ben-David, and Sanjeev R. Kulkarni. Learning Changing Concepts by Exploiting the Structure of Change. Machine Learning, 41(2):153 174, 2000. [5] Rakesh D Barve and Philip M Long. On the Complexity of Learning from Drifting Distributions. Information and Computation, 138(2):170 193, 1997. [6] Omar Besbes, Yonatan Gur, and Assaf Zeevi. Non-Stationary Stochastic Optimization. Opera- tions Research, 63(5):1227 1244, 2015. [7] Léon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization Methods for Large-Scale Machine Learning. SIAM Review, 60(2):223 311, 2018. [8] Danielle Ensign, Sorelle A. Friedler, Scott Neville, Carlos Scheidegger, and Suresh Venkatasub- ramanian. Runaway Feedback Loops in Predictive Policing. In Proceedings of the 1st ACM Conference on Fairness, Accountability and Transparency, pages 160 171, 2018. [9] Kaggle. Give Me Some Credit. https://www.kaggle.com/c/Give Me Some Credit/data, [10] Sham M Kakade. A Natural Policy Gradient. In Advances in Neural Information Processing Systems, pages 1531 1538, 2002. [11] Amanda Kube, Sanmay Das, and Patrick J Fowler. Allocating Interventions Based on Predicted Outcomes: A Case Study on Homelessness Services. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 622 629, 2019. [12] Anthony Kuh, Thomas Petsche, and Ronald L Rivest. Learning Time-Varying Concepts. In Advances in Neural Information Processing Systems (NIPS), pages 183 189, 1991. [13] Kristian Lum and William Isaac. To Predict and Serve? Significance, 13(5):14 19, 2016. [14] John Miller, Chloe Hsu, Jordan Troutman, Juan Perdomo, Tijana Zrnic, Lydia Liu, Yu Sun, Ludwig Schmidt, and Moritz Hardt. Why Not, 2020. [15] Smitha Milli, John Miller, Anca D. Dragan, and Moritz Hardt. The Social Cost of Strategic Classification. In Proceedings of the Conference on Fairness, Accountability, and Transparency (FAT*), page 230 239. Association for Computing Machinery, 2019. [16] Eric Moulines and Francis R Bach. Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning. In Advances in Neural Information Processing Systems (NIPS), pages 451 459, 2011. [17] Juan C. Perdomo, Tijana Zrnic, Celestine Mendler-Dünner, and Moritz Hardt. Performative Prediction. In Proceedings of the International Conference on Machine Learning (ICML), 2020. [18] Alexander Rakhlin, Ohad Shamir, and Karthik Sridharan. Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization. In Proceedings of the International Conference on Machine Learning (ICML), pages 1571 1578, 2012. [19] Benjamin Recht and Stephen J. Wright. Optimization for Modern Data Analysis. 2019. Preprint available at http://eecs.berkeley.edu/~brecht/opt4mlbook. [20] Herbert Robbins and Sutton Monro. A Stochastic Approximation Method. The Annals of Mathematical Statistics, pages 400 407, 1951. [21] Tom Schaul, Sixin Zhang, and Yann Le Cun. No More Pesky Learning Rates. In Proceedings of the International Conference on Machine Learning (ICML), volume 28, pages 343 351, 2013. [22] Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro, and Karthik Sridharan. Stochastic Convex Optimization. In Proceedings of the Conference on Computational Learning Theory (COLT), 2009. [23] Richard S Sutton, David A Mc Allester, Satinder P Singh, and Yishay Mansour. Policy Gradient Methods for Reinforcement Learning with Function Approximation. In Advances in Neural Information Processing Systems, pages 1057 1063, 2000. [24] Ronald J Williams. Simple Statistical Gradient-Following Algorithms for Connectionist Rein- forcement Learning. Machine learning, 8(3-4):229 256, 1992.