# regret_minimization_with_performative_feedback__4bda1060.pdf

Regret Minimization with Performative Feedback

Meena Jagadeesan 1 Tijana Zrnic 1 Celestine Mendler-Dünner 2

In performative prediction, the deployment of a predictive model triggers a shift in the data distribution. As these shifts are typically unknown ahead of time, the learner needs to deploy a model to get feedback about the distribution it induces. We study the problem of ﬁnding near-optimal models under performativity while maintaining low regret. On the surface, this problem might seem equivalent to a bandit problem. However, it exhibits a fundamentally richer feedback structure that we refer to as performative feedback: after every deployment, the learner receives samples from the shifted distribution rather than bandit feedback about the reward. Our main contribution is regret bounds that scale only with the complexity of the distribution shifts and not that of the reward function. The key algorithmic idea is careful exploration of the distribution shifts that informs a novel construction of conﬁdence bounds on the risk of unexplored models. The construction only relies on smoothness of the shifts and does not assume convexity. More broadly, our work establishes a conceptual approach for leveraging tools from the bandits literature for the purpose of regret minimization with performative feedback.

1. Introduction

Predictive models deployed in social settings are often performative. This means that the model s predictions by means of being used to inform consequential decisions inﬂuence the outcomes the model aims to predict in the ﬁrst place. For example, travel time estimates inﬂuence routing decisions and thus realized travel times, stock price predictions inﬂuence trading activity and hence prices. Such

1University of California, Berkeley 2Max Planck Institute for Intelligent Systems, Tübingen. Correspondence to: Meena Jagadeesan <mjagadeesan@berkeley.edu>, Tijana Zrnic <tijana.zrnic@berkeley.edu>, Celestine Mendler-Dünner < cmendler@tuebingen.mpg.de.>.

Proceedings of the 39th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s).

feedback-loop behavior arises in a variety of domains, including public policy, trading, trafﬁc predictions, and recommendation systems.

Perdomo et al. (2020) formalized this phenomenon under the name performative prediction. A key concept in this framework is the distribution map, which formalizes the dependence of the data distribution on the deployed predictive model. This object maps a model, encoded by a parameter vector θ, to a distribution D(θ) over instances. Naturally, in a performative environment, a model s performance is measured on the distribution that results from its deployment. That is, given a loss function ℓ(z;θ), which measures the learner s loss when they predict on instance z using model θ, we evaluate a model based on its performative risk:

PR(θ) := Ez D(θ)ℓ(z;θ). (1)

In contrast with the risk function studied in usual supervised learning, the performative risk takes an expectation over a model-dependent distribution. Importantly, this distribution is unknown ahead of time; for example, one can hardly anticipate the distribution of travel times induced by a trafﬁc forecasting system without deploying the system ﬁrst.

Due to this inherent uncertainty about D(θ), it is not possible to ﬁnd a model with low performative risk ofﬂine. The learner needs to interact with the environment and deploy models θ to explore the induced distributions D(θ). Given the online nature of this task, we measure the loss incurred by deploying a sequence of models θ1,...,θT by evaluating the performative regret:

EPR(θt) min θ PR(θ) ,

where the expectation is taken over the possible randomness in the choice of {θt}T t=1. Performative regret measures the suboptimality of the deployed sequence of models relative to a performative optimum θPO argminθPR(θ).

At ﬁrst glance, performative regret minimization might seem equivalent to a classical bandit problem. Bandit solutions minimize regret while requiring only noisy zeroth-order access to the unknown reward function in our case PR. The resulting regret bounds generally grow with some notion of complexity of the reward function.

Regret Minimization with Performative Feedback

However, a naive application of bandit baselines misses out on a crucial fact: performative regret minimization exhibits signiﬁcantly richer feedback than bandit feedback. When deploying a model θ, the learner gains access to samples from the induced distribution D(θ), rather than only a noisy estimate of the risk PR(θ). We call this feedback model performative feedback. Together with the fact that the learner knows the loss ℓ(z;θ), performative feedback can be used to inform the reward of unexplored arms. For instance, it allows the computation of an unbiased estimate of Ez D(θ)ℓ(z;θ ) for any point θ .

To illustrate the power of this feedback model, consider the limiting case in which the performative effects entirely vanish and the distribution map is constant, i.e. D(θ) D for some ﬁxed distribution D independent of θ. With zeroth-order feedback, the learner would still need to deploy different models to explore the landscape of PR and ﬁnd a point with low risk. However, with performative feedback, a single deployment gives samples from D , thus resolving all uncertainty in the objective (1) apart from ﬁnite-sample uncertainty. This raises the question: with performative feedback, can one achieve regret bounds that scale only with the complexity of the distribution map, and not that of the performative risk?

1.1. Our Contribution

We study the problem of performative regret minimization based on performative feedback. Our main contribution is performative regret bounds that scale primarily with the complexity of the distribution map. The key conceptual idea is to apply bandits tools to carefully explore the distribution map, and then propagate this knowledge to the objective (1) in order to minimize performative regret.

Performative conﬁdence bounds algorithm. Our main focus is on a setting where the distribution map is Lipschitz in an appropriate sense. We propose a new algorithm that takes advantage of performative feedback in order to construct non-trivial conﬁdence bounds on the performative risk in unexplored regions of the parameter space and thus guide exploration. A crucial implication of these bounds is that the algorithm can discard highly suboptimal regions of the parameter space without ever deploying a model nearby. We summarize the regret guarantee of our performative conﬁdence bounds algorithm:

Theorem 1.1 (Informal). Suppose that the distribution map D(θ) is ϵ-Lipschitz and that the loss ℓ(z;θ) is Lz-Lipschitz in z. Then, after T deployments, the performative conﬁdence bounds algorithm achieves a regret bound of

Reg(T ) = O

T + T d+1 d+2 (Lzϵ) d d+2 ,

where d denotes the zooming dimension of the problem.

We compare the bound in Theorem 1.1 to a baseline Lipschitz bandits regret bound. The concept of zooming dimension stems from the work of Kleinberg et al. (2008) and serves as an instance-dependent notion of dimensionality.

Kleinberg et al. showed that sublinear regret O T d +1 d +2 L d d +2

can be achieved if the reward function is L-Lipschitz, where d is a zooming dimension. The performative risk can be guaranteed to be Lipschitz if the distribution map is Lipschitz and the loss ℓ(z;θ) is Lipschitz in both arguments.

The primary beneﬁt of Theorem 1.1 is that our regret bound scales only with the Lipschitz constant of the distribution map, rather than the Lipschitz constant of the performative risk. In particular, our result allows PR(θ) to be highly irregular as a function of θ, seeing that ℓ(z;θ) as a function of θ is unconstrained. This difference becomes salient when ϵ 0, meaning that the performative effects vanish: our regret bound grows as O(

T ) in an essentially dimensionindependent manner. More precisely, the dimension can only arise implicitly through a model class complexity term. On the other hand, the rate of classical Lipschitz bandits remains exponential in the dimension.

Another difference between our regret bound and that of Lipschitz bandits is in the zooming dimension. In particular, d is a zooming dimension no smaller than the zooming dimension we obtain in Theorem 1.1. As we will elaborate on in later sections, the beneﬁt we derive from the zooming dimension comes from the fact that it implicitly depends on the Lipschitz constant driving the objective, which is smaller when making full use of performative feedback.

Extension to location families. In addition, we study performative regret minimization for the special case where the distribution map has a location family form (Miller et al., 2021). We again prove regret bounds that scale only with the complexity of the distribution map, rather than the complexity of the performative risk. We adapt the Lin UCB algorithm (Li et al., 2010) to learn the hidden parameters of the location family. This enables us to achieve a O(

T ) regret without placing any strong convexity assumptions on the performative risk that are required in prior work (Miller et al., 2021). In particular, our result again allows PR(θ) to be highly irregular as a function of θ.

More broadly, our work establishes a connection between performative prediction and the bandits literature, which we believe is a worthwhile direction for future inquiry.

Consequences for ﬁnding performative optima. While we have contextualized our work within online regret minimization, our performative conﬁdence bounds algorithm has the additional property that it converges to the set of performative optima. Thus, if run for sufﬁciently many time steps, it generates a model with near-minimal performative risk. From this perspective, our algorithm offers superior guaran-

Regret Minimization with Performative Feedback

tees over retraining methods (e.g., (Perdomo et al., 2020)), which are unlikely to reach minima of the performative risk.

1.2. Related Work

Performative prediction. Prior work on performative prediction has largely studied gradient-based optimization methods (Perdomo et al., 2020; Mendler-Dünner et al., 2020; Drusvyatskiy and Xiao, 2020; Brown et al., 2022; Miller et al., 2021; Izzo et al., 2021; Maheshwari et al., 2022; Li and Wai, 2022; Ray et al., 2022; Dong and Ratliff, 2021). Many of the studied procedures only converge to performatively stable points, that is, points θ that satisfy the ﬁxedpoint condition θ argminθ Ez D(θ)ℓ(z;θ ). In general, stable points are not minimizers of the performative risk (Perdomo et al., 2020; Miller et al., 2021), which implies that procedures converging to stable points do not achieve sublinear performative regret. There are exceptions in the literature that focus on ﬁnding performative optima (Miller et al., 2021; Izzo et al., 2021), but those algorithms rely on proving or assuming convexity of the performative risk; in this work we make no convexity assumptions. In fact, it is known that the performative risk can be nonconvex even when the loss ℓ(z;θ) is convex and the performative effects are relatively weak (Perdomo et al., 2020; Miller et al., 2021). One other work that studies performative optimality, without imposing convexity, is that of Dong and Ratliff (2021), but they focus on optimization heuristics that are not guaranteed to minimize performative regret.

Learning in Stackelberg games. Performative prediction is closely related to learning in Stackelberg games: if D(θ) is thought of as a best response to the deployment of θ according to some unspeciﬁed utility function, then performative optima can be thought of as Stackelberg equilibria. There have been many works on learning dynamics in Stackelberg games in recent years (Balcan et al., 2015; Jin et al., 2020; Fiez et al., 2020; Fiez and Ratliff, 2020). Notably, Balcan et al. (2015) also study the beneﬁt of a richer feedback model: they assume the agent s type is revealed after taking an action. When combined with a known agent-response model, this allows them to directly infer the loss of unexplored strategies. In contrast, performative feedback does not imply full-information feedback. One instance of performative prediction that has an explicit Stackelberg structure, meaning D(θ) is deﬁned as a best response, is strategic classiﬁcation (Hardt et al., 2016). Several works have studied learning dynamics in strategic classiﬁcation (Dong et al., 2018; Chen et al., 2020; Bechavod et al., 2021; Zrnic et al., 2021); notably, Dong et al. (2018) and Chen et al. (2020) provide solutions that minimize Stackelberg regret, of which performative regret is an analog in the performative prediction context. However, all of these works rely on strong structural assumptions, such as linearity of the predictor or convexity of the risk function, which signiﬁcantly reduce

the amount of necessary exploration compared to the mild Lipschitzness conditions we impose in our work.

Continuum-armed bandits. Particularly inspiring for our work is the literature on continuum-armed bandits (Agrawal, 1995; Kleinberg, 2004; Auer et al., 2007; Kleinberg et al.,

2008; Podimata and Slivkins, 2021). As we will elaborate on in Section 2, performative prediction can be cast as a Lipschitz continuum-armed bandit problem. However, while this means that one can use an off-the-shelf Lipschitz bandit algorithm to minimize performative regret, this would generally be a conservative solution. After pulling an arm θ in performative prediction the learner observes samples from D(θ). As explained earlier, in combination with the structure of our objective, this feedback model is more powerful than classical bandit feedback, where a noisy version of the mean reward at θ is observed. Moreover, it is fundamentally different from partial-feedback and side-information models studied in the literature, e.g. (Mannor and Shamir, 2011; Kocák et al., 2014; Wu et al., 2015; Cohen et al., 2016).

1.3. Preliminaries

Performative prediction, set up as an online learning problem, can be formalized as follows. At every step the learner chooses a model θ in the parameter space Θ RdΘ. We assume1 max{ θ : θ Θ} 1 for simplicity. The expected loss of model θ is given by PR(θ) = Ez D(θ)ℓ(z;θ). We assume that the objective function is bounded so that ℓ(z;θ) [0,1] for all z and θ.

At every time step t, the learner chooses a model θt and observes a constant number m0 of i.i.d. samples,

{z(i) t }i [m0], where z(i) t D(θt).

The regret incurred by choosing θt at time step t is (θt) := PR(θt) PR(θPO), where θPO is the performative optimum.

The constant m0 quantiﬁes how many samples the learner can collect in a time window determined by how often they incur regret. For example, at the beginning of each week the learner might update the model, and thus at the end of each week they incur regret for the model they chose to deploy. In that case, m0 is the number of samples the learner collects per week. Note that a learner with larger m0 collects an empirical distribution that more accurately reﬂects D(θt) and thus naturally minimizes regret at a faster rate.

To formally disentangle the effects of the parameter vector θ on the performative risk through the distribution map and the loss function, we use the notion of the decoupled performative risk (Perdomo et al., 2020):

DPR(θ,θ ) := Ez D(θ)ℓ(z;θ ).

1Throughout we use to denote the ℓ2-norm for vectors and the operator norm for matrices.

Regret Minimization with Performative Feedback

This object captures the risk incurred by a model θ on the distribution D(θ). Note that PR(θ) = DPR(θ,θ) by deﬁnition.

To measure the complexity of the distribution map we consider how much the distribution D(θ) can change with changes in θ, as formalized by ϵ-sensitivity.

Assumption 1.2 (ϵ-sensitivity (Perdomo et al., 2020)). A distribution map D( ) is ϵ-sensitive if for any pair θ,θ Θ it holds that

W(D(θ),D(θ )) ϵ θ θ ,

where W denotes the Wasserstein-1 distance.

In the context of a trafﬁc forecasting app, ϵ can be thought of as being proportional to the size of the user base of the app. When D(θ) arises from the aggregate behavior of strategic agents manipulating their features in response to a model θ, ϵ grows when features are more easily manipulable.

2. A Black-Box Bandits Approach

Performative regret minimization can be set up as a continuum-armed bandits problem where an arm corresponds to a choice of model parameters θ. Performative feedback is sufﬁcient to simulate noisy zeroth-order feedback about the reward function, as assumed in bandits. When we deploy θt, the samples from D(θt) enable us to compute an unbiased estimate

c PR(θt) = 1

i=1 ℓ z(i) t ;θt

of the risk PR(θt). Moreover, since we assume the loss function is bounded, the noise in the estimate c PR(θt) is subgaussian, as typically required in bandits.

A standard condition that makes continuum-armed bandit problems tractable is a bound on how fast the reward can change when moving from one arm to a nearby arm. Formally, this regularity is ensured by assuming Lipschitzness of the reward function in our case, Lipschitzness of the performative risk.

The dependence of PR(θ) on θ is twofold, as seen in Equation (1). Thus, the most natural way to ensure that PR(θ) is Lipschitz is to ensure that each of these two dependencies is Lipschitz. This yields the following bound:

Lemma 2.1 (Lipschitzness of PR). If the loss ℓ(z;θ) is Lz Lipschitz in z and Lθ-Lipschitz in θ and the distribution map is ϵ-sensitive, then the performative risk is (Lθ + ϵLz)- Lipschitz.

The intuition behind Lemma 2.1 is that PR(θ) is guaranteed to be Lipschitz if DPR(θ,θ ) is Lipschitz in each argument

individually. Lipschitzness in the second argument follows from requiring that the loss be Lipschitz in θ. Lipschitzness in the ﬁrst argument follows from combining Lipschitzness of the loss in z and ϵ-sensitivity of the distribution map.

2.1. Adaptive Zooming

Once we have established Lipschitzness of the performative risk, we can apply techniques from the Lipschitz bandits literature. Kleinberg et al. (2008) proposed a bandit algorithm that adaptively discretizes promising regions of the space of arms, using Lipschitzness of the reward function to bound the additional loss due to discretization. Their method, called the zooming algorithm, will serve as a baseline for our problem. The algorithm enjoys an instance-dependent regret that takes advantage of nice problem instances, while maintaining tight guarantees in the worst case. The rate depends on the zooming dimension, which is upper bounded in the worst case by the dimension of the full space dΘ.

Proposition 2.2 (Zooming algorithm (Kleinberg et al., 2008)). Suppose m0 = o(log T ). Then, after T deployments, the zooming algorithm achieves a regret bound of

Reg(T ) = O

T d+1 d+2 log T

! 1 d+2 (Lθ + ϵLz) d d+2

where d denotes the (Lθ + ϵLz)-zooming dimension.

The zooming dimension quantiﬁes the niceness of a problem instance by measuring the size of a covering of near-optimal arms, instead of the entire parameter space. Roughly speaking, if the reward function is very ﬂat in that there are many near-optimal points, then the zooming dimension is close to the dimension dΘ of the parameter space. However, if the reward has sufﬁcient curvature, then the zooming dimension can be much smaller than dΘ. The zooming dimension is deﬁned formally as follows:

Deﬁnition 2.3 (α-zooming dimension). A performative prediction problem instance has α-zooming dimension equal to d if any minimal s-cover of any subset of {θ : (θ) 16αs} includes at most a constant multiple of (3/s)d elements from {θ : 16αr (θ) < 32αr}, for all 0 < r s 1.

For well-behaved instances, the deﬁnition intuitively requires every minimal s-cover of {θ : 16αr (θ) < 32αr} to have size at most of order (3/s)d. Deﬁnition 2.3 slightly differs from the deﬁnition presented in (Kleinberg et al., 2008) and makes the dependence on the Lipschitz constant explicit; we use Deﬁnition 2.3 to later ease the comparison to our new algorithm. The differences between the two deﬁnitions are minor technicalities that we do not expect to alter the zooming dimension in a meaningful way, neither formally nor conceptually. See Appendix E for a discussion.

Regret Minimization with Performative Feedback

DPR( 1, ) DPR( 2, )

PR( ) confidence set

Figure 1. Conﬁdence bounds after deploying θ1 and θ2. (left) Conﬁdence bounds via Lipschitzness, Equation (2). (right) Performative conﬁdence bounds, Equation (3). The performative feedback model used for this illustration can be found in Appendix G.

3. Making Use of Performative Feedback

In this section, we illustrate how we can take advantage of performative feedback beyond computing a point estimate of the deployed model s risk. For now, we ignore ﬁnite-sample considerations and assume access to the entire distribution D(θ) after deploying a model θ. We will address ﬁnitesample uncertainty when presenting our main algorithm in the next section.

3.1. Constructing Performative Conﬁdence Bounds

First, we demonstrate how performative feedback allows constructing tighter conﬁdence bounds on the performative risk of unexplored models, compared to only relying on Lipschitzness of the risk function PR(θ).

Suppose we deploy a set of models S Θ and for each θ S we observe D(θ). Then, under the regularity conditions of Lemma 2.1, we can bound the risk of any θ Θ as

max θ S PR(θ) (Lθ + Lzϵ) θ θ PR(θ )

min θ S PR(θ) + (Lθ + Lzϵ) θ θ . (2)

These conﬁdence bounds only use D(θ) for the purpose of computing PR(θ) and rely on Lipschitzness to construct conﬁdence sets around the risk of unexplored models. However, in light of the structure of our objective function (1), the bounds in Equation (2) do not make full use of performative feedback; in particular, access to D(θ) actually allows us to evaluate DPR(θ,θ ) for any θ . Importantly, this information can further reduce our uncertainty about PR(θ ), and we can bound:

PR(θ ) = DPR(θ,θ ) + (DPR(θ ,θ ) DPR(θ,θ ))

DPR(θ,θ ) + Lzϵ θ θ .

Thus we can get tighter bounds on the performative risk at

PR( ) PRmin

confidence set discarded region

Figure 2. Performative feedback allows discarding unexplored suboptimal models even in regions that have not been explored. A model θ is discarded if PRLB(θ) > PRmin. The loss function and feedback model are the same as in Figure 1.

an unexplored parameter θ :

max θ S DPR(θ,θ ) Lzϵ θ θ PR(θ )

min θ S DPR(θ,θ ) + Lzϵ θ θ . (3)

We call the conﬁdence bounds computed in (3) performative conﬁdence bounds. In Figure 1, we visualize and contrast these conﬁdence bounds with the conﬁdence bounds obtained via Lipschitzness. We observe that by computing DPR we can signiﬁcantly tighten the conﬁdence regions.

The tightness of the conﬁdence bounds depends on the set S of deployed models. By choosing a cover of the parameter space, we can get an estimate of the performative risk that has low approximation error on the whole parameter space.

Proposition 3.1. Let Sγ be a γ-cover of Θ and suppose we deploy all models θ Sγ. Then, using performative feedback we can compute an estimate c PR(θ) such that for any θ Θ it holds that |PR(θ) c PR(θ)| γLzϵ.

Proposition 3.1 implies that after exploring the cover Sγ, we can ﬁnd a model whose suboptimality is at most O(γLzϵ). To contextualize the bound in Proposition 3.1, consider an approach that uses the same cover Sγ but only relies on zeroth-order feedback, that is, {PR(θ) : θ Sγ}. Then, the only feasible estimate of PR over the whole space is c PR(θ) = PR(ΠSγ(θ)), where ΠSγ(θ) = argminθ Sγ θ θ is the projection onto the cover Sγ. This zeroth-order approach only guarantees an accuracy of |PR(θ) c PR(θ)| (Lzϵ + Lθ)γ, a strictly weaker approximation than the one in Proposition 3.1.

3.2. Sequential Elimination of Suboptimal Models

Now we show how performative conﬁdence bounds can guide exploration. Speciﬁcally, we show that every deployment informs the risk of unexplored models, which allows us to sequentially discard suboptimal regions of the space.

To develop a formal procedure for discarding points, let

Regret Minimization with Performative Feedback

PRLB(θ) denote a lower conﬁdence bound on PR(θ) and PRmin denote an upper conﬁdence bound on PR(θPO) based on the information from the models deployed so far:

PRLB(θ) = max θ already deployed(DPR(θ ,θ) Lzϵ θ θ ),

PRmin = min θ Θ min θ already deployed (DPR(θ ,θ) + Lzϵ θ θ ).

It is not difﬁcult to see that the following lower bound on the suboptimality of model θ holds:

Proposition 3.2. We have (θ) PRLB(θ) PRmin, θ.

In particular, models θ with PRLB(θ) > PRmin cannot be optimal. We recall our toy example from Figure 1 and illustrate in Figure 2 the parameter conﬁgurations we can discard after the deployment of two models, θ1 and θ2. We can see that access to DPR allows us to discard a large portion of the parameter space, and, in contrast to the baseline black-box approach, it is possible to discard regions of the space that have not been explored.

4.Performative Conﬁdence Bounds Algorithm

We introduce our main algorithm that builds on the two insights from the previous section. We furthermore provide a rigorous, ﬁnite-sample analysis of its guarantees.

4.1. Algorithm Overview

Our performative conﬁdence bounds algorithm, formally stated in Algorithm 1, takes advantage of performative feedback by assessing the risk of unexplored models and thus guiding exploration. We give an overview of the main steps.

Inspired by the successive elimination algorithm (Even-Dar et al., 2002), the algorithm keeps track of and reﬁnes an active set of models A Θ. Roughly speaking, active models are those that are estimated to have low risk and only they are admissible to deploy. To deal with ﬁnitesample uncertainty, the algorithm proceeds in phases which progressively reﬁne the precision of the ﬁnite-sample risk estimates. More precisely, in phase p the algorithm chooses an error tolerance γp and deploys a model for np steps. In each step m0 samples induced by the deployed model are collected, and np is chosen so that the inferred estimates of DPR are γp-accurate. Formally, if θ is deployed in phase p, we collect an empirical distribution b D(θ) of npm0 samples

so that |[ DPR(θ,θ ) DPR(θ,θ )| γp for all θ with high probability, where

[ DPR(θ,θ ) := Ez b D(θ)ℓ(z;θ ).

These estimates of DPR are used to construct performative conﬁdence bounds and reﬁne A.

Algorithm 1 Performative Conﬁdence Bounds Algorithm

Require: time horizon T , number of samples m0, sensitivity ϵ, Lipschitz constant Lz, complexity bound C 1: Initialize A = Θ 2: for phase p = 0,1,... do 3: Set error tolerance γp = 2 p and net radius rp = γp Lzϵ

4: Let np = 2C + 3 p

log T 2 . (γ2 pm0)

5: Initialize Sp Nrp(A), Pp

6: while Sp , do 7: Draw θnet Sp uniformly at random

8: Deploy θnet for np steps to form [ DPR(θnet, ) 9: Update Sp Sp \ θnet and Pp Pp θnet 10: Update estimate of upper bound on PR(θPO):

PRmin = min θ Θ min θ Pp

[ DPR(θ ,θ) + Lzϵ θ θ

11: Update lower bound for all models θ A:

PRLB(θ) = max θ Pp

[ DPR(θ ,θ) Lzϵ θ θ

12: A A \ {θ A : PRLB(θ) > PRmin + 2γp} 13: Sp Sp \ {θ Sp : Ballrp(θ) A = } 14: end while 15: end for

Each phase begins by constructing a net of the current active set A. The points in the net are sequentially deployed in the phase, unless they are deemed to be suboptimal based on previous deployments in that phase and are in that case eliminated. During phase p, we denote by Pp the running set of deployed points and by Sp the running set of net points that have not been discarded. We initialize Sp to a minimal rp-net of the current set of active points A, denoted Nrp(A), where rp is proportional to γp. A net point θ gets eliminated from Sp if no point in Ballrp(θ) := {θ Θ : θ θ rp} is active. This means that we may deploy suboptimal points in the net if they help inform active points nearby.

4.2. Comparison with Adaptive Zooming Algorithm

While we borrow the idea of an instance-dependent zooming dimension from Kleinberg et al. (2008), Algorithm 1 and its analysis are substantially different from prior work. In particular, Kleinberg et al. (2008) study an adaptive zooming algorithm which combines a UCB-based approach with an arm activation step. Adapting this method to our setting encounters several obstacles that we describe below.

First, a naive application of the adaptive zooming algorithm proposed by Kleinberg et al. (2008) does not lead to sublinear regret in our setting, unless we assume Lipschitzness of PR. Their rule for activating new arms requires that the

Regret Minimization with Performative Feedback

reward of arms within a given radius in Euclidean distance of the pulled arm is similar. However, without Lipschitzness of PR, there is no radius that would ensure this property.

Given the shortcomings of this exploration strategy, one might imagine that selecting a better distance between arms, e.g. one based on performative conﬁdence bounds, would result in a better algorithm. A natural distance function would be d(θ,θ ) taken as (an empirical estimate of) PR(θ) DPR(θ,θ ) + Lzϵ θ θ . The challenge is that the analysis in Kleinberg et al. (2008) explicitly requires symmetry of the distance function, which d(θ,θ ) violates.

Therefore, to single out the Lzϵ dependence, it is necessary to disentangle learning the structure of the distribution map from the elimination of arms based on reward, which is in stark contrast with UCB-style adaptive zooming algorithms. Algorithm 1 achieves this by relying on a novel adaptation of successive elimination.

4.3. Regret Bound

Before we state the regret bound for Algorithm 1, let us comment on an important component in the analysis. Recall that throughout the algorithm we operate with ﬁnite-sample estimates of the decoupled performative risk to bound the risk of unexplored models. Speciﬁcally, for any deployed θ, we make use of [ DPR(θ,θ ) for all θ . Since we need these estimates to be valid simultaneously for all θ , we rely on uniform convergence. As such, the Rademacher complexity of the loss function class naturally enters the bound. Deﬁnition 4.1 (Rademacher complexity). Given a loss function ℓ(z;θ), we deﬁne C (ℓ) to be:

C (ℓ) = sup θ Θ sup n N

j=1 ϵjℓ(zθ j ;θ )

where ϵj Rademacher and zθ j D(θ), j [n], which are all independent of each other.

Now we can state our regret guarantee for Algorithm 1. Theorem 4.2 (Main regret bound). Assume the loss ℓ(z;θ) is Lz-Lipschitz in z and let ϵ denote the sensitivity of the distribution map. Suppose that C is any value such that C (ℓ) C and m0 = o(B2 log T ,C), where Blog T ,C := p

log T + C. Then, after T time steps, Algorithm 1 achieves a regret bound of

Reg(T ) = O

(Lzϵ)d B2 log T ,C m0

T Blog T ,C m0

where d is the (Lzϵ)-sequential zooming dimension.

The sequential zooming dimension, formally deﬁned in Definition 4.4, accounts for the sequential elimination of models

within each phase. It will become clear in the next section that the sequential zooming dimension is upper bounded by the usual zooming dimension from Deﬁnition 2.3. In Appendix F, we provide an example where the sequential zooming is strictly smaller than the zooming dimension.

Proposition 4.3. For all α > 0, the α-zooming dimension is at least as large as the α-sequential zooming dimension.

The primary advantage of Theorem 4.2 over the Lipschitz bandit baseline can be seen by examining the ﬁrst term in the regret bound. This term resembles the black-box regret bound from Section 2; however, the key difference is that that the bound of Theorem 4.2 depends on the complexity of the distribution map rather than that of the performative risk. In particular, the Lipschitz constant is Lzϵ and not Lθ +Lzϵ. The advantage is pronounced when ϵ 0, making the ﬁrst term of the bound in Theorem 4.2 vanish so only the O(

T ) term remains. On the other hand, the bound in Proposition 4.3 maintains an exponential dimension dependence.

Taking the limit as ϵ 0 also reveals why the second term in the bound emerges. Even if the distribution map is constant, there is regret arising from ﬁnite-sample error. This is a key conceptual difference in the meaning of Lipschitzness of the distribution map versus that of the performative risk: Lθ + Lzϵ being 0 implies that PR is ﬂat and thus all models are optimal, while performative regret minimization is nontrivial even if Lzϵ = 0. Unlike the ﬁrst term, the second term due to ﬁnite samples is dimension-independent apart from any dependence implicit in the Rademacher complexity.

We note that the presence of the Rademacher complexity term C (ℓ) makes a direct comparison of the bound in Theorem 4.2 and the bound in Proposition 4.3 subtle. When the Rademacher complexity is very high, the regret bound in Theorem 4.2 may be worse. Nonetheless, for many natural function classes, the Rademacher complexity is polynomial in the dimension; in these cases, Theorem 4.2 can substantially outperform the regret bound in Proposition 4.3.

Another key feature of the regret bound in Theorem 4.2 worth highlighting is the zooming dimension. Deﬁnition 2.3 allows us to directly compare the dimension in Theorem 4.2 with the dimension in Proposition 4.3: the (Lzϵ)-zooming dimension of Algorithm 1 is no larger than, and most likely smaller than, the (Lθ+Lzϵ)-zooming dimension in the blackbox approach. Moreover, the sequential variant of zooming dimension in Theorem 4.2 can further reduce the dimension.

Finally, the main assumption underpinning the bound in Theorem 4.2 is that DPR is (Lzϵ)-Lipschitz in its ﬁrst argument. Assumption 1.2 coupled with Lipschitzness of the loss in the data achieves this. However, this property can hold with different regularity assumptions on the distribution map and loss function; e.g., if the loss is bounded and the distribution map is Lipschitz in total variation distance.

Regret Minimization with Performative Feedback

net, 2 net, 1 net, 2 net, 3 net, 1 net, 2 net, 3

PR( ) PRmin

discarded region active region

confidence set

Figure 3. Sequential deployment of models allows Algorithm 1 to eliminate points from Sp, reducing the number of deployments. The deployment of θnet,1 and θnet,2 allows one to eliminate θnet,3.

4.4. Sequential Zooming Dimension

The zooming dimension of Deﬁnition 2.3 does not take into account that, using performative feedback, our algorithm can eliminate unexplored models within a phase. We illustrate the beneﬁts of this sequential exploration strategy in Figure 3, where the deployment of two models is sufﬁcient to eliminate the remaining model in the cover. This motivates a sequential deﬁnition of zooming dimension that captures the beneﬁts of sequential exploration.

To set up this deﬁnition, we need to introduce some notation. For a set of points S, enumeration π : S {1,...,|S|} that speciﬁes an ordering on S, and number k {1,...,|S|}, let

PRLB(θ;k) := max θ S:π(θ )<k(DPR(θ ,θ) Lzϵ θ θ ),

PRs LB(k) := min θ Balls(π 1(k))PRLB(θ;k),

PRmin(k) := min θ min θ S:π(θ )<k(DPR(θ ,θ) + Lzϵ θ θ ).

Here, PRLB(θ;k) is a lower bound on PR(θ) arising from the ﬁrst k 1 deployments of the phase. Similarly, PRs LB(k) captures the minimal lower conﬁdence bound on PR for any point in an s-ball around the k-th deployed model, π 1(k). Finally, PRmin(k) captures an upper bound on PR(θPO), estimated from the ﬁrst k 1 deployments.

Using the above terms, we see that PRs LB(k) PRmin(k) + 4αs is the population version of the condition that a model in the cover does not get discarded. The sequential zooming dimension captures the maximal number of models in each suboptimality band that can be deployed. Deﬁnition 4.4 (Sequential zooming dimension). A performative prediction problem instance has α-zooming dimension equal to d if for any minimal s-cover S of any subset of {θ : (θ) 16αs} and all 0 < r s 1, the expected number of models θ S {θ : 16αr (θ) < 32αr} with

PRs LB(π(θ)) PRmin(π(θ)) + 4αs (4)

is at most a constant multiple of (3/s)d, where the expectation is taken over a uniformly sampled enumeration π : S {1,...,|S|}.

The claim of Proposition 4.3 that the sequential zooming dimension is bounded by the zooming dimension in Deﬁnition 2.3 follows by deﬁnition. To see this, let d be the α-zooming dimension. This means that S includes at most a constant multiple of (3/s)d elements from {θ : 16αr (θ) < 32αr}, for all 0 < r s 1. This immediately guarantees that the subset of S characterized by (4) is at most a multiple of (3/s)d, as desired.

5. Regret Minimization for Location Families

In this section, we show how further knowledge about the structure of the distribution map can help reduce the complexity of performative regret minimization, without necessarily implying favorable structure of the performative risk. Once again, we apply our guiding principle of focusing exploration on learning the distribution map. Since the loss function is known, we can extrapolate knowledge about the distribution map to estimate the performative risk.

We focus on the setting of location families (Miller et al., 2021), which are distribution maps that depend on θ via a linear shift. More precisely, location families are distribu-

tion maps of the form z D(θ) z d= z0 + µ θ, where µ RdΘ m is an unknown matrix and z0 Rm is a zeromean subgaussian sample from a base distribution D0. In Appendix C, we provide an example of a standard location family in strategic classiﬁcation.

At a high level, our algorithm can be described as follows: at every step t, the learner deploys a model θt and collects m0 samples from D(θt). We will write zt := 1 m0 Pm0 i=1 z(i) t for the corresponding sample average at time t. Then, based on all samples collected so far, the algorithm computes the least-squares estimate of µ along with a conﬁdence region for µ . In the next step the algorithm picks the model that minimizes a lower conﬁdence bound PRLB(θ). See Algorithm 2 for details.

Algorithm 2 Regret Minimization for Location Families

Require: time horizon T , number of samples m0, base distribution D0, bound M such that µ M 1: Initialize conﬁdence set C1 {µ : µ M } 2: for step t = 1,2,... do 3: PRLB(θ) minµ Ct Ez0 D0ℓ(z0 + µθ;θ) θ Θ 4: Deploy θt = argminθPRLB(θ), compute zt 5: Let Σt Pt i=1 θiθ i + 1

6: Update estimate of µ , ˆµt = Σ 1 t Pt i=1 θi z i

7: Update set Ct+1 n µ : Σ1/2 t ( ˆµt µ) < ξ o , where

8m0+8log T +2dΘ log 1+ T m0

m0 8: end for

Regret Minimization with Performative Feedback

This algorithm is inspired by Lin UCB (Li et al., 2010), a standard bandits algorithm for linear rewards whose regret scales as O(d

T ), where d is the dimension of the linear map. Importantly, unlike in the Lin UCB analysis, our objective function PR(θ) is not linear in θ. Still, the nature of performative feedback allows us to learn the hidden linear structure in the distribution map and apply this knowledge to obtain conﬁdence bounds on the performative risk. Below we state our algorithm for performative regret minimization for location families together with its regret guarantees.

Theorem 5.1. Suppose that ℓ(z;θ) is Lz-Lipschitz in z, D0 is 1-subgaussian, and m0 = o(log T ). Then, after T time steps, Algorithm 2 achieves a regret bound of

Reg(T ) = O 1 m0 max{Lz,1}

T max n dΘ, p

Remark 5.2. For simplicity, we assume that D0 is known in Algorithm 2. This assumption is justiﬁed, for example, when we have plenty of historical data about a population, before any model deployment. We note that Theorem 5.1 can be extended to the case where we only have a ﬁnite data set from D0, by relying on a uniform convergence argument.

Theorem 5.1 shows that by leveraging the hidden linear structure of the distribution map, Algorithm 2 inherits the O(

T ) rate of Lin UCB. This bears resemblance to the regret bound in Theorem 4.2 that also scaled primarily with the complexity of the distribution map. Furthermore, similarly to Algorithm 1, we see that the regret bound for Algorithm 2 holds while allowing the loss to have arbitrary dependence on θ. For example, the loss need not be convex and, as a result, the performative risk need not be convex either.

We conclude by comparing Theorem 5.1 to prior work (Miller et al., 2021), which provided an algorithm for ﬁnding performative optima for location families in the special case when the performative risk is strongly convex. Converting their optimization error into a regret bound yields a bound of O(

T (dΘ + m)). While this bears resemblance to Theorem 5.1, the rates are not directly comparable. The algorithm from Miller et al. (2021) does not assume knowledge of the base distribution D0, but rather deploys the model θ = 0 in initial steps to collect samples from D0 (see Remark 5.2 for how to combine this strategy with our algorithm). In any case, the main beneﬁt of Theorem 5.1 is that it applies to a more general setting, placing signiﬁcantly fewer restrictions on the loss function and the performative risk.

6. Future Directions

Having illuminated the connection between performative prediction and bandit problems, our work opens the door for interesting further investigations. We highlight several directions we consider promising.

Structural knowledge of the distribution map. Domain knowledge about performative distribution shifts is sometimes available: for example, a parametric approximation to the aggregate response (Miller et al., 2021; Izzo et al., 2021), a microfoundations model for individual behavior (Hardt et al., 2016; Jagadeesan et al., 2021), or basic constraints on the agents action set (Chen et al., 2020). For linear shifts, we demonstrated how such structural knowledge about the distribution map can help guide exploration. We expect this principle to apply to other structures of D(θ).

Consequences of exploration. An important limitation of exploration in performative environments are social welfare concerns. Performative shifts can rarely be analyzed ofﬂine and every model deployment is consequential for the population the model acts upon. The ability to discard highly suboptimal regions of the parameter space without having to deploy a model within is highly appealing from a welfare perspective. Beyond this, we believe that incorporating constraints on what constitutes safe exploration (Wu et al., 2016; Turchetta et al., 2019; Kazerouni et al., 2017) is crucial for performative optimization in practice.

Costs of a new deployment. Our notion of regret quantiﬁes the statistical complexity of regret minimization, but it does not differentiate between collecting more samples induced by the currently deployed model and deploying a new model. This difference has previously been studied by Mendler Dünner et al. (2020). Due to the costs associated with a new deployment, collecting more samples from the same model typically comes at a reduced cost for the learner, and there may be a better notion of regret that reﬂects this.

Adapting to unknown sensitivity. Our algorithm relies on knowing Lzϵ. While the Lipschitzness of a classiﬁer in the data has been studied in the context of adversarial robustness (Szegedy et al., 2014; Cisse et al., 2017; Hein and Andriushchenko, 2017; Yang et al., 2020), which could help inform Lz, the sensitivity ϵ of an environment is generally unknown. Adapting the tools by Bubeck et al. (2011) could help relax the requirement of a known sensitivity.

Best of both worlds algorithm. When the Rademacher complexity of the function class is high, the Lipschitz bandit baseline may provide a better regret bound than Algorithm 1. It would be interesting to design an algorithm that intersects the conﬁdence sets of both algorithms and inherits the better of the two regret bounds.

Acknowledgements

The authors would like to thank Moritz Hardt for helpful conversations during the course of this project, Nilesh Tripuraneni for pointers to relevant literature, and Jacob Steinhardt, Alex Wei, and Clara Wong-Fannjiang for valuable feedback on the manuscript.

Regret Minimization with Performative Feedback

Rajeev Agrawal. The continuum-armed bandit problem. SIAM Journal on Control and Optimization, 33(6):1926 1951, 1995.

Peter Auer, Ronald Ortner, and Csaba Szepesvári. Improved rates for the stochastic continuum-armed bandit problem. In International Conference on Computational Learning Theory, pages 454 468. Springer, 2007.

Maria-Florina Balcan, Avrim Blum, Nika Haghtalab, and Ariel D Procaccia. Commitment without regrets: Online learning in Stackelberg security games. In Proceedings of the 16th ACM Conference on Economics and Computation, pages 61 78, 2015.

Yahav Bechavod, Katrina Ligett, Steven Wu, and Juba Ziani. Gaming helps! learning from strategic interactions in natural dynamics. In International Conference on Artiﬁcial Intelligence and Statistics, pages 1234 1242, 2021.

Gavin Brown, Shlomi Hod, and Iden Kalemaj. Performative prediction in a stateful world. In International Conference on Artiﬁcial Intelligence and Statistics, pages 6045 6061, 2022.

Sébastien Bubeck, Gilles Stoltz, and Jia Yuan Yu. Lipschitz bandits without the Lipschitz constant. In International Conference on Algorithmic Learning Theory, pages 144 158, 2011.

Yiling Chen, Yang Liu, and Chara Podimata. Learning strategy-aware linear classiﬁers. Advances in Neural Information Processing Systems, 33:15265 15276, 2020.

Moustapha Cisse, Piotr Bojanowski, Edouard Grave, Yann Dauphin, and Nicolas Usunier. Parseval networks: Improving robustness to adversarial examples. In Proceedings of the 34th International Conference on Machine Learning, page 854 863, 2017.

Alon Cohen, Tamir Hazan, and Tomer Koren. Online learning with feedback graphs without the graphs. In International Conference on Machine Learning, pages 811 819, 2016.

Jinshuo Dong, Aaron Roth, Zachary Schutzman, Bo Waggoner, and Zhiwei Steven Wu. Strategic classiﬁcation from revealed preferences. In Proceedings of the 2018 ACM Conference on Economics and Computation, pages 55 70, 2018.

Roy Dong and Lillian J Ratliff. Approximate regions of attraction in learning with decision-dependent distributions. ar Xiv preprint ar Xiv:2107.00055, 2021.

Dmitriy Drusvyatskiy and Lin Xiao. Stochastic optimization with decision-dependent distributions. ar Xiv preprint ar Xiv:2011.11173, 2020.

Eyal Even-Dar, Shie Mannor, and Yishay Mansour. PAC bounds for multi-armed bandit and Markov decision processes. In International Conference on Computational Learning Theory, pages 255 270, 2002.

Tanner Fiez and Lillian J Ratliff. Local convergence analysis of gradient descent ascent with ﬁnite timescale separation. In International Conference on Learning Representations, 2020.

Tanner Fiez, Benjamin Chasnov, and Lillian Ratliff. Implicit learning dynamics in Stackelberg games: Equilibria characterization, convergence analysis, and empirical study. In International Conference on Machine Learning, pages 3133 3144, 2020.

Moritz Hardt, Nimrod Megiddo, Christos Papadimitriou, and Mary Wootters. Strategic classiﬁcation. In Proceedings of the 2016 ACM Conference on Innovations in Theoretical Computer Science, pages 111 122, 2016.

Matthias Hein and Maksym Andriushchenko. Formal guarantees on the robustness of a classiﬁer against adversarial manipulation. In Proceedings of the 31st International Conference on Neural Information Processing Systems, page 2263 2273, 2017.

Zachary Izzo, Lexing Ying, and James Zou. How to learn when data reacts to your model: Performative gradient descent. In Proceedings of the 38th International Conference on Machine Learning, volume 139, pages 4641 4650, 2021.

Meena Jagadeesan, Celestine Mendler-Dünner, and Moritz Hardt. Alternative microfoundations for strategic classiﬁcation. In Proceedings of the 38th International Conference on Machine Learning, volume 139, pages 4687 4697. PMLR, 2021.

Chi Jin, Praneeth Netrapalli, and Michael Jordan. What is local optimality in nonconvex-nonconcave minimax optimization? In International Conference on Machine Learning, pages 4880 4889, 2020.

Abbas Kazerouni, Mohammad Ghavamzadeh, Yasin Abbasi Yadkori, and Benjamin Van Roy. Conservative contextual linear bandits. Advances in Neural Information Processing Systems, 30, 2017.

Robert Kleinberg. Nearly tight bounds for the continuumarmed bandit problem. Advances in Neural Information Processing Systems, 17:697 704, 2004.

Regret Minimization with Performative Feedback

Robert Kleinberg, Aleksandrs Slivkins, and Eli Upfal. Multiarmed bandits in metric spaces. In Proceedings of the fortieth annual ACM symposium on Theory of computing, pages 681 690, 2008.

Tomáš Kocák, Gergely Neu, Michal Valko, and Remi Munos. Efﬁcient learning by implicit exploration in bandit problems with side observations. In Advances in Neural Information Processing Systems, volume 27, 2014.

Tor Lattimore and Csaba Szepesvári. Bandit algorithms. Cambridge University Press, 2020.

Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th International Conference on World Wide Web, pages 661 670, 2010.

Qiang Li and Hoi-To Wai. State dependent performative prediction with stochastic approximation. In International Conference on Artiﬁcial Intelligence and Statistics, pages 3164 3186, 2022.

Chinmay Maheshwari, Chih-Yuan Chiu, Eric Mazumdar, Shankar Sastry, and Lillian Ratliff. Zeroth-order methods for convex-concave min-max problems: Applications to decision-dependent risk minimization. In International Conference on Artiﬁcial Intelligence and Statistics, pages 6702 6734, 2022.

Shie Mannor and Ohad Shamir. From bandits to experts: On the value of side-observations. In Proceedings of the 24th International Conference on Neural Information Processing Systems, page 684 692, 2011.

Celestine Mendler-Dünner, Juan Perdomo, Tijana Zrnic, and Moritz Hardt. Stochastic optimization for performative prediction. In Advances in Neural Information Processing Systems, volume 33, pages 4929 4939, 2020.

John P Miller, Juan C Perdomo, and Tijana Zrnic. Outside the echo chamber: Optimizing the performative risk. In Proceedings of the 38th International Conference on Machine Learning, volume 139, pages 7710 7720, 2021.

Juan Perdomo, Tijana Zrnic, Celestine Mendler-Dünner, and Moritz Hardt. Performative prediction. In Proceedings of the 37th International Conference on Machine Learning, volume 119, pages 7599 7609, 2020.

Chara Podimata and Alex Slivkins. Adaptive discretization for adversarial Lipschitz bandits. In Conference on Learning Theory, pages 3788 3805, 2021.

Mitas Ray, Dmitriy Drusvyatskiy, Maryam Fazel, and Lillian J Ratliff. Decision-dependent risk minimization in

geometrically decaying dynamic environments. In Proceedings of the Association for the Advancement of Artiﬁcial Intelligence Conference on AI (AAAI), 2022.

Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian J. Goodfellow, and Rob Fergus. Intriguing properties of neural networks. In 2nd International Conference on Learning Representations, 2014.

Matteo Turchetta, Felix Berkenkamp, and Andreas Krause. Safe exploration for interactive machine learning. In Advances in Neural Information Processing Systems, volume 32, 2019.

Yifan Wu, András György, and Csaba Szepesvári. Online learning with Gaussian payoffs and side observations. In Proceedings of the 28th International Conference on Neural Information Processing Systems, page 1360 1368, 2015.

Yifan Wu, Roshan Shariff, Tor Lattimore, and Csaba Szepesvári. Conservative bandits. In International Conference on Machine Learning, pages 1254 1262, 2016.

Yao-Yuan Yang, Cyrus Rashtchian, Hongyang Zhang, Russ R Salakhutdinov, and Kamalika Chaudhuri. A closer look at accuracy vs. robustness. In Advances in Neural Information Processing Systems, volume 33, pages 8588 8601, 2020.

Tijana Zrnic, Eric Mazumdar, Shankar Sastry, and Michael Jordan. Who leads and who follows in strategic classiﬁcation? Advances in Neural Information Processing Systems, 34:15257 15269, 2021.

Regret Minimization with Performative Feedback

A. Proofs from Section 2 and Section 3

A.1. Proof of Lemma 2.1

Notice that PR(θ) PR(θ ) = (DPR(θ,θ) DPR(θ,θ ))+(DPR(θ,θ ) DPR(θ ,θ )). We bound the ﬁrst difference using Lipschitzness of ℓin θ as |DPR(θ,θ) DPR(θ,θ )| = |Ez D(θ)[ℓ(z;θ) ℓ(z;θ )]| Lθ θ θ . For the second term we combine Assumption 1.2 and Lipschitzness of ℓin z via the Kantorovich-Rubinstein duality theorem. In particular, we get |DPR(θ,θ ) DPR(θ ,θ )| = |Ez D(θ)ℓ(z;θ ) Ez D(θ )ℓ(z;θ )| ϵLz θ θ . Putting both bounds together, we obtain the claimed Lipschitz bound.

A.2. Proof of Proposition 3.1

We construct a γ-cover of the parameter space, denoted Sγ, and deploy all models in this cover. This gives us access to the distributions {D(θ) : θ Sγ}. Using this information, for any θ Θ we can compute

c PR(θ) = DPR(ΠSγ(θ),θ) = Ez D(ΠSγ (θ))ℓ(z;θ),

where ΠSγ(θ) := argminθ Sγ θ θ is the projection onto Sγ. Note that θ ΠSγ(θ) γ all θ Θ since Sγ is a cover. Therefore, for any θ Θ, we can bound PR(θ) as

PR(θ) DPR(ΠSγ(θ),θ) + Lzϵ ΠSγ(θ) θ

DPR(ΠSγ(θ),θ) + Lzϵγ

= c PR(θ) + Lzϵγ.

Similarly we obtain PR(θ) c PR(θ) Lzϵγ, which completes the proof.

A.3. Proof of Proposition 3.2

We will show that PRLB(θ) PR(θ) and PRmin PR(θPO); these two facts immediately imply (θ) := PR(θ) PR(θPO) PRLB(θ) PRmin.

The ﬁrst bound follows because PR(θ) = DPR(θ,θ) DPR(θ ,θ) Lzϵ θ θ for all θ , where we use (Lzϵ)-Lipschitzness of DPR in the ﬁrst argument. Similarly, the second bound follows because

PR(θPO) = min θ DPR(θ,θ) min θ (DPR(θ ,θ) + Lzϵ θ θ ),

for all θ .

B. Regret Analysis of Algorithm 1

In this section, we prove a regret bound for Algorithm 1. At a high-level, Theorem 4.2 combines bounds speciﬁc to performative prediction with ingredients from the analysis of successive elimination (Even-Dar et al., 2002). First, using a ﬁnite-sample analogue of Proposition 3.2, we show that after phase p all models θ A have suboptimality (θ) 8γp. We then upper bound the number of models in each suboptimality band {θ : 16Lzϵr (θ) < 32Lzϵr}, for ﬁxed r, that are deployed in each phase, by leveraging the deﬁnition of sequential zooming dimension. The remainder of the proof separately analyzes the regret incurred from the ﬁrst log2(1/(Lzϵ)) phases, in which the ﬁnite-sample error dominates the discretization error, and the regret from the later phases, in which the ﬁnite-sample error and the discretization error are of the same order.

We use Regph(p1 : p2) to denote the regret incurred from phase p1 to phase p2:

Regph(p1 : p2) = E

We let Regph(0 : p) Regph(p). For phases p that happen after the time horizon T , we assume that the incurred regret is 0; for example, if phases p1 p2 happen after T , then Regph(p1 : p2) = 0.

Regret Minimization with Performative Feedback

B.1. Clean Event

First, we deﬁne a clean event that guarantees that the estimates [ DPR(θ,θ ) are close to the true values DPR(θ,θ ) at all phases. The clean event essentially guarantees uniform convergence over [ DPR(θ, ) for every θ Pp. Deﬁnition B.1 (Clean event). Denote the clean event by

p : sup θ Pp sup θ Θ

[ DPR(θ,θ ) DPR(θ,θ ) 2C (ℓ) + 3 p

log(T ) npm0

where Pp is the set of all models deployed in phase p during time horizon T .

We show that the clean event occurs with high probability. Lemma B.2. The clean event holds with high probability,

P{Eclean} 1 T 2.

Proof. We consider each interval of length np in phase p, during which the same model is deployed, separately, and then take a union bound over these intervals across all phases. Therefore, we will say interval s in phase p to refer to steps (s 1)np + 1,...,snp in phase p. For the sake of this proof, we consider a counterfactual set of samples for each model

θ that augments the set of actually observed samples. In particular, for interval s in phase p, we let {zθ,s j } npm0 j=1 denote i.i.d. samples from D(θ). The samples for different time intervals and different phases are independent. When model θ is deployed, we observe the samples corresponding to the interval in which θ is deployed.

For each phase p and each time interval s within phase p, let Es,p end denote the event that phase p terminates strictly before interval s is reached. Let Es,p clean denote the event that one of the following two holds:

(E1) Es,p end occurs;

(E2) Es,p end does not occur, and for the model θs deployed in time interval s it holds that:

[ DPR(θs,θ ) DPR(θs,θ ) 2C + 3 p

log(T ) npm0 ,

where θs is a random variable.

The probability that Es,p clean does not occur is at most:

Es,p end & sup θ Θ

[ DPR(θs,θ ) DPR(θs,θ ) > 2C + 3 p

log(T ) npm0

= P h Es,p end i P

[ DPR(θs,θ ) DPR(θs,θ ) > 2C + 3 p

log(T ) npm0

[ DPR(θs,θ ) DPR(θs,θ ) > 2C + 3 p

log(T ) npm0

We can equivalently write this as

j=1 ℓ(zθs,s j ;θ ) DPR(θs,θ )

log(T ) npm0

j=1 ℓ(zθ,s j ;θ ) DPR(θ,θ )

log(T ) npm0

Es,p end,θs = θ

Regret Minimization with Performative Feedback

To upper bound this expression, it sufﬁces to show an upper bound on

j=1 ℓ(zθ,s j ;θ ) DPR(θ,θ )

log(T ) npm0

Es,p end,θs = θ

that holds for every θ. The ﬁrst observation is that for any θ, the samples {zθ,s j } npm0 j=1 are independent of the event n θs = θ, Es,p end o , since the event depends only on the samples collected in previous time intervals and phases. This means that the above probability is equal to:

j=1 ℓ(zθ,s j ;θ ) DPR(θ,θ )

log(T ) npm0

Let ϵj denote i.i.d. Rademacher random variables. Then, we can observe that with probability 1 T 3, it holds that:

j=1 ℓ(zθ,s j ;θ ) DPR(θ,θ )

j=1 ℓ(zθ,s j ;θ ) DPR(θ,θ )

j=1 ℓ(zθ,s j ;θ ) ϵj

2 npm0 sup n 1

j=1 ℓ(zθ j ;θ ) ϵj

2C (ℓ) + 3 p

log(T ) npm0 ,

where the ﬁrst step follows from the bounded differences inequality and the second step follows from a classical symmetrization argument. In the penultimate step we let {zθ j }j N denote an inﬁnite sequence of samples from D(θ). Putting this all together, we have that:

1 P h Es,p clean i T 3.

Finally, using that there are at most T intervals before time horizon T (across all phases), by a union bound we see that:

1 P[Eclean] T 2,

as desired.

B.2. Suboptimality of the Active Set

We show that the elimination strategy in Algorithm 1 will never eliminate any performatively optimal point.

Lemma B.3. On the clean event (5), any performatively optimal point θPO argminθPR(θ) will always remain in A.

Proof. It sufﬁces to show that θPO cannot be eliminated in Step 14 of Algorithm 1. Fix any phase p and denote by Pp the

Regret Minimization with Performative Feedback

running set of deployed points at any point during phase p. Then, we have:

PRLB(θPO) = max θ Pp

[ DPR(θ ,θPO) Lzϵ θPO θ

max θ Pp (DPR(θ ,θPO) Lzϵ θPO θ ) + γp

PR(θPO) + γp = min θ DPR(θ,θ) + γp

min θ min θ Pp DPR(θ ,θ) + Lzϵ θ θ + γp

min θ min θ Pp [ DPR(θ ,θ) + Lzϵ θ θ + 2γp

= PRmin + 2γp.

Therefore, PRLB(θPO) PRmin + 2γp, implying that θPO cannot be removed from A during phase p. Since this is true for any phase p, that completes the proof of the lemma.

We next show that the elimination strategy is sufﬁciently effective that all models that remain active after a given phase p have suboptimality at most 8γp.

Lemma B.4. On the clean event (5), after phase p all models θ A satisfy (θ) 8γp.

Proof. Fix a phase p. We will analyze Pp at the end of phase p. The proof relies on two key facts:

(F1) If θ is active after phase p, then θ ΠPp(θ) rp, where ΠPp(θ) = argminθ Pp θ θ .

(F2) θPO is active after phase p.

The ﬁrst fact follows since during phase p net points cannot be eliminated from Sp in Step 13 while some parameter within an rp-neighborhood is active. The second fact is proved in Lemma B.3. Note that from fact (F1) it further follows that there is always a model in Pp within the rp-neighborhood of θPO.

Now suppose that θ is active after phase p. Then, we have:

PR(θ) DPR(ΠPp(θ),θ) + Lzϵ ΠPp(θ) θ

[ DPR(ΠPp(θ),θ) + Lzϵ ΠPp(θ) θ + γp

min θ [ DPR(ΠPp(θ ),θ ) + Lzϵ ΠPp(θ ) θ + 2Lzϵ ΠPp(θ) θ + 3γp,

where we used the deﬁnitions of PRmin and PRLB(θ), together with the fact that PRLB(θ) PRmin +2γp for active models. Now choosing θ = θPO, applying (F1), (F2), and accounting for ﬁnite-sample uncertainty we ﬁnd

PR(θ) [ DPR(ΠPp(θPO),θPO) + Lzϵ ΠPp(θPO) θPO + 2Lzϵ ΠPp(θ) θ + 3γp

[ DPR(ΠPp(θPO),θPO) + 3Lzϵrp + 3γp

DPR(θPO,θPO) + Lzϵ ΠPp(θPO) θPO + 3Lzϵrp + 4γp

PR(θPO) + 4(Lzϵrp + γp)

= PR(θPO) + 8γp,

where we use the fact that rp = γp Lzϵ. Rearranging the terms we obtain (θ) = PR(θ) PR(θPO) 8γp as claimed in Lemma B.4.

Regret Minimization with Performative Feedback

B.3. Bounding the Number of Suboptimal Deployments

For i 1, we consider the suboptimality bands

Ei = n θ : (θ) [8 2 i Lzϵ,16 2 i Lzϵ) o .

In the following lemma, we bound the number of times that models in Ei can be deployed in a given phase.

Lemma B.5. Suppose that the clean event (5) holds. For i 1, in phase log2(1/(Lzϵ)) p log2(1/(Lzϵ)) + i + 1, the number of models in Ei that are deployed is at most O (3/rp)d in expectation, where d is the (Lzϵ)-sequential zooming dimension.

To provide intuition for Lemma B.5, it is informative to consider a weaker version of the lemma where d is taken to be the (Lzϵ)-zooming dimension rather than the (Lzϵ)-sequential zooming dimension. To see why this weaker version of the lemma is true, notice that at the beginning of phase p, the set of active models A is a subset of n θ : (θ) 8γp 1 o = n θ : (θ) 16γp o = n θ : (θ) 16Lzϵrp o . The set of models deployed in phase p is contained in a minimal rp-net of A.

Notice that rp 2 (i+1). By the deﬁnition of zooming dimension, we know that at most a multiple of 3 rp

d elements from

the set n θ : (θ) [8 2 i Lzϵ,16 2 i Lzϵ) o = n θ : (θ) [16 2 (i+1)Lzϵ,32 2 (i+1)Lzϵ) o are deployed, as desired.

The proof of Lemma B.5 boils down to reﬁning this proof sketch to account for the sequential elimination aspect of Algorithm 1.

Proof. For the purposes of this analysis, we condition on the clean event.

Fix a phase log2(1/(Lzϵ)) p log2(1/(Lzϵ))+i +1. Let S0p be the covering of A chosen at the beginning of phase p, and let π be an ordering of S0p chosen uniformly at random. It is not difﬁcult to see that Algorithm 1 is equivalent to drawing π at the beginning of the phase, and deploying models in the order given by π (naturally, skipping those that get eliminated). For technical convenience, we analyze this reformulation of the algorithm.

Condition on a realization π, and let Pp S0p be the set of models that are ultimately get deployed. Note that Pp depends on the randomness arising from ﬁnite-sample noise at each step of the phase. We will show a bound on |Pp| that deterministically holds on the clean event. In particular, consider the models θ S0 p {θ : 8Lzϵri (θ) < 16Lzϵri} such that:

PR rp LB(π(θ)) PRmin(π(θ)) + 4Lzϵrp = PRmin(π(θ)) + 4γp. (6)

We will show that Pp is a subset of such models.

Suppose that θnet S0 p is deployed in phase p. Then, that means that there exists θ Ballrp(θnet) that remains active after the ﬁrst π(θnet) 1 deployments; that is:

max θ :π(θ )<π(θnet)([ DPR(θ ,θ ) Lzϵ θ θ ) = PRLB(θ )

PRmin + 2γp

= min θ min θ :π(θ )<π(θnet)([ DPR(θ ,θ) + Lzϵ θ θ ) + 2γp.

Since the clean event holds, we know that:

max θ :π(θ )<π(θnet)(DPR(θ ,θ ) Lzϵ θ θ ) γp min θ min θ :π(θ )<π(θnet)(DPR(θ ,θ) + Lzϵ θ θ ) + 3γp.

Rearranging, this means that:

max θ :π(θ )<π(θnet)(DPR(θ ,θ ) Lzϵ θ θ )

min θ min θ :π(θ )<π(θnet)(DPR(θ ,θ) + Lzϵ θ θ ) + 4γp = PRmin(π(θnet)) + 4γp.

Regret Minimization with Performative Feedback

This further implies that:

PR rp LB(π(θnet)) = min θ Ballrp(θnet) max θ :π(θ )<π(θnet)(DPR(θ ,θ ) Lzϵ θ θ ) PRmin(π(θnet)) + 4γp.

We see that any θnet Pp must satisfy condition (6). By the deﬁnition of sequential zooming dimension, we know that the expected number of models in Ei that satisfy (6), where the expectation is taken over the randomness of π, is at most a

multiple of 3 rp

d , hence E|Pp Ei| O 3 rp

d! , as desired.

B.4. Regret Bound on the Clean Event

To bound the regret on the clean event, we break the analysis into two cases: (a) the ﬁrst log2(1/(Lzϵ)) phases, and (b) all remaining phases. Lemma B.6. Suppose that the clean event (5) holds. In the ﬁrst log2(1/(Lzϵ)) phases, the algorithm has incurred regret at most

Regph log2(1/(Lzϵ)) = O r

log T + C ! .

Proof. During phases p log2(1/(Lzϵ)), we deploy a single model since rp 1 and Θ is assumed to have radius 1.

We break the ﬁrst log2(1/(Lzϵ)) phases into two cases. For a value of N 0 speciﬁed later, we consider cases p < N and p N separately.

Case 1: phases N p log2(1/(Lzϵ)) . By Lemma B.4, we see that the model deployed in phase N must have suboptimality at most 8 2 N+1 = 2 N+4. Since the algorithm runs for at most T time steps, this means that the total regret incurred in these phases is at most T 2 N+4.

Case 2: phases 0 p < min{N, log2(1/(Lzϵ)) }.

By Lemma B.4, we know that the model deployed in phase p must have suboptimality at most 8 2 p+1 = 2 p+4. Moreover,

this model is deployed for np =

steps. The regret incurred up to phase N can thus be bounded as:

p=0 np2 p+4

22p(2C + 3 p

Since we assume m0 = o((C + p

log T )2), for a large enough T we have np 1 and thus np 2np. Therefore,

p=0 2 p 22p(2C + 3 p

C (2C + 3 p

C 2N (2C + 3 p

for some large enough constant C > 0.

Putting the two cases together, on the clean event, the total regret incurred in phases p = 0,..., log2(1/(Lzϵ)) can be upper bounded by

C 2N (2C + 3 p

m0 + T 2 N+4.

Regret Minimization with Performative Feedback

We can also trivially upper bound the regret by T , using the fact that the loss incurred at each step is at most 1. This means that we obtain a regret bound of:

T ,2N (2C + 3 p

m0 + T 2 N+4

We now choose N to minimize this bound. We let η = 2 N and optimize over η (0,1). Optimizing over η instead of an integral value of N changes the bound by constant factors at most. This means that we can upper bound the regret by:

min 0<η 1min

T ,η 1 2C + 3 p

If η > 1, then the minimum of the two terms would be T , which is at least as big as the above expression. Therefore, we can upper bound the above expression by:

η 1 2C + 3 p

We set η = 3

log T +2C m0T and obtain a regret bound of:

Regph log2(1/(Lzϵ)) = O r

log T + C ! ,

as desired.

Lemma B.7. Suppose that the clean event (5) holds. Let d 0 be such that for every i 0 and every phase p [log2(1/(Lzϵ)),log2(1/(Lzϵ)) + i + 1], the number of models in Ei = n θ : (θ) [2 i+3Lzϵ,2 i+4Lzϵ] o that are deployed in

phase p is upper bounded by O 3 rp

d! in expectation. Then, the regret incurred in phases p log2(1/(Lzϵ)), within time

horizon T , can be upper bounded as

Regph log2(1/(Lzϵ)) : O

T d+1 d+2 (Lzϵ) d d+2 ( p

log T + C)2

Proof. By Lemma B.4, we see that all models θ that are active in phase p = log2(1/Lzϵ)) or later have (θ) 8Lzϵrp 8Lzϵ. We split these models into suboptimality bands and deﬁne, for each i 1, the set:

Ei = n θ : (θ) [8 2 i Lzϵ,16 2 i Lzϵ) o .

Note that all models deployed starting with phase log2(1/(Lzϵ)) are in i 1Ei. For a value of N speciﬁed later, we break the analysis into two cases.

Case 1: models in i>NEi. Since the algorithm runs for at most T time steps, the total regret incurred due to deploying models in i>NEi is at most T 16 2 N 1Lzϵ 8T 2 NLzϵ.

Case 2: models in 1 i NEi. By Lemma B.4, we know that all models θ that are active is phases p N + log2(1/Lzϵ) have (θ) 82 p = 8 2 NLzϵ = 16 2 N 1Lzϵ. This means that all models that are active after phase N + log2(1/Lzϵ) are in i>NEi. Thus, to bound the regret incurred by deploying models in 1 i NEi in phase log2(1/Lzϵ) or later, we only need to consider phases p = log2(1/Lzϵ) ,...,N + log2(1/Lzϵ).

Regret Minimization with Performative Feedback

For 1 i N, consider Ei. By Lemma B.4, we know that any θ Ei can only be active during phases p log2(1/Lzϵ)+i+1.

By assumption, in phase p, the number of points in Ei that are deployed is at most of the order 3 rp

d in expectation. Moreover,

each point is deployed np times. Putting this all together, the expected number of points in Ei deployed in phase p is at most:

!d (2C + 3 p

where we use the fact that, given the condition m0 = o((C + p

log T )2), np 1 for large enough T and hence we can bound np 2np. Take p = j + log2(1/Lzϵ)); then, rp = 2 j. We sum over phases log2(1/(Lzϵ)) p log2(1/(Lzϵ)) + i + 1 to obtain that in expectation, the total number of times that these models are deployed is at most:

3d(2C + 3 p

j=0 2j(d+2) = O

3d(2C + 3 p

L2zϵ2m0 2(i+1)(d+2) .

Using the fact that the models have suboptimality at most 16 2 i Lzϵ = 32 2 (i+1)Lzϵ, we see that the regret incurred by deploying models in Ei is upper bounded by:

3d(2C + 3 p

Lzϵm0 2(i+1)(d+1) .

We sum over 1 i N to obtain the total regret incurred due to deploying models in 1 i NEi:

3d(2C + 3 p

Lzϵm0 2(N+2)(d+1) .

Putting together the two cases we obtain a total regret bound of

3d(2C + 3 p

Lzϵm0 2(N+2)(d+1) + T 2 NLzϵ

We also can upper bound the regret by 8T Lzϵ, since all models active after phase phase log2(1/(Lzϵ)) have (θ) 8Lzϵ and there are at most T time steps in total. This means that we can bound the regret by:

T Lzϵ, 3d(2C + 3 p

Lzϵm0 2(N+2)(d+1) + T 2 NLzϵ

We now choose N to minimize this bound. We let η = 2 N and choose some η (0,1). The error from optimizing over η (0,1) instead of an integral value of N contributes at most constant factors. This means that we can upper bound the regret by:

T Lzϵ, 12d(2C + 3 p

Lzϵm0 η (d+1) + T ηLzϵ

for any η (0,1). Note that, if η 1, the second term in the bound is at least as large as the ﬁrst term, hence we can choose any η > 0. In particular, we can further upper bound the regret by

12d(2C + 3 p

Lzϵm0 η (d+1) + T ηLzϵ

Now, we set

log T + 2C 2

Regret Minimization with Performative Feedback

Thus, we ﬁnally get a regret bound of

T d+1 d+2 (Lzϵ) d d+2

log T + C 2

as desired.

B.5. Proof of Theorem 4.2

Now, we are ready to prove Theorem 4.2.

First, we handle the case where the clean event deﬁned in (5) does not hold and the concentration bound is violated. By Lemma B.2, this happens with probability at most T 2. The regret incurred in each deployment is at most 1 and there are T deployments, so these events contribute a negligible factor T 1 to the expected regret.

For the case where the clean event holds we can build on Lemma B.5, Lemma B.6, and Lemma B.7. From Lemma B.6, we obtain a bound for the total regret incurred in phases up to log2(1/(Lzϵ)) . By Lemma B.5 we can set the parameter d in Lemma B.7 to be the (Lzϵ)-sequential zooming dimension, and thus from Lemma B.7 we obtain a regret bound for all later phases.

Putting all this together yields the desired bound.

C. Location Family Example: Strategic Classiﬁcation

Location families arise in strategic classiﬁcation (Hardt et al., 2016), where agents strategically manipulate their features in response to a deployed model. Suppose the learner uses a linear predictor fθ(x) = θT x and the agents incur quadratic cost for changing their original features x to manipulated features x , c(x,x ) = 1

2(x x )Λ(x x ). Then, the best response of an agent, typically modeled as x BR(θ) = argmaxx fθ(x ) c(x,x ), satisﬁes the location family structural assumption with z0 being the agent s original features and µ = Λ 1.

D. Regret Analysis of Algorithm 2

The proof of Theorem 5.1 relies on two key lemmas. One proves that Ct are valid conﬁdence sets for µ at every step, and the other one proves a regret bound assuming that Ct are valid conﬁdence sets.

Throughout we denote by Bm the unit ball in Rm. For a vector x and matrix M, we will use the notation x M =

An important object in the proofs will be St := Pt i=1 θi z 0,i, where z0,i = 1 m0 Pm0 j=1 z(j) i µ θi. Essentially z0,i is the average

over m0 samples from D0, collected at step i. We will also denote Vt(λ) = (λI + Pt i=1 θiθ i ), for an arbitrary offset λ > 0, and Vt Vt(0). Note that in the algorithm statement we use Σt = Vt 1

D.1. Clean Event

As for Algorithm 1, we introduce a clean event. In this case, the clean event will be deﬁned as

Eclean = { t N : µ Ct}, (7)

where Ct are the conﬁdence sets constructed in Algorithm 2.

The technical subtlety lies in the fact that the points θt are chosen adaptively, hence one cannot simply apply standard least-squares conﬁdence intervals to argue that the sets Ct are valid. The same difﬁculty is resolved in the analysis of the Lin UCB algorithm and our proof builds on the proof technique of that analysis.

Before stating the main technical lemma, we start with an auxiliary result that we will use in the proof. Lemma D.1. Suppose that D0 is 1-subgaussian. Then, for all x Bm and y RdΘ, the process

Mt(x,y) = exp y Stx 1 2m0 y 2 Vt

Regret Minimization with Performative Feedback

is a supermartingale with respect to the natural ﬁltration, with M0(x,y) = 1.

Proof. Since z0,i are 1 m0 -subgaussian, we know that all one-dimensional projections are also 1 m0 -subgaussian, hence

z 0,ix are independent 1 m0 -subgaussian as well. Using this, we know

E exp(y θtz 0,tx) Ft 1 exp (y θt)2

y 2 θtθ t 2m0

almost surely. Hence,

E[Mt(x,y) | Ft 1] = E " exp y Stx 1 2m0 y 2 Vt

= Mt 1(x,y)E " exp y θtz 0,tx 1 2m0 y 2 θtθ t

almost surely. Furthermore, M0(x,y) = 1 is trivially true.

Now we are ready to state the main technical lemma about the validity of Ct.

Lemma D.2. We have that P{Eclean} 1 T 2.

Proof. First we will show that for any δ (0,1),

P ( t N : Vt(λ) 1/2St 2 1

8m + 4log 1

+ 2log det(Vt(λ))

for all λ > 0.

λ I RdΘ dΘ and let h be the density of N (0,Σ). Then, for any ﬁxed x Bm and Mt(x,y) as in Lemma D.1, deﬁne

RdΘ Mt(x,y)h(y) = 1 p

(2π)dΘdet(Σ)

RdΘ exp y Stx 1 2m0 y 2 Vt 1

Notice that we can write

y Stx 1 2m0 y 2 Vt 1

2 y 2 Σ 1 = 1

2 Stx 2 (Σ 1+ Vt

Thus, by integrating out the Gaussian density, we get

Mt(x) = exp 1

2 Stx 2 (Σ 1+ Vt

(2π)dΘdet(Σ)

2 Stx 2 (Σ 1+ Vt

! det((Σ 1 + Vt

2 V 1/2 t (λ)Stx 2 λdΘ

Regret Minimization with Performative Feedback

Now, by Lemma 20.3 in (Lattimore and Szepesvári, 2020), since Mt(x,y) is a supermartingale then Mt(x) is a non-negative supermartingale with M0(x) = 1. Thus, we can apply the maximal inequality to get

P n t N : log Mt(x) log(1/δ) o

= P ( t N : m0

2 Vt(λ) 1/2Stx 2 1

2 log det(Vt(λ))

! log(1/δ) ) δ. (9)

Inequality (9) is valid for all ﬁxed x Bm; to prove inequality (8), we use a covering argument. Let N 1

2 ,m denote a 1

2-net of Bm, and note that we can make |N 1

2 ,m| 5m. Then,

Vt(λ) 1/2St = max x Bm Vt(λ) 1/2Stx 2 max x N 1

2 ,m Vt(λ) 1/2Stx .

Therefore, we can apply a union bound to conclude that for all s > 0,

P n t N : Vt(λ) 1/2St 2 s o P ( t N : max x N1/2,m Vt(λ) 1/2Stx 2 2 s

x N1/2,m P t N : Vt(λ) 1/2Stx 2 2 s

By picking s = 1 m0 (8m + 4log 1

δ + 2log( det(Vt(λ))

λdΘ )) 1 m0 (4log 5m

δ + 2log( det(Vt(λ))

λdΘ )) and applying Equation (9), we get

P ( t N : Vt(λ) 1/2St 2 1

8m + 4log 1

+ 2log det(Vt(λ))

This completes the proof of inequality (8).

It remains to relate this bound to the deﬁnition of Ct. We can write

ˆµt µ = Vt(λ) 1St + Vt(λ) 1Vtµ µ ,

and therefore

Vt(λ)1/2( ˆµt µ ) = Vt(λ) 1/2St + Vt(λ)1/2(Vt(λ) 1Vt I)µ

Vt(λ) 1/2St + q

µ (Vt(λ) 1Vt I)Vt(λ)(Vt(λ) 1Vt I)µ

= Vt(λ) 1/2St +

µ (I Vt(λ) 1Vt)µ

= Vt(λ) 1/2St +

where the second equality follows by writing Vt = Vt(λ) λI. Note additionally that by max{ θ : θ Θ} 1 and the AM-GM inequality,

det(Vt(λ)) 1

dΘ trace Vt(λ) !dΘ dΘλ + t

Applying Equation (8), setting δ = 1

T 2 and λ = 1 m0 completes the proof.

D.2. Regret Bound on the Clean Event

The place where the structure of the performative risk comes into play is the following lemma, where we relate the suboptimality of the deployed model θt to properties of the conﬁdence set Ct.

Lemma D.3. Suppose that the clean event (7) holds. Then, we can bound the suboptimality of θt by

1,Lz sup µ,µ Ct (µ µ ) θt

Regret Minimization with Performative Feedback

Proof. In what follows, all expectations are taken only over a sample z0 D0 independent of everything else (i.e., all other random quantities are conditioned on).

Since the loss is bounded, we know (θt) 1. For the other bound, notice that

(θt) = Eℓ(z0 + µ θt;θt) Eℓ(z0 + µ θPO;θPO).

By the deﬁnition of the algorithm and the clean event, we can lower bound the second term Eℓ(z0 +µ θPO;θPO) as follows:

Eℓ(z0 + µ θPO;θPO) PRLB(θPO) PRLB(θt) = Eℓ(z0 + µ t θt;θt),

for some µt Ct. This means that:

(θt) Eℓ(z0 + µ θt;θt) Eℓ(z0 + µ t θt;θt).

To ﬁnish, we use Lipschitzness of the loss to upper bound this by Lz (µ µt) θt . Using the clean event, we can further upper bound this by Lz supµ,µ Ct (µ µ ) θt as desired.

We now use this bound on the suboptimality of deployed models, along with the structure of the conﬁdence sets, to bound the regret on the clean event.

Lemma D.4. Let 1 β1 β2 ...βT and assume that the loss ℓ(z;θ) is Lz-Lipschitz in z. Assume that the event

! (µ ˆµt 1)

holds true, for all 2 t T . Then, on this event, Algorithm 2 satisﬁes:

t=1 (θt) = O

dΘT βT log dΘ + T m0

! max{Lz,1}

Proof. As in the proof of Lemma D.3, all expectations are taken only over a sample z0 D0 independent of everything else (i.e., all other random quantities are conditioned on).

First, we separately bound the regret of the ﬁrst step as O(1), using the fact that the loss is bounded in [0,1].

For the remainder of the steps, we apply Lemma D.3 to upper bound (θt). Using this, coupled with structure of Ct, we can obtain the following upper bound, for any λ > 0:

1,Lz sup µ,µ Ct (µ µ ) θt

1,Lz sup µ,µ Ct (µ µ ) V 1/2 t 1 (λ) V 1/2 t 1 (λ)θt

min n 1,2Lz p

βt V 1/2 t 1 (λ)θt o

βT min n 1,Lz V 1/2 t 1 (λ)θt o ,

where the last line uses the fact that βT max{1,βt}.

Regret Minimization with Performative Feedback

By the Cauchy-Schwarz inequality,

t=2 min n 1,L2z V 1/2 t 1 (λ)θt 2o

t=2 min n 1,max{1,L2z} V 1/2 t 1 (λ)θt 2o

T max{1,L2z}βT

t=2 min n 1, V 1/2 t 1 (λ)θt 2o

= 2max{1,Lz}

t=2 min n 1, V 1/2 t 1 (λ)θt 2o .

Finally, we use Lemma 19.4 in (Lattimore and Szepesvári, 2020) that says

t=2 min n 1, V 1/2 t 1 (λ)θt 2o 2dΘ log trace V0(λ) + T

dΘdet(V0(λ))1/dΘ

! = 2dΘ log dΘλ + T

Using this expression in the equation above and setting λ = 1 m0 yields the ﬁnal result.

D.3. Proof of Theorem 5.1

8m+8log T +2dΘ log dΘ+tm0

. By the constraint that m0 = o(log T ), we see that second

branch dominates over the ﬁrst one and so, for large enough T , p

8m+8log T +2dΘ log dΘ+tm0

m0 . Lemma D.2 shows that:

! (µ ˆµt 1)

Moreover, the contribution of the complement of the clean event to the overall regret is negligible. Plugging this choice of βt into the bound of Lemma D.4 completes the proof of Theorem 5.1.

E. Zooming Dimension Deﬁnitions

We note that Deﬁnition 2.3 slightly differs from the deﬁnition presented in (Kleinberg et al., 2008). The statement of Deﬁnition 2.3 eases the comparison of the zooming algorithm of Kleinberg et al. to our new algorithm.

First, we introduce a multiplier α to emphasize that the zooming dimension implicitly depends on the Lipschitz constant of the problem (assumed to be ﬁxed and equal to 1 by Kleinberg et al.), which can be smaller when we make full use of performative feedback.

Second, Deﬁnition 2.3 is slightly more conservative in two ways. One is that we intersect a cover of any subset of {θ : (θ) 16αs} with {θ : 16αr < (θ) 32αr}, rather than directly take a cover of the latter set. The other one is that we take a supremum over all covers with radius coarser than r, i.e. s [r,1], instead of only r. These differences are minor technicalities that we do not expect to alter the zooming dimension in a meaningful way, neither formally nor conceptually.

Lastly, rather than requiring the size of the relevant set of points to be at most of order (1/s)d, we require the size to be at most of order (3/s)d. In this regard, Deﬁnition 2.3 is less conservative than the zooming dimension in (Kleinberg et al.,

Regret Minimization with Performative Feedback

2008). We make this modiﬁcation so that for the Euclidean ball of dimension dΘ of radius 1, which contains Θ, the zooming dimension is guaranteed to be at most dΘ. This would not be true without the factor of 3. We note that the analysis of adaptive zooming in (Kleinberg et al., 2008) can be modiﬁed in a straightforward way to allow for this change, only altering constant factors in the regret bounds.

F. Gains of Sequential Zooming Dimension

We provide an example where the sequential zooming dimension is strictly smaller than the zooming dimension.

Example F.1. Suppose that model parameters are 2-dimensional, Lzϵ = 1/32, and the distribution map is a ﬁxed distribution: PR(θ) = DPR(θ ,θ) for all θ,θ . Let θ0 = 0,θ1 = [1/2,0],θ2 = [1/4,

3/4]. Suppose that PR(θ0) = 0, PR(θ1) = 1/8, PR(θ2) = 15/64, and PR(θ) = 1 otherwise.

Lemma F.2. In Example F.1, the (Lzϵ)-zooming dimension is at least d 0.39, and the (Lzϵ)-sequential zooming dimension is at most d log6(1.5) 0.23.

Proof. We begin by observing that for all 0 < s 1, it holds that: {θ | (θ) 16Lzϵs} {θ | PR(θ) 16Lzϵ} = {θ | PR(θ) 1/2} = {θ0,θ1,θ2}. Note that θ0 achieves the optimal performative risk and thus does not appear in any suboptimality band {θ | 16Lzϵr PR(θ) < 32Lzϵr}.

First, we show that the zooming dimension is at least log6(2) 0.39. Let s = 1/2 ϵ for ϵ sufﬁciently small. Consider the set {θ | (θ) 16Lzϵs} = {θ0,θ1,θ2}. A minimal covering of the set will necessarily consist of all three points {θ0,θ1,θ2}. We see that the suboptimality band {θ | 16Lzϵr PR(θ) < 32Lzϵr} for r = 1/4 contains {θ1,θ2}. Taking ϵ 0, we see that the zooming dimension is at least log6(2).

Next, we show that the sequential zooming dimension is d log6(1.5) 0.23. Let C be a minimal covering of a subset of {θ | (θ) 16Lzϵs}. For s 1/2, we see that {θ | (θ) 16Lzϵs} = {θ0,θ1,θ2}, and C contains at most 1 point. If s < 1/2, then C might contain up to 3 points. If C does not contain both θ1 and θ2, then any suboptimality band {θ | 16Lzϵr PR(θ) < 32Lzϵr} contains at most 1 point from C. If C contains both θ1 and θ2, then we leverage the sequential properties of the sequential zooming dimension. We claim that pulling θ1 ﬁrst results in θ2 being eliminated. Notice that if θ1 is pulled ﬁrst, then PRmin will be equal to 1/8. PRLB will be equal to DPR(θ1,θ2) + Lzϵ||θ1 θ2|| = PR(θ2) + Lzϵ||θ1 θ2|| = 15/64 1/64 = 14/64 = 7/32. We see that PRmin + 4Lzϵs 1/8 + 1/16 = 3/16. Since 7/32 > 3/16, we see that θ1 will be eliminated. This means that in expectation, at most 1.5 arms are pulled. This yields the desired bound.

G. Details of Numerical Illustrations

For the purpose of the illustrations in Figure 1, Figure 2, and Figure 3 we use a one-dimensional example where θ R. The performative effects are modeled by a linear shift, i.e.,

DPR(φ,θ) = f (θ) + αφ,

where f is a multi-modal function illustrated in the respective ﬁgures and speciﬁed as

f (θ) = c0 cos(c1θ) + c2 sin(c3(θ c4)).

The shaded gray area in the ﬁgures illustrates the conﬁdence sets computed as

PRLB(θ) = max θ S PR(θ ) LPR θ θ , PRUB(θ) = min θ S PR(θ ) + LPR θ θ

for the baseline approach, and as

PRLB(θ) = max θ S DPR(θ ,θ) Lφ θ θ , PRUB(θ) = min θ S DPR(θ ,θ) + Lφ θ θ

for the performative conﬁdence bounds. We use S := {θ1,θ2} as shown in the ﬁgures. The Lipschitz constant LPR of the performative risk PR(θ) = DPR(θ,θ) is evaluated numerically for each ﬁgure.

Regret Minimization with Performative Feedback

For Figure 1 and Figure 2 we use the following parameters: c0 = 1, c1 = 0.7, c2 = 0.3, c3 = 3, c4 = 0.5, α = 1, and a conservative Lipschitz bound Lφ = 1.6 for the performative conﬁdence bounds and LPR = 3.8 for the performative risk.

For Figure 3 we use c0 = 3, c1 = 1, c2 = 0.9, c3 = 3, c4 = 0.5, α = 0.5, and a conservative Lipschitz bound Lφ = 1.3 for illustrating the performative conﬁdence bounds. If exact knowledge of the shifts were available these bounds could be made even tighter.