# performative_reinforcement_learning__1f789d09.pdf Performative Reinforcement Learning Debmalya Mandal 1 Stelios Triantafyllou 1 Goran Radanovic 1 We introduce the framework of performative reinforcement learning where the policy chosen by the learner affects the underlying reward and transition dynamics of the environment. Following the recent literature on performative prediction (Perdomo et al., 2020), we introduce the concept of performatively stable policy. We then consider a regularized version of the reinforcement learning problem and show that repeatedly optimizing this objective converges to a performatively stable policy under reasonable assumptions on the transition dynamics. Our proof utilizes the dual perspective of the reinforcement learning problem and may be of independent interest in analyzing the convergence of other algorithms with decision-dependent environments. We then extend our results for the setting where the learner just performs gradient ascent steps instead of fully optimizing the objective, and for the setting where the learner has access to a finite number of trajectories from the changed environment. For both the settings, we leverage the dual formulation of performative reinforcement learning, and establish convergence to a stable solution. Finally, through extensive experiments on a grid-world environment, we demonstrate the dependence of convergence on various parameters e.g. regularization, smoothness, and the number of samples. 1. Introduction Over the last decade, advances in reinforcement learning techniques enabled several breakthroughs in AI. These milestones include Alpha Go (Silver et al., 2017), Pluribus (Brown & Sandholm, 2019), and Alpha Star (Vinyals et al., 2019). Such success stories of reinforcement learning in multi-agent game playing environments 1Max Planck Institute for Software Systems, Saarbruecken, Germany. Correspondence to: Debmalya Mandal . Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s). have naturally led to the adoption of RL in many real-world scenarios e.g. recommender systems (Aggarwal et al.), and healthcare (Esteva et al., 2019). However, these critical domains often pose new challenges including the mismatch between deployed policy and the target environment. Existing frameworks of reinforcement learning ignore the fact that a deployed policy might change the underlying environment (i.e., reward, or probability transition function, or both). Such a mismatch between the deployed policy and the environment often arises in practice. For example, recommender systems often use contextual Markov decision process to model interaction with a user (Hansen et al., 2020). In such a contextual MDP, the initial context/user feature is drawn according a distribution, then the user interacts with the platform according to the context-specific MDP. However, it has been repeatedly observed that such recommender systems not only change the user demographics (i.e. distribution of contexts) but also how they interact with the platforms (Chaney et al., 2018; Mansoury et al., 2020). Our second example comes from autonomous vehicles (AV). Even if we ignore the multi-agent aspect of these learning algorithms, a deployed AV might change how the pedestrians, and other cars behave, and the resulting environment might be quite different from what the designers of the AV had in mind (Nikolaidis et al., 2017a). Recently, Perdomo et al. (2020) introduced the notion of performative prediction, where the predictions made by a classifier changes the data distribution. However, in the context of reinforcement learning, the situation is different as the changing transition dynamics introduces additional complexities. If the underlying probability transition function changes, then the class of feasible policies and/or models changes with time. This implies that we need a framework that is more general than the framework of performative prediction, and can model policy-dependent outcomes in reinforcement learning. Our Contributions: In this paper, we introduce the notion of performative reinforcement learning where the deployed policy not only changes the reward vector but also the underlying transition probability function. We introduce the notion of performatively stable policy and show under what conditions various repeated retraining methods (e.g., policy optimization, gradient ascent etc.) converges to such a Performative Reinforcement Learning stable solution. Our precise contributions are the following. We consider a regularized version of the reinforcement learning problem where the variables are long-term discounted state-action occupancy measures. We show that, when both the probability transition function and the reward function changes smoothly in response to the occupancy measure, repeated optimization of regularized reinforcement learning converges to a stable solution. We then show that if the learner performs repeated projected gradient ascent steps, then also convergence is guaranteed provided that the step-size is small enough. Compared to the supervised learning setting (Perdomo et al., 2020), the projection step is necessary as the probability transition function, and hence the space of occupancy measures change with time. Next we extend our result to the finite samples setting, where the learner has access to a collection of samples from the updated environment. For this setting, we use an empirical version of the Lagrangian of the regularized RL problem. We show that repeatedly solving a saddle point of this empirical Lagrangian (max player corresponds to the learner) also converges to a stable solution provided that the number of samples is large enough. Finally, we empirically evaluate the effect of various parameter choices (regularization, smoothness, number of samples etc.) through extensive experiments on a twoagent grid-world environment. In this environment, the first agent performs various types of repeated retraining, whereas the second agent responds according to a smooth response function. Our Techniques: At a high level, our theoretical results might look similar to the results derived by Perdomo et al. (2020). However, there are many challenges. We repeatedly maximize a regularized objective whose feasible region is the space of feasible occupancy measures. As the probability transition function changes with time, the feasible region of the optimization problem also changes with time. So ideas from supervised classification setting (Mendler-D unner et al., 2020) cannot be applied directly. Therefore, we look at the dual problem which is strongly-convex and mapping from occupancy measure to the corresponding dual optimal solution turns out to be a contraction. We believe that the dual perspective on performative prediction might be useful for analyzing the convergence of other algorithms with decision-dependent outcomes. For performative reinforcement learning, the finite sample setting is very different than the supervised learning setting. This is because we do not have independent sample access from the new environment. At time-step t, we can only access the new model through the policy πt (or occupancy measure dt). In that sense, the learner faces an offline reinforcement learning problem where the samples are collected from the behavior policy πt. This is also the reason we need an additional overlap assumption, which is often standard in offline reinforcement learning (Munos & Szepesv ari, 2008). 1.1. Related Work: Perdomo et al. (2020) introduced the notion of performative prediction. Subsequent papers have considered several aspects of this framework including optimization (Mendler D unner et al., 2020; Miller et al., 2021), multi-agent systems (Narang et al., 2022), and population dynamics (Brown et al., 2020). However, to the best of our knowledge, performative prediction in sequential decision making is mostly unexplored. A possible exception is (Bell et al., 2021) who consider a setting where the transition and reward of the underlying MDP depend non-deterministically on the deployed policy. Since the mapping is non-deterministic, it doesn t lead to a notion of equilibrium, and the authors instead focus on the optimality and convergence of various RL algorithms. Stochastic Stackelberg Games: Our work is also closely related to the literature on stochastic games (Shapley, 1953; Filar & Vrieze, 2012), and in particular, those that study Stackelberg (commitment) strategies (Von Stackelberg, 2010), where a leader commits a policy to which a follower (best) responds. While different algorithmic approaches have been proposed for computing Stackelberg equilibria (SE) in stochastic games or related frameworks (Vorobeychik & Singh, 2012; Letchford et al., 2012; Dimitrakakis et al., 2017), computing optimal commitment policies has shown to be a computationally intractable (NP-hard) problem (Letchford et al., 2012). More recent works have studied learning SE in stochastic games, both from practical perspective (Rajeswaran et al., 2020; Mishra et al., 2020; Huang et al., 2022) and theoretical perspective (Bai et al., 2021; Zhong et al., 2021). The results in this paper differ from this line of work in two ways. Firstly, our framework abstracts the response model of an agent s effective environment in that it does not model it through a rational agency with a utility function. Instead, it is more aligned with the approaches that learn the response function of the follower agent (Sinha et al., 2016; Kar et al., 2017; Sessa et al., 2020), out of which the closest to our work is (Sessa et al., 2020) that considers repeated games. Secondly, given that we consider solution concepts from performative prediction rather than SE, we focus on repeated retraining as the algorithm of interest, rather than bi-level optimization approaches. Other related work: We also draw a connection to other RL frameworks. Naturally, this work relates to RL settings that study non-stationary environments. These include recent learning-theoretic results, such as (Gajane et al., 2018; Performative Reinforcement Learning Fei et al., 2020; Domingues et al., 2021; Cheung et al., 2020; Wei & Luo, 2021) that allow non-stationary rewards and transitions provided a bounded number or amount of changes (under no-regret regime), the extensive literature on learning under adversarial reward functions (e.g., (Even-Dar et al., 2004; Neu et al., 2012; Dekel & Hazan, 2013; Rosenberg & Mansour, 2019)), or the recent works on learning under corrupted feedback (Lykouris et al., 2021). However, the setting of this paper is more structured, i.e., the environment responds to the deployed policy and does not arbitrarily change. Our work is also broadly related to the extensive literature on multi-agent RL literature we refer the reader to (Zhang et al., 2021) for a selective overview. A canonical example of a multi-agent setting that closely relates to the setting of this paper is human-AI cooperation, where the AI s policy influences the human behavior (Dimitrakakis et al., 2017; Nikolaidis et al., 2017b; Crandall et al., 2018; Radanovic et al., 2019; Carroll et al., 2019). In fact, our experiments are inspired by human-AI cooperative interaction. While the algorithmic framework of repeated retraining has been discussed in some of the works on cooperative AI (e.g., see (Carroll et al., 2019)), these works do not provide a formal treatment of the problem at hand. Finally, this paper also relates to the extensive literature on offline RL (Levine et al., 2020) as the learner faces an offline RL problem when performing repeated retraining with finite samples.We utilize the analysis of (Zhan et al., 2022) to establish finite sample guarantees, under a standard assumption on sample generation (Munos & Szepesv ari, 2008; Farahmand et al., 2010; Xie & Jiang, 2021), and overlap in occupancy measure (Munos & Szepesv ari, 2008; Zhan et al., 2022). Note that offline RL has primarily focused on static RL settings in which the policy of a learner does not affect the model of the underlying environment. We are primarily concerned with Markov Decision Processes (MDPs) with a fixed state space S, action set A, discount factor γ, and starting state distribution ρ. The reward and the probability transition functions of the MDP will be functions of the adopted policy. We consider infinitehorizon setting where the learner s goal is to minimize the total sum of discounted rewards. We will write sk to denote the state visited at time-step k and ak to denote the action taken at time-step k. When the learner adopts policy π, the underlying MDP has reward function rπ and probability transition function Pπ. We will write M(π) to denote the corresponding MDP, i.e., M(π) = (S, A, Pπ, rπ, ρ). Note that only the reward and the transition probability function change according to the policy π. When an agent adopts policy π and the underlying MDP is M(π ) = (S, A, Pπ , rπ , ρ) the probability of a trajectory τ = (sk, ak) k=0 is given as P(τ) = ρ(s0) Q k=1 Pπ (sk+1|sk, π(sk)). We will write τ Pπ π to denote such a trajectory τ. Given a policy π and an underlying MDP M(π ) we write V π π (ρ) to define the value function w.r.t. the starting state distribution ρ. This is defined as follows. V π π (ρ) = Eτ Pπ π k=0 γkrπ (sk, ak)|ρ Solution Concepts: Given the definition of the value function eq. (1), we can now define the solution concepts for our setting. First we define the performative value function of a policy which is the expected total return of the policy given the environment changes in response to the policy. Definition 1 (Performative Value Function). Given a policy π, and a starting state distribution ρ (S), the performative value function V π π (ρ) is defined as V π π (ρ) = Eτ Pπ π [P t=0 γtrπ(st, at)|ρ]. We now define the performatively optimal policy, which maximizes performative value function. Definition 2 (Performatively Optimal Policy). A policy π is performatively optimal if it maximizes performative value function, i.e., π arg maxπ V π π (ρ). We will write πP to denote the performatively optimal policy. Although, πP maximizes the performative value function, it need not be stable, i.e., the policy need not be optimal with respect to the changed environment M(πP ). We next define the notion of performatively stable policy which captures this notion of stability. Definition 3 (Performatively Stable Policy). A policy π is performatively stable if it satisfies the condition π arg maxπ V π π (ρ). We will write πS to denote the performatively stable policy. The definition of performatively stable policy implies that if the underlying MDP is M(πS) then an optimal policy is πS. This means after deploying the policy πS in the MDP M(πS) the environment doesn t change and the learner is also optimizing her reward in this stable environment.We next show that even for an MDP with a single state, these two solution concepts can be very different. Example: Consider an MDP with single state s and two actions a and b. If a policy plays arm a with probability θ and b with probability 1 θ then we have r(s, a) = 1 2 ϵθ and r(s, b) = 1 2 + ϵθ for some ϵ < 1 4. Note that if θS = 0 then both the actions give same rewards, and the learner can just play action b. Therefore, a policy that always plays action b is a stable policy and achieves a reward of 1 2(1 γ). On the other hand, a policy that always plays action a with Performative Reinforcement Learning probability θ = 1/4 has the performative value function of 1 γ + (1 θ)(1/2 + ϵθ) 1 γ = 1/2 + ϵ/8 So, for ϵ > 0, a performatively optimal policy can achieve higher value function than a stable policy. We will mainly consider with the tabular MDP setting where the number of states and actions are finite. Even though solving tabular MDP in classic reinforcement learning problem is relatively straight-forward, we will see that even the tabular setting raises many interesting challenges for the setting of performative reinforcement learning. Discounted State, Action Occupancy Measure: Note that it is not a priori clear if there always exists a performatively stable policy (as defined in (2)). This is because such existence guarantee is usually established through a fixed-point argument, but the set of optimal solutions need not be convex. If both π1 and π2 optimizes (2), then their convex combination might not be optimal. So, in order to find a stable policy, we instead consider the linear programming formulation of reinforcement learning. Given a policy π, its long-term discounted state-action occupancy measure in the MDP M(π) is defined as dπ(s, a) = Eτ Pπ π P k=0 γk1 {sk = s, ak = a} | ρ . Given an occupancy measure d, one can consider the following policy πd which has occupancy measure d. b d(s,b) if P a d(s, a) > 0 1 A otherwise (2) With this definition, we can pose the problem of finding a performatively stable occupancy measure. An occupancy measure d is performatively stable if it is the optimal solution of the following problem. d S arg max d 0 s,a d(s, a)rd(s, a) (3) a d(s, a) = ρ(s) + γ X s ,a d(s , a)Pd(s , a, s) s With slight abuse of notation we are writing Pd as Pπd (as defined in equation (2)). If either the probability transition function or the reward function changes drastically in response to the occupancy measure then the optimization problem 3 need not even have a stable point. Therefore, as is standard in performative prediction, we make the following sensitivity assumption regarding the underlying environment. Assumption 1. The reward and probability transition mappings are (εr, εp)-sensitive i.e. the following holds for any two occupancy measures d and d rd rd 2 εr d d 2 , Pd Pd 2 εp d d 2 The motivation behind this assumption comes from multiagent systems, particularly the Stackelberg game. Suppose the leader agent changes her policy and in response, the follower agents change their policies according to a smooth response function. Then from the leader s perspective, the environmental change is smooth. Furthermore, it is also possible to state the assumption in terms of policies i.e. change in the environment is bounded by the change in policies. This is because similar policies imply similar state visitations (e.g. see lemma 14.1 in (Agarwal et al., 2021)). The converse is not always true and our assumption is weaker. Suppose reward (rd) and the transition (Pd) are fixed in eq. (3), then the objective function of eq. (3) is convex (in fact linear), and the set of optimal solutions is convex, a simple application of Kakutani fixed point theorem (Glicksberg, 1952) shows that there always exists a performative stable solution.1 Proposition 1. Suppose assumption (1) holds for some constants (εr, εp), then the optimization problem 3 always has a fixed point. 3. Convergence of Repeated Retraining Even though the optimization problem (3) is guaranteed to have a stable solution, it is not clear that repeatedly optimizing this objective converges to such a point. We now consider a regularized version of the optimization problem (3), and attempt to obtain a stable solution of the regularized problem. In subsection (3.3) we will show that such a stable solution guarantees approximate stability with respect to the original unregularized objective (3). s,a d(s, a)rd(s, a) λ 2 d 2 2 (4) a d(s, a) = ρ(s) + γ X s ,a d(s , a)Pd(s , a, s) s Here λ > 0 is a constant that determines the strongconcavity of the objective. Before analyzing the behavior of repeatedly optimizing the new objective (4) we discuss two important issues. First, we consider quadratic regularization for simplicity, and our results can be easily extended to any strongly-convex regularizer. In particular, we use the strong convexity of L2 norm to show that the solution of the optimization problem in (3) forms a contraction mapping. If we use L1 norm then this mapping might not be a contraction. But note that, the regularized objective in (3) still has a fixed point since the objective is still concave. So we still believe that repeated optimization converges to a stable point but we 1The proof of this result and all other results are provided in the appendix. Performative Reinforcement Learning might only have convergence in the limit. Second, we apply regularization in the occupancy measure space, whereas regularization in policy space is commonly used (Mnih et al., 2016). However, value function is generally a non-convex function of policy and it is not immediately clear whether the solution of the optimization problem (eq. (3)) in the policy space also gives a contraction mapping. Since the performatively stable occupancy measure d S is not known, a common strategy adopted is repeated policy optimization. At time t, the learner obtains the occupancy measure dt, and deploys the policy πt (as defined in eq. (2)). In response, the environment changes to Pt = Pdt and rt = rdt, and the learning agent solves the following optimization problem to obtain the next occupancy measure dt+1. s,a d(s, a)rt(s, a) λ 2 d 2 2 (5) a d(s, a) = ρ(s) + γ X s ,a d(s , a)Pt(s , a, s) s We next show that repeatedly solving the problem (5) converges to a stable point. Theorem 1. Suppose assumption 1 holds with λ > 12S3/2(2ϵr+5Sϵp) (1 γ)4 . Let µ = 12S3/2(2ϵr+5Sϵp) λ(1 γ)4 . Then for any δ > 0 we have dt d S 2 δ t 2(1 µ) 1 ln (2/δ(1 γ)) Here we discuss some of the main challenges behind the proof of this theorem. The primal objective function (5) is strongly concave but the feasible region of the optimization problem changes with each iteration. So we cannot apply the results from performative prediction (Perdomo et al., 2020), and instead, look at the dual objective which is A(1 γ)2/λstrongly convex. Although the dual problem is strongly convex, it does not satisfy Lipschitz continuity w.r.t. P. However, we show that the norm of the optimal solution of the dual problem is bounded by O S/(1 γ)2 and this is sufficient to show that the dual objective is Lipschitz-continuous with respect to P at the dual optimal solution. We show that the proof argument used in Perdomo et al. (2020) works if we replace global Lipschitz-continuity by such local Lipschitz-continuity. Finally, we translate back the bound about the dual solution to a guarantee about the primal solution ( dt d S 2) using the strong duality of the optimization problem (5). This step crucially uses the quadratic regularization in the primal. Here we make several observations regarding the assumptions required by Theorem 1. First, Theorem 1 suggests that for a given sensitivity (ϵr, ϵp) and discount factor γ, one can choose the parameter λ so that the convergence to a stable point is guaranteed. Second, for a given value of λ and γ if the sensitivity is small enough, then repeatedly optimizing 5 converges to a stable point. 3.1. Gradient Ascent Based Algorithm We now extend our result for the setting where the learner does not fully solve the optimization problem 5 every iteration. Rather, the learner takes a gradient step with respect to the changed environment every iteration. Let Ct denote the set of occupancy measures that are compatible with probability transition function Pt. a d(s, a) = ρ(s) + γ X s ,a d(s , a)Pt(s , a, s) s and d(s, a) 0 s, a (6) Then the gradient ascent algorithm first takes a gradient step according to the objective function r t d λ 2 d 2 2 and then projects the resulting occupancy measure onto Ct. dt+1 = Proj Ct (dt + η (rt λdt)) = Proj Ct ((1 ηλ)dt + ηrt) (7) Here Proj C(v) finds a point in C that is closest to v in L2-norm. We next show that repeatedly taking projected gradient ascent steps with appropriate step size η converges to a stable point. Theorem 2. Let λ max n 4ϵr, 2S, 20γ2S1.5(ϵr+ϵp) (1 γ)2 o , step- size η = 1 λ and µ = r 64γ2ϵ2p (1 γ)4 1 + 30γ4S2 (1 γ)4 . Suppose assumption 1 holds with ϵp < min n γS 100γ3S o . Then for any δ > 0 we have dt d S 2 δ t (1 µ) 1 ln (2/δ(1 γ)) Proof Sketch: First, the projection step 7 can be computed through the following convex program. min d 0 1 2 d (1 ηλ)dt ηrt 2 2 (8) a d(s, a) = ρ(s) + γ X s ,a d(s , a)Pt(s , a, s) s Even though the objective above is convex, its feasible region changes with time. So we again look at the dual objective which is strongly concave and has a fixed feasible region. Given an occupancy measure dt, let GDη(dt) be the optimal solution of the problem (7). We show that if the step-size η is chosen small enough then the operator GDη( ) Performative Reinforcement Learning is a contraction mapping by first proving the corresponding optimal dual solution forms a contraction, and then using strong duality to transfer the guarantee back to the primal optimal solutions. Finally, we show that the fixed point of the mapping GDη( ) indeed coincides with the performatively stable solution d S. In order to simplify the proof of Theorem 2, we substituted η = 1/λ and then chose a suitable value of λ. This means that as ϵr, ϵp 0, λ needn t approach zero. However, it is possible to choose a value of λ that goes to zero as ϵr, ϵp approach zero. In particular, this can achieved by setting η < 1/S and imposing additional constraints on ϵp. The proof of Theorem 2 attempts to show that the projected gradient step results in a contraction with a constant term of the form γ2ϵ2 p(1 + 2ηS)2. Now if η < 1/S then this term is smaller than 1 when ϵp is small. 3.2. Finite Sample Guarantees So far we assumed that after deploying the policy corresponding to the occupancy measure dt we observe the updated environment Mt = (S, A, Pt, rt, γ). However, in practice, we do not have access to the true model but only have access to samples from the updated environment. Our setting is more challenging than the finite samples setting considered by Perdomo et al. (2020). Unlike the supervised learning setting, we do not have access to independent samples from the new environment. At time t we can deploy policy πt corresponding to the occupancy measure dt, and can access trajectories from the new environment Mt only through the policy πt. Therefore, at every step, the learner faces an offline reinforcement learning problem where the policy πt is a behavioral policy. A standard assumption in offline reinforcement learning is overlap in occupancy measure between the behavior policy and a class of target policies (Munos & Szepesv ari, 2008). Without such overlap, one can get no information regarding the optimal policy from the trajectories visited by the behavioral policy. We make the following assumption regarding the overlap in occupancy measure between a deployed policy and the optimal policy in the changed environment. Assumption 2. Given an occupancy measure d, let ρ d be the optimal occupancy measure maximizing eq. (5), and d be the occupancy measure of πd in Pd. There exists B > 0 s.t. ρ d(s, a) d(s, a) When there is no performativity, d equals d and the assumption states overlap between the occupancy measure of the deployed policy and the optimal policy. This is the standard assumption of single policy coverage in offline reinforcement learning. When there is performativity, d can be different than d since the deployed policy πd might have occupancy measure different than d in the changed model Pd, and in that case we require overlap between d and the optimal occupancy measure. Assumption (2) is also significantly weaker than the uniform coverage assumption which requires maxd maxs,a d(s, a) > 0 as it allows the possibility that d(s, a) = 0 as long as the optimal policy doesn t visit state s or never takes action a in state s. Data: We assume the following model of sample generation at time t. Given the occupancy measure dt let the normalized occupancy measure be edt(s, a) = (1 γ)dt(s, a). First, sample a state, action pair (si, ai) i.i.d as (si, ai) edt, then reward ri rt(si, ai), and finally the next state s i Pt( |si, ai). We have access to mt such tuples at time t and the data collected at time is given as Dt = {(si, ai, ri, s i)}mt i=1. We would like to point out that this is a standard model of sample generation in offline reinforcement learning (see e.g. (Munos & Szepesv ari, 2008; Farahmand et al., 2010; Xie & Jiang, 2021)). With finite samples, the learner needs to optimize an empirical version of the optimization problem 5. We follow the recent literature on offline reinforcement learning (Zhan et al., 2022) and consider the Lagrangian of eq. (5). L(d, h; Mt) = d rt λ 2 d 2 2 + X a d(s, a) + ρ(s) + γ X s ,a d(s , a)Pt(s , a, s) 2 d 2 2 + X s h(s)ρ(s) + X s,a dt(s, a) d(s, a) rt(s, a) h(s) + γ X s Pt(s, a, s )h(s ) The above expression motivates the following empirical version of the Lagrangian. ˆL(d, h; Mt) = λ 2 d 2 2 + X s h(s)ρ(s) + d(si, ai) dt(si, ai) r(si, ai) h(si) + γh(s i) mt(1 γ) (9) We repeatedly solve for a saddle point of the objective (9). (dt+1, ht+1) = arg max d arg min h ˆL(d, h; Mt) (10) The next theorem provides convergence guarantees of the repeated optimization procedure (10) provided that the number of samples is large enough. Theorem 3 (Informal Statement). Suppose assumption 1 holds with λ 24S3/2(2ϵr+5Sϵp) (1 γ)4 , and assumption 2 holds Performative Reinforcement Learning with parameter B. Let µ = 24S3/2(2ϵr+5Sϵp) (1 γ)4 . For any δ > 0, and error probability p if we repeatedly solve the optimization problem (10) with number of samples mt e O A2B2 δ4(2ϵr+5Sϵp)2 ln t p 2 then with probability at least 1 p we have dt d S 2 δ t (1 µ) 1 ln (2/δ(1 γ)) Proof Sketch: We first show that the empirical version of the Lagrangian ˆL(d, h; Mt) is close to the true Lagrangian L(d, h; Mt) with high probability. This is shown using the Chernoff-Hoeffding inequality and an ϵ-net argument over the variables. Here we use the fact that for the dual variables we can just consider the space H = h : h 2 3S/(1 γ)2 as the optimal solution is guaranteed to exist in this space. We then show that an optimal saddle point of the empirical Lagrangian (9) is close to the optimal solution of the true Largrangian. In particular, if L(d, h; Mt) ˆL(d, h; ˆ Mt) ϵ then we are guaranteed that dt+1 GD(dt) 2 O(ϵ). Here GD(dt) denotes the solution obtained from optimizing the exact function. By choosing an appropriate number of samples, we can make the error term ϵ small enough, and establish the following recurrence relation: dt+1 d S 2 βδ + β dt d S 2 for β < 1/2. The rest of the proof follows the main idea of the proof of Theorem 3.10 from (Mendler-D unner et al., 2020). If dt d S 2 > δ then we get a contraction mapping. On the other hand, if dt d S 2 δ then subsequent iterations cannot move dt outside of the δ-neighborhood of d S. 3.3. Approximating the Unregularized Objective Theorem (1) shows that repeatedly optimizing objective (4) converges to a stable policy (say dλ S) with respect to the regularized objective (4). Here we show that the solution dλ S approximates the performatively stable and performatively optimal policy with respect to the unregularized objective (3). Theorem 4. There exists a choice of the regularization parameter (λ) such that repeatedly optimizing objective (5) converges to a policy (dλ S) with the following guarantee3 s,a rdλ S(s, a)dλ S(s, a) max d C(dλ S) s,a rdλ S(s, a)d(s, a) O S3/2(ϵr + Sϵp)/(1 γ)6 2Here we ignore terms that are logarithmic in S, A, and 1/δ. 3C( ed) denotes the set of occupancy measures that are feasible with respect to P( ed) = P e d. Note that as ϵ = max {ϵr, ϵp} converges to zero, the policy dλ S also approaches a performatively stable solution with respect to the original unregularized objective. Theorem 5 (Informal Statement). Let d P O be the performatively optimal solution with respect to the unregularized objective and let ε = max {ϵr, ϵp}. Then there exists a value of λ such that repeatedly optimizing objective (5) converges to a policy (dλ S) with the following guarantee X s,a rdλ S(s, a)dλ S(s, a) X s,a rd P O(s, a)d P O(s, a) O max S5/3A1/3ϵ2/3 (1 γ)14/3 , ϵS (1 γ)4 We again see that as ϵ converges to zero, dλ S approaches a performatively optimal solution with respect to the original objective. The proof of theorem (5) uses the following bound on the distance between the performatively stable solution and the optimal solution. dλ S dλ P O 2 A λ(1 γ)4 ϵr 1 + γ S + ϵp γS (1 γ)2 We believe that the bounds of theorems (5) and (4) can be improved with more careful analysis. However, the error bound should grow as γ decreases. This is because the diameter (or max L2 norm) of occupancy measure is most 1/(1 γ)2 and even in performative prediction such an approximation bound grows with the diameter of the model.4 4. Experiments In this section, we experimentally evaluate the performance of various repeated retraining methods, and determine the effects of various parameters on convergence. 5 All experiments are conducted on a grid-world environment proposed by (Triantafyllou et al., 2021).6 We next describe how this environment is adapted for simulating performative reinforcement learning. Gridworld: We consider the grid-world environment shown in Figure 3, in which n + 1 agents share control over an actor. The agents objective is to guide the actor from some initial state S to the terminal state G, while minimizing their total discounted cost. We will call the first agent the principal, and the other n agents the followers. In each state, the agents select their actions simultaneously. The principal agent, A1, chooses the direction of the actor s next move by taking one of four actions (left, right, up, and down). Any other agent, Aj decides to either intervene or not 4For example, see proposition E.1 (Perdomo et al., 2020), which is stated for diameter 1 and convex loss function. 5Code source: https://github.com/gradanovic/ icml2023-performative-rl-paper-code 6The original single-agent version of this environment can be found in (Voloshin et al., 2019). Performative Reinforcement Learning (a): RPO λ = 1 0 250 500 750 1000 Iteration t ct ||dt + 1 dt||2 (b): RPO β = 10 0 250 500 750 1000 Iteration t ct ||dt + 1 dt||2 =0.0 =0.2 =0.5 =1.0 =5.0 (c): RGA λ = 1, β = 5 0 250 500 750 1000 Iteration t ct ||dt + 1 dt||2 (d): RGA Gap λ = 1, β = 5 0 250 500 750 1000 Iteration t Suboptimality Gap (e): ROL FS λ = 1, β = 5 0 200 400 Iteration t ct ||dt + 1 dt||2 m=100 m=200 m=500 m=1000 (f): RPO FS λ = 1, β = 5 0 200 400 Iteration t ct ||dt + 1 dt||2 m=100 m=200 m=500 m=1000 Figure 2: Experimental results for Gridworld. All plots were generated with γ = 0.9 and 1000 iterations. We normalized the distance between iterations t and t + 1 with ct = 1 ||dt||2 . RPO stands for repeated policy optimization, RGA for repeated gradient ascent, ROL for repeatedly solving (empirical) Lagrangian and FS for finite samples. The parameters are λ (regularization), β (smoothness), η (step-size), and m (number of trajectories). in A1 s action. If the majority of the n follower agents choose not to intervene, then the actor moves one cell towards the direction chosen by A1, otherwise it moves one cell towards the new direction chosen by the majority of the followers. Note that the principal and the followers policies determine a policy for the actor agent. G Figure 3: Gridworld The cost at each state is defined according to the grid-world shown in Figure 3. If the actor visits a blank or an S cell, then all the agents receive a small negative reward equal to 0.01. If an F cell is visited, then a slightly more increased cost equal to 0.02 is imposed and for H cells a severe penalty of 0.5 is inflicted. Additionally, whenever any Aj decides to intervene, it pays an additional cost of 0.05. Response Model: We implement agent A1 as a learner who performs repeated retraining. The initial policy of agent A1 is an ϵ-optimal single-agent policy (ϵ = 0.1) assuming that no other agent intervenes. Subsequently, agent A1 performs one type of repeated retraining (e.g. gradient ascent).The follower agents, on the other hand, always respond to the policy of A1 according to a response model. In particular, given a policy of A1 (say π1), we first compute an optimal Q-value function for agent Aj, Q |π1 j . Note that Q |π1 j is computed w.r.t. a perturbed grid-world, and which was generated by performing a random cell perturbation on the grid-world of Figure 3. The perturbed grid-worlds are different for different agents. Then we compute an average Q-function defined as Q |π1 = 1 n Pn+1 j=2 Q |π1 j . Then a policy π2 adopted by the Boltzmann softmax operator with parameter β. Then a policy π2 is determined by the Boltzmann softmax operator with temperature parameter β, π2(ai|s) = eβ Q |π1 (s,ai) P j eβ Q |π1 (s,aj ) . Note that the new policy π2 effectively plays the role of a changing environment by responding to the majority of the n follower agents. Additionally, parameter β controls the smoothness of the changing environment, as viewed by A1. Repeated Policy Optimization: We first consider the scenario where agent A1 gets complete knowledge of the updated reward and probability transition function at time t. In our setting, this means that A1 decides on πt 1, all the other agents respond according to the softmax operator and jointly determines πt 2, and then agent A1 observes the new policy πt 2. This lets A1 to compute new probability transition function Pt, and reward function rt, and solve optimization Performative Reinforcement Learning problem (5). The solution is the new occupancy measure dt+1 1 for A1, and A1 computes new policy πt+1 1 for time t + 1 by normalization using eq. (2). Figure 1(a) shows the convergence results of the repeated policy optimization for different values of β, with λ fixed to 1. We see that if the response function of the environment (i.e., the policy of agent A2) is not smooth enough (e.g., for β = 200), the algorithm fails to converge to a stable solution. In figure 1(b) we fix β to 10 and vary the value of parameter λ (strength of regularization). We see that the algorithm converges only for large enough values of the constant λ. Furthermore, we observe that the convergence is faster as λ increases. Repeated Gradient Ascent: We now see what happens if agent A1 uses repeated gradient ascent instead of fully optimizing the objective each iteration. Here also A1 fully observes πt 2 (hence Pt and rt) at time t. However, instead of full optimization, A1 performs a projected gradient ascent step (7) to compute the next occupancy measure dt+1 1 . Figure 1(c) shows the performance of repeated gradient ascent for different values of the step-size η. We observe that when η is small (e.g,. η O(1/λ)), the learner converges to a stable solution. But the learner diverges for large η. Additionally, the convergence is faster for step-size closer to 1/λ (as suggested by Theorem 2). Since, repeated gradient ascent does not fully solve the optimization problem (5), we also plot the suboptimality gap of the current solution 1(d). This is measured as the difference between the objective value for the best feasible solution (w.r.t. Mt) and current solution (dt 1), and then normalized by the former. As the step-size η is varied, we see a trend similar to figure 1(c). Effect of Finite Samples: Finally, we investigate the scenario where A1 does not know πt 2 at time t, and obtains samples from the new environment Mt by deploying πt 1. In our experiments, instead of sampling from the occupancy measure, A1 directly samples m trajectories of fixed length (up to 100) following policy πt 1. We considered two approaches for obtaining the new policy πt+1 1 . First, A1 solves the empirical Lagrangian (9) through an iterative method. In particular, we use an alternate optimization based method (algorithm (1)) where hn is updated through follow the regularized leader (FTRL) algorithm and dn is updated through best response. 7 Second, A1 computes estimates of probability transition function ( b Pt), and reward function (brt), and solves eq. (5) with these estimates. For both versions, we ran our experiments with 20 different seeds, and figures 1(e) and 1(f) show the average errors along with the standard errors for the two approaches. For both settings, we observe that as m increases, the algorithms eventually converge to a smaller 7Since the objective (9) is linear in h and concave in d, standard arguments (Freund & Schapire, 1996) show that algorithm (1) finds an ε-approximate saddle point in O(SAB/(1 γ)2ε2) iterations. Algorithm 1 Alternating Optimization for the Empirical Lagrangian Set H = 3S (1 γ)2 , and H = {h : h 2 H}. for n = 1, 2, .., N do hn = arg minh H Pn 1 n =1 D h ˆL(dn , h; Mt), h E + β h 2 2 dn = arg maxd 0, d(s,a) B ˆdt(s,a) s,a ˆL(d, hn; Mt) end for Return d = 1 N PN n=1 dn. neighborhood around the stable solution. However, for large number of samples (m = 500 or 1000), directly solving the Lagrangian (figure 1(e)) gives improved result. 5. Conclusion In this work, we introduce the framework of performative reinforcement learning and study under what conditions repeated retraining methods (e.g., policy optimization, gradient ascent) converges to a stable policy. In the future, it would be interesting to extend our framework to handle high dimensional state-space, and general function approximation. The main challenge with general function approximation is that a stable policy might not exist, so the first step would be to characterize under what conditions there is a fixed point. Moreover, most RL algorithms with function approximation work in the policy space. So, another challenge lies in generalizing optimization problem 5 to handle representations of states and actions. Another interesting question is to resolve the hardness of finding stable policy with respect to the unregularized objective. To the best of our knowledge, this question is unresolved even for performative prediction with just convex loss function. It could be interesting to explore connections between our repeated optimization procedure and standard reinforcement learning methods, e.g., policy gradient methods (Mnih et al., 2016; Neu et al., 2017). However, we note that policy gradient methods typically work in the policy space, and might not even converge to a stable point under changing environments. Finally, for the finite samples setting, it would be interesting to use offline reinforcement learning algorithms (Levine et al., 2020) for improving the speed of convergence to a stable policy. Acknowledgements The authors would like to thank the anonymous reviewers for their insightful comments. Stelios Triantafyllou and Goran Radanovic were supported by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) project number 467367360. Performative Reinforcement Learning Agarwal, A., Jiang, N., Kakade, S., and Sun, W. Reinforcement learning: Theory and algorithms. 2021. Aggarwal, C. C. et al. Recommender systems, volume 1. Springer. Bai, Y., Jin, C., Wang, H., and Xiong, C. Sample-efficient learning of stackelberg equilibria in general-sum games. Advances in Neural Information Processing Systems, 34, 2021. Bell, J., Linsefors, L., Oesterheld, C., and Skalse, J. Reinforcement learning in newcomblike environments. Advances in Neural Information Processing Systems, 34, 2021. Bertsekas, D. Convex optimization theory, volume 1. Athena Scientific, 2009. Brown, G., Hod, S., and Kalemaj, I. Performative prediction in a stateful world. ar Xiv preprint ar Xiv:2011.03885, 2020. Brown, N. and Sandholm, T. Superhuman ai for multiplayer poker. Science, 365(6456):885 890, 2019. Carroll, M., Shah, R., Ho, M. K., Griffiths, T., Seshia, S., Abbeel, P., and Dragan, A. On the utility of learning about humans for human-ai coordination. Advances in neural information processing systems, 32, 2019. Chaney, A. J., Stewart, B. M., and Engelhardt, B. E. How algorithmic confounding in recommendation systems increases homogeneity and decreases utility. In Proceedings of the 12th ACM Conference on Recommender Systems, pp. 224 232, 2018. Cheung, W. C., Simchi-Levi, D., and Zhu, R. Reinforcement learning for non-stationary markov decision processes: The blessing of (more) optimism. In International Conference on Machine Learning, pp. 1843 1854. PMLR, 2020. Crandall, J. W., Oudah, M., Ishowo-Oloko, F., Abdallah, S., Bonnefon, J.-F., Cebrian, M., Shariff, A., Goodrich, M. A., Rahwan, I., et al. Cooperating with machines. Nature communications, 9(1):1 12, 2018. Dekel, O. and Hazan, E. Better rates for any adversarial deterministic mdp. In International Conference on Machine Learning, pp. 675 683. PMLR, 2013. Dimitrakakis, C., Parkes, D. C., Radanovic, G., and Tylkin, P. Multi-view decision processes: the helper-ai problem. Advances in Neural Information Processing Systems, 30, 2017. Domingues, O. D., M enard, P., Pirotta, M., Kaufmann, E., and Valko, M. A kernel-based approach to non-stationary reinforcement learning in metric spaces. In International Conference on Artificial Intelligence and Statistics, pp. 3538 3546. PMLR, 2021. Esteva, A., Robicquet, A., Ramsundar, B., Kuleshov, V., De Pristo, M., Chou, K., Cui, C., Corrado, G., Thrun, S., and Dean, J. A guide to deep learning in healthcare. Nature medicine, 25(1):24 29, 2019. Even-Dar, E., Kakade, S. M., and Mansour, Y. Experts in a markov decision process. Advances in neural information processing systems, 17, 2004. Farahmand, A.-m., Szepesv ari, C., and Munos, R. Error propagation for approximate policy and value iteration. Advances in Neural Information Processing Systems, 23, 2010. Fei, Y., Yang, Z., Wang, Z., and Xie, Q. Dynamic regret of policy optimization in non-stationary environments. Advances in Neural Information Processing Systems, 33: 6743 6754, 2020. Filar, J. and Vrieze, K. Competitive Markov decision processes. Springer Science & Business Media, 2012. Freund, Y. and Schapire, R. E. Game theory, on-line prediction and boosting. In Proceedings of the ninth annual conference on Computational learning theory, pp. 325 332, 1996. Gajane, P., Ortner, R., and Auer, P. A sliding-window algorithm for markov decision processes with arbitrarily changing rewards and transitions. ar Xiv preprint ar Xiv:1805.10066, 2018. Glicksberg, I. L. A further generalization of the kakutani fixed point theorem, with application to nash equilibrium points. Proceedings of the American Mathematical Society, 3(1):170 174, 1952. Hansen, C., Hansen, C., Maystre, L., Mehrotra, R., Brost, B., Tomasi, F., and Lalmas, M. Contextual and sequential user embeddings for large-scale music recommendation. In Fourteenth ACM conference on recommender systems, pp. 53 62, 2020. Huang, P., Xu, M., Fang, F., and Zhao, D. Robust reinforcement learning as a stackelberg game via adaptively-regularized adversarial training. ar Xiv preprint ar Xiv:2202.09514, 2022. Kar, D., Ford, B., Gholami, S., Fang, F., Plumptre, A., Tambe, M., Driciru, M., Wanyama, F., Rwetsiba, A., Nsubaga, M., et al. Cloudy with a chance of poaching: Adversary behavior modeling and forecasting with realworld poaching data. 2017. Performative Reinforcement Learning Letchford, J., Mac Dermed, L., Conitzer, V., Parr, R., and Isbell, C. L. Computing optimal strategies to commit to in stochastic games. In Twenty-Sixth AAAI Conference on Artificial Intelligence, 2012. Levine, S., Kumar, A., Tucker, G., and Fu, J. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. ar Xiv preprint ar Xiv:2005.01643, 2020. Lykouris, T., Simchowitz, M., Slivkins, A., and Sun, W. Corruption-robust exploration in episodic reinforcement learning. In Conference on Learning Theory, pp. 3242 3245. PMLR, 2021. Mansoury, M., Abdollahpouri, H., Pechenizkiy, M., Mobasher, B., and Burke, R. Feedback loop and bias amplification in recommender systems. In Proceedings of the 29th ACM international conference on information & knowledge management, pp. 2145 2148, 2020. Mendler-D unner, C., Perdomo, J., Zrnic, T., and Hardt, M. Stochastic optimization for performative prediction. Advances in Neural Information Processing Systems, 33: 4929 4939, 2020. Miller, J. P., Perdomo, J. C., and Zrnic, T. Outside the echo chamber: Optimizing the performative risk. In International Conference on Machine Learning, pp. 7710 7720. PMLR, 2021. Mishra, R. K., Vasal, D., and Vishwanath, S. Model-free reinforcement learning for stochastic stackelberg security games. In 2020 59th IEEE Conference on Decision and Control (CDC), pp. 348 353. IEEE, 2020. Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., and Kavukcuoglu, K. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pp. 1928 1937. PMLR, 2016. Munos, R. and Szepesv ari, C. Finite-time bounds for fitted value iteration. Journal of Machine Learning Research, 9 (5), 2008. Narang, A., Faulkner, E., Drusvyatskiy, D., Fazel, M., and Ratliff, L. J. Multiplayer performative prediction: Learning in decision-dependent games. ar Xiv preprint ar Xiv:2201.03398, 2022. Neu, G., Gyorgy, A., and Szepesv ari, C. The adversarial stochastic shortest path problem with unknown transition probabilities. In Artificial Intelligence and Statistics, pp. 805 813. PMLR, 2012. Neu, G., Jonsson, A., and G omez, V. A unified view of entropy-regularized markov decision processes. ar Xiv preprint ar Xiv:1705.07798, 2017. Nikolaidis, S., Nath, S., Procaccia, A. D., and Srinivasa, S. Game-theoretic modeling of human adaptation in human-robot collaboration. In Proceedings of the 2017 ACM/IEEE international conference on human-robot interaction, pp. 323 331, 2017a. Nikolaidis, S., Nath, S., Procaccia, A. D., and Srinivasa, S. Game-theoretic modeling of human adaptation in human-robot collaboration. In Proceedings of the 2017 ACM/IEEE international conference on human-robot interaction, pp. 323 331, 2017b. Perdomo, J., Zrnic, T., Mendler-D unner, C., and Hardt, M. Performative prediction. In International Conference on Machine Learning, pp. 7599 7609. PMLR, 2020. Radanovic, G., Devidze, R., Parkes, D., and Singla, A. Learning to collaborate in markov decision processes. In International Conference on Machine Learning, pp. 5261 5270. PMLR, 2019. Rajeswaran, A., Mordatch, I., and Kumar, V. A game theoretic framework for model based reinforcement learning. In International conference on machine learning, pp. 7953 7963. PMLR, 2020. Rosenberg, A. and Mansour, Y. Online convex optimization in adversarial markov decision processes. In International Conference on Machine Learning, pp. 5478 5486. PMLR, 2019. Sessa, P. G., Bogunovic, I., Kamgarpour, M., and Krause, A. Learning to play sequential games versus unknown opponents. Advances in Neural Information Processing Systems, 33:8971 8981, 2020. Shapley, L. S. Stochastic games. Proceedings of the national academy of sciences, 39(10):1095 1100, 1953. Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., et al. Mastering the game of go without human knowledge. nature, 550(7676):354 359, 2017. Sinha, A., Kar, D., and Tambe, M. Learning adversary behavior in security games: A pac model perspective. In Proceedings of the 2016 International Conference on Autonomous Agents & Multiagent Systems, pp. 214 222, 2016. Triantafyllou, S., Singla, A., and Radanovic, G. On blame attribution for accountable multi-agent sequential decision making. Advances in Neural Information Processing Systems, 34, 2021. Vershynin, R. Introduction to the non-asymptotic analysis of random matrices. ar Xiv preprint ar Xiv:1011.3027, 2010. Performative Reinforcement Learning Vinyals, O., Babuschkin, I., Czarnecki, W. M., Mathieu, M., Dudzik, A., Chung, J., Choi, D. H., Powell, R., Ewalds, T., Georgiev, P., et al. Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 575 (7782):350 354, 2019. Voloshin, C., Le, H. M., Jiang, N., and Yue, Y. Empirical study of off-policy policy evaluation for reinforcement learning. ar Xiv preprint ar Xiv:1911.06854, 2019. Von Stackelberg, H. Market structure and equilibrium. Springer Science & Business Media, 2010. Vorobeychik, Y. and Singh, S. Computing stackelberg equilibria in discounted stochastic games. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 26, pp. 1478 1484, 2012. Wei, C.-Y. and Luo, H. Non-stationary reinforcement learning without prior knowledge: An optimal black-box approach. In Conference on Learning Theory, pp. 4300 4354. PMLR, 2021. Xie, T. and Jiang, N. Batch value-function approximation with only realizability. In International Conference on Machine Learning, pp. 11404 11413. PMLR, 2021. Zhan, W., Huang, B., Huang, A., Jiang, N., and Lee, J. D. Offline reinforcement learning with realizability and singlepolicy concentrability. ar Xiv preprint ar Xiv:2202.04634, 2022. Zhang, K., Yang, Z., and Bas ar, T. Multi-agent reinforcement learning: A selective overview of theories and algorithms. Handbook of Reinforcement Learning and Control, pp. 321 384, 2021. Zhong, H., Yang, Z., Wang, Z., and Jordan, M. I. Can reinforcement learning find stackelberg-nash equilibria in general-sum markov games with myopic followers? ar Xiv preprint ar Xiv:2112.13521, 2021. Performative Reinforcement Learning Table of Contents A Additional Information on Experimental Setup and Results 13 A.1 Additional Information on Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 A.2 Additional Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 A.3 Total Amount of Compute and Type of Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 B Missing Proofs 14 B.1 Proof of Convergence of Repeated Maximization (Theorem 1) . . . . . . . . . . . . . . . . . . . . . . . 14 B.2 Proof of Convergence of Repeated Gradient Ascent (Theorem 2) . . . . . . . . . . . . . . . . . . . . . 21 B.3 Formal Statement and Proof of Convergence with Finite Samples (Theorem 3) . . . . . . . . . . . . . . 29 B.4 Proof of Proposition 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 B.5 Assumptions Regarding Quadratic Regularizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 B.6 Omitted Proofs from Subsection 3.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 A. Additional Information on Experimental Setup and Results In this section, we provide additional information on the experimental setup (Section A.1), as well as additional experimental results (Section A.2). We also provide information regarding the total amount of compute time and the type of resources that were used (Section A.3). A.1. Additional Information on Experimental Setup Gridworld: The initial state of the actor in the grid-world is selected uniformly at random from the cells denoted by S. Additionally, the actor remains at its current cell in case it attempts a move that would lead it outside the grid-world. Regarding the reward function, all the agents receive a positive reward equal to +1 whenever the actor reaches the terminal state G. Response Model: The response model of a follower agent is based on a perturbed grid-world instead of the one in Fig. 3. In other words, each of the n follower agents sees different cell costs than the ones A1 sees. As a result, they might respond to the policy of A1, by adopting a policy that performs unnecessary or even harmful interventions w.r.t. the grid-world of Fig. 3. A perturbed grid-world is generated from the grid-world of Fig. 3 with the following procedure. First, G and S cells stay the same between the two grid-worlds. Then, any blank, F or H cell remains unchanged with probability 0.7, and with probability 0.3 we perturb its type to blank , F or H (the perturbation is done uniformly at random). A.2. Additional Experimental Results In this section, we provide additional insights on the interventional policies of the follower agents. The repeated retraining method we use in these experiments is the repeated policy optimization. More specifically, we present a visual representation of the limiting environment i.e. the majority of the agents policy in the limit, i.e., after the method has converged to a stable solution. The configurations are set to λ = 1, β = 5, and we vary discount factor γ. As mentioned in Section 4, the policy of the follower agents can be thought of as a changing environment that responds to the policy updates of A1. To visualize how this environment looks like in the limit, we depict in Figure 6 several limiting policies of the follower agents. From the Figures 4, and 5 we observe that for smaller discount factor, the majority of the follower agents tend to intervene closer to the goal state. Performative Reinforcement Learning Figure 4: γ = 0.5 Figure 5: γ = 0.9 Figure 6: Figures 4, and 5 visualize two instances of the interventional policy of agent A2 in the Gridworld environment. All figures correspond to the majority of the followers policy at convergence for various values of the discount factor γ. Empty cells denote states where majority of the agents most probable action is to not intervene. For cells with a red arrow the (highest probability) action of the majority of the follower agents is to intervene by forcing the actor to move one cell towards the direction of the arrow. A.3. Total Amount of Compute and Type of Resources All experiments were conducted on a computer cluster with machines equipped with 2 Intel Xeon E5-2667 v2 CPUs with 3.3GHz (16 cores) and 50 GB RAM. Table 1 reports the total computation times for our experiments (Section 4). Note that at each iteration of the repeated gradient ascent experiment, apart from the gradient step a full solution of the optimization problem 5 was also computed, in order to report the suboptimality gap. Repeated Policy Optimization 767 sec Repeated Gradient Ascent 964 sec Repeated Policy Optimization with Finite Samples 33746 sec Repeated Gradient Ascent with Finite Samples 35396 sec Table 1: Total computation times for the different experiments described in Section 4. B. Missing Proofs B.1. Proof of Convergence of Repeated Maximization (Theorem 1) Proof. We first compute the dual of the concave optimization problem 5. The Lagrangian is given as L(d, h) = d rt λ 2 d 2 2 + X a d(s, a) + ρ(s) + γ X s ,a d(s , a)Pt(s , a, s) At an optimal solution we must have d L(d, h) = 0, which gives us the following expression for d. d(s, a) = rt(s, a) es h(es)Pt(s, a, es) (11) Performative Reinforcement Learning Substituting the above value of d we get the following dual problem. s,a h(s)rt(s, a) + γ s ,a h(s)rt(s , a)Pt(s , a, s) + X es h(es)Pt(s, a, es) + γ2 es,ˆs h(es)h(ˆs)Pt(s, a, ˆs)Pt(s, a, es) (12) Note that the dual objective is parameterized by reward function rt and probability transition function Pt which are the parameters corresponding to the occupancy measure dt. We will write L( ; Mt) to denote this dual objective function. For a given occupancy measure (i.e. dt = d) we will write GD(d) to denote the optimal solution to the primal problem 5. We first aim to show that the operator GD( ) is a contraction mapping. Consider two occupancy measures d and ˆd. Let r (resp. ˆr) be the reward functions in response to the occupancy measure d (resp. ˆd). Similarly, let P (resp. ˆP) be the probability transition function in response to the occupancy measure d (resp. ˆd). Let h (resp. ˆh) be the optimal dual solutions corresponding to the occupancy measures d (resp. ˆd) i.e. h arg maxh L(h ; M) and ˆh arg maxh L(h ; ˆ M). Lemma 2 proves that the objective is A(1 γ)2/λ strongly convex. Therefore, we have the following two inequalities. L(h; M) L(ˆh; M) h ˆh L(ˆh; M) + A(1 γ)2 L(ˆh; M) L(h; M) A(1 γ)2 These two inequalities give us the following bound. 2 (h ˆh) L(ˆh; M) (15) We now bound the Lipschitz constant of the term (h ˆh) Ld(ˆh; M) with respect to the MDP M. Lemma 3 gives us the following bound. L(ˆh; M) L(ˆh; ˆ M) 2 4S A λ r ˆr 2 + Now notice that the dual variable ˆh is actually an optimal solution and we can use lemma 4 to bound its norm by 3S (1 γ)2 . Furthermore, under assumption 1, we have r ˆr 2 ϵr d ˆd 2 and P ˆP 2 ϵp d ˆd 2. Substituting these bounds we get the following inequality. L(ˆh; M) L(ˆh; ˆ M) 2 4S A λ ϵr d ˆd 2 + AS λ 3S (1 γ)2 Aϵr λ + 10γS2 Aϵp λ(1 γ)2 Performative Reinforcement Learning We now substitute the above bound in equation 15. 2 (h ˆh) L(ˆh; M) = (h ˆh) L(ˆh; M) L(ˆh; ˆ M) [As ˆh is optimal for L( ; ˆ M)] L(ˆh; M) L(ˆh; ˆ M) 2 Aϵr λ + 10γS2 Aϵp λ(1 γ)2 Rearranging we get the following inequality. h ˆh 2 λ A(1 γ)2 Aϵr λ + 10γS2 Aϵp λ(1 γ)2 Recall that GD(d) (resp. GD( ˆd)) are the optimal solution corresponding to the primal problem when the deployed occupancy measure is d (resp. ˆd). Therefore, we can apply lemma 1 to obtain the following bound. GD(d) GD( ˆd) 2 1 + 4ϵr + 6ϵp ˆh 2 λ 1 + 4ϵr + 6ϵp 3S/(1 γ)2 AS λ λ A(1 γ)2 Aϵr λ + 10γS2 Aϵp λ(1 γ)2 1 + 4ϵr + 6ϵp 3S/(1 γ)2 Aϵr λ + 10γS2 Aϵp λ(1 γ)2 Now it can be easily verified that if λ > 12S3/2(1 γ) 4(2ϵr + 5Sϵp) then β = 12S3/2(2ϵr+5Sϵp) λ(1 γ)4 < 1. This implies that the operator GD( ) is a contraction mapping and the sequence of iterates {dt}t 1 converges to a fixed point. In order to determine the speed of convergence let us substitute d = dt and ˆd = d S. This gives us GD(dt) d S 2 β dt d S 2. As GD(dt) = dt+1 we have dt+1 d S 2 β dt d S 2. After t iterations we have dt d S 2 βt d0 d S 2. Therefore, if t ln ( d0 d S 2 /δ) / ln(1/β) we are guaranteed that dt d S 2 δ. Since d0 d S 2 2 1 γ , the desired upper bound on the number of iterations becomes the following. ln ( d0 d S 2 /δ) ln(1/β) 2 (1 β) 1 ln 2 δ(1 γ) Lemma 1. Consider two state-action occupancy measures d and ˆd. Let λ 2 2ϵr + 3ϵp ˆh 2 . Then we have the following bound. d ˆd 2 1 + 4ϵr + 6ϵp ˆh 2 λ Proof. Recall the relationship between the dual and the primal variables. d(s, a) = rt(s, a) es h(es)P(s, a, es) Performative Reinforcement Learning This gives us the following bound on the difference (d(s, a) ˆd(s, a))2. d(s, a) ˆd(s, a) 2 3 λ2 (r(s, a) ˆr(s, a))2 + 1 h(s) ˆh(s) 2 s h(s )P(s, a, s ) X s ˆh(s ) ˆP(s, a, s ) [By Jensen s inequality] λ2 (r(s, a) ˆr(s, a))2 + 1 h(s) ˆh(s) 2 h(s ) ˆh(s ) P(s, a, s ) + ˆh(s ) P(s, a, s ) ˆP(s, a, s ) !2 λ2 (r(s, a) ˆr(s, a))2 + 1 h(s) ˆh(s) 2 h(s ) ˆh(s ) P(s, a, s ) s ˆh(s ) P(s, a, s ) ˆP(s, a, s ) !2 [By Jensen s inequality] λ2 (r(s, a) ˆr(s, a))2 + 1 h(s) ˆh(s) 2 P(s, a, s ) ˆP(s, a, s ) 2 [By Cauchy-Schwarz inequality] Now summing over s and a we get the following bound. λ2 r ˆr 2 2 + 7AS We now use the assumptions r ˆr 2 ϵr d ˆd 2 and P ˆP 2 ϵp d ˆd 2. h ˆh 2 + 3ϵp Rearranging we get the following bound. 1 2ϵr + 3ϵp ˆh 2 λ 1 + 4ϵr + 6ϵp ˆh 2 λ The last inequality uses the fact that λ 2(2ϵr + 3ϵp ˆh 2). Lemma 2. The dual objective Ld (as defined in 12) is A(1 γ)2 λ -strongly convex. Proof. The derivative of the dual objective Ld with respect to h(s) is given as follows. a rt(s, a) + γ s ,a rt(s , a)Pt(s , a, s) + ρ(s) + A es,a h(es) (Pt(s, a, es) + Pt(es, a, s)) + γ2 s ,a,es h(es)Pt(s , a, es)Pt(s , a, s) (16) Performative Reinforcement Learning This gives us the following identity. Ld(h) Ld(eh) (h eh) = A s,es,a (h(s) eh(s)) (Pt(s, a, es) + Pt(es, a, s)) (h(es) eh(es)) s,es (h(s) eh(es))Pt(s , a, es)Pt(s , a, s)(h(s) eh(s)) Let us now define the matrix Ma RS S with entries Ma(s, s ) = Pt(s, a, s ). Ld(h) Ld(eh) (h eh) = A a (h eh) Ma + M a (h eh) + γ2 a (h eh) M a Ma(h eh) a (h eh) Id γMa γM a + γ2M a Ma (h eh) The last inequality uses lemma 5. Lemma 3. The dual function Ld (as defined in eq. (12)) satisfies the following bound for any h and MDP M, c M. Ld(h, M) Ld(h, c M) 2 4S A λ r ˆr 2 + Proof. From the expression of the derivative of Ld with respect to h (eq. (16)) we get the following bound. Ld(h, M) Ld(h, c M) 2 a (r(s, a) ˆr(s, a)) r(s , a)P(s , a, s) ˆr(s , a) ˆP(s , a, s) γ es,a h(es)(P(es, a, s) ˆP(es, a, s)) es,a h(es)(P(s, a, es) ˆP(s, a, es)) + γ2 s ,a,es h(es) P(s , a, es)P(s , a, s) ˆP(s , a, es) ˆP(s , a, s) λ2 r ˆr 2 2 + 5γ2 r(s , a)P(s , a, s) ˆr(s , a) ˆP(s , a, s) es,a h(s)(P(es, a, s) ˆP(es, a, s)) es,a h(es)(P(s, a, es) ˆP(s, a, es)) s ,a,es h(es) P(s , a, es)P(s , a, s) ˆP(s , a, es) ˆP(s , a, s) We now use four bounds to complete the proof. The following bounds use Jensen s inequality and Cauchy-Schwarz Performative Reinforcement Learning inequality. Bound 1 : X r(s , a)P(s , a, s) ˆr(s , a) ˆP(s , a, s) s ,a |r(s , a) ˆr(s , a)| P(s , a, s) + ˆr(s , a) P(s , a, s) ˆP(s , a, s) P(s , a, s) ˆP(s , a, s) Bound 2 : X es,a h(es)(P(es, a, s) ˆP(es, a, s)) es h(es)2 X P(es, a, s) ˆP(es, a, s) 2 A h 2 2 P ˆP 2 Bound 3 : X es,a h(es)(P(s, a, es) ˆP(s, a, es)) es h(es)2 X P(s, a, es) ˆP(s, a, es) 2 A h 2 2 P ˆP 2 Bound 4 : X s ,a,es h(es) P(s , a, es)P(s , a, s) ˆP(s , a, es) ˆP(s , a, s) es h(es)2 X P(s , a, es)P(s , a, s) ˆP(s , a, es) ˆP(s , a, s) P(s , a, es)P(s , a, s) ˆP(s , a, es) ˆP(s , a, s) 2 P(s , a, es)(P(s , a, s) ˆP(s , a, s)) + ˆP(s , a, s)(P(s , a, es) ˆP(s , a, es)) 2SA h 2 2 X P(s , a, s) ˆP(s , a, s) 2 + P(s , a, es) ˆP(s , a, es) 2 4S2A h 2 2 P ˆP 2 Using the four upper bounds shown above, we can complete the proof. Ld(h, M) Ld(h, c M) 2 λ2 r ˆr 2 2 + 5γ2 r ˆr 1 + P( , , s) ˆP( , , s) 1 λ2 h 2 2 P ˆP 2 2 + 20S2Aγ4 λ2 h 2 2 P ˆP 2 λ2 + 10S2Aγ2 r ˆr 2 2 + 10γ2SA λ2 h 2 2 + 20S2Aγ4 Performative Reinforcement Learning Lemma 4. The norm of the optimal solution to the dual problem (defined in 12) is bounded by 3S (1 γ)2 for any choice of MDP M. Proof. The dual objective Ld is strongly-convex and has a unique solution. The optimal solution can be obtained by setting the derivative with respect to h to zero. Rearranging the derivative of the dual objective (16) we get the following systems of equations. a Pt(s, a, s) + γ2 s ,a Pt(s , a, s)2 ˆs =s h(ˆs) a Pt(s, a, ˆs) γ a Pt(ˆs, a, s) + γ2 s ,a Pt(s , a, ˆs)Pt(s , a, s) a rt(s, a) ρ(s) γ s ,a rt(s , a)Pt(s , a, s) Therefore let us define a matrix B RS S and a vector b RS with the following entries. a Pt(s, a, s) + γ2 s ,a Pt(s , a, s)2 if s = ˆs a Pt(s, a, ˆs) γ a Pt(ˆs, a, s) + γ2 s ,a Pt(s , a, ˆs)Pt(s , a, s) o.w. a rt(s, a) ρ(s) γ s ,a rt(s , a)Pt(s , a, s) Then the optimal solution is the solution of the system of equations Bh = b. We now provide a bound on the L2-norm of such a solution. For each a, we define matrix Ma RS S with entries Ma(s, ˆs) = Pt(s, a, ˆs). Then the matrix B can be expressed as follows. Ma + M a + γ2 a M a Ma A(1 γ)2 The last inequality uses lemma 5. Notice that for γ < 1 this also shows that the matrix is invertible. We can also bound the norm of the vector b. λ + ρ(s) + γ s ,a Pt(s , a, s) λ2 + 3 ρ 2 2 + 3γ2 s ,a Pt(s , a, s) λ2 + 3 + 3SAγ2 s ,a Pt(s , a, s) 9S2A2 Therefore, we have the following bound on the optimal value. A 1b 2 b 2 λmin(A) 3S (1 γ)2 Lemma 5. For each a, let the matrix Ma RS S be defined so that Ma(s, s ) = P(s, a, s ). a Id γ(Ma + M a ) + γ2M a Ma Performative Reinforcement Learning Proof. Let Ma = UaΣa U a be the Eigen-decomposition of the matrix Ma. Then we have X a Id γ(Ma + M a ) + γ2M a Ma = X a (Id γMa) (Id γMa) Ua(Id γΣa)U a Ua(Id γΣa)U a = X a Ua (Id γΣa)2 U a a Id(1 γ)2 = A(1 γ)2Id The last line follows the largest eigenvalue of Ma is 1, and therefore the smallest diagonal entry of the matrix (Id γΣa)2 is at least (1 γ)2. B.2. Proof of Convergence of Repeated Gradient Ascent (Theorem 2) Proof. The dual of the optimization problem 8 is given as follows. s,a h(s) ((1 ηλ)dt(s, a) + ηrt(s, a)) X s ,a Pt(s , a, s) ((1 ηλ)dt(s , a) + ηrt(s , a)) A s ,a h(s )Pt(s , a, s) γ2 s ,s h(s )h(s ) X s,a Pt(s, a, s )Pt(s, a, s ) We will consider the equivalent minimization problem. s,a h(s) ((1 ηλ)dt(s, a) + ηrt(s, a)) + X s ,a Pt(s , a, s) ((1 ηλ)dt(s , a) + ηrt(s , a)) + A s ,a h(s )Pt(s , a, s) + γ2 s ,s h(s )h(s ) X s,a Pt(s, a, s )Pt(s, a, s ) Let us call the above objective function P( ; M) for a given MDP M. Consider two occupancy measures d and ˆd. Let r (resp. ˆr) be the reward functions in response to the occupancy measure d (resp. ˆd) i.e. r = R(d) and ˆr = R( ˆd). Similarly let P (resp. ˆP) be the probability transition functions in response to the occupancy measures d (resp. ˆd). We will write GD( ) to denote the projected gradient ascent step defined in eq. (7). In particular, if we write C to define the set of occupancy measures feasible with respect to P, then we have GD(d) = Proj C ((1 ηλ)d + ηr) (19) Note that GDη(d) is the optimal solution to the primal problem 8 with dt = d. Let h be the corresponding dual optimal solution. Similarly let ˆh be the optimal dual solution corresponding to the occupancy measure ˆd. Since h is the unique minimizer of P( ; M) and P( ; M) is A(1 2γ)-strongly convex for any M (lemma 7) we have the following set of inequalities. P(h; M, d) P(ˆh; M, d) (h ˆh) P(ˆh; M, d) + A(1 γ)2/2 h ˆh 2 P(ˆh; M, d) P(h; M, d) A(1 γ)2/2 h ˆh 2 These two inequalities give us the following bound. A(1 γ)2 h ˆh 2 2 (h ˆh) P(ˆh; M, d) (20) Performative Reinforcement Learning We now apply lemma 8 to bound the Lipschitz constant of the term (h ˆh) P(ˆh; M). P(ˆh; M, d) P(ˆh; ˆ M, ˆd) 2 2 5A(1 ηλ)2(1 + 2γ2S2) d ˆd 2 2 + 5η2A (1 ηλ)2 + 2γ2S2 r ˆr 2 2 + 5γ2SA 2(1 ηλ)2 (1 γ)2 + 2η2 + 6γ2 ˆh 2 5A(1 ηλ)2(1 + η2ϵ2 r + 2γ2S2) + 10η2γ2AS2ϵ2 r d ˆd 2 + 5γ2SA 2(1 ηλ)2 (1 γ)2 + 2η2 + 12γ2 (1 + 2ηS)2 (1 γ)4 + 4(1 ηλ)2 S2 ϵ2 p d ˆd 2 5A(1 ηλ)2 1 + η2ϵ2 r + 2γ2S2 + 2γ2S (1 γ)2 ϵ2 p + 48γ4S3 A(1 γ)6 ϵ2 p + 10η2γ2AS2ϵ2 r + 10η2γ2SAϵ2 p + 60γ4SAϵ2 p (1 + 2ηS)2 5A(1 ηλ)2 1 + η2ϵ2 r + 2γ2S2 + 50γ2S3 (1 γ)6 ϵ2 p + 10η2γ2S2A(ϵ2 p + ϵ2 r) + 60γ4SAϵ2 p (1 + 2ηS)2 | {z } := 2 The last inequality uses lemma 9 and assumption 1. Substituting this bound in equation 20 we get the following inequality. A(1 γ)2 h ˆh 2 2 (h ˆh) P(ˆh; M, d) = (h ˆh) P(ˆh; M, d) (h ˆh) P(ˆh; ˆ M, ˆd) P(ˆh; M, d) P(ˆh; ˆ M, ˆd) 2 h ˆh 2 Rearranging we get the following inequality. h ˆh 2 A(1 γ)2 GDη(d) GDη( ˆd) 2 2 8γ2SA 4(1 ηλ)2 + 4η2ϵ2 r + 8γ2 ˆh 2 The last line uses lemma 6. After rearranging we get the following inequality. GDη(d) GDη( ˆd) 2 2 4(1 ηλ)2 + 4η2ϵ2 r + 8γ2 ˆh 2 2 ϵ2 p + 8γ2 2S 4(1 ηλ)2 + 4η2ϵ2 r + 16γ2ϵ2 p(1 + 2ηS)2 (1 γ)4 + 32(1 ηλ)2 γ2ϵ2 p S2 A(1 γ)6 + 8γ2 2S For η = 1/λ we get the following bound. GDη(d) GDη( ˆd) 2 4ϵ2 r λ2 + 16γ2ϵ2 p(1 + 2S/λ)2 (1 γ)4 + 80γ4S3(ϵ2 r + ϵ2 p) λ2(1 γ)4 + 480γ6S2ϵ2 p(1 + 2S/λ)2 If we choose λ max n 4ϵr, 2S, 20γ2S1.5(ϵr+ϵp) (1 γ)2 o we get the following condition. GDη(d) GDη( ˆd) 2 1 2 + 64γ2ϵ2 p (1 γ)4 + 1920γ6S2ϵ2 p (1 γ)8 Performative Reinforcement Learning For a contraction mapping we need the following condition. 64γ2ϵ2 p (1 γ)4 We consider two cases. First, if 30γ4S2 (1 γ)4 < 1 then one can show that a sufficient condition is ϵp < γS/3. On the other hand, (1 γ)4 1 then we need ϵp < (1 γ)4 100γ3S . Combining the two conditions above a sufficient condition for contraction is the following. ϵp < min γS Now if we set µ = q 1 2 + 64γ2ϵ2p (1 γ)4 + 1920γ6S2ϵ2p (1 γ)8 we get the contraction mapping: GDη(d) GDη( ˆd) 2 µ d ˆd 2. Let d be the fixed point of this contraction mapping. Using d = dt and ˆd = d we get the following sequence of inequalities. dt+1 d 2 µ dt d 2 . . . µt d1 d 2 µt 2 1 γ The last inequality uses the fact that for any occupancy measure d we have d 2 d 1 1 1 γ . Rearranging we get that as long as t ln 2 δ(1 γ) / ln(1/µ) we have dt d 2 δ. We now show that the fixed point d is a stable point. In response to d , let the probability transition function (resp. reward function) be P (resp. d ). Let C be the set of occupancy measures corresponding to d . Note that C is a convex set. We consider two cases. First, (1 ηλ)d + ηr C . Then d = GDη(d ) = (1 ηλ)d + ηr and r λd = P(d ; P , r ) = 0. Since P( ; P , r ) is a concave function the occupancy measure d is the optimal point and is a stable point. Second, we consider the case when (1 ηλ)d + ηr / C . Since d = Proj C ((1 ηλ)d + ηr ). Since C is a convex set, by the projection theorem (see e.g. (Bertsekas, 2009)) we have the following inequality for any d C . ((1 ηλ)d + ηr d ) (d d ) 0 η(λd r ) (d d ) 0 P(d ; P , r ) (d d ) 0 This implies that d maximizes the function P( ; P , r ) over the set C and is a stable point. Lemma 6. Consider two state-action occupancy measures d and ˆd. Let h (resp. ˆh) be the dual optimal solutions to the projection (eq. (18)) corresponding to occupancy measure dt = d (resp. ˆd). Then we have the following inequality. GDη(d) GDη( ˆd) 2 2 4(1 ηλ)2 + 4η2ϵ2 r + 8γ2 ˆh 2 2 + 8γ2SA h ˆh 2 Proof. Recall the relationship between the dual and the primal variables. GDη(d)(s, a) = (1 ηλ)d(s, a) + ηr(s, a) h(s) + γ X es h(es)P(s, a, es) Performative Reinforcement Learning This gives us the following bound on the difference (GDη(d)(s, a) GDη( ˆd)(s, a))2. GDη(d)(s, a) GDη( ˆd)(s, a) 2 4(1 ηλ)2 d(s, a) ˆd(s, a) 2 + 4η2 (r(s, a) ˆr(s, a))2 + 4 h(s) ˆh(s) 2 + 4γ2 X s h(s )P(s, a, s ) X s ˆh(s ) ˆP(s, a, s ) 4(1 ηλ)2 d(s, a) ˆd(s, a) 2 + 4η2 (r(s, a) ˆr(s, a))2 + 4 h(s) ˆh(s) 2 + 4γ2 X h(s ) ˆh(s ) P(s, a, s ) + ˆh(s ) P(s, a, s ) ˆP(s, a, s ) !2 4(1 ηλ)2 d(s, a) ˆd(s, a) 2 + 4η2 (r(s, a) ˆr(s, a))2 h(s ) ˆh(s ) P(s, a, s ) s ˆh(s ) P(s, a, s ) ˆP(s, a, s ) !2 4(1 ηλ)2 d(s, a) ˆd(s, a) 2 + 4η2 (r(s, a) ˆr(s, a))2 + 8γ2 h ˆh 2 2 + 8γ2 ˆh 2 P(s, a, s ) ˆP(s, a, s ) 2 Now summing over s and a we get the following bound. GDη(d) GDη( ˆd) 2 2 4(1 ηλ)2 d ˆd 2 2 + 4η2 r ˆr 2 2 + 8γ2SA h ˆh 2 2 + 8γ2 ˆh 2 We now use the assumptions r ˆr 2 ϵr d ˆd 2 and P ˆP 2 ϵp d ˆd 2. GDη(d) GDη( ˆd) 2 2 4(1 ηλ)2 + 4η2ϵ2 r + 8γ2 ˆh 2 2 + 8γ2SA h ˆh 2 Lemma 7. The objective function P( ; M) (as defined in 18) is A(1 γ)2-strongly convex. Proof. The derivative of the objective function P( ; M) with respect to h(s) is given as follows. a ((1 ηλ)dt(s, a) + ηrt(s, a)) + ρ(s) + γ X s ,a Pt(s , a, s) ((1 ηλ)dt(s, a) + ηrt(s, a)) + Ah(s) γ X es,a h(es) (Pt(es, a, s) + Pt(s, a, es)) + γ2 X es,a Pt(es, a, s )Pt(es, a, s) (21) This gives us the following identity. P(h; Mt) P(eh; Mt) (h eh) = A h eh 2 s,s ,a (h(s ) eh(s ))(Pt(s , a, s) + Pt(s, a, s ))(h(s) eh(s)) s ,s (h(s ) eh(s )) X es,a Pt(es, a, s )Pt(es, a, s)(h(s) eh(s)) Performative Reinforcement Learning Now for each action a, we define the following matrix Ma RS S with entries Ma(s, s ) = Pt(s, a, s ). Note that matrix Ma is row-stochastic and has eigenvalues bounded between 1 and 1. P(h; Mt) P(eh; Mt) (h eh) = A h eh 2 2 γ (h eh) X + γ2(h eh) X Id γ(Ma + M a ) + γ2M a Ma (h eh) A(1 γ)2 h eh 2 The last line uses lemma 5. Lemma 8. The dual function (as defined in 18) satisfies the following guarantee for any h, occupancy measures d, ˆd, and MDP M, ˆ M. P(h; M, d) P(h; c M, ˆd) 2 5A(1 ηλ)2(1 + 2γ2S2) d ˆd 2 2 + 5η2A (1 ηλ)2 + 2γ2S2 r ˆr 2 2 + 5γ2SA 2(1 ηλ)2 (1 γ)2 + 2η2 + 6γ2 h 2 2 Proof. From the expression of the derivative of the function P( ; M, d) (21) we have the following bound. P(h; M, d) P(h; c M, ˆd) 2 a (d(s, a) ˆd(s, a)) + η(1 ηλ) X a (r(s, a) ˆr(s, a)) + γη X P(s , a, s)r(s, a) ˆP(s , a, s)ˆr(s, a) + γ(1 ηλ) X P(s , a, s)d(s, a) ˆP(s , a, s) ˆd(s, a) es,a h(es) P(es, a, s) + P(s, a, es) ˆP(es, a, s) ˆP(s, a, es) P(es, a, s )P(s, a, es) ˆP(es, a, s ) ˆP(s, a, es) Performative Reinforcement Learning a (d(s, a) ˆd(s, a)) + 5η2(1 ηλ)2 X a (r(s, a) ˆr(s, a)) P(s , a, s)r(s, a) ˆP(s , a, s)ˆr(s, a) + 5γ2(1 ηλ)2 X P(s , a, s)d(s, a) ˆP(s , a, s) ˆd(s, a) es,a h(es) P(es, a, s) + P(s, a, es) ˆP(es, a, s) ˆP(s, a, es) P(es, a, s )P(s, a, es) ˆP(es, a, s ) ˆP(s, a, es) We now establish several bounds to complete the proof. The bounds mainly use the Cauchy-Schwarz inequality and the Jensen s inequality. Bound 1 : X P(s , a, s)d(s, a) ˆP(s , a, s) ˆd(s, a) s ,a P(s , a, s) d(s, a) ˆd(s, a) + ˆd(s, a) P(s , a, s) ˆP(s , a, s) s ,a P(s , a, s) d(s, a) ˆd(s, a) s ,a ˆd(s, a) P(s , a, s) ˆP(s , a, s) d(s, a) ˆd(s, a) 2 X s ,a (P(s , a, s))2 a ( ˆd(s, a))2 X s P(s , a, s) ˆP(s , a, s) s,a,s P(s , a, s) + 2AS (1 γ)2 X P(s , a, s) ˆP(s , a, s) 2 2S2A d ˆd 2 2 + 2AS (1 γ)2 Similarly one can establish the following bound. Bound 2 : X P(s , a, s)r(s, a) ˆP(s , a, s)ˆr(s, a) 2S2A r ˆr 2 2 + 2AS P ˆP 2 Performative Reinforcement Learning Bound 3 : X es,a h(es) P(es, a, s) + P(s, a, es) ˆP(es, a, s) ˆP(s, a, es) es h(es)2 X P(es, a, s) + P(s, a, es) ˆP(es, a, s) ˆP(s, a, es) !2 P(es, a, s) + P(s, a, es) ˆP(es, a, s) ˆP(s, a, es) 2 2A h 2 2 P ˆP 2 Bound 4 : X P(es, a, s )P(s, a, es) ˆP(es, a, s ) ˆP(s, a, es) P(es, a, s )P(s, a, es) ˆP(es, a, s ) ˆP(s, a, es) es,a P(s, a, es) P(es, a, s ) ˆP(es, a, s ) + ˆP(es, a, s ) P(s, a, es) ˆP(s, a, es) es,a P(s, a, es) P(es, a, s ) ˆP(es, a, s ) + 2 h 2 2 X es,a ˆP(es, a, s ) P(s, a, es) ˆP(s, a, es) es,a P(s, a, es) X P(es, a, s ) ˆP(es, a, s ) 2 + 2 h 2 2 X es,a ˆP(es, a, s ) P(s, a, es) ˆP(s, a, es) 2 2 h 2 2 P ˆP 2 s,es,a P(s, a, es) + 2 h 2 2 X P(s, a, es) ˆP(s, a, es) 2 4SA h 2 2 P ˆP 2 Performative Reinforcement Learning We now substitute the four bounds above to complete the proof. P(h; M, d) P(h; c M, ˆd) 2 2 5(1 ηλ)2A d ˆd 2 2 + 5η2(1 ηλ)2A r ˆr 2 2 + 5γ2(1 ηλ)2 2S2A d ˆd 2 2 + 2AS (1 γ)2 + 5γ2η2 2S2A r ˆr 2 2 + 2AS P ˆP 2 + 10γ2 h 2 2 P ˆP 2 2 + 20γ4SA h 2 2 P ˆP 2 5A(1 ηλ)2(1 + 2γ2S2) d ˆd 2 2 + 5η2A (1 ηλ)2 + 2γ2S2 r ˆr 2 2 + 5γ2SA 2(1 ηλ)2 (1 γ)2 + 2η2 + 6γ2 h 2 2 Lemma 9. For any choice of MDP M and occupancy measure d, the L2-norm of the optimal solution to the dual objective (as defined in 18) is bounded by 1 + 2ηS (1 γ)2 + 2 |1 ηλ| S/ Proof. The objective function P is strongly-convex and has a unique solution. We set the derivative with respect to h to zero and get the following system of equations. a P(s, a, s) + γ2 X es,a P(es, a, s)2 a (P(s , a, s) + P(s, a, s )) + γ2 X es,a P(es, a, s )P(es, a, s) a ((1 ηλ)d(s, a) + ηr(s, a)) ρ(s) γ X s ,a P(s , a, s) ((1 ηλ)d(s, a) + ηr(s, a)) Let us now define matrix B RS S and a vector b RS with the following entries. B(s, s ) = A 2γ P a P(s, a, s) + γ2 P es,a P(es, a, s)2 if s = s γ P a(P(s , a, s) + P(s, a, s )) + γ2 P es,a P(es, a, s )P(es, a, s) o.w. a ((1 ηλ)d(s, a) + ηr(s, a)) ρ(s) γ X s ,a P(s , a, s) ((1 ηλ)d(s, a) + ηr(s, a)) Then the optimal solution is the solution to the system of equations Bx = b. We now provide a bound on the L2-norm of such a solution. For each action a, we define a matrix Ma RS S with entries Ma(s, s ) = P(s, a, s ). Then the matrix B can be expressed as follows. B = A Id γ X a (Ma + M a ) + γ2 X a M a Ma A(1 γ)2Id The last inequality uses lemma 5. This also implies that for γ < 1 the matrix B is invertible. Performative Reinforcement Learning We now bound the norm of the vector b. We will use the fact that for any state P s,a d(s, a) = 1/(1 γ). b 2 b 1 |1 ηλ| X s,a d(s, a) + η X s,a r(s, a) + X + γ |1 ηλ| X s,s ,a P(s , a, s)d(s, a) + ηγ X s,s ,a P(s , a, s) |r(s, a)| 1 γ + ηSA + A + γ |1 ηλ| X s,a P(s , a, s)2 s X s,a d(s, a)2 + ηγSA 1 γ + ηSA + A + γ |1 ηλ| d 2 X s,a P(s , a, s) + ηγSA 1 γ + ηSA + A + γ |1 ηλ| S A 1 γ + ηγSA A(1 + 2ηS) + 2 |1 ηλ| S The optimal solution to the dual objective is bounded by A 1b 2 b 2 λmin(A) 1 + 2ηS (1 γ)2 + 2 |1 ηλ| S/ B.3. Formal Statement and Proof of Convergence with Finite Samples (Theorem 3) Theorem 6. Suppose assumption 1 holds with λ 24S3/2(2ϵr+5Sϵp) (1 γ)4 , and assumption 2 holds with parameter B. For a given δ, and error probability p, if we repeatedly solve the optimization problem 10 with number of samples β4δ4(2ϵr + 5Sϵp)2 A) βδ(2ϵr + 5Sϵp) then we have dt d S 2 δ t (1 µ) 1 ln 2 δ(1 γ) with probability at least 1 p where µ = 24S3/2(2ϵr+5Sϵp) Proof. We will write d GD(dt) to denote the occupancy measure dt+1. (d GD(dt), ˆht+1) = arg max d arg min h ˆL(d, h; Mt) Let us also write d GD(d S) to denote the primal solution corresponding to the stable solution d S i.e. (d GD(d S), ˆh S) = arg max d arg min h ˆL(d, h; MS) Let us also recall the definition of the operator GD( ). Given occupancy measure dt, GD(dt) is the optimal solution to the optimization problem 5 when we use the exact model Mt. Because of strong-duality this implies there exists ht+1 so that (GD(dt), ht+1) = arg max d arg min h L(d, h; Mt) Since GD(d S) = d S there also exists h S so that (d S, h S) = arg max d arg min h L(d, h; MS) Performative Reinforcement Learning Because of lemma 4 we can assume the L2-norms of the dual solutions ht+1, ˆh S, and ˆh S are bounded by 3S (1 γ)2 . Since there exists a saddle point with bounded norm, we can just consider the restricted set H = n h : h 2 3S (1 γ)2 o .8 Moreover, by assumption 2 we know that GD(dt)(s, a)/dt(s, a) B for any (s, a). Therefore, we can apply lemma 10 with δ1 = p/2t2 and H = 3S/(1 γ)2 to get the following bound, ˆL(d, h; Mt) L(d, h; Mt) 18S1.5(B + A)ϵ (1 γ)3 (22) ϵ2 (ln(t/p) + ln(S/(1 γ)ϵ)) (23) h H and maxs,a d(s, a)/dt(s, a) B. Since the event (22) holds at time t with probability at least 1 p 2t2 , by a union bound over all time steps, this event holds with probability at least 1 p. Note that the objective L( , ht+1; Mt) is λ-strongly concave. Therefore, we have L(d GD(dt), ht+1; Mt) L(GD(dt), ht+1; Mt) λ GD(dt) d GD(dt) 2 Rearranging and using lemma 12 we get the following bound. GD(dt) d GD(dt) 2 v u u t2 L(GD(dt), ht+1; Mt) L(d GD(dt), ht+1; Mt) The proof of theorem 1 establishes that the operator GD( ) is a contraction. In particular, it shows that GD(dt) d S 2 β dt d S 2 for β = 12S3/2(2ϵr + 5ϵp) λ(1 γ)4 and λ > 12S3/2(1 γ) 4(2ϵr + 5Sϵp) This gives us the following recurrence relation on the iterates of the algorithm. dt+1 d S 2 = d GD(dt) d S 2 d GD(dt) GD(dt) 2 + GD(dt) d S 2 λ + β dt d S 2 We choose λ = 24S3/2(1 γ) 4(2ϵr + 5Sϵp) which ensures β < 1/2 and gives the following recurrence. dt+1 d S 2 2 p A)ϵ 2ϵr + 5Sϵp + β dt d S 2 βδ + β dt d S 2 (24) The last line requires the following bound on ϵ. ϵ β2δ2(2ϵr + 5Sϵp) Substituting the bound on ϵ in equation (23) the required number of samples at time-step t is given as follows. β4δ4(2ϵr + 5Sϵp)2 A) βδ(2ϵr + 5Sϵp) 8See lemma 3 of (Zhan et al., 2022) for a proof of this statement. Performative Reinforcement Learning In order to analyze the recurrence relation (24) we consider two cases. First, if dt d S 2 δ we have dt+1 d S 2 2β dt d S 2 Since β < 1/2 this is a contraction, and after ln( d0 d S 2)/ ln(1/2β) iterations we are guaranteed that dt d S 2 δ. Since d0 d S 2 2 1 γ , the required number of iterations for this event to occur is given by the following upper bound. ln( d0 d S 2) ln(1/2β) 2(1 2β) 1 ln 2 δ(1 γ) On the other hand, if dt d S δ, equation (24) gives us dt+1 d S 2 2βδ < δ [Since β < 1/2] Therefore, once dt d S 2 δ we are guaranteed that dt d S 2 δ for any t t. Lemma 10. Suppose m 1 ϵ2 (A ln(2/δ1) + ln(4H/ϵ) + 2A ln(ln(SABH/ϵ)/ϵ)). Then for any occupancy measure d satisfying maxs,a d(s, a)/dt(s, a) B and any h with h 2 H the following bound holds with probability at least 1 δ1. ˆL(d, h; Mt) L(d, h; Mt) 6H Proof. Note that the expected value of the objective above equals L(d, h; Mt). E h ˆL(d, h; Mt) i = λ 2 d 2 2 + X s h(s)ρ(s) + 1 i=1 E d(si, ai) dt(si, ai) (r(si, ai) h(si) + γh(s i)) 2 d 2 2 + X s h(s)ρ(s) + X s,a dt(s, a) d(s, a) r(s, a) h(s) + γ X s Pt(s |s, a)h(s ) = L(d, h; Mt) By the overlap assumption 2 and the assumption h 2 H the following bound holds for each i, 1 1 γ d(si, ai) dt(si, ai) (r(si, ai) h(si) + γh(s i)) 2BH Therefore we can apply the Chernoff-Hoeffding inequality and obtain the following inequality. ˆL(d, h; Mt) L(d, h; Mt) 2BH We now extend this bound to any occupancy measure d D and h in the set H = {h : h 2 H}. By lemma 5.2 of (Vershynin, 2010) we can choose an ϵ-net, Hϵ of the set H of size at most 1 + 2H ϵ S and for any point h H we are guaranteed to find h Hϵ so that h h 2. However, such an additive error bound is not sufficient for the set of occupancy measures because of the overlap assumption 2. So instead we choose a multiplicative ϵ-net as follows. For any (s, a) we consider the grid points dt(s, a), (1 + ϵ)dt(s, a), . . . , (1 + ϵ)pdt(s, a) for p = log(B/dt(s, a))/ log(1 + ϵ). Although dt(s, a) can be arbitrarily small, without loss of generality we can assume that dt(s, a) ϵ 4SABH . This is because from the expression of L(d, h; Mt) (9), it is clear that ignoring such small terms introduces an error of at most ϵ/4. Therefore we can choose p = 2 log(SABH/ϵ)/ log(1+ϵ). Taking a Cartesian product over the set of all state, action pairs we see that we can choose an ϵ-net Dϵ so that |Dϵ| 2 log(SABH/ϵ) log(1+ϵ) SA 2 log(SABH/ϵ) ϵ SA . Notice that we are guaranteed that for any d D there exists a d Dϵ such that d(s, a)/d (s, a) 1 + ϵ. Performative Reinforcement Learning By a union bound over the elements in Hϵ and Dϵ the following bound holds for any d Dϵ and h Hϵ with probability at least 1 δ1. ˆL(d, h; Mt) L(d, h; Mt) 2BH v u u t SA ln 2 δ1 ϵ + 2SA ln(ln(SABH/ϵ)/ϵ) m | {z } :=Tm We now extend the bound above for any d D and h H using lemma 11. There exists ed Dϵ so that maxs,a d(s, a)/ed(s, a) ϵ. Similarly there exists hϵ Hϵ so that h eh 2 ϵ. Let L0(d, h; Mt) = L(d, h; Mt) + λ 2 d 2 2 P s h(s)ρ(s). ˆL(d, h; Mt) L(d, h; Mt) ˆL0(d, h; Mt) ˆL0(ed,eh; Mt) + ˆL(ed,eh; Mt) L(ed,eh; Mt) + L0(ed,eh; Mt) L0(d, h; Mt) SAHϵ 1 γ + 4BH Therefore, if m 1 ϵ2 (A ln(2/δ1) + ln(4H/ϵ) + 2A ln(ln(SABH/ϵ)/ϵ)) then Tm Sϵ and we have the following bound. ˆL(d, h; Mt) L(d, h; Mt) 6H Lemma 11. Suppose we are guaranteed that d(s,a) e d(s,a) 1 + ϵ and h eh 2 ϵ, and h 2 , eh 2 H. Let L0(d, h; Mt) = L(d, h; Mt) + λ s h(s)ρ(s) and define ˆL0(d, h; Mt) analogously. Then the following inequalities hold. L0(d, h; Mt) L0(ed,eh; Mt) 6 SAHϵ 1 γ and ˆL0(d, h; Mt) ˆL0(ed,eh; Mt) 4BH Proof. First note that d ed 2 s,a d(s, a) ed(s, a) 2 P s,a d(s, a)2ϵ2 ϵ2 (1 γ)2 . L0(d, h; Mt) L0(ed,eh; Mt) X d(s, a) ed(s, a) s,a d(s, a)h(s) ed(s, a)eh(s) | {z } :=T1 s Pt(s, a, s )h(s ) ed(s, a) X s Pt(s, a, s )eh(s ) | {z } :=T2 We now bound the terms T1 and T2. s,a d(s, a) h(s) eh(s) + eh(s) d(s, a) ed(s, a) s,a d(s, a) + s X a (d(s, a) ed(s, a) Performative Reinforcement Learning s Pt(s, a, s )h(s ) ed(s, a) X s Pt(s, a, s )eh(s ) d(s, a) ed(s, a) X s Pt(s, a, s ) |h(s )| + X s,a ed(s, a) X s Pt(s, a, s ) h(s ) eh(s ) h 2 d ed 1 + h eh 1 s,a ed(s, a) Substituting the bounds on T1 and T2 we get the following bound on L(d, h; Mt) L(ed,eh; Mt) . L0(d, h; Mt) L0(ed,eh; Mt) We now consider bounding the difference ˆL0(d, h; Mt) ˆL0(ed,eh; Mt) . ˆL0(d, h; Mt) ˆL0(ed,eh; Mt) d(si, ai) dt(si, ai)(r(si, ai) h(si) + γh(s i)) ed(si, ai) dt(si, ai)(r(si, ai) eh(si) + γeh(s i)) | {z } :=T3 We now bound the term T3. T3 = 1 m(1 γ) d(si, ai) dt(si, ai)(r(si, ai) h(si) + γh(s i)) ed(si, ai) dt(si, ai)(r(si, ai) eh(si) + γeh(s i)) i=1 B h(si) eh(si) + γ h(s i) eh(s i) d(si, ai) ed(si, ai) r(si, ai) eh(si) + γeh(s i) h eh 1 + ϵB 1 γ (1 + 2 h 1) Sϵ 1 γ + 3BH Lemma 12. Let (d , h ) arg maxd arg minh L(d, h; M) and ( ˆd, ˆh) arg maxd arg minh L(d, h; c M). Moreover, suppose L(d, h; M) L(d, h; c M) ϵ for all d and h with h 2 3S/(1 γ)2. Then we have L(d , h ) L( ˆd, h ) 2ϵ Proof. We will drop the dependence on the underlying model and write L(d, h; M) as L(d, h) and L(d, h; c M) as ˆL(d, h). L(d , h ) L( ˆd, h ) = L(d , h ) L(d , ˆh(d ) | {z } :=T1 + L(d , ˆh(d )) ˆL(d , ˆh(d )) | {z } :=T2 + ˆL(d , ˆh(d )) ˆL( ˆd, ˆh) | {z } :=T3 + ˆL( ˆd, ˆh) ˆL( ˆd, h ) | {z } :=T4 + ˆL( ˆd, h ) L( ˆd, h ) | {z } :=T5 Performative Reinforcement Learning Here we write ˆh(ed) = arg minh ˆL(ed, h) i.e. the dual solution that minimizes the objective ˆL(ed, ). Since h minimizes L(d , ), by lemma 4 we have h 2 3S/(1 γ)2. By a similar argument ˆh 2 3S/(1 γ)2. Therefore, both T2 and T5 are at most ϵ. Given d , h minimizes L(d , ). Therefore, the term T1 is at most zero. By a similar argument the term T4 is at most zero. Now for the term T3, notice that ˆh = ˆh( ˆd) and it also minimizes the objective ˆL(d, ˆh(d)). Therefore, the term T3 is also at most zero. B.4. Proof of Proposition 1 Proof. Let D be the following set D = n d RS A : d(s, a) 0 s, a and P s,a d(s, a) = 1 1 γ o . We define a set-valued function ϕ : D 2D as follows. ϕ(d) = arg max e d 0 s,a ed(s, a)rd(s, a) a ed(s, a) = ρ(s) + γ X s ,a ed(s , a)Pd(s , a, s) s (26) Any fixed point of ϕ( ) corresponds to a stable point. First, note that ϕ(d) is non-empty as one can always choose ed to be the occupancy measure associated with any arbitrary policy π in an MDP with probability transition function Pd. Now suppose d1, d2 S(d). Then for any ρ [0, 1] it is easy to show that ρd1 + (1 ρ)d2 ϕ(d). This is because all the constraints are linear, so ρd1 + (1 ρ)d2 is feasible. Moreover, the objective is linear, so ρd1 + (1 ρ)d2 also attains the same objective value. We now show that the function ϕ is upper hemicontinuous. Let L be the Lagrangian of the optimization problem (26). L(ed, h; Md) = X s,a ed(s, a)rd(s, a) + X a ed(s, a) ρ(s) γ X s ,a ed(s , a)Pd(s , a, s) Note that the Lagrangian is continuous (in fact linear) in ed, and continuous in d (from the assumption of (ϵr, ϵp)-sensitivity). Finally, observe the alternative definition of the function ϕ. ϕ(d) = arg max e d D min h L(ed, h; Md) Since the minimum of continuous functions is also continuous, and the set D is compact, we can apply Berge s maximum theorem to conclude that the function ϕ( ) is upper hemicontinuous. Now an application of Kakutani fixed point theorem (Glicksberg, 1952) shows that ϕ has a fixed point. B.5. Assumptions Regarding Quadratic Regularizer Throughout we performed repeated optimization with quadratic regularization. Our proof techniques can be easily generalized if we consider a strongly convex regularizer R(d). Suppose, at time t we solve the following optimization problem. s,a ed(s, a)rt(s, a) R(ed) a ed(s, a) = ρ(s) + γ X s ,a ed(s , a)Pt(s , a, s) s (27) Since R is strongly convex, (R ) 1 exists and we can use this result to show that the dual of the (27) is strongly convex. In fact, as in the proof of theorem 1 we can write down the lagrangian L(d, h) and at an optimal solution we must have Performative Reinforcement Learning d L(d, h) = 0. This gives the following expression. d(s, a) = (R ) 1 rt(s, a) h(s) + γ X es h(es)Pt(s, a, es) We can use the result above to show that the dual is strongly convex and the optimal dual solutions form a contraction. Then we can translate this guarantee back to the primal using (28). B.6. Omitted Proofs from Subsection 3.3 We will write dλ P O to write the performatively optimal solution when using the regularization parameter λ i.e. dλ P O arg max d 0 s,a d(s, a)rλ P O(s, a) λ a d(s, a) = ρ(s) + γ X s ,a d(s , a)P λ P O(s , a, s) s (29) Here we write rλ P O = R(dλ P O) and P λ P O = P(dλ P O) to denote the reward function and probability transition function in response to the optimal occupancy measure dλ P O. We will also write dλ S to denote the performatively stable solution and rλ S (resp. P λ S ) to denote the corresponding reward (resp. probability transition) function. The next lemma bounds the distance between dλ S and dλ P O. B.6.1. PROOF OF THEOREM 4 Proof. Suppose repeatedly maximizing the regularized objective converges to a stable solution dλ S i.e. X s,a rdλ S(s, a)dλ S(s, a) λ 2 max d C(dλ S) s,a rdλ S(s, a)d(s, a) λ Therefore, X s,a rdλ S(s, a)dλ S(s, a) max d C(dλ S) s,a rdλ S(s, a)d(s, a) λ max d C(dλ S) s,a rdλ S(s, a)d(s, a) λ 2(1 γ)2 The last inequality uses d 2 2 = P s,a d(s, a)2 = (1 γ) 2 P s,a ((1 γ)d(s, a))2 (1 γ) 2 P s,a(1 γ)d(s, a) = (1 γ) 2. Now we substitute λ = 12S3/2(2ϵr+5Sϵp) (1 γ)4 from theorem 1 and get the following bound. s,a rdλ S(s, a)dλ S(s, a) max d C(dλ S) s,a rdλ S(s, a)d(s, a) 6S3/2(2ϵr + 5Sϵp) B.6.2. FORMAL STATEMENT AND PROOF OF THEOREM 5 Theorem 7. Let d P O be the performatively optimal solution with respect to the original (unregularized) objective. Then there exists a choice of regularization parameter (λ) such that repeatedly optimizing objective (12) converges to a policy (dλ S) with the following guarantee X s,a rdλ S(s, a)dλ S(s, a) X s,a rd P O(s, a)d P O(s, a) S)ϵr + γSϵp (1 γ)2 2/3 , ϵr (1 γ)2 + ϵp S (1 γ)4 Performative Reinforcement Learning Proof. Let us write hλ P O to denote the dual optimal solution i.e. hλ P O arg min h Ld(h; M λ P O) Moreover, let hλ S be the dual optimal solution corresponding to the stable solution dλ S. X s,a rd P O(s, a)d P O(s, a) X s,a rdλ S(s, a)dλ S(s, a) s,a rd P O(s, a)d P O(s, a) λ 2 d P O 2 2 2 d P O 2 2 s,a rdλ S(s, a)dλ S(s, a) λ s,a rdλ P O(s, a)dλ P O(s, a) λ 2 d P O 2 2 s,a rdλ S(s, a)dλ S(s, a) λ Ld(hλ P O; M λ P O) Ld(hλ S; M λ S) + λ 2 d P O 2 2 The first inequality uses the fact that dλ P O is the performatively optimal solution with regularization parameter λ. The second inequality uses strong duality and expresses the objective in terms of optimal dual variables. We now bound the difference Ld(hλ P O; M λ P O) Ld(hλ S; M λ S). Ld(hλ P O; M λ P O) Ld(hλ S; M λ S) = Ld(hλ P O; M λ P O) Ld(hλ S; M λ P O) + Ld(hλ S; M λ P O) Ld(hλ S; M λ S) Ld(hλ S; M λ P O) Ld(hλ S; M λ S) [Since hλ P O minimizes Ld( ; M λ P O)] S + hλ S 2)ϵp dλ S dλ P O 2 [By inequality 32 ] S3A λ2(1 γ)6 S + 3S (1 γ)2 2 [By lemma 13] The term d P O 2 2 can be bounded as P s,a d2 P O(s, a) = (1 γ) 2 P s,a(d P O(s, a)(1 γ))2 (1 γ) 2 P s,a d P O(s, a)(1 γ) = (1 γ) 2. This gives us the following bound. s,a rd P O(s, a)d P O(s, a) X s,a rdλ S(s, a)dλ S(s, a) S3A λ2(1 γ)6 S + 3S (1 γ)2 2 + λ 2(1 γ)2 λ2 S3A (1 γ)6 S + 3S (1 γ)2 | {z } :=T1 +λ 1 2(1 γ)2 | {z } :=T2 Note that in order to apply lemma 13 we need λ λ0 = 2 2ϵr + 9ϵp S(1 γ) 2 . So we consider two cases. First if (2T1/T2) 1/3 > λ0. In that case, we can use λ = (2T1/T2) 1/3 and get the following upper bound. X s,a rd P O(s, a)d P O(s, a) X s,a rdλ S(s, a)dλ S(s, a) O T 1/3 1 T 2/3 2 S + 3S (1 γ)2 Performative Reinforcement Learning On the other hand, if (2T1/T2) 1/3 λ0 then we can substitute λ = λ0 and get the following bound. s,a rd P O(s, a)d P O(s, a) X s,a rdλ S(s, a)dλ S(s, a) 1 λ2 0 T1 + λ0T2 (T 1/3 1 T 1/3 2 )2 + λ0T2 = T 1/3 1 T 2/3 2 + λ0T2 2λ0T2 = O ϵr (1 γ)2 + ϵp S (1 γ)4 Lemma 13. Suppose λ 2 2ϵr + 9ϵp S (1 γ)2 . Then we have dλ S dλ P O 2 O S + ϵp γS (1 γ)2 Proof. Let us write hλ P O to denote the dual optimal solution i.e. hλ P O arg min h Ld(h; M λ P O) Moreover, let hλ S be the dual optimal solution corresponding to the stable solution dλ S. Since the dual Ld( ; M λ P O) objective is strongly convex (lemma 2) and hλ P O is the corresponding optimal solution we have, Ld(hλ S; M λ P O) Ld(hλ P O; M λ P O) A(1 γ)2 hλ S hλ P O 2 From lemma (1) we get the following bound. 1 2ϵr + 3ϵp hλ S 2 λ ! dλ S dλ P O 2 3 hλ S hλ P O 2 Substituting the above bound in eq. (30) and using lemma 4 we get the following inequality for any λ > 2ϵr +9ϵr S/(1 γ)2. Ld(hλ S; M λ P O) Ld(hλ P O; M λ P O) (1 γ)2 1 2ϵr + 3ϵp hλ S 2 λ !2 dλ S dλ P O 2 We now upper bound Ld(hλ S; M λ P O) Ld(hλ S; M λ S). Using lemma 14 we get the following bound. Ld(hλ S; M λ P O) Ld(hλ S; M λ S) S) rλ S rλ P O 2 + γ(2 S + hλ S 2) P λ S P λ P O 2 S + hλ S 2)ϵp dλ S dλ P O 2 (32) Note that the following sequence of inequalities hold. Ld(hλ S; M λ P O) Ld(hλ P O; M λ P O) Ld(hλ S; M λ S) The first inequality is true because hλ P O minimizes Ld( ; M λ P O). The second inequality holds because the primal objective at performative optimal solution (dλ P O) upper bound the primal objective at performatively stable solution (dλ S) and by strong duality the primal objectives are equal to the corresponding dual objectives. Therefore we must have Ld(hλ S; M λ P O) Performative Reinforcement Learning Ld(hλ S; M λ S) Ld(hλ S; M λ P O) Ld(hλ P O; M λ P O), and by using equations (32) and (31) we get the following bound on dλ S dλ P O 2. dλ S dλ P O 2 S + hλ S 2)ϵp 18S (1 γ)2 1 2ϵr + 3ϵp hλ S 2 λ The last line uses lemma 4 and λ 2(2ϵr + 9ϵp S/(1 γ)2) Ld(h; M) Ld(h; c M) h 2 S) r ˆr 2 + γ(2 S + h 2) P b P 2 Proof. From the definition of the dual objective 12 we have, Ld(h; M) Ld(h; c M) 1 s,a h(s)(r(s, a) ˆr(s, a)) | {z } :=T1 s,s ,a h(s) r(s , a)P(s , a, s) ˆr(s , a) ˆP(s , a, s) | {z } :=T2 es h(es) P(s, a, es) ˆP(s, a, es) | {z } :=T3 es,ˆs h(es)h(ˆs) P(s, a, ˆs)P(s, a, es) b P(s, a, ˆs) b P(s, a, es) | {z } :=T4 a (r(s, a) ˆr(s, a)) s,s ,a h(s)P(s , a, s) (r(s , a) ˆr(s , a)) s,s ,a h(s)ˆr(s , a) P(s , a, s) ˆP(s , a, s) s ,a P(s , a, s) (r(s , a) ˆr(s , a)) P(s , a, s) ˆP(s , a, s) s ,a P(s , a, s) (r(s , a) ˆr(s , a))2 + h 2 P(s , a, s) ˆP(s , a, s) 2 SA r ˆr 2 + h 2 Performative Reinforcement Learning es,a h(es) P(s, a, es) b P(s, a, es) P(s, a, es) b P(s, a, es) !2 es,ˆs h(es)h(ˆs)P(s, a, ˆs) P(s, a, es) b P(s, a, es) es,ˆs h(es)h(ˆs) b P(s, a, ˆs) P(s, a, ˆs) b P(s, a, ˆs) es h(es) P(s, a, es) b P(s, a, es) ˆs h(ˆs)P(s, a, ˆs) h 2 ˆs P(s, a, ˆs) = h 2] P(s, a, es) b P(s, a, es) Substituting the upper bounds on T1, T2, T3, and T4 into the upper bound on Ld(h; M) Ld(h; c M) gives the desired bound.