# bilevel_optimization_with_coupled_decisiondependent_distributions__c9fc20a9.pdf

Bilevel Optimization with Coupled Decision-Dependent Distributions

Songtao Lu 1

Bilevel optimization has gained signiﬁcant popularity in recent years due to its ability to formulate various machine learning problems. For instance, in meta-learning, the upper-level (UL) problem offers a good initialization for the lower-level (LL) model to facilitate adaptation. However, the decision variables can impact data features and outcomes, leading to the phenomenon known as performativity. In this work, we investigate the inclusion of decisiondependent distributions in bilevel optimization. Speciﬁcally, we consider the scenarios where the UL data distribution depends on the LL optimization variable, and the LL data distribution also depends on the UL decision variable. We ﬁrst establish sufﬁcient conditions for the existence of performatively stable (PS) solutions in this class of bilevel problems. Also, we propose efﬁcient stochastic algorithms to ﬁnd the PS point with theoretical convergence rate analysis and discuss the theoretical optimality of the obtained solution. Our theoretical analysis is corroborated through a series of numerical experiments, wherein we evaluate the performance of the bilevel performative prediction algorithms alongside non-performative counterparts in the context of meta strategic learning problems.

1. Introduction

In this work, we consider the following class of bilevel performative prediction (PP) problems, where the data distribution at each level is dependent on the decision

1IBM Research, Thomas J. Watson Research Center, Yorktown Heights, New York 10598, USA. Correspondence to: Songtao Lu <songtao@ibm.com>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

variable from the other level:

P : min x E Z Dx(y (x))f(x, y (x); Z) (1a)

s.t. y (x) = arg min y E Z Dy(x)ℓ(x, y; Z) (1b)

where x, y respectively denote the decision/optimization variables at the upper-level (UL) and lower-level (LL) of this problem, loss functions f(x, y (x); Z) and ℓ(x, y; Z) are smooth, y (x) represents the optimal solution of the LL problem given x, Dy(x) stands for the LL data distribution depended on the UL variable x, and similarly Dx(y (x)) for the UL data distribution that is dependent on the LL loss function through the optimal LL decision variable y (x).

Motivation of This Work. One of the most prominent machine learning models that can be formulated using bilevel optimization is model-agnostic meta-learning (MAML) (Finn et al., 2017; Rajeswaran et al., 2019). MAML has been successful in addressing the challenge of data distribution shifts between seen and unseen tasks, particularly in few-shot learning scenarios. In MAML, the meta-learner at the UL explores invariant features such as good initializations that can be utilized to handle unseen tasks, while the individual learners at the LL focus on ﬁtting personalized data. However, in many real-world problems that fall under the umbrella of PP, such as election forecasts, ﬁnancial markets, online advertising, and trafﬁc predictions, the decision variables can heavily inﬂuence the data distribution (Perdomo et al., 2020). Hence, it becomes crucial to incorporate decision-dependent distributions into the framework of meta-learning, as both levels of the optimization problem are tightly intertwined through their mutual interaction. This motivates the inclusion of crosslevel dependence between decision variables and data distributions in bilevel optimization.

Major Challenges in Bilevel PP. In contrast to singlelevel PP problems, solving bilevel optimization problems involves generating two sequences, each used for minimizing the loss function at its respective level. Due to the interdependence between decision variables and data distributions across levels, it is unclear whether changes in the data-generating distribution occur in a competitive or collaborative manner during the play of this performative Stackelberg game. Consequently, demonstrating the convergence rate of the bilevel stochastic algorithm is

Bilevel Optimization with Coupled Decision-Dependent Distributions

highly challenging. It necessitates the construction of a Lyapunov function that accounts for the data distribution s dependence and satisﬁes a descent or contraction property. In addition, the theoretical analysis requires decoupling the stochastic error terms between the two levels.

Main Contributions of This Work. In this work, we provide the formal deﬁnitions of performatively optimal (PO) and performatively stable (PS) points in bilevel optimization and establish the conditions for the existence of the bilevel performatively stable (BPS) points using the bilevel repeated risk minimization (Bi-RRM) method. Building upon the existence of the BPS points, we propose an efﬁcient bilevel stochastic gradient descent (Bi-SGD) method and derive sufﬁcient conditions for Bi-SGD to obtain the BPS point. Our theoretical analysis reveals that Bi-SGD achieves comparable iteration and sample complexities to traditional (bilevel) stochastic gradient descent (SGD) for ﬁnding optimal solutions in nonperformative strongly convex (bilevel) problems. To validate our theoretical results and demonstrate the stability of Bi-SGD under data distribution shifts, we conduct experiments on both synthetic and real data sets. The results underscore the efﬁcacy of Bi-RRM and Bi-SGD. The main contributions of this work are highlighted as follows.

Bilevel performative prediction. To the best of our knowledge, our proposed bilevel PP model is the ﬁrst to incorporate decision-dependent data distributions into the learning process. This unique approach distinguishes our work from existing models in the ﬁeld. Theoretical analysis. Our theoretical results provide novel insights into the existence of BPS points in the bilevel PP model. We demonstrate, for the ﬁrst time, that when the sensitivity parameters of the data distribution at both levels fall below a certain threshold, a unique BPS point exists. Additionally, we prove that the bilevel SGD algorithm is capable of achieving the PS point at a rate of O(1/T), which is on par with the standard non-performative learning setting. Here, T represents the total number of iterations. Applications to meta strategic learning. We present the results of applying the bilevel PP model to meta strategic learning and showcase the importance of incorporating the decision-dependent distribution in the algorithm design in terms of improving testing accuracy.

2. Background and Related Work

Performative Prediction (PP). There is a signiﬁcant body of research focusing on various aspects of PP, including model and algorithm design, convergence analysis, and quantiﬁcation of generalization performance. The concept of PP was ﬁrst introduced in (Perdomo et al., 2020), which

addresses the strategic feedback effect in single-level risk minimization problems. The work also proposes measures to capture performative optimality and stability in this setting, where the data distribution is induced by the predictive model. It is shown that under certain conditions on the loss function (such as smoothness and strong convexity), the PS point can be achieved by utilizing repeated risk minimization (RRM) or gradient descent (GD) on the performative risk when the changes in the data distribution are not signiﬁcant or controllable in terms of the Wasserstein-1 distance. Based on this work, more efﬁcient SGD methods have been proposed in (Mendler-Dünner et al., 2020) to ﬁnd the PS points by considering the trade-off between the frequency of model deployment and the stochastic update of the model. In (Drusvyatskiy & Xiao, 2022), several families of classical stochastic optimization algorithms, such as clipped gradient and proximal point, have been studied for PP with rigorous convergence rate analysis. These works highlight the main challenge in proving algorithm convergence, which lies in quantifying the bias introduced by distribution shifts after model deployment. Also, it is worth noting that the convergence results in these seminal works rely on the strong convexity assumption of the loss functions. Recent works have attempted to relax this assumption to the convex case (Miller et al., 2021) or the weakly convex case (Zhao, 2022), but they require additional assumptions, such as the mixture dominance condition or other notions of Lipschitz distributions (Mofakhami et al., 2022), to ensure the wellbehaved nature of the distribution map. Given the dynamic nature of online data sequences, PP naturally extends to model richer classes of online learning problems, enabling the advancement of existing algorithms to adapt to changing environments and facilitating the design of new algorithms to address distribution shifts. For instance, online projected gradient descent has been applied to optimize the charging of a ﬂeet of electric vehicles, where the time-varying costs depend on random variables with decision-dependent distributions (Wood et al., 2021). The proximal stochastic gradient method has been employed to tackle online tracking problems, where the dynamics are jointly dependent on both time and decision variables, and its non-asymptotic convergence behavior under the time drift is analyzed in (Cutler et al., 2021). Furthermore, performative feedback has been utilized in algorithm design to construct conﬁdence bounds on the risk of unexplored models and guide exploration (Jagadeesan et al., 2022). If the exact gradient of the temporal drift function over time is accessible, the underlying changes in dynamics can be approximated in an online fashion. Based on this observation, the predictorcorrector method is proposed for time-varying stochastic optimization (Maity et al., 2022).

Bilevel Optimization with Coupled Decision-Dependent Distributions

Table 1. Comparison of the existing representative works that are related to PP and bilevel optimization, where Opt.: optimization, Cond.: the existence condition of the PS points, Gap: the maximum distance between the PS and PO points, Deploy: the single-loop/double-loop structure of deploying the algorithm/method.

Algorithms Opt. Framework Cond. Gap Deploy Rate

RRM and GD (Perdomo et al., 2020) single-level O(ε) O(ε) single O(log(1/T)) SGD (Mendler-Dünner et al., 2020) single-level O(ε) O(ε) single O(1/T) Multi-Pf D (Li et al., 2022b) single-level+consensus O(εavg) n/a single O(1/T) BSA (Ghadimi & Wang, 2018) bilevel n/a n/a double O(1/T) STSA (Shen & Chen, 2022) bilevel n/a n/a single O(1/T) Bi-RRM (This work) bilevel O(εxεy + εy) O(εx(1+εy)) double O(log(1/T)) Bi-SGD (This work) bilevel O(εx+εxεy +εy) O(εx(1+εy)) single O(1/T)

In addition to the applications of PP in online learning, there is another line of research that focuses on PP for performative reinforcement learning, where the policy affects both the underlying reward and the transition kernel of the Markov chain (Mandal et al., 2022). Subsequently, the convergence analysis of state-dependent PP algorithms has been investigated under the Markov transition model (Li & Wai, 2022; Roy et al., 2022; Brown et al., 2022) as well as general stateful performative dynamics (Izzo et al., 2022). When prior knowledge about the decisiondependent distribution is available, more informative PP algorithms can be designed using performative gradient descent, which can improve the performance of classic gradient-based methods under distribution shifts (Izzo et al., 2021). Apart from the optimization perspective, recent research has also focused on identifying the causal effect of predictions (Mendler-Dünner et al., 2022), exploring the performative power by measuring the relationship between the decision maker and the population (Hardt et al., 2022), and analyzing outcome performativity with an emphasis on the performative effects of decisions on the conditional distribution rather than the traditional overall distribution (Kim & Perdomo, 2022).

Two (multiple) Players Game. The decision-making process and strategic response in PP can be also viewed as a two-player game, where the data reacts based on the decision variables to maximize a utility function. PP assumes that this interaction occurs simultaneously. In (Zrnic et al., 2021), the order of play in strategic classiﬁcation between the decision maker and strategic agents is further examined, revealing that the update frequency determines the roles of the leader and follower. Furthermore, algorithm design for PP in zero-sum games has been developed, along with corresponding convergence analysis (Wood & Dall Anese, 2022; Maheshwari et al., 2022). In a multi-agent setting, decision makers can work in either a competitive or collaborative manner. In (Narang et al., 2022), a competitive multi-player performative prediction setting is introduced, where each local player aims to minimize their performative risk under the joint action space, considering that the local data distribution is inﬂuenced by the decisions

of all players. It is shown that using SGD at each player can effectively ﬁnd the Nash equilibrium of this problem. In (Li et al., 2022b), a consensus-based multiagent performative prediction (Multi-Pf D) is considered, where all agents connected through a communication network seek a global PS solution by interacting with local strategic data. Additionally, it is demonstrated that the existence condition regarding the ε-sensitivity parameter can be relaxed from ε (in the single-agent setting) to εavg, where εavg represents the average sensitivity parameter across the entire network.

Bilevel Optimization and Meta-Learning. Model transferability is a topic of signiﬁcant interest in decisiondependent learning systems (Liu et al., 2021b). MAML has emerged as a powerful tool for enhancing the generalization performance of trained models when faced with new tasks in both supervised learning and reinforcement learning settings (Liu et al., 2019; Rajeswaran et al., 2019). The convergence rate of MAML and regret analysis have been characterized in previous studies (Balcan et al., 2019; Fallah et al., 2020), along with investigations into generalization errors (Chua et al., 2021; Chen et al., 2022). As mentioned earlier, MAML can be seen as a special case of bilevel optimization (Franceschi et al., 2018) or a Stackelberg game (Fiez et al., 2020). Also, bilevel optimization can be applied to various other machine learning problems, including multi-task AUC maximization (Hu et al., 2022), meta causal discovery (Lu & Gao, 2023), data hyper-cleaning (Franceschi et al., 2018), etc.

Consequently, solving the bilevel optimization problem becomes crucial. One of the earliest approaches for deterministic convex bilevel problems is the bilevel gradient sequential averaging method (Sabach & Shtern, 2017), while for stochastic bilevel problems, the bilevel stochastic approximation (BSA) method is commonly adopted (Ghadimi & Wang, 2018). BSA solves the LL problem by iteratively updating its variables in an inner loop and then switches to optimize the UL variables. The theoretical convergence rates of BSA have been provided in (Ghadimi & Wang, 2018) for strongly convex, convex, and nonconvex UL loss functions when the LL

Bilevel Optimization with Coupled Decision-Dependent Distributions

loss function is strongly convex. Building upon this, subsequent research has focused on improving the iteration or sample complexity of BSA. Examples include the development of single-time scale or multi-sequence singletimescale (STSA) bilevel algorithms (Chen et al., 2021; Shen & Chen, 2022).

When the LL objective function is strongly convex, it is possible to derive a closed form of the UL gradient. However, computing this hyper-gradient requires the inversion of the LL Hessian matrix. To enhance computation efﬁciency, multiple efﬁcient bilevel algorithms and techniques have been proposed, including approximate implicit differentiation (Ji et al., 2021), adaptive stochastic algorithms (Huang & Huang, 2021), and an adaptation of the well-known SAGA algorithm (Dagréou et al., 2022; Li et al., 2022a). Besides, generalized bilevel optimization solvers have been developed by relaxing the strong convexity assumption of the LL loss function (Ye et al., 2022; Liu et al., 2022). A comprehensive survey paper (Liu et al., 2021a) provides an overview of these research areas, and Table 1 presents a summary of the representative works related to both PP and bilevel optimization.

3. Bilevel Performative Optimality

In this work, we consider the scenario where the data distribution over the features and outcomes at each level of the problem depends on the decision variables at the other level. Consequently, the performance evaluation of this type of two-player performative model is based on the expected loss over the distributions induced at both levels. To be more precise, in contrast to existing singlelevel PP models (Perdomo et al., 2020), our study focuses on a bilevel model that aims to ﬁnd the following optimal points through a performative Stackelberg game. Deﬁnition 1. (Bilevel performative optimality and risk). A point x O is bilevel performatively optimal (BPO) if it satisﬁes

x O = arg min x F(x) (2)

where F(x) E Z D(y (x))[f(x, y (x); Z)] is deﬁned as

the UL performative risk, y (x) is deﬁned in (1b), and EZ Dy(x)ℓ(x, y; Z) is deﬁned as the LL performative risk. PS is another well-established notion of PP, which refers to a point that attains the global optimal solution of the optimization problem considering the data distributions induced by that point itself. A formal deﬁnition of the bilevel PS point can be expressed as follows. Deﬁnition 2. (Bilevel performative stability and decoupled risk). A point x S is bilevel performatively stable (BPS) if it satisﬁes

x S arg min x E Z D(y (x,x S))[f(x, y (x, x S); Z)] (3a)

s.t. y (x, x S) = arg min y E Z D(x S)[ℓ(x, y; Z)]. (3b)

Also, let DR(x, x ) EZ D(y (x,x ))[f(x, y (x, x ); Z)] be the decoupled bilevel performative risk.

It is clear that x S = arg minx DR(x, x S) and x O = arg minx DR(x, x). The concept of performative stability revolves around the idea of a ﬁxed point in risk minimization, wherein the learned model minimizes the risk on the data distribution that arises from its own deployment (Perdomo et al., 2020). This property validates the optimality of the closed-loop training strategy and serves as motivation for incorporating PP in the metalearning setting. In this case, the objective of both levels of learning is to identify the permutation invariant space after the model has been deployed.

It is evident that the decision-dependent distribution plays a crucial role in bridging the gap between model deployment and model parameter optimization. Similar to the concept of Lipschitz continuity used in the ﬁeld of optimization to quantify function changes, the literature (Perdomo et al., 2020) has introduced the notion of ε-sensitivity. This measure is employed to assess the variations in decisiondependent distributions caused by changes in the decision variables. Deﬁnition 3. (ε-sensitive) A distribution map D( ) is εsensitive if all x, x :

W1 (D(x), D(x )) ε x x 2 (4)

where W1 denotes the Wasserstein-1 distance (aka the earth mover s distance) between two distributions.

Given the deﬁnitions of BPO, BPS, and ε-sensitivity, it is still not clear whether the BPO or BPS exists or not. Note that the decision variables compete to minimize their objective functions through the UL and LL optimization processes, along with the decision-dependent distributions. This coupling can lead to oscillations in the iterates generated by the optimization algorithms, making the convergence analysis challenging, as mentioned earlier. To address this, we ﬁrst introduce Bi-RRM, which demonstrates the existence of the BPS point and provides insights into the iteration complexity of ﬁnding this point.

4. Existence of BPS Point In this section, we will propose the Bi-RRM method for solving this bilevel PP problem (1).

4.1. Bilevel Repeated Risk Minimization (Bi-RRM) The main idea of RRM involves a retraining procedure outlined as follows. It begins by solving the bilevel optimization problem using the data distribution induced by the decision variable xr. Next, it updates the state to xr+1 using the obtained solution, and this process is repeated iteratively, with r denoting the index of the iterations. Mathematically, Bi-RRM performs the

Bilevel Optimization with Coupled Decision-Dependent Distributions

following recursive update:

xr+1 = R(xr)

arg min φ E Z Dx(y (φ,xr)) [f(φ, y (φ, xr); Z)], (5a)

s.t. y (φ, xr) = arg min y E Z Dy(xr)[ℓ(φ, y; Z)] (5b)

where R( ) represents the one-step update of the RRM algorithm. The Bi-RRM assumes the existence of an oracle algorithm that can obtain the global optimal solution of the bilevel optimization problem with respect to the variable φ. This optimization problem takes into account both the UL and LL data distributions induced by the decision variable x at iteration r.

4.2. Theoretical Assumptions Before showing the theoretical convergence result of Bi RRM, we make the following blanket assumption for problem (1). Without loss of generality, we assume that decisiondependent distributions at both UL and LL are ε-sensitive but with different constants. Assumption 1. Assume that the distribution maps are εxand εy-sensitive at each level, namely, W1 (Dx(x), Dx(x )) εx x x 2 and W1 (Dy(y), Dy(y )) εy y y 2. Next, we assume the strong convexity and smoothness of the loss functions as follows. Assumption 2. (Strong convexity) Assume that loss functions F(x) and ℓ(x, y) are respectively γxand γystrongly convex, namely, F(x) F(x ) + x F(x ) T(x x ) + γx

ℓ(x, y) ℓ(x, y )+ yℓ(x, y ) T(y y )+ γy

Assumption 3. (Smoothness of the UL loss function and distribution) Assume that the loss function f(x, y; Z) w.r.t. x, y is smooth and the gradients of f(x, y; Z) w.r.t. Z are jointly Lz f-Lipschitz continuous x, y, namely, xf(x, y; Z) xf(x, y; Z ) Lz f Z Z , (6a)

yf(x, y; Z) yf(x, y; Z ) Lz f Z Z . (6b)

(Smoothness of the LL loss function and distribution) Similarly, we assume that loss function ℓ(x, y; Z) is smooth and the gradient of ℓ(x, y; Z) is Lz ℓ-Lipschitz continuous x, y, namely, yℓ(x, y; Z) yℓ(x, y; Z ) Lz ℓ Z Z . (7)

(Second-order smoothness of the LL loss function and distribution) Assume that loss function ℓ(x, y) is continuously twice differentiable and its the Jacobian and Hessian matrices are respectively Lz ℓxyand Lz ℓyy-Lipschitz continuous x, y, namely,

2 xyℓ(x, y; Z) 2 xyℓ(x, y; Z ) Lz ℓxy Z Z , (8a)

2 yyℓ(x, y; Z) 2 yyℓ(x, y; Z ) Lz ℓyy Z Z . (8b)

Assumption 4. (Boundedness of the gradient and Jacobian) We assume that yf(x, y) Cy f and 2 xyℓ(x, y) Cℓxy x, y.

The assumptions of the Lipschitz continuity on data distributions are commonly used in the existing theoretical works of quantifying the convergence of the single-level PP algorithms (Mendler-Dünner et al., 2020; Drusvyatskiy & Xiao, 2022). Remark 1. Under the strong convexity assumption of the LL objective function, we have the closed form of computing the UL gradient through the chain rule, which is x F(x) = xf(x, y (x)) 2 xyℓ(x, y (x)) [ 2 yyℓ(x, y (x)) ] 1 yf(x, y (x)) (Ghadimi & Wang, 2018; Shen & Chen, 2022). It can be seen that the UL gradient involves the second-order derivatives of the LL loss function. Therefore, we assume the second-order continuity and boundedness of the loss function, which will be used for measuring the distribution changes with respect to the UL and LL variables. In the following, we will present the theoretical results and relegate the detailed proofs in the appendix.

4.3. Convergence of Bi-RRM Theorem 1. Suppose that A.1 A.4 hold and iterates {xr, r 1} are generated by the Bi-RRM method. Then,

R(x) R(x ) (Cxyεxεy+Cyεy) x x , x, x (9)

( Lz f + Cℓxy Lz f γy

( Cy f ( Lz ℓxy + Lz ℓyy Cℓxy

+Lz ℓ ( Lx f + Cy f Ly ℓxy γy + Cℓxy

( Ly f + Cy f Ly ℓyy γy

))) . (10b)

If εx and εy satisfy

Cxyεxεy + Cyεy 1, (11)

then the sequence generated by Bi-RRM converges to a unique BPS point x S at a linear rate, namely,

xr x S 2 (Cxyεxεy + Cyεy)r x1 x S .

This result claims that under A.1 A.4 when (11) is satisﬁed, there exists a global optimal BPS solution of problem (1). Remark 2. If we set εx = εy ε, it becomes evident that the condition (11) ensuring the linear convergence rate of Bi-RRM to the BPS solution is ε < (2Cxy) 1(

C2y + 4Cxy Cy). Remark 3. Another important observation is that εx and εy are coupled in a bilinear manner in the existence condition (11) of the BPS point. This means that if both values

Bilevel Optimization with Coupled Decision-Dependent Distributions

are large, the coupling term will be ampliﬁed, making it impossible for Cxyεxεy + Cyεy to be less than 1. Additionally, it is worth noting that the asymmetry between Cxy and Cy is dependent on Cℓxy, Ly ℓxy, Lz ℓxy. This observation aligns with intuition, as the Jacobian matrix reﬂects the coupling relationship between x and y.

5. Bilevel Performative Prediction Algorithms However, the Bi-RRM method is computationally inefﬁcient as it requires solving a bilevel optimization problem completely at each iteration, including assessing the full gradient. In order to address this issue, we propose a simple gradient-based stochastic bilevel algorithm to ﬁnd the BPS point.

5.1. Bilevel Stochastic Gradient Descent (Bi-SGD) To simplify our notation, we deﬁne ℓ(x, y) = EZ D(x)[ℓ(x, y; Z)] and f(x, y (x)) = EZ D(y (x))[f(x, y (x); Z)]. Furthermore, we use ℓ(xr, yr; Zy) to denote ℓ(xr, yr; Z), where Z Dy(xr), and f(xr, yr; Zx) to denote f(xr, yr; Z), where Z Dx(yr). Then, the proposed bilevel SGD algorithm for minimizing both the UL and LL performative risks can be written concisely as follows.

yr+1 = yr βr b yℓ(xr, yr; Zy), (12a)

xr+1 = xr αr b xf(xr, yr; Zx), (12b)

where αr, βr respectively denote the step sizes of the UL and LL learning processes, and b yℓ(xr, yr; Zy), b xf(xr, yr; Zx) respectively represent the gradient estimates of yℓ(xr, yr) and xf(xr, yr) with only utilizing a minibatch of i.i.d. data samples. Here, xf(x, y) denotes the surrogate of x F(x), which simply replaces y (x) in x F(x) by y.

Remark 4. If the full gradients of both the UL and LL loss functions can be obtained at each iteration, the Bi-SGD algorithm reduces to bilevel gradient descent (Bi-GD).

5.2. Theoretical Assumptions To quantify the descent achieved by Bi-SGD after each round of updates, we need the following assumptions.

Assumption 5. Assume that the gradient of loss function f(x, y) w.r.t. x is Lipschitz continuous with constant Lx f for x and Lx f for y. Similarly, we assume that the gradient of f(x, y) w.r.t. y is Lipschitz continuous with constant Ly f for y and Ly f for x, and loss function ℓ(x, y) is Lℓ-smooth.

Also, we assume that the Jacobian and Hessian matrices of loss function ℓ(x, y) are Lipschitz continuous with constants Lx ℓxy and Lx ℓyy for x and constants Ly ℓxy and Ly ℓyy for y.

Let Fr = σ{y1, x1, . . . , yr, xr} denote the ﬁltration of the random variables up to iteration r, where σ{ } is the σ-

algebra generated by the random variables. Assumption 6. (Quality of both the LL and UL gradient estimates) Assume that the LL gradient estimate is unbiased and with bounded variance, namely,

E[b yℓ(xr, yr; Z)|Fr] = yℓ(xr, yr), (13a)

E[ b yℓ(xr, yr; Z) yℓ(xr, yr) 2|Fr] σ2 ℓ (13b)

Assume that the UL gradient estimate is biased and with bounded variance, namely,

E[b xf(xr, yr; Z)|Fr] xf(xr, yr) + br, (14)

E[ b xf(xr, yr; Z) xf(xr, yr) br 2|Fr] σ2 f,

where br denotes the bias term and br δr, r.

Remark 5. The bias term br primarily arises from the estimation of the inverse Hessian matrix during the stochastic approximation of the UL gradient. In (Ghadimi & Wang, 2018), it has been demonstrated that by utilizing the Hessian inverse approximation sampling method, b xf(xr, yr; Zx) can be accurately obtained. Additionally, it has been shown that the size of the resulting bias term exponentially decreases as the number of samples increases.

All the aforementioned assumptions regarding Lipschitz continuity and the quality of gradient estimates are standard in the convergence analysis for stochastic bilevel algorithms (Ghadimi & Wang, 2018; Dagréou et al., 2022; Chen et al., 2021; Shen & Chen, 2022). In the following section, we will present the theoretical results and provide detailed proofs in the appendix.

5.3. Convergence Rates of Bi-SGD and Bi-GD Based on these mild assumptions, we can show the main theorem regarding the convergence rate of Bi-SGD as follows. Theorem 2. (Convergence Rate of Bi-SGD) Suppose that A.1-A.6 hold and iterates {xr, yr, r 1} are generated by Bi-SGD. When the step sizes are chosen as αr = Θ(1/r), βr = Θ(1/r) and εx and εy satisfy

Cxεx + Cxyεxεy + Cyεy Lεx,εy F 4(Lεx,εy F + γx) (15)

Cx ( Lz f + Cℓxy Lz f γy

γxγy , (16)

then, it holds for any r that

E xr x S 2 + E yr y (x S) 2 = O (1

where Lεx,εy F is linear in terms of εx and εy (the detailed expression of Lεx,εy F is shown in Lemma 3 in the appendix.) Moreover, we have

lim r xr x S 2 0, lim r yr y (x S) 2 0 (18)

almost surely.

Bilevel Optimization with Coupled Decision-Dependent Distributions

Remark 6. By comparing the conditions (11) and (15), it is evident that Bi-SGD imposes more stringent requirements, necessitating smaller values of εx and εy in order to ﬁnd the BPS point. Corollary 1. (Convergence Rate of Bi-GD) Suppose that A.1-A.5 hold and iterates {xr, yr, r 1} are generated by Bi-GD. When the step sizes are chosen as αr = Θ(1), βr = Θ(1) and εx and εy satisfy (15), it holds for any r that

Vr+1 ( 1 min ( Lεx,εy F γxαr 2(Lεx,εy F + γx), γyβr

where Vr xr x S 2 + yr y (x S) 2.

Remark 7. Corollary 1 claims that Bi-GD can achieve a linear convergence rate in converging to the BPS solution of problem (1), which is the same as Bi-RRM.

6. Relation between BPS and BPO Points

The aforementioned results demonstrate that Bi-SGD or Bi-GD (or Bi-RRM) can achieve the BPS point at a linear or sublinear rate. However, directly minimizing the performative risk to ﬁnd the BPO points is a challenging task, even in the single-level case (Perdomo et al., 2020). Instead, we can estimate the distance between x O and x S by assuming that the loss function satisﬁes Lipschitz continuity with respect to the data distribution, namely, |f(x, y (x); Z) f(x, y (x); Z )| Lz|Z Z |, where Lz is a constant. Theorem 3. Suppose that A.1-A.4 hold and function f(x, y (x); Z) is Lz-Lipschitz in Z. Then, for every BPS point and BPO point, the following relation holds.

x O x S 2εx Lz

γxγy (Cℓxy + Lz ℓεy) . (20)

Remark 8. Theorem 3 shows that the distance between x O and x S can be bounded by the values of εx and εy. This bound indicates that when εx and εy are small, x S is very close to x O. Furthermore, it suggests that x S can serve as an approximation of x O, with εx playing a more dominant role in the error bound compared to εy.

7. Numerical Experiments

In this section, we conduct experiments to evaluate the performance of Bi-RRM and Bi-SGD on the bilevel strategic classiﬁcation problem and the meta strategic learning problem using both the synthetic and real data sets. In all the experiments, we adopt a linear utility function to model the reactions of the decision variables to the data, following the approach in (Perdomo et al., 2020). Speciﬁcally, for Bi-RRM, the feature vectors are shifted by εxy (xr) for Bi-RRM at UL (εxyr for Bi-SGD) and εyxr at LL. Additionally, we compare the convergence behavior of these algorithms under different choices of εx and εy.

7.1. Toy Example

Firstly, we consider a simple strategic bilevel problem as follows,

min x Rd E {(ai,bi)} Dx(y (x)) bi a T i x + λx

2 x y (x) 2

s.t. y (x)=arg min y Rd E {(ci,di)} Dy(x) di c T i y + λy

where both UL and LL objective functions are regression mismatch loss plus a quadratic regularization term. Here, bi and di are generated respectively through the linear regression models, i.e., bi = a T i x + nx and di = c T i y + ny, i, where x , y and noise terms nx, ny are i.i.d. Gaussian random variables with zero mean and unit variance. The features ai and ci are i.i.d. Gaussian random variables with zero mean and variance 1 and 2 respectively. The regularization terms penalize the dissimilarity between the UL and LL variables, which appears commonly in the federated learning (T Dinh et al., 2020) and meta-learning (Rajeswaran et al., 2019) settings.

In our numerical experiments, we set the problem dimension as 5 and the total number of data samples as 50. The parameter ε represents both εx and εy. We choose λx = λy = 1 10 3, the minibatch size as 5, and use the same step size 1/ r for both αr and βr in Bi-SGD. In this example, we measure the optimality of the solutions based on the size of the UL gradient. Note that this problem is strongly convex and unconstrained, so a zero stationary gap indicates the global optimality of the iterate generated by either Bi-RRM or Bi-SGD. For Bi-RRM, we can employ the existing bilevel optimization solver as an oracle at each step given Dx(y (xr)) and Dy(xr).

Results on the Synthetic Data Set. The results are shown in Figure 1 based on 10 independent runs. From Figure 1(a), it can be observed that as the value of ε increases from 1 10 3 to 0.1, the convergence rate of the Bi-RRM algorithm slows down, which aligns with our theoretical analysis. According to our theory, once ε surpasses a certain threshold, Bi-RRM fails to converge. The numerical results conﬁrm this, as it is evident that when ε = 1 or ε = 10, Bi-RRM fails to ﬁnd any stationary solution and may exhibit oscillatory behavior.

Another noteworthy characteristic of the bilevel problem is the asymmetry between the leader and the follower. As observed in Figure 1(b), the convergence rate of the Bi-RRM algorithm is more sensitive to the UL problem compared to the LL problem in terms of εx-sensitivity and εy-sensitivity. A notable comparison is the case of εx = 100, εy = 0.1 versus εx = 0.1, εy = 100. It is evident that the Bi-RRM algorithm converges to the global optimal solution of the problem when εx = 0.1, εy = 100, but fails to do so when εx = 100, εy = 0.1. This implies that the contribution of the LL optimization process (or the

Bilevel Optimization with Coupled Decision-Dependent Distributions

number of iterations

stationary gap

" = 10 " = 1 " = 0:1 " = 0:01 " = 0:001

(a) Bi-RRM (εx = εy)

number of iterations

stationary gap

"x = 100; "y = 1 "x = 100; "y = 0:1 "x = 10; "y = 1 "x = 10; "y = 10 "x = 0:1; "y = 100 "x = 0:1; "y = 1000

(b) Bi-RRM (εx = εy)

number of iterations

stationary gap

" = 0:001 " = 0:01 " = 0:1 " = 0:5 " = 1

(c) Bi-SGD (εx = εy)

Figure 1. Performance of Bi-RRM and Bi-SGD on the least-squares over different levels of the ε-sensitivity parameters.

follower) to the convergence of the entire sequence is not as sensitive as the UL optimization process (or the leader), which is again consistent with our theoretical result shown in Theorem 1.

7.2. Meta Strategic Classiﬁcation We also test the Bi-RRM and Bi-SGD on the problem of meta strategic classiﬁcation, which can be simply written as follows.

min x Rd 1 m

i=1 E Zi Di(y i (x))fi(y i (x); Zi) + λx

s.t. y i (x) = arg min yi Rd E Zi Di(x)ℓi(x, yi; Zi), i [m]

where ℓi(x, yi; Zi) yi, fi(x; Zi) + λy/2 yi x 2, λx, λy > 0, and m denotes the total number of tasks. It is obvious that when Di(y i (x)) = Di(x) = Di, i, this problem reduces to the classic formulation of MAML (Fallah et al., 2020; Finn et al., 2017). Here, the logistic regression loss is chosen as the objective function fi(; ). Results on the Spambase Data Set. In the numerical experiments, we employ the UCI machine learning repository spambase data set (Hopkins et al., 1999) for binary classiﬁcation. This data set consists of 57 attributes with both continuous and discrete values, comprising a total of 4601 instances. Each instance is associated with a class label, denoted by 1 or 0, indicating whether it is categorized as spam or not. To demonstrate the effectiveness of meta-learning on this data set, we partition it into multiple subsets and treat each subset as an individual task. For each task, we create different data distributions as follows: initially, we randomly shufﬂe the entire data set and select 5 data samples for each task, resulting in a total of 5 tasks. Subsequently, we remove certain features from each task to create the training data set. More speciﬁcally, we set the ith and the i + 3th columns of the data set to 0 for these 5 tasks, where i ranges from 1 to 5. Finally, we utilize 800 data samples that are not included in the training data set, with 5 samples used for meta-training and the remaining samples used for meta-testing. Overall, each task only has a subset of

the spambase data set features, and the learned model is evaluated on the meta-testing data set that partially overlaps with the training data. This setup ensures that the features in the latent space are transferable, enabling us to assess the generalization performance of the learned model. We conduct the numerical experiments to compare the performance of both Bi-SGD and traditional SGD (without PP training). We set the values of λx, λy to 1 10 3 and 1 respectively, and the step size is chosen as 0.5/ 10 + r for both Bi-SGD and SGD. The results, averaged over 10 independent trials, are presented in Figure 2. Figure 2(a) illustrates that when ε is large, Bi-SGD may not converge to a stationary point, as observed in the case when ε = 1. We further analyze the meta-training and meta-testing performance of both algorithms under different levels of sensitivity parameters. Figure 2(b) demonstrates that Bi SGD, trained on the strategic data generating process, can perform well in the few-shot learning setting, yielding high meta-training accuracy. It is important to note that SGD is only trained on nonstrategic features. In this case, the use of SGD may lead to reduced training accuracy and increased sensitivity to the ε-sensitivity parameters. The meta-testing results provide clearer insights into these characteristics. Firstly, when ε is small, the testing accuracy achieved by Bi SGD is close to the maximum one, aligning with our theoretical analysis presented in Theorem 3. Secondly, the convergence rate and behavior of both Bi-SGD and SGD strongly depend on the levels of ε-sensitivity, emphasizing the signiﬁcance of quantifying the maximum budget for the sensitivity parameters. Lastly, the meta-learning approach yields signiﬁcantly higher testing accuracy compared to the single-level learning strategy. Results on the Amazon Review Data Set. We also perform the numerical experiments using the UCI sentiment labeled sentences data set (Kotzias et al., 2015), speciﬁcally the Amazon reviews subset, which is a prominent area of research in natural language processing. To vectorize the sentences, we utilize the Count Vectorizer from the scikit-learn library, generating word count vectors

Bilevel Optimization with Coupled Decision-Dependent Distributions

(a) Stationarity

number of iterations

train accuracy

(b) Train accuracy (c) Test accuracy

Figure 2. Performance comparison of Bi-SGD and SGD for meta strategic learning over different levels of the ε-sensitivity parameters on the spambase data set, where ε εx = εy.

number of iterations

(a) Train accuracy

number of iterations

(b) Test accuracy

Figure 3. Performance comparison of Bi-SGD, Bi-SGD without PP training, Lazy-SGD (aka Lazy deploy), and SGD without PP training for meta strategic learning over different levels of the ε-sensitivity parameters on the Amazon review data set, where ε εx = εy.

for each sentence. Each data sample is represented by a vector of size 1546. For our experiments, we select 10 data samples for each of the 5 tasks, resulting in a total of 50 data samples. As the vectors are sparse, we partition each of them as 5 parts and set the entries of the ith part as zero for the ith task, creating heterogeneously distributed and partially observed data vectors for these tasks. Additionally, we use another set of 10 and 250 data samples as the meta-training and meta-testing data sets, respectively. We compare the performance of our proposed Bi-SGD algorithm with Lazy-SGD (aka Lazy deploy) (Mendler-Dünner et al., 2020), traditional SGD and bilevel SGD without PP training. The step sizes for the tested algorithms are chosen as 10/ 10 + r.

The results, depicted in Figure 3, reveal that Bi-SGD achieves higher meta-training and meta-testing accuracies compared to the other two benchmark algorithms. This can be attributed to the bilevel structure, which excels at learning the invariant latent feature space. As a result, it leads to improved generalization during the metatesting phase. Furthermore, we observe that increasing the value of ε corresponds to a decrease in accuracy, indicating that a larger deviation can result in a slower

convergence rate, aligning with our theoretical ﬁndings. Additional numerical results can be found in Section E of the appendix.

8. Concluding Remarks In this work, our focus is on stochastic bilevel optimization with applications to meta strategic learning. We begin by verifying the existence of the BPS point based on the εsensitivity parameters when strategic learners participate in the bilevel PP process. We then revisit the commonly used bilevel SGD method to solve this class of bilevel optimization problems in an iterative way. Moreover, we establish the convergence rate of Bi-SGD, which matches the standard rate of SGD for solving non-performative prediction problems. This convergence rate is achieved under the satisfaction of the newly provided existence conditions for the BPS point. Besides, we quantify an upper bound on the mismatch between the BPS and BPO points, demonstrating that a smaller ε-sensitivity parameter leads to closer proximity between these two points. The numerical experiments support and validate our theoretical results, providing empirical evidence for the effectiveness of our proposed approaches.

Bilevel Optimization with Coupled Decision-Dependent Distributions

Balcan, M.-F., Khodak, M., and Talwalkar, A. Provable guarantees for gradient-based meta-learning. In Proceedings of International Conference on Machine Learning (ICML), pp. 424 433, 2019.

Brown, G., Hod, S., and Kalemaj, I. Performative prediction in a stateful world. In Proceedings of International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), pp. 6045 6061, 2022.

Bubeck, S. et al. Convex optimization: Algorithms and complexity. Foundations and Trends R in Machine Learning, 8(3-4):231 357, 2015.

Chen, L., Lu, S., and Chen, T. Understanding benign overﬁtting in gradient-based meta learning. In Proceedings of Advances in Neural Information Processing Systems (Neur IPS), 2022.

Chen, T., Sun, Y., and Yin, W. Tighter analysis of alternating stochastic gradient method for stochastic nested problems. In Proceedings of Advances in Neural Information Processing Systems (Neur IPS), 2021.

Chua, K., Lei, Q., and Lee, J. D. How ﬁne-tuning allows for effective meta-learning. Proceedings of Advances in Neural Information Processing Systems (Neur IPS), 34: 8871 8884, 2021.

Cutler, J., Drusvyatskiy, D., and Harchaoui, Z. Stochastic optimization under distributional drift. ar Xiv preprint ar Xiv:2108.07356, 2021.

Dagréou, M., Ablin, P., Vaiter, S., and Moreau, T. A framework for bilevel optimization that enables stochastic and global variance reduction algorithms. In Proceedings of Advances in Neural Information Processing Systems (Neur IPS), 2022.

Drusvyatskiy, D. and Xiao, L. Stochastic optimization with decision-dependent distributions. Mathematics of Operations Research, 2022.

Fallah, A., Mokhtari, A., and Ozdaglar, A. On the convergence theory of gradient-based model-agnostic meta-learning algorithms. In Proceedings of International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), pp. 1082 1092, 2020.

Fiez, T., Chasnov, B., and Ratliff, L. Implicit learning dynamics in stackelberg games: Equilibria characterization, convergence analysis, and empirical study. In Proceedings of International Conference on Machine Learning (ICML), pp. 3133 3144, 2020.

Finn, C., Abbeel, P., and Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of International Conference on Machine Learning (ICML), pp. 1126 1135, 2017.

Franceschi, L., Frasconi, P., Salzo, S., Grazzi, R., and Pontil, M. Bilevel programming for hyperparameter optimization and meta-learning. In Proceedings of International Conference on Machine Learning (ICML), pp. 1568 1577, 2018.

Ghadimi, S. and Wang, M. Approximation methods for bilevel programming. ar Xiv preprint ar Xiv:1802.02246, 2018.

Hardt, M., Jagadeesan, M., and Mendler-Dünner, C. Performative power. In Proceedings of Advances in Neural Information Processing Systems (Neur IPS), 2022.

Hopkins, M., Reeber, E., Forman, G., and Suermondt, J. Spambase. UCI Machine Learning Repository, 1999.

Hu, Q., Zhong, Y., and Yang, T. Multi-block min-max bilevel optimization with applications in multi-task deep auc maximization. In Proceedings of Advances in Neural Information Processing Systems (Neur IPS), 2022.

Huang, F. and Huang, H. Biadam: Fast adaptive bilevel optimization methods. ar Xiv preprint ar Xiv:2106.11396, 2021.

Izzo, Z., Ying, L., and Zou, J. How to learn when data reacts to your model: performative gradient descent. In Proceedings of International Conference on Machine Learning (ICML), pp. 4641 4650, 2021.

Izzo, Z., Zou, J., and Ying, L. How to learn when data gradually reacts to your model. In Proceedings of International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), pp. 3998 4035, 2022.

Jagadeesan, M., Zrnic, T., and Mendler-Dünner, C. Regret minimization with performative feedback. ar Xiv preprint ar Xiv:2202.00628, 2022.

Ji, K., Yang, J., and Liang, Y. Bilevel optimization: Convergence analysis and enhanced design. In Proceedings of International Conference on Machine Learning (ICML), pp. 4882 4892, 2021.

Kim, M. P. and Perdomo, J. C. Making decisions under outcome performativity. ar Xiv preprint ar Xiv:2210.01745, 2022.

Kotzias, D., Denil, M., De Freitas, N., and Smyth, P. From group to individual labels using deep features. In Proceedings of ACM SIGKDD International Conference

Bilevel Optimization with Coupled Decision-Dependent Distributions

on Knowledge Discovery and Data Mining (KDD), pp. 597 606, 2015.

Li, J., Gu, B., and Huang, H. A fully single loop algorithm for bilevel optimization without hessian inverse. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 36, pp. 7426 7434, 2022a.

Li, Q. and Wai, H.-T. State dependent performative prediction with stochastic approximation. In Proceedings of International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), 2022.

Li, Q., Yau, C.-Y., and Wai, H.-T. Multi-agent performative prediction with greedy deployment and consensus seeking agents. In Proceedings of Advances in Neural Information Processing Systems (Neur IPS), 2022b.

Liu, H., Socher, R., and Xiong, C. Taming MAML: Efﬁcient unbiased meta-reinforcement learning. In Proceedings of International Conference on Machine Learning (ICML), pp. 4061 4071, 2019.

Liu, R., Gao, J., Zhang, J., Meng, D., and Lin, Z. Investigating bi-level optimization for learning and vision from a uniﬁed perspective: A survey and beyond. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021a.

Liu, R., Mu, P., Yuan, X., Zeng, S., and Zhang, J. A general descent aggregation framework for gradientbased bi-level optimization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.

Liu, Y., Chen, Y., Tang, Z., and Zhang, K. Model transferability with responsive decision subjects. ar Xiv preprint ar Xiv:2107.05911, 2021b.

Lu, S. and Gao, T. Meta-DAG: Meta causal discovery via bilevel optimization. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023.

Maheshwari, C., Chiu, C.-Y., Mazumdar, E., Sastry, S., and Ratliff, L. Zeroth-order methods for convex-concave min-max problems: Applications to decision-dependent risk minimization. In Proceedings of International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), pp. 6702 6734, 2022.

Maity, S., Mukherjee, D., Banerjee, M., and Sun, Y. Predictor-corrector algorithms for stochastic optimization under gradual distribution shift. ar Xiv preprint ar Xiv:2205.13575, 2022.

Mandal, D., Triantafyllou, S., and Radanovic, G. Performative reinforcement learning. ar Xiv preprint ar Xiv:2207.00046, 2022.

Mendler-Dünner, C., Perdomo, J., Zrnic, T., and Hardt, M. Stochastic optimization for performative prediction. In Proceedings of Advances in Neural Information Processing Systems (Neur IPS), volume 33, pp. 4929 4939, 2020.

Mendler-Dünner, C., Ding, F., and Wang, Y. Anticipating performativity by predicting from predictions. In Proceedings of Advances in Neural Information Processing Systems (Neur IPS), 2022.

Miller, J. P., Perdomo, J. C., and Zrnic, T. Outside the echo chamber: Optimizing the performative risk. In Proceedings of International Conference on Machine Learning (ICML), pp. 7710 7720, 2021.

Mofakhami, M., Mitliagkas, I., and Gidel, G. Performative prediction with neural networks. In Neur IPS Workshop on Distribution Shifts: Connecting Methods and Applications, 2022.

Narang, A., Faulkner, E., Drusvyatskiy, D., Fazel, M., and Ratliff, L. Learning in stochastic monotone games with decision-dependent data. In Proceedings of International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), pp. 5891 5912, 2022.

Perdomo, J., Zrnic, T., Mendler-Dünner, C., and Hardt, M. Performative prediction. In Proceedings of International Conference on Machine Learning (ICML), pp. 7599 7609, 13 18 Jul. 2020.

Rajeswaran, A., Finn, C., Kakade, S. M., and Levine, S. Meta-learning with implicit gradients. Proceedings of Advances in Neural Information Processing Systems (Neur IPS), 32, 2019.

Robbins, H. and Siegmund, D. A convergence theorem for non negative almost supermartingales and some applications. In Optimizing Methods in Statistics, pp. 233 257. 1971.

Roy, A., Balasubramanian, K., and Ghadimi, S. Projectionfree constrained stochastic nonconvex optimization with state-dependent markov data. ar Xiv preprint ar Xiv:2206.11346, 2022.

Sabach, S. and Shtern, S. A ﬁrst order method for solving convex bilevel optimization problems. SIAM Journal on Optimization, 27(2):640 660, 2017.

Shen, H. and Chen, T. A single-timescale analysis for stochastic approximation with multiple coupled sequences. Proceedings of Advances in Neural Information Processing Systems (Neur IPS), 2022.

T Dinh, C., Tran, N., and Nguyen, J. Personalized federated learning with moreau envelopes. In Proceedings

Bilevel Optimization with Coupled Decision-Dependent Distributions

of Advances in Neural Information Processing Systems (Neur IPS), volume 33, pp. 21394 21405, 2020.

Wood, K. and Dall Anese, E. Stochastic saddle point problems with decision-dependent distributions. ar Xiv preprint ar Xiv:2201.02313, 2022.

Wood, K., Bianchin, G., and Dall Anese, E. Online projected gradient descent for stochastic optimization with decision-dependent distributions. IEEE Control Systems Letters, 6:1646 1651, 2021.

Ye, M., Liu, B., Wright, S., Stone, P., and Liu, Q. BOME! bilevel optimization made easy: A simple ﬁrstorder approach. In Proceedings of Advances in Neural Information Processing Systems (Neur IPS), 2022.

Zhao, Y. Optimizing the performative risk under weak convexity assumptions. ar Xiv preprint ar Xiv:2209.00771, 2022.

Zrnic, T., Mazumdar, E., Sastry, S., and Jordan, M. Who leads and who follows in strategic classiﬁcation? Proceedings of Advances in Neural Information Processing Systems (Neur IPS), pp. 15257 15269, 2021.

Bilevel Optimization with Coupled Decision-Dependent Distributions

A. Preliminaries

Before showing the detailed derivations of the lemmas and theorems, we give the following inequalities and equalities which are often used in the proofs.

A.1. Inequalities

1. (Corollary 3.1 (Drusvyatskiy & Xiao, 2022)) Suppose that f(y; Z) is C1 smooth and the map Z 7 f(y; Z) is L-Lipschitz continuous. Also, assume that there exists a ε > 0 satisfying W1(D(x), D(x )) ε x x . Then,

E Z D(x) f(y; Z) E Z D(x ) f(y; Z) εL x x . (23)

2. (Lemma 3.11 (Bubeck et al., 2015)). Let f : Rd R be L-smooth and γ-strongly convex, then for all x, y Rd

f(x) f(y), x y γL γ + L x y 2 + 1 γ + L f(x) f(y) 2. (24)

3. Young s inequality with parameter θ > 0 is

2 y 2, x, y. (25)

From the optimality condition of the LL problem, we have

2 xyℓ(x, y (x)) + xy (x) T 2 yyℓ(x, y (x)) = 0, (26)

which gives the gradient of the UL objective function as

x F(x) = f(x, y (x)) f(x, y (x))

= xf(x, y (x)) 2 xyℓ(x, y (x)) [ 2 yyℓ(x, y (x)) ] 1 yf(x, y (x)) (27)

by applying the chain rule. By following this notation, the expression of f(x, y) is deﬁned as follows (Ghadimi & Wang, 2018; Shen & Chen, 2022)

f(x, y) = xf(x, y) 2 xyℓ(x, y) [ 2 yyℓ(x, y) ] 1 yf(x, y). (28)

Based on the deﬁnitions of y (φ, x) shown in (5b) and y (x) shown in (1b), we have

y (x, x) = y (x). (29)

A.2. Lipschitz Constants

In this section, we present the Lipschitz constants used to quantify the changes in gradients. First, the detailed deﬁnitions of the Lipschitz continuous constants in Assumption 5 are further given as follows.

(Smoothness of the UL loss function) Assume that the gradient of loss function f(x, y; Z) w.r.t. x is Lipschitz continuous x, x , y, y , namely,

E Z D( ) xf(x, y; Z) xf(x , y; Z) Lx f x x , (30a) E Z D( ) xf(x, y; Z) xf(x, y ; Z) Lx f y y . (30b)

Bilevel Optimization with Coupled Decision-Dependent Distributions

Also, we assume that the gradient of f(x, y; Z) w.r.t. y is Lipschitz continuous x, x , y, y , namely,

E Z D( ) yf(x, y; Z) yf(x, y ; Z) Ly f y y , (31a) E Z D( ) yf(x, y; Z) yf(x, y ; Z) Ly f x x . (31b)

(Smoothness of the LL loss function) Similarly, we also assume that loss function ℓ(x, y; Z) is Ly ℓ-smooth, x, y, y namely,

E Z D( ) yℓ(x, y; Z) yℓ(x, y ; Z) Ly ℓ y y . (32)

Assume that the Jacobian and Hessian matrices of loss function ℓ(x, y) are Lipschitz continuous x, x , y, y , namely,

E Z D( ) 2 xyℓ(x, y; Z) 2 xyℓ(x , y; Z) Lx ℓxy x x , (33a) E Z D( ) 2 xyℓ(x, y; Z) 2 xyℓ(x, y ; Z) Ly ℓxy y y , (33b) E Z D( ) 2 yyℓ(x, y; Z) 2 yyℓ(x , y; Z) Lx ℓyy x x , (33c) E Z D( ) 2 yyℓ(x, y; Z) 2 yyℓ(x, y ; Z) Ly ℓyy y y . (33d)

Next, we will introduce the technical lemmas that quantify the changes in gradients with respect to shifts in the data distribution. These lemmas are used in the convergence proofs of both Bi-RRM and Bi-SGD. The following table provides a summary of the gradient Lipschitz continuity with respect to the decision-dependent distributions.

Table 2. Notations for Constants

Constant (abbrv.) Deﬁnition Details

Lεx f f(x, y) f(x, y (x)) Lεx f y y (x) Lemma 1

Lεy y y (x) y (x ) 2 Lεy y x x Lemma 2

Lεx,εy F (LF ) f(x, y (x)) f(x , y (x )) Lεx,εy F x x Lemma 3

Lεx,εy Z (LZ)

E Z Dx(y (φ,x)) f(x, y (φ, x); Z) E Z Dx(y (φ,x )) f(x, y (φ, x ); Z) Lεx,εy Z x x Lemma 4

L εx,εy Z (L Z) E Z Dx(y (x)) f(x, y (x); Z) E Z D(y (x )) f(x, y (x, x ); Z) L εx,εy Z x x Lemma 4

Lemma 1. The gradient of the UL objective function is Lipschitz continuous with constant Lεx f , namely,

E Z Dx(y) f(x, y; Z) E Z Dx(y (x)) f(x, y (x); Z) Lεx f y y (x) (34)

where constant

Lεx f Lz fεx + Lx f + Cy f Ly ℓxy γy + Cy f Cℓxy Ly ℓyy γ2y + Cℓxy

( Lz fεx + Ly f ) . (35)

Bilevel Optimization with Coupled Decision-Dependent Distributions

Proof. From (27) and (28), we can have

f(x, y) f(x, y (x))

xf(x, y) 2 xyℓ(x, y) [ 2 yyℓ(x, y) ] 1 yf(x, y)

( xf(x, y (x)) 2 xyℓ(x, y (x)) [ 2 yyℓ(x, y (x)) ] 1 yf(x, y (x))) (36)

E Z Dx(y) xf(x, y; Z) E Z Dx(y (x)) xf(x, y (x); Z)

+ E Z Dy(x) 2 xyℓ(x, y; Z) [ E Z Dy(x) 2 yyℓ(x, y; Z) ] 1 E Z Dx(y) yf(x, y; Z)

E Z Dy(x) 2 xyℓ(x, y (x); Z) [ E Z Dy(x) 2 yyℓ(x, y (x); Z) ] 1 E Z Dx(y (x)) yf(x, y (x); Z) (37)

E Z Dx(y) xf(x, y; Z) E Z Dx(y (x)) xf(x, y; Z) + E Z Dx(y (x)) xf(x, y; Z) E Z Dx(y (x)) xf(x, y (x); Z)

+ 1 + 2 + 3 (38)

(23) (Lz fεx + Lx f) y y (x) + 1 + 2 + 3 (39)

1 ( E Z Dy(x) 2 xyℓ(x, y; Z) E Z Dy(x) 2 xyℓ(x, y (x); Z) ) [ 2 yyℓ(x, y) ] 1 yf(x, y), (40a)

2 2 xyℓ(x, y (x))

([ E Z Dy(x) 2 yyℓ(x, y; Z) ] 1 [ E Z Dy(x) 2 yyℓ(x, y (x); Z) ] 1)

yf(x, y), (40b)

3 2 xyℓ(x, y (x)) [ 2 yyℓ(x, y (x)) ] 1 ( E Z Dx(y) yf(x, y; Z) E Z Dx(y (x)) yf(x, y (x); Z) ) . (40c)

( E Z Dy(x) 2 xyℓ(x, y; Z) E Z Dy(x) 2 xyℓ(x, y (x); Z) ) [ 2 yyℓ(x, y) ] 1 yf(x, y) Cy f Ly ℓxy γy y y (x) ,

and similarly we can also obtain

2 xyℓ(x, y (x))

([ E Z Dy(x) 2 yyℓ(x, y; Z) ] 1 [ E Z Dy(x) 2 yyℓ(x, y (x); Z) ] 1)

Cℓxy Cy f Ly ℓyy γ2y y y (x) (41)

where we use the fact H 1 2 H 1 1 = H 1 1 (H1 H2)H 1 2 H 1 1 H 1 2 H1 H2 , and

3 2 xyℓ(x, y (x)) [ 2 yyℓ(x, y (x)) ] 1 ( E Z Dx(y) yf(x, y; Z) E Z Dx(y (x)) yf(x, y; Z) )

+ 2 xyℓ(x, y (x)) [ 2 yyℓ(x, y (x)) ] 1 ( E Z Dx(y (x)) yf(x, y; Z) E Z Dx(y (x)) yf(x, y (x); Z) )

( Lz fεx + Ly f ) y y (x) . (42)

Bilevel Optimization with Coupled Decision-Dependent Distributions

Therefore, we have f(x, y) f(x, y (x))

Lz fεx + Lx f + Cy f Ly ℓxy γy + Cy f Cℓxy Ly ℓyy γ2y + Cℓxy

( Lz fεx + Ly f )

y y (x) . (43)

Lemma 2. Function y (φ, x) is Lipschitz continuous (w.r.t. x) with constant Lz ℓεyγ 1 y , namely,

y (φ, x) y (φ, x ) Lz ℓεy γy x x , φ. (44)

Also, function y (x) is Lipschitz continuous (w.r.t. x) with constant Lεy y , namely,

y (x) y (x ) 2 Lεy y x x (45)

where constant

Lεy y Cℓxy + Lz ℓεy γy . (46)

Proof. First, deﬁne the following auxiliary variables:

y (φ, x) = arg min y E Z Dy(x)ℓ(φ, y; Z), (47a)

y (φ, x ) = arg min y E Z Dy(x )ℓ(φ, y; Z). (47b)

As ℓ(x, y; Z) is strongly convex, we can have

E Z Dy(x) ℓ(φ, y (φ, x); Z) E Z Dy(x) ℓ(φ, y (φ, x ); Z), y (φ, x) y (φ, x ) γy y (φ, x) y (φ, x ) 2. (48)

From the optimality conditions of (47a) and (47b), we have E Z Dy(x ) yℓ(φ, y (φ, x ); Z) = 0 and

E Z Dy(x) yℓ(φ, y (φ, x); Z) = 0. Therefore, we can replace E Z Dy(x) ℓ(φ, y (φ, x); Z) by E Z Dy(x ) ℓ(φ, y (φ, x ); Z)

in (48). As a result, we obtain

γy y (φ, x) y (φ, x ) 2

E Z Dy(x ) ℓ(φ, y (φ, x ); Z) E Z Dy(x) ℓ(φ, y (φ, x ); Z), y (φ, x) y (φ, x )

(23) Lz ℓεy x x y (φ, x) y (φ, x ) , (50)

which is equivalent to

y (φ, x) y (φ, x ) Lz ℓεy γy x x . (51)

y (x) = arg min y E Z Dy(x)ℓ(x, y; Z), (52)

so we know that

y (x ) = arg min y E Z Dy(x )ℓ(x , y; Z). (53)

Bilevel Optimization with Coupled Decision-Dependent Distributions

From (26), we have

y (x) 2 xyℓ(x, y (x)) [ 2 yyℓ(x, y (x)) ] 1 Cℓxy

which gives

y (x) y (x , x) Cℓxy

γy x x , (55)

as both y (x) and y (x , x) are obtained based on data sample Z Dy(x).

Combining (55) and (51) yields y (x) y (x ) y (x) y (x , x) + y (x , x) y (x ) (56)

γy + Lz ℓεy γy

Cℓxy + Lz ℓεy γy | {z }

where in (a) we substitute φ into (51) with x .

Lemma 3. The gradient of the UL objective function is Lipschitz continuous (w.r.t. x) with constant Lεx,εy F (abbreviated as LF ), namely,

x F(x) x F(x ) = f(x, y (x)) f(x , y (x )) Lεx,εy F x x (59)

where constant

Lεx,εy F Lεy y ( Lz fεx + Lx f ) + Lx f +

( Lz ℓxyεy + Lx ℓxy + Ly ℓxy Lεy y ) Cy f γy

+ Cℓxy Cy f ( Lz ℓyyεy + Lx ℓyy + Lεy y Ly ℓyy )

γ2y + Cℓxy ( Lεy y Lz fεx + Ly f + Lεy y Ly f )

Proof. According to the deﬁnition of f(x, y (x)), we can have

xf(x, y (x)) 2 xyℓ(x, y (x)) [ 2 yyℓ(x, y (x)) ] 1 yf(x, y (x))

( xf(x , y (x )) 2 xyℓ(x , y (x )) [ 2 yyℓ(x , y (x )) ] 1 yf(x , y (x ))) (61)

(a) xf(x, y (x)) xf(x , y (x ))

+ 2 xyℓ(x, y (x)) [ 2 yyℓ(x, y (x)) ] 1 yf(x, y (x)) 2 xyℓ(x , y (x )) [ 2 yyℓ(x , y (x )) ] 1 yf(x , y (x )))

where (a) is true due to the triangle inequality.

Next, let us deﬁne the following quantities:

1 ( E Z Dy(x) 2 xyℓ(x, y (x); Z) E Z Dy(x ) 2 xyℓ(x , y (x ); Z) ) [ 2 yyℓ(x, y (x)) ] 1 yf(x, y (x)), (62a)

2 2 xyℓ(x , y (x ))

([ E Z Dy(x) 2 yyℓ(x, y (x); Z) ] 1 [ E Z Dy(x ) 2 yyℓ(x , y (x ); Z) ] 1)

yf(x, y (x)), (62b)

3 2 xyℓ(x , y (x )) [ 2 yyℓ(x , y (x )) ] 1( E Z D(y (x)) yf(x, y (x); Z) E Z D(y (x )) yf(x , y (x ); Z) ) . (62c)

Bilevel Optimization with Coupled Decision-Dependent Distributions

Then, we can have

xf(x, y (x)) xf(x , y (x )) + 1 + 2 + 3 (63)

( Lεy y ( Lz fεx + Lx f ) + Lx f +

( Lz ℓxyεy + Lx ℓxy + Ly ℓxy Lεy y ) Cy f γy

+ Cℓxy Cy f ( Lz ℓyyεy + Lx ℓyy + Lεy y Ly ℓyy )

γ2y + Cℓxy ( Lεy y Lz fεx + Ly f + Lεy y Ly f )

) x x , (64)

where we use the following facts: ﬁrst we have

xf(x, y (x)) xf(x , y (x ))

E Z Dx(y (x)) xf(x, y (x)) E Z Dx(y (x )) xf(x , y (x )) (65)

E Z Dx(y (x)) xf(x, y (x)) E Z Dx(y (x )) xf(x, y (x))

+ E Z Dx(y (x )) xf(x, y (x)) E Z Dx(y (x )) xf(x , y (x )) (66)

(23) ( Lz fεx + Lx f ) y (x) y (x ) + Lx f x x (67)

(45) ( Lεy y ( Lz fεx + Lx f ) + Lx f ) x x , (68)

and similarly we can have E Z Dy(x) 2 xyℓ(x, y (x); Z) E Z Dy(x ) 2 xyℓ(x , y (x ); Z)

(23) ( Lz ℓxyεy + Lx ℓxy ) x x + Ly ℓxy y (x) y (x ) (69)

(45) ( Lz ℓxyεy + Lx ℓxy + Ly ℓxy Lεy y ) x x , (70)

E Z Dy(x) 2 yyℓ(x, y (x); Z) E Z Dy(x ) 2 yyℓ(x , y (x ); Z)

(23),(45) ( Lz ℓyyεy + Lx ℓyy + Lεy y Ly ℓyy ) x x , (71)

E Z D(y (x)) yf(x, y (x); Z) E Z D(y (x )) yf(x , y (x ); Z)

(23),(45) ( Lεy y Lz fεx + Ly f + Lεy y Ly f ) x x . (72)

Lemma 4. The gradient of the UL objective function is Lipschitz continuous (w.r.t. x) with constant Lεx,εy Z (LZ), φ, x, x , namely,

E Z Dx(y (φ,x)) f(φ, y (φ, x); Z) E Z Dx(y (φ,x )) f(φ, y (φ, x ); Z) Lεx,εy Z x x (73)

where constant

(( Lz f + Cℓxy Lz f γy

) εx + Lx f + Cy f Ly ℓxy γy + Cℓxy

Ly f + Cy f Ly ℓyy γy

)) Lz ℓεy γy + Cy f γy

( Lz ℓxy + Cℓxy Lz ℓyy γy

Bilevel Optimization with Coupled Decision-Dependent Distributions

Also, the gradient of the UL objective function is Lipschitz continuous (w.r.t. x) with constant L εx,εy Z (L Z), x, x , namely, E Z Dx(y (x)) f(x, y (x); Z) E Z D(y (x )) f(x, y (x, x ); Z) L εx,εy Z x x

where constant

L εx,εy Z Lεx,εy Z + ( Lz f + Cℓxy Lz f γy

Proof. Based on the closed form of f(x, y (φ, x)), we have

E Z Dx(y (φ,x)) f(x, y (φ, x); Z) E Z D(y (φ,x )) f(x, y (φ, x ); Z)

E Z Dx(y (φ,x)) xf(x, y (φ, x); Z) E Z Dx(y (φ,x )) xf(x, y (φ, x ); Z)

+ 2 xyℓ(x, y (φ, x)) [ 2 yyℓ(x, y (φ, x)) ] 1 yf(x, y (φ, x))

( 2 xyℓ(x, y (φ, x )) [ 2 yyℓ(x, y (φ, x )) ] 1 yf(x, y (φ, x )) ) (76)

(23) ( Lx f + Lz fεx ) y (φ, x) y (φ, x ) + 1 + 2 + 3

Lx f + Lz fεx + Cy f Ly ℓxy γy + Cℓxy Cy f Ly ℓyy γ2y + Cℓxy ( Lz fεx + Ly f )

y (φ, x) y (φ, x )

( Lz ℓxy + Cℓxy Lz ℓyy γy

) εy x x (77)

where in (a) we use the following facts

1 ( E Z Dy(x) 2 xyℓ(x, y (φ, x); Z) E Z Dy(x ) 2 xyℓ(x, y (φ, x); Z) )[ 2 yyℓ(x, y (φ, x)) ] 1 yf(x, y (φ, x)), (78a)

2 2 xyℓ(x, y (φ, x ))

([ E Z Dy(x) 2 yyℓ(x, y (φ, x); Z) ] 1 [ E Z Dy(x ) 2 yyℓ(x, y (φ, x ); Z) ] 1)

yf(x, y (x)),

3 2 xyℓ(x, y (φ, x )) [ 2 yyℓ(x, y (φ, x )) ] 1( E Z Dx(y (φ,x)) yf(x, y (φ, x); Z) E Z Dx(y (φ,x )) yf(x, y (φ, x ); Z) )

1 (23) Cy f Lz ℓxyεy γy x x + Cy f Ly ℓxy γy y (φ, x) y (φ, x ) , (79a)

2 (23) Cℓxy Cy f Lz ℓyyεy γ2y x x + Cℓxy Cy f Ly ℓyy γ2y y (φ, x) y (φ, x ) , (79b)

3 (23) Cℓxy ( Lz fεx + Ly f )

γy y (φ, x) y (φ, x ) . (79c)

Plugging (51) into (77) gives (73) and (74) immediately.

Bilevel Optimization with Coupled Decision-Dependent Distributions

Similarly, we also have E Z Dx(y (x)) f(x, y (x); Z) E Z Dx(y (x )) f(x, y (x, x ); Z)

E Z Dx(y (x)) xf(x, y (x); Z) E Z Dx(y (x )) xf(x, y (x, x ); Z)

+ 2 xyℓ(x, y (x)) [ 2 yyℓ(x, y (x)) ] 1 yf(x, y (x))

E Z Dy(x ) 2 xyℓ(x, y (x, x ); Z) [ E Z Dy(x ) 2 yyℓ(x, y (x, x ); Z) ] 1 E Z Dx(y (x )) yf(x, y (x, x ); Z)

(23) Lx f y (x) y (x, x ) + Lz fεx y (x) y (x ) + 1 + 2 + 3

( Lx f + Cy f Ly ℓxy γy + Cℓxy Cy f Ly ℓyy γ2y + Cℓxy Ly f γy

y (x) y (x, x )

( Lz ℓxy + Cℓxy Lz ℓyy γy

) εy x x + ( Lz f + Cℓxy Lz f γy

) εx y (x) y (x ) (81)

1 ( E Z Dy(x) 2 xyℓ(x, y (x); Z) E Z Dy(x ) 2 xyℓ(x, y (x, x ); Z) )[ 2 yyℓ(x, y (x)) ] 1 yf(x, y (x)), (82a)

2 E Z Dy(x ) 2 xyℓ(x, y (x, x ); Z)

([ E Z Dy(x) 2 yyℓ(x, y (x); Z) ] 1 [ E Z Dy(x ) 2 yyℓ(x, y (x, x ); Z) ] 1)

yf(x, y (x)), (82b)

3 E Z Dy(x ) 2 xyℓ(x, y (x, x ); Z) [ E Z Dy(x ) 2 yyℓ(x, y (x, x ); Z) ] 1

( E Z Dx(y (x)) yf(x, y (x); Z) E Z Dx(y (x )) yf(x, y (x, x ); Z) )

1 (23) Cy f Lz ℓxyεy γy x x + Cy f Ly ℓxy γy y (x) y (x, x ) , (83a)

2 (23) Cℓxy Cy f Lz ℓyyεy γ2y x x + Cℓxy Cy f Ly ℓyy γ2y y (x) y (x, x ) , (83b)

3 (23) Cℓxy Lz fεx γy y (x) y (x ) + Cℓxy Ly f γy y (x) y (x, x ) . (83c)

Plugging (51) and (45) into (81) gives E Z Dx(y (x)) f(x, y (x); Z) E Z Dx(y (x )) f(x, y (x, x ); Z)

( Lx f + Cy f Ly ℓxy γy + Cℓxy Cy f Ly ℓyy γ2y + Cℓxy Ly f γy

) Lz ℓεy γy x x

( Lz ℓxy + Cℓxy Lz ℓyy γy

) εy x x + ( Lz fεx + Cℓxy Lz fεx γy

) Cℓxy + Lz ℓεy γy x x (84)

( Lεx,εy Z + ( Lz f + Cℓxy Lz f γy

) x x , (85)

which completes the proof.

Bilevel Optimization with Coupled Decision-Dependent Distributions

B. Repeated Risk Minimization (Proof of Theorem 1)

Proof. First, let

g(φ) E Z Dx(y (φ,x))[f(φ, y (φ, x); Z)], s.t. y (φ, x) = arg min y E Z Dy(x)[ℓ(φ, y; Z)], (86a)

g (φ) E Z Dx(y (φ,x ))[f(φ, y (φ, x ); Z)], s.t. y (φ, x ) = arg min y E Z Dy(x )[ℓ(φ, y; Z)]. (86b)

Note that g(φ) is γx-strongly convex and

xr+1 = R(xr) arg min φ E Z Dx(y (φ,xr))[f(φ, y (φ, xr); Z)], s.t. y (φ, xr) = arg min y E Z Dy(xr)[ℓ(φ, y; Z)]. (87)

Based on the strong convexity of function g(φ), we have

g(R(x)) g(R(x )) R(x) R(x ), g(R(x )) + γx

2 R(x) R(x ) 2, (88a)

g(R(x )) g(R(x)) γx

2 R(x) R(x ) 2, (88b)

which together give

γx R(x) R(x ) 2 R(x) R(x ), g(R(x )) . (89)

g(R(x )) g (R(x ))

E Z Dx(y (R(x ),x)) xf(R(x ), y (R(x ), x); Z) E Z Dx(y (R(x ),x )) xf(R(x ), y (R(x ), x ); Z)

+ E Z Dy(R(x ),x) 2 xyℓ(R(x ), y (R(x ), x); Z) [ E Z Dy(R(x ),x) 2 yyℓ(R(x ), y (R(x ), x); Z) ] 1

E Z Dx(y (R(x ),x)) yf(R(x ), y (R(x ), x); Z)

E Z Dy(R(x ),x ) 2 xyℓ(R(x ), y (R(x ), x ); Z) [ E Z Dy(R(x ),x ) 2 yyℓ(R(x ), y (R(x ), x ); Z) ] 1

E Z Dx(y (R(x ),x )) yf(R(x ), y (R(x ), x ); Z) (90)

(73) Lεx,εy Z x x . (91)

Using the dual formulation of the optimal transport distance and εx-sensitivity of Dx yields

R(x) R(x ), g(R(x )) R(x) R(x ), g (R(x )) Lεx,εy Z x x R(x) R(x ) . (92)

Note that R(x) R(x ), g (R(x )) 0 due to the optimality condition of the UL problem. Combining (89) and (92) yields

R(x) R(x ) Lεx,εy Z

γx x x (74) = (Cxyεxεy + Cyεy) x x (93)

Cxy Lz ℓ γxγy

( Lz f + Cℓxy Lz f γy

( Lz ℓxy + Lz ℓyy Cℓxy

( Lx f + Cy f Ly ℓxy γy + Cℓxy

Ly f + Cy f Ly ℓyy γy

By using the deﬁnition of the PS solution (3) (i.e., R(x S) = x S) and Bi-RRM (5) (i.e., xr+1 = R(xr)), we can conclude that

xr x S 2 (Cxyεxεy + Cyεy) xr 1 x S (Cxyεxεy + Cyεy)r x0 x S 2. (95)

Bilevel Optimization with Coupled Decision-Dependent Distributions

C. Convergence Rate of Bi-SGD (Proof of Theorem 2)

In this section, we will provide detailed proofs of the convergence rate of PP based Bi-SGD.

First, we can have the following descent lemma that quantiﬁes the decrease of the objective value after performing one step update of solving the LL problem.

C.1. Descent Lemma of the LL Problem

Lemma 5. (Convergence of the LL Variables) Suppose that Assumption 1 Assumption 6 hold. If the iterates {xr, yr, r} are generated by Bi-SGD and the step size βr satisﬁes

βr γy 2(Ly ℓ)2 (96)

then, we have

E yr+1 y (xr) 2 (1 βrγy)E yr y (xr) 2 + 2β2 rσ2 ℓ, (97)

E yr+1 y (xr+1) 2

( 1 + Y1αr + Lεy y (σ f + 1)σ fα2 r ) E yr+1 y (xr) 2 + (Cℓxy Lεx f αr γy + Y2α2 r

) E yr y (xr) 2

+ ( Fα2 r + αr 2(LF + γx)

) E Z Dx(y (x )) f(xr, y (xr, x ); Z) E Z Dx(y (x )) f(x , y (x ); Z)

+ ( LF γxαr 2(LF + γx) + Xα2 r

) E xr x 2 + 4 ( Lεy y )2 α2 rσ2 f + Lεy y α2 rσ2 f + 4 ( Lεy y )2 α2 rδ2 r + 3

2Lεy y αrδ2 r (98)

where constants Y1, Y2, X, F, σ f are as follows

Y1 Cℓxy Lεx f γy + 2C2 ℓxy ( Lεx,εy Z )2 (LF + γx) γ2y LF γx + 2C2 ℓxy(LF + γx)

γ2y + Lεy y , (99a)

Y2 6 ( Lεy y )2 (Lεx f )2 + 3(Lεx f )2Lεy y σ f, (99b)

X 6 ( Lεy y )2 ( Lεx,εy Z )2 + 3Lεy y ( Lεx,εy Z )2 σ f, (99c)

F 6 ( Lεy y )2 + 3Lεy y σ f, and σ f σf + δr. (99d)

Proof. The distance between yr+1 and y (xr) can be quantiﬁed as follows.

E yr+1 y (xr) 2

= E yr+1 yr + yr y (xr) 2 (100)

(12a) E yr y (xr) 2 2βr yr y (xr), E Z Dy(xr) b yℓ(xr, yr; Z) + E yr+1 yr 2 (101)

(a) (1 2βrγy)E yr y (xr) 2 + E yr+1 yr 2 (102)

(12a) (1 2βrγy)E yr y (xr) 2 + β2 r E b ℓ(xr, yr; Zy) 2

(b) (1 2βrγy + 2β2 r(Ly ℓ)2)E yr y (xr) 2 + 2β2 rσ2 ℓ (104)

(c) (1 βrγy)E yr y (xr) 2 + 2β2 rσ2 ℓ (105)

where (a) is true because yℓ(xr, y (xr)) = 0, unbiasedness of the LL gradient estimator, and the strong convexity, i.e.,

yr y (xr), yℓ(xr, yr) yℓ(xr, y (xr)) γy yr y (xr) 2, (106)

Bilevel Optimization with Coupled Decision-Dependent Distributions

and in (b) we use the bounded variance of the LL gradient estimate (i.e., Assumption 6), optimality condition of the LL problem (i.e., E Z Dy(xr) yℓ(xr, y (xr); Z) = 0 ), and Lipschitz continuity, i.e.,

E Z Dy(xr) yℓ(xr, yr; Z) yℓ(xr, y (xr); Z)

2 (Ly ℓ)2 yr y (xr) 2, (107)

and (c) holds when βr γy/(2(Ly ℓ)2).

Then, we can decompose yr+1 y (xr+1) 2 into the following three parts.

E yr+1 y (xr+1) 2

= yr+1 y (xr) + y (xr) y (xr+1) 2 (108)

= E yr+1 y (xr) 2 + E y (xr) y (xr+1) 2 | {z }

+2 E yr+1 y (xr), y (xr) y (xr+1) | {z }

Next, we will respectively give the upper bounds of terms T1 and T2. Firstly, we have that

T1 = E y (xr) y (xr+1) 2 (110)

(a) ( Lεy y )2 E xr+1 xr 2 (111)

(12b) ( Lεy y )2 α2 r E b f(xr, yr; Zx) 2 (112)

(b) 2 ( Lεy y )2 α2 r f(xr, yr) 2 + 4 ( Lεy y )2 α2 rσ2 f + 4 ( Lεy y )2 α2 rδ2 r (113)

where in (a) we apply the Lipschitz continuity of y (x), and (b) follows due to the bounded variance of the UL gradient estimate from Assumption 6.

Let x denote x S. Based on the closed form and Lipschitz continuity of f(xr, yr), we can obtain

f(xr, yr) = f(xr, yr) f(xr, y (xr)) + f(xr, y (xr)) f(x , y (x )) (114)

E Z Dx(yr) f(xr, yr; Z) E Z Dx(y (xr)) f(xr, y (xr); Z)

+ E Z Dx(y (xr)) f(xr, y (xr); Z) E Z Dx(y (x )) f(xr, y (xr, x ); Z)

+ E Z Dx(y (x )) f(xr, y (xr, x ); Z) E Z Dx(y (x )) f(x , y (x ); Z) (115)

Lεx f yr y (xr) + L εx,εy Z xr x

+ E Z Dx(y (x )) f(xr, y (xr, x ); Z) E Z Dx(y (x )) f(x , y (x ); Z) (116)

where we use (34) and (73) in the last inequality.

Consequently, we can obtain

T1 (116) 2 ( Lεy y )2 α2 r ( 3(Lεx f )2E yr y (xr) 2 + 3L 2 ZE xr x 2)

+ 2 ( Lεy y )2 α2 r3 E Z Dx(y (x )) f(xr, y (xr, x )) E Z Dx(y (x )) f(x , y (x ))

+ 4 ( Lεy y )2 α2 rσ2 f + 4 ( Lεy y )2 α2 rδ2 r. (117)

Secondly, we will establish an upper bound for T2 as follows. Let bxr+1 ϑxr+(1 ϑ)xr+1, ϑ [0, 1]. By the mean-value

Bilevel Optimization with Coupled Decision-Dependent Distributions

theorem, we can obtain

T2 = E y (xr) yr+1, y (xr+1) y (xr) (118) = E y (xr) yr+1, y (bxr+1) T(xr+1 xr) (119)

= E y (xr) yr+1, αr y (bxr+1) T b f(xr, yr; Zx) (120) = T21 + T22

T21 E y (xr) yr+1, αr y (bxr+1) T f(xr, yr; Zx) , (121a)

T22 E y (xr) yr+1, αr y (bxr+1) T ( b f(xr, yr; Zx) f(xr, yr; Zx) ) . (121b)

Next, we will provide the upper bounds of T11 and T22.

T21 (54) Cℓxy

γy αr E y (xr) yr+1 f(xr, yr) (122)

γy αr E y (xr) yr+1 ( Lεx f yr y (xr) + L εx,εy Z xr x

+ E Z Dx(y (x )) f(xr, y (xr, x ); Z) E Z Dx(y (x )) f(x , y (x ); Z)

γy Lεx f αr E y (xr) yr+1 yr y (xr) + Cℓxy

γy L εx,εy Z αr E y (xr) yr+1 xr x

γy αr E y (xr) yr+1 E Z Dx(y (x )) f(xr, y (xr, x ); Z) E Z Dx(y (x )) f(x , y (x ); Z) (124)

(a) Cℓxy Lεx f αr 2γy E yr+1 y (xr) 2 + Cℓxy Lεx f αr 2γy E yr y (xr) 2

+ C2 ℓxy(L εx,εy Z )2(LF + γx)αr γxγ2y LF E yr+1 y (xr) 2 + LF γxαr 4(LF + γx)E xr x 2

+ C2 ℓxy(LF + γx)αr

γ2y E yr+1 y (xr) 2

+ αr 4(LF + γx)

E Z Dx(y (x )) f(xr, y (xr, x ); Z) E Z Dx(y (x )) f(x , y (x ); Z)

where in (a) we apply Young s inequality.

Regarding term T22, it can be upper bounded by

T22 = E y (xr) yr+1, αr ( y (bxr+1) y (xr)) T ( b f(xr, yr; Zx) f(xr, yr) )

+ E y (xr) yr+1, αr y (xr) T ( b f(xr, yr; Zx) f(xr, yr) )

(a) Lεy y αr E y (xr) yr+1 bxr+1 xr b f(xr, yr; Zx) f(xr, yr)

+ Lεy y αr E y (xr) yr+1 Eb f(xr, yr; Zx) f(xr, yr) (127)

(b) Lεy y αr E y (xr) yr+1 xr+1 xr b f(xr, yr; Zx) f(xr, yr)

2 E y (xr) yr+1 2 + Lεy y αrδ2 r 2 (128)

where (a) holds because of (45) and conditional independence of data sampling of Zx, in (b) we use (14).

Bilevel Optimization with Coupled Decision-Dependent Distributions

Using the update rule of sequence xr (12b) in (128) further gives

T22 (12b) Lεy y α2 r E y (xr) yr+1 b f(xr, yr; Zx) b f(xr, yr; Zx) f(xr, yr)

2 E y (xr) yr+1 2 + Lεy y αrδ2 r 2 (129)

(a) Lεy y α2 r E y (xr) yr+1 f(xr, yr) b f(xr, yr; Zx) f(xr, yr)

+ Lεy y α2 r E y (xr) yr+1 b f(xr, yr; Zx) f(xr, yr) 2

2 E y (xr) yr+1 2 + Lεy y αrδ2 r 2 (130)

(b) (Lεy y α2 r(σf + δr)

2 + Lεy y α2 r(σf + δr)2

2 + Lεy y αr

) E y (xr) yr+1 2

+ Lεy y α2 r(σf + δr)

2 f(xr, yr) 2 + Lεy y α2 rσ2 f + 3Lεy y αrδ2 r 2 (131)

(116) 3L 2 ZLεy y α2 r(σf + δr) 2 E xr x 2

+ 3(Lεx f )2Lεy y α2 r(σf + δr)

2 E yr y (xr) 2

+ 3Lεy y α2 r(σf + δr)

E Z Dx(y (x )) f(xr, y (xr, x ); Z) E Z Dx(y (x )) f(x , y (x ); Z)

+ (Lεy y α2 r(σf + δr)

2 + Lεy y α2 r(σf + δr)2

2 + Lεy y αr

) E y (xr) yr+1 2

+ Lεy y α2 rσ2 f + 3Lεy y αrδ2 r 2 . (132)

where (a) is true due to the fact that b f(xr, yr; Zx) f(xr, yr) + b f(xr, yr; Zx) f(xr, yr) , and in (b) we use Young s inequality and Assumption 6 (i.e., E b f(xr, yr; Zx) f(xr, yr) σf + δr).

Combining (125) and (132) yields

( ( Cℓxy Lεx f 2γy + C2 ℓxy L 2 Z(LF + γx) γ2y LF γx + C2 ℓxy(LF + γx)

+ Lεy y α2 r(σf + δr)

2 + Lεy y α2 r(σf + δr)2

2 + Lεy y αr

E yr+1 y (xr) 2

( Cℓxy Lεx f αr 2γy + 3(Lεx f )2Lεy y α2 r(σf + δr)

E yr y (xr) 2

+ ( αr LF γx 4(LF + γx) + 3L 2 ZLεy y α2 r(σf + δr) 2

+ (3Lεy y α2 r(σf + δr)

2 + αr 4(LF + γx)

) E Z Dx(y (x )) f(xr, y (xr, x ); Z) E Z Dx(y (x )) f(x , y (x ); Z)

+ Lεy y α2 rσ2 f + 3Lεy y αrδ2 r 2 . (133)

Bilevel Optimization with Coupled Decision-Dependent Distributions

To simplify the notation, let σ f σf + δr. Substituting (117) and (133) back into (109) gives

E yr+1 y (xr+1) 2

( Cℓxy Lεx f γy + 2C2 ℓxy L 2 Z(LF + γx) γ2y LF γx + 2C2 ℓxy(LF + γx)

+ Lεy y α2 rσ f + Lεy y α2 rσ 2 f + Lεy y αr

E yr+1 y (xr) 2

+ ( 6 ( Lεy y )2 (Lεx f )2α2 r + Cℓxy Lεx f αr γy + 3(Lεx f )2Lεy y α2 rσ f

) E yr y (xr) 2

+ ( 6 ( Lεy y )2 ( L εx,εy Z )2 α2 r + αr LF γx 2(LF + γx) + 3 ( L εx,εy Z )2 Lεy y α2 rσ f

+ ( 6 ( Lεy y )2 α2 r + 3Lεy y α2 rσ f + αr 2(LF + γx)

) E Z Dx(y (x )) f(xr, y (xr, x ); Z) E Z Dx(y (x )) f(x , y (x ); Z)

+ 4 ( Lεy y )2 α2 rσ2 f + Lεy y α2 rσ2 f + 4 ( Lεy y )2 α2 rδ2 r + 3

2Lεy y αrδ2 r, (134)

which gives (98) directly.

Bilevel Optimization with Coupled Decision-Dependent Distributions

C.2. Descent Lemma of the UL Problem

Lemma 6. Suppose that Assumption 1 Assumption 6 hold. If the iterates {xr, yr, r} are generated by Bi-SGD and the εx and εy satisfy

Cxεx + Cxyεxεy + Cyεy Lεx,εy F 4(Lεx,εy F + γx) (135)

Cx ( Lz f + Cℓxy Lz f γy

γxγy , (136)

then, we have

LF + γx + 1

( 4(Lεx f )2(LF + γx)αr

LF γx + 2(Lεx f )2α2 r

E yr y (xr) 2

( αr LF + γx 2α2 r

) E Z Dx(y (x )) f(xr, y (xr, x ); Z) f(x , y (x ); Z)

+ 4α2 rσ2 f + 4α2 rδ2 r + 4(LF + γx)

LF γx αrδ2 r. (137)

Proof. Let x denote the x S of this problem. First, we can obtain

= E xr+1 xr + xr x 2 (138)

= E xr x 2 + 2 xr+1 xr, xr x + E xr+1 xr 2 (139)

(12b) = E xr x 2 2αr

E Z Dx(yr) b f(xr, yr; Z), xr x + E xr+1 xr 2 (140)

= E xr x 2 2αr

E Z Dx(yr) b f(xr, yr; Z) E Z Dx(y (xr)) f(xr, y (xr); Z), xr x

E Z Dx(y (xr)) f(xr, y (xr); Z), xr x + E xr+1 xr 2. (141)

Using the fact that E Z Dx(y (x )) f(x , y (x ); Z) = 0, we can have

E Z Dx(y (xr)) f(xr, y (xr); Z) E Z Dx(y (x )) f(xr, y (xr, x ); Z), xr x

+ E Z Dx(y (x )) f(xr, y (xr, x ); Z) E Z Dx(y (x )) f(x , y (x ); Z), xr x

(a) L εx,εy Z xr x 2 + E Z Dx(y (x )) f(xr, y (xr, x ); Z) E Z Dx(y (x )) f(x , y (x ); Z), xr x

(b) L εx,εy Z E xr x 2 + LF γx LF + γx xr x 2

+ 1 LF + γx

E Z Dx(y (x )) f(xr, y (xr, x ); Z) E Z Dx(y (x )) f(x , y (x ); Z)

Bilevel Optimization with Coupled Decision-Dependent Distributions

where (a) holds by applying Young s inequality and (73), in (b) we use the LF -smooth and γx-strongly convexity of f(x, y (x)) by applying Lemma 3.11 in (Bubeck et al., 2015).

By substituting (142) into (141), we obtain

( 1 2αr LF γx

LF + γx + 2αr L εx,εy Z

E Z Dx(yr) b f(xr, yr; Z) E Z Dx(y (xr)) f(xr, y (xr); Z), xr x

+ E xr+1 xr 2 αr LF + γx

E Z Dx(y (x )) f(xr, y (xr, x ); Z) f(x , y (x ); Z)

(a) ( 1 3αr LF γx 2(LF + γx) + 2αr L εx,εy Z

) E xr x 2 + 4(Lεx f )2(LF + γx)αr

LF γx E yr y (xr) 2

+ E xr+1 xr 2 αr LF + γx

E Z Dx(y (x )) f(xr, y (xr, x ); Z) f(x , y (x ); Z)

2 + 4(LF + γx)

LF γx αrδ2 r

(b) ( 1 3αr LF γx 2(LF + γx) + 2αr L εx,εy Z

) E xr x 2 + 4(Lεx f )2(LF + γx)αr

LF γx E yr y (xr) 2

+ 2α2 r E f(xr, yr) 2 αr LF + γx

E Z Dx(y (x )) f(xr, y (xr, x )) f(x , y (x ))

+ 4α2 rσ2 f + 4α2 rδ2 r + 4(LF + γx)

LF γx αrδ2 r (145)

(c) ( 1 3αr LF γx 2(LF + γx) + 2αr L εx,εy Z + 2(L εx,εy Z αr)2 ) E xr x 2

( 4(Lεx f )2(LF + γx)αr

LF γx + 2(Lεx f )2α2 r

E yr y (xr) 2

( αr LF + γx 2α2 r

) E Z Dx(y (x )) f(xr, y (xr, x ); Z) f(x , y (x ); Z)

+ 4α2 rσ2 f + 4α2 rδ2 r + 4(LF + γx)

LF γx αrδ2 r (146)

where in (a) we use

E Z Dx(yr) f(xr, yr; Z) E Z Dx(y (xr)) f(xr, y (xr); Z), xr x

+ E Z Dx(yr) b f(xr, yr; Z) E Z Dx(yr) f(xr, yr; Z), xr x

LF γx 8(LF + γx)E xr x 2 + 2(Lεx f )2(LF + γx)

LF γx E yr y (xr) 2

+ LF γx 8(LF + γx)E xr x 2 + 2(LF + γx)

LF γx δ2 r (147)

LF γx 4(LF + γx)E xr x 2 + 2(Lεx f )2(LF + γx)

LF γx E yr y (xr) 2 + 2(LF + γx)

LF γx δ2 r, (148)

(b) is true due to (12b) and the fact that E b f(xr, yr, Zx) E b f(xr, yr, Zx) f(xr, yr) + f(xr, yr) σf + δr + f(xr, yr) , and (c) holds by applying (116).

Bilevel Optimization with Coupled Decision-Dependent Distributions

When LF γx LF + γx 4L εx,εy Z , (149)

αr LF γx LF + γx + 2αr L εx,εy Z + 2(L εx,εy Z αr)2 αr LF γx 2(LF + γx) + 1

Note that the expression of L εx,εy Z is given in (74). So, the condition

LF γx L εx,εy Z 1 (151)

is equivalent to

1 4(LF + γx)

( Lεx,εy Z + ( Lz f + Cℓxy Lz f γy

= 4(LF + γx)

γx + ( Lz f + Cℓxy Lz f γy

(94) = 4(LF + γx)

LF (Cxεx + Cxyεxεy + Cyεy) (154)

Cx ( Lz f + Cℓxy Lz f γy

γxγy . (155)

Combining (150) and (154) gives the desired result.

C.3. Convergence of the Whole Sequence

Proof. Combining (97), (98) and (137), we can get

E xr+1 x 2 + E yr+1 y (xr+1) 2

E xr x 2 + E yr y (xr) 2 | {z }

LF + γx + 1

( ( 1 + Y1αr + Lεy y (σ f + 1)σ fα2 r ) (1 βrγy) 1

( Cℓxy Lεx f αr γy + Y2α2 r + 4(Lεx f )2(LF + γx)αr

LF γx + 2(Lεx f )2α2 r

E yr y (xr) 2

( αr 2(LF + γx) (2 + F)α2 r

) E Z Dx(y (x )) f(xr, y (xr, x )) f(x , y (x ))

+ 4 ( Lεy y )2 α2 rσ2 f + Lεy y α2 rσ2 f + 4 ( Lεy y )2 α2 rδ2 r + 3

2Lεy y αrδ2 r + 4α2 rσ2 f + 4α2 rδ2 r + 4(LF + γx)

LF γx αrδ2 r

+ 2 ( 1 + Y1αr + Lεy y (σ f + 1)σ fα2 r ) (1 βrγy)β2 rσ2 ℓ (156)

where Pr is the Lyapunov function.

In order to the contraction property of Pr, we need the terms in front of E xr x 2, E yr y (xr) 2, E Z Dx(y (x )) f(xr, y (xr, x ); Z) f(x , y (x ); Z)

2 to be negative. To be more speciﬁc, we have the following

requirements.

Bilevel Optimization with Coupled Decision-Dependent Distributions

1) to make the term αr LF γx

8 ( LF γx LF +γx αr )2 + Xα2 r be negative, we can choose a small enough step size, i.e.,

LF γx 2(LF + γx) + 1

)2 αr + Xαr 0

2(LF + γx) ( 1 8 ( LF γx LF +γx

)2 + X ) 4LF γx

(LF + γx) ( LF γx LF +γx

)2 = 4(LF + γx)

so that this term is less than LF γx 2(LF +γx).

2) to make the term (1+Y1αr +Lεy y (σ f +1)σ fα2 r)(1 βrγy) 1+( Cℓxy Lεx f αr γy +Y2α2 r + 4(Lεx f )2(LF +γx)αr

LF γx +2(Lεx f )2α2 r)

be less than βrγy

2 , we can require

Y1αr + Lεy y (σ f + 1)σ fα2 r +

( Cℓxy Lεx f αr γy + Y2α2 r + 4(Lεx f )2(LF + γx)αr

LF γx + 2(Lεx f )2α2 r

2 + Y1αr + Lεy y (σ f + 1)σ fα2 r

) βrγy (158)

2 + Y1αr + Lεy y (σ f + 1)σ fα2 r

Y1 + Cℓxy Lεx f γy + 4(Lεx f )2(LF + γx)

αr + ( Lεy y (σ f + 1)σ f + Y2 + 2(Lεx f )2) α2 r

1 + Cℓxy Lεx f γy Y1 + 4(Lεx f )2(LF + γx)

+ Y 1 1 ( Lεy y (σ f + 1)σ f + Y2 + 2(Lεx f )2) αr = Θ(αr). (160)

where the last equality is true because from Lemma 3.2 in (Ghadimi & Wang, 2018) we know that only Ω(log(α 1 r )) number of data samples can make δ2 r < O(αr), so it holds that σ f σf + O(1) when we choose αr O(1/r).

3) to make the term αr 2(LF +γx) (2 + F)α2 r be positive, we can require 1 4(LF + γx) (2 + F)αr 0

= αr 1 4(LF + γx)(2 + F). (161)

Therefore, we can get

E[Pr+1] ( 1 min ( LF γxαr 2(LF + γx), βrγy

)) Pr + O ( α2 rσ2 f ) + O ( β2 rσ2 ℓ ) + O ( ((αr + α2 r)δ2 r ) (162)

If we choose αr βr, then

E[Pr+1] (1 Ω(αr)) Pr + O ( α2 r ) + O ( (αr + α2 r)δ2 r ) . (163)

Considering that only a number of data samples on the order of Ω(log(α 1 r )) can result in δ2 r < αr, as stated in Lemma 3.2 of (Ghadimi & Wang, 2018), we can conclude that

E[Pr+1] (1 Ω(αr)) Pr + O ( α2 r ) . (164)

If we choose αr βr O(1/r), applying the Robbins-Siegmund theorem (Robbins & Siegmund, 1971) gives

lim r xr x S 0, lim r yr y (x S) 0 almost surely, (165)

since r=1 αr = . Also, note that in this choice of the step size Πr r =1(1 αr ) = O(1/r), so we can have

E[Pr] = O (1

which completes the proof.

Bilevel Optimization with Coupled Decision-Dependent Distributions

C.4. Proof of Corollary 1

Proof. If the full gradient estimate is used, we can immediately from (162) have that

Pr+1 ( 1 min ( LF γxαr 2(LF + γx), βrγy

)) Pr, (167)

giving a linear convergence of Bi-GD to the BPS solution.

D. Revisiting BPO and BPS (Proof of Theorem 3)

Proof. Based on the deﬁnitions of the BPO solution (2) and BPS solution (3), we know

DR(x S, x S) E Z D(y (x S))f(x S, y (x S); Z), s.t. y (x S) = arg min y E Z D(x S)[ℓ(x S, y; Z)] , (168)

DR(x O, x O) E Z D(y (x O))f(x O, y (x O); Z), s.t. y (x O) = arg min y E Z D(x O) [ℓ(x O, y; Z)] . (169)

DR(x O, x O) DR(x S, x S) DR(x S, x O). (170)

Then, we deﬁne an auxiliary variable as follows

y (x O, x S) = arg min y E Z D(x S) [ℓ(x O, y; Z)] . (171)

From the strong convexity of ℓ(x, y), we know that

E Z D(x S) ℓ(x O, y (x O)) ℓ(x O, y (x O, x S)), y (x O) y (x O, x S) γy

2 y (x O) y (x O, x S) 2.

Similar as the derivations in (49) and (50), according the optimality conditions that both E Z D(x S) ℓ(x O, y (x O, x S); Z)

and E Z D(x O) ℓ(x O, y (x O); Z) are zero, we can obtain

2 y (x O) y (x O, x S) 2

E Z D(x S) ℓ(x O, y (x O)) E Z D(x O) ℓ(x O, y (x O)), y (x O) y (x O, x S)

Lz ℓεy x S x O y (x O) y (x O, x S) , (173)

which directly gives

y (x O) y (x O, x S) Lz ℓεy γy x S x O . (174)

Due to the strong convexity of the UL function, we have

E Z D(y (x S))f(x O, y (x O); Z) E Z D(y (x S))f(x S, y (x S); Z)

+ E Z D(y (x S)) f(x S, y (x S); Z), x O x S

2 x O x S 2. (175)

Moverover, by using the optimality condition of the UL variable x S, we can get

E Z D(y (x S)) [f(x O, y (x O); Z) f(x S, y (x S); Z)] γx

2 x O x S 2. (176)

Bilevel Optimization with Coupled Decision-Dependent Distributions

Also, note that E Z D(y (x S))f(x O, y (x O); Z) E Z D(y (x O))f(x O, y (x O); Z)

εx Lz y (x S) y (x O, x S) + y (x O, x S) y (x O) (177)

(55),(174) εx Lz

γy x S x O + Lz ℓεy γy x S x O ) . (178)

By the deﬁnition of the PO solution x O shown in (2), we have

E Z D(y (x O))f(x O, y (x O); Z) E Z D(y (x S))f(x O, y (x S); Z), (179)

which gives

E Z D(y (x S)) [f(x O, y (x O); Z) f(x O, y (x S); Z)]

E Z D(y (x S))f(x O, y (x O); Z) E Z D(y (x O))f(x O, y (x O); Z). (180)

Substituting (176) and (178) into (180) yields

2 x O x S 2 εx Lz

γy (Cℓxy + Lz ℓεy) x S x O (181)

x O x S 2εx Lz

γxγy (Cℓxy + Lz ℓεy) . (182)

E. Additional Numerical Experiments

We also evaluate the performance of Bi-SGD for the meta performative prediction learning problem by varying the values of εx and εy on the spambase data set. As shown in Figure 4, it can be observed that when either the sensitivity parameter εx or εx is large, Bi-SGD yields lower meta-training and meta-testing accuracies, which aligns with our intuition. Another interesting ﬁnding is that the meta-test accuracy with εx = 0, εy = 0.1 is signiﬁcantly higher than the case with εx = 0.1, εy = 0, indicating that the meta-testing accuracy is more affected by εx compared to εy in this example.

number of iterations

train accuracy

"x = 0; "y = 1

"x = 0; "y = 0:01 "x = 0; "y = 0:1

"x = 0:1; "y = 0 "x = 0:01; "y = 0

(a) Train accuracy

number of iterations

test accuracy

"x = 0; "y = 1

"x = 0; "y = 0:01 "x = 0; "y = 0:1

"x = 0:1; "y = 0 "x = 0:01; "y = 0

(b) Test accuracy

Figure 4. Comparison of Bi-SGD for meta strategic learning over different combinations of the sensitivity parameters at each level.