# contextual_stochastic_bilevel_optimization__52ef93a3.pdf Contextual Stochastic Bilevel Optimization Yifan Hu EPFL & ETH Zürich Switzerland Jie Wang Gatech United States Yao Xie Gatech United States Andreas Krause ETH Zürich Switzerland Daniel Kuhn EPFL Switzerland We introduce contextual stochastic bilevel optimization (CSBO) a stochastic bilevel optimization framework with the lower-level problem minimizing an expectation conditioned on some contextual information and the upper-level decision variable. This framework extends classical stochastic bilevel optimization when the lower-level decision maker responds optimally not only to the decision of the upper-level decision maker but also to some side information and when there are multiple or even infinite many followers. It captures important applications such as meta-learning, personalized federated learning, end-to-end learning, and Wasserstein distributionally robust optimization with side information (WDRO-SI). Due to the presence of contextual information, existing single-loop methods for classical stochastic bilevel optimization are unable to converge. To overcome this challenge, we introduce an efficient double-loop gradient method based on the Multilevel Monte-Carlo (MLMC) technique and establish its sample and computational complexities. When specialized to stochastic nonconvex optimization, our method matches existing lower bounds. For meta-learning, the complexity of our method does not depend on the number of tasks. Numerical experiments further validate our theoretical results. 1 Introduction A contextual stochastic bilevel optimization (CSBO) problem differs from a classical stochastic bilevel optimization problem only in that its lower-level problem is conditioned on a given context ξ. min x Rdx F(x) := Eξ Pξ,η Pη|ξ[f(x, y (x; ξ); η, ξ)] (upper level) where y (x; ξ) := argminy Rdy Eη Pη|ξ[g(x, y; η, ξ)] ξ and x. (lower level) (1) Here ξ Pξ and η Pη|ξ are random vectors, with Pη|ξ denoting the conditional distribution of η for a given ξ. The dimensions of the upper-level decision variable x and the lower-level decision variable y are dx and dy, respectively. The functions f and g are continuously differentiable in (x, y) for any given sample pair (ξ, η). The function f(x, y; η, ξ) can be nonconvex in x, but the function g(x, y; η, ξ) must be strongly convex in y for any given x, η and ξ. Thus, y (x; ξ) is the unique minimizer of the strongly convex lower-level problem for any given x and ξ. Note that, on its own, the lower-level problem can be viewed as a contextual stochastic optimization problem [Bertsimas and Kallus, 2020] parametrized in x. We assume that the joint distribution of ξ and η is unknown. However, we assume that we have access to any number of independent and Correspondence: yifan.hu@epfl.ch. 37th Conference on Neural Information Processing Systems (Neur IPS 2023). identically distributed (i.i.d.) samples from Pξ, and for any given realization of ξ, we can generate any number of i.i.d. samples from the conditional distribution Pη|ξ. The bilevel structure generally makes the objective function F(x) nonconvex in the decision variable x, except for few special cases. Thus we aim to develop efficient gradient-based algorithms for finding an ϵ-stationary point of the nonconvex objective function F, i.e., a point bx satisfying the inequality E F(bx) 2 ϵ2. CSBO generalizes the widely studied class of stochastic bilevel optimization (SBO) problems [Ghadimi and Wang, 2018] whose lower-level problem minimizes an unconditional expectation. min x Rdx Eξ Pξ[f(x, y (x); ξ)] where y (x) := argminy Rdy Eη Pη[g(x, y; η)]. (2) Indeed, (2) is a special case of CSBO if the upperand lower-level objective functions are stochastically independent. Another special case of CSBO is the conditional stochastic optimization (CSO) problem [Hu et al., 2020a,b, He and Kasiviswanathan, 2023, Goda and Kitade, 2023] representable as min x Rdx Eξ Pξ[f(x, Eη Pη|ξ[h(x; η, ξ)]; ξ)]. (3) Indeed, (3) is a special case of CSBO if we set g(x, y; η, ξ) = y h(x; η, ξ) 2, in which case the lower-level problem admits the unique closed-form solution y (x, ξ) = EPη|ξ[h(x; η, ξ)]. Applications. Despite the wide applicability of SBO to various machine learning and game theory paradigms, SBO cannot capture two important cases. The first case involves the lower-level decision maker responding optimally not only to the upper-level decision x but also to some side information ξ like weather, spatial, and temporal information. The second case involves multiple lower-level decision makers, especially when the total number is large. CSBO well captures these two settings and encompasses various important machine learning paradigms as special cases, including metalearning [Rajeswaran et al., 2019], personalized federated learning [Shamsian et al., 2021, Xing et al., 2022, Wang et al., 2023a], hierarchical representation learning [Yao et al., 2019], end-toend learning [Donti et al., 2017, Sadana et al., 2023, Rychener et al., 2023, Grigas et al., 2021], Sinkhorn distributionally robust optimization (DRO) [Wang et al., 2023b], Wasserstein DRO with side information [Yang et al., 2022], information retrieval [Qiu et al., 2022], contrastive learning [Qiu et al., 2023], and instrumental variable regression [Muandet et al., 2020]. Below we provide a detailed discussion of meta-learning, personalized federated learning, and end-to-end learning. Meta-Learning and Personalized Federated Learning. Both applications can be viewed asspecial cases of CSBO. For meta-learning with M tasks or personalized federated learning with M users, the goal is to find a common regularization center θ shared by all tasks or all users. min x Ei µEDtest i ρi li(y i (x), Dtest i ) (upper level) where y i (x) = argminyi EDtrain i ρi li(yi, Dtrain i ) + λ 2 yi x 2 , i [M], x. (lower level) (4) Here, µ is the empirical uniform distribution on [M]. The upper-level problem minimizes the generalization loss for all tasks/all users by tuning the joint regularization center x, and the lowerlevel problem finds an optimal regularization parameter xi close to x for each individual task or user. Note that M may be as large as O(103) in meta-learning and as large as O(109) in personalized federated learning. Thus, it is crucial to design methods with complexity bounds independent of M. End-to-End Learning. Traditionally, applications from inventory control to online advertising involve a two-step approach: first estimating a demand function or the click-through rate, and then making decisions based on this estimation. End-to-end learning streamlines this into a single-step method, allowing the optimization to account for estimation errors, thereby enabling more informed decisions. This can be framed as a special case of CSBO, where the upper-level problem seeks the best estimator, while the lower-level problem makes optimal decisions based on the upper-level estimator and the contextual information ξ. For example, in online advertising, x represents the click-through rate estimator, and y (x; ξ) denotes the optimal advertisement display for a customer characterized by the feature vector ξ. For a comprehensive review, see the recent survey paper [Sadana et al., 2023]. Challenges. Given the wide applicability of CSBO, it is expedient to look for efficient solution algorithms. Unfortunately, when extended to CSBO, existing algorithms for SBO or CSO either suffer from sub-optimal convergence rates or are entirely unable to handle the contextual information. Indeed, a major challenge of CSBO is to estimate y (x; ξ) for (typically) uncountably many realizations of ξ. In the following, we explain in more detail why existing methods fail. If the lower-level problem is strongly convex, then SBO can be addressed with numerous efficient single-loop algorithms [Guo and Yang, 2021, Guo et al., 2021, Chen et al., 2022a, 2021, Hong et al., 2023, Yang et al., 2021]. Indeed, as the unique minimizer y (x) of the lower-level problem in (2) depends only on the upper-level decision variable x, these algorithms can sequentially update the upperand lower-level decision variables x and y in a single loop while ensuring that the sequence {yt}t approximates {y (xt)}t. Specifically, these approaches leverage the approximation y (xt+1) y (xt) y (xt) (xt+1 xt), which is accurate if x is updated using small stepsizes. However, these algorithms generically fail to converge on CSBO problems because the minimizer y (x; ξ) of the lower-level problem in (1) additionally depends on the context ξ, i.e., each realization of ξ corresponds to a lower-level constraint. Consequently, there can be infinitely many lower-level constraints. It is unclear how samples from Pη|ξ corresponding to a fixed context ξ can be reused to estimate the minimizer y (x; ξ ) corresponding to a different context ξ . Since gradient-based methods sample ξt independently in each iteration t, no single sequence {yt}t can approximate the function {y (xt, ξt)}t. Guo and Yang [2021] and Hu et al. [2023] analyze a special case of the CSBO problem (1), in which ξ is supported on M points as shown in (4). However, the sample complexity of their algorithm grows linearly with M. In contrast, we develop methods for general CSBO problems and show that their sample complexities are independent of the support of ξ. SBO problems can also be addressed with double-loop stochastic gradient descent (DL-SGD), which solve the lower-level problem to approximate optimality before updating the upper-level decision variable [Ji et al., 2021, Ghadimi and Wang, 2018]. We will show that these DL-SGD algorithms can be extended to CSBO problems and will analyze their sample complexity as well as their computational complexity. Unfortunately, it turns out that, when applied to CSBO problems, DL-SGD incurs high per-iteration sampling and computational costs to obtain a low-bias gradient estimator for F. More precisely, solving the contextual lower-level problem to ϵ-optimality for a fixed ξ requires O(ϵ 2) samples from Pη|ξ and gradient estimators for the function g, which leads to a e O(ϵ 6) total sample and computational complexity to obtain an ϵ-stationary point of F. Methodology. Given these observations indicating that existing methods can fail or be sub-optimal for solving the CSBO problem, we next discuss the motivation for our algorithm design. Our goal is to build gradient estimators that share the same small bias as DL-SGD but require much fewer samples and incur a much lower computational cost at the expense of a slightly increased variance. To obtain estimators with low bias, variance, and a low sampling and computational cost, we propose here a multilevel Monte Carlo (MLMC) approach [Giles, 2015, Hu et al., 2021, Asi et al., 2021], which is reminiscent of the control variate technique, and combine it with inverse propensity weighting [Glynn and Quinn, 2010]. We refer to the proposed method as random truncated MLMC (RT-MLMC) and demonstrate that the RT-MLMC estimator for F requires only O(1) samples from Pη|ξ. This is a significant improvement vis-à-vis DL-SGD, which requires O(ϵ 2) samples. Consequently, the sample complexity as well as the gradient complexity over g (i.e., the number g-gradient evaluations) of RT-MLMC for finding an ϵ-stationary point of F is given by e O(ϵ 4). While the idea of using MLMC in stochastic optimization is not new [Hu et al., 2021, Asi et al., 2021], the construction of MLMC gradient estimators for CSBO and the analysis of the variance of the RT-MLMC gradient estimators are novel contributions of this work. 1.1 Our Contributions We introduce CSBO as a unifying framework for a broad range of machine learning tasks and optimization problems. We propose two methods, DL-SGD and RT-MLMC, and analyze their complexities; see Table 1 for a summary. When specialized to SBO and CSO problems, RT-MLMC displays the same performance as the state-of-the-art algorithms for SBO [Chen et al., 2021] and CSO [Hu et al., 2021], respectively. When specialized to classical stochastic nonconvex optimization, RT-MLMC matches the lower bounds by Arjevani et al. [2023]. Table 1: Complexity of the RT-MLMC and DL-SGD algorithms for finding an ϵ-stationary point of F. The sample complexity refers to the total number of samples from Pξ as well as Pη|ξ. Nonconvex CSBO Sample Complexity Gradient Complexity of g and f Per-iteration Memory Cost RT-MLMC e O(ϵ 4) e O(ϵ 4) | e O(ϵ 4) O(dx + dy) DL-SGD e O(ϵ 6) e O(ϵ 6) | e O(ϵ 4) O(dx + dy) For meta-learning with M tasks, the complexity bounds of RT-MLMC are constant in M. Thus, RT-MLMC outperforms the methods by Guo and Yang [2021] and Hu et al. [2023] when M is large. For Wasserstein DRO with side information [Yang et al., 2022], existing methods only cater for affine and non-parametric decision rules. In contrast, RT-MLMC allows for neural network approximations. We also present the first sample and gradient complexity bounds for WDRO-SI. For meta-learning and Wasserstein DRO with side information, our experiments show that the RT-MLMC gradient estimator can be computed an order of magnitude faster than the DL-SGD gradient estimator, especially when the contextual lower-level problem is solved to higher accuracy. Preliminaries For any function ψ : Rdx Rdy with arguments x Rdx and y Rdy, we use ψ, 1ψ and 2ψ to denote the gradients of ψ with respect to (x, y), x and y, respectively. Similarly, we use 2ψ, 2 11ψ and 2 22 to denote Hessians of ψ with respect to (x, y), x and y, respectively. In addition, 2 12ψ stands for the (dx dy)-matrix with entries 2 xiyjψ. A function φ : Rd R is L-Lipschitz continuous if |φ(x) φ(x )| L x x for all x, x Rd, and it is S-Lipschitz smooth if it is continuously differentiable and satisfies φ(x) φ(x ) S x x for all x, x Rd. In addition, φ is called µ-strongly convex if it is continuously differentiable and if φ(x) φ(x ) φ(x ) (x x ) µ 2 x x 2 for all x, x Rd. The identity matrix is denoted by I. Finally, we use e O( ) as a variant of the classical O( ) symbol that hides logarithmic factors. 2 Algorithms for Contextual Stochastic Bilevel Optimization Throughout the paper, we make the following assumptions. Similar assumptions appear in the SBO literature [Ghadimi and Wang, 2018, Guo and Yang, 2021, Chen et al., 2022a, 2021, Hong et al., 2023]. Assumption 2.1. The CSBO problem (1) satisfies the following regularity conditions: (i) f is continuously differentiable in x and y for any fixed η and ξ, and g is twice continuously differentiable in x and y for any fixed η and ξ. (ii) g is µg-strongly convex in y for any fixed x, η and ξ. (iii) f, g, f, g and 2g are Lf,0, Lg,0, Lf,1, Lg,1 and Lg,2-Lipschitz continuous in (x, y) for any fixed η and ξ, respectively. (iv) If (η, ξ) P(η,ξ), then f(x, y; η, ξ) is an unbiased estimator for E(η,ξ) P(η,ξ)[f(x, y; η, ξ)] with variance σ2 f uniformly across all x and y. Also, if η Pη|ξ, then g(x, y; η, ξ) is an unbiased estimator for Eη Pη|ξ[g(x, y; η, ξ)] with variance σ2 g,1, and 2g(x, y; η, ξ) is an unbiased estimator for 2Eη Pη|ξ[g(x, y; η, ξ)] with variance σ2 g,1 uniformly across all x, y and ξ. Assumption 2.1 ensures that problem (1) is well-defined. In particular, by slightly adapting the proofs of [Ghadimi and Wang, 2018, Lemma 2.2] and [Hong et al., 2023, Lemma 2], it allows us to show that F is LF -Lipschitz continuous as well as SF -Lipschitz smooth for some LF , SF > 0. Assumptions 2.1 (i-iii) also imply that the gradients of f and g with respect (x, y) can be interchanged with the expectations with respect to (η, ξ) Pη,ξ and η Pη|ξ. Hence, Assumptions 2.1 (i-iii) readily imply the unbiasedness of the gradient estimators imposed in Assumption 2.1 (iv). In fact, only the uniform variance bounds do not already follow from Assumptions 2.1 (i-iii). In order to design SGD-type algorithms for problem (1), we first construct gradient estimators for F. To this end, we observe that the Jacobian 1y (x; ξ) Rdx dy exists and is Lipschitz continuous in x for any fixed ξ thanks to [Chen et al., 2021, Lemma 2]. By the chain rule, we therefore have F(x) = E(η,ξ) P(η,ξ) h 1f(x, y (x; ξ); η, ξ) + 1y (x; ξ) 2f(x, y (x; ξ); η, ξ) i . Algorithm 1 Epoch SGD(K, x, ξ, y0 1) Input: of epochs K, sample ξ, upper-level decision x, initial iterate y0 1. 1: for k = 1 to K do 2: for j = 0 to 2k 1 do 3: Sample ηj k from Pη|ξ and update yj+1 k = yj k β02 k 2g(x, yj k; ηj k, ξ). 4: end for 5: Update y0 k+1 = 2 k P2k 1 j=0 yj k. 6: end for Output: y0 1, y0 K, and y0 K+1. By following a similar procedure as in [Ghadimi and Wang, 2018], we can derive an explicit formula for 1y (x; ξ) (for details we refer to Appendix B) and substitute it into the above equation to obtain F(x) = E(η,ξ) P(η,ξ) h 1f(x, y (x; ξ); η, ξ) Eη Pη|ξ 2 12g(x, y (x; ξ); η , ξ) Λ(x, y (x; ξ); ξ) 2f(x, y (x; ξ); η, ξ) i , where Λ(x, y; ξ) = (Eη Pη|ξ 2 22g(x, y; η, ξ)) 1. Thus, the main challenges of constructing a gradient estimator are to compute and store the minimizer y (x, ξ) as well as the inverse expected Hessian matrix Λ(x, y; ξ) for all (potentially uncountably many) realizations of ξ. Computing these two objects exactly would be too expensive. In the remainder of this section, we thus derive estimators for y (x; ξ) and Λ(x, y; ξ), and we combine these two estimators to construct an estimator for F(x). Estimating y (x; ξ). We estimate y (x; ξ) using the gradient-based method Epoch SGD by Hazan and Kale [2014], Asi et al. [2021], which involves K epochs of increasing lengths. Each epoch k = 1, . . . , K starts from the average of the iterates computed in epoch k 1 and then applies 2k stochastic gradient steps to the lower-level problem with stepsize 2 k (see Algorithm 1). In the following we use the output y0 K+1 of Algorithm 1 with inputs K, x, ξ and y0 as an estimator for the minimizer y (x; ξ) of the lower-level problem. We use Epoch SGD for the following two reasons. First, Epoch SGD attains the optimal convergence rate for strongly convex stochastic optimization in the gradient oracle model [Hazan and Kale, 2014]. In addition, it is widely used in practical machine learning training procedures. Note that y (x; ξ) could also be estimated via classical SGD. Even though this would lead to similar complexity results, the analysis would become more cumbersome. Estimating Λ(x, y; ξ). Following [Ghadimi and Wang, 2018], one can estimate the inverse of an expected random matrix A with 0 A I using a Neumann series argument. Specifically, we have [EA PAA] 1 = n =0 (I EA PAA)n = n=1 EAn PA(I An) n=1 EAn PA(I An). The truncated series on the right hand side provides a good approximation if N 1. Assumption 2.1 (iii) implies that 0 2 22g(x, y; ηn, ξ) 2Lg,1I. Hence, the above formula can be applied to A = 1 2Lg,1 2 22g(x, y; ηn, ξ), which gives rise to an attractive estimator for Λ(x, y; ξ) of the form bΛ(x, y; ξ) := N 2Lg,1 I if b N = 0, N 2Lg,1 Q b N n=1 I 1 2Lg,1 2 22g(x, y; ηn, ξ) if b N 1. (5) Here, b N is a random integer drawn uniformly from {0, 1, . . . , N 1} that is independent of the i.i.d. samples η1, . . . , η b N from Pη|ξ. Chen et al. [2022b] showed that the estimator (5) displays the following properties. Its bias decays exponentially with N, its variance grows quadratically with N, and its sampling cost grows linearly with N. Below we call N the approximation number. Estimating F(x) via DL-SGD. For any given K and N, we construct the DL-SGD estimator for the gradient of F by using the following procedure: (i) generate a sample ξ from Pξ, (ii) generate i.i.d. samples η and η from the conditional distribution Pη|ξ, (iii) run Epoch SGD as described in Algorithm 1 with an arbitrary initial iterate y0 1 to obtain y0 K+1, and (iv) construct bΛ(x, y0 K+1; ξ) as in (5). Using these ingredients, we can now construct the DL-SGD gradient estimator as bv K(x) := 1f(x, y0 K+1; η , ξ) 2 12g(x, y0 K+1; η , ξ)bΛ(x, y0 K+1; ξ) 2f(x, y0 K+1; η , ξ). (6) Algorithm 2 SGD Framework Input: of iterations T, stepsizes {αt}T t=1, initial iterate x1. 1: for t = 1 to T do 2: Construct a gradient estimator v(xt) and update xt+1 = xt αtv(xt). 3: end for Output: bx T uniformly sampled from {x1, ..., x T }. In Lemma 2 below, we will analyze the bias and variance as well as the sampling and computational costs of the DL-SGD gradient estimator. We will see that a small bias E[bv K](x) F(x) ϵ can be ensured by setting K = O(log(ϵ 1)), in which case Epoch SGD computes O(ϵ 2) stochastic gradients of g. From now on, we refer to Algorithm 2 with v(x) = bv K(x) as the DL-SGD algorithm. 2.1 RT-MLMC Gradient Estimator The bottleneck of evaluating the DL-SGD gradient estimators is the computation of y0 K+1. The computational costs can be reduced, however, by exploiting the telescoping sum property bv K(x) = bv1(x) + k=1 [bvk+1(x) bvk(x)] k=1 pk bvk+1(x) bvk(x) pk = bv1(x) + Ebk Pbk hbvbk+1(x) bvbk(x) where bvbk is defined as in (6) with k = 1, . . . , K replacing K, and where Pbk is a truncated geometric distribution with Pbk(bk = k) = pk 2 k for every k = 1, . . . , K. This observation prompts us to construct the RT-MLMC gradient estimator as bv(x) = bv1(x) + p 1 bk (bv bk+1(x) bv bk(x)). (7) The RT-MLMC gradient estimator has three key properties: It is an unbiased estimator for the DL-SGD gradient estimator, i.e., Ebk Pbk[bv(x)] = bv K(x). Evaluating bv(x) requires computing y0 k+1(x, ξ) with probability pk, which decays exponentially with k. To ensure a small bias, we need to set K = O(log(ϵ 1)), and thus p K = O(ϵ). Hence, most of the time, Epoch SGD only needs to run over k K epochs. As a result, the average sampling and computational costs are markedly smaller for RT-MLMC than for DL-SGD. Since bvk+1(x) and bvk(x) differ only in y0 k+1 and y0 k, both of which are generated by Epoch SGD and are thus highly correlated, bvk+1(x) bvk(x) has a small variance thanks to a control variate effect [Nelson, 1990]. Hence, the variance of RT-MLMC is well-controlled, as shown in Lemma 2. In Lemma 2 below, we will analyze the bias and variance as well as the sampling and computational costs of the RT-MLMC gradient estimator. We will see that it requires only O(1) samples to ensure that the bias drops to O(ϵ). This is in stark contrast to the DL-SGD estimator, which needs O(ϵ 2) samples. The lower sample complexity and the corresponding lower computational cost come at the expense of an increased variance of the order O(log(ϵ 1)). The construction of the RT-MLMC gradient estimator is detailed in Algorithm 3. From now on, we refer to Algorithm 2 with v(x) = bv(x) as the RT-MLMC algorithm. 2.2 Memory and Arithmetic Operational Costs The per-iteration memory and arithmetic operational cost of DL-SGD as well as RT-MLMC is dominated by the cost of computing the matrix-vector product bc(x, y; ξ) := 2 12g(x, y; η , ξ)bΛ(x, y; ξ) 2f(x, y; η , ξ). (8) By (5), bΛ(x, y; ξ) is a product of b N matrices of the form I 1/(2Lg,1) 22g(x, y; ηn, ξ), and the n-th matrix coincides with the gradient of (y 1/(2Lg,1) 2g(x, y; ηn, ξ)) with respect to y. We can thus compute (8) recursively as follows. We first set v = 2f(x, y; η , ξ). Next, we update v by Algorithm 3 RT-MLMC Gradient Estimator for Conditional Bilevel Optimization Input: Iterate x, largest epoch number K, initialization y0 1, approximation number N 1: Sample ξ from Pξ and sample b N uniformly from {0, . . . , N 1}. 2: Sample bk from the truncated geometric distribution Pbk. 3: Run Epoch SGD(bk, x, ξ, y0 1) and obtain y0 bk+1, y0 bk, and y0 1. 4: Construct bvbk+1(x), bvbk(x), and bv1(x) according to (6) and compute bv(x) = bv1(x) + p 1 bk (bv bk+1(x) bv bk(x)). Output: bv(x). setting it to the gradient of (y 1/(2Lg,1) 2g(x, y; η b N, ξ)) v with respect to y, which is computed via automatic differentiation. This yields v = (I 1/(2Lg,1) 22g(x, y; η b N, ξ)) 2f(x, y; η , ξ). By using a backward recursion with respect to n, we can continue to multiply v from the left with the other matrices in the expansion of bΛ(x, y; ξ). This procedure is highly efficient because the memory and arithmetic operational cost of computing the product of a Hessian matrix with a constant vector via automatic differentiation is bounded up to a universal constant by the cost of computing the gradient of the same function [Rajeswaran et al., 2019]. See Algorithm 4 in Appendix C for details. The expected arithmetic operational costs of Algorithm 4 is O(Nd) and the memory cost is O(d). 3 Complexity Bounds In this section we derive the sample and gradient complexities of the proposed algorithms. We first analyze the error of the general SGD framework detailed in Algorithm 2. Lemma 1 (Error Analysis of Algorithm 2). If Algorithm 2 is used to minimize an Lψ-Lipschitz continuous and Sψ-Lipschitz smooth fucntion ψ(x) and if α 1/(2Sψ), then we have E ψ(bx T ) 2 2A1 T PT t=1 Lψ E[v(xt) ψ(xt)] + SψαE v(xt) ψ(xt) 2 , where A1 := ψ(x1) minx ψ(x). Lemma 1 sightly generalizes [Rakhlin et al., 2012, Ghadimi et al., 2016, Bottou et al., 2018]. We defer the proof to Appendix A. Thus, to prove convergence to a stationary point, we need to characterize the bias, variance, and computational costs of the DL-SGD and the RT-MLMC gradient estimators. Lemma 2 (Bias, Variance, Sampling Cost and Computational Cost). We have the following results. The biases of the DL-SGD and RT-MLMC estimators match and satisfy Ebv K(x) F(x) = Ebv(x) F(x) µ 1 g (1 µg/(2Lg,1))N + O(N 22 K/2), and the corresponding variances satisfy Var(bv K(x)) = O(N 2) and Var(bv(x)) = O(KN 4). The numbers of samples and iterations needed by Epoch SGD to build a DL-SGD estimator are bounded by N + 2K+1 1 and 2K+1 1, respectively. The expected numbers of samples and iterations needed for an RT-MLMC estimator are bounded by N + 3K and 3K, respectively. Lemma 2 implies that setting N = O(log(ϵ 1)) and K = O(log(ϵ 1)) reduces the bias to O(ϵ). In this case the RT-MLMC estimators have higher variances than the DL-SGD estimators, but their variances are still of the order O(log(ϵ 1)). On the other hand, using RT-MLMC estimators reduces the per-iteration sampling and computational costs from O(2K) = O(ϵ 2) to O(K) = O(log(ϵ 1)). Note that Hu et al. [2021] characterize the properties of general MLMC estimators and derive their complexity bounds. However, the proposed RT-MLMC estimators for CSBO problems are the first of their kind. In addition, as we need to estimate the Hessian inverse Λ(x, y (x; ξ); ξ), our analysis is more involved. In contrast to Asi et al. [2021], who use MLMC techniques for estimating projections and proximal points, we use MLMC techniques for estimating gradients in bilevel optimization. The following main theorem summarizes our complexity bounds. 0 2e3 4e3 6e3 8e3 1e4 Epoch Logistic Loss Value K=6 K=8 K=10 K=12 Baseline: MAML 0 2e3 4e3 6e3 8e3 1e4 Epoch Logistic Loss Value K=6 K=8 K=10 K=12 Baseline: MAML Figure 1: Test error of DL-SGD (left figure), RT-MLMC (right figure), and MAML against upper-level iterations on meta-learning. K represents how accurately we solve the lower-level problem. Theorem 1 (Complexity Bounds). If Assumption 2.1 holds, then Algorithm 2 based on the RTMLMC or the DL-SGD estimator outputs an ϵ-stationary point of F provided that N = O(log(ϵ 1)), K = O(log(ϵ 1)), α = O(ϵ2) and T = O(ϵ 4). When using the RT-MLMC estimator, the sample complexities of ξ and η as well as the gradient complexities of g and f are e O(ϵ 4). When using the DL-SGD estimator, the sample complexity of η and the gradient complexity of g are e O(ϵ 6), while and the sample complexity of ξ and the gradient complexity of f are e O(ϵ 4). Remark. Theorem 1 asserts that the sample complexity of η and the gradient complexity of g are much smaller for RT-MLMC than for DL-SGD, while the gradient complexities of f are comparable. When specialized to SBO or CSO problems, the complexity bounds of RT-MLMC match those of the state-of-the-art algorithms ALEST for SBO problems [Chen et al., 2021] and MLMC-based methods for CSO problems [Hu et al., 2021]. When restricted to classical stochastic nonconvex optimization, the complexity bounds of RT-MLMC match the existing lower bounds [Arjevani et al., 2023]. These observations further highlight the effectiveness of RT-MLMC across various settings. 4 Applications and Numerical Experiments 4.1 Meta-Learning Optimization-based meta-learning [Finn et al., 2017, Rajeswaran et al., 2019] aims to find a common shared regularization parameter for multiple similar yet different machine learning tasks in order to avoid overfitting when training each task separately on their datasets that each only processes a few data points. Recall Equation (4), the objective function of the optimization-based metalearning [Rajeswaran et al., 2019], min x Ei µEDtest i ρi li(y i (x), Dtest i ) where y i (x) = argminyi EDtrain i ρi li(yi, Dtrain i ) + λ 2 yi x 2 , i [M] and x. (9) where µ is the distribution over all M tasks, ρi is the distribution of data from the task i, Dtrain i and Dtest i are the training and testing dataset of the task i, x is the shared parameter of all tasks and y i (x) is the best parameter for a regularized objective of task i, li is a loss function that measures the average loss on the dataset of the i-th task, and λ is the regularization parameter to ensure the optimal solution obtained from the lower-level problem is not too far from the shared parameter obtained from the upper-level problem. Note that such a problem also occurs in personalized federated learning with each lower level being one user. Note that the task distribution µ is usually replaced by averaging over all M tasks. In such cases, existing works [Guo and Yang, 2021, Rajeswaran et al., 2019] only demonstrate a convergence rate that scales linearly with the number of tasks M. In contrast, the sample complexity of our proposed method does not depend on the number of tasks M, enabling substantially faster computation for a larger M. The seminal work, Model-agnostic Meta-learning (MAML) [Finn et al., 2017], is an approximation of Problem (9) via replacing y i (x) with one-step gradient update, i.e., byi(x) := x λ 1 li(x, Dtrain i ). We study the case where the loss function li(x, D), i is a multi-class logistic loss using a linear classifier. The experiment is examined on tiny Image Net [Mnmoustafa, 2017] by pre-processing 0 2e3 4e3 6e3 8e3 1e4 Epoch Logistic Loss Value RT-MLMC with Kmax=10 RT-MLMC with Kmax=12 BSVRBv2 0 200 400 600 800 1000 Computation Time (Seconds) Logistic Loss Value RT-MLMC with Kmax=10 RT-MLMC with Kmax=12 BSVRBv2 At this point the outer iteration number of BSVRBv2 equals 1e4 Figure 2: Left: Performance of our proposed RT-MLMC algorithm and BSVRBv2 against upper-level iterations on meta-learning. Right: Performance of algorithms against the total computational time on meta-learning. it using the pre-trained Res Net-18 network [He et al., 2016] to extract linear features. Since the network has learned a rich set of hierarchical features from the Image Net dataset [Deng et al., 2009], it typically extracts useful features for other image datasets. Note that each task consists of labels of similar characteristics. Table 2: The computation time of DL-SGD/RTMLMC gradient estimators on meta-learning. K DL-SGD RT-MLMC Mean Variance Mean Variance 6 2.65e-02 6.34e-03 2.73e-02 1.46e-02 8 7.23e-02 7.77e-03 3.41e-02 1.85e-02 10 2.48e-01 2.75e-02 4.93e-02 4.06e-02 12 9.38e-01 3.71e-02 1.08e-01 5.44e-02 Figure 1 presents the average of logistic loss evaluated on the test dataset against the number of iterations, with each iteration representing one upper-level update. From the plot, we see that both DL-SGD and RT-MLMC methods tend to have better generalization performance when using a larger number of levels K. As shown in Table 2, RT-MLMC is about 9 times faster to compute the upper-level gradient estimator than DL-SGD when K is large. In contrast, the MAML baseline does not have superior performance since the one-step gradient update does not solve the lower-level problem to approximate global optimality. In Appendix D.1, we provide numerical results for a modified MAML algorithm with multiple gradient updates, which achieves better performance compared to MAML but is still worse than our proposed method. After the initial submission of the paper, a con-current work Hu et al. [2023] proposed two types of algorithms (BSVRBv1 and BSVRBv2) that apply to the meta-learning formulation (9). Their proposed algorithm BSVRBv1 is computationally expansive as it requires the exact computation of the inverse of Hessian matrix (which is of size 5120 5120 in this example) with respect to θ in each iteration of the upper-level update. In the following, we compare the performance of our algorithm with their proposed Hessian-free algorithm BSVRBv2 in Figure 2. In the left plot of Figure 2, we examine the performance of RT-MLMC method and BSVRBv2 by running the same number of total epochs. It shows that RT-MLMC method has much better performance in terms of test error. In the right plot of Figure 2, we examine the performance of RT-MLMC method and BSVRBv2 by running the same amount of computational time. Although the per-upper-level-iteration computational costs of BSVRBv2 is small, it takes a much longer time for BSVRBv2 to achieve a similar test error as RT-MLMC. 4.2 Wasserstein DRO with Side Information The WDRO-SI [Yang et al., 2022] studies robust stochastic optimization with side information [Bertsimas and Kallus, 2020]. Let ξ denote the side information and η denote the randomness dependent on ξ. The WDRO-SI seeks to find a parameterized mapping f(x; ξ) from the side information ξ to a decision w that minimizes the expected loss w.r.t. η under the worst-case distributional shifts over (ξ, η). Rigorously, with a penalty on the distributional robust constraint, WDRO-SI admits the form min x max P n E(ξ,η) P[l(f(x, ξ); η)] λC(P, P0) o , (10) 0 200 400 600 800 1000 Epoch Testing Loss Value K=2 K=4 K=6 K=8 0 200 400 600 800 1000 Epoch Testing Loss Value K=2 K=4 K=6 K=8 10 30 50 100 Sample Size m Testing Loss Value ERM WDRO WDRO SI Figure 3: Test error on WDRO-SI against the number of upper-level updates for DL-SGD (left) and RT-MLMC (middle). Figure (Right) compares WDRO-SI with ERM and Wasserstein DRO that do not incorporate side information. m means the number of samples of Z generated from PZ|X for each realization of X. where l(w; η) is the loss function dependent on the decision w and the random variable η, P0 is the nominal distribution of (ξ, η) that usually takes the form of an empirical distribution, and C( , ) is a casual transport distance between distributions [Yang et al., 2022, Definition 1] a variant of the Wasserstein distance that better captures the information from ξ. For distributionally robust feature-based newsvendor problems [Zhang et al., 2023], the covariate ξ can be temporal, spatial, or weather information, η is the random demand, f(x; ξ) denotes the ordering quantity for a given ξ, and l(f(x; ξ); η) characterizes the loss if the ordering quantity f(x; ξ) does not match the demand η. Incorporating the cost function of the casual transport distance used in [Yang et al., 2022] and utilizing the dual form, the WDRO-SI problem in (10) can be reformulated as a special case of CSBO: min x Eξ P0 ξEη P0 η|ξ l(f(x; y (x; ξ)), η) λ y (x; ξ) ξ 2 (upper level) where y (x; ξ) := argmineξ Eη P0 η|ξ l(f(x; eξ), η) + λ ξ eξ 2 , eξ and x. (lower level) (11) The original work [Yang et al., 2022] only allows affine function f or non-parametric approximation, while our approach allows using neural network approximation such that f(x; ξ) is a neural network parameterized by x. Using Theorem 1, we obtain the first sample and gradient complexities for WDRO-SI. For the distributionally robust feature-based newsvendor problems, we compare the performance of DL-SGD and RT-MLMC. We compare with ERM and WDRO, which do not incorporate side information. Fig. 3 (left) and (middle) present the results of test loss versus the number of upper-level iterations for DL-SGD and RT-MLMC, respectively. From the plot, using a larger number of epochs K for the lower-level problem generally admits lower testing loss values, i.e., better generalization performance. Table 3: Computation time of DL-SGD/RTMLMC gradient estimators for WDRO-SI. K DL-SGD RT-MLMC Mean Variance Mean Variance 2 1.27e-02 2.67e-03 5.04e-03 7.26e-04 4 5.25e-02 2.58e-03 1.25e-02 8.26e-03 6 1.68e-01 2.74e-03 2.02e-02 9.39e-03 8 4.63e-01 2.08e-03 3.41e-02 1.68e-02 Fig. 3 (right) highlights the importance of incorporating side information as the performance of WDRO-SI outperforms the other two baselines. In addition, more observations of η for a given side information ξ can enhance the performance. Table 3 reports the computational time for DL-SGD and RT-MLMC gradient estimators, and RT-MLMC is significantly faster since it properly balances the bias-variancecomputation trade-off for the gradient simulation. 5 Conclusion We introduced the class of contextual stochastic bilevel optimization problems, which involve a contextual stochastic optimization problem at the lower level. In addition, we designed efficient gradient-based solution schemes and analyzed their sample and gradient complexities. Numerical results on two complementary applications showcase the expressiveness of the proposed problem class as well as the efficiency of the proposed algorithms. Future research should address generalized CSBO problems with constraints at the lower level, which will require alternative gradient estimators. Acknowledgments and Disclosure of Funding This research was supported by the Swiss National Science Foundation under the NCCR Automation, grant agreement 51NF40_180545, an NSF CAREER CCF-1650913, NSF DMS-2134037, CMMI2015787, CMMI-2112533, DMS-1938106, DMS-1830210, and the Coca-Cola Foundation. Yossi Arjevani, Yair Carmon, John C Duchi, Dylan J Foster, Nathan Srebro, and Blake Woodworth. Lower bounds for non-convex stochastic optimization. Mathematical Programming, 199(1-2): 165 214, 2023. Hilal Asi, Yair Carmon, Arun Jambulapati, Yujia Jin, and Aaron Sidford. Stochastic bias-reduced gradient methods. In Advances in Neural Information Processing Systems, 2021. Dimitris Bertsimas and Nathan Kallus. From predictive to prescriptive analytics. Management Science, 66(3):1025 1044, 2020. Léon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning. SIAM Review, 60(2):223 311, 2018. Tianyi Chen, Yuejiao Sun, and Wotao Yin. Closing the gap: Tighter analysis of alternating stochastic gradient methods for bilevel problems. In Advances in Neural Information Processing Systems, 2021. Tianyi Chen, Yuejiao Sun, Quan Xiao, and Wotao Yin. A single-timescale method for stochastic bilevel optimization. In International Conference on Artificial Intelligence and Statistics, 2022a. Xin Chen, Niao He, Yifan Hu, and Zikun Ye. Efficient algorithms for minimizing compositions of convex functions and random functions and its applications in network revenue management. ar Xiv preprint ar Xiv:2205.01774, 2022b. Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition, 2009. Priya Donti, Brandon Amos, and J Zico Kolter. Task-based end-to-end model learning in stochastic optimization. In Advances in Nneural Information Processing Systems, 2017. Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning, 2017. Saeed Ghadimi and Mengdi Wang. Approximation methods for bilevel programming. ar Xiv preprint ar Xiv:1802.02246, 2018. Saeed Ghadimi, Guanghui Lan, and Hongchao Zhang. Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization. Mathematical Programming, 155(1-2):267 305, 2016. Michael B Giles. Multilevel Monte Carlo methods. Acta Numerica, 24:259 328, 2015. Adam N Glynn and Kevin M Quinn. An introduction to the augmented inverse propensity weighted estimator. Political Analysis, 18(1):36 56, 2010. Takashi Goda and Wataru Kitade. Constructing unbiased gradient estimators with finite variance for conditional stochastic optimization. Mathematics and Computers in Simulation, 204:743 763, 2023. Paul Grigas, Meng Qi, and Zuo-Jun Max Shen. Integrated conditional estimation-optimization. ar Xiv preprint ar Xiv:2110.12351, 2021. Zhishuai Guo and Tianbao Yang. Randomized stochastic variance-reduced methods for stochastic bilevel optimization. ar Xiv preprint ar Xiv:2105.02266, 2021. Zhishuai Guo, Yi Xu, Wotao Yin, Rong Jin, and Tianbao Yang. On stochastic moving-average estimators for non-convex optimization. ar Xiv preprint ar Xiv:2104.14840, 2021. Elad Hazan and Satyen Kale. Beyond the regret minimization barrier: Optimal algorithms for stochastic strongly-convex optimization. Journal of Machine Learning Research, 15(1):2489 2512, 2014. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, 2016. Lie He and Shiva Prasad Kasiviswanathan. Debiasing conditional stochastic optimization. ar Xiv preprint ar Xiv:2304.10613, 2023. Mingyi Hong, Hoi-To Wai, Zhaoran Wang, and Zhuoran Yang. A two-timescale stochastic algorithm framework for bilevel optimization: Complexity analysis and application to actor-critic. SIAM Journal on Optimization, 33(1):147 180, 2023. Quanqi Hu, Zi-Hao Qiu, Zhishuai Guo, Lijun Zhang, and Tianbao Yang. Blockwise stochastic variance-reduced methods with parallel speedup for multi-block bilevel optimization. In International Conference on Machine Learning, 2023. Yifan Hu, Xin Chen, and Niao He. Sample complexity of sample average approximation for conditional stochastic optimization. SIAM Journal on Optimization, 30(3):2103 2133, 2020a. Yifan Hu, Siqi Zhang, Xin Chen, and Niao He. Biased stochastic first-order methods for conditional stochastic optimization and applications in meta learning. In Advances in Neural Information Processing Systems, 2020b. Yifan Hu, Xin Chen, and Niao He. On the bias-variance-cost tradeoff of stochastic optimization. In Advances in Neural Information Processing Systems, 2021. Kaiyi Ji, Junjie Yang, and Yingbin Liang. Bilevel optimization: Convergence analysis and enhanced design. In International Conference on Machine Learning, 2021. Mohammed Ali Mnmoustafa. Tiny imagenet, 2017. URL https://kaggle.com/competitions/ tiny-imagenet. Krikamol Muandet, Arash Mehrjou, Si Kai Lee, and Anant Raj. Dual instrumental variable regression. In Advances in Neural Information Processing Systems, 2020. Barry L Nelson. Control variate remedies. Operations Research, 38(6):974 992, 1990. Zi-Hao Qiu, Quanqi Hu, Yongjian Zhong, Lijun Zhang, and Tianbao Yang. Large-scale stochastic optimization of NDCG surrogates for deep learning with provable convergence. In International Conference on Machine Learning, 2022. Zi-Hao Qiu, Quanqi Hu, Zhuoning Yuan, Denny Zhou, Lijun Zhang, and Tianbao Yang. Not all semantics are created equal: Contrastive self-supervised learning with automatic temperature individualization. In International Conference on Machine Learning, 2023. Aravind Rajeswaran, Chelsea Finn, Sham M Kakade, and Sergey Levine. Meta-learning with implicit gradients. In Advances in Neural Information Processing Systems, 2019. Alexander Rakhlin, Ohad Shamir, and Karthik Sridharan. Making gradient descent optimal for strongly convex stochastic optimization. In International Conference on Machine Learning, 2012. Yves Rychener, Daniel Kuhn, and Tobias Sutter. End-to-end learning for stochastic optimization: A Bayesian perspective. In International Conference on Machine Learning, 2023. Utsav Sadana, Abhilash Chenreddy, Erick Delage, Alexandre Forel, Emma Frejinger, and Thibaut Vidal. A survey of contextual optimization methods for decision making under uncertainty. ar Xiv preprint ar Xiv:2306.10374, 2023. Aviv Shamsian, Aviv Navon, Ethan Fetaya, and Gal Chechik. Personalized federated learning using hypernetworks. In International Conference on Machine Learning, 2021. Aman Sinha, Hongseok Namkoong, and John Duchi. Certifiable distributional robustness with principled adversarial training. In International Conference on Learning Representations, 2018. Bokun Wang, Zhuoning Yuan, Yiming Ying, and Tianbao Yang. Memory-based optimization methods for model-agnostic meta-learning and personalized federated learning. Journal of Machine Learning Research, 24:1 46, 2023a. Jie Wang, Rui Gao, and Yao Xie. Sinkhorn distributionally robust optimization. ar Xiv preprint ar Xiv:2109.11926, 2023b. Pengwei Xing, Songtao Lu, Lingfei Wu, and Han Yu. Big-fed: Bilevel optimization enhanced graph-aided federated learning. IEEE Transactions on Big Data, 2022. Jincheng Yang, Luhao Zhang, Ningyuan Chen, Rui Gao, and Ming Hu. Decision-making with side information: A causal transport robust approach. Optimization Online, 2022. Junjie Yang, Kaiyi Ji, and Yingbin Liang. Provably faster algorithms for bilevel optimization. In Advances in Neural Information Processing Systems, 2021. Huaxiu Yao, Ying Wei, Junzhou Huang, and Zhenhui Li. Hierarchically structured meta-learning. In International Conference on Machine Learning, 2019. Luhao Zhang, Jincheng Yang, and Rui Gao. Optimal robust policy for feature-based newsvendor. Management Science, 2023. A Proofs of Technical Results Proof of Lemma 1. Since the function ψ is Sψ-Lipschitz smooth, we have Eψ(xt+1) ψ(xt) E ψ(xt) (xt+1 xt) + Sψ 2 E xt+1 xt 2 = αt E ψ(xt) v(xt) + Sψαt2 2 E v(xt) 2 αt E ψ(xt) 2 + αt E ψ(xt) ( ψ(xt) v(xt)) + Sψαt 2E v(xt) ψ(xt) 2 + Sψαt 2E ψ(xt) 2 = (αt Sψαt 2)E ψ(xt) 2 + αt E ψ(xt) ( ψ(xt) v(xt)) + Sψαt 2E v(xt) ψ(xt) 2 = (αt Sψαt 2)E ψ(xt) 2 + αt E[ ψ(xt) E[ ψ(xt) v(xt) | xt]] + Sψαt 2E v(xt) ψ(xt) 2 αt/2E ψ(xt) 2 + αt E[ ψ(xt) E[ ψ(xt) v(xt) | xt]] + Sψαt 2E v(xt) ψ(xt) 2 αt/2E ψ(xt) 2 + αt E[ ψ(xt) E[ ψ(xt) v(xt) | xt] ] + Sψαt 2E v(xt) ψ(xt) 2, where the first inequality uses Lipschitz smoothness of ψ, the first equality uses the updates of the SGD algorithm, the second inequality uses the Cauchy-Schwarz inequality, the third equality uses the conditional expectation and the tower property, the third inequality uses the fact that αt 1/(2Sψ), and the fourth inequality uses the Cauchy-Schwarz inequality. Rearranging terms and setting αt = α, we have E ψ(xt) 2 2(Eψ(xt) Eψ(xt+1)) α + 2E ψ(xt) E[ ψ(xt) v(xt) | xt] + 2SψαE v(xt) ψ(xt) 2. Averaging from t = 1 to t = T, we have E ψ(bx T ) 2 = 1 t=1 E ψ(xt) 2 2(ψ(x1) minx ψ(x)) t=1 E ψ(xt) E[ ψ(xt) v(xt) | xt] + 2Sψα t=1 E v(xt) ψ(xt) 2 2(ψ(x1) minx ψ(x)) h Lψ E[ ψ(xt) v(xt)] + SψαE v(xt) ψ(xt) 2i , where the inequality holds as ψ is Lψ-Lipschitz continuous and thus ψ(x) Lψ for all x. To demonstrate the bias, variance as well as sampling and computational costs for building DL-SGD and RT-MLMC gradient estimators, we first show the iterate convergence of Epoch SGD on the lower-level minimization problem with the side information. The analysis follows similarly as [Hazan and Kale, 2014, Asi et al., 2021]. Lemma 3 (Error of Epoch SGD). For given x and ξ, the iterates y0 K+1 of Algorithm 1 with β0 = (4µg) 1 satisfies E y0 K+1 y (x; ξ) 2 2L2 g,0µ 2 g 2 (K+1). It is important to note that the initial stepsize, denoted as β0, doesn t necessarily have to be equal to (4µg) 1. Indeed, equivalent results can be achieved with a constant β0 > 0. The choice of this specific β0 value is primarily to streamline the analysis. In reality, there are numerous instances where the value of µg is unknown beforehand. Under such circumstances, one practical approach could be to set β0 to a standard value, such as 0.4. Proof of Lemma 3. Denote G(x, y; ξ) := Eη|ξg(x, y; η, ξ). For the ease of notation, throughout the proof, E denotes taking full expectation conditioned on a given x and ξ. Utilizing the update of y in the k-th epoch of Epoch SGD algorithm, it holds for any y and any j = 0, . . . , 2k 1 that E yj+1 k y 2 =E yj k βk 2g(x, yj k; ηj k, ξ) y 2 =E yj k y 2 + β2 k E 2g(x, yj k; ηj k, ξ) 2 2βk E 2g(x, yj k; ηj k, ξ) (yj k y) E yj k y 2 + β2 k L2 g,0 2βk(EG(x, yj k; ξ) G(x, y; ξ)), where the inequality utilizes the convexity of g and G in y. Rearranging terms, the above inequality yields EG(x, yj k; ξ) G(x, y; ξ) E yj k y 2 E yj+1 k y 2 2βk + βk L2 g,0 2 . Summing up from j = 0 to j = 2k 1 and dividing 2k on both sides, we obtain the relation EG(x, y0 k+1; ξ) G(x, y; ξ) j=0 EG(x, yj k; ξ) G(x, y; ξ) E y0 k y 2 2βk2k + βk L2 g,0 2 , (12) where the inequality uses the convexity of G in y and Jensen s inequality. Since G(x, y; ξ) is µg-strongly convex in y for any given x and ξ, it holds that G(x, y0 1; ξ) G(x, y (x; ξ); ξ) 2G(x, y0 1; ξ) (y (x; ξ) y0 1) µg 2 y (x; ξ) y0 1 2 n 2G(x, y0 1; ξ) (y y0 1) µg 2 y y0 1 2o = 2G(x, y0 1; ξ) 2 2µg L2 g,0 2µg . Next, we use induction on k to show EG(x, y0 k; ξ) G(x, y (x; ξ); ξ) L2 g,0 2kµg . The base step for k = 1 follows from the inequality established above. As for the induction step, suppose that EG(x, y0 k; ξ) G(x, y (x; ξ); ξ) L2 g,0 2kµg holds for some k n. Plugging y = y (x; ξ) into (12), we thus find EG(x, y0 k+1; ξ) G(x, y (x; ξ); ξ) E y0 k y (x; ξ) 2 2βk2k + βk L2 g,0 2 EG(x, y0 k; ξ) G(x, y (x; ξ); ξ) µgβk2k + β0L2 g,0 2K+1 =EG(x, y0 k; ξ) G(x, y (x; ξ); ξ) µgβ0 + β0L2 g,0 2K+1 L2 g,0 µ2gβ02k + β0L2 g,0 2k+1 = L2 g,0 µg2k+2 + L2 g,0 µg2k+3 L2 g,0 µg2k+1 , where the second inequality uses the strong convexity of G in y and the fact that y (x; ξ) minimizes G, the first equality uses our assumption that βk = β0/2k, the third inequality uses the induction condition, and the second equality inequality uses the definition of β0 = (4µg) 1. It concludes the induction. Therefore, we have EG(x, y0 K+1; ξ) G(x, y (x; ξ); ξ) L2 g,0 µg2K+1 . By the µg-strong convexity of G(x, y, ; ξ) and the fact that y (x; ξ) is the minimizer, we thus have E y0 K+1 y (x; ξ) 2 2 µg EG(x, y0 K+1; ξ) G(x, y (x; ξ); ξ) L2 g,0 µ2g2K . Proof of Lemma 2. We first demonstrate the properties of the RT-MLMC gradient estimator and then show that of the DL-SGD gradient estimator. To facilitate the analysis, we define V (x) = EPξ,Pη|ξ h 1f(x, y (x; ξ); η, ξ) EPη |ξ 2 12g(x, y (x; ξ); η , ξ) E{ηn}c N n=1 Pη|ξ[bΛ(x, y (x; ξ); ξ)] 2f(x, y (x; ξ); η, ξ) i . RT-MLMC gradient estimator By the triangle inequality, we have Ebv(x) F(x) Ebv(x) V (x) + V (x) F(x) . From Chen et al. [2022b, Lemma 2.2], we know for given x, y, and ξ that E{ηn}c N n=1 Pη|ξ[bΛ(x, y; ξ)] Λ(x, y; ξ) 1 Utilizing Lipschitz continuity of g and f, we know that V (x) F(x) Lg,1Lf,0 On the other hand, by the definition of bv(x), we have Ebv(x) V (x) =E 1f(x, y0 K+1; η, ξ) E 1f(x, y (x; ξ); η, ξ) + E 2 12g(x, y0 K+1; η, ξ) h bΛ(x, y0 K+1; ξ) i 2f(x, y0 K+1; η, ξ) E 2 12g(x, y (x; ξ); η , ξ) h bΛ(x, y (x; ξ); ξ) i 2f(x, y (x; ξ); η , ξ). By the Lipschitz continuity of f and Lemma 3, we have E 1f(x, y0 K+1; η, ξ) E 1f(x, y (x; ξ); η, ξ) Lf,1E y0 K+1 y (x; ξ) Utilizing the Lipschitz continuity of f, g, and 2g in y, we have Ebv(x) V (x) Lf,1E y0 K+1 y (x; ξ) + Lg,1Lf,1N Lg,1 E y0 K+1 y (x; ξ) + Lf,0Lg,2N Lg,1 E y0 K+1 y (x; ξ) + Lf,0Lg,1 N Lg,1 1 N N E y0 K+1 y (x; ξ) µg2K/2 + Lg,1Lf,1N Lg,1 + Lf,0Lg,2N Lg,1 + Lf,0Lg,1 N 2 Next, we show the sampling and computational costs. To build up the RT-MLMC gradient estimator bv(x), we need one sample of ξ, and the number of samples of η from Pη|ξ is k=1 (2k+1 1) 2 k 1 2 K 1 < N + 3K. On average, the iteration needed for Epoch SGD is k=1 (2k+1 1) 2 k 1 2 K 1 < 3K. Next, we demonstrate the variance of bv(x). Denote HK(1) := 1f(x, y0 K; η , ξ), HK(2) := 2 12g(x, y0 K; η , ξ) h bΛ(x, y0 K; ξ) i 2f(x, y0 K; η , ξ), H (1) := 1f(x, y (x; ξ); η , ξ), H (2) := 2 12g(x, y (x; ξ); η , ξ) h bΛ(x, y (x; ξ); ξ) i 2f(x, y (x; ξ); η , ξ). Thus one may rewrite bv(x) and V (x) as bv(x) = Hk+1(1) Hk(1) Hk+1(2) + Hk(2) pk + H1(1) H1(2), V (x) = E[H (1) H (2)]. It holds that E bv(x) Ebv(x) 2 E bv(x) V (x) 2 2E bv(x) H1(1) + H1(2) 2 + 2E H1(1) H1(2) V (x) 2 pk (Hk+1(1) Hk(1)) 2 + 4E 1 pk (Hk+1(2) Hk(2)) 2 + 2E H1(1) H1(2) V (x) 2, where the first inequality holds by the definition of variance, the second inequality holds by the Cauchy-Schwarz inequality, the third inequality uses the Cauchy-Schwarz inequality and the definition of bv(x). It remains to analyze E 1 pk (Hk+1(1) Hk(1)) 2 , E 1 pk (Hk+1(2) Hk(2)) 2 , and E H1(1) H1(2) V (x) 2. For the first term, we have pk (Hk+1(1) Hk(1)) 2 k=1 p 1 k E Hk+1(1) Hk(1) 2 k=1 p 1 k L2 f,1E y0 k+1 y0 k 2 k=1 p 1 k L2 f,12(E y0 k+1 y (x; ξ) + E y (x; ξ) y0 k 2) k=1 p 1 k L2 f,1 6L2 g,0 µ2g2k KL2 f,1 6L2 g,0 µ2g , where the first inequality uses the Lipschitz continuity of f, the second inequality uses the Cauchy-Schwarz inequality, the third inequality uses Lemma 3, and the last inequality uses the definition of pk. For the second term, we may conduct a similar analysis. pk (Hk+1(2) Hk(2)) 2 k=1 p 1 k E Hk+1(2) HK(2) 2 k=1 p 1 k 6 L2 f,1N 2 + L2 f,0L2 g,2N 2 L2 g,1 + L2 f,0N 4 E y0 K+1 y0 K 2 k=1 p 1 k 2 L2 g,0 µ2g2K + L2 g,0 µ2g2K 1 6 L2 f,1N 2 + L2 f,0L2 g,2N 2 L2 g,1 + L2 f,0N 4 =36KL2 g,0 µ2g L2 f,1N 2 + L2 f,0L2 g,2N 2 L2 g,1 + L2 f,0N 4 , where the first inequality follows utilizing Lipschitz continuity of f, f, g, 2g in y, the second inequality uses Lemma 3, and the last inequality uses the definition of pk. For the third term, we have E H1(1) H1(2) V (x) 2 2E H1(1) EH (1) 2 + 2E H1(2) EH (2) 2 Notice that E H1(1) EH (1) 2 =E 1f(x, y0 1; η, ξ) E 1f(x, y0 1; η, ξ) 2 + E E 1f(x, y0 1; η, ξ) E 1f(x, y (x; ξ); η, ξ) 2 σ2 f + L2 f,1L2 g,0 2µ2g , where the first equality holds as E a + b 2 = E a 2 + E b 2 + 2Ea b, and last inequality holds by Lemma 3 and the Lipschitz continuity of f. On the other hand, we have E H1(2) EH (2) 2 =E H1(2) EH1(2) 2 + E EH1(2) EH (2) 2 E H1(2) 2 + E EH1(2) EH (2) 2 L2 g,1L2 f,0 4Lg,1 + 6 L2 f,1N 2 + L2 f,0L2 g,2N 2 L2 g,1 + L2 f,0N 4 L2 g,0 2µ2g =L2 g,1L2 f,0 N 2 4Lg,1 + 6 L2 f,1N 2 + L2 f,0L2 g,2N 2 L2 g,1 + L2 f,0N 4 L2 g,0 2µ2g , where the first equality uses E a + b 2 = E a 2 + E b 2 + 2Ea b, and last inequality holds by Lemma 3 and the Lipschitz continuity. As a result, we conclude that the variance satisfies that E bv(x) Ebv(x) 2 E bv(x) V (x) 2 = O(KN 4). DL-SGD gradient estimators Note that Ebv(x) = Ebv K(x). Thus the bias follows directly from the analysis of RT-MLMC. Next, we consider the per-iteration sampling costs and the average number of iterations for the Epoch SGD. DL-SGD runs Epoch SGD as a subroutine for 2K+1 1 number of iterations. Thus the sampling cost on average is N + 2K+1 1. Consider the variance of the DL-SGD method. Note that bv K(x) = HK+1(1) HK+1(2) Following a similar decomposition as we did for the third term in bounding the variance of RT-MLMC estimators, we have E bv K(x) Ebv K(x) 2 E bv K(x) V (x) 2 2E HK+1(1) EH (1) 2 + 2E HK+1(2) EH (2) 2 2E HK+1(1) EHK+1(1) 2 + 2E EHK+1(1) EH (1) 2 + 2E HK+1(2) 2 + 2E EHK+1(2) EH (2) 2 2σ2 f + L2 f,1L2 g,0 2Kµ2g + L2 g,1L2 f,0 N 2 2Lg,1 + 12 L2 f,1N 2 + L2 f,0L2 g,2N 2 L2 g,1 + L2 f,0N 4 L2 g,0 2K+1µ2g =O(N 2 + N 42 K), where the first inequality uses the definition of variance, the second inequality uses the Cauchy Schwarz inequality, and the third and the last equality uses Lipschitz continuity of f, f and g and follows a similar argument as in bounding the third term for the variance of RT-MLMC estimators. Note that to control the bias, we let both N and K to be of order O(log(ϵ 1)). Thus N 2 is the dominating term in the variance of DL-SGD gradient estimator. Next, we demonstrate the proof of Theorem 1. Proof of Theorem 1. We first demonstrate the analysis for RT-MLMC. RT-MLMC gradient method: combining Lemmas 1 and 2, we know that E F(bx T ) 2 2(EF(x1) minx F(x)) t=1 E F(xt) E[ F(xt) bv(xt) | xt] t=1 E bv(xt) F(xt) 2 2(EF(x1) minx F(x)) t=1 E F(xt) E[ F(xt) bv(xt) | xt] t=1 E bv(xt) V (xt) 2 + 4SF α t=1 E V (xt) F(xt) 2 αT + SF 1 µg µg2K/2 + αSF KN 4 To ensure that bx T is an ϵ-stationarity point, it suffices to let the right hand side of the inequality above to be O(ϵ2). Correspondingly, we set N = O(log(ϵ 1)), K = O(log(ϵ 2)), α = O(T 1/2), and T = e O(ϵ 4). As a result, the total sampling and the gradient complexity over g are of order 3KT = e O(ϵ 4). Since at each upper iteration, we only compute one gradient of f, the gradient complexity over f is T = e O(ϵ 4). Next, we demonstrate the analysis for DL-SGD. DL-SGD gradient method: combining Lemmas 1 and 2, we have E F(bx T ) 2 2(EF(x1) minx F(x)) t=1 E F(xt) E[ F(xt) bv K(xt) | xt] t=1 E bv K(xt) F(xt) 2 2(EF(x1) minx F(x)) t=1 E F(xt) E[ F(xt) bv K(xt) | xt] t=1 E bv K(xt) V (xt) 2 + 4SF α t=1 E V (xt) F(xt) 2 αT + SF 1 µg µg2K/2 + αSF N 2 + α 1 To ensure that bx T is an ϵ-stationarity point, it suffices to let N = O(log(ϵ 1)), K = O(log(ϵ 2)), α = O(T 1/2), and T = e O(ϵ 4). As a result, the sample complexity and the gradient complexity over g is of order O(2KT) = e O(ϵ 6). At each upper iteration, we compute one gradient of f, and thus the gradient complexity over f is T = e O(ϵ 4). B Computing 1y (x; ξ) To derive an explicit formula for 1y (x; ξ), we use the first-order optimality condition of the unconstrained lower-level problem. Indeed, as g(x, y; η, ξ) is strongly convex in y, for given x and ξ, y (x; ξ) is the unique solution of the equation EPη|ξ 2g(x, y (x; ξ); η, ξ) = 0. Taking gradients with respect to x on both sides and using the chain rule, we obtain 2 21g(x, y (x; ξ); η, ξ) + 2 22g(x, y (x; ξ); η, ξ) 1y (x; ξ) = 0. Since g(x, y; η, ξ) is µg-strongly convex in y for any given x, η, and ξ, the expected Hessian matrix EPη|ξ 2 22g(x, y; η, ξ) Sdy + is invertible, and thus we find 1y (x; ξ) = EPη|ξ 2 12g(x, y (x; ξ); η, ξ) EPη|ξ 2 22g(x, y (x; ξ); η, ξ) 1 , where EPη|ξ 2 12g(x, y (x; ξ); η, ξ) is in fact the transpose of EPη|ξ 2 21g(x, y (x; ξ); η, ξ). C Efficient Implementation for Hessian Vector Products Our goal is to compute bc(x, y; ξ) = 2 12g(x, y; η , ξ)bΛ(x, y; ξ) 2f(x, y; η , ξ) in (8) efficiently. Algorithm 4 Hessian Vector Implementation for Computing bc(x, y; ξ). Input: Iteration points (x, y), samples ξ and η , {ηn} b N n=1, η Pη|ξ, b N 1. 1: Compute r0 = 2f(x, y; η , ξ) 2: for n = 1 to b N do 3: Get Hessian-vector product vn = 2 22g(x, y; ηn, ξ)rn 1 via automatic differentiation on y 4: Update rn = rn 1 1 Lg,1 vn 5: end for 6: Update r N3 = N Lg,1 r N3 7: Get Hessian-vector product c = 2 12g(x, y; η , ξ)r b N via automatic differentiation on y Output: c Remark: Step 4 and 9 can be implemented at cost O(d) by as we compute 2g(x, y; ηn, ξ)rn 1 first and then differentiate over y to avoid calculating Hessian matrices directly. D Implementation Details In this section, we provide all relevant implementation details. We fine-tune the stepsize for all approaches using the following strategy: we pick a fixed breakpoint denoted as t0 and adjust the stepsize accordingly. Specifically, for the t-th outer iteration when t t0, the stepsize is set as 1/ t, while for iterations beyond t0, the stepsize is set as 1/t. We choose the breakpoint t0 such that training loss remains relatively stable as the number of outer iterations approaches t0. D.1 Meta-Learning Let D = (a, b) denote the training or testing data, where a Rd represents the feature vector and b [C] represents the label with C categories. In this section, the loss ℓi(x, D) is defined as a multi-class logistic loss, given by: L(x; D) = b T x a + log 1 ex a , where the parameter x Rd C stands for the linear classifier, b {0, 1}C stands for the corresponding one-hot vector of the label b. The experiment utilizes the tiny Image Net dataset, which consists of 100,000 images belonging to 200 classes. After data pre-processing, each image has a dimension of 512. For dataset splitting, we divide the dataset such that 90% of the images from each class are assigned to the training set, while the remaining 10% belong to the testing set. The meta-learning task comprises 20 tasks, with each task involving the classification of 10 classes. Additionally, we set the hyper-parameter λ = 2. Finally, we provide additional experiments for meta-learning in Figure 4. In the left plot, we examine the performance of MAML by varying the stepsize of inner-level gradient update from {5e-3, 1e-2, 5e-2, 1e-1, 2e-1}. From the plot we can tell that when using small stepsize, MAML tends to have similar performance whereas the performance of MAML tend to degrade when using large stepsize. In the right plot, we examine the performance of m-step MAML against upper-level iterations. The m-step MAML is an approximation of problem (4) via replacing yi(x) with the m-step gradient update byi,m(x), which is defined recursively: byi,0(x) = x, byi,k(x) = byi,k 1(x) γ li(byi,k 1(x), Dtrain i,k 1) + λ 2 byi,k 1(x) x 2 , with stepsize γ and Dtrain i,k 1 ρi for k = 1, . . . , m. Here we take the number of gradient updates at inner level m from {1, 4, 8, 12}. From the plot, we realize that multi-step MAML tends to have better performance, but it still cannot outperform the RT-MLMC method. 0 2e3 4e3 6e3 8e3 1e4 Epoch Logistic Loss Value MAML (step size 5e-3) MAML (step size 1e-2) MAML (step size 5e-2) MAML (step size 1e-1) MAML (step size 2e-1) RT-MLMC with Kmax=12 0 2e3 4e3 6e3 8e3 1e4 Epoch Logistic Loss Value MAML (m = 1) MAML (m = 4) MAML (m = 8) MAML (m = 12) RT-MLMC with Kmax=12 Figure 4: Left: Performance of one-step MAML against upper-level iterations on meta-learning. Right: Performance of m-step MAML against upper-level iterations on meta-learning. D.2 Wasserstein DRO with Side Information In this subsection, we take the loss function l(w, η) = h(w η)+ + b(η w)+, where h > 0 is a constant representing the per-unit holding cost and b > 0 is a constant representing the per-unit backlog cost. Since Assumption 2.1 does not hold for the objective function due to its non-smooth structure, we approximate the loss function with the smoothed version lβ(w, η) = h β log(1 + eβ(w η)) + b β log(1 + eβ(η w)), where we specify the hyper-parameter β = 5 to balance the trade-off between loss function approximation and smoothness. The synthetic dataset in this part is generated in the following procedure: the covariate ξ is sampled from the 100-dimensional uniform distribution supported on [ 15, 15]100. The demand η depends on ξ in a nonlinear way: η = f NN(x; ξ) + ϵ, ϵ N(0, 1), f NN(x; ξ) = 10 Sigmoid(W3 Re LU(W2 Re LU(W1ξ + b1) + b2) + b3), where the neural network parameter x := (W1, b1, . . . , W3, b3). In particular, the ground-truth neural network parameter is generated using the uniform initialization procedure by calling torch.nn.init.uniform_ in pytorch. We quantify the performance of a given θ using the testing loss E(ξ,η) P [l(f(x; ξ), η)], where the expectation is estimated using sample average approximation based on 2 105 sample points. When creating training dataset, we generate M = 50 samples of ξ, denoted as {ξi}i [M] and for each xi, we generate m {10, 30, 50, 100} samples of η from the conditional distribution Pη|ξi. When generating the left two plots of Fig. 3, we specify m = 100. When solving the WDRO baseline, we consider the formulation min x max P n E(ξ,η) P[lβ(f(x; ξ); η)] λW(P, P0) o , (13) with W( , ) being the standard Wasserstein distance using the same cost function as the casual transport distance. We apply the SGD algorithm developed in [Sinha et al., 2018] to solve the WDRO formulation. See Algorithm 5 for detailed implementation. The hyper-parameter λ for WDRO-SI or WDRO formulation has been fine-tuned via grid search from the set {1, 10, 50, 100, 150} for optimal performance. Algorithm 5 Gradient Descent Ascent Heuristic for Solving (13) Input: of outer iterations T and of inner iterations Tin, stepsizes {αt}T t=1 and {βs}Tin s=1, initial iterate x1. 1: for t = 1 to T do 2: Sample (ξt, ηt) from P and initialize ξt 1 = ξt. 3: for s = 1 to Tin do 4: Generate gs = ξ[lβ(f(xt; ξt s); ηt) λ ξt s ξt 2]. 5: Update ξt s+1 = ξt s + βs Adam(ξt s, gs). 6: end for 7: Compute Gt = x[lβ(f(xt; ξt Tin+1); ηt)]. 8: Update xt+1 = xt αt Gt. 9: end for Output: x T