# differentiable_distributionally_robust_optimization_layers__dc57f3c0.pdf Differentiable Distributionally Robust Optimization Layers Xutao Ma 1 Chao Ning 1 Wenli Du 2 In recent years, there has been a growing research interest in decision-focused learning, which embeds optimization problems as a layer in learning pipelines and demonstrates a superior performance than the prediction-focused approach. However, for distributionally robust optimization (DRO), a popular paradigm for decision-making under uncertainty, it is still unknown how to embed it as a layer, i.e., how to differentiate decisions with respect to an ambiguity set. In this paper, we develop such differentiable DRO layers for generic mixed-integer DRO problems with parameterized second-order conic ambiguity sets and discuss its extension to Wasserstein ambiguity sets. To differentiate the mixed-integer decisions, we propose a novel dual-view methodology by handling continuous and discrete parts of decisions via different principles. Specifically, we construct a differentiable energy-based surrogate to implement the dual-view methodology and use importance sampling to estimate its gradient. We further prove that such a surrogate enjoys the asymptotic convergency under regularization. As an application of the proposed differentiable DRO layers, we develop a novel decision-focused learning pipeline for contextual distributionally robust decision-making tasks and compare it with the prediction-focused approach in experiments. 1. Introduction In real-world scenarios, decision-making problems are typically affected by uncertainties. Therefore, machine learning techniques are usually leveraged to predict the behavior of Corresponding Author 1Department of Automation, Shanghai Jiao Tong University, Shanghai 200240, China 2The Key Laboratory of Smart Manufacturing in Energy Chemical Process, Ministry of Education, East China University of Science and Technology, Shanghai 200237, China. Correspondence to: Chao Ning . Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s). the uncertainty, and then this prediction is passed to an optimization problem to derive decisions (Ning & You, 2019). Conventionally, the learning model is trained by minimizing a prediction loss, i.e., in a prediction-focused way. In recent years, decision-focused learning, also known as smart predict-and-optimize in operations research (Elmachtoub & Grigas, 2022), has received much research interest (Mandi et al., 2023; Sadana et al., 2023). Different from prediction-focused learning, decision-focused learning aims to train a learning model that minimizes a decision loss, i.e., improving the decision quality. To implement decisionfocused learning, differentiable optimization layers play the central role of passing gradient information from the decision back to the learning model, and this is achieved by differentiating the decision with respect to the learning target. From the learning side, the learning target of most differentiable optimization layer research is a point prediction of uncertain quantity, and some research learns to predict the distribution of uncertainty. However, in prior research, the robustness of prediction is typically ignored, so the decision made based on this prediction is also in lack of robustness. As an emerging paradigm for robust decision-making, distributionally robust optimization (DRO) has seen a boom in both theory and applications in recent years (Delage & Ye, 2010; Wiesemann et al., 2014; Mohajerin Esfahani & Kuhn, 2018; Rahimian & Mehrotra, 2022). Therefore, to improve decision quality while preserving robustness, developing a differentiable DRO layer to learn the ambiguity set in a decision-focused way is highly desired but has not been investigated yet. From the optimization side, most differentiable optimization layer research focuses on either pure continuous or pure discrete decisions. However, the decisions in practical problems are typically mixed-integer. Only Ferber et al. (2020) developed a mixed-integer linear program (MILP) layer. However, their approach relies on the specific solution structure of linear program (LP). Therefore, how to differentiate the mixed-integer decisions for generic mixed-integer convex optimization remains an unsolved problem. To fill the aforementioned research gaps, this paper develops the first differentiable DRO layers with mixed-integer decisions. That is, the learning target is an ambiguity set Differentiable Distributionally Robust Optimization Layers Figure 1. Sequential learning and decision-making pipeline. and the output decisions are mixed-integer. The ambiguity set we mainly focus on is the class of parameterized secondorder conic (SOC) ambiguity set (Bertsimas et al., 2019), which is widely adopted in various applications (Zhou et al., 2019; Zhang et al., 2022; Yang et al., 2023), and Wasserstein ambiguity set is also discussed in Appendix C. The major contributions of this paper are summarized as follows. We develop the first generic differentiable DRO layers, which enable integrating learning and distributionally robust decision-making via gradient descent. We propose a novel dual-view methodology to differentiate the mixed-integer decisions. We note that this methodology can be applied to develop any mixedinteger convex optimization layer, not limited to the proposed DRO layers. We construct a differentiable energy-based surrogate value function to implement the dual-view methodology and use importance sampling to estimate its gradient. In theory, we prove that such a surrogate enjoys the asymptotic convergency under regularization. As an application of the proposed differentiable DRO layers, we develop a novel decision-focused learning pipeline, which is of interest in its own right, for contextual distributionally robust decision-making tasks and compare it with the prediction-focused approach in experiments. 2. Related Literature We first review existing work on differentiable optimization layers with pure continuous and pure discrete decisions. Convex optimization layers. To differentiate continuous decisions of constrained optimization, the basic idea is to apply the implicit differentiation theorem to the optimality conditions. Following this idea, Amos & Kolter (2017) successfully differentiated through constrained quadratic programs. Differentiating through LP was achieved by adding regulation terms in Wilder et al. (2019) and Mandi & Guns (2020). For linear conic programming, the optimality condition was derived by leveraging the homogeneous self-dual embedding technique (Busseti et al., 2019), and then the implicit differentiation was applied (Agrawal et al., 2019b). Finally, Agrawal et al. (2019a) aggregated all these work and developed the differentiable convex optimization layer package cvxpylayers. Combinatorial optimization layers. To handle the nondifferentiability of discrete decisions, Berthet et al. (2020) developed differentiable surroagte solution by adding perturbation. Similar ideas also appeared in Niepert et al. (2021) and Poganˇci c et al. (2020). We refer to Dalle et al. (2022) for a review of this perturbation technique. Aside from the differentiable optimization layer approach that manages to differentiate the decision, some research constructs a surrogate loss to circumvent difficulty. Surrogate loss approach. The seminal work Elmachtoub & Grigas (2022) developed a surrogate SPO+ loss. Shah et al. (2022) and Zharmagambetov et al. (2023) constructed a training-based surrogate loss. Kong et al. (2022) developed a surrogate loss for stochastic programming (SP) by using an energy-based model. For combinatorial optimization problems, Mulamba et al. (2020) and Mandi et al. (2022) constructed surrogate loss functions by maximizing the probability of the ground-truth optimal decision. From the perspective of the learning target, most of the work mentioned above only considered point prediction, except for Donti et al. (2017) and Kong et al. (2022), which learned conditional distributions. Chenreddy et al. (2022) and Sun et al. (2023) investigated prediction-focused learning methods for uncertainty sets, and Wang et al. (2023) developed a learning method for robust optimization (RO) based on an augmented Lagrangian method. Very recently, Chenreddy & Delage (2024) developed an end-to-end learning method for robust optimization. Perhaps the most relevant work to this paper is Costa & Iyengar (2023), which to the best of our knowledge is the only research on distributionally robust decision-focused learning. However, their framework presumes the uncertainty distribution to have a residual structure and only applies to specific financial problems with continuous decisions. On the contrary, our differentiable DRO layers apply to general distributions and a broad family of optimization problems with mixed-integer decisions. 3. Background In this section, we provide some background information on the topic of this paper. Differentiable Distributionally Robust Optimization Layers 3.1. Decision-Making under Uncertainty A typical sequential learning and decision-making pipeline is shown in Figure 1, where the decision-maker first leverages a learning model Mϕ to predict some information U concerning the uncertainty y from covariate z. Such information U can be a point prediction, conditional distribution, uncertainty set, or ambiguity set of the uncertainty y. Subsequently, the decision-maker takes U as a parameter and solves a constrained optimization problem minx X f(x, U ) to derive the decision x . Depending on the form of U , this constrained optimization problem can be deterministic optimization, SP, RO, or DRO. Finally, after the decision is made, the uncertainty y is revealed and the decision loss l(x , y) is realized. 3.2. Decision-Focused Learning In the conventional prediction-focused approach, the learning model is trained independently of the subsequent optimization process. On the contrary, in decision-focused learning, the learning model is trained by directly minimizing the decision loss, which can be formally expressed as the following bilevel problem. min ϕ Φ E(z,y) P l x (Mϕ(z)), y s.t. x (Mϕ(z)) = arg min x X f(x, U = Mϕ(z)) (1) where ϕ is the parameter of the learning model Mϕ we want to train, P is the joint distribution of covariate z and uncertainty y, and here we assume (1) is well-defined, i.e., the solution set of the argmin operator is a singleton. To optimize this bilevel problem by gradient descent, it necessitates the computation of the following gradient. l x (Mϕ(z)), y = l x (Mϕ(z)), y Mϕ(z) Mϕ(z) where the first and last terms are easy to compute. However, the existence of argmin operator poses great difficulty in the computation of the middle term x U , so the goal of a differentiable optimization layer is to compute this term. In this paper, we aim to develop a differentiable DRO layer, i.e., the learning target U is an ambiguity set and minx X f(x, U ) is a DRO problem. Therefore, the goal is to develop a method to differentiate the mixed-integer decision x with respect to the ambiguity set U , i.e., computing x 3.3. Distributionally Robust Optimization The DRO takes an ambiguity set as the parameter and outputs a decision by optimizing the following problem. x (U ) = arg min x X f(x, U ) := max P U Ey P[c(x, y)] (3) where the cost function c is usually taken as the decision loss l and f is often referred to as worst-case expectation . 4. Differentiable Distributionally Robust Optimization Layers Since the space of all ambiguity sets is infinite-dimensional, directly learning in this space is generally computationally impossible. Therefore, we focus on the class of parameterized SOC ambiguity sets, which stem from the well-known SOC ambiguity set (Bertsimas et al., 2019). To define the parameterized SOC ambiguity set, we first introduce the following differentiable parameterized secondorder cone representable set, which is an extension of the conventional second-order cone representable set (see Appendix A.1). Definition 4.1. A set W (θ) RK is a differentiable parameterized second-order cone representable set with parameter θ if there exists a collection of J second-order cone inequalities such that y W (θ) v : Aj(θ) bj(θ) Lmj 0, j [J] where Lmj represents a mj dimensional second-order cone and matrixes Aj(θ) and vectors bj(θ) are differentiable functions of θ. Now we define the parameterized SOC ambiguity set. Definition 4.2. An ambiguity set U (θ) is a parameterized SOC ambiguity set with parameter θ if it can be expressed as follows. EP[gi(y, αi)] σi, i [I] where P is a distribution of y, θ = (α1, σ1, , αI, σI), support Ξ RK of the uncertainty is a second-order cone representable set, and the epigraph of each gi, epi gi = {(y, u)|u gi(y, αi)} (5) is a differentiable parameterized second-order cone representable set with parameter αi By selecting functions gi, the parameterized SOC ambiguity set can characterize a variety of distributional features. We Differentiable Distributionally Robust Optimization Layers present some examples of the parameterized SOC ambiguity set in Appendix A.2 and see also Bertsimas et al. (2019). For the cost function c(x, y) in Equation (3), we consider both the singleand two-stage cost functions. Assumption 4.3. The cost function c(x, y) admits a singlestage formulation (i) without recourse or a two-stage formulation (ii) with relatively complete recourse as follows. (i). Single-stage formulation: c(x, y) = PK k=1 ck(x)yk, where the epigraph of each function ck is a second-order cone representable set and the uncertainty y is required to be non-negative. (ii). Two-stage formulation: c(x, y) is the optimal value of an LP, i.e., c(x, y) = min γ 0 q T γ s.t. T (y)x + W γ = h(y) (6) where T (y) = T 0 + PK k=1 T kyk, h(y) = h0 + PK k=1 hkyk, and the constraint in (6) is feasible for all x X and y Ξ. To derive a tractable reformulation of the DRO problem (3), we need the following regularity assumption on the parameterized SOC ambiguity set, and the detailed explanation of Assumption 4.4 is presented in Appendix A.3. Assumption 4.4. Slater s condition holds for the parameterized SOC ambiguity set U (θ). Now we can state the following reformulation theorem. Theorem 4.5. Suppose U (θ) is a parameterized SOC ambiguity set and Assumption 4.3 and Assumption 4.4 hold, then the worst-case expectation f(x, U (θ)) = max P U (θ) Ey P[c(x, y)] is a linear second-order cone program. Proof. See Appendix B.1 Since the parameterized SOC ambiguity set is determined by the finite-dimensional parameter θ, it suffices to use the learning model Mϕ(z) to learn the parameter θ, i.e., θ = Mϕ(z). (7) Therefore, the goal of the differentiable DRO layer comes down to computing x By Theorem 4.5, the worst-case expectation function f is a linear second-order cone programming. Therefore, if the decision x is continuous, the gradient x θ can be directly computed by the technique of differentiating through a cone program (Agrawal et al., 2019b). We formally state this result in Theorem 4.6. Theorem 4.6. Suppose conditions in Theorem 4.5 hold and x is a continuous variable with X a second-order cone representable set, then x = arg minx X f(x, U (θ)) is differentiable with respect to θ. Proof. See Appendix B.2 However, when the decision x is mixed-integer, it is inherently non-differentiable due to the discreteness of integer variables. Therefore, it is necessary to develop a new methodology to handle mixed-integer decisions. 4.1. Dual-View of Mixed-Integer Decisions Existing research on differentiable optimization layers all views the decision x as a function of the parameter and handles it by the principle of automatic differentiation. However, this view does not work for discrete decisions. Although some research manages to differentiate the discrete decisions by adding perturbations (Berthet et al., 2020), these approaches are still restricted to integer linear programming. Alternatively, by examining the bilevel formulation (1) of the whole decision-focused learning task, we notice that the decision-making process is the lower-level problem. Therefore, the decision x can be viewed as a constraint, and we can handle it via the principle of constrained optimization. By the above observations, we propose the following dualview methodology to address mixed-integer decisions. Dual-View Methodology: I The continuous part of the decisions is viewed as a function of parameters and handled via the principle of automatic differentiation. II The discrete part of the decisions is viewed as a constraint of the whole bilevel learning problem and handled via the principle of constrained optimization. In this dual-view methodology, we have already established part I in Theorem 4.6. To better illustrate the idea of part II, we make the following assumptions and notations. Assumption 4.7. x = (xd, xc) is a mixed-integer variable with discrete part xd {0, 1}n1 and continuous part xc Rn2. The feasible region of x is X = X ({0, 1}n1 Rn2), where X is a second-order cone representable set. We denote by Xd the feasible region of the discrete part of variable x, i.e., Xd = {xd {0, 1}n1| xc Rn2 : (xd, xc) X}, and by Xc(xd) the feasible region of the continuous part variable xc given the integer part xd Xd, i.e., Xc(xd) = {xc Rn2|(xd, xc) X}. The next assumption ensures that the bilevel problem (1) is well-defined, and see Appendix A.4 for a detailed discussion of this assumption. Assumption 4.8. (i) For all ϕ Φ, z Z, and xd Xd, the optimal continuous solution x c(xd, Mϕ(z)) := arg min xc Xc(xd) f (xd, xc), Mϕ(z) Differentiable Distributionally Robust Optimization Layers (ii) For all ϕ Φ, the optimal integer solution x d(Mϕ(z)) := arg min xd Xd f (xd, x c(xd, Mϕ(z)), Mϕ(z) is unique almost surely, i.e., ϕ Φ, Pz x d(Mϕ(z)) is a singleton = 1 where Pz is the marginal distribution of covariate z. By the above assumptions, the bilevel problem (1) can be reformulated as follows. Corollary 4.9. Suppose Assumption 4.7 and Assumption 4.8 hold, the decision-focused learning can be formulated as min ϕ Φ E(z,y) P l x d(Mϕ(z)), x c(x d, Mϕ(z)) , y s.t. x d(Mϕ(z)) = arg min xd Xd f (xd, x c(xd, Mϕ(z))), Mϕ(z) (8) From the bilevel form (8) we can see more clearly the idea of dual-view methodology. The optimal continuous decision x c is embedded as a function of the integer decision and learning target Mϕ(z). On the contrary, the optimal integer solution x d is explicitly expressed as a constraint. Let R(ϕ) denote the value function of problem (8), i.e., R(ϕ) := E(z,y) P l x d(Mϕ(z)), x c(x d, Mϕ(z)) , y Then problem (8) is equivalent to minϕ Φ R(ϕ). Following Part II in the dual-view methodology, we handle the constrained optimization (8) by approximating the value function sequentially. That is, we want to construct a sequence of differentiable surrogate value functions Rλ(ϕ) such that Rλ(ϕ) R(ϕ) in some sense. 4.2. Energy-Based Surrogate Value Function To construct such a surrogate function Rλ(ϕ), we first construct point surrogate function rλ(Mϕ(z), y) for each point (z, y) and then define Rλ(ϕ) as Rλ(ϕ) = E(z,y) P rλ(Mϕ(z), y) (9) We construct the point surrogate function rλ(ϕ, z, y) by leveraging the energy-based model. Specifically, we assign each feasible integer decision xd Xd the following energy function E(xd, Mϕ(z), λ), where λ is a positive scalar. E(xd,Mϕ(z), λ) =exp f((xd, x c(xd, Mϕ(z)), Mϕ(z)) Based on the energy function, we can define a distribution p(xd|Mϕ(z), λ) over Xd. p(xd|Mϕ(z), λ) = E(xd, Mϕ(z), λ) P x d Xd E(x d, Mϕ(z), λ) (11) We then construct point surrogate function rλ(ϕ, z, y) as follows. rλ(Mϕ(z), y) := Exd p(xd|Mϕ(z),λ)l((xd, x c(xd, Mϕ(z))), y) (12) xd Xd p(xd|Mϕ(z), λ)l((xd, x c(xd, Mϕ(z))), y) By the above construction, we notice that the optimal integer solution x d(Mϕ(z)) has the largest energy, so the corresponding probability p(x d|Mϕ(z), λ) is also the highest. When λ 0+, this probability will converge to 1, and rλ(ϕ, z, y) will also converge to the true decision loss l (x d(Mϕ(z)), x c(x d, Mϕ(z))), y . To further establish convergence results of the surrogate value function Rλ(ϕ), we need the following continuity assumptions and the concept of epi-convergence. Assumption 4.10. The decision loss l((xd, xc), y) is bounded and continuous in xc. For any z Z, f((xd, xc), Mϕ(z)) is continuous in xc and ϕ. Assumption 4.11. For all xd Xd and z Z, the optimal continuous decision x c(xd, Mϕ(z)) is continuous in ϕ. We note that Assumption 4.10 is easily satisfied. For Assumption 4.11, since x c(xd, Mϕ(z)) is differentiable with respect to θ = Mϕ(z) by Theorem 4.6, Assumption 4.11 simply requires the continuity of Mϕ in its parameter ϕ. Definition 4.12 (Bonnans & Shapiro (2013), p.41). A sequence of functions {Rn(ϕ)} epi-converges to a function R(ϕ) if and only if ϕ Φ, condition (i) and (ii) hold. (i) For any sequence {ϕn} converges to ϕ, lim infn Rn(ϕn) R(ϕ) (ii) There exists a sequence {ϕn} converging to ϕ such that lim supn Rn(ϕn) R(ϕ) The epi-convergence of Rλ(ϕ) and asymptotic convergence of optimal solution are established in Theorem 4.13. Theorem 4.13. Suppose Assumption 4.7, Assumption 4.8, Assumption 4.10, and Assumption 4.11 hold, then for any sequence λn 0+ as n , the following two assertions hold for the energy-based surrogate value function Rλ(ϕ). (i) Rλn(ϕ) epi-converges to R(ϕ) as n . (ii) if ϕλnk arg minϕ Φ Rλnk (ϕ) for some sub-sequence {nk} N and {ϕλnk } converges to a point ϕ , then Differentiable Distributionally Robust Optimization Layers Figure 2. Decision-focused learning pipeline for contextual distributionally robust decision-making. ϕ arg minϕ Φ R(ϕ) and limk infϕ Φ Rλnk (ϕ) = infϕ Φ R(ϕ) Proof. See Appendix B.3 4.3. Gradient Estimation According to Theorem 4.13, the optimal parameter ϕ of the learning model Mϕ can be derived by optimizing the surrogate value functions Rλ(ϕ) sequentially. We next show in Theorem 4.14 that the surrogate value function Rλ(ϕ) is differentiable, so it can be optimized via gradient descent. Theorem 4.14. Suppose conditions in Theorem 4.5 and Theorem 4.13 hold, then Rλ(ϕ) is differentiable with respect to ϕ and the gradient is ϕ = E(z,y) P where θ = Mϕ(z) is the learning target, and rλ(θ,y) θ can be computed by " E (xd, θ, λ) E(xd, θ, λ) l xd, xc (xd, θ) , y # E (xd, θ, λ) E(xd, θ, λ) l xd, xc (xd, θ) , y " l xd, xc (xd, θ) , y xc xc (xd, θ) (14) where p = p(xd|θ, λ) and E (xd, θ, λ) = E(xd,θ,λ) Proof. See Appendix B.4 Note that in the last term of Equation (14), xc (xd,θ) θ is the gradient of continuous decision with respect to the parameter, which is exactly what we develop in Theorem 4.6. The gradient (14) can be estimated by sampling from distribution p(xd|θ, λ). However, direct sampling from p(xd|θ, λ) necessitates the computation of the normalizer in Equation (11), which requires the calculation of the energy function of all the feasible integer solutions. To avoid this problem, we adopt the self-normalized importance sampling method (See Appendix A.5). To construct a proposal distribution q that resembles p(xd|θ, λ), we first derive T integer solutions C = {xd1, , xd T } with the largest energy functions by solving f for T times (See Appendix A.6 for readers not familiar with this oracle) and then construct the proposal distribution q as follows. q(xd) = E(xd, θ, λ) P t [T ] E(xdt, θ, λ) + M(|Xd| T), xd C q(xd) = M P t [T ] E(xdt, θ, λ) + M(|Xd| T), xd / C where M is a constant that can be understood as the energy of other integer solutions. Therefore, each term in Equation (14) can be estimated unbiasedly by sampling from q. 5. Application: Contextual Distributionally Robust Decision-Making As an application of the differentiable DRO layers in Section 4, we develop a decision-focused learning pipeline for contextual distributionally robust decision-making tasks (Bertsimas & Van Parys, 2022; Wang et al., 2021; Yang et al., 2022). In this paper, we mainly focus on and develop a decisionfocused learning method for DRO with SOC ambiguity set, but in fact, the proposed DRO Layer technique can also be extended to the Wasserstein ambiguity set and we discuss this issue in Appendix C. 5.1. Decision-Focused Learning Pipeline The proposed pipeline is illustrated in Figure 2. A learning model Mϕ is first leveraged to learn the ambiguity set parameter θ from covariate z. However, the output parameter θ provided by the learning model can lead to an empty ambiguity set, i.e., U (θ) = , and this problem typically happens when the learning model is a neural network (NN). Differentiable Distributionally Robust Optimization Layers To fix this problem, we add a projection layer after the learning layer. The projection layer takes θ as input and outputs θproj such that U (θproj) is always non-empty. To achieve this, we construct the projection layer as follows. θproj := arg min θproj θproj θ s.t. Qz U (θproj) (16) In the constraint of (16), we explicitly require that a distribution Qz lies in the parameterized SOC ambiguity set U (θproj), which ensures the non-emptyness of U (θproj). This distribution Qz should be understood as an estimation of the conditional distribution of uncertainty y given covariate z. To construct such a conditional distribution estimation Qz, we take the idea in Bertsimas & Kallus (2020), which constructed the conditional distribution from data (zn, yn), n [N] in a weighted sample average way as follows. n=1 ωn(z)δyn, ωn(z) 0, n=1 ωn(z) = 1 (17) where δ[ ] is the Dirac delta function. In (17), the weight ωn(z) can be intuitively understood as a measurement of closeness between z and data zn. Some research papers provide such weight functions to choose from (Bertsimas & Kallus, 2020; Kallus & Mao, 2023), for example, the k-nearest-neighbors weight function. With the formulation (17) of Qz, the constraint of (16) is equivalent to n=1 ωn(z)gi(yn, αi) σi, i [I] (18) Further, if gi, i [I] are selected as in Appendix A.2, then (18) can be reformulated into finitely many second-order cone constraints (See Appendix A.7 for this result). Therefore, the projection layer is a convex optimization layer. After projection, θproj is fed into the DRO layer to construct the surrogate value function Rλ(ϕ), which is usually estimated by data of a certain batch size. Then, the backpropagation and parameter update processes are conducted. 5.2. Prediction-Focused Pre-training If the learning model is very complicated, for example, a deep neural network, it can be hard to train it directly via the decision-focused learning pipeline. Therefore, to facilitate convergence, we first pre-train the learning model in a prediction-focused fashion. In Definition 4.2, function gi(y, αi) captures distributional features of the uncertainty y. Therefore, lower Ey Q[gi(y, αi)] can be deemed as better characterization of these distributional features, i.e., better prediction. By the above observation, we define the loss function of prediction-focused pre-training as EQz[gi(y, αi)] + EQz[gi(y, αi)] σi where Qz is defined in Equation (17). 6. Experiments To validate the effectiveness of the proposed differentiable DRO layers, we conduct experiments on a toy example and the portfolio management problem1. In both experiments, problems are formulated in a contextual DRO setting and thus we can apply the decision-focused learning pipeline developed in Section 5. The detailed experiment setup is presented in Appendix D. 6.1. Toy Example: Multi-item Newsvendor Problem We consider a multi-item newsvendor problem (19) where two options are provided for buying each item, i.e., retail and wholesale. The wholesale price ad i is lower than the retail price ac i but it can only be sold at a fixed amount vi. i=1 ac ixc i + ad i vixd i + bi(yi xc i vixd i )+ + di(xc i + vixd i yi)+ s.t. xc i 0, xd i {0, 1}, i [n] (19) where yi is the demand of item i, the continuous variable xc denotes amount of item bought from retail, integer variable xd denotes the wholesale option, b is the unit price of additional ordering, and d is the unit holding cost. To characterize the uncertainty in demand yi, we consider the following three types of parameterized SOC ambiguity sets, where firstand second-order moment features are characterized. UI(µI, σI) = EP [ y µI 1] σI UII(µII, σII) = EP h y µII 2 2 i σII UIII(µI, σI, µII, σII) = UI(µI, σI) \ UII(µII, σII) (SOC-III) 1Source code of all the experiments is available at https://github.com/DOCU-Lab/Differentiable DRO Layers. Differentiable Distributionally Robust Optimization Layers Figure 3. Experimental results on the multi-item newsvendor problem. The ambiguity set parameters are learned by NN in all the experiments. We compare the proposed decision-focused learning method (Section 5.1) with the prediction-focused learning method, which is described in Section 5.2. To fully verify the superior performance of the decisionfocused learning method, we further compare it with the prediction-focused benchmark. The parameters of the ambiguity set in the prediction-focused benchmark are selected with the full knowledge of the conditional distribution p(y|z). For example, in SOC-I, the parameters µI and σI are directly set to the mean and first-order absolute central moment of the conditional distribution p(y|z). Therefore, the performance of the prediction-focused benchmark is the optimal performance that the prediction-focused learning method can expect. In experiments, we take n = 4 and conduct 10 runs for cases of different training data sizes. We use the percentage optimality gap, whose definition can be found in Appendix D, to measure the performance of each method. The experiment results are presented in Figure 3. By Figure 3, the performance of prediction-focused learning will converge to the prediction-focused benchmark as the training data size grows. On the contrary, by directly minimizing the decision loss, the proposed decision-focused learning method demonstrates better performance. Quantitatively, the proposed decision-focused learning method demonstrates average improvements of 21.4%, 18.7%, and 18.1% compared with the prediction-focused benchmark in the three ambiguity sets, respectively. 6.2. Portfolio Management Problem As mentioned in the literature review, Costa & Iyengar (2023) also developed an end-to-end DRO method for portfolio management problems, but their method only applies to pure continuous decision cases. Therefore, in this part, we make a direct comparison with the method proposed in Costa & Iyengar (2023) on the portfolio management problem with pure continuous decisions, and then we further conduct experiments with mixed-integer decisions. 6.2.1. PORTFOLIO MANAGEMENT PROBLEM WITH PURE CONTINUOUS DECISIONS The portfolio management problem (20) with pure continuous decisions aims to select the optimal portfolio x Rn that minimizes the cost and the uncertainty comes from the return y. min x y T x s.t. 1T x = 1, x 0 (20) To characterize the uncertainty in return y, we consider the following parameterized SOC ambiguity set. EP h yi µi 2 2 i σi, i [n] In experiments, we take the dimension n of assets to 40. For learning the ambiguity set parameter, we use the same 2-layer neural network in both our method and the method proposed in Costa & Iyengar (2023) for a fair comparison. We use 2,500 data for training, 500 data for validation, and 1,000 data for testing. The average percentage profit of the proposed decision-focused learning method, predictionfocused learning method, and the method proposed in Costa & Iyengar (2023) are 0.289, 0.082, and 0.268, respectively. We also plot the wealth evolution in Figure 4. This superiority in performance is ascribed to the fact that a restrictive assumption on the structure of uncertainty distribution is presumed in the method of Costa & Iyengar (2023) while our method applies to general distributions. 6.2.2. PORTFOLIO MANAGEMENT PROBLEM WITH MIXED-INTEGER DECISIONS We further conduct experiments on the mixed-integer portfolio management problem (22), where some of the assets Differentiable Distributionally Robust Optimization Layers Figure 4. Wealth evolution on a 40-dimensional continuous portfolio management problem using decision-focused learning, prediction-focused learning, and method proposed in Costa & Iyengar (2023). (In the legend, DFL stands for decision-focused learning, and PFL stands for prediction-focused learning.) are only allowed to be bought with either a fixed amount or 0. Thus, the decisions on these assets are binary variables. min xc,xd y T c xc y T d diag(v)xd s.t. 1T xc + v T xd = 1, xc 0, xd {0, 1} (22) where yc, yd are returns corresponding to assets with continuous decisions xc and binary decisions xd, and v denotes the fixed amount of assets allowed to be bought. In experiments, the SOC ambiguity set is set to (21) if not explicitly specified and the number of binary decisions is set to 1 5 of the number of the problem dimension. Performance with Different Dimensions: We compare the performance of the proposed decision-focused learning method with prediction-focused learning method on mixed-integer portfolio management problems of different dimensions, and the results are presented in Table 1, where problem dimension refers to the number of assets. Generally, the performance gap between decision-focused and predict-focused learning methods scales as the problem dimension becomes larger. The advantage of decision-focused method becomes more apparent as the problem dimension increases. Table 1. Average percentage profit of decision-focused learning and prediction-focused learning on mixed-integer portfolio management problems with different dimensions. Problem dimension Method Improvement DFL PFL 20 0.1561 0.0583 167% 40 0.1763 0.0634 178% 60 0.2145 0.0479 347% Performance with Different SOC Ambiguity Sets: The the SOC ambiguity set (4) is determined by the constraints gi, which significantly affect the performance. Therefore, we conduct experiments on 60-dimensional mixed-integer portfolio management problems with three SOC ambiguity sets with different numbers of constraints. The detailed information of these three ambiguity sets is presented in Appendix D. The performance of decision-focused and predictionfocused learning methods using these three types of ambiguity sets are presented in Table 2. With more complex ambiguity sets, both decision-focused and prediction-focused methods have better performance, and the improvement for decision-focused learning generally grows. Table 2. Average percentage profit of decision-focused learning and prediction-focused learning on 60-dimensional mixed-integer portfolio management problem with different SOC ambiguity sets. SOC constraint No. Method Improvement DFL PFL 15 0.0762 0.0357 113% 30 0.1349 0.0491 174% 60 0.2145 0.0479 347% 7. Discussion of Limitations The major limitation of applying the proposed DRO-Layer method to large-scale problems lies in the heavy computational burden. Specifically, the decision-focused learning pipeline we developed in Section 4 can be decomposed into four processes: 1. learning layer; 2. projection layer; 3. solving MICP; 4. DRO Layer. Processes 2 and 4 are built on Cvxpylayers (Agrawal et al., 2019a), and Process 3 is built on commercial solver Gurobi. To test the computational efficiency and scalability, we conduct experiments and present the computational time of each of these four processes in Appendix E. It is noteworthy that both Cvxpylayers and Gurobi are built on CPU rather than on GPU, thereby leading to computational inefficiency and relatively weak scalability inevitably. If GPU training is allowable in these software, we believe the computation will not be a burden. 8. Conclusion We developed the first generic differentiable DRO layers, where a novel dual-view methodology was proposed to handle the mixed-integer decision via distinct principles. Based on the proposed differentiable DRO layers, we further developed a decision-focused learning pipeline for contextual DRO problems and verified its effectiveness in experiments. Differentiable Distributionally Robust Optimization Layers Acknowledgements This work was supported in part by the National Natural Science Foundation of China under Grant 62103264, and in part by the National Natural Science Foundation of China (Basic Science Center Program) under Grant 61988101. Impact Statement This paper presents work whose goal is to advance the integration of DRO and machine learning. None of the potential impacts of our work we feel must be specifically highlighted here. Agrawal, A., Amos, B., Barratt, S., Boyd, S., Diamond, S., and Kolter, J. Z. Differentiable convex optimization layers. Advances in neural information processing systems, 32, 2019a. Agrawal, A., Barratt, S. T., Boyd, S. P., Busseti, E., and Moursi, W. M. Differentiating through a cone program. Journal of Applied and Numerical Optimization, 2019b. Amos, B. and Kolter, J. Z. Optnet: Differentiable optimization as a layer in neural networks. In International Conference on Machine Learning, pp. 136 145. PMLR, 2017. Ben-Tal, A. and Nemirovski, A. Lectures on modern convex optimization: analysis, algorithms, and engineering applications. SIAM, 2001. Berthet, Q., Blondel, M., Teboul, O., Cuturi, M., Vert, J.- P., and Bach, F. Learning with differentiable pertubed optimizers. Advances in neural information processing systems, 33:9508 9519, 2020. Bertsimas, D. and Kallus, N. From predictive to prescriptive analytics. Management Science, 66(3):1025 1044, 2020. Bertsimas, D. and Van Parys, B. Bootstrap robust prescriptive analytics. Mathematical Programming, 195(1): 39 78, 2022. Bertsimas, D., Sim, M., and Zhang, M. Adaptive distributionally robust optimization. Management Science, 65(2): 604 618, 2019. Bonnans, J. F. and Shapiro, A. Perturbation analysis of optimization problems. Springer Science & Business Media, 2013. Busseti, E., Moursi, W. M., and Boyd, S. Solution refinement at regular points of conic problems. Computational Optimization and Applications, 74:627 643, 2019. Chenreddy, A. R. and Delage, E. End-to-end conditional robust optimization. Ar Xiv, abs/2403.04670, 2024. Chenreddy, A. R., Bandi, N., and Delage, E. Data-driven conditional robust optimization. Advances in Neural Information Processing Systems, 35:9525 9537, 2022. Costa, G. and Iyengar, G. N. Distributionally robust end-toend portfolio construction. Quantitative Finance, 23(10): 1465 1482, 2023. Dalle, G., Baty, L., Bouvier, L., and Parmentier, A. Learning with combinatorial optimization layers: a probabilistic approach. ar Xiv preprint ar Xiv:2207.13513, 2022. Delage, E. and Ye, Y. Distributionally robust optimization under moment uncertainty with application to data-driven problems. Operations research, 58(3):595 612, 2010. Donti, P., Amos, B., and Kolter, J. Z. Task-based end-to-end model learning in stochastic optimization. Advances in neural information processing systems, 30, 2017. Elmachtoub, A. N. and Grigas, P. Smart predict, then optimize . Management Science, 68(1):9 26, 2022. Ferber, A., Wilder, B., Dilkina, B., and Tambe, M. Mipaal: Mixed integer program as a layer. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp. 1504 1511, 2020. Kallus, N. and Mao, X. Stochastic optimization forests. Management Science, 69(4):1975 1994, 2023. Kong, L., Cui, J., Zhuang, Y., Feng, R., Prakash, B. A., and Zhang, C. End-to-end stochastic optimization with energy-based model. Advances in Neural Information Processing Systems, 35:11341 11354, 2022. Mandi, J. and Guns, T. Interior point solving for lp-based prediction+ optimisation. Advances in Neural Information Processing Systems, 33:7272 7282, 2020. Mandi, J., Bucarey, V., Tchomba, M. M. K., and Guns, T. Decision-focused learning: through the lens of learning to rank. In International Conference on Machine Learning, pp. 14935 14947. PMLR, 2022. Mandi, J., Kotary, J., Berden, S., Mulamba, M., Bucarey, V., Guns, T., and Fioretto, F. Decision-focused learning: Foundations, state of the art, benchmark and future opportunities. ar Xiv preprint ar Xiv:2307.13565, 2023. Mohajerin Esfahani, P. and Kuhn, D. Data-driven distributionally robust optimization using the wasserstein metric: performance guarantees and tractable reformulations. Mathematical Programming, 171(1-2):115 166, 2018. Differentiable Distributionally Robust Optimization Layers Mulamba, M., Mandi, J., Diligenti, M., Lombardi, M., Bucarey, V., and Guns, T. Contrastive losses and solution caching for predict-and-optimize. ar Xiv preprint ar Xiv:2011.05354, 2020. Niepert, M., Minervini, P., and Franceschi, L. Implicit mle: backpropagating through discrete exponential family distributions. Advances in Neural Information Processing Systems, 34:14567 14579, 2021. Ning, C. and You, F. Optimization under uncertainty in the era of big data and deep learning: When machine learning meets mathematical programming. Computers & Chemical Engineering, 125:434 448, 2019. Poganˇci c, M. V., Paulus, A., Musil, V., Martius, G., and Rolinek, M. Differentiation of blackbox combinatorial solvers. In International Conference on Learning Representations, 2020. Rahimian, H. and Mehrotra, S. Frameworks and results in distributionally robust optimization. Open Journal of Mathematical Optimization, 3:1 85, 2022. Sadana, U., Chenreddy, A., Delage, E., Forel, A., Frejinger, E., and Vidal, T. A survey of contextual optimization methods for decision making under uncertainty. ar Xiv preprint ar Xiv:2306.10374, 2023. Shah, S., Wang, K., Wilder, B., Perrault, A., and Tambe, M. Decision-focused learning without decision-making: Learning locally optimized decision losses. Advances in Neural Information Processing Systems, 35:1320 1332, 2022. Shapiro, A., Dentcheva, D., and Ruszczynski, A. Lectures on stochastic programming: modeling and theory. SIAM, 2021. Sun, C., Liu, L., and Li, X. Predict-then-calibrate: A new perspective of robust contextual lp. ar Xiv preprint ar Xiv:2305.15686, 2023. Wang, I., Becker, C., Van Parys, B., and Stellato, B. Learning for robust optimization. ar Xiv preprint ar Xiv:2305.19225, 2023. Wang, T., Chen, N., and Wang, C. Distributionally robust prescriptive analytics with wasserstein distance. ar Xiv preprint ar Xiv:2106.05724, 2021. Wiesemann, W., Kuhn, D., and Sim, M. Distributionally robust convex optimization. Operations research, 62(6): 1358 1376, 2014. Wilder, B., Dilkina, B., and Tambe, M. Melding the datadecisions pipeline: Decision-focused learning for combinatorial optimization. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp. 1658 1665, 2019. Yang, J., Zhang, L., Chen, N., Gao, R., and Hu, M. Decisionmaking with side information: A causal transport robust approach. Optimization Online, 2022. Yang, Y., Yin, Y., Wang, D., Ignatius, J., Cheng, T., and Dhamotharan, L. Distributionally robust multi-period location-allocation with multiple resources and capacity levels in humanitarian logistics. European Journal of Operational Research, 305(3):1042 1062, 2023. Zhang, J., Li, Y., and Yu, G. Emergency relief network design under ambiguous demands: A distributionally robust optimization approach. Expert Systems with Applications, 208:118139, 2022. Zharmagambetov, A., Amos, B., Ferber, A., Huang, T., Dilkina, B., and Tian, Y. Landscape surrogate: Learning decision losses for mathematical optimization under partial information. ar Xiv preprint ar Xiv:2307.08964, 2023. Zhou, Y., Shahidehpour, M., Wei, Z., Li, Z., Sun, G., and Chen, S. Distributionally robust unit commitment in coordinated electricity and district heating networks. IEEE Transactions on Power Systems, 35(3):2155 2166, 2019. Differentiable Distributionally Robust Optimization Layers A. Supplementary Background Material In this section, we give supplementary information on DRO with SOC ambiguity set in A.1, A.2, and A.3. Most of these materials are selected from Ben-Tal & Nemirovski (2001) and Bertsimas et al. (2019). In Appendix A.4, A.5, A.6, and A.7, detailed information on our method are provided. A.1. Second-Order Cone Representable Set [Ben-Tal & Nemirovski (2001), p.86] A set W Rn is a second-order cone representable set if there exists J second-order cone inequalities of the form bj Lmj 0, j [J] (23) bj Lmj 0, j [J] (24) where Lm is the m dimensional second-order cone: Lm = n y = (y1, , ym)T Rm ym q y2 1 + + y2 m 1 o (25) A.2. Examples of Parameterized SOC Ambiguity Set The parameterized SOC ambiguity set is determined by its constraints EP[gi(y, αi)] σi, and here we present some choices of function g and the parameter α. 1. g = µT y with vector µ as the parameter. 2. g = |µT y h| with vector µ and scalar h as the parameter. 3. g = (|µT y h|)p for some rational p 1 with vector µ and scalar h as the parameter. 4. g = ((µT y h)+)2 = (max{0, µT y h})2 vector µ and scalar h as the parameter. 5. g = Ay µ p for some rational p 1 norm p with matrix A and vector µ as the parameter. More examples can be constructed by taking the maximum, i.e., g = maxl [L] gi, and non-negtive sum, i.e., g = PL l=1 λigi for λi 0. See Ben-Tal & Nemirovski (2001) for more operators that preserve the second-order cone representable property of g. A.3. Slater s Condition for Parameterized SOC Ambiguity Set According to Proposition 1 in Bertsimas et al. (2019), the parameterized SOC ambiguity set U (θ) (defined in Definition 4.2) can be equivalently reformulated as follows. EQ[ui] σi, i [I] where each dimension ui of variable u corresponds to the constraint gi in U (θ), distribution Q is on the space of (y, u), #y Q is the marginal distribution of Q on dimension y, and the support V is defined as V = {(y, u)|y Ξ, gi(y, αi) ui, i [I]} = {(y, u)|y Ξ} i=1 {(y, u)|gi(y, αi) ui} (27) By Definition 4.2, {(y, u)|y Ξ} is second-order cone representable set, and each {(y, u)|gi(y, αi) ui} are differentiable parameterized second-order representable sets. Therefore, V is also a differentiable parameterized second-order representable set, so V can be represented by finitely many second-order cone constraints. bj(θ) Lmj 0, j [J] Differentiable Distributionally Robust Optimization Layers The Slater s condition requires that there exist a (y , u , v ) such that u i < σi, i [I] and bj(θ) >Lmj 0, j [J] (29) A.4. Discussion of Assumption 4.8 Assumption (i) ensures the uniqueness of the continuous decision, which is common in differentiable optimization layer research (Agrawal et al., 2019a). For the discrete decision xd, since its feasible region Xd is finite, we can not require the uniqueness of xd for all θ Θ, where θ = Mϕ(z). For example, if we consider the combinatorial optimization problem, i.e., xd = arg min xd {0,1}n f(xd, θ) := θT xd, s.t. 1T xd = 1 (30) The solution to this problem is not unique when the prediction θ has multiple minimum elements. However, we note that for problem (30), the θ leading to multiple solutions has measure zero in its space Θ, and this property holds for a lot of problems. Therefore, we require in assumption (ii) the uniqueness to hold almost surely with respect to the marginal distribution Pz of z. Specifically, in combinatorial optimization problem (30), if the learning model is a linear one, i.e., θ = Az + b with ϕ = (A, b), where the rows of matrix A are all different, and covariate z has marginal distribution absolutely continuous with respect to the Lebesgue measure, then (ii) is satisfied. We further note that (ii) is also implicitly assumed in Poganˇci c et al. (2020). A.5. Self-Normalized Importance Sampling Suppose we want to compute the expectation of a random variable J(xd), i.e., Exd p(xd|θ,λ)[J(xd)] (31) where p(xd|θ, λ) is defined as in Equation (11). The importance sampling aims to compute Exd p(xd|θ,λ)[J(xd)] by leveraging a proposal distribution q which is absolutely continuous with respect to p(xd|θ, λ). With proposal distribution q, the above expectation can be computed by Exd p(xd|θ,λ)[J(xd)] = Exd q(xd) q(xd) J(xd) Therefore, by sampling from the known distribution q, the expectation can be estimated unbiasedly. A.6. Oracles to Get T Optimal Integer Solutions Suppose that we want to solve T optimal integer solutions of the following mixed-integer linear cone program. min xd,xc,v a T xd + b T xc + c T v s.t. xd {0, 1}nd, xc Rnc AT i xd + BT i xc + CT i v Ki 0, i [I] Such mixed-integer linear cone program can be solved by commercial solvers like gurobi. We solve this program and get the optimal integer solution x1 d, and then we add the following constraint to problem (32). 1 2x1 d T xd + 1T x1 d 1 (33) Differentiable Distributionally Robust Optimization Layers Constraint (33) only cuts out x1 d from the feasible region. Therefore, by solving (32) with extra constraint (33) we will get the second optimal solution x2 d. Then we cut out x2 d to get x3 d. Repeat the above process for T times and we will get T optimal integer solutions. A.7. Analysis of the Projection Layer If gi, i [I] in the parameterized SOC ambiguity set are selected as what we give in Appendix A.2, then the epigraph of gi(y, αi) with respect to αi, i.e., {(αi, u)|αi Ai, u gi(y, αi)} (34) is also a second-order cone representable set. By Ben-Tal & Nemirovski (2001) p.91, this representability is preserved by a non-negative sum, so n=1 ωn(z)gi(yn, αi) σi (35) can be expressed by finitely many second-order cone constraints. B.1. Proof of Theorem 4.5 When the cost function c(x, y) is of the form (ii) in Assumption 4.3, then Theorem 4.5 coincides with Theorem 1 in Bertsimas et al. (2019). For the case the cost function c(x, y) is of the form (i) in Assumption 4.3, the proof is quite similar. Since in this case c(x, y) is linear in y, according to Theorem 1 in Bertsimas et al. (2019), the worst-case expectation f(x, U (θ)) = max P U (θ) Ey P[c(x, y)] is equivalent to f(x, U (θ)) = min r,β r + βT σ (36) s.t. r c(x, y) βT u, (y, u) V , y 0 (37) where V is defined in Equation (27). Since Assumption 4.4 holds, we can leverage the duality theory to reformulate constraint (37) as follows. r max y,u V ,y 0 c(x, y) βT u r max y,u V ,y 0 k=1 ck(x)yk βT u (39) r max y 0 min ηj Lmj 0 k=1 ck(x)yk βT u + j=1 ηT j Ay j (θ)y + Au j (θ)u + Av j (θ)v bj(θ) r min ηj Lmj 0 max y 0 k=1 ck(x)yk βT u + j=1 ηT j Ay j (θ)y + Au j (θ)u + Av j (θ)v bj(θ) j=1 ηT j bj(θ), j=1 (Ay j (θ))T ηj j=1 (Au j (θ))T ηj = β, j=1 (Av j (θ))T ηj = 0, ηj Lmj 0 where (Ay j (θ), Au j (θ), Av j (θ)) = Aj(θ) and bj(θ) are defined in Equation (28), and (41) hold by duality theory since the Slater s condition holds by Assumption 4.4. Differentiable Distributionally Robust Optimization Layers The second term in (42) can be reformulated as j=1 (Ay j (θ)ek)T ηj ck(x), k [K] (43) where ek is the vector where the kth element is 1 and other elements are 0. Since by Assumption 4.3 the epigraph of ci(x) is a second-order cone representable set, thus each PJ j=1(Ay j (θ)ek)T ηj ck(x) can be expressed by finitely many second-order cone constraints. Therefore, Theorem 4.5 holds for both the cases in Assumption 4.3. B.2. Proof of Theorem 4.6 By Theorem 4.5, the worst-case expectation f(x, U (θ)) is a linear second-order cone program and all the constraints are linear in x, so under continuous assumption of x, the DRO problem min x X f(x, U (θ)) (44) is also a linear second-order cone program. Specifically, by Appendix B.1, if the cost function c(x, y) is of the form (i) in Assumption 4.3, (44) is equivalent to min x X f(x, U (θ)) = min x X,r,β r + βT σ (45) j=1 ηT j bj(θ) (46) j=1 (Ay j (θ)ek)T ηj ck(x), k [K] (47) j=1 (Au j (θ))T ηj = β (48) j=1 (Av j (θ))T ηj = 0, ηj Lmj 0 (49) β 0, ηj Lmj 0, j [J] (50) where constraint (47) is defined in Equation (43) in Appendix B.1 and can be reformulated into finitely many second-order cone constraints that are linear in x. If the cost function c(x, y) is of the form (ii) in Assumption 4.3, a similar reformulation can be derived. The cone programming minx X f(x, U (θ)) takes Aj(θ) = (Ay j (θ), Au j (θ), Av j (θ)) and bj(θ) as its parameter. Then by Agrawal et al. (2019b), the optimal value x = arg minx X f(x, U (θ)) is differentiable with respect to parameter Aj(θ) and bj(θ). Further, by Definition 4.1, Aj(θ) and bj(θ) are differentiable with respect to θ. Therefore, x = arg minx X f(x, U (θ)) is differentiable with respect to θ. B.3. Proof of Theorem 4.13 (i) It suffices to prove that for any sequence λn 0, ϕ Φ, and sequence ϕn ϕ, the following equality holds lim n Rλn(ϕn) = R(ϕ) (51) For ease of notation, we define f (xd, Mϕ(z)) = min xc Xc(xd) f((xd, xc), Mϕ(z)) = f (xd, x c(xd, Mϕ)), Mϕ(z) (52) Differentiable Distributionally Robust Optimization Layers By Assumption 4.10, f((xd, xc), Mϕ(z)) is continuous in xc and ϕ, and by Assumption 4.11, x c(xd, Mϕ(z)) is also continuous in ϕ, so f (xd, Mϕ(z)) is continuous in ϕ. If for a pair of (z, ϕ) the optimal integer solution x d(Mϕ(z)) = arg minxd Xd f (xd, Mϕ(z)) is unique, then it holds that min xd Xd/{x d(Mϕ(z))} f xd, Mϕ(z) f x d, Mϕ(z) > 0 (53) Then by the continuity of f (xd, Mϕ(z)) in ϕ, there exists a ϵ > 0 such that ϕ ϕ ϕ ϵ , min xd Xd/{x d(Mϕ(z))} f xd, Mϕ(z) f x d(Mϕ(z)), Mϕ(z) ϵ (54) By the above observation, we further define Y (ϕ, N) = z min xd Xd/{x d(Mϕ(z))} f xd, Mϕ(z) f x d(Mϕ(z)), Mϕ(z) 1 (55) It is obvious that Y (ϕ, 1) Y (ϕ, 2) Y (ϕ, N) . By the argument concerning (54), we conclude that if inequality (53) holds for a pair of (z, ϕ), then z Y (ϕ, N) for sufficiently large N. Since by Assumption 4.8, for any ϕ Φ, inequality (53) holds almost surely, thus we have N=1 Y (ϕ, N) = 1 (56) By Assumption 4.10, l is bounded, and we denote this bounded by Ψ. Therefore, for any ϵ > 0, there exist a Nϵ/Ψ such that P(Y (ϕ, Nϵ/Ψ)) 1 ϵ/Ψ (57) Therefore, when n Nϵ/Ψ, we have |Rλn(ϕn) R(ϕ)| (58) xd Xd p(xd|Mϕn(z), λn)l((xd, x c(xd, Mϕn(z))), y) l x d(Mϕ(z)), x c(xd (Mϕ(z)), Mϕ(z)) , y # 1 z Y (ϕ, Nϵ/Ψ) " X xd Xd p(xd|Mϕn(z), λn)l((xd, x c(xd, Mϕn(z))), y) l x d(Mϕ(z)), x c(xd (Mϕ(z)), Mϕ(z)) , y #) 1 z / Y (ϕ, Nϵ/Ψ) X xd Xd p(xd|Mϕn(z), λn)l((xd, x c(xd, Mϕn(z))), y) 1 z / Y (ϕ, Nϵ/Ψ) l x d(Mϕ(z)), x c(xd (Mϕ(z)), Mϕ(z)) , y 1 z Y (ϕ, Nϵ/Ψ) " X xd Xd p(xd|Mϕn(z), λn)l((xd, x c(xd, Mϕn(z))), y) l x d(Mϕ(z)), x c(xd (Mϕ(z)), Mϕ(z)) , y #) + 2 ϵ Differentiable Distributionally Robust Optimization Layers where 1 is the indicator function. In (61), for xd (Mϕ(z)), p xd (Mϕ(z)) Mϕn(z), λn = xd (Mϕ(z)),Mϕn(z) xd Xd/ xd (Mϕ(z)) exp xd (Mϕ(z)),Mϕn(z) Since ϕn ϕ, then for sufficiently large n, |ϕn ϕ| 1 Nϵ/Ψ . Therefore, for z Y (ϕ, Nϵ/Ψ), f x d, Mϕn(z) f xd (Mϕ(z)), Mϕn(z) 1 Nϵ/Ψ , xd p xd (Mϕ(z)) Mϕn(z), λn 1 1 + P xd Xd/ xd (Mϕ(z)) exp( 1 λn Nϵ/Ψ ) = 1 1 + (|Xd| 1)exp( 1 λn Nϵ/Ψ ) (65) Since λn 0+, we have lim n p xd (Mϕ(z)) Mϕn(z), λn = 1 (66) xd Xd p(xd|Mϕn(z), λn)l((xd, x c(xd, Mϕn(z))), y) = l x d(Mϕ(z)), x c(xd (Mϕ(z)), Mϕ(z)) , y (67) xd Xd p(xd|Mϕn(z), λn)l((xd, x c(xd, Mϕn(z))), y) l x d(Mϕ(z)), x c(xd (Mϕ(z)), Mϕ(z)) , y = 0 Since l is bounded, by the bounded convergence theorem, we have 1 z Y (ϕ, Nϵ/Ψ) " X xd Xd p(xd|Mϕn(z), λn)l((xd, x c(xd, Mϕn(z))), y) l x d(Mϕ(z)), x c(xd (Mϕ(z)), Mϕ(z)) , y #) = 0 Therefore, lim n |Rλn(ϕn) R(ϕ)| 2ϵ (70) Since ϵ can be selected arbitrarily small, we have lim n |Rλn(ϕn) R(ϕ)| = 0 (71) Therefore, Rλn(ϕ) epi-converges to R(ϕ) as n . (ii) Since Rλn(ϕ) epi-converges to R(ϕ), (ii) can be immediately derived by applying Proposition 4.6 in Bonnans & Shapiro (2013) Differentiable Distributionally Robust Optimization Layers B.4. Proof of Theorem 4.14 We first prove Equation (14). By the chain rule, θ l xd, xc (xd, θ) , y + p(xd|θ, λ) l xd, xc (xd, θ) , y xc xc (xd, θ) By Assumption 4.7, the feasible region Xc(xd) given xd is a second-order cone representable set. Therefore, by Theorem 4.6, the optimal continuous solution xc (xd, θ) is differentiable with respect to θ, so the last gradient term xc (xd,θ) θ in Equation (72) is well-defined. Since the energy function E(xd, θ, λ) = exp f (xd, x c(xd, θ)), θ and x c(xd, θ)) is differentiable, thus E(xd, θ, λ) is also differentiable with respect to θ. Therefore, the first gradient term p(xd|θ,λ) θ in Equation (72) is well-defined since p(xd|θ, λ) = E(xd, θ, λ) P x d Xd E(x d, θ, λ) (74) Let Z(θ, λ) denote the normalizer P x d Xd E(x d, θ, λ). By the chain rule, we have θ = E(xd,θ,λ) P xd Xd E(xd ,θ,λ) θ = E (xd, θ, λ) Z(θ, λ) E(xd, θ, λ) , θ, λ) Z(θ, λ) (75) = p(xd|θ, λ)E (xd, θ, λ) E(xd, θ, λ) p(xd|θ, λ) X |θ, λ)E (xd , θ, λ) E(xd , θ, λ) = p(xd|θ, λ) ( E (xd, θ, λ) E(xd, θ, λ) Exd p(xd |θ,λ) , θ, λ) E(xd , θ, λ) By combining Equation (72) and Equation (77), we have xd Xd p(xd|θ, λ)E (xd , θ, λ) E(xd , θ, λ) l xd, xc (xd, θ) , y Exd p(xd |θ,λ) , θ, λ) E(xd , θ, λ) xd Xd p(xd|θ, λ)l xd, xc (xd, θ) , y ! xd Xd p(xd|θ, λ) l xd, xc (xd, θ) , y xc xc (xd, θ) = Exd p(xd|θ,λ) " E (xd, θ, λ) E(xd, θ, λ) l xd, xc (xd, θ) , y # Exd p(xd|θ,λ) E (xd, θ, λ) E(xd, θ, λ) Exd p(xd|θ,λ) l xd, xc (xd, θ) , y + Exd p(xd|θ,λ) " l xd, xc (xd, θ) , y xc xc (xd, θ) Differentiable Distributionally Robust Optimization Layers Therefore, we have derived Equation (14). Since rλ(θ, y) = rλ(Mϕ(z), y) is bounded by Assumption 4.10, then according to Theorem 9.56 in Shapiro et al. (2021), ϕ = E(z,y) P rλ(Mϕ(z), y) ϕ = E(z,y) P rλ(Mϕ(z), y) C. Extension to Wasserstein DRO Layer Here we present how to extend the proposed DRO Layer to Wasserstein-based DRO with a learnable radius. In the non-contextual setting, the reference distribution of the Wasserstein ambiguity set is typically set to an empirical distribution of N data points. In the contextual setting, we take the idea in Bertsimas & Kallus (2020) and set the conditional empirical distribution b P(z) as a weighted sum of data points, i.e. n=1 ωn(z)δyn (81) where δ[ ] is the Dirac delta function and ωn(z), n [N] are the weights satisfying P n [N] ωn(z) = 1. Let ϵθ(z) be the learnable radius with parameter θ. Then, the Wasserstein ambiguity set is U b P(z), ϵθ(z) = n P P(Ξ) = 1, d W b P(z), P ϵθ(z) o , (82) and the Wasserstein DRO problem is min x X max P U (b P(z),ϵθ(z)) EP[c(x, y)] (WDRO) where x is the mixed-integer decision variable satisfying Assumption 4.7 and c(x, y) is the cost function satisfying Assumption 4.3. As outlined in the paper, the procedure of the proposed DRO Layer is presented as follows. z θ U b P(z), ϵθ(z) Problem (WDRO) Solve (WDRO) for T times Construct proposal distribution Importance sampling Compute gradient (14) θ update Therefore, in order to learn the ambiguity set in a decision-focused style, we only need to ensure that (i) The problem (WDRO) is a mixed-integer linear cone programming. (ii) When the integer part of the decision variable (i.e., xd) is fixed, the problem (WDRO) is a linear cone programming. where condition (i) allows us to use a commercial solver like Gurobi to solve problem (WDRO) for T times so that we can construct the proposal distribution, and condition (ii) allows us to derive the gradient of continuous variables with respect to learnable parameter in computing gradient (14). In fact, these two conditions are satisfied for Wasserstein DRO, and we formally present this result in the following corollary. Corollary C.1. If the mixed-integer decision variable x satisfies Assumption 4.7 and the cost function c(x, y) satisfies Assumption 4.8, then condition (i) and (ii) hold for the problem (WDRO). Proof. The proof is straightforward by combining techniques in Wasserstein DRO (Mohajerin Esfahani & Kuhn, 2018) and second-order cone programming (Ben-Tal & Nemirovski, 2001). In fact, since Assumption 4.7 holds, we only need to show that condition (ii) holds when the decision variable x is pure continuous and the feasible region X is second-order cone representable. For simplicity, we prove the corollary for bilinear cost function c(x, y) = max k [K] x T T ky (83) and proof for general cost functions satisfying Assumption 4.3 is quite similar but with heavier notations. Differentiable Distributionally Robust Optimization Layers According to Mohajerin Esfahani & Kuhn (2018), the problem (WDRO) can be reformulated as inf x X,λ,sn,γnk λϵθ(z) + n=1 ωn(z)sn s.t. sn sup y Ξ x T T ky γT nky + γT nkyn, n [N], k [K] γnk λ, n [N], k [K] Since the uncertainty support Ξ is a second-order cone representable set, it can be formulated as follows. Ξ = y | v s.t. Ay j y + Av j bj Lmj 0, j [J] (85) By leveraging expression (85), we can further reformulate (84) as inf x X,λ,sn,γnk,ηnkj λϵθ(z) + n=1 ωn(z)sn j [J] b T j ηnkj, n [N], k [K] T T k x γnk + X j [J] Ay j T ηnkj = 0, n [N], k [K] j [J] Av j T ηnkj = 0, n [N], k [K] ηnkj Lmj 0, n [N], k [K], j [J] γnk λ, n [N], k [K] Problem (86) is already in the form of a linear cone programming. D. Experiment Setup D.1. Toy Example: Multi-item Newsvendor Problem We measure the performance by the following percentage optimality gap. Percentage Optimality Gap = E(z,y) P l(xlearning(z), y) E(z,y) P l(x (z), y) E(z,y) P l(x (z), y) (87) where xlearning(z) is the decision given by the learning method and x (z) is the optimal decision derived by solving the following problem. x (z) = arg min x X Ey p(y|z)l(x, y) (88) In computing the percentage optimality gap, we compute E(z,y) P l(xlearning(z), y) by sample average approximation using 1 105 pairs of (z, y) i.i.d. sampled from P. Since E(z,y) P l(x (z), y) = Ez Pz min x X Ey p(y|z) l(x, y) (89) we follow a two step approach to compute E(z,y) P l(x (z), y). The outer expectation in Equation (89) is approximated by sample average approximation using 400 z i.i.d. sampled from the marginal distribution Pz. The inner stochastic program minx X Ey p(y|z) l(x, y) is also solved by sample average approximation using 200 y i.i.d. sampled from the conditional distribution p(y|z). In experiments, we use k-nearest-neighbors weight function to construct the distribution Qz (defined in Equation (17)) in the projection layer, i.e., K , if zn is in the k-nearest-neighbor of z. Else, ωn(z) = 0. (90) Differentiable Distributionally Robust Optimization Layers In the experiments, we select K as N 20, where N is the training data size. We use neural networks to learn the ambiguity set parameters µI, σI, µII, σII, the architecture are presented as follows. µI : FC(1, 60) FC(60, 60) FC(60, 4) (91) σI : FC(1, 60) FC(60, 60) FC(60, 1) (92) µII : FC(1, 60) FC(60, 60) FC(60, 4) (93) σII : FC(1, 60) FC(60, 60) FC(60, 1) (94) where FC(m1, m2) represents full connection layer with m1 inputs and m2 outputs. Since µ and σ are learned by different neural networks, in the pre-training, we first train the parameter µ by minimizing EQz[gi(y, αi)] and then train σ by minimizing EQz[gi(y, αi)] σi . In conducting experiments, we vary the training data size N from 100 to 800, and in each case, 10 runs are conducted. In each case, the size of the validation data set is set to N 5 . In generating data, we first sample covariate z, and then sample y conditioned on z. In the multi-item newsvendor problem (19) with n = 4, we set ac = (0.25, 0.5, 0.75, 1), ad = 0.95 ac, v = (10.0, 13.0, 16.0, 19.0), b = (2.0, 4.0, 6.0, 8.0), d = (0.5, 1.0, 1.5, 2.0). The covariate z follows the uniform distribution on [0, 1], and conditioned on z, we set the distribution of demand y by y1 N 8 + 6(z 0.2)2, 1 1 + 8|z 0.2| , y2 N 11 + 6(z 0.4)2, 1 1 + 8|z 0.4| y3 N 14 + 6(z 0.6)2, 1 1 + 8|z 0.6| , y4 N 17 + 6(z 0.8)2, 1 1 + 8|z 0.8| D.2. Portfolio Management Problem In the experiment on the portfolio management problem, the covariate z is a vector of 5 dimensions, and the returns of n assets are generated by the following non-linear model. y = a + Bz + Ce + Dflat(zz T ) + z 1diag(s)g (95) where matrixes B Rn 5, C Rn 3, D Rn 25 and vector s Rn are randomly picked parameters, e R3 and g Rn are random variables independent of covariate z R5, and 1 is the 1-norm. We use 2,500 data for training, 500 data for validation, and 1,000 data for testing. The distribution Qz is also set to k-nearest distribution as in (90), and K is set to 20. In the gradient estimation, we solve the DRO 3 times to construct (15) and use 4 samples to estimate the gradient term by importance sampling. For the energy parameter λ in (10), we initially set it to 10, and subsequently reduce it by one-third every 30 epochs. D.2.1. PORTFOLIO MANAGEMENT PROBLEM WITH PURE CONTINUOUS DECISIONS We use the following two-layer full-connected neural network to learn the ambiguity set parameter in both our method and the method proposed in Costa & Iyengar (2023). FC(5, 22) FC(22, 27) FC(27, m) (96) where m is the dimension of the ambiguity set parameter. D.2.2. PORTFOLIO MANAGEMENT PROBLEM WITH MIXED-INTEGER DECISIONS In problem (22), the number of binary decisions is set to n 5 where n is the number of assets, and the problem parameter v Rn/5 is set to [1/n, , 1/n]. Differentiable Distributionally Robust Optimization Layers To analyze the impact of the number of constraints in the SOC ambiguity set on the performance of the proposed decisionfocused learning, we conduct experiments on the 60-dimensional mixed-integer portfolio management problem and consider the following three types of SOC ambiguity sets. U15(µ, σ) = y4(i 1)+j µ4(i 1)+j 2 U30(µ, σ) = y2(i 1)+j µ2(i 1)+j 2 U60(µ, σ) = EP h yi µi 2 2 i σi, i [60] Intuitively, U15 constrains 4 dimensions together and thus leads to 15 constraints, U30 constrains 2 dimensions together and thus leads to 30 constraints, and U60 constrains different dimensions individually and leads to 60 constraints. E. Discussion of Limitations The decision-focused learning pipeline we developed in Section 4 can be decomposed into four processes: 1. learning layer; 2. projection layer; 3. solving MICP; 4. DRO Layer. Processes 2 and 4 are built on Cvxpylayers (Agrawal et al., 2019a), and Process 3 is built on commercial solver Gurobi. Processes 1, 2, and 4 participate in both the forward and backward path, and solving MICP is only involved in the forward path. To test the scalability of our method, we test the running time for a batch of 100 instances on large-scale problems, and the results are presented in Table 3. The experiments are conducted on a laptop with an i7 CPU and 32G RAM. In Table 3, the running time for solving MICP and DRO Layer pertains to T = 1 and S = 1, so the total computational time for these two processes should be multiplied by T and S, respectively. From Table 3, we can see the computational time is very long in high-dimensional problems. However, we note that this computational inefficiency is due to the inefficiency of Cvxpylayers and Gurobi, because both of them are built on CPU rather than on GPU. Table 3. Running time for a batch of 100 instances on problems with different dimensions. (T stands for the number of solutions we derive to construct the proposal distribution, and S stands for the number of sampling to estimate gradient (14).) Dimension Forward path Backward propagation learning layer projection layer solving MICP ( T) DRO Layer ( S) 100 < 1ms 16.5s 8.4s 1.2s 9.8s 200 < 1ms 19.6s 12.8s 1.4s 13.1s 400 < 1ms 44.0s 24.1s 5.8s 90.5s 800 < 1ms 82.9s 47.6s 15.3s 358.7s