# datadriven_conditional_robust_optimization__1b8a3864.pdf Data-Driven Conditional Robust Optimization Abhilash Chenreddy GERAD & Dept. of Decision Sciences HEC Montréal Montréal, Quebec, Canada abhilash.chenreddy@hec.ca Nymisha Bandi Mc Gill University Montréal, Quebec, Canada nymisha.bandi@mcgill.ca Erick Delage GERAD & Dept. of Decision Sciences HEC Montréal Montréal, Quebec, Canada erick.delage@hec.ca In this paper, we study a novel approach for data-driven decision-making under uncertainty in the presence of contextual information. Specifically, we address this problem using a new Conditional Robust Optimization (CRO) paradigm that seeks the solution of a robust optimization problem where the uncertainty set accounts for the most recent side information provided by a set of covariates. We propose an integrated framework that designs the conditional uncertainty set by jointly learning a partition in the covariate data space and simultaneously constructing region specific deep uncertainty sets for the random vector that perturbs the CRO problem. We also provide theoretical guarantees for the coverage provided by conditional uncertainty sets and for the value-at-risk performances obtained using the proposed CRO model. Finally, we use simulated and real world data to illustrate the implementation of our approach and compare it against two non-contextual robust optimization benchmark approaches to demonstrate the value of exploiting contextual information in robust optimization. 1 Introduction In most real world decision problems, the decision maker (DM) faces uncertainty either in the objective function that he aims to optimize, or some of the constraints that he needs to satisfy. Stochastic Programming and Robust Optimization (RO) are the most popular methods for addressing this issue. With the growing availability of data, there has recently been a surge of interest in modeling optimization under uncertainty as contextual optimization problems that seek to leverage rich feature observations to make better decisions [Ban and Rudin, 2019, Bertsimas and Kallus, 2020]. In a simple cost minimization problem, where X Rn and c(x, ξ) respectively capture the feasible set of actions and a cost that depends on both the action x and a random perturbation vector ξ Rm, the contextual DM has access to a vector of covariates ψ Rm assumed to be correlated to ξ. This DM therefore traditionally wishes to identify an optimal policy, i.e. a functional x : Rm X that suggests an action in X adapted to the observed realization of ψ, with respect to his expected cost over the joint distribution of (ψ, ξ): min x( ) E[c(x(ψ), ξ)]. (1) From a theoretical point of view, one can exploit the interchangeability property (see Theorem 14.60, [Rockafellar and Wets, 2009]) to identify an optimal policy for Problem (1) using the following 36th Conference on Neural Information Processing Systems (Neur IPS 2022). conditional stochastic optimization (CSO) problem: (CSO) x (ψ) argmin x X E[c(x, ξ)|ψ]. (2) While the literature that treats contextual optimization through the CSO problem is rich, much less attention has been given to contextual optimization in the risk averse setting. Namely, one can easily think about replacing the risk neutral expected value operator in problem (2) with a risk measure such as value-at-risk or conditional value-at-risk in order to prevent the DM from being exposed to the possibility of large costs. Moreover, while robust optimization is being used pervasively in disciplines that employ decision models, including chemical, civil, electrical engineering, medicine, and physics (see respectively [Bernardo and Saraiva, 1998, Bendsøe et al., 1994, Mani et al., 2006, Chu et al., 2005, Bertsimas et al., 2007]) to name a few, the question of how to systematically integrate contextual information in this important class of decision models remains to this day unexplored. In this work, we therefore tackle for the first time the contextual optimization problem from the point of view of robust optimization. Namely, we will consider a contextual DM that wishes to exploit the side information in the design and solution of a robust optimization problem. This naturally gives rise to the following conditional robust optimization (CRO) problem x (ψ) := argmin x X max ξ U(ψ) c(x, ξ) , where U(ψ) is an uncertainty set designed to contain with high probability the realization of ξ conditionally on observing ψ. Our proposed approach will be data-driven in the sense that the design of the CRO problem will make use of historical observations of joint realizations of ψ and ξ. Our contribution can be summarized as follows. We propose for the first time a framework for learning from data an uncertainty set for RO that adapts to side information. The training of this conditional uncertainty set is done by jointly learning a partition in the covariate data space using deep clustering methods, and simultaneously constructing region specific deep uncertainty sets, using techniques from one-class classification, for the random vector that perturbs the CRO problem. We establish theoretical connections between CRO and Contextual Value-at-Risk Optimization (CVO): min x( ) Va R1 ε(c(x(ψ), ξ)), (3) where Va R1 ε(Z) := inf{t|P(Z t) 1 ε} refers to the value-at-risk of 1 ε confidence level of Z. We demonstrate empirically that contextual robust optimization can improve the performance of robust optimization models in a data-driven portfolio optimization problem that employs real world data from the US stock market. In particular, we find that in conditions where side information carries a strong signal about future returns, the risk of the portfolio can be reduced by up to 15%. The paper is organized as follows. Section 2 surveys related work. Section 3 summarizes the approach discussed in [Goerigk and Kurtz, 2020]. Section 4 presents a Deep Cluster then Classify (DCC) scheme and our Integrated Deep Cluster then Classify (IDCC) scheme to generate conditional uncertainty sets. It also establishes the connections to CVO. Our case study based on real world portfolio optimization is presented in section 5 followed by conclusions in section 6. 2 Related work Conditional Stochastic Optimization [Hannah et al., 2010] was possibly the earliest work on CSO, where a kernel density estimation approach is exploited to formulate and solve a CSO problem. [Ban and Rudin, 2019] apply CSO to a newsvendor optimization problem where the performance of linear policies and kernel density estimation is explored and where generalization error can be controlled using regularization. [Kallus and Mao, 2020] studied methods to train forest decision policies for CSO in a way that directly targets the optimization costs. [Ban et al., 2019] use residual tree methods to solve general multi-stage stochastic programs where information about the underlying uncertainty is available through covariate information. [Kannan et al., 2020a] propose data-driven SAA frameworks for approximating the solution to two-stage stochastic programs with access to a finite number of samples of random variables and concurrently observed covariates. Recently, Lin et al. [2022] has applied a conditional Va R constrained CSO formulation to the newsvendor problem. While most of the related work focuses on an estimate-then-optimize approach (see also [Srivastava et al., 2021b] and [Hu et al., 2022]), there have also been recent efforts in designing CSO models using an end-to-end paradigm (see [Elmachtoub and Grigas, 2022] and [Donti et al., 2017]). Distributionally robust CSO One common challenge with the applications of CSO is due to the fact that often there are only a few samples (if any at all) drawn from the conditional distribution of ξ given ψ for each realization of ψ [Hu et al., 2020]. This in turn causes a poor approximation of the true conditional distribution resulting in poor out-of-sample performance. Most proposed solutions to this issue have relied on distributionally robust optimization (DRO). For example, [Bertsimas and Van Parys, 2021], [Bertsimas et al., 2022, Nguyen et al., 2021], and [Srivastava et al., 2021a] all propose DRO approaches that employ distribution sets that are centered at either the estimated conditional distribution or joint empirical distribution of (ψ, ξ). [Kannan et al., 2020b] applies distributionally robust optimization to the residual-based CSO model proposed in [Kannan et al., 2020a]. We finally note that none of these works have considered the problem of conditional DRO where the distributional ambiguity set itself, namely its support or size, depending on contextual information. Data-driven Robust Optimization and One-class Classification There has been a growing set of papers (see [Ohmori, 2021, Mc Cord, 2019, Wang and Jacquillat, 2020]) proposing various frameworks that use both supervised and unsupervised one-class classification techniques in designing the uncertainty sets which are further integrated into the RO problems. Some approaches make use of variance and covariance of historical data [Natarajan et al., 2008] while others [Goerigk and Kurtz, 2020, Wang et al., 2021] have exploited the representative power of deep neural networks to construct compact uncertainty sets Up to this day, none of the data-driven robust optimization approaches have considered accounting for contextual information. Deep Clustering Methods Traditional clustering methods like Gaussian Mixture Models (GMM) and k-means clustering rely on the original data representations and suffer from the curse of dimensionality. Recent developments in DNNs led to the learning of high quality representations, especially auto-encoder(AE) and decoder systems are particularly appealing as they are able to learn the representations in a fully unsupervised fashion. Several works like [Chang et al., 2017, Guo et al., 2017, Ji et al., 2017] combine variational AEs and GMMs to perform clustering and non-linearly map the input data into a latent space. Few works like [Fard et al., 2020] try to jointly learn the representations and jointly cluster with k-means and learning representations. We modify these algorithms to introduce a probability simplex that interacts with the centroids and also the center of the uncertainty sets. 3 The Deep Data-Driven Robust Optimization (DDDRO) Approach Focusing on a classical robust optimization model, i.e. minx X maxξ U c(x, ξ), the authors of [Goerigk and Kurtz, 2020] propose to employ deep learning to characterize the uncertainty set U in a data-driven environment. In particular, they consider describing the uncertainty set U in the form: U(W, R) := { ξ Rm : f W (ξ) f0 R} , (4) where f W : Rm Rd is a deep neural network, parametrized using W, that projects the perturbation vector ξ to a new vector space where the uncertainty set can be more simply defined as a sphere of radius R centered at some f0. Given a dataset Dξ = {ξ1, ξ2 . . . ξN}, they propose discovering the underlying structure of U by training the NN using a method found in the one-class classification literature, namely minimizing the empirical centered total variation of the projected data points: i=1 f W (ξi) f0 2 , (5) where f0 := (1/N) P i [N] f W0(ξi) is the center of the projected points under some initial random choice of f W0. Once the network is trained, they calibrate the radius R of U in order to reach a targeted coverage 1 ϵ of the data set. In terms of NN architecture, they favor a special class of fully connected neural networks of depth L: f W (c) = σL(W LσL 1(W L 1 . . . σ1(W 1(c)) . . . )) (6) where each W ℓcaptures a linear projection while each σℓcaptures a term-wise piecewise linear activation function (e.g. Re LU, Hardtanh, or hard sigmoid): σℓ j(wj) = aℓ kwj + bℓ k if αℓ k wj αℓ k, k = 1, . . . , K with {aℓ k, bℓ k, αℓ k, αℓ k}K k=1 as the parameters that identifies each of the K affine pieces. The motivation for such an architecture comes from the proposed solution scheme for the RO problem, which relies on a constraint generation approach (See Algorithm 3, 4 in Appendix). This scheme relies on progressively adding scenarios to a reduced set U U until the worst-case cost of the solution under U is the same as under U. Numerically, a critical step consists in identifying the worst-case realization in U, which is shown to reduce to a mixed-integer linear program when c(x, ξ) is linear in ξ under the selected NN architecture due to the following representation of U(W, R): u {0, 1}d K L, ζ Rd L, φ Rd L PK k=1 uk,ℓ j = 1, j, ℓ φ1 = W 1ξ ζℓ j = PK k=1 uk,ℓ j aℓ kφℓ j + PK k=1 uk,ℓ j bℓ k, j, ℓ φℓ= W ℓζℓ 1, ℓ 2 PK k=1 uk,ℓ j αℓ k φℓ j PK k=1 uk,ℓ j αℓ k, j, ℓ ζL f0 R where we assume for simplicity that each layer of the deep neural network has d neurons and φℓ is the output at l-th layer of the neural network. We refer interested readers to [Goerigk and Kurtz, 2020] for more details. 4 Deep Data-driven Conditional Robust Optimization Let (ψ, ξ) be a pair of random vectors defining respectively the side-information and random perturbation vectors of a contextual optimization problem. We can call our dataset Dψξ := {(ψ1, ξ1), . . . , (ψN, ξN)}. Our objective is to train a data-driven conditional uncertainty set U(ψ) that will lead to robust solutions that are adapted to the type of perturbance that is experienced when ψ is observed. In this section, we propose two algorithms, namely the Deep cluster then classify (DCC) and the Integrated Deep cluster then classify (IDCC), to do so, and propose a calibration procedure that offers some guarantees with respect to a contextual value-at-risk problem. 4.1 The Deep Cluster then Classify (DCC) Approach A direct extension of G&K s DDDRO approach in section 3 consists in reducing the side-information ψ to a set of K different clusters, which provides states of the environment in which one wishes to design customized data-driven uncertainty sets. Mathematically, U(ψ) := Ua(ψ), where a : Rm [K], is a trained K-class cluster assignment function for ψ, and each Uk, for k = 1, . . . , K, is an uncertainty sets for ξ that is trained and sized using the procedure described in section 3 with the dataset Dk ξ := (ψ,ξ) Dψξ:a(ψ)=k{ξ}. This process implicitly involves multiple sequential steps of training deep neural networks. Following [Moradi Fard et al., 2020], when performing deep K-mean clustering to obtain a(ψ), training can take the form of Algorithm 5, where the deep Kmeans algorithm trains simultaneously a representation g VE : Rm Rd, using an encoder and g VD : Rd Rm, using a decoder network, and a K-mean classifier aθ(φ) := argmink [K] φ θk 2 by minimizing, using stochastic gradient descent in a coordinate descent scheme, a trade-off (using αK) between reconstruction error and the within cluster centered total variation in the encoded space: L1(V, θ) := (1 αK) 1 i=1 g VD(g VE(ψi)) ψi 2 + αK 1 N i=1 g VE(ψi) θa(ψi) 2 , (8) where a(ψ) := aθ(g VE(ψ)). To solve this problem, we iterate between improving V := (VE, VD) while keeping θ fixed, and improving θ while preserving V fixed. Once the K-mean and one-class classifiers are trained, we correct for a deficiency of DDDRO approach, which assumes wrongfully that the projected f W k(ξ) are normalized for each Dk ξ . Namely, we replace U(W, R) with a set that employs an ellipsoid in the projected space according to the statistics of Dk ξ : U(W k, Rk, Sk) := {ξ Rm : Σk f 1/2(f W k(ξ) µk f) Rk} , (9) where Sk is short for (µk f, Σk f) with µk f := |Dk ξ | 1 X f W k 0 (ξ) and Σk f := |Dk ξ | 1 X (f W k(ξ) µk)(f W k(ξ) µk)T . The calibration of each Rk can finally be done using the same procedure as in [Goerigk and Kurtz, 2020]but using the reduced dataset Dk ξ . 4.2 The Integrated Deep Cluster-Classify (IDCC) Approach While the simplicity of the approach presented in section 4.1 makes it appealing, we identify two important weaknesses. First, by separating the training into multiple steps, it omits tackling the conditional uncertainty set learning problem as a whole. Namely, that low total variation in the ψ space (or a projection of it) does not necessarily imply that low total variation can easily be achieved in a projection of the ξ space. Second, it is unclear how to adapt the approach to a context where a clear separation of the clusters is impossible and where the notion of partial membership to a cluster is more appropriate. To address the first problem, we propose an integrated framework for performing deep clustering and deep uncertainty set design jointly. Namely, we propose to optimize all of V , θ, and {W k}K k=1 jointly using a loss function that trades-off between the objectives used for clustering and each of the K versions of one-class classifiers. We also tackle the issue of hard assignments by training a parameterized random assignment policy π : Rm K, where K is the probability simplex in RK, and θ the parameters that define the policy space. In the context of employing a soft version of deep K-means [Fard et al., 2020]; this random assignment policy takes the form of π(ψ) := πθ(g V (ψ)), where πθ k(ψ) := exp{ β g V (ψ) θk 2} PK k =1 exp{ β g V (ψ) θk 2} (10) With these adjustments, our proposed loss function takes the form of: L3 α(V, θ, {W k}K k=1) :=αS (1 αK)Eπ D[ g VD(g VE(ψi)) ψi 2] + αKEπ D[Total Varπ D(g VE(ψ), θ a(ψ) | a(ψ))] k=1 min ϑk Total Varπ D(f W k(ξ), ϑk | a(ψ) = k) , (11) where a(ψ) πθ(g VE(ψ)) is the randomized assignment based on ψ, Total Varπ D(φ, θ| a(ψ)) := Pd j=1 Eπ D[(φj θj)2| a(ψ)] is the conditional centered total variation of given a(ψ). In fact, all statistics are measured using the empirical distribution expressed in Dψξ and the conditional distribution produced by the randomized assignment policy πθ(g V (ψ)), i.e. Pπ D((ψ, ξ, a) E) = (1/N) PN i=1 PK k=1 1{(ψi, ξi, k) E} πθ k(g V (ψi)). The explicit form of equation (11) can be found in Appendix B.1. Overall, L3 α trades off (using αS) between the reconstruction error of the encoder-decoder networks on ξ, the expected recognizability of the K clusters, i.e. the fact that the observed features g VE(ψ) form distinct clusters of points, and the average compactness of the produced conditional uncertainty sets. In particular, as αS 1, we can expect the minimizer of L3 α to converge to the minimizer of the cluster and classify approach. At the other end of the spectrum, when αS 0, the model will produce more self contained conditional uncertainty sets but at the price of less distinguishable clusters (in terms of ψ) that might poorly exploit the side-information. Algorithm 1 presents our proposed training scheme for the IDCC approach. Given that we employ a random assignment policy, we propose replacing the deterministic CRO problem with its randomized version: x (ψ) argmin x X max ξ U(ψ) c(x, ξ) , where U(ψ) := U(W a(ψ), R a(ψ), S a(ψ))1 is a random uncertainty set, and where we express the fact that conditionally on ψ, x(ψ) is a random policy that depends on the realization of a. Given the randomness of U(ψ), one needs to be more careful in defining a calibration scheme for each Rk. Our proposed scheme is motivated by the following Lemma, which proof can be found in appendix A. Lemma 4.1. Let the random uncertainty set U(ψ) satisfy: Pπ D(ξ U(ψ)| a(ψ) = k) 1 ϵ, k , (12) then it satisfies: Pπ D(ξ U(ψ)) 1 ϵ. (13) In particular, this lemma suggests calibrating each Rk using the bisection to solve: PN i=1 1{ξi U(W k, R, Sk)} πθ k(g VE(ψi)) PN i=1 πθ k(g VE(ψi)) 1 ϵ given that the resulting Rk are the smallest that satisfy (12). Algorithm 1 Integrated deep cluster-classify with deep K-means Input: Data-set Dψξ, Number of clusters K, hyper-parameters αK, αS, β Randomly initialize θ0, V0, and W0 Let π0 := πθ0(g VE 0(ψ)) and W k 0 := W0 for all k s Set t := 0 repeat Set t := t + 1. Update θk t := Eπ D[g VE t 1(ψ)| a(ψ) = k] using πt 1 Update (Vt, {W k t }K k=1) using gradient descent on (11) with θt Get πt := πθt(g VE t(ψ)) until t T or convergence Let π( ) := πt( ) and W k := W k t for all k s for k = 1, . . . , K do Calibrate Rk using (14) Let Uk := U(W k, Rk, Sk) end for Return π( ) and {Uk}K k=1 4.3 Connections to Contextual Value-at-Risk Optimization In the previous subsections, we proposed two different schemes to produce a possibly randomized uncertainty set U(ψ) that can be employed in a randomized CRO problem.2. We also proposed a 1Here, Sk refers to ( f θ,V W k| a(ψi)=k, Σθ,V Wk| a(ψ)=k) with Σθ,V Wk| a(ψ)=k := πθ k(g VE(ψi)) PN i=1 πθ k(g VE(ψi)) (f W k(ξi) f θ,V W k| a(ψi)=k)(f W k(ξi) f θ,V W k| a(ψi)=k)T . 2Note that in the case of section 4.1, the conditional uncertainty set is deterministic thus reducing the randomized version of CRO to a pure CRO problem scheme for radii calibration so that they would satisfy the coverage property in (13). Hence, one can derive the following connection between conditional robust optimization and the CVO problem (1). The proof is pushed to Appendix A. Lemma 4.2. When U satisfies (13), the random policy x( ) to the randomized CRO problem together with v := esssupπ D min x X max ξ U(ψ) c(x, ξ) provide a conservative approximate solution to the CVO problem under the empirical measure Pπ D. Namely, Va RD,π 1 ϵ(c( x(ψ), ξ)) v . In particular, in the case of the proposed DCC and IDCC approaches we have that v = max k [K] min x X max ξ U(W k,Rk.Sk) c(x, ξ) . As the robust optimization paradigm traditionally aims at offering statistical guarantees on the out-ofsample performance of the prescribed solutions, we describe below how a bootstrap method can be used to estimate the radii Rk s. Remark 4.1. Using bootstrapping methods, we can get a conservative approximation of each Rk as: πθ k(g VE(ψi) PN i=1 πθ k(g VE(ψi) 1{ξi U(W k, R, Sk)} 1 ϵ where P D measures the probabibility when resampling a new dataset of size N with replacement from Dψξ. When N is large enough and assuming that each data point is drawn i.i.d. according to some unknown probability measure P, we asymptotically get the guarantee that P(ξ U(ψ)) 1 ϵ with probability higher than approximately 1 Kδ. 5 Experiments In this section, we illustrate the coverage aspect of the IDCC approach using simulated data. We will further demonstrate the advantage of the CRO problem using a standard risk minimizing portfolio optimization problem. We compare the performance of IDCC with that of DCC, DDDRO (with ellipsoidal correction in (9)), and the classical ellipsoidal uncertainty approach (i.e. DCC with K = 1 and f W 1(ξ) := ξ). The IDCC and DCC methods incorporate the covariate information whereas DDDRO and ellipsoid approaches ignore this information. The neural network architecture and other modeling information are available in Appendix B. The code can be found on github3. Our code uses the Pytorch implementation from [Goerigk and Kurtz, 2020], which is available online4. 5.1 Conditional uncertainty set illustration using simulated data For ease of illustration, we consider a simulation environment where [ψT ξT ] R4 is a random vector whose distribution is an equal-weighted mixture of two 4-d multivariate normal distributions. We consider N = 500 and train IDCC (with K = 2), DDDRO, and the ellipsoid and calibrate the uncertainty sets for a probability coverage of 90%, 99% (i.e. ϵ {1%, 10%}). As a result, DDDRO and IDCC, which use deep neural networks, identify non-convex uncertainty sets, whose convex hulls are presented in Figure 1 together with the calibrated ellipsoid.The figure also presents the conditional distribution of ξ according to Pπ D( | a(ψ) = k), using IDCC s randomized assignment, and the training dataset. One can remark that the conditional sets produced by IDCC exploit the side information by concentrating the uncertainty set on the region that has the most mass according to Pπ D( | a(ψ) = k) thus leading to a less conservative RO problem then DDDRO and the ellipsoid, which are oblivious to ψ. In fact, it appears to have successfully learned to at least partially recognize the mixture membership using ψ and exploit this information to adapt the uncertainty set. 3https://anonymous.4open.science/r/Data-Driven-Conditional-Robust-Optimization-E160/ 4https://github.com/goerigk/RO-DNN (a) a(ψ) = 1, 90% coverage (b) a(ψ) = 1, 99% coverage (c) a(ψ) = 2, 90% coverage (d) a(ψ) = 2, 99% coverage IDCC DDDRO Ellipsoid Figure 1: Convex hull of trained uncertainty sets for two levels of coverage and with a conditional uncertainty set for IDCC that exploits two clusters. The heatmap represents the conditional distribution of ξ according to Pπ D( | a(ψ) = k). The cloud of points represents the training dataset. 5.2 Robust portfolio optimization We further investigate the empirical out-of-sample performance of the proposed uncertainty sets on a classical robust portfolio optimization problem. Namely, we consider a situation where an investor is trying to minimize the worst-case return based on an uncertainty set that provides 1 ϵ probabilistic coverage of the uncertain future return vector. In particular, given that x captures a vector of investment in n = m different assets whose return are captured using ξ, we let c(x, ξ) := ξ x to capture the return on investment, and let X := {x Rn| Pn i=1 xi = 1, x 0} to capture the need to invest one unit of wealth among the available assets. Following Lemma 4.2, this model can in turn be interpreted as conservatively approximating a minx X Va R1 ϵ(ξ x), where the objective is a risk averse value-at-risk metric. Dataset Our experiments make use of historical data from the US stock market. We collect the adjusted daily closing prices for 70 stocks (as used in [Xu and Cohen, 2018]) coming from 8 different sectors from January 1, 2012, to December 31, 2020, using the Y!Finance s API. Each year has 252 data points and we compute the percentage gain/loss w.r.t the previous day to create our dataset for ξ. As for side information, we use the trading volume of individual stocks and other market indices5 over the same period as covariates. Our algorithm gives the flexibility to use any number of such metrics as contextual information. Given the time series nature of the data, at a given instance, we use 3 years of data to train and the following year as validation to pick the hyperparameters of our model such as learning rate, weight decay, and the optimal number of clusters. We then retrain the model using the 4 years of data to build the final model. Upon calibrating the uncertainty set, we use it to solve the robust portfolio optimization problem. We then apply this policy to the next 1 year s of data and compute the performance metric, namely Value at risk (Va R) for different confidence levels to compare the performances. Va R quantifies the level of risk of a portfolio over a specified time frame. Here, it gives an estimate of the maximum % loss the decision maker can incur over a period of 1 year when he uses the policy from the RO model. Intuitively, lower the Va R, less riskier is the generated policy. Many financial institutions use Va R to determine the amount of collateral needed when trading financial products so lowering Va R for high confidence levels is crucial. Experiment Design To test for the robustness of the IDCC algorithm, we experiment on various randomly sampled stock combinations across different time periods. We randomly sampled a subset of 15 stocks in a time window and repeated the experiment for 10 runs on 3 moving time frames. We used learning rate = 0.01, αK = 0.5, αS = 0.5, β = 0.1 for all the experiments. We use a cold start K-means approach to determine K for each run. We do this across all these experiments as it will be computationally expensive to tune the parameters through grid search for each run and also our intention is to show the learning capability of our algorithm even with minimal tuning. The parameter tuning and implementation details can be found in appendix B.3. 5Volatility Index (VIX), 10-year Treasury Yield Index (TNX), Oil Index (CL=F), S&P 500 (GSPC), Global Income & Currency Fund (XGCFX), Dow Jones Index (DJI) 0.8 0.9 0.95 0.99 Confidence level 0.8 0.9 0.95 0.99 Confidence level 0.8 0.9 0.95 0.99 Confidence level (c) 2019 IDCC DCC DDDRO Ellipsoid Figure 2: Avg. Va R across portfolio simulations. Error bars report 95% CI. Results Fig. 2 shows the avg. Va R across the runs at different confidence levels. It is evident that IDCC generally performs better than the baseline models. This difference is especially noticeable at a higher confidence level and vanishes as we move to lower confidence levels. Table 1 provides more details by comparing the overall and conditional cluster level Va R with the baseline models. Specifically, in each run, we identify each cluster as either the majority or minority cluster depending on its frequency and report averages of Va R (among the 10 runs) for each of these labels. The average frequencies for each label are also reported in the table. In particular, one can observe that the improvement on average overall Va R can reach up to 15% (see in 2019 at a 0.99 confidence level). This advantage is even more clearly visible when we look at the individual cluster-level conditional Va R. For instance, in the year 2018 for the 0.99 confidence level, the majority cluster ( 68% data) provides an improvement of 19% and an overall improvement of 9% compared to the second best baseline model. A similar pattern is observed for the year 2019 as well. In the year 2017, the overall performance of IDCC is close and for some confidence levels slightly above the baseline models. However, we see that the majority cluster ( 80% data) is performing better than the baseline models while the minority cluster has a slightly higher risk. We attribute this loss in performance to the fact that the minority clusters are much less frequent ( 20% data) and therefore have fewer data available to properly learn its conditional uncertainty set. This large difference in frequencies might also indicate that the side information does not have a strong signal for the behavior of the returns during this period of time. 2017 2018 2019 Conf. 1 ϵ 0.8 0.9 0.95 0.99 0.8 0.9 0.95 0.99 0.8 0.9 0.95 0.99 IDCC 0.30 0.55 0.75 1.37 0.64 1.16 1.67 2.86 0.44 0.77 1.11 2.02 Overall DDDRO 0.31 0.52 0.79 1.46 0.63 1.24 1.84 3.17 0.45 0.84 1.27 2.35 Ellipsoid 0.30 0.49 0.75 1.45 0.72 1.45 2.04 3.19 0.47 0.81 1.30 2.52 Cond. on Cluster Freq. 80% 68% 59% Majority IDCC 0.31 0.52 0.71 1.30 0.57 1.08 1.50 2.62 0.44 0.75 1.17 1.88 Cluster DDDRO 0.31 0.52 0.74 1.35 0.59 1.15 1.63 3.23 0.45 0.85 1.31 2.06 Ellipsoid 0.32 0.52 0.74 1.41 0.69 1.29 1.92 3.08 0.47 0.85 1.25 2.31 Cond. on Cluster Freq. 20% 32% 41% Minority IDCC 0.30 0.61 0.77 1.43 0.96 1.57 2.05 3.13 0.48 0.82 1.15 2.22 Cluster DDDRO 0.30 0.56 0.84 1.39 1.00 1.66 2.04 3.30 0.49 0.84 1.40 2.39 Ellipsoid 0.28 0.47 0.69 1.13 1.17 1.80 2.43 3.43 0.49 0.82 1.38 2.57 Table 1: Comparison of average value-at-risk (over 10 runs) for different levels of probability coverage. Both the overall Va R and conditional Va R given the membership to the majority/minority clusters are presented. 6 Conclusion and Future Work In this work, we introduced a new approach, Conditional Robust Optimization, for solving contextual optimization problems in a risk averse setting. We proposed a novel integrated approach to design uncertainty sets that adapt to revealed covariate information. We identified connections to contextual value-at-risk optimization and showed empirically that our method reduces the out-of-sample Va R considerably compared to non-contextual RO schemes when the level of protection needed is high. As future work, we find that it should be interesting to integrate data-driven conditional uncertainty sets in the context of multi-stage robust optimization models. Given that clustering techniques are often prone to learning correlations from the data that do not reflect true causal relations, so there might be a need to integrate causal inference methods into our approach. One might also be concerned regarding fairness considerations in contexts where side information might allow to treat of a certain class of individuals differently from others. This last issue might be addressed by adding fairness consideration in our integrated loss function. Acknowledgement The authors gratefully acknowledge support from the Institut de Valorisation des Données (IVADO) and from the Canadian Natural Sciences and Engineering Research Council [Grant RGPIN-201605208 and 492997-2016]. Gah-Yi Ban and Cynthia Rudin. The big data newsvendor: Practical insights from machine learning. Operations Research, 67(1):90 108, 2019. Gah-Yi Ban, Jérémie Gallien, and Adam J Mersereau. Dynamic procurement of new products with covariate information: The residual tree method. Manufacturing & Service Operations Management, 21(4):798 815, 2019. M. P. Bendsøe, A. Ben-Tal, and J. Zowe. Optimization methods for truss geometry and topology design. Structural optimization, 7(3):141 159, 1994. Fernando P. Bernardo and Pedro M. Saraiva. Robust optimization framework for process parameter and tolerance design. AICh E Journal, 44(9):2007 2017, 1998. Dimitris Bertsimas and Nathan Kallus. From predictive to prescriptive analytics. Management Science, 66(3):1025 1044, 2020. Dimitris Bertsimas and Bart Van Parys. Bootstrap robust prescriptive analytics. Mathematical Programming, pages 1 40, 2021. Dimitris Bertsimas, Omid Nohadani, and Kwong Meng Teo. Robust optimization in electromagnetic scattering problems. Journal of Applied Physics, 101(7):074507, 2007. Dimitris Bertsimas, Christopher Mc Cord, and Bradley Sturt. Dynamic optimization with side information. European Journal of Operational Research, 2022. Jianlong Chang, Lingfeng Wang, Gaofeng Meng, Shiming Xiang, and Chunhong Pan. Deep adaptive image clustering. In Proceedings of the IEEE international conference on computer vision, pages 5879 5887, 2017. Millie Chu, Yuriy Zinchenko, Shane G Henderson, and Michael B Sharpe. Robust optimization for intensity modulated radiation therapy treatment planning under uncertainty. Physics in Medicine and Biology, 50(23):5463 5477, nov 2005. Priya L. Donti, Brandon Amos, and J. Zico Kolter. Task-based end-to-end model learning. Co RR, abs/1703.04529, 2017. Adam N Elmachtoub and Paul Grigas. Smart predict, then optimize . Management Science, 68(1): 9 26, 2022. Maziar Moradi Fard, Thibaut Thonet, and Eric Gaussier. Deep k-means: Jointly clustering with k-means and learning representations. Pattern Recognition Letters, 138:185 192, 2020. Marc Goerigk and Jannis Kurtz. Data-driven robust optimization using unsupervised deep learning. ar Xiv preprint ar Xiv:2011.09769, 2020. Xifeng Guo, Long Gao, Xinwang Liu, and Jianping Yin. Improved deep embedded clustering with local structure preservation. In Ijcai, pages 1753 1759, 2017. Lauren Hannah, Warren Powell, and David Blei. Nonparametric density estimation for stochastic optimization with an observable state variable. In J. Lafferty, C. Williams, J. Shawe-Taylor, R. Zemel, and A. Culotta, editors, Advances in Neural Information Processing Systems, volume 23. Curran Associates, Inc., 2010. Yichun Hu, Nathan Kallus, and Xiaojie Mao. Fast rates for contextual linear optimization. Management Science, 2022. Yifan Hu, Siqi Zhang, Xin Chen, and Niao He. Biased stochastic first-order methods for conditional stochastic optimization and applications in meta learning. Advances in Neural Information Processing Systems, 33:2759 2770, 2020. Pan Ji, Tong Zhang, Hongdong Li, Mathieu Salzmann, and Ian Reid. Deep subspace clustering networks. Advances in neural information processing systems, 30, 2017. Nathan Kallus and Xiaojie Mao. Stochastic optimization forests. ar Xiv preprint ar Xiv:2008.07473, 2020. Rohit Kannan, Güzin Bayraksan, and James R Luedtke. Data-driven sample average approximation with covariate information. Optimization Online. URL: http://www. optimization-online. org/DB HTML/2020/07/7932. html, 2020a. Rohit Kannan, Güzin Bayraksan, and James R Luedtke. Residuals-based distributionally robust optimization with covariate information. ar Xiv preprint ar Xiv:2012.01088, 2020b. Shaochong Lin, Youhua (Frank) Chen, Yanzhi Li, and Zuo-Jun Max Shen. Data-driven newsvendor problems regularized by a profit risk constraint. Production and Operations Management, 31(4): 1630 1644, 2022. Murari Mani, Ashish K. Singh, and Michael Orshansky. Joint design-time and post-silicon minimization of parametric yield loss using adjustable robust optimization. In 2006 IEEE/ACM International Conference on Computer Aided Design, pages 19 26, 2006. Christopher George Mc Cord. Data-driven dynamic optimization with auxiliary covariates. Ph D thesis, Massachusetts Institute of Technology, 2019. Maziar Moradi Fard, Thibaut Thonet, and Eric Gaussier. Deep k-means: Jointly clustering with k-means and learning representations. Pattern Recognition Letters, 138:185 192, 2020. MOSEK Ap S. MOSEK Fusion API for C++ 9.3.20, 2022. URL https://docs.mosek.com/ latest/cxxfusion/index.html. Karthik Natarajan, Dessislava Pachamanova, and Melvyn Sim. Incorporating asymmetric distributional information in robust value-at-risk optimization. Management Science, 54(3):573 585, 2008. Viet Anh Nguyen, Fan Zhang, Jose Blanchet, Erick Delage, and Yinyu Ye. Robustifying conditional portfolio decisions via optimal transport, 2021. Shunichi Ohmori. A predictive prescription using minimum volume k-nearest neighbor enclosing ellipsoid and robust optimization. Mathematics, 9(2):119, 2021. Thomas J Page Jr. Multivariate statistics: A vector space approach. JMR, Journal of Marketing Research (pre-1986), 21(000002):236, 1984. R Tyrrell Rockafellar and Roger J-B Wets. Variational analysis, volume 317. Springer Science & Business Media, 2009. Prateek R. Srivastava, Yijie Wang, Grani A. Hanasusanto, and Chin Pang Ho. On data-driven prescriptive analytics with side information: A regularized nadaraya-watson approach, 2021a. Prateek R Srivastava, Yijie Wang, Grani A Hanasusanto, and Chin Pang Ho. On data-driven prescriptive analytics with side information: A regularized nadaraya-watson approach. ar Xiv preprint ar Xiv:2110.04855, 2021b. Cong Wang, Xin Peng, Chao Shang, Chen Fan, Liang Zhao, and Weimin Zhong. A deep learningbased robust optimization approach for refinery planning under uncertainty. Computers & Chemical Engineering, 155:107495, 2021. Kai Wang and Alex Jacquillat. From classification to optimization: A scenario-based robust optimization approach. Available at SSRN 3734002, 2020. Yumo Xu and Shay B Cohen. Stock movement prediction from tweets and historical prices. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1970 1979, 2018. 1. For all authors... (a) Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] 6 (c) Did you discuss any potential negative societal impacts of your work? [Yes] 6 (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results... (a) Did you state the full set of assumptions of all theoretical results? [Yes] (b) Did you include complete proofs of all theoretical results? [Yes] A 3. If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] 5 (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] B (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes] 5 (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [TODO] 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets... (a) If your work uses existing assets, did you cite the creators? [Yes] https://github. com/goerigk/RO-DNN (b) Did you mention the license of the assets? [N/A] (c) Did you include any new assets either in the supplemental material or as a URL? [Yes] https://anonymous.4open.science/r/ Data-Driven-Conditional-Robust-Optimization-E160/README.md (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects... (a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]