# robust_datadriven_prescriptiveness_optimization__8d183533.pdf

Robust Data-driven Prescriptiveness Optimization

Mehran Poursoltani 1 Erick Delage 2 Angelos Georghiou 3

The abundance of data has led to the emergence of a variety of optimization techniques that attempt to leverage available side information to provide more anticipative decisions. The wide range of methods and contexts of application have motivated the design of a universal unitless measure of performance known as the coefficient of prescriptiveness. This coefficient was designed to quantify both the quality of contextual decisions compared to a reference one and the prescriptive power of side information. To identify policies that maximize the former in a data-driven context, this paper introduces a distributionally robust contextual optimization model where the coefficient of prescriptiveness substitutes for the classical empirical risk minimization objective. We present a bisection algorithm to solve this model, which relies on solving a series of linear programs when the distributional ambiguity set has an appropriate nested form and polyhedral structure. Studying a contextual shortest path problem, we evaluate the robustness of the resulting policies against alternative methods when the out-of-sample dataset is subject to varying amounts of distribution shift.

1. Introduction

Stochastic programming is perceived as one of the fundamental methods devised for decision-making under uncertainty (see Shapiro et al. 2021 and Birge & Louveaux 2011). Given a cost function h(x, ξ) that depends on a decision x Rnx and a random vector ξ Rnξ, the stochastic

1Desautels Faculty of Management, Mc Gill University, Montr eal, Canada 2GERAD & Department of Decision Sciences, HEC Montr eal, Montr eal, Canada 3Department of Business and Public Administration, University of Cyprus, Nicosia, Cyprus. Correspondence to: Mehran Poursoltani <mehran.poursoltani@mcgill.ca>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

programming (SP) problem is defined as

(SP) x arg min x X EF [h(x, ξ)], (1)

where X is a convex feasible set, h(x, ξ) is a cost function that is assumed convex in x for all ξ, and ξ is assumed to be drawn from the distribution F. The solution methods for this problem mainly rely on either assuming a priori distribution for F or exploiting a set of independent and identically distributed observations. In the latter case, a set of i.i.d observations of the random vector ξ denoted by S := {ξi}N i=1 can be used to formulate the following sample average approximation problem:

(SAA) x arg min x X 1 N

i=1 h(x, ξi), (2)

where we assume a uniform distribution over the observed data. Recently, the availability of large datasets has played a critical role in redirecting the optimization methods devised for decision-making under uncertainty towards taking advantage of so-called side information or covariates . This paradigm encourages decision-makers to benefit from the available data beyond the desired random variables to make more anticipative decisions. For instance, a portfolio manager who optimizes her investments in the stock market may consider a variety of available micro and macroeconomic indicators as side information to make more anticipative decisions (see Brandt et al. 2009 and Bazier-Matte & Delage 2020), while a traffic path planner can utilize side information like time of day, weather status and holiday/work day to find the best route through the city (see Bertsimas & Kallus 2020). This gives rise to the following contextual stochastic optimization (CSO) problem:

(CSO) x (ζ) arg min x X EF [h(x, ξ)|ζ], (3)

where ζ Rnζ denotes the given vector of covariates , or so-called features . In this case, any observed random vector ξi is accompanied by a vector of covariates ζi Rnζ. The difficulty of this problem shows up when the conditional probability distribution function Fξ|ζ is unknown, and only a set of i.i.d observations T := {(ζi, ξi)}N i=1 is available. In this case, a data-driven variant of the CSO problem can be written as

x (ζ) arg min x X E ˆ Fξ|ζ[h(x, ξ)], (4)

Robust Data-driven Prescriptiveness Optimization

where ˆFξ|ζ is a conditional probability model for ξ given ζ inferred from the available data, e.g. by training a random forest (Breiman, 2001), or estimated via kernel density estimation (Ban & Rudin, 2019). To deal with possible overfitting in the presence of limited data or possible distribution shifts due to unexpected events, one can formulate a distributionally robust contextual stochastic optimization (DRCSO) model, which in general, takes the following form

(DRCSO) x (ζ) arg min x X sup F D EF [h(x, ξ)|ζ] (5)

where D is the ambiguity set containing the set of admissible distributions (see Duchi et al. 2020; Bertsimas & Van Parys 2022; Kannan et al. 2020; Nguyen et al. 2022; Esteban P erez & Morales 2022; Srivastava et al. 2021, and literature within).

Recently, Bertsimas & Kallus 2020 proposed to compare the performance of different CSO (or DRCSO) approaches, by measuring the coefficient of prescriptiveness , defined as:

PF (x( )) := 1 EF [h(x(ζ), ξ)] EF [minx X h(x , ξ)]

EF [h(ˆx, ξ)] EF [minx X h(x , ξ)] ,

(6) where ˆx := argminx E ˆ F [h(x, ξ)] with ˆF as the empirical distribution that puts equal weights on each observed data point {ξi}N i=1 (i.e. the solution of SAA). The idea behind the coefficient of prescriptiveness is that it measures the performance of a given policy x(ζ) relative to the constant decision ˆx which is agnostic to the side information ζ, and to the fully anticipative policy which achieves the progressive optimal value of EF [minx X h(x , ξ)]. It is easy to see that a high value of PF indicates that the policy can leverage the contextual information of ζ with PF = 1 indicating that the policy is achieving the fully anticipative performance in terms of ξ. In contrast, a low value of PF indicates that the policy is not able to exploit (or even is misled by) the available information. This behavior is reminiscent of R2, the coefficient of determination , typically used in the context of predictive models, a connection which we discuss in the next section.

Following the introduction of the coefficient of prescriptiveness, this metric has been employed in several pieces of research to demonstrate the potential of proposed datadriven policies for leveraging the available side information. One can refer to Bertsimas et al. 2016 for such a comparison in the context of inventory management, Stratigakos et al. 2022 for energy trading, Notz & Pibernik 2022 for flexible capacity planning, and Kallus & Mao 2023 for shortest path and portfolio optimization problems. We note that, in the current literature, PF is only used as a benchmark metric for assessing the performance of policies computed using different approaches, e.g., in Bertsimas & Kallus 2020, the metric compares policies computed (amongst others) using CSO

where the conditional probability is estimated by random forests and kernel density estimation. Given its prevalence as a performance measure, it is natural to question whether it is possible and useful to directly optimize the coefficient of prescriptiveness.

While one can show that maximizing PF reduces to solving the CSO problem, one may wonder how the PF measure should be robustified in order to improve out-of-sample performance. In this work, we introduce for the first time a distributionally robust version of PF . We establish connections to other models in the literature and present an efficient algorithm to maximize it when the conditional probability model is discrete (such as with a random forest or with a Kernel density estimator). The rest of the paper is organized as follows. Section 2 motivates the optimization of the coefficient of prescriptiveness by explicating its relationship to the coefficient of determination in the field of statistics. Section 3 introduces a robust data-driven prescriptiveness optimization model that can be used to maximize a distributionally robust version of the coefficient of prescriptiveness. We reformulate this problem as a convex optimization problem that can reduce to a linear program when the ambiguity set takes the form of a so-called nested Conditional Value-at Risk (CVa R) set . A bisection method is proposed to solve the latter, as well as an acceleration scheme; finally, Section 4 presents the numerical experiments, where we evaluate the robustness of the resulting policies against benchmark ones in a shortest path problem when the out-of-sample dataset confronts a distribution shift. All proofs are relegated in Appendix A.

2. Motivation for optimizing P and its robustification

As argued in Bertsimas & Kallus 2020, in the context of predictive models, where one wishes to predict the value of ξ R based on a list of covariates ζ using a statistical model f : Rnζ R, one popular metric that is employed takes the form of the so-called coefficient of determination :

R2(f( )) := 1 E ˆ F [(f(ζ) ξ)2]

E ˆ F [(ˆξ ξ)2] ,

where ˆξ := E ˆ F [ξ] is the empirical mean of ξ in the data set and ˆF is the empirical joint distribution of (ζ, ξ). The popularity of R2 compared to mean squared error as a measure of performance can be partially attributed to being unitless. It is upper bounded by 1, with a value closer to 1, indicating that most of the variation of ξ can be modeled using f( ). On the flip side, when strictly smaller than 0, its absolute value measures the percentage of additional variations that are introduced by the predictive model, thus indicating a degradation of predictive power when compared to the simple sample average ˆξ.

Robust Data-driven Prescriptiveness Optimization

The coefficient of prescriptiveness can be viewed as an attempt to introduce an analogous measure in the contextual optimization setting. More specifically, it reduces to R2

when nx = 1 and h(x, ξ) := (x ξ)2, namely:

P ˆ F (x( )) = 1 E ˆ F [(x(ζ) ξ)2)] E ˆ F [minx (x ξ)2] E ˆ F [(ˆx ξ)2] E ˆ F [minx (x ξ)2]

= R2(x( )),

since E ˆ F [minx (x ξ)2] = 0 and ˆx := argminx E ˆ F [(x ξ)2] = ˆξ. Hence, the coefficient of prescriptiveness has a similar interpretation as R2. Namely, PF is upper bounded by 1, and as it gets closer to 1, it indicates how successful the data-driven policy has been in closing the gap between the SAA solution that makes no use of covariate information and a hypothetical policy that would have access to full information about ξ.

One can also find traces in the literature of attempts to measure R2(x( )) out-of-sample. Namely, Campbell & Thompson 2008 studies whether excess stock return predictors can outperform historical averages in terms of out-of-sample explanatory power of such predictors. This measure can be captured using

R2 F (f( ), ˆF) := 1 EF [(f(ζ) ξ)2]

EF [(ˆξ ξ)2] = PF (f( )) ,

which naturally leads to the question of whether R2(f( )) is a good approximation for R2 F (f( ), ˆF) in a data-driven environment (potentially susceptible to distribution shifts). If not, then one must turn to employing more robust estimation methods.

3. Robust Data-driven Prescriptiveness Optimization

In order to tackle the robustification and optimization of P, we consider a more general version of this measure, which relaxes the assumption that the benchmark is the solution to (2) and widens the scope of our analysis. To this end, we define the prescriptiveness competitive ratio (PCR) of a policy x( ) with respect to a reference policy x as:

VF (x( ), x) :=

1 EF [h(x(ζ),ξ)] EF [minx X h(x ,ξ)]

EF [h( x,ξ)] EF [minx X h(x ,ξ)]

if EF [h( x, ξ)] EF [minx X h(x , ξ)] > 0

1 if EF [h( x, ξ)] = EF [minx X h(x , ξ)]

and EF [h(x(ζ), ξ)] = EF [minx X h(x , ξ)]

Indeed, the coefficient of prescriptiveness can be considered a special case when x := ˆx:

VF (x( ), ˆx) = 1 EF [h(x(ζ), ξ)] EF [minx X h(x , ξ)]

EF [h(ˆx, ξ)] EF [minx X h(x , ξ)] = PF (x( )).

when EF [h(ˆx, ξ)] EF [minx X h(x , ξ)] > 0, while the two other cases follow from the natural extension of the definition of PF (x( )). In contrast to PF (x( )) which benchmarks policy x( ) only to the SAA solution, the definition of VF allows to benchmark against any other static policy. This allows our model to accommodate situations where more sophisticated statistical tools might be used to obtain the reference decision (e.g. regularized or distributionally robust SAA approaches (Lam, 2019; Mohajerin Esfahani & Kuhn, 2018; Van Parys et al., 2021), variance-based regularized solution schemes (Duchi et al., 2020), or data-pooled solutions schemes (Gupta & Kallus, 2022)). 1

In a finite sample regime, where ˆF might fail to capture the true underlying distribution, or in a situation where we expect distribution shifts, one should be interested in a distributionally robust estimation of the PCR (or equivalently of the coefficient of prescriptiveness), which takes the form of:

VD(x( ), x) := inf F D VF (x( ), x) = inf F D PF (x( ))

when x := ˆx.

PCR where D is a set of distribution over the joint space (ζ, ξ), and the notation VD is overloaded to denote the distributional robust PCR measure. Furthermore, one might be interested in identifying the policy that maximizes the PCR in the form of the following distributionally robust optimization problem:

(DRPCR) max x( ) H VD(x( ), x)

where H {x : Rnζ X}. The following lemma provides interpretable bounds for the value of VD.

Lemma 3.1. If x H, then the optimal value of DRPCR is necessarily in the interval [0, 1].

Lemma 3.1 can be interpreted as follows. First, if x(ζ) achieves a VD(x( ), x) = 1 then the policy is guaranteed to exploit ζ just as efficiently as if it had full information about ξ (namely achieves the fully anticipative performance). On the other end of the spectrum, VD(x( ), x) = 0 indicates that the policy can potentially fail to exploit any of the

1In fact, one can go a step further and define VF (x( ), x( )) were x( ) is not a static policy. For example, x( ) could be a simple rule-based policy such as the order-up-to policy in inventory control. For ease of exposition, we treat the benchmark policy x as a static policy for the remainder of the paper.

Robust Data-driven Prescriptiveness Optimization

information present in ζ. When x H, one can always prevent negative PCR by falling back to the benchmark policy x.

Next, we show that in an environment where the distribution is known, the optimal policy obtained from CSO is an optimal solution to DRPCR. Before proceeding, we first make the following assumption.

Assumption 3.2. The policy set H contains all possible mappings, i.e. H := {x : Rnζ X}.

Lemma 3.3. Given that Assumption 3.2 is satisfied, if the distribution set is a singleton, i.e. D = { F}, then the optimal policy obtained from the CSO problem that employs F maximizes DRPCR.

While Lemma 3.3 implies that DRPCR reduces to CSO when the distribution is known thus making the question of PCR optimization and performance irrelevant, this is not the case anymore for larger ambiguity sets D.

In this section, we first present a convex reformulation of DRPCR and then provide a reformulation of the problem for the nested CVa R ambiguity set. Finally, we propose a decomposition algorithm for solving the problem based on a bisection algorithm.

3.1. Convex formulation for DRPCR

The following proposition provides a convex reformulation of DRPCR.

Proposition 3.4. Given that x H, DRPCR is equivalent to

max x( ) H,γ γ (8a)

s.t. Q(x( ), γ) 0 (8b)

0 γ 1. (8c)

Q(x( ), γ) := sup F D EF h h(x(ζ), ξ) (1 γ)h( x, ξ)

+ γ min x X h(x , ξ) i

is a convex non-decreasing function of γ. Moreover, problem (8) is a convex optimization problem when H is convex.

From the reformulation (8) one can draw interesting insights regarding the connection of DRPCR and risk-averse regret minimization, see Poursoltani et al. 2023. For γ = 1, the problem reduces to the ex-post risk-averse regret minimization problem. In contrast, for γ = 0, one can interpret the problem as regretting the performance of the policy compared to a policy with less information. In the notation of Poursoltani et al. 2023, this will lead to a risk-averse regret problem with = 1.

3.2. The nested CVa R ambiguity set D

In the following, we consider a discrete empirical distribution F and restrict D to be a nested CVa R ambiguity set. This ambiguity set is motivated by the works on nested dynamic risk measures (see Riedel 2004, Detlefsen & Scandolo 2005 and Ruszczy nski & Shapiro 2006) as will be explained shortly. We formalize our approach through the following assumption.

Assumption 3.5. There is a discrete distribution F, with {ζω}ω Ωζ and {ξω}ω Ωξ as the set of distinct scenarios for ζ and ξ respectively, such that the distribution set D takes the form of the nested CVa R ambiguity set with respect to P F and defined as

D( F, α) :=

PF (ζ = ζω) = P F (ζ = ζω) ω Ωζ,

PF (ξ = ξω |ζω) (1/(1 α))P F (ξ = ξω |ζω) ω Ωζ, ω Ωξ

(9) where M(Ωζ Ωξ) is the set of all distributions supported on over the joint space {ζω}ω Ωζ {ξω}ω Ωξ.

The structure of D( F, α) implies that there is no ambiguity in the marginal distribution of the observed random variable ζ. Rather, the ambiguity is solely on the unobserved random variable ξ and is sized using the parameter α. The nested CVa R ambiguity set owes its name from Ruszczy nski & Shapiro 2006 and the fact that for any function g(x, ξ):

sup F D( F ,α) EF [g(x(ζ), ξ)]

= sup F D( F ,α)

ω Ωξ PF (ζ = ζω)

PF (ξ = ξω |ζ = ζω)g(x(ζω), ξω )

= E F [CVa Rα F (g(x(ζ), ξ)|ζ)] .

For α = 0, the problem reduces to minx( ) H E F [h(x(ζ), ξ)], effectively recovering the CSO policy. On the other spectrum, for α = 1 the problem reduces to minx( ) H E F [maxω: P (ξ=ξω|ζ)>0 h(x(ζ), ξω)], which implies that for each realization of ζω the decision x(ζω) is robust against all admissible realizations of ξ given ζω.

The nested CVa R representation and full policy space Assumption 3.2 can be exploited to optimize Q(x( ), γ).

Proposition 3.6. Under Assumption 3.5, problem (8) can

Robust Data-driven Prescriptiveness Optimization

thus be reformulated as

max γ γ (10a)

ω Ωζ P F (ζ = ζω)ϕω(γ) 0 (10b)

0 γ 1 , (10c)

where ϕω(γ) is a non-decreasing function (when x X) capturing the optimal value of:

min x X,t,s 0 t + 1 1 α

ω Ωξ P F (ξ = ξω |ζ = ζω)sω

s.t. sω h(x, ξω ) (1 γ)h( x, ξω )

+ γ min x X h(x , ξω ) t, ω Ωξ

and can be reduced to a linear program when X is polyhedral and h(x, ξω ) is linear programming representable.

In practice, F is often composed of an empirical distribution ˆFζ and a trained conditional distribution ˆFξ|ζ. Given an optimal solution γ to problem (13), one should then define the extended optimal policy x(ζ) beyond {ζω}ω Ωζ using the optimal solution of problem (11) with ˆFξ|ζ.

This being said, whether problem (11) is reduceable to a linear program or, more generally, a convex optimization model, its size scales with |Ωζ| |Ωξ|, which can be computationally challenging. We therefore propose a decomposition algorithm to efficiently solve the problem. Let ψ(γ) := P

ω Ωζ P F (ζ = ζω)ϕω(γ). Using the definition of ϕω(γ), we observe that for fixed γ one can evaluate ψ(γ) by solving |Ωζ| distinct problem (11) for each ω Ωζ. Moreover, given that each ϕω(γ) is non-decreasing (see Proposition 3.6), one concludes that ψ(γ) is non-decreasing. Hence, one can design a bisection algorithm on γ to solve the DRPCR problem (8). Namely, each step consists in identifying the mid-point γ of an interval known to contain the optimal value of γ, and verifying whether γ is feasible by evaluating ψ( γ) to decide which of the two subinterval below or above γ contains γ , see Figure 2 (left) in Appendix B. The details of this algorithm are presented in Algorithm 1. It s efficiency relies on the difficulty of executing step 7, i.e. evaluation ϕω(γ) for each ω. The following lemma provides formal guaranties regarding the convergence rate of Algorithm 1.

Lemma 3.7. Algorithm 1 terminates in log2(1/ϵ) iterations. Moreover, if X is polyhedral and h(x, ξ) linear programming representable, the algorithm terminates in polynomial time with respect to log(1/ϵ), |Ωζ|, |Ωξ|, nx, nξ, the size of the LP representation of X and of h(x, ξ).

Appendix B further proposes an accelerated bisection algorithm for the case when X is convex. Namely, it derives the sub-gradient of ψ(γ) and exploits its convexity to tighten the interval for γ at each iteration.

Algorithm 1 Bisection algorithm for DRPCR

1: Input: Tolerance ϵ > 0 2: Set γ := 0, γ+ := 1 3: while γ+ γ > ϵ do 4: Set γ := (γ+ + γ )/2 5: //Solve minx( ) H Q(x( ), γ) to get optimal value ψ( γ) 6: for ω Ωζ do 7: Solve problem (11) with ω and γ to get optimal value ϕω( γ) 8: end for 9: if ψ := P

ω Ωζ P F (ζ = ζω)ϕω( γ) 0 then

10: Set γ := γ 11: else 12: Set γ+ := γ 13: end if 14: end while 15: Return γ := γ

3.3. Generalized nested ambiguity set D

One can generalize the results of the previous section by considering a generalized version of the ambiguity set formalized in the following assumption.

Assumption 3.8. For a discrete distribution F, the distribution set D takes the form of the generalized nested ambiguity set with respect to P F and defined as

D( F, rζ, rξ) := F M(Ωζ Ωξ) dζ(Fζ, Fζ) rζ dξ Fξ|ζω, Fξ|ζω rξ ω Ωζ

(12) where dζ(Fζ, Fζ) and dξ(Fξ|ζω, Fξ|ζω) are two convex divergence measures, i.e. non-negative, convex in their first argument and minimized when the two probability measures are equal, applied on marginal distribution of F and the conditional distribution of ξ given ζ, respectively.

The structure of (12) allows one to control the ambiguity about both the marginal distribution of ζ and the conditional distributions of ξ using the parameters rζ and rξ to bound the maximum divergence respectively. In particular, it reduces to the nested CVa R ambiguity set when using

dζ(Fζ, Fζ) :=

inf s|PFζ(ζ = ζω) 1 1 s P Fζ(ζ = ζω), ω Ω ,

Robust Data-driven Prescriptiveness Optimization

dξ(Fξ|ζ, Fξ|ζ) :=

inf s|PFξ|ζ(ξ = ξω ) 1 1 s P Fξ|ζ(ξ = ξω ), ω Ω ,

rζ := 0, and rξ := α.

In the following, to simplify presentation, given that Ωζ and Ωξ are finite, we let p R|Ωζ| denote the vector of probabilities pω := PF (ζ = ζω) and qω R|Ωξ| denote the probabilities qω ω := PF (ξ = ξω |ζ = ζω), and similarly for p and qω to captures the same probabilities under F. We will further abuse notation and denote dζ(p, p) := dζ(Fζ, Fζ) and dξ(qω, qω) := dξ(Fξ|ζω, Fξ|ζω). The following proposition generalizes Proposition 3.6.

Proposition 3.9. Under Assumption 3.8, problem (8) can thus be reformulated as

max γ γ (13a)

s.t. sup p Z

ω Ωζ pω ϕω(γ) 0 (13b)

0 γ 1 , (13c)

where Z := {p : p 0, e p = 1, dζ(p, p) rζ} and ϕω(γ) is a non-decreasing function (when x X) capturing the optimal value of:

min x,t,α,s t + rξα + d (s, α, qω) (14a)

s.t. sω h(x, ξω ) (1 γ)h( x, ξω )

+γ min x X h(x , ξω ) t, ω Ωξ(14b)

x X, α 0, (14c)

where d (s, α, qω) := supq s T q αdξ(q, qω) is the perspective of the convex conjugate of dξ(q, qω).

Algorithm 1 can be applied in the generalized setting with the simple modification that problem (11) in step 7 is replaced with the convex problem (14), and step 9 must compute supp Z P

ω Ωζ pω ϕω(γ), which now requires solving a convex optimization problem.

4. Experiments

In this section, we present a numerical study that compares the performance of DRPCR against three other data-driven benchmark methods to evaluate its robustness to perturbations of the data generating process. Specifically, we will observe how these models react to the situation where one faces a distribution shift for ξ. In a vehicle routing problem with travel time uncertainties, this can be interpreted as a shift in the distribution of the travel times, for instance, when a special event is happening in the town. Alternatively, one can think of an inventory management problem

where the manager faces a shift in the demand distribution, e.g., an unforeseen increase in demand for sanitizer during the first days of an epidemic. In general, there are numerous reasons why distribution shifts considerations might be needed depending on the context. In this regard, we refer the reader to Schrouff et al. 2022 and Filos et al. 2020 for such considerations in healthcare and autonomous driving applications.

The application that we consider for our numerical experiments is a shortest path problem described in Kallus & Mao 2023. A directed graph is defined as G = (V, A), where V denotes the set of nodes and A V V is the set of arcs, i.e., ordered pairs (i, j) of nodes describing the existence of a directed path from node i to node j. The corresponding travel time of such an arc is assumed to be ξ(i,j). The objective of this problem is to identify the shortest path from an origin (node o) to a destination (node d). Moving away from an ideal world of known parameters gives rise to a stochastic version of this problem. In this setting, the traveling times along the arcs ξ R|A| are uncertain; however, one might still have access to side information or observed covariates. In this case, aiming at minimizing the expected travel time leads to the following CSO problem:

x (ζ) arg min x X E ˆ Fξ|ζ[x ξ], (15)

x(i,j) {0, 1} (i, j) A P

j:(o,j) A x(o,j) P j:(j,o) A x(j,o) = 1 P j:(d,j) A x(d,j) P

j:(j,d) A x(j,d) = 1 P j:(i,j) A x(i,j) P

j:(j,i) A x(j,i) = 0 i V \ {o, d}

and x(i,j) = 1 if we decide to travel from node i to node j and x(i,j) = 0 otherwise. Unlike Kallus & Mao 2023, we enforce the integrality constraints. Furthermore, ˆFξ|ζ denotes the conditional distribution inferred from the training dataset.

As discussed in Section 1, DRCSO is a method proposed for robustifying the policies against distributional uncertainties in the data-driven context. Consequently, one can consider DRCSO, as an alternative to CSO, for solving this shortest-path problem. Using the nested CVa R ambiguity set introduced in Assumption 3.5 as the ambiguity set of DRCSO, one gets the model below:

(DRCSO) x (ζ) arg min x X sup Fξ|ζ D( ˆ Fξ|ζ,α) EFξ|ζ[x ξ],

Robust Data-driven Prescriptiveness Optimization

D( ˆFξ|ζ, α) := {Fξ|ζ M(Ωξ) : PFξ|ζ(ξ = ξω )

(1/(1 α))P ˆ Fξ|ζ(ξ = ξω ) ω Ωξ},

and α is the control parameter for the size of the ambiguity set. Staying in the DRCSO context, one can exploit a worstcase regret minimization approach instead of worst-case expected travel time. In our experiments, we look into the optimal solutions arising from an ex-post regret minimization setting, introduced as a = 1 regret minimization model in Poursoltani et al. 2023. This leads to the following distributionally robust contextual regret optimization (DRCRO) problem:

x (ζ) arg min x X sup Fξ|ζ D( ˆ Fξ|ζ,α) EFξ|ζ[x ξ min x X x ξ].

(17) In this case, the decision maker compares her travel time to the one resulting from a benchmark decision that knows the future realization of ξ. The ultimate goal is to minimize the worst-case expectation of this gap, so-called worst-case expected regret , where the ambiguity set is nested CVa R. Finally, we solve our introduced DRPCR problem under nested CVa R ambiguity set, where the Q(x( ), γ) function takes the form of:

Q(x( ), γ) :=

sup F D( F ,α) EF h x(ζ) ξ (1 γ)ˆx ξ + γ min x X x ξ i ,

(18) where F denotes the distribution derived from the training dataset, composed of the empirical distribution ˆFζ of ζ and the inferred conditional distribution ˆFξ|ζ, while ˆx := argminx E ˆ F [h(x, ξ)] with ˆF that puts equal weights on each observed data point {ξi}N i=1 (i.e. the SAA solution). Based on an optimal solution γ for the DRPCR problem, one can retrieve an optimal policy using:

x (ζ) arg min x X sup Fξ|ζ D( ˆ Fξ|ζ,α) EFξ|ζ

(1 γ )ˆx ξ + γ min x X x ξ ,

which can be obtained by solving (11) with γ and replacing P F (ξ = ξω |ζ = ζω) with P ˆ Fξ|ζ(ξω ).

We adapt our numerical experiments to the graph (G) structure employed in Kallus & Mao 2023 with the same origin (o) and destination (d); therefore, we study a graph with the size of 45 nodes (|V| = 45) and 97 arcs (|A| = 97). We assume there exist 200 covariates (nζ = 200) and the vector composed of travel times ξ and covariates ζ follow a multivariate normal distribution. Specifically, each covariate

ζi follows a normal distribution with a mean of zero and standard deviation of one (i.e. ζi N(0, 1)). Similarly, each travel time ξ(i,j) is normal with a standard deviation that matches the deviation present in Kallus & Mao 2023 s dataset yet both the correlation and mean vector are treated differently. Starting with correlation, we introduce a new correlation structure for (ζ, ξ)2 by instantiating a random correlation matrix (see Appendix D for details).

Our treatment of the mean of ξ embodies our objective to study robustness to distribution shifts. Namely, while the data generating process for the training set employs the same mean vector as in Kallus & Mao 2023, our validation data set and out-of-sample test set will measure the performance of proposed policies on generating processes where the mean of ξ as been perturbed, i.e. E[ξ(i,j)] := (1 + δ(i,j))µ(i,j). Six tests were conducted for different levels of mean perturbations: no distribution shift δ(i,j) = 0, which does not allow for any perturbation, along with tests that take into account shifts with δ(i,j) generated i.i.d. according to a uniform distribution on [0%, m], where m M := {20%, 30%, 40%, 50%, 60%} represents the maximum possible perturbation. Furthermore, the perturbation experienced in the validation set is independent of the test set. This is to simulate situations where the level of robustness would be calibrated on a data set where a distribution shift of similar size is observed as the shift experienced out-of-sample.

Experiments for each perturbation range contain 50 instances generated by resampling the training, validation, and test data sets. Both the training and validation datasets consist of 400 data points, while the test set contains 1000 data points and is used to measure the out-of-sample performance. The training dataset is used for learning purposes, which allows us to infer the conditional probabilities of ˆFξ|ζ once a new covariate vector ζ is observed. From a wide range of existing predictive tools for inference of ˆFξ|ζ, Bertsimas & Kallus 2020 compare methods such as k-nearest-neighbors regression (Hastie et al. 2001), local linear regressions (Cleveland & Devlin 1988), classification and regression trees (CART; Breiman et al. 1984), and random forests (RF; Breiman 2001). In their experiments, the best coefficient of prescriptiveness belongs to random forests. We exploit the code provided in Kallus & Mao 2023 to train random forests over our training datasets and then use it as the conditional distribution estimator ˆFξ|ζ for our validation and out-of-sample data points. The validation dataset is used to calibrate the size of the ambiguity set (α) for the DRCSO, DRCRO, and DRPCR models. The procedure for calibrating α and the associated optimal γ for the

2This was done after observing that with Kallus & Mao 2023 s dataset the optimal uninformed decisions produced nearly the same performance as the optimal hindsight decisions that exploited full information about realized travel times.

Robust Data-driven Prescriptiveness Optimization

Figure 1. Shortest path problem: (a) statistics of the out-of-sample coefficient of prescriptiveness (lower values indicate worse performance). (b) statistics of E F [ x (ζ) ˆx 1] where F is the out-of-sample distribution (lower values reflect a closer proximity to the SAA solution).

DRPCR model and to calibrate α for the DRCSO and DRCRO models are described in Appendix C (see algorithms 2 and 3 respectively). We define the set of discretized α values as A := A1 A2, where A1 includes 20 logarithmically spaced values in [0.01, 0.99] and A2 includes 20 evenly spaced values in [0, 1). For CSO, Algorithm 3 can also be used with A = {0}. From a computational point of view, the training of the DRPCR algorithm took on average less than 36 minutes per instance, compared to closer to 3 minutes for DRCSO and DRCRO. The main difference comes from the extra Step 6 of Algorithm 2, which requires solving the DRPCR problem for each candidate for α, and took on average 50 seconds to solve, and needs to be repeated for all α A. Once the optimal α and γ are determined by Algorithm 2 for a given training/validation dataset, equation (19) provides the optimal policy x ( ) for any covariate ζ received in real-time. Similarly, equations (16) and (17) can be employed to derive the optimal DRCSO and DRCRO policies associated with the real-time input of ζ, relying on the calibrated values of α obtained from Algorithm 3. All optimization problems are implemented in Python and solved using Gurobi 8.1.1 on a machine featuring an Intel processor Xeon(R) CPU E5-2687W v3 @ 3.10GHz 3.10 GHz (2 processors) and 128 GB RAM. The code used for the numerical experiments is available at https://github.com/erickdelage/robust prescriptive opt.

Figure 1(a) reports the coefficients of prescriptiveness VF (x ( ), ˆx), where F is the test dataset, for the four policies and perturbation levels. More details on the average out-of-sample performance are also presented in Table 2 in Appendix E. We observe the following: (i) When considering a particular optimization model, the coefficient of

prescriptiveness decreases as the magnitude of the distribution shift increases. Indeed, these policies face a more serious robustness challenge as they approach more extreme scenarios beyond what was seen in the train dataset. (ii) When the test set follows the same distribution as the train set, all four policies roughly demonstrate similar performance; however, when this set experiences a distribution shift, DRPCR policies differentiate their performance compared to the alternative ones. (iii) Imposing a more severe distribution shift accentuates this differentiation. For instance, when the mean travel times across the edges are perturbed up to 50% in the test set, DRPCR policies provide a positive coefficient of prescriptiveness, at least over 75% of instances. On the contrary, the alternative policies fail to reach a positive ratio over almost a similar number of instances. This observation is further amplified in the case of 60% perturbation. In this scenario, while CSO, DRCSO, and DRCRO policies fail to return a positive out-of-sample coefficient of prescriptiveness, DRPCR still can reach a positive median of 4% which can go up to 18% at its best.

Figure 1(b) depicts the statistics of the 1-norm distance metric of the adaptable policies from the optimal SAA solution. This illustration elucidates the reason behind the superior performance of the DRPCR method compared to others. Indeed, it is notable that DRPCR implicitly utilizes x as an anchor for the adaptable policy. In other words, under large distribution shifts, it is able to learn under what context it is worth staying closer to x, where the relative regret is zero, as explained below Lemma 3.1. This phenomenon highlights how the DRPCR applies a completely different form of regularization, compared to prior DRO models.

For further insights, readers are directed to Appendix F,

Robust Data-driven Prescriptiveness Optimization

Table 1. Average runtime (minutes) per instance

Method Type of Problem Algorithm Levels of Mean Purturbations (m)

0% 20% 30% 40% 50% 60%

DRCSO relaxed x( ) 3 3.08 2.38 3.06 3.14 2.38 2.38 DRCRO relaxed x( ) 3 2.96 2.28 2.96 3.02 2.28 2.28 DRPCR relaxed x( ) 2 32.76 25.22 32.86 33.58 25.28 25.22

DRCSO binary x( ) 3 3.08 2.64 3.04 3.42 3.02 3.40 DRCRO binary x( ) 3 3.32 3.42 3.20 3.44 3.20 3.44 DRPCR binary x( ) 2 34.28 38.08 33.32 37.92 33.14 37.92

where an additional set of experiments is presented. In line with the approach in Kallus & Mao 2023, the integrality constraint of x( ) is relaxed in this supplementary investigation. One can refer to Table 1 for a comparison of runtime of Algorithms 2 and 3 under both relaxed and binary policies.

5. Conclusion

The proposed DRPCR model offers an innovative method for calibrating contextual optimization problems and introduces a unique form of regularization, differing from previous DRO models and achieving significantly improved outof-sample performance. Unfortunately, in its current form, the approach requires considerable training time, i.e. approximately tenfold that of the DRCSO and DRCRO models. This is due to the bisection algorithm needing log2 (1/ϵ) steps to converge. In order to improve tractability (at the expense of optimality), one might consider limiting the admissible policy x( ) to those with affine dependence on the side information ζ. The resulting DRPCR takes the form of a smaller optimization problem with size proportional to the number of dimensions of ζ rather than |Ωζ|. Affine policies might also facilitate the use of more general non-nested ambiguity sets, which constitute a current limitation of the proposed DRPCR model. Finally, we expect that additional empirical evaluation with other data generating and application environments would certainly benefit our understanding of the value of the presented DRPCR approach.

Impact Statement

While we expect this work to contribute in the long term to many potential societal consequences, none are direct consequences of the results presented in this paper.

Acknowledgments

The authors gratefully acknowledge support from the Institut de Valorisation des Donn ees (IVADO), from the Cana-

dian Natural Sciences and Engineering Research Council [RGPIN-2022-05261], and the Canada Research Chair program [CRC-2018-00105].

Ban, G.-Y. and Rudin, C. The big data newsvendor: Practical insights from machine learning. Operations Research, 67 (1):90 108, 2019.

Bazier-Matte, T. and Delage, E. Generalization bounds for regularized portfolio selection with market side information. INFOR: Information Systems and Operational Research, 58 (2):374 401, 2020.

Bertsimas, D. and Kallus, N. From predictive to prescriptive analytics. Management Science, 66(3):1025 1044, 2020.

Bertsimas, D. and Van Parys, B. Bootstrap robust prescriptive analytics. Mathematical Programming, 195(1-2):39 78, 2022.

Bertsimas, D., Kallus, N., and Hussain, A. Inventory management in the era of big data. Production and Operations Management, 25(12):2006 2009, 2016.

Birge, J. R. and Louveaux, F. Introduction to stochastic programming. Springer Science & Business Media, 2011.

Brandt, M. W., Santa-Clara, P., and Valkanov, R. Parametric Portfolio Policies: Exploiting Characteristics in the Cross Section of Equity Returns. The Review of Financial Studies, 22(9):3411 3447, 02 2009.

Breiman, L. Random forests. Machine learning, 45:5 32, 2001.

Breiman, L., Friedman, J., Stone, C., and Olshen, R. Classification algorithms and regression trees. Classification and regression trees, 15(2):246, 1984.

Campbell, J. Y. and Thompson, S. B. Predicting excess stock returns out of sample: Can anything beat the historical average? The Review of Financial Studies, 21(4):1509 1531, 2008.

Cleveland, W. S. and Devlin, S. J. Locally weighted regression: an approach to regression analysis by local fitting. Journal of the American statistical association, 83(403):596 610, 1988.

Detlefsen, K. and Scandolo, G. Conditional and dynamic convex risk measures. Finance and Stochastics, 9(4):539 561, 2005.

Robust Data-driven Prescriptiveness Optimization

Duchi, J. C., Hashimoto, T., and Namkoong, H. Distributionally robust losses for latent covariate mixtures. Co RR, abs/2007.13982, 2020. Esteban-P erez, A. and Morales, J. M. Distributionally robust stochastic programs with side information based on trimmings. Mathematical Programming, 195(1):1069 1105, 2022. Filos, A., Tigkas, P., Mc Allister, R., Rhinehart, N., Levine, S., and Gal, Y. Can autonomous vehicles identify, recover from, and adapt to distribution shifts? In International Conference on Machine Learning, pp. 3145 3153. PMLR, 2020. Gr otschel, M., Lov asz, L., and Schrijver, A. The ellipsoid method and its consequences in combinatorial optimization. Combinatorica, 1(2):169 197, 1981. Gupta, V. and Kallus, N. Data pooling in stochastic optimization. Management Science, 68(3):1595 1615, 2022. Hastie, T., Tibshirani, R., and Friedman, J. The Elements of Statistical Learning. Springer Series in Statistics. Springer New York Inc., New York, NY, USA, 2001. Kallus, N. and Mao, X. Stochastic optimization forests. Management Science, 69(4):1975 1994, 2023. Kannan, R., Bayraksan, G., and Luedtke, J. R. Residuals-based distributionally robust optimization with covariate information. ar Xiv preprint ar Xiv:2012.01088, 2020. Karmarkar, N. A new polynomial-time algorithm for linear programming. Combinatorica, 4(4):373 395, 1984. Lam, H. Recovering best statistical guarantees via the empirical divergence-based distributionally robust optimization. Operations Research, 67(4):1090 1105, 2019. Mohajerin Esfahani, P. and Kuhn, D. Data-driven distributionally robust optimization using the Wasserstein metric: Performance guarantees and tractable reformulations. Mathematical Programming, 171(1):115 166, 2018. Nguyen, V. A., Zhang, F., Blanchet, J., Delage, E., and Ye, Y. Robustifying conditional portfolio decisions via optimal transport. ar Xiv preprint ar Xiv:2103.16451, 2022. Notz, P. M. and Pibernik, R. Prescriptive analytics for flexible capacity management. Management Science, 68(3):1756 1775, 2022. doi: 10.1287/mnsc.2020.3867. Poursoltani, M., Delage, E., and Georghiou, A. Risk-averse regret minimization in multistage stochastic programs. Operations Research, https://doi.org/10.1287/opre.2022.2429, 2023. Riedel, F. Dynamic coherent risk measures. Stochastic Processes and their Applications, 112(2):185 200, 2004. Ruszczy nski, A. and Shapiro, A. Conditional risk mappings. Mathematics of Operations Research, 31(3):544 561, 2006. Schrouff, J., Harris, N., Koyejo, O., Alabdulmohsin, I., Schnider, E., Opsahl-Ong, K., Brown, A., Roy, S., Mincu, D., Chen, C., et al. Maintaining fairness across distribution shift: do we have viable solutions for real-world applications? ar Xiv preprint ar Xiv:2202.01034, 2022. Shapiro, A. Interchangeability principle and dynamic equations in risk averse stochastic programming. Operations Research Letters, 45(4):377 381, 2017. Shapiro, A., Dentcheva, D., and Ruszczynski, A. Lectures on stochastic programming: modeling and theory. SIAM, 2021. Sion, M. On general minimax theorems. Pacific Journal of Mathematics, 8(1):171 176, 1958.

Srivastava, P. R., Wang, Y., Hanasusanto, G. A., and Ho, C. P. On data-driven prescriptive analytics with side information: A regularized nadaraya-watson approach. ar Xiv preprint ar Xiv:2110.04855, 2021. Stratigakos, A., Camal, S., Michiorri, A., and Kariniotakis, G. Prescriptive trees for integrated forecasting and optimization applied in trading of renewable energy. IEEE Transactions on Power Systems, 37(6):4696 4708, 2022. Van Parys, B. P. G., Esfahani, P. M., and Kuhn, D. From data to decisions: Distributionally robust optimization is optimal. Management Science, 67(6):3387 3402, 2021.

Robust Data-driven Prescriptiveness Optimization

A.1. Proof of Lemma 3.1

This follows simply from VF (x( ), x) being bound above by 1 for all policy x( ) and all distribution F due to:

VF (x( ), x) = 1 EF [h(x(ζ), ξ)] EF [minx X h(x , ξ)]

EF [h( x, ξ)] EF [minx X h(x , ξ)]

1 EF [minx X h(x , ξ)] EF [minx X h(x , ξ)]

EF [h( x, ξ)] EF [minx X h(x , ξ)] = 1

when EF [h(ˆx, ξ)] EF [minx X h(x , ξ)] > 0, and otherwise equal to 1 or both bounded above by 1. Hence,

max x( ) H inf F D VF (x( ), x) 1.

Moreover, if x H, then we have that

max x( ) H VD(x( ), x) VD( x, x) = 0 if EF [h( x, ξ)] EF [minx X h(x , ξ)] > 0 1 otherwise. .

A.2. Proof of Lemma 3.3

Let x( ) be a CSO optimal policy, then necessarily x( ) H since x(ζ) X for all ζ. This confirms that x( ) is feasible in DRPCR. Next, we can demonstrate optimality through:

VD( x( ), x) = V F ( x( ), x) max x( ) H V F (x( ), x) = max x( ) H VD(x( ), x),

since for all x( ) H, we have that EF [h(x(ζ), ξ)|ζ] minx( ) H EF [h(x(ζ), ξ)|ζ] = EF [h( x(ζ), ξ)|ζ] for all ζ, which we can show implies that VF ( x( ), x) VF (x( ), x). More specifically, if EF [h(ˆx, ξ)] = EF [minx X h(x , ξ)], then either EF [h(x(ζ), ξ)] = EF [minx X h(x , ξ)] thus

EF [ min x X h(x , ξ)] = EF [h( x(ζ), ξ)] EF [h(x(ζ), ξ)] = EF [ min x X h(x , ξ)]

meaning that VF ( x( ), x) = VF (x( ), x) = 1, or EF [h(x(ζ), ξ)] > EF [minx X h(x , ξ)] thus VF ( x( ), x) = VF (x( ), x). Alternatively, the case where EF [h(ˆx, ξ)] > EF [minx X h(x , ξ)] is straightforward as the function

f(y) := 1 y EF [minx X h(x , ξ)] EF [h( x, ξ)] EF [minx X h(x , ξ)]

is strictly decreasing.

A.3. Proof of Proposition 3.4

We first present the DRPCR in epigraph form:

max γ, x( ) H γ (20a)

s.t. VF (x( ), x) γ, F D (20b)

0 γ 1 (20c)

where we added the redundant constraint γ [0, 1] since Lemma 3.1 ensures that the optimal value of DRPCR is in this interval.

Focusing on constraint (20b), we can then consider two cases for the definition of VF (x( ), x). In the case that EF [h( x, ξ)] EF [minx X h(x , ξ)] > 0, one can multiply both sides of the inequality to equivalently obtain:

EF [h(x(ζ), ξ)] EF [ min x X h(x , ξ)] (1 γ) EF [h( x, ξ)] EF [ min x X h(x , ξ)]

Robust Data-driven Prescriptiveness Optimization

which is equivalent, when rearranging the terms, to:

EF [h(x(ζ), ξ) (1 γ)h( x, ξ) γ min x X h(x , ξ)] 0. (21)

In the second case where EF [h( x, ξ)] = EF [minx X h(x , ξ)], then constraint (20b) is equivalent to:

EF [h(x(ζ), ξ)] = EF [ min x X h(x , ξ)] & γ 1,

yet γ 1 is redundant while the former condition can equivalently be posed as (21). We are left with

EF [h(x(ζ), ξ) (1 γ)h( x, ξ) γ min x X h(x , ξ)] 0 , F D,

which can equivalently be described by Q(x( ), γ) 0. One can further conclude that Q(x( ), γ) 0 is convex and non-decreasing in γ given that it is the supremum of a set of affine non-decreasing functions:

Q(x( ), γ) = sup F D EF [h(x(ζ), ξ) h( x, ξ)] + γ(EF [h( x, ξ) min x X h(x , ξ)]),

with h( x, ξ) minx X h(x , ξ) for all ξ since x X.

A.4. Proof of Proposition 3.6

Letting g(x, ξ, γ) := h(x, ξ) ((1 γ)h( x, ξ) + γ minx X h(x , ξ)), we have that

ψ(γ) := min x( ) H Q(x( ), γ)

= min x( ) H sup F D( F ,α) EF h g(x(ζ), ξ, γ) i

= min x( ) H E F

CVa Rα F g(x(ζ), ξ, γ)|ζ #

= min x( ) H E F

inf t t + 1 1 αE F

max 0, g(x(ζ), ξ, γ) t |ζ #

inf x X,t t + 1 1 αE F

max 0, g(x, ξ, γ) t |ζ #

where we exploit the infimum representation of CVa R and the interchangeability property of expected value operators (see Shapiro 2017 and reference therein). Given that F is a discrete distribution as described in Assumption 3.5, one can compute ψ(γ) by solving for each scenario ζω with ω Ωζ the problem (11b). Based on the solution of problem (11) for each ω Ωζ, one can obtain ψ(γ) := P

ω Ωζ P F (ζ = ζω)ϕω(γ) together with a potentially feasible policy x(ζ) := x ω(ζ), where ω(ζ) = argminω Ωζ ζ ζω and xω refers to the minimizer of problem (11).

The function ϕω(γ) is non-decreasing in γ since γ only appears in constraint (11b), which can be rewritten as:

sω h(x, ξω ) h( x, ξω ) + h( x, ξω ) min x X h(x , ξω ) γ t , ω Ωξ .

Since the right-hand side of this constraint is non-decreasing in γ, due to h( x, ξω ) minx X h(x , ξω ), one can concludes that the minimum of (11) cannot decrease when γ is increased, since the feasible set is reduced.

We further note that problem (11) can be reduced to a linear program when X is polyhedral and h(x, ξω ) is linear programming representable for all ω Ωξ. For example, in the context of a portfolio optimization, where X is the

Robust Data-driven Prescriptiveness Optimization

probability simplex and h(x, ξ) := ξT x, we have that problem (8) reduces to:

max γ,{xω,tω,sω}ω Ωζ γ

subject to X

ω Ωζ P F (ζ = ζω)

ω Ωξ P F (ξ = ξω |ζ = ζω)sω ω

sω ω ξT ω xω (1 γ)ξT ω x γ min x X ξT ω x tω , ω Ωξ, ω Ωζ

sω 0 , ω Ωζ nx X

i=1 xω i = 1, ω Ωζ

xω 0, ω Ωζ 0 γ 1 .

Alternatively, in the context of a shortest path problem (see Section 4 for details), we have that problem (8) reduces to a mixed integer linear program:

max γ,{xω,tω,sω}ω Ωζ γ

subject to X

ω Ωζ P F (ζ = ζω)

ω Ωξ P F (ξ = ξω |ζ = ζω)sω ω

sω ω ξT ω xω (1 γ)ξT ω x γ min x X ξT ω x tω , ω Ωξ, ω Ωζ

sω 0 , ω Ωζ X

j:(o,j) A xω (o,j) X

j:(j,o) A xω (j,o) = 1, ω Ωζ

j:(d,j) A xω (d,j) X

j:(j,d) A xω (j,d) = 1, ω Ωζ

j:(i,j) A xω (i,j) X

j:(j,i) A xω (j,i) = 0, i V \ {o, d}, ω Ωζ

xω (i,j) {0, 1}, (i, j) A, ω Ωζ 0 γ 1 .

A.5. Proof of Lemma 3.7

One can first easily verify that in Algorithm 1, we have that := γ+ γ is initially equal to 1 and reduces by a factor of 2 at every iteration. The algorithm therefore necessarily terminates after log2(1/ϵ) iterations. When X is polyhedral and h(x, ξ) is linear programming representable, problem (13) reduces to a linear program that can be solved in polynomial time with respect to |Ωξ|, nx, nξ, the size of the LP representation of X, and of h(x, ξ) (see Gr otschel et al. 1981 and Karmarkar 1984). Given that this problem is solved |Ωζ| at each iteration of the algorithm. We conclude that the total run time of the algorithm is polynomial with respect to all of these quantities.

Robust Data-driven Prescriptiveness Optimization

A.6. Proof of Proposition 3.9

The proof follows similar steps as Proposition 3.6. Let g(x, ξ, γ) := h(x, ξ) ((1 γ)h( x, ξ) + γ minx X h(x , ξ)), we have that

ψ(γ) = min x( ) H sup F D( F ,α) EF h g(x(ζ), ξ, γ) i

= min x( ) H sup p:p 0,P

ω Ωζ pω=1,dζ(p, p) rζ

ω Ωζ pω sup q:q 0,P

ω Ωξ qω =1,dξ(q, qω) rξ

ω Ωξ qω g(x(ζω), ξω , γ)

= sup p:p 0,P

ω Ωζ pω=1,dζ(p, p) rζ

ω Ωζ pω min x X sup q:q 0,P

ω Ωξ qω =1,dξ(q, qω) rξ

ω Ωξ qω g(x(ζω), ξω , γ)

= sup p:p 0,P

ω Ωζ pω=1,dζ(p, p) rζ

ω Ωζ pω ϕω(γ), (22)

ϕω(γ) := min x X sup q:q 0,P

ω Ωξ qω =1,dξ(q, qω) rξ

ω Ωξ qω g(x(ζω), ξω , γ). (23)

Denoting g R|Ω | with gω := g(x(ζω), ξω , γ), one can derive a reformulation of the inner supremum in (23) as an infimum following:

sup q:q 0,P

ω Ωξ qω =1,dξ(q, qω) rξ q T g = sup q inf λ 0,t,α 0 q T g + λT q + t(1 1T q) + α(rξ dξ(q, qω))

= inf λ 0,t,α 0 sup q q T g + λT q + t(1 1T q) + α(rξ dξ(q, qω))

= inf λ 0,t,α 0 t + rξα + d (g + λ t, α, qω),

where d (v, α, qω) := supq v T q αdξ(q, qω) is the perspective of the convex conjugate of dξ(q, qω).

By joining this reformulation with the minimization in x, we get

min x X,t,α 0,s t + rξα + d (s, α, qω) (24a)

subject to sω g(x(ζω), ξω , γ) t, ω Ω (24b)

which concludes the proof.

B. Acceleration strategy for Algorithm 1

One can possibly accelerate the convergence rate on the bisection Algorithm 1 by exploiting the fact that ψ( ) is a convex function when X is convex. Indeed, for the current interval [γ , γ+], ψ(γ) can be underand over-estimated, see Figure 2 (right). The procedure can be described as follows. First, we construct a line that will underestimate ψ by identifying a subgradient of the function at γ. This can be computed analytically since

ψ(γ) := E F

min x X CVa Rα

h(x, ξ) (1 γ)h( x, ξ) + γ min x X h(x , ξ) ζ

min x X sup F D( F ,α) EF

h(x, ξ) (1 γ)h( x, ξ) + γ min x X h(x , ξ) ζ #

sup F D( F ,α) min x X EF

h(x, ξ) (1 γ)h( x, ξ) + γ min x X h(x , ξ) ζ #

min x X EF ξ|ζ

h(x, ξ) (1 γ)h( x, ξ) + γ min x X h(x , ξ) ζ

min x X EF ξ|ζ [h(x, ξ) h( x, ξ)]

| {z } offset a

h( x, ξ) min x X h(x , ξ)

| {z } slope b

Robust Data-driven Prescriptiveness Optimization

where F ξ|ζ is the conditional probability given ζ of any member (hopefully a maximizer) of D( F, α). Note that the first inequality is tight based on Sion s minimax theorem (see Sion 1958) given that D( F, α) is compact, while the second is tight as long as F ξ|ζ achieves the supremum. Such a maximizer can be identified using:

F ξ|ζ argmax Fξ|ζ M(Ωζ) : PFξ|ζ (ξ) (1 α) 1P F (ξ|ζ), ξ

h(x γ(ζ), ξ) (1 γ)h( x, ξ) + γ min x X h(x , ξ)

where x γ(ζ) is the minimizer of (11) with ζω = ζ since (x γ( ), F ), with F as the composition of F marginalized on ζ and F ξ|ζ,3 is a saddle point of:

g(x( ), F) := EF h h(x(ζ), ξ) (1 γ)h( x, ξ) + γ min x X h(x , ξ) i .

Such a F ξ|ζ can be obtained as a side product of solving problem (11) using the optimal dual variables associated with

constraint (11b). If we denote by γu := a/b then the right bound of the interval can be updated to γ+ := min(γ+, γu).

Figure 2. Visualization of the basic (left) and accelerated (right) bisection algorithm. The blue squared brackets indicate the current estimated interval containing the optimal γ and the red squared brackets indicate the interval in the next iterations. The right graph also visualizes the over and under estimators of ψ(γ).

The second step is to construct an overestimator. If ψ( γ) > 0, then we evaluate ψ(γ ) and construct the line that passes through (γ , ψ(γ )) and ( γ, ψ( γ)). If ψ( γ) < 0 then we evaluate ψ(γ+) and construct the line that passes through (γ+, ψ(γ+)) and ( γ, ψ( γ)). We denote the point for which the line evaluates to zero as γo, and update the left bound of the interval to γ := max(γ , γo). Hence, the new interval is given by [γ , γ+ ] [γ , γ+], which would potentially significantly reduce the search space.

We conclude this section by commenting that the accelerated bisection algorithm could require up to two evaluations of the ψ function at each iteration instead of a single one as described in the original algorithm.

3Namely, PF (ξ) = P F (ξ) and PF (ξ|ζ) = PF ξ|ζ(ξ) for all ζ.

Robust Data-driven Prescriptiveness Optimization

C. Algorithms for calibrating the size of the ambiguity sets

Algorithm 2 Algorithm for calibrating the size of the ambiguity set (α) for DRPCR

1: Input: Training dataset {ζj, ξj}Ntrain j=1 and validation dataset {ζj, ξj}Nvalidation j=1 and A := {αi}n i=1 [0, 1]

2: Train a random forest model ˆFξ|ζ on {ζj, ξj}Ntrain j=1 3: Let F be the composition of ˆFξ|ζ with empirical distribution ˆFζ of ζ in the training set {ζj}Ntrain j=1 4: for i = 1, . . . , n do 5: //Construct ˆx i ( ) with αi and F 6: Solve DRPCR with αi and F to get γ i 7: //Evaluate ˆx i ( ) on empirical distribution of realizations in {ζj, ξj}Nvalidation j=1 8: for j = 1, . . . , Nvalidation do 9: Solve (11) with γ i , αi, and replacing P F (ξ = ξω |ζ = ζω) with P ˆ Fξ|ζj (ξω ) to get optimal x j 10: Let ˆx i (ζj) := x j 11: end for 12: Set si := Pα ˆ F ˆx i ( ) for empirical distribution ˆF on {ζj, ξj}Nvalidation i=1 13: end for 14: Let i := argmaxi si and set α := αi , γ := γi , and x ( ) := ˆx i ( ) 15: Return (α , γ , x ( ))

Algorithm 3 Algorithm for calibrating the size of the ambiguity set (α) for CVa R-loss/CVa R-regret

1: Input: Training dataset {ζj, ξj}Ntrain j=1 and validation dataset {ζj, ξj}Nvalidation j=1 and A := {αi}n i=1 [0, 1]

2: Train a random forest model ˆFξ|ζ on {ζj, ξj}Ntrain j=1 3: for i = 1, . . . , n do 4: //Evaluate ˆx i ( ) on empirical distribution of realizations in {ζj, ξj}Nvalidation j=1 5: for j = 1, . . . , Nvalidation do 6: Solve (16)/(17) with αi for ζ := ζj in validation set to get optimal x j 7: Let ˆx i (ζj) := x j 8: end for 9: Set si := Pα ˆ F ˆx i ( ) for empirical distribution ˆF on {ζj, ξj}Nvalidation i=1 10: end for 11: Let i := argmaxi si and set α := αi and x ( ) := ˆx i ( ) 12: Return (α , x ( ))

D. Generation of random covariance matrix with arbitrary variances

A random covariance matrix for the random vector of (ζ, ξ) is generated based on a two-step procedure that follows. The first step consists in generating a random symmetric positive-definite matrix described in Algorithm 4, a method implemented in the sklearn.datasets.make spd matrix function of scikit-learn machine learning library in Python.

Algorithm 4 Algorithm for generating random symmetric positive-definite matrix

1: Input: Dimension of the square matrix nζ + nξ 2: Generate random square matrix Anζ+nξ sampling from the uniform distribution U[0,1] 3: Construct the symmetric matrix M = A A 4: Decompose M with Singular Value Decomposition (SVD) method as M = UΣV

5: Generate random diagonal matrix S sampling from the uniform distribution U[0,1] 6: Construct Σ = S + J where J is the square matrix of ones with the size of nζ + nξ 7: Get the symmetric positive-definite matrix as M = UΣ V

8: Return M

Robust Data-driven Prescriptiveness Optimization

Given the vector of standard deviations for (ζ, ξ) denoted by [σ ζ σ ξ ] and also a random symmetric positive-definite matrix generated by Algorithm 4, one can implement the second stage described in Algorithm 5 to get a random covariance matrix with arbitrary standard deviations of [σ ζ σ ξ ] .

Algorithm 5 Algorithm for generating random covariance matrix with arbitrary standard deviations

1: Input: Random symmetric positive-definite matrix (M) and vector of standard deviations [σ ζ σ ξ ]

2: Convert matrix M into its associated correlation matrix Corr = diag(M) 1

2 M diag(M) 1

3: Get the arbitrary covariance matrix of Cov = diag h σζ σξ

i Corr diag h σζ σξ

4: Return Cov

E. Average out-of-sample coefficient of prescriptiveness

Table 2. Average out-of-sample coefficient of prescriptiveness

Problem Type Method Level of Perturbation

0% 20% 30% 40% 50% 60%

Relaxed x( )

CSO 0.45 0.30 0.19 0.04 -0.13 -0.31 DRCSO 0.45 0.30 0.18 0.04 -0.13 -0.31 DRCRO 0.45 0.30 0.18 0.04 -0.13 -0.32 DRPCR 0.45 0.31 0.23 0.13 0.05 0.01

Binary x( )

CSO 0.45 0.30 0.19 0.04 -0.13 -0.31 DRCSO 0.44 0.30 0.19 0.06 -0.09 -0.25 DRCRO 0.44 0.30 0.19 0.05 -0.11 -0.28 DRPCR 0.44 0.32 0.24 0.15 0.07 0.02

F. Additional experiments

While the experiments in Section 4 consider an exact version of the shortest path problem, to be closer to the setting proposed in Kallus & Mao 2023, we also conduct a second set of experiments where x( ) represents relaxed variables. Figure 3 (a) illustrates the coefficients of prescriptiveness obtained from the optimal relaxed policies. These results, in general, are aligned with the ones spotted in Figure 1 (a); however, one remarks the following. Firstly, the results derived from CSO remain exactly the same as the binary case. This stems from the fact that optimal relaxed CSO decisions are known to be integral for the stochastic shortest path problems; conversely, this is not the case for DRCSO, DRCRO, and DRPCR where robustness breaks the linearity of the objective. Secondly, comparing Figures 1 (a) and 3 (a) reveals that forcing DRCSO and DRCRO to propose binary policies enhances their out-of-sample performance, surpassing those of CSO. Indeed, this setting seems to provide these two approaches the chance to better prepare for potential distribution shifts; however, despite their enhanced performance, the highest degree of robustness to distribution shift remains associated with DRPCR policies. Thirdly, this comparative analysis yields counter-intuitive empirical evidence that out-of-sample performance might be slightly improved when imposing integrality constraints on the three robust models. We hypothesize that this might be caused by the additional flexibility of the relaxed models, which makes them more susceptible to overfitting their assumed stochastic models.

Robust Data-driven Prescriptiveness Optimization

Figure 3. Shortest path problem (relaxed version): (a) statistics of the out-of-sample coefficient of prescriptiveness (lower values indicate worse performance). (b) statistics of E F [ x (ζ) ˆx 1] where F is the out-of-sample distribution (lower values reflect a closer proximity to the SAA solution).