# conformal_inverse_optimization__114710c2.pdf

Conformal Inverse Optimization

Bo Lin University of Toronto blin@mie.utoronto.ca

Erick Delage HEC Montréal Mila Québec AI Institute erick.delage@hec.ca

Timothy C. Y. Chan University of Toronto Vector Institute tcychan@mie.utoronto.ca

Inverse optimization has been increasingly used to estimate unknown parameters in an optimization model based on decision data. We show that such a point estimation is insufficient in a prescriptive setting where the estimated parameters are used to prescribe new decisions. The prescribed decisions may be of low-quality and misaligned with human intuition and thus are unlikely to be adopted. To tackle this challenge, we propose conformal inverse optimization, which seeks to learn an uncertainty set for the unknown parameters and then solve a robust optimization model to prescribe new decisions. Under mild assumptions, we show that our method enjoys provable guarantees on solution quality, as evaluated using both the ground-truth parameters and the decision maker s perception of the unknown parameters. Our method demonstrates strong empirical performance compared to classic inverse optimization.

1 Introduction

Inverse optimization (IO) is a supervised learning approach that fits parameters in an optimization model to decision data. The fitted optimization model can then be used to prescribe future decisions. Such problems naturally arise in AI applications where human preferences are not explicitly given, and instead need to be inferred from historical decisions. For this pipeline to succeed in practice, the prescribed decision should not only be of high quality but also align with human intuition (i.e., perceived to be of high-quality). The latter encourages algorithm adoption (Chen et al., 2023; Donahue et al., 2023), which is critical in many real-world settings (Liu et al., 2023; Sun et al., 2022).

As an example, rideshare platforms, e.g., Uber and Lyft, provide a shortest-path to the driver at the start of a trip based on real-time traffic data (Nguyen, 2015). The driver then relies on her perception of the road network formed through past experience to evaluate the path. The driver may deviate from the suggested path if it is perceived to be low-quality. Although seasoned drivers are often capable of identifying a better path due to their tacit knowledge of the road network (Merchán et al., 2022), such deviations impose operational challenges as it may cause rider safety concerns and affect downstream decisions, such as arrival time estimation, trip pricing, and rider-driver matching (Hu et al., 2022). Therefore, the platform may be interested in leveraging historical paths taken by drivers to suggest high-quality paths for future trips, as evaluated using both the travel time and the driver s perception.

In this paper, we first show that the classic IO pipeline may generate decisions that are low-quality and misaligned with human intuition. We next propose conformal IO, which first learns an uncertainty set from decision data and then solves a robust optimization model with the learned uncertainty set to prescribe decisions. Finally, we show that the proposed approach has provable guarantees on the actual and perceived solution quality. Our contributions are as follows.

New framework. We propose a new IO pipeline that integrates i) a novel method to learn uncertainty sets from decision data and ii) a robust optimization model for decision recommendation.

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

Theoretical guarantees. We prove that, with high probability, the learned uncertainty set contains parameters that make future observed decisions optimal. This coverage guarantee leads to bounds on the optimality gap of the decisions from conformal IO, as evaluated using the ground-truth parameters and the decision maker s (DM s) perceived parameters.

Performance. Through experiments, we demonstrate strong empirical performance of conformal IO compared to classic IO and provide insights into modeling choices.

2 Literature Review

Inverse optimization. IO is a method to estimate unknown parameters in an optimization problem based on decision data (Ahuja and Orlin, 2001; Chan et al., 2014; Chan and Kaw, 2020). Early IO papers focus on deterministic settings where the observed decisions are assumed to be optimal to the specified optimization model. Recently, IO has been extended to stochastic settings where the observed decisions are subject to errors and bounded rationality. Progress has been made to provide estimators that are consistent (Aswani et al., 2018; Dong et al., 2018), tractable (Chan et al., 2019; Tan et al., 2020; Zattoni Scroccaro et al., 2024), and robust to data corruption (Mohajerin Esfahani et al., 2018). Our paper is in the stochastic stream. Unlike existing methods that provide a point estimation of the unknown parameters, we learn an uncertainty set that can be used in a robust optimization model.

Data-driven uncertainty set construction. Recently, data have become a critical ingredient to design the structure (Delage and Ye, 2010; Mohajerin Esfahani et al., 2018; Gao and Kleywegt, 2023) and calibrate the size (Chenreddy et al., 2022; Sun et al., 2023) of an uncertainty set. Our paper is related to the work of Sun et al. (2023) who first use an ML model to predict the unknown parameters and then calibrate an uncertainty set around the prediction. However, this approach does not apply in our setting as it requires observations of the unknown parameters, which we do not have access to. Our paper presents a new approach to calibrating uncertainty sets using decision data.

Estimate, then optimize. Conformal IO belongs to a family of data-driven optimization methods called estimate, then optimize (Elmachtoub et al., 2023). Recent research suggests that even small estimation errors may be amplified in the optimization step, resulting in significant decision errors. This issue can be mitigated by training the estimation model with decision-aware losses (Wilder et al., 2019; Mandi et al., 2022) and robustifying the optimization model (Chan et al., 2023a). We take a similar approach as the second stream, yet deviate from them by i) utilizing decision data instead of observations of the unknown parameters, and ii) focusing on both the ground-truth and perceived solution quality, the latter of which has not been studied in this stream of literature.

Preference learning. Preference learning has been studied in the context of reinforcement learning and has recently attracted significant attention due to its application in AI alignment (Ji et al., 2023). Existing methods focus on learning a reward function/decision policy that is maximally consistent with expert decision trajectories (Ng and Russell, 2000; Wu et al., 2024) or labeled preferences in the form of pair-wise comparison/ranking (Wirth et al., 2017; Christiano et al., 2017; Rafailov et al., 2024). We enrich this stream of literature in two ways. First, we leverage unlabeled decision data that are not state-action trajectories, but solutions to an optimization model. Second, instead of learning a policy that imitates expert behaviors, we aim to extract common wisdom from decision data crowd-sourced from DMs who are not necessarily experts to encourage algorithm adoption.

3 Preliminaries

In this section, we first present the problem setup (Section 3.1) and then describe the challenges with the classic IO pipeline (Section 3.2). Finally, we provide intuition on why robustifying the IO pipeline would help (Section 3.3).

3.1 Problem Setup

Data generation. Consider a forward optimization problem

FO(θ, u) : minimize x X(u) f(θ, x) (1)

where x Rn is the decision vector whose feasible region X(u) is non-empty and is parameterized by exogenous parameters u Rm, θ Rd is a parameter vector, and f : Rn d R is the objective function. Suppose u is distributed according to Pu supported on U. There exists a ground-truth parameter vector θ that is unknown to the DM. Instead, the DM obtains a decision ˆx by solving FO(ˆθ, u) where ˆθ is a noisy perception of θ . We assume that, while the distribution Pθ of ˆθ is unknown, it is supported on a known bounded set Θ Rd and that θ Θ. Let P(θ,u) denote the joint distribution of ˆθ and u, x : Θ U Rn be an oracle that returns an optimal solution to FO drawn uniformly at random from X OPT(θ, u) := argmin {f(θ, x) | x X(u)}.

Objective function. We focus on cases where f is linear in θ and convex in x, i.e., f(θ, x) = P

i [d] θifi(x) where fi : Rn R are convex basis functions. This generalizes the linear objective f(θ, x) = θ x. Moreover, FO with this objective function can be treated as a multi-objective optimization model, which has been used to model routing preferences (Rönnqvist et al., 2017), radiation therapy planning (Chan et al., 2014), and portfolio optimization (Dong and Zeng, 2021). In this setting, the optimal solution to FO is invariant to the scale of θ, i.e., if x X OPT(θ, u), then x X OPT(βθ, u) for any β R+. So, we set Θ = θ Rd θ 2 = 1 without loss of generality.

Learning task. Given a dataset of N decision-exogenous parameter pairs D = {ˆxk, uk}N k=1, we are interested in finding a decision policy x : U Rn to suggest decisions for future u. We require x(u) X(u) for any u U. As discussed later, x(u) is usually generated by solving an optimization model that may have multiple optimal solutions. So we consider randomized policies (e.g., uniformly sample from a set of optimal solutions). This is nonrestrictive because a deterministic policy can be recovered from a randomized policy that samples the deterministic solution with probability one.

As a running example, FOP may represent a shortest path problem, where uk specifies the origin and destination of a trip, θ indicates the ground-truth travel times on each road segment, while ˆθk is a driver s perception of θ (i.e., perceived travel times). Decision xk corresponds to a path taken by a driver based on her perceived travel times. Given a set of historical trips {uk, ˆxk}k [N], our goal is to derive a routing policy x that can provide path recommendations for future origin-destination pairs.

3.1.1 Assumptions

Assumption 1 (I.I.D. Samples). The dataset D is generated using ˆxk := x(ˆθk, uk) where (ˆθk, uk) are i.i.d. samples from P(θ,u) for all k [N]. Assumption 2 (Bounded Inverse Feasible Set). There exists a constant η R+ such that, for any θ, θ ΘOPT (ˆx, u), for some ˆx X(u) and u U, we have θ θ 2 η, where

ΘOPT(x, u) := θ Rd x X OPT(θ, u), θ 2 = 1 . (2)

Assumption 3 (Bounded Divergence). There exists a constant σ R+ such that E(ˆθ) θ 2 σ.

Assumption 1 is standard in the ML and IO literature. Assumption 2 is mild because ΘOPT (ˆx, u) is by definition bounded and is usually much smaller than Θ. Assumption 3 states that the l2 distance between the expected perceived parameters and the ground-truth parameters is upper bounded. It is reasonable in many real-world settings. For example, a driver s perceived travel cost (ˆθ) should not be too different from the travel time (θ ) as the latter is an important factor that drivers consider.

3.1.2 Evaluation Metrics

Definition 1. The actual optimality gap (AOG) of a decision policy x is defined as

AOG( x) := E [f (θ , x(u)) f (θ , x(θ , u))] (3)

where the expectation is taken over the joint distribution of the random variable u and the decision sampled using the possibly randomized policy x. Definition 2. The perceived optimality gap (POG) of a decision policy x is defined as

POG( x) := E h f ˆθ, x(u) f ˆθ, x ˆθ, u i . (4)

where the expectation is taken with respect to the randomness in ˆθ, u, and possibly x.

AOG is an objective performance measure. Achieving a low AOG means that x can generate highquality decisions. In contrast, POG is a subjective measure that depends on the DM s perception of the problem. Achieving a low POG is critical to mitigate algorithm aversion (Burton et al., 2020).

3.2 An Inverse Optimization Pipeline

Finding x is challenging for three reasons. First, unlike many ML tasks where the prediction target is unconstrained, we require x(u) to be feasible to FO which may involve a large number of constraints. Supervised learning approaches that predict ˆx based on u can often fail as they typically do not provide feasibility guarantees. An optimization module is often needed to recover feasibility or produce feasible solutions based on u and some estimated θ. Second, we do not directly observe θ

or ˆθ, which precludes using classic ML techniques to estimate them. Finally, AOG and POG may not necessarily align with each other, so we are essentially dealing with a bi-objective problem.

Figure 1: Classic and conformal IO pipelines.

In light of the first two challenges, a classic IO pipeline (visualized in Figure 1) has been proposed to first obtain a point estimation θ of the unknown parameters and then employ a policy x IO(u) := x( θ, u) to prescribe decisions for any u U (Rönnqvist et al., 2017; Babier et al., 2020). Specifically, we can estimate the parameters by solving the following inverse optimization problem

IO(D) : minimize θ Θ 1 N

k [N] ℓ ˆxk, X OPT(θ, uk) , (5)

where ℓis a non-negative loss function that returns 0 when ˆxk X OPT(θ, uk). For instance, the following loss function is commonly used in the literature. Definition 3. The sub-optimality loss of θ is given by

ℓS ˆx, X OPT(θ, u) := max x X(u) f(θ, ˆx) f(θ, x). (6)

The sub-optimality loss penalizes the optimality gap achieved by the observed decision under the estimated parameters. As remarked by Mohajerin Esfahani et al. (2018), this loss function has better computational properties than its alternatives as it is convex in the unknown parameters. In fact, when the dataset D is large in size and the unknown parameters θ are high-dimensional, the sub-optimality loss is usually the only loss function that leads to a tractable inverse problem, although it does not enjoy properties such as statistical consistency (see Chan et al. (2023b) for detailed discussions). While such a trade-off is acceptable in some applications, we suggest that it is undesirable in our setting because the resulting policy can achieve arbitrarily large AOG and POG. To see this, consider the following example that satisfies Assumptions 1 3. This example is visualized in Figure 2. Example 1. Let FO(θ, u) be the following problem

minimize θ1x1 + θ2x2 (7a) subject to x1 + ux2 u (7b) 0 x1 u (7c) 0 x2 2. (7d)

Let the ground-truth θ = (cos(π/4), sin(π/4)) and U = {u} where u > 1 is a real constant. We are given a dataset D = {ˆxk, u}N k=1 where ˆxk = x(ˆθk, u) with ˆθk uniformly and independently drawn from Θ = {(cos δ, sin δ) | δ (0, π/2)} for all k [N].

Figure 2: Illustration of Example 1. The gray areas are the feasible region X(u). The black arrows are the ground-truth parameter θ . The gray arrows are the extreme rays of Θ. The blue and green arrows are the point estimation θ. The green area is the uncertainty set C( θ, α). The black circles are the optimal solution to FO(θ , u). The blue and green circles are the suggested decisions. Note that x IO may suggest any decisions on the facet of x1 + ux2 u, which are omitted for clarity.

Lemma 1. In Example 1, let θN denote an optimal solution to IO(D) with the sub-optimality loss (6), we have P θN = θu 1 as N , where θu := 1/

Lemma 1 shows that, when using IO(D) with the sub-optimality loss to estimate the unknown parameter in Example 1, the probability of the estimated parameter being θu converges to one asymptotically. This implies that asymptotically we are almost certain that x IO(u) = x(θu, u), i.e. the policy that samples uniformly from the facet corresponding to the constraint x1 + ux2 u. As a result, x IO can achieve arbitrarily large AOG and POG when u is set to a large enough value.

Proposition 1. In Example 1, let x IO(u) = x(θu, u). For any v R+ there exists some u > 1 such that AOG( x IO) > v and POG( x IO) > v for any u > u.

3.3 Robustifying the Inverse Optimization Pipeline

A natural idea to improve the AOG and POG of x is to robustify the decision pipeline. Specifically, instead of solving FO with some estimated parameters θ, we solve the following robust forward optimization problem with an uncertainty set around θ to prescribe decisions (final steps in Figure 1).

RFO C( θ, α), u : minimize x X(u) maximize θ C( θ,α) f(θ, x) (8)

where C is an uncertainty set with θ being its center and α representing parameters that control its shape/size. Given the support set Θ defined in Section 3.1, we focus on the following uncertainty set.

C( θ, α) := θ Rd | θ 2 = 1, θ θ cos α (9)

where α (0, π] represents the max angle between θ and any vector in the uncertainty set.

Remark 1. An alternative approach to robustify the IO pipeline is to replace IO with a distributionally robust IO for parameter estimation (Mohajerin Esfahani et al., 2018). However, this approach would not help as θu is still optimal to the distributionally robust IO in Example 1. So, the AOG and POG can still be arbitrarily large. See Appendix A.3 for complete statement and discussions.

Now, in Example 1, we analyze the performance of a policy that utilizes RFO to prescribe decisions.

Lemma 2. In Example 1, let x CIO(u) be an optimal solution to RFO C( θN, α), u where θN is an optimal solution to IO(D) with the sub-optimality loss (6). When α (0, π/2), we have P [AOG(x CIO) = 0] 1 and P POG( x CIO) < π/2

Lemmas 2 shows that, when using RFO to prescribe new decisions, the probability of achieving upper-bounded AOG and POG converges to one as N goes to infinity, as long as α (0, π/2). These bounds are independent of u, in contrast to the AOG and POG of classic IO that can be arbitrarily large as u changes (Proposition 1). However, the performance of this approach still depends on the choice of α, which is non-trivial when FO is more complex than a two-dimensional linear program. We address this problem next.

4 Conformal Inverse Optimization

In this section, we present a principled approach to learn uncertainty sets that lead to provable performance guarantees. As presented later, the learned uncertainty set contains parameters that make the next DM s decision optimal with a specified probability. We call this approach conformal IO due to its connection to conformal prediction (Vovk et al., 2005), which aims to predict a set that contains the next prediction target with a specified probability. Our approach first converts each context-decision observation (uk, ˆxk) to a parameter set ΘOPT(uk, ˆxk) that contains all the parameters that explain ˆxk under uk. We then adapt conformal prediction to produce a set that has γ probability of containing at least one member of the next sampled ΘOPT(u, ˆx). Note that if ΘOPT(u, ˆx) is a singleton almost surely, the approach is equivalent to applying conformal prediction to θk directly. As illustrated in Figure 1, there are three training steps in conformal IO: i) data split, ii) point estimation, and iii) uncertainty set calibration. We present these steps in Section 4.1 and analyze the properties of conformal IO in Section 4.2.

4.1 Learning an Uncertainty Set

Data split. We first split the dataset D into training and validation sets, namely Dtrain and Dval. Let Ktrain and Kval index Dtrain and Dval, respectively, while Ntrain = |Dtrain| and Nval = |Dval|.

Point estimation. Given a training set Dtrain, we apply data-driven IO techniques to obtain a point estimation θ of the unknown parameters. The most straightforward way is to solve IO(Dtrain) with any loss function. Alternatively, one may consider using end-to-end learning and optimization methods that do not require observations of the parameter vectors, e.g., the one proposed by Berthet et al. (2020). The point estimation can also come from other sources, e.g., from an ML model that predicts the parameters. Our calibration method is independent of the point estimation method.

Uncertainty set calibration. Given a point estimation θ, we calibrate an uncertainty set that, with a specified probability, contains parameters that make the next observed decision optimal. This property is critical for the results in Section 4.2 to hold. While we can naively achieve this by setting α = π, the resulting RFO may generate overly conservative decisions. Hence, we are interested in learning the smallest uncertainty set that satisfies this condition. We solve the following calibration problem

CP( θ, Dval, γ) : minimize α,{θk}k Kval α (10a)

subject to ˆxk X OPT(θk, uk), k Kval (10b) X

k Kval 1 θk C( θ, α) γ(Nval + 1) (10c)

θk 2 = 1, k Kval (10d) 0 α π, (10e)

where decision α controls the size of the uncertainty set, θk represent a possible parameter vector associated with data point k Kval, γ [0, 1] is a DM-specified confidence level. Constraints (10b) ensure that θk can make the decision ˆxk optimal for k Kval. Constraint (10c) ensures that at least γ(Nval + 1) of the decisions in Dval can find a vector in C that makes it optimal. Constraints (10d) ensure that the parameter vectors are on the unit sphere as defined in Equation (9). Remark 2 (Optimality Conditions). The specific form of Constraints (10b) depends on the structure of FO. For example, when FO is a linear program, Constraints (10b) can be replaced with the dual feasibility and strong duality constraints. When the FO is a general convex optimization problem, we can use the KKT conditions. For non-convex forward problems, we can replace Constraints (10b) with f(θk, ˆxk) f(θk, x) for all x X(u), which can be generated in a cutting-plane fashion. Remark 3 (Feasibility). For CP to be feasible, we require, for each observed decision, there exists a θ Θ that make it optimal. This condition holds for a range of problems, e.g., routing problems and the knapsack problem, even if the DM is subject to bounded rationality, i.e., the DM settles for suboptimal solutions due to cognitive/computational limitations. For problems where this condition is violated, we may pre-process Dval to project ˆx to a point in X(u) such that the condition is satisfied.

Solving CP is hard. First, CP is non-convex due to Constraints (10d). Second, Constraints (10b) involve the optimality conditions of Nval problems, so the size of CP grows quickly as Nval in-

creases. Nevertheless, considering a large Dval is critical to ensure desirable properties of the learned uncertainty set (Section 4.2). Below we introduce a decomposition method to solve CP efficiently. Theorem 1. Let Dval be a dataset, γ [0, 1], θ Rd, τ = γ(Nval + 1) and Γτ be an operator that returns the τ th largest value in a set. The optimal solution to CP( θ, Dval, γ) is αγ := arccos (Γτ ({ck}k Kval)) with ck := maxθk θ k θ ˆxk X OPT(θk, uk), θk 2 = 1 .

Theorem 1 states that we can solve CP by first solving Nval optimization problems whose size is independent of Nval and then find a quantile in a set of Nval elements. The first step is parallelizable and the second step can be done in O (Nval log(τ)) time. Since the problem required for evaluating ck is a maximization problem, we can replace the constraint θk 2 = 1 with θk 2 1 if θ Rd +, so this problem is convex when the forward problem is a linear program.

4.2 Properties of Conformal IO

Theorem 2 (Uncertainty Set Validity). Let Dval be a dataset that satisfies Assumption 1, (ˆθ, u) be a new i.i.d. sample from P(θ,u), ˆx = x(ˆθ, u), ˆΘ := ΘOPT(ˆx, u), and αγ be an optimal solution to CP( θ, Dval, γ) where θ Rd. We have, for any γ [0, Nval/(Nval + 1)], that

P ˆΘ C( θ, αγ) = γ. (11)

For any γ [0, 1], with probability at least 1 1/Nval,

P ˆΘ C( θ, αγ) = γ ϵ(Nval) :=

8 log(Nval + 1) + 2 log Nval

Nval + 2 Nval . (12)

Theorem 2 states that our learned uncertainty set is conservatively valid and asymptotically exact (Vovk et al., 2005). More specifically, first, our method will produce a set that contains a θ that makes the next DM s decision optimal no less than γ of the time that it is used (conservatively valid). The probability in Inequality (11) is with respect to the joint distribution over Dval and the new sample. Second, once the set is given, we have high confidence that, the probability of the next DM s decision being covered is within ϵ(Nval) from γ. The probability in Inequality (12) is with respect to the new sample, while the high confidence is with respect to the draw of the validation data set. Overall, we have the almost sure convergence of P( ˆΘ C( θ, αγ) = ) to γ as N goes to infinity. Finally, we note that in many practical applications, the number of decision observations N can be quite large. For example, in our motivating example, rideshare platforms observe millions of trips on a daily basis, providing ample data for both point estimation and uncertainty set calibration in our pipeline.

Now, we relate the validity results to the performance of conformal IO. The following Lemma is an immediate result of the objective function f being linear in θ.

Lemma 3. For any ˆx x ˆθ, u and ˆθ, u Θ U, there exists a constant ν (ˆx) R+ such

that, for any θ, θ Θ, we have f (θ, ˆx) f θ , ˆx ν (ˆx) θ θ 2.

Theorem 3 (POG Bound). Let x CIO(u) be an optimal solution to RFO C( θ, α1), u for any u U, where θ Rd and α1 are chosen such that, for a new sample (θ , u ) from P(θ,u) and x = x(θ , u ),

P C( θ, α1) ΘOPT(u , x ) = = 1. If Assumptions 2 3 hold, then

POG( x CIO) (η 2 cos 2α1 + 2)µ + ηµCIO, (13)

and AOG( x CIO) (2 2 cos 2α1 + η + σ)µ + (η + σ)µCIO, (14)

where µ := E[ν( x(ˆθ, u))], µCIO := E(ν[ x CIO(u)]), and µ := E (ν[ x(θ , u)]).

Theorem 3 states that, when C( θ, α) contains a θ that makes the next DM s decision optimal almost surely, conformal IO achieves upper-bounded POG and AOG. While we can meet this condition by using a large α, the bounds would be large as they increase as α increases, reflecting that the decisions may be overly conservative. Instead, we can use CP to obtain a α that achieves close-to-100% coverage and possibly add a small α R+ as extra protection. Moreover, we show in Section 5

that, when using γ < 100%, conformal IO still demonstrates favorable performance compared to classic IO. Our bounds have problem-specific constants. To demonstrate tightness, we present their numerical values in Example 1 with α = π/4 (Table 1). They closely follow the performance of conformal IO, which outperforms classic IO by a large margin.

Table 1: Performance profile of classic and conformal IO in Example 1.

u 2 10 50 100 2 10 50 100

Classic IO 0.35 3.18 17.32 35 0.74 4.55 24.51 49.50 Conformal IO 0.00 0.00 0.00 0.00 0.70 0.16 0.03 0.02 Conformal IO bound 0.70 0.05 0.002 0.001 1.58 0.67 0.15 0.08

5 Numerical Studies

Data generation. We consider two forward problems: The shortest path problem (linear program, d = 120) and knapsack problem (integer program, d = 10). See Appendix C.2 for their formulations. We use synthetic instances, which is a common practice as there is no well established IO benchmark (Tan et al., 2020; Dong et al., 2018). For both problems, we randomly generate a ground-truth parameters θ and a dataset of N = 1000 DMs. For each DM k [N], we generate her perceived parameters as ˆθi k = (θi k pi k + ϵi k)+ + ϵ0 for i [d] where pi k is uniformly drawn from [1/2, 2], ϵi k is drawn from a normal distribution with mean 0 and standard deviation 1, and ϵ0 = 0.1. For the shortest path problem, parameters uk represent a random origin-destination pair on the network. For the knapsack problem, uk correspond to the weights of different items and the DM s budget. The item weight wi for i [d] are uniformly drawn from [1, 10] and are shared among DMs. For each DM k [N], we generate a budget uk = qk P

i wi where qk is uniformly drawn from [1/5, 5].

Experiment design. Conformal IO is compatible with any approach that can provide a point estimation of unknown parameters using decision data (Step 2 in Section 4.1). To the best of our knowledge, i) IO with the sub-optimality loss and ii) the PFYL approach from Berthet et al. (2020) are the only two methods that can perform this task at scale. We thus implement conformal IO with these two methods. They also serve as our baselines. We call both i) and ii) classic IO to emphasize that they rely on a point estimation for decision prescription, although PFYL is not an IO approach. See Appendix C for implementation details. In all experiments, conformal IO uses the training set for point estimation and the validation set for calibration, while classic IO uses the union of the training and validation sets for point estimation. So, they have access to the same amount of data and are evaluated on the same test set. Unless otherwise noted, experiments are based on a 60/20/20 train-validation-test split and are repeated 10 times with different random seeds.

Uncertainty validity. To verify Theorem 2, we empirically evaluate the out-of-sample coverage achieved by our uncertainty set under different target levels γ and sample sizes Nval. The point estimation is generated by IO with the sub-optimality loss. As shown in Figure 3, when the validation set is small (Nval = 10), we always achieve the specified target but C( θ, αγ) tends to over-cover (conservatively valid). When using larger validation sets (Nval {100, 200}), our coverage level gets closer to the specified γ (asymptotic exact). These empirical findings echo our theoretical analysis.

Figure 3: Empirical coverage achieved by the learned uncertainty set (error bar = range).

The value of robustness. As shown in Figure 4, solving RFO with an uncertainty set learned by conformal IO leads to decisions of lower AOG and POG, compared to solving FO with a point estimation from classic IO. On average, when varying γ, our approach improves AOG by 20.1 30.4% and POG by 15.0 23.2% for the shortest path problem, and improves AOG by 40.3 57.0% and POG by 13.5 20.1% for the knapsack problem. The improvement is orthogonal to the point estimation method. Our decisions are of higher quality and better align with human intuition than classic IO. Finally, the performance of conformal IO generally improves as the quality of the point estimate increases (see Appendix D for details). Therefore, even though the conformal IO pipeline is robust to estimation errors, using the best available point estimation method is still recommended.

Figure 4: Performance profile of classic (blue) and conformal IO (green).

Computational efficiency. As shown in Table 2, conformal IO and classic IO require similar training times. When FO is an integer program (knapsack), the training of conformal IO is even faster because it replaces a relatively large inverse integer program (associated with Dtrain Dval), which is notoriously difficult to solve (Bodur et al., 2022), with a smaller inverse integer program (associated with Dtrain) and a set of small calibration problems that are parallelizable (Theorem 1). At the prediction time, our method achieves lower AOG and POG at the cost of solving a more challenging RFO. Nevertheless, the solution time of RFO is within one second in our instances.

Table 2: Average (std) computational time of classic and conformal IO in seconds.

Training Prediction (per decision)

Problem Classic IO Conformal IO FO RFO

Shortest Path 0.18 (0.02) 0.27 (0.03) 0.01 (0.00) 0.63 (0.12) Knapsack 2.47 (0.37) 1.95 (0.32) 0.01 (0.00) 0.44 (0.15)

Important hyper-parameters. Finally, we provide empirical evidence that sheds light on the choice of two important hyper-parameters in conformal IO: i) confidence level γ, and ii) train-validation split ratio. Regarding γ, as shown in Figure 4, the performance of conformal IO improves quickly as γ increases from 0 to 50% and remains stable and even worsens slightly after that. Hence, it is possible to improve the performance of conformal IO by carefully tuning γ using cross-validation. However, this requires an additional validation dataset. If such a dataset is unavailable, setting γ to a relatively large value (e.g., 0.99) usually yields decent performance, which aligns with our theoretical analysis. Regarding train-validation split ratio, intuitively, both the estimation and calibration can benefit from more data. However, when the dataset is small, we need to strike a balance between these two steps aiming to achieve lower AOG and POG. We implement conformal IO for the shortest path problem under different dataset sizes (Ntrain + Nval) and train-validation split ratios (Nval/(Ntrain + Nval)). As shown in Figure 5, when the dataset is small, there is no benefit of using conformal IO because we do not have enough data to obtain good point estimation and uncertainty set simultaneously. However,

the performance of classic IO quickly plateaus as the dataset grows. When given a midor large-sized dataset, we can generally benefit from putting more data in Dval, echoing our theoretical analysis.

Figure 5: Percentage reduction in AOG and POG when using the conformal IO vs classic IO.

6 Conclusion and Future Work

In this paper, we propose conformal IO, a novel IO pipeline to recommend high-quality decisions that align with human intuition. We present a new approach to learning uncertainty sets from decision data, which is then utilized in a robust optimization model to prescribe new decisions. We prove that conformal IO achieves bounded optimality gaps, as measured by the ground-truth parameters and the DM s perceived parameters. We demonstrate the strong empirical performance of conformal IO via extensive numerical studies. Finally, we highlight several challenges that underscore future research directions. First, we focus on objectives that are linear in the unknown parameters. While such objectives are ubiquitous in practice, it is of interest to extend our results for general convex objectives. Second, the robust forward problem, while leading to better decisions, is computationally more costly than the forward problem. Future research can be done to accelerate its solution process. Finally, while Conformal IO consistently outperforms classic IO when the point estimation methods are fixed, our computational experiments suggest that its performance hinges on the quality of the point estimate. Future research could explore point estimation methods that directly optimize the performance of the downstream robust optimization model.

Acknowledgments

The authors are grateful to the anonymous reviewers for their valuable feedback and insightful comments. Erick Delage was partially supported by the Canadian Natural Sciences and Engineering Research Council [Grant RGPIN-2022-05261] and by the Canada Research Chair program [950230057].

Ahuja, R. K. and Orlin, J. B. (2001). Inverse optimization. Operations Research, 49(5):771 783.

Aswani, A., Shen, Z.-J., and Siddiq, A. (2018). Inverse optimization with noisy data. Operations

Research, 66(3):870 892.

Babier, A., Mahmood, R., Mc Niven, A. L., Diamant, A., and Chan, T. C. (2020). Knowledge-based automated planning with three-dimensional generative adversarial networks. Medical Physics, 47(2):297 306.

Berthet, Q., Blondel, M., Teboul, O., Cuturi, M., Vert, J.-P., and Bach, F. (2020). Learning with differentiable pertubed optimizers. In Advances in Neural Information Processing Systems, volume 33, pages 9508 9519.

Bodur, M., Chan, T. C., and Zhu, I. Y. (2022). Inverse mixed integer optimization: Polyhedral insights and trust region methods. INFORMS Journal on Computing, 34(3):1471 1488.

Burton, J. W., Stein, M.-K., and Jensen, T. B. (2020). A systematic review of algorithm aversion in augmented decision making. Journal of Behavioral Decision Making, 33(2):220 239.

Chan, T. C. Y., Craig, T., Lee, T., and Sharpe, M. B. (2014). Generalized inverse multiobjective optimization with application to cancer therapy. Operations Research, 62(3):680 695.

Chan, T. C. Y. and Kaw, N. (2020). Inverse optimization for the recovery of constraint parameters.

European Journal of Operational Research, 282(2):415 427.

Chan, T. C. Y., Lee, T., and Terekhov, D. (2019). Inverse optimization: Closed-form solutions, geometry, and goodness of fit. Management Science, 65(3):1115 1135.

Chan, T. C. Y., Mahmood, R., O Connor, D. L., Stone, D., Unger, S., Wong, R. K., and Zhu, I. Y. (2023a). Got (optimal) milk? pooling donations in human milk banks with machine learning and optimization. Manufacturing & Service Operations Management, 0(0).

Chan, T. C. Y., Mahmood, R., and Zhu, I. Y. (2023b). Inverse optimization: Theory and applications.

Operations Research, 0(0).

Chen, V., Liao, Q. V., Wortman Vaughan, J., and Bansal, G. (2023). Understanding the role of human intuition on reliance in human-AI decision-making with explanations. In Proceedings of the ACM on Human-Computer Interaction, volume 7, pages 1 32.

Chenreddy, A. R., Bandi, N., and Delage, E. (2022). Data-driven conditional robust optimization. In

Advances in Neural Information Processing Systems, volume 35, pages 9525 9537.

Christiano, P. F., Leike, J., Brown, T., Martic, M., Legg, S., and Amodei, D. (2017). Deep reinforcement learning from human preferences. In Advances in neural information processing systems, volume 30.

Delage, E. and Ye, Y. (2010). Distributionally robust optimization under moment uncertainty with application to data-driven problems. Operations Research, 58(3):595 612.

Donahue, K., Kollias, K., and Gollapudi, S. (2023). When are two lists better than one?: Benefits and harms in joint decision-making. ar Xiv preprint ar Xiv:2308.11721.

Dong, C., Chen, Y., and Zeng, B. (2018). Generalized inverse optimization through online learning. In Advances in Neural Information Processing Systems, volume 31.

Dong, C. and Zeng, B. (2021). Wasserstein distributionally robust inverse multiobjective optimization. In Proceedings of the AAAI Conference on Artificial Intelligence, number 7 in 35, pages 5914 5921.

Elmachtoub, A. N., Lam, H., Zhang, H., and Zhao, Y. (2023). Estimate-then-optimize versus integrated-estimation-optimization: A stochastic dominance perspective. ar Xiv preprint ar Xiv:2304.06833.

Gao, R. and Kleywegt, A. (2023). Distributionally robust stochastic optimization with wasserstein distance. Mathematics of Operations Research, 48(2):603 655.

Hu, X., Cirit, O., Binaykiya, T., and Hora, R. (2022). Deep ETA: How uber predicts arrival times using deep learning. Uber Engineering Blog. Available at https://www.uber.com/en-CA/blog/deepetahow-uber-predicts-arrival-times/. Accessed: 2024-01-19.

Ji, J., Qiu, T., Chen, B., Zhang, B., Lou, H., Wang, K., Duan, Y., He, Z., Zhou, J., Zhang, Z., et al. (2023). AI alignment: A comprehensive survey. ar Xiv preprint ar Xiv:2310.19852.

Liu, M., Tang, X., Xia, S., Zhang, S., Zhu, Y., and Meng, Q. (2023). Algorithm aversion: Evidence from ridesharing drivers. Management Science, 0(0).

Mandi, J., Bucarey, V., Tchomba, M. M. K., and Guns, T. (2022). Decision-focused learning: through the lens of learning to rank. In International Conference on Machine Learning, pages 14935 14947. PMLR.

Merchán, D., Arora, J., Pachon, J., Konduri, K., Winkenbach, M., Parks, S., and Noszek, J. (2022). 2021 Amazon last mile routing research challenge: Data set. Transportation Science, 0(0).

Mohajerin Esfahani, P., Shafieezadeh-Abadeh, S., Hanasusanto, G. A., and Kuhn, D. (2018). Datadriven inverse optimization with imperfect information. Mathematical Programming, 167:191 234.

Mohri, M., Rostamizadeh, A., and Talwalkar, A. (2018). Foundations of Machine Learning. MIT press.

Ng, A. Y. and Russell, S. J. (2000). Algorithms for inverse reinforcement learning. In Proceedings of

the Seventeenth International Conference on Machine Learning, page 663 670.

Nguyen, T. (2015). ETA phone home: How uber engineers an efficient route. Uber Engineering Blog. Available at https://www.uber.com/en-CA/blog/engineering-routing-engine/. Accessed: 2024-01-19.

Rafailov, R., Sharma, A., Mitchell, E., Manning, C. D., Ermon, S., and Finn, C. (2024). Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36.

Rönnqvist, M., Svenson, G., Flisberg, P., and Jönsson, L.-E. (2017). Calibrated route finder: Improving the safety, environmental consciousness, and cost effectiveness of truck routing in sweden. Interfaces, 47(5):372 395.

Shafer, G. and Vovk, V. (2008). A tutorial on conformal prediction. Journal of Machine Learning

Research, 9(3).

Sun, C., Liu, L., and Li, X. (2023). Predict-then-calibrate: A new perspective of robust contextual LP. In Advances in Neural Information Processing Systems.

Sun, J., Zhang, D. J., Hu, H., and Van Mieghem, J. A. (2022). Predicting human discretion to adjust algorithmic prescription: A large-scale field experiment in warehouse operations. Management Science, 68(2):846 865.

Tan, Y., Terekhov, D., and Delong, A. (2020). Learning linear programs from optimal decisions. In

Advances in Neural Information Processing Systems, volume 33, pages 19738 19749.

Tang, B. and Khalil, E. B. (2024). Py EPO: A pytorch-based end-to-end predict-then-optimize library for linear and integer programming. Mathematical Programming Computation, 16(3):297 335.

Vovk, V., Gammerman, A., and Shafer, G. (2005). Algorithmic learning in a random world, volume 29. Springer.

Wainwright, M. J. (2019). High-dimensional Statistics: A non-Asymptotic Viewpoint, volume 48. Cambridge University Press.

Wilder, B., Dilkina, B., and Tambe, M. (2019). Melding the data-decisions pipeline: Decision-focused learning for combinatorial optimization. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 1658 1665.

Wirth, C., Akrour, R., Neumann, G., and Fürnkranz, J. (2017). A survey of preference-based reinforcement learning methods. Journal of Machine Learning Research, 18(136):1 46.

Wu, F., Ke, J., and Wu, A. (2024). Inverse reinforcement learning with the average reward criterion. In Advances in Neural Information Processing Systems, volume 36.

Zattoni Scroccaro, P., Atasoy, B., and Mohajerin Esfahani, P. (2024). Learning in inverse optimization: Incenter cost, augmented suboptimality loss, and algorithms. Operations Research, 0(0).

A Omitted Statements and Proofs in Section 3

A.1 Poof of Lemma 1

Proof. We first show that ˆx {(0, 1), (u, 0)} almost surely. Let δu := arccos 1/

1 + u2 , so cos δu = 1/

1 + u2 and sin δu = u/

1 + u2. It is easy to verify that, when ˆθk Θ1 := {(cos δ, sin δ) | δ (0, δu]}, we have ˆxk = x(ˆθk, u) = (0, 1) almost surely; When ˆθk Θ2 := {(cos δ, sin δ) | δ (δu, π/2)}, we have ˆxk = x(ˆθk, u) = (u, 0) almost surely. Since ˆθk is uniformly distributed in ˆθ Θ = Θ1 Θ2, the distribution of ˆxk is

ˆxk = (0, 1), w.p. 2δu/π (u, 0), w.p. (π 2δu)/π. (15)

Given a sample set D = {uk, ˆxk}k [N], let N1 and N2, respectively, denote the numbers of (0, 1) and (u, 0) in D. We next show that when N1 > 0 and N2 > 0, θu is the unique optimal solution to IO(D). Specifically, in Example 1, IO(D) is presented as follows.

θN := argmin θ Θ

N l1(θ) + N2

N l2(θ) (16)

l1(θ) = 0, if θ Θ1, θ2 uθ1, if θ Θ2, (17)

l2(θ) = uθ1 θ2, if θ Θ1, 0, if θ Θ2. (18)

A simple calculation gives that when N1 > 0 and N2 > 0, the minimum is 0 which occurs uniquely at θ = (cos δu, sin δu); When N2 = 0, the minimum is 0 which occurs when θ Θ1; When N1 = 0, the minimum is 0 which occurs when θ Θ2. Therefore, we have P(N1N2 > 0) P θN = (cos δu, sin δu) 1. (19)

Given the probability distribution given in Equation (15) and that D is generated using i.i.d. samples from Pθ, we have

P(N1N2 > 0) = 1 2δu

which converges to 1 as N goes to infinity. Therefore, we conclude that P( θN = (cos δu, sin δu)) converges to 1 as N goes to infinity.

A.2 Proof of Proposition 1

Proof. We first consider the AOG and POG of the decision policy x IO(u) := x(θu, u) separately in the following two lemmas.

Lemma 4. In Example 1, let x IO(u) = x(θu, u). For any v R+ there exists some u > 1 such that AOG( x IO) > v for any u > u AOG.

Proof. According to the definition of x, we know that x IO(u) is uniformly drawn from

X OPT(θu, u) = ut

u2 + 1 , 1 t

u2 + 1 i . (21)

Since the ground-truth θ = (cos(π/4), sin(π/4)), the true optimal solution is x = (0, 1) with f(θ , u) =

2/2. Hence, we have

AOG( x IO) = Z

u2 + 1 + ut

Therefore, for any v R+, there exists u AOG = 2

2v + 1 such that AOG( x IO) > v for any u > u AOG.

Lemma 5. In Example 1, let x IO(u) = x(θu, u). for any v R+ there exists some u POG > 1 such that POG( x IO) > v for any u > u POG.

Proof. According to the definition of x, x IO(u) is uniformly drawn from

X OPT(θu, u) = ut

u2 + 1 , 1 t

u2 + 1 i . (23)

It is easy to verify that, when ˆθ Θ1 := {(cos δ, sin δ) | δ (0, δu]}, we have ˆxk = x(ˆθ, u) = (0, 1) with f(ˆθ, u) = ˆθ2 almost surely; When ˆθ Θ2 := {(cos δ, sin δ) | δ (δu, π/2}, we have ˆxk = x(ˆθ, u) = (u, 0) with f(ˆθ, u) = uˆθ1 almost surely. Since the optimal solution drawn from X OPT(θu, u) is independent of the DM s perception ˆθ, we have

POG( x IO) = Z δu

u2 + 1 cos δ + 1 t

sin δ sin δ dt dδ

u2 + 1 cos δ + 1 t

sin δ u cos δ dt dδ

0 (u cos δ sin δ) dδ + 1

δu ( u cos δ + sin δ) dδ

1 + u2 u + 1

The inequality holds because

1 + u2 > u. Therefore, we have, for any v R+, there exists u POG = 2v + 1 such that POG( x IO) > v for any u > u POG.

Based on Lemmas 4 and 5, we conclude that for any v R+, there exists u = max{ u AOG, u POG} = 2

2v + 1 such that AOG( x IO) > v and POG( x IO) > v for any u > u.

A.3 Robustifying the Inverse Problem

An alternative approach to robustify the classic IO pipeline is to solve a distributionally robust inverse optimization problem. Specifically, consider the following loss function proposed by Mohajerin Esfahani et al. (2018).

Definition 4. The distributionally robust sub-optimality loss of θ is given by

ℓDR-S (θ) := sup Q Bp r(ˆPu,ˆx) ρQ ℓS ˆx, X OPT(θ, u) (24)

where ˆPu,ˆx is the sample distribution of D, Bp r(ˆPu,ˆx) is a p-Wasserstain ball of radius r centered at ˆPu,ˆx, and ρQ is a risk measure, e.g., the value at risk.

The distributionally robust inverse optimization problem is

DRIO(D) : minimize θ Θ ℓDR-S(θ). (25)

As shown by Mohajerin Esfahani et al. (2018), the estimated parameters from DRIO achieve bounded out-of-sample sub-optimality loss with a high probability. However, this does not imply bounded AOG and POG for the decision policy.

Lemma 6. In Example 1, θu is an optimal solution to DRIO(D).

Proof. As shown in the proof of Lemma 7, when α (0, π/2), the RFO (C(θu, α), u) has a unique optimal solution (0, 1). So x CIO(u) = (0, 1) almost surely, when θN = θu and α (0, π/2). It is easy to verify that, when ˆθ Θ1 := {(cos δ, sin δ) | δ (0, δu]}, we have ˆxk = x(ˆθ, u) = (0, 1)

almost surely; When ˆθ Θ2 := {(cos δ, sin δ) | δ (δu, π/2)}, we have ˆxk = x(ˆθ, u) = (u, 0) almost surely. Hence, we have

POG( x CIO) = Z δu

2 0 dδ + Z π/2

2 sin δ dδ = π

2 cos δ π/2

1 + u2 < π 2

The inequality holds because u > 1.

According to Lemma 1, we know that P( θN = θu) 1 as N . So we conclude that, when α (0, π/2), we have P POG( x CIO) < π/2

Lemma 6 shows that, in Example 1, the estimated parameter from DRIO(D) may still be θu. Hence, the decision policy is identical to x IO whose AOG and POG can be unbounded. The fundamental reason behind these negative results is the misalignment between the sub-optimality loss and the evaluation metrics. Achieving a low sub-optimality loss means that the suggested and observed decisions are of similar quality as evaluated using the estimated parameters. However, this does not speak to the similarity between these two decisions with respect to the DM s perceived parameters (POG) or the ground-truth parameters (AOG). Therefore, the out-of-sample guarantees on the suboptimality loss do not translate into bounded AOG or POG.

A.4 Proof of Lemma 2

We consider the AOG and POG of conformal IO separately in the following two lemmas.

Lemma 7. In Example 1, let x CIO(u) be an optimal solution to RFO C( θN, α), u where θN is an optimal solution to IO(D) with the sub-optimality loss (6). When α (0, π/2), we have P [AOG(x CIO) = 0] 1 as N .

Proof. We first show that when θ = θu and α (0, π/2), RFO C( θ, α), u has a unique optimal solution (0, 1). Let x1 = (0, 1), x2 = (0, 2), x3 = (u, 0) and x4 = (u, 2) denote the four extreme points of the feasible region X(u), respectively, and

R(x) := max θ C(θu,α) θ1x1 + θ2x2. (27)

Since FO is a linear program, it suffices to show that, when α (0, π/2), R(x1) < min{R(x2), R(x3), R(x4)} because, if there exists an optimal solution that is not an extreme point, then there must exist another extreme point xi such that R(x1) = R(xi) where i = 1. Next, we compare R(x1) with R(x2), R(x3), and R(x4).

It is easy to verify that

R(x1) = sin(δu + α), if α (0, π/2 δu], 1, if α (π/2 δu, π/2). (28)

For x2, we have

R(x2) = 2 sin(δu + α), if α (0, π/2 δu], 2, if α (π/2 δu, π/2). (29)

Hence, we have R(x1) < R(x2) when α (0, π/2).

For x3, we have

R(x3) = u cos(δu α), if α (0, δu], u, if α (δu, π/2). (30)

Since u > 1, we have π/2 δu < π/4 < δu < π/2. We will show that R(x1) < R(x3) when α is in (0, π/2 δu), [π/2 δu, δu), and [δu, π/2). When α (0, π/2 δu), we have

R(x1) = sin(δu + α) = sin δu cos α + cos δu sin α

1 + u2 cos α + 1

1 + u2 sin α

1 + u2 cos α + u2

1 + u2 sin α

1 + u2 cos α + u

1 + u2 sin α

= u (cos δu cos α + sin δu sin α) = u cos(δu α) = R(x3).

The second line holds due to the sum of angles identity. The third line holds due to the definition of δu. The fourth line holds because u > 1. The fifth line is obtained by simple manipulation. The sixth line holds due to the definition of δu. The seventh line holds due to the sum of angles identity.

When α [π/2 δu, δu), we have

u2 + 1 + u2

= u cos δu 1

u2 + 1 + u sin δu 1

u2 + 1 < u cos δu cos α + u sin δu sin α = u cos(δu α) = R(x3).

The second line holds because u > 1. The third line is obtained through simple manipulation. The forth line holds due to the definition of δu. For the fifth line, we know that α [π/2 δu, δu) [π/4 π/2] where cos α is strictly decreasing in α and where sin α is strictly increasing in α. Therefore, cos α < cos δu = 1/

u2 + 1 and sin α sin(π/2 δu) = cos δu = 1/

u2 + 1. Hence, the fifth line holds. The sixth line holds due to the sum of angles identity.

When α [δu, π/2), we have R(x1) = 1 < u = R(x3).

Hence, R(x1) < R(x3) when α (0, π/2).

For x4, we have R(x4) = max δ C(δu,α) u cos δ + 2 sin δ. (31)

Let δ 1 denote the optimal solution to the maximization problem for calculating R(x1). It is easy to verify that δ 1 (0, π/2) when α (0, π/2). So cos δ 1 > 0 and sin δ 1 > 0. Hence, we have

R(x4) = max δ C(δu,α) u cos δ + 2 sin δ u cos δ 1 + 2 sin δ 1 > sin δ 1 = R(x1). (32)

The first inequality holds because δ 1 may not be the maximizer of the problem associated with x4. The second inequality holds because u > 1, cos δ 1 > 0, and sin δ 1 > 0.

Hence, when α (0, π/2), RFO (C(δu, α), u) has a unique optimal solution x1, so x CIO(u) = (0, 1) almost surely. Given that (0, 1) is also the optimal solution to FO(δ , u), we have AOG( x CIO) = 0 when θN = θu and α (0, π/2). According to Lemma 1, we know that P( θN = θu) 1 as N . So we conclude that, when α (0, π/2), we have P [AOG( x CIO) = 0] 1 as N .

Lemma 8. In Example 1, let x CIO(u) be an optimal solution to RFO C( θN, α), u where θN is an optimal solution to IO(D) with the sub-optimality loss (6). When α (0, π/2), we have P POG( x CIO) < π/2

Proof. As shown in the proof of Lemma 7, when α (0, π/2), the RFO (C(θu, α), u) has a unique optimal solution (0, 1). So x CIO(u) = (0, 1) almost surely, when θN = θu and α (0, π/2). It is easy to verify that, when ˆθ Θ1 := {(cos δ, sin δ) | δ (0, δu]}, we have ˆxk = x(ˆθ, u) = (0, 1) almost surely; When ˆθ Θ2 := {(cos δ, sin δ) | δ (δu, π/2)}, we have ˆxk = x(ˆθ, u) = (u, 0) almost surely. Hence, we have

POG( x CIO) = Z δu

2 0 dδ + Z π/2

2 sin δ dδ = π

2 cos δ π/2

1 + u2 < π 2

The inequality holds because u > 1.

According to Lemma 1, we know that P( θN = θu) 1 as N . So we conclude that, when α (0, π/2), we have P POG( x CIO) < π/2

B Proof of Statements in Section 4

B.1 Definitions

Definition 5 (Empirical Rademacher Complexity). Let F be a class of functions mapping from Z = {Z1, Z2, . . . , Zm} to [a, b] and D be a fixed sample of size N with elements in Z, then the empirical Rademacher Complexity of F with respect to the sample D is defined as

ˆRD(F) := Eσ

i [N] σif(Zi)

where σ = (σ1, σ2, . . . , σN) with σi s being independent uniform random variables taking values in { 1, 1}.

Definition 6 (Rademacher Complexity). Let P denote the distribution according to which samples are drawn. For any integer N 1, the Rademacher complexity of a function class F is the expectation of the empirical Rademacher complexity over the samples of size N drawn from P:

RN(F) := ED PN h ˆRD(F) i (35)

Definition 7 (Growth Function). Let H be a class of functions that take values in { 1, 1}. The growth function ΠH : N N for H is defined as

ΠH(N) := max (Z1,Z2,...,ZN) ZN |{(h(Z1), h(Z2), . . . , h(ZN)) | h H}| (36)

which measures the maximum number of distinct ways in which N data points in Z can be classified using the function class H.

B.2 Useful Lemmas

Lemma 9 (Corollary 3.1 in Mohri et al. (2018)). Let H be a class of functions taking values in {1, 1}, then, for any integer N 1, the following holds

2 log ΠH(N)

Lemma 10 (Theorem 4.10 in Wainwright (2019)). For any b-uniformly bounded class of functions F, any positive integer N 1, and any scalar δ 0, with probability at least 1 exp Nδ2/(2b2) , we have

i [N] f(Xi) E [f(Xi)]

2RN(F) + δ (38)

where R(F) denotes the Rademacher complexity of the function class F.

B.3 Proof of Theorem 1

Proof. We first present the extensive formulation of Problem (10). For convenience, we define ˆΘk := ΘOPT(uk, ˆxk) for any k [N]. When α [0, π], cos α is a strictly decreasing in α. Therefore, minimizing α is equivalent to maximizing the value of cos α. We can replace the decision variable α in Problem (10) with a new decision variable c := cos α with an additional constraint t with 1 c 1. In addition, we introduce a new set of decision variables yk {0, 1} that indicate if ˆΘk intersects with the learned uncertainty set (= 1) or not (= 0) for any k Kval. Problem (10) can be presented as follows. maximize c,{θk}k Kval,{yk}k Kval c (39a)

subject to ˆxk X OPT(θk, uk), k Kval (39b)

θ k θ c + 2(yk 1), k Kval (39c) X

k Kval yk γ(Nval + 1) (39d)

θk 2 = 1, k Kval (39e) 1 c 1 (39f) yk {0, 1}, k Kval. (39g)

Constraints (39b) ensure that θk is a member of ˆΘk for any k Kval. Constraints (39c) decide if θk should be taken into account when calculating the maximal cosine value c based on if ˆΘk intersects with C. Constraint (39d) ensures that C intersects with at least γ(Nval + 1) inverse feasible sets. Constraint (39e) enforces θk to be on the unit sphere as defined in Equation (9). Constraints (39f) (39g) specify the ranges of the decision variables.

Observing that the objective of Problem (39) is to maximize c and that decision variables θk of different data points only interact in Constraints (39c). We can re-write Problem (39) as maximize c (40a) subject to c ck 2(yk 1), k Kval (40b) X

k Kval yk γ(Nval + 1) (40c)

1 c 1 (40d) yk {0, 1}, k Kval, (40e) where ck := maximize θk θ k θ (41a)

subject to ˆxk X OPT(θk, uk) (41b) θk 2 1. (41c)

Note that we replace Constraints (39e) with Constraints (41c) because the objective of Problem (41) is to maximize the inner product of θk and θ, so the maximum only occurs when θk 2 = 1. We further observe that the optimal solution to Problem (40a) is to set yk = 1 for all k such that ck Γτ {ck}k Kval and yk = 0 otherwise. Therefore, the optimal objective value of Problem (40a) is c = Γτ {ck}k Kval corresponding to αγ = arccos Γτ {ck}k Kval .

B.4 Proof of Lemma 3

Proof. Proof. For any fixed x, we have

f (θ, ˆx) f θ , ˆx = X

i [d] (θi θ i) fi (ˆx)

i [d] f 2 i (ˆx) s X

i [d] (θi θ i)2

=ν(ˆx) θ θ 2

where ν(ˆx) := q P

i [d] f 2 i (ˆx). The inequality follows the Cauchy-Schwartz inequality.

B.5 Proof of Theorem 2

Proof. We first prove the learned uncertainty set is conservatively valid. Following the conformal prediction language used by Vovk et al. (2005), we define a conformality measure of each data point,i.e. an observed decision and exogenous parameter pair, A θ : Rn U R+ as follows

A θ(ˆx, u) := maximize θ θ θ (42a)

subject to θ ΘOPT(ˆx, u) (42b) θ 2 1. (42c)

We note that ck = A θ(ˆxk, uk) for any k Kval where ck is defined in Theorem 1. Let τ = γ(Nval + 1) , and A := {A θ(ˆxk, uk)}k Kval, or equivalently, A := {ck}k Kval. Due to the definition of C θ, α and that α is chosen such that cos α = Γτ(A), the event ΘOPT(ˆx, u) C( θ, α) = is equivalent to A θ(ˆx, u) Γτ(A) , so

P ΘOPT(ˆx, u) C( θ, α) = = P (A θ(ˆx, u) Γτ(A)) . (43)

Assumption 1 implies that the dataset D = Dval {(ˆx, u)} is exchangeable, i.e. the ordering of the data points in D does not affect its joint probability distribution (Shafer and Vovk, 2008). Therefore, the rank (from high to low) of A θ(ˆx, u) in A := A {A θ(ˆx, u)} is uniformly distributed in {1, 2, . . . , Nval + 1}. So, we have

γ P {A θ(ˆx, u) Γτ(A )}

= 1 P {A θ(ˆx, u) < Γτ(A )} = 1 P {A θ(ˆx, u) < Γτ(A)} = P {A θ(ˆx, u) Γτ(A)}

= P ΘOPT(ˆx, u) C( θ, α) = .

The first line holds due to the definition of τ. We obtain the second line by taking the complement of the event in the first line (inside the probability). The third line holds because A θ(ˆx, u) can never be strictly smaller than itself, so any elements in A that are strictly smaller than A θ(ˆx, u) are in A. Note that this line holds only when τ Nval, which occurs when γ Nval/(Nval + 1), because A only has Nval elements. We obtain the third line by taking the complement of the event in the second line (inside the probability). The last line holds due to Equation (43). We note that all the probabilities are over the joint distribution of Dval and the new sample, i.e. D .

We next prove that the learned uncertainty set is asymptotically exact. Let zk := (uk, ˆxk), Z := {zk}k Kval. We define a function class

H = h(z, α) = 1 ΘOPT(ˆx, u) C( θ, α) α (0, π) . (44)

Let ΠH denote the growth function of H as defined in Definition 7. It is easy to verify that

ΠH(Nval) = Nval + 1 (45)

because the value of h(z, α) is monotonically increasing in α for any fixed z Z, so changing the value of α can only leads to Nval + 1 different outcomes for a fixed dataset Z.

Therefore, according to Lemma 9, we have

2 log(Nval + 1)

where RNval(H) denotes the Rademacher complexity of H when sample size is Nval, as defined in Definition 6.

We know that the value of α is chosen such that it is the smallest value that satisfies 1 Nval

k Kval h(zk, α) = 1 Nval

k Kval 1 ΘOPT(ˆxk, uk) C( θ, α) = γ(Nval + 1)

Nval , (47)

so we have γ 1 Nval

k Kval h(zk, α) γ + 2 Nval . (48)

The second inequality holds because γ(Nval + 1)

Nval = γNval + γNval γNval + γ

Nval γNval + γNval γNval + γ

Since Dval is i.i.d. sampled, for any fixed α, P

k Kval h(zk, α)/Nval provides a sample average approximation to E [h(z, α)], which can be interpreted as P ΘOPT(ˆx, u) C( θ, α) for any new sample (ˆθ, u) from Pˆθ,u and ˆx = x(ˆθ, u).

By applying Lemma 10, we have, with probability at least δ = 1 1/Nval, 1 Nval

k Kval h(zk, α) E [h(z, α)]

2RNval(H) +

Nval . (50)

By combing (46) (50), we have, with probability at least 1 1/Nval,

P ΘOPT(ˆx, u) C( θ, α) γ

8 log(Nval + 1) + 2 log Nval

Nval + 2 Nval . (51)

B.6 Proof of Theorem 3

We bound the AOG and POG of conformal IO separately in the following two propositions.

Proposition 2 (Conformal IO Achieves Bounded POG). Let x CIO(u) be an optimal solution to RFO C( θ, α1), u for any u U, where θ Rd and α1 are chosen such that, for a new sample

(θ , u ) from P(θ,u) and x = x(θ , u ), P C( θ, α1) ΘOPT(u , x ) = = 1. If Assumptions 2 3 hold, then POG( x CIO) (η 2 cos 2α1 + 2)µ + ηµCIO (52) where µ := E[ν( x(ˆθ, u))] and µCIO := E(ν[ x CIO(u)]).

Proof. We first bound the perceived optimality gap of a sampled DM. Let (ˆθ, u) be a sample from P(θ,u), ˆx = x(ˆθ, u), ˆθCIO(u) denote the optimal solution to the inner maximization problem in RFO C( θ, α1), u when the outer decision variable is set to ˆx, θCIO(u) denote the optimal solution to the inner maximization problem in RFO C( θ, α1), u when the outer decision variable is set to x CIO(u), If ΘOPT (ˆx, u) C( θ, α1) = , let θ be an element of ΘOPT (ˆx, u) C( θ, α1), we have

f ˆθ, x CIO(u) f ˆθ, ˆx

f θ, x CIO(u) f θ, ˆx + [ν(ˆx) + ν ( x CIO(u))] ˆθ θ 2

f θ, x CIO(u) f θ, ˆx + η [ν(ˆx) + ν ( x CIO(u))]

f θCIO(u), x CIO(u) f θ, ˆx + η [ν(ˆx) + ν ( x CIO(u))]

f ˆθCIO(u), ˆx f θ, ˆx + η [ν(ˆx) + ν ( x CIO(u))]

ν(ˆx) ˆθCIO(u) θ 2 + η [ν(ˆx) + ν ( x CIO(u))]

2ν(ˆx)(1 cos 2α1) + η (ν(ˆx) + ν [ x CIO(u)]) = ν(ˆx)(η 2 cos 2α1 + 2) + ην [ x CIO(u)] .

The first line holds due to Lemma 3. The second line holds due to assumption 2. The third line holds due to the definition of θCIO(u). The fourth line holds because x CIO(u), θCIO(u) is an optimal solution to RFO C( θ, α1), u . The fifth line holds due to Lemma 3. The sixth line holds because both ˆθCIO(u) and θ are in C( θ, α1) so the angle between them is no larger than 2α1. Since both ˆθCIO(u) and θ are on the unit sphere, the L2 distance between them are bounded by 2(1 cos 2α1).

Since α1 is chosen such that P ΘOPT (ˆx, u) C( θ, α1) = 1, we have

POG( x CIO) = E h f ˆθ, x CIO(u) f ˆθ, ˆx i

E {ν(ˆx)(η 2 cos 2α1 + 2) + ην [ x CIO(u)]} = µ(η 2 cos 2α1 + 2) + ηµCIO

where µ := E [ν(ˆx)] and µCIO := E (ν [ x CIO(u)]).

Proposition 3 (Conformal IO Achieves Bounded AOG). Let x CIO(u) be an optimal solution to RFO C( θ, α1), u for any u U, where θ Rd and α1 are chosen such that, for a new sample

(θ , u ) from P(θ,u) and x = x(θ , u ), P C( θ, α1) ΘOPT(u , x ) = = 1. If Assumptions 2 3 hold, then AOG( x CIO) (2 2 cos 2α1 + η + σ)µ + (η + σ)µCIO (53) where µ := E (ν[ x(θ , u)]).

Proof. We first derive an upper bound on the optimality gap of the suggested decision x CIO(u) as evaluated using θ for any u U. Let (ˆθ, u) be a sample from P(θ,u), ˆx = x(ˆθ, u), and θ be an element of ΘOPT (ˆx, u) C( θ, α1), which is non-empty almost surely because α1 is chosen such that P ΘOPT (ˆx, u) C( θ, α1) = 1. Let θCIO(u) denote the optimal solution to the inner

maximization problem in RFO C( θ, α1), u when the outer decision variable is set to x CIO(u). For any u U, let x (u) := x(θ , u) and θ CIO(u) denote the optimal solution to the inner maximization problem in RFO C( θ, α1), u when the outer decision variable is set to x (u), we have

f (θ , x CIO(u)) f (θ , x (u))

f E(ˆθ), x CIO(u) f E(ˆθ), x (u) + (ν [ x CIO(u)] + ν [x (u)]) θ E(ˆθ) 2

f E(ˆθ), x CIO(u) f E(ˆθ), x (u) + σ (ν [ x CIO(u)] + ν [x (u)])

= E h f ˆθ, x CIO(u) f ˆθ, x (u) i + σ (ν [ x CIO(u)] + ν (x (u)))

E h f θ, x CIO(u) f θ, x (u) + (ν [ x CIO(u)] + ν [ˆx]) ˆθ θ 2 i

+ σ (ν [ x CIO(u)] + ν (x (u)))

E h f θ, x CIO(u) f θ, x (u) + (ν [ x CIO(u)] + ν [ˆx])η i

+ σ (ν [ x CIO(u)] + ν (x (u)))

E h f θCIO(u), x CIO(u) f θ, x (u) i + (η + σ) (ν [ x CIO(u)] + ν (x (u)))

E h f (θ CIO(u), x (u)) f θ, x (u) i + (η + σ) (ν [ x CIO(u)] + ν (x (u)))

E h ν(x (u)) θ CIO(u) θ 2 i + (η + σ) (ν [ x CIO(u)] + ν (x (u)))

2ν (x (u)) (1 cos 2α1) + (η + σ) (ν [ x CIO(u)] + ν (x (u))) (2 2 cos 2α1 + η + σ)ν (x (u)) + (η + σ)ν [ x CIO(u)]

The first line holds because of Lemma 3. The second line holds due to Assumption 3. The third line holds because f is linear in θ. The expectation is taken over Pθ. The fourth line holds due to Lemma

3. The fifth line holds due to Assumption 2. The sixth line holds because of the definition of θCIO(u). The seventh line holds because x CIO(u), θCIO(u) is an optimal solution to RFO C( θ, α1), u . The eigth line holds due to Lemma 3. The ninth line holds since both θ CIO(u) and θ are on the unit sphere and the angle between them is no greater than 2α1, then the L2 distance between them is upper bounded by 2(1 cos 2α1).

Next, we bound the AOG of x CIO. We have

AOG( x CIO) = E [f (θ , x CIO(u)) f (θ , x (u))] E [(2 2 cos 2α1 + η + σ)ν (x (u)) + (η + σ)ν [ x CIO(u)]] = (2 2 cos 2α1 + η + σ)µ + (η + σ)µCIO

where µ := E (ν[x (u)]).

C Numerical Experiment Details

C.1 Computational Setup

All the algorithms are implemented and test using Python 3.9.1 on a Mac Book Pro with an Apple M1 Pro processor and 16 GB of RAM. Optimization models are implemented with Gurobi 9.5.2.

C.2 Forward Problems

C.2.1 Shortest-path

We consider the shortest path problem on a 5 5 grid network G(N, E) where N and E indicate the node and edge sets, respectively. Let E+(i) and E (i) denote the sets of edges that enter and leave node i N, respectively. Let uo and ud denote the origin and destination of the trip, respectively. We define xij E as binary decision variables that take 1 if road (i, j) is traversed for any (i, j) E. The shortest path problem is presented as follows.

(i,j) E θijxij (54a)

subject to X

(j,i) E+(i) xji X

(i,j) E (i) xij =

1, if i = ud

1, if i = od

0, otherwise , i N (54b)

xij {0, 1}, (i, j) E. (54c)

The objective function minimizes the total travel cost. The first set of constraints are the flowbalancing constraints that make sure we can find a path from uo to ud. The second set of constraints specify the range of our decision variables. Note that the constriant matrix is totally unimodular, so we can replace the binary constraints with 0 xij 1 for any (i, j) E when implementing this model.

C.2.2 Knapsack

We consider a knapsack problem of d = 10 items. We define binary decision variables xi that indicate if item i [d] is selected (= 1) or not (= 0). The knapsack problem is presented as follows.

i [d] θixi (55a)

subject to X

i [d] wixi u (55b)

xi {0, 1}, i [d]. (55c)

The objective maximizes the total value of the selected items. The first constraint enforces a total budget for item selection. The second set of constraints specify the range of our decision variables.

C.3 Obtaining a Point Estimation

We consider two methods to obtain point estimations of the unknown parameters. They are i) datadriven inverse optimization with the sub-optimality loss and ii) the gradient-based method proposed by Berthet et al. (2020). We implement the method from Berthet et al. (2020) with the package provided by Tang and Khalil (2024). Hyper-parameters are tuned using a separate validation set of 200 decision data points. Batch size is set to 64. We use the Adam optimizer with an initial learning rate of 0.1. We train the model for 20 epochs.

We present the implementation details of the data-driven inverse optimization method next. We consider solving

minimize θ R|E|,ϵ R ntrain +

k Ktrain lk (56a)

subject to lk θ ˆxk θ x, x Xk, k Ktrain (56b)

This problem is initialized without Constraints (56b), which were added iteratively using a cuttingplane method. Specifically, in each iteration, after solving Problem (56), let θ and {l k}k Ktrain be the optimal solution. For each data point k Ktrain, we solve the following sub-problem

minimize xk X(uk) θ xk. (57)

Let x k be the optimal solution to the sub-problem. If l k < θ ˆxk θ x k, we add the following cut to Problem (56) lk θ ˆxk θ x k. (58)

We keep running this procedure until no cut is added to the master Problem (56).

C.4 Solving the Calibration Problem

C.4.1 Shortest Path

For each data point in the validation set, we calculate the value of ck by solving the following problem

maximize θ R|E|,w RN ,v R|E| + θ θ (59a)

subject to wdk wok X

(i,j) E vij = θ ˆxk (59b)

wj wi vij cij, (i, j) E (59c) θ 2 1. (59d)

where w RN and v R|E| + , respectively, denote the dual variables associated with the flowbalancing constraints and the capacity constraints in the primal problem. The first constraint enforces strong duality. The second set of constraints are the dual feasibility constraints. The last constraint ensures the optimal solution is on the unit sphere. Note that we do not need to enforce θ 2 = 1 because this is a maximization problem.

C.4.2 Knapsack

For each data point in the validation set, we calculate the value of ck by solving the following calibration problem

maximize θ Rd θ θ (60a)

subject to θ ˆxk θ x, x X(uk) (60b) θ 2 1. (60c)

We initialize this problem without Constraints (60b). In each iteration, after solving the calibration problem, let θ denote the optimal solution. We solve FO(θ , uk) and let x denote the optimal solution. If θ x > θ ˆxk, we then add the corresponding cut to the model. We keep running this process until no cut is added.

C.5 Solving the Robust Forward Problem

Let α = cos 1 (Γk({ck}k Kval)). We next solve the following robust model to recommend a new decision to prescribe a decision given a u U.

minimize x X(u) maximize θ R|E| θ x (61a)

subject to θ θ cos(α) (61b) θ 2 1. (61c)

We initialize this problem as follows.

minimize x X(u),Ω R+ Ω (62a)

subject to θ x Ω, θ Θ. (62b)

We initialize Θ = . We first solve Problem (62), let x and Ω denote the optimal solution. Then we solve the following sub-problem

maximize θ R|E| θ x (63a)

subject to θ θ cos(α) (63b) θ 2 1. (63c)

Let θ denote the optimal solution to the sub-problem. If θ x > Ω , then we add θ to Θ and re-solve Problem (62). We keep running this procedure until no new solution is added to Θ.

D Additional Computational Results

In this section, we conduct additional computational experiments to assess how point estimate quality affects the performance of the Conformal IO pipeline. Specifically, we focus on the knapsack problem, where we randomly generate point estimates with an angular deviation δ from the groundtruth parameter θ , with δ serving as a proxy for point estimate quality. We then apply our calibration method to the point estimate and subsequently use the robust forward problem to generate decision recommendations. We vary the value of δ and report the average (std) out-of-sample AOG and POG achieved by our pipeline in Table 3.

Table 3: Mean (std) AOG and POG by conformal IO when varying point estimate quality.

δ 0 π/20 π/10 3π/20 π/5

POG 8.45 (0.67) 8.29 (0.66) 8.94 (0.79) 9.91 (0.95) 11.58 (1.40) AOG 0.00 (0.00) 0.16 (0.07) 0.30 (0.19) 0.74 (0.31) 1.38 (0.53)

In general, conformal IO benefits from more accurate point estimates (i.e., when δ is small). However, the trend is not always strictly monotonic. For instance, as δ increases from 0 to π/10, the POG initially decreases before rising again. While beyond the scope of this work, exploring point estimation methods that optimize for the performance of the downstream robust optimization could be a valuable direction for future research.

Neur IPS Paper Checklist

Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope?

Answer: [Yes]

Justification: Theoretical results are proved in Sections 3 and Section 4 (complete proofs are in the appendix). These results are verified numerically in Section 5.

Guidelines:

The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper.

2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors?

Answer: [Yes]

Justification: Limitations and future research directions are discussed in Section 6.

Guidelines:

The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations.

3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?

Answer: [Yes]

Justification: Assumptions are stated and justified in Section 3.1.1. Proofs are in the appendix.

Guidelines:

The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced.

4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)?

Answer: [Yes]

Justification: Experiment details are well documented in the paper. We also open source our implementation at https://anonymous.4open.science/r/Conformal IO-B776.

Guidelines:

The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset).

(d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results.

5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?

Answer: [Yes]

Justification: Data and source code available at https://anonymous.4open.science/ r/Conformal IO-B776.

Guidelines:

The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted.

6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results?

Answer: [Yes]

Justification: Yes, this information is documented in Section 5.

Guidelines:

The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material.

7. Experiment Statistical Significance

Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?

Answer: [Yes]

Justification: We report error bars/standard errors for all experimental results in Section 4.

Guidelines:

The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text.

8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments?

Answer: [Yes]

Justification: This information is disclosed in Appendix C.

Guidelines:

The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper).

9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines?

Answer: [Yes]

Justification: We have reviewed the Neur IPS Code and Ethics and confirm that our paper conform with it.

Guidelines:

The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics.

The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction).

10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?

Answer: [NA]

Justification: This is a theoretical/methodological paper that presents foundational research in optimization. We do not see a direct path to any negative societal impact.

Guidelines:

The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).

11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?

Answer: [NA]

Justification: This is a theoretical/methodological paper that presents foundational research in optimization. It does not involve any high-risk data/model.

Guidelines:

The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.

12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?

Answer: [Yes]

Justification: We properly cite code/papers on which our research is built on.

Guidelines:

The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators.

13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?

Answer: [Yes]

Justification: Our code and data along with documentations are available at https:// anonymous.4open.science/r/Conformal IO-B776.

Guidelines:

The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file.

14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)?

Answer: [NA]

Justification: Our paper does not involve crowdsourcing nor research with human subjects.

Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects.

Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector.

15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects

Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained?

Answer: [NA]

Justification: Our paper does not involve crowdsourcing nor research with human subjects.

Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.