# from_inverse_optimization_to_feasibility_to_erm__d05a1b9e.pdf From Inverse Optimization to Feasibility to ERM Saurabh Mishra 1 Anant Raj 2 Sharan Vaswani 1 Abstract Inverse optimization involves inferring unknown parameters of an optimization problem from known solutions and is widely used in fields such as transportation, power systems, and healthcare. We study the contextual inverse optimization setting that utilizes additional contextual information to better predict the unknown problem parameters. We focus on contextual inverse linear programming (CILP), addressing the challenges posed by the non-differentiable nature of LPs. For a linear prediction model, we reduce CILP to a convex feasibility problem, allowing the use of standard algorithms such as alternating projections. The resulting algorithm for CILP is equipped with a linear convergence guarantee without additional assumptions such as degeneracy or interpolation. Next, we reduce CILP to empirical risk minimization (ERM) on a smooth, convex loss that satisfies the Polyak-Lojasiewicz condition. This reduction enables the use of scalable first-order optimization methods to solve large non-convex problems while maintaining theoretical guarantees in the convex setting. Subsequently, we use the reduction to ERM to quantify the generalization performance of the proposed algorithm on previously unseen instances. Finally, we experimentally validate our approach on synthetic and real-world problems, and demonstrate improved performance compared to existing methods. 1. Introduction Inverse optimization (Heuberger, 2004) is the reverse of standard optimization and uses a known output (decision) of an optimization problem to infer the unknown problem parameters. For example, in the context of linear programming (LP), inverse optimization uses the optimal solutions 1Simon Fraser University 2SIERRA Project Team (Inria), Coordinated Science Laboratory (CSL), UIUC. Correspondence to: Saurabh Mishra , Sharan Vaswani . Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s). Figure 1: CIO framework: model fθ takes input z and predicts the cost vector c = fθ(z). This cost vector is the input of an optimization procedure that outputs decision x(c). Given the optimal decision x , the objective is to learn the model parameters such that the predicted decision x(c) is close to the optimal decision. To train the model in an end-to-end fashion, the key challenge is to compute the gradient of c w.r.t decision x(c) (shown in red in the figure). to the LP in order to learn the coefficients (costs) that can produce these solutions. Inverse optimization has found applications in transportation (Bertsimas et al., 2015), power systems (Birge et al., 2017) and healthcare (Chan et al., 2022) (refer to Chan et al. (2023) for a recent survey). We focus on integrating additional contextual information into the inverse optimization framework. In particular, we leverage historical data and a machine learning (ML) model to predict the (unknown) optimization problem parameters that can render (known) optimal decisions. This setting is commonly referred to as contextual inverse optimization (CIO) (Besbes et al., 2023; Sun et al., 2023) or data-driven inverse optimization (Mohajerin Esfahani et al., 2018). CIO requires a combination of prediction and optimization and has found applications in optimal transport and vehicle routing (Li et al., 2022), financial modeling (Cornuejols & Tütüncü, 2006), power systems (Bansal, 2005; Li et al., 2018), healthcare (Angalakudati et al., 2014), circuit design (Boyd et al., 2001), robotics (Raja & Pugazhenthi, 2012). Some use-cases for CIO are as follows. Example 1: Energy-cost aware scheduling (Wahdany et al., 2023), which involves using weather data to forecast windenergy generation and hence energy prices (prediction). These predictions can be used to schedule jobs (optimization) to minimize energy costs. For the CIO, the contextual information corresponds to weather data, and the solutions (decisions) correspond to past schedules. From Inverse Optimization to Feasibility to ERM Example 2: Shortest path planning (Guyomarch, 2017), which involves predicting the time taken through different routes or terrain (prediction). These predictions can be used to determine the shortest path between two locations (optimization). For CIO, the contextual information corresponds to images or features of the terrain, and the decisions correspond to known shortest paths for pairs of locations. Example 3: Inverse reinforcement learning (IRL) (Ng et al., 2000), which involves learning the underlying reward function in a Markov decision process (MDP) from the observed behaviour of a human expert. The learned reward function can be used to infer a good policy for an artificial agent. Assuming that the human expert acts in order to maximize the implicit reward functions, for the CIO, the context corresponds to features of the MDP, and decisions correspond to the observed expert behaviour. Example 4: In rational choice theory, a common way to model agents (e.g. users interacting with a recommendation system) is to assume that the (i) agent is rational and is making decisions to optimize some unknown implicit utility function and that (ii) the form (but not the parameters) of this utility function is known (e.g. whether it is linear or concave). For recommendation systems, the user s demographics and other metadata correspond to the context. In CIO, this context is used to predict the unknown parameters of the utility function, such that when it is maximized, it can explain the users past purchases (corresponding to the known decisions) (Wilder et al., 2019). The estimated utility function can then be used to make better recommendations. Since numerous combinatorial problems, including shortest path, max-flow, and perfect matching, can be cast as linear programs, we will mainly focus on cases where the optimization problem is a Linear Program (LP) (in Section 4.4, we briefly consider more general non-linear problems). The examples of CIO presented earlier fall within the LP framework. For LPs, the key challenge of contextual inverse linear programming (CILP) lies in the non-differentiable nature of LPs. This limitation precludes the direct use of auto-differentiation techniques. To overcome this problem, we make the following contributions. Contribution 1: For a linear prediction model, we reduce CILP to a convex feasibility problem (Section 4). This reduction enables the use of standard algorithms such as alternating projections. Unlike existing work (Sun et al., 2023), the resulting method (Algorithm 1) guarantees linear convergence to the solution without additional assumptions such as degeneracy or interpolation. Contribution 2: To efficiently handle large-scale problems, we reduce the feasibility problem (and hence CILP) to empirical risk minimization (ERM) on a smooth, convex loss function satisfying the Polyak-Lojasiewicz condi- tion (Polyak, 1964) (Section 5). This reduction allows us to employ scalable first-order optimization methods while retaining strong theoretical guarantees. Contribution 3: In Section 6, we argue about the shortcomings of the previous measures of performance for CILP, and propose a new sub-optimality metric. Subsequently, we use the reduction to ERM to quantify the performance of the proposed algorithm on previously unseen instances. Contribution 4: In Section 7, we validate the effectiveness of our approach with experiments on synthetic shortest path and fractional knapsack problems (Sun et al., 2023), and real-world Warcraft shortest path and MNIST perfect matching tasks (Vlastelica et al., 2019). Our empirical results demonstrate that the proposed algorithm results in improved performance compared to the prior work. 2. Related Work In this section, we review the related works, contrasting them with our proposed approach. Inverse Optimization: Iyengar & Kang (2005), uses the Karush Kuhn Tucker (KKT) optimality conditions for the LP to define the feasible set of cost vectors. Similarly, Mohajerin Esfahani et al. (2018), uses the Wasserstein metric to find a set of robust cost vectors by formulating an inverse optimization problem. However, they do not consider the contextual setting, so no learning is required. In contrast, we use the KKT conditions to train a prediction model, mapping contextual information to optimal decisions. More recently, Besbes et al. (2023) consider solving CIO in both the online and offline settings. Their offline setting is similar to our problem formulation, but does not make any linearity or convexity assumptions. They derive bounds on the worstcase suboptimality for a specific mapping from features to cost vectors. However, it is unclear whether this mapping can be efficiently computed even in the special case of LPs. Furthermore, Besbes et al. (2023) assume realizability i.e. there is no noise in the decisions and the model can perfectly fit (interpolate) the data. This further limits the practical utility of their framework. In contrast, we make no realizability assumptions and develop efficient algorithms for CILP. Using the reduced cost optimality condition: Sun et al. (2023) proposed a method to use the reduced cost optimality conditions (Luenberger et al., 1984) for LPs. The method constructs a surrogate loss function that encourages the prediction to satisfy the reduced cost optimality conditions. The resulting method has theoretical convergence guarantees, assuming that the LPs are non-degenerate and that the model can interpolate the training data. Both of these are strong assumptions and are not necessarily satisfied in practice. In contrast, we use the KKT conditions that are equivalent to the reduced cost optimality conditions for non-degenerate LPs (see Appendix B.2 for proof). Since the KKT condi- From Inverse Optimization to Feasibility to ERM tions can also be used for degenerate LPs, our proposed framework provides theoretical guarantees without relying on this assumption. Moreover, our guarantees hold even without assuming interpolation. Differentiating through LPs: Vlastelica et al. (2019) estimate the gradient through the LP by calculating the change in the decision by perturbing the prediction. However, it introduces additional hyper-parameters that are nontrivial to tune. Another common technique is to use the straight-through-estimator (ST) (Sahoo et al., 2022). Given a set of predictions from the model, the ST method uses the LP to estimate the decisions. However, it does not consider the LP (treats the corresponding Jacobian as an identity matrix) while back-propagating the gradient from the decisions to the model parameters. Though successful in practice, this method is not theoretically principled. The method in Berthet et al. (2020) computes expected gradients by perturbing the prediction target in different directions. While this method accurately models the gradient, it is not practically feasible because of the computational cost of solving LPs multiple times for each update to the model. One advantage of these techniques is their black-box nature, meaning that they only rely on the outputs from an LP, thus allowing the use of faster problem-specific solvers. In contrast, our work leverages LP properties (and some other problems described in Section 4.4), allowing us to develop a more efficient and principled approach. Implicit Differentiation: The methods in (Amos & Kolter, 2017; Amos, 2019) focuses on (strongly)-convex optimization problems. It calculates the gradient through such problems by differentiating through its optimality (KKT) conditions. However, since the solutions of LPs are located at the corners of the feasible polytope, this method will yield zero gradients for LPs. To address this, QPTL (Wilder et al., 2019) add a quadratic regularization to the LP, thus relaxing the problem to a non-linear strongly-convex quadratic program (QP) and then use the technique in Amos & Kolter (2017). Similarly, Cameron et al. (2022) relax Mixed Integer Programs (MIP) by using a log-barrier regularization followed by the use of the techniques in Amos & Kolter (2017). These approaches suffer from two notable limitations: (i) they introduce additional hyper-parameters (the regularization strength), and (ii) they are only guaranteed to converge in the vicinity of the optimal solution (because of the bias introduced by the regularization). While our proposed framework also uses KKT conditions, it ensures convergence to the optimal solution without introducing additional hyper-parameters. Predict and Optimize: The CIO problem is related to the predict and optimize (PO) framework (Elmachtoub & Grigas, 2022). In contrast to the CIO, PO requires the knowledge of ground-truth costs. Our work does not assume access to this additional information, but we note that the proposed algorithms can be directly used in the PO setting. In the next section, we formally formulate the problem and highlight the technical challenges. 3. Problem Formulation We consider the optimization procedure to be a linear program (LP). Without loss of generality, we assume the standard form of the LP and define ˆx(c) as the solution to the LP with cost-vector c Rm, ˆx(c) := arg min x c, x s.t Ax = b, x 0 , where A Rn m and b Rn. The CILP problem consists of a training dataset D = {zi, x i }N i=1 where zi R1 d is the input (Z RN d is the corresponding feature matrix) and x i Rm is the corresponding optimal decision. We assume that the LP parameters (A, b) encoding the constraints are known, whereas the cost vector c is unknown and will be predicted using the data. Example: In the context of the shortest path problem (Example 2 in Section 1), consider an arbitrary x, c Rm in the dataset; for all j [m], x j {0, 1} variables denote whether an edge is included in the shortest path and the weight of each edge is represented by the cost cj. To ensure a valid path from the start vertex s to the target vertex t, the flow constraints are encoded via A and b. These constraints ensure that every vertex, except s and t, maintains an equal number of incoming and outgoing edges. Vertex s is constrained to have exactly one outgoing edge, and vertex t has precisely one incoming edge. When using a model fθ with parameters θ to predict the cost-vector, we define ˆx(ˆc) as: ˆx(ˆc) := arg min x ˆc, x s.t Ax = b, x 0, ˆc = fθ(z) . Given the dataset D and knowledge of (A, b), the CILP objective is to learn θ s.t. ˆx(fθ(zi)) x i for all i [N]. 3.1. Challenge in Gradient Estimation To gain some intuition as to why the typical end-to-end learning approach via auto-differentiation (Paszke et al., 2019) will not work for CILP, consider using the squared loss to quantify the discrepancy between ˆx(fθ(zi)) and x i , i.e. ℓ(θ) := 1 2 PN i=1 ˆx(fθ(zi)) x i 2. Using the chain rule to compute the gradient with respect to θ, we get that l θ = l x x c c θ. The first and last terms can be easily calculated. However, for an LP, the decision x is piece-wise constant with respect to c, and x c is either 0 or undefined. Consequently, in the next section, we aim to develop an algorithm that does not rely on directly calculating x From Inverse Optimization to Feasibility to ERM 4. Reduction to Convex Feasibility For a linear model, we reduce the CILP problem to convex set feasibility (Section 4.1). In Section 4.2, we use alternating projections onto convex sets (POCS) to solve the feasibility problem and completely instantiate Algorithm 1. In Section 4.3, we describe some practical considerations when using Algorithm 1. In Section 4.4, we describe how to extend these ideas to handle non-linear but convex objectives and constraints. 4.1. Reduction Recall that for an input (z, x ) D, we aim to find a c such that ˆx(c) = x . However, due to the non-uniqueness of the mapping from x to c, there are potentially infinitely many values of c that can yield x . We define C to represent the set encompassing all such values of c. The set C can be represented by exploiting the optimality conditions for the LP. KKT conditions (Kuhn & Tucker, 1951) give necessary and sufficient conditions for the optimality of the LP. If x is the solution to the standard LP, then the KKT conditions can be written as follows: νT A + λ c = 0 , x λ = 0 , λ 0 , Ax = b , x 0 where λ Rm +, ν Rn are the dual variables, x i λi = 0 (for all i) represents the complementary slackness condition and Ax = b, x 0 represents the feasibility of x . At optimality, there exist dual variables (λ, ν) such that the tuple (x , λ, ν, c) satisfies the KKT conditions. Since the KKT optimality conditions are both necessary and sufficient, given an optimal solution x , we can identify the set of cost vectors c that satisfy these conditions. This enables us to define the convex set C as: C = {c | λ, ν s.t. νT A + λ c = 0, x λ = 0, λ 0} (1) Note that we omit the condition Ax = b, x 0 as it is satisfied by definition for the optimal solution x . For any λ, ν, the set C is affine and hence convex in c. We define F as the set of cost vectors that are realizable by the linear model parameterized by θ Rd m. Formally, F can be written as: F = {c | θ s.t. c = zθ}. (2) For a linear model, the set F is linear and hence convex in c. The objective of CILP is to find a c C (resulting in the optimal solution x ) and is also realizable by the model, i.e. it lies in set F. Hence, we aim to find a c that lies in the intersection (C F). Therefore, CILP is equivalent to a convex feasibility problem in this setting. 4.2. Algorithm The commonly employed method for solving convex feasibility problems is the alternating projections (POCS) algorithm (Von Neumann, 1949; Bauschke & Borwein, 1996). The POCS algorithm alternatively projects a point onto the two sets. The algorithm is guaranteed to converge to a point in the intersection if the intersection is non-empty; otherwise, it converges to the closest point between the two sets (Deutsch, 1984; Bauschke & Borwein, 1993). Moreover, the rate of convergence is linear in the number of POCS iterations. In order to use the POCS algorithm for the CILP problem, we require the projection of an arbitrary point q Rm onto the set C. This corresponds to solving the quadratic program (QP) as follows: PC(q) := arg min c ||c q||2 2 subject to νT A c + λ = 0, λ x = 0, λ 0 (3) Eq. (3) returns a point PC(q), the Euclidean projection of q onto C. For the projection of a point q onto the set F, we require solving the following regression problem, ˆθ := arg min θ 1 2 ||q zθ||2 ; PF (q) = zˆθ (4) Eq. (4) returns a point PF (q), the Euclidean projection of q onto F. Hence, POCS consists of alternatively solving the optimization problems in Eq. (3) and Eq. (4). In Algorithm 1 for CILP Input: A, b, Training dataset D (zi, x i )N i=1, Model fθ Initialize θ1 for t = 1, 2, .., T do ˆci = fθt(zi), i [N] for i = 1, 2, .., N do qi = PCi(ˆci) by solving the optimization problem in Eq. (3) end θt+1 = arg minθ 1 2N PN i=1 ||qi fθ(zi)||2 end Output: θT +1 order to extend the above idea to N training points, we will define sets Ci analogous to Eq. (1) for each i [N]. We define C := C1 C2 . . . CN to be the Cartesian product of these sets. Hence, C consists of concatenated vectors (c1, c2, . . . , c N) RNm where ci Ci. Similarly, we define F := {(c1, c2, . . . , c N)| θ s.t. i [N] , ci = ziθ}. Hence, the projection onto C corresponds to solving the QP in Eq. (3) for every point i [N]. Projecting an arbitrary point q = ( q1, q2, . . . , q N) RNm onto F involves solving the regression problem, ˆθ := arg minθ 1 2N PN i=1 qi ziθ 2. Since the Cartesian product of sets is convex, both C and F are convex and hence the resulting POCS algorithm will converge at a linear rate. From Inverse Optimization to Feasibility to ERM Finally, we note that our algorithmic framework can handle a generic model fθ, though our theoretical results only hold for a linear model. The complete algorithm for a generic model fθ is described in Algorithm 1. Next, we describe practical considerations when implementing the algorithm. 4.3. Practical considerations: Margin Recently, Sun et al. (2023) have noted the benefits of using a margin with the LP optimality conditions (Luenberger et al., 1984). In Appendix B.2, we show how to modify their margin formulation for the KKT conditions, resulting in the modified set C below: C = {c | λ, ν s.t. νT A c + λ = 0, λi I{x i = 0} = 0, λi I{x i = 0} χ} (5) There are two key motivations for adding the margin: (i) it ensures that the algorithm does not converge to a trivial solution corresponding to c = 0, (ii) it ensures that the algorithm will converge to the interior (rather than the boundary) of C making it more robust to perturbations and improving the algorithm s generalization performance (El Balghiti et al., 2019) to previously unseen instances. We note that our framework and the resulting algorithm is not limited to LPs. Next, we describe how to extend these ideas to handle non-linear but convex objectives and constraints. 4.4. Handling non-linear optimization problems Similar to the linear case, the KKT optimality conditions can be used to derive a convex set C for a class of non-linear convex objectives and constraints (Iyengar & Kang, 2005). As an example, we instantiate C for a specific quadratic program (QP) below and defer the general case to Appendix B. ˆx(c) := arg min x 0 c, x + γ 2 x T Qx s.t i=1 xi = 1 (6) Example: For the portfolio optimization problem (Fabozzi et al., 2008) common in econometrics, xj and cj in Eq. (6) represent the fraction of investment and the expected return for stock j [m] respectively. The matrix Q Rm m represents the risk associated with selecting similar stocks, and γ is the given risk tolerance. The task is to maximize the return while minimizing the risk, subject to simplex constraints. Given historical data and a model that can be used to predict the expected return and risk matrix, and the best portfolios in hindsight, the CIO problem is to infer the model parameters. In this case, the convex set C consisting of χ := {c , Q } that satisfy the KKT conditions of the QP is given as: C = {c, Q | λ, ν s.t. ν1m + λ + c γ 2 (Q + QT )x = 0, x λ = 0, λ 0} where λ Rm +, ν R1, 1m = (1, 1, . . . , 1). The set C is linear and therefore convex in {c, Q}. Consequently, when using a linear prediction model, the inverse problem for portfolio optimization can also be reduced to convex feasibility. For a general non-linear convex objective ϕ(x, ω) where ω represents a general cost vector, set C is convex in ω if ϕ(x,ω) x |x=x , the derivative of the objective function ϕ w.r.t x evaluated at x is convex in ω. For instance, this condition also applies to semi-definite programs. Finally, we note that our framework is not limited to linear constraints, and can easily handle non-linear convex constraints. Please refer to Appendix B.1 for the derivation. 4.5. Challenges for solving large-scale problems The POCS approach described above requires computing the exact projection of point q onto the set F. For highdimensional problems, computing these exact projections is computationally expensive. Moreover, for non-linear models such as neural networks, F is non-convex and the resulting projection is ill-defined. Additionally, computing the exact projection requires iterating through the entire dataset of N points, which can be prohibitive for large datasets typical in practice. Consequently, in the next section, we reduce the problem to empirical risk minimization on an appropriate smooth, convex loss satisfying the PL condition. This reduction enables the use of computationally efficient (stochastic) first-order optimization algorithms common in the machine learning literature. 5. Reduction to Empirical Risk Minimization For a linear model, in Section 5.1, we reduce the feasibility problem to empirical risk minimization (ERM) on an appropriate smooth, convex loss satisfying the PL condition and prove that the preconditioned gradient method (with a specific preconditioner) on this loss is equivalent to the POCS approach of Algorithm 1. Subsequently, in Section 5.3, we consider using computationally efficient (stochastic) firstorder methods for minimizing the loss functions. 5.1. Reduction We define the loss function h(θ) as follows: h(θ) := 1 2N i=1 min qi Ci fθ(zi) qi 2 , (7) where Ci represents the set of feasible cost vectors (defined in Eq. (1)) for data-point i. Hence, h(θ) represents the mean (across the data-points) of squared distances between the predicted cost vector fθ(zi) and the set Ci. In order to better interpret h(θ), consider a point cθ = (c1, c2, . . . , c N) RNm such that ci = fθ(zi). Hence, h(θ) = d2(cθ,C) N where d2(w, W) is the squared Euclidean distance of point w to the set W. Since cθ F, minimizing h(θ) is related to minimizing the distance between the sets F and C. Formally, in Proposition 5.1 (proved in Appendix B.4), we can reduce the feasibility problem in Section 4 to minimizing h(θ). From Inverse Optimization to Feasibility to ERM Proposition 5.1. Point ˆc := (c1, c2, . . . , c N) where ci = zi θ and θ arg min h(θ) lies in the intersection C F if it exists, else ˆc F is the point closest to C. 5.2. Properties of h(θ) For a linear model where fθ(z) = zθ, we show that h(θ) has desirable properties that allow it to be minimized efficiently. In the proposition below, we establish the convexity and smoothness of h(θ). Proposition 5.2. For a linear model fθ(z) = zθ parameterized by θ Rd m, assuming (without loss of generality) that i, zi 1, h(θ) is a 1-smooth convex function. The proof of the above proposition is included in Appendix B.3. In addition to convexity, we prove that when using a linear model, h(θ) satisfies the Polyak-Lojasiewicz (PL) inequality (Polyak, 1964; Karimi et al., 2016). The PL condition is a gradient domination property that implies curvature near the minima and entails that every stationary point is a global minimum. Formally, the PL inequality states that there exists a constant µ > 0 such that for all θ, 2µ|| h(θ)||2 , (8) where h is the minimum of h. Proposition 5.3. For a linear model fθ(z) = zθ and assuming (i) (without loss of generality) that i, zi 1 and (ii) λmin[ZT Z] > 0, h(θ) is not necessarily strongly-convex but satisfies the PL inequality with µ = λmin h PN i=1 ziz i N i . We include the proof in Appendix B.6 and note that such a result showing that square distance functions to (non)- convex sets is PL was also recently shown in (Garrigos, 2023). Importantly, we note that convexity coupled with the PL condition implies that the function satisfies the restricted secant inequality (RSI) (Zhang & Yin, 2013), a stronger condition than PL but weaker than strong-convexity (Karimi et al., 2016, Theorem 2). Since we have reduced the CILP to a problem of minimizing a loss function with desirable properties, we can use computationally efficient techniques like gradient descent and its stochastic and adaptive (Kingma & Ba, 2014; Duchi et al., 2011) variants. 5.3. First-order Methods We first show that for a linear model, Algorithm 1 is equivalent to the preconditioned gradient method on h(θ). With Z RN d being the feature matrix, the preconditioned gradient update for minimizing h(θ) at iteration t [T] with step-size η and the preconditioner equal to h ZT Z N i 1 , is given as: θt+1 = θt η [ZT Z] 1 ZT (Zθt qt) , (9) where qt = PC(Zθt) and Zθt qt is the Euclidean distance to the set C at iteration t. Consequently, point Zθt+1 is exactly the Euclidean projection of qt onto the set F. In Appendix B.5, we prove the following proposition. Proposition 5.4. For a linear model fθ(z) = zθ, the iterates corresponding to the preconditioned gradient method on h(θ) with η = 1 are identical to Algorithm 1. We note that for general non-linear models, this connection to POCS and hence Algorithm 1 does not necessarily hold. Next, we consider minimizing h(θ) using gradient descent (GD) with step-size ηt at iteration t. This results in the following general update: θt+1 = θt ηt PN i=1 fθ(zi) θ |θ=θt [fθ(zi) qi,t] where qi,t = PCi(fθt(zi)) and fθ(zi) θ |θ=θt is the Jacobian of fθ at iterate θt. For a linear model, this simplifies to: θt+1 = θt ηt N ZT (Zθt qt) , (11) where qt = PC(Zθt) and Zθt qt is the Euclidean distance to the set C at iteration t. The update in Eq. (11) can be interpreted as an inexact projection of qt onto F. If θ arg minθ h(θ), standard convergence results (Karimi et al., 2016) for smooth, convex and PL loss functions guarantee that GD, after T iterations, returns θT such that h(θT ) h( θ) = O(exp( T)). To illustrate what this convergence rate implies for the feasibility and consequently the CILP problem, consider the case where C F is non-empty. In this case, Proposition 5.1 guarantees that h( θ) = 0 and hence, using GD with T = O ln( N/ϵ) iterations is guaranteed to return a point ˆc := (c1, c2, . . . , c N) F where ci = ziθT that is ϵ close to C. Compared to Algorithm 1, we see that GD retains the fast linear convergence rate to a point in C F. GD requires iterating through the entire dataset for each update, which is inefficient for large datasets. To address this, we use stochastic gradient descent (SGD) (Robbins & Monro, 1951). Writing h(θ) = 1 N PN i=1 hi(θ) where hi(θ) = 1 2 minqi Ci fθ(zi) qi 2, the SGD update with step-size ηt at iteration t is: θt+1 = θt ηt fθ(zit) θ=θt [fθ(zit) qit,t] , (12) where it [N] is the index of the loss function sampled uniformly at random at iteration t. For a linear model, θt+1 = θt ηt z T it(zitθt qit,t) , (13) From Inverse Optimization to Feasibility to ERM where qit,t = PCi(zitθt). Similar to GD, the update in Eq. (13) can be interpreted as an inexact projection of qit onto Ci. Compared to GD, which has an O(N) per-iteration cost, SGD has an O(1) iteration cost, making it preferable for large datasets. However, in general, SGD has a slower rate of convergence compared to GD. Specifically, when minimizing smooth, convex functions and PL functions, T iterations of SGD with a decreasing O(1/T) step-size is guaranteed to return θT such that E[h(θT )] h( θ) = O(1/T) (Karimi et al., 2016; Gower et al., 2021), where the expectation is over the random sampling in each iteration. If an additional interpolation property is satisfied, SGD with a constant step-size can match the convergence rate of GD (Ma et al., 2018; Vaswani et al., 2019; Bassily et al., 2018; Raj & Bach, 2021). Formally, for convex loss functions, interpolation is satisfied when θ := arg min h(θ) also simultaneously minimizes each hi, i.e. || hi( θ)|| = 0 for all i [N]. In the context of the feasibility problem, interpolation is satisfied if f θ(zi) Ci and hence hi( θ) = 0 for all i [N]. This implies that the intersection C F is non-empty and in this case, SGD with a constant step-size requires T = O ln( N/ϵ) iterations to return a point ϵ-close to C F. Notably, the PL condition is not required for convergence; smooth and convex functions (even without the PL condition) ensure convergence with first-order methods, though at a slower rate (e.g., O(1/ T) instead of O(1/T) for SGD). The above results hold when using a linear model, ensuring convexity in the resulting function h(θ). Similar guarantees extend to non-parametric techniques like kernel methods, demonstrating the generality of our results. However, for expressive models such as deep neural networks, convexity is not necessarily satisfied. In certain regimes of over-parametrized neural networks, conditions resembling PL or variations thereof are satisfied (Liu et al., 2022; 2023). In these cases, SGD can still achieve linear convergence (Vaswani et al., 2019; Bassily et al., 2018), matching the results in the convex case. The above results are concerned with minimizing the loss on the training dataset. In the next section, we study the generalization performance of SGD on previously unseen instances sampled from the same distribution. 6. Generalization Guarantees In this section, we use the existing results on algorithmic stability (Bousquet & Elisseeff, 2002; Hardt et al., 2016; Lei & Ying, 2020) to control the generalization error and subsequently bound the suboptimality for CILP. We first define the necessary notation and recall the necessary results from the algorithmic stability literature. We define ρ to be the probability measure on the sample space Y = Z X , where Z Rd and X Rm. We assume that the training dataset D = {(z1, x 1), , (z N, x N)}, is drawn independently and identically from ρ. We define h(θ, (z, x )) := 1 2 minq C(x ) fθ(z) q 2, where C(x ) is the set constructed according to Eq. (1). Furthermore, we denote the population loss for parameter θ as: ˆh(θ) = E(z,x ) ρ[h(θ, (z, x ))]. Based on algorithmic stability, Lei & Ying (2021) prove the following generalization result for learning with smooth loss functions satisfying the PL condition. Theorem 6.1. (Theorem 1 in (Lei & Ying, 2021)) Let θD denote the output of a randomized algorithm A when minimizing an L-smooth function h that satisfies PL inequality with constant µ. Under the condition N 4L/µ, we have, E[ˆh(θD) inf θ h(θ)] = O E[infθ h(θ)] +E[h(θD) infθ h(θ)] The expectation in the above theorem is w.r.t the randomness in selecting the training dataset of size N and w.r.t the stochasticity in the learning algorithm. In our context, since the randomized algorithm A is SGD, the bound on its generalization is a direct consequence of Theorem 6.1. In particular, since ED[infθ h(θ)] infθ ED[h(θ)] = infθ ˆh(θ), we can obtain the following result from Lei & Ying (2021). Corollary 6.2. (Theorem 6 in (Lei & Ying, 2021)) When minimizing an L-smooth, µ-RSI function, SGD with stepsize ηt = 1 µ(t+1) for all t > 0 has the following guarantee, E[ˆh(θT )] inf θ ˆh(θ) = O 1 The expectation in the above result is only over the stochasticity in SGD. The LHS represents the excess risk, while the first term on the RHS decreases as N increases, and the second term on the RHS represents the average (over D) optimization error that decreases as T increases. In the interpolation setting, since the model can fit any training dataset of size N using SGD, infθ E[h(θ)] = 0. In this case, we obtain the following result from Lei & Ying (2021, Theorem 7). Corollary 6.3. When minimizing an L-smooth, µ-RSI function and if infθ E[h(θ)] = 0 for any choice of training dataset D of size N, SGD with step-size ηt = η = 1 L for all t > 0 has the following guarantee, E[ˆh(θT )] = O L(1 µ The above result shows that in the interpolation setting, the expected (over the randomness in SGD) population loss From Inverse Optimization to Feasibility to ERM decreases at a linear rate (depending on T). Importantly, the above bound does not depend on N. Intuitively, if h(θ) is smooth and N is large enough s.t. it satisfies the PL condition with µ > 0, minimizing the loss over a single dataset results in minimizing the population loss. The above results bound the population loss ˆh(θ), which serves as a proxy for the decision quality of SGD. Subsequently, we establish a connection between h(θ) and the suboptimality in the CIO framework. 6.1. Sub-optimality In this section, we first argue about the shortcomings of previous definitions of sub-optimality to measure performance for CILP and then propose a new sub-optimality metric. Recently, Sun et al. (2023) define the suboptimality gap as Γ1(θ, (z, x )) := cθ, x ˆx(cθ) where cθ = fθ(z), and prove theoretical guarantees for this loss. We argue that Γ1(z, x ) is not the right metric as the predicted cθ can be made arbitrarily small, resulting in smaller values of Γ1(z, x ) without ensuring that x ˆx(cθ). On the other hand, work in predict-and-optimize (Elmachtoub & Grigas, 2022) assumes access to the ground-truth cost-vector c and proposes to use a sub-optimality Γ2(θ, (z, c , x )) := c , ˆx(cθ) x . Since we do not have access to c , we cannot directly use this measure of sub-optimality. Consequently, we use the projection of cθ onto C as a proxy for the ground-truth c and define the suboptimality gap as: Γ(z, x ) = PC(cθ) PC(cθ) 2 , ˆx(cθ) x (15) It is important to note that we divide PC(cθ) by its corresponding ℓ2 norm to make the sub-optimality scale-invariant i.e. small values of PC(cθ) do not necessarily imply small sub-optimality (unlike Γ1). We now relate the suboptimality to the loss h(θ) and prove the following result in Appendix B.7. Proposition 6.4. For cθ Rm := fθ(z), assuming that j [m], [ˆx(cθ)]j, x j [0, 1], Γ(θ, (z, x )) δ where δ := O(χ/ m), χ is the margin and the O notation hides constants that depend on the LP. As the sub-optimality is upper-bounded by O( p h(θ)), we can control it by controlling the loss h(θ). Putting together the results in Corollaries 6.2 and 6.3 and Proposition 6.4, we observe that T iterations of SGD result in the following bounds on the expected sub-optimality: E(z,x ) ρ [ED ρ [E[Γ(θT , (z, x ))]]] = 2m δ h infθ ˆh(θ) + h 1 Nµ + 1 µ2T ii1/2 in the general setting and O 2m δ [exp( µ T)]1/2 (independent of N) in the interpolation setting. Compared to these results, Sun et al. (2023) derive an O(1/ N) bound on the expected sub-optimality in terms of Γ1 for both the interpolation and general settings. In the next section, we compare our method against several baselines on real-world and synthetic datasets and present the results. 7. Experiments1 Figure 2: Decision loss: Training and Test plot for the real world experiments. Our method significantly outperforms the other methods (ST, BB, MOM, SPO+). Datasets and Model: To validate the effectiveness of our approach, we experiment with both synthetic and real-world benchmarks. We consider two real-world tasks (Vlastelica et al., 2019) Warcraft Shortest Path and Perfect Matching below and defer the synthetic experiments to Appendix C. Warcraft Shortest Path (SP): The dataset consists of (z, x ) pairs where the input z is an RGB image generated from the Warcraft II tileset. The output x corresponds to the shortest path between given source-target pairs. The model predicts the edge weights for each tile in a k k grid (where k {12, 18}). Given these edge-weights, the optimization problem is to find the shortest path. Perfect Matching (PM): The dataset consists of (z, x ) pairs where the input z is a grey-scale image consisting of MNIST digits on a k k grid. The output x is a matching for each digit to one of its neighbors on the grid. The model predicts the edge-weights between each pair of neighbouring digits. Given these edge-weights, the optimization problem is to find a matching that has the minimal cumulative weight of the selected edges. 1The code is available here From Inverse Optimization to Feasibility to ERM Figure 3: Estimate loss: training and test plots for realworld experiments. Our method significantly outperforms existing methods (ST, BB, MOM) and is comparable to SPO+, which uses the knowledge of c . Both datasets consist of 10000 training samples, 1000 validation samples and 1000 test samples each. For both SP and PM, we use Resnet-18 (He et al., 2016) followed by a softplus function s(x) = log(1 + exp (x)) to ensure the predicted cost is non-negative. Please refer to Appendix D for additional details about the model and datasets. Methods: We compare the proposed method against several existing methods, including ST (Sahoo et al., 2022), MOM (Sun et al., 2023), BB (Vlastelica et al., 2019), QPTL Wilder et al. (2019) and SPO+ (Bertsimas & Kallus, 2020). We use adaptive first-order methods: Ada Grad (Duchi et al., 2011) and Adam (Kingma & Ba, 2014) to minimize the loss in Eq. (7) for our method and the corresponding losses for the other baselines. We train all the methods for 50 epochs with a batch size of 100. We employ a grid search to find the best constant step size in {0.1, 0.05, 0.01, 0.005, 0.001, 0.0005, 0.0001, 0.00005}, across both the Adam and Adagrad optimizers. The optimal settings are determined based on performance on the validation set. For optimal settings, we consider 5 independent runs and plot the average result and standard deviation. Following Sun et al. (2023), we set χ = 1 for all our experiments. In Appendix D.4, we also provide an ablation study varying χ in Table 1 and show that the algorithm is robust to χ. We note that the QPTL approach is excluded from real-world experiments as it is prohibitively slow for large problems (Amos & Kolter, 2017; Geng et al., 2023). For all the methods, we implement the LPs and QPs using the CVXPY library (Diamond & Boyd, 2016). For LPs, we use the ECOS solver (Domahidi et al., 2013), and for QPs, we use the OSQP solver (Stellato et al., 2020). Metrics: For each method, we plot the standard metrics: estimate-loss and decision-loss on both the train and test set, defined as: Estimate-Loss(θ) = i=1 c i , ˆx(fθ(zi)) c i , x i (16) Decision-Loss(θ) = i=1 ˆx(fθ(zi)) x i 2 (17) Since all the datasets consist of (zi, c i , x i ) pairs where x i = ˆx(c i ), the estimate-loss and decision-loss are commonly used to measure performance in these tasks2. We note that SPO+ requires access to the ground-truth cost-vector c , while other methods, including ours, do not. Though the MOM method does not require access to c in principle, the paper s implementation3 uses this ground-truth information to calculate the basis and use the LP optimality conditions. We continue using this information for MOM, thereby overestimating its performance. Results: In Fig. 2 w.r.t to the decision-loss, our method consistently outperforms the baselines by a considerable margin across tasks. In Fig. 3, w.r.t to the estimate-loss, our method outperforms all the baselines except for SPO+ on the SP problem. In Appendix D.3, we plot the wall-clock time/epoch for all the methods, and observe that our method is comparable to the baselines and scales gracefully as the dimension and the number of training examples increase. These results demonstrate the strong empirical performance of our method compared to other baselines. 8. Discussion We presented a reduction of CIO to convex feasibility, which enabled us to guarantee linear convergence to the solution without additional assumptions such as degeneracy or interpolation. We further reduced it to ERM on a smooth, convex loss that satisfies the PL condition. This enabled us to use first-order optimizers and demonstrate strong empirical performance on real-world tasks while being computationally efficient. For future work, we aim to address the following areas: (1) since solving the QP takes a substantial amount of time, we plan to incorporate techniques from Lavington et al. (2023) to allow for multiple updates to the model for every solve of the QP, (2) we intend to extend our framework to accommodate unknown constraints, broadening its applicability (3) finally, we aim to experiment with general non-linear convex objectives. 2Experimentally, we found that the sub-optimality in Eq. (15) has a similar trend as the estimate-loss. 3See MOM code From Inverse Optimization to Feasibility to ERM Impact Statement This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none of which we feel must be specifically highlighted here. Acknowledgements We would like to thank Michael Lu, Anh Dang, Reza Babanezhad, Vibhuti Dhingra, Tarannum Khan, Karan Desai, Ashwin Samudre and Hardik Chauhan for helpful feedback on the paper. This research was partially supported by the Natural Sciences and Engineering Research Council of Canada (NSERC) Discovery Grant RGPIN-2022-04816. Anant Raj is supported by the a Marie Sklodowska Curie Fellowship (project NN-OVEROPT 101030817). Amos, B. Differentiable optimization-based modeling for machine learning. Ph. D. thesis, 2019. (cited on 3) Amos, B. and Kolter, J. Z. Optnet: Differentiable optimization as a layer in neural networks. In International Conference on Machine Learning, pp. 136 145. PMLR, 2017. (cited on 3, 9) Angalakudati, M., Balwani, S., Calzada, J., Chatterjee, B., Perakis, G., Raad, N., and Uichanco, J. Business analytics for flexible resource allocation under random emergencies. Management Science, 60:1552 1573, 2014. (cited on 1) Armijo, L. Minimization of functions having Lipschitz continuous first partial derivatives. Pacific Journal of Mathematics, 16(1):1 3, 1966. (cited on 21) Bansal, R. Optimization methods for electric power systems: An overview. International Journal of Emerging Electric Power Systems, 2, 2005. (cited on 1) Bassily, R., Belkin, M., and Ma, S. On exponential convergence of sgd in non-convex over-parametrized learning. ar Xiv preprint ar Xiv:1811.02564, 2018. (cited on 7) Bauschke, H. H. and Borwein, J. M. On the convergence of von neumann s alternating projection algorithm for two sets. Set-Valued Analysis, 1:185 212, 1993. (cited on 4) Bauschke, H. H. and Borwein, J. M. On projection algorithms for solving convex feasibility problems. SIAM review, 38(3):367 426, 1996. (cited on 4) Berthet, Q., Blondel, M., Teboul, O., Cuturi, M., Vert, J.- P., and Bach, F. Learning with differentiable pertubed optimizers. Advances in neural information processing systems, 33:9508 9519, 2020. (cited on 3) Bertsimas, D. and Kallus, N. From predictive to prescriptive analytics. Management Science, 66:1025 1044, 2020. (cited on 9) Bertsimas, D., Gupta, V., and Paschalidis, I. C. Datadriven estimation in equilibrium using inverse optimization. Mathematical Programming, 153:595 633, 2015. (cited on 1) Besbes, O., Fonseca, Y., and Lobel, I. Contextual inverse optimization: Offline and online learning. Operations Research, 2023. (cited on 1, 2) Birge, J. R., Hortaçsu, A., and Pavlin, J. M. Inverse optimization for the recovery of market structure from market outcomes: An application to the miso electricity market. Operations Research, 65(4):837 855, 2017. (cited on 1) Bousquet, O. and Elisseeff, A. Stability and generalization. The Journal of Machine Learning Research, 2:499 526, 2002. (cited on 7) Boyd, S. P. and Vandenberghe, L. Convex optimization. Cambridge university press, 2004. (cited on 17) Boyd, S. P., Lee, T. H., et al. Optimal design of a cmos opamp via geometric programming. IEEE Transactions on Computer-aided design of integrated circuits and systems, 20:1 21, 2001. (cited on 1) Cameron, C., Hartford, J., Lundy, T., and Leyton-Brown, K. The perils of learning before optimizing. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp. 3708 3715, 2022. (cited on 3) Chan, T. C., Eberg, M., Forster, K., Holloway, C., Ieraci, L., Shalaby, Y., and Yousefi, N. An inverse optimization approach to measuring clinical pathway concordance. Management Science, 68(3):1882 1903, 2022. (cited on 1) Chan, T. C., Mahmood, R., and Zhu, I. Y. Inverse optimization: Theory and applications. Operations Research, 2023. (cited on 1) Cornuejols, G. and Tütüncü, R. Optimization methods in finance, volume 5. Cambridge University Press, 2006. (cited on 1) Deutsch, F. Rate of convergence of the method of alternating projections. Parametric optimization and approximation (Oberwolfach, 1983), 72:96 107, 1984. (cited on 4) Diamond, S. and Boyd, S. Cvxpy: A python-embedded modeling language for convex optimization. The Journal of Machine Learning Research, 17:2909 2913, 2016. (cited on 9) From Inverse Optimization to Feasibility to ERM Domahidi, A., Chu, E., and Boyd, S. Ecos: An socp solver for embedded systems. In 2013 European control conference (ECC), pp. 3071 3076. IEEE, 2013. (cited on 9) Duchi, J., Hazan, E., and Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(7), 2011. (cited on 6, 9) El Balghiti, O., Elmachtoub, A. N., Grigas, P., and Tewari, A. Generalization bounds in the predict-then-optimize framework. Advances in neural information processing systems, 32, 2019. (cited on 5) Elmachtoub, A. N. and Grigas, P. Smart predict, then optimize . Management Science, 68(1):9 26, 2022. (cited on 3, 8, 21) Fabozzi, F. J., Markowitz, H. M., and Gupta, F. Portfolio selection. Handbook of finance, 2, 2008. (cited on 5) Garrigos, G. Square distance functions are polyak- {\L} ojasiewicz and vice-versa. ar Xiv preprint ar Xiv:2301.10332, 2023. (cited on 6) Geng, H., Ruan, H., Wang, R., Li, Y., Wang, Y., Chen, L., and Yan, J. Rethinking and benchmarking predictthen-optimize paradigm for combinatorial optimization problems. ar Xiv preprint ar Xiv:2311.07633, 2023. (cited on 9) Gower, R., Sebbouh, O., and Loizou, N. Sgd for structured nonconvex functions: Learning rates, minibatching and interpolation. In International Conference on Artificial Intelligence and Statistics, pp. 1315 1323. PMLR, 2021. (cited on 7) Guyomarch, J. Warcraft ii open-source map editor. URL http://github. com/war2/war2edit, 2017. (cited on 2) Hardt, M., Recht, B., and Singer, Y. Train faster, generalize better: Stability of stochastic gradient descent. In International conference on machine learning, pp. 1225 1234. PMLR, 2016. (cited on 7) He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016. (cited on 9) Heuberger, C. Inverse combinatorial optimization: A survey on problems, methods, and results. Journal of combinatorial optimization, 8:329 361, 2004. (cited on 1) Iyengar, G. and Kang, W. Inverse conic programming with applications. Operations Research Letters, 33(3):319 330, 2005. (cited on 2, 5) Karimi, H., Nutini, J., and Schmidt, M. Linear convergence of gradient and proximal-gradient methods under the polyak-łojasiewicz condition. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2016, Riva del Garda, Italy, September 19-23, 2016, Proceedings, Part I 16, pp. 795 811. Springer, 2016. (cited on 6, 7) Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. (cited on 6, 9) Kuhn, H. W. and Tucker, A. W. Nonlinear programming, paper presented at proceedings of the second berkeley symposium on mathematical statistics and probability, 1951. (cited on 4) Lavington, J. W., Vaswani, S., Babanezhad, R., Schmidt, M., and Roux, N. L. Target-based surrogates for stochastic optimization. ar Xiv preprint ar Xiv:2302.02607, 2023. (cited on 9) Lei, Y. and Ying, Y. Fine-grained analysis of stability and generalization for stochastic gradient descent. In International Conference on Machine Learning, pp. 5809 5819. PMLR, 2020. (cited on 7) Lei, Y. and Ying, Y. Sharper generalization bounds for learning with gradient-dominated objective functions. In International Conference on Learning Representations, 2021. (cited on 7) Li, B., Wu, G., He, Y., Fan, M., and Pedrycz, W. An overview and experimental study of learning-based optimization algorithms for the vehicle routing problem. IEEE/CAA Journal of Automatica Sinica, 9:1115 1138, 2022. (cited on 1) Li, J., Liu, F., Wang, Z., Low, S. H., and Mei, S. Optimal power flow in stand-alone dc microgrids. IEEE Transactions on Power Systems, 33:5496 5506, 2018. (cited on 1) Liu, C., Zhu, L., and Belkin, M. Loss landscapes and optimization in over-parameterized non-linear systems and neural networks. Applied and Computational Harmonic Analysis, 59:85 116, 2022. (cited on 7) Liu, C., Drusvyatskiy, D., Belkin, M., Davis, D., and Ma, Y.-A. Aiming towards the minimizers: fast convergence of sgd for overparametrized problems. ar Xiv preprint ar Xiv:2306.02601, 2023. (cited on 7) Luenberger, D. G., Ye, Y., et al. Linear and nonlinear programming, volume 2. Springer, 1984. (cited on 2, 5, 20) From Inverse Optimization to Feasibility to ERM Ma, S., Bassily, R., and Belkin, M. The power of interpolation: Understanding the effectiveness of sgd in modern over-parametrized learning. In International Conference on Machine Learning, pp. 3325 3334. PMLR, 2018. (cited on 7) Mohajerin Esfahani, P., Shafieezadeh-Abadeh, S., Hanasusanto, G. A., and Kuhn, D. Data-driven inverse optimization with imperfect information. Mathematical Programming, 167:191 234, 2018. (cited on 1, 2) Ng, A. Y., Russell, S., et al. Algorithms for inverse reinforcement learning. In Icml, volume 1, pp. 2, 2000. (cited on 2) Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., De Vito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., and Chintala, S. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pp. 8024 8035. Curran Associates, Inc., 2019. (cited on 3) Polyak, B. T. Gradient methods for solving equations and inequalities. USSR Computational Mathematics and Mathematical Physics, 4(6):17 32, 1964. (cited on 2, 6) Raj, A. and Bach, F. Explicit regularization of stochastic gradient methods through duality. In International Conference on Artificial Intelligence and Statistics, pp. 1882 1890. PMLR, 2021. (cited on 7) Raja, P. and Pugazhenthi, S. Optimal path planning of mobile robots: A review. International journal of physical sciences, 7:1314 1320, 2012. (cited on 1) Robbins, H. and Monro, S. A stochastic approximation method. The annals of mathematical statistics, pp. 400 407, 1951. (cited on 6) Sahoo, S. S., Paulus, A., Vlastelica, M., Musil, V., Kuleshov, V., and Martius, G. Backpropagation through combinatorial algorithms: Identity with projection works. ar Xiv preprint ar Xiv:2205.15213, 2022. (cited on 3, 9, 21) Stellato, B., Banjac, G., Goulart, P., Bemporad, A., and Boyd, S. Osqp: An operator splitting solver for quadratic programs. Mathematical Programming Computation, 12 (4):637 672, 2020. (cited on 9) Sun, C., Liu, S., and Li, X. Maximum optimality margin: A unified approach for contextual linear programming and inverse linear programming. ar Xiv preprint ar Xiv:2301.11260, 2023. (cited on 1, 2, 5, 8, 9, 14, 15, 21) Vaswani, S., Bach, F., and Schmidt, M. Fast and faster convergence of sgd for over-parameterized models and an accelerated perceptron. In The 22nd international conference on artificial intelligence and statistics, pp. 1195 1204. PMLR, 2019. (cited on 7) Vlastelica, M., Paulus, A., Musil, V., Martius, G., and Rolínek, M. Differentiation of blackbox combinatorial solvers. ar Xiv preprint ar Xiv:1912.02175, 2019. (cited on 2, 3, 8, 9, 21, 22, 23) Von Neumann, J. On rings of operators. reduction theory. Annals of Mathematics, pp. 401 485, 1949. (cited on 4) Wahdany, D., Schmitt, C., and Cremer, J. L. More than accuracy: end-to-end wind power forecasting that optimises the energy system. Electric Power Systems Research, 221: 109384, 2023. (cited on 1) Wilder, B., Dilkina, B., and Tambe, M. Melding the datadecisions pipeline: Decision-focused learning for combinatorial optimization. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp. 1658 1665, 2019. (cited on 2, 3, 9, 21) Zhang, H. and Yin, W. Gradient methods for convex minimization: better rates under weaker conditions. ar Xiv preprint ar Xiv:1303.4645, 2013. (cited on 6) From Inverse Optimization to Feasibility to ERM Supplementary material Organization of the Appendix A Definitions B Theoretical Results C Synthetic Experimental Results D Additional Real-world Experiment Details A. Definitions If the function f is differentiable and L-smooth, then for all v and w, f(v) f(w) + f(w), v w + L 2 v w 2 , (Smoothness) If f is convex, then for all v and w, f(v) f(w) + f(w), v w , (Convexity) If f is µ strongly-convex, then for all v and w, f(v) f(w) + f(w), v w + µ 2 v w 2 (Strong Convexity) B. Theoretical Results B.1. Generalizing our method to other classes of optimization problem: In this section, we relax our assumption on the class of problem from LP to non-linear convex optimization objectives and constraints. We specify the condition on the objective for our reduction to hold; there is no extra condition (apart from convexity) on optimization constraints for reduction to hold. Similar to Section 4.1, we consider a single data-point (z, x ) D with |D|= N, where z Rd, x Rm we aim to find a c Rp such that ˆx(c) = x . Consider a non-linear convex optimization problem defined as : ˆx(c) = arg min x ϕ(x, c) subject to G(x) = 0 where G(x), H(x) consist of k, l convex constraints respectively and, ϕ(x, c), Gi(x), Hi(x) are (non-linear) convex function with respect to parameter x. Assumptions: For this reduction to hold, the condition we have is ϕ(c,x) x |x=x should be convex. Example: For LPs, the condition x |x=x simplifies to c which is convex in c. Writing the KKT optimality conditions for Eq. (18), we get: x 0 (stationary condition) λ 0 (dual feasibility) λ H(x) = 0 (complementary slackness) G(x) = 0 (primal feasibility) H(x) 0 (primal feasibility) From Inverse Optimization to Feasibility to ERM where λ Rl, ν Rk are referred to as the dual-variables. The equations, G(x) = 0, H(x) 0 come from problem definition. The first term is derived by differentiating the Lagrangian. The term λ H(x) = 0 represents the complementary slackness condition. Morever, after substituing the value of x in Eq. (primal feasibility), we get: x=x + ν G(x) x=x + λ H(x) λ H(x ) = 0 Now, for a given optimal decision x , H(x ), G(x ) are inherently satisfied. Thus, we omit them from the Eq. (19). The term G(x) x |x=x represents gradient of G(x) taken w.r.t x and evaluated at x . From our assumption, ϕ(x,c) x x=x is convex in c and as λ, ν are multiplied with constants. Thus the Eq. (19) is convex in c, λ, ν. Therefore, the feasible set C encompassing all the values of c s.t. ˆx(c) = x can be written as: C = n c | λ, ν s.t. ϕ(x, c) x=x + ν G(x) x=x + λ H(x) x=x 0, λ 0, λ H(x ) = 0 o (20) Moreover, the projection of point ˆc onto set C can be attained by solving the below QP: q(ˆc) = arg min c c ˆc 2 (21) subject to ϕ(x, c) x=x + ν G(x) x=x + λ H(x) λ H(x ) = 0 (24) Thus, our framework can be used for a non-linear convex class of functions where the first-order KKT conditions are both sufficient and necessary and ϕ(x,c) x |x=x is convex in c. Moreover, we do not assume any additional condition on constraints G, H apart from convexity. B.2. Equivalence between margin in KKT formulation and (Sun et al., 2023) In this section, we derive the equivalent margin for the KKT formulation as in Sun et al. (2023). We further show that our margin formulation can be extended to the degenerate LPs. Additionally, our margin formulation does not require specific handling in the case of degenerate/non-degenerate cases. First, we will derive the same margin formulation as in (Sun et al., 2023) for non-degenerate LPs in terms of KKT conditions. In the case of non-degenerate LP, we denote B as the basis set defined as B := {j | x j > 0} and M as the set of indices not in B, i.e. M = [n] B. AB represents the columns corresponding to the indices in set B, is invertible by definition. The reduced cost is given as c M c B(AB) 1AM 0 from LP optimality conditions. And to add margin χ in the reduced cost optimality conditions, (Sun et al., 2023) proposed the following modification: c M c B(AB) 1AM χ. Recall the the KKT conditions for LP: νT A c + λ = 0 (25) λ x = 0 (27) From KKT conditions, we have the condition νT A c + λ = 0. Separating it row-wise for index B, M respectively for From Inverse Optimization to Feasibility to ERM matrix A , we get: νT AB c B + λB = 0 (28) νT AM c M + λM = 0 (29) where AB, AM are the the corresponding columns of matrix A defined by set B and M respectively. Now, from Eq. (27), we have λ x = 0. For non-degenerate LPs, we have x B > 0; this implies that λB = 0. Substituting this value in Eq. (28), we get: νT = c B(AB) 1 (substituting λB = 0) λM = c M νT AM (rearranging Eq. (29)) λM = c M c B(AB) 1AM (30) From Eq. (26), we have λ 0 that implies λM 0. Thus, in Eq. (30), we retrieve the reduced cost optimality condition from KKT formulation. Therefore, the equivalent of c M c B(AB) 1AM χ to the KKT formulation would be to impose the same constraint on λM, i.e. λM χ. B.2.1. EXTENSION TO DEGENERATE CASE: In the case of degenerate LPs, i B s.t. xi = 0. Moreover, the set B is no longer unique. Our experiments found that a choice of B affects the results in (Sun et al., 2023). Here, let us define a separate set of basis Bd, Md as: Bd := {i | x i > 0}, Md := {i | x i = 0}. Note that Bd, Md differ from the standard definition of basis in LPs. For the degenerate case, we propose the following modification for margin, λMd χ. This can be written as: i [n]; λi I{x i = 0} χ (31) Note that margin modification in Eq. (31) is exactly the same as in the non-degenerate case. Thus, unifying the margin formulation for the degenerate and non-degenerate LPs. Moreover, this also resolves the problem of determining the basis for degenerate LPs. B.3. Proof of Proposition 5.2 Proposition 5.2. For a linear model fθ(z) = zθ parameterized by θ Rd m, assuming (without loss of generality) that i, zi 1, h(θ) is a 1-smooth convex function. To prove that the loss function h(θ) is a smooth, convex function for a linear model, we define ζi(θ) such that, i=1 ζi(θ) where ζi(θ) := 1 2 min q Ci ziθ q 2 = 1 2 d2(ziθ, Ci) , (32) where d2(ziθ, Ci) represents squared Euclidean distance of a point ziθ from set Ci and Ci represents the set of feasible cost vectors (defined in Eq. (1)) for corresponding solution x i . We will prove that for an arbitrary z, ζ(θ) := 1 2 minq C zθ q 2 is 1-smooth and convex. This will prove the desired statement since the mean of 1-smooth convex functions is 1-smooth and convex. For this, we define ψ(θ) := min c C zθ c = d(zθ, C) , (33) where d(zθ, C) represents Euclidean distance of a point zθ from set C and C represents the set of feasible cost vectors (defined in Eq. (1)) for corresponding solution x . We will first prove that ψ(θ) is 1-Lipschitz and convex and use that to prove the smoothness and convexity of ζ(θ). From Inverse Optimization to Feasibility to ERM Lemma B.1. For z 1, loss ψ(θ) is 1-Lipschitz. Proof. : We first prove that the projection of a point onto a closed convex set C is non-expansive. Consider two arbitrary points x, y Rd for this. Now, for a point p C s.t. p is the projection of y on C, we know that d(y, C) = d(y, p) and d(x, C) d(x, p). d(x, C) d(x, p) d(x, y) + d(y, p) = d(x, y) + d(y, C) (Triangle inequality) = d(x, C) d(y, C) d(x, y) (34) Similarly, now consider point q C s.t. q is the projection of x on C, we know d(x, C) = d(x, q) and d(y, C) d(y, q). For the same reason, d(y, C) d(y, q) d(y, x) + d(x, C) = d(y, x) + d(x, C) (35) = d(y, C) d(x, C) d(x, y) (36) As, d(y, x) = d(x, y), from Eq. (34), Eq. (36), we get |d(y, C) d(x, C)| d(x, y) = x y . Thus, the projection to a convex closed set C is non-expansive. Now consider two points θ1, θ2 for function ψ, ψ(θ1) ψ(θ2) = d(zθ1, C) d(zθ2, C) (By definition of ψ) d(zθ1, zθ2) (From the above relation) = zθ1 zθ2 (37) z θ1 θ2 (Cauchy-Schwartz) 1 θ1 θ2 (Since z 1 by assumption) Lemma B.2. Loss function ψ(θ) is convex. Proof. Consider two parameter values θ1, θ2 and let the projection of zθ1, zθ2 on C be c1, c2 respectively. Additionally, consider θ3 to be the convex combination of θ1, θ2 i.e. θ3 := λθ1 + (1 λ)θ2 for a arbitrary λ (0, 1). Now, we can write: λψ(θ1) + (1 λ)ψ(θ2) = λ zθ1 c1 + (1 λ) zθ2 c2 (38) λ(zθ1 c1) + (1 λ)(zθ2 c2) (Triangle Inequality) = λ zθ1 + (1 λ) zθ2 λc1 (1 λ)c2 (39) = λzθ1 + (1 λ)zθ2 c (c := λc1 + (1 λ)c2) = zθ3 c (By definition of θ3) = d(zθ3, c) d(zθ3, C) (Since C is convex, c C) = ψ(θ3) (By definition of ψ) = λψ(θ1) + (1 λ)ψ(θ2) ψ(θ3) (40) = λψ(θ1) + (1 λ)ψ(θ2) ψ(λθ1 + (1 λ)θ2) (41) Thus, the function ψ is convex from the definition of convexity. From Inverse Optimization to Feasibility to ERM Lemma B.3. For z 1, loss ζ(θ) is 1-smooth. Proof. Consider two parameter values θ1, θ2 and lets denote the projection of zθ1 and zθ2 on C as c1 and c2 respectively. ζ(θ1) ζ(θ2) = 1 2 zθ1 c1 2 1 2 zθ2 c2 2 (By definition of ζ) = z T (zθ1 c1) z T (zθ2 c2) (42) z d(zθ1, C) d(zθ2, C) (Cauchy Schwarz) ψ(θ1) ψ(θ2) (Since z 1 by assumption and by definition of ψ) θ1 θ2 (Lemma B.1) Thus, the function ζ(θ) is 1-smooth. Lemma B.4. Loss ζ(θ) is convex. Proof. Consider a function g(x) = 1 2x2 and note that ζ(θ) = g(ψ(θ)). From Lemma B.2, we know that ψ is convex. g(x) is non-decreasing for x {R+ 0} and ψ(θ) is always non-negative. The composition of two functions is convex if g is non-decreasing and ψ is convex (Boyd & Vandenberghe, 2004). Thus, the composite function ζ is convex. B.4. Proof for Proposition 5.1 Proposition 5.1. Point ˆc := (c1, c2, . . . , c N) where ci = zi θ and θ arg min h(θ) lies in the intersection C F if it exists, else ˆc F is the point closest to C. Proof. Loss h(θ) is defined as: h(θ) = 1 2N i=1 min qi Ci fθ(zi) qi 2 (43) where Ci represents the set of feasible cost vectors (defined in Eq. (1)) for data-point i. We assume fθ is a linear model for this proof for which the function h(θ) is convex. In order to better interpret h(θ), consider a point cθ = (c1, c2, . . . , c N) RNm such that ci = fθ(zi). Since cθ F, h(θ) = d2(cθ,C) N where d2(w, W) is the squared Euclidean distance of point w to the set W. Hence, minimizing h(θ) is related to minimizing the distance between the sets F and C. Consequently, our loss can be reformulated as: h(θ) = 1 2N d(cθ, C)2 (44) Let us denote θ arg min h(θ) and the predicted point as ˆc := Z θ F. Therefore, h( θ) = 1 2N d(ˆc, C)2. We can prove the proposition by contradiction. Assume, ˆc := Z θ F and is not the closest point to set C. Conversely, assume the closest point to the set C in F is given by Z θp. Now, the loss h(θp) = 1 2N d(Zθp, C)2 and since, Zθp is the closest point in the set C in F, this means, that θp is also the arg min of h(θ). As θ is also the arg min, this implies that h( θ) = h(θp), Thus, d(Z θ, C) = d(Zθp, C). This implies that minimizing h(θ) leads to convergence to a point ˆc = Z θ, which is the closest distance to C. In the case where an intersection exists, the closest distance to C = 0; therefore, it converges in F C. From Inverse Optimization to Feasibility to ERM B.5. Proof of Proposition 5.4 Proposition 5.4. For a linear model fθ(z) = zθ, the iterates corresponding to the preconditioned gradient method on h(θ) with η = 1 are identical to Algorithm 1. Proof. Consider iteration t, and the current value of model parameters at iteration t is denoted by θt. Let us denote the updated parameter at time t + 1 as θt+1(POCS) and θt+1(ERM) for POCS update and first order pre-conditioned update on h(θ) respectively. Let qt = PC(Zθt) denote the projection of Zθt. In POCS, the θt+1 = arg minθ 1 2N Zθ qt 2. Solving it exactly gives the projection of qt onto the set F, which can be written as: PF (qt) = Zθt+1 s.t. θt+1 = (ZT Z) 1ZT qt (45) Thus, θt+1(POCS) = (ZT Z) 1ZT qt Considering first order update for function h(θ) with step-size η and pre-conditioner h ZT Z N i 1 can be written as: θt+1 = θt η ZT Z 1 h(θt) (from definition of preconditioner update) where, h(θt) = 1 N ZT (Zθt qt). Now, putting the values back in preconditioned update, we get the following: θt+1 = θt η[ZT Z] 1(ZT (Zθt qt)) (46) = θt(1 η) + [ZT Z] 1ZT qt (47) = (ZT Z) 1ZT qt (for η = 1) Thus, θt+1(ERM) = (ZT Z) 1ZT qt. Therefore, we can see that iterates produced by POCS is equivalent to 1-step of preconditioned gradient update with η = 1. B.6. Proof for Proposition 5.3 Proposition 5.3. For a linear model fθ(z) = zθ and assuming (i) (without loss of generality) that i, zi 1 and (ii) λmin[ZT Z] > 0, h(θ) is not necessarily strongly-convex but satisfies the PL inequality with µ = λmin h PN i=1 ziz i N i . We prove that h(θ) is not necessarily a strongly convex function by contradiction. Let us assume that h(θ) is α-strongly convex function with α > 0. From the definition of α-strong convexity, h(θ) must satisfy this following inequality for all θ1, θ2. Figure 4: In this figure, we can see two point x, y and their projection onto a linear boundary of set C denoted as x1, y1 respectively. Moreover, the angle between x, y and x, x1 is the right angle; thus, the two vectors are orthogonal. From Inverse Optimization to Feasibility to ERM h(θ1) h(θ2) + ( h(θ2))T (θ1 θ2) + α 2 θ1 θ2 2 (48) Consider a special case where N = 1 point and m = d = 1 and z = 1. Let y = zθ1 = θ1 and x = zθ2 = θ2. Consider C as an affine set and two points (x, y) equidistant from C. Define x1, y1 to be the projection of x and y onto C respectively (refer to Fig. 4 above). Since x and y are equidistant from C, x x1 = y y1 . Moreover, since x and y are on the same side of C, vector x y is orthogonal vector x x1. Hence, y x, x x1 = 0. Substituting these values in Eq. (48), we get: 1 2 y y1 2 1 2 x x1 2 + x x1, y x + α 2 x y 2 (49) 2 x y 2 (50) = α = 0 (51) As α = 0 therefore the function h(θ) is not strongly convex for this case when C is an affine set. However, we can show that function h(θ) satisfies the PL inequality with µ = λmin[ZT Z] N . Recall, h(θ) is defined as 1 2N Zθ q 2 where q = PC(Zθ) is projection of Zθ on C. Hence, h(θ) = 1 N ZT [Zθ q]. h(θ) 2 = 1 N 2 ZT [Zθ q] 2 (By definition of h(θ)) 1 N 2 σ2 min(ZT ) [Zθ q] 2 ( Ax σmin(A) x ) = 1 N 2 λmin(ZT Z) [Zθ q] 2 (using σ2 min(ZT ) = λmin(ZT Z)) N λmin(ZT Z) h(θ) (By definition of h) N λmin(ZT Z)[h(θ) h(θ )] (Since h is non-negative) [h(θ) h(θ )] (replacing ZT Z as PN i=1 ziz i ) = 2µ [h(θ) h(θ )] (For µ = λmin h PN i=1 ziz i N i ) Hence, h(θ) is µ-PL. From Inverse Optimization to Feasibility to ERM B.7. Sub-optimality proofs Proposition 6.4. For cθ Rm := fθ(z), assuming that j [m], [ˆx(cθ)]j, x j [0, 1], Γ(θ, (z, x )) δ where δ := O(χ/ m), χ is the margin and the O notation hides constants that depend on the LP. Γ(θ, (z, x )) = PC(cθ) PC(cθ) , ˆx(cθ) x (By definition) δ , ˆx(cθ) x (Since by Lemma B.5, PC(cθ) δ) δ , ˆx(cθ) x + 1 δ cθ, x ˆx(cθ) (By definition of ˆx(cθ), cθ, x cθ, ˆx(cθ) ) δ PC(cθ) cθ, ˆx(cθ) x (rearranging terms) δ PC(cθ) cθ ˆx(cθ) x (Cauchy-Schwartz) δ PC(cθ) cθ m (From assumption that j, ˆxj, x j [0, 1]) 2h(θ) m (from definition of h(θ)) Next, we give the lower-bound on the term δ which depends on the margin χ defined in Section 4.3. Lemma B.5. For c C defined using a margin χ, c 2 δ := χ m maxj M min n 1, minp B 1 |τpj| o . Proof. We lower-bound c 2 for an arbitrary c C using the reduced cost optimality conditions (Luenberger et al., 1984). For this, we denote B as the basis set defined as B := {j | x j > 0} and M as the set of indices not in B, i.e. M = [m] B. Let us define a new term τij = [(AB) Aj]i where AB represents the columns corresponding to the indices in set B, and A B is the pseudo-inverse of the matrix AB. To prove this proposition, we first consider two arbitrary vectors a, b Rm and show that a 1 χ maxn p=1 |bp| is a necessary condition to ensure that a T b χ. We do this by contradiction: assume a 1 χ maxm p=1 |bp|, but a T b χ. In this case, a 1 m max p=1 |bp| χ = a 1 b χ By Holders inequality, since a T b a 1 b , the above inequality implies that which is a contradiction. Since a 1 χ maxm p=1 |bp|, it gives us a lower-bound on a 1. Now, we can use this result to find the lower bound on c 1. In Appendix B.2, we have shown that KKT conditions are equivalent to reduced cost optimality conditions. We first find the lower bound to satisfy reduced cost inequality for a specific index j [M] and then extend the result for all indices in M to find the lower-bound on c 1. The reduced cost for an index j M and margin χ is given by r(j) := cj c B(AB) Aj. The reduced costs conditions imply that for all j M, r(j) χ. In terms of τ, these conditions imply that cj P p B cpτpj χ. Equivalently, for all j M, c T αj χ, where αj represents the coefficients of c in r(j). From Inverse Optimization to Feasibility to ERM Using the above result to obtain a lower-bound on c 1, we have that, c 1 χ maxm p=1 |(αj)p| (53) = χ max{1, maxp B |τpj|} (substituting the value of αj) = χ min 1, min p B 1 |τpj| (rearranging terms) Since we require that the reduced cost condition be satisfied for all j M, we get that, c 1 χ max j M min 1, min p B 1 |τpj| Finally, we use the relation between norms to lower-bound the value of c 2. δ := min c C c 2 (54) min c C 1 m c 1 (by norm inequality) χ m max j M min 1, min p B 1 |τpj| C. Synthetic Experimental Results We conduct numerical experiments for two LP problems the shortest path (SP) problem and the fractional Knapsack problems considered in (Sun et al., 2023). For both SP and Knapsack, we generate 100 samples for training, validation and test sets. We used the codebase provided by the (Sun et al., 2023) to generate the dataset. Shortest Path (SP-synth): The Shortest Path problem is defined on a 5 5 grid with m = 40 directed edges associated with the ground truth cost-vector c Rm. Input z Rd with d = 6. Thus, θ Rd m. To make the problem harder, we use the degree= 4 in the data-generation process. Fractional Knapsack: The Fractional Knapsack problem is defined with input z Rd with d = 5. We have 10 items with associated cost-vectors, and slack variables are added to convert the problem to standard form, making the dimension m = 21. Thus, θ Rd m. To make the problem harder, we use the degree= 2 in the data-generation process with the attacking noise of attack-power= 3.0. Methods and model: For the experiments, we compare both our variants, POCS (Algorithm 1) and ERM (Section 5) with GD and show that they have similar performance. We compare our method against against several existing methods, including ST (Sahoo et al., 2022), MOM (Sun et al., 2023), BB (Vlastelica et al., 2019), QPTL Wilder et al. (2019) and SPO+ (Elmachtoub & Grigas, 2022). We train all the models in a deterministic setting employing a linear model. Training Details For Ours (POCS), we solve the regression problem using closed-form solutions obtained through matrix inversion. For the MOM, Ours (ERM) method, we employ the Armijo line search algorithm (Armijo, 1966) with Gradient Descent. For the BB, ST, QPTL, and SPO+ methods, a grid search is used to find the best constant step-size in {10, 1, 0.1, 0.01, 0.001, 0.0001}. We also did a grid search for λ, regularizer in QPTL and the perturbation weight in BB with λ {100, 10, 1, 0.1, 0.01, 0.001}. The optimal settings are determined based on performance on the validation set, and we plot the training and test plot for the best-performing model. Results: In Fig. 6 w.r.t to the decision-loss, all the methods have similar performance except SPO+, BB. For Knapsack, our method outperforms all the other baselines except SPO+. In Fig. 5, w.r.t to the estimate-loss, our method significantly outperforms the other methods (ST, QPTL, BB, MOM) and is comparable to SPO+, which uses the knowledge of c . Moreover, we can see that both our variants, POCS and ERM, have negligible performance differences. From Inverse Optimization to Feasibility to ERM Figure 5: Estimate loss Figure 6: Decision loss Figure 7: : Training and Test plot for synthetic tasks. For both problems, our method significantly outperforms the other methods (ST, QPTL, BB, MOM) and is comparable to SPO+, which uses the knowledge of c D. Additional real-world experiment details D.1. Warcraft shortest Path In this section, we define the LP for the shortest path problem. Consider xij represents the edge from vertex i to vertex j. Since all edges are bidirectional, xij is not the same as xji. s and t denote source and target vertices, respectively. Let cij be the cost of selecting edge xij. Before initializing the LP, let N(i) be the set of vertices with an outgoing edge from vertex i, and I(i) be the set of vertices with an incoming edge to vertex i. The LP can be written as follows: minimize x c T x subject to i / {s, t}, X j N(i) xij = X j I(i) xji (flow conservation) j xsj = 1 (source has one outgoing edge) j xjt = 1 (target has one incoming edge) In this case, the source is always in the top-left grid, and the target is in the bottom-right grid. As in the dataset, ground-truth weights are defined on the vertex. To run it as LP defined in Eq. (56), we directly predict the weights of the edges. Given that the ground-truth cost c (BB) is defined for vertex weights in the dataset, we define c for the edge-weighted shortest path as: i, j N(i), c ij = c i (BB) (56) Both approaches yield the same shortest path, validating the conversion. Please see (Vlastelica et al., 2019) for more details regarding the dataset. D.2. Perfect Matching In this section, we define the LP associated with the perfect matching problem. xij represents the edge from vertex i to vertex j. N(i) represents the neighbors of vertex i. All the edges in the graph are unidirectional, i.e. xij and xji represent the same edge. cij represent the cost of selecting the edge xij. Thus, the LP can be written as: From Inverse Optimization to Feasibility to ERM Figure 8: Warcraft SP dataset sample: The input image (left), ground truth shortest path (center), and the ground truth vertex weights (right). The task is to learn the edge weights to retrieve the same shortest path. minimize c c T x subject to i X j N(i) xij = 1 (each vertex should have exactly one incident edge) Figure 9: Perfect Matching dataset sample: This figure shows the input image (left) and the corresponding min-cost perfect matching overlayed on the input image on the right. Each input is a 12 12 grid, with each grid containing an MNIST digit. In Perfect Matching(PM), edges highlighted by the orange lines represent the edge selected by the solving min-cost perfect matching optimization problem. Ground truth edge weights are inferred by reading the digits connected by the edge as a two-digit number. The task is to predict edge weights such that we get the same PM In our case, ground-truth edge weights are inferred by reading the digits on the two ends of the vertex as two-digit numbers. Please see (Vlastelica et al., 2019) for more details regarding the dataset. From Inverse Optimization to Feasibility to ERM Figure 10: Training Time (in seconds) per epoch vs method for three real-world experiments. We can see that our method is comparable to other methods and scales well with the dimension of the problem D.3. Runtime Comparison To benchmark the computational efficiency of our method, we plot the average training time per epoch for all methods in Fig. 10. Our method is competitive with ST and SPO+, and faster than MOM and BB. BB requires two solver calls per gradient evaluation, while MOM, despite not needing solver calls, involves inverting a matrix AB, which scales poorly with dimension. As shown in the plot, our method scales well with both the problem dimension and dataset size. D.4. Ablation study for margin In order to verify the robustness of Algorithm 1 for varying χ, we do an ablation study varying χ from [10, 0.01] for the synthetic datasets in the deterministic setting (refer to Appendix C for details). We train each model for 150 epochs according to Algorithm 1 and report the mean decision error (Eq. (17)) for both the train/test data for the synthetic shortestpath (SP) and Knapsack (K) problems. We see that increasing the margin (from 0.01 to 1) leads to small sub-optimality and that increasing it beyond 1 does not improve the performance. Margin Train Sp-synth Test Sp-synth Train Knapsack Test Knapsack 10 0.779 1.95 2.033 2.32 1 0.779 1.89 2.033 2.32 0.1 0.699 2.11 2.08 2.39 0.01 2.63 3.23 2.11 2.32 Table 1: Ablation results varing margin (χ) from 10 to 0.01 and their performance on train/test set for the synthetic dataset