# differentiable_integer_linear_programming__722240a8.pdf Published as a conference paper at ICLR 2025 DIFFERENTIABLE INTEGER LINEAR PROGRAMMING Zijie Geng1,2 , Jie Wang1,2 , Xijun Li3, Fangzhou Zhu4, Jianye Hao4,5, Bin Li1, Feng Wu1,2 1 University of Science and Technology of China 2 Mo E Key Laboratory of Brain-inspired Intelligent Perception and Cognition 3 Shanghai Jiao Tong University 4 Noah s Ark Lab, Huawei 5 Tianjin University zijiegeng@mail.ustc.edu.cn, {jiewangx,binli,fengwu}@ustc.edu.cn lixijun@sjtu.edu.cn, {zhufangzhou,haojianye}@huawei.com Machine learning (ML) techniques have shown great potential in generating highquality solutions for integer linear programs (ILPs). However, existing methods typically rely on a supervised learning paradigm, leading to (1) expensive training cost due to repeated invocations of traditional solvers to generate training labels, and (2) plausible yet infeasible solutions due to the misalignment between the training objective (minimizing prediction loss) and the inference objective (generating high-quality solutions). To tackle this challenge, we propose Diff ILO (Differentiable Integer Linear Programming Optimization), an unsupervised learning paradigm for learning to solve ILPs. Specifically, through a novel probabilistic modeling, Diff ILO reformulates ILPs discrete and constrained optimization problems into continuous, differentiable (almost everywhere), and unconstrained optimization problems. This reformulation enables Diff ILO to simultaneously solve ILPs and train the model via straightforward gradient descent, providing two major advantages. First, it significantly reduces the training cost, as the training process does not need the aid of traditional solvers at all. Second, it facilitates the generation of feasible and high-quality solutions, as the model learns to solve ILPs in an end-to-end manner, thus aligning the training and inference objectives. Experiments on commonly used ILP datasets demonstrate that Diff ILO not only achieves an average training speedup of 13.2 times compared to supervised methods, but also outperforms them by generating heuristic solutions with significantly higher feasibility ratios and much better solution qualities. 1 INTRODUCTION Integer linear programs (ILPs) are powerful tools in various areas such as operations research, mathematics, and engineering (Bixby et al., 2004; Bengio et al., 2021). They are able to model a broad range of combinatorial optimization (CO) problems and find diverse real-world applications, including scheduling (Ryan & Foster, 1981), planning (Beyer et al., 2016), and network design (Koster et al., 2010). Despite their significant importance, ILPs inherently exhibit a complex combinatorial nature and are known as quintessential NP-hard problems, posing substantial challenges in solving them efficiently. Therefore, extensive research and development efforts have been dedicated to advancing ILP solvers, such as SCIP (Achterberg, 2009) and Gurobi (Gurobi Optimization, 2021). These solvers are mainly based on traditional algorithms such as Branch-and-Bound (B&B) (Land & Doig, 2010) and Branch-and-Cut (B&C) (Mitchell, 2002), which are meticulously enhanced with various heuristics to improve efficiency and accuracy. Machine learning (ML) techniques, especially deep neural networks (DNNs), have recently shown great potential in solving or aiding the resolution of ILPs (Zhang et al., 2023; Li et al., 2024a). In practice, ILPs within specific scenarios often exhibit similar structures or patterns across instances, enabling ML methods to automatically identify and exploit these patterns to reduce computational complexity in a data-driven manner (Gasse et al., 2019). When trained on a dataset of instances, This work was done when Zijie Geng was an intern at Huawei. Corresponding author. Published as a conference paper at ICLR 2025 s. t. Ax b x {0, 1}n ILP Problem Predictor Predicted Solutions Differentiable Loss Gurobi Solutions (a) Previous Methods Supervised Learning Labels (b) Diff ILO Unsupervised Learning Gradient Descent Time-Consuming Figure 1: (a) Previous works mainly use supervised learning. They employ Gurobi to obtain solutions that serve as training labels, which is time-consuming. (b) Our proposed Diff ILO is an unsupervised learning approach. Its key idea is to design a differentiable loss function, enabling straightforward gradient descent methods to optimize the problems and the predictor simultaneously. DNNs are able to generalize to novel but similar instances, making decisions according to past experiences in a short time frame. The growing intersection of CO and ML has attracted considerable attention due to its potential to drive innovation and offer mutual benefits to both fields. Research efforts in this area can be generally categorized into two main streams. Some studies integrate ML into the traditional branch-and-bound framework, enhancing components such as branching decisions (Gasse et al., 2019; Kuang et al., 2024a;b; Liu et al., 2023; 2024b), separation processes (Li et al., 2023a), presolving (Kuang et al., 2023), and cut selection (Wang et al., 2023c; Ling et al., 2024; Wang et al., 2024b). Another line of research, which is the focus of this paper, directly employs ML to predict solutions (Khalil et al., 2022; Liu et al., 2024a; 2025). Recent advancements, such as Neural Diving (Nair et al., 2020; Yoon, 2022) and Predict-and-Search (Han et al., 2023; Huang et al., 2024; Zeng et al., 2024), have shown effectiveness in reducing the solving time. These methods typically follow a supervised learning paradigm, using traditional solvers like SCIP (Achterberg, 2009) and Gurobi (Gurobi Optimization, 2021) to generate near-optimal solutions as training labels. A predictor is trained to predict the near-optimal solutions from the given instances. Sophisticated heuristics are then applied to exploit the predicted solutions, thus accelerating the solving process. Despite these achievements, supervised learning methods still present significant challenges. First, they are time-extensive due to the need to repeatedly invoke traditional solvers to generate training labels. This is a fundamental limitation of supervised learning approaches (Wang et al., 2024a), which has driven researchers across various fields to turn their attention towards unsupervised learning approaches (Wang et al., 2022). Second, there exists an inherent misalignment between the objectives of minimizing prediction errors during training and generating high-quality solutions during inference. In other words, the predictor is trained to mimic the provided solutions rather than independently solving the problems. Consequently, these methods often yield plausible yet infeasible solutions (Zeng et al., 2024). In light of these issues, developing a differentiable approach for solving ILPs in an unsupervised learning paradigm becomes especially attractive. As illustrated in Figure 1, the core concept involves designing a differentiable loss function to directly optimize the problems and the predictor using gradient descent methods. In the field of CO, this differentiable loss is intuitively expected to align with the optimization objectives, thus enabling an end-to-end training framework that facilitates producing high-quality solutions. From a fundamental research perspective, such a method aligns with the ongoing academic pursuit of unsupervised learning. From a practical perspective, this method promises significant advantages, significantly reducing training time while improving the quality of generated solutions (Chen et al., 2023). To tackle this challenge, we propose Diff ILO (Differentiable Integer Linear Programming Optimization), a novel unsupervised learning approach for learning to generate high-quality ILP solutions. To the best of our knowledge, Diff ILO is the first method to employ pure ML techniques for training, without relying on traditional solvers, representing a front-line exploration of ML applications in the field of combinatorial optimization. Specifically, Diff ILO first relaxes the binary variables into continuous ones via probabilistic modeling, where the constraints are transformed into the form of expected violation. We theoretically demonstrate the equivalence preserved in this transformation. Diff ILO then adopts the penalty function method to convert the constrained problems into unconstrained ones. By further leveraging a reparameterization trick to flow the gra- Published as a conference paper at ICLR 2025 dient back-propagation, thus forming a differentiable (almost everywhere) loss function, Diff ILO optimizes both the ILP problems and the predictor parameters simultaneously via straightforward gradient descent methods. The training process is entirely unsupervised, thus significantly reducing the training time by bypassing the collection of labeled data. Moreover, the end-to-end optimization aligns the objectives of training and inference, thus consistently producing feasible and high-quality solutions. Extensive experiments demonstrate that Diff ILO not only achieves an average training speedup of 13.2 times compared to supervised methods, but also outperforms them by generating heuristic solutions with much higher feasibility ratios and significantly improved objective values. 2 RELATED WORK 2.1 MACHINE LEARNING FOR INTEGER LINEAR PROGRAMS Machine learning (ML) techniques have become increasingly prevalent in addressing combinatorial optimization (CO) problems, especially integer linear programs (ILPs) and mixed-integer linear programs (MILPs) (Bengio et al., 2021; Li et al., 2023b; Zhang et al., 2023; Huawei, 2021). Some studies incorporate ML models into heuristic components in modern solvers (He et al., 2014; Baltean Lugojan et al., 2019; Kuang et al., 2023; Li et al., 2024a), such as branching (Gasse et al., 2019), separation (Li et al., 2023a), and cut selection (Wang et al., 2023c), etc. Another line of research, which is the focus of this paper, employs ML to generate heuristic solutions (Nair et al., 2020; Yoon, 2022; Khalil et al., 2022; Ye et al., 2023b; Zeng et al., 2024). A notable recent advancement is the predict-and-search (PS) framework (Han et al., 2023; Huang et al., 2024), which first predicts initial solutions, and then employs solvers like Gurobi (Gurobi Optimization, 2021) or SCIP (Achterberg, 2009) to search within a strategically designed trust region for solution improvement. Such research is similar to the decision-focused learning (DFL) or predict-then-optimize framework (Elmachtoub & Grigas, 2022; Ferber et al., 2020; Zharmagambetov et al., 2024), which learns a model to map observable features into latent representation (e.g. coefficients in LP objective) used by solvers. 2.2 DIFFERENTIABLE APPROACHES Differentiable approaches aim to construct a differentiable optimization objectives, facilitating the straightforward application of gradient descent methods in an unsupervised, end-to-end manner. These methods have been effectively implemented in fields such as Partial Differential Equation (PDE) (Holl et al., 2020; Belbute-Peres et al., 2020) and Density Functional Theory (DFT) (Kvaal et al., 2014; Mathiasen et al., 2024; Li et al., 2024b), showcasing their effectiveness in solving such continuous problems. Applying these techniques to CO problems poses challenges due to the discrete nature of these problems. In the field of CO, differentiable approaches have been tailored for some specific problems, such as Boolean Satisfiability Problem (SAT) (Amizadeh et al., 2018) and Traveling Salesman Problem (TSP) (Gaile et al., 2022). Notably, Karalias & Loukas (2020) explored differentiable approaches for combinatorial optimization on graphs and developed Erd os Goes Neural, a differentiable and unsupervised learning framework that is similar to our approach. This concept has been further explored by some following works (Wang et al., 2022; Schuetz et al., 2022; Wang & Li, 2023). However, these methods are tailored for some specific problems and depend on custom-designed differentiable loss functions specific to these cases. To the best of our knowledge, extending these techniques to general ILPs remains non-trivial. 3 METHODOLOGY This section introduces our proposed Diff ILO framework, with an overview depicted in Figure 2. In Section 3.1, we present the probabilistic approach that reformulates a discrete, constrained ILP problem into a continuous, unconstrained optimization problem. Next, in Section 3.2, we adopt the Gumbel Softmax technique for reparameterization, which makes the problem differentiable almost everywhere (a.e.) and facilitates an efficient resolution via straightforward stochastic gradient descent. Finally, Section 3.3 details the implementation of the Diff ILO model, including its training and inference processes. The proofs and additional implementation specifics are available in Appendix A and Appendix B, respectively. Published as a conference paper at ICLR 2025 s. t. Ax b x {0, 1}n (P1) Discrete Constrained min ˆx c ˆx s. t. Ex p( |ˆx) [max(Ax b, 0)] = 0 (P2) Continuous Constrained min ˆx c ˆx + µ X s. t. ˆx [0, 1]n (P3) Continuous Unconstrained Bipartite Graph G GNN fθ Solution ˆx min ˆx c ˆx + µ X s. t. ˆx [0, 1]n (P4) Differentiable a.e. Thm. 5 Gradients Thm. 3 Sec. 3.1 Sec. 3.2 Reparameterization Thm. 4 Figure 2: Method overview of Diff ILO. We first transform the primal discrete and constrained problem into a continuous, unconstrained, and differentiable (a.e.) problem, as depicted by the blue arrows. Diff ILO employs a graph neural network (GNN) to predict solutions, as depicted by the black arrows. It optimizes both the ILP problem and the GNN parameters simultaneously through gradient descent, as depicted by the red arrows. 3.1 PROBABILISTIC MERIT FUNCTION We focus on integer linear programs (ILPs) that take the form of: min x c x | Ax b, x {0, 1}n . (P1) where n denotes the number of variables, c Rn denotes the objective coefficients, the matrix A = (a1, a2, , am) Rm n denotes the constraint coefficient matrix, m denotes the number of constraints, and b Rm denotes the right-hand-side biases of the constraints. Remark 1. Without loss of generality, we focus on ILPs with binary integer variables, also referred to as 0-1 programs. This simplification is reasonable given the fact that any bounded integer program can be converted into a binary (0-1) form (Dantzig, 2016; Ye et al., 2023a). An ILP is quintessentially a discrete and constrained optimization problem. Our first step is to reformulate it into a continuous problem. An intuitive approach is to relax the binary variables x {0, 1}n into continuous variables ˆx [0, 1]n, which is commonly known as the linear programming (LP) relaxation (Agmon, 1954; Bixby et al., 2004). However, clearly such relaxation alters the original solution space. To achieve an equivalent reformulation, we adopt a probabilistic approach (Karalias & Loukas, 2020), interpreting the continuous variables ˆx as probabilities associated with the binary variables. Specifically, each binary variable xi is assumed to follow a Bernoulli distribution, writing xi Bernoulli(ˆxi). The distribution of x is denoted as x p( |ˆx). Based on the probabilistic modeling, we reformulate the primal problem (P1) into: min ˆx c ˆx | Ex p( |ˆx) [max(Ax b, 0)] = 0, ˆx [0, 1]n . (P2) Remark 2. The novel reformulation in (P2) is motivated by the intuition that minimizing the expected constraint violations can restrict the distribution s support to the set of optimal solutions. For example, if a component ˆxi (0, 1), then the constraint Ex Bernoulli(ˆxi)[max(ax b, 0)] = 0 will ensure that ax b holds for both x = 0 and x = 1. Otherwise, if ˆxi = 0 or ˆxi = 1, Bernoulli(ˆxi) becomes a deterministic distribution and it indicates that aˆxi b holds. We now present the fundamental theoretical properties of (P2), establishing its equivalence with the primal problem (P1). Theorem 1 demonstrates the equivalence betweeen the two problems in terms of feasibility and solvability. Then, Theorem 2 shows that, loosely speaking, the optimal solutions to (P2) coincide with the optimal solutions to (P1). Theorem 1. The problem (P2) is feasible (and solvable, i.e., it admits at least one optimal solution) if and only if (P1) is feasible (and solvable). Published as a conference paper at ICLR 2025 Theorem 2. Let Ic {i [n] : ci = 0}. Then the following statements hold: 1. Suppose x {0, 1}n is an optimal solution to (P1). Then x is also an optimal solution to (P2). If a vector ˆx [0, 1]n is a feasible solution to (P2) and satisfies ˆx i = x i for all i Ic, then ˆx is an optimal solution to (P2). 2. Suppose ˆx [0, 1]n is an optimal solution to (P2). Then we have ˆx i {0, 1} for all i Ic. Let Iˆx = {i [n] : ˆx i {0, 1}}. If a vector x {0, 1}n satisfies x i = ˆx i for all i Iˆx , then x is an optimal solution to (P1). We can now conclude that transforming (P1) into its continuous format (P2) is well-justified. The next step is to apply the penalty function method to recast the constrained problem (P2) into an unconstrained format. We define ϕj(x) max(a j x bj, 0) and ˆϕj(ˆx) Ex p( |ˆx) [ϕj(x)] for each constraint indexed by j [m]. Plugging these penalty functions into the optimization objective results in the following unconstrained problem: min ˆx {Fµ(ˆx) c ˆx + µ X j ˆϕj(ˆx) | ˆx [0, 1]n}. (P3) Here, the function Fµ(ˆx) is referred to as a merit function (Nocedal & Wright, 1999). Its exactness is supported by the following theorem, which states that for a sufficiently large penalty coefficient µ, the penalty function method preserves the optimal solutions of the original problem. Theorem 3. There exists a positive scalar µ > 0 such that for any µ > µ , any optimal solution to (P3) is also an optimal solution to (P2). Remark 3. The conclusion in Theorem 3 differs from the established exact penalty function theory, which typically assumes the existence of a first-order Karush-Kuhn-Tucker (KKT) point (Di Pillo & Grippo, 1989). In contrast, our proof leverages the combinatorial properties of the primal problem. Remark 4. The probabilistic modeling and penalty function approach have been used in some previous studies for combinatorial optimization problems (Karalias & Loukas, 2020; Wang et al., 2022). However, these methods depend on predefined closed-form constraint penalties, which are hard to derive for general ILPs. As a result, they are applicable to some specific problems rather than general ILPs. Our key technical innovation lies in the transformation of constraints into an expectation form in (P2), eliminating the need for closed-form penalty designs. 3.2 GRADIENT BACK-PROPAGATION FOR OPTIMIZATION We are now interested in how to apply gradient-based methods, such as Stochastic Gradient Descent (SGD), to optimize (P3). This involves estimating the gradients ˆx ˆϕj(ˆx) = ˆx Ex p( |ˆx)[ϕj(x)]. Since ˆx appears in the sampling distribution and it is hard to derive a closed-form expression for the expectation in general ILPs, estimating the gradients is non-trivial. To address this, we employ the reparameterization trick to enable accurate and low-variance gradient estimates, thus facilitating efficient gradient back-propagation. Remark 5. An alternative approach for handling such non-differentiable computation graphs involving sampling is REINFORCE (also known as the score function estimator) (Williams, 1992). However, REINFORCE falls short as it does not explicitly propagate gradients from ϕj(x), leading to a potentially less efficient optimization process (Jang et al., 2017). Therefore, we favor the reparameterization trick in our approach. See Appendix D.2 for more details. We adopt the relaxed Bernoulli distribution (Maddison et al., 2016; Wang & Yin, 2020) for reparameterization, which is based on a simple observation illustrated in the following lemma. We denote the sigmoid function as σ(z) 1 1+e z , and its inverse, known as the logit function, as τ(p) σ( 1)(p) = log( p 1 p). We also denote the uniform distribution over (0, 1) as U(0, 1). Lemma 1. (Restated from (Maddison et al., 2016)) Let ˆx (0, 1) and ϵ be a random variable sampled from U(0, 1). We define ξ(ˆx; ϵ) σ (τ(ˆx) + τ(ϵ)). It follows that P (ξ(ˆx; ϵ) > 0.5) = ˆx. Remark 6. The distribution of ξ(ˆx; ϵ) is the so-called relaxed Bernoulli distribution. It serves as a soft approximation of the discrete Bernoulli distribution, enabling the gradient flow during backpropagation. It is also a specific application of the Gumbel-Softmax trick (Jang et al., 2017), and is also relevant to the random perturbed optimizers (Berthet et al., 2020). Published as a conference paper at ICLR 2025 Building on Lemma 1, we present the following theorem for reparameterization. Theorem 4. Let ˆx = (ˆx1, , ˆxn) (0, 1)n, and ϵ = (ϵ1, , ϵn) be a random vector, where each ϵi is independently and identically distributed (i.i.d.) as ϵi U(0, 1), writing ϵ pϵ( ). Let ξ(ˆx; ϵ) (ξ1, , ξn) , where ξi = ξ(ˆxi; ϵi) is defined as in Lemma 1. Let ψ(ˆx; ϵ) (ψ1, , ψn) , where ψi = [ξi] is the binary rounded value of ξi. It follows that: ˆϕj(ˆx) = Ex p( |ˆx) [ϕj(x)] = Eϵ pϵ( ) [ϕj (ψ(ˆx; ϵ))] . (1) Theorem 4 validates the effectiveness of the relaxed Bernoulli to reparameterization, in the sense that, the term ˆϕj(ˆx) can be accurately calculated by sampling random variables ϵ from a nonparametric distribution pϵ( ). However, note that the gradients derived from ψ, due to the existence of the rounding function, vanish everywhere. Therefore, while ψ can be used to compute the values of ϕj, we use ξ to flow gradients from ϕj to ˆx. Formally, we observe: ϕj(ψ(ˆx; ϵ)) = max{a j ψ(ˆx; ϵ) bj, 0} = a j ψ (ˆx; ϵ) bj I(a j ψ(ˆx; ϵ) bj > 0) a j ξ (ˆx; ϵ) bj I(a j ψ(ˆx; ϵ) bj > 0), where I( ) denotes the indicator function. In Equation 2, ψ is used to accurately determine whether the constraints are violated, thus preserving the combinatorial properties, while ξ acts as a surrogate to flow the gradient back-propagation. Consequently, we can approximate ˆϕj(ˆx) as ˆϕj(ˆx) ˆφj(ˆx) Eϵ pϵ( ) [φj(ˆx; ϵ)] , (3) where φj(ˆx; ϵ) a j ξ (ˆx; ϵ) bj I(a j ψ(ˆx; ϵ) bj > 0). (4) The advantage of ˆφj(ˆx) is that it is differentiable almost everywhere (a.e.), allowing for efficient gradient back-propagation. Therefore, we define the new surrogate problem as: min ˆx {Fµ(ˆx) c ˆx + µ j=1 ˆφj(ˆx) | ˆx [0, 1]n}. (P4) The merit function Fµ(ˆx) defined in (P4) is differentiable a.e., and thus (P4) can be resolved through gradient descent. Formally, we have the following theorem. Theorem 5. The merit function Fµ(ˆx) defined in (P4) is differentiable almost everywhere (a.e.) in (0, 1)n. At the differentiable points, the gradient is given by: ˆx Fµ(ˆx) = c + µ ϵ:a j ψ(ˆx;ϵ) bj>0 aj ˆx ξ(ˆx; ϵ) pϵ(ϵ)dϵ, (5) where denotes the element-wise product. Remark 7. Though not differentiable everywhere, modern deep learning frameworks such as Py Torch can properly handle such cases, even at non-differentiable points. Remark 8. We consider ˆx (0, 1)n as gradient calculations are only necessary within the interior of the solution space. In practice, we output logits and apply a sigmoid function to map it to (0, 1)n to represent the probabilities. As noted in Equation 4, we use the sampled binary solutions x p( |ˆx) to compute constraint violations. Therefore, information from the exact solutions, rather than relaxed ones, is also properly carried out, preserving the combinatorial nature of the problem. 3.3 MODEL IMPLEMENTATION So far, we have transformed the original ILP problem into a continuous, unconstrained, and differentiable (a.e.) problem that can be efficiently solved using straightforward gradient descent. Building on this transformation, we introduce the model implementation of Diff ILO, which simultaneously optimizes both the problem and the predictor parameters through direct gradient back-propagation, eliminating the need for labeled data. In line with the established practices in the field (Gasse et al., 2019; Han et al., 2023), we represent each ILP instance which is formulated as (P1) as a bipartite Published as a conference paper at ICLR 2025 graph G = (V W, E), where V and W denote the sets of variables and constraints, respectively, and E denotes the edges corresponding to the coefficients. Further details on the data representation are available in Appendix B.1. The predictor fθ, which is parameterized by θ and assumed to be differentiable with respect to θ, outputs the predicted variable probabilities ˆx = fθ(G). The model architecture is implemented as a Graph Neural Network (GNN) (Kipf & Welling, 2017; Shi et al., 2023; 2024; 2025), followed by a multilayer perceptron (MLP). The final layer applies a sigmoid function to ensure that the output probabilities ˆx (0, 1)n. More details on the model architecture are provided in Appendix B.2. Model Training During training, we update the parameters θ by optimizing the merit function Fµ(fθ(G)), as defined in (P4), through gradient descent. Specifically, let D = {G1, , G|D|} be a batch of ILP instances, each represented as a bipartite graph Gi, with ni variables and mi constraints. Let φi,j( ) be the penalty function corresponding to the previously defined φj( ) for the ith instance Gi, i.e., φi,j(ˆx; ϵ) a i,jξ (ˆx; ϵ) bi,j I a i,jψ(ˆx; ϵ) bi,j > 0 , (6) where ai,j and bi,j denote the coefficients and the right-hand-side terms of the jth constraints, respectively. For each instance Gi, we sample K random vectors ϵ(k) i U(0, 1)ni, where K is a hyperparameter. The training loss for this batch is defined as: L(θ; D) 1 |D| i=1 L(θ; Gi) = 1 |D| c fθ(Gi) + µ k=1 φi,j(fθ(Gi); ϵ(k) i ) The gradient of the loss function is then given by 1 j mi,1 k K: a i,jψ(fθ(Gi);ϵ(k) i ) bj>0 ˆx ξ fθ(Gi); ϵ(k) i We stabilize the training process through three useful techniques. First, to accommodate instances with ranging sizes and coefficient ranges, we apply a normalization to modify the loss function. Second, we find cosine annealing (Loshchilov & Hutter, 2016) to be beneficial for optimizing the learning schedule. Third, since the penalty coefficient µ is a critical hyperparameter in training, we introduce a dynamic and adaptive method for adjusting µ. These training techniques are further explained in detail in Appendix C.2. Model Inference During inference, for a given instance G, we can sample heuristic solutions from the predicted distributions p( |fθ(G)), which are immediately available. Generating heuristic solutions is particularly valuable in time-sensitive tasks or scenarios that require rapid decisionmaking, such as route planning and production scheduling. Moreover, the generated heuristic solutions can be used to improve the behavior of existing ILP solvers to find high-quality solutions within a constrained time frame. Diff ILO can theoretically be integrated into any framework that benefits from initialization with heuristic solutions, such as neural diving (Nair et al., 2020), large neighbourhood search (Huang et al., 2023), or Predict-and Search (Han et al., 2023), which are complementary to our approach. In this paper, inspired by Han et al. (2023) but with further simplification, we add the following constraint to the initial problem X ˆxi=0 xi + X ˆxi=1 (1 xi) < , (9) where ˆx is the generated heuristic solution, and is a hyperparameter. This constraint defines a trust region by limiting the number of variables which are different from the predicted ones to be fewer than . Besides, we provide ˆx for the solver as an initial solution. Further details on the inference process can be found in Appendix C.3. Published as a conference paper at ICLR 2025 0 25 50 75 100 Instance Objective Value 0 25 50 75 100 Instance 0 25 50 75 100 Instance BKS Gurobi PS PS+Gurobi Diff ILO Diff ILO+Gurobi Figure 4: The objective values of solutions generated by different approaches across the 100 test instances. BKS denotes the best known solutions. Gurobi denotes the heuristic module in Gurobi that searches for heuristic solutions before the exact resolution process. For a better visualization, objective values for the instances with no feasible solution found are assigned as 3000, 300, and 10000, respectively on these datasets. 4 EXPERIMENTS This section presents empirical results to demonstrate the effectiveness of our proposed Diff ILO in (1) generating feasible and high-quality heuristic solutions in an end-to-end manner, and (2) improving the overall performance of traditional solvers to find high-quality solutions within a constrained time frame. We then conduct a case study to provide some additional insights into the optimization process of Diff ILO. All training and evaluations are performed using the same hardware configuration, specifically an Intel(R) Xeon(R) Gold 6246R CPU @ 3.40GHz, and an NVIDIA Ge Force RTX 3090 GPU. Code is available at https://github.com/MIRALab-USTC/L2O-Diff ILO. More experimental details can be found in Appendix C. PS (data collection) PS (training) Diff ILO (training) Figure 3: Training times of PS and Diff ILO on different datasets. Data collection refers to the time spent on solving training and validation instances to obtain labels, while training denotes the time spent on training the neural networks. Benchmarks We evaluate our method mainly on three widely used ILP problem benchmarks: set covering (SC) (Balas & Ho, 1980), maximum independent set problem (IS) (Bergman et al., 2015), and combinatorial auctions (CA) (Leyton-Brown et al., 2000). The datasets are generated using the code from Gasse et al. (2019). Following the settings described in the PS paper (Han et al., 2023), we generate 400 instances for each benchmark, 240 for training, 60 for validation, and 100 for testing, respectively. Additional details about these datasets can be found in Appendix C.1. To demonstrate the effectiveness of Diff ILO on realistic datasets, we conduct experiments on two subsets of MIPLIB 2017 (Gleixner et al., 2021). More details are in Appendix D.3. Baselines We compare our method against two main categories of baselines. First, we include traditional solvers, SCIP (Achterberg, 2009) and Gurobi (Gurobi Optimization, 2021), to evaluate whether the heuristic solutions generated by Diff ILO can accelerate the solving process. Second, we compare Diff ILO with the Predict-and-Search (PS) framework (Han et al., 2023), which first predicts solutions and then employs SCIP or Gurobi to search within a trust region for further improvement. PS and Neural Diving (ND) (Nair et al., 2020) are both representative supervised learning methods. However, since the prediction components in PS and ND are the same, we primarily include PS as our baseline. Some more recent supervised learning approaches such as those based on contrastive learning (Huang et al., 2024) or diffusion models (Zeng et al., 2024) have not been implemented. Although these methods can enhance the performance of supervised learning, they also lead to additional training time. For fairness, Diff ILO does not employ similar additional tricks either. The comparison includes capabilities of both methods to generate feasible solutions and to improve the performance of Gurobi and SCIP. We provide the results of some additional baselines, including ablation studies, in Appendix D.2. Published as a conference paper at ICLR 2025 Table 1: Average objective values obtained by different approaches at 10, 100, and 1, 000 seconds. We mark the best values in bold and underline the second-best values. Best known solution (BKS) refers to the solution obtained by running Gurobi for 3,600 seconds. SC (min, BKS: 86.45) IS (max, BKS:684.14) CA (max, BKS:22272.55) 10s 100s 1000s 10s 100s 1000s 10s 100s 1000s Gurobi 1031.39 87.09 86.52 682.02 684.12 684.13 22090.76 22242.58 22272.03 PS+Gurobi 131.87 125.26 125.26 684.13 684.13 684.13 22140.65 22243.12 22272.47 Diff ILO+Gurobi 95.65 86.78 86.48 684.00 684.12 684.14 22177.82 22260.48 22272.55 SCIP 96.15 89.91 86.93 660.79 679.80 684.05 21013.73 22151.71 22272.55 PS+SCIP 125.26 125.26 125.26 664.50 684.09 684.13 21712.73 22248.55 22272.55 Diff ILO+SCIP 94.16 87.47 86.57 674.30 684.06 684.13 21948.70 22256.08 22272.55 101 102 103 0.0 Average Primal Gap 101 102 103 0.000 101 102 103 0.000 101 102 103 Average Primal Gap 101 102 103 101 102 103 Gurobi SCIP PS+Gurobi PS+SCIP Diff ILO+Gurobi Diff ILO+SCIP Figure 5: The relative primal gap of different approaches as the solving process proceeds. The results are averaged across 100 test instances. 100 101 102 103 Gurobi Diff ILO+Gurobi (a) cvs16r106-72. 100 101 102 103 Gurobi Diff ILO+Gurobi (b) cvs16r128-89. Figure 6: The relative primal gap on two cvs test instances. Training and Inference We train Diff ILO for 1, 200 epochs on the SC and IS datasets, and 2, 400 epochs on the CA dataset. After each epoch, we use the trained model to generate solutions for the validation instances, recording the best feasible solutions and selecting the best epoch based on their average objectives. We train the PS predictor for 2, 400 epochs on all datasets, selecting the best epoch based on validating prediction loss. To improve traditional solvers, PS adds constraints to restrict the solution space within a trust region, which is controlled by some key hyperparameters. Finding suitable hyperparameters on different datasets is challenging and labor-intensive. In contrast, as shown in Formula 9, Diff ILO employs a simpler approach that does not require extensive hyperparameter tuning. In our experiments, we simply set = 200. More details about the training and inference processes can be found in Appendix C.2 and Appendix C.3, respectively. Training Time Comparison Figure 3 shows a comparison of the training times for Diff ILO and PS. The results show that supervised learning methods like PS spends much more time on collecting training labels than training the neural networks. In contrast, Diff ILO bypasses the labor-intensive labeling process, achieving an average speedup of 13.2 times across the three datasets. Generating Feasible Solutions We evaluate the ability of different methods to generate highquality feasible solutions. The evaluation is performed on the 100 test instances. In Figure 4, for PS and Diff ILO, we sample 30 solutions from the predicted distribution for each instance, select the best feasible solution, and report the obtained objective value. The average feasible ratio for Diff ILO, computed as (P instnace #feasible 30 )/(#instances), is 50.8%, 97.1%, and 99.4%, on SC, IS, and CA datasets, respectively. The results show that Diff ILO consistently produces high-quality feasible solutions even without solver assistance on almost all instances, while PS struggles to generate feasible solutions on many instances. Gurobi has a heuristic module to find a heuristic solution before Published as a conference paper at ICLR 2025 (a) z = 5x1 4x2 0.0 2.5 5.0 7.5 10.0 (b) #sampling=1 0.0 2.5 5.0 7.5 10.0 (c) #sampling=10 0.0 2.5 5.0 7.5 10.0 (d) Closed Form Figure 7: An illustrative example. (a) shows the objective function. (b) and (c) visualize the optimization processes of Diff ILO, which converge to the optimal solution. (d) visualizes the optimization process of optimizing the closed-form objective, which converges to a sub-optimal solution. executing the exact resolution process. We further assess the solutions found by combining these methods with this heuristic module. Results show that both PS and Diff ILO can improve Gurobi s heuristic solutions, but Diff ILO+Gurobi significantly outperforms PS+Gurobi, with objectives very close to the best known solutions. Improving traditional Solvers We then evaluate the capability of the generated solutions to accelerate the traditional solvers like Gurobi and SCIP to find better solutions in a constrained time frame. The results are in Table 1 and the solving processes are in Figure 5. The results demonstrate that Diff ILO consistently outperforms PS on the SC and CA datasets. We present the solving curves on the CVS dataset in Figure 6, and report the solving time on 3 neos test instances in Table 7 in Appendix D.3. Other additional experiments, including ablation studies and additional baselines, can be found in Appendix D. Case Study We present an illustrative example to demonstrate Diff ILO s optimization process. We consider a simple ILP problem min x1,x2 {0,1} { 5x1 4x2 | x1 + x2 1} . (10) This case is non-trivial for gradient descent, as its optimal solution (1, 0) and an sub-optimal solution (0, 1) have very close objectives. Figure 7 (a) displays the objective function z = 5x1 4x2. The transformed continuous unconstrained optimization problem is: min ˆx1,ˆx2 [0,1] n Fµ(ˆx1, ˆx2) 5ˆx1 4ˆx2 + µ Ex1,x2 [max(x1 + x2 1, 0)] o , (11) where xi Bernoulli(ˆxi). The closed form of Fµ(ˆx1, ˆx2) is derived as Fµ(ˆx1, ˆx2) = 5ˆx1 4ˆx2 + µ ˆx1ˆx2. (12) We first set µ = 20. Figures 7 (b) and (c) visualize the optimization process of Diff ILO, which samples x1 and x2 with the reparameterization trick for optimization. The numbers of sampled solutions (#sampling) are 1 and 10 in (b) and (c), respectively. They both converge to the optimal solution (1, 0). Moreover, increasing the number of samples stabilizes the optimization process. However, as shown in Figure 7 (d), when we directly optimize the smoothed function, i.e., the closed form of Fµ, it converges to a sub-optimal solutions (0, 1). We conduct experiments with 20 different random seeds. In 11 out of 20 runs, the optimization of the closed-form objective derives sub-optimal solutions. In contrast, Diff ILO s optimization approach derives the optimal solution in all the 20 runs. However, using the closed-form penalty function will always perform penalties. We further conduct experiments to investigate the influence of µ, and the results are shown in Table 9 in Appendix D.8. More discussions about this paper can be found in Appendix E. 5 CONCLUSION This paper proposes Diff ILO, a novel Differentiable Integer Linear Programming Optimization approach for learning to predict ILP solutions under an unsupervised learning paradigm. It is to our knowledge the first method that employs pure ML techniques to solve general ILPs entirely without the aid of traditional solvers. Experiments on commonly used ILP datasets demonstrate the effectiveness of Diff ILO in reducing training time and producing high-quality solutions. Published as a conference paper at ICLR 2025 REPRODUCIBILITY STATEMENT We provide the following information for the reproducibility of our proposed Diff ILO. The method is detailed in Section 3, with the proofs for theorems available in Appendix A. The implementation details, including data representation and model architecture, are provided in Appendix B The experimental details and results are in Section 4 and further elaborated in Appendix C. The code is publicly available at https://github.com/MIRALab-USTC/L2O-Diff ILO. ACKNOWLEDGMENTS The authors would like to thank all the anonymous reviewers for their insightful comments. This work was supported in part by the National Key R&D Program of China under contract 2022ZD0119801, National Nature Science Foundations of China grants U23A20388, and 62021001. This work was supported in part by Huawei as well. Tobias Achterberg. Scip: solving constraint integer programs. Mathematical Programming Computation, 1:1 41, 2009. Shmuel Agmon. The relaxation method for linear inequalities. Canadian Journal of Mathematics, 6:382 392, 1954. Saeed Amizadeh, Sergiy Matusevych, and Markus Weimer. Learning to solve circuit-sat: An unsupervised differentiable approach. In International Conference on Learning Representations, 2018. Anonymous. A reoptimization framework for mixed integer linear programming with dynamic parameters. In Submitted to The Thirteenth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=scd Gzuw C9u. under review. Egon Balas and Andrew Ho. Set covering algorithms using cutting planes, heuristics, and subgradient optimization: a computational study. Springer, 1980. Radu Baltean-Lugojan, Pierre Bonami, Ruth Misener, and Andrea Tramontani. Scoring positive semidefinite cutting planes for quadratic optimization via trained neural networks. preprint: http://www. optimization-online. org/DB HTML/2018/11/6943. html, 2019. Filipe De Avila Belbute-Peres, Thomas Economon, and Zico Kolter. Combining differentiable pde solvers and graph neural networks for fluid flow prediction. In international conference on machine learning, pp. 2402 2411. PMLR, 2020. Yoshua Bengio, Andrea Lodi, and Antoine Prouvost. Machine learning for combinatorial optimization: a methodological tour d horizon. European Journal of Operational Research, 290(2): 405 421, 2021. David Bergman, Andr e Augusto Cir e, Willem Jan van Hoeve, and J. Hooker. Decision diagrams for optimization. Constraints, 20:494 495, 2015. Quentin Berthet, Mathieu Blondel, Olivier Teboul, Marco Cuturi, Jean-Philippe Vert, and Francis Bach. Learning with differentiable pertubed optimizers. Advances in neural information processing systems, 33:9508 9519, 2020. Hawthorne L Beyer, Yann Dujardin, Matthew E Watts, and Hugh P Possingham. Solving conservation planning problems with integer linear programming. Ecological Modelling, 328:14 22, 2016. Robert E Bixby, Mary Fenelon, Zonghao Gu, Ed Rothberg, and Roland Wunderling. Mixed-integer programming: A progress report. In The sharpest cut: the impact of Manfred Padberg and his work, pp. 309 325. SIAM, 2004. Published as a conference paper at ICLR 2025 Mathilde Caron, Hugo Touvron, Ishan Misra, Herv e J egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 9650 9660, 2021. Ziang Chen, Jialin Liu, Xinshang Wang, and Wotao Yin. On representing mixed-integer linear programs by graph neural networks. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=4gc3MGZra1d. Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. Advances in neural information processing systems, 26, 2013. George B Dantzig. Linear programming and extensions. In Linear programming and extensions. Princeton university press, 2016. Gianni Di Pillo and Luigi Grippo. Exact penalty functions in constrained optimization. SIAM Journal on control and optimization, 27(6):1333 1360, 1989. Adam N Elmachtoub and Paul Grigas. Smart predict, then optimize . Management Science, 68(1): 9 26, 2022. Aaron Ferber, Bryan Wilder, Bistra Dilkina, and Milind Tambe. Mipaal: Mixed integer program as a layer. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp. 1504 1511, 2020. El ıza Gaile, Andis Draguns, Em ıls Ozolin ˇs, and K arlis Freivalds. Unsupervised training for neural tsp solver. In International Conference on Learning and Intelligent Optimization, pp. 334 346. Springer, 2022. Maxime Gasse, Didier Ch etelat, Nicola Ferroni, Laurent Charlin, and Andrea Lodi. Exact combinatorial optimization with graph convolutional neural networks. Advances in neural information processing systems, 32, 2019. Zijie Geng, Xijun Li, Jie Wang, Xiao Li, Yongdong Zhang, and Feng Wu. A deep instance generative framework for milp solvers under limited data availability. In Advances in Neural Information Processing Systems, 2023. Ambros Gleixner, Gregor Hendel, Gerald Gamrath, Tobias Achterberg, Michael Bastubbe, Timo Berthold, Philipp Christophel, Kati Jarck, Thorsten Koch, Jeff Linderoth, et al. Miplib 2017: datadriven compilation of the 6th mixed-integer programming library. Mathematical Programming Computation, 13(3):443 490, 2021. LLC Gurobi Optimization. Gurobi optimizer. URL http://www. gurobi. com, 2021. Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, George Tucker, Sehoon Ha, Jie Tan, Vikash Kumar, Henry Zhu, Abhishek Gupta, Pieter Abbeel, et al. Soft actor-critic algorithms and applications. ar Xiv preprint ar Xiv:1812.05905, 2018. Qingyu Han, Linxin Yang, Qian Chen, Xiang Zhou, Dong Zhang, Akang Wang, Ruoyu Sun, and Xiaodong Luo. A gnn-guided predict-and-search framework for mixed-integer linear programming. In The Eleventh International Conference on Learning Representations, 2023. He He, Hal Daume III, and Jason M Eisner. Learning to search in branch and bound algorithms. Advances in neural information processing systems, 27, 2014. Philipp Holl, Vladlen Koltun, Kiwon Um, and Nils Thuerey. phiflow: A differentiable pde solving framework for deep learning via physical simulations. In Neur IPS workshop, volume 2, 2020. Taoan Huang, Aaron M Ferber, Yuandong Tian, Bistra Dilkina, and Benoit Steiner. Searching large neighborhoods for integer linear programs with contrastive learning. In International Conference on Machine Learning, pp. 13869 13890. PMLR, 2023. Taoan Huang, Aaron M Ferber, Arman Zharmagambetov, Yuandong Tian, and Bistra Dilkina. Contrastive predict-and-search for mixed integer linear programs. In Forty-first International Conference on Machine Learning, 2024. URL https://openreview.net/forum?id= zat Ln Lvbs8. Published as a conference paper at ICLR 2025 Huawei. Optverse solver. https://www. huaweicloud.com/product/modelarts/optverse.html, 2021. Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparametrization with gumble-softmax. In International Conference on Learning Representations (ICLR 2017). Open Review. net, 2017. Nikolaos Karalias and Andreas Loukas. Erdos goes neural: an unsupervised learning framework for combinatorial optimization on graphs. Advances in Neural Information Processing Systems, 33: 6659 6672, 2020. Elias B Khalil, Christopher Morris, and Andrea Lodi. Mip-gnn: A data-driven framework for guiding combinatorial solvers. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp. 10219 10227, 2022. Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations, 2017. URL https: //openreview.net/forum?id=SJU4ay Ygl. Arie MCA Koster, Manuel Kutschka, and Christian Raack. Towards robust network design using integer linear programming techniques. In 6th EURO-NGI Conference on Next Generation Internet, pp. 1 8. IEEE, 2010. Yufei Kuang, Xijun Li, Jie Wang, Fangzhou Zhu, Meng Lu, Zhihai Wang, Jia Zeng, Houqiang Li, Yongdong Zhang, and Feng Wu. Accelerate presolve in large-scale linear programming via reinforcement learning. ar Xiv preprint ar Xiv:2310.11845, 2023. Yufei Kuang, Jie Wang, Haoyang Liu, Fangzhou Zhu, Xijun Li, Jia Zeng, Jianye Hao, Bin Li, and Feng Wu. Rethinking branching on exact combinatorial optimization solver: The first deep symbolic discovery framework. In The Twelfth International Conference on Learning Representations, 2024a. URL https://openreview.net/forum?id=j Kh NBul NMh. Yufei Kuang, Jie Wang, Yuyan Zhou, Xijun Li, Fangzhou Zhu, Jianye Hao, and Feng Wu. Towards general algorithm discovery for combinatorial optimization: Learning symbolic branching policy from bipartite graph. In Forty-first International Conference on Machine Learning, 2024b. URL https://openreview.net/forum?id=ULleq1Dtaw. Simen Kvaal, Ulf Ekstr om, Andrew M Teale, and Trygve Helgaker. Differentiable but exact formulation of density-functional theory. The Journal of chemical physics, 140(18), 2014. Ailsa H Land and Alison G Doig. An automatic method for solving discrete programming problems. Springer, 2010. Kevin Leyton-Brown, Mark Pearson, and Yoav Shoham. Towards a universal test suite for combinatorial auction algorithms. In Proceedings of the 2nd ACM Conference on Electronic Commerce, pp. 66 76, 2000. Sirui Li, Wenbin Ouyang, Max B Paulus, and Cathy Wu. Learning to configure separators in branchand-cut. In Thirty-seventh Conference on Neural Information Processing Systems, 2023a. Xijun Li, Fangzhou Zhu, Hui-Ling Zhen, Weilin Luo, Meng Lu, Yimin Huang, Zhenan Fan, Zirui Zhou, Yufei Kuang, Zhihai Wang, et al. Machine learning insides optverse ai solver: Design principles and applications. ar Xiv preprint ar Xiv:2401.05960, 2024a. Yang Li, Jinpei Guo, Runzhong Wang, and Junchi Yan. From distribution learning in training to gradient search in testing for combinatorial optimization. In Advances in Neural Information Processing Systems, 2023b. Yang Li, Zechen Tang, Zezhou Chen, Minghui Sun, Boheng Zhao, He Li, Honggeng Tao, Zilong Yuan, Wenhui Duan, and Yong Xu. Neural-network density functional theory. ar Xiv preprint ar Xiv:2403.11287, 2024b. Haotian Ling, Zhihai Wang, and Jie Wang. Learning to stop cut generation for efficient mixedinteger linear programming. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pp. 20759 20767, 2024. Published as a conference paper at ICLR 2025 Haoyang Liu, Yufei Kuang, Jie Wang, Xijun Li, Yongdong Zhang, and Feng Wu. Promoting generalization for exact solvers via adversarial instance augmentation. ar Xiv preprint ar Xiv:2310.14161, 2023. Haoyang Liu, Jie Wang, Wanbo Zhang, Zijie Geng, Yufei Kuang, Xijun Li, Bin Li, Yongdong Zhang, and Feng Wu. MILP-Stu Dio: MILP instance generation via block structure decomposition. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024a. URL https://openreview.net/forum?id=W433RI0VU4. Haoyang Liu, Jie Wang, Zijie Geng, Xijun Li, Yuxuan Zong, Fangzhou Zhu, Jianye Hao, and Feng Wu. Apollo-MILP: An alternating prediction-correction neural solving framework for mixedinteger linear programming. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=m FY0t PDWK8. Hongyu Liu, Haoyang Liu, Yufei Kuang, Jie Wang, and Bin Li. Deep symbolic optimization for combinatorial optimization: Accelerating node selection by discovering potential heuristics. In Proceedings of the Genetic and Evolutionary Computation Conference Companion, pp. 2067 2075, 2024b. Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. ar Xiv preprint ar Xiv:1608.03983, 2016. Chris J Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous relaxation of discrete random variables. ar Xiv preprint ar Xiv:1611.00712, 2016. Alexander Mathiasen, Hatem Helal, Paul Balanca, Adam Krzywaniak, Ali Parviz, Frederik Hvilshøj, Blazej Banaszewski, Carlo Luschi, and Andrew William Fitzgibbon. Reducing the cost of quantum chemical data by backpropagating through density functional theory. ar Xiv preprint ar Xiv:2402.04030, 2024. John E Mitchell. Branch-and-cut algorithms for combinatorial optimization problems. Handbook of applied optimization, 1(1):65 77, 2002. Vinod Nair, Sergey Bartunov, Felix Gimeno, Ingrid Von Glehn, Pawel Lichocki, Ivan Lobov, Brendan O Donoghue, Nicolas Sonnerat, Christian Tjandraatmadja, Pengming Wang, et al. Solving mixed integer programs using neural networks. ar Xiv preprint ar Xiv:2012.13349, 2020. Jorge Nocedal and Stephen J Wright. Numerical optimization. Springer, 1999. David M Ryan and Brian A Foster. An integer programming approach to scheduling. Computer scheduling of public transport urban passenger vehicle and crew scheduling, pp. 269 280, 1981. Martin JA Schuetz, J Kyle Brubaker, and Helmut G Katzgraber. Combinatorial optimization with physics-inspired graph neural networks. Nature Machine Intelligence, 4(4):367 377, 2022. Zhihao Shi, Xize Liang, and Jie Wang. LMC: Fast training of GNNs via subgraph sampling with provable convergence. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=5VBBA91N6n. Zhihao Shi, Jie Wang, Fanghua Lu, Hanzhu Chen, Defu Lian, Zheng Wang, Jieping Ye, and Feng Wu. Label deconvolution for node representation learning on large-scale attributed graphs against learning bias. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):11273 11286, 2024. doi: 10.1109/TPAMI.2024.3459408. Zhihao Shi, Jie Wang, Zhiwei Zhuang, Xize Liang, Bin Li, and Feng Wu. Accurate and scalable graph neural networks via message invariance. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id= Uqr FPhcm Fp. Lawrence Stewart, Francis Bach, Felipe Llinares-L opez, and Quentin Berthet. Differentiable clustering with perturbed spanning forests. Advances in Neural Information Processing Systems, 36, 2024. Published as a conference paper at ICLR 2025 Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning. Advances in neural information processing systems, 30, 2017. Haoyu Wang and Pan Li. Unsupervised learning for combinatorial optimization needs metalearning. ar Xiv preprint ar Xiv:2301.03116, 2023. Haoyu Wang, Jialin Liu, Xiaohan Chen, Xinshang Wang, Pan Li, and Wotao Yin. Dig-milp: a deep instance generator for mixed-integer linear programming with feasibility guarantee. ar Xiv preprint ar Xiv:2310.13261, 2023a. Haoyu Peter Wang, Nan Wu, Hang Yang, Cong Hao, and Pan Li. Unsupervised learning for combinatorial optimization with principled objective relaxation. Advances in Neural Information Processing Systems, 35:31444 31458, 2022. Hong Wang, Zhongkai Hao, Jie Wang, Zijie Geng, Zhen Wang, Bin Li, and Feng Wu. Accelerating data generation for neural operators via krylov subspace recycling. In The Twelfth International Conference on Learning Representations, 2024a. URL https://openreview.net/ forum?id=Upg RVWexa D. Jie Wang, Zijie Geng, Xijun Li, Jianye Hao, Yongdong Zhang, and Feng Wu. G2MILP: Learning to generate mixed-integer linear programming instances for milp solvers. Authorea Preprints, 2023b. Jie Wang, Zhihai Wang, Xijun Li, Yufei Kuang, Zhihao Shi, Fangzhou Zhu, Mingxuan Yuan, Jia Zeng, Yongdong Zhang, and Feng Wu. Learning to cut via hierarchical sequence/set model for efficient mixed-integer programming. IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1 17, 2024b. doi: 10.1109/TPAMI.2024.3432716. Xi Wang and Junming Yin. Relaxed multivariate bernoulli distribution and its applications to deep generative models. In Conference on Uncertainty in Artificial Intelligence, pp. 500 509. PMLR, 2020. Zhihai Wang, Xijun Li, Jie Wang, Yufei Kuang, Mingxuan Yuan, Jia Zeng, Yongdong Zhang, and Feng Wu. Learning cut selection for mixed-integer linear programming via hierarchical sequence model. In The Eleventh International Conference on Learning Representations, 2023c. Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8:229 256, 1992. Huigen Ye, Hua Xu, and Hongyan Wang. Light-milpopt: Solving large-scale mixed integer linear programs with small-scale optimizer and small training dataset. In The Twelfth International Conference on Learning Representations, 2023a. Huigen Ye, Hua Xu, Hongyan Wang, Chengming Wang, and Yu Jiang. GNN&GBDT-guided fast optimizing framework for large-scale integer programming. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp. 39864 39878. PMLR, 23 29 Jul 2023b. Taehyun Yoon. Confidence threshold neural diving. ar Xiv preprint ar Xiv:2202.07506, 2022. Hao Zeng, Jiaqi Wang, Avirup Das, Junying He, Kunpeng Han, Haoyuan Hu, and Mingfei Sun. Effective generation of feasible solutions for integer programming via guided diffusion. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 4107 4118, 2024. Jiayi Zhang, Chang Liu, Xijun Li, Hui-Ling Zhen, Mingxuan Yuan, Yawen Li, and Junchi Yan. A survey for solving mixed integer programming via machine learning. Neurocomputing, 519: 205 217, 2023. Arman Zharmagambetov, Brandon Amos, Aaron Ferber, Taoan Huang, Bistra Dilkina, and Yuandong Tian. Landscape surrogate: Learning decision losses for mathematical optimization under partial information. Advances in Neural Information Processing Systems, 36, 2024. Published as a conference paper at ICLR 2025 Theorem 1. The problem (P2) is feasible (and solvable, i.e., it admits at least one optimal solution) if and only if (P1) is feasible (and solvable). Proof. First, we show the equivalence of the feasibility. It is obvious that any feasible solution x to (P1) is also a feasible solution to (P2). Conversely, let us consider ˆx as a solution to (P2). It follows that Ex p( |ˆx ) [max(Ax b, 0)] = 0. Notably, supp p( |ˆx ) is not empty, and for x supp p( |ˆx ), we have max(Ax b, 0) = 0, and thus Ax b. Consequently, x is a feasible solution to (P1). Next, we show the equivalence of solvability. Notice that the domain of (P1) is {0, 1}n, which is a finite set. Therefore, (P1) is solvable if and only if it is feasible. Similarly, as the objective of (P2) is a continuous function over the compact set [0, 1]n, (P2) is solvable if and only if it is feasible. Based on these observations, we conclude that (P2) is solvable if and only if (P1) is solvable. Theorem 2. Let Ic {i [n] : ci = 0}. Then the following statements hold: 1. Suppose x {0, 1}n is an optimal solution to (P1). Then x is also an optimal solution to (P2). If a vector ˆx [0, 1]n is a feasible solution to (P2) and satisfies ˆx i = x i for all i Ic, then ˆx is an optimal solution to (P2). 2. Suppose ˆx [0, 1]n is an optimal solution to (P2). Then we have ˆx i {0, 1} for all i Ic. Let Iˆx = {i [n] : ˆx i {0, 1}}. If a vector x {0, 1}n satisfies x i = ˆx i for all i Iˆx , then x is an optimal solution to (P1). Proof. 1. Let x {0, 1}n be an optimal solution to (P1). Obviously it is also a feasible solution to (P2). Assume ˆx [0, 1]n is a feasible solution to (P2) and satisfies ˆx i = x i for all i Ic. It follows that c ˆx = c x . Let ˆx [0, 1]n be any feasible solution to (P2). For x supp p( |ˆx), we have max(Ax b, 0) = 0, and thus Ax b, which indicates that x is a feasible solution to (P1) and thus c x c x. Then we have c x c Ex p( |ˆx)[x] = c ˆx. It follows that c ˆx = c x c ˆx, indicating that ˆx is an optimal solution to (P2). This conclusion, combined with the feasibility of x , gives that x is also an optimal solution to (P2). 2. Let ˆx [0, 1]n be an optimal solution to (P2). Consider if ˆx i (0, 1) for some i Ic. Without loss of generality, we assume ci > 0. Define ˆx such that ˆx i = 0 and ˆx j = ˆxj for j = i. Then supp p( |ˆx ) supp p( |ˆx ), and so ˆx remains a feasible solution to (P2). However, c ˆx < c ˆx , contradicting the optimality of ˆx . Therefore ˆx i {0, 1} for all i Ic. Let x {0, 1}n satisfies x i = ˆx i for all i Iˆx . Then x supp p( |ˆx ) and thus it is a feasible solution to (P1). We also have c x = c ˆx . Let x {0, 1}n be a feasible solution to (P1). Then it is also a feasible solution to (P2), indicating that c ˆx c x. In follows that c x = c ˆx c x, indicating that x is an optimal solution to (P1). Theorem 3. There exists a positive scalar µ > 0 such that for any µ > µ , any optimal solution to (P3) is also an optimal solution to (P2). Proof. We define ϕ(x) P j ϕj(x) and ˆϕ(ˆx) P j ˆϕj(ˆx), which leads to ˆϕ(ˆx) = Ex p( |ˆx)[ϕ(x)]. The domain of ϕ( ) is {0, 1}n, a finite set. Therefore, we can find ρ min x supp ϕ( ) ϕ(x). (13) It follows that ϕ(x) ρ for x supp ϕ( ). We set µ = 2 n c 2 ρ , and assume µ > µ . Let ˆx [0, 1]n be an optimal solution to (P3) and ˆx be any feasible solution to (P3). It suffices to show that ˆϕ(ˆx ) = P j ˆϕj(ˆx ) = 0 and that c ˆx c ˆx. We denote X arg minx supp p( |ˆx ){c x + µ ϕ(x)}. The optimality of ˆx indicates that supp p( |ˆx ) = X , and that c ˆx + µ ˆϕ(ˆx ) c ˆx + µ ˆϕ(ˆx). (14) Published as a conference paper at ICLR 2025 The feasibility of ˆx indicates that ˆϕ(ˆx) = 0, and thus c ˆx + µ ˆϕ(ˆx ) c ˆx. (15) Therefore, for x X = supp p( |ˆx ), we have µ ϕ(x ) c (ˆx x ) c 2( ˆx 2 + x 2) 2 n c 2. (16) Consider if ˆϕ(ˆx ) > 0, which implies that supp p( |ˆx ) supp ϕ( ) = . Let x supp p( |ˆx ) supp ϕ( ). Then we have µ < µ c (ˆx x ) ϕ(x ) c 2( ˆx 2 + x 2) ρ = µ , (17) leading to a contradiction. Therefore, we have ˆϕ(ˆx ) = 0. Plugging it into Equation 14 completes the proof. Lemma 1. (Restated from (Maddison et al., 2016)) Let ˆx (0, 1) and ϵ be a random variable sampled from U(0, 1). We define ξ(ˆx; ϵ) σ (τ(ˆx) + τ(ϵ)). It follows that P (ξ(ˆx; ϵ) > 0.5) = ˆx. Proof. We have P (ξ(ˆx; ϵ) > 0.5) =P log( ˆx 1 ˆx) + log( ϵ 1 ϵ) > 0 =P ˆx 1 ˆx ϵ 1 ϵ > 1 =P (ϵ > 1 ˆx) = ˆx. Theorem 4. Let ˆx = (ˆx1, , ˆxn) (0, 1)n, and ϵ = (ϵ1, , ϵn) be a random vector, where each ϵi is independently and identically distributed (i.i.d.) as ϵi U(0, 1), writing ϵ pϵ( ). Let ξ(ˆx; ϵ) (ξ1, , ξn) , where ξi = ξ(ˆxi; ϵi) is defined as in Lemma 1. Let ψ(ˆx; ϵ) (ψ1, , ψn) , where ψi = [ξi] is the binary rounded value of ξi. It follows that: ˆϕj(ˆx) = Ex p( |ˆx) [ϕj(x)] = Eϵ pϵ( ) [ϕj (ψ(ˆx; ϵ))] . (1) Proof. By Lemma 1, we have P(ψi = 1) = P(ξi > 0.5) = ˆxi, (19) which implies that p(x|ˆx) = p(ψ(ˆx; ϵ) = x|ˆx). Therefore, ˆϕj(ˆx) = Ex p( |ˆx) [ϕj(x)] x {0,1}n ϕj(x)p(x|ˆx) x {0,1}n ϕj(x)p(ψ(ˆx; ϵ) = x|ˆx) x {0,1}n ϕj(x) Z ϵ:x=ψ(ˆx;ϵ) p(x|ˆx, ϵ)pϵ(ϵ)dϵ x=ψ(ˆx;ϵ) ϕj(x) =Eϵ pϵ( ) [ϕj(ψ(ˆx; g))] . Published as a conference paper at ICLR 2025 Theorem 5. The merit function Fµ(ˆx) defined in (P4) is differentiable almost everywhere (a.e.) in (0, 1)n. At the differentiable points, the gradient is given by: ˆx Fµ(ˆx) = c + µ ϵ:a j ψ(ˆx;ϵ) bj>0 aj ˆx ξ(ˆx; ϵ) pϵ(ϵ)dϵ, (5) where denotes the element-wise product. Proof. We have ˆx Fµ(ˆx) = ˆx j=1 ˆφj(ˆx) j=1 ˆx ˆφj(ˆx) = c + µ j=1 ˆx Eϵ pϵ( ) [φj(ˆx; ϵ)] j=1 Eϵ pϵ( ) ˆx a j ξ (ˆx; ϵ) bj I a j ψ(ˆx; ϵ) bj > 0 ϵ:a j ψ(ˆx;ϵ) bj>0 ˆx a j ξ (ˆx; ϵ) bj pϵ(ϵ)dϵ ϵ:a j ψ(ˆx;ϵ) bj>0 aj ˆx ξ(ˆx; ϵ) pϵ(ϵ)dϵ. Published as a conference paper at ICLR 2025 B IMPLEMENTATION DETAILS B.1 DATA REPRESENTATION Following previous works (Gasse et al., 2019; Han et al., 2023; Geng et al., 2023; Wang et al., 2023b), we represent each ILP problem as a weighted bipartite graph G = (V W, E), where V and W denote the sets of variables and constraints, respectively. The graph is equipped with a tuple of feature matrices (V, W, E), and the description of these features can be found in Table 2. Table 2: Description of variable, constraint, and edge features in our bipartite graph representation. Tensor Feature Description Objective Normalized objective coefficient. Variable coefficient Average variable coefficient in all constraints. Variable degree Degree of the variable node in the bipartite graph representation. Maximum variable coefficient Maximum variable coefficient in all constraints. Minimum variable coefficient Minimum variable coefficient in all constraints. W Constraint coefficient Average of all coefficients in the constraint. Constraint degree Degree of constraint nodes. Bias Normalized right-hand-side of the constraint. E Coefficient Constraint coefficient. B.2 MODEL ARCHITECTURE We employ a graph neural network (GNN), parameterized by θ, as the predictor. Specifically, given a bipartite graph G = (V W, E) equipped with the feature metrices (V, W, E), we use MLPs as embedding layers to obtain the initial embeddings : h(0) vi = MLPθ(vi), h(0) wj = MLPθ(wj), heij = MLPθ(eij). (22) After that, we perform K graph convolution layers, with each layer in the form of two interleaved half-convolutions (Gasse et al., 2019), defined as follows: h(k+1) wi MLPθ h(k) wi , X j:eij E MLPϕ h(k) wi , heij, h(k) vj h(k+1) vj MLPϕ h(k) vj , X i:eij E MLPϕ h(k+1) wi , heij, h(k) vj Each convolution layer is followed by two Graph Norm layers, one for variables and the other for constraints. We employ a concatenation Jumping Knowledge layer to aggregate information from all K layers and obtain the final node representations: CONCAT k=0, ,K h(k) vi , hwj = MLPθ CONCAT k=0, ,K h(k) wj . (24) Subsequently, we use another MLP to output the predicted logits for each variable: zvi = MLPθ (hvi) . (25) The logits are then used for resampling module, followed by a sigmoid function to output the generated solutions. Published as a conference paper at ICLR 2025 C EXPERIMENTAL DETAILS C.1 DETAILS OF THE BENCHMARKS We use three commonly used ILP benchmarks in our experiments. The data instances are generated using the code from https://github.com/ds4dm/learn2branch. We list the benchmark information in Table 3, including the generation algorithms, average numbers of constraints and average numbers of variables. Table 3: Statistics of the benchmarks. Dataset Generation Number of Constraints Number of Variables SC (Balas & Ho, 1980) 3000 2000 IS (Bergman et al., 2015) 5943 1500 CA (Leyton-Brown et al., 2000) 576 1500 C.2 TRAINING DETAILS As mentioned in Section 3.3 in the main text, here we introduce three useful techniqques in our training process. Normalization In practice, we conduct a normalization and modify the loss function on each instance Gi as c 2 fθ(Gi) + µ φi,j(fθ(Gi); ϵ(k) i ) Mi ai,j 2 , Mi > 0, c 2 fθ(Gi), Mi = 0, where Mi is the number of constraints violated. Learning Rate Annealing To facilitate a continuing model optimization and alleviate local optimum, we adopt a cosine annealing scheduler for the learning rate (Loshchilov & Hutter, 2016), with a period denoted as lr T. The training curves in Figure 10 in Appendix D.4 demonstrate the influece of the cosine annealing of learning rate on the training progress. Adaptive Penalty Coefficient The penalty coefficient µ is an important hyperparameter in Diff ILO, which influences the convergence of the training process. It is not set on per-instance level, but setting it as a single value is enough. Our probability modeling approach can somehow reduce the influence of µ. Specifically, the penalty term in our method is defined as P j Ex p( |ˆx)[max(a j x bj, 0)]. Notice that the penalty is only activated when when the constraint is violated. Thus, even if the penalty parameter is set relatively large, the penalty term is less likely to dominate the loss function significantly if the constraints are not violated. To reduce the need for manual parameter adjustment for µ, we use a dynamic and adaptive µ, which is inspired by the adaptive temperature in soft actor-critic algorithm (Haarnoja et al., 2018). Specifically, after each epoch, we update the coefficient µ according to the updating rule µk+1 = µk + mu step (cons cons targ), (27) where cons denotes the average constraint violation in this epoch, and cons targ is the target value of the average constraint violation. Empirically the hyperparameter mu targ is set as no more than 1 (according to the range of coefficients), as this indicates that there exist solutions with no constraint violation in a probabilistic sense. This dynamic way for tuning µ can effectively improve the algorithm robustness against the choice of µ. We present the training curves with different values the parameter µ and analyze the influence of the adaptive strategy for µ. The results are in Appendix D.5. Some important hyperparameters of the model training are provided in Table 4. Published as a conference paper at ICLR 2025 Table 4: Hyperparameters in our experiments. Hyperparameter SC IS CA Description embed size 32 32 32 The embedding size of the GNN predictor. depth 3 10 10 The depth of the GNN predictor. batch size 5 5 5 Number of ILP problems in each training batch. num samples 15 15 15 Number of sampled solutions for reparameterization. num epochs 1,200 1,200 2,400 Number of max running epochs. optimizer Adam Adam Adam Optimizer for training. learning rate 1e-4 8e-5 8e-5 Leaning rate for training. lr T 200 200 200 The period for learning rate cosine annealing. mu init 5.0 100.0 15 The initial value of µ. mu step 0.01 1.0 0.001 Step size for optimizing µ. cons targ 1.0 0.1 0.1 Target value of average constraint violation. C.3 INFERENCE DETAILS To incorporate the heuristic solutions into traditional solvers like Gurobi and SCIP, we define a trust region by limiting the number of variables which are different from the predicted ones to be fewer than 200. In practice, we add the following constraint to the original problem X ˆxi=0 xi + X ˆxi=1 (1 xi) < , (28) where ˆx denotes the provided heuristic solution, and is a hyperparameter. In our experiments, we simply set = 200. We also supply our best-found solution as an initial solution to the solver. For Gurobi, this is implemented as: for i, v in enumerate(m.get Vars()): v.Start = best_x[i] and for SCIP it is implemented as: sol = m.create Partial Sol() for i, v in enumerate(m.get Vars()): m.set Sol Val(sol, v, best_x[i]) m.add Sol(sol) Only one best solution is provided to the solver. Specifically, for each instance, we sample 1,000 solutions from the predicted distribution. We select the best feasible solution, if any, based on the objective value. If no feasible solution found, we select the one with the minimal merit function. During inference, SCIP 8.1.0 (Achterberg, 2009), Gurobi 10.0.1 (Gurobi Optimization, 2021) are used for solving instances. Following Han et al. (2023), we configure the solvers towards the heuristic-first mode the MIPFocus parameter for Gurobi and the AGGRESSIVE parameter in SCIP so that they will focus on finding better primal solutions. Specifically, we set m.Params.MIPFocus = 1 for Gurobi and m.set Heuristics(SCIP PARAMSETTING.AGGRESSIVE) for SCIP, respectively. The time limit for running each experiment is set to 1, 000 seconds. To improve traditional solvers, PS (Han et al., 2023) adds constraints to restrict the solution space within a trust region. The trust region search algorithm is controlled by three key hyperparameters, k0, k1, and . Searching for suitable hyperparameters for PS on different datasets is challenging and labor-intensive. For IS and CA, we use the default hyperparameters specified in the original paper. For SC, which is not included in the original paper, we conduct the hyperparameter search as follows: we first fix k0 and k1 to 100 and experiment with values in {5, 10, 15, 20}. We then experiment with k0 and k1 values in {100, 200, 300} with fixed . Published as a conference paper at ICLR 2025 D ADDITIONAL RESULTS D.1 SHIFTED GEOMETRIC MEAN OF RELATIVE GAPS To better understand the results, we also report the shifted geometric mean (SGM) of relative gaps of the instances in Table 5, which is a usually used metric to measure the model performance in the ILP community. Specifically, suppose the dataset contains N instances, the ith instance s best know objective is BKSi, and a method achieves an objective value OBJi. Its relative gap is defined as gapi = |OBJi BKSi| |BKSi| . (29) The shifted geometric mean across all instances is defined as i log(gapi + 1.0) Table 5: Shifted geometric mean (SGM) of relative gaps of different methods on all datasets. Lower SGM indicates better performance. We mark the best results in bold and underline the second-best results. SC IS CA 10s 100s 1000s 10s 100s 1000s 10s 100s 1000s Gurobi 10.6222 0.0070 0.0008 0.0031 2.97E-05 1.50E-05 0.0059 0.0013 3.48E-06 PS+Gurobi 0.5085 0.449 0.449 1.50E-05 1.50E-05 1.50E-05 0.0044 0.0009 3.61E-10 Diff ILO+Gurobi 0.1046 0.0037 0.0005 0.0002 2.96E-05 0 0.0042 0.0005 4.10E-10 SCIP 0.11 0.0373 0.0052 0.0341 0.0063 0.0001 0.0563 0.0054 1.0067 PS+SCIP 0.449 0.449 0.449 0.0286 7.36E-05 1.50E-05 0.0249 0.0011 1.00E-07 Diff ILO+SCIP 0.0873 0.0112 0.0015 0.0143 1.00E-04 1.50E-05 0.0145 0.0007 1.00E-07 D.2 RESULTS OF ADDITIONAL BASELINES In the main text, our main baseline is the PS method, which is a representative supervised learning approach. In this section, we include three additional baselines as follows, and the experimental results on the SC dataset are shown in Table 6 and Figure 8. CL-LNS (Huang et al., 2023) is a large neighborhood search (LNS) approach, a different framework from branch and bound (Bn B). We compare with CL-LNS to demonstrate the effectiveness of Diff ILO compared with LNS-based methods. Con Pa S (Huang et al., 2024) adopts contrastive learning for enhancing the PS framework. As they have not provided publicly released code, we implement the approach based on the paper s details. DDIM (Zeng et al., 2024) adopts diffusion models to learn to generate feasible solutions. We used the authors released code. Naive Relaxation We include a naive relaxation baseline for the ablation study to demonstrate the effectiveness of our proposed relaxation approach. Specifically, in this baseline, we directly optimize the penalty term P j max(a j ˆx bj, 0) instead of our proposed probabilistic one P j Ex p( |ˆx)[max(a j x bj, 0)]. The parameter µ is further tuned to achieve good convergence. Different from our approach, the naive baseline views the problem as a simple continuous one without considering the discrete nature of the original problem. Published as a conference paper at ICLR 2025 REINFORCE rather than reparameterization As noted in Remark 5, an alternative approach for handling non-differentiable computation graphs involving sampling is the REINFORCE method. To demonstrate the effectiveness of using reparameterization, we implement a REINFORCE method as a baseline. It computes gradients as ˆx Ex p( |ˆx)[C(x)] = Ex p( |ˆx)[C(x) ˆx log p( |ˆx)], (31) where C(x) denotes the merit function as defined in (P3). We find that all models collapse towards minimal objectives but significant constraint violations, even if we set a very large µ. This tendency underscores the well-known training challenges associated with RL models. Specifically, the REINFORCE method relies on random exploration without gradient guidance. When a solution is reached, the model receives only a reward signal but is unaware of the inherent components of the reward or the gradient information at the current point. In the vast search space, the absence of gradient-directed exploration can lead models to converge to trivial yet infeasible solutions. The results further demonstrate the necessity of re-parameterization trick for this task. Table 6: Results of additional baselines. We report the objective values achieved at 10s, 100s, and 1000s, respectively. The results show that Diff ILO still outperforms these baselines. Another baseline, REINFORCE, has failed to derive meaningful results and thus is not reported. SC (Min, BKS: 86.45) 10s 100s 1000s CL-LNS 203.27 91.96 86.77 Naive Relaxation 132.1 94.71 87.05 Diff ILO 95.65 86.78 86.48 We test the abilities of Diff ILO, Con Pa S, and DDIM to generate feaisble solutions. The results are shown in Figure 8. The results show that Con Pa S still fails to generate feasible solutions across most instances. DDIM demonstrates strong feasibility rates and successfully generates feasible solutions for all instances. However, when considering solution quality, Diff ILO still outperformed DDIM in terms of objective values. This highlights the strength of Diff ILO in producing higher-quality solutions. 0 25 50 75 100 Instance Objective Value BKS DFL Con Pa S DDIM Figure 8: Results of Diff ILO, Con Pa S (Huang et al., 2024), and DDIM (Zeng et al., 2024) on generating feasible solutions on SC. D.3 RESULTS ON MIPLIB DATASETS To demonstrate the effectiveness of Diff ILO on a more complex and realistic dataset, we conduct additional experiments on the MIPLIB 2017 benchmark (Gleixner et al., 2021), which is well known as a collection of challenging real-world MILP instances. Notice that MIPLIB contains instances across many different scenarios, and many of the large-scale problems do not have isomorphic counterparts, learning directly on the full MIPLIB is extremely challenging. Following Wang et al. (2023a), we first construct a subset of MIPLIB, called MIPLIB-CVS , to validate the effectiveness of Diff ILO. Specifically, it contains five capacitated vertex separator (CVS) problem instances from MIPLIB 2017. They are cvs08r139-94, cvs16r70-62, cvs16r89-60, cvs16r106-72, Published as a conference paper at ICLR 2025 and cvs16r128-89. We use the first three instances for training Diff ILO, and then test the model on the last two instances. The total training process takes about only 9 minutes. If we use supervised learning, it will take more than three hours to solve the training instances for providing labels. The solving progresses are shown in Figure 6, which overall demonstrate that Diff ILO can accelerate the solving process even on complex realistic datasets. Notably, on the cvs16r128-89, Diff ILO achieves the optimal solution (which is 97.0) in 1, 000 seconds, which even surpasses the result obtained by running Gurobi for 3, 600 seconds (which is 96.0). 100 101 102 103 Gurobi Diff ILO+Gurobi (a) cvs16r106-72. 100 101 102 103 Gurobi Diff ILO+Gurobi (b) cvs16r128-89. Figure 9: The relative primal gap on the two test instances. We have conducted an experiment on another subset of MIPLIB. We construct a subset of MIPLIB, called neos , to validate the effectiveness of Diff ILO. Specifically, we collect all binary instances with neos in their names and select the instances whose .mps files contain no more than 500,000 lines. We identify 25 such instances in total. Then, we randomly select 20 instances for training and use another 5 for test. Among these 5 instances, neos-952987 and neos-4382714-ruvuma derive no feasible solutions solved by Gurobi in 1,000 seconds. We report the solving time on the other 3 test instances, neos-829552, neos-831188, neos18 in Table 7. Table 7: Solving time on the 3 test instances from neos. neos-829552 neos-831188 neos18 Gurobi 44.83 63.24 4.24 Gurobi+Diff ILO 43.08 60.91 4.28 While there were slight overall improvements, they were not significant enough to draw firm conclusions. We attribute this to the inherent heterogeneity of the neos dataset. According to the MIPLIB website, the neos instances originate from diverse scenarios with unknown applications. This poses significant challenges for ML-based approaches, which rely on common patterns and generalizations across instances. Additionally, we find that the heterogeneity among training samples led to unstable training processes, further complicating evaluation. The results demonstrate that training on heterogeneous datasets still pose challenges to Diff ILO. D.4 TRAINING CURVES We present the training curves of Diff ILO in Figure 10. The consistent progress of training and validation curves shows that Diff ILO exhibits good generalization. The consistent progress of losses and objectives shows that Diff ILO effectively aligns the training and inference objectives. D.5 INFLUENCE OF ADAPTIVE PENALTY COEFFICIENT We present the training curves with two different settings of the penalty coefficient µ in Figure 11. The results show that with our proposed adaptive penalty strategy, the training process is robust to different configurations of µ, and the coefficient µ can adjust itself towards a reasonable value. Published as a conference paper at ICLR 2025 0 10000 20000 30000 40000 50000 0.00 SC train loss 0 10000 20000 30000 40000 50000 102 104 SC train objs 0 200 400 600 800 1000 0 SC valid loss 0 200 400 600 800 1000 102 104 SC valid objs 0 10000 20000 30000 40000 50000 IS train loss 0 10000 20000 30000 40000 50000 IS train objs 0 200 400 600 800 1000 675 IS valid loss 0 200 400 600 800 1000 675 IS valid objs 0 20000 40000 60000 80000 100000 CA train loss 0 20000 40000 60000 80000 100000 20000 CA train objs 0 500 1000 1500 2000 20000 CA valid loss 0 500 1000 1500 2000 CA valid objs Figure 10: The learning curves of Diff ILO on different datasets. Objs denotes the objective values of the best-found solutions at each step. 0 1000 2000 3000 4000 5000 6000 Training Curve of µ 0 25 50 75 100 125 Training Curve of Best Feasible Solution 0 25 50 75 100 125 Training Curve of Cons Violation Loss 0 25 50 75 100 125 Training Curve of Objective Loss Figure 11: Training curves for different configurations of the penalty parameter µ. In the two experiments, we set the initial value of µ as 1.0 and 5.0 (with miximum set as 5.0), respectively. The parameter µ is dynamically adaptive during training. Even if we set a small initial value for µ, e.g., 1.0 here, it can be adaptively adjusted towards a proper value, i.e., 5.0 here. D.6 ABLATION STUDIES We have conducted experiments to evaluate key choices, and the results are in Figure 12. The number of samples. We conduct experiments on a SC dataset, with results shown in Figure 12(a). We evaluated sample sizes of 5, 10, 15, 20, and 25. While larger sample sizes resulted in slightly smoother training curves and smaller sample sizes led to a little early convergence in the early stage of training, the overall results do not show significant differences. This demonstrates the robustness of Diff ILO to this parameter. For the main experiments, we just empirically set the sample size to 15. We also compared performance with and without the proposed normalization techniques. The results, presented in Figure D.6(b), show that our normalization method significantly accelerates convergence compared to directly summing all penalty terms. We also tested averaging the constraint penalties instead of summing them, which resulted in worse validation performance and thus is not presented in the figure. D.7 GENERALIZATION RESULTS We further test the zero-shot generalizability of PS and Diff ILO. Specifically, the models are trained on small SC instances (with 3, 000 constraints an 2, 000 variables), and tested on large SC instances (with 6, 000 constraints and 4, 000 variables). The results are in Table 8, demonstrating that Diff ILO generalizes well to large-sized instances. This may be because the unsupervised training approach Published as a conference paper at ICLR 2025 Table 8: Generalization to large-size datasets. The models are trained on small SC instances (with 3000 constraints an 2000 variables), and tested on large SC instances (with 6000 constraints and 4000 variables). Results show that Diff ILO performs well on large-sized instances, indicating its generalization ability. SC-small (BKS: 86.45) SC-large (BKS: 79.35) 10s 100s 1000s 10s 100s 1000s Gurobi 1031.39 87.09 86.52 993.65 85.92 79.58 PS+Gurobi 131.87 125.26 125.26 144.76 131.45 131.45 Diff ILO+Gurobi 95.65 86.78 86.48 97.83 84.72 79.55 0 100 200 300 num sample=5 num sample=10 num sample=15 num sample=20 num sample=25 (a) numsample 0 200 400 600 800 1000 w/ normalization w/o normalization (b) normalization Figure 12: Ablation studies on (a) number of samples in each step and (b) the normalization trick used for training. encourages the model to learn the fundamental mechanisms needed to solve problems, instead of merely memorizing simple statistical patterns in the data, thus outperforming supervised methods. D.8 CASE STUDY We have conducted additional experiments to investigate how the penalty parameter µ affects the success rate. The results are in Table 9. In this table, we report the number of successes out of 20 trails with different random seeds. The results show that optimizing closed-form merit function is robust against changes in µ. Interestingly, its performance is mainly determined by the random seed. However, it consistently underperforms compared to Diff ILO across a broad range of µ values. With a properly chosen µ, Diff ILO achieves a 100% success ratio. Table 9: The influence of µ on the success ratio in the case study. 8 9 10 11 12 13 14 15 16 17 18 19 20 25 50 100 closed-form 0 0 0 9 9 9 9 9 9 9 9 9 9 9 9 9 Diff ILO 0 0 0 0 0 2 7 17 20 20 20 20 20 20 19 16 Published as a conference paper at ICLR 2025 E DISCUSSIONS E.1 LIMITATIONS Sub-optimality Diff ILO is an unsupervised learning approach, which does not learn from the optimal solutions found by traditional solvers, but instead learns to produce solutions itself. There are reasonable concerns that Diff ILO may tend to generate sub-optimal solutions, especially when compared with supervised learning approaches. Sub-optimality is essentially a fundamental challenge for most optimization approaches. Therefore, more attention should be paid to addressing this issue and reducing the risk of sub-optimality. Research scope Diff ILO mainly focuses on solving integer linear programs (ILPs). Currently, we focus on ILPs with binary variables. More general ILPs with integer variables and mixed-integer linear programs (MILPs) are out of the scope of this paper. This extension poses new challenges, especially in dealing with the unbounded integer and continuous variables. However, we believe that Diff ILO provides an avenue for such directions, and we plan to explore the use of differentiable approaches for solving general ILPs and MILPs in future work. Waiting for more sophisticated designs The differentiable ILP solving is still in its early stage and lacks sophisticated designs. For example, the sampled solutions can be stored in a buffer for better sample efficiency. The specific gradient optimization algorithm for such task is not yet developed. Moreover, the current bipartite graph representation and GNN architecture are still simple, and can be further designed in the future. E.2 FUTURE AVENUES Used for large-scale pretraining The unsupervised nature of Diff ILO makes it suitable for pretraining tasks on large-scale datasets. Combination with supervised learning Diff ILO is based on unsupervised learning and may be stuck in sub-optimal. It is promising to combine it with small amounts of supervised data to better overcome the sub-optimality issue. Better optimization algorithm In this work, we use simple Adam for gradient-based optimization. Better and more specificly-used optimizers could be developed in the future. Moreover, integrating traditional methods such as branch-and-bound or large-neighborhood search into our framework could bolster its robustness and help the model navigate complex solution landscapes effectively. Extension to more general problems We note that a contemporaneous ICLR submission (Anonymous, 2024) reports and attempts to tackle the similar limitations. They stated: Most existing end-to-end machine learning-based methods primarily focus on predicting solutions for binary variables. Their approach involves converting integer variables into binary representations and predicting these binary bits iteratively. This iterative binary prediction approach could be extended to our framework, though it would require additional modifications. We plan to explore this direction in future work. Moreover, While this paper primarily focuses on ILPs, the underlying principles can be extended to non-linear problems. The key lies in the design of the probabilistic model for ˆϕj(ˆx) = Ex p( |ˆx)[ϕj(x)], where ϕj(x) can be adapted for non-linear constraints. Exploring such extensions is an exciting direction for future work. Better model architectures In the future, better model architectures will be expored to replace the GNN in the current methodology. Some potential practices include soft matching between nodes (Cuturi, 2013; Caron et al., 2021), differentiable clustering for a soft cluster assignment (Stewart et al., 2024), and vector quantization for assigning discrete values (Van Den Oord et al., 2017).