# learning_to_branch_with_tree_mdps__4f77862c.pdf Learning to Branch with Tree MDPs Lara Scavuzzo Delft University of Technology l.v.scavuzzomontana@tudelft.nl Feng Yang Chen Polytechnique Montréal feng-yang.chen@polymtl.ca Didier Chételat Polytechnique Montréal didier.chetelat@polymtl.ca Maxime Gasse Mila, Polytechnique Montréal maxime.gasse@polymtl.ca Andrea Lodi Jacobs Technion-Cornell Institute Cornell Tech and Technion - IIT andrea.lodi@cornell.edu Neil Yorke-Smith Delft University of Technology n.yorke-smith@tudelft.nl Karen Aardal Delft University of Technology k.i.aardal@tudelft.nl State-of-the-art Mixed Integer Linear Program (MILP) solvers combine systematic tree search with a plethora of hard-coded heuristics, such as the branching rule. The idea of learning branching rules from data has received increasing attention recently, and promising results have been obtained by learning fast approximations of the strong branching expert. In this work, we instead propose to learn branching rules from scratch via Reinforcement Learning (RL). We revisit the work of Etheve et al. [11] and propose tree Markov Decision Processes, or tree MDPs, a generalization of temporal MDPs that provides a more suitable framework for learning to branch. We derive a tree policy gradient theorem, which exhibits a better credit assignment compared to its temporal counterpart. We demonstrate through computational experiments that tree MDPs improve the learning convergence, and offer a promising framework for tackling the learning-to-branch problem in MILPs. 1 Introduction Mixed Integer Linear Programs (MILPs) offer a powerful tool for modeling combinatorial optimization problems, and are used in many real-world applications [30]. The method of choice for solving MILPs to global optimality is the Branch-and-Bound (B&B) algorithm, which follows a divide-and-conquer strategy. A critical component of the algorithm is the branching rule, which is used to recursively partition the MILP search space using a space search tree. While there is little understanding of optimal branching decisions in general settings [25], the choice of the branching rule has a decisive impact on solving performance in practice [3]. In modern solvers, state-of-the-art performance is obtained using hard-coded branching rules that rely on heuristics designed by domain experts to perform well on a representative set of test instances [18]. In recent years, increasing attention has been given to approaches that use Machine Learning (ML) to obtain good branching rules and improve upon expert-crafted heuristics [5]. Such a statistical 36th Conference on Neural Information Processing Systems (Neur IPS 2022). Tree MDP episode Temporal MDP episodes a c g b d f h i e a c f h i g b d e a b c d e f g h i Figure 1: B&B process as a tree MDP episode vs. a temporal MDP episode. White nodes denote states, green nodes denote actions. In the tree MDP framework, the branching decision for splitting a node f is credited two rewards, (rh, ri). In the temporal MDP framework, the same branching decision is credited with additional rewards which depend on the temporal order in which B&B nodes are processed, (rh, ri, re), (rh, ri, rg, rb, rd, re), or (rg, rh, ri). approach is particularly well-suited in situations where similar problems are solved repeatedly, which is a common industrial scenario. In such cases, a generic and problem-agnostic branching rule might be suboptimal. Due to the sequential nature of branching, a natural paradigm to formulate the learning problem is Reinforcement Learning (RL), which allows to directly search for the optimal branching rule with respect to a metric of interest, such as average solving time, or final B&B tree size. Etheve et al. [11] show that, when using depth-first-search (DFS) as the tree exploration (a.k.a. node selection) policy in B&B, minimizing each subtree size is equivalent to minimizing the global tree size. Using this result, they propose an efficient value-based RL procedure for learning to branch. In this paper we build upon the work of Etheve et al. [11], and we propose a generalization of the Markov Decision Process (MDP) paradigm that we call tree MDP (illustrated in Figure 1, formally discussed in Section 4). This general formulation captures the key insight of their approach, but opens the door for more general B&B settings, RL algorithms and reward functions. In particular, we show that DFS is just one way to enforce the tree Markov property, and we propose an alternative, more practical solution that simply relies on providing the optimal objective limit to the solver during training. Our contribution is three-fold: 1. We introduce the concepts of a tree MDP and of the tree Markov property, and we derive a policy gradient theorem for tree MDPs. 2. We propose an alternative way to enforce the tree Markov property in learning to branch by imposing an optimal objective limit, which is less computationally demanding than DFS [11]. 3. We show that adopting the tree MDP paradigm improves both the convergence speed of RL and the quality of the resulting branching rules, compared the regular MDP paradigm. The remainder of this paper is organized as follows. We formally introduce MILPs, B&B and MDPs in Section 2. In Section 3 we motivate the need for moving beyond imitation learning in learning to branch, and discuss the challenges it raises. In Section 4 we introduce the concepts of tree MDPs and the tree Markov property, we derive a policy gradient theorem for tree MDPs, and we present a convenient solution to enforce the tree Markov property in B&B without DFS [11]. In Section 5 we conduct, for the first time, a thorough computational evaluation of RL for learning branching rules, on five synthetic MILP benchmarks with two difficulty levels. Finally, we discuss the significance of our results and future directions in Section 6. 2 Background In this section, we describe the Branch-and-Bound (B&B) algorithm, and we show how the branching problem naturally formulates as a (temporal) Markov Decision Process. 2.1 The B&B algorithm Consider a Mixed Integer Linear Program instance as an optimization problem of the form min x Rn{c T x : Ax b, l x u, xi Z i I}, (1) where c Rn is the objective coefficient vector, A Rm n is the constraint matrix, b Rm is the constraint right-hand-side, l, u Rn are the lower and upper bound vectors, respectively, and I {1, 2, ..., n} is the subset of variables constrained to take integer values. Relaxing the integrality constraints, xi Z i I, yields a Linear Program (LP) that can be solved efficiently in practice, for example by using the Simplex algorithm. Solving such a relaxation yields an LP solution ˆx , and provides a lower bound c T ˆx to the original problem (1). If by chance ˆx satisfies the integrality constraints, ˆx i Z i I, then it is also an optimal solution to (1). If, on the contrary, there is at least one variable xi, i I such that ˆx i is not an integer, then one can split the feasible region into two sub-problems, by imposing xi ˆx i or xi ˆx i . (2) This partitioning step is known as branching. In its most basic form, the vanilla B&B algorithm recursively applies branching, thereby building a search space tree where each node has an associated local sub-MILP, with a local LP solution and associated local lower bound.1 If the local LP solution satisfies the MILP integrality constraints, then it is termed a feasible solution, and it also provides an upper bound to (1). At any given time of the B&B process, a Global Lower Bound (GLB) to the original MILP can be obtained by taking the lowest of local lower bounds of the leaf nodes of the tree. Similarly, a Global Upper Bound (GUB) can be obtained by taking the lowest of the upper bounds so far.2 The B&B process terminates when no more B&B leaf node can be partitioned with (2), i.e., all leaf nodes satisfy one of these conditions: the local LP has no solution (infeasible); the local lower bound is above the GUB (pruned); or the local LP solution satisfies the MILP integrality constraints (integer feasible). At termination, we have GLB = GUB, and the original MILP (1) is solved. The branching problem, a.k.a. variable selection problem, is the task of choosing, at every B&B iteration, the variable xi that will be used to generate a partition of the form (2). While there is no universal metric to measure the quality of a branching rule, a common performance measure is the final size of the B&B tree (the smaller the better) [1]. 2.2 Temporal MDPs In the following, upper-case letters in italics denote random variables (e.g. S, A), while their lowercase counterparts denote their value (e.g. s, a) and their calligraphic counterparts their domain (e.g., s S, a A). We consider episodic Markov Decision Processes (MDPs) of the form M = (S, A, pinit, ptrans, r), with states s S, actions a A, initial state distribution pinit(s0), state transition distribution ptrans(st+1|st, at), and reward function r : S R. For simplicity, we assume finite episodes of length T, that is, τ = (s0, a0, . . . , s T ), |τ| = T.3 Together with a control mechanism represented by a stochastic policy π(at|st), the MDP defines a probability distribution over trajectories, namely pπ(τ) = pinit(s0) t=0 π(at|st)ptrans(st+1|st, at). The MDP control problem is to find a policy that maximizes the expected cumulative reward, π arg maxπ V π, with V π := E τ pπ A key property of MDPs is the temporal Markov property St+1 St r(st ). Thus, it can be expected intuitively that learning branching policies within the tree MDP framework will be easier, and more sample-efficient than learning within the temporal MDP framework. We will validate this hypothesis experimentally in Section 5. 4.4 Theoretical limitations Our proposed B&B variants, Obj Lim and DFS, allow for a nice formulation of the branching problem as a tree MDP, which we argue is key to unlocking a more practical and sample-efficient learning of branching policies. However, usually the end goal is to learn a branching policy that performs well in realistic B&B settings, and the fact that a branching policy performs well in one of those variants does not guarantee that it will perform well in the vanilla setting also. This discrepancy between the training environment and the evaluation environment is a recurring problem in RL, and is more generally referred to as the transfer learning problem. While there exist solutions to mitigate this problem, in this paper we leave the question aside and simply assume that the transfer problem is negligible. We thus directly report the performance obtained from each training setting in the realistic evaluation setting, a default B&B solver. 4.5 Connections with hierarchical RL The tree MDP formulation has connections with hierarchical RL (HRL), a paradigm that aims at decomposing the learning task into a set of simpler tasks that can be solved recursively, independently of the parent task. The most related HRL approach is perhaps MAXQ [10], which decomposes the value function of an MDP recursively using a finite set of smaller constituent MDPs, each with its own action set and reward function. For example, delivering a package from a point A to a point B decomposes into: moving to A, picking up package, moving to B, dropping package. While both tree MDP and MAXQ exploit a recursive tree decomposition in order to simplify the credit assignment problem, the two frameworks also differ on several points. First, in MAXQ the hierarchical sub-task structure must be known a priori for each new task, and results in a fixed, limited tree depth, while in tree MDPs the decomposition holds by construction and can result in virtually infinite depths. Second, in MAXQ each sub-task results in a different MDP, while in tree MDPs all sub-tasks are the same. Lastly, in MAXQ the recursive decomposition must follow a temporal abstraction, where each episode is processed according to a depth-first traversal of the tree. In tree MDPs the decomposition is not tied to the temporal processing order of the episode, except for the requirement that a parent must be processed before its children. Thus, any tree traversal order is allowed (see Figure 1). Algorithm 1 REINFORCE training loop 1: Input: training set of MILP instances and their pre-computed optimal solution D, maximum number of epochs K, time limit ζ, entropy bonus λ, learning rate α, sample rate β. 2: Initialize policy πθ with random parameters θ. 3: for epoch from 1 to K do 4: if elapsed time > ζ then break 5: Sample 10 MILP instances from D 6: for each sampled instance do 7: Collect one episode τ by running B&B to optimality 8: Extract randomly β |τ| state, action, return tuples (s, a, G) from τ (with G the local subtree size for tree MDPs, and the remaining episode size for MDPs)8 9: end for 10: n number of collected tuples, L 0 11: for each collected tuple (s, a, G) do 12: L L G 1 n log πθ(a|s) # policy gradient cost 13: L L λ 1 n H(πθ( |s)) # entropy bonus 14: end for 15: θ θ α θL 16: end for 17: return πθ 5 Experiments We now compare the performance of four machine learning approaches: the imitation learning method of Gasse et al. [17], and three different RL methods. We also compare against SCIP s default rule (for a description of this rule we refer to the Supplementary Material A.4). Code for reproducing all experiments is available online 7. Benchmarks Similarly to Gasse et al. [17], we train and evaluate each method on five NP-hard problem benchmarks, which consist of synthetic combinatorial auctions, set covering, maximum independent set, capacitated facility location and multiple knapsack instances. For each benchmark we generate a training set of 10,000 instances, along with a small set of 20 validation instances for tracking the RL performance during training. For the final evaluation, we further generate a test set of 40 instances, the same size as the training ones, and also a transfer set of 40 instances, larger and more challenging than the training ones. More information about benchmarks and instance sizes can be found in the Supplementary Material (A.5). Training We use the Graph Neural Network (GNN) from Gasse et al. [17] to learn branching policies, with the same features and architecture. This state representation has been shown to provide an efficient encoding of the local MILP together with some global features. Notice that this makes the MDP formulation into a POMDP, given that we do not encode the complete search tree (see Section 2.3). We compare four training methods: imitation learning from strong branching (IL); RL using temporal policy gradients (MDP); RL using tree policy gradients with DFS as a node selection strategy (t MDP+DFS), which enforces the tree Markov property due to Proposition 4.5; and RL using tree policy gradients with the optimal objective value set as an objective limit (t MDP+Obj Lim), which corresponds to Propositions 4.4. Other than that, we use default solver parameters, except for restarts and cutting planes after the root node which are deactivated. We use a plain REINFORCE [35] with entropy bonus as our RL algorithm, for simplicity. Our training procedure is summarized in Algorithm 1. We set a maximum of 15,000 epochs and a time limit of six days for training. Our implementation uses Py Torch [31] together with Py Torch Geometric [12], and Ecole [32] for interfacing to the solver SCIP [15]. All experiments are run on compute nodes equipped with a GPU. 7https://github.com/lascavana/rl2branch 8Notice that for tree MDPs the computation of the return for each node can be computed efficiently with a bottom-up traversal that runs in O(n). (a) Combinatorial Auctions (b) Set Covering (c) Maximum Independent Set Figure 2: Training curves for REINFORCE with temporal policy gradients (MDP), tree policy gradients with objective limit (t MDP+Obj Lim) and DFS node selection (t MDP+DFS). We report the final B&B tree size on the validation set (geometric mean over 20 instances 5 seeds, the lower the better), versus the number of processed training samples on the x-axis. Solid lines show the moving average. Training curves for the remaining benchmarks can be found in the Supplementary Material (A.3). Evaluation For each branching rule evaluated, we solve every validation, test or transfer instance 5 times with a different random seed. We use default solver parameters, except for restarts and cutting planes after the root node which are deactivated (same as during training), and a time limit of 1 hour for each solving run. For t MDP+DFS and t MDP+Obj Lim, the specific settings used during training (DFS node selection and optimal objective limit, respectively) are not used any more, thus providing a realistic evaluation setting. We report the geometric mean of the final B&B tree size as our metric of interest, as is common practice in the MILP literature [3]. We pair this with the average per-instance standard deviation (in percentage). We only consider solving runs that finished successfully for all methods, as in [17]. Extended results including solving times are provided in the Supplementary Material (A.3). 5.2 Results Figure 2 showcases the convergence of our three RL paradigms MDP, t MDP+Obj Lim and t MDP+DFS during training, in terms of the final B&B tree size on the validation set (the lower the better). In order to better highlight the sample efficiency of each method, we report on the x-axis the cumulative number of collected training samples, which correlates with the length of the episodes collected during training. This provides a hardware-independent proxy for training time. As can be seen, the tree MDP paradigm clearly improves the convergence speed on these three benchmarks, with a clear domination of t MDP+Obj Lim on one benchmark (Set Covering). Training curves for the remaining benchmarks are available in the Supplementary Material (A.3). Table 1 reports the final performance of the branching rules obtained with each method, on both a held-out test set (same instance difficulty as training) and a transfer set (larger, more difficult instances than training). Despite a mismatch between the training and evaluation environments, which is required to enforce the tree Markov property, the tree MDP paradigm consistently produces equal or better branching rules than the temporal MDP paradigm on all 5 benchmarks. On one benchmark, Multiple Knapsack, the branching rules learned by RL outperform both SCIP s default branching rule and the strong branching imitation (IL) approach. The likely reason is that the MILP formulation of Multiple Knapsack provides a very poor linear relaxation, which often results in no dual bound improvement after branching. This means that strong branching scores are in most cases not discriminative, which is problematic for rules that heavily rely on this criterion (see Section 3.1), such as SCIP s default or a policy that imitates strong branching. This situation makes a strong case for the potential of RL-based methods, which can adapt and devise alternative branching strategies. On the remaining 4 benchmarks, however, RL methods perform worse than SCIP default or IL, despite being based on the same GNN architecture. This illustrates the difficulty of learning to branch via RL, even on small-scale problems, and the remaining room for improvement. Additional evaluation criteria (solving times and number of time limits) are available in the Supplementary Material (A.3). Table 1: Solving performance of the different branching rules in terms of the final B&B tree size (lower is better). We evaluate each method on a test set with instances the same size as training, and a transfer set with larger instances. We report the geometric mean and standard deviation over 40 instances, solved 5 times with different random seeds, and we bold the best of the RL methods. Model Comb. Auct. Set Cover Max.Ind.Set Facility Loc. Mult. Knap. SCIP default 7.3 39% 10.7 24% 19.3 52% 203.6 63% 267.8 96% IL 52.2 13% 51.8 10% 35.9 36% 247.5 39% 228.0 95% RL (MDP) 86.7 16% 196.3 20% 91.8 56% 393.2 47% 143.4 76% RL (t MDP+DFS) 86.1 17% 190.8 20% 89.8 51% 360.4 46% 135.8 75% RL (t MDP+Obj Lim) 87.0 18% 193.5 23% 85.4 53% 325.4 41% 142.4 78% Test Model Comb. Auct. Set Cover Max.Ind.Set Facility Loc. Mult. Knap. SCIP default 733.9 26% 61.4 19% 2867.1 35% 344.3 57% 592.3 75% IL 805.1 9% 145.0 6% 1774.8 38% 407.8 37% 1066.1 101% RL (MDP) 1906.3 18% 853.3 27% 2768.5 76% 679.4 52% 518.4 79% RL (t MDP+DFS) 1804.6 17% 816.8 25% 2970.0 76% 609.1 47% 495.1 81% RL (t MDP+Obj Lim) 1841.9 18% 826.4 26% 2763.6 74% 496.0 48% 425.3 64% Transfer 6 Conclusions and Future Directions This paper adds to a growing body of literature on using ML to assist decision-making in several key components of the B&B algorithm (see e.g. [5, 7, 22]). We contribute to the study of RL as a tool for learning to branch in MILP solvers. We present tree MDP, a variant of Markov Decision Processes, and show that under some conditions, the B&B branching process is tree-Markovian. We show that the approach of Etheve et al. [11] can be naturally cast as Q-learning for tree MDPs, and we propose an alternative, more computationally appealing way to enforce the tree Markov property in B&B, using optimal objective limits. Finally, we evaluate for the first time a variety of RL-based branching rules in a comprehensive computational study, and we show that tree MDPs improve the convergence speed of RL for branching, as well as the overall performance of the learnt branching rules. These contributions bring us closer to learning efficient branching rules from scratch using RL, which could ultimately outperform existing branching heuristics built upon decades of expert knowledge and experiment. However, despite the convergence speed-up that our method provides, training without expert knowledge remains very computationally heavy and in general still results in worse performance than its supervised learning counterpart, which reveals a significant gap that must be closed. As future work, we would like to explore ideas to keep improving sample efficiency, and generalization across instance size. This is necessary for RL to scale to larger, non-homogeneous benchmarks, such as MIPLIB [18], which at the moment remain out-of-reach for RL. Finally, although our concern in this paper is focused on improving variable selection for MILP, our tree MILP construction could be useful in other applications. Branch-and-bound is a type of divide-and-conquer algorithm, and we expect that, in general, this framework can be applied to any problem where one seeks to control such algorithms more efficiently. Examples would include controlling the order in which a rover explores rooms in a building, or selecting the pivots in a quicksort algorithm. Acknowledgements The authors thanks the anonymous reviewers for their suggestions. The work of Karen Aardal, Lara Scavuzzo and Neil Yorke-Smith was partially supported by TAILOR, a project funded by EU Horizon 2020 research and innovation programme under grant number 952215, and by The Netherlands Organisation for Scientific Research (NWO), grant OCENW.GROOT.2019.015. The work of Feng Yang Chen, Didier Chételat, Maxime Gasse and Andrea Lodi was supported by the Canada Excellence Research Chairs program (CERC). [1] Tobias Achterberg. Constraint Integer Programming. Ph D thesis, Technischen Universität Berlin, 2007. [2] Tobias Achterberg and Timo Berthold. Hybrid branching. In International Conference on Integration of Constraint Programming, Artificial Intelligence, and Operations Research, pages 309 311. Springer, 2009. [3] Tobias Achterberg and Roland Wunderling. Mixed integer programming: Analyzing 12 years of progress. In Facets of Combinatorial Optimization, pages 449 481. Springer, 2013. [4] Egon Balas and Andrew Ho. Set covering algorithms using cutting planes, heuristics, and subgradient optimization: a computational study. In Combinatorial Optimization, pages 37 60. Springer, 1980. [5] Yoshua Bengio, Andrea Lodi, and Antoine Prouvost. Machine learning for combinatorial optimization: a methodological tour d horizon. European Journal of Operational Research, 2020. [6] David Bergman, Andre A Cire, Willem-Jan Van Hoeve, and John Hooker. Decision diagrams for optimization, volume 1. Springer, 2016. [7] Antonia Chmiela, Elias B Khalil, Ambros Gleixner, Andrea Lodi, and Sebastian Pokutta. Learning to schedule heuristics in branch-and-bound. ar Xiv preprint ar Xiv:2103.10294, 2021. [8] Gérard Cornuéjols, Ranjani Sridharan, and Jean-Michel Thizy. A comparison of heuristics and relaxations for the capacitated plant location problem. European Journal of Operational Research, 50(3):280 297, 1991. [9] Santanu S Dey, Yatharth Dubey, Marco Molinaro, and Prachi Shah. A theoretical and computational analysis of full strong-branching. ar Xiv preprint ar Xiv:2110.10754, 2021. [10] Thomas G Dietterich. Hierarchical reinforcement learning with the maxq value function decomposition. Journal of Artificial Intelligence Research, 13:227 303, 2000. [11] Marc Etheve, Zacharie Alès, Côme Bissuel, Olivier Juan, and Safia Kedad-Sidhoum. Reinforcement learning for variable selection in a branch and bound algorithm. In CPAIOR, 2020. [12] Matthias Fey and Jan E. Lenssen. Fast graph representation learning with Py Torch Geometric. In ICLR Workshop on Representation Learning on Graphs and Manifolds, 2019. [13] Alex S Fukunaga. A branch-and-bound algorithm for hard multiple knapsack problems. Annals of Operations Research, 184(1):97 119, 2011. [14] Gerald Gamrath and Christoph Schubert. Measuring the impact of branching rules for mixedinteger programming. In Operations Research Proceedings 2017, pages 165 170. Springer, 2018. [15] Gerald Gamrath, Daniel Anderson, Ksenia Bestuzheva, Wei-Kun Chen, Leon Eifler, Maxime Gasse, Patrick Gemander, Ambros Gleixner, Leona Gottwald, Katrin Halbig, Gregor Hendel, Christopher Hojny, Thorsten Koch, Pierre Le Bodic, Stephen J. Maher, Frederic Matter, Matthias Miltenberger, Erik Mühmer, Benjamin Müller, Marc E. Pfetsch, Franziska Schlösser, Felipe Serrano, Yuji Shinano, Christine Tawfik, Stefan Vigerske, Fabian Wegscheider, Dieter Weninger, and Jakob Witzig. The SCIP Optimization Suite 7.0. ZIB-Report 20-10, Zuse Institute Berlin, 3 2020. [16] Gerald Gamrath, Timo Berthold, and Domenico Salvagnin. An exploratory computational analysis of dual degeneracy in mixed-integer programming. EURO Journal on Computational Optimization, 8(3):241 261, 2020. [17] Maxime Gasse, Didier Chételat, Nicola Ferroni, Laurent Charlin, and Andrea Lodi. Exact combinatorial optimization with graph convolutional neural networks. In Advances in Neural Information Processing Systems, pages 15580 15592, 2019. [18] Ambros Gleixner, Gregor Hendel, Gerald Gamrath, Tobias Achterberg, Michael Bastubbe, Timo Berthold, Philipp Christophel, Kati Jarck, Thorsten Koch, Jeff Linderoth, et al. MIPLIB 2017. Mathematical Programming Computation, pages 1 48, 2021. [19] Prateek Gupta, Maxime Gasse, Elias Khalil, Pawan Mudigonda, Andrea Lodi, and Yoshua Bengio. Hybrid models for learning to branch. In Advances in Neural Information Processing Systems, volume 33, 2020. [20] Christoph Hansknecht, Imke Joormann, and Sebastian Stiller. Cuts, primal heuristics, and learning to branch for the time-dependent traveling salesman problem. ar Xiv preprint ar Xiv:1805.01415, 2018. [21] He He, Hal Daume III, and Jason M Eisner. Learning to search in branch and bound algorithms. In Advances in Neural Information Processing systems, pages 3293 3301, 2014. [22] Elias B Khalil, Christopher Morris, and Andrea Lodi. Mip-gnn: A data-driven framework for guiding combinatorial solvers. In AAAI, 2022. [23] Elias Boutros Khalil, Pierre Le Bodic, Le Song, George Nemhauser, and Bistra Dilkina. Learning to branch in mixed integer programming. In Thirtieth AAAI Conference on Artificial Intelligence, 2016. [24] Kevin Leyton-Brown, Mark Pearson, and Yoav Shoham. Towards a universal test suite for combinatorial auction algorithms. In Proceedings of the 2nd ACM conference on Electronic commerce, pages 66 76, 2000. [25] Andrea Lodi and Giulia Zarpellon. On learning and branching: a survey. Top, 25(2):207 236, 2017. [26] Alejandro Marcos Alvarez, Louis Wehenkel, and Quentin Louveaux. Online learning for strong branching approximation in branch-and-bound. Technical report, Universite de Liege, 2016. [27] Marvin Minsky. Steps toward artificial intelligence. Proceedings of the IRE, 49(1):8 30, 1961. [28] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529 533, 2015. [29] Vinod Nair, Sergey Bartunov, Felix Gimeno, Ingrid von Glehn, Pawel Lichocki, Ivan Lobov, Brendan O Donoghue, Nicolas Sonnerat, Christian Tjandraatmadja, Pengming Wang, et al. Solving mixed integer programs using neural networks. ar Xiv preprint ar Xiv:2012.13349, 2020. [30] Vangelis Th Paschos. Applications of combinatorial optimization, volume 3. John Wiley & Sons, 2014. [31] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary De Vito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, highperformance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'AlchéBuc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. [32] Antoine Prouvost, Justin Dumouchelle, Lara Scavuzzo, Maxime Gasse, Didier Chételat, and Andrea Lodi. Ecole: A gym-like library for machine learning in combinatorial optimization solvers. In Workshop on Learning Meets Combinatorial Algorithms, Neur IPS 2020, 11 2020. [33] Haoran Sun, Wenbo Chen, Hui Li, and Le Song. Improving learning to branch via reinforcement learning. In Learning Meets Combinatorial Algorithms at Neur IPS 2020, 2020. [34] Richard S Sutton, David A Mc Allester, Satinder P Singh, Yishay Mansour, et al. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems, volume 99, pages 1057 1063. Citeseer, 1999. [35] Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3):229 256, 1992. [36] Giulia Zarpellon, Jason Jo, Andrea Lodi, and Yoshua Bengio. Parameterizing branch-and-bound search trees to learn branching policies. In AAAI, pages 3931 3939, 2021.