# learning_to_branch_with_tree_mdps__4f77862c.pdf

Learning to Branch with Tree MDPs

Lara Scavuzzo Delft University of Technology l.v.scavuzzomontana@tudelft.nl

Feng Yang Chen Polytechnique Montréal feng-yang.chen@polymtl.ca

Didier Chételat Polytechnique Montréal didier.chetelat@polymtl.ca

Maxime Gasse Mila, Polytechnique Montréal maxime.gasse@polymtl.ca

Andrea Lodi Jacobs Technion-Cornell Institute Cornell Tech and Technion - IIT andrea.lodi@cornell.edu

Neil Yorke-Smith Delft University of Technology n.yorke-smith@tudelft.nl

Karen Aardal Delft University of Technology k.i.aardal@tudelft.nl

State-of-the-art Mixed Integer Linear Program (MILP) solvers combine systematic tree search with a plethora of hard-coded heuristics, such as the branching rule. The idea of learning branching rules from data has received increasing attention recently, and promising results have been obtained by learning fast approximations of the strong branching expert. In this work, we instead propose to learn branching rules from scratch via Reinforcement Learning (RL). We revisit the work of Etheve et al. [11] and propose tree Markov Decision Processes, or tree MDPs, a generalization of temporal MDPs that provides a more suitable framework for learning to branch. We derive a tree policy gradient theorem, which exhibits a better credit assignment compared to its temporal counterpart. We demonstrate through computational experiments that tree MDPs improve the learning convergence, and offer a promising framework for tackling the learning-to-branch problem in MILPs.

1 Introduction

Mixed Integer Linear Programs (MILPs) offer a powerful tool for modeling combinatorial optimization problems, and are used in many real-world applications [30]. The method of choice for solving MILPs to global optimality is the Branch-and-Bound (B&B) algorithm, which follows a divide-and-conquer strategy. A critical component of the algorithm is the branching rule, which is used to recursively partition the MILP search space using a space search tree. While there is little understanding of optimal branching decisions in general settings [25], the choice of the branching rule has a decisive impact on solving performance in practice [3]. In modern solvers, state-of-the-art performance is obtained using hard-coded branching rules that rely on heuristics designed by domain experts to perform well on a representative set of test instances [18].

In recent years, increasing attention has been given to approaches that use Machine Learning (ML) to obtain good branching rules and improve upon expert-crafted heuristics [5]. Such a statistical

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

Tree MDP episode

Temporal MDP episodes

a c g b d f h i e

a c f h i g b d e

a b c d e f g h i

Figure 1: B&B process as a tree MDP episode vs. a temporal MDP episode. White nodes denote states, green nodes denote actions. In the tree MDP framework, the branching decision for splitting a node f is credited two rewards, (rh, ri). In the temporal MDP framework, the same branching decision is credited with additional rewards which depend on the temporal order in which B&B nodes are processed, (rh, ri, re), (rh, ri, rg, rb, rd, re), or (rg, rh, ri).

approach is particularly well-suited in situations where similar problems are solved repeatedly, which is a common industrial scenario. In such cases, a generic and problem-agnostic branching rule might be suboptimal. Due to the sequential nature of branching, a natural paradigm to formulate the learning problem is Reinforcement Learning (RL), which allows to directly search for the optimal branching rule with respect to a metric of interest, such as average solving time, or ﬁnal B&B tree size.

Etheve et al. [11] show that, when using depth-ﬁrst-search (DFS) as the tree exploration (a.k.a. node selection) policy in B&B, minimizing each subtree size is equivalent to minimizing the global tree size. Using this result, they propose an efﬁcient value-based RL procedure for learning to branch. In this paper we build upon the work of Etheve et al. [11], and we propose a generalization of the Markov Decision Process (MDP) paradigm that we call tree MDP (illustrated in Figure 1, formally discussed in Section 4). This general formulation captures the key insight of their approach, but opens the door for more general B&B settings, RL algorithms and reward functions.

In particular, we show that DFS is just one way to enforce the tree Markov property, and we propose an alternative, more practical solution that simply relies on providing the optimal objective limit to the solver during training. Our contribution is three-fold:

1. We introduce the concepts of a tree MDP and of the tree Markov property, and we derive a policy gradient theorem for tree MDPs.

2. We propose an alternative way to enforce the tree Markov property in learning to branch by imposing an optimal objective limit, which is less computationally demanding than DFS [11].

3. We show that adopting the tree MDP paradigm improves both the convergence speed of RL and the quality of the resulting branching rules, compared the regular MDP paradigm.

The remainder of this paper is organized as follows. We formally introduce MILPs, B&B and MDPs in Section 2. In Section 3 we motivate the need for moving beyond imitation learning in learning to branch, and discuss the challenges it raises. In Section 4 we introduce the concepts of tree MDPs and the tree Markov property, we derive a policy gradient theorem for tree MDPs, and we present a convenient solution to enforce the tree Markov property in B&B without DFS [11]. In Section 5 we conduct, for the ﬁrst time, a thorough computational evaluation of RL for learning branching rules, on ﬁve synthetic MILP benchmarks with two difﬁculty levels. Finally, we discuss the signiﬁcance of our results and future directions in Section 6.

2 Background

In this section, we describe the Branch-and-Bound (B&B) algorithm, and we show how the branching problem naturally formulates as a (temporal) Markov Decision Process.

2.1 The B&B algorithm

Consider a Mixed Integer Linear Program instance as an optimization problem of the form

min x Rn{c T x : Ax b, l x u, xi Z i I}, (1)

where c Rn is the objective coefﬁcient vector, A Rm n is the constraint matrix, b Rm is the constraint right-hand-side, l, u Rn are the lower and upper bound vectors, respectively, and I {1, 2, ..., n} is the subset of variables constrained to take integer values. Relaxing the integrality constraints, xi Z i I, yields a Linear Program (LP) that can be solved efﬁciently in practice, for example by using the Simplex algorithm. Solving such a relaxation yields an LP solution ˆx , and provides a lower bound c T ˆx to the original problem (1). If by chance ˆx satisﬁes the integrality constraints, ˆx i Z i I, then it is also an optimal solution to (1). If, on the contrary, there is at least one variable xi, i I such that ˆx i is not an integer, then one can split the feasible region into two sub-problems, by imposing

xi ˆx i or xi ˆx i . (2)

This partitioning step is known as branching. In its most basic form, the vanilla B&B algorithm recursively applies branching, thereby building a search space tree where each node has an associated local sub-MILP, with a local LP solution and associated local lower bound.1 If the local LP solution satisﬁes the MILP integrality constraints, then it is termed a feasible solution, and it also provides an upper bound to (1). At any given time of the B&B process, a Global Lower Bound (GLB) to the original MILP can be obtained by taking the lowest of local lower bounds of the leaf nodes of the tree. Similarly, a Global Upper Bound (GUB) can be obtained by taking the lowest of the upper bounds so far.2 The B&B process terminates when no more B&B leaf node can be partitioned with (2), i.e., all leaf nodes satisfy one of these conditions: the local LP has no solution (infeasible); the local lower bound is above the GUB (pruned); or the local LP solution satisﬁes the MILP integrality constraints (integer feasible). At termination, we have GLB = GUB, and the original MILP (1) is solved.

The branching problem, a.k.a. variable selection problem, is the task of choosing, at every B&B iteration, the variable xi that will be used to generate a partition of the form (2). While there is no universal metric to measure the quality of a branching rule, a common performance measure is the ﬁnal size of the B&B tree (the smaller the better) [1].

2.2 Temporal MDPs

In the following, upper-case letters in italics denote random variables (e.g. S, A), while their lowercase counterparts denote their value (e.g. s, a) and their calligraphic counterparts their domain (e.g., s S, a A). We consider episodic Markov Decision Processes (MDPs) of the form M = (S, A, pinit, ptrans, r), with states s S, actions a A, initial state distribution pinit(s0), state transition distribution ptrans(st+1|st, at), and reward function r : S R. For simplicity, we assume ﬁnite episodes of length T, that is, τ = (s0, a0, . . . , s T ), |τ| = T.3 Together with a control mechanism represented by a stochastic policy π(at|st), the MDP deﬁnes a probability distribution over trajectories, namely

pπ(τ) = pinit(s0)

t=0 π(at|st)ptrans(st+1|st, at).

The MDP control problem is to ﬁnd a policy that maximizes the expected cumulative reward, π arg maxπ V π, with

V π := E τ pπ

A key property of MDPs is the temporal Markov property St+1 S<t | St, At, t, which guarantees that the future only depends upon the current state and action. One consequence of this property is

1The local lower bound is if the LP is infeasible. 2The local upper bound is if the LP is infeasible, or if its solution is not MILP feasible. 3Equivalently, we can assume all trajectories will reach an absorbing, zero-reward state snull in a ﬁnite number of steps, that is, ptrans(st+1|st = snull, at) = δsnull(st+1), r(snull) = 0, and T s.t. pπ(s T = snull) = 1, π.

the temporal policy gradient theorem [34], which forms the basis of policy gradient methods

πV π = E τ pπ

t=0 π log π(at|st)

t =t+1 r(st )

2.3 The branching temporal MDP

Let us now consider the problem of learning a branching policy in a B&B solver. As noted by [21, 17], branching decisions are made sequentially, thus the problem can naturally be regarded as a temporal MDP. In this form, the states st consist of the entire state of the B&B process (that is, the solver) at time t, which includes the whole tree structure, all sub-MILPs and LP solutions, and all upper and lower bounds. The actions at are the branching decisions, and the reward is chosen so that the return matches an objective function of interest (e.g., the ﬁnal B&B tree size).

While appealing by its simplicity, we argue that this formulation is not practical for two reasons. First, even in the simplest textbook implementation of B&B, the MDP states are complex objects of growing size, which are impractical to handle. Second, episode length in branching can grow extremely large, in the worst case exponentially with the size of the problem. This exacerbates the so-called credit assignment problem, i.e., the problem of determining which actions should be given credit for a certain outcome (see Section 3.2 for a more detailed discussion). In Section 4, we will we will show how the tree MDP formulation addresses both those challenges.

3 Approaches to learning to branch

Approaches that frame branching as a learning problem have gained signiﬁcant attention recently. In this section, we review some common trends in the ﬁeld and identify some key challenges.

3.1 Learning to branch with imitation learning

In recent years, it has been demonstrated that efﬁcient branching rules can be obtained by learning fast approximations of strong branching, a globally effective but computationally expensive rule [23, 26, 20, 17, 19, 36, 29]. This approach can be framed as imitation learning, a well-known method for learning in MDPs when an expert is available. While this approach can lead to improvements over state-of-the-art branching rules on various benchmarks, it also has several drawbacks.

First, strong branching implementations are known to trigger side effects (such as early detection of an infeasible child) that do not map well within the branching MDP framework [14]. This means that branching decisions collected from the strong branching expert might not line up with the environment of the learning agent, which might result in a performance gap between the learning agent and the expert, even if the agent manages to successfully reproduce the expert decisions.

Second, other than these side-effects, strong branching relies on dual bound improvements to make branching decisions. This can be ineffective in problems where the LP relaxation is not very informative or suffers from dual degeneracy, as pointed out in Gamrath et al. [16]. As an extreme case, Dey et al. [9] provide an example where a strong-branching based B&B tree can have exponentially more nodes than the tree obtained with a problem-speciﬁc rule.

Finally, regardless of the expert quality, obtaining strong branching samples can become prohibitively expensive for large instances. This means that non-trivial engineering solutions and scaling strategies are needed to allow training on larger problems, such as heavy computational parallelization [29].

3.2 Learning to branch with reinforcement learning

The reasons listed above suggest the need for an alternative approach to ﬁnding a good branching policy; an approach that does not rely on the strong branching rule. One could perform imitation learning on another expert, but no other plausible imitation target is known, and the same performance ceiling issue would remain. A natural alternative is reinforcement learning, an approach that aims to ﬁnd an optimal policy in a Markov Decision Process, with respect to any desired objective function. But despite the theoretical appeal of RL for learning to branch, it also comes with its own challenges.

First, common evaluation metrics in B&B, such as solving time or ﬁnal tree size, are inconvenient for RL. Both require to run episodes to completion, that is, to solve MILP instances to optimality, which leads to very long episodes even for moderately hard instances. Second, in contrast to many RL tasks studied in the literature, in B&B the worse the policy is, the longer are the episodes. These two factors combined give rise to the following problems:

1. Collecting training data is computationally expensive. In particular, training from scratch from a randomly initialized policy can be prohibitive for large MILPs.

2. Due to the length of the episodes, training signals are particularly sparse. This exacerbates the so-called credit assignment problem [27], i.e., the problem of determining which actions should be given credit for a certain outcome.

Two works have tried to tackle the branching problem with RL so far. Sun et al. [33] propose an approach based on evolution strategies, a variant of RL, combined with a novelty bonus based on discrepancies in B&B trees. They show improvements over an imitation learning approach Gasse et al. [17] on instances derived from a common backbone graph, but fail to improve on more heterogeneous ones. These results emphasize the difﬁculty of a direct application of RL on hard instance sets.

In parallel, Etheve et al. [11] recently made a contribution that directly addresses the credit assignment problem. They show that, when depth-ﬁrst-search (DFS) is used as the node selection strategy, minimization of the total B&B tree size can be achieved by taking decisions that minimize subtree size at each node. Based on this result, they propose a Q-learning-type algorithm [28] where the learned Q-function approximates the local subtree size. They report improvements over a state-ofthe-art branching rule on collections of small, ﬁxed-size instances. However, they only evaluate the learned policy with DFS node selection, which matches their training setting, but is not a realistic B&B setting. In Section 4, we will show that the method proposed by Etheve et al. [11] can be interpreted as a speciﬁc instantiation of a more general tree MDP framework, which effectively simpliﬁes the credit assignment problem in learning to branch (point 2 above). Furthermore, we propose an alternative condition to the one of Etheve et al. [11] to ensure a tree MDP setting, which results in shorter episodes, hence reducing the cost of data collection (point 1 above).

4 Branching as a tree MDP

We now detail our tree MDP framework, and show how the branching problem can be cast as a tree MDP control problem, under some conditions. Proofs as well as a side-by-side comparison of temporal and tree MDPs are deferred to the Supplementary Material.

4.1 Tree MDPs

We deﬁne tree MDPs as augmented Markov Decision Processes t M = (S, A, pinit, p ch, p+ ch, r, l), with states s S, actions a A, initial state distribution pinit(s0), respectively left and right child transition distributions p ch(sch i |si, ai) and p+ ch(sch+ i |si, ai)4, reward function r : S R and leaf indicator l : S {0, 1}. The central concept behind tree MDPs is that each non-leaf state si (i.e., such that l(si) = 0), together with an action ai, produces two new states sch i (its left child) and sch+ i (its right child). As a result, the tree MDP generative process results in episodes τ that follow a tree structure (see Figure 1), where leaf states (i.e., such that l(si) = 1) are the leaf nodes of the tree, below which no action can be taken and no children state will be created. For simplicity, just like in Section 2.2, we assume the tree-like trajectories have some ﬁnite size that we denote T = |τ|.

A tree MDP episode τ consists of a binary5 tree with nodes N = {0, . . . , |τ|} and leaf nodes L = {i|i N, l(si) = 1}, which embeds a state si at every node i N and an action ai at every non-leaf node i N \ L. For convenience, in the following we will use pai, ch i and ch+ i to denote the nodes that are respectively parent, left child and right child of a node i if any, as well as di and ndi to denote respectively the set of all descendants and non-descendants of a node i in the tree. Together

4In the case of branching, these transitions are deterministic. 5The concept can easily be extended to non-binary trees.

with a control mechanism π(at|st), a tree MDP deﬁnes a probability distribution over trajectories,

pπ(τ) = pinit(s0) Y

i N \L π(ai|si)p ch(sch i |si, ai)p+ ch(sch+ i |si, ai).

As in temporal MDPs, the tree MDP control problem is to ﬁnd a policy that maximizes the expected cumulative reward, as deﬁned by (3). Due to their speciﬁc generative process, a key characteristic of tree MDPs is the tree Markov property Sch i , Sch+ i Sndi | Si, Ai, i, which guarantees, similarly to the temporal Markov property, that each subtree only depends upon the immediate state and action. This again simpliﬁes the credit assignment problem in RL and results in an efﬁcient tree policy gradient formulation. Proposition 4.1. For any tree MDP t M, the policy gradient can be expressed as

πV π = E τ pπ

i N \L π log π(at|st) X

4.2 The branching tree MDP

We now show how and under which conditions the vanilla B&B algorithm can be formulated as a tree MDP. We consider episodes τ that follow exactly the B&B tree structure. Each node i in the tree embeds a state si = (MILPi, GUBi), where MILPi is the local sub-MILP of the node, and GUBi is the global upper bound at the time the node is processed6. Each non-leaf node also embeds an action ai = (j, x j) , where j is the index of the branching variable chosen by B&B, and x j is the value used to branch (xj x j xj x j ). Note that such states and actions, embedded in the B&B tree, carry enough information to unroll a vanilla B&B algorithm, as described in Section 2.1. We now need to make two additional assumptions in order to formulate branching as a tree MDP.

4.2.1 B&B tree transitions

First, and this is our main requirement, B&B state transitions must decompose into p ch and p+ ch. Assumption 4.2. For every non-leaf node i, the global upper bounds GUBch i and GUBch+ i (reached by B&B when the left and right child is processed, respectively) can be derived solely from the current state and action, (si, ai).

Proposition 4.3. A vanilla B&B algorithm that satisﬁes Assumption 4.2 forms a tree MDP.

Assumption 4.2 is not always satisﬁed, as the following counterexample shows. Counter example. Consider the root problem MILP0 = min x s.t. x 0.6, x Z, with upper bound GUB0 = . The root LP solution is ˆx = 0.6, and the two sub-problems MILPch i and MILPch+ i follow from the (only) branching decision x 0 x 1. Now, the two global upper bound GUBch i and GUBch+ i depend on whether the feasible solution x = 1 has been found in the past, which in turn depends on the node processing order. Going left ﬁrst ( ) will yield (GUBch i , GUBch+ i ) = ( , ), while going right ﬁrst (+) will yield (GUBch i , GUBch+ i ) = (1, ).

We now provide two conditions under which Assumption 4.2 is true. Proposition 4.4. In Optimal Objective Limit B&B (Obj Lim B&B), that is, when the optimal solution value of the MILP is known at the start of the algorithm (GUB0 = GUB ), Assumption 4.2 holds.

Proposition 4.5. In Depth-First-Search B&B (DFS B&B), that is, when nodes are processed depthﬁrst and left-ﬁrst by the algorithm, Assumption 4.2 holds.

Propositions 4.4 and 4.5 provide two viable options for turning vanilla B&B into a tree MDP, where p ch and p+ ch are deterministic functions. The ﬁrst variant, Obj Lim, requires MILP instances used for training to be solved to optimality once, in order to collect their optimal objective value. The second variant, DFS, corresponds to the setting in [11]. In this variant there is no need to pre-solve training instances to optimality, however it is expected that the collected episodes might be longer than with a standard node selection rule, which might result in slower training.

6Recall that in B&B, the GUB corresponds to the value of the best feasible solution found so far, or when no solution has been found yet.

4.2.2 B&B tree reward

Last, for branching to formulate as a control problem in a tree MDP, the objective must be compatible.

Assumption 4.6. The branching objective can be decomposed over the nodes of the B&B tree, with a state-based reward function r : S R.

Interestingly, a natural objective for branching is the ﬁnal B&B tree size, which expresses naturally as r(si) = 1. Thus, it is trivially compatible with Assumption 4.6. We will consider this reward in our experiments. Another common objective is the total solving time, which can also be expressed as a state-based reward r : S R under mild assumptions. Indeed, it sufﬁces to considers that solving LP relaxations and making branching decisions at each node are the main contributing factors in the total running time, while other algorithmic components have a negligible cost. In vanilla B&B, both these components only depend on the local state si = (MILPi, GUBi) of each node.

4.3 Efﬁciency of tree MDP

Tree MDPs, when applicable, provide a convenient alternative to temporal MDPs for tackling the branching problem. First, the tree Markov property implies that branching policies in tree branching MDPs will not beneﬁt from any information other than the local state si = (MILPi, GUBi) to make optimal decisions, similarly to how control policies in temporal MDP can ignore past states and rely only on the immediate state st. Second, the credit assignment problem in the branching tree MDP is more efﬁcient than in the equivalent temporal MDP. This is showcased in Figure 1, and stems from the fact that in a tree MDP all the descendants of a node i are necessarily processed after that node temporally. As a consequence, the rewards credited to an action in the tree policy gradient (5), P

j di r(sj), are necessarily a subset of the rewards credited to the same action in the temporal policy gradient (4), P

t >t r(st ). Thus, it can be expected intuitively that learning branching policies within the tree MDP framework will be easier, and more sample-efﬁcient than learning within the temporal MDP framework. We will validate this hypothesis experimentally in Section 5.

4.4 Theoretical limitations

Our proposed B&B variants, Obj Lim and DFS, allow for a nice formulation of the branching problem as a tree MDP, which we argue is key to unlocking a more practical and sample-efﬁcient learning of branching policies. However, usually the end goal is to learn a branching policy that performs well in realistic B&B settings, and the fact that a branching policy performs well in one of those variants does not guarantee that it will perform well in the vanilla setting also. This discrepancy between the training environment and the evaluation environment is a recurring problem in RL, and is more generally referred to as the transfer learning problem. While there exist solutions to mitigate this problem, in this paper we leave the question aside and simply assume that the transfer problem is negligible. We thus directly report the performance obtained from each training setting in the realistic evaluation setting, a default B&B solver.

4.5 Connections with hierarchical RL

The tree MDP formulation has connections with hierarchical RL (HRL), a paradigm that aims at decomposing the learning task into a set of simpler tasks that can be solved recursively, independently of the parent task. The most related HRL approach is perhaps MAXQ [10], which decomposes the value function of an MDP recursively using a ﬁnite set of smaller constituent MDPs, each with its own action set and reward function. For example, delivering a package from a point A to a point B decomposes into: moving to A, picking up package, moving to B, dropping package. While both tree MDP and MAXQ exploit a recursive tree decomposition in order to simplify the credit assignment problem, the two frameworks also differ on several points. First, in MAXQ the hierarchical sub-task structure must be known a priori for each new task, and results in a ﬁxed, limited tree depth, while in tree MDPs the decomposition holds by construction and can result in virtually inﬁnite depths. Second, in MAXQ each sub-task results in a different MDP, while in tree MDPs all sub-tasks are the same. Lastly, in MAXQ the recursive decomposition must follow a temporal abstraction, where each episode is processed according to a depth-ﬁrst traversal of the tree. In tree MDPs the decomposition is not tied to the temporal processing order of the episode, except for the requirement that a parent must be processed before its children. Thus, any tree traversal order is allowed (see Figure 1).

Algorithm 1 REINFORCE training loop

1: Input: training set of MILP instances and their pre-computed optimal solution D, maximum number of epochs K, time limit ζ, entropy bonus λ, learning rate α, sample rate β. 2: Initialize policy πθ with random parameters θ. 3: for epoch from 1 to K do 4: if elapsed time > ζ then break 5: Sample 10 MILP instances from D 6: for each sampled instance do 7: Collect one episode τ by running B&B to optimality 8: Extract randomly β |τ| state, action, return tuples (s, a, G) from τ (with G the local subtree size for tree MDPs, and the remaining episode size for MDPs)8

9: end for 10: n number of collected tuples, L 0 11: for each collected tuple (s, a, G) do 12: L L G 1

n log πθ(a|s) # policy gradient cost 13: L L λ 1

n H(πθ( |s)) # entropy bonus 14: end for 15: θ θ α θL 16: end for 17: return πθ

5 Experiments

We now compare the performance of four machine learning approaches: the imitation learning method of Gasse et al. [17], and three different RL methods. We also compare against SCIP s default rule (for a description of this rule we refer to the Supplementary Material A.4). Code for reproducing all experiments is available online 7.

Benchmarks Similarly to Gasse et al. [17], we train and evaluate each method on ﬁve NP-hard problem benchmarks, which consist of synthetic combinatorial auctions, set covering, maximum independent set, capacitated facility location and multiple knapsack instances. For each benchmark we generate a training set of 10,000 instances, along with a small set of 20 validation instances for tracking the RL performance during training. For the ﬁnal evaluation, we further generate a test set of 40 instances, the same size as the training ones, and also a transfer set of 40 instances, larger and more challenging than the training ones. More information about benchmarks and instance sizes can be found in the Supplementary Material (A.5).

Training We use the Graph Neural Network (GNN) from Gasse et al. [17] to learn branching policies, with the same features and architecture. This state representation has been shown to provide an efﬁcient encoding of the local MILP together with some global features. Notice that this makes the MDP formulation into a POMDP, given that we do not encode the complete search tree (see Section 2.3). We compare four training methods: imitation learning from strong branching (IL); RL using temporal policy gradients (MDP); RL using tree policy gradients with DFS as a node selection strategy (t MDP+DFS), which enforces the tree Markov property due to Proposition 4.5; and RL using tree policy gradients with the optimal objective value set as an objective limit (t MDP+Obj Lim), which corresponds to Propositions 4.4. Other than that, we use default solver parameters, except for restarts and cutting planes after the root node which are deactivated. We use a plain REINFORCE [35] with entropy bonus as our RL algorithm, for simplicity. Our training procedure is summarized in Algorithm 1. We set a maximum of 15,000 epochs and a time limit of six days for training. Our implementation uses Py Torch [31] together with Py Torch Geometric [12], and Ecole [32] for interfacing to the solver SCIP [15]. All experiments are run on compute nodes equipped with a GPU.

7https://github.com/lascavana/rl2branch 8Notice that for tree MDPs the computation of the return for each node can be computed efﬁciently with a bottom-up traversal that runs in O(n).

(a) Combinatorial Auctions

(b) Set Covering

(c) Maximum Independent Set

Figure 2: Training curves for REINFORCE with temporal policy gradients (MDP), tree policy gradients with objective limit (t MDP+Obj Lim) and DFS node selection (t MDP+DFS). We report the ﬁnal B&B tree size on the validation set (geometric mean over 20 instances 5 seeds, the lower the better), versus the number of processed training samples on the x-axis. Solid lines show the moving average. Training curves for the remaining benchmarks can be found in the Supplementary Material (A.3).

Evaluation For each branching rule evaluated, we solve every validation, test or transfer instance 5 times with a different random seed. We use default solver parameters, except for restarts and cutting planes after the root node which are deactivated (same as during training), and a time limit of 1 hour for each solving run. For t MDP+DFS and t MDP+Obj Lim, the speciﬁc settings used during training (DFS node selection and optimal objective limit, respectively) are not used any more, thus providing a realistic evaluation setting. We report the geometric mean of the ﬁnal B&B tree size as our metric of interest, as is common practice in the MILP literature [3]. We pair this with the average per-instance standard deviation (in percentage). We only consider solving runs that ﬁnished successfully for all methods, as in [17]. Extended results including solving times are provided in the Supplementary Material (A.3).

5.2 Results

Figure 2 showcases the convergence of our three RL paradigms MDP, t MDP+Obj Lim and t MDP+DFS during training, in terms of the ﬁnal B&B tree size on the validation set (the lower the better). In order to better highlight the sample efﬁciency of each method, we report on the x-axis the cumulative number of collected training samples, which correlates with the length of the episodes collected during training. This provides a hardware-independent proxy for training time. As can be seen, the tree MDP paradigm clearly improves the convergence speed on these three benchmarks, with a clear domination of t MDP+Obj Lim on one benchmark (Set Covering). Training curves for the remaining benchmarks are available in the Supplementary Material (A.3).

Table 1 reports the ﬁnal performance of the branching rules obtained with each method, on both a held-out test set (same instance difﬁculty as training) and a transfer set (larger, more difﬁcult instances than training). Despite a mismatch between the training and evaluation environments, which is required to enforce the tree Markov property, the tree MDP paradigm consistently produces equal or better branching rules than the temporal MDP paradigm on all 5 benchmarks.

On one benchmark, Multiple Knapsack, the branching rules learned by RL outperform both SCIP s default branching rule and the strong branching imitation (IL) approach. The likely reason is that the MILP formulation of Multiple Knapsack provides a very poor linear relaxation, which often results in no dual bound improvement after branching. This means that strong branching scores are in most cases not discriminative, which is problematic for rules that heavily rely on this criterion (see Section 3.1), such as SCIP s default or a policy that imitates strong branching. This situation makes a strong case for the potential of RL-based methods, which can adapt and devise alternative branching strategies.

On the remaining 4 benchmarks, however, RL methods perform worse than SCIP default or IL, despite being based on the same GNN architecture. This illustrates the difﬁculty of learning to branch via RL, even on small-scale problems, and the remaining room for improvement. Additional evaluation criteria (solving times and number of time limits) are available in the Supplementary Material (A.3).

Table 1: Solving performance of the different branching rules in terms of the ﬁnal B&B tree size (lower is better). We evaluate each method on a test set with instances the same size as training, and a transfer set with larger instances. We report the geometric mean and standard deviation over 40 instances, solved 5 times with different random seeds, and we bold the best of the RL methods.

Model Comb. Auct. Set Cover Max.Ind.Set Facility Loc. Mult. Knap.

SCIP default 7.3 39% 10.7 24% 19.3 52% 203.6 63% 267.8 96% IL 52.2 13% 51.8 10% 35.9 36% 247.5 39% 228.0 95%

RL (MDP) 86.7 16% 196.3 20% 91.8 56% 393.2 47% 143.4 76% RL (t MDP+DFS) 86.1 17% 190.8 20% 89.8 51% 360.4 46% 135.8 75% RL (t MDP+Obj Lim) 87.0 18% 193.5 23% 85.4 53% 325.4 41% 142.4 78% Test

Model Comb. Auct. Set Cover Max.Ind.Set Facility Loc. Mult. Knap.

SCIP default 733.9 26% 61.4 19% 2867.1 35% 344.3 57% 592.3 75% IL 805.1 9% 145.0 6% 1774.8 38% 407.8 37% 1066.1 101%

RL (MDP) 1906.3 18% 853.3 27% 2768.5 76% 679.4 52% 518.4 79% RL (t MDP+DFS) 1804.6 17% 816.8 25% 2970.0 76% 609.1 47% 495.1 81% RL (t MDP+Obj Lim) 1841.9 18% 826.4 26% 2763.6 74% 496.0 48% 425.3 64% Transfer

6 Conclusions and Future Directions

This paper adds to a growing body of literature on using ML to assist decision-making in several key components of the B&B algorithm (see e.g. [5, 7, 22]). We contribute to the study of RL as a tool for learning to branch in MILP solvers. We present tree MDP, a variant of Markov Decision Processes, and show that under some conditions, the B&B branching process is tree-Markovian. We show that the approach of Etheve et al. [11] can be naturally cast as Q-learning for tree MDPs, and we propose an alternative, more computationally appealing way to enforce the tree Markov property in B&B, using optimal objective limits. Finally, we evaluate for the ﬁrst time a variety of RL-based branching rules in a comprehensive computational study, and we show that tree MDPs improve the convergence speed of RL for branching, as well as the overall performance of the learnt branching rules.

These contributions bring us closer to learning efﬁcient branching rules from scratch using RL, which could ultimately outperform existing branching heuristics built upon decades of expert knowledge and experiment. However, despite the convergence speed-up that our method provides, training without expert knowledge remains very computationally heavy and in general still results in worse performance than its supervised learning counterpart, which reveals a signiﬁcant gap that must be closed. As future work, we would like to explore ideas to keep improving sample efﬁciency, and generalization across instance size. This is necessary for RL to scale to larger, non-homogeneous benchmarks, such as MIPLIB [18], which at the moment remain out-of-reach for RL.

Finally, although our concern in this paper is focused on improving variable selection for MILP, our tree MILP construction could be useful in other applications. Branch-and-bound is a type of divide-and-conquer algorithm, and we expect that, in general, this framework can be applied to any problem where one seeks to control such algorithms more efﬁciently. Examples would include controlling the order in which a rover explores rooms in a building, or selecting the pivots in a quicksort algorithm.

Acknowledgements

The authors thanks the anonymous reviewers for their suggestions. The work of Karen Aardal, Lara Scavuzzo and Neil Yorke-Smith was partially supported by TAILOR, a project funded by EU Horizon 2020 research and innovation programme under grant number 952215, and by The Netherlands Organisation for Scientiﬁc Research (NWO), grant OCENW.GROOT.2019.015. The work of Feng Yang Chen, Didier Chételat, Maxime Gasse and Andrea Lodi was supported by the Canada Excellence Research Chairs program (CERC).

[1] Tobias Achterberg. Constraint Integer Programming. Ph D thesis, Technischen Universität Berlin, 2007.

[2] Tobias Achterberg and Timo Berthold. Hybrid branching. In International Conference on Integration of Constraint Programming, Artiﬁcial Intelligence, and Operations Research, pages 309 311. Springer, 2009.

[3] Tobias Achterberg and Roland Wunderling. Mixed integer programming: Analyzing 12 years of progress. In Facets of Combinatorial Optimization, pages 449 481. Springer, 2013.

[4] Egon Balas and Andrew Ho. Set covering algorithms using cutting planes, heuristics, and subgradient optimization: a computational study. In Combinatorial Optimization, pages 37 60. Springer, 1980.

[5] Yoshua Bengio, Andrea Lodi, and Antoine Prouvost. Machine learning for combinatorial optimization: a methodological tour d horizon. European Journal of Operational Research, 2020.

[6] David Bergman, Andre A Cire, Willem-Jan Van Hoeve, and John Hooker. Decision diagrams for optimization, volume 1. Springer, 2016.

[7] Antonia Chmiela, Elias B Khalil, Ambros Gleixner, Andrea Lodi, and Sebastian Pokutta. Learning to schedule heuristics in branch-and-bound. ar Xiv preprint ar Xiv:2103.10294, 2021.

[8] Gérard Cornuéjols, Ranjani Sridharan, and Jean-Michel Thizy. A comparison of heuristics and relaxations for the capacitated plant location problem. European Journal of Operational Research, 50(3):280 297, 1991.

[9] Santanu S Dey, Yatharth Dubey, Marco Molinaro, and Prachi Shah. A theoretical and computational analysis of full strong-branching. ar Xiv preprint ar Xiv:2110.10754, 2021.

[10] Thomas G Dietterich. Hierarchical reinforcement learning with the maxq value function decomposition. Journal of Artiﬁcial Intelligence Research, 13:227 303, 2000.

[11] Marc Etheve, Zacharie Alès, Côme Bissuel, Olivier Juan, and Saﬁa Kedad-Sidhoum. Reinforcement learning for variable selection in a branch and bound algorithm. In CPAIOR, 2020.

[12] Matthias Fey and Jan E. Lenssen. Fast graph representation learning with Py Torch Geometric. In ICLR Workshop on Representation Learning on Graphs and Manifolds, 2019.

[13] Alex S Fukunaga. A branch-and-bound algorithm for hard multiple knapsack problems. Annals of Operations Research, 184(1):97 119, 2011.

[14] Gerald Gamrath and Christoph Schubert. Measuring the impact of branching rules for mixedinteger programming. In Operations Research Proceedings 2017, pages 165 170. Springer, 2018.

[15] Gerald Gamrath, Daniel Anderson, Ksenia Bestuzheva, Wei-Kun Chen, Leon Eiﬂer, Maxime Gasse, Patrick Gemander, Ambros Gleixner, Leona Gottwald, Katrin Halbig, Gregor Hendel, Christopher Hojny, Thorsten Koch, Pierre Le Bodic, Stephen J. Maher, Frederic Matter, Matthias Miltenberger, Erik Mühmer, Benjamin Müller, Marc E. Pfetsch, Franziska Schlösser, Felipe Serrano, Yuji Shinano, Christine Tawﬁk, Stefan Vigerske, Fabian Wegscheider, Dieter Weninger, and Jakob Witzig. The SCIP Optimization Suite 7.0. ZIB-Report 20-10, Zuse Institute Berlin, 3 2020.

[16] Gerald Gamrath, Timo Berthold, and Domenico Salvagnin. An exploratory computational analysis of dual degeneracy in mixed-integer programming. EURO Journal on Computational Optimization, 8(3):241 261, 2020.

[17] Maxime Gasse, Didier Chételat, Nicola Ferroni, Laurent Charlin, and Andrea Lodi. Exact combinatorial optimization with graph convolutional neural networks. In Advances in Neural Information Processing Systems, pages 15580 15592, 2019.

[18] Ambros Gleixner, Gregor Hendel, Gerald Gamrath, Tobias Achterberg, Michael Bastubbe, Timo Berthold, Philipp Christophel, Kati Jarck, Thorsten Koch, Jeff Linderoth, et al. MIPLIB 2017. Mathematical Programming Computation, pages 1 48, 2021.

[19] Prateek Gupta, Maxime Gasse, Elias Khalil, Pawan Mudigonda, Andrea Lodi, and Yoshua Bengio. Hybrid models for learning to branch. In Advances in Neural Information Processing Systems, volume 33, 2020.

[20] Christoph Hansknecht, Imke Joormann, and Sebastian Stiller. Cuts, primal heuristics, and learning to branch for the time-dependent traveling salesman problem. ar Xiv preprint ar Xiv:1805.01415, 2018.

[21] He He, Hal Daume III, and Jason M Eisner. Learning to search in branch and bound algorithms. In Advances in Neural Information Processing systems, pages 3293 3301, 2014.

[22] Elias B Khalil, Christopher Morris, and Andrea Lodi. Mip-gnn: A data-driven framework for guiding combinatorial solvers. In AAAI, 2022.

[23] Elias Boutros Khalil, Pierre Le Bodic, Le Song, George Nemhauser, and Bistra Dilkina. Learning to branch in mixed integer programming. In Thirtieth AAAI Conference on Artiﬁcial Intelligence, 2016.

[24] Kevin Leyton-Brown, Mark Pearson, and Yoav Shoham. Towards a universal test suite for combinatorial auction algorithms. In Proceedings of the 2nd ACM conference on Electronic commerce, pages 66 76, 2000.

[25] Andrea Lodi and Giulia Zarpellon. On learning and branching: a survey. Top, 25(2):207 236, 2017.

[26] Alejandro Marcos Alvarez, Louis Wehenkel, and Quentin Louveaux. Online learning for strong branching approximation in branch-and-bound. Technical report, Universite de Liege, 2016.

[27] Marvin Minsky. Steps toward artiﬁcial intelligence. Proceedings of the IRE, 49(1):8 30, 1961.

[28] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529 533, 2015.

[29] Vinod Nair, Sergey Bartunov, Felix Gimeno, Ingrid von Glehn, Pawel Lichocki, Ivan Lobov, Brendan O Donoghue, Nicolas Sonnerat, Christian Tjandraatmadja, Pengming Wang, et al. Solving mixed integer programs using neural networks. ar Xiv preprint ar Xiv:2012.13349, 2020.

[30] Vangelis Th Paschos. Applications of combinatorial optimization, volume 3. John Wiley & Sons, 2014.

[31] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary De Vito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, highperformance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'AlchéBuc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.

[32] Antoine Prouvost, Justin Dumouchelle, Lara Scavuzzo, Maxime Gasse, Didier Chételat, and Andrea Lodi. Ecole: A gym-like library for machine learning in combinatorial optimization solvers. In Workshop on Learning Meets Combinatorial Algorithms, Neur IPS 2020, 11 2020.

[33] Haoran Sun, Wenbo Chen, Hui Li, and Le Song. Improving learning to branch via reinforcement learning. In Learning Meets Combinatorial Algorithms at Neur IPS 2020, 2020.

[34] Richard S Sutton, David A Mc Allester, Satinder P Singh, Yishay Mansour, et al. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems, volume 99, pages 1057 1063. Citeseer, 1999.

[35] Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3):229 256, 1992.

[36] Giulia Zarpellon, Jason Jo, Andrea Lodi, and Yoshua Bengio. Parameterizing branch-and-bound search trees to learn branching policies. In AAAI, pages 3931 3939, 2021.