# learning_to_search_with_mctsnets__1d59a174.pdf

Learning to Search with MCTSnets

Arthur Guez * 1 Th eophane Weber * 1 Ioannis Antonoglou 1 Karen Simonyan 1

Oriol Vinyals 1 Daan Wierstra 1 R emi Munos 1 David Silver 1

Planning problems are among the most important and well-studied problems in artiﬁcial intelligence. They are most typically solved by tree search algorithms that simulate ahead into the future, evaluate future states, and back-up those evaluations to the root of a search tree. Among these algorithms, Monte-Carlo tree search (MCTS) is one of the most general, powerful and widely used. A typical implementation of MCTS uses cleverly designed rules, optimised to the particular characteristics of the domain. These rules control where the simulation traverses, what to evaluate in the states that are reached, and how to back-up those evaluations. In this paper we instead learn where, what and how to search. Our architecture, which we call an MCTSnet, incorporates simulation-based search inside a neural network, by expanding, evaluating and backing-up a vector embedding. The parameters of the network are trained end-to-end using gradient-based optimisation. When applied to small searches in the well-known planning problem Sokoban, the learned search algorithm signiﬁcantly outperformed MCTS baselines.

1. Introduction

Many success stories in artiﬁcial intelligence are based on the application of powerful tree search algorithms to challenging planning problems (Samuel, 1959; Knuth & Moore, 1975; J unger et al., 2009). It has been well documented that planning algorithms can be highly optimised by tailoring them to the domain (Schaeffer, 2000). For example, the performance can often be dramatically improved by modifying the rules that select the trajectory to traverse, the states to expand, the evaluation function by which performance

*Equal contribution 1Deep Mind, London, UK. Correspondence to: A. Guez and T. Weber <{aguez, theophane}@google.com>.

Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s).

is measured, and the backup rule by which those evaluations are propagated up the search tree. Our contribution is a new search algorithm in which all of these steps can be learned automatically and efﬁciently. Our work ﬁts into a more general trend of learning differentiable versions of algorithms.

One particularly powerful and general method for planning is Monte-Carlo tree search (MCTS) (Coulom, 2006; Kocsis & Szepesv ari, 2006), as used in the recent Alpha Go program (Silver et al., 2016). A typical MCTS algorithm consists of several phases. First, it simulates trajectories into the future, starting from a root state. Second, it evaluates the performance of leaf states - either using a random rollout, or using an evaluation function such as a value network . Third, it backs-up these evaluations to update internal values along the trajectory, for example by averaging over evaluations.

We present a neural network architecture that includes the same processing stages as a typical MCTS, but inside the neural network itself, as a dynamic computational graph. The key idea is to represent the internal state of the search, at each node, by a memory vector. The computation of the network proceeds forwards from the root state, just like a simulation of MCTS, using a simulation policy based on the memory vector to select the trajectory to traverse. The leaf state is then processed by an embedding network to initialize the memory vector at the leaf. The network proceeds backwards up the trajectory, updating the memory at each visited state according to a backup network that propagates from child to parent. Finally, the root memory vector is used to compute an overall prediction of value or action.

The major beneﬁt of our planning architecture, compared to more traditional planning algorithms, is that it can be exposed to gradient-based optimisation. This allows us to replace every component of MCTS with a richer, learnable equivalent while maintaining the desirable structural properties of MCTS such as the use of a model, iterative local computations, and structured memory. We jointly train the parameters of the evaluation network, backup network and simulation policy so as to optimise the overall predictions of the MCTS network (MCTSnet). The majority of the network is fully differentiable, allowing for

Learning to Search with MCTSnets

efﬁcient training by gradient descent. Still, internal action sequences directing the control ﬂow of the network cannot be differentiated, and learning this internal policy presents a challenging credit assignment problem. To address this, we propose a novel, generally-applicable approximate scheme for credit assignment that leverages the anytime property of our computational graph, allowing us to also effectively learn this part of the search network from data.

In the Sokoban domain, a classic planning task (Botea et al., 2003), we justify our network design choices and show that our learned search algorithm is able to outperform various model-free and model-based baselines.

2. Related Work

There has been signiﬁcant previous work on learning evaluation functions, using supervised learning or reinforcement learning, that are subsequently combined with a search algorithm (Tesauro, 1994; Baxter et al., 1998; Silver et al., 2016). However, the learning process is typically decoupled from the search algorithm, and has no awareness of how the search algorithm will combine those evaluations into an overall decision.

Several previous search architectures have learned to tune the parameters of the evaluation function so as to achieve the most effective overall search results given a speciﬁed search algorithm. The learning-to-search framework (Chang et al., 2015) learns an evaluation function that is effective in the context of beam search. Samuel s checkers player (Samuel, 1959), the TD(leaf) algorithm (Baxter et al., 1998; Schaeffer et al., 2001), and the Tree Strap algorithm apply reinforcement learning to ﬁnd an evaluation function that combines with minimax search to produce an accurate root evaluation (Veness et al., 2009); while comparison training (Tesauro, 1988) applies supervised learning to the same problem; these methods have been successful in chess, checkers and shogi. In all cases the evaluation function is scalar valued.

There have been a variety of previous efforts to frame the learning of internal search decisions as a meta-reasoning problem, one which can be optimized directly (Russell, 1995). Kocsis et al. (2005) apply black-box optimisation to learn the meta-parameters controlling an alpha-beta search, but do not learn ﬁne-grained control over the search decisions. Considering action choices at tree nodes as a bandit problem led to the widely used UCT variant of MCTS (Kocsis & Szepesv ari, 2006). Hay & Russell (2011) also studied the meta-problem in MCTS, but they only considered a myopic policy without function approximation. Pascanu et al. (2017) also investigate learning-to-plan using neural networks. However, their system uses an unstructured memory which makes complex branching very unlikely. Initial results have been provided for toy domains but it has not yet

been demonstrated to succeed in any domain approaching the complexity of Sokoban.

Other neural network architectures have also incorporated Monte-Carlo simulations. The I2A architecture (Weber et al., 2017) aggregates the results of several simulations into its network computation. MCTSnets both generalise and extend some ideas behind I2A: introducing a tree structured memory that stores node-speciﬁc statistics; and learning the simulation and tree expansion strategy, rather than rolling out each possible action from the root state with a ﬁxed policy. Similar to I2A, the predictron architecture (Silver et al., 2017b) also aggregates over multiple simulations; however, in that case the simulations roll out an implicit transition model, rather than concrete steps from the actual environment. A recent extension by Farquhar et al. (2017) performs planning over a ﬁxed-tree expansion of such implicit model.

The MCTSnet architecture may be understood from two distinct but equivalent perspectives. First, it may be understood as a search algorithm with a control ﬂow that closely mirrors the simulation-based tree traversals of MCTS. Second, it may be understood as a neural network represented by a dynamic computation graph that processes input states, performs intermediate computations on hidden states, and outputs a ﬁnal decision. We present each of these perspectives in turn, starting with the original, unmodiﬁed search algorithm.

3.1. MCTS Algorithm

The goal of planning is to ﬁnd the optimal strategy that maximises the total reward in an environment deﬁned by a deterministic transition model s = T(s, a), mapping each state and action to a successor state s , and a reward model r(s, a), describing the goodness of each transition.

MCTS is a simulation-based search algorithm that converges to a solution to the planning problem. At a high level, the idea of MCTS is to maintain statistics at each node, such as the visit count and mean evaluation, and to use these statistics to decide which branches of the tree to visit.

MCTS proceeds by running a number of simulations. Each simulation traverses the tree, selecting the most promising child according to the statistics, until a leaf node is reached. The leaf node is then evaluated using a rollout or value-network (Silver et al., 2016). This value is then propagated during a back-up phase that updates statistics of the tree along the traversed path, tracking the visit counts N(s), N(s, a) and mean evaluation Q(s, a) following from each state s and action a. Search proceeds in an anytime fashion: the statistics gradually become more accurate, and simulations focus on increasingly promising regions of the

Learning to Search with MCTSnets

We now describe a value-network MCTS in more detail. Each simulation from the root state s A is composed of four stages: 1

Algorithm 1: Value-Network Monte-Carlo Tree Search

1. Initialize simulation time t = 0 and current node s0 = s A.

2. Forward simulation from root state. Do until we reach a leaf node (N(st) = 0):

(a) Sample action at based on simulation policy, at π(a|st, {N(st), N(st, a), Q(st, a)}), (b) the reward rt = r(st, at) and next state st+1 = T(st, at) are computed (c) Increment t.

3. Evaluate leaf node s L found at depth L.

(a) Obtain value estimate V (s L), (b) Set N(s L) = 1.

4. Back-up phase from leaf node s L, for each t < L

(a) Set (s, a) = (st, at). (b) Update Q(s, a) towards the Monte-Carlo return:

Q(s, a) Q(s, a) + 1 N(s, a) + 1 (Rt Q(s, a))

where Rt = PL 1 t =t γt trt + γL t V (s L) (c) Update visit counts N(s) and N(s, a):

N(s) N(s) + 1, N(s, a) N(s, a) + 1

When the search completes, it selects the action at the root with the most visit counts. The simulation policy π is chosen to trade-off exploration and exploitation in the tree. In the UCT variant of MCTS (Kocsis & Szepesv ari, 2006), π is inspired by the UCB bandit algorithm (Auer, 2002).2

3.2. MCTSnet: search algorithm

We now present MCTSnet as a generalisation of MCTS in which the statistics being tracked by the search algorithm, the backup update, and the expansion policy, are all learned from data.

MCTSnet proceeds by executing simulations that start from the root state s A. When a simulation reaches a leaf node, that node is expanded and evaluated to initialize a vector of statistics. The back-up phase then updates statistics in each node traversed during the simulation. Speciﬁcally, the

1We write states with letter indices (s A, s B, . . . ) when referring to states in the tree across simulations, and use time indices (s0, s1, . . . ) for states within a simulation. 2In this case, the simulation policy is deterministic, π(s) = arg maxa Q(s, a) + c p

log(N(s))/N(s, a).

parent node statistics are updated to new values that depend on the child values and also on their previous values. Finally, the selected action is chosen according to the statistics at the root of the search tree.

Different sub-networks are responsible for each of the components of the search described in the previous paragraph. Internally, these sub-networks manipulate the memory statistics h at each node of the tree, which now have a vector representation, h Rn. An embedding network h ϵ(s; θe) evaluates the state s and computes initial raw statistics; this can be understood as the generalisation of the value network in Algorithm 1. A simulation policy a π( |h; θs) is used to select actions during each simulation, based on statistics h. A backup network hparent β(hparent, hchild; θb) updates and propagates the statistics up the search tree. Finally, an overall decision (or evaluation) is made by a readout network a ρ(hs A; θr).

An algorithmic description of the search network follows in Algorithm 2: 3

Algorithm 2: MCTSnet For m = 1 . . . M, do simulation:

1. Initialize simulation time t = 0 and current node s0 = s A.

2. Forward simulation from root state. Do until we reach a leaf node (N(st) = 0):

(a) Sample action at based on simulation policy, at π(a|hst; θs), (b) Compute the reward rt = r(st, at) and next state st+1 = T(st, at). (c) Increment t.

3. Evaluate leaf node s L found at depth L.

(a) Initialize node statistics using the embedding network: hs L ϵ(s L; θe)

4. Back-up phase from leaf node s L, for each t < L

(a) Using the backup network β, update the node statistic as a function of its previous statistics and the statistic of its child:

hst β(hst, hst+1, rt, at; θb)

After M simulations, readout network outputs a (real) action distribution from the root memory, ρ(hs A; θr).

3.3. MCTSnet: neural network architecture

We now present MCTSnet as a neural network architecture. The algorithm described above effectively deﬁnes a form of tree-structured memory: each node sk of the tree maintains its own corresponding statistics hk. The statistics are initial-

3For clarity, we omitted the update of visits N(s), only used to determine leaf nodes, which proceeds as in Alg. 1.

Learning to Search with MCTSnets

ized by the embedding network, but otherwise kept constant until the next time the node is visited. They are then updated using the backup network. For a ﬁxed tree expansion, this allows us to see MCTSnet as a deep residual and recursive network with numerous skip connections, as well as a large number of inputs - all corresponding to different potential future states of the environment. We describe MCTSnet again following this different viewpoint in this section.

It is useful to introduce an index for the simulation count m, so that the tree memory after simulation m is the set of hm s for all tree nodes s. Conditioned on a tree path pm+1 = s0, a0, s1, a1, , s L for simulation m + 1, the MCTSnet memory gets updated as follows:

1. For t = L:

hm+1 s L ϵ(s L) (1)

2. For t < L:

hm+1 st β(hm st, hm+1 st+1 , r(st, at), at; θb) (2)

3. For all other s:

hm+1 s hm s (simulation m + 1 is skipped) (3)

The tree path that gates this memory update is sampled as:

pm+1 P(s0a0 s L|hm) (4)

t=0 π(at|hm st; θs)1[st+1 = T(st, at)], (5)

where L is a random stopping time for the tree path deﬁned by (N m(s L 1) > 0, N m(s L) = 0). An illustration of this update process is provided in Fig. 1.

Note that the computation ﬂow of the MCTS network is not only deﬁned by the ﬁnal tree, but also by the order in which nodes are visited (the tree expansion process). Furthermore, taken as a whole, MCTSnet is a stochastic feed-forward network with single input (the initial state) and single output (the action probabilities). However, thanks to the tree-structured memory, MCTSnet naturally allows for partial replanning, in a fashion similar to MCTS. Assume that from root-state s A, the MCTS network chooses action a, and transitions (in the real environment) to new state s A. We can initialize the MCTS network for s A as the subtree rooted in s A, and initialize node statistics of the subtree to their previously computed values the resulting network would then be recurrent across real time-steps.

3.4. Design choices

We now provide design details for each sub-network in MCTSnet.

Backup β The backup network contains a gated residual connection, allowing it to selectively ignore information originating from a node s subtree. It updates hs as follows:

hs β(φ; θb) = hs + g(φ; θb)f(φ; θb), (6)

where φ = (hs, hs , r, a) and where g is a learned gating function with range [0, 1] and f is the learned update function. We justify this architecture in Sec. 4.2, by comparing it to a simpler MLP which maps φ to the updated value of hs.

Learned simulation policy π In its basic, unstructured form, the simulation policy network is a simple MLP mapping the statistics hs to the logits ψ(s, a), which deﬁne π(a|s; θs) exp(ψ(s, a)).

We consider adding structure by modulating each logit with side-information corresponding to each action. One form of action-speciﬁc information is obtained from the child statistic h T (s,a). Another form of information comes from a learned policy prior µ over actions, with log-probabilities ψ(s, a; θp) = log µ(s, a; θp); as in PUCT (Rosin, 2011). In our case, the policy prior comes from learning a small, model-free residual network on the same data. Combined, we obtain the following modulated network version for the simulation policy logits:

ψ(s, a) = w0ψp(s, a) + w1u hs, h T (s,a) , (7)

where u is a small network that combines information from the parent and child statistics.

Embedding ϵ and readout network ρ The embedding network is a standard residual convolution network. The readout network, ρ, is a simple MLP that transforms a memory vector at the root into the required output format, in this case an action distribution. See appendix for details.

3.5. Training MCTSnet

The readout network of MCTSnet ultimately outputs an overall decision or evaluation from the entire search. This ﬁnal output may in principle be trained according to any loss function, such as by value-based or policy gradient reinforcement learning.

However, in order to focus on the novel aspects of our architecture, we choose to investigate the MCTSnet architecture in a supervised learning setup in which labels are desired actions a in each state (the dataset is detailed in the appendix). Our objective is to then train MCTSnet to predict the action a from state s.

Learning to Search with MCTSnets

sim 1 2 3 4

search trees s A s A

s B s B s C s B

s A / h A s A

MCTSnet s B / h B

Figure 1. This diagram shows an execution of a search with M = 4. (Top) The evolution of the search tree rooted at s0 after each simulation, with the last simulation path highlighted in red. (Bottom) The computation graph in MCTSnet resulting from these simulations. Black arrows represent the application of the embedding network ϵ(s) to initialize h at tree node s. Red arrows represent the forward tree traversal during a simulation using the simulation policy (based on last memory state) and the environment model until a leaf node is reached. Blue arrows correspond to the backup network β, which updates the memory statistics h along the traversed simulation path based on the child statistic and the last updated parent memory (in addition to transition information such as reward). The diagram makes it clear that this backup mechanism can skip over simulations where a particular node was not visited. For example, the fourth simulation updates h B based on h B from the second simulation, since s B was not visited during the third simulation. Finally, the readout network ρ, in green, outputs the action distribution based on the last root memory h A. Additional diagrams are available in the appendix.

We denote zm the set of actions sampled stochastically during the mth simulation; z m the set of all stochastic actions taken up to simulation m, and z = z M the set of all stochastic actions. The number of simulations M is either chosen to be ﬁxed, or taken from a stochastic distribution p M.

After performing the desired number of simulations, the network output is the action probability vector pθ(a|s, z). It is a random function of the state s due to the stochastic actions z. We choose to optimize pθ(a|s, z) so that the prediction is on average correct, by minimizing the average cross entropy between the prediction and the correct label a . For a pair (s, a ), the loss is:

ℓ(s, a ) = Ez π(z|s) [ log pθ(a |s, z)] . (8)

This can also be interpreted as a lower-bound on the log-likelihood of the marginal distribution pθ(a|s) = Ez π(z|s) (pθ(a|z, s)).

We minimize l(s, a ) by computing a single sample estimate of its gradient (Schulman et al., 2015):

θ ℓ(s, a ) = Ez h θ log pθ(a |s, z)

+ ( θ log π(z|s; θs)) log pθ(a |s, z) i . (9)

The ﬁrst term of the gradient corresponds to the differentiable path of the network as described in section. The

second term corresponds to the gradient with respect to the simulation distribution, and uses the REINFORCE or score-function method. In this term, the ﬁnal log likelihood log pθ(a |s, z) plays the role of a reward signal: in effect, the quality of the search is determined by the conﬁdence in the correct label (as measured by its log-likelihood); the higher that conﬁdence, the better the tree expansion, and the more the stochastic actions z that induced that tree expansion will be reinforced. In addition, we follow the common method of adding a neg-entropy regularization term on π(a|s; θs) to the loss, to prevent premature convergence.

3.6. A credit assignment technique for anytime algorithms

Although it is unbiased, the REINFORCE gradient above has high variance; this is due to the difﬁculty of credit assignment in this problem: the number of decisions that contribute to a single decision a is large (between O(M log M) and O(M 2) for M simulations), and understanding how each decision contributed to a low error through a better tree expansion structure is very intricate.

In order to address this issue, we design a novel credit assignment technique for anytime algorithms, by casting the loss minimization for a single example as a sequential decision problem, and using reinforcement learning techniques to come up with a family of estimators, allowing us to ma-

Learning to Search with MCTSnets

nipulate the bias-variance trade-off.

Consider a general anytime algorithm which, given an initial state s, can run for an arbitrary number of internal steps M in MCTSnet, these are simulations. For each step m = 1 . . . M, any number of stochastic decisions (collectively denoted zm) may be taken, and at the end of each step, a candidate output distribution pθ(a|s, z m) may be evaluated against a loss function ℓ. The value of the loss at the end of step m is denoted ℓm = ℓ(pθ(a|s, z m)). We assume the objective is to maximize the terminal negative loss ℓM. Letting ℓ0 = 0, we rewrite the terminal loss as a telescoping sum:

ℓM = (ℓM ℓ0) = X

m=1...M (ℓm ℓm 1)

m=1...M rm,

where we deﬁne the (surrogate) reward rm as the decrease in loss (ℓm ℓm 1) obtained during the mth step. We then deﬁne the return Rm = P m m rm as the sum of future rewards from step m; by deﬁnition we have R1 = ℓM.

The REINFORCE term of equation (9), θ log π(z|s; θs) log pθ(a |s, z) = A, can be rewritten:

m θ log π(zm|s; θs)R1. (10)

Since stochastic variables in zm can only affect future rewards r m, m m, it follows from a classical policy gradient argument that (10) is, in expectation, also equal to:

m θ log π(zm|s, z<m; θs)Rm (11)

m θ log π(zm|s, z<m; θs)(ℓM ℓm 1). (12)

In other words, the stochastic decisions from step m do not simply use the terminal loss as reinforcement signal, but rather the difference between the terminal loss, and the loss computed before step m started (the baseline). This estimate is still unbiased, but has lower variance, especially for the later steps of algorithm. Next, we trade off bias and variance by introducing a discount term γ. In essence, for simulation choices zm, we choose to reward short term improvements more than later ones, since the relation between simulation m and later improvements is harder to ascertain and likely to mostly appear as noise. Letting Rγ m = P

m m γm mrm , our ﬁnal gradient estimate of the MCTSnet loss becomes:

θ l(s, a ) = Ez h θ log pθ(a |x, z)

m θ log π(zm|s; θs)Rγ m i , (13)

Rγ m can be rewritten as the average of future baselined losses lm+t lm 1, where t follows a truncated geometric distribution with parameter γ and maximum value M m. Letting γ = 0 leads to a greedy behavior, where actions of simulation m are only chosen as to maximize the immediate improvement in loss (ℓm ℓm 1). This myopic mode is linked to the single-step assumption proposed by Russell & Wefald (1989) in an analog context.

4. Experiments

Figure 2. Evolution of success ratio in Sokoban during training using a continuous evaluator. MCTSnet (with M = 25) against two model-free baselines. In one case (M = 2), the copy-model has access to the same number of parameters and the same subnetworks. When M = 25, the baseline also matches the amount of computation. We also provide performance of MCTS with UCT with variable number of simulations.

We investigate our architecture in the game of Sokoban, a classic, challenging puzzle game (Botea et al., 2003). As described above, our results are obtained in a supervised training regime. However, we continuously evaluate our network during training by running it as an agent in random Sokoban levels and report its success ratio in solving the levels. Throughout this experimental section, we keep the architecture and size of both embedding and readout network ﬁxed, as detailed in the appendix.

4.1. Main results

We ﬁrst compare our MCTSnet architecture with M = 25 simulations to a couple of model-free baselines. To assess whether the MCTSnet leverages the information contained in the simulations (from transition model T and reward function r), we consider a version of the network that uses a sham environment model where T(s, a) = s and r(s, a) = 0, but otherwise has identical architecture. For the case M = 2, the baseline has the same number of parameters as MCTSnet, and uses each subnetwork exactly once. We also

Learning to Search with MCTSnets

test this architecture for the case M = 25, in which case the model-free baseline has the same number of parameters and can perform the same amount of computation (but does not have access to the real environment model). We also evaluate a standard model-based method (with no possibility of learning) in the same environment: MCTS with a prelearned value function (see appendix for details). When given access to 25 simulations per step, same as MCTSnet, we observe 30% success ratio for MCTS in Sokoban. It requires 20 more times simulations for this version of MCTS to reach the level of performance of MCTSnet. We also tested a stronger model-based baseline that uses both a pre-learned policy and value network with a PUCT-like rule (Silver et al., 2016), this achieved 65% with 25 sims.

Overall, MCTSnet performs favorably against both modelbased and model-free baselines, see Fig. 2. These comparisons validate two ingredients of our approach. First, the comparison of MCTSnet to its model-free variant conﬁrms that it extracts information contained in states visited (and rewards obtained) during the search - in section 4.3 we show that it is also able to learn nontrivial search policies. Second, at test time, MCTSnet and MCTS both use the same environment model, and therefore have in principle access to the same information. The higher performance of MCTSnet demonstrates the beneﬁts of learning and propagating vector-valued statistics which are richer and more informative than those tracked by MCTS.

Using the architecture detailed in Sec. 3.4 and 25 simulations, MCTSnets reach 84 1% of levels solved4 close to the 87% obtained in (Weber et al., 2017), although in a different setting (supervised vs RL, 1e8 vs. 1e9 environment steps). We now consider detailed comparisons to justify our different design choices for MCTSnet.

4.2. Learned statistics and how to perform backups

In this section, we justify the backup network choice made in Sec. 3.4, by comparing the simple MLP version to the gated residual architecture we suggested. We ﬁnd the gated residual architecture for the backup network to be advantageous both in terms of stability and accuracy. As Fig. 3 illustrates, with M = 10 simulations, the gated residual version systematically achieves better performance. For larger number of simulations (M > 25), we found the nonresidual backup network to be simply numerically unstable. This can be explained by a large number of updates being recurrently applied, which can cause divergence when the network has arbitrary form. Instead, the gated network can quickly reduce the inﬂuence of the update coming from the subtree, and then slowly depart from the identity skipconnection; this is the behavior we ve observed in practice.

4A video of MCTSnet solving Sokoban levels is available here: https://goo.gl/2Bu8HD.

Figure 3. Comparison of two backup network architectures (Number of simulations is M = 10).

For all other experiments, we therefore employed the gated residual backup network in MCTSnet.

4.3. Learning the simulation policy

MCTSnet learns a tree exploration strategy that is directly tuned to obtain the best possible ﬁnal output. However, as previously mentioned, learning the simulation policy is challenging because of the noisy estimation of the pseudo-return for the selected sequence z. We investigate the effectiveness of our proposed designs for π (see Sec. 3.4) and of our proposed approximate credit assignment scheme for learning its parameters (see Sec. 3.6).

Basic setting Despite the estimation issues, we veriﬁed that we can nevertheless lift π, in its simple form, above the performance of a MCTSnet using a uniform random simulation policy. Note that MCTSnet with a random simulation strategy already performs reasonably since it can still take advantage of the learned statistics, backups, and readout. See blue and red curves in Fig. 4a for a comparison.

Improved credit assignment technique The vanilla gradient in Eq. (9) is enough to learn a simulation policy π that performs better than a random search strategy, but it is still hampered by the noisy estimation process of the simulation policy gradient. A more effective search strategy can be learned using our proposed credit assignment scheme in Sec. 3.6 and the modulated policy architecture in Sec. 3.4. To show this, we train MCTSnets for different values of the discount γ using the modulated policy architecture. The results in Fig. 4b demonstrate that the value of γ = 1, for which the network is optimizing the true loss, is not the ideal choice in MCTSnet. Lower values of γ perform better at different stages of training. In late training, when the estimation problem is more stationary, the advantage of γ < 1 reduces but remains. The best performing MCTSnet architecture is shown in comparison to others in Fig. 4a. We also investigated whether simply providing the policy prior

Learning to Search with MCTSnets

Figure 4. a) Comparison of different simulation policy strategies in MCTSnet with M = 25 simulations. b) Effect of different values of γ in our approximate credit assignment scheme.

term in Eq. 7 (i.e., setting w1 = 0) could match these results. With the policy prior learned with the right entropy regularization, this is indeed a well-performing simulation policy for MCTSnet with 25 simulations, but it did not match our best performing learned policy. We refer to it as distilled simulation.

4.4. Scalability with number of simulations

Figure 5. Performance of MCTSnets trained with different number of simulations M (nsims). Generally, larger searches achieve better results with less training steps.

Thanks to weight sharing, MCTSnets can in principle be run for an arbitrary number of simulations M 1. With larger number of simulations, the search has progressively more opportunities to query the environment model. To check whether our search network could take advantage of this during training, we compare MCTSnet for different number of simulations M, applied during both training and evaluation. In Fig. 5, we ﬁnd that our approach was able to query and extract relevant information with additional simulations,

generally achieving better results with less training steps.

5. Discussion

We have shown that it is possible to learn successful search algorithms by framing them as a dynamic computational graph that can be optimized with gradient-based methods. This may be viewed as a step towards a long-standing AI ambition: meta-reasoning about the internal processing of the agent. In particular, we proposed a neural version of the MCTS algorithm. The aim was to maintain the desirable properties of MCTS while allowing some ﬂexibility to improve on the choice of nodes to expand, the statistics to store in memory, and the way in which they are propagated, all using gradient-based learning. A more pronounced departure from existing search algorithms could also be considered, although this may come at the expense of a harder optimization problem. We have also assumed that the true environment is available as a simulator, but this could also be relaxed: a model of the environment could be learned separately (Weber et al., 2017), or even end-toend (Silver et al., 2017b). One advantage of this approach is that the search algorithm could learn how to make use of an imperfect model. Although we have focused on a supervised learning setup, our approach could easily be extended to a reinforcement learning setup by leveraging policy iteration with MCTS (Silver et al., 2017a; Anthony et al., 2017). We have focused on small searches, more similar in scale to the plans that are processed by the human brain (Arbib, 2003), than to the massive-scale searches in high-performance games or planning applications. In fact, our learned search performed better than a standard MCTS with more than an order-of-magnitude more computation, suggesting that learned approaches to search may ultimately replace their handcrafted counterparts.

Learning to Search with MCTSnets

Anthony, T., Tian, Z., and Barber, D. Thinking fast and slow with deep learning and tree search. ar Xiv preprint ar Xiv:1705.08439, 2017.

Arbib, M. A. The handbook of brain theory and neural networks. MIT press, 2003.

Auer, P. Using conﬁdence bounds for exploitationexploration trade-offs. Journal of Machine Learning Research, 3(Nov):397 422, 2002.

Baxter, J., Tridgell, A., and Weaver, L. Knightcap: A chess program that learns by combining td (λ) with gametree search. In Proceedings of the 15th International Conference on Machine Learning, 1998.

Botea, A., M uller, M., and Schaeffer, J. Using abstraction for planning in sokoban. In Computers and Games, volume 2883, pp. 360, 2003.

Chang, K.-W., Krishnamurthy, A., Agarwal, A., Daume, H., and Langford, J. Learning to search better than your teacher. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pp. 2058 2066, 2015.

Coulom, R. Efﬁcient selectivity and backup operators in monte-carlo tree search. In International conference on computers and games, pp. 72 83. Springer, 2006.

Farquhar, G., Rockt aschel, T., Igl, M., and Whiteson, S. Treeqn and atreec: Differentiable tree planning for deep reinforcement learning. In ICLR, 2017.

Hay, N. and Russell, S. J. Metareasoning for monte carlo tree search. Technical Report UCB/EECS-2011-119, EECS Department, University of California, Berkeley, 2011.

J unger, M., Liebling, T. M., Naddef, D., Nemhauser, G. L., Pulleyblank, W. R., Reinelt, G., Rinaldi, G., and Wolsey, L. A. 50 years of integer programming 1958-2008: From the early years to the state-of-the-art. Springer, 2009.

Knuth, D. E. and Moore, R. W. An analysis of alpha-beta pruning. Artiﬁcial intelligence, 6(4):293 326, 1975.

Kocsis, L. and Szepesv ari, C. Bandit based monte-carlo planning. In ECML, volume 6, pp. 282 293. Springer, 2006.

Kocsis, L., Szepesv ari, C., and Winands, M. H. RSPSA: enhanced parameter optimization in games. In Advances in Computer Games, pp. 39 56. Springer, 2005.

Pascanu, R., Li, Y., Vinyals, O., Heess, N., Buesing, L., Racani ere, S., Reichert, D., Weber, T., Wierstra, D., and Battaglia, P. Learning model-based planning from scratch. ar Xiv preprint ar Xiv:1707.06170, 2017.

Rosin, C. D. Multi-armed bandits with episode context. Annals of Mathematics and Artiﬁcial Intelligence, 61(3): 203 230, 2011.

Russell, S. Rationality and intelligence. In Proceedings of the 14th international joint conference on Artiﬁcial intelligence-Volume 1, pp. 950 957. Morgan Kaufmann Publishers Inc., 1995.

Russell, S. and Wefald, E. On optimal game-tree search using rational meta-reasoning. In Proceedings of the 11th international joint conference on Artiﬁcial intelligence Volume 1, pp. 334 340, 1989.

Samuel, A. Some studies in machine learning using the game of checkers. IBM Journal of Research and Development, 3(3):210, 1959.

Schaeffer, J. The games computers (and people) play. Advances in computers, 52:189 266, 2000.

Schaeffer, J., Hlynka, M., and Jussila, V. Temporal difference learning applied to a high-performance gameplaying program. In Proceedings of the 17th international joint conference on Artiﬁcial intelligence-Volume 1, pp. 529 534. Morgan Kaufmann Publishers Inc., 2001.

Schulman, J., Heess, N., Weber, T., and Abbeel, P. Gradient estimation using stochastic computation graphs. In Advances in Neural Information Processing Systems, pp. 3528 3536, 2015.

Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484 489, 2016.

Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., et al. Mastering the game of go without human knowledge. Nature, 550:354 359, 2017a.

Silver, D., van Hasselt, H., Hessel, M., Schaul, T., Guez, A., Harley, T., Dulac-Arnold, G., Reichert, D., Rabinowitz, N., Barreto, A., et al. The predictron: End-to-end learning and planning. In ICML, 2017b.

Tesauro, G. Connectionist learning of expert preferences by comparison training. In Advances in Neural Information Processing, pp. 99 106, 1988.

Learning to Search with MCTSnets

Tesauro, G. TD-gammon, a self-teaching backgammon program, achieves master-level play. Neural Computation, 6: 215 219, 1994.

Veness, J., Silver, D., Blair, A., and Uther, W. Bootstrapping from game tree search. In Advances in Neural Information Processing Systems, pp. 1937 1945, 2009.

Weber, T., Racani ere, S., Reichert, D. P., Buesing, L., Guez, A., Rezende, D. J., Badia, A. P., Vinyals, O., Heess, N., Li, Y., et al. Imagination-augmented agents for deep reinforcement learning. ar Xiv preprint ar Xiv:1707.06203, 2017.