# on_strength_adjustment_for_mctsbased_programs__dad61f16.pdf

On Strength Adjustment for MCTS-Based Programs

I-Chen Wu,* Ti-Rong Wu,* An-Jen Liu,* Hung Guei, Tinghan Wei

Department of Computer Science, National Chiao Tung University, 1001 University Road, Hsinchu, Taiwan, ROC

{icwu,kds285,andyliu,ghung,ting}@aigames.nctu.edu.tw

This paper proposes an approach to strength adjustment for MCTS-based game-playing programs. In this approach, we use a softmax policy with a strength index to choose moves. Most importantly, we filter low quality moves by excluding those that have a lower simulation count than a pre-defined threshold ratio of the maximum simulation count. We perform a theoretical analysis, reaching the result that the adjusted policy is guaranteed to choose moves exceeding a lower bound in strength by using a threshold ratio. The approach is applied to the Go program ELF Open Go. The experiment results show that is highly correlated to the empirical strength; namely, given a threshold ratio 0.1, is linearly related to the Elo rating with regression error 47.95 Elo where . Meanwhile, the covered strength range is about 800 Elo ratings in the interval of in . With the ease of strength adjustment using , we present two methods to adjust strength and predict opponents strengths dynamically. To our knowledge, this result is state-of-the-art in terms of the range of strengths in Elo rating while maintaining a controllable relationship between the strength and a strength index.

Artificial intelligence in computer games has made significant progress in recent years, especially after Deep Mind s Alpha Go (Silver et al. 2016) defeated human Go champions by a large margin in 2016. Deep Mind then followed up their success with Alpha Go Zero (Silver et al. 2017b) to further improve the playing strength without requiring human knowledge, resulting in much stronger programs against earlier versions of Alpha Go. Both Alpha Go and Alpha Go Zero incorporate deep neural networks into Monte Carlo tree search (MCTS) (Browne et al. 2012; Coulom 2006; Kocsis and Szepesvári 2006), which itself had been a major breakthrough that was responsible for more than ten years of rapid growth in computer games, particularly computer Go, before Alpha Go was announced.

* Equal Contribution. Copyright 2019, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

Since Alpha Go Zero, many other programs such as Fine Art (Tencent AI Lab 2018), Leela Zero (Pascutto 2018), and ELF Open Go (Tian et al. 2018) have successfully reproduced the Alpha Go Zero algorithm. Alpha Go Zero s method was also applied to other games such as chess and shogi, reaching strength levels much higher than human champions and other top programs (Silver et al. 2017a).

Super-human level game playing programs capture the imagination and fascination of society at large; for human players, these programs pose an interesting challenge and offer opportunities for learning (Hunicke and Chapman 2004; Demediuk et al. 2017; Paulsen and Fürnkranz 2010; Sephton, Cowling, and Slaven 2015). However, it is also important to fit program difficulty to appropriate levels for human players. On the one hand, human players may lose interest if the game program is too weak; on the other hand, excessive difficulty tends to lead to frustration (Hunicke and Chapman 2004). From our observation, in the context of learning with programs, it is difficult to offer feedback for human players if they are constantly on the losing side. Thus, in order to achieve an overall better game experience, and to improve the learning process for players, it is imperative to balance program difficulty accordingly. The fundamental goal is to offer programs with a wide variety of strength levels.

A simple and straightforward method to offer different program strengths is to reduce the total thinking time, or the total simulation count in MCTS, if MCTS is used. However, with this method, the search tree s relatively smaller size leaves the program vulnerable to tactical traps. For example, the ladder problem in Go is one of the most elementary shapes taught to human players; for programs, however, search is often required to handle ladders properly. It has been shown that when adjusting program strength through reduction, simulation count and playing strength do not form a linear relationship (Sephton, Cowling, and Slaven 2015). In fact, once the number of simulations fall below a certain threshold, the program playing strength drops catastrophically.

The Thirty-Third AAAI Conference on Artiﬁcial Intelligence (AAAI-19)

Another straightforward approach is to offer one program for each strength level, e.g., train one network for each difficulty. A good example of this type of strength adjustment is Paulsen and Fürnkranz s (2010) work on training chess evaluation functions for different strengths. However, this approach usually requires large amounts of time and effort to tune and test the programs. This is especially difficult for games like Go, where strength levels span a wide range from 30 kyu to professional 9 dan (Hollosi and Pahle 2018), about a 3000 Elo rating difference.

This paper reviews a strength adjustment (SA) approach based on the softmax policy and proposes our modification in the following section (Strength Adjustment). We apply the method to the open source Go program ELF Open Go (abbr. ELF for the rest of this paper) and demonstrate that the method can be easily used to adjust the program strength, covering a range of over 800 Elo rating. In the Strength Analysis section, this paper presents a hypothesis and performs theoretical analyses to justify the empirical strengths shown in the Strength Adjustment section. Having demonstrated that the program strength can be adjusted with relative ease, we introduce methods to adjust the strength dynamically in the section of Dynamic Strength Adjustment. Finally, we provide discussion and summarize our contributions in the Conclusion section.

Strength Adjustment

In this section, we first review past work on strength adjustment, then present our modifications to the method. We apply the modified approach to the Go program ELF and provide empirical data.

For strength adjustment, Sephton et al. (2015) presented a method for MCTS-based game-playing programs using a simple softmax policy as follows. Given strength index , choose moves with probability

, where is the number of simulations on move in MCTS. For simplicity of discussion in the rest of this paper, let if , i.e. is the maximum.

Conceptually, is the inverse of the softmax temperature. When is higher, the policy tends to choose the move with higher simulation counts, which tends to be a higherquality move as is the case with MCTS. When approaches infinity, the moves with the highest simulation counts are guaranteed to be chosen, and thus policy exhibits the same behavior as the original MCTS. When , all the moves are chosen with equal probability, i.e., moves are chosen randomly. When approaches negative infinity, the moves with the lowest simulation counts are chosen, i.e. the policy tends to choose the lowest quality moves.

Thus, can serve as an index of strength. Sephton et al. (2015) showed through experiments that is correlated to the empirical strength, but experiment only covered six trials on (from 1 to 6) for the game Lords of War, and the differences of win rates for these values of are ranged from 5% to 24%, equivalent to a range of 100 Elo rating.

As above, when is low, the policy tends to choose low quality moves. However, in MCTS, many moves are not visited during simulation, or in some cases, visited very few times only because of the exploration bias. For this reason, it is not a good idea to allow the policy to choose the lowest-quality moves, which would result in a much weaker program or unpredictable behavior.

In order to avoid choosing the very lowest-quality moves, Sephton et al. suggested choosing the first best moves as candidates, where is a given fixed value. However, it is still possible to choose a very low-quality move, e.g., in the case that only one move is viable while the others are extremely bad, the policy is still likely to choose bad moves.

Our Approach

In our approach, we follow the softmax policy to choose moves via the strength index . However, from our observation, it is critical to screen the candidate moves. For this issue, our approach is to use a threshold ratio to avoid choosing moves with small simulation counts in order to ensure the quality of moves. Namely, given a threshold ratio , we only consider the moves with as candidates. Assuming that the move quality is correlated to the simulation count (we discuss this in greater detail in the Strength Analysis section), this approach ensures that the qualities of the chosen moves are higher than the screened moves which do not reach the threshold. At the very least, the modified policy is less likely to choose extremely bad moves, as mentioned in the previous subsection.

For a high threshold ratio, more low-quality moves are filtered. When , the move with the highest simulation count is always chosen, behaving the same way as the original MCTS. In contrast, for a low threshold ratio, many low-quality moves are not filtered. Thus, it is important to set a reasonable threshold ratio, where the goal is to filter most low-quality moves, while simultaneously allowing reasonable moves to be considered.

In contrast to the previous work (Sephton, Cowling, and Slaven 2015), our empirical results in the next subsection show that strengths can be adjusted across a wide range over 800 Elo rating with the threshold ratio 0.1 and the interval of in . Thus, our approach is very suitable for games that are considered to have very high depth (Cauwet et al. 2015).

Empirical Results

We apply the above approach to the Go program ELF and present the experiment results. All the experiments are performed on machines equipped with one GTX 1080Ti GPU, one Intel Xeon E5-2683 v3 (14 cores in total), 2.6 GHz, 128 GB memory, and with Linux. All games are played with one second per move, using one GPU and six CPU cores. For each benchmark, 250 games are played against a baseline, ELF with and . Note that we do not use the original ELF, equivalent to , as the baseline since it is much too strong for some trials, such as when .

Table 1 (below) shows the win rates and the relative Elo rating of the ELF versions with and with different against the baseline. Note that the shown Elo ratings are relative to the original ELF which is set to 0 for simplicity of analysis. Since ELF follows the process of training Alpha Go Zero with 20 blocks, its real Elo rating is supposed to be between 4000 and 5000 (Silver et al. 2017b).

Win rate ( errors) Elo rating ( errors) 97.6% ( 1.9%) 0 (-106, +289) 2 94.4% ( 2.9%) -153 (-78, +133) 1.5 92.4% ( 3.4%) -210 (-70, +107) 1 91.2% ( 3.6%) -237 (-66, +98) 0.5 71.6% ( 5.7%) -483 (-46, +52) 0 50.0% -644 -0.5 35.6% ( 6.1%) -747 (-48, +44) -1 21.6% ( 5.2%) -868 (-59, +49) -1.5 13.2% ( 4.3%) -971 (-76, +58) -2 12.4% ( 4.2%) -983 (-79, +59) - 7.2% ( 3.3%) -1088 (-111, +71)

Table 1. The win rates (against ELF with z=0) and Elo ratings (relative to the original ELF) with respect to when .

Figure 1. The correlation between and Elo rating when

Figure 1 shows the correlation between and the Elo ratings. Interestingly, both are highly correlated with a low linear regression error 47.95 Elo, in terms of the Elo rating, when is between -2 to 2. In addition, the range of strength is very wide, covering 1088 Elo rating for all and 830 for the interval of in .

Furthermore, Figure 2 depicts the correlation between and the Elo rating for different threshold ratios, 0, 0.02, 0.05, 0.1, 0.25, and 0.5. All games are also played against the same baseline as above. From the Figure 2, the correlation between Elo ratings and z is also highly correlated to in most cases. A higher value of usually corresponds to higher Elo ratings.

Figure 2. Elo rating (relative to ELF) in different threshold ratios

and strength indices.

Figure 3. The number of candidates with respect to . We observe that high values of are not appropriate. For example, when , the Elo rating has no significant changes across different values of . An intuitive explanation for this is that with a high threshold ratio, most candidate moves are filtered, so the value of does not matter as much. Figure 3 shows that the average number of candidates is only 1.4 for , and 1.9 for . Another effect is that the adjusted strength range is narrower, e.g., smaller than 500 Elo rating for .

On the other hand, for low threshold ratios, the Elo rating drops quickly, and the strength for different values of show no difference, e.g. when and , and when and .

Thus, judging from both Figure 2 and Figure 3, the threshold ratios of 0.05 and 0.1 appear to be suitable for our needs. For simplicity of analysis, 0.1 will be used as the threshold ratio, unless otherwise stated.

Strength Analysis

The above empirical results show that the strengths are highly correlated to . In fact, between and a threshold ratio of 0.1, and the strength show a near linear relationship with regression error 47.95 Elo. However, the strength or Elo rating should be fixed when approaches or . Thus, intuitively, the curve of the Elo rating strength according to the value should be shaped similar to a logistic function. Applying logistic regression (Hosmer Jr, Lemeshow, and Sturdivant 2013), the curve is close to a logistic function with error 26.00 Elo ( , ).

This section investigates this conjecture of logistic regression from a theoretical perspective. First, we review the generalized Bradley-Terry model. Second, we present a hypothesis on move strength. Then, from theoretical analysis, we show that the derived strengths are close to the empirical strengths. We calculate the regression error between the derived and the empirical strengths to justify the hypothesis.

Generalized Bradley-Terry Model

The Bradley-Terry model has been the foundation of various ranking systems, including the Elo rating system. The model is used to estimate the strengths of players and predict the win rates among these players. Note that moves mentioned in the Past Work Subsection can be viewed as players here. Namely, each player is associated with a positive value representing the strength of , and the probability that wins over is . Obviously, the higher is, the higher the winning rate (implying a stronger player). The Elo rating of individual is (Coulom 2008). For simplicity of discussion in this paper, we also define the rating , whose corresponding Elo rating is .

The Bradley-Terry model has also been generalized to handle competitions involving more than two players (Coulom 2007; Hunter 2004). Namely, the probability that wins among players, 1, , , is formulated as

. Another generalization is to allow competitions among teams, instead of players. The strength and its corresponding rating of a team of players is estimated as

In this paper, we also define the average strength and rating of a team of players to be

This is useful when is not fixed. In addition, consider a team that can choose one and only one player to participate and choose player with probability , where

. Thus, the strength and rating of the team are

respectively, for the reason as illustrated below. For example, let

. We can consider the team composed of

players, among which the number of players is . Thus, the average strength and rating of the team are the same as and in formula (3), respectively.

As mentioned above, moves with higher simulation counts in MCTS normally tend to have higher quality. Following this notion, we present a hypothesis for further theoretical analysis. Assume that given a position the strength of move is proportional to

. Here, denotes a conjectured strength index for moves to be selected in MCTS in the previous sections. Namely, let

. Here, is a constant coefficient with respect to the same game position, i.e. different positions may have different relative strengths, and therefore will have a different value of .

The rating of move is . For the simplicity of analysis, we use in the following analysis without loss of generality. If the Elo rating is preferred, can be obtained by a simple conversion, as described above.

Let and denote the overall strength and rating following the above method for strength adjustment, which chooses among all moves using the softmax policy

. From the above Bradley-Terry model for team strength, we can derive that

In the above formula, the first item in (5) is fixed for this position, and therefore it can be removed to obtain relative ratings, say, relative to the rating where (which always choose the move with the maximum simulation counts ) as follows.

where is the ratio . Since all moves with the ratio less than are filtered out, . In addition, since is the maximum among , and are therefore all non-positive. Thus, we obtain

(7) An important implication in formula (7) is that the relative ratings of the chosen moves are at worst . Assume . The relative ratings of all chosen moves are at worst , no worse than move 1 by . Since is a constant under this hypothesis, this implies that the strength of any chosen move is at worst a fixed value. This ensures the quality of all chosen moves.

Now, let us consider following the above SA method to play a game , containing a sequence of moves or positions to move. Let and denote the strength and rating of the move made at the th position (i.e. on the th turn), which can be formulated as follows.

are respectively the strength, rating, policy and simulation count of moves at the th position in the game, and is the coefficient with respect to the position.

Furthermore, let and denote the averaged strength and rating as follows.

Note that we evaluate the averaged strength and rating as Formula (1), instead of the aggregated values in (2), simply because the number of moves in a game is not fixed.

In the above formula, the first item is fixed, and therefore can be omitted when calculating ratings relative to the one with , similarly, as follows.

. Moreover, let the relative rating be normalized to be independent of the value as follows.

For stochastic analysis, we extend by collecting some sets of games, each of which is collected from the games under a designated threshold ratio in the above empirical experiments. We exclude extreme cases to minimize the effect of noise for our analysis. For example, the cases of and are not included.

For simplicity of analysis, let us illustrate the case for , denoting the set of games with threshold ratio 0.1, which contains about 2000 games. The expected relative rating under the set is

Figure 4 (below) depicts the solid curve of

calculated from the set according to formula (14). The left y axis indicates the value of

. The curve resembles a logistic function. Now, let denote the expected rating, and be the value , under the set of games, . Thus, we have

Figure 4. The curve of

and the empirical data. Then, we can derive that

Since the values and are supposed to approximate the strength in the empirical experiments, they can be replaced by the empirical strengths at and , whose relative Elo ratings are 0 and -1088 as shown in Table 1. Thus, the value is derived to be

6.243 according to the above formula. The right y axis in Figure 4 follows the y axis in Figure 1. The regression error to the empirical strengths for between -2 and 2 is about 40.45 Elo, and the regression error to a logistic regression curve is about 10.51 Elo ( , ). These low errors justify the hypothesis.

Data set 4.385 5.369 6.243 6.244 3.292

Table 2. The conjectured strength indices estimated in different

In our experiments, we also derived the value for other sets of games, as shown in Table 2. From the table, is almost the same as , while , and are lower. For and , our conjecture is that the noise incurred from having a low threshold ratio are high as the following illustration. In the case of , since the average number of simulation counts for the best move is about 259.4 with the one second time limit, it is highly likely to include the moves with very low simulation counts (the threshold is about ). Since many of these simulations may be generated simply because of the exploration bias, these simulations may introduce noise and therefore affect the verification of our hypothesis.

As for , we observe that the average number of candidates is 1.4 from Figure 3. Since the number is relatively low, in many cases, the policy chooses only from a single candidate move. Therefore, the distribution is insufficient to justify our hypothesis. As an example, the most extreme case is where the threshold ratio is 1, and only the moves with the highest simulation counts are chosen, as in the original MCTS. The value of in this case does not affect the policy at all, since there is only one choice.

Dynamic Strength Adjustment

As stated in the previous sections, this paper presents a flexible strength adjustment method simply by altering the value with an appropriate , say 0.1. Moreover, the strength ratings are approximately linear with respect to in the interval [-2, 2]. This allows us to fit the programs strength to its opponents dynamically, provided the opponents strengths are within [-983, -153], corresponding to the range of in [-2, 2]. This section introduces two types of dynamic strength adjustment, inter-game and intragame strength adjustment. For the former, strengths are adjusted based on previous game results, while for the latter strengths are adjusted within each game. We present two methods of dynamic strength adjustment (DSA) here only

to showcase how we can predict opponent strengths and adjust accordingly with relative ease; the presented methods are by no means a comprehensive review of all available methods.

Inter-game Strength Adjustment

Inter-game strength adjustment is relatively easy. Namely, the strength index of a game is adjusted based on the previous game results and the index remains unchanged within the game.

In this section, a simple adjustment method is presented and demonstrated to predict the opponent s strength. The prediction can then be used to set accordingly. The strength index is decreased for every win and increased for every loss, both by a small amount . The initial value of is set to 0. The value is initialized to and decreased by a discount factor for each game, capped by a lower bound .

In our experiments for the method, is , approximately equivalent to 100 in Elo rating based on the linear regression in Figure 1, then decreased by a factor of for each game, with , equivalent to 8 in Elo rating. In the experiment, 100 games are played against each of the five opponents whose strength indices are for a total of 500 games. The experiment is repeated five times and the following experiment results are based on the average values of the five times.

Figure 5. Strength index estimation for inter-game SA. Opponent w/o DSA

5.6% ( 2.9%)

8.8% ( 3.6%) 50.0%

78.4% ( 5.2%)

87.6% ( 4.2%) Inter-game

43.4% ( 4.4%)

46.6% ( 4.5%)

52.0% ( 4.5%)

50.0% ( 4.5%)

54.8% ( 4.5%) Avg. 1.93 0.88 -0.04 -1.06 -1.73

Table 3. Win rate (WR) and average dynamic strength index

(Avg. z) against different opponents using inter-game SA. In Figure 5 each of the five lines indicates the predicted for each opponent. The result shows that our method can approximately predict opponents strengths and clearly distinguish five opponents. Table 3 also shows that the aver-

aged win rate for each opponent is within 43-54% and the averaged predicted is very close the opponent s.

Intra-game Strength Adjustment

Intra-game strength adjustment is relatively challenging, given that the algorithm only has one game to predict the opponent s approximate level of play. Players often play inconsistently within the same game, mixing good and bad moves. On the one hand, adjusting by large amounts leads to high variance of program strength. On the other hand, if strengths are adjusted by a small amount, the effects may not be sufficiently obvious.

Our method is as follows. In principle, we still attempt to maintain all moves so that the overall win rate is around 50%. For each move, we first estimate the current win rate , by using the MCTS win rate of the move with the most simulation counts. The index is decreased when and increased when , both by . The program chooses moves based on the softmax policy proposed earlier.

However, for stability, is set to be relatively small when is within a range (50% , 50% ), where is a user defined value, say 10%. Namely,

In addition, decreases linearly from an initial value, say 0.1, to 0 after a number of moves, say 150 moves. The purpose is to cool down the amplitude of changes as games progress, since the program should have an idea of its opponent s strength by that stage in the game. This cool down mechanism is important because without it, the program will make unreasonable concessions when it is in the lead, or ramp up in strength indefinitely when it is behind.

In the experiment, the above method is used to play against the five opponents with strength indices . For each opponent, consider two cases of , 0.2 and 0.1, where for each case, 100 games are played.

Table 4 presents the experiment results. The results show that the average values over 100 games are close to the opponents strength indices , especially when . Note that the predicted for each game is the value at the end of the game. The standard deviation is high as expected.

The results in Table 4 also show that while all the win rates except for are not around 50%, when compared to the baseline win rates without DSA (as shown in the second row), the overall win rates are closer to 50%. This shows that intra-game DSA can predict opponents strengths and even out the games. The reason why the win rates are not balanced around 50%, despite the predicted value to be more or less accurate, is that the early moves in

a game influence the outcome significantly, but the program has yet to observe its opponents strengths sufficiently at that point.

Opponent z = 2 z = 1 z = 0 z = -1 z = -2 w/o DSA 5.6% 8.8% 50.0% 78.4% 87.6%

WR. 29.0% 38.0% 43.0% 64.0% 71.0% Avg. z 1.85 0.92 -0.10 -1.39 -1.57 Std. z 1.82 1.81 1.41 1.73 1.56

WR. 15.0% 32.0% 49.0% 73.0% 72.0% Avg. z 1.02 0.75 -0.21 -0.66 -0.96 Std. z 0.94 0.90 0.81 0.84 0.85

Table 4. Win rate (WR), average z (Avg. z) and standard deviation of z (Std. z) against different opponents using intra-game SA.

In this paper, our major contributions are: 1. We propose an approach to strength adjustment for

MCTS-based game-playing programs. In this approach, we follow a softmax policy (Sephton, Cowling, and Slaven 2015) with a strength index to choose moves. Most importantly, this approach uses a threshold ratio to filter out low-quality moves whose simulation counts in MCTS are . 2. We apply the approach to the Go program ELF, and

demonstrate that we can easily adjust the strength. The empirical results show the strength covers a range of about 830 Elo ratings with a low linear regression error of 47.95 Elo, with respect to in the range [-2, 2]. To our knowledge, this result is state-of-the-art in terms of the range of strengths in Elo rating while maintaining a controllable relationship between the strength and a strength index. Another advantage is that the program is still able to play diverse moves despite its adjusted, weaker strength. 3. We present an in-depth strength analysis for the above

empirical results. First, we make the hypothesis that given a position, the strength of move is proportional to

. From this hypothesis, the strength ratings of chosen moves are shown to be at worst a fixed value, , lower than the best move. This justifies that the move quality is under control, avoiding exceptionally bad moves. In addition, the analysis also shows that the derived strengths are also close to the empirical strengths with regression error 40.45 Elo, and to a logistic function with regression error 10.51 Elo. 4. With the ease of strength adjustment using , we in-

troduce two methods to adjust strength dynamically, including inter-game and intra-game strength adjustment. The experiment results show that these methods

are able to predict the opponents expected strengths, though the variances can be high. In practice, we have applied our method to ELF Open Go and three versions of the Go program CGI (Wu et al. 2018), which can cover a strength range over 3000 Elo ratings from beginners to ELF, which is much stronger than human champions. Our estimation is that ELF roughly covers the range of Elo ratings [3300, 4300] and the other three covers [2600, 3700], [1800, 2800] and [900, 2000]. The four versions have been tested on online Go websites against human players. Namely, the last two versions were tested to play against amateur players on a Go playing online site Tygem (Tongyang Online 2018), while the first two were used to play against professionals in Hai Fong Go Association (Hai Fong Go Association 2018).

As mentioned in the introduction, the Alpha Zero algorithm has also been successfully applied to other games such as chess and shogi, reaching a strength level much higher than human champions and other top programs (Silver et al. 2017a). With our approach, we expect to be able to provide a wide range of strength levels for each of these games. We expect our approach to not only impact the Go community, but also the games community at large.

Browne, C. B.; Powley, E.; Whitehouse, D.; Lucas, S. M.; Cowling, P. I.; Rohlfshagen, P.; Tavener, S.; Perez, D.; Samothrakis, S.; and Colton, S. 2012. A Survey of Monte Carlo Tree Search Methods. IEEE Transactions on Computational Intelligence and AI in Games 4(1):1 43. Cauwet, M.-L.; Teytaud, O.; Cazenave, T.; Saffidine, A.; Liang, H.-M.; Yen, S.-J.; Lin, H.-H.; and Wu, I-C. 2015. Depth, Balancing, and Limits of the Elo Model. In 2015 IEEE Conference on Computational Intelligence and Games (CIG), 376 382. IEEE. Coulom, R. 2006. Efficient Selectivity and Backup Operators in Monte-Carlo Tree Search. In International Conference on Computers and Games, 72 83. Springer. Coulom, R. 2007. Computing Elo Ratings of Move Patterns in the Game of Go. ICGA Journal 30(4):198 208. Coulom, R. 2008. Whole-History Rating: A Bayesian Rating System for Players of Time-Varying Strength. In International Conference on Computers and Games, 113 124. Springer. Demediuk, S.; Tamassia, M.; Raffe, W. L.; Zambetta, F.; Li, X.; and Mueller, F. 2017. Monte Carlo Tree Search Based Algorithms for Dynamic Difficulty Adjustment. In 2017 IEEE Conference on Computational Intelligence and Games (CIG), 53 59. IEEE. Hai Fong Go Association. 2018. The Homepage of Hai Fong Go Association. Retrieved from http://www.haifong.org/. Hollosi, A., and Pahle, M. 2018. Sensei s Library. Retrieved from https://senseis.xmp.net/?Rank Worldwide Comparison. Hosmer Jr, D. W.; Lemeshow, S.; and Sturdivant, R. X. 2013. Applied Logistic Regression, volume 398. John Wiley & Sons. Hunicke, R., and Chapman, V. 2004. AI for Dynamic Difficulty Adjustment in Games. In Proceedings of the Challenges in Game

AI Workshop, Nineteenth National Conference on Artificial Intelligence, 91 96. San Jose, California: AAAI Press. Hunter, D. R. 2004. MM Algorithms for Generalized Bradley Terry Models. The Annals of Statistics 32(1):384 406. Kocsis, L., and Szepesvári, C. 2006. Bandit Based Monte-Carlo Planning. In European Conference on Machine Learning, 282293. Springer. Pascutto, G.-C. 2018. Leela-zero Git Hub Repository. Retrieved from https://github.com/gcp/leela-zero. Paulsen, P., and Fürnkranz, J. 2010. A Moderately Successful Attempt to Train Chess Evaluation Functions of Different Strengths. In Proceedings of the ICML-10 Workshop on Machine Learning and Games, Haifa, Israel. Sephton, N.; Cowling, P. I.; and Slaven, N. H. 2015. An Experimental Study of Action Selection Mechanisms to Create an Entertaining Opponent. In 2015 IEEE Conference on Computational Intelligence and Games (CIG), 122 129. IEEE. Silver, D.; Huang, A.; Maddison, C. J.; Guez, A.; Sifre, L.; Van Den Driessche, G.; Schrittwieser, J.; Antonoglou, I.; Panneershelvam, V.; Lanctot, M.; et al. 2016. Mastering the Game of Go with Deep Neural Networks and Tree Search. Nature 529(7587):484 489. Silver, D.; Hubert, T.; Schrittwieser, J.; Antonoglou, I.; Lai, M.; Guez, A.; Lanctot, M.; Sifre, L.; Kumaran, D.; Graepel, T.; et al. 2017a. Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm. ar Xiv preprint ar Xiv:1712.01815. Silver, D.; Schrittwieser, J.; Simonyan, K.; Antonoglou, I.; Huang, A.; Guez, A.; Hubert, T.; Baker, L.; Lai, M.; Bolton, A.; et al. 2017b. Mastering the Game of Go Without Human Knowledge. Nature 550(7676):354. Tencent AI Lab. 2018. Fine Art Wikipedia. Retrieved from https://en.wikipedia.org/wiki/Fine_Art_(software). Tian, Y.; Ma, J.; Gong, Q.; Sengupta, S.; Chen, Z.; and Zitnick, C. L. 2018. ELF Open Go Git Hub Repository. Retrieved from https://github.com/pytorch/ELF. Tongyang Online. 2018. The Homepage of Tygem. Retrieved from http://www.tygemgo.com/. Wu, T.-R.; Wu, I-C.; Chen, G.-W.; Wei, T.-h.; Wu, H.-C.; Lai, T.-Y.; and Lan, L.-C. 2018. Multi-Labelled Value Networks for Computer Go. IEEE Transactions on Games, 10(4):378 389.