# materials_discovery_using_max_karmed_bandit__19c3b768.pdf Journal of Machine Learning Research 25 (2024) 1-40 Submitted 2/22; Revised 10/23; Published 3/24 Materials Discovery using Max K-Armed Bandit Nobuaki Kikkawa kikkawa@mosk.tytlabs.co.jp Toyota Central R&D Labs., Inc. 41-1, Yokomichi, Nagakute, Aichi 480-1192, Japan Hiroshi Ohno oono-h@mosk.tytlabs.co.jp Toyota Central R&D Labs., Inc. 41-1, Yokomichi, Nagakute, Aichi 480-1192, Japan Editor: Aurelien Garivier Search algorithms for bandit problems are applicable in materials discovery. However, objectives of the conventional bandit problem are different from those of materials discovery. The conventional bandit problem aims to maximize the total rewards, whereas materials discovery aims to achieve breakthroughs in material properties. The max K-armed bandit (MKB) problem, which aims to acquire the single best reward, matches with the discovery tasks better than the conventional bandit. However, typical MKB algorithms are not directly applicable to materials discovery due to some difficulties. The typical algorithms have many hyperparameters and some difficulty in the directly implementation for the materials discovery. Thus, we propose a new MKB algorithm using an upper confidence bound of expected improvement of the best reward. This approach is guaranteed to be asymptotic to greedy oracles, which does not depend on the time horizon. In addition, compared with other MKB algorithms, the proposed algorithm has only one hyperparameter, which is advantageous in materials discovery. We applied the proposed algorithm to synthetic problems and molecular-design demonstrations using a Monte Carlo tree search. According to the results, the proposed algorithm stably outperformed other bandit algorithms in the late stage of the search process, unless the optimal arm coincides in the MKB and conventional bandit settings. Keywords: Max K-armed bandit problem, Confidence bounds, Monte Carlo tree search, Molecular design, Greedy oracle 1. Introduction Materials discovery integrated with machine learning is a field with immense growth potential. Material property predictions using regression and clustering methods (Liu et al., 2017; Meredig et al., 2018; Butler et al., 2018; Ramprasad et al., 2017; Pilania et al., 2013) are recognized as a beneficial approach in the development workplace. Materials discovery using deep learning (Agrawal and Choudhary, 2019; Jha et al., 2018), transfer learning (Jha et al., 2019; Yamada et al., 2019), and generative models (Sanchez-Lengeling et al., 2017; Sanchez-Lengeling and Aspuru-Guzik, 2018) is actively under investigation in advanced researches. Autonomous searches based on Bayesian optimization (Ueno et al., 2016; Kusne et al., 2020), Monte Carlo tree search (MCTS) (M. Dieb et al., 2017; Yang et al., 2017; Ju et al., 2018; Segler et al., 2018; Kiyohara and Mizoguchi, 2018; M. Dieb et al., 2018; Ka- c 2024 Nobuaki Kikkawa and Hiroshi Ohno. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v25/22-0186.html. Kikkawa and Ohno jita et al., 2020; Kikkawa et al., 2020; Patra et al., 2020), and reinforcement learning (RL) (Sanchez-Lengeling et al., 2017; Popova et al., 2018; Olivecrona et al., 2017) have also been investigated to accelerate materials discovery. Active learning approaches (Kusne et al., 2020; Del Rosario et al., 2020) and the effective use of failed experiments (Raccuglia et al., 2016) are also important for overcoming the limitations of data generation in materials science. Finding novel materials with record-breaking properties of interest is one of the goals of materials discovery. However, the guiding principles of MCTS and RL seem to differ from the goal of materials discovery because these approaches mainly focus on maximizing the total reward (Auer et al., 2002; Kocsis and Szepesv ari, 2006; Browne et al., 2012; Sutton and Barto, 2018) rather than discovering a record-breaking material property. Therefore, these approaches tend to avoid selections with high failure rates even though those could lead to a few great breakthroughs. Because failure is often a prerequisite for success, these approaches are not always optimal for achieving a significant discovery. The max K-armed bandit (MKB) problem (Cicirello and Smith, 2005), also called the extreme bandit (Carpentier and Valko, 2014) or the max bandit (David and Shimkin, 2016), is a promising problem setting for materials discovery. In the MKB problem, a player aims to maximize the single best reward from a slot with K arms instead of the total reward in the conventional bandit problem (Lai and Robbins, 1985). Owing to these modifications, the algorithms for the MKB problem can explore the adventurous arm rather than the stable arm. Several algorithms have been proposed for the MKB problem (Carpentier and Valko, 2014; David and Shimkin, 2016; Streeter and Smith, 2006b; Achab et al., 2017; Streeter and Smith, 2006a; Baudry et al., 2022; Bhatt et al., 2022). However, their practical applications in materials discovery are limited. Some of them consider the time horizon T as a hyperparameter, even though their applications for MCTS are associated with many drawbacks. Other methods involve many hyperparameters depending on unknown reward distributions. This requires a time-consuming parameter tuning, which is extremely costly for materials discovery. To overcome these difficulties, we propose an MKB algorithm with one hyperparameter that employs an upper confidence bound (UCB) of the expected improvement (EI) of the maximum reward as the selection index of the arm. We apply this algorithm to synthetic problems and demonstrations of materials discovery using MCTS. The primary contributions of this study are as follows: 1. We propose MKB algorithms by introducing the UCB of EI of the best reward1, which has only one hyperparameter to control the balance between exploration and exploitation. 2. We demonstrate that the MCTS approach based on the MKB algorithm is effective for materials discovery than other MCTS algorithms based on the conventional bandit algorithm. 3. We prove that asymptotically optimal MKB algorithms can be generated using the UCB of EI of the best reward. 1. An algorithm is based on a tentative value of the UCB of the EI. Materials Discovery using Max K-Armed Bandit 4. We propose a time-independent oracle named Kikkawa s greedy oracle. This oracle makes it possible to discuss the MKB problem in the almost same manner as the conventional bandit problem. 5. To the best of our knowledge, this is the first study to actually apply the MKB algorithm to materials discovery. The remainder of this paper is structured as follows. Section 2 describes the related work of the MKB problem, materials discovery using MCTS, and related algorithms. After that, we define some terms and representations in Section 3. In Section 4, we describe the idea to create an MKB algorithm. In the following Section 5, the derivation of the proposed algorithm is presented. Section 6 demonstrates the experiments conducted for comparing the proposed algorithm with other bandit algorithms. In Section 7, we discuss the subtleties of the MKB problem and theoretical aspect of the role of the UCB of EI for the MKB problem. Finally, Section 8 presents our conclusions and discussions on the future outlook of the proposed algorithm. 2. Related Work In this section, we first describe the related work of the MKB problem, materials discovery using MCTS, and related algorithms. 2.1 Max K-armed bandit problem The MKB problem is expressed as a policy-decision problem that maximizes the single best reward maxt [T] rk(t)(t), where [T] := {1, 2, . . . , T}, rk(t) is the reward from the k-th arm at time t with a time-independent distribution fk(r), and k(t) is the selected arm index at time t determined based on the policy to be tuned (Cicirello and Smith, 2005). This is a simple variant of the conventional bandit problem, which aims to maximize the total reward P t [T] rk(t)(t) (Lai and Robbins, 1985). The MKB problem was first proposed by Cicirello and Smith (2005), who derived the optimal allocation order for this problem with the Gumbel-type reward distribution. The following year, Streeter and Smith (2006a) proposed an asymptotically optimal algorithm using the explore-then-commit (ETC) approach. They also proposed a UCB algorithm for the MKB problem, called Threshold Ascent, in the same year (Streeter and Smith, 2006b). This algorithm used a UCB of E[1[rk > rs-th]] as the selection index, where rs-th was the s-th maximum of observed rewards. The next stream of the algorithm development for the MKB problem was undertaken by Carpentier and Valko (2014). They estimated a finite-time upper bound of E[rmax] assuming the reward distribution as the second-order Pareto distribution and proposed Extreme Hunter algorithm based on it. Achab et al. (2017) proposed the ETC version of Extreme Hunter, and they also proposed a simple algorithm, denoted Robust UCBMax in this paper. The Robust UCBMax used a robust UCB (Bubeck et al., 2013) of E[rk1[rk > u]], where u was a threshold parameter. A probably approximately correct (PAC) approach for the MKB problem called Max-CB was discussed theoretically by David and Shimkin (2016). Kikkawa and Ohno Recently, distribution-free (DF) approaches were used to create new MKB algorithms. Bhatt et al. (2022) proposed Max-Median algorithm. In their algorithm, the order statistic, which estimates the median of maxima in conceivable subsets of observed rewards, was adopted as the selection index. Baudry et al. (2022) also employed the quantile of maxima (Qo Max) as an extension of the median of maxima, and developed ETC algorithm and subsampling dueling algorithm (SDA) using Qo Max, which was directly calculated from the subsets of observed rewards. Here, we summarize the features of the MKB algorithms in Table 1. Additionally, we show the main target of these algorithms in this table. The MKB algorithms have not been applied for materials discovery in the previous studies, although it was discussed by David and Shimkin (2016). Table 1: MKB algorithms. The term Anytime refers to algorithms that do not employ T as a hyperparameter. Algorithm Approach Anytime # of parameters Target Max Search (this work) UCB yes 1 synthetic problem, materials discovery asymptotically algorithm ETC no 2 - Threshold Ascent UCB no 2 scheduling Extreme Hunter finite-time upper bound no 5 synthetic problems, traffic analysis Extreme ETC ETC no 5 synthetic problem Robust UCBMax UCB yes 3 synthetic problem Max-CB PAC yes 2 - Max Median DF yes 2 synthetic problem Qo Max-ETC DF-ETC no 3 synthetic problem Qo Max-SDA DF-SDA yes 3 synthetic problem Some researchers also contributed theoretically. Cicirello and Smith (2005) stated that in the case of the MKB problem with the Gumbel-type reward distributions, the optimal algorithm should sample the observed best arm at a rate increasing double exponentially relative to the other arms. Carpentier and Valko (2014) introduced an expected regret, called extreme regret, for the MKB problem. They also proposed an algorithm where the regret had o(E maxt [T] rk(t) ). Although these theoretical developments were traced to the analogies of the conventional bandit problem, Nishihara et al. (2016) proved that no policy is guaranteed to asymptotically approach the oracle used by Carpentier and Valko (2014) in some settings. Nishihara et al. (2016) also pointed out some other subtleties on the MKB problem and proposed an oracle using EI although they does not analyze it much. 2.2 Materials discovery using Monte Carlo tree search There are several studies relating to materials discovery using MCTS (M. Dieb et al., 2017; Yang et al., 2017; Ju et al., 2018; Segler et al., 2018; Kiyohara and Mizoguchi, 2018; M. Dieb et al., 2018; Kajita et al., 2020; Kikkawa et al., 2020; Patra et al., 2020). M. Dieb et al. Materials Discovery using Max K-Armed Bandit (2017), in the pioneering work, compiled Si-Ge interfacial conformations into binaries and optimized them to maximize the thermal conductance using MCTS. Ju et al. (2018) also optimized the interface roughness by ternary embedding. The optimizations of the grain boundary (Kiyohara and Mizoguchi, 2018), doping (M. Dieb et al., 2018), and chemical syntheses (Segler et al., 2018; Patra et al., 2020) have also been investigated. Yang et al. (2017) applied MCTS to the optimization of chemical structures. They introduced a search tree in which nodes correspond to the simplified molecular-input lineentry system (SMILES) characters (Weininger, 1988), e.g., C , O , ( , and ) . Because the SMILES grammar can express most of molecules, the chemical-structure optimization is regarded as a string optimization in this approach. They showed that the MCTS approach outperformed other approaches in the SMILES search. The MCTS approach using SMILES was employed in subsequent studies. Kajita et al. (2020) introduced fragments of SMILES, such as CC and CO to restrict the search space of chemical structures. In their study, they attempted 5,500 evaluations 10 times using molecular dynamics (MD) simulations in a search run. They also confirmed the properties of the molecules with high rewards by synthetic experiments. Kikkawa et al. (2020) improved the flexibility of the restriction by introducing rule-based grammar into the search tree using a maze game. They also evaluated several thousand molecules in a search run using MD simulations. 2.3 Other algorithms The application of the single-player MCTS (Schadd et al., 2008) for materials discovery has also been considered. In this approach, a variance-dependent term is empirically added to the selection index of the UCB. Herein, we denote the bandit algorithm using this modified index sp UCB. We note that the best-arm identification, such as the UCBE algorithm (Audibert et al., 2010), is different from the MKB algorithm. The best-arm identification aims to find the arm with the maximum expectation reward not the single maximum through a search run. The algorithms based on the best-arm identification barely select arms with a low expectation reward even if the arm affords a high reward at low rates. 3. Definitions The definitions used in this section through Section 5 are listed as follows: Definition 1 (Bandit problem) The K-armed bandit problem, or simply bandit problem, is a problem to maximize (minimize) some objective G h {k(t)}t [T] ; {rk(t)}k [K],t [T] i (1) in a selection game with K arms during time horizon T, where k(t), t [T] is a player s selection which should be optimized. The arm k [K] returns a reward rk(t), t [T] at time t, following unknown time-independent reward distribution fk(r). A player also does not know T in the anytime setting. We usually omit the dependency of G on k(t) and rk(t), k [K], and t [T]. Kikkawa and Ohno Definition 2 (Conventional bandit problem) The conventional bandit problem is a bandit problem to maximize the total reward Gsum h {k(t)}t [T] i := X t [T] rk(t)(t). (2) Definition 3 (MKB problem) The MKB problem is a bandit problem to maximize the single maximum reward Gmax h {k(t)}t [T] i := max t [T] rk(t)(t). (3) Definition 4 (EI) In the bandit problem, the EI of arm k at time τ T is defined as EI [k, τ; G] := Efk h G h { k(t)}t [τ] ii Efk h G h {k(t)}t [τ 1] ii , (4) where k(t) = k(t) when t [τ 1] and k(t) = k when t = τ. Definition 5 (Complementary error function) erfc(x) := 2 x exp( x2)dx. (5) Definition 6 (Integral of erfc(x) (Olver et al., 2010)) ierfc(x) := Z x erfc(x)dx = 1 π exp( x2) x erfc(x). (6) Definition 7 (Sub-gaussian) A distribution f(r) is called a sub-gaussian distribution when m R and s R+, such that Pf[|r| < u] U(u; m, s2), (7) U(r; m, s2) := 2 exp (r m)2 The m and s2 are called the mean and variance proxies, respectively. Definition 8 I(r; m, s2) := Z r U(u; m, s2)du = 2πs2 erfc r m Definition 9 (Tentative upper bound) The symbol means that the right value is a tentative value of the upper bound of the left value. Definition 10 (Sub-exponential) A distribution g(x) is called a sub-exponential when b 0, such that Pg(x){x u} 2 exp u Materials Discovery using Max K-Armed Bandit 4. Our Concept Our main claim in this article is the effectiveness of Algorithm 1 which uses a UCB of EI of the single best reward as the selection index. We first show the reasonability of the use of a UCB of EI for the bandit algorithm by taking the conventional UCB (Auer et al., 2002; Bubeck et al., 2013) as an example. This example is intuitively clear although it is not rigorous. We provide a more theoretical discussion in Section 7. We also show in Lemmas 13 and 14 that the EI of the single best reward can be calculated from a survival function of fk(r). The substantial value is estimated in Theorem 16 in the next section. In the conventional bandit problem, the EI becomes the expected reward as shown in the following lemma. Lemma 11 (EI of conventional bandit problem) In the conventional bandit problem, EI [k, τ; Gsum] = Efk[r]. (11) Based on this lemma, we can consider that the conventional UCB algorithm (Auer et al., 2002) uses a UCB of EI[k, τ; Gsum] as a selection index. This relation of the conventional UCB and EI implies that the same approach is valid in the MKB problem as follows. Theorem 12 (Max Search strategy) Let zk := z(k, R(τ 1); Gmax) be a UCB of EI[k, τ; Gmax], where R(τ) := {k(t), rk(t)(t)} t τ is the set of the pairs of the selected arm ids and rewards previously played and obtained. The strategy which selects argmax k [K] z(k, R(τ 1); Gmax) (12) in each selection (Algorithm 1) is an asymptotically optimal approach in the MKB bandit problem. Proof See Section 7. Algorithm 1 Max Search Input: number of arms K, current time τ, and previous records R(τ 1). Output: selected arm index ˆk. 1: for each k [K] do 2: calculate zk = z(k, R(τ 1); Gmax) 4: ˆk argmax k [K] zk 5: return ˆk Theorem 12 does not state the explicit form of z(k, R(τ 1); Gmax) . Therefore, we should estimate it to use Algorithm 1. The lemmas are footholds for the estimation. Lemma 13 (EI of MKB problem) In the MKB problem, let rmax := maxt [τ 1] rk(t)(t) be given. Then, EI[k, τ; Gmax] = Efk max{rk(t)(t), rmax} rmax. (13) Kikkawa and Ohno Lemma 14 (EI and survival function) Let r be an independent identical distributed (i.i.d.) random variable following f(r) and r0 be given. Then, Ef [max {r, r0}] r0 = Z r0 S(u)du, (14) where S(r) := Z r f(u)du (15) is the survival function of f(r). Ef [max {r, r0}] = Z r0 r0f(r)dr + Z r0f(r)dr + Z r0f(r)dr + Z r0 r0 f(r)dr + Z Here, in switching the order of the integration, we used r0 r 0 u r (0 u r0 r0 r) (r0 u u r). (17) Lemma 13 assumes rmax given. It is no problem in the implementation because rmax can be recorded on a O(1) memory. Lemma 14 says that the EI in Lemma 13 can be calculated from the survival function of the reward distribution. Because of this, the remained work to obtain the selection index is the estimation of a substantial form of a UCB of R rmax Sk(r)dr with some assumption for the reward distribution. 5. Estimation for Selection Index In this section, we derive two concrete forms of the UCB of EI. One is derived strictly under Gaussian reward settings, and the other is derived approximately under sub-gaussian reward settings. As will be shown in the next section, the former appears to better match the tasks of materials discovery. However, the assumption for the reward distribution in the latter approach is looser than that in the former. Because the material properties of different materials groups are sometimes bounded do not lie on a continuum, the latter might be preferable in some cases of practical materials discovery. 5.1 Selection Index under Gaussian Reward Settings Under Gaussian reward settings, the reward distributions are written as fk(r) = N(r; µk, σ2 k), (18) Materials Discovery using Max K-Armed Bandit where N(r; µ, σ2) is a Gaussian distribution with mean µ and variance σ2. In this case, the confidence interval of R r Sk(u)du is given as follows: Proposition 15 (Confidence interval of R r S(u)du) Let f(r) be a Gaussian distribution with mean µ and variance σ2. Let S(r) be a survival function of f(r) and let ri, i [n] be i.i.d. Gaussian random variables following f(r). We then have, s ˆσ2 2 ierfc ˆσ2 + 2 ierfc with confidential level 1 α, where ˆµ := µ tn 1,1 α 2 σ n, ˆσ2 := (n 1) σ2 i=1 ri, σ2 := 1 n 1 i=1 (ri µ)2. (21) Here, tn,p and χ2 n,p are the p-quantiles of t and χ2 distributions, respectively. Proof The function Z 2 ierfc r µ is a monotonically increasing function of µ and σ2. Therefore, its confidence interval can be simply obtained by substituting the lower and upper confidence bounds ˆµ µ ˆµ+ and ˆσ2 σ2 ˆσ2 + (Pishro-Nik, 2014) into the function. From this proposition and Theorem 12, the selection index is obtained as ˆσ2 k 2 ierfc ˆµk = µk + tnk 1,1 α 2 σk nk , ˆσ2 k = (nk 1) σ2 k χ2 nk 1, α k(t)=k, t [τ 1] rk(t)(t), σ2 k = 1 nk 1 k(t)=k, t [τ 1] (rk(t)(t) µk)2, (25) where nk is the number that the arm k is selected previously. The summations in the calculation of µk and σ2 k are also performed only when the arm k is selected. The parameter α, which determines the balance between exploration and exploitation, is determined as α = ν c2, where ν = X k [K] nk. (26) Kikkawa and Ohno This value, usually called the allocation order, is consistent with the optimal order of the conventional bandit problem (Auer et al., 2002). However, it is inconsistent with the double exponential order, which is optimal in the MKB with Gumbel-type reward distributions (Cicirello and Smith, 2005). Considering the uncertainty of the MKB problem discussed in Section 7, the optimal allocation order will be explored in future work. Equation (23) is numerically calculated using Algorithm 2. In the present experiments, we set c = 1. Algorithm 2 Selection index under Gaussian settings Input: total number of selections previously performed ν, number of times the target arm is selected n, sum of the rewards obtained from the target arm R, sum of the square rewards obtained from the target arm Q, the maximum reward obtained until the current time rmax, and the hyperparameter c. Output: selection index z. 1: if n > 1 then 3: σ2 (Q n µ2)/(n 1) 5: ˆµ µ + tn 1,1 α/2 p 6: ˆσ2 (n 1) σ2/χ2 n 1,α/2 ˆσ2/2 ierfc h (rmax ˆµ)/ 11: return z 5.2 Selection Index under Sub-gaussian Reward Settings Employing the sub-gaussian assumption in the reward distributions, we have Z r Sk(u)du I(r; mk, s2 k). (27) Therefore, we can use a UCB of I(rmax; m, s2) as the selection index . We could only estimate a tentative value of this UCB as follows: Proposition 16 (UCB of I(r; m, s2)) Let f(r) be a sub-gaussian distribution with mean proxy m and variance proxy s2. Let S(r) be a survival function of f(r). Let ri, i [n] be i.i.d. sub-gaussian random variables following f(r). Then, I(r; m, s2) I(r; ˆm, ˆs2) (28) with the confidential level 0 < 1 α < 1 exp[ ( 2 ln 2)2n], or almost equivalently n > 13.613 ln α > 0, where i=1 ri, (29) Materials Discovery using Max K-Armed Bandit ˆs2 := 1 2[ln 2 γ(α, n)] i=1 r2 i ˆm2 ! γ(α, n) := ln α This proposition is weak because it states only a tentative value. However, Algorithm 1 with the selection index calculated from Algorithm 3, which is based on Proposition 16, performed well as shown in Section 6. Algorithm 3 Selection index under sub-gaussian settings Input: total number of selections previously performed ν, number of times the target arm is selected n, sum of the rewards obtained from the target arm R, sum of the square rewards obtained from the target arm Q, the maximum reward obtained until the current time rmax, and the hyperparameter c. Output: selection index z. 1: if n = 0 or ν < 2 then 5: γ β2 + 2 6: if γ > ln 2 then 10: ˆs2 (Q/n ˆm2)/[2(ln 2 γ)] 2πˆs2 erfc h (rmax ˆm)/ 14: return z The reason for the weakness of Proposition 16 is that the sub-gaussian assumption lacks an upper bound of s2(Lemma 18). We alternatively use a lower bound of s2 in this lemma to derive Proposition 16. The term tentative represents the theoretical inauthenticity of this treatment. This alternative is valid only if the lower bound captures the characteristics of the tail distributions, but the assumption appears to be reasonable from the results in Section 6. The derivation of Proposition 16 is based on Bernstein s inequality in Proposition 20 because the lower bound of s2 contains the expected square value of the reward. The details are given in the following subsections. 5.2.1 Sub-Gaussian assumption Using the sub-gaussian assumption, a UCB of I(r; m, s2) is given by the following conceptual lemma. Kikkawa and Ohno Lemma 17 (UCB of I(r; m, s2)) Consider n samples {ri}, i [n] taken from the subgaussian distribution f(r) with mean proxy m and variance proxy s2. Let ˆm and ˆs2 be UCBs of the mean and variance proxies with the confidence level 0 < 1 α < 1, respectively. Then, I(r; m, s2) I(r; ˆm, ˆs2) (32) with the confidence level 1 α, when r m. Proof With the confidence level 1 α, m ˆm and s2 ˆs2. I(m; m, s2) I(m; ˆm, ˆs2). I(r; m, s2) is monotonically decreasing for r and increasing for m and s2. Then, I(r; m, s2) I(r; ˆm, ˆs2) when r m. In this lemma, UCBs denoted as ˆm and ˆs2 are virtual. Thus, we represent it using the term conceptual . Unfortunately, as the following lemma indicates, even the upper bound of s2 cannot be determined under the sub-gaussian assumption. Lemma 18 (Bounds on s2) Let f(r) is a sub-gaussian with mean proxy m and variance proxy s2. Then, 2s2 Ef[(r m)2] ln 2 . (33) There are no upper bounds on s2. Proof Lower bound: As an equivalent condition to the sub-gaussian on Definition 7, the following Orlicz condition is established (Vershynin, 2018). Applying Jensen s inequality to this condition, we obtain exp Ef[(r m)2]/(2s2) 2. Then, 2s2 Ef[(r m)2]/ ln 2. No upper bound: From the sub-gaussian definition, f(r) U(r; m, s2). Then, f(r) U(r; m, ˆs2), where ˆs2 > s2. Therefore, f(r) can be considered as a sub-gaussian with variance proxy ˆs2 > s2. It means no upper bound of s2. Because of this lemma, we gave up deriving a rigor UCB of s2. Alternatively, we decided to use a UCB of the lower bound Ef[(r m)2]/ ln 2 as a tentative UCB. 5.2.2 Derivation for Theorem 16 To estimate a UCB of Ef[(r m)2]/ ln 2, we first indicate the sub-exponential property of (r m)2 in Lemma 19. A UCB of the expected value of the sub-exponential distribution is known to be estimated from Bernstein s inequality (Vershynin, 2018). Therefore, we estimate a UCB of Ef[(r m)2]/ ln 2 using a variant of Bernstein s inequality in Proposition 20. This result gives a tentative value of the UCB of s2 in Lemma 22. Then, applying Lemma 22 to Lemma 17, we obtain Proposition 16. As is shown in the following lemma, (r m)2 becomes sub-exponential under the subgaussian assumption. Lemma 19 (Sub-exponential property of square reward) Let r be an i.i.d. random variable following sub-gaussian f(r) with variance proxy s2. Then, m R, x = (r m)2 follows a sub-exponential with parameter b = 2s2. Materials Discovery using Max K-Armed Bandit Proof Let r be an i.i.d. random variable following a sub-gaussian f(r) with mean proxy m and variance proxy s2. Let g(x) be the distribution of x = (r m)2. Then, u 0, Pg{x u} = Pf{(r m)2 u} = Pf{|r m| u} where b = 2s2. We used the definition of sub-exponential property for the last inequality. Using the sub-exponential property of (r m)2, a UCB of Ef[(r m)2] can be estimated from a variant of Bernstein s inequality. Proposition 20 (Bernstein s inequality) Let g(x) be a sub-exponential distribution with parameter b. Let x1, x2, . . . , xn > 0 be i.i.d. sub-exponential random variables following g(x). Then, i=1 xi Eg [x] u 2nw+ nw2 + 2n), (36) 2nw nw2 2n) 0 u < 3b/2 exp ub 1 + 1 n 3b/2 u , (37) where u 0 and w := The proof is presented in Appendix A. From Lemma 19 and Proposition 20, a UCB of Ef[(r m)2] is given as follows: Corollary 21 (Confidence bounds of Ef[(r m)2]) Consider n samples {ri}i [n] taken from the sub-gaussian distribution f(r) with mean proxy m and variance proxy s2. Then, i=1 (ri m)2 γ+b Ef (r m)2 1 i=1 (ri m)2 + γ b, (38) when γ+ := β2 + 2 γ := β2 + 2 2β 0 < β < 1/ 2 1 + β2 1/ with a confidence level of 0 < 1 α < 1, where β := p ln α/n and b := 2s2. Using this corollary, we obtain a tentative UCB of s2 as follows: Lemma 22 (Tentative UCB of s2) Consider n samples {ri}i [n] taken from the subgaussian distribution f(r) with mean proxy m and variance proxy s2. Then, s2 1 2n(ln 2 γ) i=1 (ri m)2 (39) with the confidence level 0 < 1 α < 1 exp[ ( 2 ln 2)2n], or almost equivalently n > 13.613 ln α > 0. γ := β2 + 2 2β, where β := p Kikkawa and Ohno Proof Substituting 2s2 ln 2 = Ef[(r m)2] into Corollary 21, i=1 (ri m)2 + 2s2γ . (40) Then, when 0 < γ < ln 2 < 1/ s2 1 2n(ln 2 γ ) i=1 (ri m)2. (41) From the bounds of γ , 0 < β < 2 ln 2, which is equivalent to the bounds of α in this lemma. Lemmas 17 and 22 contain unknown ˆm and m. We simply select m = ˆm = (Pn i=1 ri)/n because we only estimated the tentative value of ˆs2. Proposition 16 is then obtained from these lemmas. Using the sample means for m and ˆm is also justified in terms of the order of convergence. The confidence bounds of the sample mean of the reward converge to the expected mean at O(n 1/2) (Auer et al., 2002). Then, we expect that m and ˆm also converge in the same order. This order is faster than O(n 1/4), which is the order of convergence of ˆs in Lemma 22. In this case, I(r; ˆm, ˆs2) in Lemma 17 converges to limn I(r; ˆm, ˆs2) at the same order of ˆs. Because of this, it is sufficient to correctly evaluate only ˆs. 5.2.3 Derivation for Algorithm 3 Setting α = ν c2, we can implement the UCB in Theorem 16 as Algorithm 3. We use symbol ν instead of τ 1 because of the generality in MCTS. In this algorithm, c is a hyperparameter that controls the balance between exploration and exploitation. We recommend c = 1/ 13.613 to satisfy the condition, γ < ln 2, when n > ln ν. To explore the search space more randomly, a larger c should be used. In the implementation, when the same z value was obtained from other arms, one of the arms was selected randomly. The inequality, n > ln ν, indicates that our algorithm allocates at least ln ν trials for the nonoptimal arms. This allocation order is consistent with the optimal order for the conventional bandit problem (Auer et al., 2002). 6. Experiments and Results We conduct two types of numerical experiments to compare our algorithms with other algorithms. One is the synthetic bandit problems with the Gaussian reward distributions, and the other is SMILES optimization using MCTS (Yang et al., 2017; Kajita et al., 2020; Kikkawa et al., 2020) as the demonstrations for materials discovery. We employed a single set of recommended or reasonable hyperparameters for all the experiments because the tuning of hyperparameters for the actual applications in materials discovery is extremely expensive. We set T = 10, 000 considering the realistic applications (Kajita et al., 2020; Kikkawa et al., 2020) unless the observed maximum reward clearly does not converge. We present the details of other algorithms in Appendix B. Materials Discovery using Max K-Armed Bandit 6.1 Synthetic problems for bandits The synthetic problems explored in the experiments include the following: easy problem This problem consists of three arms with the Gaussian parameters (µ1, σ1) = (1, 1), (µ2, σ2) = (0, 2), and (µ3, σ3) = ( 1, 3). Arm 3 is optimal for the MKB problem because of its large variance. However, in the conventional bandit approaches, arm 1 is preferred because of its high expectation reward. difficult problem This problem consists of three arms with (µ1, σ1) = ( 0.2, 1.1), (µ2, σ2) = (0, 1), and (µ3, σ3) = ( 0.8, 1.2). In this problem, the optimal arm in the MKB problem switches depending on the total number of trials. Arm 1 is optimal 102 T 109 because µ1 +2σ1 = µ2 +2σ2 and µ1 +6σ1 = µ3 +6σ3. The algorithms for determining the arm with the maximum expectation reward will select arm 2. An algorithm with a strong tendency to choose arms with high variances will have a higher preference toward arm 3 than arm 1. It is a challenge for the MKB algorithm to select arm 1 correctly. unfavorable problem This problem comprises three arms with the same variance; the Gaussian parameters of each arm were set to (µ1, σ1) = (1, 1), (µ2, σ2) = (0, 1), and (µ3, σ3) = ( 1, 1), respectively. In this setting, arm 1 is optimal. the conventional UCB will select the optimal arm correctly because this arm has the highest mean reward. The MKB algorithms will lose the conventional UCB because these algorithms incur costs for estimating the variance of each arm. The transition plots of the observed maximum and the ratio of the optimal arm selection averaged over 100 independent runs are shown in Figure 1. The plots of the observed maximum can directly evaluate the performance of the MKB algorithm; however it is susceptible to data variability. The ratio of the optimal arm selection can help in that case. In the result of the easy problem, the MKB algorithms exhibit higher observed maximum reward than the random search on average. Although the obtained maximum rewards are similar among these MKB algorithms, the ratios of the optimal arm selected clearly show that our algorithms identify the best arm first. As expected, the conventional UCB mainly selected the non-optimal arm. The sp UCB and UCBE also afforded worse results than those of the random search. In the difficult problem, the selection ratios show that the MKB algorithms selected the optimal arm more frequently than the random search, although slight differences were observed in the observed maximum reward. In particular, our algorithms more efficiently select the optimal arm than other MKB algorithms. The performances of the random search and UCBE were almost the same, and sp UCB and the conventional UCB exhibited the worse performances. In the unfavorable case, the conventional UCB worked the best from the viewpoint of the selection ratio. The performance of sp UCB is similar to that of the conventional UCB. Our algorithms also exhibited good performance, although the ratios were slightly lower than those of the conventional UCB. The results of Threshold Ascent and Robust UCBMax were better than those of UCBE. The random search afforded the worst result. Kikkawa and Ohno In the previous three tasks, the Max Search[sub-gaussian] algorithm outperformed the Max Search[Gaussian] algorithm. This performance difference is possibly attributable to different mean estimates in the two algorithms. The Gaussian-based algorithm achieved higher performance when using the sample mean µ than when using the UCB of the mean ˆµ. Figure 1: Transition plots of the observed maximum (upper) and ratio of the optimal arm selection (lower) on (a) the easy problem, (b) the difficult problem, and (c) the unfavorable problem. The colors represent the results of different algorithms: red, Max Search[Gaussian]; orange Max Search[sub-gaussian]; reddish brown, Threshold Ascent; purple, Robust UCBMax; green, sp UCB; sky blue, UCBE; blue, conventional UCB; and gray, random search. The error bars indicate the standard errors of 100 independent runs. If the differences between the two methods are more than two times the standard errors, there will be a significant difference between these methods with a 5 % significance level. 6.2 Molecular discovery using tree search As a demonstration of the molecular discovery problem, we attempted to optimize the molecular structure M which maximized either of the properties defined by the following empirical equations (Joback and Reid, 1987): Tb(M)[K] = 198.2 + X i frag(M) Tb,i, Materials Discovery using Max K-Armed Bandit Pc(M)[bar] = 0.113 + 0.0032Na(M) + X i frag(M) Pc,i η300K(M)[Pa s] = Mw(M) exp P i frag(M) ηa,i 597.82 i frag(M) ηb,i 11.202 where Tb, Pc, and η300K are the boiling temperature, critical pressure, and liquid dynamic viscosity at 300 K of molecule M, respectively; frag(M) was a set of atomic fragments of M, determined by Joback and Raid. The fragments simply determined for each atom type, such as carbon in methyl group, halogens, and ether oxygen in a ring group, etc. The functions Na(M) and Mw(M) were the number of atoms in M and molecular weight of M, respectively. The empirical parameters, Tb,i, Pc,i, ηa,i, and ηb,i, were optimized to reproduce the experimental properties. The properties, Tb, Pc, and η300K, depended on the molecular structure through these parameters. In addition to those three properties, the topological polar surface area TPSA(M) [ A2] ( A = 0.1nm) (Ertl et al., 2000) was maximized. Using these empirical formulas, we can verify the performance of the search algorithms in a short time. During the search process, the candidate molecular structures were generated using the following context-free grammar (Hopcroft et al., 2001) of the SMILES strings (Weininger, 1988). Using the context-free grammar, we could create a simple maze game (Kikkawa et al., 2020) systematically. Here, we applied the following rules: S C(X)(Y )(Y )(Y ), C(=O)(Y )(Y ), C(Y )C(Y )(=C(Y )C(Y )), or C(=O)(O(Y ))(Y ), X [H], F, Cl, Br, C(X)(Y )(Y ), O(Y ), N(Y )(Y ), C(=O)(Y ), C(Y )(=C(Y )(Y )), or C(=O)(O(Y )), Y [H], F, Cl, Br, C(X)(Y )(Y ), C(=O)(Y ), C(Y )(=C(Y )(Y )), or C(=O)(O(Y )), where S, X, and Y denote the non-terminal variables, and the upright characters denote the terminals. The start variable is set to S, and a string-generation process is completed when the string no longer has variables. The following additional rule was applied when the number of alphabets was greater than 40: X or Y [H]. This rule guarantees the termination of the generation process within the moderate molecular size. This limit is approximately 500 g/mol in molecular weight, and most of the known molecules in the database2 are within the limit. We employed hydrogen as the termination atom, which is commonly used in organic chemistry. The alphabets include the explicit H , and exclude the parenthesis and equal symbols. The string Br and Cl are considered as two alphabets. The search space of this molecular generator contains significantly more than 6.248 1013 molecular species, which is the number of isomers in C40H82 (Yeh, 1995). We did not consider the synthesizability and the target scope of generated molecules; however, it can be considered by modifying the grammar in practical use. 2. https://www.rsc.org/Merck-Index/ Kikkawa and Ohno The context-free language can be projected to a tree graph (Figure 2). Therefore, the molecular generator can be easily implemented with an MCTS algorithm, as shown in Algorithm 4. The node selection in each layer continues until a complete molecular string is created. Subsequently, the chemical property evaluation is performed, after which the property value is used as the reward. The reward value is recorded in each node passed in the creation, and it is used to calculate the selection indices in the next creation. The complete SMILES strings assigned on the different leaves are treated as the different molecules in this search algorithm even if these molecules have the same molecular symmetries. Figure 2: Tree image of SMILES generation. Algorithm 4 MCTS Input: number of trials T. 2: while τ < T do 4: v root() {the root node of the search tree.} 6: while v is not a leaf node do 7: k policy(records) {policy : Algorithm 1 or the algorithms in Appendix B. records : statistic data such as K, nk, Rk, R2 k, and rmax in Algorithm 1.} 9: v child(k, v) {the k-th child of the node v.} 10: end while 11: r reward(L) {the reward of the selected path L.} 12: records.add(L, r) {record the path L and the reward r.} 13: end while The properties, Tb, Pc, and η300K were calculated using the python thermo module (Bell and Contributors, 2016), and TPSA was calculated using the RDKit library (Landrum, 2016). When using η300K as the reward, the rules containing one of F, N, and =C were excluded because their empirical parameters were not available. Additionally, we note that all of the generated SMILES were valid in the network test of RDKit. Materials Discovery using Max K-Armed Bandit Using the transition plots of the observed maximum, we compared Max Search and other algorithms in Figure 3. The plots were obtained by averaging over each 100 independent search runs. Figure 3: Transition plots of the molecular discovery. (a) Tb, (b) Pc, (c) η300K, and (d) TPSA. The other notations are the same as those in Figure 1. The colors represent the results of different algorithms: red, Max Search[Gaussian]; orange, Max Search[sub-gussian]; green, sp UCB; sky blue, UCBE; blue, conventional UCB; and gray, random search. The error bars indicate the standard errors of the 100 independent runs . Threshold Ascent and Robust UCBMax cannot be implemented in MCTS because of the many hyperparameters involved. In the search of Tb in Figure 3(a), the conventional UCB afforded the highest rewards at t = 10, 000. This result is expected because the empirical formula of Tb is a simple sum of the fragment parameters. In such case, the optimal arm is almost equivalent to the arm with the best expectation reward. This condition corresponds to the unfavorable case of synthetic problems. In fact, the searches for other properties expressed by simple summation in the Joback method afforded similar results. For t < 2, 500, sp UCB demonstrated the best performance. This result is probably due to setting the exploitative hyperparameter c = 0.1 recommended in the original article (Schadd et al., 2008). The conventional UCB with c = 0.1 gave a similar transition plot. Our algorithms are less effective than the conventional UCB and comparable in performance to the sp UCB, but outperform UCBE and random search. Kikkawa and Ohno In the searches of Pc, η300K, and TPSA, our algorithms outperformed the other algorithms, especially in the late stage. There are some different tendencies in these transition plots. These differences are probably due to the differences in the population distributions of rewards. For example, for η300K, there are chemical structures with enormously high rewards in the search space. Our algorithm can find these structures with a high efficiency and success rate. In contrast, for TPSA, the population distribution probably has an upper bound near 290 A2. Our algorithms worked well even if such case. These results evidence the wide application range of our proposed algorithms. The chemical structures with the top three rewards of Tb, Pc, η300K, or TPSA in the 100 independent runs are shown in Figure 4. From the high-scoring molecules, we deduce that: Carboxyl groups are favorable for high Tb. Alcohol, carboxyl, and halogen groups are favorable for high viscosity. Polarized oxygen groups are favorable for high TPSA. These understandings are consistent with chemical knowledge. More complicated and highly optimized structures can be found in our algorithms than other algorithms. 7. Discussion The numerical experiments in the previous section show that the UCB of EI is possible as the selection index for the MKB problem. In this section, we discuss why that is so. Our discussion would contain non-rigorous arguments. However, we believe that this discussion will help with future work. We also discuss the incompleteness of our algorithm for multimodal reward distributions. This limitation is due to the sub-gaussian assumption, but this would not affect the application of MCTS. 7.1 Subtleties of Extreme Regret For our discussion, we should mention the subtleties of extreme regret, first pointed out by Nishihara et al. (2016). In this section, we review these subtleties. The extreme regret was introduced by Carpentier and Valko (2014), defined as follows: Definition 23 (Carpentier s regret) In the MKB problem, Carpentier s regret when k(t), t [T] are selected is defined as follows: Rk (t),k(t) C (T) := E max t [T] rk (t)(t) E max t [T] rk(t)(t) , (42) where k (t), t [T] denotes an oracle policy. The asymptotically optimal policy should satisfy Rk (t),k(t) C (T) = o E max t [T] rk (t)(t) . (43) A subtlety of Carpentier s regret is that the regret asymptotically approaches 0 for most policies in some settings. For example, we consider all reward distributions of the arms have bounded support. Then, any policy that selects each arm infinitely often achieves an Materials Discovery using Max K-Armed Bandit Figure 4: Chemical structures with the top three rewards in 100 independent runs. Kikkawa and Ohno asymptotically zero regret, meaning that even the random search is asymptotically optimal in the setting. To avoid this, Nishihara et al. (2016) defined an alternative regret: Definition 24 (Nishihara s regret) In the MKB problem, Nishihara s regret when k(t), t [T ] are selected is defined as follows: Rk (t),k(t) N (T) := 1 T : E max t [T ] rk(t)(t) E max t [T] rk (t)(t) , (44) where k (t), t [T] denotes an oracle policy. The asymptotically optimal policy should satisfy lim sup T Rk (t),k(t) N (T) 1. (45) This regret works even when the reward distributions have bounded supports. However, Nishihara et al. (2016) showed that there is a set of reward distributions such that lim sup T Rk SA ,k(t) N (T) K for any policy, where k SA is the selection of single-armed oracle defined in Definition 25. Namely, no policy is asymptotically optimal under Nishihara s regret. Because of this, we only treat reward distributions with unbounded supports in the following discussion. Another subtlety exists in the definition of the oracle policy. The previous works are essentially based on the single-armed oracle (Nishihara et al., 2016) as follows: Definition 25 (Single-armed oracle) In the MKB problem, the single-armed oracle is the policy that plays the single arm k SA (T) := argmaxk [K]E max t [T] rk(t) (46) over a time horizon T. However, this oracle gives different k SA depending on T. This fact can be confirmed by the following example. Example 1 Consider the MKB problem with K = 3. Let the reward distributions of each arm be f1(r) = N(r; 0, 0.01), f2(r) = N(r; 1, 0.25), and f3(r) = N(r; 15, 4). Then, the single-armed oracle gives k SA = 1 if T 11, k SA = 2 if 1.4 1011 T 5.4 1013, and k SA = 3 if T 3.9 10202. Proof The expected maximum reward sampled from the k-th arm over a time horizon T is bounded by ln T E max t [T] rk(t) µk + where µk and σ2 k are the mean and variance of the Gaussian reward distribution, fk(r), respectively (Kamath, 2015). Then, the example is established. Materials Discovery using Max K-Armed Bandit The T-dependency of the single-armed oracle means that the best arm cannot be determined without information on T (Nishihara et al., 2016). This raises a question about the regret analysis using the infinity limit of T. In Example 1, arm 3 should be selected most often to achieve the asymptotically zero regret. However, arm 1 or 2 is a more suitable choice when T < 5.4 1013. Because many applications cannot perform such a large number of trials, the regret analysis result is impractical. The T-dependency of the oracle is also inconvenient in MCTS applications. In an MCTS algorithm, T is not given except for the root node. Then, one cannot determine the best arm except for the root node even if the reward distribution is known. 7.2 T-independent oracles and asymptotics of UCB approach Nishihara et al. (2016) also proposed an oracle independent of T. Definition 26 (Nishihara s greedy oracle) In the MKB problem, Nishihara s greedy oracle is the policy that plays the arm with the maximum EI. Namely, this oracle plays k N (τ) := argmax k [K] E h max {rk(τ), rmax (τ 1)} rmax (τ 1)|{rk N (t)(t)}t τ 1 i (48) at time τ, where rmax (τ) := maxt [τ] rk N (t). This oracle uses EI to avoid the dependency on T. Therefore, in terms of Nishihara s greedy oracle, it is natural that we employ EI to derive a T-independent MKB algorithm. Although Nishihara et al. (2016) did not analyze this oracle much, we note that Nishihara s greedy oracle gives different k N (τ) depending on the oracle value rmax (τ) instead of T, as shown in the following example: Example 2 Consider the MKB problem with K = 3. Let the reward distributions of each arm be f1(r) = N(r; 0, 1), f2(r) = N(r; 2, 2), and f3(r) = N(r; 6, 3). Then, Nishihara s greedy oracle gives k N (τ) = 1 if rmax (τ 1) 1.3, k N (τ) = 2 if 7.0 rmax (τ 1) 11.9, and k N (τ) = 3 if rmax (τ 1) 18.9, Proof Because of Lemma 14, we should consider the integral of the survival function. The survival function of fk(r) = N(r; µk, σ2 k) is expressed as follows: r fk(u)du = 1 2 erfc r µk The bounds of erfc(x), x > 0 are given by c exp( βx2) < erfc(x) < exp( x2), (50) and β > 1 (Chiani et al., 2003; Chang et al., 2011). Then, y exp( βx2)dx < 1 y erfc(x)dx < 1 y exp( x2)dx, (52) Kikkawa and Ohno where y > 0. Using the definition and bounds of erfc(x) again, we obtain 4 β exp( β2y2) < 1 y erfc(x)dx < π 4 exp( y2). (53) These bounds give the example. This dependency generates a subtlety in an adaptive case. Consider one obtains rmax(τ) < 11.9 under a selection {k(t)}t [τ] in Example 2. A problem arises when the oracle value rmax (τ) > 18.9 at that time. In this case, the next selection of Nishihara s greedy oracle differs from the selection with the maximum EI, meaning that a policy simply approaching Nishihara s greedy oracle is not always effective in the MKB problem. The subtlety due to the dependence on the oracle value rmax (τ) is solved using the observed value rmax(τ) alternatively. We define this oracle as follows: Definition 27 (Kikkawa s greedy oracle) In the MKB problem, let k(t), t [τ 1] be the previous selections. Then, Kikkawa s greedy oracle plays k K (τ) := argmax k [K] E max {rk(τ), rmax(τ 1)} rmax(τ 1)|{rk(t)(t)}t τ 1 (54) at time τ, where rmax(τ) := maxt [τ] rk(t)(t). This oracle is equivalent to Nishihara s greedy oracle when all selections follow this oracle. In addition, this oracle gives the arm that has the maximum EI even when the non-oracle selections exist in k(t), t [τ 1]. The following proposition states that Kikkawa s greedy oracle asymptotically approaches Nishihara s greedy oracle in terms of Carpentier s regret. Proposition 28 (Asymptotics of Kikkawa s greedy oracle) Let k(t), t [T] contain o(T) non-oracle selections and other selections follow Kikkawa s greedy oracle. Then, Rk N (t),k(t) C (T) = o E max t [T] rk N (t)(t) , (55) EI [k, T; Gmax] = O 1 T E max t [T] rk N (t)(t) . (56) Proof Let nk(T), k [K] and n N k (T), k [K] be the numbers of the k-th arm selected in k(t), t [T] and k N (t), t [T], respectively. Then, δn := n N k (T) nk(T) = o(T) is Materials Discovery using Max K-Armed Bandit expected.3 Therefore, Rk N (t),k(t) C (T) = max k [K] max n [n N k (T)] rk(n) E max k [K] max n [nk(T)] rk(n) = E max k [K] max max n [nmin] rk(n), max n [δn] rk(n + nmin) max n [nmin] rk(n) E max max n [nmin] rk(n), max n [o(T)] rk(n + nmin) max n [nmin] rk(n) o(T)EI k, nmin; Gmax = o E max t [T] rk N (t)(t) , where nmin := min{n N k (T), nk(T)}. The proposition states that o(T) mistakes are allowed in the asymptotically optimal policy under the condition related to the maximum value. This condition can be satisfied by Gaussian distributions at least. Our concept in Section 4 can be obtained by simply substituting the EI in Kikkawa s greedy oracle into its UCB. Therefore, our conceptual algorithm is expected to approach Kikkawa s greedy oracle for large T. The number of non-oracle selections in Algorithm 1 can be estimated as follows: Proposition 29 (Number of non-oracle selections) Consider the MKB problem. Let EI [k, R(t 1); Gmax] , t [T] be an estimator of EI [k, t; Gmax]. Suppose a confidence interval of EI [k, t; Gmax] is known as EI [k, t; Gmax] EI [k, R(t 1); Gmax] C(k, α(t), nk(t)) (58) with confidence level 1 α(t), where nk(t), k [K] be the numbers of the k-th arm selected under Algorithm 1 with this confidence interval. Then, the number of non-oracle selections becomes o(T) when α(t) = o(1) and C(k, α(t), nk(t)) = o min k κ(t) k (t) t nk(t) where d > 0, κ(t) = [K]/k K (t) and k(t) = EI k K (t), t; Gmax EI [k, t; Gmax] . (60) Proof Consider the following events: Ak K (t),t : EI k K (t), R(t 1); Gmax + C(k K (t), α(t), nk K (t)) EI k K (t), t; Gmax , (61) Aκ(t),t : EI [κ(t), R(t 1); Gmax] EI [κ(t), t; Gmax] + C κ(t), α(t), nκ(t)(t) . (62) 3. Pathological conditions may exist. However, we do not consider them. Kikkawa and Ohno Then, the number of complementary cases can easily be counted as follows: k [K] Ac k,t t [T] α(t) = o(T). (63) Conversely, in the case of all Ak,t established, the non-oracle arm is selected when EI [κ(t), R(t 1); Gmax] + C(κ(t), α(t), nκ(t)(t)) EI k K (t), R(t 1); Gmax + C(k K (t), α(t), nk K (t)(t)) (64) for any κ(t). Then, EI k K (t), t; Gmax EI k K (t), R(t 1); Gmax + C(k K (t), α(t), nk K (t)(t)) EI [κ(t), R(t 1); Gmax] + C(κ(t), α(t), nκ(t)(t)) EI [κ(t), t; Gmax] + 2C(κ(t), α(t), nκ(t)(t)). Equations (61), (64), and (62) are used in the first, second, and third inequalities, respectively. Then, solving for nκ(t)(t) using Equation (59), we obtain nκ(t)(t) = o(t). Then, we obtain the proposition using Equation (63). This proposition means that Algorithm 1 asymptotically approaches Kikkawa s greedy oracle in terms of the number of non-oracle selections if true UCB was used. That is, Algorithm 1 is conceptually also an asymptotically optimal policy in terms of Nishihara s greedy oracle and Carpentier s regret through Proposition 28. Under Proposition 29, the Max Search[Gaussian] algorithm corresponds to the following case: EI [k, R(t 1); Gmax] = σ2 k 2 ierfc C(k, α(t), nk(t)) = O The derivation is shown in Appendix C. The function C(k, α(t), nk(t)) satisfies the condition in Proposition 29. The algorithm is then asymptotically optimal if the reward of each arm follows the Gaussian distribution. Similarly, Max Search[sub-gaussian] corresponds to the case with EI [k, R(t 1); Gmax] = I(rmax; ˆmk, lim α 0 ˆs2 k) (68) C(k, α(t), nk(t)) = O as the upper bound. Note that no lower bounds are stated in Theorem 16. Because the lower bound of EI [k, t; Gmax] is required in Proposition 29, incorrect cases can occur. An example given below. Materials Discovery using Max K-Armed Bandit Example 3 Consider the two-armed bandit problem with the reward distributions, f1(r) = Bernoulli(0.5), r {0, 1} and f2(r) = N(r; 0, 0.1). Then, I(rmax; ˆm1, ˆs2 1) > I(rmax; ˆm2, ˆs2 2) in most situations. It means that Max Search[sub-gaussian] selects arm 1 mainly. However, the oracle policy selects arm 2 after obtaining r = 1 once from arm 1 because EI[1, t; Gmax] = 0 and EI[2, t; Gmax] > 0 in this case. This result is not due to the bounded support of the Bernoulli distribution. The similar discussion is established even if the Bernoulli distribution is convolved with N(r; 0, 0.01). The algorithm failure in this example is due to the poor estimating ability of the subgaussian assumption for the lower bound of EI[1, t; Gmax] in Proposition 29. In our algorithm, the tail distribution should approach the estimator at nk(t) (ln t)4. The multimodal distribution, such as the Bernoulli distribution, can break this condition easily and has a bad effect for the optimal arm selection. Fortunately, these effects have not impacts empirically in our materials discovery demonstrations using MCTS. We infer that this is a feature of MCTS, and its theoretical analysis is an interesting issue for future studies. Lastly, we note that the proof of Proposition 29 is an analog of the proof for the conventional bandit problem (Auer et al., 2002; Jamieson, 2018) except for the optimal arm depending on the time t. This treatment can be allowed because the selections by Kikkawa s greedy oracle correspond to the arms with the maximum EI for any t. This feature of Kikkawa s greedy oracle is valuable. We are sure that several other proofs for the conventional bandit problem establish formally in the MKB problem using Kikkawa s greedy oracle. 8. Conclusion Here, we proposed MKB algorithms and applied them to synthetic problems and molecular-design demonstrations using MCTS for materials discovery. The proposed algorithms use a single hyperparameter and are easily implemented for MCTS. This feature gives the proposed algorithms an advantage over other MKB algorithms, and enables their application to materials discovery. In fact, to the best of our knowledge, this is the first case where the MKB algorithms are actually employed for materials discovery. The performances of the proposed algorithms were examined on the synthetic problems and molecular-structure optimizations. The experimental results demonstrated that the proposed algorithms found the maximum reward more efficiently than other algorithms when the optimal arm could not be determined only based on the expectation reward. In real molecular designs, most of the molecular properties would have a high complexity; thus, we believe that the MKB algorithms are useful for these tasks. In the theoretical aspect, we mainly contribute in two aspects. One is the proof of the effectiveness of the use of a UCB of EI. The proof result has wide flexibility, and this can be used to propose other algorithms with other assumptions for distributions, which will be addressed in future work. Especially, we indicate that our algorithm is weak for multimodal distributions because it is based on the Gaussian or sub-gaussian reward assumption. Nonparametric approaches are probably required for the MKB algorithm to show their true potential. Recent studies on the nonparametric MKB algorithms (Bhatt et al., 2022; Baudry et al., 2022) probably have a potential to overcome these difficulties. In another direction, theoretical analyses based on full parametric assumptions would help Kikkawa and Ohno us to thoroughly understand the MKB problem. The other contribution is the proposal of Kikkawa s greedy oracle. Using the proposed oracle, we can avoid many of the subtleties of the MKB problem. We think that several other statements in the conventional bandit problem can be imported to the MKB problem with the help of this oracle. It may be possible to apply our theoretical approach to other fields based on the bandit problem. Heuristics to reduce the required trials are also important for actual use. For this purpose, the MKB algorithm could combined with other bandit algorithms. Combining with a supervised learning also holds significant promise. Acknowledgments We thank Dr. Ryosuke Jinnouchi in TCRDL for reviewing our early draft. We also thank Nanako Ishihara for her internship work. We also thank the anonymous reviewers. Their incisive remarks have contributed to the theoretical quality of our manuscript. Materials Discovery using Max K-Armed Bandit Appendix A. Proof of Theorem 20 Let Xi be independent random variables drawn from the same sub-exponential g(X) with parameter b > 0. Then, i=1 Xi Eg(X) [X] i=1 Xi Eg(X) u or Eg(X) P where u > 0. Therefore, in the former case, i=1 Xi Eg(X) [X] u exp λ(u + Eg(X) [X]) ) exp λ(u + Eg(X) [X]) E , Markov s inequality where λ > 0 is an arbitrary parameter. Since the random variables Xi are independent of each other, their moment-generating function can be separated. Thus, we obtain i=1 Xi Eg(X) [X] u exp λ(u + Eg(X) [X]) Eg(X) = exp λ(u + Eg(X) [X]) 1 + λEg(X) [X] λp Eg(X) [Xp] λp Eg(X) [Xp] Since 1 + x exp x 0 Pg(X){Xp u} du Integral identity 0 Pg(X){X bv}pbpvp 1 dv Replace u with bpvp 0 e vvp 1 dv . Sub-exponential Kikkawa and Ohno The above integral corresponds to the Gamma function. Therefore, i=1 Xi Eg(X) [X] u bpλpΓ(p) np 1(p 1)! = exp λu + 2b2λ2 where |bλ/n| < 1. Replacing n bλ with ξ+, we obtain i=1 Xi Eg(X) [X] u b + 2 ξ+ + 2n2 b 4n , (74) where 0 < ξ+ < 2n. Hence, the optimized ξ+ is 2n2b 2b + u. (75) i=1 Xi Eg(X) [X] u 2nw+ nw2 + 2n , (76) where w+ := 2. Moreover, using the same approach, we obtain Eg(X) [X] 1 b + 2 ξ + 2n2 b 4n , (77) where ξ := n + bλ and 0 < ξ < 2n. Hence, the optimized ξ is 2n2b/(2b u) 0 u < 3b/2 2n ϵ 3b/2 u, (78) where ϵ is an infinitesimal. Then, we obtain Eg(X) [X] 1 2nw nw2 2n) 0 u < 3b/2 exp ub 1 + 1 n + O(ϵ) 3b/2 u, (79) ub 1 + 2 > 1/ 2. Equations 76 and 79 show that E [X] is bounded at a confidence level of 1 α > 0 as follows: i=1 Xi u + E [X] 1 i=1 Xi + u , (80) where α = exp h 2 2nw + n(w +)2 2n i , (81) Materials Discovery using Max K-Armed Bandit when u + 0, and 2nw n(w )2 2n 0 u < 3b/2 exp u b 1 + 1 n 3b/2 u , (82) where w := p u b 1 + 2 (double sign in the same order). From Eq. 81, we obtain 2 + β, (83) and u + = (β2 + 2 where β := p ln α/n. In addition, from Eq. 82, u = ( β2 + 2 2β)b 0 β < 1/ 2 (β2 + 1)b 1/ The theorem follows from Eqs. 76 and 79. Appendix B. Compared Algorithms We compared our algorithm with Algorithms 5-10. We employed the following hyperparameters and applied some modifications for the implementation. In Threshold Ascent, the hyperparameters were set to s = 100 and δ = 2 ln ν. We used the reward ranking instead of the iteration used in the original code (Streeter and Smith, 2006b). In Robust UCBMax, we set s = 100, u = rs-th, v = (rmax u)1+ϵ/ ν, and ϵ = 0.4, according to the original paper (Achab et al., 2017). Although the original paper employed the robust UCB with the truncated mean estimator, we used a simple version of the robust UCB (Bubeck et al., 2013). In sp UCB, c = 0.1 and D = 32 are used as the hyperparameters. These values are recommended in the original paper (Schadd et al., 2008). In UCBE (Audibert et al., 2010) and the conventional UCB (Auer et al., 2002), we used c = 1 as the hyperparameter. In some algorithms, we estimated the variance parameter, σ, as the sample variance of the first P = 10 random searches. Kikkawa and Ohno Algorithm 5 Threshold Ascent Input: number of arms K, time horizon T, the s-th maximum of observed reward rs-th number of times the k-th arm is selected nk, the i-th reward from the k-th arm rk,i, and hyperparameter δ. Output: selected arm index ˆk. 1: for each k [K] do 2: if nk = 0 or ν < 2 then 5: Sk = P i [nk] 1[rk,i > rs-th] 6: α ln(2TK/δ) 7: zk Sk/nk + (α + p α(2Sk + α))/nk 8: end if 10: ˆk argmax k [K] zk 11: return ˆk Algorithm 6 Robust UCBMax Input: number of arms K, number of times the k-th arm is selected nk, the i-th reward from the k-th arm rk,i, and hyperparameters u, v, and ϵ. Output: selected arm index ˆk. 1: ν = P k [K] nk 2: for each k [K] do 3: if nk = 0 or ν < 2 then 6: Sk = P i [nk] rk,i1[rk,i > u] 7: zk Sk/nk + 4v1/(1+ϵ)(2 ln ν/nk)ϵ/(1+ϵ) 10: ˆk argmax k [K] zk 11: return ˆk Materials Discovery using Max K-Armed Bandit Algorithm 7 sp UCB Input: number of arms K, current time τ, number of times the k-th arm is selected nk, sum of the rewards obtained from the k-th arm Rk, sum of square rewards obtained from the k-th arm Qk, and sample variance obtained from the first P trials σ. Output: selected arm index ˆk. 1: if τ P then 2: ˆk Random Search(K) 4: ν = P k [K] nk 5: for each k [K] do 6: if nk = 0 or ν < 2 then 9: mk Rk/nk 10: zk mk + cσ p ln ν/nk + q Qk nkm2 k+D nk 11: end if 12: end for 13: ˆk argmax k [K] zk 15: return ˆk Algorithm 8 UCBE Input: number of arms K, current time τ, number of times the k-th arm is selected nk, sum of the rewards obtained from the k-th arm Rk, and sample variance obtained from the first P trials σ. Output: selected arm index ˆk. 1: if τ P then 2: ˆk Random Search(K) 4: ν = P k [K] nk 5: for each k [K] do 6: if nk = 0 or ν < 2 then 9: zk Rk/nk + cσ p ν/nk 10: end if 11: end for 12: ˆk argmax k [K] zk 14: return ˆk Kikkawa and Ohno Algorithm 9 UCB Input: number of arms K, current time τ, number of times the k-th arm is selected nk, sum of the rewards obtained from the k-th arm Rk, and sample variance obtained from the first P trials σ. Output: selected arm index ˆk. 1: if τ P then 2: ˆk Random Search(K) 4: ν = P k [K] nk 5: for each k [K] do 6: if nk = 0 or ν < 2 then 9: zk Rk/nk + cσ p ln ν/nk 10: end if 11: end for 12: ˆk argmax k [K] zk 14: return ˆk Algorithm 10 Random Search Input: number of arms K. Output: selected arm index ˆk. 1: ˆk random(K) {randomly select any of 1, ..., K.} 2: return ˆk Materials Discovery using Max K-Armed Bandit Appendix C. Order of C(k, α(t), n) In the Gaussian reward settings, we have C(k, α(t), n) = ˆσ2 k 2 ierfc σ2 k 2 ierfc The order of the above equation can be derived using some asymptotic equations (Zelen and Sevelo, 1968). For large n and Nα/2 p 2(n 1), we can write ˆµk = µk + tn 1,1 α/2 σk n µk + N1 α/2[1 + O(n 1)] σk n (87) ˆσ2 k = (n 1) σ2 k χ2 n 1,α/2 2(n 1) σ2 k (Nα/2 + 2n 3)2 σ2 k Nα/2 p 2(n 1) (88) ln α), (89) where Np is the p-quantile of the standard normal distribution. Therefore, if α = ν c2, 2(n 1) ln ν n, (90) ˆσ2 k σ2 k = Θ Because ln ν n, we can apply the Taylor expansion to C(k, α(t), n) as follows: C(k, α(t), n) = O(ˆµk µk) + O(ˆσk σk) = O The order of C(k, α(t), n) under sub-gaussian reward settings can also be derived through Taylor expansion and the order of ˆs2 k in Proposition 16. C(k, α(t), n) = I(r; ˆmk, ˆs2 k) I(r; mk, s2 k) = O Kikkawa and Ohno Mastane Achab, Stephan Cl emen con, Aur elien Garivier, Anne Sabourin, and Claire Vernade. Max k-armed bandit: On the extremehunter algorithm and beyond. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 389 404. Springer, 2017. Ankit Agrawal and Alok Choudhary. Deep materials informatics: Applications of deep learning in materials science. MRS Communications, 9(3):779 792, 2019. Jean-Yves Audibert, S ebastien Bubeck, and R emi Munos. Best arm identification in multiarmed bandits. In COLT, pages 41 53. Citeseer, 2010. Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2):235 256, 2002. Dorian Baudry, Yoan Russac, and Emilie Kaufmann. Efficient algorithms for extreme bandits. ar Xiv preprint ar Xiv:2203.10883, 2022. Caleb Bell and Contributors. Thermo: Chemical properties component of chemical engineering design library (chedl). 2016. URL https://github.com/Caleb Bell/thermo. Sujay Bhatt, Ping Li, and Gennady Samorodnitsky. Extreme bandits using robust statistics. IEEE Transactions on Information Theory, 2022. Cameron B Browne, Edward Powley, Daniel Whitehouse, Simon M Lucas, Peter I Cowling, Philipp Rohlfshagen, Stephen Tavener, Diego Perez, Spyridon Samothrakis, and Simon Colton. A survey of monte carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in games, 4(1):1 43, 2012. S ebastien Bubeck, Nicolo Cesa-Bianchi, and G abor Lugosi. Bandits with heavy tail. IEEE Transactions on Information Theory, 59(11):7711 7717, 2013. Keith T Butler, Daniel W Davies, Hugh Cartwright, Olexandr Isayev, and Aron Walsh. Machine learning for molecular and materials science. Nature, 559(7715):547 555, 2018. Alexandra Carpentier and Michal Valko. Extreme bandits. In Neural Information Processing Systems, 2014. Seok-Ho Chang, Pamela C Cosman, and Laurence B Milstein. Chernoff-type bounds for the gaussian error function. IEEE Transactions on Communications, 59(11):2939 2944, 2011. Marco Chiani, Davide Dardari, and Marvin K Simon. New exponential bounds and approximations for the computation of error probability in fading channels. IEEE Transactions on Wireless Communications, 2(4):840 845, 2003. Vincent A Cicirello and Stephen F Smith. The max k-armed bandit: A new model of exploration applied to search heuristic selection. In The Proceedings of the Twentieth National Conference on Artificial Intelligence, volume 3, pages 1355 1361, 2005. Materials Discovery using Max K-Armed Bandit Yahel David and Nahum Shimkin. PAC lower bounds and efficient algorithms for the max karmed bandit problem. In International Conference on Machine Learning, pages 878 887. PMLR, 2016. Zachary Del Rosario, Matthias Rupp, Yoolhee Kim, Erin Antono, and Julia Ling. Assessing the frontier: Active learning, model accuracy, and multi-objective candidate discovery and optimization. The Journal of Chemical Physics, 153(2):024112, 2020. Peter Ertl, Bernhard Rohde, and Paul Selzer. Fast calculation of molecular polar surface area as a sum of fragment-based contributions and its application to the prediction of drug transport properties. Journal of Medicinal Chemistry, 43(20):3714 3717, Oct 2000. ISSN 0022-2623. doi: 10.1021/jm000942e. URL https://doi.org/10.1021/jm000942e. John E. Hopcroft, Rajeev Motwani, and Jeffrey D. Ullman. Introduction to automata theory, languages, and computation, 2nd edition. SIGACT News, 32(1):60 65, March 2001. ISSN 0163-5700. doi: 10.1145/568438.568455. URL https://doi.org/10.1145/ 568438.568455. Kevin Jamieson. Lecture 3: Stochastic multi-armed bandits, regret minimization, 2018. URL https://courses.cs.washington.edu/courses/cse599i/18wi/ resources/lecture3/lecture3.pdf. Dipendra Jha, Logan Ward, Arindam Paul, Wei-keng Liao, Alok Choudhary, Chris Wolverton, and Ankit Agrawal. Elemnet: Deep learning the chemistry of materials from only elemental composition. Scientific reports, 8(1):1 13, 2018. Dipendra Jha, Kamal Choudhary, Francesca Tavazza, Wei-keng Liao, Alok Choudhary, Carelyn Campbell, and Ankit Agrawal. Enhancing materials property prediction by leveraging computational and experimental data using deep transfer learning. Nature communications, 10(1):1 12, 2019. K. G. Joback and R. Reid. Estimation of pure-component properties from groupcontributions. Chemical Engineering Communications, 57:233 243, 1987. Shenghong Ju, TM Dieb, K Tsuda, and J Shiomi. Optimizing interface/surface roughness for thermal transport. In Machine Learning for Molecules and Materials NIPS 2018 Workshop, 2018. Seiji Kajita, Tomoyuki Kinjo, and Tomoki Nishi. Autonomous molecular design by Monte-Carlo tree search and rapid evaluations using molecular dynamics simulations. Communications Physics, 3(1):1 11, 2020. Gautam Kamath. Bounds on the expectation of the maximum of samples from a gaussian. URL http://www. gautamkamath. com/writings/gaussian max. pdf, 2015. Nobuaki Kikkawa, Seiji Kajita, and Kensuke Takechi. Self-learning molecular design for high lithium-ion conductive ionic liquids using maze game. Journal of Chemical Information and Modeling, 60(10):4904 4911, 2020. Kikkawa and Ohno Shin Kiyohara and Teruyasu Mizoguchi. Searching the stable segregation configuration at the grain boundary by a monte carlo tree search. The Journal of chemical physics, 148 (24):241741, 2018. Levente Kocsis and Csaba Szepesv ari. Bandit based monte-carlo planning. In European conference on machine learning, pages 282 293. Springer, 2006. A Gilad Kusne, Heshan Yu, Changming Wu, Huairuo Zhang, Jason Hattrick-Simpers, Brian De Cost, Suchismita Sarker, Corey Oses, Cormac Toher, Stefano Curtarolo, et al. On-thefly closed-loop materials discovery via Bayesian active learning. Nature communications, 11(1):1 11, 2020. Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1):4 22, 1985. Greg Landrum. RDKit: Open-source cheminformatics software. 2016. URL https:// github.com/rdkit/rdkit/releases/tag/Release_2016_09_04. Yue Liu, Tianlu Zhao, Wangwei Ju, and Siqi Shi. Materials discovery and design using machine learning. Journal of Materiomics, 3(3):159 177, 2017. ISSN 2352-8478. doi: https://doi.org/10.1016/j.jmat.2017.08.002. Thaer M. Dieb, Shenghong Ju, Kazuki Yoshizoe, Zhufeng Hou, Junichiro Shiomi, and Koji Tsuda. MDTS: automatic complex materials design using Monte Carlo tree search. Science and technology of advanced materials, 18(1):498 503, 2017. Thaer M. Dieb, Zhufeng Hou, and Koji Tsuda. Structure prediction of boron-doped graphene by machine learning. The Journal of chemical physics, 148(24):241716, 2018. Bryce Meredig, Erin Antono, Carena Church, Maxwell Hutchinson, Julia Ling, Sean Paradiso, Ben Blaiszik, Ian Foster, Brenna Gibbons, Jason Hattrick-Simpers, Apurva Mehta, and Logan Ward. Can machine learning identify the next high-temperature superconductor? examining extrapolation performance for materials discovery. Mol. Syst. Des. Eng., 3:819 825, 2018. doi: 10.1039/C8ME00012C. Robert Nishihara, David Lopez-Paz, and L eon Bottou. No regret bound for extreme bandits. In Artificial Intelligence and Statistics, pages 259 267. PMLR, 2016. Marcus Olivecrona, Thomas Blaschke, Ola Engkvist, and Hongming Chen. Molecular denovo design through deep reinforcement learning. Journal of cheminformatics, 9(1):1 14, 2017. Frank WJ Olver, Daniel W Lozier, Ronald F Boisvert, and Charles W Clark. NIST handbook of mathematical functions hardback and CD-ROM. Cambridge university press, 2010. Tarak K Patra, Troy D Loeffler, and Subramanian KRS Sankaranarayanan. Accelerating copolymer inverse design using monte carlo tree search. Nanoscale, 12(46):23653 23662, 2020. Materials Discovery using Max K-Armed Bandit Ghanshyam Pilania, Chenchen Wang, Xun Jiang, Sanguthevar Rajasekaran, and Ramamurthy Ramprasad. Accelerating materials property predictions using machine learning. Scientific reports, 3(1):1 6, 2013. H. Pishro-Nik. Introduction to Probability, Statistics, and Random Processes. Kappa Research, LLC, 2014. ISBN 9780990637202. URL https://books.google.co.jp/books? id=3yq_o QEACAAJ. Mariya Popova, Olexandr Isayev, and Alexander Tropsha. Deep reinforcement learning for de novo drug design. Science advances, 4(7):eaap7885, 2018. Paul Raccuglia, Katherine C Elbert, Philip DF Adler, Casey Falk, Malia B Wenny, Aurelio Mollo, Matthias Zeller, Sorelle A Friedler, Joshua Schrier, and Alexander J Norquist. Machine-learning-assisted materials discovery using failed experiments. Nature, 533 (7601):73 76, 2016. Rampi Ramprasad, Rohit Batra, Ghanshyam Pilania, Arun Mannodi-Kanakkithodi, and Chiho Kim. Machine learning in materials informatics: recent applications and prospects. npj Computational Materials, 3(1):1 13, 2017. Benjamin Sanchez-Lengeling and Al an Aspuru-Guzik. Inverse molecular design using machine learning: Generative models for matter engineering. Science, 361(6400):360 365, 2018. Benjamin Sanchez-Lengeling, Carlos Outeiral, Gabriel L Guimaraes, and Alan Aspuru Guzik. Optimizing distributions over molecular space. an objective-reinforced generative adversarial network for inverse-design chemistry (ORGANIC). Chem Rxiv, 2017, 2017. Maarten PD Schadd, Mark HM Winands, H Jaap Van Den Herik, Guillaume MJ-B Chaslot, and Jos WHM Uiterwijk. Single-player monte-carlo tree search. In International Conference on Computers and Games, pages 1 12. Springer, 2008. Marwin HS Segler, Mike Preuss, and Mark P Waller. Planning chemical syntheses with deep neural networks and symbolic AI. Nature, 555(7698):604 610, 2018. Matthew J Streeter and Stephen F Smith. An asymptotically optimal algorithm for the max k-armed bandit problem. In AAAI, pages 135 142, 2006a. Matthew J Streeter and Stephen F Smith. A simple distribution-free approach to the max k-armed bandit problem. In International Conference on Principles and Practice of Constraint Programming, pages 560 574. Springer, 2006b. Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018. Tsuyoshi Ueno, Trevor David Rhone, Zhufeng Hou, Teruyasu Mizoguchi, and Koji Tsuda. COMBO: an efficient Bayesian optimization library for materials science. Materials discovery, 4:18 21, 2016. Kikkawa and Ohno Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018. David Weininger. SMILES, a chemical language and information system. 1. introduction to methodology and encoding rules. Journal of Chemical Information and Computer Sciences, 28(1):31 36, Feb 1988. ISSN 0095-2338. doi: 10.1021/ci00057a005. URL https: //pubs.acs.org/doi/abs/10.1021/ci00057a005. Hironao Yamada, Chang Liu, Stephen Wu, Yukinori Koyama, Shenghong Ju, Junichiro Shiomi, Junko Morikawa, and Ryo Yoshida. Predicting materials properties with little data using shotgun transfer learning. ACS central science, 5(10):1717 1730, 2019. Xiufeng Yang, Jinzhe Zhang, Kazuki Yoshizoe, Kei Terayama, and Koji Tsuda. Chem TS: an efficient python library for de novo molecular generation. Science and technology of advanced materials, 18(1):972 976, 2017. Chin-yah Yeh. Isomer enumeration of alkanes, labeled alkanes, and monosubstituted alkanes. Journal of chemical information and computer sciences, 35(5):912 913, 1995. Marvin Zelen and Norman C. Sevelo. Probability functions. In Milton Abramowitz and Irene A Stegun, editors, Handbook of mathematical functions with formulas, graphs, and mathematical tables, chapter 26, pages 925 996. US Government printing office, 1968.