# mfeshb_efficient_hyperband_with_multifidelity_quality_measurements__66009150.pdf

MFES-HB: Efﬁcient Hyperband with Multi-Fidelity Quality Measurements

Yang Li,1,4 Yu Shen,1,4 Jiawei Jiang,2 Jinyang Gao,5 Ce Zhang,2 Bin Cui1,3

1Key Laboratory of High Conﬁdence Software Technologies (MOE), School of EECS, Peking University, Beijing, China 2Department of Computer Science, Systems Group, ETH Zurich, Switzerland 3Institute of Computational Social Science, Peking University (Qingdao), China 4AI Platform, Kuaishou Technology, Beijing, China 5Alibaba Group, Hangzhou, China {liyang.cs,shenyu,bin.cui}@pku.edu.cn, {jiawei.jiang, ce.zhang}@inf.ethz.ch, jinyang.gjy@alibaba-inc.com

Hyperparameter optimization (HPO) is a fundamental problem in automatic machine learning (Auto ML). However, due to the expensive evaluation cost of models (e.g., training deep learning models or training models on large datasets), vanilla Bayesian optimization (BO) is typically computationally infeasible. To alleviate this issue, Hyperband (HB) utilizes the early stopping mechanism to speed up conﬁguration evaluations by terminating those badly-performing conﬁgurations in advance. This leads to two kinds of quality measurements: (1) many low-ﬁdelity measurements for conﬁgurations that get early-stopped, and (2) few high-ﬁdelity measurements for conﬁgurations that are evaluated without being early stopped. The state-of-the-art HB-style method, BOHB, aims to combine the beneﬁts of both BO and HB. Instead of sampling conﬁgurations randomly in HB, BOHB samples conﬁgurations based on a BO surrogate model, which is constructed with the high-ﬁdelity measurements only. However, the scarcity of high-ﬁdelity measurements greatly hampers the efﬁciency of BO to guide the conﬁguration search. In this paper, we present MFES-HB, an efﬁcient Hyperband method that is capable of utilizing both the high-ﬁdelity and low-ﬁdelity measurements to accelerate the convergence of HPO tasks. Designing MFES-HB is not trivial as the lowﬁdelity measurements can be biased yet informative to guide the conﬁguration search. Thus we propose to build a Multi Fidelity Ensemble Surrogate (MFES) based on the generalized Product of Experts framework, which can integrate useful information from multi-ﬁdelity measurements effectively. The empirical studies on the real-world Auto ML tasks demonstrate that MFES-HB can achieve 3.3 8.9 speedups over the state-of-the-art approach BOHB.

Introduction The performance of Machine Learning (ML) models heavily depends on the speciﬁc choice of hyperparameter conﬁgurations. As a result, automatically tuning the hyperparameters has attracted lots of interest from both academia and industry, and has become an indispensable component in modern Auto ML systems (Hutter, Kotthoff, and Vanschoren 2018; Yao et al. 2018; Z oller and Huber 2019). Hyperparameter optimization (HPO) is often a computationallyintensive process as one often needs to try hyperparame-

Copyright c 2021, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

ter conﬁgurations iteratively, and evaluate each conﬁguration by training and validating the corresponding ML model. However, for ML models that are computationally expensive to train (e.g., deep learning models or models trained on large datasets), vanilla Bayesian optimization (BO) (Hutter, Hoos, and Leyton-Brown 2011; Bergstra et al. 2011; Snoek, Larochelle, and Adams 2012), one of the most prevailing frameworks in solving the HPO problem, suffers from the low-efﬁciency issue due to insufﬁcient conﬁguration evaluations within a limited budget. Instead, Hyperband (HB) (Li et al. 2018) is a popular alternative, which uses the early stopping strategy to speed up conﬁguration evaluation. In HB, the system dynamically allocates resources to a set of conﬁgurations drawn from a uniform distribution, and uses successive halving (Jamieson and Talwalkar 2016) to early stop the poorly-performing conﬁgurations after measuring their quality periodically. Among many efforts to improve Hyperband (Klein et al. 2017b; Falkner, Klein, and Hutter 2018), BOHB (Falkner, Klein, and Hutter 2018) combines the beneﬁts from both Hyperband and traditional BO. It replaces the conﬁguration sampling procedure in HB from the uniform distribution to a BO surrogate model TPE (Bergstra et al. 2011), which is ﬁtted on those quality measurements obtained from the evaluations without being early stopped. Due to the successive halving strategy of HB, most conﬁguration evaluations end up being early stopped by the system, thus creating two kinds of quality measurements: (1) many low-ﬁdelity quality measurements of these early-stopped conﬁgurations, and (2) few high-ﬁdelity quality measurements of conﬁgurations that are evaluated with complete training resources. One fundamental limitation of BOHB lies in the fact that the BO component only utilizes the few high-ﬁdelity measurements to sample conﬁgurations, which are insufﬁcient to train a BO surrogate that models the objective function in HPO well. Consequently, like vanilla BO, sampling conﬁgurations in BOHB also suffers from the low-efﬁciency issue. Can we also take advantage of the low-ﬁdelity quality measurements, which are ignored by the existing methods, to further speed up the Hyperband-style methods? In this paper, our goal is to investigate a new Hyperband-style method, which is capable of utilizing the multi-ﬁdelity quality measurements: both the high-ﬁdelity measurements and the low-ﬁdelity measure-

The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21)

ments, to accelerate the HPO process. (Opportunities and Challenges) Taking advantage of the multi-ﬁdelity quality measurements poses unique opportunities and challenges. Intuitively, numerous low-ﬁdelity measurements are obtained from the early-stopped evaluations, which can boost the total number of measurements that BO can use. The low-ﬁdelity measurements obtained with partial training resources yield a biased surrogate of the highﬁdelity quality measurements. Nevertheless, due to the relevancy between early-stopped evaluations and complete evaluations, they can still reveal some useful information about the objective function. Therefore, there is great potential for utilizing the low-ﬁdelity measurements to accelerate HPO. However, if we cannot balance the beneﬁts and biases from the low-ﬁdelity measurements well, we might be misled by the harmful information towards a wrong objective function. In this paper, we propose MFES-HB, an efﬁcient Hyperband method, which is capable of utilizing the planetary unexploited multi-ﬁdelity measurements to signiﬁcantly accelerate the convergence of conﬁguration search. To utilize the multi-ﬁdelity measurements without introducing the biases from the low-ﬁdelity measurements, we ﬁrst train multiple base surrogates on these measurements grouped by their ﬁdelities. Then we propose the Multi-Fidelity Ensemble Surrogate (MFES) that is used in the BO framework to sample conﬁgurations. Concretely, to make the best usage of those biased yet informative low-ﬁdelity measurements, MFES uses the generalized Product of Experts (g Po E) (Cao and Fleet 2014) to combine these base surrogates, and the contribution of each base surrogate to MFES can be adjusted based on their performance when approximating the objective function. Therefore, the heterogeneous information among multi-ﬁdelity measurements could be automatically extracted by MFES in a reliable yet efﬁcient way. The empirical studies on the real-world HPO tasks demonstrate the superiority of the proposed method over competitive baselines. MFES-HB can achieve an order of magnitude speedups compared with Hyperband, and 3.3 8.9 speedups over the state-of-the-art method BOHB.

Related Work Bayesian Optimization (BO). Machine learning (ML) has made great strides in many application areas, e.g., recommendation, computer vision, etc (Goodfellow, Bengio, and Courville 2016; He et al. 2017; Jiang et al. 2017; Ma et al. 2019; Wu et al. 2020; Zhang et al. 2020). BO has been successfully applied to tune the hyperparameters of ML models. The main idea of BO is to use a probabilistic surrogate model M : p M(f|x) to describe the relationship between a hyperparameter conﬁguration x and its performance f(x) (e.g., validation error), and then utilizes this surrogate to guide the conﬁguration search (See more details about BO in Section 3). Spearmint (Snoek, Larochelle, and Adams 2012) uses Gaussian process (GP) to model p M(f|x) as a Gaussian distribution, and TPE (Bergstra et al. 2011) employs a tree-structured Parzen density estimators to model p M(f|x). Lastly, SMAC (Hutter, Hoos, and Leyton-Brown 2011) adopts a modiﬁed random forest to yield an uncertain estimate of f(x). An empirical evaluation of the three

methods (Eggensperger et al. 2013) shows that SMAC performs the best on the benchmarks with high-dimensional and complex hyperparameter space that includes categorical and conditional hyperparameters, closely followed by TPE. Spearmint only works well with low-dimensional continuous hyperparameters, and cannot apply to complex conﬁguration space easily. Early stopping mechanism that stops the evaluations of poorly-performing conﬁgurations early, has been discussed in many methods (Swersky, Snoek, and Adams 2014; Domhan, Springenberg, and Hutter 2015; Baker et al. 2017; Klein et al. 2017b; Falkner, Klein, and Hutter 2018; Wang, Xu, and Wang 2018; Bertrand et al. 2017; Dai et al. 2019), including Hyperband (HB) (Li et al. 2018) and BOHB (Falkner, Klein, and Hutter 2018). In Section 3, we will describe HB in more detail. Among them, LCNETHB (Klein et al. 2017b) utilizes the LCNET that predicts the learning curve of conﬁgurations to sample conﬁgurations in HB. In this paper, we explore to use the multi-ﬁdelity quality measurements in the BO framework to further accelerate the HB-style methods. Multi-ﬁdelity Optimization methods exploit the lowﬁdelity measurements about the objective function f to guide the search for the optimum of f, by conducting cheap low-ﬁdelity evaluations proactively, instead of early stopping (Swersky, Snoek, and Adams 2013; Klein et al. 2017a; Kandasamy et al. 2017; Poloczek, Wang, and Frazier 2017; Hu et al. 2019; Sen, Kandasamy, and Shakkottai 2018; Wu et al. 2019b,a; Takeno et al. 2020). For instance, FABOLAS (Klein et al. 2017a) and TSE (Hu et al. 2019) evaluate conﬁgurations on subsets of the training data and use the generated low-ﬁdelity measurements to infer the quality on the full training set. Transfer Learning methods for HPO aim to take advantage of auxiliary knowledge/information acquired from the past HPO tasks (source problems) to achieve a faster convergence for the current HPO task (target problem) (Bardenet et al. 2013; Yogatama and Mann 2014; Schilling et al. 2015; Wistuba, Schilling, and Schmidt-Thieme 2016; Schilling, Wistuba, and Schmidt-Thieme 2016; Golovin et al. 2017; Feurer, Letham, and Bakshy 2018). While sharing a similar idea, here we investigate to speed up HB-style methods by using the multi-ﬁdelity measurements from the current HPO task, instead of the measurements from past similar tasks. Thus, our work is inspired by, but orthogonal to the transfer learning-related methods.

Bayesian Hyperparameter Optimization and Hyperband

We model the loss f(x) (e.g., validation error), which reﬂects the quality of an ML algorithm with the given hyperparameter conﬁguration x X, as a black-box optimization problem. The goal of hyperparameter optimization (HPO) is to ﬁnd arg minx X f(x), where the only mode of interaction with the objective function f is to evaluate the given conﬁguration x. Due to the randomness of most ML algorithms, we assume that f(x) cannot be observed directly but rather through noisy observation y = f(x) + ϵ,

Algorithm 1 Pseudo code for Hyperband. Input: maximum amount of resource that can be allocated to a single hyperparameter conﬁguration R, the discard proportion η, and hyperparameter space X. 1: Initialize smax = logη(R) , B = (smax + 1)R. 2: for s {smax, smax 1, ..., 0} do 3: n1 = B

s+1 , r1 = Rη s. 4: sample n1 conﬁgurations from X randomly. 5: execute the SH with the n1 conﬁgurations and r1 as input (the inner loop). 6: end for 7: return the conﬁguration with the best evaluation result.

50 100 150 200 250 300 350 400 450 Training Resource (epochs)

Validation Error

0.4 1st early stopping: 9 low-fidelity measurements

2nd early stopping: 3 low-fidelity measurements

1 high-fidelity measurement

Figure 1: Successive halving process (the inner loop of HB) in tuning a neural network where n1 = 9, r1 = 1, R = 9, and η = 3; one unit of resource corresponds to 36 epochs. First, 9 conﬁgurations are evaluated with 1 unit of resource. Then the top 3rd conﬁgurations continue their evaluations with 3 units of resources. Finally, only one conﬁguration is evaluated with the maximum training resource R.

with ϵ N(0, σ2). We now introduce two methods for solving this black-box optimization problem in more detail: Bayesian optimization and Hyperband, which are the basic ingredients in MFES-HB.

Bayesian Optimization The main idea of Bayesian optimization (BO) is as follows. Since evaluating the objective function f for a given conﬁguration x is very expensive, it approximates f using a probabilistic surrogate model M : p(f|D) that is much cheaper to evaluate. Given a conﬁguration x, the surrogate model M outputs the posterior predictive distribution at x, that is, f(x) N(µM(x), σ2 M(x)). In the nth iteration, BO methods iter-

s = 4 s = 3 s = 2 s = 1 s = 0 i ni ri ni ri ni ri ni ri ni ri 1 81 1 27 3 9 9 6 27 5 81 2 27 3 9 9 3 27 2 81 3 9 9 3 27 1 81 4 3 27 1 81 5 1 81

Table 1: The values of ni and ri in the HB evaluations. Here R = 81 and η = 3. Each column displays an inner loop (SH process). The pair (ni, ri) in each cell means there are ni conﬁguration evaluations with ri units of training resources.

ate the following three steps: (1) use the surrogate model M to select a conﬁguration that maximizes the acquisition function xn = arg maxx X a(x; M), where the acquisition function is used to balance the exploration and exploitation; (2) evaluate the conﬁguration xn to get its performance yn = f(xn) + ϵ with ϵ N(0, σ2); (3) add this measurement (xn, yn) to the observed quality measurements D = {(x1, y1), ..., (xn 1, yn 1)}, and reﬁt the surrogate model on the augmented D. Expected improvement (EI) (Jones, Schonlau, and Welch 1998) is a common acquisition function:

a(x; M) = Z

max(y y, 0)p M(y|x)dy, (1)

where y is the best observed performance in D, i.e., y = min{y1, ..., yn}, and M is the probabilistic surrogate model. By maximizing this EI function a(x; M) over the hyperparameter space X, BO methods can ﬁnd a conﬁguration with the largest EI value to evaluate in each iteration. Low-efﬁciency issue One fundamental challenge of BO is that, for models that are computationally expensive to train , each complete evaluation of conﬁguration x often takes a signiﬁcant amount of time. Given a limited budget, few measurements can be obtained, and it is insufﬁcient for BO methods to ﬁt a surrogate that approximates f well. In this case, BO methods fail to converge to the optimal solution quickly (Wang et al. 2013; Li et al. 2020).

Hyperband To accommodate the above issue of BO, Hyperband (HB) (Li et al. 2018) proposes to speed up conﬁguration evaluations by early stopping the badly-performing conﬁgurations. It has the following two loops: (1) Inner Loop: successive halving (SH) Given a kind of training resource (e.g., the number of iterations, the size of the training subset), HB ﬁrst evaluates n1 hyperparameter conﬁgurations with the initial r1 units of resources, and ranks them by the evaluation performance. Then HB continues the top η 1 conﬁgurations with η times larger resources (usually η = 3), that s, n2 = n1 η 1 and r2 = r1 η, and stops the evaluations of the other conﬁgurations in advance. This process repeats until the maximum training resource R is reached, that s, ri = R. We provide an example to illustrate this procedure in Figure 1. (2) Outer Loop: grid search of n1 and r1 Given a ﬁxed budget B, the values of n1 and r1 should be carefully chosen because a larger n1 with a small initial training resource r1 may lead to the elimination of good conﬁgurations in SH process by mistake. There is no prior whether we should use a larger n1 with a small initial training resource r1, or a smaller n1 with a larger r1. HB addresses this n versus B/n problem by performing a grid search over feasible values of n1 and r1 in the outer loop. Algorithm 1 shows the enumeration of n1 and r1 in Line 3. Table 1 lists the number of evaluations and their corresponding training resources in one iteration of HB. Note that, the HB algorithm can be called multiple times until the HPO budget exhausts. The problem of HB and BOHB. The disadvantage of HB lies in that it samples conﬁgurations randomly from the uniform distribution. To improve it, BOHB utilizes a BO

0.1 0.46 0.64 0.82 1.0

-7.0 -6.0 -5.0 -4.0 -3.0 -2.0 Learning Rate

Keep Probability

(a) D1: r1 = 1

0.1 0.46 0.64 0.82 1.0

-7.0 -6.0 -5.0 -4.0 -3.0 -2.0 Learning Rate

Keep Probability

(b) D2: r2 = 3

0.1 0.46 0.64 0.82 1.0

-7.0 -6.0 -5.0 -4.0 -3.0 -2.0 Learning Rate

Keep Probability

(c) D3: r3 = 9

0.1 0.46 0.64 0.82 1.0

-7.0 -6.0 -5.0 -4.0 -3.0 -2.0 Learning Rate

Keep Probability

(d) D4: r4 = 27

Figure 2: Validation error of 900 Le Net conﬁgurations (30 settings of keep probability λ = 1 dropout value in dropout layer and 30 settings of the learning rate α on a base-10 log scale in [-7, -2]) on the MNIST dataset using different training resources r = 1, 3, 9, 27. R = 27, K = 4, and one unit of resource corresponds to an epoch.

component to sample conﬁgurations in HB iteration, instead of the uniform distribution. However, BOHB also suffers from the low-efﬁciency issue due to the scarcity of highﬁdelity quality measurements. Consequently, BOHB does not fully unleash the potential for combining BO and HB.

Efﬁcient Hyperband using Multi-Fidelity Quality Measurements In this section, we introduce MFES-HB, an efﬁcient Hyperband method that can utilize the multi-ﬁdelity quality measurements in the framework of BO to speed up the convergence of conﬁguration search. First, we investigate the characteristics of multi-ﬁdelity measurements, and then describe the proposed Multi-Fidelity Ensemble Surrogate (MFES) that is capable of extracting instrumental information from multi-ﬁdelity measurements effectively.

Multi-ﬁdelity Quality Measurements According to the number of training resources used by the evaluations in HB, we can categorize the multi-ﬁdelity measurements into K groups: D1, ..., DK, where K = logη(R) + 1, and typically K is less than 7. The (quality) measurement (x, y) in each group Di with i [1 : K] is obtained by evaluating x with ri = ηi 1 units of resources. Thus DK denotes the high-ﬁdelity measurements from the evaluations with the maximum training resource r K = R, and D1:K 1 denote the low-ﬁdelity measurements obtained from the early-stopped evaluations. Then we discuss the characteristics of D1:K from two aspects: (1) The number of measurements Due to successive halving strategy, the number of measurements in Di, i.e., Ni = |Di|, satisﬁes that N1 > N2 > ... > NK. Here, Ni denotes the total number of measurements in each Di with training resource ri. Table 1 shows the Nis in one iteration of HB, that is, N1 = 81, N2 = 54, N3 = 27, N3 = 15, and N5 = 10. (2) The ﬁdelities of measurements The high-ﬁdelity measurements, DK, consist of the unbiased measurements of the objective function f. The other Dis, the low-ﬁdelity measurements, are composed of the biased measurements about f. The BO surrogate model Mi, ﬁtted on Di with i < K, is to model the objective function f i with training resource ri, instead of the true objective function f = f K with the

maximum training resource R. Although f 1:K 1 are different from f, they have some correlation. As i increases, the surrogate Mi, learned on the measurements Di with a larger training resource ri = ηi 1, can offer a higher-ﬁdelity approximation to f because r is closer to R. Figure 2 provides a brief example to illustrate the diversity of the measurement ﬁdelity in D1:K. The quality measurements are visualized as heat maps, where good conﬁgurations with low validation errors are marked by the yellow region. By comparing the yellow regions in each sub-ﬁgure, we can ﬁnd that, as i [1 : 3] increases, the measurements in Di with partial training resource r = ηi 1 gradually approach the (unbiased) high-ﬁdelity measurements in DK, where K = 4. Hence we can conclude that (1) although D1:K 1 includes the biased measurements about f, it could still reveal some instrumental information to model f; (2) the group of quality measurements Di that offers a higher-ﬁdelity approximation to f has a smaller number of measurements Ni.

The Proposed Algorithm In MFES-HB, we train K base surrogates on D1:K respectively. (1) DK offers the highest ﬁdelity when modeling f, however, the measurements in DK are insufﬁcient to train a BO surrogate that describes f well; (2) although D1:K 1 have a much larger number of quality measurements, the low-ﬁdelity measurements in D1:K 1 with biases cannot approximate f accurately. Thus none of the base surrogates could approximate f well. Instead, we propose to combine the base surrogates to obtain a more accurate approximation to f. However, combining the base surrogates is not trivial as we need to integrate the heterogeneous information behind the base surrogates in a reliable and effective way. Since the performance y in Dis has different numerical ranges, we standardize them by removing the mean and scaling to unit variance respectively. In BO, the uncertainty prediction of the surrogate Mi at x is a Gaussian, i.e., f i(x) N(µMi(x), σ2 Mi(x)). For brevity, we use µi(x) and σ2 i (x) to denote the mean and variance of predictions from Mi.

Ensemble Surrogate with g Po E To fully utilize the biased yet informative low-ﬁdelity measurements, we propose the Multi-Fidelity Ensemble Surrogate (MFES) Mens that can integrate the information from all base surrogates to

Algorithm 2 Pseudo code of MFES-HB Input: the hyperparameter space X, the total budget to conduct HPO Bhpo, maximum amount of resource for a hyperparameter conﬁguration R, and the discard proportion η. Output: the best conﬁguration found. 1: initialize: Di = with i [1 : K], Mens = None, smax = logη(R) , B = (smax + 1)R. 2: while budget Bhpo does not exhaust. do 3: for s {smax, smax 1, ..., 0} do 4: n1 = B

s+1 , r1 = Rη s. 5: sample n1 conﬁgurations: X = Sample(X, Mens, n1). 6: execute the SH (inner loop) of HB with X and r1 as input, and collect the new multi-ﬁdelity quality measurements: D 1:K = SH(X, r1). 7: Di = Di D i with i = [1 : K]. 8: update the MFES: Mens = Build(D1:K). 9: end for 10: end while 11: return the best conﬁguration x in DK.

approximate f effectively. Concretely, a weight wi is assigned to each base surrogate Mi, which determines the contribution of Mi to the ensemble surrogate Mens, where wi [0, 1] and PK i=1 wi = 1. The base surrogate Mi, which offers a more accurate approximation to f, should own a larger proportion (larger wi) in Mens and vice versa. Thus the weights could reﬂect the inﬂuence of the measurements with different ﬁdelities on Mens. Below, we describe the way to combine the base surrogates with weights. To enable this ensemble surrogate in the BO framework, given a conﬁguration x, its posterior prediction at x should also be a Gaussian, i.e., f ens(x) N(µens(x), σ2 ens(x)). To derive the mean and variance, we need to combine the predictions from base surrogates. The most straightforward solution is to use µens(x) = P i wiµi(x) and σ2 ens(x) = P i w2 i σ2 i (x) by assuming that the predictions from base surrogates are independent. However, this assumption is contradictory to the fact that these predictive distributions are correlated as discussed before. Instead, we propose to use the generalized product of experts (g Po E) (Cao and Fleet 2014; ?) framework to combine the predictions from M1:Ks. The predictive mean and variance of the ensemble surrogate Mens at x are given by:

µens(x) = ( X

i µi(x)wiσ 2 i (x))σ2 ens(x),

σ2 ens(x) = ( X

i wiσ 2 i (x)) 1, (2)

where wi is the weight for the ith base surrogate, and it is used to control the inﬂuence of individual surrogate Mi. Using g Po E has the following two advantages: (1) the uniﬁed prediction is still a Gaussian; (2) unreliable predictions are automatically ﬁltered out from the ensemble surrogate.

Weight Calculation Method In this section, we propose a heuristic method to compute the weight for each base surrogate. As mentioned above, the value of wi should be proportional to the performance of Mi when approximating f. We

Algorithm 3 Pseudo code for Sample in MFES-HB Input: the hyperparameter space X, fraction of random conﬁguration ρ, the MFES: Mens, the number of random conﬁgurations Ns to optimize EI, and evaluation measurements D1:K. Output: next conﬁguration to evaluate. 1: if rand() ρ or Mens = None, then return a random conﬁguration. 2: draw Ns conﬁgurations randomly, compute their acquisition (EI) values according to the EI criterion in Eq.1, where Mens is used as the surrogate model M. 3: return the conﬁguration with the largest EI value.

measure the approximation performance of Mi to f on the high-ﬁdelity measurements DK, by using a pairwise ranking loss. In HPO, the ranking loss is more reasonable than the mean square error. The real value of the prediction does not matter and we only care about the partial orderings over conﬁgurations. We deﬁne the ranking loss as the number of misranked pairs:

k=1 1((µi(xj) < µi(xk) (yj < yk)), (3)

where is the exclusive-or operator, NK is the number of measurements in DK, and (xi, yi) is the measurement in DK. Further, for each Mi, we can calculate the percentage of the order-preserving pairs by pi = 1 L(Mi;DK)

Npairs , where Npairs is the number of measurement combination in DK. Finally, we apply the following weight discrimination operator to obtain the weight wi = pθ i PK k=1 pθ k , where θ N and it is

set to 3 in our experiments. Due to pi [0, 1], this operator has a discriminative scaling effect on different pis: (1) further decrease the weight of bad surrogates, and (2) increase the weight of good surrogates. For the base surrogates M1:K 1, the ranking loss in Eq.3 can measure their ability to approximate f, i.e., the generalization performance. However, for the surrogate MK trained on DK directly, this is an estimate of in-sample error and can not reﬂect generalization. To measure MK s generalization, we adopt the cross-validation strategy. First, we train NK leave-one-out surrogates M i K on DK with measurement (xi, yi) removed. Then the ranking loss for MK can be computed by L(MK) = PNK j=1 PNK k=1 1((µ j K (xj) < µ j K (xk) (yj < yk)). In practice, when n K is larger than 5, we use 5-fold cross validation to compute L(MK). In the beginning, w K is set to 0, and wi = 1 K 1 with i [1 : K 1]. This means that we utilize more low-ﬁdelity information due to no high-ﬁdelity info available. When |DK| 3, the weights are calculated according to the proposed method.

Putting It All Together Algorithm 2 illustrates the pseudo code of MFES-HB. Before executing each SH (the inner loop) of HB, this method utilizes the proposed MFSE to sample n1 conﬁgurations (Line 5) according to the Sample procedure in Algorithm 3. After SH ends (Line 6), each Di is augmented with the new measurements D i (Line 7). Then, the proposed method utilizes D1:K to build the MFES (Line

8). The function Build(D1:K) includes the following three steps: (1) reﬁt each basic surrogate Mi on the augmented Di; (2) calculate the weight wi for each surrogate; and (3) use g POE to combine basic surrogates. Finally, the best conﬁguration in DK is returned (Line 11).

Discussions

Novelty. MFES-HB is the ﬁrst method that explores to combine the beneﬁts of both HB and Multi-ﬁdelity Bayesian optimization (MFBO). The state-of-the-art BOHB only uses the high-ﬁdelity measurements to guide conﬁguration selection, while it suffers from the low-efﬁciency issue due to scarce high-ﬁdelity measurements. To alleviate this issue, we propose to utilize massive low-ﬁdelity measurements. However, utilizing the massive and cheap low-ﬁdelity measurements in an effective and efﬁcient manner is not trivial, and we need to balance the ﬁdelities and #measurements trade-off introduced in Sec.3. Further, we propose to build an ensemble surrogate, which can leverage the useful information from the multi-ﬁdelity measurements to guide the conﬁguration search. Convergence Analysis. (1) When the high-ﬁdelity measurements become sufﬁcient, i.e., n K is larger than a threshold, w K will be set to 1 in MFES-HB. Therefore, the convergence result of MFES-HB will be no worse than the stateof-the-art BOHB s. (2) MFES-HB also samples a constant fraction ρ of the conﬁgurations randomly (Line 1 in Algorithm 3), thus the theoretical guarantee of HB still holds in MFSE-HB. We provide a detailed analysis of the two guarantees in Appendix A.1 of supplemental materials. POGPE (Schilling, Wistuba, and Schmidt-Thieme 2016) also uses a similar product of Gaussian Process (GP) experts to combine the GP-based surrogates. It is trained on the measurements from the past HPO tasks, while the measurements in MFES-HB are obtained from the current HPO task. Moreover, the weight in POGPE is set to a constant wi = 1 K . Multi-ﬁdelity Bayesian optimization (MFBO) methods can accelerate HPO by conducting low-ﬁdelity evaluations proactively. However, since most MFBO methods (Swersky, Snoek, and Adams 2013; Klein et al. 2017a) use the GP in the surrogate model, (1) they cannot support complex conﬁguration spaces easily. (2) Most MF based methods only support a particular type of training resource as they often rely on some customized optimization structures. MFESHB, which inherits the advantages from HB, can support all resource types, e.g., the number of iterations (epochs) for an iterative algorithm, the size of dataset subset, the number of steps in an MCMC chain, etc. While many MFBO methods are designed to work on one kind of training resource, e.g., FABOLAS and TSE only support (dataset) subsets as resources, and LCNET-HB only supports epochs (#iterations) as resources. (3) These methods are intrinsically sequential and difﬁcult to parallelize. MFES-HB does not have the above three limitations by 1) using a probabilistic random forest-based surrogate and 2) inheriting the easy-to-parallel merit from HB. In the following section, we evaluate the proposed method and discuss more advantages of MFES-HB.

Task Datasets |X| R Bhpo Type Unit FCNet MNIST 10 81 5h #Iterations 0.5 epoch Res Net CIFAR-10 6 81 28h #Iterations 2 epochs XGBoost Covtype 8 27 7.5h Data Subset 1/27 #samples Auto ML 10 Datasets 110 27 4h Data Subset 1/27 #samples

Table 2: Four real-world HPO tasks. |X| is the number of hyperparameters in X; R is the maximum training resource; Bhpo is the total HPO budget.

Experiments and Results To evaluate the proposed MFES-HB, we apply it to tune hyperparameters on several real-world Auto ML tasks. Compared with other methods, four main insights about MFESHB that we should investigate are as follows:

Using low-ﬁdelity quality measurements brings beneﬁts.

The proposed MFES can effectively utilize the multi-ﬁdelity quality measurements from HB evaluations.

MFES-HB can greatly accelerate HPO tasks. It reaches a similar performance like other methods but spends much less time.

MFES-HB has the following advantages: scalability, generality, ﬂexibility, practicability in Auto ML systems.

Experiment Settings

Baselines. We compared MFES-HB with the following eight baselines: (1) HB: the original Hyperband (Li et al. 2018), (2) BOHB (Falkner, Klein, and Hutter 2018): a model-based Hyperband that uses TPE-based surrogate ﬁtted on the high-ﬁdelity measurements to sample conﬁgurations, (3) LCNET-HB (Klein et al. 2017a): also a modelbased Hyperband, it utilizes the LCNet that predicts the learning curve of conﬁgurations to sample conﬁgurations in HB, (4) SMAC (Hutter, Hoos, and Leyton-Brown 2011): a widely-used BO method with high-ﬁdelity evaluations, (5) SMAC-ES (Domhan, Springenberg, and Hutter 2015): a variant of SMAC with the learning curve extrapolationbased early stopping, (6) Batch-BO (Gonz alez et al. 2016): a parallel BO method that conducts a batch of high-ﬁdelity evaluations concurrently. (7) FABOLAS (Klein et al. 2017b) and (8) TSE (Hu et al. 2019): the multi-ﬁdelity optimization methods that utilize the cheap ﬁdelities of f on subsets of the training data. Besides the two compared MF methods (FABOLAS and TSE), we also considered the following 3 methods: MF-MES (Takeno et al. 2020), ta KG (Wu et al. 2019a) and MTBO (Swersky, Snoek, and Adams 2013). HPO Tasks. Table 2 describes 4 real-world Auto ML HPO tasks in our experiments. For example, in FCNet, we optimized 10 hyperparameters on MNIST; the resource type is the number of iterations; one unit of resource corresponds to 0.5 epoch, and the total HPO budget Bhpo for each baseline is 5 hours. In addition, to investigate the practicability and the performance of MFES-HB in Auto ML systems, we also implemented MFES-HB in Auto-Sklearn (AUSK) (Feurer et al. 2015), and compared it with the built-in HPO method SMAC in AUSK, on 10 public datasets. More details

Average Validation Error

Wall Clock Time (s) 18000 2250 0 4500 6750 9000 11250 13500 15750 0.074

0.079 HB: No BO Surrogate Used BOHB: High-Fidelity Surrogate Only MFES with Single Best Surrogate MFES with Equal Weight MFES: Proposed Surrogate

(a) Feasibility analysis

Weight Value

Iteration 0 3 6 9 12 15 18 21 24 27 30 0.00

0.40 w1 w2 w3 w4 w5

(b) Weight update

Average Validation Error

Wall Clock Time (s) 18000 2250 0 4500 6750 9000 11250 13500 15750 0.074

g POE: θ=1 g POE: θ=2 g POE: θ=3 g POE: θ=4 BOHB LC-idp: θ=3

(c) g Po E & Parameter check

Figure 3: Optimizing 10 hyperparameters of FCNet on MNIST.

Average Validation Error

Wall Clock Time (s)

18000 1800 0 3600 5400 7200 9000 1440016200 0.074

10800 12600

0.080 SMAC SMAC-ES HB

BOHB LCNET-HB MFES-HB

(a) Results on FCNet

Average Validation Error

Wall Clock Time (s)

14400 1440 0 2880 4320 5760 108001152012960

0.060 AUSK(SMAC) AUSK(SMAC) AUSK(MFES-HB)

(b) Auto ML

Figure 4: Results for optimizing FCNet on MNIST (sequential) and Auto ML on Letter (parallel).

about the conﬁguration space X and datasets (12 in total) can be found in Appendix A.2 and A.3 respectively. Dataset Split, Metric and Parameter Setting. In each experiment, we randomly divided 20% of the training dataset as the validation set, tracked the wall clock time (including optimization overhead and evaluation cost), and stored the lowest validation error after each evaluation. All methods are repeated 10 times with different random seeds, and the mean( std) validation errors across runs are plotted. Moreover, the best models found by the baselines are applied to the test dataset, and test errors are reported. All methods are discussed based on two metrics: (1) the time taken for reaching the same validation error, (2) the test performance. As recommended by HB and BOHB, η is set to 3 for the HB-based methods, and ρ = 0.2. In MFES-HB, we implemented the MFES based on the probabilistic random forest from the SMAC package; the parameter θ used in weight discrimination operator is set to 3. Figure 3 (c) depicts the sensitivity analysis about θ. The same parallel mechanism in BOHB is adopted in the parallel experiments. More details about the experiment settings, hardware, the parameter settings of baselines, and their implementations (including source code) are available in Appendix A.4, A.5, and A.6.

Empirical Analysis

Figures 3 and 4(a) illustrate the results on FCNet, where we investigated (1) the feasibility of the low-ﬁdelity measurements and (2) the effectiveness of MFES. Figures 5 and 4(b) show the results on four HPO tasks, where we studied the efﬁciency of MFES-HB. Table 3 lists the test results. Below, we will verify four insights based on these results. Using low-ﬁdelity quality measurements brings ben-

eﬁts. Figure 3 (a) shows the results of (1) the different versions of MFES that utilize the multi-ﬁdelity measurements and (2) BOHB that uses the high-ﬁdelity measurements only, on FCNet. MFES with single best surrogate means that it uses the base surrogate with the smallest ranking loss deﬁned in Eq.3 to sample conﬁgurations in HB. MFES with equal weight means that, for each surrogate Mi, wi = 1 K . MFES refers to the proposed ensemble surrogate with a ranking loss based weight calculation method. We can observe that using low-ﬁdelity measurements can bring beneﬁts to achieve a faster convergence in HPO.

MFES can exploit the multi-ﬁdelity measurements effectively. From Figure 3(a), we can ﬁnd that two variants (MFES with the single best surrogate and MFES with equal weight) cannot beat the proposed MFSE method using g Po E. As shown in Figure 3(c), g POE is more reasonable and effective in combining the base surrogates than the linear combination under the independent assumption (LCidp curve). Based on the above results, the proposed MFSE, which combines base surrogates using g POE with the ranking loss based weight calculation technique, is an effective surrogate to utilize the multi-ﬁdelity measurements. To further investigate the weight update process, the values of wis across iterations are illustrated in Figure 3(b). The surrogates trained on the lowest-ﬁdelity measurements and the scarce high-ﬁdelity measurements have relatively smaller weights; the surrogates with medium-ﬁdelity measurements own larger weights. Figure 3(c) depicts the parameter sensitivity check of θ. Finally, Figure 4(a) shows the HPO results of all methods on FCNet, and MFES-HB obtains a (more than) 5 speedup over the baselines.

MFES-HB can accelerate HPO tasks. Figures 5 and

Average Validation Error

Wall Clock Time (s)

18000 1800 0 3600 5400 7200 9000 1440016200 0.074

10800 12600

SMAC SMAC-ES HB

BOHB LCNET-HB MFES-HB Batch-BO

Average Validation Error

Wall Clock Time (s) 50000 5000 0 10000 15000 2000025000 40000 45000 0.06

30000 35000

BOHB LCNET-HB MFES-HB Batch-BO

(b) Res Net

Average Validation Error

Wall Clock Time (s)

27000 2700 0 5400 8100 1080013500 2160024300

16200 18900

BOHB FABOLAS MFES-HB TSE

(c) XGBoost

Figure 5: Results for optimizing FCNet on MNIST, Res Net on CIFAR-10, and XGBoost on Covtype.

(a) Test results of baselines

Method FCNet Res Net XGB

SMAC 7.63 9.10 3.52 SMAC (ES) 7.49 8.37 - Batch BO 7.47 7.98 3.00 HB 7.55 8.40 3.56 LCNET-HB 7.49 8.21 - BOHB 7.48 8.10 3.16 FABOLAS - - 3.40 TSE - - 3.12 MFES-HB 7.38 7.49 2.65

(b) Results on 10 Auto ML datasets

Dataset AUSK AUSK(new)

MNIST 3.39 2.15 Letter 3.85 3.44 Higgs 26.84 26.79 Electricity 6.18 6.21 Kropt 19.84 13.08 Mv 0.03 0.01 Poker 12.91 4.30 Fried 6.60 6.62 A9a 17.23 17.09 2dplanes 6.59 6.41

Table 3: Mean test errors (%) of compared baselines. In Table(a), since SMAC (ES) & LCNET-HB depend on training iteration and FABOLAS & TSE only work on sample size, - means the invalid cases. In Table(b), AUSK(new) represents Auto-Sklearn using MFES-HB as its HPO optimizer.

4(b) depict the empirical results on four HPO tasks, where the tasks on FCNet and Res Net are conducted in parallel settings. MFES-HB spends less time than the compared methods to obtain a sub-optimal performance. Concretely, MFES-HB achieves the validation error of 7.5% on FCNet within 0.75 hours, 7.3% on Res Net within 4.3 hours, and 3.5% on XGBoost within 2.25 hours. To reach the same results, it takes other methods about 5 hours on FCNet, 13.9 hours on Res Net, and 7.5 hours on XGBoost. MFES-HB achieves the 3.2 to 6.7 speedups in ﬁnding a similar conﬁguration. Particularly, MFES-HB achieves 4.05 to 10.1 speedups over Hyperband, and 3.3 to 8.9 speedups over the state-of-the-art BOHB. Moreover, Table 3 (a) shows that the conﬁguration found by MFES-HB gets the best

(a) Results on FCNet

Method Error Speedups

HB 7.53 1.0x BOHB 7.48 3.05x MTBO 7.50 1.07x MF-MES 7.45 5.7x ta KG 7.46 4.8x MFES-HB 7.41 10.1x

(b) Results on XGBoost

Method Error Speedups

HB 3.48 1.0x BOHB 3.26 1.8x MTBO 3.46 F MF-MES 3.11 3.1x ta KG 3.16 2.9x MFES-HB 2.97 4.5x

Table 4: Speedup result over Hyperband (HB) on two benchmarks: FCNet and XGBoost. F means the method fails to reach the result of HB.

test performance. In addition, when comparing MF-MES, ta KG, and MTBO, the ﬁnal results and speedups over HB in achieving the same ﬁnal error of HB are reported in Table 4. MFES-HB outperforms these methods, and we see that MFES-HB, which combines the beneﬁts of HB and MFBO, works well. Therefore, this demonstrates that MFES-HB can conduct HPO efﬁciently and effectively. The advantages of MFES-HB. MFES-HB can easily handle HPO problems with 6 to 110 hyperparameters (scalability). Particularly, the Auto ML task on 10 datasets involves a very high-dimensional space: 110 hyperparameters in total. In addition, MFES-HB supports (1) complex hyperparameter space by using the BO component from SMAC (generality), and (2) HPO with different resource types (ﬂexibility); while most multi-ﬁdelity optimization approaches only support one type of resource, and cannot be extended to the other types easily. Finally, on 10 datasets, we compared the performance of MFES-HB with the built-in HPO algorithm (SMAC) in Auto-Sklearn (AUSK). Figure 4(b) shows the results on dataset Letter, and Table 3 (b) demonstrates its practicability and effectiveness in Auto ML system.

In this paper, we introduced MFES-HB, an efﬁcient Hyperband method that utilizes multi-ﬁdelity quality measurements to accelerate HPO tasks. The multi-ﬁdelity ensemble surrogate is proposed to integrate quality measurements with multiple ﬁdelities effectively. We evaluated MFES-HB on a wide range of Auto ML HPO tasks, and demonstrated its superiority over the competitive approaches.

Acknowledgments This work is supported by the National Key Research and Development Program of China (No.2018YFB1004403), NSFC (No.61832001, 61702015, 61702016), Beijing Academy of Artiﬁcial Intelligence (BAAI), and Kuaishou PKU joint program. Bin Cui is the corresponding author.

References Baker, B.; Gupta, O.; Raskar, R.; and Naik, N. 2017. Practical neural network performance prediction for early stopping. ar Xiv preprint ar Xiv:1705.10823 2(3): 6. Bardenet, R.; Brendel, M.; K egl, B.; and Sebag, M. 2013. Collaborative hyperparameter tuning. In International Conference on Machine Learning, 199 207. Bergstra, J. S.; Bardenet, R.; Bengio, Y.; and K egl, B. 2011. Algorithms for hyper-parameter optimization. In Advances in neural information processing systems, 2546 2554. Bertrand, H.; Ardon, R.; Perrot, M.; and Bloch, I. 2017. Hyperparameter optimization of deep neural networks: Combining hyperband with Bayesian model selection. In Conf erence sur l Apprentissage Automatique. Cao, Y.; and Fleet, D. J. 2014. Generalized product of experts for automatic and principled fusion of Gaussian process predictions. ar Xiv preprint ar Xiv:1410.7827 . Dai, Z.; Yu, H.; Low, B. K. H.; and Jaillet, P. 2019. Bayesian Optimization Meets Bayesian Optimal Stopping 1496 1506. Domhan, T.; Springenberg, J. T.; and Hutter, F. 2015. Speeding Up Automatic Hyperparameter Optimization of Deep Neural Networks by Extrapolation of Learning Curves. In IJCAI, volume 15, 3460 8. Eggensperger, K.; Feurer, M.; Hutter, F.; Bergstra, J.; Snoek, J.; Hoos, H.; and Leyton-Brown, K. 2013. Towards an empirical foundation for assessing bayesian optimization of hyperparameters. In NIPS workshop on Bayesian Optimization in Theory and Practice, volume 10, 3. Falkner, S.; Klein, A.; and Hutter, F. 2018. BOHB: Robust and Efﬁcient Hyperparameter Optimization at Scale. In Proceedings of the 35th International Conference on Machine Learning, 1436 1445. Feurer, M.; Klein, A.; Eggensperger, K.; Springenberg, J.; Blum, M.; and Hutter, F. 2015. Efﬁcient and robust automated machine learning. In Advances in neural information processing systems, 2962 2970. Feurer, M.; Letham, B.; and Bakshy, E. 2018. Scalable metalearning for bayesian optimization using ranking-weighted gaussian process ensembles. In Auto ML Workshop at ICML. Golovin, D.; Solnik, B.; Moitra, S.; Kochanski, G.; Karro, J.; and Sculley, D. 2017. Google vizier: A service for black-box optimization. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1487 1495. ACM. Gonz alez, J.; Dai, Z.; Hennig, P.; and Lawrence, N. 2016. Batch bayesian optimization via local penalization. In Artiﬁcial Intelligence and Statistics, 648 657.

Goodfellow, I.; Bengio, Y.; and Courville, A. 2016. Deep learning. MIT press. He, X.; Liao, L.; Zhang, H.; Nie, L.; Hu, X.; and Chua, T.-S. 2017. Neural collaborative ﬁltering. In Proceedings of the 26th international conference on world wide web, 173 182. Hu, Y.-Q.; Yu, Y.; Tu, W.-W.; Yang, Q.; Chen, Y.; and Dai, W. 2019. Multi-Fidelity Automatic Hyper-Parameter Tuning via Transfer Series Expansion. AAAI . Hutter, F.; Hoos, H. H.; and Leyton-Brown, K. 2011. Sequential model-based optimization for general algorithm conﬁguration. In International Conference on Learning and Intelligent Optimization, 507 523. Springer. Hutter, F.; Kotthoff, L.; and Vanschoren, J., eds. 2018. Automated Machine Learning: Methods, Systems, Challenges. Springer. In press, available at http://automl.org/book. Jamieson, K.; and Talwalkar, A. 2016. Non-stochastic best arm identiﬁcation and hyperparameter optimization. In Artiﬁcial Intelligence and Statistics, 240 248. Jiang, J.; Jiang, J.; Cui, B.; and Zhang, C. 2017. Tencent Boost: a gradient boosting tree system with parameter server. In 2017 IEEE 33rd ICDE, 281 284. IEEE. Jones, D. R.; Schonlau, M.; and Welch, W. J. 1998. Efﬁcient global optimization of expensive black-box functions. Journal of Global optimization 13(4): 455 492. Kandasamy, K.; Dasarathy, G.; Schneider, J.; and Poczos, B. 2017. Multi-ﬁdelity bayesian optimisation with continuous approximations. ar Xiv preprint ar Xiv:1703.06240 . Klein, A.; Falkner, S.; Bartels, S.; Hennig, P.; and Hutter, F. 2017a. Fast Bayesian Optimization of Machine Learning Hyperparameters on Large Datasets. In Proceedings of the 20th International Conference on Artiﬁcial Intelligence and Statistics, 528 536. Klein, A.; Falkner, S.; Springenberg, J. T.; and Hutter, F. 2017b. Learning curve prediction with Bayesian neural networks. Proceedings of the International Conference on Learning Representations . Li, L.; Jamieson, K.; De Salvo, G.; Rostamizadeh, A.; and Talwalkar, A. 2018. Hyperband: A novel bandit-based approach to hyperparameter optimization. Proceedings of the International Conference on Learning Representations 1 48. Li, Y.; Jiang, J.; Gao, J.; Shao, Y.; Zhang, C.; and Cui, B. 2020. Efﬁcient Automatic CASH via Rising Bandits. In AAAI, 4763 4771. Ma, J.; Wen, J.; Zhong, M.; Chen, W.; and Li, X. 2019. MMM: Multi-source Multi-net Micro-video Recommendation with Clustered Hidden Item Representation Learning. Data Science and Engineering 4(3): 240 253. Poloczek, M.; Wang, J.; and Frazier, P. 2017. Multiinformation source optimization. In Advances in Neural Information Processing Systems, 4288 4298. Schilling, N.; Wistuba, M.; Drumond, L.; and Schmidt Thieme, L. 2015. Hyperparameter optimization with factorized multilayer perceptrons. In Joint European Confer-

ence on Machine Learning and Knowledge Discovery in Databases, 87 103. Springer.

Schilling, N.; Wistuba, M.; and Schmidt-Thieme, L. 2016. Scalable hyperparameter optimization with products of gaussian process experts. In Joint European conference on machine learning and knowledge discovery in databases, 33 48. Springer.

Sen, R.; Kandasamy, K.; and Shakkottai, S. 2018. Noisy Blackbox Optimization with Multi-Fidelity Queries: A Tree Search Approach. ar Xiv: Machine Learning .

Snoek, J.; Larochelle, H.; and Adams, R. P. 2012. Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems.

Swersky, K.; Snoek, J.; and Adams, R. P. 2013. Multi-task bayesian optimization. In Advances in neural information processing systems, 2004 2012.

Swersky, K.; Snoek, J.; and Adams, R. P. 2014. Freeze-thaw Bayesian optimization. ar Xiv preprint ar Xiv:1406.3896 .

Takeno, S.; Fukuoka, H.; Tsukada, Y.; Koyama, T.; Shiga, M.; Takeuchi, I.; and Karasuyama, M. 2020. Multi-ﬁdelity Bayesian Optimization with Max-value Entropy Search and its parallelization.

Wang, J.; Xu, J.; and Wang, X. 2018. Combination of Hyperband and Bayesian Optimization for Hyperparameter Optimization in Deep Learning .

Wang, Z.; Zoghi, M.; Hutter, F.; Matheson, D.; and De Freitas, N. 2013. Bayesian optimization in high dimensions via random embeddings. In Twenty-Third International Joint Conference on Artiﬁcial Intelligence.

Wistuba, M.; Schilling, N.; and Schmidt-Thieme, L. 2016. Two-stage transfer surrogate model for automatic hyperparameter optimization. In Joint European conference on machine learning and knowledge discovery in databases, 199 214. Springer.

Wu, J.; Toscano-Palmerin, S.; Frazier, P. I.; and Wilson, A. G. 2019a. Practical multi-ﬁdelity Bayesian optimization for hyperparameter tuning.

Wu, J.; Toscanopalmerin, S.; Frazier, P. I.; and Wilson, A. G. 2019b. Practical multi-ﬁdelity Bayesian optimization for hyperparameter tuning 284.

Wu, S.; Zhang, Y.; Gao, C.; Bian, K.; and Cui, B. 2020. GARG: Anonymous Recommendation of Point-of-Interest in Mobile Networks by Graph Convolution Network. Data Science and Engineering ISSN 23641541. doi:10.1007/ s41019-020-00135-z.

Yao, Q.; Wang, M.; Escalante, H. J.; Guyon, I.; Hu, Y.; Li, Y.; Tu, W.; Yang, Q.; and Yu, Y. 2018. Taking Human out of Learning Applications: A Survey on Automated Machine Learning. Co RR .

Yogatama, D.; and Mann, G. 2014. Efﬁcient transfer learning method for automatic hyperparameter tuning. In Artiﬁcial Intelligence and Statistics, 1077 1085.

Zhang, W.; Jiang, J.; Shao, Y.; and Cui, B. 2020. Snapshot boosting: a fast ensemble framework for deep neural networks. Sci. China Inf. Sci. 63(1): 112102. doi: 10.1007/s11432-018-9944-x. URL https://doi.org/10.1007/ s11432-018-9944-x. Z oller, M.; and Huber, M. F. 2019. Survey on Automated Machine Learning. Co RR abs/1904.12054.