# economic_hyperparameter_optimization_with_blended_search_strategy__c722000e.pdf

ECONOMICAL HYPERPARAMETER OPTIMIZATION WITH BLENDED SEARCH STRATEGY

Chi Wang , Qingyun Wu , Silu Huang, Amin Saied Microsoft Corporation, Redmond, WA 98052, USA {wang.chi,qingyun.wu,silu.huang,amin.saied}@microsoft.com

We study the problem of using low cost to search for hyperparameter conﬁgurations in a large search space with heterogeneous evaluation cost and model quality. We propose a blended search strategy to combine the strengths of global and local search, and prioritize them on the ﬂy with the goal of minimizing the total cost spent in ﬁnding good conﬁgurations. Our approach demonstrates robust performance for tuning both tree-based models and deep neural networks on a large Auto ML benchmark, as well as superior performance in model quality, time, and resource consumption for a production transformer-based NLP model ﬁne-tuning task.

1 INTRODUCTION

Hyperparameter optimization (HPO) of modern machine learning models is a resource-consuming task, which is unaffordable to individuals or organizations with little resource (Yang & Shami, 2020). Operating HPO in a low-cost regime has numerous beneﬁts, such as democratizing ML techniques, enabling new applications of ML, which requires frequent low-latency tuning, and reducing the carbon footprint. It is inherently challenging due to the nature of the task: trying a large number of conﬁgurations of heterogeneous cost and accuracy in a large search space. The expense can accumulate from multiple sources: either a large number of individually cheap trials or a small number of expensive trials can add up the required resources.

There have been multiple attempts to address the efﬁciency of HPO from different perspectives. Each of them has strengths and limitations. For example, Bayesian optimization (BO) (Brochu et al., 2010), which is a class of global optimization algorithms, is used to minimize the total number of iterations to reach global optima. However, when the cost of different hyperparameter conﬁgurations is heterogeneous, vanilla BO may select a conﬁguration that incurs unnecessarily high cost. As opposed to BO, local search (LS) methods (Wu et al., 2021) are able to control total cost by preventing very expensive trials until necessary, but they may get trapped in local optima. Multi-ﬁdelity methods (Jamieson & Talwalkar, 2016) aim to use cheap proxies to replace some of the expensive trials and approximate the accuracy assessment, but can only be used when such proxies exist. A single search strategy is difﬁcult to meet the generic goal of economical HPO.

In this work, we propose a blended search strategy which combines global search and local search strategy such that we can enjoy beneﬁts from both worlds: (1) global search can ensure the convergence to the global optima when the budget is sufﬁcient; and (2) local search methods enable a better control on the cost incurred along the search trajectory. Given a particular global and local search method, our framework, which is named as Blend Search, combines them according to the following design principles. (1) Instead of sticking with a particular method for conﬁguration selection, we consider both of the candidate search methods and decide which one to use at each round of the conﬁguration selection. (2) We use the global search method to help decide the starting points of local search threads. (3) We use the local search method to intervene the global search method s conﬁguration selection to avoid conﬁgurations that may incur unnecessarily large evaluation cost. (4) We prioritize search instances of both methods according to their performance and efﬁciency of performance improvement on the ﬂy. Extensive empirical evaluation on the Auto ML

Equal contribution

0 500 1000 1500 2000 Time(s)

Blend Search BO LS

(a) Loss vs. evaluation time for conﬁgs tried by each method. One point represents one conﬁg. The lower is the loss the better is the quality of the conﬁg. Longer evaluation time corresponds to larger training cost

0 1000 2000 3000 Time(s)

Blend Search BO LS

(b) Best loss vs. optimization time per method. Colored circles conﬁgs proposed by LS threads in our method (color change indicates thread change); diamonds conﬁgs proposed by BO in our method

Figure 1: A typical example of the different behaviors of BO, LS and our proposed Blend Search in tuning a set of 11-dim hyperparameters for XGBoost. BO is prone to selecting expensive but not necessarily good conﬁgs. LS avoids expensive conﬁgs in the beginning but is prone to getting stuck in local regions. Blend Search switches between one BO and multiple LS search threads and prioritizes the more promising ones, and turns out to try more low-cost, high-quality conﬁgs.

Benchmark (Gijsbers et al., 2019) validates the robust performance of our method on a wide variety of datasets. Blend Search is now publicly available in an open-source Auto ML Library1.

2 BACKGROUND AND RELATED WORK

We ﬁrst brieﬂy introduce the vanilla Bayesian optimization methods and local search methods, which are among the building blocks of our method. Bayesian optimization is a class of global optimization algorithms which is suitable for optimizing expensive black-box functions. It models the probabilistic distribution of the objective conditioned on the optimization variables. Typical models include Gaussian process (Snoek et al., 2012), random forest (Hutter et al., 2011), and tree Parzen estimator (TPE) (Bergstra et al., 2011). In BO methods, an acquisition function is used to determine the next point to evaluate. Two common acquisition functions are the expected improvement (EI) (Bull, 2011) over the currently best-observed objective and upper conﬁdence bound (UCB) (Srinivas et al., 2009). Local search methods are prevalent in the general optimization literature (Spall et al., 1992; Nesterov & Spokoiny, 2017) but less studied in the HPO literature due to the possibility of getting trapped in local optima (Gy orgy & Kocsis, 2011). Recent work (Wu et al., 2021) shows that a local search method FLOW2 can make HPO cost-effective when combined with low-cost initialization and random restart. At each iteration, it samples a pair of vectors (with opposite directions) uniformly at random from a unit sphere, the center of which is the best conﬁguration found so far (a.k.a. incumbent) and the radius of which is the current stepsize. Expensive conﬁgurations are avoided in the beginning as each iteration proposes a conﬁguration near the incumbent. Random restart of the local search is performed once the convergence condition is satisﬁed.

There are several attempts to address the limitations of vanilla BO or local search methods. BOw LS (BO with local search) (Gao et al., 2020) uses a BO model to select the starting point of a local search thread. Each local search thread is run until convergence and the BO model is updated with the start point and the converged loss. Trust region BO (Eriksson et al., 2019) ﬁts a ﬁxed number of local models and performs a principled global allocation of samples across these models via an implicit bandit approach. It is primarily designed for HPO problems with high-dimensional numerical hyperparamters. Unfortunately, all existing work that tries to combine global search with local search methods does not consider the heterogeneity of evaluation cost incurred along with the search. There are also a lot of attempts in making HPO efﬁcient by speeding up conﬁguration evaluation. Multiﬁdelity optimizations (Klein et al., 2017; Li et al., 2017; Kandasamy et al., 2017; Falkner et al., 2018; Lu et al., 2019; Li et al., 2020) are proposed for this purpose. They usually require an additional degree of freedom in the problem called ﬁdelity , to allow performance assessment on a conﬁgura-

1https://github.com/microsoft/FLAML

Figure 2: Framework. Both paths from Conﬁg Evaluator are executed independently.

tion with different ﬁdelities. There is surprisingly little prior work for generic cost-effective HPO. Gaussian process with expected improvement per second (GPEIPS) (Snoek et al., 2012) models the evaluation cost using another Gaussian process, and heuristically adds the estimated cost into the acquisition function. It does not always outperform GPEI as the acqusition function can overpenalize good but expensive conﬁgurations.

3 BLENDSEARCH

Our framework needs the following information as inputs.

B is the total budget of cost. In this work, we measure the cost by CPU/GPU time. P is the input HPO problem, which has the following attributes characterizing the problem.

P.X is the search space in which each element x is a d-dimensional hyperparameter conﬁguration. For a non-categorical hyperparameter coordinate i [d], if different values of xi lead to heterogeneous cost and there is a known value P.x Low Cost i corresponding to a low cost, it is considered as a controlled dimension. We use P.D to denote such a set of controlled dimensions. P.Loss Func( ) is the loss function to be minimized in terms of the conﬁgurations x P.X. P.Cost Func( ) is the cost function that outputs the cost incurred when evaluating x. The goal of an HPO algorithm is to minimize the loss P.Loss Func(x) with the constraint that the total cost incurred G(π) := P

x I(π) P.Cost Func(x) B, where I(π) is the search trajectory of algorithm π. Note that both P.Loss Func(x) and P.Cost Func(x) are black-box functions meaning that typically the analytic form is not available, and only function values can be observed. In order to distinguish the operation of querying from loss/cost function and the loss/cost observation, we use P.Loss Func(x) and P.Cost Func(x) to denote the former, and use l(x) and c(x) to denote the latter. l and c (omitting x) are used when there is no ambiguity. G is the global search method to be used. L is the local search method to be used. L. is the largest stepsize used in local search method L, i.e., the largest possible change on a hyperparameter value between two consecutive search steps.

The overall design of our framework is presented in Figure 2 and Algorithm 1. The key idea is to maintain a pool of search threads, one of which corresponds to global search and the others local search. The pool starts with one global search thread and gradually adds local search threads as the search goes on. Here a search thread is an instance of a global search or local search method, each with its own search trajectory. At each round, a search thread selector selects one of the search threads from the pool according to a priority metric that reﬂects the search threads current performance and efﬁciency of performance improvement. The selected search thread will then be used to propose a conﬁguration to evaluate in this round. When the selected search thread is the global search thread, a conﬁg validator ﬁrst checks whether the proposed conﬁguration is within

the admissible region for evaluation. If not, it uses a backup local search thread instead. A local search thread will be created only when the global search thread proposes a valid conﬁg and a certain create thread condition is met, and will be deleted once it converges. Two local search threads will be merged into one if they are close enough. The priority of each search thread is updated after each evaluation round.

Algorithm 1 Blend Search

Inputs: HPO problem P, a global search method G, a local search method L, total budget B. 1: Initialization: Initialize F.S = [S0] where S0 is an instance of G. We denote by F.x0 the initial point of the search. By design, the values of the controlled dimensions of F.x0 are set to be P.x Low Cost and those for the other dimensions are proposed by S0. 2: while F.c < B do 3: S, Sbak Select Thread(F) 4: (x, l, c) Search Evaluate1Step( S, F, P) 5: if x is invalid then (x, l, c) Search Evaluate1Step( Sbak, F, P)

6: if x is proposed by global search & Create New LSCondition is satisﬁed then 7: Initialize S = Initialize Search Thread(L, P, (x, l, c)) 8: Add the new LS thread S into the pool: F.S F.S + S 9: Delete And Merge LS(F) Merge or delete existing LS threads when necessary 10: Update F.Priority 11: if x is proposed by global search then Update GSModel(S0, (x, l, c))

For convenience, we use F to denote a collection of framework-level variables:

F.S is the list of search threads maintained in our framework. F.S contains at least one search thread and among them there is one and only one global search thread, i.e., S0, in F.S. F.Priority is a priority dictionary, in which the keys are the search threads in F.S, and the values are the priority of the corresponding search threads. Bookkeeping information of F: F.l is the best loss achieved among all the search threads and F.c is the total cost consumed in our framework. F.R is the admissible region on the controlled dimensions of the current search, which is a hyperrectangle and can be written in the form of F.R := {[F.xmin i , F.xmax i ]}i D with F.xmin i and F.xmax i denoting the minimum and maximum value along the i-th dimension in F.R respectively. They are initially set as F.xmin i = F.xmax i = P.x Low Cost i for all i D. The admissible region gradually expands during the search: (1) it is expanded to cover all the points evaluated by all the search threads and all the points that are possible to be reached by the local search within one search step, as shown in line 7 and 8 of Algorithm 2; (2) it expands if a local search thread converges, as shown in line 3 of Algorithm 7 (included in Appendix A).

In the following, we explain the key steps in our algorithm.

Step 1: Search thread selector (line 3 of Alg 1). In addition to the primary search thread S, Select Thread also outputs a backup search thread Sbak which is guaranteed to be a local search thread. It is set to be none when there is no local search thread yet in F.S. Speciﬁcally,

S = arg max S F.S F.Priority(S), Sbak = arg max S (F.S\S0) F.Priority(S) F.S \ S0 = None F.S \ S0 = , (1) The design of the priority metric follows the principle of optimism in the face of uncertainty from the multi-armed bandit problem to balance exploitation and exploration (Lattimore & Szepesv ari, 2020). Speciﬁcally, linear extrapolation is performed adaptively and locally to calculate the improvement speed of each search thread. The estimated future reward based on such a linear extrapolation provides a ﬁrst-order upper bound of the ground truth future reward assuming each search thread has a diminishing return, i.e., the speed of improvement decreases as more resource is spent. Formally, we introduce the following variables and functions for each search thread S F.S.

Bookkeeping information: S.l1st and S.l2nd are the best loss so far and second best loss before the best loss is achieved. S.c1st and S.c2nd are the total cost taken when S.l1st and S.l2nd are achieved respectively. S.c is the total cost spent in S. S.x1st is the best conﬁguration found so far.

S.s is the performance improvement speed of S. It is calculated as S.s = S.l2nd S.l1st

S.c S.c2nd . This formula is only valid when there is at least one improvement. Otherwise, we do not have enough information to estimate the speed of improvement. We set the speed to the highest speed of all the search threads when S.l2nd = S.l1st. It is due to an implicit assumption of diminishing return. S.xmin and S.xmax are the minimum and maximum value of the i-th dimension of all hyperparameters conﬁgurations evaluated in S respectively. S.Cost Imp( ) is a function whose input is a target loss and output is the anticipated cost for S to generate a better loss than this target. We use the following formula to compute it, which is intuitively using the cost for improvement in the past to estimate that in the future.

S.Cost Imp(l) = max S.c S.c1st, S.c1st S.c2nd, 2S.l1st l

Our proposed priority metric is essentially the negative of the projected loss of S:

F.Priority(S) = (S.l1st S.s b) (3)

in which b = min(max S F.S S.Cost Imp(F.l ), B F.c). max S F.S S.Cost Imp(F.l ) can be considered as the resource needed for every S to have a better performance than the currently best performance F.l . Our priority metric estimates the loss of each search thread if such an amount of resource (restricted by the budget left B F.c) is given. By considering both the search threads current performance and potential improvement, it provides a fair trade-off between exploiting the currently-best and exploring the potentially-better choices.

Algorithm 2 Search Evaluate1Step

Inputs: HPO problem P, search thread S, and F 1: if S is None then Construct x as follows: generate the controlled dimensions of x by adding Guassian noises on the corresponding dimensions of F.x0 and for the rest of the dimensions sample uniformly at random from the search space P.X. 2: else x S.Propose Config() 3: if S = S0 (i.e., S is the global search thread) & x / F.R then x invalid 4: else l, c P.Loss Func(x), P.Cost Func(x) Evaluate conﬁguration x

5: if x = invalid then 6: Book Keeping(S, x, l, c, F) and update speed S.s, 7: i P.D, S.xmin i min{xi, S.xmin i }, F.xmin i min{S.xmin i L. , F.xmin i }, 8: i P.D, S.xmax i max{xi, S.xmax i }, F.xmax i max{S.xmax i + L. , F.xmax i }

9: Outputs: x, l, c

Step 2: Conﬁg validator and evaluator (line 4-5 of Alg 1). After a search thread (and a backup search thread) is selected, the next step is to propose the next conﬁguration to try with the chosen search thread(s). Intuitively speaking, we consider generating the next conﬁguration to try primarily according to the selected search thread S whose priority is ranked the highest. But we set a guard rail for the global search thread as it may propose an unnecessarily high-cost conﬁguration. We thus introduce a conﬁg validator to validate the conﬁgurations proposed by global search according to whether they are within the current admissible region of our framework F.R (line 3 of Alg 2). A conﬁguration marked as invalid means that it is considered to be prone to incur unnecessarily high cost and will not be evaluated at this round. In this case, the selected backup search thread will be used to perform another round of Search Evaluate1Step (line 5 of Alg 1) if it is a valid search thread (i.e., not none). In the case where the backup thread is none, we generate the new conﬁguration according to line 1 of Alg 2. The conﬁg validator helps avoid potentially high-cost evaluation and thus avoid creating local search threads from high-cost points until necessary. It does not stick to local searches forever because the admissible region F.R gets expanded. Note that according to the deﬁnition of F.R, only the controlled dimensions of the hyperparameter conﬁgurations are subject to validation check. If needed, a multi-ﬁdelity pruning strategy can be used in this conﬁg evaluator component. Multi-ﬁdelity pruning does not necessarily yield better performance. So the adoption of multi-ﬁdelity pruning in Blend Search is optional.

Step 3: Search thread creator, updater and cleaner (line 6-11 of Alg 1). If the newly proposed conﬁguration is proposed by global search and it is not marked as invalid , we consider creating

a new local search thread using the proposed conﬁguration as a starting point. To make sure the newly created local search thread is relatively good, we ﬁrst check whether the proposed conﬁguration s performance is better than at least half of the existing threads performance (speciﬁed in the Create New LSCondition). If so, a new local search thread will be initialized and added to the active search thread pool S. In Delete And Merge LS, we check whether a local search thread has converged according to the convergence condition of the speciﬁc local search method. If it is, the search thread will be removed from S. In addition, we also go through all the local search threads to see whether the incumbent of a LS thread is reachable in one step by another LS thread with lower loss (ref. Appendix A). If so, the former LS thread will be deleted. After a conﬁguration proposed by global search is evaluated, the observation tuple (x, l, c) is then used to update the model of the global search method through function Update GSModel. For example, when the global search method is a Bayesian optimization method, the model is the surrogate model used.

Due to page limit, detailed pseudocode for several of the straightforward functions mentioned in our framework are provided in Appendix A, including Create New LSCondition, Delete And Merge LS, Initialize Search Thread and Book Keeping.

4 EXPERIMENTS

We evaluate Blend Search in tuning tabular machine learning libraries with an Auto ML benchmark (Gijsbers et al., 2019), and in ﬁne-tuning NLP models for text data. The Auto ML benchmark consists of 39 tabular datasets that represent real-world data science classiﬁcation problems. It includes datasets of all sizes, of different problem domains and with various levels of difﬁculty. As each dataset has 10 cross-validation folds, all the results reported in this paper are averaged over the 10 folds. With this benchmark, we are able to evaluate multiple HPO methods on a large number of datasets within a manageable computational budget, for tuning three machine learning libraries: XGBoost (Chen & Guestrin, 2016), Light GBM (Ke et al., 2017) and Deep Tables2. The ﬁrst two are popular libraries based on gradient boosted trees, and the third is an easy-to-use deep learning toolkit which utilizes latest research ﬁndings for tabular data. We chose them because gradient boosted trees and deep neural networks are the most frequent winners in data science competitions. We run experiments for XGBoost and Light GBM on the Auto ML benchmark and report the results in Section 4.1. As a real application, we report an NLP model ﬁne-tuning task for a production use case in Section 4.2. Finally, in Section 4.3 we perform ablation study to investigate the effectiveness of several important components of our framework. Due to page limit, we include the results for tuning Deep Tables in Appendix B. We include the following baselines in our experiments.

BO (Akiba et al., 2019) the Bayesian optimization baseline. We choose a modern HPO library Optuna and use the TPE sampler because of its ﬂexibility in handling mixed continuous and discrete space and good peformance reported in existing work (Falkner et al., 2018). LS (Wu et al., 2021) the recent baseline of using local search with random restart, based on FLOW2. It is proved to be able to control cost effectively and outperform BO methods for numerical cost-related hyperparameter search. BOw LS (Gao et al., 2020) the baseline of an existing approach of combining local search with BO, i.e., using BO to propose start points for local search. ASHA (Li et al., 2020), i.e., asynchronous successive halving a state-of-the-art HPO method that uses multi-ﬁdelity optimization and supports parallel tuning.

Initialization setting. For the local search method used in our framework, a low-cost initialization is needed to realize its unique advantages in controlling the cost. It is implemented via setting a lowcost initial value for each of the controlled dimensions. For example, among the hyperparameters tuned in Light GBM (shown in Table 2 of Appendix B), three hyperparameters, including tree num , leaf num and min child weight have initial values corresponding to the min or max values in their range, depending on whether they have a positive or negative correlation with the evaluation cost. It does not require the loss of the initial conﬁguration to be low. Only one single low-cost initial value for each controlled dimension needs to be speciﬁed as input. To ensure a fair evaluation, we use the same low-cost initial point (if controlled dimensions exist) as the starting point of all the baselines.

2https://github.com/Data Canvas IO/Deep Tables

0 1000 2000 3000 time (s)

(a) credit, loss = 1-auc

0 1000 2000 3000 time (s)

Blend Search BO LS BOw LS

(b) KDDCup09, loss = 1-auc

0 1000 2000 3000 time (s)

(c) volkert, loss = log-loss

Figure 3: Optimization performance curve for XGBoost. Lines correspond to the mean loss over 10 folds, and shades correspond to 95% conﬁdence intervals. 1-auc is used for binary classiﬁcation and log-loss is used for multi-class classiﬁcation.

102 103 time (s)

Blend Search BO LS BOw LS

102 103 time (s)

Scaled Loss

(b) Scaled loss

102 103 time (s)

BS-Priority:Round Robin BS-w/o-Config Validator BS-Creat Cond:Always BS-Creat Cond:Best Or First

(c) Rank for ablation

Figure 4: Aggregated rank and scaled loss on Light GBM and XGBoost. The scaled loss is obtained by min-max scaling using the max and min loss over all time across all the methods. Sub-ﬁgure (a) & (b) share the same legend, and BS is short for Blend Search in sub-ﬁgure (c). Both rank and scaled regret are aggregated ﬁrst across datasets then across 10 folds of data in each dataset.

4.1 TUNING XGBOOST AND LIGHTGBM

We tune a set of 9-dimensional hyperparameters (all numerical) in Light GBM and 11-dimensional hyperparameters (9 numerical and 2 categorical) in XGBoost. A detailed description of the search space can be found in Appendix B. In this section, we omit the multi-ﬁdelity baseline method, i.e., ASHA, because we do not ﬁnd a good ﬁdelity dimension that works well in tuning Light GBM and XGBoost (ref. additional results in Appendix B). We perform the evaluation on 37 out of the 39 datasets from the Auto ML benchmark ( Robert and Dionis are excluded due to out-ofmemory error and extremely long training time). The input budget B (in terms of CPU time) is set to be 4 hours for the 3 largest datasets among the 37 datasets, and 1 hour for the rest. From the performance curves shown in Figure 3, we observe that BO tends to perform well on small datasets, e.g., credit in Figure 3(a), where the 1h budget is sufﬁcient. Under the same budget, it may perform badly on large datasets, e.g., volkert in Figure 3(c), as it may try conﬁgurations which consume a very large portion of the budget at the early stage of the search. On the medium size dataset, e.g., KDDCup09 in Figure 3(b), the local search method is more efﬁcient than BO in the early stage but is outperformed in the later stage. Blend Search performs similarly to the better one between LS and BO in the early stage, and surpasses both of them in the later stage. We also observe that BOw LS performs similarly with LS (sometimes worse). This is because BOw LS needs to wait until a local search converges before proposing a new one. The aggregated result over all the test cases in Figure 4(a) & (b) is consistent with these observations.

The interplay between local and global search in Blend Search. We investigated the dynamics of local and global search thread selection in Blend Search. As a case study, we show the results in tuning XGBoost on two datasets in Figure 1(b) and Figure 5(a). The result in Figure 1(b) shows

0 1000 2000 3000 Time(s)

Blend Search BO LS

Figure 5: Search thread selection in Blend Search in tuning XGBoost on dataset KDDCup09. In this ﬁgure (and Figure 1(b), which shows the result for dataset Volkert), each marker corresponds to one time of search thread selection in Blend Search. The diamonds correspond to the global search thread, and the solid circles correspond to the local search threads (different colors for different threads).

# of thread switches

# of threads induced by GS 0

Figure 6: The distributions of the # of thread switches and the # of local search threads induced by the GS (short for global search) in Blend Search for tuning XGBoost. The result is aggregated over all the datasets evaluated in tunning XGBoot. The outliers with very large # of thread switches and threads induced by GS are from very small datasets, where the number of evaluations is large.

that the global search indeed plays a role after 1000s which contributes to Blend Search s better performance comparing to LS. The result in Figure 5(a) shows that although the ﬁrst point suggested by the global search (at around 200s) does not yield signiﬁcantly better performance immediately, the induced new local search thread (the thread in green circle) lead to a signiﬁcantly better performance soon. These two ﬁgures together indicate that global search is not only responsible for directly ﬁnding better conﬁgurations, but also for creating local search threads that can achieve yet better performance. We show two statistics about the overall interplay between global and local search in Blend Search on all the datasets evaluated for tuning XGBoost in Figure 6.

We provide additional experiments for tuning XGBoost and Light GBM in Appendix B, including a comparison with ASHA under different settings of ﬁdelity, and an empirical study about the effect of low-cost initialization.

Takeaway. (1) Blend Search is able to overcome the limitations of BO and LS and at the same time, inherit their advantages. By blending BO and LS, Blend Search is able to outperform both on this large collection of datasets over time. (2) The interplay between global and local search indeed contribute to Blend Search s good performance.

4.2 TRANSFORMER-BASED NLP MODEL FINE-TUNING

This section presents an application of economical hyperparameter optimization to an NLP model ﬁne tuning task used in a large software company. It starts from a large transformer-based model Turing-NLRv2 with 24 transformer layers, each with 16 attention heads and hidden-dimension 1024 totaling 340M parameters. It is pretrained on English Wikipedia (2,500M words) and Book Corpus (800M words), and uses byte-pair encoding3. This pre-trained model is then ﬁne-tuned (Dai & Le, 2015) for use in multiple production scenarios, including sequence classiﬁcation, named entity recognition and question answering. The ﬁne-tuning procedure is performed for a dozen separate tasks, and is repeated on a regular cadence, typically every few weeks. We focus our experiment on a single sequence classiﬁcation ﬁne-tuning task where the objective is to label a document (consisting of one or more sentences) with one of ﬁve possible classes. For ﬁne-tuning this model we introduce a classiﬁcation layer with 1024 5 = 5120 additional weight parameters, randomly initialized. The dataset used for training consists of 52K labeled examples, which we split 80/20 for training/validation. The objective to maximize is the f1-score obtained on the validation set of 10.4K labeled documents. Selecting hyperparameters for ﬁne-tuning this model has been a manual process that typically takes a data scientist a few days.

3msturing.org

Table 1: Results of language model ﬁne-tuning.

method wall-clock GPU-hours f1 Manual (10-20 working hours, spanned to a few days) 40-80h 0.645 Blend Search 3h 12h 0.652 ASHA-1 6h 24h 0.624 ASHA-16 6h 384h 0.640

In our experiment we use a 6-dimensional search space (4 numerical and 2 categorical). A detailed description of all the hyperparameters tuned, including their ranges, can be found in Table 5 in Appendix B. We compare to ASHA and let it use 16 VMs with 4 NVIDIA Tesla V100 GPUs on each VM. We run ASHA with 16 concurrent jobs for 6 hours in wallclock time. That amounts to 4 16 6 = 384 GPU hours of hardware cost. We run Blend Search on a single VM of the same conﬁguration for 3 hours, which uses 12 GPU hours in total. For comparison, we include ASHA-1 using the same single VM. The results are summarized in Table 1.

Takeaway. Not only Blend Search ﬁnds a more accurate model than both ASHA and manual effort, it does so faster and consumes only 3% of the resources of ASHA-16.

4.3 ABLATION STUDY

To investigate the effectiveness of several important modules of our framework (colored in orange in Figure 2), we perform an ablation study in tuning XGBoost on a random subset of the datasets (one third of the datasets mentioned in Section 4.1). We show the aggregated rank of different variants of our method in Figure 4(c). Speciﬁcally, we study the following three modules of the framework. (1) Priority metric. BS-Priority:Round Robin uses the round-robin policy in the search thread selector. (2) Conﬁg validator. BS-w/o-Conﬁg Validator skips the validity check of the proposed conﬁguration. (3) Create new thread condition. In BS-Create Cond:Always and BS-Creat Cond:Best Or First , the following two conditions are used as the condition for creating new local search threads respectively: always create, and create a new thread only when the loss is better than all the existing search threads loss or there is no local search thread yet.

From this ablation study, we have the following observations: (1) Making round-robin selection is worse than doing selection using our designed priority metric. Round-robin s relative performance becomes worse in the later stage of the search because it cannot avoid bad-performing search threads. (2) The conﬁg validator is also vital in our framework. (3) Overall, the conditions for creating new threads has a smaller impact on our method comparing to the other designs studied. Better than half condition used by default tends to perform the best.

5 EXTENSION AND FUTURE WORK

In the low-resource scenario which is targeted by this paper, each single trial is not resource-saturated if we spend all resources in it. So we do not recommend parallel trials in this low-resource scenario. In the case where more resources than the maximum resource each single trial can consume are available, our framework can be extended by running the trials from different search threads on multiple workers. For example, if there are additional workers available, we can keep invoking the search thread selector (but skip the local search threads that have O(d) trials running). Our design of having multiple independent local search threads naturally allows efﬁcient asynchronous parallel trials. The design of utilizing existing global optimization methods allows existing easyto-parallelize global optimization (such as random search or batch versions of BO) to be plugged in. The prioritization of search threads is still useful as long as the maximal concurrent number of trials divided by the number of search threads is smaller than O(d). Since our method can be used together with multi-ﬁdelity pruning methods, it can naturally inherit the asynchronous resource scheduling when used in the parallel setting. Parallelization is now supported in the latest version of Blend Search s implementation.

In this work, we show the effectiveness of Blend Search through extensive empirical study. As future work, it is worth studying the theoretical properties of Blend Search, including theoretical guarantees about its convergence rate and total resource consumption.

Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. Optuna: A next-generation hyperparameter optimization framework. In Proceedings of the 25rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2019.

James S Bergstra, R emi Bardenet, Yoshua Bengio, and Bal azs K egl. Algorithms for hyper-parameter optimization. In Advances in neural information processing systems, 2011.

Eric Brochu, Vlad M. Cora, and Nando de Freitas. A tutorial on bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. Co RR, abs/1012.2599, 2010.

Adam D. Bull. Convergence rates of efﬁcient global optimization algorithms. J. Mach. Learn. Res., 12:2879 2904, November 2011.

Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. In KDD, 2016.

Andrew M Dai and Quoc V Le. Semi-supervised sequence learning. Advances in neural information processing systems, 2015.

David Eriksson, Michael Pearce, Jacob Gardner, Ryan D Turner, and Matthias Poloczek. Scalable global optimization via local bayesian optimization. In Advances in Neural Information Processing Systems, 2019.

Stefan Falkner, Aaron Klein, and Frank Hutter. BOHB: Robust and efﬁcient hyperparameter optimization at scale. In Proceedings of the 35th International Conference on Machine Learning (ICML 2018), 2018.

Yuzhou Gao, Tengchao Yu, and Jinglai Li. Bayesian optimization with local search. In International Conference on Machine Learning, Optimization, and Data Science, 2020.

Pieter Gijsbers, Erin Le Dell, Janek Thomas, S ebastien Poirier, Bernd Bischl, and Joaquin Vanschoren. An open source automl benchmark. In Auto ML Workshop at ICML 2019, 2019.

Andr as Gy orgy and Levente Kocsis. Efﬁcient multi-start strategies for local search algorithms. Journal of Artiﬁcial Intelligence Research, 41:407 444, 2011.

Frank Hutter, Holger H. Hoos, and Kevin Leyton-Brown. Sequential model-based optimization for general algorithm conﬁguration. In Learning and Intelligent Optimization, 2011.

Kevin Jamieson and Ameet Talwalkar. Non-stochastic best arm identiﬁcation and hyperparameter optimization. In Artiﬁcial Intelligence and Statistics, 2016.

Kirthevasan Kandasamy, Gautam Dasarathy, Jeff Schneider, and Barnab as P oczos. Multi-ﬁdelity bayesian optimisation with continuous approximations. In Proceedings of the 34th International Conference on Machine Learning, 2017.

Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie Yan Liu. Lightgbm: A highly efﬁcient gradient boosting decision tree. In Advances in Neural Information Processing Systems, 2017.

Aaron Klein, Stefan Falkner, Simon Bartels, Philipp Hennig, and Frank Hutter. Fast bayesian optimization of machine learning hyperparameters on large datasets. In Artiﬁcial Intelligence and Statistics, 2017.

Tor Lattimore and Csaba Szepesv ari. Bandit algorithms. Cambridge University Press, 2020.

Liam Li, Kevin Jamieson, Afshin Rostamizadeh, Ekaterina Gonina, Jonathan Ben-tzur, Moritz Hardt, Benjamin Recht, and Ameet Talwalkar. A system for massively parallel hyperparameter tuning. In Proceedings of Machine Learning and Systems, 2020.

Lisha Li, Kevin Jamieson, Giulia De Salvo, Afshin Rostamizadeh, and Ameet Talwalkar. Hyperband: A novel bandit-based approach to hyperparameter optimization. In ICLR 17, 2017.

Zhiyun Lu, Liyu Chen, Chao-Kai Chiang, and Fei Sha. Hyper-parameter tuning under a budget constraint. In Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence, IJCAI-19, 2019.

Yurii Nesterov and Vladimir Spokoiny. Random gradient-free minimization of convex functions. Foundations of Computational Mathematics, 17(2):527 566, 2017.

Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems, 2012.

James C Spall et al. Multivariate stochastic approximation using a simultaneous perturbation gradient approximation. IEEE transactions on automatic control, 37(3):332 341, 1992.

Niranjan Srinivas, Andreas Krause, Sham M Kakade, and Matthias Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. ar Xiv preprint ar Xiv:0912.3995, 2009.

Qingyun Wu, Chi Wang, and Silu Huang. Frugal optimization for cost-related hyperparameters. In AAAI 21, 2021.

Li Yang and Abdallah Shami. On hyperparameter optimization of machine learning algorithms: Theory and practice. Neurocomputing, 415:295 316, 2020.

A MORE DETAILS ABOUT BLENDSEARCH

We provide detailed pseudocode of the sub-algorithms used in our framework.

Algorithm 3 Select Thread

Inputs: Framework-level state F which keeps a list of candidate search thread F.S and a priority function F.Priority. 1: S = arg max S F.S F.Priority(S)

2: Sbak = arg max S (F.S\S0) F.Priority(S) F.S \ S0 = S0 F.S \ S0 = , Outputs: S, Sbak

Algorithm 4 Book Keeping

Inputs: Search thread S, one entry of observation x, l, c generated by S, and F. 1: S.c S.c + c 2: if S is a local search thread then Update {S.xmin i }i [d], and {S.xmax i }i [d] 3: if l < S.l1st then 4: S.l2nd, S.c2nd S.l1st, S.c1st; S.l1st, S.c1st l, S.c; and S.xbest x 5: if l < F.l then F.l l

Algorithm 5 Initialize Search Thread

Inputs: Search method L or G, problem P and initial point (x, l, c). 1: S.π = L.π or G.π 2: Initialize bookkeeping information of S: S.l1 = S.l2 = l, S.c1st = S.c2nd = S.c = c, and S.xmin i = S.xmax i = xi for i P.D

Choice of L. used in constructing the admissible region F.R. We normalize each numeric hyperparameter into [0,1]. The setting of L. is decided by the local search method. For the local search method we chose, L. is a constant corresponding to the initial stepsize (ﬁxed as 0.1).

Deﬁnition of Reachable In One Step in Alg 7. We deﬁne local search thread S1 to be reachable by S2 if the distance between their incumbents is no larger than the maximal distance between the next proposal of S2 and the incumbent of S2. In the local search method we used, the incumbent is the currently best conﬁg, and the maximal distance is equal to the stepsize.

B MORE DETAILS ABOUT EXPERIMENTS AND ADDITIONAL RESULTS

B.1 EXPERIMENT SETUP

Settings of BO and LS. For BO, we use implementation from Optuna 2.0.0 (https://optuna. readthedocs.io/en/stable/index.html) with default settings for TPE sampler. For LS, we follow the implementation guidelines from Wu et al. (2021). After a local search thread is created from a particular starting point, we ﬁx the categorical dimensions and only search for numerical dimensions in that local search thread. A local search thread S is considered to have converged (corresponding to S.converged() in Algorithm 7) once the stepsize of the local search thread is smaller than a lower bound introduced by Wu et al. (2021).

Experiments in tuning XGBoost and Light GBM. The XGBoost and Light GBM experiments are performed in a server with Intel Xeon E5-2690 v4 2.6GHz, and 256GB RAM. A full list of hyperparameters tuned and their ranges can be found in Table 3 and Table 2. The search space for numerical hyperparameters aligns with the search space used in (Wu et al., 2021). On the same fold, the same random seed is used for LS, BO and BS. Experiments on different folds use different random seeds.

Experiments in NLP model ﬁne tuning. For ASHA, we set min and max epochs as 1 and 16, and reduction factor 4.

Algorithm 6 Create New LSCondition

Inputs: l, F. Outputs: |F.S| = 1 or l Median({S.l1st}S F.S\S0)

Algorithm 7 Delete And Merge LS

Inputs: S, F 1: if S.converged() then 2: F.S F.S \ S, 3: F.R.xmin i F.R.xmin i L. , and F.R.xmax i F.R.xmax i + L. , i D

4: else 5: for S F.S \ S do 6: if S S .Reachable In One Step() & S .l < S.l then 7: F.S F.S \ S 8: break 9: else if S S.Reachable In One Step() & S.l < S .l then F.S F.S \ S

Result aggregation details. Aggregated rank in Figure 4(a)&(c) and 12(c) is calculated as follows: (1) per dataset per fold, the method is ranked based on the loss on validation set at each second (x-axis), starting from when there is at least one ﬁnished conﬁg evaluation in any method; (2) the rank is then averaged across datasets per fold; (3) we ﬁnally compute the average rank (line) and conﬁdence interval (shaded area) across 10 folds. Scaled loss in Figure 4(b) is calculated similarly. Per dataset per fold, min-max scaling is applied on each method using the maximum and minimum loss along the whole performance curve across all methods.

B.2 ADDITIONAL EXPERIMENTAL RESULTS ON LIGHTGBM AND XGBOOST

More performance curves on Light GBM and XGBoost. The performance curves for tuning Light GBM on 3 representative datasets with an 1h budget are shown in Figure 7. We observe that the performance of LS is quite good (comparing to BO), especially on large datasets. This result is consistent with the results reported in (Wu et al., 2021), where all the hyperparameters for tuning are numerical. In our experiment of XGBoost tuning, we include categorical hyperparameters. LS performs worse in this case because the introduction of categorical hyperparameters ampliﬁes the local search method s limitation of being trapped in local optima. The observations about Blend Search for Light GBM are similar to XGBoost tuning. The performance curves on the three large datasets with a 4h budget are shown in Figure 8 and 9, where similar conclusions can be drawn.

Multi-ﬁdelity. We compare BO and Blend Search with the multi-ﬁdelity baseline ASHA for tuning Light GBM and XGBoost in Figure 10. In this experiment, we tried two choices of ﬁdelity dimensions with ASHA, including number of iterations and sample size (the sample size begins with 10K, so small datasets are excluded) respectively. The results show that the multi-ﬁdelity baseline overall perform no better than BO and are signiﬁcantly worse than Blend Search.

Table 2: Hyperparameters tuned in Light GBM.

hyperparameter type range init tree num int [4, min(32768, # instance)] 4 leaf num int [4, min(32768, # instance)] 4 min child weight ﬂoat [0.001, 20] 20 learning rate ﬂoat [0.01, 0.1] random subsample ﬂoat [0.6, 1.0] random reg alpha ﬂoat [1e-10, 1.0] random reg lambda ﬂoat [1e-10, 1.0] random max bin int [7, 1023] random colsample by tree ﬂoat [0.7, 1.0] random

Table 3: Hyperparameters tuned in XGBoost.

hyperparameter type range init tree num int [4, min(32768, # instance)] 4 leaf num int [4, min(32768, # instance)] 4 min child weight ﬂoat [0.001, 20] 20 learning rate ﬂoat [0.01, 0.1] random subsample ﬂoat [0.6, 1.0] random reg alpha ﬂoat [1e-10, 1.0] random reg lambda ﬂoat [1e-10, 1.0] random colsample by level ﬂoat [0.6, 1.0] random colsample by tree ﬂoat [0.7, 1.0] random booster categorical {gbtree, gblinear} gblinear tree method categorical {auto, approx, hist} random

Table 4: Hyperparameters tuned in Deep Tables.

hyperparameter type range init early stopping rounds int [1, max(min(1.5M/# instance),150),10)] 10 batch size int [16, 1024] 512 dropout ﬂoat [0, 0.5] 0.1 learning rate ﬂoat [1e-4, 3e-2] 3e-4 dense dropout ﬂoat [0, 0.5] 0.1 net categorical {DCN, dnn nets} random auto discrete boolean {False, True} random apply gbm features boolean {False, True} random ﬁxed embedding dim boolean {False, True} random

Table 5: Hyperparameters tuned in ﬁne-tuning Turing language model.

hyperparameter type range init learning rate ﬂoat [1e-6, 1e-3] random hidden dropout probability ﬂoat [0.05, 0.4] random warmup proportion ﬂoat [0.2, 0.4] random batch size categorical {16, 32} 32 epochs int [1, 16] 1 learning rate scheduler categorical {Warmup linear decay polynomial, Warmup linear, Warmup linear decay exponential}

0 1000 2000 3000 time (s)

(a) blood, loss = 1 - auc

0 1000 2000 3000 time (s)

Blend Search BO LS BOw LS

(b) numerai28.6, loss = 1 - auc

0 1000 2000 3000 time (s)

(c) Albert, loss = 1 - auc

Figure 7: Optimization performance curve for Light GBM (1h).

0 2500 5000 7500 10000 12500 time (s)

(a) Fashion-MNIST, loss = log-loss

0 2500 5000 7500 10000 12500 time (s)

Blend Search BO LS BOw LS

(b) dilbert, loss = log-loss

0 2500 5000 7500 10000 12500 time (s)

(c) Covertype, loss = log-loss

Figure 8: Optimization performance curve for Light GBM (4h).

0 2500 5000 7500 10000 12500 time (s)

(a) Fashion-MNIST, loss = log-loss

0 2500 5000 7500 10000 12500 time (s)

Blend Search BO LS BOw LS

(b) dilbert, loss = log-loss

0 2500 5000 7500 10000 12500 time (s)

(c) Covertype, loss = log-loss

Figure 9: Optimization performance curve for XGBoost (4h).

101 102 103 time (s)

(a) Average rank

101 102 103 time (s)

Scaled Loss

Blend Search BO Prune With Sample Prune With Iter

(b) Scaled loss

Figure 10: Aggregated results on Light GBM and XGBoost. Purne With Sample and Prune With Iter represent ASHA using sample size and iteration number as resource dimension respectively. BO and Blend Search are the same as those in Figure 4.

2 103 3 103 time (s)

Blend Search BO

(a) Average rank

2 103 3 103 time (s)

Scaled Loss

Blend Search BO

(b) Scaled loss

Figure 11: Aggregated rank and scaled loss on Light GBM with random initialization.

0 1000 2000 3000 time (s)

BOHB Hyperband ASHA Blend Search-BOHB Blend Search-Hyperband Blend Search-ASHA

(a) vehicle, loss = log-loss

103 time (s)

(b) Aggregated rank

103 time (s)

Scaled Loss

(c) Aggregated scaled loss

Figure 12: Performance curves on cane, aggregated rank and aggregated loss on Deep Tables.

Ablation study on the low-cost initialization. In this work, we use low-cost initialization for the controlled dimensions of the hyperparameters. Although such information is fairly easy to obtain, we investigate our method s robustness when no controlled dimension is provided. We test Blend Search in a controlled dimension agnostic setting: there are still hyperparameters with heterogeneous cost, but the controlled dimensions and a low-cost initial point is not speciﬁed as input. In such scenarios, Blend Search will use random initialization and the conﬁg validator always returns yes . We compare Blend Search with local search and BO under such a setting using the same random initial point. In Figure 11 we report the results including the aggregated rank and scaled loss on Light GBM across half of the datasets mentioned in Section 4.1 in Figure 11(a) & (b). The results show that even if Blend Search is agnostic to the controlled dimensions and a random initialization is used, it is still able to outperform both the local search method and BO.

B.3 TUNING DEEPTABLES.

In this experiment, we tune 9-dimensional hyperparameters (5 numerical and 4 categorical as detailed in Table 4) in Deep Tables. Since the training of deep neural networks are more time-consuming than that of XGBoost, we run experiments for Deep Tables on the datasets where they are worse than the best known performance in the benchmark, including shuttle , cnae , mfeat , vehicle , phoneme , kc1 . All experiments for Deep Tables are performed in a server with the same CPU, 110GB RAM, and one Tesla P100 GPU. A full list of hyperparameters tuned and their ranges can be found in Table 4.

Recall that we mentioned multi-ﬁdelity pruning strategies could be incorporated into Blend Search in the conﬁg evaluator component. In this experiment, we are particularly interested showing the performance of Blend Search when combined with multi-ﬁdelity methods. To this end, we include the three state-of-the-art multi-ﬁdelity methods, including BOHB (Falkner

et al., 2018), ASHA (Li et al., 2020), and asynchronous Hyper Band (Li et al., 2017; 2020) which are shown efﬁcient for tuning deep neural networks and the Blend Search based on each of them. We use the following libraries for baselines: For BOHB, we use Hp Band Ster 0.7.4 (https://github.com/automl/Hp Band Ster). For ASHA and asynchronous Hyper Band, we use implementations from Optuna 2.0.0. In all the methods compared, including both existing methods and variants of Blend Search, the number of training epochs is used as the ﬁdelity dimension, with maximum epochs set to be 1024, reduction factor set to be 3, and minimum epochs 4. For ASHA, we set the minimum early stopping rate to be 4 (we adopted this setting as it yields better performance comparing to the default setting, i.e., 0). The number of training epochs is used as the ﬁdelity dimension.

Blend Search incorporates existing multi-ﬁdelity methods in the following way: Each conﬁg, either proposed by global search or local search, uses the same schedule to increase the ﬁdelity and check its pruning condition. For example, when ASHA (Li et al., 2020), i.e., asynchronous successive halving, with a reduction factor of η, is used as the pruning strategy, after each conﬁg is evaluated by a certain ﬁdelity, it is compared with other conﬁgs already evaluated by the same ﬁdelity. The conﬁg will be pruned if its loss is ranked in the worst 1/η. Otherwise, the ﬁdelity is multiplied by η. In addition to the original pruning conditions speciﬁed by the multi-ﬁdelity method, a conﬁguration will also be pruned at a particular ﬁdelity level where no pruning is performed yet, and the conﬁguration does not yield superior performance (comparing to the currently-best performance) when evaluated at that ﬁdelity level.

We present the performance of all compared methods for tuning Deep Tables in Figure 12. Figure 12(a) shows the learning curves on dataset cane with budget 1h. Figure 12(b) and (c) show the aggregated rank and loss on all the 6 datasets within budget 1h. The performance of multi-ﬁdelity methods are signiﬁcantly improved when used in our Blend Search framework.