# economic_hyperparameter_optimization_with_blended_search_strategy__c722000e.pdf ECONOMICAL HYPERPARAMETER OPTIMIZATION WITH BLENDED SEARCH STRATEGY Chi Wang , Qingyun Wu , Silu Huang, Amin Saied Microsoft Corporation, Redmond, WA 98052, USA {wang.chi,qingyun.wu,silu.huang,amin.saied}@microsoft.com We study the problem of using low cost to search for hyperparameter configurations in a large search space with heterogeneous evaluation cost and model quality. We propose a blended search strategy to combine the strengths of global and local search, and prioritize them on the fly with the goal of minimizing the total cost spent in finding good configurations. Our approach demonstrates robust performance for tuning both tree-based models and deep neural networks on a large Auto ML benchmark, as well as superior performance in model quality, time, and resource consumption for a production transformer-based NLP model fine-tuning task. 1 INTRODUCTION Hyperparameter optimization (HPO) of modern machine learning models is a resource-consuming task, which is unaffordable to individuals or organizations with little resource (Yang & Shami, 2020). Operating HPO in a low-cost regime has numerous benefits, such as democratizing ML techniques, enabling new applications of ML, which requires frequent low-latency tuning, and reducing the carbon footprint. It is inherently challenging due to the nature of the task: trying a large number of configurations of heterogeneous cost and accuracy in a large search space. The expense can accumulate from multiple sources: either a large number of individually cheap trials or a small number of expensive trials can add up the required resources. There have been multiple attempts to address the efficiency of HPO from different perspectives. Each of them has strengths and limitations. For example, Bayesian optimization (BO) (Brochu et al., 2010), which is a class of global optimization algorithms, is used to minimize the total number of iterations to reach global optima. However, when the cost of different hyperparameter configurations is heterogeneous, vanilla BO may select a configuration that incurs unnecessarily high cost. As opposed to BO, local search (LS) methods (Wu et al., 2021) are able to control total cost by preventing very expensive trials until necessary, but they may get trapped in local optima. Multi-fidelity methods (Jamieson & Talwalkar, 2016) aim to use cheap proxies to replace some of the expensive trials and approximate the accuracy assessment, but can only be used when such proxies exist. A single search strategy is difficult to meet the generic goal of economical HPO. In this work, we propose a blended search strategy which combines global search and local search strategy such that we can enjoy benefits from both worlds: (1) global search can ensure the convergence to the global optima when the budget is sufficient; and (2) local search methods enable a better control on the cost incurred along the search trajectory. Given a particular global and local search method, our framework, which is named as Blend Search, combines them according to the following design principles. (1) Instead of sticking with a particular method for configuration selection, we consider both of the candidate search methods and decide which one to use at each round of the configuration selection. (2) We use the global search method to help decide the starting points of local search threads. (3) We use the local search method to intervene the global search method s configuration selection to avoid configurations that may incur unnecessarily large evaluation cost. (4) We prioritize search instances of both methods according to their performance and efficiency of performance improvement on the fly. Extensive empirical evaluation on the Auto ML Equal contribution 0 500 1000 1500 2000 Time(s) Blend Search BO LS (a) Loss vs. evaluation time for configs tried by each method. One point represents one config. The lower is the loss the better is the quality of the config. Longer evaluation time corresponds to larger training cost 0 1000 2000 3000 Time(s) Blend Search BO LS (b) Best loss vs. optimization time per method. Colored circles configs proposed by LS threads in our method (color change indicates thread change); diamonds configs proposed by BO in our method Figure 1: A typical example of the different behaviors of BO, LS and our proposed Blend Search in tuning a set of 11-dim hyperparameters for XGBoost. BO is prone to selecting expensive but not necessarily good configs. LS avoids expensive configs in the beginning but is prone to getting stuck in local regions. Blend Search switches between one BO and multiple LS search threads and prioritizes the more promising ones, and turns out to try more low-cost, high-quality configs. Benchmark (Gijsbers et al., 2019) validates the robust performance of our method on a wide variety of datasets. Blend Search is now publicly available in an open-source Auto ML Library1. 2 BACKGROUND AND RELATED WORK We first briefly introduce the vanilla Bayesian optimization methods and local search methods, which are among the building blocks of our method. Bayesian optimization is a class of global optimization algorithms which is suitable for optimizing expensive black-box functions. It models the probabilistic distribution of the objective conditioned on the optimization variables. Typical models include Gaussian process (Snoek et al., 2012), random forest (Hutter et al., 2011), and tree Parzen estimator (TPE) (Bergstra et al., 2011). In BO methods, an acquisition function is used to determine the next point to evaluate. Two common acquisition functions are the expected improvement (EI) (Bull, 2011) over the currently best-observed objective and upper confidence bound (UCB) (Srinivas et al., 2009). Local search methods are prevalent in the general optimization literature (Spall et al., 1992; Nesterov & Spokoiny, 2017) but less studied in the HPO literature due to the possibility of getting trapped in local optima (Gy orgy & Kocsis, 2011). Recent work (Wu et al., 2021) shows that a local search method FLOW2 can make HPO cost-effective when combined with low-cost initialization and random restart. At each iteration, it samples a pair of vectors (with opposite directions) uniformly at random from a unit sphere, the center of which is the best configuration found so far (a.k.a. incumbent) and the radius of which is the current stepsize. Expensive configurations are avoided in the beginning as each iteration proposes a configuration near the incumbent. Random restart of the local search is performed once the convergence condition is satisfied. There are several attempts to address the limitations of vanilla BO or local search methods. BOw LS (BO with local search) (Gao et al., 2020) uses a BO model to select the starting point of a local search thread. Each local search thread is run until convergence and the BO model is updated with the start point and the converged loss. Trust region BO (Eriksson et al., 2019) fits a fixed number of local models and performs a principled global allocation of samples across these models via an implicit bandit approach. It is primarily designed for HPO problems with high-dimensional numerical hyperparamters. Unfortunately, all existing work that tries to combine global search with local search methods does not consider the heterogeneity of evaluation cost incurred along with the search. There are also a lot of attempts in making HPO efficient by speeding up configuration evaluation. Multifidelity optimizations (Klein et al., 2017; Li et al., 2017; Kandasamy et al., 2017; Falkner et al., 2018; Lu et al., 2019; Li et al., 2020) are proposed for this purpose. They usually require an additional degree of freedom in the problem called fidelity , to allow performance assessment on a configura- 1https://github.com/microsoft/FLAML Figure 2: Framework. Both paths from Config Evaluator are executed independently. tion with different fidelities. There is surprisingly little prior work for generic cost-effective HPO. Gaussian process with expected improvement per second (GPEIPS) (Snoek et al., 2012) models the evaluation cost using another Gaussian process, and heuristically adds the estimated cost into the acquisition function. It does not always outperform GPEI as the acqusition function can overpenalize good but expensive configurations. 3 BLENDSEARCH Our framework needs the following information as inputs. B is the total budget of cost. In this work, we measure the cost by CPU/GPU time. P is the input HPO problem, which has the following attributes characterizing the problem. P.X is the search space in which each element x is a d-dimensional hyperparameter configuration. For a non-categorical hyperparameter coordinate i [d], if different values of xi lead to heterogeneous cost and there is a known value P.x Low Cost i corresponding to a low cost, it is considered as a controlled dimension. We use P.D to denote such a set of controlled dimensions. P.Loss Func( ) is the loss function to be minimized in terms of the configurations x P.X. P.Cost Func( ) is the cost function that outputs the cost incurred when evaluating x. The goal of an HPO algorithm is to minimize the loss P.Loss Func(x) with the constraint that the total cost incurred G(π) := P x I(π) P.Cost Func(x) B, where I(π) is the search trajectory of algorithm π. Note that both P.Loss Func(x) and P.Cost Func(x) are black-box functions meaning that typically the analytic form is not available, and only function values can be observed. In order to distinguish the operation of querying from loss/cost function and the loss/cost observation, we use P.Loss Func(x) and P.Cost Func(x) to denote the former, and use l(x) and c(x) to denote the latter. l and c (omitting x) are used when there is no ambiguity. G is the global search method to be used. L is the local search method to be used. L. is the largest stepsize used in local search method L, i.e., the largest possible change on a hyperparameter value between two consecutive search steps. The overall design of our framework is presented in Figure 2 and Algorithm 1. The key idea is to maintain a pool of search threads, one of which corresponds to global search and the others local search. The pool starts with one global search thread and gradually adds local search threads as the search goes on. Here a search thread is an instance of a global search or local search method, each with its own search trajectory. At each round, a search thread selector selects one of the search threads from the pool according to a priority metric that reflects the search threads current performance and efficiency of performance improvement. The selected search thread will then be used to propose a configuration to evaluate in this round. When the selected search thread is the global search thread, a config validator first checks whether the proposed configuration is within the admissible region for evaluation. If not, it uses a backup local search thread instead. A local search thread will be created only when the global search thread proposes a valid config and a certain create thread condition is met, and will be deleted once it converges. Two local search threads will be merged into one if they are close enough. The priority of each search thread is updated after each evaluation round. Algorithm 1 Blend Search Inputs: HPO problem P, a global search method G, a local search method L, total budget B. 1: Initialization: Initialize F.S = [S0] where S0 is an instance of G. We denote by F.x0 the initial point of the search. By design, the values of the controlled dimensions of F.x0 are set to be P.x Low Cost and those for the other dimensions are proposed by S0. 2: while F.c < B do 3: S, Sbak Select Thread(F) 4: (x, l, c) Search Evaluate1Step( S, F, P) 5: if x is invalid then (x, l, c) Search Evaluate1Step( Sbak, F, P) 6: if x is proposed by global search & Create New LSCondition is satisfied then 7: Initialize S = Initialize Search Thread(L, P, (x, l, c)) 8: Add the new LS thread S into the pool: F.S F.S + S 9: Delete And Merge LS(F) Merge or delete existing LS threads when necessary 10: Update F.Priority 11: if x is proposed by global search then Update GSModel(S0, (x, l, c)) For convenience, we use F to denote a collection of framework-level variables: F.S is the list of search threads maintained in our framework. F.S contains at least one search thread and among them there is one and only one global search thread, i.e., S0, in F.S. F.Priority is a priority dictionary, in which the keys are the search threads in F.S, and the values are the priority of the corresponding search threads. Bookkeeping information of F: F.l is the best loss achieved among all the search threads and F.c is the total cost consumed in our framework. F.R is the admissible region on the controlled dimensions of the current search, which is a hyperrectangle and can be written in the form of F.R := {[F.xmin i , F.xmax i ]}i D with F.xmin i and F.xmax i denoting the minimum and maximum value along the i-th dimension in F.R respectively. They are initially set as F.xmin i = F.xmax i = P.x Low Cost i for all i D. The admissible region gradually expands during the search: (1) it is expanded to cover all the points evaluated by all the search threads and all the points that are possible to be reached by the local search within one search step, as shown in line 7 and 8 of Algorithm 2; (2) it expands if a local search thread converges, as shown in line 3 of Algorithm 7 (included in Appendix A). In the following, we explain the key steps in our algorithm. Step 1: Search thread selector (line 3 of Alg 1). In addition to the primary search thread S, Select Thread also outputs a backup search thread Sbak which is guaranteed to be a local search thread. It is set to be none when there is no local search thread yet in F.S. Specifically, S = arg max S F.S F.Priority(S), Sbak = arg max S (F.S\S0) F.Priority(S) F.S \ S0 = None F.S \ S0 = , (1) The design of the priority metric follows the principle of optimism in the face of uncertainty from the multi-armed bandit problem to balance exploitation and exploration (Lattimore & Szepesv ari, 2020). Specifically, linear extrapolation is performed adaptively and locally to calculate the improvement speed of each search thread. The estimated future reward based on such a linear extrapolation provides a first-order upper bound of the ground truth future reward assuming each search thread has a diminishing return, i.e., the speed of improvement decreases as more resource is spent. Formally, we introduce the following variables and functions for each search thread S F.S. Bookkeeping information: S.l1st and S.l2nd are the best loss so far and second best loss before the best loss is achieved. S.c1st and S.c2nd are the total cost taken when S.l1st and S.l2nd are achieved respectively. S.c is the total cost spent in S. S.x1st is the best configuration found so far. S.s is the performance improvement speed of S. It is calculated as S.s = S.l2nd S.l1st S.c S.c2nd . This formula is only valid when there is at least one improvement. Otherwise, we do not have enough information to estimate the speed of improvement. We set the speed to the highest speed of all the search threads when S.l2nd = S.l1st. It is due to an implicit assumption of diminishing return. S.xmin and S.xmax are the minimum and maximum value of the i-th dimension of all hyperparameters configurations evaluated in S respectively. S.Cost Imp( ) is a function whose input is a target loss and output is the anticipated cost for S to generate a better loss than this target. We use the following formula to compute it, which is intuitively using the cost for improvement in the past to estimate that in the future. S.Cost Imp(l) = max S.c S.c1st, S.c1st S.c2nd, 2S.l1st l Our proposed priority metric is essentially the negative of the projected loss of S: F.Priority(S) = (S.l1st S.s b) (3) in which b = min(max S F.S S.Cost Imp(F.l ), B F.c). max S F.S S.Cost Imp(F.l ) can be considered as the resource needed for every S to have a better performance than the currently best performance F.l . Our priority metric estimates the loss of each search thread if such an amount of resource (restricted by the budget left B F.c) is given. By considering both the search threads current performance and potential improvement, it provides a fair trade-off between exploiting the currently-best and exploring the potentially-better choices. Algorithm 2 Search Evaluate1Step Inputs: HPO problem P, search thread S, and F 1: if S is None then Construct x as follows: generate the controlled dimensions of x by adding Guassian noises on the corresponding dimensions of F.x0 and for the rest of the dimensions sample uniformly at random from the search space P.X. 2: else x S.Propose Config() 3: if S = S0 (i.e., S is the global search thread) & x / F.R then x invalid 4: else l, c P.Loss Func(x), P.Cost Func(x) Evaluate configuration x 5: if x = invalid then 6: Book Keeping(S, x, l, c, F) and update speed S.s, 7: i P.D, S.xmin i min{xi, S.xmin i }, F.xmin i min{S.xmin i L. , F.xmin i }, 8: i P.D, S.xmax i max{xi, S.xmax i }, F.xmax i max{S.xmax i + L. , F.xmax i } 9: Outputs: x, l, c Step 2: Config validator and evaluator (line 4-5 of Alg 1). After a search thread (and a backup search thread) is selected, the next step is to propose the next configuration to try with the chosen search thread(s). Intuitively speaking, we consider generating the next configuration to try primarily according to the selected search thread S whose priority is ranked the highest. But we set a guard rail for the global search thread as it may propose an unnecessarily high-cost configuration. We thus introduce a config validator to validate the configurations proposed by global search according to whether they are within the current admissible region of our framework F.R (line 3 of Alg 2). A configuration marked as invalid means that it is considered to be prone to incur unnecessarily high cost and will not be evaluated at this round. In this case, the selected backup search thread will be used to perform another round of Search Evaluate1Step (line 5 of Alg 1) if it is a valid search thread (i.e., not none). In the case where the backup thread is none, we generate the new configuration according to line 1 of Alg 2. The config validator helps avoid potentially high-cost evaluation and thus avoid creating local search threads from high-cost points until necessary. It does not stick to local searches forever because the admissible region F.R gets expanded. Note that according to the definition of F.R, only the controlled dimensions of the hyperparameter configurations are subject to validation check. If needed, a multi-fidelity pruning strategy can be used in this config evaluator component. Multi-fidelity pruning does not necessarily yield better performance. So the adoption of multi-fidelity pruning in Blend Search is optional. Step 3: Search thread creator, updater and cleaner (line 6-11 of Alg 1). If the newly proposed configuration is proposed by global search and it is not marked as invalid , we consider creating a new local search thread using the proposed configuration as a starting point. To make sure the newly created local search thread is relatively good, we first check whether the proposed configuration s performance is better than at least half of the existing threads performance (specified in the Create New LSCondition). If so, a new local search thread will be initialized and added to the active search thread pool S. In Delete And Merge LS, we check whether a local search thread has converged according to the convergence condition of the specific local search method. If it is, the search thread will be removed from S. In addition, we also go through all the local search threads to see whether the incumbent of a LS thread is reachable in one step by another LS thread with lower loss (ref. Appendix A). If so, the former LS thread will be deleted. After a configuration proposed by global search is evaluated, the observation tuple (x, l, c) is then used to update the model of the global search method through function Update GSModel. For example, when the global search method is a Bayesian optimization method, the model is the surrogate model used. Due to page limit, detailed pseudocode for several of the straightforward functions mentioned in our framework are provided in Appendix A, including Create New LSCondition, Delete And Merge LS, Initialize Search Thread and Book Keeping. 4 EXPERIMENTS We evaluate Blend Search in tuning tabular machine learning libraries with an Auto ML benchmark (Gijsbers et al., 2019), and in fine-tuning NLP models for text data. The Auto ML benchmark consists of 39 tabular datasets that represent real-world data science classification problems. It includes datasets of all sizes, of different problem domains and with various levels of difficulty. As each dataset has 10 cross-validation folds, all the results reported in this paper are averaged over the 10 folds. With this benchmark, we are able to evaluate multiple HPO methods on a large number of datasets within a manageable computational budget, for tuning three machine learning libraries: XGBoost (Chen & Guestrin, 2016), Light GBM (Ke et al., 2017) and Deep Tables2. The first two are popular libraries based on gradient boosted trees, and the third is an easy-to-use deep learning toolkit which utilizes latest research findings for tabular data. We chose them because gradient boosted trees and deep neural networks are the most frequent winners in data science competitions. We run experiments for XGBoost and Light GBM on the Auto ML benchmark and report the results in Section 4.1. As a real application, we report an NLP model fine-tuning task for a production use case in Section 4.2. Finally, in Section 4.3 we perform ablation study to investigate the effectiveness of several important components of our framework. Due to page limit, we include the results for tuning Deep Tables in Appendix B. We include the following baselines in our experiments. BO (Akiba et al., 2019) the Bayesian optimization baseline. We choose a modern HPO library Optuna and use the TPE sampler because of its flexibility in handling mixed continuous and discrete space and good peformance reported in existing work (Falkner et al., 2018). LS (Wu et al., 2021) the recent baseline of using local search with random restart, based on FLOW2. It is proved to be able to control cost effectively and outperform BO methods for numerical cost-related hyperparameter search. BOw LS (Gao et al., 2020) the baseline of an existing approach of combining local search with BO, i.e., using BO to propose start points for local search. ASHA (Li et al., 2020), i.e., asynchronous successive halving a state-of-the-art HPO method that uses multi-fidelity optimization and supports parallel tuning. Initialization setting. For the local search method used in our framework, a low-cost initialization is needed to realize its unique advantages in controlling the cost. It is implemented via setting a lowcost initial value for each of the controlled dimensions. For example, among the hyperparameters tuned in Light GBM (shown in Table 2 of Appendix B), three hyperparameters, including tree num , leaf num and min child weight have initial values corresponding to the min or max values in their range, depending on whether they have a positive or negative correlation with the evaluation cost. It does not require the loss of the initial configuration to be low. Only one single low-cost initial value for each controlled dimension needs to be specified as input. To ensure a fair evaluation, we use the same low-cost initial point (if controlled dimensions exist) as the starting point of all the baselines. 2https://github.com/Data Canvas IO/Deep Tables 0 1000 2000 3000 time (s) (a) credit, loss = 1-auc 0 1000 2000 3000 time (s) Blend Search BO LS BOw LS (b) KDDCup09, loss = 1-auc 0 1000 2000 3000 time (s) (c) volkert, loss = log-loss Figure 3: Optimization performance curve for XGBoost. Lines correspond to the mean loss over 10 folds, and shades correspond to 95% confidence intervals. 1-auc is used for binary classification and log-loss is used for multi-class classification. 102 103 time (s) Blend Search BO LS BOw LS 102 103 time (s) Scaled Loss (b) Scaled loss 102 103 time (s) BS-Priority:Round Robin BS-w/o-Config Validator BS-Creat Cond:Always BS-Creat Cond:Best Or First (c) Rank for ablation Figure 4: Aggregated rank and scaled loss on Light GBM and XGBoost. The scaled loss is obtained by min-max scaling using the max and min loss over all time across all the methods. Sub-figure (a) & (b) share the same legend, and BS is short for Blend Search in sub-figure (c). Both rank and scaled regret are aggregated first across datasets then across 10 folds of data in each dataset. 4.1 TUNING XGBOOST AND LIGHTGBM We tune a set of 9-dimensional hyperparameters (all numerical) in Light GBM and 11-dimensional hyperparameters (9 numerical and 2 categorical) in XGBoost. A detailed description of the search space can be found in Appendix B. In this section, we omit the multi-fidelity baseline method, i.e., ASHA, because we do not find a good fidelity dimension that works well in tuning Light GBM and XGBoost (ref. additional results in Appendix B). We perform the evaluation on 37 out of the 39 datasets from the Auto ML benchmark ( Robert and Dionis are excluded due to out-ofmemory error and extremely long training time). The input budget B (in terms of CPU time) is set to be 4 hours for the 3 largest datasets among the 37 datasets, and 1 hour for the rest. From the performance curves shown in Figure 3, we observe that BO tends to perform well on small datasets, e.g., credit in Figure 3(a), where the 1h budget is sufficient. Under the same budget, it may perform badly on large datasets, e.g., volkert in Figure 3(c), as it may try configurations which consume a very large portion of the budget at the early stage of the search. On the medium size dataset, e.g., KDDCup09 in Figure 3(b), the local search method is more efficient than BO in the early stage but is outperformed in the later stage. Blend Search performs similarly to the better one between LS and BO in the early stage, and surpasses both of them in the later stage. We also observe that BOw LS performs similarly with LS (sometimes worse). This is because BOw LS needs to wait until a local search converges before proposing a new one. The aggregated result over all the test cases in Figure 4(a) & (b) is consistent with these observations. The interplay between local and global search in Blend Search. We investigated the dynamics of local and global search thread selection in Blend Search. As a case study, we show the results in tuning XGBoost on two datasets in Figure 1(b) and Figure 5(a). The result in Figure 1(b) shows 0 1000 2000 3000 Time(s) Blend Search BO LS Figure 5: Search thread selection in Blend Search in tuning XGBoost on dataset KDDCup09. In this figure (and Figure 1(b), which shows the result for dataset Volkert), each marker corresponds to one time of search thread selection in Blend Search. The diamonds correspond to the global search thread, and the solid circles correspond to the local search threads (different colors for different threads). # of thread switches # of threads induced by GS 0 Figure 6: The distributions of the # of thread switches and the # of local search threads induced by the GS (short for global search) in Blend Search for tuning XGBoost. The result is aggregated over all the datasets evaluated in tunning XGBoot. The outliers with very large # of thread switches and threads induced by GS are from very small datasets, where the number of evaluations is large. that the global search indeed plays a role after 1000s which contributes to Blend Search s better performance comparing to LS. The result in Figure 5(a) shows that although the first point suggested by the global search (at around 200s) does not yield significantly better performance immediately, the induced new local search thread (the thread in green circle) lead to a significantly better performance soon. These two figures together indicate that global search is not only responsible for directly finding better configurations, but also for creating local search threads that can achieve yet better performance. We show two statistics about the overall interplay between global and local search in Blend Search on all the datasets evaluated for tuning XGBoost in Figure 6. We provide additional experiments for tuning XGBoost and Light GBM in Appendix B, including a comparison with ASHA under different settings of fidelity, and an empirical study about the effect of low-cost initialization. Takeaway. (1) Blend Search is able to overcome the limitations of BO and LS and at the same time, inherit their advantages. By blending BO and LS, Blend Search is able to outperform both on this large collection of datasets over time. (2) The interplay between global and local search indeed contribute to Blend Search s good performance. 4.2 TRANSFORMER-BASED NLP MODEL FINE-TUNING This section presents an application of economical hyperparameter optimization to an NLP model fine tuning task used in a large software company. It starts from a large transformer-based model Turing-NLRv2 with 24 transformer layers, each with 16 attention heads and hidden-dimension 1024 totaling 340M parameters. It is pretrained on English Wikipedia (2,500M words) and Book Corpus (800M words), and uses byte-pair encoding3. This pre-trained model is then fine-tuned (Dai & Le, 2015) for use in multiple production scenarios, including sequence classification, named entity recognition and question answering. The fine-tuning procedure is performed for a dozen separate tasks, and is repeated on a regular cadence, typically every few weeks. We focus our experiment on a single sequence classification fine-tuning task where the objective is to label a document (consisting of one or more sentences) with one of five possible classes. For fine-tuning this model we introduce a classification layer with 1024 5 = 5120 additional weight parameters, randomly initialized. The dataset used for training consists of 52K labeled examples, which we split 80/20 for training/validation. The objective to maximize is the f1-score obtained on the validation set of 10.4K labeled documents. Selecting hyperparameters for fine-tuning this model has been a manual process that typically takes a data scientist a few days. 3msturing.org Table 1: Results of language model fine-tuning. method wall-clock GPU-hours f1 Manual (10-20 working hours, spanned to a few days) 40-80h 0.645 Blend Search 3h 12h 0.652 ASHA-1 6h 24h 0.624 ASHA-16 6h 384h 0.640 In our experiment we use a 6-dimensional search space (4 numerical and 2 categorical). A detailed description of all the hyperparameters tuned, including their ranges, can be found in Table 5 in Appendix B. We compare to ASHA and let it use 16 VMs with 4 NVIDIA Tesla V100 GPUs on each VM. We run ASHA with 16 concurrent jobs for 6 hours in wallclock time. That amounts to 4 16 6 = 384 GPU hours of hardware cost. We run Blend Search on a single VM of the same configuration for 3 hours, which uses 12 GPU hours in total. For comparison, we include ASHA-1 using the same single VM. The results are summarized in Table 1. Takeaway. Not only Blend Search finds a more accurate model than both ASHA and manual effort, it does so faster and consumes only 3% of the resources of ASHA-16. 4.3 ABLATION STUDY To investigate the effectiveness of several important modules of our framework (colored in orange in Figure 2), we perform an ablation study in tuning XGBoost on a random subset of the datasets (one third of the datasets mentioned in Section 4.1). We show the aggregated rank of different variants of our method in Figure 4(c). Specifically, we study the following three modules of the framework. (1) Priority metric. BS-Priority:Round Robin uses the round-robin policy in the search thread selector. (2) Config validator. BS-w/o-Config Validator skips the validity check of the proposed configuration. (3) Create new thread condition. In BS-Create Cond:Always and BS-Creat Cond:Best Or First , the following two conditions are used as the condition for creating new local search threads respectively: always create, and create a new thread only when the loss is better than all the existing search threads loss or there is no local search thread yet. From this ablation study, we have the following observations: (1) Making round-robin selection is worse than doing selection using our designed priority metric. Round-robin s relative performance becomes worse in the later stage of the search because it cannot avoid bad-performing search threads. (2) The config validator is also vital in our framework. (3) Overall, the conditions for creating new threads has a smaller impact on our method comparing to the other designs studied. Better than half condition used by default tends to perform the best. 5 EXTENSION AND FUTURE WORK In the low-resource scenario which is targeted by this paper, each single trial is not resource-saturated if we spend all resources in it. So we do not recommend parallel trials in this low-resource scenario. In the case where more resources than the maximum resource each single trial can consume are available, our framework can be extended by running the trials from different search threads on multiple workers. For example, if there are additional workers available, we can keep invoking the search thread selector (but skip the local search threads that have O(d) trials running). Our design of having multiple independent local search threads naturally allows efficient asynchronous parallel trials. The design of utilizing existing global optimization methods allows existing easyto-parallelize global optimization (such as random search or batch versions of BO) to be plugged in. The prioritization of search threads is still useful as long as the maximal concurrent number of trials divided by the number of search threads is smaller than O(d). Since our method can be used together with multi-fidelity pruning methods, it can naturally inherit the asynchronous resource scheduling when used in the parallel setting. Parallelization is now supported in the latest version of Blend Search s implementation. In this work, we show the effectiveness of Blend Search through extensive empirical study. As future work, it is worth studying the theoretical properties of Blend Search, including theoretical guarantees about its convergence rate and total resource consumption. Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. Optuna: A next-generation hyperparameter optimization framework. In Proceedings of the 25rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2019. James S Bergstra, R emi Bardenet, Yoshua Bengio, and Bal azs K egl. Algorithms for hyper-parameter optimization. In Advances in neural information processing systems, 2011. Eric Brochu, Vlad M. Cora, and Nando de Freitas. A tutorial on bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. Co RR, abs/1012.2599, 2010. Adam D. Bull. Convergence rates of efficient global optimization algorithms. J. Mach. Learn. Res., 12:2879 2904, November 2011. Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. In KDD, 2016. Andrew M Dai and Quoc V Le. Semi-supervised sequence learning. Advances in neural information processing systems, 2015. David Eriksson, Michael Pearce, Jacob Gardner, Ryan D Turner, and Matthias Poloczek. Scalable global optimization via local bayesian optimization. In Advances in Neural Information Processing Systems, 2019. Stefan Falkner, Aaron Klein, and Frank Hutter. BOHB: Robust and efficient hyperparameter optimization at scale. In Proceedings of the 35th International Conference on Machine Learning (ICML 2018), 2018. Yuzhou Gao, Tengchao Yu, and Jinglai Li. Bayesian optimization with local search. In International Conference on Machine Learning, Optimization, and Data Science, 2020. Pieter Gijsbers, Erin Le Dell, Janek Thomas, S ebastien Poirier, Bernd Bischl, and Joaquin Vanschoren. An open source automl benchmark. In Auto ML Workshop at ICML 2019, 2019. Andr as Gy orgy and Levente Kocsis. Efficient multi-start strategies for local search algorithms. Journal of Artificial Intelligence Research, 41:407 444, 2011. Frank Hutter, Holger H. Hoos, and Kevin Leyton-Brown. Sequential model-based optimization for general algorithm configuration. In Learning and Intelligent Optimization, 2011. Kevin Jamieson and Ameet Talwalkar. Non-stochastic best arm identification and hyperparameter optimization. In Artificial Intelligence and Statistics, 2016. Kirthevasan Kandasamy, Gautam Dasarathy, Jeff Schneider, and Barnab as P oczos. Multi-fidelity bayesian optimisation with continuous approximations. In Proceedings of the 34th International Conference on Machine Learning, 2017. Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie Yan Liu. Lightgbm: A highly efficient gradient boosting decision tree. In Advances in Neural Information Processing Systems, 2017. Aaron Klein, Stefan Falkner, Simon Bartels, Philipp Hennig, and Frank Hutter. Fast bayesian optimization of machine learning hyperparameters on large datasets. In Artificial Intelligence and Statistics, 2017. Tor Lattimore and Csaba Szepesv ari. Bandit algorithms. Cambridge University Press, 2020. Liam Li, Kevin Jamieson, Afshin Rostamizadeh, Ekaterina Gonina, Jonathan Ben-tzur, Moritz Hardt, Benjamin Recht, and Ameet Talwalkar. A system for massively parallel hyperparameter tuning. In Proceedings of Machine Learning and Systems, 2020. Lisha Li, Kevin Jamieson, Giulia De Salvo, Afshin Rostamizadeh, and Ameet Talwalkar. Hyperband: A novel bandit-based approach to hyperparameter optimization. In ICLR 17, 2017. Zhiyun Lu, Liyu Chen, Chao-Kai Chiang, and Fei Sha. Hyper-parameter tuning under a budget constraint. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, 2019. Yurii Nesterov and Vladimir Spokoiny. Random gradient-free minimization of convex functions. Foundations of Computational Mathematics, 17(2):527 566, 2017. Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems, 2012. James C Spall et al. Multivariate stochastic approximation using a simultaneous perturbation gradient approximation. IEEE transactions on automatic control, 37(3):332 341, 1992. Niranjan Srinivas, Andreas Krause, Sham M Kakade, and Matthias Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. ar Xiv preprint ar Xiv:0912.3995, 2009. Qingyun Wu, Chi Wang, and Silu Huang. Frugal optimization for cost-related hyperparameters. In AAAI 21, 2021. Li Yang and Abdallah Shami. On hyperparameter optimization of machine learning algorithms: Theory and practice. Neurocomputing, 415:295 316, 2020. A MORE DETAILS ABOUT BLENDSEARCH We provide detailed pseudocode of the sub-algorithms used in our framework. Algorithm 3 Select Thread Inputs: Framework-level state F which keeps a list of candidate search thread F.S and a priority function F.Priority. 1: S = arg max S F.S F.Priority(S) 2: Sbak = arg max S (F.S\S0) F.Priority(S) F.S \ S0 = S0 F.S \ S0 = , Outputs: S, Sbak Algorithm 4 Book Keeping Inputs: Search thread S, one entry of observation x, l, c generated by S, and F. 1: S.c S.c + c 2: if S is a local search thread then Update {S.xmin i }i [d], and {S.xmax i }i [d] 3: if l < S.l1st then 4: S.l2nd, S.c2nd S.l1st, S.c1st; S.l1st, S.c1st l, S.c; and S.xbest x 5: if l < F.l then F.l l Algorithm 5 Initialize Search Thread Inputs: Search method L or G, problem P and initial point (x, l, c). 1: S.π = L.π or G.π 2: Initialize bookkeeping information of S: S.l1 = S.l2 = l, S.c1st = S.c2nd = S.c = c, and S.xmin i = S.xmax i = xi for i P.D Choice of L. used in constructing the admissible region F.R. We normalize each numeric hyperparameter into [0,1]. The setting of L. is decided by the local search method. For the local search method we chose, L. is a constant corresponding to the initial stepsize (fixed as 0.1). Definition of Reachable In One Step in Alg 7. We define local search thread S1 to be reachable by S2 if the distance between their incumbents is no larger than the maximal distance between the next proposal of S2 and the incumbent of S2. In the local search method we used, the incumbent is the currently best config, and the maximal distance is equal to the stepsize. B MORE DETAILS ABOUT EXPERIMENTS AND ADDITIONAL RESULTS B.1 EXPERIMENT SETUP Settings of BO and LS. For BO, we use implementation from Optuna 2.0.0 (https://optuna. readthedocs.io/en/stable/index.html) with default settings for TPE sampler. For LS, we follow the implementation guidelines from Wu et al. (2021). After a local search thread is created from a particular starting point, we fix the categorical dimensions and only search for numerical dimensions in that local search thread. A local search thread S is considered to have converged (corresponding to S.converged() in Algorithm 7) once the stepsize of the local search thread is smaller than a lower bound introduced by Wu et al. (2021). Experiments in tuning XGBoost and Light GBM. The XGBoost and Light GBM experiments are performed in a server with Intel Xeon E5-2690 v4 2.6GHz, and 256GB RAM. A full list of hyperparameters tuned and their ranges can be found in Table 3 and Table 2. The search space for numerical hyperparameters aligns with the search space used in (Wu et al., 2021). On the same fold, the same random seed is used for LS, BO and BS. Experiments on different folds use different random seeds. Experiments in NLP model fine tuning. For ASHA, we set min and max epochs as 1 and 16, and reduction factor 4. Algorithm 6 Create New LSCondition Inputs: l, F. Outputs: |F.S| = 1 or l Median({S.l1st}S F.S\S0) Algorithm 7 Delete And Merge LS Inputs: S, F 1: if S.converged() then 2: F.S F.S \ S, 3: F.R.xmin i F.R.xmin i L. , and F.R.xmax i F.R.xmax i + L. , i D 4: else 5: for S F.S \ S do 6: if S S .Reachable In One Step() & S .l < S.l then 7: F.S F.S \ S 8: break 9: else if S S.Reachable In One Step() & S.l < S .l then F.S F.S \ S Result aggregation details. Aggregated rank in Figure 4(a)&(c) and 12(c) is calculated as follows: (1) per dataset per fold, the method is ranked based on the loss on validation set at each second (x-axis), starting from when there is at least one finished config evaluation in any method; (2) the rank is then averaged across datasets per fold; (3) we finally compute the average rank (line) and confidence interval (shaded area) across 10 folds. Scaled loss in Figure 4(b) is calculated similarly. Per dataset per fold, min-max scaling is applied on each method using the maximum and minimum loss along the whole performance curve across all methods. B.2 ADDITIONAL EXPERIMENTAL RESULTS ON LIGHTGBM AND XGBOOST More performance curves on Light GBM and XGBoost. The performance curves for tuning Light GBM on 3 representative datasets with an 1h budget are shown in Figure 7. We observe that the performance of LS is quite good (comparing to BO), especially on large datasets. This result is consistent with the results reported in (Wu et al., 2021), where all the hyperparameters for tuning are numerical. In our experiment of XGBoost tuning, we include categorical hyperparameters. LS performs worse in this case because the introduction of categorical hyperparameters amplifies the local search method s limitation of being trapped in local optima. The observations about Blend Search for Light GBM are similar to XGBoost tuning. The performance curves on the three large datasets with a 4h budget are shown in Figure 8 and 9, where similar conclusions can be drawn. Multi-fidelity. We compare BO and Blend Search with the multi-fidelity baseline ASHA for tuning Light GBM and XGBoost in Figure 10. In this experiment, we tried two choices of fidelity dimensions with ASHA, including number of iterations and sample size (the sample size begins with 10K, so small datasets are excluded) respectively. The results show that the multi-fidelity baseline overall perform no better than BO and are significantly worse than Blend Search. Table 2: Hyperparameters tuned in Light GBM. hyperparameter type range init tree num int [4, min(32768, # instance)] 4 leaf num int [4, min(32768, # instance)] 4 min child weight float [0.001, 20] 20 learning rate float [0.01, 0.1] random subsample float [0.6, 1.0] random reg alpha float [1e-10, 1.0] random reg lambda float [1e-10, 1.0] random max bin int [7, 1023] random colsample by tree float [0.7, 1.0] random Table 3: Hyperparameters tuned in XGBoost. hyperparameter type range init tree num int [4, min(32768, # instance)] 4 leaf num int [4, min(32768, # instance)] 4 min child weight float [0.001, 20] 20 learning rate float [0.01, 0.1] random subsample float [0.6, 1.0] random reg alpha float [1e-10, 1.0] random reg lambda float [1e-10, 1.0] random colsample by level float [0.6, 1.0] random colsample by tree float [0.7, 1.0] random booster categorical {gbtree, gblinear} gblinear tree method categorical {auto, approx, hist} random Table 4: Hyperparameters tuned in Deep Tables. hyperparameter type range init early stopping rounds int [1, max(min(1.5M/# instance),150),10)] 10 batch size int [16, 1024] 512 dropout float [0, 0.5] 0.1 learning rate float [1e-4, 3e-2] 3e-4 dense dropout float [0, 0.5] 0.1 net categorical {DCN, dnn nets} random auto discrete boolean {False, True} random apply gbm features boolean {False, True} random fixed embedding dim boolean {False, True} random Table 5: Hyperparameters tuned in fine-tuning Turing language model. hyperparameter type range init learning rate float [1e-6, 1e-3] random hidden dropout probability float [0.05, 0.4] random warmup proportion float [0.2, 0.4] random batch size categorical {16, 32} 32 epochs int [1, 16] 1 learning rate scheduler categorical {Warmup linear decay polynomial, Warmup linear, Warmup linear decay exponential} 0 1000 2000 3000 time (s) (a) blood, loss = 1 - auc 0 1000 2000 3000 time (s) Blend Search BO LS BOw LS (b) numerai28.6, loss = 1 - auc 0 1000 2000 3000 time (s) (c) Albert, loss = 1 - auc Figure 7: Optimization performance curve for Light GBM (1h). 0 2500 5000 7500 10000 12500 time (s) (a) Fashion-MNIST, loss = log-loss 0 2500 5000 7500 10000 12500 time (s) Blend Search BO LS BOw LS (b) dilbert, loss = log-loss 0 2500 5000 7500 10000 12500 time (s) (c) Covertype, loss = log-loss Figure 8: Optimization performance curve for Light GBM (4h). 0 2500 5000 7500 10000 12500 time (s) (a) Fashion-MNIST, loss = log-loss 0 2500 5000 7500 10000 12500 time (s) Blend Search BO LS BOw LS (b) dilbert, loss = log-loss 0 2500 5000 7500 10000 12500 time (s) (c) Covertype, loss = log-loss Figure 9: Optimization performance curve for XGBoost (4h). 101 102 103 time (s) (a) Average rank 101 102 103 time (s) Scaled Loss Blend Search BO Prune With Sample Prune With Iter (b) Scaled loss Figure 10: Aggregated results on Light GBM and XGBoost. Purne With Sample and Prune With Iter represent ASHA using sample size and iteration number as resource dimension respectively. BO and Blend Search are the same as those in Figure 4. 2 103 3 103 time (s) Blend Search BO (a) Average rank 2 103 3 103 time (s) Scaled Loss Blend Search BO (b) Scaled loss Figure 11: Aggregated rank and scaled loss on Light GBM with random initialization. 0 1000 2000 3000 time (s) BOHB Hyperband ASHA Blend Search-BOHB Blend Search-Hyperband Blend Search-ASHA (a) vehicle, loss = log-loss 103 time (s) (b) Aggregated rank 103 time (s) Scaled Loss (c) Aggregated scaled loss Figure 12: Performance curves on cane, aggregated rank and aggregated loss on Deep Tables. Ablation study on the low-cost initialization. In this work, we use low-cost initialization for the controlled dimensions of the hyperparameters. Although such information is fairly easy to obtain, we investigate our method s robustness when no controlled dimension is provided. We test Blend Search in a controlled dimension agnostic setting: there are still hyperparameters with heterogeneous cost, but the controlled dimensions and a low-cost initial point is not specified as input. In such scenarios, Blend Search will use random initialization and the config validator always returns yes . We compare Blend Search with local search and BO under such a setting using the same random initial point. In Figure 11 we report the results including the aggregated rank and scaled loss on Light GBM across half of the datasets mentioned in Section 4.1 in Figure 11(a) & (b). The results show that even if Blend Search is agnostic to the controlled dimensions and a random initialization is used, it is still able to outperform both the local search method and BO. B.3 TUNING DEEPTABLES. In this experiment, we tune 9-dimensional hyperparameters (5 numerical and 4 categorical as detailed in Table 4) in Deep Tables. Since the training of deep neural networks are more time-consuming than that of XGBoost, we run experiments for Deep Tables on the datasets where they are worse than the best known performance in the benchmark, including shuttle , cnae , mfeat , vehicle , phoneme , kc1 . All experiments for Deep Tables are performed in a server with the same CPU, 110GB RAM, and one Tesla P100 GPU. A full list of hyperparameters tuned and their ranges can be found in Table 4. Recall that we mentioned multi-fidelity pruning strategies could be incorporated into Blend Search in the config evaluator component. In this experiment, we are particularly interested showing the performance of Blend Search when combined with multi-fidelity methods. To this end, we include the three state-of-the-art multi-fidelity methods, including BOHB (Falkner et al., 2018), ASHA (Li et al., 2020), and asynchronous Hyper Band (Li et al., 2017; 2020) which are shown efficient for tuning deep neural networks and the Blend Search based on each of them. We use the following libraries for baselines: For BOHB, we use Hp Band Ster 0.7.4 (https://github.com/automl/Hp Band Ster). For ASHA and asynchronous Hyper Band, we use implementations from Optuna 2.0.0. In all the methods compared, including both existing methods and variants of Blend Search, the number of training epochs is used as the fidelity dimension, with maximum epochs set to be 1024, reduction factor set to be 3, and minimum epochs 4. For ASHA, we set the minimum early stopping rate to be 4 (we adopted this setting as it yields better performance comparing to the default setting, i.e., 0). The number of training epochs is used as the fidelity dimension. Blend Search incorporates existing multi-fidelity methods in the following way: Each config, either proposed by global search or local search, uses the same schedule to increase the fidelity and check its pruning condition. For example, when ASHA (Li et al., 2020), i.e., asynchronous successive halving, with a reduction factor of η, is used as the pruning strategy, after each config is evaluated by a certain fidelity, it is compared with other configs already evaluated by the same fidelity. The config will be pruned if its loss is ranked in the worst 1/η. Otherwise, the fidelity is multiplied by η. In addition to the original pruning conditions specified by the multi-fidelity method, a configuration will also be pruned at a particular fidelity level where no pruning is performed yet, and the configuration does not yield superior performance (comparing to the currently-best performance) when evaluated at that fidelity level. We present the performance of all compared methods for tuning Deep Tables in Figure 12. Figure 12(a) shows the learning curves on dataset cane with budget 1h. Figure 12(b) and (c) show the aggregated rank and loss on all the 6 datasets within budget 1h. The performance of multi-fidelity methods are significantly improved when used in our Blend Search framework.