# hyperparameter_tuning_under_a_budget_constraint__6910d224.pdf

Hyper-parameter Tuning under a Budget Constraint

Zhiyun Lu1 , Liyu Chen1 , Chao-Kai Chiang2 and Fei Sha3

1University of Southern California 2Appier Inc. 3 Google AI {zhiyunlu, liyuc}@usc.edu, chaokai@gmail.com, fsha@google.com

Hyper-parameter tuning is of crucial importance for real-world machine learning applications. While existing works mainly focus on speeding up the tuning process, we propose to study the problem of hyper-parameter tuning under a budget constraint, which is a more realistic scenario in developing large-scale systems. We formulate the task into a sequential decision making problem and propose a solution, which uses a Bayesian belief model to predict future performances, and an action-value function to plan and select the next conﬁguration to run. With long term prediction and planning capability, our method is able to early stop unpromising conﬁgurations, and adapt the tuning behaviors to different constraints. Experiment results show that our method outperforms existing algorithms, including the-state-of-the-art one, on real-world tuning tasks across a range of different budgets.

1 Introduction Hyper-parameter tuning is of crucial importance to designing and deploying machine learning systems. Broadly, hyperparameters include the architecture of the learning models, regularization parameters, optimization methods, and other knobs to be tuned. It is challenging to explore the vast space of hyper-parameters efﬁciently to identify the optimal conﬁguration. Quite a few approaches have been proposed: random search, Bayesian Optimization (BO) [Snoek et al., 2012; Shahriari et al., 2016], bandits-based Hyperband [Jamieson and Talwalkar, 2016; Li et al., 2016], and meta-learning [Chen et al., 2016; Bello et al., 2017; Franceschi et al., 2018]. Most of the prior works have focused on the aspect of reducing as much as possible the computation cost to obtain the optimal conﬁguration. Finding the optimal, however, is not always practical: in deep learning, there is a vast search space of hyper-parameters, and it takes a long time to train

Contact Author Work done while at USC. On leave from USC (feisha@usc.edu).

each model. In this work, we look at a different but important perspective to hyper-parameter tuning under a ﬁxed time/computation cost, how we can improve the performance as much as possible. Concretely, we study the problem of hyper-parameter tuning under a hard budget constraint. The budget offers the practitioners a tradeoff: affordable resource balanced with good performance. It is a more realistic scenario in developing large-scale learning systems, and is especially applicable, for example when the practitioner searches for a best model under the pressure of a deadline. The budget constraint certainly complicates the hyperparameter tuning strategy. While the strategy without the constraints is to explore and exploit in the hyper-parameter space, a budget-aware strategy needs to decide how much to explore and exploit with respect to the resource/time. As most learning algorithms are iterative in nature, a human operator would monitor the training progress of different conﬁgurations, and adjust the tuning strategy based on their potential future performances and how much resource remains we all have had the experience of killing a running job as this training is going nowhere ! How can we automate this process? We formalize this inquiry into a sequential decision making problem, and propose a budgeted hyper-parameter tuning algorithm (BHPT) to automatically achieve good resource utilization. The BHPT algorithm uses a belief model to predict future performances of conﬁgurations, and an actionvalue function to select the conﬁgurations. It automatically balances exploration with exploitation under the budget, and adapts to different constraints. We empirically demonstrate the performance of the proposed algorithm on both synthetic and real-world datasets. BHPT outperforms the state-of-theart tuning algorithms across a wide range of budgets. Besides, it exhibits budget adaptive tuning behaviors. The rest of the paper is organized as follows. We ﬁrst discuss related work in Sec. 2. We formally deﬁne the problem and introduce the sequential formulation in Sec. 3. We describe the proposed algorithm with analysis in Sec. 4. Experiments and the conclusion can be found in Sec. 5 and 6. respectively.

2 Related Work Automated design of machine learning system is an important research topic Snoek et al. [2012]. Traditionally hyperparameter optimization (HO) is formulated as a black-box

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

optimization and solved by Bayesian optimization (BO). It uses a probabilistic model together with an acquisition function to adaptively select and identify the optimal conﬁguration [Shahriari et al., 2016]. However, fully training a single conﬁguration can be expensive. Recent advances focus to exploit cheaper surrogate tasks, like training on subsets of the data (Fabolas [Klein et al., 2017]), or partial training curves (Freeze Thaw [Swersky et al., 2014]), to speed up the tuning. Recently Hyperband [Li et al., 2016] formulates the HO as a best-arm identiﬁcation problem. It adaptively evaluates a conﬁguration s intermediate results and quickly eliminates poor arms. Lately, Falkner et al. [2018] combines the beneﬁts of BO and Hyperband to achieve fast convergence to the optimal. Other HO work includes gradient-based methods [Maclaurin et al., 2015; Franceschi et al., 2017], metalearning [Jaderberg et al., 2017; Franceschi et al., 2018], the spectral approach [Hazan et al., 2017], and adaptive data allocation [Sabharwal et al., 2016]. However, none of them has an explicit notion of hard budget constraint for the tuning process. Neither do they consider to adapt the tuning strategy across different budgets. In this work, we propose to take the resource constraint as an input to the tuning algorithm. Table. 1 summarizes the comparisons of popular tuning algorithms to our work from three perspectives: whether it uses a (probabilistic) model to adaptively predict and identify good conﬁgurations; whether it supports early stop and resume during training; and whether it adapts to different budgets.

algorithm early stop adaptive (future) budget and resume prediction aware random search BO (GP-EI/SMAC) Fabolas Freeze Thaw Hyperband BHPT (ours)

Table 1: Comparison of tuning algorithms

3 Problem Statement We start by introducing the notation, and formally deﬁne the budgeted hyper-parameter tuning problem. Then we formulate the task into a sequential decision making problem.

3.1 Preliminaries Conﬁguration (arm). Conﬁguration denotes the hyperparameters, e.g. the architecture, the optimization method. We use [K] = {1, ..., K} to index the set of conﬁgurations. The term conﬁguration and arm are used interchangeably. (See the supplementary material1 Sec. A for a full notation table.) Model. Model refers to the (intermediate) training outcome, e.g. the weights of neural nets, of a particular conﬁguration. We evaluate the model on a heldout set periodically, for example every epoch. We consider loss or error rate as the evaluation metric. We keep track of νk b [0, 1), the minimum loss of conﬁguration k among 1, . . . , b epochs. Note

1Supplementary material link: https://bit.ly/2ICQF0K.

that νk b is a non-increasing function in b. We always use superscript to denote the conﬁguration, and the subscript for the budget/epoch.

Budget. The budget deﬁnes a computation constraint imposed on the tuning procedure. In this paper, we consider training time, or epoch, as the budget unit. Epoch is an abstract notion of computation resource in most iterative learning algorithms. Given a total budget B N+, a strategy b = (b1, . . . , b K) NK allocates the budget among K conﬁgurations, i.e. it runs conﬁguration k for bk epochs. b should satisfy that the total epochs from K conﬁgurations add up to B: b 1K = B. We use epoch and budget interchangeably when there is no confusion.

Constrained Optimization. The goal of the budgeted hyper-parameter tuning task is to obtain a well-optimized model under the constraint. Under the allocation strategy b, arm k returns a model with loss νk bk. We search for the strategy which optimizes the loss of the best model out of K conﬁgurations, ℓB = min{ν1 b1, . . . , νK b K}. Concretely, the constrained optimization problem is

min b ℓB, s.t. b 1K = B. (1)

3.2 Optimal Solution with Perfect Information

Despite a combinatorial optimization, if we know the training curves of all conﬁgurations, Solving Eq. 1 is simply to put all budgets to the conﬁguration c that achieves the smallest loss.

ℓ B = min k νk B, c = argmin k νk B. (2)

Namely νc B = ℓ B. Note that this is infeasible because it would require KB epochs to obtain {νk b }k [K],b [B]. However, since νk 1 , ..., νk B are dependent variables, it is feasible to predict the future losses using historical information, which helps to select the next conﬁguration. This motivates us to formulate the budgeted tuning as a ﬁnite-horizon sequential decision making problem.

3.3 Sequential Decision Making

We deﬁne a sequential tuning procedure, where at each step a model of one conﬁguration is selected to train for one epoch. At the n-th step, the tuning algorithm selects the action/conﬁguration an [K], and the corresponding loss zn [0, 1) is returned. We want to ﬁnd a policy π, which selects the action at each step based on the history and the remaining budget r = B n: an = π r, (a1, z1,. . . ,an 1, zn 1) . This process is repeated for B steps and the ﬁnal tuning output is

ℓπ B = min 1 n B zn. (3)

Solving Eq. 3 is equivalent to solving Eq. 1 in a sequential manner. However, the advantages are that the tuning algorithm can stop and resume the training of a conﬁguration at any time. Besides, it can leverage the information of partial training curves and remaining budget r to make more economical decisions in terms of budget. The framework can be extended to other settings, e.g. multiple conﬁgurations in parallel, resource of heterogenous

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

types (like cpu, gpu) etc. While some extensions is straightforward, some requires more careful design and is left as future work.

Regret. The goal of the sequential problem is to optimize the policy π, such that the regret is minimized,

min π ℓπ B ℓ B = min π

min 1 n B zn νc B . (4)

where c is the best arm in Eq. 2. Note ℓ B assumes knowledge of all curves, hence infeasible to achieve in practice.

Challenges. The sequential decision making problem introduced in our paper is an instance of extreme bandits [Nishihara et al., 2016]. This leads to the well-known exploitationexploration tradeoff: on one hand, it tempts to pick arms that have achieved small losses (exploitation). On the other hand, there is also an incentive to pick other arms to see whether they can admit even smaller losses (exploration). However, Eq. 4 is different than the typical bandits, because ℓπ B is the minimum over all losses, instead of the sum. This leads to two differences: ﬁrstly the optimal policy/conﬁguration depends on the horizon B. Secondly at every step, the optimal policy for the remaining horizon depends on the history. The key of our method to address this challenge is to re-plan at every step using historical information.

In this section, we describe the BHPT algorithm. The main idea is that we use a Bayesian belief model to predict the future performance of conﬁgurations, and follow a policy to plan and select the next conﬁguration. At step n, BHPT does the following two steps: an π(r, Sn 1) plan, Sn Belief update(Sn 1, an, zn) predict. Sn is the Bayesian belief model, which predicts future performances and gets updated at every step with new observations. π(r, Sn 1) is the policy to pick the next conﬁguration, which takes both the remaining budget r, and the future predictions of the belief Sn 1 as input. The core of policy π is an actionvalue function Qr[a|Sn 1], which outputs a score of expected future performance of a conﬁguration. BHPT algorithm can adaptively predict and identify good conﬁgurations through the Bayesian belief model Sn, and adapt the tuning behaviors to different budgets due to the dependency of π(r, Sn 1) on r, as summarized in Table 1. We describe the action-value function Q in Sec. 4.1, and the Bayesian belief model in Sec. 4.2. We discuss how Q balances the exploration-exploitation tradeoff in Sec. 4.3, and present the BHPT algorithm in Sec. 4.4.

4.1 Action-value Function Q

Consider we are at the n-th step, with remaining budget r = B n. Our goal is to ﬁnd the policy which minimizes the future losses : ℓr = min n s B zs. We use a belief state/model

Sn 1 to estimate the unknown training curves, which is derived from the past history (a1, z1, . . . , an 1, zn 1) and allows us to simulate, and predict future outcomes.

We compute an action-value function Qr[a|Sn 1] for each arm, and select the next action that minimizes Q,

an=π r,(a1,z1,...,an 1,zn 1) =argmin a [K] Qr[a|Sn 1]. (5)

The action-value function Q2 is deﬁned as follow:

Qr[a] = E min{νa r , min k =a µk r} (6)

where νk r is the minimum loss of conﬁguration k in the future r epochs starting from the current epoch (recall Sec. 3.2). and µk r = E νk r |Sn 1 is the expected loss. Intuitively, Q measures the potential of arm a compared to the expected performances of other arms. Note that this is an optimistic estimation of arm a s performance since it does not account for the risk when νa r > min k =a µk r. The design of

Q leverages the optimal planning structure it directly predicts the ﬁnal loss after r epochs, which is an estimate of the optimal solution with perfect information (Sec. 3.2). In what follows, we show that minimizing Q is equivalent to maximizing a notion of surprise, which measures the potential gain of pulling the arm for the remaining budgets. Speciﬁcally, what is a good action/arm which improves the estimate of future performances? There are two scenarios where the outcomes contradict to our prior belief: (a) an arm previously considered sub-optimal turns out to be the best arm, and (b) the arm previously considered best is actually inferior to others. Denote bc = argmin k µk r as the predicted

top conﬁguration. bc s predicted loss is µ1st r = min k µk r, and

similarly the loss of the runner-up µ2nd r . Now consider for a sub-optimal arm a = bc, Eq. 6 becomes

Qr[a] = E min{νa r,µ1st r } = µ1st r E h µ1st r νa r +i . (7)

Note that case (a) happens when: νa r < µ1st r , and we expect to gain µ1st r νa r by running a instead of bc. The second term on the r.h.s., E h µ1st r νa r +i , computes the area when νa r falls

smaller than µ1st r , which quantiﬁes the expected surprise of a. This quantity is called the value of perfect information in decision theory [Howard, 1966], a numerical value that measures the reduction of uncertainty. Minimizing Qr[a] favors the arm with large surprise. Similarly in case (b) when we consider the predicted top arm a = bc, there is a surprise when νbc r > µ2nd r : the expected best arm falls behind with other candidates. Q in this case is

Qr[bc]=E h min{νbc r,µ2nd r } i =µ1st r E νbc r µ2nd r + , (8)

where the second term computes the area when surprisingly νbc r is larger than µ2nd r . Pulling bc in this case is favorable since it has large gain of surprise. Refer to Fig.1 in the supplementary material Sec. B for visualization of Q. We will come back to the discussion of the properties and behaviors of Q in Sec. 4.3, after introducing the details of the belief model in the next section.

2We drop the Sn 1 in Q to simplify the notation when it s clear.

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

4.2 Belief Model In this section, we brieﬂy describe the Bayesian belief model Sn, and explain how Eq. 6 is computed with the posterior distribution. The belief model captures our current knowledge of the training curves to predict future νk r , and gets updated as new observations become available. In this paper, we adopt the Freeze-Thaw Gaussian Process [Swersky et al., 2014]. We use (k, t) to index the hyper-parameters and epochs respectively, and the loss of k from the t-th epoch is yk(t). The joint distribution of losses from all conﬁgurations and epochs, y = [y1(1), y1(2), . . . , y1(n1), . . . , y K(n K)] (arm k has nk epochs/losses) is Gaussian, with a speciﬁc Freeze-Thaw covariance kernels. We use the joint distribution to predict future performances of yk(t), k, t, and update the posteriors with new observations, by applying Bayes rule. Details of the belief model can be found in the supplementary material Sec. E. Computing Q. Note that the future best loss is a random variable, νk r = min 1 t r yk(t0 + t). However, νk r s distribution

is non-trivial to compute, as it is the minimum of r correlated Gaussians. To simplify the computation3, we approximate: νk r yk(t0 + τ k) where we ﬁx the time index τ k deterministically, to be the one which achieves the minimum loss in expectation, τ k = argmin 1 t r E yk(t0 + t) . 4 The advantage

is that Qr[a] can be computed efﬁciently in closed-form:

Qr[a] = Eν[min{νa r ,µ}] = µ σ[νa r ] sΦ(s)+φ(s) (9)

where s = µ E [νa r ] σ [νar ] is the normalized distance of νa r

to µ, and µ is a constant, either µ1st r or µ2nd r depending on whichever a we are looking at. σ [ ] is the standard deviation, and Φ( ) and φ( ) are the cdf and pdf of standard Gaussian respectively.

4.3 Behavior of Q In this section, we analyze the behaviors of the action-value function. With Gaussian distributed νk r s, there are nice properties of the proposed Q. Q balances the exploration with exploitation: both arms with smaller mean (exploitation), and larger variance (exploration) are preferred in either case the surprise area is large (2nd term on the r.h.s. of Eq. 7 and 8). We provide asymptotic analysis of the behaviors as the budget goes to inﬁnity. Proofs can be found in the supplementary material Sec. C. Proposition 1. Q is asymptotically consistent, that is, lim B ℓπ B ℓ B 0. The regret (Eq. 4) goes to 0, and the best

conﬁguration will be discovered B .

3We can always use a Monte Carlo approximation to compute this quantity. Asymptotically our simpliﬁcation does not affect the behavior of the Q, see Sect. 4.3. 4The intuition is that for different random draws of the curve, νk r can be any one of yk(t0 + t) for 1 t r. In most cases,

Pr h νk r = yk(t0 + τ k) i Pr h νk r = yk(t0 + t) i , t. Thus we use

yk(t0 + τ k) as νk r .

Proposition 2. As B , the limiting ratio of the number of pulls between the top and the second best arm approaches 1 assuming the same observation noise parameter.

4.4 Budgeted Hyper-parameter Tuning Algorithm In this section, we discuss the practical use of Q, and present the BHPT algorithm. Imagine the behavior of Q when applied to the hyper-parameter tuning: all conﬁgurations get selected often at the beginning, due to the high uncertainty. Gradually as our prediction gets more accurate, the actions focus on a few promising arms. Finally, Q mostly allocates budget among the top two arms, according to Prop. 2. This in practice can be a waste of resource, because we aim to ﬁnalize on one model. We propose the following heuristics. ε-Greedy policy. In practice, we can cut down the exploration in Q, and exploit more in the top arm. Recall bc is the predicted best arm. We design the following ε-greedy policy:

an=πε(r,Sn 1)=

bc, w.p. ε, argmin a =bc Qr[a|Sn 1], otherwise. (10)

With probability ε we select bc, and follow Q for the rest of the time. We set ε = 0.5 in the experiment5. Budget exhaustion. When the budget is running out, it is desirable to focus mainly on the top arm bc. Deﬁne τ = argmin 1 t r E h ybc(t0 + t) i , the number of epoch bc short

of achieving its minimum. We use the condition τ < r, where r is the remaining budget, to keep track of the budget exhaustion. We pick bc when it is false. The ﬁnal algorithm is presented in Alg. 1, which we will refer to as BHPT, and BHPT-ε in the experiments.

Algorithm 1: Budgeted Hyperparameter Tuning (BHPT)

1 Input: Budget B, and conﬁgurations [K].

2 for n = 1, 2, . . . B do

3 if τ < r then

4 an = π(r, Sn 1), or an = πε(r, Sn 1) ; // Eq. 5 and 6, or 10

6 an = bc. // budget exhaustion

7 Run an and obtain loss zn = yan(t) (for some t).

8 Sn Belief update(Sn 1, an, zn). // Sec. 4.2

9 Update τ and r.

10 Output min 1 n B zn.

5 Experiment In this section, we ﬁrst compare and validate the conceptual advantage of the BHPT algorithm over other methods, by analyzing the exploration exploitation tradeoff under different budgets on synthetic data. Then we demonstrate the performance of the BHPT algorithm on real-world hyper-parameter tuning tasks. Particularly, we include a task to tune network

5Despite the fact that we do not tune ε, πε performs well as demonstrated in the experiments.

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

dataset K # hyperevaluation # arms params. synthetic 84 NA Eq. 11 Res Net on CIFAR-10 96 5 error rate FCNet on MNIST 50 10 error rate VAE on MNIST 49 4 ELBO Res Net/Alex Net on CIFAR-10 49 6 error rate

Table 2: Data and Evaluation Metric

architectures because selecting the optimal architecture under a budget constraint is of great practical importance. We start by describing the experimental setup, i.e. the data, evaluation metric and baseline methods, in Sec. 5.1, and then provide results and analysis in Sec. 5.2 and 5.3.

5.1 Experiment Description Data. For the data preparation, we generate and save the learning curves of all conﬁgurations, to avoid repeated training when tuning under different budget constraints. For synthetic data, we generate 100 sets of training curves drawn from a Freeze-Thaw GP. For real-world data, we create 4 tuning tasks as summarized in Table 2. For more details, please refer to the supplementary material Sec. F. Evaluation Metric. For evaluation metric on synthetic data, we deﬁne the normalized regret

RB = ℓπ B ℓ B ℓ0 ℓ B , (11)

where ℓ0 is the maximal initial loss of all arms; ℓπ B is the tuning output; ℓ B is the optimal solution with perfect information of training curves. We normalize the regret over the range ℓ0 ℓ B, which can be substantially different from different sets. For real-world task, see Table 2 for the evaluation metrics. Baseline Methods. We compare to Hyperband, BO and its variants (Fabolas and Freeze Thaw), and Rollout [Lam et al., 2016], which is an approximate dynamic programming approach to solve for the Bayes optimal solution. Please refer to the supplementary material Sec. D for the full descriptions of methods and the implementation details. Table 1 in Sec. 2 summarizes the key differences between methods.

5.2 Results on Synthetic Data On synthetic data, since we use the correct model, we expect that the belief model learns to accurately predict the future as the budget increases. We examine the behaviors of the proposed algorithms under different budgets. All results are averaged over 100 synthetic sets. First, we plot the normalized regrets RB (Eq. 11) over budgets in Fig. 1(a). Our algorithms consistently outperform competing methods under different budgets. As B increases, the regret of BHPT RB 0 as discussed in Sec. 4.3. Exploration vs. Exploitation. There are two factors that determine the performance of an algorithm: whether it correctly identiﬁes the optimal arm, and whether it spends sufﬁcient budget to achieve small loss on such arm. The ﬁrst task reﬂects if the algorithm achieves effective exploration, such that it can accurately estimate the curves and identify the top arm, and the second task indicates sufﬁcient exploitation. To examine the exploration , we plot the hit rate of the output arm on the top 5 arms (out of all 84) across different budgets in Fig. 1(b)6. To check the exploitation , we visualize the percentage of the budget spent on the output arm in Fig. 1(c). In both (b) and (c), the higher of the bar the better. In (b), the proposed BHPT and BHPT-ε do better in exploration than all baselines. The column adaptive prediction in Table. 1 explains the exploration behavior of different methods. Speciﬁcally, the Bayesian belief model in BO and BHPT improve with more budgets, which leads to the increase in hit rate as budget increases. In (c), our methods perform well in terms of exploitation. The early stop column in Table. 1 partially explains the exploitation behavior: BO does strong exploitation (and poor exploration) under small budget because it does not early stop conﬁgurations. Although Rollout has a belief model and does future predictions as BHPT, it doesn t perform well on neither task: the hit rate does not improve with more budget, nor does it exploit sufﬁciently on the output arm. Our conjecture is that the Rollout truncates the planning horizon due to the computa-

6The three stacks in each bar are the hit rate @ top-1, top-3, and top-5 in ground-truth respectively. The numerical values are reported in the supplementary material Sec. F.

BHPT (ours) BHPT-ε (ours) Hyperband BO (GP-EI) Rollout

20 40 60 80 100 120 140 160 180 200 budget unit (1 unit = 6 iterations)

30 60 90 120 150 180 budget

hit rate on top-5 (%)

30 60 90 120 150 180 budget

percentage (%)

(a) normalized regret RB (b) exploration: output arm hit rate @ top 5 (c) exploitation: % budget on the output arm In (a) BHPT outperforms all other methods, and RB 0 as B increases for BHPT. In (b) the hit rate of BHPT is the highest, i.e. correctly identiﬁes the best arm.

Figure 1: budgeted optimization on 100 synthetic sets

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

BHPT (ours) BHPT-ε (ours) Hyperband BO (SMAC) Fabolas Freeze Thaw

(a) Res Net for classiﬁcation (b) FCnet for classiﬁcation (c) VAE for generative model BHPT (blue) converges to the global optimal model at the rightmost budget for all tasks. BHPT-ε is better under small budgets, while BHPT is better under large budgets. Figure 2: Real-world Hyper-parameter Tuning Tasks

tion challenge, which leads to myopic behaviors and the poor results. Indeed in all three subplots of Fig. 1, Rollout performs and behaves similarly to Hyperband, which only uses current performance to select actions. This demonstrates the importance of long-term predictions and planning in the budgeted tuning task. Comparing (b) and (c), there is a clear tradeoff between exploration and exploitation: the hit rate decreases as the exploitation percentage increases. Note that BHPT adjusts this tradeoff automatically across different budgets. Compare the BHPT against its ε-greedy variant, the BHPT-ε does slightly better in exploitation, and worse in exploration.

5.3 Results on Real-world Data

In this section, we report the tuning performances on realworld tuning tasks across different budget constraints. We plot the tuning outcomes (error rate or ELBO) over budgets in Fig. 2 and Fig. 3(a). Each curve is the average of 10 runs from different random seeds, and the mean with one standard deviation is shown in the ﬁgures. BHPT methods work well under a wide range of budgets, and outperform BO, Hyperband, Freeze Thaw and the stateof-the-art algorithm Fabolas, across 4 tuning tasks. The trend of different methods across budgets is consistent with the observations on the synthetic data. As explained in the synthetic experiment, the vanilla BHPT does better in exploration while the ε-greedy variant does more exploitation. This explains the superior performance of the ε-greedy under small budgets. However, the lack of exploration results in worse belief model and damages the performance as the budget increases. This phenomenon is more salient on the architecture selection task Fig. 3(a), where the belief modeling is more challenging due to different learning curve patterns between Res Net and Alex Net. Budget Adaptive Behaviors. An important motivation to study the budgeted tuning problem is that it is practically desirable to have adaptive strategy under different constraints. For example, the optimal network architecture might change under different budgets. We would like to examine whether the proposed BHPT exhibits such adaptive behavior. We take the architecture selection task between Res Net and Alex Net

51 59 67 75 83 91 99 107 115 123 131 budget

budget on Res Net (%)

(a) architecture selection between Alex Net and Res Net (b) budget allocation

Figure 3: On an architecture selection task, BHPT adapts its behavior to different budgets, while Hyperband employs a ﬁxed strategy.

on CIFAR-107, and visualize the ratio of the resource spent on Res Net conﬁgurations over budgets in Fig. 3(b). Res Net conﬁgurations converge slower than Alex Net, but reach smaller error rates. Thus it is rewarding to focus the tuning on Alex Net under small budgets, and vice versa. Indeed, the BHPT allocates more resource to Res Net as the budget increases, compared to the baseline Hyperband, which samples the two networks more or less uniformly at random.

6 Conclusion In this paper, we study the budgeted hyper-parameter tuning problem. We formulate a sequential decision making problem, and propose the BHPT algorithm, which uses long-term predictions with an action-value function to allocate the resource efﬁciently. It exhibits budget adaptive behavior, and achieves the state-of-the-art performance on real-world tasks.

Acknowledgments We appreciate the feedback from the reviewers. This work is partially supported by NSF Awards IIS1513966/1632803/1833137, CCF-1139148, DARPA Award#: FA8750-18-2-0117, DARPA-D3M-Award UCB00009528, Google Research Awards, gifts from Facebook and Netﬂix, and ARO# W911NF-12-1-0241 and W911NF15-1-0484.

7The hyper-parameters include architecture type as well as others, like learning rate and etc. There are 49 conﬁgurations in total, with 24 Res Nets and 25 Alex Nets.

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

Irwan Bello, Barret Zoph, Vijay Vasudevan, and Quoc V Le. Neural optimizer search with reinforcement learning. ar Xiv preprint ar Xiv:1709.07417, 2017.

Yutian Chen, Matthew W Hoffman, Sergio Gomez Colmenarejo, Misha Denil, Timothy P Lillicrap, and Nando de Freitas. Learning to learn for global optimization of black box functions. ar Xiv preprint ar Xiv:1611.03824, 2016.

Stefan Falkner, Aaron Klein, and Frank Hutter. Bohb: Robust and efﬁcient hyperparameter optimization at scale. ar Xiv preprint ar Xiv:1807.01774, 2018.

Luca Franceschi, Michele Donini, Paolo Frasconi, and Massimiliano Pontil. Forward and reverse gradient-based hyperparameter optimization. ar Xiv preprint ar Xiv:1703.01785, 2017.

Luca Franceschi, Paolo Frasconi, Saverio Salzo, and Massimilano Pontil. Bilevel programming for hyperparameter optimization and meta-learning. ar Xiv preprint ar Xiv:1806.04910, 2018.

Elad Hazan, Adam Klivans, and Yang Yuan. Hyperparameter optimization: a spectral approach. ar Xiv preprint ar Xiv:1706.00764, 2017.

Ronald A Howard. Information value theory. IEEE Transactions on systems science and cybernetics, 2(1):22 26, 1966.

Max Jaderberg, Valentin Dalibard, Simon Osindero, Wojciech M Czarnecki, Jeff Donahue, Ali Razavi, Oriol Vinyals, Tim Green, Iain Dunning, Karen Simonyan, et al. Population based training of neural networks. ar Xiv preprint ar Xiv:1711.09846, 2017.

Kevin Jamieson and Ameet Talwalkar. Non-stochastic best arm identiﬁcation and hyperparameter optimization. In Artiﬁcial Intelligence and Statistics, pages 240 248, 2016.

Aaron Klein, Stefan Falkner, Simon Bartels, Philipp Hennig, and Frank Hutter. Fast bayesian optimization of machine learning hyperparameters on large datasets. In Artiﬁcial Intelligence and Statistics, pages 528 536, 2017.

Remi Lam, Karen Willcox, and David H Wolpert. Bayesian optimization with a ﬁnite budget: An approximate dynamic programming approach. In Advances in Neural Information Processing Systems, pages 883 891, 2016.

Lisha Li, Kevin Jamieson, Giulia De Salvo, Afshin Rostamizadeh, and Ameet Talwalkar. Hyperband: A novel bandit-based approach to hyperparameter optimization. ar Xiv preprint ar Xiv:1603.06560, 2016.

Dougal Maclaurin, David Duvenaud, and Ryan Adams. Gradient-based hyperparameter optimization through reversible learning. In International Conference on Machine Learning, pages 2113 2122, 2015.

Robert Nishihara, David Lopez-Paz, and L eon Bottou. No regret bound for extreme bandits. In AISTATS, pages 259 267, 2016.

Ashish Sabharwal, Horst Samulowitz, and Gerald Tesauro. Selecting near-optimal learners via incremental data allocation. In Thirtieth AAAI Conference on Artiﬁcial Intelligence, 2016. Bobak Shahriari, Kevin Swersky, Ziyu Wang, Ryan P Adams, and Nando de Freitas. Taking the human out of the loop: A review of bayesian optimization. Proceedings of the IEEE, 104(1):148 175, 2016. Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems, pages 2951 2959, 2012. Kevin Swersky, Jasper Snoek, and Ryan Prescott Adams. Freeze-thaw bayesian optimization. ar Xiv preprint ar Xiv:1406.3896, 2014.

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)