# optimizing_hyperparameters_with_conformal_quantile_regression__7e227fa9.pdf

Optimizing Hyperparameters with Conformal Quantile Regression

David Salinas 1 Jacek Golebiowski 1 Aaron Klein 1 Matthias Seeger 1 Cedric Archambeau 1

Many state-of-the-art hyperparameter optimization (HPO) algorithms rely on model-based optimizers that learn surrogate models of the target function to guide the search. Gaussian processes are the de facto surrogate model due to their ability to capture uncertainty but they make strong assumptions about the observation noise, which might not be warranted in practice. In this work, we propose to leverage conformalized quantile regression which makes minimal assumptions about the observation noise and, as a result, models the target function in a more realistic and robust fashion which translates to quicker HPO convergence on empirical benchmarks. To apply our method in a multi-ﬁdelity setting, we propose a simple, yet effective, technique that aggregates observed results across different resource levels and outperforms conventional methods across many empirical tasks.

1. Introduction

Hyperparameters play a vital role in the machine learning (ML) workﬂow. They control the speed of the optimization process (e.g., learning rate), the capacity of the underlying statistical model (e.g., number of units) or the generalization quality through regularization (e.g., weight decay). While virtually all ML pipelines beneﬁt from having their hyperparameters tuned, this can be tedious and expensive to do, and practitioners tend to leave many of them at their default values.

Hyperparameter optimization (Feurer & Hutter, 2018) is a powerful framework to tune hyperparameters automatically with a large body of work to tackle this optimization problem, spanning from simple heuristics to more complex model-based methods. Random search-based approaches (Bergstra & Bengio, 2012) deﬁne a uniform dis-

1Amazon Web Services. Correspondence to: David Salinas <david.salinas.pro@gmail.com>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

tribution over the conﬁguration space and repeatedly sample new conﬁgurations until an total budget is exhausted. Evolutionary algorithms modify a population of hyperparameter conﬁgurations by hand-designed mutations. Finally, model-based methods, such as Bayesian optimization (Snoek et al., 2012) use the collected data points to ﬁt a surrogate model of the target objective, which informs the sampling distribution so that more promising conﬁgurations are selected over time.

Unfortunately, solving the HPO problem can become computationally expensive, as each function evaluation incurs the cost of fully training and validating the underlying machine learning model. Multi-ﬁdelity HPO (Karnin et al., 2013; Li et al., 2017) accelerates this process by terminating the evaluation of under-performing conﬁgurations early and only training promising conﬁgurations to the end. Early stopping allows algorithms to explore more conﬁgurations within the total search budget. Baseline multiﬁdelity methods can be made more efﬁcient by relying on model-based sampling, exploiting past observations to select more promising conﬁgurations over time.

Model-based multi-ﬁdelity methods constitute the stateof-the-art in HPO today, but still face some limitations. Concretely, we identify two major gaps with the current approaches. First, Bayesian optimization usually assume that the observation noise is homoskedastic and distributed as a Gaussian. This allows for an analytic expression of the likelihood for Gaussian processes, the most popular probabilistic model for Bayesian optimization (Snoek et al., 2012). However, most HPO problems exhibit heteroskedastic noise that can span several orders of magnitude (Salinas et al., 2020; Cowen-Rivers et al., 2020). For instance, the validation error of a model trained with SGD can behave very sporadically for large leaning rate and not taking this heteroskedasticity into account can severely hinder the performance of methods with Gaussian assumptions.

Second, it is difﬁcult to model probabilistically the target metric both across conﬁgurations and resources (e.g., number of epochs trained), while at the same time retaining the simplicity of Gaussian processes. Previous work maintains separate models of the target at different resource levels (Falkner et al., 2018; Li et al., 2022), or use a single

Optimizing Hyperparameters with Conformal Quantile Regression

joint model (Klein et al., 2020; Swersky et al., 2014). The former does not take into account dependencies across resource levels, even though these clearly exist. The latter has to account for the non-stationarity of the target function, and requires strong modeling assumptions in order to remain tractable.

With this paper, we propose a conformalized quantile regression surrogate model that can be used in the presence of any noise, possibly non Gaussian or heteroskedastic. We further propose a new way to expand any modelbased single-ﬁdelity HPO method to the multi-ﬁdelity setting, by aggregating observed results across different resource levels. This strategy can be used with most singleﬁdelity method and greatly simpliﬁes the model based multi-ﬁdelity setup. We show that, in spite of its simplicity, our framework offers competitive results across most common single-ﬁdelity methods and signiﬁcant improvements over baselines when paired with conformalized quantile regression. Our main contributions are the following:

We introduce a novel surrogate method for HPO based on conformalized quantile regression which can handle heteroskedastic and non-Gaussian distributed noise.

We propose a simple way to extend any single-ﬁdelity method into a multi-ﬁdelity method, by using only the last observed datapoint for each hyperparameter conﬁguration to train the function surrogate.

We run empirical evaluations on a large set of benchmarks, demonstrating that quantile regression surrogates achieve a more robust performance compared to state-of-the-art methods in the single-ﬁdelity case

We show that our new multi-ﬁdelity framework outperforms state-of-the-art methods across multiple single-ﬁdelity surrogates.

The paper ﬁrst reviews related work and then discuss our proposed method for single-ﬁdelity optimization leveraging conformal quantile prediction. We then describe an extension to the multi-ﬁdelity case, before evaluating the method on an extensive set of benchmarks.

2. Related work

Bayesian optimization is one of the most successful strategies for hyperparameter optimization (HPO) (Shahriari et al., 2016). Based on a probabilistic model of the objective function, it iteratively samples new candidates by optimizing an acquisition function, that balances between exploration and exploitation when searching the space. Typically, Gaussian processes are used as the probabilistic surrogate model (Snoek et al., 2012), but other methods, such

as random forests (Hutter et al., 2011) or Bayesian neural networks (Springenberg et al., 2016; Snoek et al., 2015) are possible. Alternatively, instead of modeling the objective function, previous work (Bergstra et al., 2011; Tiao et al., 2021) estimate the acquisition function directly by the density ratio of well and poorly performing conﬁgurations.

Despite its sample efﬁciency, Bayesian optimization still requires tens to hundreds of function evaluations to converge to well-performing conﬁgurations. To accelerate the optimization process, multi-ﬁdelity optimization exploits cheap-to-evaluate ﬁdelities of the objective function such as training epochs (Swersky et al., 2014). Jamieson & Talwalkar (2016) proposed to use successive halving (Karnin et al., 2013) for multi-ﬁdelity hyperparameter optimization which trains a set of conﬁgurations for a small budget and then only let the the top half conﬁgurations continue for twice as many resources. Hyperband (Li et al., 2017) calls successive halving as a sub-routine with varying minimum resources level, to avoid that conﬁgurations are terminated too early. Falkner et al. (2018) combined Hyperband with Bayesian optimization to replace the inefﬁcient random sampling of conﬁguration by Bayesian optimization with kernel density estimators (Bergstra et al., 2011). ASHA (Li et al., 2019) proposed to extend Hyperband to the asynchronous case when using multiple workers which led to signiﬁcant improvements and Klein et al. (2020); Li et al. (2022) later combined this method with Gaussian process based Bayesian optimization. Instead of relying on a model-based approach, Awad et al. (2021) instead proposed to combine Hyperband with evolution algorithms.

An orthogonal line of work models the learning curves of machine learning algorithms directly; see Mohr & van Rijn (2022) for an overview. Previous work by Domhan et al. (2015) ﬁts an ensemble of parametric basis functions to the learning curve of a neural network. This method can be plugged into any HPO approach such that evaluation of an network is stopped if it is unlikely to outperform previous conﬁgurations and the prediction of the model is returned to the optimizer. Klein et al. (2017) used a Bayesian neural networks to predict the parameters of these basis functions which is able to model the correlation across hyperparameter conﬁgurations. Wistuba & Pedapati (2020) proposed neural networks architectures that ranks learning curves across different tasks.

To avoid requiring Gaussian homoskedastic noise, several papers considered the use of quantile regression for HPO (Picheny et al., 2013; Salinas et al., 2020; Moriconi et al., 2020) but those approaches do not ensure that the provided uncertainties are well calibrated. Conformal prediction has been gaining traction recently in ML applications due to its ability to provide well calibrated uncertainty with widely applicable assumptions, in particular not requiring the pres-

Optimizing Hyperparameters with Conformal Quantile Regression

ence of a Gaussian homoskedastic distribution (Shafer & Vovk, 2007). To the best of our knowledge, conformal prediction has only been considered for single-ﬁdelity HPO by (Stanton et al., 2022) and Doyle (2022). The former applies conformal correction to standard GP posteriors in order to improve model calibration on non-Gaussian noise, whereas we build our method on quantile regression which is already robust to non-Gaussian and heteroskedastic noise. The latter conducted a preliminary study showing that conformal predictors can outperform random-search on four datasets. The key difference to our method is that we utilize the framework of conformal quantile prediction from (Romano et al., 2019) which leverages quantile regression allowing to bypass the need to ﬁt an additional model for the variance. In both cases, our work differs as we consider the asynchronous multi-ﬁdelity case which allows the method to perform much better in presence of hundreds of observations.

3. Single-ﬁdelity Hyperparameter Optimization

In the single-ﬁdelity hyperparameter optimization setting, we are interested in ﬁnding the hyperparameter minimizing of a blackbox function f:

x = arg min x X f(x)

where f(x) X denotes the validation error obtained for a hyperparameter conﬁguration x. Hyperparameters may include the learning rate, number of layers and number of hidden dimensions of a transformer or a convolutional neural network. Given that evaluating f is typically expensive and gradients are not readily available, we look for gradient-free and sample efﬁcient optimization methods.

Bayesian Optimization is arguably one of the most popular approaches owing to its ability to efﬁciently trade-off exploration and exploitation when searching the conﬁguration space. In each iteration n, a probabilistic model of the objective function f is ﬁtted on the n previous observations D = {(x1, y1), . . . , (xn, yn)}; the ﬁrst initial conﬁgurations are typically drawn at random. To select the next point to evaluate, an acquisition function is then optimized to select the most promising candidate based on the probabilistic surrogate, for instance by picking the conﬁguration that maximizes the expected improvement (Jones et al., 1998).

The standard approach to model the objective function uses Gaussian Processes due its computational tractability and theoretical guarantees (Srinivas et al., 2012). However, this approach assumes that the observation noise is Gaussian and homoskedastic. We next describe an approach to lift this restriction by leveraging quantile regression with

conformal prediction as a probabilistic model for f.

4. Conformal Quantile Regression

Preliminaries. For a real-valued random variable Y , we denote by g(y) its probability density function and by FY (y) = R y g(t)dt its cumulative distribution function (CDF). The associated quantile function of Y is then deﬁned as follows:

F 1 Y (α) = inf y R{FY (y) α}.

The quantile function allows to easily obtain conﬁdence intervals. One can also easily sample from the distribution by ﬁrst sampling a random quantile uniformly α U([0, 1]) and then computing y = F 1 Y (α) which provides one sample y from the distribution Y .

Quantile regression. Given data drawn from a joint distribution (x, y) F(X,Y ), quantile regression aims to estimate a given quantile α of the conditional distribution of Y given X = x, e.g. to learn the function

qα(x) = F 1 Y |X=x(α)

which predicts the quantile α conditioned on x. This problem can be solved by minimizing the quantile loss function (Bassett & Koenker, 1982) for a given quantile α and some predicted value ˆy:

Lα(y, ˆy) :=

( α(y ˆy) if y ˆy > 0, (1 α)(ˆy y) otherwise. (1)

A critical property is that minimizing the quantile loss allows to retrieve the desired quantile in the sense that

arg min ˆy Ey FY [Lα(y, ˆy)] = F 1 Y (α).

Given a set of n observations D = {(xi, yi)}n i=1, one can thus estimate the quantile function by training a model ˆqα with parameters θ that minimizes the quantile loss given by Eq. 1:

θ = arg min θ

i=1 Lα(yi, ˆqα(xi)). (2)

Quantile Regression Surrogate. We now explain how we can leverage quantile regression to build a probabilistic surrogate for Bayesian Optimization.

To this end, we estimate m models by minimizing Eq. 2 for equally spread-out quantiles {α1, . . . , αm} where

Optimizing Hyperparameters with Conformal Quantile Regression

Figure 1: Samples from the synthetic function f(x) N(0, ρ(x)2) to be minimized (left), illustration of the Thompson sampling procedure based on m = 8 predicted quantiles (middle) and acquisition function obtained for our method and a GP (right). When sampling, we sample one random quantile for each candidate and pick the best conﬁguration obtained.

Samples from f(x) (0, ρ(x)2)

f(x) (0, ρ(x)2)

x * = argminx X y(x) x * = argminx X y(x) x * = argminx X y(x) x * = argminx X y(x) x * = argminx X y(x) x * = argminx X y(x) x * = argminx X y(x) x * = argminx X y(x) x * = argminx X y(x)

Illustration of the proposed sampling Acquisition function

αj = j/(m+1) and where m > 1 is an even number. This allows to provide probabilistic predictions conditioned on any hyperparameter x. We then use independent Thompson sampling as the acquisition function by taking advantage that samples can easily be obtained through predicted quantiles. Indeed, one can then draw a sample y(x) from the estimated conditional distribution of FY |X by simply sampling a quantile at random among the m quantiles predicted by the model {ˆqα1(x), . . . , ˆqαm(x)}.1

To select the next conﬁguration to evaluate, we then use independent Thompson sampling as the acquisition function. We ﬁrst sample a set of N candidates X = { x1, . . . , x N} uniformly at random and then sample a validation performance for each of those candidates to obtain { y( x1), . . . , y( x N)}. We then return the conﬁguration that has the lowest sampled value

x = arg min x { x1,..., x N} y(x).

We illustrate this procedure in Figure 1 which shows how different quantiles are sampled in a toy example and how the next point is selected by picking the conﬁguration with the lowest sampled predicted value. In this example, we consider f(x) N(0, ρ(x)2) as the function to minimize with ρ(x) = sin(x)2 + κ where κ is set to κ = 0.3 to ensure positive variance. Samples of this function are shown in Figure 1, left. For this function, a standard GP is not able to represent the heteroskedastic variance and as such cannot favor any part of the space as shown by its uniformly distributed acquisition. However, while the mean

1One downside typically associated with ﬁtting m models is that the quantiles predicted may not be monotonic (Gasthaus et al., 2019), however this problem does not occur in our case since we simply use samples from the predicted distribution.

of the function is always zero, the optimal points to sample are both situated at π/2 and 3π/2 where the uncertainty is the highest. While, the GP cannot model this information given its homoskedastic noise, quantile regression is able to regress the conditional quantiles and therefore correctly identify the best regions to sample as evidenced by the acquisition function which peaks at the two best values for the next candidate.

Conformalizing predictions. While quantile regression can learn the shape of any continuous distribution given enough data, the model predictions are not guaranteed to be well calibrated given insufﬁcient data.

More precisely, the quantiles estimated allows us to construct m/2 conﬁdence intervals

Cj(x) = ˆqαj(x), ˆq1 αj(x)

for j m/2.2 For each conﬁdence interval, we would like to have a miscoverage rate 2αj, i.e. the predictions should have probability at least 1 2αj of being in the conﬁdence interval Cj(x),

P[Y Cj(x)] = 1 2αj. (3)

In the presence of heteroskedasticity, this requires to have the length of Cj(x) to depend on x which is possible with the use of quantile regression as illustrated in Figure 1. However, the coverage statement of Eq. 3 cannot be guaranteed when ﬁtting models on ﬁnite sample size. Miscalibrated intervals can be problematic for HPO, as it may lead to a model converging early to a suboptimal solution. To

2Note that the values chosen for the quantiles αj = j/(m + 1) ensures that the quantile 1 αj belongs to the m quantiles computed since 1 αj = αm j.

Optimizing Hyperparameters with Conformal Quantile Regression

address this problem, we propose using the split conformal method from Romano et al. (2019) that allows to obtain robust coverage for each m/2 of the predicted conﬁdence intervals that we now describe.

The method consists in applying an offset on each conﬁdence interval, which is estimated on a validation set. We divide the dataset of available observations D = {(xi, yi)}n i=1 into a training set Dtrain and a validation set Dval. After ﬁtting each of the m quantile regression models ˆqαj on the training set Dtrain, we compute conformity scores that measure how well the predicted conformal intervals ﬁt the data for each sample in the validation set:

Ej = max(ˆqαj(xi) yi, yi ˆq1 αj(xi)) |Dval| i=1 . (4)

The intuition of the conformity scores is the following. First note that the sign of the score is positive when the target yi is outside of the target and negative when the target falls inside the predicted interval. This allows the score to account for both overcoverage and undercoverage cases as we want to reduce the interval in cases of overcoverage and increase it in case of undercoverage. In addition, the score amplitude always measures the distance to the closest quantile of the conﬁdence interval, i.e. the score amplitude of each sample is |yi qi| where qi is the closest quantile from yi between ˆqαj(xi) and ˆq1 αj(xi).

Given this score we compute a correction γj which is set to

γj = (1 2αj) 1 + 1 |Dval|

-th empirical quantile of Ej.

The conformalized prediction interval for a new data point (x, y) is then given by

ˆCj(x) = ˆqαj(x) γj, ˆq1 αj(x) + γj . (6)

An important property of this procedure is that the corrected conﬁdence intervals are guaranteed to have valid coverage, e.g. the probability that y belongs the prediction interval ˆCj can then be shown to arbitrarily close to 1 2αj (Romano et al., 2019)

1 2αj P[Y ˆCj(x)] 1 2αj + 1 |Dval| + 1.

Once the conﬁdence intervals are readjusted by offsetting quantiles, we can sample new candidates to evaluate using a protocol based on Thompson sampling discussed in the previous section while being able to guarantee coverage properties of our predicted quantiles. The pseudo-code of

the proposed algorithm to select the next conﬁguration to evaluate is given in Algo. 1 where the key three steps are 1) ﬁtting quantile regression models, 2) computing quantile adjustments and 3) sampling the best candidate with independent Thompson sampling.

5. Multi-ﬁdelity and Successful Halving

Single-ﬁdelity methods steer the search towards the most promising part of the conﬁguration space based on the observed validation performance of hyperparameter conﬁgurations, however, this does not consider leveraging other available signals, such as the loss emitted at each epoch for a neural network trained with SGD. Multi-ﬁdelity methods consider this additional information to further accelerate the search by cutting poor conﬁgurations preemptively.

Formally, multi-ﬁdelity optimization considers the following optimization problem:

x = arg min x X f(x, rmax)

where f(x, rmax) denotes the blackbox error obtained for a hyperparameter x at the maximum budget rmax (for instance the maximum number of epochs) and we assume that r [rmin, rmax]. Typically, early values of f(x, r) for r < rmax are informative of f(x, rmax), while being computationally cheaper to collect and can help us to cut poorly performing conﬁgurations.

Asynchronous Successive Halving. Asynchronous Successive Halving (ASHA) (Li et al., 2019) is a multi-ﬁdelity method that can leverage several workers to evaluate multiple conﬁgurations asynchronously while stopping poor conﬁgurations early. The method starts by evaluating a list of random conﬁgurations in parallel for a small initial budget. When a result is collected from a worker, it is continued or stopped based on its result - the evaluation of the conﬁguration continues if it is in the top results seen so far for a given ﬁdelity and interrupted otherwise. Stopped conﬁgurations are replaced with new candidates sampled randomly and the process is iterated until the tuning budget is exhausted.

ASHA avoids synchronization points by evaluating each conﬁguration based on the data available at the time, which can lead to false positives in the continuation decision. Indeed, some conﬁgurations may be continued due to poor competition rather than good performance and would have been stopped if more data was available. However, the avoidance of synchronization points efﬁciently deals with straggler evaluation and is one of the key components of the method s excellent performance in practice. The pseudocode of the method is given in the appendix.

Optimizing Hyperparameters with Conformal Quantile Regression

Algorithm 1 CQR candidate suggestion pseudo-code.

1: function SUGGEST() 2: Input: conﬁguration space X, set of observations D, number of quantiles m, number of candidates N 3: Output: next conﬁguration to evaluate 4: Dtrain, Dval = split train val(D) 5: for 1 j m do 6: Fit model ˆqαj on Dtrain with Eq. 2 7: end for 8: for 1 j m/2 1 do 9: Compute conformity scores Ej with Eq. 4 10: Compute correction γj with Eq. 5 11: end for 12: X = Sample N candidates from X 13: for 1 i N do 14: Draw random quantile j in [1, m] 15: if j < m/2 then 16: yi = ˆqαj(xi) γj 17: else 18: yi = ˆqαj(xi) + γm j 19: end if 20: end for 21: i = arg mini yi 22: return xi 23: end function

Model-based ASHA. One pitfall of ASHA is that candidates are sampled randomly at initialization and when workers become free after conﬁgurations are interrupted. However after spending some time in the tuning process, we gathered results which we would clearly like to be able to bias the search towards the most promising parts of the conﬁguration space.

One challenge is that most single-ﬁdelity model-based approaches regress a surrogate model f(x, rmax) given observations at the ﬁnal ﬁdelity rmax. It becomes then difﬁcult to combine model-based and multi-ﬁdelity approaches given that when we stop a poor conﬁguration at a resource r < rmax, we are unsure about what would have been the value at f(x, rmax).

Bridging single and multi-ﬁdelity methods. We propose a simple data transformation that allows to use any single ﬁdelity method in the multi-ﬁdelity setting. We denote the conﬁgurations and evaluations obtained at a given time as (xi, {f(xi, r1), . . . , f(xi, rnxi)}) n

where n denotes the number of conﬁgurations evaluated and nx denotes the number of ﬁdelities evaluated for a conﬁguration x.

We propose to consider the transformation that takes the

last value of the time-observations of a given conﬁguration. Namely, we propose to consider the transformation: z = f(x, rnx) and then use a single-ﬁdelity method rather than random-search to determine the best next conﬁguration to evaluate given the observations D = {(xi, zi)}n i=1 while using ASHA for the stopping decisions.

Relying on the last observed value f(x, rnx) rather than all ﬁdelities f(x, r) for r rnx signiﬁcantly simpliﬁes the multi-ﬁdelity setup but obscures a portion of the available signal. However, as evaluating conﬁguration candidates longer is expected to improve their result, the data transformation is effectively pushing the poor and well performing conﬁgurations further apart. Assuming conﬁgurations are stopped with a probability inversely correlated with their performance and their results would not cross if the training were to continue, the result-based ordering of conﬁgurations remains constant regardless of whether we use the last or ﬁnal observation. This means that it remains possible under those assumptions to discriminate between promising and not-promising conﬁgurations. 3

In addition to working well with the Conformal Quantile Regression that we introduced, we will also show in our experiments that this simple transformation allows to combine single-ﬁdelity methods with ASHA while reaching the performance of state-of-the-art dedicated model-based multi-ﬁdelity methods.

6. Experiments

We evaluate our method against state-of-the-art HPO algorithms on a large collection of real-world datasets on both single and multi-ﬁdelity tasks. The code to reproduce our results is available at https://github.com/ geoalgo/syne-tune/tree/icml_conformal.

Benchmarks. Our experiments rely on 13 tasks coming from FCNet (Klein & Hutter, 2019), NAS201 (Dong & Yang, 2020) and LCBench (Zimmer et al., 2021) benchmarks as well as NAS301 (Siems et al., 2020) using the implementation provided in (Pﬁsterer et al., 2022). All methods are evaluated asynchronously with 4 workers. Details on these benchmarks and their conﬁguration spaces distributions are given in Appendix B.

Baselines. For single-ﬁdelity benchmarks, we compare our proposed method (CQR) with random-search (RS) (Bergstra et al.), TPE (Bergstra et al., 2011), Gaussian Process (GP) (Snoek et al., 2012), regularized-evolution

3The accentuated spread between bad and good conﬁgurations can be mitigated by using quantile normalization as the transformation is invariant to monotonic changes. We do not report results for this approach as it adds a layer of complexity and performed on-par with just taking the last observations in our experiments.

Optimizing Hyperparameters with Conformal Quantile Regression

(REA) (Real et al., 2019) and BORE (Tiao et al., 2021). For multi-ﬁdelity benchmarks, we compare against ASHA (Li et al., 2019), BOHB (Falkner et al., 2018), Hyper-Tune (HT) (Li et al., 2022) and Mobster (MOB) (Klein et al., 2020).

Experiment Setup. All tuning experiments run asynchronously with 4 workers and are stopped when 200 rmax results were observed, which corresponds to seeing 200 different conﬁgurations for single-ﬁdelity methods, or when the wallclock time exceeded a ﬁxed budget. All runs are repeated with 30 different random seeds, and we report mean and standard errors. We use gradient boosted trees (Friedman, 2001) for the quantile-regression models with the same hyperparameter used for BORE. We use the simulation backend provided by Syne Tune (Salinas et al., 2022) on a AWS m5.4xlarge machine to simulate methods which allows to account for both optimizers and blackbox runtimes.

Metrics. We benchmark the performance of methods using normalized regret, ranks averaged over tasks and critical diagrams. The normalized regret, also called average distance to the minimum, is deﬁned as (yt ymin)/(ymax ymin) where ymin and ymax denotes respectively the best and worse possible values of the blackbox f and yt denotes the best value found after at a time-step t for a given method. To aggregate across tasks, we report scores at 50 fractions of the total budget for each tasks. We then average normalized regret over each budget proportion across tasks and seeds. We compute ranks for each method at all time-steps and also average those values over tasks and seeds. Critical diagrams show group of methods that are statistically tied together with a horizontal line using the statistical test proposed by (Demˇsar, 2006). They are computed over averaged ranks of the methods obtained for a ﬁxed budget. The performance of all methods per task is also given in the appendix in Fig. 3, 4, 5.

Results discussion for single-ﬁdelity. Aggregated metrics of single-ﬁdelity methods are shown in Figure 2 top. Our proposed method (CQR) outperforms all baselines in term of both rank and regret. In particular, critical diagrams shows that our method outperforms all baselines at 50% of the tuning budget and is only tied statistically to (BORE) given 100% of the budget. Our results also show that GPbased tuning performs really well in the low sample regime where it can rely on its informative prior. After enough observations, our data-driven approach starts to outperform Bayesian competitor as it can model irregular noise better.

Analyzing surrogate performance. To better understand performance gains, we now analyze the properties of different surrogate models in more detail. In Tab. 1,

Table 1: RMSE, Calibration error and runtime for different surrogates when increasing the number of samples.

RMSE Calibration error Runtime model GP QR CQR GP QR CQR GP QR CQR n

16 1.01 0.78 0.81 0.06 0.13 0.13 1.11 1.13 1.06 64 0.92 0.57 0.58 0.04 0.10 0.08 1.71 1.48 1.43 256 0.85 0.43 0.44 0.06 0.05 0.04 2.27 1.75 1.71 1024 0.58 0.37 0.37 0.11 0.04 0.03 20.03 2.23 2.16

we compare the surrogate accuracy (RMSE), the quality of their uncertainty estimate (calibration error) and their runtime. In particular, we measure those metrics for different number of samples n. In each case, we draw a random subset of size n to train the surrogate model and then evaluate the three metrics on remaining unseen examples. Results are averaged across seeds and benchmarks and we normalize the target with quantile normalization. Results per task are also given in the appendix as well as the deﬁnition of calibration error metric.

We compare three surrogates: the baseline GP, conformalized quantile regression CQR as well as quantile regression QR. Compared to GP, the RMSE of the boosted-trees surrogates is always better, which is expected as boostedtrees are known for their good performance on tabular data. However, a critical aspect in HPO is to provide good uncertainty estimates in addition to good mean predictors as uncertainty plays a critical in balancing exploration versus exploitation.

To measure the quality of uncertainty estimates, we analyze the calibration error of the different surrogates which measures how much over or under conﬁdent are each predicted quantiles of the surrogate predictive distribution. When few observations are available (e.g. when n 64), the quality of uncertainty estimates of GP is better compared to both boosted tree-methods, which is expected as GP can rely on their prior in this regime whereas data-driven QR and CQR lack the amount of data to estimate quantile levels well enough. The lack of data also means that CQR cannot adjust conﬁdence intervals accurately given that its validation set is too small and its calibration performance just matches QR. However, as the number of samples increases, the calibration of tree-based methods quickly becomes better, which underlines that quantile regression better ﬁts the noise function observed in the benchmarks. As expected given the theoretical coverage guarantees, the calibration of CQR exceeds the calibration of QR given sufﬁcient data making it a better suited surrogate for HPO.

Results discussion for multi-ﬁdelity. Next, we analyze the performance in the multi-ﬁdelity setting in the middle of Fig. 2 where we show the performance of (CQR+MF)

Optimizing Hyperparameters with Conformal Quantile Regression

0.0 0.2 0.4 0.6 0.8 1.0 % Budget Used

Method rank

Average-rank

RS TPE REA GP BORE CQR

0.0 0.2 0.4 0.6 0.8 1.0 % Budget Used

Average normalized regret

Normalized-regret

RS TPE REA GP BORE CQR

2.75 3.00 3.25 3.50 3.75 4.00 4.25 4.50

2.5 3.0 3.5 4.0 4.5

0.0 0.2 0.4 0.6 0.8 1.0 % Budget Used

Method rank

Average-rank

ASHA BOHB MOB HT CQR+MF

0.0 0.2 0.4 0.6 0.8 1.0 % Budget Used

Average normalized regret

Normalized-regret

ASHA BOHB MOB HT CQR+MF

2.6 2.8 3.0 3.2

2.6 2.7 2.8 2.9 3.0 3.1 3.2 3.3

0.0 0.2 0.4 0.6 0.8 1.0 % Budget Used

Method rank

Average-rank

ASHA MOB REA+MF GP+MF BORE+MF QR+MF CQR+MF

0.0 0.2 0.4 0.6 0.8 1.0 % Budget Used

Average normalized regret

Normalized-regret

ASHA MOB REA+MF GP+MF BORE+MF QR+MF CQR+MF

3.4 3.6 3.8 4.0 4.2 4.4

CQR+MF BORE+MF

ASHA GP+MF QR+MF

3.4 3.6 3.8 4.0 4.2 4.4

CQR+MF BORE+MF

QR+MF ASHA GP+MF

Figure 2: Performance for single ﬁdelity (top) multi-ﬁdelity (middle) and multi-ﬁdelity variants (bottom) for average rank over all tasks (left), normalized regret (middle) and critical diagrams obtained at 50% and 100% of the total budget (right).

which combines the single-ﬁdelity method with the simple transformation described in section 5.

In contrast to single-ﬁdelity, multi-ﬁdelity optimization quickly yields many hundreds of observations and the majority of the tuning process lies in the high-data regime. In this setup, (CQR+MF) shows signiﬁcant improvement in term of HPO performance. In particular, while most multi-ﬁdelity approaches are not statistically distinguishable from ASHA, our proposed method (CQR+MF) offers statistically signiﬁcant improvements over ASHA at all times and over all other model-based multi-ﬁdelity methods after spending 50% and 100% of the total budget. We understand the improvement mainly comes from the multiﬁdelity setting which offers more observations to the HPO tuning methods which plays into the strengths of the CQR surrogate illustrated in the previous paragraph in term of accuracy and calibration.

Ablation study. The surrogate analysis showed that conformal prediction improves the calibration of quantile regression but has little effect on the surrogate RMSE. To examine the beneﬁt of this contribution on the HPO setting, we next evaluate quantile regression with and without applying conformal correction (QR+MF) in the bottom of Fig. 2. The performance of QR+MF is much worse than CQR+MF which highlights the beneﬁt of the better uncertainty provided by conformalizing predictions.

Next, we investigate in Fig. 2 the performance of the best single-ﬁdelity methods REA, GP and BORE extended to the multi-ﬁdelity setting with our simple extension. As for QR+MF and CQR+MF, all those methods surrogates are trained using the last ﬁdelity observed for each hyperparameter and the worst conﬁgurations are stopped with asynchronous successful halving. While those methods perform worse than CQR+MF, they all outperform ASHA in term

Optimizing Hyperparameters with Conformal Quantile Regression

of average rank and regret except for REA which we believe is due to the lower performance of the method. Those simple extensions also match or improve over the performance of dedicated model-based multi-ﬁdelity methods.

This illustrates the robustness of the proposed extension with respect to the choice of the single-ﬁdelity method also shows the potential of future work to extend other advanced single ﬁdelity methods - for instance multi-objective or constrained - to the multi-ﬁdelity case.

7. Conclusion

We presented a new HPO approach that allows to use highly accurate tabular predictors, such as gradient boosted trees, while obtaining calibrated uncertainty estimates through conformal predictions. In addition, we showed that most single-ﬁdelity methods can be extended to the multiﬁdelity case by just using the last ﬁdelity available while achieving good performance.

The method we proposed has a few limitations. For instance the use of Thompson Sampling may be less efﬁcient in the presence of many hyperparameters, as such further work could consider extending the method with other acquisition functions, such as UCB. Further work could also investigate providing regret bounds or extension to support multi-objective or transfer learning scenarios.

Awad, N., Mallik, N., and Hutter, F. Dehb: Evolutionary hyberband for scalable, robust and efﬁcient hyperparameter optimization. In Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI 21), 2021.

Bassett, G. and Koenker, R. An empirical quantile function for linear models with iid errors. Journal of the American Statistical Association, 77(378):407 415, Jun 1982. ISSN 1537-274X. doi: 10.1080/01621459.1982. 10477826. URL http://dx.doi.org/10.1080/ 01621459.1982.10477826.

Bergstra, J. and Bengio, Y. Random search for hyperparameter optimization. Journal of Machine Learning Research, 2012.

Bergstra, J., Bardenet, R., Bengio, Y., and Kegl, B. Algorithms for hyperparameter optimization. pp. 2546 2554.

Bergstra, J., Bardenet, R., Bengio, Y., and K egl, B. Algorithms for hyper-parameter optimization. In Proceedings of the 24th International Conference on Advances in Neural Information Processing Systems (NIPS 11), 2011.

Chen, T. and Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pp. 785 794, 2016.

Cowen-Rivers, A. I., Lyu, W., Tutunov, R., Wang, Z., Grosnit, A., Grifﬁths, R. R., Maraval, A. M., Jianye, H., Wang, J., Peters, J., and Ammar, H. B. Hebo pushing the limits of sample-efﬁcient hyperparameter optimisation, 2020. URL https://arxiv.org/abs/ 2012.03826.

Demˇsar, J. Statistical comparisons of classiﬁers over multiple data sets. Journal of Machine Learning Research, 7 (1):1 30, 2006. URL http://jmlr.org/papers/ v7/demsar06a.html.

Domhan, T., Springenberg, J. T., and Hutter, F. Speeding up automatic hyperparameter optimization of deep neural networks by extrapolation of learning curves. In Twenty-fourth international joint conference on artiﬁcial intelligence, 2015.

Dong, X. and Yang, Y. NAS-Bench-201: Extending the scope of reproducible neural architecture search. Technical Report ar Xiv:2001.00326 [cs.CV], 2020.

Doyle, R. Model agnostic conformal hyperparameter optimization. ar Xiv:2207.03017 [cs.LG], 2022.

Falkner, S., Klein, A., and Hutter, F. BOHB: Robust and efﬁcient hyperparameter optimization at scale. In Proceedings of the 35th International Conference on Machine Learning (ICML 2018), pp. 1436 1445, 2018.

Feurer, M. and Hutter, F. Hyperparameter optimization. In Automatic Machine Learning: Methods, Systems, Challenges. Springer, 2018.

Friedman, J. H. Greedy function approximation: A gradient boosting machine. The Annals of Statistics, 29 (5):1189 1232, 2001. ISSN 00905364. URL http: //www.jstor.org/stable/2699986.

Gasthaus, J., Benidis, K., Wang, Y., Rangapuram, S. S., Salinas, D., Flunkert, V., and Januschowski, T. Probabilistic forecasting with spline quantile function rnns. In Chaudhuri, K. and Sugiyama, M. (eds.), Proceedings of the Twenty-Second International Conference on Artiﬁcial Intelligence and Statistics, volume 89 of Proceedings of Machine Learning Research, pp. 1901 1910. PMLR, 16 18 Apr 2019. URL https://proceedings.mlr. press/v89/gasthaus19a.html.

Hutter, F., Hoos, H., and Leyton-Brown, K. Sequential model-based optimization for general algorithm conﬁguration. In Proceedings of the Fifth International

Optimizing Hyperparameters with Conformal Quantile Regression

Conference on Learning and Intelligent Optimization (LION 11), 2011.

Jamieson, K. and Talwalkar, A. Non-stochastic best arm identiﬁcation and hyperparameter optimization. In Proceedings of the 17th International Conference on Artiﬁcial Intelligence and Statistics (AISTATS 16), 2016.

Jones, D. R., Schonlau, M., and Welch, W. J. Efﬁcient global optimization of expensive black-box functions. Journal of Global optimization, 13(4):455, 1998.

Karnin, Z., Koren, T., and Somekh, O. Almost optimal exploration in multi-armed bandits. In Dasgupta, S. and Mc Allester, D. (eds.), Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pp. 1238 1246, Atlanta, Georgia, USA, 17 19 Jun 2013. PMLR. URL https://proceedings.mlr. press/v28/karnin13.html.

Klein, A. and Hutter, F. Tabular benchmarks for joint architecture and hyperparameter optimization. Technical Report ar Xiv:1905.04970 [cs.LG], 2019.

Klein, A., Falkner, S., Springenberg, J. T., and Hutter, F. Learning curve prediction with Bayesian neural networks. In International Conference on Learning Representations (ICLR 17), 2017.

Klein, A., Tiao, L., Lienart, T., Archambeau, C., and Seeger, M. Model-based asynchronous hyperparameter and neural architecture search. Number 2003.10865 [cs.LG], 2020.

Kuleshov, V., Fenner, N., and Ermon, S. Accurate uncertainties for deep learning using calibrated regression. Co RR, abs/1807.00263, 2018. URL http://arxiv. org/abs/1807.00263.

Li, L., Jamieson, K., De Salvo, G., Rostamizadeh, A., and Talwalkar, A. Hyperband: Bandit-based conﬁguration evaluation for hyperparameter optimization. In International Conference on Learning Representations (ICLR 17), 2017.

Li, L., Jamieson, K., Rostamizadeh, A., Gonina, E., Hardt, M., Recht, B., and Talwalkar, A. Massively parallel hyperparameter tuning. Technical Report 1810.05934v4 [cs.LG], 2019.

Li, Y., Shen, Y., Jiang, H., Zhang, W., Li, J., Liu, J., Zhang, C., and Cui, B. Hyper-tune: Towards efﬁcient hyper-parameter tuning at scale. Proc. VLDB Endow., 15(6):1256 1265, jun 2022. ISSN 2150-8097. doi: 10.14778/3514061.3514071. URL https://doi. org/10.14778/3514061.3514071.

Mohr, F. and van Rijn, J. N. Learning curves for decision making in supervised machine learning a survey. ar Xiv:2201.12150 [cs.LG], 2022.

Moriconi, R., Kumar, K. S., and Deisenroth, M. P. Highdimensional bayesian optimization with projections using quantile gaussian processes. Optimization Letters, 14:51 64, 2020.

Pﬁsterer, F., Schneider, L., Moosbauer, J., Binder, M., and Bischl, B. Yahpo gym-an efﬁcient multi-objective multi-ﬁdelity benchmark for hyperparameter optimization. In First Conference on Automated Machine Learning (Main Track), 2022.

Picheny, V., Ginsbourger, D., Richet, Y., and Caplin, G. Quantile-based optimization of noisy computer experiments with tunable precision. Technometrics, 55 (1):2 13, 2013. doi: 10.1080/00401706.2012.707580. URL https://doi.org/10.1080/00401706. 2012.707580.

Real, E., Aggarwal, A., Huang, Y., and Le, Q. V. Regularized evolution for image classiﬁer architecture search, 2019.

Romano, Y., Patterson, E., and Cand es, E. J. Conformalized quantile regression. 2019. doi: 10.48550/ARXIV. 1905.03222.

Salinas, D., Shen, H., and Perrone, V. A quantilebased approach for hyperparameter transfer learning. In III, H. D. and Singh, A. (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp. 8438 8448. PMLR, 13 18 Jul 2020. URL https://proceedings.mlr. press/v119/salinas20a.html.

Salinas, D., Seeger, M., Klein, A., Perrone, V., Wistuba, M., and Archambeau, C. Syne tune: A library for large scale hyperparameter tuning and reproducible research. In International Conference on Automated Machine Learning, pp. 16 1. PMLR, 2022.

Shafer, G. and Vovk, V. A tutorial on conformal prediction. Co RR, abs/0706.3188, 2007. URL http://arxiv. org/abs/0706.3188.

Shahriari, B., Swersky, K., Wang, Z., Adams, R., and de Freitas, N. Taking the human out of the loop: A review of Bayesian optimization. Proceedings of the IEEE, 2016.

Siems, J., Zimmer, L., Zela, A., Lukasik, J., Keuper, M., and Hutter, F. Nas-bench-301 and the case for surrogate benchmarks for neural architecture search. Co RR, abs/2008.09777, 2020. URL https://arxiv.org/ abs/2008.09777.

Optimizing Hyperparameters with Conformal Quantile Regression

Snoek, J., Larochelle, H., and Adams, R. P. Practical Bayesian optimization of machine learning algorithms. In Proceedings of the 25th International Conference on Advances in Neural Information Processing Systems (NIPS 12), 2012.

Snoek, J., Rippel, O., Swersky, K., Kiros, R., Satish, N., Sundaram, N., Patwary, M., Prabhat, and Adams, R. Scalable Bayesian optimization using deep neural networks. In Proceedings of the 32nd International Conference on Machine Learning (ICML 15), 2015.

Springenberg, J. T., Klein, A., Falkner, S., and Hutter, F. Bayesian optimization with robust bayesian neural networks. In Proceedings of the 29th International Conference on Advances in Neural Information Processing Systems (NIPS 16), 2016.

Srinivas, N., Krause, A., Kakade, S. M., and Seeger, M. W. Information-theoretic regret bounds for gaussian process optimization in the bandit setting. IEEE Transactions on Information Theory, 58(5):3250 3265, may 2012. doi: 10.1109/tit.2011.2182033. URL https: //doi.org/10.1109%2Ftit.2011.2182033.

Stanton, S., Maddox, W., and Wilson, A. G. Bayesian optimization with conformal coverage guarantees, 2022. URL https://arxiv.org/abs/2210.12496.

Swersky, K., Snoek, J., and Adams, R. P. Freeze-thaw bayesian optimization, 2014. URL https://arxiv. org/abs/1406.3896.

Tiao, L. C., Klein, A., Seeger, M. W., Bonilla, E. V., Archambeau, C., and Ramos, F. BORE: Bayesian optimization by density-ratio estimation. In International Conference on Machine Learning, pp. 10289 10300. PMLR, 2021.

Wistuba, M. and Pedapati, T. Learning to rank learning curves. In III, H. D. and Singh, A. (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp. 10303 10312. PMLR, 13 18 Jul 2020.

Zimmer, L., Lindauer, M., and Hutter, F. Auto Py Torch Tabular: Multi-Fidelity Meta Learning for Efﬁcient and Robust Auto DL. ar Xiv:2006.13799 [cs, stat], April 2021. URL http://arxiv.org/abs/ 2006.13799. ar Xiv: 2006.13799.

Optimizing Hyperparameters with Conformal Quantile Regression

A. ASHA description

We recall how the method ASHA proposed by (Li et al., 2019) works. Given some positive constant η 2, let us deﬁne a ﬁnite set of rungs R = {η0rmin, η1 rmin, η2rmin, ..., rmax}. Successive halving (Karnin et al., 2013; Jamieson & Talwalkar, 2016) starts with a set of N = ηK initial candidates, where for simplicity, we assume that K = logη rmax/rmin. Now, in the ﬁrst iteration, successive halving collects the performance of all N conﬁgurations on rmin and only continues the evaluation of the top 1/η conﬁgurations for the next rung level. This process is iterated until the maximum resource level rmax is reached and repeated until we reach some total budget for the entire search process.

Successive halving can be trivially parallelized in the synchronous setting, however, this will require synchronization points at each rung level to wait for stragglers. Li et al. (2019) adapted successive halving to the asynchronous setting (ASHA), where conﬁgurations are immediately promoted to the next rung level, once we observed at least η conﬁgurations. While this potentially leads to the promotion of conﬁgurations that are not among the top 1/η conﬁgurations, it removes any synchronization overhead and has been shown to perform very well in practice. See Algo. 2 for the pseudo-code of ASHA.

Algorithm 2 ASHA pseudo-code.

1: function ASHA() 2: Input: minimum resource rmin, maximum resource rmax, reduction factor η 3: repeat 4: for each free worker do 5: (x, k) = get job() 6: run then return val loss(x, rminηk) 7: end for 8: for completed job (x, k) with loss l do 9: Update conﬁguration x in rung k with loss l. 10: end for 11: until desired 12: end function

13: function get job() 14: // Check if there is a promotable config. 15: for k = logη(rmax/rmin) 1, . . . , 1, 0 do 16: candidates = top k(rung k, |rung k|/η) 17: promotable = {t candidates : t not promoted} 18: if |promotable| > 0 then 19: return promotable[0], k + 1 20: end if 21: // If not, grow bottom rung. 22: Suggest random conﬁguration x. 23: return x, 0 24: end for 25: end function

B. Experiment details

Statistics of different blackboxes are given in Table 2 and their conﬁguration spaces are given in Table 3 together with the base distribution used for each hyperparameters which are used when sampling random candidates. For LCBench, we run the 5 most expensive tasks among the 35 tasks available ( airlines , albert , covertype , christine and Fashion MNIST ). Since LCBench does not contain all possible evaluations on a grid, we run evaluations using a k-nearestneighbors surrogate with k = 1. We use Yahpo implementation (Pﬁsterer et al., 2022) to get access to NAS301 (Siems et al., 2020).

Schedulers details We give here the list of parameters used for running schedulers in our experiments. In general, the Syne Tune defaults have been used.

Optimizing Hyperparameters with Conformal Quantile Regression

Table 2: Tabulated benchmark statistics

Benchmark #Evaluations #Hyperparameters #Tasks #Fidelities

FCNet 62208 9 4 100 LCBench 2000 7 5 52 NAS201 15625 6 3 200 NAS301 NA 35 1 97

Table 3: Conﬁguration spaces for all tabulated benchmarks.

Benchmark Hyperparameter Conﬁguration space Domain

FCNet activation 1 [tanh, relu] categorical activation 2 [tanh, relu] categorical batch size [8, 16, 32, 64] ﬁnite-range log-space dropout 1 [0.0, 0.3, 0.6] ﬁnite-range dropout 2 [0.0, 0.3, 0.6] ﬁnite-range init lr [0.0005, 0.001, 0.005, 0.01, 0.05, 0.1] categorical lr schedule [cosine, const] categorical n units 1 [16, 32, 64, 128, 256, 512] ﬁnite-range log-space n units 2 [16, 32, 64, 128, 256, 512] ﬁnite-range log-space NAS201 x0 [avg pool 3x3, nor conv 3x3, skip connect, nor conv 1x1, none] categorical x1 [avg pool 3x3, nor conv 3x3, skip connect, nor conv 1x1, none] categorical x2 [avg pool 3x3, nor conv 3x3, skip connect, nor conv 1x1, none] categorical x3 [avg pool 3x3, nor conv 3x3, skip connect, nor conv 1x1, none] categorical x5 [avg pool 3x3, nor conv 3x3, skip connect, nor conv 1x1, none] categorical LCBench num layers [1, 5] uniform max units [64, 512] log-uniform batch size [16, 512] log-uniform learning rate [1e-4, 1e-1] log-uniform weight decay [1e-5, 0.1] uniform momentum [0.1, 0.99] uniform max dropout [0.0, 1.0] uniform NAS301 edge-normal-{0-13} [max pool 3x3, avg pool 3x3, skip connect, sep conv 3x3, sep conv 5x5, dil conv 3x3, dil conv 5x5] categorical node-normal-{0-13} [max pool 3x3, avg pool 3x3, skip connect, sep conv 3x3, sep conv 5x5, dil conv 3x3, dil conv 5x5] categorical inputs-node-normal-{3-5} [0 1, 0 2, 1 2] categorical inputs-node-reduce-{3-5} [0 1, 0 2, 1 2] categorical

REA is run with a population size of 10, and 5 samples are drawn to select a mutation from

GP is run using a Mat ern 5

2 kernel with automatic relevance determination parameters. For each suggestion, the surrogate model is ﬁt by marginal likelihood maximization, and a conﬁguration is returned which maximizes the expected improvement acquisition function. This involves averaging over 20 samples of fantasy outcomes for pending evaluations.

TPE is based on a multi-variate kernel density estimator as proposed by Falkner et al. (2018) to capture interactions between hyperparameters, which is not possible with unit-variant kernel density estimator as used for the original TPE approach (Bergstra et al.). We limit the minimum bandwidth for the kernel density estimator to 0.1 to avoid that all probability mass is assigned to a single categorical value, which would eliminate extrapolation.

ASHA is running the stopping variant described in (Klein et al., 2020) with grace period 1 and reduction factor 3, so that stopping trials happens after 1, 3, 9, . . . epochs. Conﬁgurations for new trials are sampled at random.

BOHB uses the same multi-variate kernel density estimator as TPE, and hyperparameters are set to default values in Falkner et al. (2018). Note that BOHB uses the same asynchronous scheduling as ASHA and MOB, while the algorithm in Falkner et al. (2018) is synchronous.

MOB is running the same scheduling as ASHA, but conﬁgurations for new trials are chosen as in Bayesian optimization. We deal with pending evaluations by averaging the acquisition function over 20 samples of fantasy outcomes.

BORE is evaluated with XGBoost as the classiﬁer with default hyperparameters (Chen & Guestrin, 2016). We use the default hyperparameters of the method, in particular, γ = 1/4 which means that the method maximizes the probability that a conﬁguration is in the top 25% of conﬁgurations.

HT uses the same rung levels (given by grace period and reduction factor) as ASHA and ﬁt independent Gaussian process models to data at each rung level. Compared to MOB, HT is using a more advanced acquisition function,

Optimizing Hyperparameters with Conformal Quantile Regression

which averages the rung level models with a weighting depending on the fraction of trials which ﬂipped ranks between low and high levels.

QR and CQR: we estimate m = 4 quantiles by ﬁtting gradient boosted trees models with quantile-losses and we use the same hyperparameter as BORE for the boosted-trees. We conformalize only when more than 32 samples are available to avoid poorly estimated correction due to too little samples and use 10% of the data for validation. We sample N = 2000 candidates when performing independent Thompson Sampling to select the next candidate.

{REA/GP/BORE/CQR} + MF: for these methods, we run ASHA by calling each single ﬁdelity method when a new conﬁguration has to be suggested using the data transformation proposed in section 5 to obtain the method observations D = {(xi, zi)}n i=1.

All multi-ﬁdelity methods based on successful-halving/ASHA (e.g. all except BOHB) uses a single bracket. We use implementations of all baselines provided in Syne Tune (https://github.com/awslabs/syne-tune).

C. Performance per task

We plot the performance of methods on all 13 tasks for single-ﬁdelity methods in Figure 3 and 4 for multi-ﬁdelity methods.

D. Surrogate performance

RMSE. To compute RMSE for quantile predictions, we ﬁrst compute the mean by taking the average of predicted quantiles.

Calibration error. A calibrated estimator of the conditional distribution ˆqα(x) should exceed the target α percent of the time asymptotically. For instance, we expect the predicted P90-th percentile to exceed the target 90% of the time for a calibrated estimator. Formally, we expect the following property to hold for a calibrated estimator for all α [0, 1]:

E(x,y) D[1y<ˆqα(x)] = α

where the mean is taken over examples on a validation set (x, y) Dval.

The calibration error measures the gap shown in predictions compared to this expected property (Kuleshov et al., 2018). Let us denote P(α) = E(x,y) D[1y<ˆqα(x)] the average number of times the prediction ˆqα(x) exceeds the target, then

the calibration error is deﬁned over a set of quantiles α1, . . . , αk as q Pk j=1(P(αj) αj)2. We report this error on 5 equally-spaced quantiles in surrogate experiments. For models predicting quantiles (QR and CQR), we evaluate directly the calibration on the predicted quantiles and for models predicting normal distribution (GP), we compute quantiles predictions using the Gaussian inverse CDF.

Optimizing Hyperparameters with Conformal Quantile Regression

0 1000 2000 3000 Wallclock time

fcnet-naval

REA BORE TPE RS GP CQR

0 1000 2000 3000 4000 5000 6000 7000 Wallclock time

fcnet-protein

REA BORE TPE RS GP CQR

0 2000 4000 6000 Wallclock time

fcnet-slice

REA BORE TPE RS GP CQR

0 1000 2000 3000 Wallclock time

fcnet-parkinsons

REA BORE TPE RS GP CQR

5000 10000 15000 20000 25000 30000 35000

Wallclock time

lcbench-airlines

REA BORE TPE RS GP CQR

5000 10000 15000 20000 25000 30000 35000

Wallclock time

lcbench-albert

REA BORE TPE RS GP CQR

5000 10000 15000 20000 25000 30000 35000

Wallclock time

lcbench-christine

REA BORE TPE RS GP CQR

5000 10000 15000 20000 25000 30000 35000

Wallclock time

lcbench-covertype

REA BORE TPE RS GP CQR

5000 10000 15000 20000 25000 30000 35000

Wallclock time

lcbench-Fashion-MNIST

REA BORE TPE RS GP CQR

5000 10000 15000 20000 25000 30000 35000

Wallclock time

nas201-cifar10

REA BORE TPE RS GP CQR

5000 10000 15000 20000 25000 30000 35000

Wallclock time

nas201-cifar100

REA BORE TPE RS GP CQR

5000 10000 15000 20000 25000 30000 35000

Wallclock time

nas201-Image Net16-120

REA BORE TPE RS GP CQR

50000 100000 150000 200000 250000 300000

Wallclock time

nas301-yahpo

REA BORE TPE RS GP CQR

Figure 3: Performance of single-ﬁdelity methods over time on all individual tasks considered. Mean and standard errors are computed over 30 seeds.

Optimizing Hyperparameters with Conformal Quantile Regression

0 1000 2000 3000 Wallclock time

fcnet-naval

CQR+MF ASHA HT BOHB MOB

0 1000 2000 3000 4000 5000 6000 7000 Wallclock time

fcnet-protein

CQR+MF ASHA HT BOHB MOB

0 2000 4000 6000 Wallclock time

fcnet-slice

CQR+MF ASHA HT BOHB MOB

0 1000 2000 3000 Wallclock time

fcnet-parkinsons

CQR+MF ASHA HT BOHB MOB

5000 10000 15000 20000 25000 30000 35000

Wallclock time

lcbench-airlines

CQR+MF ASHA HT BOHB MOB

5000 10000 15000 20000 25000 30000 35000

Wallclock time

lcbench-albert

CQR+MF ASHA HT BOHB MOB

5000 10000 15000 20000 25000 30000 35000

Wallclock time

lcbench-christine

CQR+MF ASHA HT BOHB MOB

5000 10000 15000 20000 25000 30000 35000

Wallclock time

lcbench-covertype

CQR+MF ASHA HT BOHB MOB

5000 10000 15000 20000 25000 30000 35000

Wallclock time

lcbench-Fashion-MNIST

CQR+MF ASHA HT BOHB MOB

5000 10000 15000 20000 25000 30000 35000

Wallclock time

nas201-cifar10

CQR+MF ASHA HT BOHB MOB

5000 10000 15000 20000 25000 30000 35000

Wallclock time

nas201-cifar100

CQR+MF ASHA HT BOHB MOB

5000 10000 15000 20000 25000 30000 35000

Wallclock time

nas201-Image Net16-120

CQR+MF ASHA HT BOHB MOB

50000 100000 150000 200000 250000 300000

Wallclock time

nas301-yahpo

CQR+MF ASHA HT BOHB MOB

Figure 4: Performance of multi-ﬁdelity methods over time on all individual tasks considered. Mean and standard errors are computed over 30 seeds.

Optimizing Hyperparameters with Conformal Quantile Regression

0 1000 2000 3000 Wallclock time

fcnet-naval

BORE+MF REA+MF CQR+MF ASHA GP+MF QR+MF MOB

0 1000 2000 3000 4000 5000 6000 7000 Wallclock time

fcnet-protein

BORE+MF REA+MF CQR+MF ASHA GP+MF QR+MF MOB

0 2000 4000 6000 Wallclock time

fcnet-slice

BORE+MF REA+MF CQR+MF ASHA GP+MF QR+MF MOB

0 1000 2000 3000 Wallclock time

fcnet-parkinsons

BORE+MF REA+MF CQR+MF ASHA GP+MF QR+MF MOB

5000 10000 15000 20000 25000 30000 35000

Wallclock time

lcbench-airlines

BORE+MF REA+MF CQR+MF ASHA GP+MF QR+MF MOB

5000 10000 15000 20000 25000 30000 35000

Wallclock time

lcbench-albert

BORE+MF REA+MF CQR+MF ASHA GP+MF QR+MF MOB

5000 10000 15000 20000 25000 30000 35000

Wallclock time

lcbench-christine

BORE+MF REA+MF CQR+MF ASHA GP+MF QR+MF MOB

5000 10000 15000 20000 25000 30000 35000

Wallclock time

lcbench-covertype

BORE+MF REA+MF CQR+MF ASHA GP+MF QR+MF MOB

5000 10000 15000 20000 25000 30000 35000

Wallclock time

lcbench-Fashion-MNIST

BORE+MF REA+MF CQR+MF ASHA GP+MF QR+MF MOB

5000 10000 15000 20000 25000 30000 35000

Wallclock time

nas201-cifar10

BORE+MF REA+MF CQR+MF ASHA GP+MF QR+MF MOB

5000 10000 15000 20000 25000 30000 35000

Wallclock time

nas201-cifar100

BORE+MF REA+MF CQR+MF ASHA GP+MF QR+MF MOB

5000 10000 15000 20000 25000 30000 35000

Wallclock time

nas201-Image Net16-120

BORE+MF REA+MF CQR+MF ASHA GP+MF QR+MF MOB

50000 100000 150000 200000 250000 300000

Wallclock time

nas301-yahpo

BORE+MF REA+MF CQR+MF ASHA GP+MF QR+MF MOB

Figure 5: Performance of multi-ﬁdelity ablation variants over time on all individual tasks considered. Mean and standard errors are computed over 30 seeds.

Optimizing Hyperparameters with Conformal Quantile Regression

Table 4: Calibration error, Pearson correlation and runtime for different surrogates when increasing the number of samples. Results are averaged over 30 different seeds.

RMSE Calibration error Runtime model GP QR CQR GP QR CQR GP QR CQR task n

16 1.44 0.89 0.88 0.17 0.13 0.16 1.92 0.95 0.87 64 1.19 0.68 0.67 0.11 0.09 0.08 3.11 1.36 1.27 256 1.10 0.51 0.54 0.14 0.07 0.04 3.90 1.67 1.59 1024 0.92 0.45 0.45 0.17 0.07 0.05 19.02 2.45 2.34

16 0.79 0.74 0.81 0.01 0.14 0.13 0.54 1.20 1.14 64 0.78 0.53 0.54 0.01 0.09 0.06 0.74 1.49 1.41 256 0.72 0.41 0.42 0.02 0.06 0.04 0.97 1.71 1.73 1024 0.22 0.35 0.35 0.05 0.03 0.03 19.39 2.11 2.04

16 0.79 0.71 0.72 0.01 0.13 0.11 0.88 1.22 1.17 64 0.80 0.51 0.53 0.01 0.11 0.09 1.28 1.60 1.60 256 0.75 0.37 0.37 0.03 0.04 0.04 1.96 1.87 1.81 1024 0.58 0.31 0.31 0.12 0.02 0.02 21.68 2.14 2.10