# learning_to_rank_learning_curves__a317a6a5.pdf

Learning to Rank Learning Curves

Martin Wistuba 1 Tejaswini Pedapati 1

Many automated machine learning methods, such as those for hyperparameter and neural architecture optimization, are computationally expensive because they involve training many different model conﬁgurations. In this work, we present a new method that saves computational budget by terminating poor conﬁgurations early on in the training. In contrast to existing methods, we consider this task as a ranking and transfer learning problem. We qualitatively show that by optimizing a pairwise ranking loss and leveraging learning curves from other datasets, our model is able to effectively rank learning curves without having to observe many or very long learning curves. We further demonstrate that our method can be used to accelerate a neural architecture search by a factor of up to 100 without a signiﬁcant performance degradation of the discovered architecture. In further experiments we analyze the quality of ranking, the inﬂuence of different model components as well as the predictive behavior of the model.

1. Introduction

A method commonly used by human experts to speed up the optimization of neural architectures or hyperparameters is the early termination of iterative training processes that are unlikely to improve the current solution. A common technique to determine the likelihood of no improvement is to compare the learning curve of a new conﬁguration to the one of the currently best conﬁguration. This idea can also be used to speed up automated machine learning processes. For this purpose, it is common practice to extrapolate the partial learning curve in order to predict the ﬁnal performance of the currently investigated model. Current extrapolation techniques have several weaknesses that make them unable to

1IBM Research. Correspondence to: Martin Wistuba <martin.wistuba@ibm.com>.

Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s).

realize their full potential in practice. Many of the methods require sufﬁcient sample learning curves to make reliable predictions (Chandrashekaran & Lane, 2017; Klein et al., 2017; Baker et al., 2018). Thus, the extrapolation method for the ﬁrst candidates can not be used yet, which means more computational effort. Other methods do not have this disadvantage, but require sufﬁciently long learning curves to make reliable predictions which again means unnecessary overhead (Domhan et al., 2015). Many of these methods also do not take into account other information such as the hyperparameters of the model being examined or its network architecture.

We address the need for sample learning curves by devising a transfer learning technique that uses learning curves from other problems. Since the range of accuracy varies from dataset to dataset, we are forced to consider this in our modeling. But since we are not interested in predicting the performance of a model anyway, we use a ranking model that models the probability that the model currently being investigated surpasses the best solution so far.

In order to be able to make reliable predictions for short learning curves, we consider further characteristics of the model such as its network architecture. We compare our ranking method with respect to a ranking measure against different methods on ﬁve different image classiﬁcation and four tabular regression datasets. We also show that our method is capable of signiﬁcantly accelerating neural architecture search (NAS) and hyperparameter optimization. Furthermore, we conduct several ablation studies to provide a better motivation of our model and its behavior.

2. Related Work

Most of the prior work for learning curve prediction is based on the idea of extrapolating the partial learning curve by using a combination of continuously increasing basic functions.

Domhan et al. (2015) deﬁne a set of 11 parametric basic functions, estimate their parameters and combine them in an ensemble. Klein et al. (2017) propose a heteroscedastic Bayesian model which learns a weighted average of the basic functions. Chandrashekaran & Lane (2017) do not use basic functions but use previously observed learning

Learning to Rank Learning Curves

curves of the current dataset. An afﬁne transformation for each previously seen learning curve is estimated by minimizing the mean squared error with respect to the partial learning curve. The best ﬁtting extrapolations are averaged as the ﬁnal prediction. Baker et al. (2018) use a different procedure. They use support vector machines as sequential regressive models to predict the ﬁnal accuracy based on features extracted from the learning curves, its gradients, and the neural architecture itself.

The predictor by Domhan et al. (2015) is able to forecast without seeing any learning curve before but requires observing more epochs for accurate predictions. The model by Chandrashekaran & Lane (2017) requires seeing few learning curves to extrapolate future learning curves. However, accurate forecasts are already possible after few epochs. Algorithms proposed by Klein et al. (2017); Baker et al. (2018) need to observe many full-length learning curves before providing any useful forecasts. However, this is prohibiting in the scenarios where learning is time-consuming such as in large convolutional neural networks (CNN).

All previous methods for automatically terminating iterative learning processes are based on methods that predict the learning curve. Ultimately, however, we are less interested in the exact learning curve but rather whether the current learning curve leads to a better result. This way of obtaining a ranking is referred to as pointwise ranking methods (Liu, 2011). They have proven to be less efﬁcient than pairwise ranking methods which directly optimize for the objective function (Burges et al., 2005). Yet, we are the ﬁrst to consider a pairwise ranking loss for this application.

There are some bandit-based methods which leverage early termination as well. Successive Halving (Jamieson & Talwalkar, 2016) is a method that trains multiple models with settings chosen at random simultaneously and terminates the worst performing half in predeﬁned intervals. Hyperband (Li et al., 2017) identiﬁes that the choice of the intervals is vital and therefore proposes to run Successive Halving with different intervals. BOHB (Falkner et al., 2018) is an extension of Hyperband which proposes to replace the random selection of settings with Bayesian optimization.

The use of methods that terminate less promising iterative training processes early are of particular interest in the ﬁeld of automated machine learning, as they can, for example, signiﬁcantly accelerate hyperparameter optimization. The most computationally intensive subproblem of automated machine learning is Neural Architecture Search (Wistuba et al., 2019), the optimization of the neural network topology. Our method is not limited to this problem, but as it is currently one of the major challenges in automated machine learning, we use this problem as a sample application in our evaluation. For this particular application, optimization methods that leverage parameter sharing (Pham et al., 2018)

Validation Accuracy

Final Performance

Intermediate Performance

Figure 1. Learning curve prediction tries to predict from the partial learning curve (solid line) the ﬁnal performance.

have become established as a standard method. Here, a search space spanning, overparameterized neural network is learned and ﬁnally used to determine the best architecture. DARTS (Liu et al., 2019) is one of the last extensions of this idea which uses a continuous relaxation of the search space, which allows learning of the shared parameters as well as of the architecture by means of gradient descent. These methods can be a strong alternative to early termination but do not transfer to general machine learning and recent research implies that this approach does not work for arbitrary architecture search spaces (Yu et al., 2020).

3. Learning Curve Ranking

With learning curve we refer to the function of qualitative performance with growing number of iterations of an iterative learning algorithm. We use the term ﬁnal learning curve to explicitly denote the entire learning curve, y1, . . . , y L, reﬂecting the training process from beginning to end. Here, yi is a measure of the performance of the model (e.g., classiﬁcation accuracy), which is determined at regular intervals. On the contrary, a partial learning curve, y1, . . . , yl, refers to learning curves that are observed only up to a time l. We visualize the concepts of the terms in Figure 1.

There is a set of automated early termination methods which follow broadly the same principle. The ﬁrst model is trained to completion and is considered as the current best model mmax with its performance being ymax. For each further model mi, the probability that it is better than the current best model, p (mi > mmax), is monitored at periodic intervals during training. If it is below a given threshold, the model s training is terminated (see Algorithm 1). All existing methods rely on a two-step approach to determine this probability which involves extrapolating the partial learning curve and a heuristic measure. Instead of the two-step process to determine the probability, we propose LCRank Net to predict the probability that a model mi is better than mj directly. LCRank Net is based on a neural network f which considers model characteristics and the partial learning curve xi as input. We deﬁne the probability that mi is

Learning to Rank Learning Curves

better than mj as

p(mi > mj) = ˆpi,j = ef(xi) f(xj)

1 + ef(xi) f(xj) . (1)

Using the logistic function is an established modelling approach in the learning-to-rank community (Burges et al., 2005). Given a set of ﬁnal learning curves for some models, the estimation of the values pi,j between these models is trivial since we know whether mi is better than mj or not. Therefore, we set

1 if mi > mj 0.5 if mi = mj 0 if mi < mj . (2)

We minimize the cross-entropy loss

i,j pi,j log ˆpi,j (1 pi,j) log(1 ˆpi,j) (3)

to determine the parameters of f. Now p(mi > mj) can be predicted for arbitrary pairs mi and mj using Equation (1). A model fl is trained on all partial learning curves of length l to predict the probabilities of curves of length l.

Algorithm 1 Early Termination Method Input: Dataset d, model m, performance of best model mmax so far ymax. Output: Learning curve.

1: for l 1 . . . L do 2: Train m on d for a step and observe a further part of the learning curve yt. 3: if max1 i l yi > ymax then 4: continue 5: else if p (m > mmax) δ then 6: return y 7: end if 8: end for 9: return y

3.1. Ranking Model to Learn Across Datasets

We have deﬁned the prediction of our model in Equation (1). It depends on the outcome of the function f which takes a model representation x as input. The information contained in this representation depends on the task at hand. The representation consists of up to four different parts. First, the partial learning curve y1, . . . , yl. Second, the description of the model s architecture which is a sequence of strings representing the layers it is comprised of. Third, a dataset ID to indicate on which dataset the corresponding architecture and learning curve was trained and observed. Fourth, all remaining hyperparameters, e.g. learning rate or batch size.

Fully Connected Layer

conv k=2..5

Architecture

Dataset ID Hyperparams

Figure 2. LCRank Net has four different components, each dealing with one type of input: architecture encoding, partial learning curve, dataset ID, and further hyperparameters.

We model f using a neural network and use special layers to process the different parts of the representation. The learned representation for the different parts are ﬁnally concatenated and fed to a fully connected layer. The architecture is visualized in Figure 2. We will now describe the different components in detail.

Learning Curve Component A CNN is used to process the partial learning curve. We consider up to four different convolutional layers with different kernel sizes. The exact number depends on the length of the learning curve since we consider no kernel sizes larger than the learning curve length. Each convolution is followed by a global max pooling layer and their outputs are concatenated. This results in the learned representation for the learning curve. In the appendix we analyze alternative modeling options.

Architecture Component In this paper we consider two experiments for learning curve prediction. In one we deal with NAS in the NASNet search space (Zoph et al., 2018), in the other we work on the joint optimization of architecture and hyperparameters on tabular data (Klein & Hutter, 2019). In the former experiment, the search space consists of CNNs, which consist of two cells with ﬁve blocks each. This means that 60 decisions about architecture have to be made. In the second experiment, the fully connected neural networks (FCNN) consists of only two layers, so the number of decisions is only 6. For more details refer to the appendix. We learn an embedding for every option. An LSTM takes this embedding and generates the architecture embedding.

Dataset Component As we intend to learn from learning curves of other datasets, we include dataset embeddings as one of the components. Every dataset has its own embedding and it will be used whenever a learning curve observed on this dataset is selected. If the dataset is new, then the corresponding embedding is initialized at random. As the

Learning to Rank Learning Curves

0 10 20 30 Seen Learning Curve (%)

Spearman Correlation

0 10 20 30 Seen Learning Curve (%)

Spearman Correlation

0 10 20 30 Seen Learning Curve (%)

Spearman Correlation

Fashion-MNIST

0 10 20 30 Seen Learning Curve (%)

Spearman Correlation

Quick, Draw!

0 10 20 30 Seen Learning Curve (%)

Spearman Correlation

Baker et al. Chandrashekaran Domhan et al. Last Seen Value LCRank Net

Figure 3. The x-axis indicates the length observed of the learning curve. We report the mean of ten repetitions. The shaded area is the standard deviation. Our method LCRank Net outperforms its competitors on all datasets.

model observes more learning curves from this dataset, it improves the dataset embedding. The embedding helps us to model dataset-speciﬁc peculiarities in order to adjust the ranking if necessary.

Hyperparameter Component Other hyperparameters, if available, are passed directly to the last layer.

Technical Details During the development process, we found that the architecture component leads to instabilities during training. To avoid this, we regularize the output of the LSTM layer by using it as an input to an autoregressive model, similar to a sequence-to-sequence model (Sutskever et al., 2014) that recreates the original description of the architecture. In addition, we use the attention mechanism (Bahdanau et al., 2015) to facilitate this process. All parameters of the layers in f are trained jointly by means of Adam (Kingma & Ba, 2015) by minimizing

L = αLce + (1 α) Lrec (4)

a weighted linear combination of the ranking loss (Equation (3)) and the reconstruction loss with α = 0.8. The hyperparameter δ allows for trading precision vs. recall or search time vs. regret, respectively. There are two extreme cases: if we set δ 1, every run is terminated immediately. If we set δ 0, we never terminate any run. For our experiment we set δ = 0.45 which means that if the predicted probability that the new model is better than the best one is below 45%, the run is terminated early.

4. Experiments

In this section, we ﬁrst discuss how to create the metaknowledge and then analyze our model in terms of learning curve ranking and the ability to use it as a way to accelerate NAS. Finally, we examine its individual components and behavior in certain scenarios. Furthermore, we show that LCRank Net can be combined with various optimization methods and used for hyperparameter optimization tasks.

4.1. Meta-Knowledge

We compare our method to similar methods on ﬁve different datasets: CIFAR-10, CIFAR-100, Fashion-MNIST, Quickdraw, and SVHN. We use the original train/test splits if available. Quickdraw has a total of 50 million data points and 345 classes. To reduce the training time, we select a subset of this dataset. We use 100 different randomly selected classes and choose 300 examples per class for the training split and 100 per class for the test split. 5,000 random data points of the training dataset serve as validation split for all datasets.

To create the meta-knowledge, we choose 200 architectures per dataset at random from the NASNet search space (Zoph et al., 2018) such that we train a total of 1,000 architectures. We would like to point out that these are 1,000 unique architectures, there is no architecture that has been trained on several different datasets. Each architecture is trained for 100 epochs with stochastic gradient descent and cosine learning rate schedule without restart (Loshchilov & Hutter, 2017).

We use standard image preprocessing and augmentation:

Learning to Rank Learning Curves

Table 1. Results obtained by the different methods on ﬁve different datasets. For both metrics the smaller, the better. Regret reported in percent, time in GPU hours.

Method CIFAR-10 CIFAR-100 Fashion Quickdraw SVHN REGR. TIME REGR. TIME REGR. TIME REGR. TIME REGR. TIME

NO EARLY TERMINATION 0.00 1023 0.00 1021 0.00 1218 0.00 1045 0.00 1485 DOMHAN ET AL. (2015) 0.56 346 0.82 326 0.00 460 0.44 331 0.28 471 LI ET AL. (2017) 0.22 106 0.78 102 0.32 132 0.54 109 0.00 156 BAKER ET AL. (2018) 0.00 89 0.00 77 0.00 129 0.00 107 0.00 241 JAMIESON & TALWALKAR (2016) 0.62 62 0.00 54 0.18 70 0.40 60 0.28 88 CHANDRASHEKARAN & LANE (2017) 0.62 30 0.00 35 0.28 41 0.30 82 0.06 164 LCRANKNET 0.22 20 0.00 11 0.10 19 0.00 28 0.10 74

for every image, we ﬁrst subtract the channel mean and then divide by the channel standard deviation. Images are padded by a margin of four pixels and randomly cropped back to the original dimension. For all datasets but SVHN we apply random horizontal ﬂipping. Additionally, we use Cutout (De Vries & Taylor, 2017).

For the experiments in Section 4.6 we rely on the tabular benchmark (Klein & Hutter, 2019). This benchmark contains meta-data for four regression tasks. In total 62,208 different FCNN with different architecture and hyperparameters are evaluated. 100 settings per dataset are chosen at random as meta-knowledge.

The following experiments are conducted in a leave-onedataset-out cross-validation. That means when considering one dataset, all meta-knowledge but the one for this particular dataset is used.

4.2. Ranking Performance

First, we analyze the quality of the learning curve rankings by different learning curve prediction methods. In this experiment we choose 50 different learning curves at random as a test set. Five random learning curves are used as a training set for every repetition. Each learning curve prediction method ranks the 50 architectures by observing the partial learning curve whose length varies from 0 to 30 percent of the ﬁnal learning curve. We repeat the following experiment ten times and report the mean and standard deviation of the correlation between the true and the predicted ranking in Figure 3. As a correlation metric, we use Spearman s rank correlation coefﬁcient. Thus, the correlation is 1 for a perfect ranking and 0 for an uncorrelated, random ranking. Our method LCRank Net shows for all datasets higher correlation. If there are no or only very short partial learning curves available, our method shows the biggest difference to the existing methods. The reason for this is a combination of the consideration of the network architecture together with additional meta-knowledge. We analyze the impact of each component in detail in Section 4.5.

The method by Chandrashekaran & Lane (2017) consistently shows the second best results and in some cases can catch up to the results of our method. The method by Baker et al. (2018) stands out due to the high standard deviation. It is by far the method with the smallest squared error on test. However, the smallest changes in the prediction lead to a signiﬁcantly different ranking, which explains the high variance in their results. The method by Domhan et al. (2015) requires a minimum length of the learning curve to make predictions. Accordingly, we observe rank correlation values starting from a learning curve length of 4%. Using the last seen value to determine the ranking of learning curves is a simple yet efﬁcient method (Klein et al., 2017). In fact, it is able to outperform some of the more elaborate methods.

4.3. Accelerating Random Neural Architecture Search

In this experiment, we demonstrate the utility of learning curve predictors in the search for network architectures. For the sake of simplicity, we accelerate a random search in the NASNet search space (Zoph et al., 2018).

The random search samples 200 models and trains each of them for 100 epochs to obtain the ﬁnal accuracy. In the end, the best of all these models is returned. Now each learning curve predictor iterates over these sampled architectures in the same order and determines at every third epoch if the training should be aborted. For Successive Halving (Jamieson & Talwalkar, 2016) and Hyperband (Li et al., 2017) we follow the algorithm deﬁned by the authors and use the recommended settings. The current best model discovered after iterating over all 200 architectures is returned as the best model for this learning curve predictor. The goal of the methods is to minimize the regret compared to a random search without early stopping and at the same time reduce the computational effort.

One of our observations is that Domhan et al. (2015) s method underestimates performance when the learning curve is short. As a result, the method ends each training process early after only a few epochs. Therefore, we

Learning to Rank Learning Curves

Learning Curve Length (%)

0 20 40 60 80 100

Validation Error

0 20 40 60 80 100 Learning Curve Length (%)

Validation Error

Figure 4. LCRank Net is speeding up architecture search for SVHN. Difference between learning curves is small, making this the hardest task.

follow the recommendation by Domhan et al. (2015) and do not end processes before we have seen the ﬁrst 30 epochs. We summarize the results in Table 1. Our ﬁrst observation is that all methods accelerate the search with little regret. Here we deﬁne regret as the difference between the accuracy of the model found by the random search and the accuracy of the model found by one of the methods. Not surprisingly, Domhan et al. (2015) s method takes up most of the time, as it requires signiﬁcantly longer learning curves to make its decision. In addition, we can conﬁrm the results by Baker et al. (2018); Chandrashekaran & Lane (2017), both of which report better results than Domhan et al. (2015). Our method requires the least amount of time for each dataset. For CIFAR-100 we do not observe any regret, but a reduction of the time by a factor of 100. In some cases we observe an insigniﬁcantly higher regret than some of the other methods. In our opinion, the time saved makes up for it.

In Figure 4 we visualize the random search for SVHN. As you can see, many curves not only have similar behavior but also similar error. For this reason, it is difﬁcult to decide whether to discard a model safely, which explains the increased runtime. The only methods that do not show this behavior are Successive Halving and Hyperband. The simple reason is that the number of runs terminated and the time they are discarded are ﬁxed by the algorithm and do not rely on the properties of the learning curve. The disadvantages are obvious: promising runs may be terminated and unpromising runs may run longer than needed.

Finally, we compare our method with DARTS (Liu et al., 2019), a method that speeds up the search by parameter sharing. We train the architecture discovered in the previous experiment with an equivalent training setup like DARTS for a fair comparison. This architecture achieved a classiﬁcation error of 2.99% on CIFAR-10 after only 20 GPU hours were searched. In comparison, DARTS (1st order) needs 36 GPU hours for a similarly good architecture (3.00% error).

DARTS (2nd order) can improve these results to 2.76%, but it takes 96 GPU hours. One of the major disadvantages of DARTS and related methods is that they cannot generalize to any architecture search space (Yu et al., 2020) and do not allow hyperparameter optimization. Our proposed method does not suffer from these problems.

4.4. LCRank Net Prediction Analysis

We saw in the previous selection that LCRank Net does not perform perfectly. There are cases where its regret is greater than zero or where the search time is higher which indicates that early stopping was applied later than maybe possible. In this section we try to shed some light on the decisions made by the model and try to give the reader some insight of the model s behavior.

We provide four example decisions of LCRank Net in Figure 5. The plots in the top row show cases where LCRank Net made a correct decision, the plots in the bottom row are examples for incorrect decisions.

In both of the correct cases (top row), LCRank Net assigns a higher probability to begin with, using meta-knowledge. However, in the top left case it becomes evident after only very few epochs that mmax is consistently better than m such that the probability p (m > mmax) reduces sharply to values close to zero which would correctly stop training early on. The case in the top right plot is more tricky. The probability is increasing until a learning curve length of 12% as the learning curve seem to indicate that m is better than mmax. However, the learning curve of mmax approaches the values of m and then the probability decreases and training is stopped early.

We continue now with the discussion of the two examples in the bottom row which LCRank Net erroneously terminated early, which causes in turn a regret higher than zero in Table 1. In the bottom left we show an example where

Learning to Rank Learning Curves

0 25 50 75 100 Learning Curve Length (%)

Validation Accuracy

p(m > mmax)

0 25 50 75 100 Learning Curve Length (%)

Validation Accuracy

0 25 50 75 100 Learning Curve Length (%)

Validation Accuracy

0 25 50 75 100 Learning Curve Length (%)

Validation Accuracy

Figure 5. Analysis of the predicted probability of LCRank Net with growing learning curve length. The plots in the top row show examples for correct decisions by LCRank Net. The bottom row shows two examples where its decision was wrong.

sometimes m is better for a longer period and sometimes mmax. This is arguable a very difﬁcult problem and it is hard to predict which curve will eventually be better. However, LCRank Net shows a reasonable behavior. As the learning curve of m is consistently worse than mmax in the segment from 15% to 42%, the probability is decreasing. Starting from a length of 57%, where m shows superior performance than mmax for several epochs, the probability starts to raise. From learning curve length of 70% onward, both learning curves are very close and the difference in the ﬁnal accuracy is only 0.0022. The example visualized in the bottom right plot is a very interesting one. The learning curve of mmax

is consistently better than or almost equal to m up to the very end. Towards the end, learning curves are getting very close. In fact, from learning curve length 63% onward, the maximum difference between m and mmax per epoch is only 0.008. Hence, in the beginning it seems like a trivial decision to reject m. However, eventually m turns out to be better than mmax.

In conclusion, deciding whether one model will be better than another one based on a partial learning curve is a challenging task. A model that turns out to be (slightly) better than another one can be dominated consistently for most of the learning curve. This makes it a very challenging problem for automated methods and human experts.

4.5. Analysis of LCRank Net s Components

We brieﬂy mentioned before which components of our learning curve ranker have an essential inﬂuence on its quality.

We would like to deepen the analysis at this point and compare the conﬁguration we have proposed with different variants in Figure 6. We consider variants with and without metadata, architecture description or learning curve consideration. In addition, we compare our conﬁguration trained with pairwise ranking loss to one trained with a pointwise ranking loss.

One of the most striking observations is that the metadata is essential for the model. This is not surprising since in particular the learning of architecture embedding needs sufﬁcient data. Sufﬁcient data is not available in this setup, so we observe a much higher variance for these variants. Even the variant that only considers the learning curve beneﬁts from additional meta-knowledge. But even this is not surprising, since stagnating learning processes show similar behavior regardless of the dataset. Using the meta-knowledge, both components achieve good results on their own. It can be clearly seen that these components are orthogonal to one another. The variant, which only considers the architecture, shows very good results for short learning curves. If only the learning curve is considered, the results are not very good at ﬁrst, but improve signiﬁcantly with the length of the learning curve. A combination of both methods ultimately leads to our method and further improves the results. Finally, we compare the use of a pointwise ranking loss (L2 loss) versus a pairwise ranking loss. Although our assumption was that the latter would have to be fundamentally better, since it optimizes the model parameters directly for the task at hand, in practice this does not necessarily seem to be the case. Especially for short learning curves, the simpler

Learning to Rank Learning Curves

0 10 20 30 Seen Learning Curve (%)

Spearman Correlation

0 10 20 30 Seen Learning Curve (%)

Spearman Correlation

0 10 20 30 Seen Learning Curve (%)

Spearman Correlation

Fashion-MNIST

0 10 20 30 Seen Learning Curve (%)

Spearman Correlation

Quick, Draw!

0 10 20 30 Seen Learning Curve (%)

Spearman Correlation

Proposed Conﬁguration Architecture only Proposed Conﬁguration with L2-Loss Proposed Conﬁguration w/o Metadata Learning Curve only Learning Curve only w/o Metadata

Figure 6. Analysis of the different components of LCRank Net. Every single component, metadata, consideration of the learning curve and architecture description, is vital.

method achieves better results. However, once the learning curve is long enough, the pairwise ranking loss pays off.

4.6. Accelerating Hyperparameter Optimization

LCRank Net is not limited to NAS, but can be used for hyperparameter optimization methods as well. We demonstrate this by combining LCRank Net with TPE (Bergstra et al., 2011), Regularized Evolution (RE) (Real et al., 2019)), and Reinforcement Learning (RL) (Zoph & Le, 2017) to optimize the hyperparameters of an FCNN using a tabular data benchmark (Klein & Hutter, 2019). The objective of this benchmark is to minimize the mean squared error (MSE) by choosing the right architecture and hyperparameter settings. All experiments are repeated ten times. Each optimization method can evaluate up to 100 different conﬁgurations and chooses the best one based on its validation MSE. We report results with respect to test regret following Klein & Hutter (2019). This number indicates the difference between the test MSE actually achieved and the best possible test MSE. In our experiments, we observed that early termination generally improves the method (Figure 7, Appendix). All other results are listed in the Appendix.

5. Conclusion

In this paper we present LCRank Net, a method to automatically terminate unpromising model conﬁgurations early. The two main novelties of the underlying model are that it is able to consider learning curves from other datasets and that it uses a pairwise ranking loss. The former allows to predict for relatively short, and in extreme cases even

0 2000 4000 6000 8000 Training Time (in seconds)

Test Regret

Protein Structure

TPE TPE (early term.)

0 1000 2000 3000 Training Time (in seconds)

Test Regret

Protein Structure

RE RE (early term.)

Figure 7. TPE and RE can leverage LCRank Net to accelerate optimization. Results for reinforcement learning, more datasets and further insights can be found in the Appendix.

without, learning curves. The latter directly allows to model the probability that one conﬁguration is better than another. We analyze our method on ﬁve different datasets against up to ﬁve alternatives. In an experiment to optimize network architectures, we obtain the fastest results. In the best case, LCRank Net is 100 times faster without sacriﬁcing accuracy. We also examine the components and predictions of our method to give the reader a better understanding of the de-

Learning to Rank Learning Curves

sign choices and functionalities. Finally, we demonstrate its use in combination with various hyperparameter optimizers.

Bahdanau, D., Cho, K., and Bengio, Y. Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 79, 2015, Conference Track Proceedings, 2015. URL http://arxiv.org/abs/1409.0473.

Baker, B., Gupta, O., Raskar, R., and Naik, N. Accelerating neural architecture search using performance prediction. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Workshop Track Proceedings, 2018. URL https://openreview.net/forum? id=HJqk3N1v G.

Bergstra, J., Bardenet, R., Bengio, Y., and K egl, B. Algorithms for hyper-parameter optimization. In Advances in Neural Information Processing Systems 24: 25th Annual Conference on Neural Information Processing Systems 2011. Proceedings of a meeting held 12-14 December 2011, Granada, Spain, pp. 2546 2554, 2011. URL http://papers.nips.cc/paper/ 4443-algorithms-for-hyper-parameter-optimization.

Burges, C. J. C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Hamilton, N., and Hullender, G. N. Learning to rank using gradient descent. In Machine Learning, Proceedings of the Twenty-Second International Conference (ICML 2005), Bonn, Germany, August 7-11, 2005, pp. 89 96, 2005. doi: 10.1145/1102351.1102363. URL https: //doi.org/10.1145/1102351.1102363.

Chandrashekaran, A. and Lane, I. R. Speeding up hyperparameter optimization by extrapolation of learning curves using previous builds. In Machine Learning and Knowledge Discovery in Databases - European Conference, ECML PKDD 2017, Skopje, Macedonia, September 18-22, 2017, Proceedings, Part I, volume 10534 of Lecture Notes in Computer Science, pp. 477 492. Springer, 2017.

De Vries, T. and Taylor, G. W. Improved regularization of convolutional neural networks with cutout. Co RR, abs/1708.04552, 2017. URL http://arxiv.org/ abs/1708.04552.

Domhan, T., Springenberg, J. T., and Hutter, F. Speeding up automatic hyperparameter optimization of deep neural networks by extrapolation of learning curves. In Proceedings of the Twenty-Fourth International Joint Conference

on Artiﬁcial Intelligence, IJCAI 2015, Buenos Aires, Argentina, July 25-31, 2015, pp. 3460 3468. AAAI Press, 2015.

Falkner, S., Klein, A., and Hutter, F. BOHB: robust and efﬁcient hyperparameter optimization at scale. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsm assan, Stockholm, Sweden, July 10-15, 2018, pp. 1436 1445, 2018. URL http://proceedings.mlr.press/ v80/falkner18a.html.

Jamieson, K. G. and Talwalkar, A. Non-stochastic best arm identiﬁcation and hyperparameter optimization. In Proceedings of the 19th International Conference on Artiﬁcial Intelligence and Statistics, AISTATS 2016, Cadiz, Spain, May 9-11, 2016, pp. 240 248, 2016. URL http://proceedings.mlr.press/ v51/jamieson16.html.

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http://arxiv.org/abs/1412.6980.

Klein, A. and Hutter, F. Tabular benchmarks for joint architecture and hyperparameter optimization. Co RR, abs/1905.04970, 2019. URL http://arxiv.org/ abs/1905.04970.

Klein, A., Falkner, S., Springenberg, J. T., and Hutter, F. Learning curve prediction with bayesian neural networks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, 2017. URL https: //openreview.net/forum?id=S11KBYclx.

Li, L., Jamieson, K. G., De Salvo, G., Rostamizadeh, A., and Talwalkar, A. Hyperband: A novel bandit-based approach to hyperparameter optimization. J. Mach. Learn. Res., 18:185:1 185:52, 2017. URL http://jmlr.org/ papers/v18/16-558.html.

Liu, H., Simonyan, K., and Yang, Y. DARTS: differentiable architecture search. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, 2019. URL https: //openreview.net/forum?id=S1e YHo C5FX.

Liu, T. Learning to Rank for Information Retrieval. Springer, 2011. ISBN 978-3-642-14266-6. doi: 10.1007/ 978-3-642-14267-3. URL https://doi.org/10. 1007/978-3-642-14267-3.

Loshchilov, I. and Hutter, F. SGDR: stochastic gradient descent with warm restarts. In 5th International

Learning to Rank Learning Curves

Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, 2017. URL https://openreview.net/ forum?id=Skq89Scxx.

Pham, H., Guan, M. Y., Zoph, B., Le, Q. V., and Dean, J. Efﬁcient neural architecture search via parameter sharing. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsm assan, Stockholm, Sweden, July 10-15, 2018, pp. 4092 4101, 2018. URL http://proceedings.mlr.press/ v80/pham18a.html.

Real, E., Aggarwal, A., Huang, Y., and Le, Q. V. Regularized evolution for image classiﬁer architecture search. In The Thirty-Third AAAI Conference on Artiﬁcial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artiﬁcial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artiﬁcial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019, pp. 4780 4789, 2019. doi: 10.1609/aaai.v33i01.33014780. URL https:// doi.org/10.1609/aaai.v33i01.33014780.

Sutskever, I., Vinyals, O., and Le, Q. V. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pp. 3104 3112, 2014. URL http://papers.nips.cc/paper/ 5346-sequence-to-sequence-learning-with-neural-networks.

Wistuba, M., Rawat, A., and Pedapati, T. A survey on neural architecture search. Co RR, abs/1905.01392, 2019. URL http://arxiv.org/abs/1905.01392.

Yu, K., Sciuto, C., Jaggi, M., Musat, C., and Salzmann, M. Evaluating the search phase of neural architecture search. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, Conference Track Proceedings, 2020. URL https://openreview.net/ forum?id=H1lo F2NFwr.

Zoph, B. and Le, Q. V. Neural architecture search with reinforcement learning. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, 2017. URL https://openreview.net/ forum?id=r1Ue8Hcxg.

Zoph, B., Vasudevan, V., Shlens, J., and Le, Q. V. Learning transferable architectures for scalable image recognition. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pp. 8697 8710,

2018. doi: 10.1109/CVPR.2018.00907. URL http: //openaccess.thecvf.com/content_cvpr_ 2018/html/Zoph_Learning_Transferable_ Architectures_CVPR_2018_paper.html.