# scalable_gradientbased_tuning_of_continuous_regularization_hyperparameters__ec5d7396.pdf Scalable Gradient-Based Tuning of Continuous Regularization Hyperparameters Jelena Luketina1 JELENA.LUKETINA@AALTO.FI Mathias Berglund1 MATHIAS.BERGLUND@AALTO.FI Klaus Greff2 KLAUS@IDSIA.CH Tapani Raiko1 TAPANI.RAIKO@AALTO.FI 1Department of Computer Science, Aalto University, Finland 2IDSIA, Dalle Molle Institute for Artificial Intelligence, USI-SUPSI, Manno-Lugano, Switzerland Hyperparameter selection generally relies on running multiple full training trials, with selection based on validation set performance. We propose a gradient-based approach for locally adjusting hyperparameters during training of the model. Hyperparameters are adjusted so as to make the model parameter gradients, and hence updates, more advantageous for the validation cost. We explore the approach for tuning regularization hyperparameters and find that in experiments on MNIST, SVHN and CIFAR-10, the resulting regularization levels are within the optimal regions. The additional computational cost depends on how frequently the hyperparameters are trained, but the tested scheme adds only 30% computational overhead regardless of the model size. Since the method is significantly less computationally demanding compared to similar gradientbased approaches to hyperparameter optimization, and consistently finds good hyperparameter values, it can be a useful tool for training neural network models. 1. Introduction Specifying and training artificial neural networks requires several design choices that are often not trivial to make. Many of these design choices boil down to the selection of hyperparameters. The process of hyperparameter selection is in practice often based on trial-and-error and grid or random search (Bergstra and Bengio, 2012). There are also a Proceedings of the 33 rd International Conference on Machine Learning, New York, NY, USA, 2016. JMLR: W&CP volume 48. Copyright 2016 by the author(s). number of automated methods (Bergstra et al., 2011; Snoek et al., 2012), all of which rely on multiple complete training runs with varied fixed hyperparameters, with the hyperparameter selection based on the validation set performance. Although effective, these methods are expensive as the user needs to run multiple full training runs. In the worst case, the number of needed runs also increases exponentially with the number of hyperparameters to tune, if an extensive exploration is desired. In many practical applications such an approach is too tedious and time-consuming, and it would be useful if a method existed that could automatically find acceptable hyperparameter values in one training run even if the user did not have a strong intuition regarding good values to try for the hyperparameters. In contrast to these methods, we treat hyperparameters similar to elementary1 parameters during training, in that we simultaneously update both sets of parameters using stochastic gradient descent. The gradient of elementary parameters is computed as in usual training from the cost of the regularized model on the training set, while the gradient of hyperparameters (hypergradient) comes from the cost of the unregularized model on the validation set. For simplicity, we will refer to the training set as T1 and to the validation set (or any other data set used exclusively for training the hyperparameters) as T2. The method itself will be called T1 T2, referring to the two simultaneous optimization processes. Similar approaches have been proposed since the late 1990s; however, these methods either require computation of the inverse Hessian (Larsen et al., 1998; Bengio, 2000; Chen and Hagan, 1999; Foo et al., 2008), or propagating gradients 1Borrowing the expression from Maclaurin et al. (2015), we refer to the model parameters customarily trained with backpropagation as elementary parameters, and to all other parameters as hyperparameters. Scalable Gradient-Based Tuning of Continuous Regularization Hyperparameters through the entire history of parameter updates Maclaurin et al. (2015). Moreover, these methods make changes to the hyperparameter only once the elementary parameter training has ended. These drawbacks make them too expensive for use in modern neural networks, which often require millions of parameters and large data sets. Elements distinguishing our approach are: 1. By making some very rough approximations, our method for modifying hyperparameters avoids using computationally expensive terms, including the computation of the Hessian or its inverse. This is because with the T1 T2 method, hyperparameter updates are based on stochastic gradient descent, instead of Newton s method. Furthermore, any dependency of elementary parameters on hyperparameters beyond the last update is disregarded. As a result, additional computational and memory overhead therefore becomes comparable to back-propagation. 2. Hyperparameters are trained simultaneously with elementary parameters. Feedback and feedforward passes can be computed simultaneously for the training and validation set, further reducing the computational cost. 3. We add batch normalization (Ioffe and Szegedy, 2015) and adaptive learning rates (Kingma and Ba, 2015) to the process of hyperparameter training, which diminishes some of the problems of gradient-based hyperparameter optimization. Through batch normalization, we can counter internal covariate shifts. This eliminates the need for different learning rates at each layer, as well as speeding up adjustment of the elementary parameters to the changes in hyperparameters. This is particularly relevant when parametrizing each of the layers with a separate hyperparameter. A common assumption is that the choice of hyperparameters affects the whole training trajectory, i.e. changing a hyperparameter on the fly during training has a significant effect on the training trajectory. This hysteresis effect implies that in order to measure how a hyperparameter combination influences the validation set performance, the hyperparameters need to be kept fixed during the whole training procedure. However, to our knowledge this has not been systematically studied. If the hysteresis effect is weak enough and the largest changes to the hyperparameter happen early on, it becomes possible to train the model while tuning the hyperparameters on the fly during training, and then use the final hyperparameter values to retrain the model if a fixed set of hyperparameters is desired. We also explore this approach. An important design choice when training neural network models is which regularization strategy to use in order to ensure that the model generalizes to data not included in the training set. Common regularization strategies involve adding explicit terms to the model or the cost function during training, such as penalty terms on the model weights or injecting noise to inputs or neuron activations. Injecting noise is particularly relevant for denoising autoencoders and related models (Vincent et al., 2010; Rasmus et al., 2015), where performance strongly depends on the level of noise. Although the proposed method could work in principle for any continuous hyperparameter, we have specifically focused on studying tuning of regularization hyperparameters. We have chosen to use Gaussian noise added to the inputs and hidden layer activations, in addition to L2 weight penalty. A third, often used, regularization method that involves a hyperparameter choice is dropout (Srivastava et al., 2014). However, we have omitted studying dropout as it is not trivial to compute a gradient on the dropout rate. Moreover, dropout can be seen as a form of multiplicative Gaussian noise (Wang and Manning, 2013). We also omit study adapting the learning rate, since we suspect that the local gradient information is not sufficient to determine optimal learning rates. In Section 2 we present details on the proposed method. The method is tested with multiple MLP and CNN network structures and regularization schemes, detailed in Section 3. The results of the experiments are presented in Section 3.1. 2. Proposed Method We propose a method, T1 T2, for tuning continuous hyperparameters of a model using the gradient of the performance of the model on a separate validation set T2. In essence, we train a neural network model on a training set T1 as usual. However, for each update of the network weights and biases, i.e. the elementary parameters of the network, we tune the hyperparameters so as to make the direction of the weight update as beneficial as possible for the validation cost on a separate dataset T2. Formally, when training a neural network model, we try to minimize an objective function that depends on the training set, model weights and hyperparameters that determine the strength of possible regularization terms. When using gradient descent, we denote the optimization objective function C1( ) and the corresponding weight update as: C1(θ|λ, T1) = C1(θ|λ, T1) + Ω(θ, λ), (1) θt+1 = θt + η1 θ C1(θt|λt, T1), (2) where C1( ) and Ω( ) are cost and regularization penalty terms, T1 = {(xi, yi)} is the training data set, θ = {Wl, bl} a set of elementary parameters including weights and biases of each layer, λ denotes various hyperparameters that determine the strength of regularization, while η1 is a learning rate. Subscript t refers to the iteration number. Scalable Gradient-Based Tuning of Continuous Regularization Hyperparameters Figure 1. Left: Values of additive input noise and L2 penalty (n0, log(l2)) during training using the T1 T2 method for hyperparameter tuning. Trajectories are plotted over the grid search result for the same regularization pair. Initial hyperparameter values are denoted with a square, final hyperparameter values are denoted with a star. Right: Similarly constructed trajectories, on a model regularized with input and hidden layer additive noise (n0, n1). Assuming T2 = {(xi, yi)} is a separate validation data set, the generalization performance of the model is measured with a validation cost C2(θt+1, T2), which is usually a function of the unregularized model. Hence the value of the cost function of the actual performance of the model does not depend on the regularizer directly, but only on the elementary parameter updates. The gradient of the validation cost with respect to λ is: λC2 = ( θC2)( λθt+1) We only consider the influence of the regularization hyperparameter on the current elementary parameter update, λθt+1 = η1 λ θ C1 based on Eq. (2). The hyperparameter update is therefore: λt+1 = λt + η2( θC2)( λ θ C1) (3) where η2 is a learning rate. The method is greedy in the sense that it only depends on one parameter update, and hence rests on the assumption that a good hyperparameter choice can be evaluated based on the local information within only one elementary parameter update. 2.1. Motivation and analysis The most similar previously proposed model is the incremental gradient version of the hyperparameter update from (Chen and Hagan, 1999). However their derivation of the hypergradient assumes a Gauss-Newton update of the elementary parameters, making computation of the gradient and the hypergradient significantly more expensive. A well justified closed form for the term λθ is available once the elementary gradient has converged (Foo et al., 2008), with the update of the form (4). Comparing this expression with the T1 T2 update, (3) can be considered as approximating (4) in the case when gradient is near convergence and the Hessian can be well approximated by identity 2 θ C1 = I: λt+1 = λt + ( θC2)( 2 θ C1) 1( λ θ C1). (4) Another approach to hypergradient computation is given in Maclaurin et al. (2015). There, the term λθT (T denoting the final iteration number) considers effect of the hyperparameter on the entire history of updates: θT = θ0 + X 0